removeStopWords

Remove stop words from documents

Syntax

newDocuments = removeStopWords(documents)

newDocuments = removeStopWords(documents,'IgnoreCase',false)

Description

Words like "a", "and", "to", and "the" (known as stop words) can add noise to data. Use this function to remove stop words before analysis.

The function supports English, Japanese, German, and Korean text. To learn how to use removeStopWords for other languages, see Language Considerations.

newDocuments = removeStopWords(documents) removes the stop words from the tokenizedDocument array documents. The function, by default, uses the stop word list given by the stopWords function according to the language details of documents and is case insensitive.

To remove a custom list of words, use the removeWords function.

example

newDocuments = removeStopWords(documents,'IgnoreCase',false) removes stop words with case matching the stop word list given by the stopWords function.

Tip

Use removeStopWords before using the normalizeWords function as removeStopWords uses information that is removed by this function.

Examples

collapse all

Remove Stop Words from Documents

Open Live Script

Remove the stop words from an array of documents using removeStopWords. The tokenizedDocument function detects that the documents are in English, so removeStopWords removes English stop words.

documents = tokenizedDocument([
    "an example of a short sentence" 
    "a second short sentence"]);
newDocuments = removeStopWords(documents)

newDocuments = 
  2×1 tokenizedDocument:

    3 tokens: example short sentence
    3 tokens: second short sentence

Remove Japanese Stop Words

Open Live Script

Tokenize Japanese text using tokenizedDocument. The function automatically detects Japanese text.

str = [
    "ここは静かなので、とても穏やかです"
    "企業内の顧客データを利用し、今年の売り上げを調べることが出来た。"
    "私は先生です。私は英語を教えています。"];
documents = tokenizedDocument(str);

Remove stop words using removeStopWords. The function uses the language details from documents to determine which language stop words to remove.

documents = removeStopWords(documents)

documents = 
  3×1 tokenizedDocument:

     4 tokens: 静か 、 とても 穏やか
    10 tokens: 企業 顧客 データ 利用 、 今年 売り上げ 調べる 出来 。
     5 tokens: 先生 。 英語 教え 。

Remove German Stop Words from Documents

Open Live Script

Tokenize German text using tokenizedDocument. The function automatically detects German text.

str = [
    "Guten Morgen. Wie geht es dir?"
    "Heute wird ein guter Tag."];
documents = tokenizedDocument(str)

documents = 
  2×1 tokenizedDocument:

    8 tokens: Guten Morgen . Wie geht es dir ?
    6 tokens: Heute wird ein guter Tag .

Remove stop words using the removeStopWords function. The function uses the language details from documents to determine which language stop words to remove.

documents = removeStopWords(documents)

documents = 
  2×1 tokenizedDocument:

    5 tokens: Guten Morgen . geht ?
    5 tokens: Heute wird guter Tag .

Input Arguments

collapse all

`documents` — Input documents
`tokenizedDocument` array

Input documents, specified as a tokenizedDocument array.

Output Arguments

collapse all

`newDocuments` — Output documents
`tokenizedDocument` array

Output documents, returned as a tokenizedDocument array.

More About

collapse all

Language Considerations

The stopWords and removeStopWords functions support English, Japanese, German, and Korean stop words only.

To remove stop words from other languages, use removeWords and specify your own stop words to remove.

Algorithms

collapse all

Language Details

tokenizedDocument objects contain details about the tokens including language details. The language details of the input documents determine the behavior of removeStopWords. The tokenizedDocument function, by default, automatically detects the language of the input text. To specify the language details manually, use the Language option of tokenizedDocument. To view the token details, use the tokenDetails function.

Version History

Introduced in R2018b

removeStopWords

Syntax

Description

Examples

Remove Stop Words from Documents

Remove Japanese Stop Words

Remove German Stop Words from Documents

Input Arguments

`documents` — Input documents
`tokenizedDocument` array

Output Arguments

`newDocuments` — Output documents
`tokenizedDocument` array

More About

Language Considerations

Algorithms

Language Details

Version History

See Also

Topics

removeStopWords

Syntax

Description

Examples

Remove Stop Words from Documents

Remove Japanese Stop Words

Remove German Stop Words from Documents

Input Arguments

documents — Input documents tokenizedDocument array

Output Arguments

newDocuments — Output documents tokenizedDocument array

More About

Language Considerations

Algorithms

Language Details

Version History

See Also

Topics

`documents` — Input documents
`tokenizedDocument` array

`newDocuments` — Output documents
`tokenizedDocument` array