Main Content

Language Considerations

Text Analytics Toolbox™ supports the languages English, Japanese, German, and Korean. Most Text Analytics Toolbox functions also work with text in other languages. This table summarizes how to use Text Analytics Toolbox features for other languages.

FeatureLanguage ConsiderationWorkaround
Tokenization

The tokenizedDocument function has built-in rules for English, Japanese, German, and Korean only. For English and German text, the 'unicode' tokenization method of tokenizedDocument detects tokens using rules based on Unicode® Standard Annex #29 [1] and the ICU tokenizer [2], modified to better detect complex tokens such as hashtags and URLs. For Japanese and Korean text, the 'mecab' tokenization method detects tokens using rules based on the MeCab tokenizer [3].

For other languages, you can still try using tokenizedDocument. If tokenizedDocument does not produce useful results, then try tokenizing the text manually. To create a tokenizedDocument array from manually tokenized text, set the 'TokenizeMethod' option to 'none'.

For more information, see tokenizedDocument.

Stop word removal

The stopWords and removeStopWords functions support English, Japanese, German, and Korean stop words only.

To remove stop words from other languages, use removeWords and specify your own stop words to remove.

Sentence detection

The addSentenceDetails function detects sentence boundaries based on punctuation characters and line number information. For English and German text, the function also uses a list of abbreviations passed to the function.

For other languages, you might need to specify your own list of abbreviations for sentence detection. To do this, use the 'Abbreviations' option of addSentenceDetails.

For more information, see addSentenceDetails.

Word clouds

For string input, the wordcloud and wordCloudCounts functions use English, Japanese, German, and Korean tokenization, stop word removal, and word normalization.

For other languages, you might need to manually preprocess your text data and specify unique words and corresponding sizes in wordcloud.

To specify word sizes in wordcloud, input your data as a table or arrays containing the unique words and corresponding sizes.

For more information, see wordcloud.

Word embeddings

File input to the trainWordEmbedding function requires words separated by whitespace.

For files containing non-English text, you might need to input a tokenizedDocument array to trainWordEmbedding.

To create a tokenizedDocument array from pretokenized text, use the tokenizedDocument function and set the 'TokenizeMethod' option to 'none'.

For more information, see trainWordEmbedding.

Keyword extraction

The rakeKeywords function supports English, Japanese, German, and Korean text only.

The rakeKeywords function extracts keywords using a delimiter-based approach to identify candidate keywords. The function, by default, uses punctuation characters and the stop words given by the stopWords with language given by the language details of the input documents as delimiters.

For other languages, specify an appropriate set of delimiters using the Delimiters and MergingDelimiters options.

For more information, see rakeKeywords.

The textrankKeywords function supports English, Japanese, German, and Korean text only.

The textrankKeywords function extracts keywords by identifying candidate keywords based on their part-of-speech tag. The function uses part-of-speech tags given by the addPartOfSpeechDetails function which supports English, Japanese, German, and Korean text only.

For other languages, try using the rakeKeywords instead and specify an appropriate set of delimiters using the 'Delimiters' and 'MergingDelimiters' options.

For more information, see textrankKeywords.

Language-Independent Features

Word and N-Gram Counting

The bagOfWords and bagOfNgrams functions support tokenizedDocument input regardless of language. If you have a tokenizedDocument array containing your data, then you can use these functions.

Modeling and Prediction

The fitlda and fitlsa functions support bagOfWords and bagOfNgrams input regardless of language. If you have a bagOfWords or bagOfNgrams object containing your data, then you can use these functions.

The trainWordEmbedding function supports tokenizedDocument or file input regardless of language. If you have a tokenizedDocument array or a file containing your data in the correct format, then you can use this function.

References

[1] Unicode Text Segmentation. https://www.unicode.org/reports/tr29/

[3] MeCab: Yet Another Part-of-Speech and Morphological Analyzer. https://taku910.github.io/mecab/

See Also

| | | | | | | | | |

Related Topics