Language Considerations

Text Analytics Toolbox™ supports the languages English, Japanese, German, and Korean. Most Text Analytics Toolbox functions also work with text in other languages. This table summarizes how to use Text Analytics Toolbox features for other languages.

Feature	Language Consideration	Workaround
Tokenization	The `tokenizedDocument` function has built-in rules for English, Japanese, German, and Korean only. For English and German text, the `'unicode'` tokenization method of `tokenizedDocument` detects tokens using rules based on Unicode^® Standard Annex #29 [1] and the ICU tokenizer [2], modified to better detect complex tokens such as hashtags and URLs. For Japanese and Korean text, the `'mecab'` tokenization method detects tokens using rules based on the MeCab tokenizer [3].	For other languages, you can still try using `tokenizedDocument`. If `tokenizedDocument` does not produce useful results, then try tokenizing the text manually. To create a `tokenizedDocument` array from manually tokenized text, set the `'TokenizeMethod'` option to `'none'`. For more information, see `tokenizedDocument`.
Stop word removal	The `stopWords` and `removeStopWords` functions support English, Japanese, German, and Korean stop words only.	To remove stop words from other languages, use `removeWords` and specify your own stop words to remove.
Sentence detection	The `addSentenceDetails` function detects sentence boundaries based on punctuation characters and line number information. For English and German text, the function also uses a list of abbreviations passed to the function.	For other languages, you might need to specify your own list of abbreviations for sentence detection. To do this, use the `'Abbreviations'` option of `addSentenceDetails`. For more information, see `addSentenceDetails`.
Word clouds	For string input, the `wordcloud` and `wordCloudCounts` functions use English, Japanese, German, and Korean tokenization, stop word removal, and word normalization.	For other languages, you might need to manually preprocess your text data and specify unique words and corresponding sizes in `wordcloud`. To specify word sizes in `wordcloud`, input your data as a table or arrays containing the unique words and corresponding sizes. For more information, see `wordcloud`.
Word embeddings	File input to the `trainWordEmbedding` function requires words separated by whitespace.	For files containing non-English text, you might need to input a `tokenizedDocument` array to `trainWordEmbedding`. To create a `tokenizedDocument` array from pretokenized text, use the `tokenizedDocument` function and set the `'TokenizeMethod'` option to `'none'`. For more information, see `trainWordEmbedding`.
Keyword extraction	The `rakeKeywords` function supports English, Japanese, German, and Korean text only.	The `rakeKeywords` function extracts keywords using a delimiter-based approach to identify candidate keywords. The function, by default, uses punctuation characters and the stop words given by the `stopWords` with language given by the language details of the input documents as delimiters. For other languages, specify an appropriate set of delimiters using the `Delimiters` and `MergingDelimiters` options. For more information, see `rakeKeywords`.
Keyword extraction	The `textrankKeywords` function supports English, Japanese, German, and Korean text only.	The `textrankKeywords` function extracts keywords by identifying candidate keywords based on their part-of-speech tag. The function uses part-of-speech tags given by the `addPartOfSpeechDetails` function which supports English, Japanese, German, and Korean text only. For other languages, try using the `rakeKeywords` instead and specify an appropriate set of delimiters using the `'Delimiters'` and `'MergingDelimiters'` options. For more information, see `textrankKeywords`.

Language-Independent Features

Word and N-Gram Counting

The bagOfWords and bagOfNgrams functions support tokenizedDocument input regardless of language. If you have a tokenizedDocument array containing your data, then you can use these functions.

Modeling and Prediction

The fitlda and fitlsa functions support bagOfWords and bagOfNgrams input regardless of language. If you have a bagOfWords or bagOfNgrams object containing your data, then you can use these functions.

The trainWordEmbedding function supports tokenizedDocument or file input regardless of language. If you have a tokenizedDocument array or a file containing your data in the correct format, then you can use this function.

References

[1] Unicode Text Segmentation. https://www.unicode.org/reports/tr29/

[2] Boundary Analysis. https://unicode-org.github.io/icu/userguide/boundaryanalysis/

[3] MeCab: Yet Another Part-of-Speech and Morphological Analyzer. https://taku910.github.io/mecab/

Language Considerations

Language-Independent Features

Word and N-Gram Counting

Modeling and Prediction

References

See Also

Related Topics