Language Considerations
Text Analytics Toolbox™ supports the languages English, Japanese, German, and Korean. Most Text Analytics Toolbox functions also work with text in other languages. This table summarizes how to use Text Analytics Toolbox features for other languages.
| Feature | Language Consideration | Workaround |
|---|---|---|
| Tokenization |
The | For other languages, you can still try using For more information, see |
| Stop word removal | The | To remove stop words from other languages, use |
| Sentence detection |
The | For other languages, you might need to specify your own list of abbreviations for sentence
detection. To do this, use the For more information, see |
| Word clouds | For string input, the | For other languages, you might need to manually preprocess your text data and specify unique
words and corresponding sizes in To specify word sizes in For more information, see |
| Word embeddings | File input to the | For files containing non-English text, you might need to input a To create a For more information, see |
| Keyword extraction | The | The For other languages, specify an appropriate set of delimiters using the For more information, see |
The | The For other languages, try using the For more information, see |
Language-Independent Features
Word and N-Gram Counting
The bagOfWords and bagOfNgrams functions support tokenizedDocument input regardless of language. If you have a tokenizedDocument array containing your data, then you can use these functions.
Modeling and Prediction
The fitlda and fitlsa functions support bagOfWords and bagOfNgrams input regardless of language. If you have a bagOfWords or bagOfNgrams object containing your data, then you can use these functions.
The trainWordEmbedding function supports tokenizedDocument or file input regardless of language. If you have a tokenizedDocument array or a file containing your data in the correct format, then you can use this function.
References
[1] Unicode Text Segmentation. https://www.unicode.org/reports/tr29/
[2] Boundary Analysis. https://unicode-org.github.io/icu/userguide/boundaryanalysis/
[3] MeCab: Yet Another Part-of-Speech and Morphological Analyzer. https://taku910.github.io/mecab/
See Also
stopWords | removeWords | normalizeWords | bagOfWords | bagOfNgrams | tokenizedDocument | fitlda | fitlsa | wordcloud | addSentenceDetails | addLanguageDetails