Main Content

addLanguageDetails

Add language identifiers to documents

Description

Use addLanguageDetails to add language identifiers to documents.

The function supports English, Japanese, German, and Korean text.

example

updatedDocuments = addLanguageDetails(documents) detects the language of documents and updates the token details. The function adds details to the tokens with missing language details only. To get the language details from updatedDocuments, use tokenDetails.

updatedDocuments = addLanguageDetails(documents,Name,Value) specifies additional options using one or more name-value pairs.

Tip

Use addLanguageDetails before using the lower and upper functions as addLanguageDetails uses information that is removed by this functions.

Examples

collapse all

Manually tokenize some text by splitting it into an array of words. Convert the manually tokenized text into a tokenizedDocument object by setting the 'TokenizeMethod' option to 'none'.

str = split("an example of a short sentence")';
documents = tokenizedDocument(str,'TokenizeMethod','none');

View the token details using tokenDetails.

tdetails = tokenDetails(documents)
tdetails=6×2 table
      Token       DocumentNumber
    __________    ______________

    "an"                1       
    "example"           1       
    "of"                1       
    "a"                 1       
    "short"             1       
    "sentence"          1       

When you specify 'TokenizeMethod','none', the function does not automatically detect the language details of the documents. To add the language details, use the addLanguageDetails function. This function, by default, automatically detects the language.

documents = addLanguageDetails(documents);

View the updated token details using tokenDetails.

tdetails = tokenDetails(documents)
tdetails=6×4 table
      Token       DocumentNumber     Type      Language
    __________    ______________    _______    ________

    "an"                1           letters       en   
    "example"           1           letters       en   
    "of"                1           letters       en   
    "a"                 1           letters       en   
    "short"             1           letters       en   
    "sentence"          1           letters       en   

Input Arguments

collapse all

Input documents, specified as a tokenizedDocument array.

Name-Value Arguments

Specify optional pairs of arguments as Name1=Value1,...,NameN=ValueN, where Name is the argument name and Value is the corresponding value. Name-value arguments must appear after other arguments, but the order of the pairs does not matter.

Before R2021a, use commas to separate each name and value, and enclose Name in quotes.

Example: 'DiscardKnownValues',true specifies to discard previously computed details and recompute them.

Language, specified as one of the following:

  • 'en' – English

  • 'ja' – Japanese

  • 'de' – German

  • 'ko' – Korean

If you do not specify a value, then the function detects the language from the input text using the corpusLanguage function.

This option specifies the language details of the tokens. To view the language details of the tokens, use tokenDetails. These language details determine the behavior of the removeStopWords, addPartOfSpeechDetails, normalizeWords, addSentenceDetails, and addEntityDetails functions on the tokens.

For more information about language support in Text Analytics Toolbox™, see Language Considerations.

Option to discard previously computed details and recompute them, specified as true or false.

Data Types: logical

Output Arguments

collapse all

Updated documents, returned as a tokenizedDocument array. To get the token details from updatedDocuments, use tokenDetails.

Version History

Introduced in R2018b