Main Content

German Language Support

This topic summarizes the Text Analytics Toolbox™ features that support German text. For an example showing how to analyze German text data, see Analyze German Text Data.

Tokenization

The tokenizedDocument function automatically detects German input. Alternatively, set the 'Language' option in tokenizedDocument to 'de'. This option specifies the language details of the tokens. To view the language details of the tokens, use tokenDetails. These language details determine the behavior of the removeStopWords, addPartOfSpeechDetails, normalizeWords, addSentenceDetails, and addEntityDetails functions on the tokens.

Tokenize German Text

Tokenize German text using tokenizedDocument. The function automatically detects German text.

str = [
    "Guten Morgen. Wie geht es dir?"
    "Heute wird ein guter Tag."];
documents = tokenizedDocument(str)
documents = 
  2x1 tokenizedDocument:

    8 tokens: Guten Morgen . Wie geht es dir ?
    6 tokens: Heute wird ein guter Tag .

Sentence Detection

To detect sentence structure in documents, use the addSentenceDetails. You can use the abbreviations function to help create custom lists of abbreviations to detect.

Add Sentence Details to German Documents

Tokenize German text using tokenizedDocument.

str = [
    "Guten Morgen, Dr. Schmidt. Geht es Ihnen wieder besser?"
    "Heute wird ein guter Tag."];
documents = tokenizedDocument(str);

Add sentence details to the documents using addSentenceDetails. This function adds the sentence numbers to the table returned by tokenDetails. View the updated token details of the first few tokens.

documents = addSentenceDetails(documents);
tdetails = tokenDetails(documents);
head(tdetails,10)
      Token      DocumentNumber    SentenceNumber    LineNumber       Type        Language
    _________    ______________    ______________    __________    ___________    ________

    "Guten"            1                 1               1         letters           de   
    "Morgen"           1                 1               1         letters           de   
    ","                1                 1               1         punctuation       de   
    "Dr"               1                 1               1         letters           de   
    "."                1                 1               1         punctuation       de   
    "Schmidt"          1                 1               1         letters           de   
    "."                1                 1               1         punctuation       de   
    "Geht"             1                 2               1         letters           de   
    "es"               1                 2               1         letters           de   
    "Ihnen"            1                 2               1         letters           de   

Table of German Abbreviations

View a table of German abbreviations. Use this table to help create custom tables of abbreviations for sentence detection when using addSentenceDetails.

tbl = abbreviations('Language','de');
head(tbl)
    Abbreviation     Usage 
    ____________    _______

       "A.T"        regular
       "ABl"        regular
       "Abb"        regular
       "Abdr"       regular
       "Abf"        regular
       "Abfl"       regular
       "Abh"        regular
       "Abk"        regular

Part of Speech Details

To add German part of speech details to documents, use the addPartOfSpeechDetails function.

Get Part of Speech Details of German Text

Tokenize German text using tokenizedDocument.

str = [
    "Guten Morgen. Wie geht es dir?"
    "Heute wird ein guter Tag."];
documents = tokenizedDocument(str)
documents = 
  2x1 tokenizedDocument:

    8 tokens: Guten Morgen . Wie geht es dir ?
    6 tokens: Heute wird ein guter Tag .

To get the part of speech details for German text, first use addPartOfSpeechDetails.

documents = addPartOfSpeechDetails(documents);

To view the part of speech details, use the tokenDetails function.

tdetails = tokenDetails(documents);
head(tdetails)
     Token      DocumentNumber    SentenceNumber    LineNumber       Type        Language    PartOfSpeech
    ________    ______________    ______________    __________    ___________    ________    ____________

    "Guten"           1                 1               1         letters           de       adjective   
    "Morgen"          1                 1               1         letters           de       noun        
    "."               1                 1               1         punctuation       de       punctuation 
    "Wie"             1                 2               1         letters           de       adverb      
    "geht"            1                 2               1         letters           de       verb        
    "es"              1                 2               1         letters           de       pronoun     
    "dir"             1                 2               1         letters           de       pronoun     
    "?"               1                 2               1         punctuation       de       punctuation 

Named Entity Recognition

To add entity tags to documents, use the addEntityDetails function.

Add Named Entity Tags to German Text

Tokenize German text using tokenizedDocument.

str = [
    "Ernst zog von Frankfurt nach Berlin."
    "Besuchen Sie Volkswagen in Wolfsburg."];
documents = tokenizedDocument(str);

To add entity tags to German text, use the addEntityDetails function. This function detects person names, locations, organizations, and other named entities.

documents = addEntityDetails(documents);

To view the entity details, use the tokenDetails function.

tdetails = tokenDetails(documents);
head(tdetails)
       Token       DocumentNumber    SentenceNumber    LineNumber       Type        Language    PartOfSpeech      Entity  
    ___________    ______________    ______________    __________    ___________    ________    ____________    __________

    "Ernst"              1                 1               1         letters           de       proper-noun     person    
    "zog"                1                 1               1         letters           de       verb            non-entity
    "von"                1                 1               1         letters           de       adposition      non-entity
    "Frankfurt"          1                 1               1         letters           de       proper-noun     location  
    "nach"               1                 1               1         letters           de       adposition      non-entity
    "Berlin"             1                 1               1         letters           de       proper-noun     location  
    "."                  1                 1               1         punctuation       de       punctuation     non-entity
    "Besuchen"           2                 1               1         letters           de       verb            non-entity

View the words tagged with entity "person", "location", "organization", or "other". These words are the words not tagged with "non-entity".

idx = tdetails.Entity ~= "non-entity";
tdetails(idx,:)
ans=5×8 table
       Token        DocumentNumber    SentenceNumber    LineNumber     Type      Language    PartOfSpeech       Entity   
    ____________    ______________    ______________    __________    _______    ________    ____________    ____________

    "Ernst"               1                 1               1         letters       de       proper-noun     person      
    "Frankfurt"           1                 1               1         letters       de       proper-noun     location    
    "Berlin"              1                 1               1         letters       de       proper-noun     location    
    "Volkswagen"          2                 1               1         letters       de       noun            organization
    "Wolfsburg"           2                 1               1         letters       de       proper-noun     location    

Stop Words

To remove stop words from documents according to the token language details, use removeStopWords. For a list of German stop words set the 'Language' option in stopWords to 'de'.

Remove German Stop Words from Documents

Tokenize German text using tokenizedDocument. The function automatically detects German text.

str = [
    "Guten Morgen. Wie geht es dir?"
    "Heute wird ein guter Tag."];
documents = tokenizedDocument(str)
documents = 
  2x1 tokenizedDocument:

    8 tokens: Guten Morgen . Wie geht es dir ?
    6 tokens: Heute wird ein guter Tag .

Remove stop words using the removeStopWords function. The function uses the language details from documents to determine which language stop words to remove.

documents = removeStopWords(documents)
documents = 
  2x1 tokenizedDocument:

    5 tokens: Guten Morgen . geht ?
    5 tokens: Heute wird guter Tag .

Stemming

To stem tokens according to the token language details, use normalizeWords.

Stem German Text

Tokenize German text using the tokenizedDocument function. The function automatically detects German text.

str = [
    "Guten Morgen. Wie geht es dir?"
    "Heute wird ein guter Tag."];
documents = tokenizedDocument(str);

Stem the tokens using normalizeWords.

documents = normalizeWords(documents)
documents = 
  2x1 tokenizedDocument:

    8 tokens: gut morg . wie geht es dir ?
    6 tokens: heut wird ein gut tag .

Language-Independent Features

Word and N-Gram Counting

The bagOfWords and bagOfNgrams functions support tokenizedDocument input regardless of language. If you have a tokenizedDocument array containing your data, then you can use these functions.

Modeling and Prediction

The fitlda and fitlsa functions support bagOfWords and bagOfNgrams input regardless of language. If you have a bagOfWords or bagOfNgrams object containing your data, then you can use these functions.

The trainWordEmbedding function supports tokenizedDocument or file input regardless of language. If you have a tokenizedDocument array or a file containing your data in the correct format, then you can use this function.

See Also

| | | | | |

Related Topics