This is machine translation

Translated by Microsoft
Mouseover text to see original. Click the button below to return to the English version of the page.

Note: This page has been translated by MathWorks. Click here to see
To view all translated materials including this page, select Country from the country navigator on the bottom of this page.

addPartOfSpeechDetails

Add part-of-speech tags to documents

Syntax

newDocuments = addPartOfSpeechDetails(documents)
newDocuments = addPartOfSpeechDetails(documents,'RetokenizeMethod',method)

Description

example

newDocuments = addPartOfSpeechDetails(documents) detects parts of speech in documents and updates the token details. The function, by default, retokenizes the text for part-of-speech tagging. For example, the function splits the word "you're" into the tokens "you" and "'re". To get the part-of-speech details from newDocuments, use tokenDetails.

newDocuments = addPartOfSpeechDetails(documents,'RetokenizeMethod',method) also specifies the method to use for retokenizing the documents.

Examples

collapse all

Load the example data. The file sonnetsPreprocessed.txt contains preprocessed versions of Shakespeare's sonnets. The file contains one sonnet per line, with words separated by a space. Extract the text from sonnetsPreprocessed.txt, split the text into documents at newline characters, and then tokenize the documents.

filename = "sonnetsPreprocessed.txt";
str = extractFileText(filename);
textData = split(str,newline);
documents = tokenizedDocument(textData);

View the token details of the first few tokens.

tdetails = tokenDetails(documents);
head(tdetails)
ans=8×5 table
       Token       DocumentNumber    LineNumber     Type      Language
    ___________    ______________    __________    _______    ________

    "fairest"            1               1         letters       en   
    "creatures"          1               1         letters       en   
    "desire"             1               1         letters       en   
    "increase"           1               1         letters       en   
    "thereby"            1               1         letters       en   
    "beautys"            1               1         letters       en   
    "rose"               1               1         letters       en   
    "might"              1               1         letters       en   

Add part-of-speech details to the documents using the addPartOfSpeechDetails function. This function first adds sentence information to the documents, and then adds the part-of-speech tags to the table returned by tokenDetails. View the updated token details of the first few tokens.

documents = addPartOfSpeechDetails(documents);
tdetails = tokenDetails(documents);
head(tdetails)
ans=8×7 table
       Token       DocumentNumber    SentenceNumber    LineNumber     Type      Language     PartOfSpeech 
    ___________    ______________    ______________    __________    _______    ________    ______________

    "fairest"            1                 1               1         letters       en       adjective     
    "creatures"          1                 1               1         letters       en       noun          
    "desire"             1                 1               1         letters       en       verb          
    "increase"           1                 1               1         letters       en       noun          
    "thereby"            1                 1               1         letters       en       adverb        
    "beautys"            1                 1               1         letters       en       verb          
    "rose"               1                 1               1         letters       en       noun          
    "might"              1                 1               1         letters       en       auxiliary-verb

Tokenize Japanese text using tokenizedDocument.

str = [
    "恋に悩み、苦しむ。"
    "恋の悩みで 苦しむ。"
    "空に星が輝き、瞬いている。"
    "空の星が輝きを増している。"
    "駅までは遠くて、歩けない。"
    "遠くの駅まで歩けない。"
    "すもももももももものうち。"];
documents = tokenizedDocument(str);

For Japanese text, you can get the part-of-speech details using tokenDetails. For English text, you must first use addPartOfSpeechDetails.

tdetails = tokenDetails(documents);
head(tdetails)
ans=8×7 table
     Token     DocumentNumber    LineNumber       Type        Language    PartOfSpeech     Lemma 
    _______    ______________    __________    ___________    ________    ____________    _______

    "恋"             1               1         letters           ja       noun            "恋"   
    "に"             1               1         letters           ja       adposition      "に"   
    "悩み"           1               1         letters           ja       verb            "悩む"  
    "、"             1               1         punctuation       ja       punctuation     "、"   
    "苦しむ"          1               1         letters           ja       verb            "苦しむ"
    "。"             1               1         punctuation       ja       punctuation     "。"   
    "恋"             2               1         letters           ja       noun            "恋"   
    "の"             2               1         letters           ja       adposition      "の"   

Input Arguments

collapse all

Input documents, specified as a tokenizedDocument array.

Method to retokenize documents, specified as one of the following:

  • 'part-of-speech' – Transform the tokens for part-of-speech tagging. The function performs these tasks:

    • Split compound words. For example, split the compound word "wanna" into the tokens "want" and "to". This includes compound words containing apostrophes. For example, the function splits the word "don't" into the tokens "do" and "n't".

    • Merge periods with preceding abbreviations. For example, merge the tokens "Mr" and "." into the token "Mr.".

    • Merge runs of periods into ellipses. For example, merge three instances of "." into the single token "...".

  • 'none' – Do not retokenize the documents.

Output Arguments

collapse all

Updated documents, returned as a tokenizedDocument array. To get the sentence details from newDocuments, use tokenDetails.

Algorithms

If the input documents do not contain sentence details, then the function first runs addSentenceDetails.

Introduced in R2018b