This is machine translation

Translated by Microsoft
Mouseover text to see original. Click the button below to return to the English version of the page.

Note: This page has been translated by MathWorks. Click here to see
To view all translated materials including this page, select Country from the country navigator on the bottom of this page.

tokenizedDocument

Array of tokenized documents for text analysis

Description

A tokenized document is a document represented as a collection of words (also known as tokens) which is used for text analysis.

Use tokenized documents for the following tasks:

Creation

Syntax

documents = tokenizedDocument
documents = tokenizedDocument(str)
documents = tokenizedDocument(str,Name,Value)

Description

documents = tokenizedDocument creates a scalar tokenized document with no tokens.

example

documents = tokenizedDocument(str) tokenizes the elements of str and returns a tokenized document array.

example

documents = tokenizedDocument(str,Name,Value) specifies additional options using one or more name-value pair arguments.

Input Arguments

expand all

Input text, specified as a string array, character vector, cell array of character vectors, or cell array of string arrays.

If the input text has not already been split into words, then str must be a string array, character vector, cell array of character vectors, or a cell array of string scalars.

Example: ["an example of a short document";"a second short document"]

Example: 'an example of a short document'

Example: {'an example of a short document';'a second short document'}

Example: {"an example of a short document";"a second short document"}

If the input text has already been tokenized, then specify 'TokenizeMethod' to be 'none'. If str contains a single document, then it must be a string vector of words, a row cell array of character vectors, or a cell array containing a single string vector of words. If str contains multiple documents, then it must be a cell array of string arrays.

Example: ["an" "example" "document"]

Example: {'an','example','document'}

Example: {["an" "example" "of" "a" "short" "document"]}

Example: {["an" "example" "of" "a" "short" "document"];["a" "second" "short" "document"]}

Data Types: string | char | cell

Name-Value Pair Arguments

Specify optional comma-separated pairs of Name,Value arguments. Name is the argument name and Value is the corresponding value. Name must appear inside quotes. You can specify several name and value pair arguments in any order as Name1,Value1,...,NameN,ValueN.

Example: 'DetectPatterns',{'email-address','web-address'} detects email addresses and web addresses

Method to tokenize documents, specified as the comma-separated pair consisting of 'TokenizeMethod' and one of the following:

  • 'unicode' – Tokenize input text. If str is a cell array, then the elements of str must be string scalars or character vectors. If 'Language' is 'en', then 'unicode' is the default.

  • 'mecab' – Tokenize Japanese text using the MeCab tokenizer. If 'Language' is 'ja', then 'mecab' is the default.

  • 'none' – Do not tokenize the input text.

If the input text has already been tokenized, then specify 'TokenizeMethod' to be 'none'. If str contains a single document, then it must be a string vector of words, a row cell array of character vectors, or a cell array containing a single string vector of words. If str contains multiple documents, then it must be a cell array of string arrays.

If 'TokenizeMethod' is 'none', then the function tokenDetails returns an empty table. To add tokens with document and sentence numbers to the table, use addSentenceDetails.

Example: 'none'

Patterns of complex tokens to detect, specified as the comma-separated pair consisting of 'DetectPatterns' and 'none', 'all', or a string or cell array containing one or more of the following:

  • 'email-address' – Detect email addresses. For example, treat user@domain.com as a single token.

  • 'web-address' – Detect web addresses. For example, treat www.mathworks.com as a single token.

  • 'hashtag' – Detect hashtags. For example, treat #MATLAB as a single token.

  • 'at-mention' – Detect at-mentions. For example, treat @MathWorks as a single token.

  • 'emoticon' – Detect emoticons. For example, treat :-D as a single token.

If DetectPatterns is 'none', then the function does not detect any complex token patterns. If DetectPatterns is 'all', then the function detects all the listed complex token patterns.

Example: 'DetectPatterns','hashtag'

Example: 'DetectPatterns',{'email-address','web-address'}

Data Types: char | string | cell

Top-level domains to use for web address detection, specified as the comma-separated pair consisting of 'TopLevelDomains' and a character vector, string array, or cell array of character vectors. By default, the function uses the output of topLevelDomains.

This option only applies if 'DetectPatterns' is 'all' or contains 'web-address'.

Example: 'TopLevelDomains',["com" "net" "org"]

Data Types: char | string | cell

Language, specified as the comma-separated pair consisting of 'Language' and one of the following:

  • 'en' – English. This option also sets the default value for 'TokenizeMethod' to 'unicode'.

  • 'ja' – Japanese. This option also sets the default value for 'TokenizeMethod' to 'mecab'.

If you do not specify a value, then the function detects the language from the input text using the corpusLanguage function.

This option specifies the language details of the tokens. To view the language details of the tokens, use tokenDetails. These language details determine the behavior of the removeStopWords, addPartOfSpeechDetails, normalizeWords, and addSentenceDetails functions on the tokens.

For more information about language support in Text Analytics Toolbox™, see Language Support.

Example: 'Language','ja'

Properties

expand all

Unique words in the documents, specified as a string array. The words do not appear in any particular order.

Data Types: string

Object Functions

expand all

erasePunctuationErase punctuation from text and documents
removeStopWordsRemove stop words from documents
removeWordsRemove selected words from documents or bag-of-words model
normalizeWordsStem or lemmatize words
removeEmptyDocumentsRemove empty documents from tokenized document array, bag-of-words model, or bag-of-n-grams model
lowerConvert documents to lowercase
upperConvert documents to uppercase
tokenDetailsDetails of tokens in tokenized document array
addSentenceDetailsAdd sentence numbers to documents
addPartOfSpeechDetailsAdd part-of-speech tags to documents
addLanguageDetailsAdd language identifiers to documents
addTypeDetailsAdd token type details to documents
addLemmaDetailsAdd lemma forms of tokens to documents
writeTextDocumentWrite documents to text file
doclengthLength of documents in document array
contextSearch documents for word occurrences in context
joinwordsConvert documents to string by joining words
doc2cellConvert documents to cell array of string vectors
stringConvert scalar document to string vector
plusAppend documents
replaceFind and replace substrings in documents
docfunApply function to words in documents
regexprepReplace text in words of documents using regular expression
wordcloudCreate word cloud chart from text, bag-of-words model, bag-of-n-grams model, or LDA model

Examples

collapse all

Create tokenized documents from a string array.

str = [
    "an example of a short sentence" 
    "a second short sentence"]
str = 2x1 string array
    "an example of a short sentence"
    "a second short sentence"

documents = tokenizedDocument(str)
documents = 
  2x1 tokenizedDocument:

    6 tokens: an example of a short sentence
    4 tokens: a second short sentence

Create a tokenized document from the string str. By default, the function treats the hashtag "#MATLAB", the emoticon ":-D", and the web address "https://www.mathworks.com/help" as single tokens.

str = "Learn how to analyze text in #MATLAB! :-D see https://www.mathworks.com/help/";
document = tokenizedDocument(str)
document = 
  tokenizedDocument:

   11 tokens: Learn how to analyze text in #MATLAB ! :-D see https://www.mathworks.com/help/

To detect only hashtags as complex tokens, specify the 'DetectPatterns' option to be 'hashtag' only. The function then tokenizes the emoticon ":-D" and the web address "https://www.mathworks.com/help" into multiple tokens.

document = tokenizedDocument(str,'DetectPatterns','hashtag')
document = 
  tokenizedDocument:

   24 tokens: Learn how to analyze text in #MATLAB ! : - D see https : / / www . mathworks . com / help /

Remove the stop words from an array of documents using removeStopWords. The tokenizedDocument function detects that the documents are in English, so removeStopWords removes English stop words.

documents = tokenizedDocument([
    "an example of a short sentence" 
    "a second short sentence"]);
newDocuments = removeStopWords(documents)
newDocuments = 
  2x1 tokenizedDocument:

    3 tokens: example short sentence
    3 tokens: second short sentence

Stem the words in a document array using the Porter stemmer.

documents = tokenizedDocument([
    "a strongly worded collection of words"
    "another collection of words"]);
newDocuments = normalizeWords(documents)
newDocuments = 
  2x1 tokenizedDocument:

    6 tokens: a strongli word collect of word
    4 tokens: anoth collect of word

Load the example data. The file sonnetsPreprocessed.txt contains preprocessed versions of Shakespeare's sonnets. The file contains one sonnet per line, with words separated by a space. Extract the text from sonnetsPreprocessed.txt, split the text into documents at newline characters, and then tokenize the documents.

filename = "sonnetsPreprocessed.txt";
str = extractFileText(filename);
textData = split(str,newline);
documents = tokenizedDocument(textData);

Search for the word "life".

tbl = context(documents,"life");
head(tbl)
ans=8×3 table
                            Context                             Document    Word
    ________________________________________________________    ________    ____

    "consumst thy self single life ah thou issueless shalt "        9        10 
    "ainted counterfeit lines life life repair times pencil"       16        35 
    "d counterfeit lines life life repair times pencil pupi"       16        36 
    " heaven knows tomb hides life shows half parts write b"       17        14 
    "he eyes long lives gives life thee                    "       18        69 
    "tender embassy love thee life made four two alone sink"       45        23 
    "ves beauty though lovers life beauty shall black lines"       63        50 
    "s shorn away live second life second head ere beautys "       68        27 

View the occurrences in a string array.

tbl.Context
ans = 23x1 string array
    "consumst thy self single life ah thou issueless shalt "
    "ainted counterfeit lines life life repair times pencil"
    "d counterfeit lines life life repair times pencil pupi"
    " heaven knows tomb hides life shows half parts write b"
    "he eyes long lives gives life thee                    "
    "tender embassy love thee life made four two alone sink"
    "ves beauty though lovers life beauty shall black lines"
    "s shorn away live second life second head ere beautys "
    "e rehearse let love even life decay lest wise world lo"
    "st bail shall carry away life hath line interest memor"
    "art thou hast lost dregs life prey worms body dead cow"
    "           thoughts food life sweetseasond showers gro"
    "tten name hence immortal life shall though once gone w"
    " beauty mute others give life bring tomb lives life fa"
    "ve life bring tomb lives life fair eyes poets praise d"
    " steal thyself away term life thou art assured mine li"
    "fe thou art assured mine life longer thy love stay dep"
    " fear worst wrongs least life hath end better state be"
    "anst vex inconstant mind life thy revolt doth lie o ha"
    " fame faster time wastes life thou preventst scythe cr"
    "ess harmful deeds better life provide public means pub"
    "ate hate away threw savd life saying                  "
    " many nymphs vowd chaste life keep came tripping maide"

Tokenize Japanese text using tokenizedDocument. The function automatically detects Japanese text.

str = [
    "恋に悩み、苦しむ。"
    "恋の悩みで苦しむ。"
    "空に星が輝き、瞬いている。"
    "空の星が輝きを増している。"];
documents = tokenizedDocument(str)
documents = 
  4x1 tokenizedDocument:

     6 tokens: 恋 に 悩み 、 苦しむ 。
     6 tokens: 恋 の 悩み で 苦しむ 。
    10 tokens: 空 に 星 が 輝き 、 瞬い て いる 。
    10 tokens: 空 の 星 が 輝き を 増し て いる 。

More About

expand all

Compatibility Considerations

expand all

Behavior changed in R2018b

Behavior changed in R2018b

Behavior changed in R2018b

References

[1] Unicode Text Segmentation. https://www.unicode.org/reports/tr29/

[3] MeCab: Yet Another Part-of-Speech and Morphological Analyzer. https://taku910.github.io/mecab/

Introduced in R2017b