Main Content

tokenDetails

Details of tokens in tokenized document array

Description

example

tdetails = tokenDetails(documents) returns a table of token details for the tokens in the tokenizedDocument array documents.

Examples

collapse all

Create a tokenized document array.

str = [ ...
    "This is an example document. It has two sentences."
    "This document has one sentence and an emoticon. :)"
    "Here is another example document. :D"];
documents = tokenizedDocument(str);

View the token details of the first few tokens.

tdetails = tokenDetails(documents);
head(tdetails)
      Token       DocumentNumber    LineNumber       Type        Language
    __________    ______________    __________    ___________    ________

    "This"              1               1         letters           en   
    "is"                1               1         letters           en   
    "an"                1               1         letters           en   
    "example"           1               1         letters           en   
    "document"          1               1         letters           en   
    "."                 1               1         punctuation       en   
    "It"                1               1         letters           en   
    "has"               1               1         letters           en   

The type variable contains the type of each token. View the emoticons in the documents.

idx = tdetails.Type == "emoticon";
tdetails(idx,:)
ans=2×5 table
    Token    DocumentNumber    LineNumber      Type      Language
    _____    ______________    __________    ________    ________

    ":)"           2               1         emoticon       en   
    ":D"           3               1         emoticon       en   

Create a tokenized document array.

str = [ ...
    "This is an example document. It has two sentences."
    "This document has one sentence."
    "Here is another example document. It also has two sentences."];
documents = tokenizedDocument(str);

Add sentence details to the documents using addSentenceDetails. This function adds the sentence numbers to the table returned by tokenDetails. View the updated token details of the first few tokens.

documents = addSentenceDetails(documents);
tdetails = tokenDetails(documents);
head(tdetails)
      Token       DocumentNumber    SentenceNumber    LineNumber       Type        Language
    __________    ______________    ______________    __________    ___________    ________

    "This"              1                 1               1         letters           en   
    "is"                1                 1               1         letters           en   
    "an"                1                 1               1         letters           en   
    "example"           1                 1               1         letters           en   
    "document"          1                 1               1         letters           en   
    "."                 1                 1               1         punctuation       en   
    "It"                1                 2               1         letters           en   
    "has"               1                 2               1         letters           en   

View the token details of the second sentence of the third document.

idx = tdetails.DocumentNumber == 3 & ...
    tdetails.SentenceNumber == 2;
tdetails(idx,:)
ans=6×6 table
       Token       DocumentNumber    SentenceNumber    LineNumber       Type        Language
    ___________    ______________    ______________    __________    ___________    ________

    "It"                 3                 2               1         letters           en   
    "also"               3                 2               1         letters           en   
    "has"                3                 2               1         letters           en   
    "two"                3                 2               1         letters           en   
    "sentences"          3                 2               1         letters           en   
    "."                  3                 2               1         punctuation       en   

Load the example data. The file sonnetsPreprocessed.txt contains preprocessed versions of Shakespeare's sonnets. The file contains one sonnet per line, with words separated by a space. Extract the text from sonnetsPreprocessed.txt, split the text into documents at newline characters, and then tokenize the documents.

filename = "sonnetsPreprocessed.txt";
str = extractFileText(filename);
textData = split(str,newline);
documents = tokenizedDocument(textData);

View the token details of the first few tokens.

tdetails = tokenDetails(documents);
head(tdetails)
       Token       DocumentNumber    LineNumber     Type      Language
    ___________    ______________    __________    _______    ________

    "fairest"            1               1         letters       en   
    "creatures"          1               1         letters       en   
    "desire"             1               1         letters       en   
    "increase"           1               1         letters       en   
    "thereby"            1               1         letters       en   
    "beautys"            1               1         letters       en   
    "rose"               1               1         letters       en   
    "might"              1               1         letters       en   

Add part-of-speech details to the documents using the addPartOfSpeechDetails function. This function first adds sentence information to the documents, and then adds the part-of-speech tags to the table returned by tokenDetails. View the updated token details of the first few tokens.

documents = addPartOfSpeechDetails(documents);
tdetails = tokenDetails(documents);
head(tdetails)
       Token       DocumentNumber    SentenceNumber    LineNumber     Type      Language     PartOfSpeech 
    ___________    ______________    ______________    __________    _______    ________    ______________

    "fairest"            1                 1               1         letters       en       adjective     
    "creatures"          1                 1               1         letters       en       noun          
    "desire"             1                 1               1         letters       en       noun          
    "increase"           1                 1               1         letters       en       noun          
    "thereby"            1                 1               1         letters       en       adverb        
    "beautys"            1                 1               1         letters       en       noun          
    "rose"               1                 1               1         letters       en       noun          
    "might"              1                 1               1         letters       en       auxiliary-verb

Input Arguments

collapse all

Input documents, specified as a tokenizedDocument array.

Output Arguments

collapse all

Table of token details. tdetails has the following variables:

NameDescription
TokenToken text, returned as a string scalar.
DocumentNumberIndex of document that the token belongs to, returned as a positive integer.
SentenceNumberSentence number of token in document, returned as a positive integer. If these details are missing, then first add sentence details to documents using the addSentenceDetails function.
LineNumberLine number of token in document, returned as a positive integer.
Type

The type of token, returned as one of these types:

  • letters — string of letter characters only

  • digits — string of digits only

  • punctuation — string of punctuation and symbol characters only

  • email-address — detected email address

  • web-address — detected web address

  • hashtag — detected hashtag (starts with "#" character followed by a letter)

  • at-mention — detected at-mention (starts with "@" character, followed by 1 to 15 ASCII letter, digit, or underscore characters)

  • emoticon — detected emoticon

  • emoji — detected emoji

  • other — does not belong to the previous types and is not a custom type

If these details are missing, then first add type details to documents using the addTypeDetails function.

Language

Language of the token, returned as one of these languages:

  • en — English

  • ja — Japanese

  • de — German

  • ko — Korean

These language details determine the behavior of the removeStopWords, addPartOfSpeechDetails, normalizeWords, addSentenceDetails, and addEntityDetails functions on the tokens.

If these details are missing, then first add language details to documents using the addLanguageDetails function.

For more information about language support in Text Analytics Toolbox™, see Language Considerations.

PartOfSpeech

Part of speech tag, returned as one of these tags:

  • adjective — Adjective

  • adposition — Adposition

  • adverb — Adverb

  • auxiliary-verb — Auxiliary verb

  • coord-conjunction — Coordinating conjunction

  • determiner — Determiner

  • interjection — Interjection

  • noun — Noun

  • numeral — Numeral

  • particle — Particle

  • pronoun — Pronoun

  • proper-noun — Proper noun

  • punctuation — Punctuation

  • subord-conjunction — Subordinating conjucntion

  • symbol — Symbol

  • verb — Verb

  • other — Other

If these details are missing, then first add part-of-speech details to documents using the addPartOfSpeechDetails function.

Entity

Entity tag, specified as one of these tags:

  • location — detected location

  • organization — detected organization

  • person — detected person

  • other — detected entity, not belonging to the above categories

  • non-entity — no entity detected

If these details are missing, then first add entity details to documents using the addEntityDetails function.

Lemma

Lemma form. If these details are missing, then first add lemma details to documents using the addLemmaDetails function.

HeadGrammatical dependency head, specified as the index of the token that this token modifies. If these details are missing, then first add grammatical dependency details to documents using the addDependencyDetails function.
Dependency

Grammatical dependency type, specified as one of these tags.

The dependency types listed here are only a subset. For a complete list of dependency types, including subtypes, see [1].

  • acl — clausal modifier of noun (adnominal clause)

  • advcl — adverbial clause modifier

  • advmod — adverbial modifier

  • amod — adjectival modifier

  • appos — appositional modifier

  • aux — auxiliary

  • case — case marking

  • cc — coordinating conjunction

  • ccomp — clausal complement

  • clf — classifier

  • compound — compound

  • conj — conjunct

  • cop — copula

  • csubj — clausal subject

  • dep — unspecified dependency

  • det — determiner

  • discourse — discourse element

  • dislocated — dislocated elements

  • expl — expletive

  • fixed — fixed multiword expression

  • flat — flat multiword expression

  • goeswith — goes with

  • iobj — indirect object

  • list — list

  • mark — marker

  • nmod — nominal modifier

  • nsubj — nominal subject

  • nummod — numeric modifier

  • obj — object

  • obl — oblique nominal

  • orphan — orphan

  • parataxis — parataxis

  • punct — punctuation

  • reparandum — overridden disfluency

  • root — root

  • vocative — vocative

  • xcomp — open clausal complement

If these details are missing, then first add grammatical dependency details to documents using the addDependencyDetails function.

References

[1] Universal Dependency Relations https://universaldependencies.org/u/dep/index.html.

Version History

Introduced in R2018a

expand all