tokenDetails
Details of tokens in tokenized document array
Description
Examples
View Token Details of Documents
Create a tokenized document array.
str = [ ... "This is an example document. It has two sentences." "This document has one sentence and an emoticon. :)" "Here is another example document. :D"]; documents = tokenizedDocument(str);
View the token details of the first few tokens.
tdetails = tokenDetails(documents); head(tdetails)
Token DocumentNumber LineNumber Type Language __________ ______________ __________ ___________ ________ "This" 1 1 letters en "is" 1 1 letters en "an" 1 1 letters en "example" 1 1 letters en "document" 1 1 letters en "." 1 1 punctuation en "It" 1 1 letters en "has" 1 1 letters en
The type
variable contains the type of each token. View the emoticons in the documents.
idx = tdetails.Type == "emoticon";
tdetails(idx,:)
ans=2×5 table
Token DocumentNumber LineNumber Type Language
_____ ______________ __________ ________ ________
":)" 2 1 emoticon en
":D" 3 1 emoticon en
Add Sentence Details to Documents
Create a tokenized document array.
str = [ ... "This is an example document. It has two sentences." "This document has one sentence." "Here is another example document. It also has two sentences."]; documents = tokenizedDocument(str);
Add sentence details to the documents using addSentenceDetails
. This function adds the sentence numbers to the table returned by tokenDetails
. View the updated token details of the first few tokens.
documents = addSentenceDetails(documents); tdetails = tokenDetails(documents); head(tdetails)
Token DocumentNumber SentenceNumber LineNumber Type Language __________ ______________ ______________ __________ ___________ ________ "This" 1 1 1 letters en "is" 1 1 1 letters en "an" 1 1 1 letters en "example" 1 1 1 letters en "document" 1 1 1 letters en "." 1 1 1 punctuation en "It" 1 2 1 letters en "has" 1 2 1 letters en
View the token details of the second sentence of the third document.
idx = tdetails.DocumentNumber == 3 & ...
tdetails.SentenceNumber == 2;
tdetails(idx,:)
ans=6×6 table
Token DocumentNumber SentenceNumber LineNumber Type Language
___________ ______________ ______________ __________ ___________ ________
"It" 3 2 1 letters en
"also" 3 2 1 letters en
"has" 3 2 1 letters en
"two" 3 2 1 letters en
"sentences" 3 2 1 letters en
"." 3 2 1 punctuation en
Add Part-of-Speech Details to Documents
Load the example data. The file sonnetsPreprocessed.txt
contains preprocessed versions of Shakespeare's sonnets. The file contains one sonnet per line, with words separated by a space. Extract the text from sonnetsPreprocessed.txt
, split the text into documents at newline characters, and then tokenize the documents.
filename = "sonnetsPreprocessed.txt";
str = extractFileText(filename);
textData = split(str,newline);
documents = tokenizedDocument(textData);
View the token details of the first few tokens.
tdetails = tokenDetails(documents); head(tdetails)
Token DocumentNumber LineNumber Type Language ___________ ______________ __________ _______ ________ "fairest" 1 1 letters en "creatures" 1 1 letters en "desire" 1 1 letters en "increase" 1 1 letters en "thereby" 1 1 letters en "beautys" 1 1 letters en "rose" 1 1 letters en "might" 1 1 letters en
Add part-of-speech details to the documents using the addPartOfSpeechDetails
function. This function first adds sentence information to the documents, and then adds the part-of-speech tags to the table returned by tokenDetails
. View the updated token details of the first few tokens.
documents = addPartOfSpeechDetails(documents); tdetails = tokenDetails(documents); head(tdetails)
Token DocumentNumber SentenceNumber LineNumber Type Language PartOfSpeech ___________ ______________ ______________ __________ _______ ________ ______________ "fairest" 1 1 1 letters en adjective "creatures" 1 1 1 letters en noun "desire" 1 1 1 letters en noun "increase" 1 1 1 letters en noun "thereby" 1 1 1 letters en adverb "beautys" 1 1 1 letters en noun "rose" 1 1 1 letters en noun "might" 1 1 1 letters en auxiliary-verb
Input Arguments
documents
— Input documents
tokenizedDocument
array
Input documents, specified as a tokenizedDocument
array.
Output Arguments
tdetails
— Table of token details
table
Table of token details. tdetails
has the following
variables:
Name | Description |
---|---|
Token | Token text, returned as a string scalar. |
DocumentNumber | Index of document that the token belongs to, returned as a positive integer. |
SentenceNumber | Sentence number of token in document, returned as a
positive integer. If these details are missing, then
first add sentence details to
documents using the addSentenceDetails function. |
LineNumber | Line number of token in document, returned as a positive integer. |
Type | The type of token, returned as one of these types:
If these details are
missing, then first add type details to
|
Language | Language of the token, returned as one of these languages:
These language details determine the behavior of the If these details are
missing, then first add language details to
For more information about language support in Text Analytics Toolbox™, see Language Considerations. |
PartOfSpeech | Part of speech tag, returned as one of these tags:
If these details are missing, then
first add part-of-speech details to
|
Entity | Entity tag, specified as one of these tags:
If these details are
missing, then first add entity details to
|
Lemma | Lemma form. If these details are missing, then
first add lemma details to
|
Head | Grammatical dependency head, specified as the index
of the token that this token modifies. If these details
are missing, then first add grammatical dependency
details to documents using the
addDependencyDetails function. |
Dependency | Grammatical dependency type, specified as one of these tags. The dependency types listed here are only a subset. For a complete list of dependency types, including subtypes, see [1].
If these details are missing, then
first add grammatical dependency details to
|
References
[1] Universal Dependency Relations https://universaldependencies.org/u/dep/index.html.
Version History
Introduced in R2018aR2018b: tokenDetails
returns token type emoji
for emoji characters
Starting in R2018b, tokenizedDocument
detects emoji characters and the tokenDetails
function reports these tokens with type
"emoji"
. This makes it easier to analyze text containing emoji
characters.
In R2018a, tokenDetails
reports emoji characters with type "other"
.
To find the indices of the tokens with type "emoji"
or
"other"
, use the indices idx = tdetails.Type == "emoji" |
tdetails.Type == "other"
, where tdetails
is a table of
token details.
MATLAB Command
You clicked a link that corresponds to this MATLAB command:
Run the command by entering it in the MATLAB Command Window. Web browsers do not support MATLAB commands.
Select a Web Site
Choose a web site to get translated content where available and see local events and offers. Based on your location, we recommend that you select: .
You can also select a web site from the following list:
How to Get Best Site Performance
Select the China site (in Chinese or English) for best site performance. Other MathWorks country sites are not optimized for visits from your location.
Americas
- América Latina (Español)
- Canada (English)
- United States (English)
Europe
- Belgium (English)
- Denmark (English)
- Deutschland (Deutsch)
- España (Español)
- Finland (English)
- France (Français)
- Ireland (English)
- Italia (Italiano)
- Luxembourg (English)
- Netherlands (English)
- Norway (English)
- Österreich (Deutsch)
- Portugal (English)
- Sweden (English)
- Switzerland
- United Kingdom (English)