This is machine translation

Translated by Microsoft
Mouseover text to see original. Click the button below to return to the English version of the page.

Note: This page has been translated by MathWorks. Click here to see
To view all translated materials including this page, select Country from the country navigator on the bottom of this page.

addSentenceDetails

Add sentence numbers to documents

Use addSentenceDetails to add sentence information to documents.

The function supports English, Japanese, and German text.

Syntax

updatedDocuments = addSentenceDetails(documents)
updatedDocuments = addSentenceDetails(documents,Name,Value)

Description

example

updatedDocuments = addSentenceDetails(documents) detects the sentence boundaries in documents and updates the token details. To get the sentence details from updatedDocuments, use tokenDetails.

updatedDocuments = addSentenceDetails(documents,Name,Value) specifies additional options using one or more name-value pair arguments.

Tip

Use addSentenceDetails before using the lower, upper, erasePunctuation, normalizeWords, removeWords, and removeStopWords functions as addSentenceDetails uses information that is removed by these functions.

Examples

collapse all

Create a tokenized document array.

str = [ ...
    "This is an example document. It has two sentences."
    "This document has one sentence."
    "Here is another example document. It also has two sentences."];
documents = tokenizedDocument(str);

Add sentence details to the documents using addSentenceDetails. This function adds the sentence numbers to the table returned by tokenDetails. View the updated token details of the first few tokens.

documents = addSentenceDetails(documents);
tdetails = tokenDetails(documents);
head(tdetails)
ans=8×6 table
      Token       DocumentNumber    SentenceNumber    LineNumber       Type        Language
    __________    ______________    ______________    __________    ___________    ________

    "This"              1                 1               1         letters           en   
    "is"                1                 1               1         letters           en   
    "an"                1                 1               1         letters           en   
    "example"           1                 1               1         letters           en   
    "document"          1                 1               1         letters           en   
    "."                 1                 1               1         punctuation       en   
    "It"                1                 2               1         letters           en   
    "has"               1                 2               1         letters           en   

View the token details of the second sentence of the third document.

idx = tdetails.DocumentNumber == 3 & ...
    tdetails.SentenceNumber == 2;
tdetails(idx,:)
ans=6×6 table
       Token       DocumentNumber    SentenceNumber    LineNumber       Type        Language
    ___________    ______________    ______________    __________    ___________    ________

    "It"                 3                 2               1         letters           en   
    "also"               3                 2               1         letters           en   
    "has"                3                 2               1         letters           en   
    "two"                3                 2               1         letters           en   
    "sentences"          3                 2               1         letters           en   
    "."                  3                 2               1         punctuation       en   

Input Arguments

collapse all

Input documents, specified as a tokenizedDocument array.

Name-Value Pair Arguments

Specify optional comma-separated pairs of Name,Value arguments. Name is the argument name and Value is the corresponding value. Name must appear inside quotes. You can specify several name and value pair arguments in any order as Name1,Value1,...,NameN,ValueN.

Example: 'Abbreviations',["cm" "mm" "in"] specifies to detect sentences boundaries where these abbreviations are followed by a period and a capitalized sentence starter.

List of abbreviations, specified as a string array, character vector, cell array of character vectors, or a table.

If Abbreviations is a string array, character vector, or cell array of character vectors, then the function treats these as regular abbreviations. If the next word is a capitalized sentence starter, then the function breaks at the trailing period. The function ignores any differences in the letter case of the abbreviations. Specify the sentence starters using the Starters name-value pair.

To specify different behaviors when splitting sentences at abbreviations, specify Abbreviations as a table. The table must have variables named Abbreviation and Usage, where Abbreviation contains the abbreviations, and Usage contains the type of each abbreviation. The following table describes the possible values of Usage, and the behavior of the function when passed abbreviations of these types.

UsageBehaviorExample AbbreviationExample TextDetected Sentences
regularIf the next word is a capitalized sentence starter, then break at the trailing period. Otherwise, do not break at the trailing period."appt.""Book an appt. We'll meet then."

"Book an appt."

"We'll meet then."

"Book an appt. today.""Book an appt. today."
innerDo not break after trailing period."Dr.""Dr. Smith.""Dr. Smith."
referenceIf the next token is not a number, then break at a trailing period. If the next token is a number, then do not break at the trailing period."fig.""See fig. 3.""See fig. 3."
"Try a fig. They are nice."

"Try a fig."

"They are nice."

unitIf the previous word is a number and the following word is a capitalized sentence starter, then break at a trailing period."in.""The height is 30 in. The width is 10 in."

"The height is 30 in."

"The width is 10 in."

If the previous word is a number and the following word is not capitalized, then do not break at a trailing period."The item is 10 in. wide.""The item is 10 in. wide."
If the previous word is not a number, then break at a trailing period."Come in. Sit down."

"Come in."

"Sit down."

The default value is the output of the abbreviations function.

Tip

By default, the function treats single letter abbreviations, such as "V.", or tokens with mixed single letters and periods, such as "U.S.A." as regular abbreviations. You do not need to include these abbreviations in Abbreviations.

Example: ["cm" "mm" "in"]

Data Types: char | string | table | cell

Words that start a sentence, specified as a string array, character vector, or a cell array of character vectors. If a sentence starter appears capitalized after a regular abbreviation, then the function detects a sentence boundary at the trailing period. The function ignores any differences in the letter case of the sentence starters.

The default value is the output of the stopWords function.

Data Types: char | string | cell

Output Arguments

collapse all

Updated documents, returned as a tokenizedDocument array. To get the token details from updatedDocuments, use tokenDetails.

More About

collapse all

Language Considerations

The addSentenceDetails function detects sentence boundaries based on punctuation characters and line number information. For English and German text, the function also uses a list of abbreviations passed to the function.

For other languages, you might need to specify your own list of abbreviations for sentence detection. To do this, use the 'Abbreviations' option of addSentenceDetails.

Algorithms

If emoticons or emoji characters appear after a terminating punctuation character, then the function splits the sentence after the emoticons and emoji.

Introduced in R2018a