docfun

Apply function to words in documents

Syntax

newDocuments = docfun(func,documents)

newDocuments = docfun(func,documents1,...,documentsN)

Description

newDocuments = docfun(func,documents) calls the function specified by the function handle func and passes elements of documents as a string vector of words.

If func accepts exactly one input argument, then the words of newDocuments(i) are the output of func(string(documents(i))).
If func accepts two input arguments, then the words of newDocuments(i) are the output of func(string(documents(i)),details), where details contains the corresponding token details output by tokenDetails.
If func changes the number of words in the document, then docfun removes the token details from that document.

docfun does not perform the calls to function func in a specific order.

example

newDocuments = docfun(func,documents1,...,documentsN) calls the function specified by the function handle func and passes elements of documents1,…,documentsN as string vectors of words, where N is the number of inputs to the function func. The words of newDocuments(i) are the output of func(string(documents1(i)),...,string(documentsN(i))).

Each of documents1,…,documentsN must be the same size.

example

Examples

collapse all

Reverse Words in Documents

Open Live Script

Apply reverse to each word in a document array.

documents = tokenizedDocument([ ...
    "an example of a short sentence" 
    "a second short sentence"])

documents = 
  2×1 tokenizedDocument:

    6 tokens: an example of a short sentence
    4 tokens: a second short sentence

func = @reverse;
newDocuments = docfun(func,documents)

newDocuments = 
  2×1 tokenizedDocument:

    6 tokens: na elpmaxe fo a trohs ecnetnes
    4 tokens: a dnoces trohs ecnetnes

Specify Document Function with Multiple Inputs

Open Live Script

Tag words by combining the words from one document array with another, using the string function plus.

Create the first tokenizedDocument array. Erase the punctuation and convert the text to lowercase.

str = [ ...
    "An example of a short sentence."
    "A second short sentence."];
str = erasePunctuation(str);
str = lower(str);
documents1 = tokenizedDocument(str)

documents1 = 
  2×1 tokenizedDocument:

    6 tokens: an example of a short sentence
    4 tokens: a second short sentence

Create the second tokenizedDocument array. The documents have the same number of words as the corresponding documents in documents1. The words of documents2 are POS tags for the corresponding words.

documents2 = tokenizedDocument([ ...
    "_det _noun _prep _det _adj _noun"
    "_det _adj _adj _noun"])

documents2 = 
  2×1 tokenizedDocument:

    6 tokens: _det _noun _prep _det _adj _noun
    4 tokens: _det _adj _adj _noun

func = @plus;
newDocuments = docfun(func,documents1,documents2)

newDocuments = 
  2×1 tokenizedDocument:

    6 tokens: an_det example_noun of_prep a_det short_adj sentence_noun
    4 tokens: a_det second_adj short_adj sentence_noun

The output is not the same as calling plus on the documents directly.

plus(documents1,documents2)

ans = 
  2×1 tokenizedDocument:

    12 tokens: an example of a short sentence _det _noun _prep _det _adj _noun
     8 tokens: a second short sentence _det _adj _adj _noun

Input Arguments

collapse all

`func` — Function handle
function handle

Function handle that accepts N string arrays as inputs and outputs a string array. func must accept string(documents1(i)),...,string(documentsN(i)) as input.

Function handle to apply to words in documents. The function must have one of the following syntaxes:

newWords = func(words), where words is a string array of the words of a single document.
newWords = func(words,details), where words is a string array of the words of a single document, and details is the corresponding table of token details given by tokenDetails.
newWords = func(words1,...,wordsN), where words1,...,wordsN are string arrays of words.

Example: @reverse

Data Types: function_handle

`documents` — Input documents
`tokenizedDocument` array

Input documents, specified as a tokenizedDocument array.

Output Arguments

collapse all

`newDocuments` — Output documents
`tokenizedDocument` array

Output documents, returned as a tokenizedDocument array.

Version History

Introduced in R2017b

docfun

Syntax

Description

Examples

Reverse Words in Documents

Specify Document Function with Multiple Inputs

Input Arguments

`func` — Function handle
function handle

`documents` — Input documents
`tokenizedDocument` array

Output Arguments

`newDocuments` — Output documents
`tokenizedDocument` array

Version History

See Also

Topics

docfun

Syntax

Description

Examples

Reverse Words in Documents

Specify Document Function with Multiple Inputs

Input Arguments

func — Function handle function handle

documents — Input documents tokenizedDocument array

Output Arguments

newDocuments — Output documents tokenizedDocument array

Version History

See Also

Topics

`func` — Function handle
function handle

`documents` — Input documents
`tokenizedDocument` array

`newDocuments` — Output documents
`tokenizedDocument` array