rakeKeywords
Description
extracts keywords and respective scores using the Rapid Automatic Keyword Extraction (RAKE)
algorithm. The function supports English, Japanese, German, and Korean text. To learn how to
use tbl
= rakeKeywords(documents
)rakeKeywords
for other languages, see Language Considerations.
specifies additional options using one or more name-value arguments.tbl
= rakeKeywords(documents
,Name=Value
)
Tip
The rakeKeywords
function, by default, extracts keywords using
stop words and punctuation characters. When using the default values for the Delimiters
and MergingDelimiters
options, do not remove stop words or punctuation
characters from the input text.
Examples
Extract Keywords Using RAKE
Create an array of tokenized documents containing the text data.
textData = [ "MATLAB provides tools for scientists and engineers. MATLAB is used by scientists and engineers." "Analyze text and images. You can import text and images." "Analyze text and images. Analyze text, images, and videos in MATLAB."]; documents = tokenizedDocument(textData);
Extract the keywords using the rakeKeywords
function.
tbl = rakeKeywords(documents)
tbl=12×3 table
Keyword DocumentNumber Score
_________________________________________ ______________ _____
"MATLAB" "provides" "tools" 1 8
"MATLAB" "" "" 1 2
"scientists" "and" "engineers" 1 2
"scientists" "" "" 1 1
"engineers" "" "" 1 1
"Analyze" "text" "" 2 4
"import" "text" "" 2 4
"images" "" "" 2 1
"Analyze" "text" "" 3 4
"images" "" "" 3 1
"videos" "" "" 3 1
"MATLAB" "" "" 3 1
If a keyword contains multiple words, then the ith element of the string array corresponds to the ith word of the keyword. If the keyword has fewer words that the longest keyword, then remaining entries of the string array are the empty string ""
.
For readability, transform the multi-word keywords into a single string using the join
and strip
functions.
if size(tbl.Keyword,2) > 1 tbl.Keyword = strip(join(tbl.Keyword)); end tbl
tbl=12×3 table
Keyword DocumentNumber Score
__________________________ ______________ _____
"MATLAB provides tools" 1 8
"MATLAB" 1 2
"scientists and engineers" 1 2
"scientists" 1 1
"engineers" 1 1
"Analyze text" 2 4
"import text" 2 4
"images" 2 1
"Analyze text" 3 4
"images" 3 1
"videos" 3 1
"MATLAB" 3 1
Specify Maximum Number of Keywords Per Document
Create an array of tokenized document containing the text data.
textData = [ "MATLAB provides tools for scientists and engineers. MATLAB is used by scientists and engineers." "Analyze text and images. You can import text and images." "Analyze text and images. Analyze text, images, and videos in MATLAB."]; documents = tokenizedDocument(textData);
Extract the top two keywords using the rakeKeywords
function and setting the MaxNumKeywords
option to 2
.
tbl = rakeKeywords(documents,MaxNumKeywords=2)
tbl=6×3 table
Keyword DocumentNumber Score
__________________________________ ______________ _____
"MATLAB" "provides" "tools" 1 8
"MATLAB" "" "" 1 2
"Analyze" "text" "" 2 4
"import" "text" "" 2 4
"Analyze" "text" "" 3 4
"images" "" "" 3 1
If a keyword contains multiple words, then the ith element of the string array corresponds to the ith word of the keyword. If the keyword has fewer words that the longest keyword, then remaining entries of the string array are the empty string ""
.
For readability, transform the multi-word keywords into a single string using the join
and strip
functions.
if size(tbl.Keyword,2) > 1 tbl.Keyword = strip(join(tbl.Keyword)); end tbl
tbl=6×3 table
Keyword DocumentNumber Score
_______________________ ______________ _____
"MATLAB provides tools" 1 8
"MATLAB" 1 2
"Analyze text" 2 4
"import text" 2 4
"Analyze text" 3 4
"images" 3 1
Input Arguments
documents
— Input documents
tokenizedDocument
array | string array | cell array of character vectors
Input documents, specified as a tokenizedDocument
array, a string array of words, or a cell array of
character vectors. If documents
is not a
tokenizedDocument
array, then it must be a row vector representing
a single document, where each element is a word. To specify multiple documents, use a
tokenizedDocument
array.
Name-Value Arguments
Specify optional pairs of arguments as
Name1=Value1,...,NameN=ValueN
, where Name
is
the argument name and Value
is the corresponding value.
Name-value arguments must appear after other arguments, but the order of the
pairs does not matter.
Before R2021a, use commas to separate each name and value, and enclose
Name
in quotes.
Example: rakeKeywords(documents,MaxNumKeywords=20)
returns at most 20
keywords per document.
MaxNumKeywords
— Maximum number of keywords to return per document
Inf
(default) | positive integer
Maximum number of keywords to return per document, specified as a positive integer or
Inf
.
If MaxNumKeywords
is Inf
, then the function returns all identified keywords.
Data Types: single
| double
| int8
| int16
| int32
| int64
| uint8
| uint16
| uint32
| uint64
IgnoreKeywordCase
— Option to ignore keyword case
0
(false
) (default) | 1
(true
)
Option to ignore keyword case, specified as one of the following:
0
(false
) – extract case-sensitive keywords.1
(true
) – extract keywords ignoring case. Use this option when you expect the same keywords to appear with variations in letter case and want to treat them as the same keyword, for example, the words "analytics", "Analytics", and "ANALYTICS".
When IgnoreKeywordCase
is 1
, the function
returns keywords with the most commonly occurring letter case pattern. When two or
more patterns appear with the same frequency, then the function returns the keyword
with the letter case pattern that occurs first in the input.
Data Types: single
| double
| int8
| int16
| int32
| int64
| uint8
| uint16
| uint32
| uint64
| logical
Delimiters
— Tokens for splitting documents into keywords
string array | character vector | cell array of character vectors
Tokens for splitting documents into keywords, specified a string array, a
character vector, or a cell array of character vectors. If
Delimiters
is a character vector, then it must represent a
single delimiter.
The default list of delimiters is a list of punctuation characters.
If multiple candidate keywords appear in a document separated only by merging delimiters, then the function merges those keywords and the merging delimiters into a single keyword.
To specify delimiters for merging, use the MergingDelimiters
option.
Data Types: char
| string
| cell
MergingDelimiters
— Delimiters also used for merging keywords
string array | character vector | cell array of character vectors
Delimiters also used for merging keywords, specified as a string array, a
character vector, or a cell array of character vectors. If
MergingDelimiters
is a character vector, then it must represent
a single delimiter.
The default list of merging delimiters is the list of stop words given by the
stopWords
function.
If multiple candidate keywords appear in a document separated only by merging delimiters, then the function merges those keywords and the merging delimiters into a single keyword.
To specify delimiters that should not be used for merging, use the Delimiters
option.
Data Types: char
| string
| cell
IgnoreDelimiterCase
— Option to ignore delimiter case
1
(true
) (default) | 0
(false
)
Option to ignore delimiter case, specified as one of the following:
1
(true
) – ignore delimiter case.0
(false
) – use case-sensitive delimiters. Use this option when you expect there to be keywords and delimiters differ only by case, for example the delimiter "and" and the acronym "AND".
Data Types: single
| double
| int8
| int16
| int32
| int64
| uint8
| uint16
| uint32
| uint64
| logical
Output Arguments
tbl
— Extracted keywords and scores
table
Extracted keywords and scores, returned as a table with the following variables:
Keyword
– Extracted keyword, specified as a 1-by-maxNgramLength
string array, wheremaxNgramLength
is the number of words in the longest keyword.DocumentNumber
– Document number containing the corresponding keyword.Score
– Score of keyword.
If multiple candidate keywords appear in a document separated only by merging delimiters, then the function merges those keywords and the merging delimiters into a single keyword.
If a keyword contains multiple words, then the ith element of the corresponding string array corresponds to the ith word of the keyword. If the keyword has fewer words that the longest keyword, then remaining entries of the string array are the empty string ""
.
For more information, see Rapid Automatic Keyword Extraction.
More About
Language Considerations
The rakeKeywords
function supports English, Japanese, German, and Korean text only.
The rakeKeywords
function extracts keywords using a delimiter-based approach to identify candidate keywords. The function, by default, uses punctuation characters and the stop words given by the stopWords
with language given by the language details of the input documents as delimiters.
For other languages, specify an appropriate set of delimiters using the Delimiters
and MergingDelimiters
options.
Tips
You can experiment with different keyword extraction algorithms to see what works best with your data. Because the RAKE keywords algorithm uses a delimiter-based approach to extract candidate keywords, the extracted keywords can be very long. Alternatively, you can try extracting keywords using TextRank algorithm which starts with individual tokens as candidate keywords and then merges them when appropriate. To extract keywords using TextRank, use the
textrankKeywords
function. To learn more, see Extract Keywords from Text Data Using TextRank.
Algorithms
Rapid Automatic Keyword Extraction
For each document, the rakeKeywords
function extracts keywords
independently using the following steps based on [1]:
Determine candidate keywords:
Extract sequences of tokens between the delimiters specified by the
Delimiters
andMergingDelimiters
options. The function treats each sequence as a single candidate keyword.
Calculate scores for the candidate keywords:
Create an undirected, unweighted graph with nodes corresponding to the individual tokens in the candidate keywords.
Add edges between nodes where tokens co-occur in a candidate keyword, including self co-occurrences, weighted by the number of candidate keywords containing that co-occurrence.
Score each token using the formula
deg(token) / freq(token)
, wheredeg(token)
is the number of edges for the specified token andfreq(token)
is the number of times that the specified token occurs in the document.For each candidate keyword, assign a score given by the sum of scores of the contained tokens.
Extract top keywords from candidates:
If there are multiple instances of the same pair of candidate keywords separated by the same single merging delimiter, then merge the candidate keywords and the delimiter into a single keyword and sum the corresponding scores.
Return the top k keywords, where k is given by the
MaxNumKeywords
option.
Language Details
tokenizedDocument
objects contain details about the tokens including language
details. The language details of the input documents determine the behavior of
rakeKeywords
. The tokenizedDocument
function, by default, automatically detects the language of
the input text. To specify the language details manually, use the
Language
option of tokenizedDocument
. To view the token details, use the tokenDetails
function.
References
[1] Rose, Stuart, Dave Engel, Nick Cramer, and Wendy Cowley. "Automatic keyword extraction from individual documents." Text mining: applications and theory 1 (2010): 1-20.
Version History
Introduced in R2020b
MATLAB Command
You clicked a link that corresponds to this MATLAB command:
Run the command by entering it in the MATLAB Command Window. Web browsers do not support MATLAB commands.
Select a Web Site
Choose a web site to get translated content where available and see local events and offers. Based on your location, we recommend that you select: .
You can also select a web site from the following list
How to Get Best Site Performance
Select the China site (in Chinese or English) for best site performance. Other MathWorks country sites are not optimized for visits from your location.
Americas
- América Latina (Español)
- Canada (English)
- United States (English)
Europe
- Belgium (English)
- Denmark (English)
- Deutschland (Deutsch)
- España (Español)
- Finland (English)
- France (Français)
- Ireland (English)
- Italia (Italiano)
- Luxembourg (English)
- Netherlands (English)
- Norway (English)
- Österreich (Deutsch)
- Portugal (English)
- Sweden (English)
- Switzerland
- United Kingdom (English)
Asia Pacific
- Australia (English)
- India (English)
- New Zealand (English)
- 中国
- 日本Japanese (日本語)
- 한국Korean (한국어)