topkngrams
Most frequent n-grams
Description
specifies additional options using one or more name-value pair arguments.tbl
= topkngrams(___,Name,Value
)
Examples
Most Frequent Bigrams of Bag-of-N-Grams Model
Create a table of the most frequent bigrams of a bag-of-n-grams model.
Load the example data. The file sonnetsPreprocessed.txt
contains preprocessed versions of Shakespeare's sonnets. The file contains one sonnet per line, with words separated by a space. Extract the text from sonnetsPreprocessed.txt
, split the text into documents at newline characters, and then tokenize the documents.
filename = "sonnetsPreprocessed.txt";
str = extractFileText(filename);
textData = split(str,newline);
documents = tokenizedDocument(textData);
Create a bag-of-n-grams model.
bag = bagOfNgrams(documents)
bag = bagOfNgrams with properties: Counts: [154×8799 double] Vocabulary: [1×3092 string] Ngrams: [8799×2 string] NgramLengths: 2 NumNgrams: 8799 NumDocuments: 154
Find the top 5 bigrams.
tbl = topkngrams(bag)
tbl=5×3 table
Ngram Count NgramLength
________________ _____ ___________
"thou" "art" 34 2
"mine" "eye" 15 2
"thy" "self" 14 2
"thou" "dost" 13 2
"mine" "own" 13 2
Find the top 10 bigrams.
tbl = topkngrams(bag,10)
tbl=10×3 table
Ngram Count NgramLength
_________________ _____ ___________
"thou" "art" 34 2
"mine" "eye" 15 2
"thy" "self" 14 2
"thou" "dost" 13 2
"mine" "own" 13 2
"thy" "sweet" 12 2
"thy" "love" 11 2
"dost" "thou" 10 2
"thou" "wilt" 10 2
"love" "thee" 9 2
Count N-Grams of Different Lengths
Load the example data. The file sonnetsPreprocessed.txt
contains preprocessed versions of Shakespeare's sonnets. The file contains one sonnet per line, with words separated by a space. Extract the text from sonnetsPreprocessed.txt
, split the text into documents at newline characters, and then tokenize the documents.
filename = "sonnetsPreprocessed.txt";
str = extractFileText(filename);
textData = split(str,newline);
documents = tokenizedDocument(textData);
Create a bag-of-n-grams model. To count n-grams of length 2 and 3 (bigrams and trigrams), specify 'NgramLengths'
to be the vector [2 3]
.
bag = bagOfNgrams(documents,'NgramLengths',[2 3])
bag = bagOfNgrams with properties: Counts: [154×18022 double] Vocabulary: [1×3092 string] Ngrams: [18022×3 string] NgramLengths: [2 3] NumNgrams: 18022 NumDocuments: 154
View the 10 most common n-grams of length 2 (bigrams).
topkngrams(bag,10,'NGramLengths',2)
ans=10×3 table
Ngram Count NgramLength
_______________________ _____ ___________
"thou" "art" "" 34 2
"mine" "eye" "" 15 2
"thy" "self" "" 14 2
"thou" "dost" "" 13 2
"mine" "own" "" 13 2
"thy" "sweet" "" 12 2
"thy" "love" "" 11 2
"dost" "thou" "" 10 2
"thou" "wilt" "" 10 2
"love" "thee" "" 9 2
View the 10 most common n-grams of length 3 (trigrams).
topkngrams(bag,10,'NGramLengths',3)
ans=10×3 table
Ngram Count NgramLength
____________________________ _____ ___________
"thy" "sweet" "self" 4 3
"why" "dost" "thou" 4 3
"thy" "self" "thy" 3 3
"thou" "thy" "self" 3 3
"mine" "eye" "heart" 3 3
"thou" "shalt" "find" 3 3
"fair" "kind" "true" 3 3
"thou" "art" "fair" 2 3
"love" "thy" "self" 2 3
"thy" "self" "thou" 2 3
Input Arguments
bag
— Input bag-of-n-grams model
bagOfNgrams
object
Input bag-of-n-grams model, specified as a bagOfNgrams
object.
k
— Number of n-grams
positive integer | Inf
Number of n-grams to return, specified as a positive integer or
Inf
.
If k
is Inf
, then the function
returns all n-grams. For bag-of-n-grams and LDA model input, the function
sorts the n-grams in order of frequency and importance, respectively.
Example: 20
Name-Value Arguments
Specify optional pairs of arguments as
Name1=Value1,...,NameN=ValueN
, where Name
is
the argument name and Value
is the corresponding value.
Name-value arguments must appear after other arguments, but the order of the
pairs does not matter.
Before R2021a, use commas to separate each name and value, and enclose
Name
in quotes.
Example: 'NgramLengths',[2 3]
specifies to return the top
bigrams and trigrams.
NgramLengths
— N-gram lengths
positive integer | vector of positive integers
N-gram lengths, specified as the comma separated pair consisting of
'NgramLengths'
and a positive integer or a vector
of positive integers.
If you specify NgramLengths
, then the function
returns n-grams of these lengths only. If you do not specify
NgramLengths
, then the function returns the top
n-grams regardless of length.
Example: [1 2 3]
IgnoreCase
— Option to ignore case
false
(default) | true
Option to ignore case, specified as the comma-separated pair
consisting of 'IgnoreCase'
and one of the following:
false
– treat n-grams differing only by case as separate n-grams.true
– treat n-grams differing only by case as the same n-gram and merge counts.
ForceCellOutput
— Indicator for forcing output to be returned as cell array
false
(default) | true
Indicator for forcing output to be returned as cell array, specified as the comma separated pair consisting of 'ForceCellOutput'
and true
or false
.
Data Types: logical
Output Arguments
tbl
— Top n-grams
table | cell array of tables
Top n-grams, returned as a table or a cell array of tables. For bag-of-n-grams and LDA model input, the function sorts the n-grams in order of frequency and importance, respectively.
The table has the following columns:
Ngram | N-gram specified as a string vector |
Count | Number of times the n-gram appears in the bag-of-n-grams model. |
NgramLength | Length of the n-gram. |
If bag
is a non-scalar array or
'ForceCellOutput'
is true
, then
the function returns the outputs as a cell array of tables. Each element in
the cell array is a table containing the top n-grams of the corresponding
element of bag
.
Version History
Introduced in R2018a
See Also
bagOfWords
| bagOfNgrams
| removeInfrequentNgrams
| removeNgrams
| topkwords
| tfidf
| tokenizedDocument
MATLAB Command
You clicked a link that corresponds to this MATLAB command:
Run the command by entering it in the MATLAB Command Window. Web browsers do not support MATLAB commands.
Select a Web Site
Choose a web site to get translated content where available and see local events and offers. Based on your location, we recommend that you select: .
You can also select a web site from the following list:
How to Get Best Site Performance
Select the China site (in Chinese or English) for best site performance. Other MathWorks country sites are not optimized for visits from your location.
Americas
- América Latina (Español)
- Canada (English)
- United States (English)
Europe
- Belgium (English)
- Denmark (English)
- Deutschland (Deutsch)
- España (Español)
- Finland (English)
- France (Français)
- Ireland (English)
- Italia (Italiano)
- Luxembourg (English)
- Netherlands (English)
- Norway (English)
- Österreich (Deutsch)
- Portugal (English)
- Sweden (English)
- Switzerland
- United Kingdom (English)