trainWordEmbedding
Train word embedding
Syntax
Description
specifies additional options using one or more name-value pair arguments. For
example, emb
= trainWordEmbedding(___,Name,Value
)'Dimension',50
specifies the word embedding dimension to
be 50.
Examples
Train Word Embedding from File
Train a word embedding of dimension 100 using the example text file sonnetsPreprocessed.txt
. This file contains preprocessed versions of Shakespeare's sonnets, with one sonnet per line and words separated by a space.
filename = "sonnetsPreprocessed.txt";
emb = trainWordEmbedding(filename)
Training: 100% Loss: 3.23452 Remaining time: 0 hours 0 minutes.
emb = wordEmbedding with properties: Dimension: 100 Vocabulary: ["thy" "thou" "love" "thee" "doth" "mine" "shall" "eyes" "sweet" "time" "nor" "beauty" "yet" "art" "heart" "o" "thine" "hath" "fair" "make" "still" ... ] (1x401 string)
View the word embedding in a text scatter plot using tsne
.
words = emb.Vocabulary; V = word2vec(emb,words); XY = tsne(V); textscatter(XY,words)
Train Word Embedding from Documents
Train a word embedding using the example data sonnetsPreprocessed.txt
. This file contains preprocessed versions of Shakespeare's sonnets. The file contains one sonnet per line, with words separated by a space. Extract the text from sonnetsPreprocessed.txt
, split the text into documents at newline characters, and then tokenize the documents.
filename = "sonnetsPreprocessed.txt";
str = extractFileText(filename);
textData = split(str,newline);
documents = tokenizedDocument(textData);
Train a word embedding using trainWordEmbedding
.
emb = trainWordEmbedding(documents)
Training: 100% Loss: 2.95291 Remaining time: 0 hours 0 minutes.
emb = wordEmbedding with properties: Dimension: 100 Vocabulary: ["thy" "thou" "love" "thee" "doth" "mine" "shall" "eyes" "sweet" "time" "nor" "beauty" "yet" "art" "heart" "o" "thine" "hath" "fair" "make" "still" ... ] (1x401 string)
Visualize the word embedding in a text scatter plot using tsne
.
words = emb.Vocabulary; V = word2vec(emb,words); XY = tsne(V); textscatter(XY,words)
Specify Word Embedding Options
Load the example data. The file sonnetsPreprocessed.txt
contains preprocessed versions of Shakespeare's sonnets. The file contains one sonnet per line, with words separated by a space. Extract the text from sonnetsPreprocessed.txt
, split the text into documents at newline characters, and then tokenize the documents.
filename = "sonnetsPreprocessed.txt";
str = extractFileText(filename);
textData = split(str,newline);
documents = tokenizedDocument(textData);
Specify the word embedding dimension to be 50. To reduce the number of words discarded by the model, set 'MinCount'
to 3. To train for longer, set the number of epochs to 10.
emb = trainWordEmbedding(documents, ... 'Dimension',50, ... 'MinCount',3, ... 'NumEpochs',10)
Training: 100% Loss: 3.13646 Remaining time: 0 hours 0 minutes.
emb = wordEmbedding with properties: Dimension: 50 Vocabulary: ["thy" "thou" "love" "thee" "doth" "mine" "shall" "eyes" "sweet" "time" "nor" "beauty" "yet" "art" "heart" "o" "thine" "hath" "fair" "make" "still" ... ] (1x750 string)
View the word embedding in a text scatter plot using tsne
.
words = emb.Vocabulary; V = word2vec(emb, words); XY = tsne(V); textscatter(XY,words)
Input Arguments
filename
— Name of file
string scalar | character vector | 1-by-1 cell array containing a character vector
Name of the file, specified as a string scalar, character vector, or a 1-by-1 cell array containing a character vector.
Data Types: string
| char
| cell
documents
— Input documents
tokenizedDocument
array
Input documents, specified as a tokenizedDocument
array.
Name-Value Arguments
Specify optional pairs of arguments as
Name1=Value1,...,NameN=ValueN
, where Name
is
the argument name and Value
is the corresponding value.
Name-value arguments must appear after other arguments, but the order of the
pairs does not matter.
Before R2021a, use commas to separate each name and value, and enclose
Name
in quotes.
Example: 'Dimension',50
specifies the word embedding dimension
to be 50.
Dimension
— Dimension of word embedding
100 (default) | positive integer
Dimension of the word embedding, specified as the comma-separated pair
consisting of 'Dimension'
and a nonnegative
integer.
Example: 300
Window
— Size of context window
5 (default) | nonnegative integer
Size of the context window, specified as the comma-separated pair
consisting of 'Window'
and a nonnegative
integer.
Example: 10
Model
— Model
'skipgram'
(default) | 'cbow'
Model, specified as the comma-separated pair consisting of
'Model'
and 'skipgram'
(skip
gram) or 'cbow'
(continuous bag-of-words).
Example: 'cbow'
DiscardFactor
— Factor to determine word discard rate
1e-4
(default) | positive scalar
Factor to determine the word discard rate, specified as the
comma-separated pair consisting of 'DiscardFactor'
and a positive scalar. The function discards a word from the input
window with probability 1-sqrt(t/f) - t/f
where f is
the unigram probability of the word, and t
is
DiscardFactor
. Usually,
DiscardFactor
is in the range of
1e-3
through 1e-5
.
Example: 0.005
LossFunction
— Loss function
'ns'
(default) | 'hs'
| 'softmax'
Loss function, specified as the comma-separated pair consisting of
'LossFunction'
and 'ns'
(negative sampling), 'hs'
(hierarchical softmax), or
'softmax'
(softmax).
Example: 'hs'
NumNegativeSamples
— Number of negative samples
5 (default) | positive integer
Number of negative samples for the negative sampling loss function,
specified as the comma-separated pair consisting of
'NumNegativeSamples'
and a positive integer. This
option is only valid when LossFunction
is
'ns'
.
Example: 10
NumEpochs
— Number of epochs
5 (default) | positive integer
Number of epochs for training, specified as the comma-separated pair
consisting of 'NumEpochs'
and a positive
integer.
Example: 10
MinCount
— Minimum count of words
5 (default) | positive integer
Minimum count of words to include in the embedding, specified as the
comma-separated pair consisting of 'MinCount'
and a
positive integer. The function discards words that appear fewer than
MinCount
times in the training data from the
vocabulary.
Example: 10
NGramRange
— Inclusive range for subword n-grams
[3 6]
(default) | vector of two nonnegative integers
Inclusive range for subword n-grams, specified as the comma-separated
pair consisting of 'NGramRange'
and a vector of two
nonnegative integers [min max]
. If you do not want to
use n-grams, then set 'NGramRange'
to [0
0]
.
Example: [5 10]
InitialLearnRate
— Initial learn rate
0.05 (default) | positive scalar
Initial learn rate, specified as the comma-separated pair consisting
of 'InitialLearnRate'
and a positive scalar.
Example: 0.01
UpdateRate
— Rate for updating learn rate
100 (default) | positive integer
Rate for updating the learn rate, specified as the comma-separated
pair consisting of 'UpdateRate'
and a positive
integer. The learn rate decreases to zero linearly in steps every
N words where N is the
UpdateRate
.
Example: 50
Verbose
— Verbosity level
1 (default) | 0
Verbosity level, specified as the comma-separated pair consisting of
'Verbose'
and one of the following:
0 – Do not display verbose output.
1 – Display progress information.
Example: 'Verbose',0
Output Arguments
emb
— Output word embedding
word embedding
Output word embedding, returned as a wordEmbedding
object.
More About
Language Considerations
File input to the trainWordEmbedding
function requires words separated by whitespace.
For files containing non-English text, you might need to input a tokenizedDocument
array to trainWordEmbedding
.
To create a tokenizedDocument
array from pretokenized text, use the tokenizedDocument
function and set the 'TokenizeMethod'
option to 'none'
.
Tips
The training algorithm uses the number of threads given by the function
maxNumCompThreads
. To learn how to change the number of threads
used by MATLAB®, see maxNumCompThreads
.
Version History
Introduced in R2017b
MATLAB Command
You clicked a link that corresponds to this MATLAB command:
Run the command by entering it in the MATLAB Command Window. Web browsers do not support MATLAB commands.
Select a Web Site
Choose a web site to get translated content where available and see local events and offers. Based on your location, we recommend that you select: United States.
You can also select a web site from the following list
How to Get Best Site Performance
Select the China site (in Chinese or English) for best site performance. Other MathWorks country sites are not optimized for visits from your location.
Americas
- América Latina (Español)
- Canada (English)
- United States (English)
Europe
- Belgium (English)
- Denmark (English)
- Deutschland (Deutsch)
- España (Español)
- Finland (English)
- France (Français)
- Ireland (English)
- Italia (Italiano)
- Luxembourg (English)
- Netherlands (English)
- Norway (English)
- Österreich (Deutsch)
- Portugal (English)
- Sweden (English)
- Switzerland
- United Kingdom (English)
Asia Pacific
- Australia (English)
- India (English)
- New Zealand (English)
- 中国
- 日本Japanese (日本語)
- 한국Korean (한국어)