Generate Text with Deep Learning "Invalid training data. Labels must not contain undefined values" ERROR

5 views (last 30 days)
I am using the Generate Text with Deep Learning Matlab example, here
It works fine when I use the Shakespeare text provided in the example, but none of my texts are accepted. I always get the error: "Invalid training data. Labels must not contain undefined values."
My text and code provided below.
filename = 'RWE Nature.txt';
textData = fileread(filename);
textData = replace(textData," ","");
textData = split(textData,[newline]); % USE NEWLINE TO SPLIT TEXT INTO CELLS
% textData = textData(5:2:end);
textData(1:5) % 154 X 1 string array
startOfTextCharacter = compose("\x0002");
whitespaceCharacter = compose("\x00B7");
endOfTextCharacter = compose("\x2403");
newlineCharacter = compose("\x00B6");
textData = startOfTextCharacter + textData;
textData = replace(textData,[" " newline],[whitespaceCharacter newlineCharacter]);
uniqueCharacters = unique([textData{:}]); % '!'(),-.:;?ABCDEFGHIJKLMNOPRSTUVWYabcdefghijklmnopqrstuvwxyz¶·'
numUniqueCharacters = numel(uniqueCharacters); % 62
%
numDocuments = numel(textData); % 154 SONNETS, 89 PARAGRAPHS IN MAYER
XTrain = cell(1,numDocuments);
YTrain = cell(1,numDocuments);
for i = 1:numel(textData)
characters = textData{i};
sequenceLength = numel(characters);
% Get indices of characters.
[~,idx] = ismember(characters,uniqueCharacters);
% Convert characters to vectors.
X = zeros(numUniqueCharacters,sequenceLength);
for j = 1:sequenceLength
X(idx(j),j) = 1;
end
% Create vector of categorical responses with end of text character.
charactersShifted = [cellstr(characters(2:end)')' endOfTextCharacter];
Y = categorical(charactersShifted);
XTrain{i} = X;
YTrain{i} = Y;
end
% textData{1}
inputSize = size(XTrain{1},1);
numHiddenUnits = 200;
numClasses = numel(categories([YTrain{:}]));
layers = [
sequenceInputLayer(inputSize)
lstmLayer(numHiddenUnits,'OutputMode','sequence')
fullyConnectedLayer(numClasses)
softmaxLayer
classificationLayer];
options = trainingOptions('adam', ...
'MaxEpochs',500, ...
'InitialLearnRate',0.01, ...
'GradientThreshold',2, ...
'MiniBatchSize',77,...
'Shuffle','every-epoch', ...
'Plots','training-progress', ...
'Verbose',false);
% Train the network.
'a'
net = trainNetwork(XTrain,YTrain,layers,options);
'b'
% Generate text using the trained network.
generatedText = generateText(net,uniqueCharacters,startOfTextCharacter,newlineCharacter,whitespaceCharacter,endOfTextCharacter)
'end'
function generatedText = generateText(net,uniqueCharacters,startOfTextCharacter,newlineCharacter,whitespaceCharacter,endOfTextCharacter)
numUniqueCharacters = numel(uniqueCharacters);
X = zeros(numUniqueCharacters,1);
idx = strfind(uniqueCharacters,startOfTextCharacter);
X(idx) = 1;
generatedText = "";
vocabulary = string(net.Layers(end).Classes);
maxLength = 500;
while strlength(generatedText) < maxLength
% Predict the next character scores.
[net,characterScores] = predictAndUpdateState(net,X,'ExecutionEnvironment','cpu');
% Sample the next character.
newCharacter = datasample(vocabulary,1,'Weights',characterScores);
% Stop predicting at the end of text.
if newCharacter == endOfTextCharacter
break
end
% Add the character to the generated text.
generatedText = generatedText + newCharacter;
% Create a new vector for the next input.
X(:) = 0;
idx = strfind(uniqueCharacters,newCharacter);
X(idx) = 1;
end
generatedText = replace(generatedText,[newlineCharacter whitespaceCharacter],[newline " "]);
end

Answers (1)

Ben
Ben on 28 Nov 2022
There are a few issues to fix this:
  1. The call to Y = categorical(charactersShifted) needs to include a valueset that includes all the unique characters in your dataset, Y = categorical(charactersShifted,allUniqueCharacters)
  2. To make that work with the uniqueCharacters variable you need to convert it to the same class as charactersShifted, a string.
  3. The endOfTextCharacter will need to be included too, otherwise it'll become an <undefined> category in Y.
  4. Finally the logic charactersShifted = [cellstr(characters(2:end)')' endOfTextCharacter]; might prepend an empty "" when characters was only 1 character long. That will make Y have length 2, but X have length 1 and you'll get a sequence length mismatch when you try to train.
I think training should work once you resolve these things. Hope that helps.

Categories

Find more on Language Support in Help Center and File Exchange

Products


Release

R2021b

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!