Hybrid method to sentiment analysis column number error

Question

oliver on 12 Apr 2023

0
Link

Direct link to this question

https://uk.mathworks.com/matlabcentral/answers/1945859-hybrid-method-to-sentiment-analysis-column-number-error

Edited: Sanguk on 14 Apr 2023

IMBD_reviews_smol.csv

I am trying to perofrm the hybrid approach to sentiment analysis using both vader sentiment scores and machine learning using smv. I am trying to concatenate the sentiment scores with the bag of words features and then predict the sentiment labels for the next test set before evaluating th performance. however my X data is 1 coumn off being correct for the test. The error i recieve is

"Error using classreg.learning.internal.numPredictorsCheck

X data must have 1662 column(s).

Error in classreg.learning.classif.CompactClassificationECOC/predict (line 335)

classreg.learning.internal.numPredictorsCheck(X,...

Error in assessment_hybrid (line 50)

YPred = predict(mdl, XTest);"

my code is

% Load the movie review dataset
filename = "IMBD_reviews_first5000.csv"; 
data = readtable(filename,'TextType','string');
data.sentiment = categorical(data.sentiment);
% Split dataset into training and test sets using holdout
cvp = cvpartition(data.sentiment, 'Holdout', 0.1);
dataTrain = data(cvp.training, :);
dataTest = data(cvp.test, :);
% Extract review text and sentiment labels from training and test set 
textDataTrain = dataTrain.review;
textDataTest = dataTest.review;
YTrain = dataTrain.sentiment;
YTest = dataTest.sentiment;
% Preprocess training set
documents = preprocessText(textDataTrain);
% Create bag of words and remove infrequent words
bag = bagOfWords(documents);
bag = removeInfrequentWords(bag,2);
[bag,idx] = removeEmptyDocuments(bag);
YTrain(idx) = [];
% Encode training set using bag of words
XTrain = bag.Counts;
% Train SVM classifier
mdl = fitcecoc(XTrain, YTrain, "Learners", "linear");  
% Preprocess test set
documentsTest = preprocessText(textDataTest);
documentsTrain = preprocessText(textDataTrain);
% Encode test set using bag of words
XTest = encode(bag, documentsTest);
% Compute sentiment scores for training and test sets using VADER
sentimentScoresTrain = vaderSentimentScores(documentsTrain);
sentimentScoresTest = vaderSentimentScores(documentsTest);
% Concatenate sentiment scores with bag of words features
XTrain = [XTrain, sentimentScoresTrain];
XTest = [XTest, sentimentScoresTest];
% Predict sentiment labels for test set
YPred = predict(mdl, XTest);
% Evaluate performance
accuracy = sum(YPred == YTest) / numel(YTest);
fprintf("Accuracy: %.2f%%\n", accuracy * 100);
confusion = confusionmat(YTest, YPred);
truePositive = confusion(1, 1);
falsePositive = confusion(2, 1);
trueNegative = confusion(2, 2);
falseNegative = confusion(1, 2);
% Compute precision, recall, and F-measure
precision = truePositive / (truePositive + falsePositive);
recall = truePositive / (truePositive + falseNegative);
fMeasure = 2 * precision * recall / (precision + recall);
% Compute accuracy 
accuracy2 = (truePositive + trueNegative) / numel(YTest);
% Display results 
disp(['True positive: ' num2str(truePositive)]);
disp(['False positive: ' num2str(falsePositive)]);
disp(['True negative: ' num2str(trueNegative)]);
disp(['False negative: ' num2str(falseNegative)]);
disp(['Precision: ' num2str(precision)]);
disp(['Recall: ' num2str(recall)]);
disp(['F-measure: ' num2str(fMeasure)]);
function documents = preprocessText(textData)
documents = tokenizedDocument(textData);
documents = addPartOfSpeechDetails(documents);
documents = removeStopWords(documents);
documents = erasePunctuation(documents);
documents = removeShortWords(documents,2);
documents = removeLongWords(documents,15);
end

0 Comments
Show -2 older commentsHide -2 older comments

Sign in to comment.

Sign in to answer this question.

Answer 1

Drew on 12 Apr 2023

1
Link

Direct link to this answer

https://uk.mathworks.com/matlabcentral/answers/1945859-hybrid-method-to-sentiment-analysis-column-number-error#answer_1214554

Edited: Drew on 12 Apr 2023

Open in MATLAB Online

Updating this answer based on the comment below:

You have built one svm classifier based on the bag-of-words features. It sounds like what you want to do is build another classifier that uses both the bag-of-words features and the vader sentiment score feature. So, after you concatentate the sentiment scores with the bag-of-words features, you should train a new model on those combined features, and then evaluate that model.

% Concatenate sentiment scores with bag of words features
XTrain = [XTrain, sentimentScoresTrain];
XTest = [XTest, sentimentScoresTest];
% Build new svm model using both bag-of-words and vader sentiment scores as
% features
mdl2 = fitcecoc(XTrain, YTrain, "Learners", "linear");  

Original answer:

As the error message indicates, on line 50 of your code (shown below), the number of predictors in XTest does not match the number of predictors used to train the model.

YPred = predict(mdl, XTest);

After training the model, the lines below added one extra predictor, and hence the mismatch.

% Concatenate sentiment scores with bag of words features
XTrain = [XTrain, sentimentScoresTrain];
XTest = [XTest, sentimentScoresTest];

The number of predictors at model training time needs to match the number of predictors at model testing time.

2 Comments
Show NoneHide None

oliver on 12 Apr 2023

do you know how i would change the code to prevent adding one extra predictor

Sanguk on 13 Apr 2023

Edited: Sanguk on 14 Apr 2023

Open in MATLAB Online

I get this error

"Error using horzcat

Dimensions of arrays being concatenated are not consistent.

Error in reun (line 44)

XTrain = [XTrain, sentimentScoresTrain];"

from running:

filename = "sentiment_irrelevantdrop_Chegg"; 
data = readtable(filename,'TextType','string');
data.sentiment = categorical(data.sentiment);
% Split dataset into training and test sets using holdout
cvp = cvpartition(data.sentiment, 'Holdout', 0.1);
dataTrain = data(cvp.training, :);
dataTest = data(cvp.test, :);
% Extract review text and sentiment labels from training and test set 
textDataTrain = dataTrain.text;
textDataTest = dataTest.text;
YTrain = dataTrain.sentiment;
YTest = dataTest.sentiment;
% Preprocess training set
documents = preprocessText(textDataTrain);
% Create bag of words and remove infrequent words
bag = bagOfWords(documents);
bag = removeInfrequentWords(bag,2);
[bag,idx] = removeEmptyDocuments(bag);
YTrain(idx) = [];
% Encode training set using bag of words
XTrain = bag.Counts;
% Train SVM classifier
mdl = fitcecoc(XTrain, YTrain, "Learners", "linear");  
% Preprocess test set
documentsTest = preprocessText(textDataTest);
documentsTrain = preprocessText(textDataTrain);
% Encode test set using bag of words
XTest = encode(bag, documentsTest);
% Compute sentiment scores for training and test sets using VADER
sentimentScoresTrain = vaderSentimentScores(documentsTrain);
sentimentScoresTest = vaderSentimentScores(documentsTest);
% Concatenate sentiment scores with bag of words features
XTrain = [XTrain, sentimentScoresTrain];
XTest = [XTest, sentimentScoresTest];
% Build new svm model using both bag-of-words and vader sentiment scores as
% features
mdl2 = fitcecoc(XTrain, YTrain, "Learners", "linear");  
% Predict sentiment labels for test set
YPred = predict(mdl, XTest);
% Evaluate performance
accuracy = sum(YPred == YTest) / numel(YTest);
fprintf("Accuracy: %.2f%%\n", accuracy * 100);
confusion = confusionmat(YTest, YPred);
truePositive = confusion(1, 1);
falsePositive = confusion(2, 1);
trueNegative = confusion(2, 2);
falseNegative = confusion(1, 2);
% Compute precision, recall, and F-measure
precision = truePositive / (truePositive + falsePositive);
recall = truePositive / (truePositive + falseNegative);
fMeasure = 2 * precision * recall / (precision + recall);
% Compute accuracy 
accuracy2 = (truePositive + trueNegative) / numel(YTest);
% Display results 
disp(['True positive: ' num2str(truePositive)]);
disp(['False positive: ' num2str(falsePositive)]);
disp(['True negative: ' num2str(trueNegative)]);
disp(['False negative: ' num2str(falseNegative)]);
disp(['Precision: ' num2str(precision)]);
disp(['Recall: ' num2str(recall)]);
disp(['F-measure: ' num2str(fMeasure)]);
function documents = preprocessText(textData)
documents = tokenizedDocument(textData);
documents = addPartOfSpeechDetails(documents);
documents = removeStopWords(documents);
documents = erasePunctuation(documents);
documents = removeShortWords(documents,2);
documents = removeLongWords(documents,15);
end

btw, thank for your coding Oliver

Sign in to comment.

Hybrid method to sentiment analysis column number error

0 Comments
Show -2 older commentsHide -2 older comments

Accepted Answer

2 Comments
Show NoneHide None

More Answers (0)

See Also

Categories

Tags

Products

Release

Community Treasure Hunt

Hybrid method to sentiment analysis column number error

0 Comments Show -2 older commentsHide -2 older comments

Accepted Answer

2 Comments Show NoneHide None

More Answers (0)

See Also

Categories

Tags

Products

Release

Community Treasure Hunt

0 Comments
Show -2 older commentsHide -2 older comments

2 Comments
Show NoneHide None