Text classification/similarity measures when text data is very similar

Question

Ess Gee on 19 Feb 2021

0
Link

Direct link to this question

https://uk.mathworks.com/matlabcentral/answers/750334-text-classification-similarity-measures-when-text-data-is-very-similar

Answered: Anay on 3 Jul 2025

I'm trying to use Matlab to perform a type of text classification/similarity measure, and I'm having some difficulty figuring out the right algorithm for my specific use case.

I have a list of 'gold standard' product names which can be quite similar to one another e.g.

productnames_examples = [
"Apple iPhone 11 128GB Black"
"Apple iPhone 11 128GB Blue"
"Apple iPhone 11 256GB Black"
"Apple iPhone 11 256GB Blue"
"Apple iPhone 11 Pro 128GB Black"
"Apple iPhone 11 Pro 128GB Blue"
"Apple iPhone 11 Pro 256GB Black"
"Apple iPhone 11 Pro 256GB Blue"
"Apple iPhone 12 128GB Black"
"Apple iPhone 12 128GB Blue"
"Apple iPhone 12 256GB Black"
"Apple iPhone 12 256GB Blue"
"Apple iPhone 12 Pro 128GB Black"
"Apple iPhone 12 Pro 128GB Blue"
"Apple iPhone 12 Pro 256GB Black"
"Apple iPhone 12 Pro 256GB Blue"]

I'm trying to match a long list of queries to these product names. Examples of the query values are:

querynames_examples = [
"Apple iPhone 11 256 Gb Black"
"Apple iPhone 12 256GB Pro Blue" 
"Apple iPhone 12 Pro iOS 10 5.5 5G LTE 128GB Blue"
"Apple iPhone 11 Pro Black 5.5 256GB"
"iPhone 12 Pro Blue 256GB" ]

Edit distance algorithms (such as those found in the editDistance function) don't seem to be appropriate as edit distances won't help when the productnames string contents are as similar as they are in this example. For example, the query "Apple iPhone 12 256GB Pro Blue" is more likely to match to "Apple iPhone 12 256GB Blue" than the correct "Apple iPhone 12 Pro 256GB Blue"

I've also looked at the family of BM25 algorithms and again, it doesn't seem to be able to get past the similarity of the contents of the contents of productnames.

I've also looked at training a simple text classifier on word frequency counts using a bag-of-words model based on https://mathworks.com/help/textanalytics/ug/create-simple-text-model-for-classification.html, with some alterations (e.g. no minumum word length in the preprocessing so numeric values like 11 can be captured) and using pre-existing matched data as training data, but again I don't seem to be getting anything useful out of it.

Is there a function in Matlab that can be used to suitably match these queries?

0 Comments
Show -2 older commentsHide -2 older comments

Sign in to comment.

Sign in to answer this question.

Answer 1

Anay on 3 Jul 2025

0
Link

Direct link to this answer

https://uk.mathworks.com/matlabcentral/answers/750334-text-classification-similarity-measures-when-text-data-is-very-similar#answer_1567503

Open in MATLAB Online

Hi Ess,

Based on the description of your requirement, I suggest you to go for “dense document embedding”. It uses neural networks with “attention layers” which helps to capture the semantic and contextual meaning of the text beyond just the individual words. To use dense document embeddings you would need to install the “Text Analytics Toolbox” add on.

Models like BM25 or classifiers trained on bag-of-words use sparse vectors which do not capture the semantic meaning of text or the order of words which may not be suitable for your application where product names can be quite similar to one another.

You can use the following code for reference:

documents = [
"Apple iPhone 11 128GB Black"
"Apple iPhone 11 128GB Blue"
"Apple iPhone 11 256GB Black"
"Apple iPhone 11 256GB Blue"
"Apple iPhone 11 Pro 128GB Black"
"Apple iPhone 11 Pro 128GB Blue"
"Apple iPhone 11 Pro 256GB Black"
"Apple iPhone 11 Pro 256GB Blue"
"Apple iPhone 12 128GB Black"
"Apple iPhone 12 128GB Blue"
"Apple iPhone 12 256GB Black"
"Apple iPhone 12 256GB Blue"
"Apple iPhone 12 Pro 128GB Black"
"Apple iPhone 12 Pro 128GB Blue"
"Apple iPhone 12 Pro 256GB Black"
"Apple iPhone 12 Pro 256GB Blue"];
emb = documentEmbedding(Model="all-MiniLM-L12-v2");
embeddedDocuments = embed(emb,documents);
query = "Apple iPhone 12 Pro iOS 10 5.5 5G LTE 128GB Blue";
embeddedQuery = embed(emb,query);
scores = cosineSimilarity(embeddedDocuments,embeddedQuery);
[~,idx] = sort(scores,"descend");
%display the doc which had the highest similarity score
disp("query: " + query)
disp("ranked documents (in decending order of similarity:")
disp(documents(idx));

I get this output:

In order to use the “all-MiniLM-L12-v2” model you need to install the “Text Analytics Toolbox Model for all-MiniLM-L12-v2 Network” support package. Follow the below link to it’s download page:

https://www.mathworks.com/matlabcentral/fileexchange/156394-text-analytics-toolbox-model-for-all-minilm-l12-v2-network

You can consider following below link to learn more about dense document embeddings:

https://www.mathworks.com/help/textanalytics/ug/information-retrieval-with-document-embeddings.html#mw_e7bc4de4-8e8b-4cdc-a50c-610306b7b0b8

0 Comments
Show -2 older commentsHide -2 older comments

Sign in to comment.

Text classification/similarity measures when text data is very similar

0 Comments
Show -2 older commentsHide -2 older comments

Answers (1)

0 Comments
Show -2 older commentsHide -2 older comments

See Also

Categories

Tags

Products

Community Treasure Hunt

Text classifica​tion/simil​arity measures when text data is very similar

0 Comments Show -2 older commentsHide -2 older comments

Answers (1)

0 Comments Show -2 older commentsHide -2 older comments

See Also

Categories

Tags

Products

Community Treasure Hunt

Text classification/similarity measures when text data is very similar

0 Comments
Show -2 older commentsHide -2 older comments

0 Comments
Show -2 older commentsHide -2 older comments