Compare strings of different size/length
Show older comments
I'm getting a huge headache in coding a procedure to determine similarities between two strings and so the index of the best matching into a more than 10,000 elements cell.
the i-th element of the first cell matrix is something like:
str1= 'Class music n. 12 160b'
which is the element I want to search into the other matrix. The correspondant matching element of the second matrix, e.g., is:
str2= 'Classical musical n. 12 160beats'
and so on.
I wish to find a procedure to distinguish whether this couple is the most similar with respect to all the others (others can be like
str3 = 'Techno music n. 7 120beats'
str4 = 'Rock disco n. 12 140beats'
str5 = 'Punk metal n. 18 180 beats'
or even more different).
I wish to find the index in the cell matrix where
str2
variable is, in order to manipulate it.
I've been trying several approaches, but with none of them I achieved consistent results.
Would you be able to assist me in this?
Thank you
M
6 Comments
Michele Rizzato
on 4 Dec 2020
Rik
on 4 Dec 2020
You need something like fuzzy matching.
What you can do to achieve something like this manually is to parse each char array to its constituents: break it up into the style, number and speed. Then you can more easily attempt to match.
Michele Rizzato
on 4 Dec 2020
To solve this task you need to define what "most similar" means mathematically. As you have probably discovered, using a naive metric (e.g. the Levenshtein distance) is quite possibly not the most suitable (e.g. short strings can match because they have a close edit distance even if most characters are different, whereas you want to match based on the meaning of the content (certainly not a trivial task)).
Possibly you could rely on some prior knowledge to preprocess the strings (e.g. replace all abbreviations with the equivalent full words) and then try measuring the edit distance. For example:
Does 'Class ' always represent 'Classical ' ?
Does 'b' at the end always represent 'beats' ?
etc.
You could trivially define these replacements using a regular expression and then calculate the edit distance.
Michele Rizzato
on 4 Dec 2020
Michele Rizzato
on 4 Dec 2020
Answers (2)
in1 = 'Class music n. 12 160b';
in2 = {'Classical musical n. 12 160beats','Techno music n. 7 120beats','Rock disco n. 12 140beats','Punk metal n. 18 180 beats'};
rgx = {'([Cc])lass(\s+)','\d+b$'};
rpl = { '$1lassical$2','$&eats'};
tm1 = regexprep(in1,rgx,rpl);
tm2 = regexprep(in2,rgx,rpl);
edd = editDistance(tm1,tm2)
[~,idx] = min(edd);
in2{idx}
2 Comments
Michele Rizzato
on 4 Dec 2020
Edited: Michele Rizzato
on 4 Dec 2020
"i should write a different code"
No, that is not the idea at all: there should be just one list of all abbreviations and their replacements (this assumes that you have this prior knowledge) which you can apply to all strings. What I showed is just a demonstration using your example data, but you will need to complete it with all abbreviations. You can then use the same code for any string that you want to match.
If the order of the words can be "random" as you wrote, then first replace the abbreviations, split the words, sort the words alphabetically (or alphanumerically), join the words, and finally measure the edit distance:
in1 = 'Class music n. 12 160b'; % string you want to match
in2 = {'Classical musical n. 12 160beats','Techno music n. 7 120beats','Rock disco n. 12 140beats','Punk metal n. 18 180 beats'};
rgx = {'([Cc])lass(\s+)', '\d+b$'};
rpl = { '$1lassical$2','$&eats'};
fun = @(s)join(sort(split(s))); % or use NATSORT (must be downloaded)
tm1 = fun(regexprep(in1,rgx,rpl));
tm2 = cellfun(fun,regexprep(in2,rgx,rpl));
edd = editDistance(tm1,tm2)
[~,idx] = min(edd);
in2{idx}
try this,
4 Comments
Rik
on 4 Dec 2020
I was just about to remove your answer from the spam filter. Feel free to put the code in your answer again. If it gets flagged, I'll remove the flag.
Michele Rizzato
on 4 Dec 2020
Sibi
on 4 Dec 2020
R='T m n. 7 120b';
code will work for this one also.
Michele Rizzato
on 4 Dec 2020
Categories
Find more on Data Import and Export in Help Center and File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!