Vectorized Levenshtein distances between arrays of text labels?
Show older comments
I have to compare "N" ID labels (several thousand) to each other in order to determine which are mistypings of each other. The labels have up to 20 characters. Preliminarily, I am considering the calculation of the N(N-1)/2 Levenshtein distances between them and using clustering to determine which labels correspond to the same ID. It is being done in Python, but none of the Levenshtein distance implementations are vectorized. That is, the NxN array of distances have to be iterated through on an element-by-element basis in order to calculate their values.
I thought that there might be a vectorized Matlab version of Levenshtein distance, which I could package for deployment and invocation from Python. I found the a few shown in the Annex below, as well as an "editDistance" function available in R2023b. None of these vectorize the calculation of N(N-2)/2 distances. I'm surprised that a vectorized implementation doesn't exist. Am I missing something obvious?
Annex: Matlab implementations of Levenshtein distance
Accepted Answer
More Answers (2)
Christopher Creutzig
on 14 Jun 2024
0 votes
For this application, I recommend using editDistanceSearcher: You probably don't want to look at clusters larger than some size anyway, and knnsearch on editDistanceSearcher will give you the neighbors of interest. It performs precomputation, trading memory for being faster in amortized time, as long as you call it often enough on the same dataset, which it sounds is exactly what you want to do.
7 Comments
FM
on 17 Jun 2024
Christopher Creutzig
on 17 Jun 2024
I was thinking of taking the 30,011 labels as the base set. If you then take the 5 closest elements within an edit distance of 3 for each of them, is that a start?
FM
on 17 Jun 2024
> To get the 5 closest elements to each of the 30,011 labels, I need to calculate 30,010 distances for each label
No, you don't. That's exactly what editDistanceSearcher helps you avoid. If it helps thinking about it that way, it precomputes “bubble-shaped detectors” around the input values, avoiding most computations of distances that are larger than the original threshold. (Those bubbles aren't perfectly “distance 3” shaped, so some distances still need to be computed, but they will filter out the overwhelming majority of points further out.)
data = string(rand([1,10000]));
searcher = editDistanceSearcher(data,3);
timeit(@() editDistance(data(1),data))
timeit(@() knnsearch(searcher,data(1),K=5))
Christopher Creutzig
on 19 Jun 2024
There are many ways to define clusters. What exactly is the one you are looking for?
If your clusters are defined by “groups of words separated by edit distance,” i.e., you regard all those as a cluster where you have stepping stones to get form A to B without making any steps with a Levenshtein distance of, say, 4 or more, then knowing all the neighbors of your words is all you need.
That is not data you would put into kmeans or kmedoids, of course. Such neighboring information transforms the clustering into a graph theoretical problem. You'd set up an undirected graph with the neighborhood information you have and ask for connected components (or block-cut components to avoid spurious connections from outliers).
FM
on 20 Jun 2024
Categories
Find more on Matrix Indexing in Help Center and File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!