Vectorized Levenshtein distances between arrays of text labels?

I have to compare "N" ID labels (several thousand) to each other in order to determine which are mistypings of each other. The labels have up to 20 characters. Preliminarily, I am considering the calculation of the N(N-1)/2 Levenshtein distances between them and using clustering to determine which labels correspond to the same ID. It is being done in Python, but none of the Levenshtein distance implementations are vectorized. That is, the NxN array of distances have to be iterated through on an element-by-element basis in order to calculate their values.
I thought that there might be a vectorized Matlab version of Levenshtein distance, which I could package for deployment and invocation from Python. I found the a few shown in the Annex below, as well as an "editDistance" function available in R2023b. None of these vectorize the calculation of N(N-2)/2 distances. I'm surprised that a vectorized implementation doesn't exist. Am I missing something obvious?
Annex: Matlab implementations of Levenshtein distance

2 Comments

"Am I missing something obvious?"
The lack of an easily vectorizable algorithm.
OK, so that confirms that I haven't overlooked anything in my search. Thanks!

Sign in to comment.

 Accepted Answer

Hi FM,
I understand that you want to compute the Levenshtein distances between several thousand ID labels in a vectorized manner using MATLAB and then interface it with Python.
Here is a MATLAB code that utilizes parallel processing to compute the Levenshtein distances more efficiently. This approach is not fully vectorized due to the nature of the algorithm, but it leverages MATLAB's parallel computing capabilities to speed up the computation.
function distanceMatrix = computeLevenshteinDistances(labels)
% Number of labels
N = numel(labels);
% Initialize the distance matrix
distanceMatrix = zeros(N, N);
% Parallel loop to compute Levenshtein distances
parfor i = 1:N
for j = i+1:N
distanceMatrix(i, j) = levenshtein(labels{i}, labels{j});
distanceMatrix(j, i) = distanceMatrix(i, j); % Symmetric matrix
end
end
end
function d = levenshtein(s1, s2)
% Calculate the Levenshtein distance between two strings
m = length(s1);
n = length(s2);
% Initialize the distance matrix
D = zeros(m+1, n+1);
for i = 1:m+1
D(i, 1) = i-1;
end
for j = 1:n+1
D(1, j) = j-1;
end
% Compute the distances
for i = 2:m+1
for j = 2:n+1
cost = (s1(i-1) ~= s2(j-1));
D(i, j) = min([D(i-1, j) + 1, ... % Deletion
D(i, j-1) + 1, ... % Insertion
D(i-1, j-1) + cost]); % Substitution
end
end
d = D(m+1, n+1);
end
To integrate this MATLAB function with Python, you can use the MATLAB Engine API for Python. Here's an example of how to call the MATLAB function from Python:
import matlab.engine
import numpy as np
# Start MATLAB engine
eng = matlab.engine.start_matlab()
# Example labels
labels = ['label1', 'label2', 'label3', 'labelN']
# Convert Python list to MATLAB cell array
matlab_labels = matlab.cell.array(labels)
# Call the MATLAB function
distance_matrix = eng.computeLevenshteinDistances(matlab_labels)
# Convert the result to a numpy array
distance_matrix = np.array(distance_matrix)
# Stop MATLAB engine
eng.quit()
# Print the distance matrix
print(distance_matrix)
To summarize:
  1. The MATLAB function computeLevenshteinDistances computes the Levenshtein distances between all pairs of labels using parallel processing.
  2. The levenshtein function calculates the distance between two strings.
  3. The Python script uses the MATLAB Engine API to call the MATLAB function and retrieve the distance matrix.
By using MATLAB's parallel computing capabilities, you can significantly speed up the computation of the Levenshtein distances for a large number of labels. I recommend using "parfor" instead of "for" loops to leverage parallel computation in MATLAB.
Refer to the following MathWorks documentation for more information on "parfor" in MATLAB: https://www.mathworks.com/help/parallel-computing/parfor.html
Hope this helps.
Regards,
Nipun

3 Comments

Thanks, Nipun. This is my iniital exposure to the parallel computing toolbox. I think it could be useful in the future. For this particular case, it might not be the right way to go.
I was initially hoping that Matlab's JiT compilation would allow it to bust through a bottleneck of 1.5 hours in my Python implementation. My tests of an unparallelized Matlab implementaton show that it would take 70 hours ( https://www.mathworks.com/matlabcentral/answers/2127991 ). For a typical machine with (spitballing) 5 cores, and assuming 2 threads/core, we can estimate an upper bound to the parallelization speedup of 2x5=10 times, It is not enough to bust through my current bottleneck.
I use the above ballparking calculations because I can't use the "conventional" way of estimating the time it would take. The "conventional" way is just to use a variable to count the number of Levenshtein distance calculations, then print this out alongside the time stamp. Using values from the first few minutes of execution, I can project out to 70 hours for the 450+ million calculations required. Unfortunately, however, the serial counting of calculations doesn't make sense when the outer loop is parallelized. A much more complex means is needed to estimate the total wall clock time, but it doesn't seem justified because the ballparking calculation above is sufficiently telling.
I am test-driving your levenshtein() function to see if it is faster than editDistance and will report back in this comment. From the notation, it seems to operate on cell arrays of char instead of strings per se.
Update: Replacing editDistance() with your levenshtein() reduces the estimated runtime to 2.2 hours for an unparallelized implementation. Parallelizing might knock that down by up to an order of magnitude on a non-specialized machine. This definitely puts the solution in the realm of worthwhile considering. There are challenges with progress monitoring. There are shared utilities for parfor progress monitoring, which could take some time to figure out. If they simply show the percentage of outer loops completed, it would be nonideal because the inner loops iterate a very variable number of times. An added unknown is that I'm not yet sure how compatible the use of such 3rd party software components is with our computing restrictions. If it was just Matlab source code, that would be a non-issue.
I still need to test-drive the use of the MATLAB Engine API for Python...for now, I will let the unparallelized implementation finish and confirm that the Levenshtein distances match those from Python (afternote: The results match) and actually test drive the clustering alogorthm being fed by the distance calculations before trying to find faster ways to generate them.
editDistance uses both recursion and arrayfun. It basically calls itself in a loop and on each call is validating the input strings, which really only needs to be done once, not to mention the overhead of arrayfun (which I think has been discussed elseswhere on this forum), and all of the other checks it runs each time it's called. Maybe all of that overhead contributes to its slowdown by a factor of 70/2.2.
Yes, it does seem that editDistance is for interactive one-offs rather than large data sets en masse. I'm new enough to this area that I'm not sure what the use case for that is, at least in a programmatic context.

Sign in to comment.

More Answers (2)

For this application, I recommend using editDistanceSearcher: You probably don't want to look at clusters larger than some size anyway, and knnsearch on editDistanceSearcher will give you the neighbors of interest. It performs precomputation, trading memory for being faster in amortized time, as long as you call it often enough on the same dataset, which it sounds is exactly what you want to do.

7 Comments

@Christopher: I'm actually trying to explore the clustering of the 30,011 labels to see which can be considered typos or embellishments of each other. At the outset, I don't have specific words for which I'm searching for possible matches witin a certain edit distance. After exploring the clustering, I might be able to set such thresholds.
(Originally responded to at https://www.mathworks.com/matlabcentral/answers/2127991-does-matlab-automatically-vectorize-for-loops#comment_3186891)
I was thinking of taking the 30,011 labels as the base set. If you then take the 5 closest elements within an edit distance of 3 for each of them, is that a start?
To get the 5 closest elements to each of the 30,011 labels, I need to calculate 30,010 distances for each label (or rather, an average of 30,010 / 2 distances for each label, since some of the distances will already have been calculated when processing other labels). I'm not sure how I can avoid the 450+ million distance calculations as a precursor. Am I missing something about your suggestion?
> To get the 5 closest elements to each of the 30,011 labels, I need to calculate 30,010 distances for each label
No, you don't. That's exactly what editDistanceSearcher helps you avoid. If it helps thinking about it that way, it precomputes “bubble-shaped detectors” around the input values, avoiding most computations of distances that are larger than the original threshold. (Those bubbles aren't perfectly “distance 3” shaped, so some distances still need to be computed, but they will filter out the overwhelming majority of points further out.)
data = string(rand([1,10000]));
searcher = editDistanceSearcher(data,3);
timeit(@() editDistance(data(1),data))
ans = 2.3585
timeit(@() knnsearch(searcher,data(1),K=5))
ans = 0.7855
I think it's useful to have some specific labels for which I seek the closest matches out of a large pool of labels. However, I'm wondering how well that matches my problem. I want to feed a clustering algorithm so that I get clusters of labels that are all to be potentially considered the same (not necessarily just two labels). I don't have specific labels beforehand. And furthermore, even if it hypothetically took no time to find the closest matches for each 3,011 labels, I still need to do the clustering. I welcome your thoughts on the matching of editDistanceSearcher to the problem as it is now described.
There are many ways to define clusters. What exactly is the one you are looking for?
If your clusters are defined by “groups of words separated by edit distance,” i.e., you regard all those as a cluster where you have stepping stones to get form A to B without making any steps with a Levenshtein distance of, say, 4 or more, then knowing all the neighbors of your words is all you need.
That is not data you would put into kmeans or kmedoids, of course. Such neighboring information transforms the clustering into a graph theoretical problem. You'd set up an undirected graph with the neighborhood information you have and ask for connected components (or block-cut components to avoid spurious connections from outliers).
This is my first foray into clustering, so it may be naive. My plan, however, was to get a 2D matrix of all-pairs distances and feed that into HDBSCAN. I've only read about HDBSCAN, so as I try this, I will become more familiar with it. Of course, the O(N^2) calculation of all-pairs distances may make this infeasible, depending on the number of text labels and whether the distance calculation is vectorized or parallelized.

Sign in to comment.

FM
FM on 17 Jun 2024
Edited: FM on 18 Jun 2024
@Stephen23 provided the technically correct answer, though @Nipun's provided an extremely helpful alternative to explore though I haven't yet decided to adopt this alternative. I feel that both responses are useful answers, but only one can be accepted. Is there any MATLAB forum guidance on what should be formally designated as the answer, i.e., whether it is decided by which tecnically answers the question (which is still valuable) vs. which contributes possibly alternative solutions?

Products

Release

R2023b

Asked:

FM
on 10 Jun 2024

Commented:

FM
on 20 Jun 2024

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!