Fastest match between 2 sequences
19 views (last 30 days)
Show older comments
Hi All,
I am matching many long sequences against each other in a way that the shorter sequence x slides against the longer reference sequence r in a search for the best match: the number of matching symbols (characters).
Here is my naive looped code that finds the best match. Do you have any suggestions on how to turn it into the fastest possible (vectorized) code? There might be a number of same sized shorter vectors x coming as inputs, so if possible to exploit it in the vectorization it would be ideal. I know this could be done with convolution on the binary vectors but my vectors are character symbols. Any ideas would be much appreciated.
%Finds the best match between shorter sequence x sliding against a longer reference sequence r
%The slide starts from the last element x on top of the 1st element of r:
%start (i=-2) 'THE'->
% 'ABCFGHETYUI'
%best match (i=4): 'THE'
%match score s is the number of elements in x that match (align) with r.
%match position i could be from -m+1:n+m-2 i.e. takes n+m-1 possible values
function [i,s]=find_seq(x,r)
m=numel(x); n=numel(r);
for i=1:m-1 %Counts matches in the left partial overlap
s(i)=sum(x(end-i+1:end)==r(1:i));
end
for i=1:n-m+1 %Counts matches when x fully overlaps with r
s(i+m-1)=sum(x==r(i:i+m-1));
end
for i=1:m-1 %Counts matches in the right partial overlap
s(n+i)=sum(x(1:m-i)==r(n-m+i+1:end));
end
[s,i]=max(s); i=i-m+1;
6 Comments
Bruno Luong
on 27 Jan 2021
Well you didn't really answer my question. You have the article and you know now char can be converted too double numerical. You previously state
"I know this could be done with convolution on the binary vectors but my vectors are character symbols. A"
Now the vectors of character is no longer an obstacle for you. So I ask to know where intend to do.
Note that the author claim their technique is "faster known" and it is published in hight reputation journal. I'll not try to beat their algorithm like that.
Accepted Answer
Jan
on 28 Jan 2021
Another approach:
function [i, s] = find_seq_v3(x, r)
m = numel(x);
n = numel(r);
s = zeros(1, n + m - 1);
for i = 1:m-1 % left partial overlap
s(i) = sum(x(m-i+1:m) == r(1:i));
end
% Original: Long loop over r:
% for i = 1:n-m+1 % fully overlaps with r
% s(i+m-1) = sum(x == r(i:i+m-1));
% end
% Faster: Short loop over x:
q = 0; % full overlap
for i = 1:m
q = q + (x(i) == r(i:n-m+i));
end
s(m:n) = q;
for i = 1:m-1 % right partial overlap
s(n+i) = sum(x(1:m-i) == r(n-m+i+1:n));
end
[s,i] = max(s);
i = i - m + 1;
end
Speed comparison:
- Elapsed time is 2.224595 seconds. Original
- Elapsed time is 0.802787 seconds. Pre-allocation & UINT8
- Elapsed time is 0.184660 seconds. Pre-allocation & UINT8 & Short loop
Just 8% of the original runtime. The FFT method of the paper might be magically fast. But some small changes can accelerate the naive implementation remarkably already.
I'm going to mex this tomorrow.
2 Comments
More Answers (1)
Jan
on 27 Jan 2021
As far as I understand, you want to accelerate this piece of code:
function main
r = char(randi(256,1,1e6)-1);
x = char(randi(256,1,1e2)-1);
tic
[i,s] = find_seq(x,r);
toc
end
function [i,s]=find_seq(x,r)
m=numel(x); n=numel(r);
for i=1:m-1 %Counts matches in the left partial overlap
s(i)=sum(x(end-i+1:end)==r(1:i));
end
for i=1:n-m+1 %Counts matches when x fully overlaps with r
s(i+m-1)=sum(x==r(i:i+m-1));
end
for i=1:m-1 %Counts matches in the right partial overlap
s(n+i)=sum(x(1:m-i)==r(n-m+i+1:end));
end
[s,i]=max(s); i=i-m+1;
end
This is possible with some cheap methods:
function main2
r = randi([0, 255], 1, 1e6, 'uint8'); % UINT8 instead of CHAR
x = randi([0, 255], 1, 1e2, 'uint8');
tic
[i, s] = find_seq_v2(x,r);
toc
end
function [i, s] = find_seq_v2(x,r)
m = numel(x);
n = numel(r);
s = zeros(1, n + m + 1); % Pre-allocation!!!
for i = 1:m-1 % left partial overlap
s(i) = sum(x(end-i+1:end) == r(1:i));
end
for i = 1:n-m+1 % fully overlaps with r
s(i+m-1) = sum(x == r(i:i+m-1));
end
for i = 1:m-1 % right partial overlap
s(n+i) = sum(x(1:m-i) == r(n-m+i+1:end));
end
[s, i] = max(s);
i = i - m + 1;
end
Point 1:
Pre-allocating the output is essential, because the iterative growing of arrays needs a lot of ressources:
x = []; for k = 1:1e6, x(k)=k; end
Although the final array needs only 8MB, the operating system creates a new copy of the array in each iteration, such that sum(1:1e6)*8 Bytes are used: 4 TB!
Point 2:
CHAR is a 16 bit type. If you store values from [0, 255] only, use UINT8 instead.
On my mobile i5:
- Elapsed time is 2.216123 seconds. Original
- Elapsed time is 0.791529 seconds. Pre-allocation & UINT8
65% faster.
3 Comments
Jan
on 28 Jan 2021
Edited: Jan
on 28 Jan 2021
You are right, Bruno. The memory copies are important.
I've created a lean copy of spdiags omitting all not needed parts, but this is still 40% slower than the above code.
for i = 1:n-m+1 % fully overlaps with r
s(i+m-1) = sum(x == r(i:i+m-1));
end
This sums the diagonals of x.'==r . Maybe a vectorized method can be faster than the loop, but creating the matrix x.'==r will exhaust my 8GB of RAM, if the real world application searchs 1e3 in 1e7 elements instead of 1e2 in 1e6 as in the test data.
I'm trying another approach: find_seq_v2 runs n+m loops and sums over m elements of the searched string. Because m<<n it might be faster to run m loops and accumulate the sums.
[EDITED] Yes, see version v3 in my next answer. 0.18 instead of 0.80 seconds.
Bruno Luong
on 28 Jan 2021
Edited: Bruno Luong
on 28 Jan 2021
My simple MEX (no openmp) timing
>> main
Elapsed time is 0.061210 seconds.
But dymitr is already happy with your answer so I won't post it.
See Also
Categories
Find more on Loops and Conditional Statements in Help Center and File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!