Fastest match between 2 sequences

19 views (last 30 days)
dymitr ruta
dymitr ruta on 27 Jan 2021
Edited: Bruno Luong on 28 Jan 2021
Hi All,
I am matching many long sequences against each other in a way that the shorter sequence x slides against the longer reference sequence r in a search for the best match: the number of matching symbols (characters).
Here is my naive looped code that finds the best match. Do you have any suggestions on how to turn it into the fastest possible (vectorized) code? There might be a number of same sized shorter vectors x coming as inputs, so if possible to exploit it in the vectorization it would be ideal. I know this could be done with convolution on the binary vectors but my vectors are character symbols. Any ideas would be much appreciated.
%Finds the best match between shorter sequence x sliding against a longer reference sequence r
%The slide starts from the last element x on top of the 1st element of r:
%start (i=-2) 'THE'->
% 'ABCFGHETYUI'
%best match (i=4): 'THE'
%match score s is the number of elements in x that match (align) with r.
%match position i could be from -m+1:n+m-2 i.e. takes n+m-1 possible values
function [i,s]=find_seq(x,r)
m=numel(x); n=numel(r);
for i=1:m-1 %Counts matches in the left partial overlap
s(i)=sum(x(end-i+1:end)==r(1:i));
end
for i=1:n-m+1 %Counts matches when x fully overlaps with r
s(i+m-1)=sum(x==r(i:i+m-1));
end
for i=1:m-1 %Counts matches in the right partial overlap
s(n+i)=sum(x(1:m-i)==r(n-m+i+1:end));
end
[s,i]=max(s); i=i-m+1;
  6 Comments
Bruno Luong
Bruno Luong on 27 Jan 2021
Well you didn't really answer my question. You have the article and you know now char can be converted too double numerical. You previously state
"I know this could be done with convolution on the binary vectors but my vectors are character symbols. A"
Now the vectors of character is no longer an obstacle for you. So I ask to know where intend to do.
Note that the author claim their technique is "faster known" and it is published in hight reputation journal. I'll not try to beat their algorithm like that.
dymitr ruta
dymitr ruta on 27 Jan 2021
Edited: dymitr ruta on 27 Jan 2021
Bruno, I meant convolution on binary, not numeric. Obviously I am aware that char is implemented as a de facto uint8 ('A'+0=65). Let me put it this way: Matlab's swalign that uses famous Smith-Waterman algorithm for aligning nucleotide sequences - a part of expensive Bioinformatics Toolbox works slower and produces worse match than the simple code above (tested on 30k sars-cov2 sequences). Hence since even this baseline looped solution gives me a speedup I am looking for further speedups while being cautious about big claims out there. I picked this paper as it appears to capture state of the art, but I am not sure if this could work fastest in specific Matlab environment in the multi-vector input format.
I am not a fanatic of conv, happy to take simpler approaches if they are faster in Matlab. In the meantime I was falsely excited about a new vectorization of the sliding vectors sum of matches via spdiags(bsxfun(@eq...:
function [i,s]=find_seq(x,r)
m=numel(x); n=numel(r); s=zeros(1,m+n-1);
[x,i]=spdiags(bsxfun(@eq,flipud(r'),fliplr(x)));
s(i+n)=sum(x); [s,i]=max(s); i=i-m+1;
Code reduced to loopless 4 lines but unfortunately spdiags (summing over matrix diagonals) takes forever and slows everything down :( resulting in the similar execution time. For the typical inputs of:
r=char(randi(256,1,1e6)-1); x=char(randi(256,1,1e2)-1);
I am getting a similar processing time of about 1 second for both versions, need to get much faster than that.

Sign in to comment.

Accepted Answer

Jan
Jan on 28 Jan 2021
Another approach:
function [i, s] = find_seq_v3(x, r)
m = numel(x);
n = numel(r);
s = zeros(1, n + m - 1);
for i = 1:m-1 % left partial overlap
s(i) = sum(x(m-i+1:m) == r(1:i));
end
% Original: Long loop over r:
% for i = 1:n-m+1 % fully overlaps with r
% s(i+m-1) = sum(x == r(i:i+m-1));
% end
% Faster: Short loop over x:
q = 0; % full overlap
for i = 1:m
q = q + (x(i) == r(i:n-m+i));
end
s(m:n) = q;
for i = 1:m-1 % right partial overlap
s(n+i) = sum(x(1:m-i) == r(n-m+i+1:n));
end
[s,i] = max(s);
i = i - m + 1;
end
Speed comparison:
  • Elapsed time is 2.224595 seconds. Original
  • Elapsed time is 0.802787 seconds. Pre-allocation & UINT8
  • Elapsed time is 0.184660 seconds. Pre-allocation & UINT8 & Short loop
Just 8% of the original runtime. The FFT method of the paper might be magically fast. But some small changes can accelerate the naive implementation remarkably already.
I'm going to mex this tomorrow.
  2 Comments
dymitr ruta
dymitr ruta on 28 Jan 2021
Great, many thanks Ian, lots of eseful tips, did not realize chars is 16 bits.
dymitr ruta
dymitr ruta on 28 Jan 2021
I am surprized, though, that when vectorizing it to accomodate matrices as inputs (multiple candidate patterns x), it does not accelerate the execution i.e. your riginal _v3 code run in a loop for every vector x completes faster than the below:
function [i,s] = find_seq_v4(x, r)
[N,m]=size(x); n=numel(r); s=zeros(N,m+n-1,'uint8');
for i=1:m-1 % left partial overlap
s(:,i)=sum(bsxfun(@eq,x(:,m-i+1:m),r(1:i)),2);
end
q=uint8(0); % full overlap
for i=1:m
q=q + cast(bsxfun(@eq,x(:,i),r(i:n-m+i)),'uint8');
end
s(:,m:n)=q;
for i=1:m-1 % right partial overlap
s(:,n+i) = sum(bsxfun(@eq,x(:,1:m-i),r(n-m+i+1:n)),2);
end
[s,i]=max(s,[],2); i=i-m+1;

Sign in to comment.

More Answers (1)

Jan
Jan on 27 Jan 2021
As far as I understand, you want to accelerate this piece of code:
function main
r = char(randi(256,1,1e6)-1);
x = char(randi(256,1,1e2)-1);
tic
[i,s] = find_seq(x,r);
toc
end
function [i,s]=find_seq(x,r)
m=numel(x); n=numel(r);
for i=1:m-1 %Counts matches in the left partial overlap
s(i)=sum(x(end-i+1:end)==r(1:i));
end
for i=1:n-m+1 %Counts matches when x fully overlaps with r
s(i+m-1)=sum(x==r(i:i+m-1));
end
for i=1:m-1 %Counts matches in the right partial overlap
s(n+i)=sum(x(1:m-i)==r(n-m+i+1:end));
end
[s,i]=max(s); i=i-m+1;
end
This is possible with some cheap methods:
function main2
r = randi([0, 255], 1, 1e6, 'uint8'); % UINT8 instead of CHAR
x = randi([0, 255], 1, 1e2, 'uint8');
tic
[i, s] = find_seq_v2(x,r);
toc
end
function [i, s] = find_seq_v2(x,r)
m = numel(x);
n = numel(r);
s = zeros(1, n + m + 1); % Pre-allocation!!!
for i = 1:m-1 % left partial overlap
s(i) = sum(x(end-i+1:end) == r(1:i));
end
for i = 1:n-m+1 % fully overlaps with r
s(i+m-1) = sum(x == r(i:i+m-1));
end
for i = 1:m-1 % right partial overlap
s(n+i) = sum(x(1:m-i) == r(n-m+i+1:end));
end
[s, i] = max(s);
i = i - m + 1;
end
Point 1:
Pre-allocating the output is essential, because the iterative growing of arrays needs a lot of ressources:
x = []; for k = 1:1e6, x(k)=k; end
Although the final array needs only 8MB, the operating system creates a new copy of the array in each iteration, such that sum(1:1e6)*8 Bytes are used: 4 TB!
Point 2:
CHAR is a 16 bit type. If you store values from [0, 255] only, use UINT8 instead.
On my mobile i5:
  • Elapsed time is 2.216123 seconds. Original
  • Elapsed time is 0.791529 seconds. Pre-allocation & UINT8
65% faster.
  3 Comments
Jan
Jan on 28 Jan 2021
Edited: Jan on 28 Jan 2021
You are right, Bruno. The memory copies are important.
I've created a lean copy of spdiags omitting all not needed parts, but this is still 40% slower than the above code.
for i = 1:n-m+1 % fully overlaps with r
s(i+m-1) = sum(x == r(i:i+m-1));
end
This sums the diagonals of x.'==r . Maybe a vectorized method can be faster than the loop, but creating the matrix x.'==r will exhaust my 8GB of RAM, if the real world application searchs 1e3 in 1e7 elements instead of 1e2 in 1e6 as in the test data.
I'm trying another approach: find_seq_v2 runs n+m loops and sums over m elements of the searched string. Because m<<n it might be faster to run m loops and accumulate the sums.
[EDITED] Yes, see version v3 in my next answer. 0.18 instead of 0.80 seconds.
Bruno Luong
Bruno Luong on 28 Jan 2021
Edited: Bruno Luong on 28 Jan 2021
My simple MEX (no openmp) timing
>> main
Elapsed time is 0.061210 seconds.
But dymitr is already happy with your answer so I won't post it.

Sign in to comment.

Categories

Find more on Loops and Conditional Statements in Help Center and File Exchange

Products


Release

R2020b

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!