Fastest match between 2 sequences

Question

dymitr ruta on 27 Jan 2021

0
Link

Direct link to this question

https://uk.mathworks.com/matlabcentral/answers/728503-fastest-match-between-2-sequences

Edited: Bruno Luong on 28 Jan 2021

Hi All,

I am matching many long sequences against each other in a way that the shorter sequence x slides against the longer reference sequence r in a search for the best match: the number of matching symbols (characters).

Here is my naive looped code that finds the best match. Do you have any suggestions on how to turn it into the fastest possible (vectorized) code? There might be a number of same sized shorter vectors x coming as inputs, so if possible to exploit it in the vectorization it would be ideal. I know this could be done with convolution on the binary vectors but my vectors are character symbols. Any ideas would be much appreciated.

%Finds the best match between shorter sequence x sliding against a longer reference sequence r
%The slide starts from the last element x on top of the 1st element of r:
%start (i=-2)    'THE'->
%                  'ABCFGHETYUI'
%best match (i=4):     'THE'
%match score s is the number of elements in x that match (align) with r.
%match position i could be from -m+1:n+m-2 i.e. takes n+m-1 possible values
function [i,s]=find_seq(x,r)
m=numel(x); n=numel(r);
for i=1:m-1                             %Counts matches in the left partial overlap 
    s(i)=sum(x(end-i+1:end)==r(1:i));
end
for i=1:n-m+1                           %Counts matches when x fully overlaps with r
    s(i+m-1)=sum(x==r(i:i+m-1));
end
for i=1:m-1                             %Counts matches in the right partial overlap
    s(n+i)=sum(x(1:m-i)==r(n-m+i+1:end));
end
[s,i]=max(s); i=i-m+1;

6 Comments
Show 4 older commentsHide 4 older comments

Bruno Luong on 27 Jan 2021

Well you didn't really answer my question. You have the article and you know now char can be converted too double numerical. You previously state

"I know this could be done with convolution on the binary vectors but my vectors are character symbols. A"

Now the vectors of character is no longer an obstacle for you. So I ask to know where intend to do.

Note that the author claim their technique is "faster known" and it is published in hight reputation journal. I'll not try to beat their algorithm like that.

dymitr ruta on 27 Jan 2021

Edited: dymitr ruta on 27 Jan 2021

Open in MATLAB Online

Bruno, I meant convolution on binary, not numeric. Obviously I am aware that char is implemented as a de facto uint8 ('A'+0=65). Let me put it this way: Matlab's swalign that uses famous Smith-Waterman algorithm for aligning nucleotide sequences - a part of expensive Bioinformatics Toolbox works slower and produces worse match than the simple code above (tested on 30k sars-cov2 sequences). Hence since even this baseline looped solution gives me a speedup I am looking for further speedups while being cautious about big claims out there. I picked this paper as it appears to capture state of the art, but I am not sure if this could work fastest in specific Matlab environment in the multi-vector input format.

I am not a fanatic of conv, happy to take simpler approaches if they are faster in Matlab. In the meantime I was falsely excited about a new vectorization of the sliding vectors sum of matches via spdiags(bsxfun(@eq...:

function [i,s]=find_seq(x,r)
m=numel(x); n=numel(r); s=zeros(1,m+n-1);
[x,i]=spdiags(bsxfun(@eq,flipud(r'),fliplr(x))); 
s(i+n)=sum(x); [s,i]=max(s); i=i-m+1;

Code reduced to loopless 4 lines but unfortunately spdiags (summing over matrix diagonals) takes forever and slows everything down :( resulting in the similar execution time. For the typical inputs of:

r=char(randi(256,1,1e6)-1); x=char(randi(256,1,1e2)-1);

I am getting a similar processing time of about 1 second for both versions, need to get much faster than that.

Sign in to comment.

Sign in to answer this question.

Answer 1

Jan on 28 Jan 2021

0
Link

Direct link to this answer

https://uk.mathworks.com/matlabcentral/answers/728503-fastest-match-between-2-sequences#answer_607963

Open in MATLAB Online

Another approach:

function [i, s] = find_seq_v3(x, r)
m = numel(x);
n = numel(r);
s = zeros(1, n + m - 1);
for i = 1:m-1               % left partial overlap
   s(i) = sum(x(m-i+1:m) == r(1:i));
end
% Original: Long loop over r:
% for i = 1:n-m+1             % fully overlaps with r
%    s(i+m-1) = sum(x == r(i:i+m-1));
% end
% Faster: Short loop over x:
q = 0;                      % full overlap
for i = 1:m
   q = q + (x(i) == r(i:n-m+i));
end
s(m:n) = q;
for i = 1:m-1               % right partial overlap
   s(n+i) = sum(x(1:m-i) == r(n-m+i+1:n));
end
[s,i] = max(s);
i     = i - m + 1;
end

Speed comparison:

Elapsed time is 2.224595 seconds. Original
Elapsed time is 0.802787 seconds. Pre-allocation & UINT8
Elapsed time is 0.184660 seconds. Pre-allocation & UINT8 & Short loop

Just 8% of the original runtime. The FFT method of the paper might be magically fast. But some small changes can accelerate the naive implementation remarkably already.

I'm going to mex this tomorrow.

2 Comments
Show NoneHide None

dymitr ruta on 28 Jan 2021

Great, many thanks Ian, lots of eseful tips, did not realize chars is 16 bits.

dymitr ruta on 28 Jan 2021

Open in MATLAB Online

I am surprized, though, that when vectorizing it to accomodate matrices as inputs (multiple candidate patterns x), it does not accelerate the execution i.e. your riginal _v3 code run in a loop for every vector x completes faster than the below:

function [i,s] = find_seq_v4(x, r)
[N,m]=size(x); n=numel(r); s=zeros(N,m+n-1,'uint8');
for i=1:m-1                         % left partial overlap
    s(:,i)=sum(bsxfun(@eq,x(:,m-i+1:m),r(1:i)),2);
end
q=uint8(0);                         % full overlap
for i=1:m
   q=q + cast(bsxfun(@eq,x(:,i),r(i:n-m+i)),'uint8');
end
s(:,m:n)=q;
for i=1:m-1                         % right partial overlap
   s(:,n+i) = sum(bsxfun(@eq,x(:,1:m-i),r(n-m+i+1:n)),2);
end
[s,i]=max(s,[],2); i=i-m+1;

Sign in to comment.

Answer 2

Jan on 27 Jan 2021

0
Link

Direct link to this answer

https://uk.mathworks.com/matlabcentral/answers/728503-fastest-match-between-2-sequences#answer_607913

Open in MATLAB Online

As far as I understand, you want to accelerate this piece of code:

function main
r = char(randi(256,1,1e6)-1);
x = char(randi(256,1,1e2)-1);
tic
[i,s] = find_seq(x,r);
toc
end
function [i,s]=find_seq(x,r)
m=numel(x); n=numel(r);
for i=1:m-1                             %Counts matches in the left partial overlap 
    s(i)=sum(x(end-i+1:end)==r(1:i));
end
for i=1:n-m+1                           %Counts matches when x fully overlaps with r
    s(i+m-1)=sum(x==r(i:i+m-1));
end
for i=1:m-1                             %Counts matches in the right partial overlap
    s(n+i)=sum(x(1:m-i)==r(n-m+i+1:end));
end
[s,i]=max(s); i=i-m+1;
end

This is possible with some cheap methods:

function main2
r = randi([0, 255], 1, 1e6, 'uint8');  % UINT8 instead of CHAR
x = randi([0, 255], 1, 1e2, 'uint8');
tic
[i, s] = find_seq_v2(x,r);
toc
end
function [i, s] = find_seq_v2(x,r)
m = numel(x);
n = numel(r);
s = zeros(1, n + m + 1);    % Pre-allocation!!!
for i = 1:m-1               % left partial overlap 
    s(i) = sum(x(end-i+1:end) == r(1:i));
end
for i = 1:n-m+1             % fully overlaps with r
    s(i+m-1) = sum(x == r(i:i+m-1));
end
for i = 1:m-1               % right partial overlap
    s(n+i) = sum(x(1:m-i) == r(n-m+i+1:end));
end
[s, i] = max(s);
i      = i - m + 1;
end

Point 1:

Pre-allocating the output is essential, because the iterative growing of arrays needs a lot of ressources:

x = []; for k = 1:1e6, x(k)=k; end

Although the final array needs only 8MB, the operating system creates a new copy of the array in each iteration, such that sum(1:1e6)*8 Bytes are used: 4 TB!

Point 2:

CHAR is a 16 bit type. If you store values from [0, 255] only, use UINT8 instead.

On my mobile i5:

Elapsed time is 2.216123 seconds. Original
Elapsed time is 0.791529 seconds. Pre-allocation & UINT8

65% faster.

3 Comments
Show 1 older commentHide 1 older comment

Jan on 28 Jan 2021

Edited: Jan on 28 Jan 2021

Open in MATLAB Online

You are right, Bruno. The memory copies are important.

I've created a lean copy of spdiags omitting all not needed parts, but this is still 40% slower than the above code.

for i = 1:n-m+1             % fully overlaps with r
    s(i+m-1) = sum(x == r(i:i+m-1));
end

This sums the diagonals of x.'==r . Maybe a vectorized method can be faster than the loop, but creating the matrix x.'==r will exhaust my 8GB of RAM, if the real world application searchs 1e3 in 1e7 elements instead of 1e2 in 1e6 as in the test data.

I'm trying another approach: find_seq_v2 runs n+m loops and sums over m elements of the searched string. Because m<<n it might be faster to run m loops and accumulate the sums.

[EDITED] Yes, see version v3 in my next answer. 0.18 instead of 0.80 seconds.

Bruno Luong on 28 Jan 2021

Edited: Bruno Luong on 28 Jan 2021

Open in MATLAB Online

My simple MEX (no openmp) timing

>> main
Elapsed time is 0.061210 seconds.

But dymitr is already happy with your answer so I won't post it.

Sign in to comment.

Fastest match between 2 sequences

6 Comments
Show 4 older commentsHide 4 older comments

Accepted Answer

2 Comments
Show NoneHide None

More Answers (1)

3 Comments
Show 1 older commentHide 1 older comment

See Also

Categories

Tags

Products

Release

Community Treasure Hunt

Fastest match between 2 sequences

6 Comments Show 4 older commentsHide 4 older comments

Accepted Answer

2 Comments Show NoneHide None

More Answers (1)

3 Comments Show 1 older commentHide 1 older comment

See Also

Categories

Tags

Products

Release

Community Treasure Hunt

6 Comments
Show 4 older commentsHide 4 older comments

2 Comments
Show NoneHide None

3 Comments
Show 1 older commentHide 1 older comment