Fastest multiple binary search in the same sorted large vector

Question

dymitr ruta on 2 Jan 2021

0
Link

Direct link to this question

https://uk.mathworks.com/matlabcentral/answers/706923-fastest-multiple-binary-search-in-the-same-sorted-large-vector

Commented: Bruno Luong on 10 Jan 2021

Hi All, I have a challenging problem to find the fastest solution for as it is a big bottleneck in my code right now. Vector v [Nx1] contains the sorted rates of v(i)=c(i)/n(i) where c and n are positive integers all of which are accessible. Now I have another vector of candidate rates u that are created by the averages of various pairs of the original v rates, for example u(i,j)=(c(i)+c(j))/(n(i)+n(j)). All I want is to quickly find for all candidate rates in a vector u where it would appear in the original vector v - i.e. what is the first value v that is greater or equal than my candidate rate u(i,j). Since the candidate rates are generated by averageing pairs of rates I know that the searched position would be somwhere between i and j so the fastest solution I have so far is to run the binary search on the section of v(i:j) for the first value greater or equal than u(i,j) and then add the offset i-1. I am not sure if it is possible to exploit efficiently the fact that v is formed by rates of integer values and u is generated by averaging of these rates. If this is not possible to exploit efficiently I guess my question can be reduced how to what is the fastest implementation of multiple binary search within the same sorted (large) vector. For instance is it better to also first sort u and then exploit the results of binary search of the previous smaller u elements to speed up subsequent binary searches of larger values of u? Cheers

2 Comments
Show NoneHide None

Bruno Luong on 2 Jan 2021

Edited: Bruno Luong on 2 Jan 2021

All the options you cited might be good.

You can use stock function such as histc and histcounts. They are fast.

You might try some of my FEX using MEX such as

https://www.mathworks.com/matlabcentral/fileexchange/28930-merge-sorted-arrays?s_tid=srchtitle

by sorting u and try to find the location in v.

https://www.mathworks.com/matlabcentral/fileexchange/23049-find_idx?s_tid=srchtitle

You might adapt the later to start the search in (i,j) if you are fluent in C.

It would be helpful to know how big is your data, how fast you want it to perform and how do you stand now in term of speed.

dymitr ruta on 2 Jan 2021

Open in MATLAB Online

Thanks Bruno for a rapid response and seeding lots of ideas. My current performance on finding the positions of random 1m u values within the sorted 1m-element vector v is about 0.5 s:

v=sort(rand(1e6,1)); u=rand(1e6,1); tic; for i=1:numel(u) j(i)=find1(u(i),v); end; toc

Elapsed time is 0.480259 seconds.

Sign in to comment.

Sign in to answer this question.

Answer 1

Bruno Luong on 2 Jan 2021

1
Link

Direct link to this answer

https://uk.mathworks.com/matlabcentral/answers/706923-fastest-multiple-binary-search-in-the-same-sorted-large-vector#answer_589183

Edited: Bruno Luong on 2 Jan 2021

Open in MATLAB Online

v=sort(rand(1e6,1)); u=rand(1e6,1);
ntests = 10;
t=zeros(1,ntests);
for k=1:ntests
    tic; 
    [~,~,loc]=histcounts(u,v); % loc is the last index such that v(loc(i))<=u(i)
    t(k)=toc;
end
fprintf('mean time = %g sec\n', mean(t)) % mean time = 0.148221 sec

2 Comments
Show NoneHide None

dymitr ruta on 2 Jan 2021

Edited: dymitr ruta on 2 Jan 2021

Brilliant and impressively fast. Would you know how/if the loc can be extracted using histcounts or discretize functions such that it is the first index greater or equal u(i), rather than last lower equal?

Found it with parameter: ...'IncludedEdge','right') instead or 'left', smashing, thanks

Bruno Luong on 2 Jan 2021

Open in MATLAB Online

Like this.

[~,~,loc] = histcounts(-u,-flip(v)); 
loc = (length(v)+1)-loc; % loc is the first index such that v(loc(i))>=u(i)

But my other answer provide even faseter solution.

Sign in to comment.

Answer 2

Bruno Luong on 2 Jan 2021

0
Link

Direct link to this answer

https://uk.mathworks.com/matlabcentral/answers/706923-fastest-multiple-binary-search-in-the-same-sorted-large-vector#answer_589213

Open in MATLAB Online

Use FEX file here

v=sort(rand(1e6,1));
u=rand(1e6,1);
% Make sure v covers u on the right side
v(end)=max(v(end),max(u));
ntests = 10;
t=zeros(1,ntests);
for k=1:ntests
    tic
    [w,is] = sort(u);
    % Fex file
    % https://www.mathworks.com/matlabcentral/fileexchange/28930-merge-sorted-arrays?s_tid=srchtitle
    [~,j] = mergesa(v,w);
    loc = zeros(size(w));
    l=length(v);
    for n=length(j):-1:1
        jn=j(n);
        if jn<0
            loc(-jn)=l;
        else
            l=jn;
        end
    end
    loc(is) = loc;
    t(k) = toc;
end
fprintf('mean time = %g sec\n', mean(t)) % mean time = 0.0816771 sec

2 Comments
Show NoneHide None

dymitr ruta on 6 Jan 2021

Open in MATLAB Online

I have done some time analysis for searching for the first element in large sorted vector v that is >= the query elements in small vector u. For very large difference between the size of v and u for instance: v=sort(rand(1e8,1)), u=rand(1e2,1), my simple binary search:

%Finds an index of the first element j in (sorted) v that is >= elements in u: v(j)>=u
function j=bin_find(u,v)
n=numel(u); 
if n>1 
    for i=1:n j(i)=bin_find(u(i),v); end
    return;
end
i=1; j=numel(v); %Lower and upper indices
if u<=v(i) %Out of bounds handling;
    j=i; 
elseif u>v(j) j=0; 
else 
    while (i+1<j)
        m=(floor((i+j)/2)); %mid point
        if (v(m)<u) i=m; else j=m; end
    end
end

is the fastest I have found to date, much faster than 2nd best Matlab's discretize and way faster than Matalab's find:

nu=1e2; nv=1e8; u=rand(nu,1); v=sort(rand(nv,1)); 
tic; j=bin_find(u,v); t(1)=toc; 
tic; i=uint32(discretize(u,v,'IncludedEdge','right'))+1; t(2)=toc:
%Elapsed time t=[0.0046463 0.1795119]

However when the size of the query vector u is also large then Matlab's discretize is much better:

nu=1e7; nv=1e7; u=rand(nu,1); v=sort(rand(nv,1)); 
j=uint32(zeros(nu,1)); i=j; tic; j=bin_find(u,v); t(1)=toc; 
tic; i=uint32(discretize(u,v,'IncludedEdge','right'))+1; t(2)=toc
%Elapsed time t=[9.382821 0.5352835]

So my question: is it possible to vectorize binary search or update it to exploit previous searches such that it retains its lighning speed for multiple searches on the same sorted vector v?

Bruno Luong on 7 Jan 2021

Edited: Bruno Luong on 7 Jan 2021

Open in MATLAB Online

benchbinsearch.m

I did not find the similar time ratio as you, especially case 1e7 and 1e7, where discretize is only 3 time faster (you show 15 times faster)

Here is my results (R2020b, Windows laptop) with the attached test script (find_idx is mex binary search with the link I posted above)

--------------------------------------
Number of bins (nv)         = 10^(7)
Number of query points (nu) = 10^(7)
	Method "discretize"	: mean time = 1.95056 sec
	Method "find_idx"	: mean time = 0.996632 sec
	Method "bin_find"	: mean time = 6.77919 sec
--------------------------------------
Number of bins (nv)         = 10^(8)
Number of query points (nu) = 10^(2)
	Method "discretize"	: mean time = 0.150994 sec
	Method "find_idx"	: mean time = 0.00023296 sec
	Method "bin_find"	: mean time = 0.00030678 sec

You might implement a top level manager function that selects the fast algorithm depending on the size nu or the ratio nu/nv or such.

EDIT: I just improve my FEX submission FIND_IDX using multi threading MEX.

It now becomes the fatest in two test cases.

Sign in to comment.

Answer 3

dymitr ruta on 10 Jan 2021

0
Link

Direct link to this answer

https://uk.mathworks.com/matlabcentral/answers/706923-fastest-multiple-binary-search-in-the-same-sorted-large-vector#answer_594548

Open in MATLAB Online

Thanks Bruno find_idx is the current leader. Jus a remark that I found the matlab internal: matlab.internal.math.discretize(u,v,true) to be better than discretize (yet still slightly worse than your find_idx). I also improved my bin_find by presorting u and calling binary search alternating from both ends of sorted u thereby continuously narrowing down the bound for the search:

%Finds an index of the first element in (sorted) v that is >= elements in u 
function j=bin_find(u,v)
n=numel(u); N=numel(v);
[u,I]=sort(u);
j=zeros(n,1,'uint32'); J=[1 N];
j(1)=bf(u(1),v,1,N);
j(n)=bf(u(n),v,j(1),N);
for i=2:ceil(n/2) 
    j(i)=bf(u(i),v,j(i-1),j(n-i+2));
    j(n-i+1)=bf(u(n-i+1),v,j(i),j(n-i+2)); 
end
j(I)=j;
function j=bf(u,v,i,j)
if u<=v(i) j=i; elseif u>v(j) j=0; end %Out of bounds handling;
while (i+1<j)
    m=(floor((i+j)/2)); %mid point
    if (v(m)<u) i=m; else j=m; end
end

This improved it 3x for 1e7 x 1e7 case and also a little bit for 1e8 x 1e2 case, but still behind you and internal discretize:

%case\method     find_idx        matlab..discretize      bin_find_alt      bin_find             
%-----------------------------------------------------------------------------------------
%1e7 x 1e7 | 0.387706260000000   0.486588070000000   2.431423150000000   9.367798840000001
%1e8 x 1e2 | 0.000031492080000   0.000087360630000   0.000169353520000   0.000170555760000

I also implemented a fully recirsive binary search where after presorting u I check middle u in v and then send smaller half o u to one branch and the upper half to the the other with maximally narrowed bounds but it surprisingly takes much more time :(.

5 Comments
Show 3 older commentsHide 3 older comments

dymitr ruta on 10 Jan 2021

Open in MATLAB Online

you have to call it by:

matlab.internal.math.discretize(u,v,true);

true is not documented, just an internal matlab function of similar kind like ismembc2 and other alike, but its faster. You can get the feel for how its used by inspecting matlab's discretize code (edit discretize)

Bruno Luong on 10 Jan 2021

I see. Thanks for the tip.

Sign in to comment.

Fastest multiple binary search in the same sorted large vector

2 Comments
Show NoneHide None

Answers (3)

2 Comments
Show NoneHide None

2 Comments
Show NoneHide None

5 Comments
Show 3 older commentsHide 3 older comments

See Also

Categories

Tags

Products

Release

Community Treasure Hunt

Fastest multiple binary search in the same sorted large vector

2 Comments Show NoneHide None

Answers (3)

2 Comments Show NoneHide None

2 Comments Show NoneHide None

5 Comments Show 3 older commentsHide 3 older comments

See Also

Categories

Tags

Products

Release

Community Treasure Hunt

2 Comments
Show NoneHide None

2 Comments
Show NoneHide None

2 Comments
Show NoneHide None

5 Comments
Show 3 older commentsHide 3 older comments