Asked by Michal Kvasnicka
on 20 Oct 2017

Hi, this is my code for permutation cycles counting. Any idea how to vectorize it to get some additional speedup for large set of permutations?

function count = PermCyclesCount(p)

[nR,nE] = size(p);

count = zeros(nR,1);

for j = 1:nR

i = 0;

count(j) = 0;

while i < nE

i = i + 1;

if p(j,i) > 0

flag = false; % flag is just to trick the computer into entering the while loop

first = i;

while (first ~= i) || ~flag

flag = true;

r = first;

first = p(j,first);

p(j,r) = 0; % The entries in the cycle are changed to zero to indicate that they have been `used'

end

count(j) = count(j) + 1;

end

end

end

end

Some testing data:

[~,perms] = sort(rand(1000000,60),2);

tic;c= PermCyclesCount(perms);toc

Elapsed time is 1.559758 seconds.

Answer by Jan
on 20 Oct 2017

Edited by Jan
on 20 Oct 2017

Accepted Answer

% VERSION 1

[nR, nE] = size(p);

count = zeros(nR, 1);

for j = 1:nR

q = p(j, :); % <== Cut out a vector to operate on

i = 0;

while i < nE

i = i + 1;

if q(i) > 0

flag = false;

first = i;

while (first ~= i) || ~flag

flag = true;

r = first;

first = q(first);

q(r) = 0;

end

count(j) = count(j) + 1;

end

end

end

This tiny change let the code run in 1.41 instead of 2.37 sec on my computer.

And this in 1.25 sec without the flag:

% VERSION 2

[nR, nE] = size(p);

count = zeros(nR, 1);

for j = 1:nR

q = p(j, :);

for i = 1:nE

if q(i) > 0

f = q(i);

q(i) = 0;

while f ~= i

r = f;

f = q(f);

q(r) = 0;

end

count(j) = count(j) + 1;

end

end

end

Michal Kvasnicka
on 20 Oct 2017

But your last code is about 2 times faster than mine. Thanks!

tic;c_= PermCyclesCount_(perms);toc

Elapsed time is 0.797457 seconds.

Jan
on 20 Oct 2017

This code is 20% faster, if the data a provided in columnwise order:

[~,perms] = sort(rand(60, 1000000), 1);

% VERSION 2.1

[nE, nE] = size(p); % <== Swapped

count = zeros(nR, 1);

for j = 1:nR

q = p(:, j); % <== Copy column

for i = 1:nE

if q(i) > 0

f = q(i);

q(i) = 0;

while f ~= i

r = f;

f = q(f);

q(r) = 0;

end

count(j) = count(j) + 1;

end

end

end

The C code is even 33% faster - see in the other answer.

Does the creation of perms matter also? Then:

[~,perms] = sort(rand(60, 1000000),1); % 4.88 sec

perms = Shuffle(repmat((1:60).', 1, 1000000), 1); % 2.95 sec

Michal Kvasnicka
on 23 Oct 2017

Sign in to comment.

Answer by Jan
on 20 Oct 2017

Edited by Jan
on 23 Oct 2017

A C-Mex has about the double speed of the VERSION 2:

// VERSION 3

#include "mex.h"

void mexFunction(int nlhs, mxArray *plhs[], int nrhs, const mxArray *prhs[])

{

size_t nR, nE, j, i;

double *c, *p, *pr;

int *q, f, r;

// Get input:

nR = mxGetM(prhs[0]);

nE = mxGetN(prhs[0]);

p = mxGetPr(prhs[0]);

// Create output:

plhs[0] = mxCreateDoubleMatrix((mwSize) nR, 1, mxREAL);

c = mxGetPr(plhs[0]);

q = (int *) mxMalloc((nE + 1) * sizeof(int));

for (j = 0; j < nR; j++) {

// Copy row p(j, :) to q:

pr = p++;

for (i = 1; i <= nE; i++) {

q[i] = (int) (*pr); // One based indexing for q

pr += nR;

}

for (i = 1; i <= nE; i++) {

if (q[i] != 0) {

f = q[i];

q[i] = 0;

while (f != i) {

r = f;

f = q[f];

q[r] = 0;

}

*c += 1.0;

}

}

c++;

}

mxFree(q);

return;

}

This is 33% faster, if you provide the input in columnwise order instead:

[~,perms] = sort(rand(60, 1000000),1);

// VERSION 3.1

#include "mex.h"

void mexFunction(int nlhs, mxArray *plhs[], int nrhs, const mxArray *prhs[])

{

// !!! FOR [nE x nR] !!!

size_t nR, nE, j, i;

double *c, *p, *pr;

int *q, f, r, ic;

// Get input:

nE = mxGetM(prhs[0]);

nR = mxGetN(prhs[0]);

p = mxGetPr(prhs[0]);

// Create output:

plhs[0] = mxCreateDoubleMatrix((mwSize) nR, 1, mxREAL);

c = mxGetPr(plhs[0]);

q = (int *) mxMalloc((nE + 1) * sizeof(int));

for (j = 0; j < nR; j++) {

// Copy row p(j, :) to q:

pr = p;

for (i = 1; i <= nE; i++) {

q[i] = (int) (*pr++); // One based indexing for q

}

ic = 0;

for (i = 1; i <= nE; i++) {

if (q[i] != 0) {

f = q[i];

q[i] = 0;

while (f != i) {

r = f;

f = q[f];

q[r] = 0;

}

ic++;

}

}

*c++ = (double) ic;

pr += nE;

}

mxFree(q);

return;

}

Michal Kvasnicka
on 23 Oct 2017

Sign in to comment.

Answer by Jos (10584)
on 20 Oct 2017

Not vectorised, but less checking, not sure if it is faster

nR = size(p,1) ;

count2 = zeros(nR,1) ;

tf = ~isnan(p) ;

for j = 1:nR

r1 = find(tf(j,:), 1 ,'first') ; % find the first cycle

while ~isempty(r1) % is there a cycle

count2(j) = count2(j) + 1 ;

while tf(j,r1)

tf(j,r1) = false ;

r2 = p(j,r1) ;

r1 = r2 ;

end

r1 = find(tf(j,:),1,'first') ; % find the next cycle

end

end

Michal Kvasnicka
on 20 Oct 2017

So, what is the main purpose to use profiler now?

I have two codes, doing same thing. First code is in profiler mode slower than second code. But, in non-profiler mode is the first code significantly faster than second code.

Please, let me know what the Matlab user is supposed to do in this situation? Or, what is the reason to use profiler, when its results are completely distorted?

Jan
on 20 Oct 2017

The profiler disables the JIT acceleration, which could re-order the code lines to improve speed. This cannot be allowed, if the time per line should be measured.

But the JIT accelerator is essential for the run time of such loops. Therefore the profiler has a limited use here. You can determine the commands which take the most time, but not the code blocks.

Michal Kvasnicka
on 20 Oct 2017

Sign in to comment.

Answer by Jos (10584)
on 20 Oct 2017

Edited by Jos (10584)
on 20 Oct 2017

This is upto 10 times faster than your code (and indeed about 100 times faster than my previous code):

[nR, nE] = size(p) ;

count3 = zeros(nR,1) ;

tf = true(nR,nE) ;

for Ri = 1:nR

for Ei = 1:nE,

if tf(Ri,Ei)

count3(Ri) = count3(Ri) + 1 ;

while tf(Ri,Ei)

tf(Ri,Ei) = false ;

r = p(Ri,Ei) ;

Ei = r ;

end

end

end

end

Michal Kvasnicka
on 20 Oct 2017

This code is definitely still a bit slower than mine :)

[~,perms] = sort(rand(1000000,60),2);

My code:

tic;c= PermCyclesCount(perms);toc

Elapsed time is 1.552508 seconds.

Your code:

tic;c_= PermCyclesCount_(perms);toc

Elapsed time is 1.664062 seconds.

Jan
on 20 Oct 2017

An improved version of Jos' code:

[nR, nE] = size(p) ;

count = zeros(nR,1) ;

tf = true(nE, nR) ; % Transposed to operate on column

for Ri = 1:nR

q = p(Ri, :); % Vector cut out

for Ei = 1:nE

k = Ei;

if tf(k, Ri)

count(Ri) = count(Ri) + 1 ;

while tf(k, Ri)

tf(k, Ri) = false ;

r = q(k) ;

k = r ;

end

end

end

end

To my surprise using a vector tf = true(nE, 1) is not faster.

Sign in to comment.

Opportunities for recent engineering grads.

Apply Today
## 0 Comments

Sign in to comment.