A fast way to perform sparse matrix-free product with a vector

Question

Paulo Ribeiro on 4 Sep 2018

0
Link

Direct link to this question

https://uk.mathworks.com/matlabcentral/answers/417540-a-fast-way-to-perform-sparse-matrix-free-product-with-a-vector

Edited: Paulo Ribeiro on 8 Sep 2018

Dear all,

I was wondering if it is possible to increase the performance of a matrix-free product of a sparse matrix defined by 3 vectors (rows, columns and values) with another vector {B}. That is:

A(l,m)=v and performing {C}=[A]*{B}

So far I have the following strategy:

function C = mfree_times(l,m,v,B)
     aux = B(m);
     prod = v.*aux;
     C = accumarray(l,prod);
end

My intentions are to increase this performance as much as possible. The bottleneck is majorly given by accumarray function. I am already using l, m and v as gpuArrays. Please share your thoughts. Thanks!

10 Comments
Show 8 older commentsHide 8 older comments

James Tursa on 6 Sep 2018

@Paulo: So, based on your replies, your variables appear to have nothing to do with CSR format sparse matrices ... you simply have 3-tuples of row,column,value to use. My answer below is based on that assumption.

Paulo Ribeiro on 6 Sep 2018

James, Thanks. That’s right. The question is now updated.

Sign in to comment.

Sign in to answer this question.

Answer 1

James Tursa on 6 Sep 2018

1
Link

Direct link to this answer

https://uk.mathworks.com/matlabcentral/answers/417540-a-fast-way-to-perform-sparse-matrix-free-product-with-a-vector#answer_335760

Edited: James Tursa on 6 Sep 2018

Open in MATLAB Online

Assuming the vectors are 3-tuples of row, column, and value, and that all of the variables involved are full double vectors, here is a naive mex routine to do the multiplication and produce a full vector result. If you are working with R2018a and compiling with the -R2018a option then replace the mxGetPr calls with mxGetDoubles calls.

/*   C = mfree_timesx(r,c,v,B) does the equivalent of:
 *
 *   C = sparse(r,c,v,numel(B),numel(B)) * B
 *
 * where
 *
 *   r,c,v are 3-tuples of values from an MxM matrix:
 *     r = row index (1-based)
 *     c = column index (1-based)
 *     v = values
 *   B = column vector of size Mx1
 *   C = column vector of size Mx1
 *
 * It is assumed that the indexing contained in the r and c vectors are
 * all within an MxM square matrix (i.e., are integers between 1 and M).
 * (The code does *not* check for this. A violation will bomb this routine)
 *
 * Duplicate indexing within the r and c vectors is allowed, and is handled
 * by simply adding the results together (same as what would happen if you
 * did C = sparse(r,c,v,numel(B),numel(B)) * B).
 *
 * All inputs must be full real double vectors.
 *
 * The output is a full real double column vector.
 *
 * Programmer: James Tursa
 *
 */
/* Includes ----------------------------------------------------------- */
#include "mex.h"
/* Gateway ------------------------------------------------------------ */
void mexFunction(int nlhs, mxArray *plhs[], int nrhs, const mxArray *prhs[])
{
    mwSize i, n, m;
    double *Ar, *Ac, *Av, *B, *C;
      if( nrhs != 4 ) {
          mexErrMsgTxt("Need exactly four double input vectors");
      }
      if( nlhs > 1 ) {
          mexErrMsgTxt("Too many outputs");
      }
      if( !(mxIsDouble(prhs[0]) && mxIsDouble(prhs[1]) && mxIsDouble(prhs[2]) && mxIsDouble(prhs[3])) ||
            mxIsComplex(prhs[0]) || mxIsComplex(prhs[1]) || mxIsComplex(prhs[2]) || mxIsComplex(prhs[3]) ||
            mxIsSparse(prhs[0]) || mxIsSparse(prhs[1]) || mxIsSparse(prhs[2]) || mxIsSparse(prhs[3]) ) {
          mexErrMsgTxt("All inputs must be full real double vectors");
      }
      n = mxGetNumberOfElements(prhs[0]);
      if( mxGetNumberOfElements(prhs[1]) != n || mxGetNumberOfElements(prhs[2]) != n ) {
          mexErrMsgTxt("The first three inputs must be the same size");
      }
      m = mxGetNumberOfElements(prhs[3]);
      if( mxGetM(prhs[3]) != m ) {
          mexErrMsgTxt("The last input must be a column vector");
      }
      Ar = mxGetPr(prhs[0]);
      Ac = mxGetPr(prhs[1]);
      Av = mxGetPr(prhs[2]);
      B = mxGetPr(prhs[3]);
      plhs[0] = mxCreateDoubleMatrix(m,1,mxREAL);
      C = mxGetPr(plhs[0]);
      for( i=0; i<n; i++ ) {
          C[(mwSize)*Ar++ - 1] += *Av++ * B[(mwSize)*Ac++ - 1];
      }
  }

Some sample test code:

N = 5000;
r = ceil(N*rand(N*N,1));
c = ceil(N*rand(N*N,1));
v = rand(N*N,1);
B = rand(N,1);
disp(' ')
disp('mex routine')
tic
x = mfree_timesx(r,c,v,B);
toc
disp('sparse')
tic
s = sparse(r,c,v,numel(B),numel(B))*B;
toc
disp('accumarray')
tic
aux = B(c);
prod = v.*aux;
C = accumarray(r,prod);
toc
disp(' ')
disp('mex vs sparse')
max(abs(x-s))
disp('mex vs accumarray')
max(abs(x-C))
disp('sparse vs accumarray')
max(abs(s-C))
disp(' ')

and results:

mex routine
Elapsed time is 0.150372 seconds.
sparse
Elapsed time is 5.806035 seconds.
accumarray
Elapsed time is 0.666030 seconds.
mex vs sparse
ans =
  1.2960e-011
mex vs accumarray
ans =
     0
sparse vs accumarray
ans =
  1.2960e-011

4 Comments
Show 2 older commentsHide 2 older comments

Paulo Ribeiro on 7 Sep 2018

Dear James,

A GPU version for the 3 tuples and B will be more than welcome. Do you think it is manageable? Many thanks for your attention and work on this thread. Glad to find an expert :)

Best regards,

Paulo

Paulo Ribeiro on 8 Sep 2018

Edited: Paulo Ribeiro on 8 Sep 2018

Here are some values after @James contributions:

Problem: finite element, 2D elasticity, 116.000 nodes. Setup: i7 7700HQ, NVIDIA GTX 1050, Linux (gcc compiler); Solver: matrix-free conjugate gradient, as described on this thread

Solver time for CPU with MEX compilation (mfree_timesx) = 42s;

Solver time for CPU with mfree_times (no MEX) = 317s;

Solver time for GPU with mfree_times (no MEX) = 82s.

Really impressive results with MEX on CPU

Sign in to comment.

Answer 2

Bruno Luong on 5 Sep 2018

0
Link

Direct link to this answer

https://uk.mathworks.com/matlabcentral/answers/417540-a-fast-way-to-perform-sparse-matrix-free-product-with-a-vector#answer_335617

Can you give the detail about how you iterate ?

There seems no room to improve the basic block mfree_times

If you output is sparse and you add them, and internally it requires new memory allocation and merge sorted array.

Better building all collection triplets then call ACCUMARRAY once.

4 Comments
Show 2 older commentsHide 2 older comments

Paulo Ribeiro on 6 Sep 2018

Bruno,

After several attempts, I am really out of options. Vectors l,m and v are very large, up to 1e9. I imagine two possible scenarios:

1. Fast convergence and reduced number of iterations; 2. Translate this code to Fortran or C/C++.

Thanks!

Bruno Luong on 6 Sep 2018

Edited: Bruno Luong on 6 Sep 2018

It seems that your problem is memory and not inneficient calculation.

If the matrix is sparse (a lot of zero) you should consider my suggestion if squeezing subspace.

By do try to reduce memory by using storage the indexes with INT32, matrix values with SINGLE, or using TALL ARRAY for more efficient HD cache.

Sign in to comment.

A fast way to perform sparse matrix-free product with a vector

10 Comments
Show 8 older commentsHide 8 older comments

Accepted Answer

4 Comments
Show 2 older commentsHide 2 older comments

More Answers (1)

4 Comments
Show 2 older commentsHide 2 older comments

See Also

Categories

Tags

Community Treasure Hunt

A fast way to perform sparse matrix-free product with a vector

10 Comments Show 8 older commentsHide 8 older comments

Accepted Answer

4 Comments Show 2 older commentsHide 2 older comments

More Answers (1)

4 Comments Show 2 older commentsHide 2 older comments

See Also

Categories

Tags

Community Treasure Hunt

10 Comments
Show 8 older commentsHide 8 older comments

4 Comments
Show 2 older commentsHide 2 older comments

4 Comments
Show 2 older commentsHide 2 older comments