Fastest way to dot product all columns of two matricies of same size

Question

0 votes

I have come up against this problem over and over, and I have a nice solution, but it seems non-optimal from a speed sense. Does anybody have a better way?

Problem: I will typically have two matricies of the same size (A and B) that have a small number of rows (m ~ 5-10) and a large number of columns (n ~ 1000) (think of each row is a "state" and each column representing the state value at a point in time). Is there a nice efficient way to take the corresponding columns of each matrix and dot product them to end up with a row vector of length n?

My current solution is:

C = sum(A.*B, 1);

The reason I don't like this is that for single column vectors

sum(v1.*v2, 1)

is about twice as slow as the regular dot product, so I am thinking there is a better way to do this for the matrix problem (it is typically executed 100's of thousands of times in my code is why I care).

Btw, looping over the columns and taking dot product is reallllllly slow, so don't bother trying anything with a for loop...

0 Comments
Show -2 older comments Hide -2 older comments

Sign in to comment.

Sign in to answer this question.

Follow Question

Answer 1

Jan on 20 Feb 2012

Edited: Jan on 11 Mar 2014

Open in MATLAB Online

2 votes

A direct C-Mex can avoid the creation of the large temporary array A.*B, but collect the sum immediately:

// Compile:
//   DOT as C-loop:    mex -O SumAB1.c libmwblas.lib -DUSE_C_LOOP
//   DOT as BLAS call: mex -O SumAB1.c libmwblas.lib
//
// Tested: Matlab 7.8, 7.13, Win7/64, Compiler: MSVC2008
// Author:  Jan Simon, Heidelberg, (C) 2012 matlab.2010ATnMINUSsimonDOTde
// License: BSD
#include "mex.h"
#ifndef USE_C_LOOP
double ddot(mwSignedIndex *k, double *a, mwSignedIndex *ai,
                              double *b, mwSignedIndex *bi);
void CoreBlas(double *a, double *b, double *c, mwSignedIndex M, mwSignedIndex N);
#else
void CoreLoop(double *a, double *b, double *c, mwSize M, mwSize N);
#endif
void mexFunction(int nlhs, mxArray *plhs[], int nrhs, const mxArray *prhs[])
{
  mwSize        M, N;
  const mxArray *A, *B;
    // Proper number of arguments:
    if (nrhs != 2) {
      mexErrMsgTxt("*** SumAB1[mex]: 2 inputs required.");
    }
    if (nlhs > 1) {
      mexErrMsgTxt("*** SumAB1[mex]: 1 output allowed.");
    }
    // Pointers to inputs:
    A = prhs[0];
    B = prhs[1];
    M = mxGetM(A);
    N = mxGetN(A);
    // Inputs must be double matrices of the same size:
    if (!mxIsDouble(A) || mxGetNumberOfDimensions(A) != 2 ||
        !mxIsDouble(B) || mxGetNumberOfDimensions(B) != 2 ||
        mxGetM(B) != M || mxGetN(B) != N) {
      mexErrMsgTxt("*** SumAB1[mex]: "
                   "Inputs must be real double matrices of the same size.");
    }
    // Create output:
    plhs[0] = mxCreateDoubleMatrix(1, N, mxREAL);
#ifndef USE_C_LOOP
  CoreBlas(mxGetPr(A), mxGetPr(B), mxGetPr(plhs[0]), M, N);
#else
  CoreLoop(mxGetPr(A), mxGetPr(B), mxGetPr(plhs[0]), M, N);
#endif
    return;
  }
// *****************************************************************************
void CoreLoop(double *a, double *b, double *c, mwSize M, mwSize N)
{
  // DOT-product as C-loop:
  mwSize i, j;
  double s;
    for (i = 0; i < N; i++) {
       s = 0.0;
       for (j = 0; j < M; j++) {
          s += *a++ * *b++;
       }
       c[i] = s;
    }
    return;
  }
// *****************************************************************************
void CoreBlas(double *a, double *b, double *c, mwSignedIndex M, mwSignedIndex N)
{
  // DOT-product as BLAS call, faster if M > 100:
  mwSignedIndex i, one = 1;
    for (i = 0; i < N; i++) {
       c[i] = ddot(&M, a + i * M, &one, b + i * M, &one);
    }
    return;
  }

With MSVC 2008 I get about the double speed as for sum(A.*B, 1) in Matlab 2009a/64/Win7.

[EDITED] Inserted a compiler directive to perform the DOT using the faster BLAS DDOT call.

[EDITED 2] Sorry, DDOT is faster only for about M>100, not for M=10!

A hard coded loop is 16% faster than the C-loop method above for [10 x 1000] matrices:

void CoreLoop_10(double *a, double *b, double *c, mwSize N)
{
  // DOT-product as C-loop for !!! M==10 !!!:
  mwSize i;
  double s;
  register double *a0=a, *a1=a+1, *a2=a+2, *a3=a+3, *a4=a+4,
                  *b0=b, *b1=b+1, *b2=b+2, *b3=b+3, *b4=b+4;
    for (i = 0; i < N; i++) {
       s  = *a0++ + *b0++ + *a1++ + *b1++ + *a2++ + *b2++ + 
            *a3++ + *b3++ + *a4++ + *b4++;
       s += *a0++ + *b0++ + *a1++ + *b1++ + *a2++ + *b2++ + 
            *a3++ + *b3++ + *a4++ + *b4++;
       c[i] = s;
    }
    return;
  }

But then you have to branch in the main function to the different lengths of M. A SSE3 approach will be faster, but you have to care for the 128 bit alignment then.

6 Comments
Show 4 older comments Hide 4 older comments

James Tursa on 21 Feb 2012

@Jan: Well, in my 32-bit WinXP R2011a, DDOT is 25% slower than the hand coded loops you have above (timing for 10x1000000). Not sure what 3rd party BLAS library R2011a uses, but as I recall it is not MKL (this info used to be listed in the doc). But on your machine the BLAS is faster. This is a poster child case for why I have difficulty figuring out at runtime which method to use in MTIMESX. Regarding MTIMESX use, this would likely not help in this case since it just adds some overhead and performs the same basic loops you have above in the background. Also, there would be some reshapes involved to coax MTIMESX to do this calculation, which again adds a little bit of overhead (maybe I can put this particular calculation as a direct calculation option in a future version of MTIMESX to avoid the reshapes). One exception would be if the dot product size was very large (>>10), in which case the unrolled loops and OpenMP in MTIMESX perform a little faster than the straight loops you have above.

Jan on 21 Feb 2012

Doh. I've read the corresponding sentence in the question three times because I've confused this too often in the forums already. Nevertheless, I've measured the timings with 1000x10 matrices instead of 10x1000.

Then the BLAS:DDOT method takes 30% more time than SUM(A.*B)!

New timings:

A = rand(10, 1000); B = rand(10, 1000);

tic; for i=1:10000; C2 = MyMexFunc(A, B); end; toc

Hard coded for M=10: 0.162 sec

The C-loop "s += *a++ * *b++": 0.196 sec

BLAS:DDOT: 0.444 sec

sum(A.*B): 0.344 sec

Sign in to comment.

Answer 2

Jiro Doke on 20 Feb 2012

Open in MATLAB Online

0 votes

When you say you're doing the dot product, I assume you're doing this:

v1'*v2

and not

dot(v1, v2)

The equivalent for your scenario, where you have matrices, would be something like this:

diag(A'*B)'

But that adds a lot of overhead with extra calls to diag and transpose.

Have you tried calling sum without the dimension parameter?

C = sum(A.*B);

This is slightly faster than

C = sum(A.*B, 1);

1 Comment
Show -1 older comments Hide -1 older comments

Dan Ryan on 20 Feb 2012

Using diag(A'*B) would compute 1,000,000 values (instead of just the 1000 that I want), so it would be far from efficient.

Dropping the "1" from the sum command seems like a 10% or so speed up, which is fairly nice. Thanks for that one.

Sign in to comment.

Fastest way to dot product all columns of two matricies of same size

0 Comments
Show -2 older comments Hide -2 older comments

Answers (2)

6 Comments
Show 4 older comments Hide 4 older comments

1 Comment
Show -1 older comments Hide -1 older comments

Categories

Products

Tags

Community Treasure Hunt

Fastest way to dot product all columns of two matricies of same size

0 Comments Show -2 older comments Hide -2 older comments

Answers (2)

6 Comments Show 4 older comments Hide 4 older comments

1 Comment Show -1 older comments Hide -1 older comments

Categories

Products

Tags

See Also

Community Treasure Hunt

0 Comments
Show -2 older comments Hide -2 older comments

6 Comments
Show 4 older comments Hide 4 older comments

1 Comment
Show -1 older comments Hide -1 older comments