Fastest way to dot product all columns of two matricies of same size

I have come up against this problem over and over, and I have a nice solution, but it seems non-optimal from a speed sense. Does anybody have a better way?
Problem: I will typically have two matricies of the same size (A and B) that have a small number of rows (m ~ 5-10) and a large number of columns (n ~ 1000) (think of each row is a "state" and each column representing the state value at a point in time). Is there a nice efficient way to take the corresponding columns of each matrix and dot product them to end up with a row vector of length n?
My current solution is:
C = sum(A.*B, 1);
The reason I don't like this is that for single column vectors
sum(v1.*v2, 1)
is about twice as slow as the regular dot product, so I am thinking there is a better way to do this for the matrix problem (it is typically executed 100's of thousands of times in my code is why I care).
Btw, looping over the columns and taking dot product is reallllllly slow, so don't bother trying anything with a for loop...

Answers (2)

A direct C-Mex can avoid the creation of the large temporary array A.*B, but collect the sum immediately:
// Compile:
// DOT as C-loop: mex -O SumAB1.c libmwblas.lib -DUSE_C_LOOP
// DOT as BLAS call: mex -O SumAB1.c libmwblas.lib
//
// Tested: Matlab 7.8, 7.13, Win7/64, Compiler: MSVC2008
// Author: Jan Simon, Heidelberg, (C) 2012 matlab.2010ATnMINUSsimonDOTde
// License: BSD
#include "mex.h"
#ifndef USE_C_LOOP
double ddot(mwSignedIndex *k, double *a, mwSignedIndex *ai,
double *b, mwSignedIndex *bi);
void CoreBlas(double *a, double *b, double *c, mwSignedIndex M, mwSignedIndex N);
#else
void CoreLoop(double *a, double *b, double *c, mwSize M, mwSize N);
#endif
void mexFunction(int nlhs, mxArray *plhs[], int nrhs, const mxArray *prhs[])
{
mwSize M, N;
const mxArray *A, *B;
// Proper number of arguments:
if (nrhs != 2) {
mexErrMsgTxt("*** SumAB1[mex]: 2 inputs required.");
}
if (nlhs > 1) {
mexErrMsgTxt("*** SumAB1[mex]: 1 output allowed.");
}
// Pointers to inputs:
A = prhs[0];
B = prhs[1];
M = mxGetM(A);
N = mxGetN(A);
// Inputs must be double matrices of the same size:
if (!mxIsDouble(A) || mxGetNumberOfDimensions(A) != 2 ||
!mxIsDouble(B) || mxGetNumberOfDimensions(B) != 2 ||
mxGetM(B) != M || mxGetN(B) != N) {
mexErrMsgTxt("*** SumAB1[mex]: "
"Inputs must be real double matrices of the same size.");
}
// Create output:
plhs[0] = mxCreateDoubleMatrix(1, N, mxREAL);
#ifndef USE_C_LOOP
CoreBlas(mxGetPr(A), mxGetPr(B), mxGetPr(plhs[0]), M, N);
#else
CoreLoop(mxGetPr(A), mxGetPr(B), mxGetPr(plhs[0]), M, N);
#endif
return;
}
// *****************************************************************************
void CoreLoop(double *a, double *b, double *c, mwSize M, mwSize N)
{
// DOT-product as C-loop:
mwSize i, j;
double s;
for (i = 0; i < N; i++) {
s = 0.0;
for (j = 0; j < M; j++) {
s += *a++ * *b++;
}
c[i] = s;
}
return;
}
// *****************************************************************************
void CoreBlas(double *a, double *b, double *c, mwSignedIndex M, mwSignedIndex N)
{
// DOT-product as BLAS call, faster if M > 100:
mwSignedIndex i, one = 1;
for (i = 0; i < N; i++) {
c[i] = ddot(&M, a + i * M, &one, b + i * M, &one);
}
return;
}
With MSVC 2008 I get about the double speed as for sum(A.*B, 1) in Matlab 2009a/64/Win7.
[EDITED] Inserted a compiler directive to perform the DOT using the faster BLAS DDOT call.
[EDITED 2] Sorry, DDOT is faster only for about M>100, not for M=10!
A hard coded loop is 16% faster than the C-loop method above for [10 x 1000] matrices:
void CoreLoop_10(double *a, double *b, double *c, mwSize N)
{
// DOT-product as C-loop for !!! M==10 !!!:
mwSize i;
double s;
register double *a0=a, *a1=a+1, *a2=a+2, *a3=a+3, *a4=a+4,
*b0=b, *b1=b+1, *b2=b+2, *b3=b+3, *b4=b+4;
for (i = 0; i < N; i++) {
s = *a0++ + *b0++ + *a1++ + *b1++ + *a2++ + *b2++ +
*a3++ + *b3++ + *a4++ + *b4++;
s += *a0++ + *b0++ + *a1++ + *b1++ + *a2++ + *b2++ +
*a3++ + *b3++ + *a4++ + *b4++;
c[i] = s;
}
return;
}
But then you have to branch in the main function to the different lengths of M. A SSE3 approach will be faster, but you have to care for the 128 bit alignment then.

6 Comments

Wow, nice work. I was really just wondering if I was missing something about built-in matrix slicing type of Matlab functionality, but this solution certainly seems to pass the "optimized" test.
Cheers.
See [EDITED]: Further 20% by calling BLAS:DDOT.
This C-mex uses one core only. Unfortunately your arrays are small and starting multiple threads for [1000 x 10] matrices will consume too much time. But you can open a thread pool, wuch that the threads do not have to be restarted in every call, but sleep intermediately.
If you use PARFOR in the Matlab level, the CPU cores are loaded already and the single-threaded C-mex is sufficient.
@Jan: My experience with MTIMESX is that the speed of DDOT vs unrolled self-coded loops is highly dependent on compiler and MATLAB version. I have found it very difficult to come up with an algorithm for determining at runtime which method is going to be fastest for a given size.
@James: I assume that in Matlab 2009a the very fast MKL-library is used, when a C-Mex calls DDOT. But this is guessing only - how do I check, which library is actually used?
Can MTIMESX help to solve the OP's problem faster?
@Jan: Well, in my 32-bit WinXP R2011a, DDOT is 25% slower than the hand coded loops you have above (timing for 10x1000000). Not sure what 3rd party BLAS library R2011a uses, but as I recall it is not MKL (this info used to be listed in the doc). But on your machine the BLAS is faster. This is a poster child case for why I have difficulty figuring out at runtime which method to use in MTIMESX. Regarding MTIMESX use, this would likely not help in this case since it just adds some overhead and performs the same basic loops you have above in the background. Also, there would be some reshapes involved to coax MTIMESX to do this calculation, which again adds a little bit of overhead (maybe I can put this particular calculation as a direct calculation option in a future version of MTIMESX to avoid the reshapes). One exception would be if the dot product size was very large (>>10), in which case the unrolled loops and OpenMP in MTIMESX perform a little faster than the straight loops you have above.
Doh. I've read the corresponding sentence in the question three times because I've confused this too often in the forums already. Nevertheless, I've measured the timings with 1000x10 matrices instead of 10x1000.
Then the BLAS:DDOT method takes 30% more time than SUM(A.*B)!
New timings:
A = rand(10, 1000); B = rand(10, 1000);
tic; for i=1:10000; C2 = MyMexFunc(A, B); end; toc
Hard coded for M=10: 0.162 sec
The C-loop "s += *a++ * *b++": 0.196 sec
BLAS:DDOT: 0.444 sec
sum(A.*B): 0.344 sec

Sign in to comment.

When you say you're doing the dot product, I assume you're doing this:
v1'*v2
and not
dot(v1, v2)
The equivalent for your scenario, where you have matrices, would be something like this:
diag(A'*B)'
But that adds a lot of overhead with extra calls to diag and transpose.
Have you tried calling sum without the dimension parameter?
C = sum(A.*B);
This is slightly faster than
C = sum(A.*B, 1);

1 Comment

Using diag(A'*B) would compute 1,000,000 values (instead of just the 1000 that I want), so it would be far from efficient.
Dropping the "1" from the sum command seems like a 10% or so speed up, which is fairly nice. Thanks for that one.

Sign in to comment.

Products

Asked:

on 20 Feb 2012

Edited:

Jan
on 11 Mar 2014

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!