need to vectorize efficiently calculating only certain values in the matrix multiplication A * B, using a logical array L the size of A * B

I have matrices A (m by v) and B (v by n). I also have a logical matrix L (m by n).
I am interested in, as efficiently as possible, calculating only the values in A * B that correspond to logical values in L (values of 1s). Essentially I am interested in the quantity ( A * B ) .* L .
For my problem, a typical L matrix has less than 0.1% percent of its values as 1s; the vast majority of the values are 0s. Thus, it makes no sense for me to directly perform ( A * B ) .* L , it would actually be faster to loop over each element of A * B that I want to compute, but even that is inefficient.
-------------------------------------------------------------------------------------------------------------------------------------------------------
Possible solution (need help vectorizing this code if possible)
My particular problem may have a nice solution given that the logical matrix L has a nice structure.
This L matrix is nice in that it can be represented as something like a permuted block matrix. This example L in is composed of 9 "blocks" of 1s, where each block of 1s has its own set of row and column indices. For instance, the highlighted area here can be seen the values of 1 as a particular submatrix in L.
My solution was to do this. I can get the row indices and column indices per each block's submatrix in L, organized in two cell lists "rowidxs_list" and "colidxs_list", both with the number of cells equal to the number of blocks. For instance in the block example I gave, subblock 1, I could calculate those particular values in A * B by simply doing A( rowidxs_list{1} , : ) * B( : , colidxs_list{1} ) .
That means that if I precomputed rowidxs_list and colidxs_list (ignore the costs of calculating these lists, they are negligable for my application), then my problem of calculating C = ( A * B ) .* L could effectively be done by:
C = sparse( m,n )
for i = 1:length( rowidxs_list )
C( rowidxs_list{i} , colidxs_list{i} ) = A( rowidxs_list{i} , : ) * B( : , colidxs_list{i} ) .
end
This seems like it would be the most efficient way to solve this problem if I knew how to vectorize this for loop. Does anyone see a way to vectorize this?
There may be ways to vectorize if certain things hold, e.g. only if rowidxs_list and colidxs_list are matrix arrays instead of cell lists of lists (where each column in an array is an index list, thus replacing use of rowidxs_list{i} with rowidxs_list(i,:) ). I'd prefer to use cell lists here if possible since different lists can have different numbers of elements.
-------------------------------------------------------------------------------------------------------------------------------------------------------
other suggested solution (creating a mex file?)
I first posted this question on the /r/matlab subreddit, see here for the reddit thread. The user "qtac" recommended that a C-MEX function linking to C programming language:
My gut feeling is the only way to really optimize this is with a C-MEX solution; otherwise, you are going to get obliterated by overhead from subsref in these loops. With C you could loop over L until you find a nonzero element, and then do only the row-column dot product needed to populate that specific element. You will miss out on a lot of the BLAS optimizations but the computational savings may make up for it.
Honestly I bet an LLM could write 90%+ of that MEX function for you; it's a well-formulated problem.
I think this could be a good solution to pursue, but I'd like other opinons as well.

8 Comments

We would need to know typical sizes involved to offer any meaningful advice. Is there any pattern to the 1's in the L matrix?
Thus, it makes no sense for me to directly perform ( A * B ) .* L , it would actually be faster to loop over each element of A * B
Have you actually tested to see if it is faster? Matlab extensively parallelizes operations like (A*B).*L, so it is not always clear if reducing the number of operations will translate into improved speed.
responding to James:
"We would need to know typical sizes involved to offer any meaningful advice. Is there any pattern to the 1's in the L matrix?"
It's hard for me to say what the "typical size" of m, n, and v can be for my application; these sizes can have huge difference in ranges.
dimension "v" can be very small to very big, e.g. anywhere from 20 to 20000. Dimensions "m" and "n" also have very wide ranges, either one can be between 400 to 200,000. Assuming max values for m and n are 200,000, that means L can be at most a 40 billion entry sparse matrix. Additionally, the larger these L matrices are, the fewer the percentage of entries in it that are nonzero (e.g. in the max case of 40 billion entries, I'd assume that less than 1e-5 % or the entries of L are nonzero (400,000). But I'd assume that in most of my cases, the vast majority of entries are zero.
--------------------------------------------------------------------------------------------------
"Is there any pattern to the 1's in the L matrix?"
Yes, the L matrix has a well-structured pattern that should be exploitable. I can explain it in a few different ways.
consider the operation
J_i = sparse( m , n )
J_i( rowidxs_list{i} , colidxs_list{i} ) = ones( length(rowidxs_list{i}) , length(colidxs_list{i}) )
This will make a rank-1 logical matrix, a permuted "block" of 1s within a larger matrix of 0s, with the row/column indices of this block corresponding to some ith set of row indices and column indices that could be stored in a cell list. For example, with a very small scale example (just for visualization purposes), Here is an example L. , and there are a total of 9 submatrices, with this one being one such example.
Each submatrix forms a rank-1 matrix in L, which can be written as the outer product of two logical vectors:
j_row = sparse(m,1);
j_row(rowidxs_list{i}) = 1;
j_col = sparse(n,1);
j_col(colidxs_list{i}) = 1;
J_i = j_row j_col^T ;
The entire matrix L is then a summation of several (e.g. "p") different of these rank-1 matrices J_i, where the blocks per each rank-1 do not overlap (such that values are either 0 or 1, not 2 or more). It could be written using rank-1 logical matrices as:
L = sum_(i=1)^p J_i
or in code as (ignoring efficiency):
L = sparse(m,n);
for i=1:p
j_row = sparse(m,1);
j_row(rowidxs_list{i}) = 1;
j_col = sparse(n,1);
j_col(colidxs_list{i}) = 1;
J_i = j_row j_col^T ;
L = L + J_i;
end
"p" can also vary considerably, e.g. between 9 to 40,000 rank-1 matrices, and here p = length(rowidxs_list) = length(colidxs_list) for the cell arrays rowidxs_list and colidxs_list.
responding to Catalytic:
"Thus, it makes no sense for me to directly perform ( A * B ) .* L , it would actually be faster to loop over each element of A * B". Have you actually tested to see if it is faster? Matlab extensively parallelizes operations like (A * B) .* L, so it is not always clear if reducing the number of operations will translate into improved speed.
I tested for many different combinations of dimensions (m, n, and v), for various different small scale and large scale matrices. For all cases I tried, I always found that ( A * B ) .* L is unfortunately very slightly slower than ( A * B ). It appears that matlab doesn't recognize how to use L here before calculating ( A * B ), and maybe calculating all of ( A * B ) before applying L.
For all cases I tried, I always found that ( A * B ) .* L is unfortunately very slightly slower than ( A * B ).
That's not what I asked. I asked if you had compared (A*B).* L to the mcoded loop that you claimed would be faster. Below is an example where the loop does poorer than the direct operation -
m=5000;n=m; v=3000;
A=rand(m,v); B=rand(v,m); L=sprand(m,n,0.1/100)>0;
timeit(@()direct(A,B,L))
ans = 0.3599
timeit(@()looping(A,B,L))
ans = 1.0783
function direct(A,B,L)
C=(A*B).*L;
end
function looping(A,B,L)
[I,J]=find(L);
K=numel(I);
C=double(L);
for k=1:K
C(I(k),J(k))=A(I(k),:)*B(:,J(k));
end
end
In both of your examples, L appears to have a 3x4 tiled pattern. In other words, it is a Kronecker product of the form L=kron(ones(3,4), Lsub) where Lsub is a smaller binary matrix. Will L always be a Kronecker product?
To Catalytic:
> I asked if you had compared (A*B).* L to the mcoded loop that you claimed would be faster. Below is an example where the loop does poorer than the direct operation.
I just tested both methods you mentioned in your comment here, in addition to the loop I describe in my original post under "Possible solution (need help vectorizing this code if possible)". The methods I tested are:
  • only performing A*B (I'll call this "AtimesB")
  • directly doing (A*B).*L (the function you call "direct")
  • doing the looping over each index (the function you call "looping")
  • doing the loop over each nonzero block in (A*B).*L, as I describe in my OP (I'll call this "blocklooping")
  • another recommended on reddit from a user "86BillionFireflies", here they provide a pretty nice vectorized solution.
I tested with as large a matrix as I could on my current computer. E.g., I tested when 2e-2% of the entries were 1s in L, and A * B was (m=5625 by n=90000).
I observed these cpu times for the methods:
  • "AtimesB" = 18.8 sec
  • "direct" = 28.5 sec
  • "looping" = 23.1 sec
  • "blocklooping" = 7.0 sec
  • "86BillionFireflies" = 5.4 sec
The looping you suggested over indices is faster than "direct", but my original post's "backlooping" is supposed to exploit that the L matrix is composed of these blocks, using fewer loops and larger matrix multiplications in each loop. However, 86BillionFireflies provided a solution that is faster than mine, and appears to be the fastest out of all solutions recommended to me (aside from a solution that uses a "mex" file such that for loops can be done more efficiently in C, I am still trying to get that solution to work).
To Matt J:
> In both of your examples, L appears to have a 3x4 tiled pattern. In other words, it is a Kronecker product of the form L=kron(ones(3,4), Lsub) where Lsub is a smaller binary matrix. Will L always be a Kronecker product?
Yes! L is always a Kronecker product of some matrix of 1s with some smaller binary matrix you can call "Lsub". The matrix of 1s can be any sized matrix of 1s (including ones(1)).
I didn't realize this was a Kronecker product originally. But I sort of exploited this kronecker product structure to perform my original post's solution under "Possible solution (need help vectorizing this code if possible)".
Essentially for that small binary matrix "Lsub", each row of Lsub corresponded to some submatrix of all 1s in C = ( A * B ) .* L. In the visual example I provided, Lsub has 9 rows, thus there are 9 submatrices ("blocks") of 1s in L, each having their own row and column indices in L (and thus C), and I went over each of these submatrices (per each row in Lsub) to evaluate that submatrix's entries in C.

Sign in to comment.

 Accepted Answer

You can try this naive mex routine. Not optimized at all for cache hits or patterns in L. Might be faster to transpose A first, but I haven't looked into that yet. Also doesn't contain any code to clean 0's from result.
/* File: ABL.c
*
* C = ABL(A,B,L)
*
* Performs the function
*
* C = (A*B).*L
*
* Where
*
* A = Full M x P real double matrix
* B = Full P x N real double matrix
* L = Sparse M x N logical matrix
* C = Sparse M x N real double matrix
*
* Brute force algorithm, no attempt at multi-threading, etc.
*
* Building: mex ABL.c -R2018a
*
* Programmer: James Tursa
*/
#include "mex.h"
void mexFunction(int nlhs, mxArray *plhs[], int nrhs, const mxArray *prhs[])
{
mwSize Am, An, Bm, Bn, Lm, Ln, i, j, k;
mwIndex nrow;
mwIndex *Ir, *Jc, *Cir, *Cjc;
double *A, *B, *C, *a, *b;
/* Check inputs */
if( nrhs != 3 ||
!mxIsDouble(prhs[0]) || mxIsComplex(prhs[0]) || mxIsSparse(prhs[0]) || mxGetNumberOfDimensions(prhs[0]) != 2 ||
!mxIsDouble(prhs[1]) || mxIsComplex(prhs[1]) || mxIsSparse(prhs[1]) || mxGetNumberOfDimensions(prhs[1]) != 2 ||
!mxIsLogical(prhs[2]) || !mxIsSparse(prhs[2]) ) {
mexErrMsgTxt("Invalid Inputs");
}
Am = mxGetM(prhs[0]);
An = mxGetN(prhs[0]);
Bm = mxGetM(prhs[1]);
Bn = mxGetN(prhs[1]);
Lm = mxGetM(prhs[2]);
Ln = mxGetN(prhs[2]);
if( Am != Lm || Bn != Ln || An != Bm ) {
mexErrMsgTxt("Matrix sizes incompatible");
}
/* Get sparse indexing pointers */
Ir = mxGetIr(prhs[2]);
Jc = mxGetJc(prhs[2]);
/* Create output array same size and sparsity as L */
// mexCallMATLAB(plhs,1,(mxArray **)(prhs+2),1,"double"); /* slower? will need to manually 0 the data spots */
plhs[0] = mxCreateSparse(Lm, Ln, Jc[Ln], mxREAL); /* will need to manually fill in Ir and Jc */
/* Get data pointers */
A = (double *) mxGetData(prhs[0]);
B = (double *) mxGetData(prhs[1]);
C = (double *) mxGetData(plhs[0]);
Cir = mxGetIr(plhs[0]);
Cjc = mxGetJc(plhs[0]);
/* Loop through the logical L matrix indexing, C has exact same structure */
for( j=0; j<Ln; j++) {
*Cjc++ = Jc[j];
nrow = Jc[j+1] - Jc[j]; /* number of elements in this column */
for( i=0; i<nrow; i++ ) {
*Cir++ = *Ir;
a = A + *Ir++;
b = B + Bm * j;
for( k=0; k<An; k++ ) {
*C += *a * *b++;
a += Am;
}
C++;
}
}
*Cjc = Jc[Ln];
}

9 Comments

Thank you James for making this mex file for me.
This is my first experience working with a mex file or C, so sorry for my ignorance. I am getting an error running the code. I tried implementing your mex file solution by:
  • dropping the code you wrote here into a txt editor, and saving the file with extension ".c"
  • using this guide to build the c mex function from the ".c" code, pretty much exactly as they do here (but for your function "ABL"). It appears that the mex file was completed (I now have a file in my directory called "ABL.mexw64".
  • then just running C = ABL(A,B,L) in matlab.
Unfortunately when I run it in matlab, I get the error message: "Error using ABL, Invalid Inputs". It appears it can access the function you wrote, but there may be a bug in the code. Could there be something wrong with its handling of A B and L?
Name Size Bytes Class Attributes
A 11250x256 23040000 double
B 256x180000 368640000 double
L 11250x180000 16200000000 double
@Cal the MEX-file expects
L = Sparse M x N logical matrix
You're passing in M x N real full double matrix. That's why the MEX-file is throwing the error.
Sorry Raymond I didn't catch that L was stored as a double originally.
I now specify it as a sparse array when I create it, but still get the same error "Error using ABL, invalid inputs".
This time when I do "whos A B L":
Name Size Bytes Class Attributes
A 11250x256 23040000 double
B 256x180000 368640000 double
L 11250x180000 16200000000 double sparse
Right, because L is a double, not logical (which you stated originally it was):
I have matrices A (m by v) and B (v by n). I also have a logical matrix L (m by n).
The MEX-file @James Tursa wrote requires
L = Sparse M x N logical matrix
Thank you Raymond for your patience and apologies for not being more careful,
Yes now that L is sparse and logical, the code works.
It is orders of multitude faster than any other solution I have tested, which is great. And James even says in his file "Brute force algorithm, no attempt at multi-threading, etc.", so I think I can work on this to make it even faster (I think I can take my original loop over "blocks" in L and turn that into mex file), but regardless, this solution is great and sufficient for now!
Many, many thanks to both of you for your help :D
*** UPDATE ***
Here is a more CPU cache friendly version of the code. By transposing A first, the A elements used for the dot products are contiguous in memory (the B elements already are) and this makes much better use of the CPU cache for the dot product operations and gives better performance. This version would also be more friendly to multi-threading (although I don't think I will bother attempting that at the moment ... this can get tricky to figure out the best way to split things up depending on matrix sizes). Keep in mind that you have to pass in the transpose of A to this mex routine. I.e., the following are all mathematically equivalent (but not numerically because of different order of operations):
(A*B).*L
ABL(A,B,L)
ATBL(A.',B,L) % <-- Pass in the transpose of A!
That is, ABL(A,B,L) should give the exact same result as ATBL(A.',B,L) because the order of operations is exactly the same, but neither of these would be expected to give the exact same result as (A*B).*L because of different order of operations involved, even though there are all mathematically equivalent.
The mex code:
/* File: ATBL.c
*
* C = ATBL(A,B,L)
*
* Performs the function
*
* C = ((A.')*B).*L
*
* Where
*
* A = Full P x M real double matrix
* B = Full P x N real double matrix
* L = Sparse M x N logical matrix
* C = Sparse M x N real double matrix
*
* Brute force algorithm, no attempt at multi-threading, etc.
*
* Building: mex ATBL.c -R2018a
*
* Programmer: James Tursa
*/
#include "mex.h"
void mexFunction(int nlhs, mxArray *plhs[], int nrhs, const mxArray *prhs[])
{
mwSize Am, An, Bm, Bn, Lm, Ln, i, j, k;
mwIndex nrow;
mwIndex *Lir, *Ljc, *Cir, *Cjc;
double *A, *B, *C, *a, *b, d;
/* Check inputs */
if( nrhs != 3 ||
!mxIsDouble(prhs[0]) || mxIsComplex(prhs[0]) || mxIsSparse(prhs[0]) || mxGetNumberOfDimensions(prhs[0]) != 2 ||
!mxIsDouble(prhs[1]) || mxIsComplex(prhs[1]) || mxIsSparse(prhs[1]) || mxGetNumberOfDimensions(prhs[1]) != 2 ||
!mxIsLogical(prhs[2]) || !mxIsSparse(prhs[2]) ) {
mexErrMsgTxt("Invalid Inputs");
}
Am = mxGetM(prhs[0]);
An = mxGetN(prhs[0]);
Bm = mxGetM(prhs[1]);
Bn = mxGetN(prhs[1]);
Lm = mxGetM(prhs[2]);
Ln = mxGetN(prhs[2]);
if( An != Lm || Bn != Ln || Am != Bm ) {
mexErrMsgTxt("Matrix sizes incompatible");
}
/* Get sparse indexing pointers */
Lir = mxGetIr(prhs[2]);
Ljc = mxGetJc(prhs[2]);
/* Create output array same size and sparsity as L */
plhs[0] = mxCreateSparse(Lm, Ln, Ljc[Ln], mxREAL); /* will need to manually fill in Ir and Jc */
/* Get data pointers */
A = (double *) mxGetData(prhs[0]);
B = (double *) mxGetData(prhs[1]);
C = (double *) mxGetData(plhs[0]);
Cir = mxGetIr(plhs[0]);
Cjc = mxGetJc(plhs[0]) + 1; /* 2nd value is for 1st column */
/* Loop through the logical L matrix indexing */
/* C has same basic structure, but we need to clean the 0's as we go */
for( j=0; j<Ln; j++) { /* for each column of L */
*Cjc = *(Cjc-1); /* Copy running total from last column */
nrow = Ljc[j+1] - Ljc[j]; /* number of elements in this column of L */
for( i=0; i<nrow; i++ ) { /* for each index in this column of L */
a = A + Am * *Lir; /* point to correct column of A */
b = B; /* point to this column of B */
d = 0.0; /* initialize dot product value */
for( k=0; k<Am; k++ ) { /* dot product */
d += *a++ * *b++;
}
if( d ) { /* only include in sparse result if non-zero */
*Cir++ = *Lir; /* copy row index and increment pointer */
*Cjc += 1; /* bump up running total */
*C++ = d; /* store the value and increment pointer */
}
Lir++; /* point to next row value */
}
B += Bm; /* point to next column of B */
Cjc++; /* point to next value of non-zero running total */
}
}
Also note that this version of the code is production quality, meaning that it checks for 0's along the way and only stores explicit non-zeros in the sparse result. The ABL.c code posted above does not do that. E.g., a run where there are forced 0 results:
>> A = reshape(1:12,3,4); A = [A,fliplr(A)]
A =
1 4 7 10 10 7 4 1
2 5 8 11 11 8 5 2
3 6 9 12 12 9 6 3
>> B = [ones(4,4);-ones(4,4)]
B =
1 1 1 1
1 1 1 1
1 1 1 1
1 1 1 1
-1 -1 -1 -1
-1 -1 -1 -1
-1 -1 -1 -1
-1 -1 -1 -1
>> L = sparse(rand(3,4)<.5)
L =
3×4 sparse logical array
(1,1) 1
(1,2) 1
(2,2) 1
(3,2) 1
(1,3) 1
(2,3) 1
(3,3) 1
(2,4) 1
>> (A*B).*L
ans =
All zero sparse: 3×4
>> sarek(ans)
SAREK -- Sparse Analyzer Real Et Komplex , by James Tursa
Compiled in version R2020a (with -R2018a option)
Running in version R2020a
Matrix is double ...
Matrix is real ...
M = 3 >= 0 OK ...
N = 4 >= 0 OK ...
Nzmax = 1 >= 1 OK ...
Jc[0] == 0 OK ...
Jc[N] = 0 <= Nzmax = 1 OK ...
Jc array OK ...
Ir array OK ...
All stored elements nonzero OK ...
All sparse integrity checks OK
ans =
0
>>
>> ABL(A,B,L)
ans =
(1,1) 0
(1,2) 0
(2,2) 0
(3,2) 0
(1,3) 0
(2,3) 0
(3,3) 0
(2,4) 0
>> sarek(ans)
SAREK -- Sparse Analyzer Real Et Komplex , by James Tursa
Compiled in version R2020a (with -R2018a option)
Running in version R2020a
Matrix is double ...
Matrix is real ...
M = 3 >= 0 OK ...
N = 4 >= 0 OK ...
Nzmax = 8 >= 1 OK ...
Jc[0] == 0 OK ...
Jc[N] = 8 <= Nzmax = 8 OK ...
Jc array OK ...
Ir array OK ...
ERROR: There are 8 explicit 0's in matrix
TO FIX: B = 1*A;
There were ERRORS found in matrix!
ans =
1
>>
>> ATBL(A.',B,L)
ans =
All zero sparse: 3×4
>> sarek(ans)
SAREK -- Sparse Analyzer Real Et Komplex , by James Tursa
Compiled in version R2020a (with -R2018a option)
Running in version R2020a
Matrix is double ...
Matrix is real ...
M = 3 >= 0 OK ...
N = 4 >= 0 OK ...
Nzmax = 8 >= 1 OK ...
Jc[0] == 0 OK ...
Jc[N] = 0 <= Nzmax = 8 OK ...
Jc array OK ...
Ir array OK ...
All stored elements nonzero OK ...
All sparse integrity checks OK
ans =
0
You can see that the ABL mex routine stores explicit 0's in the sparse result, whereas the ATBL mex routine checks for this and does not store them. The sarek function is a mex routine that checks sparse matrices for integrity. It can be found here:
As a side issue, this 0 check makes it trickier to multi-thread the code. You can't have each thread work completely independently on part of the result since the resulting memory locations of one part depend on the number of actual non-zeros of the other part(s). What you could do is multi-thread where each thread assumes no 0's in previous threads (that way each thread knows where to put its results in memory), then in a second pass clean the 0's if there were any detected. That would not be a bad strategy, particularly if you don't expect to have 0's in the resulting sparse spots (e.g., if you were typically working with arbitrary non-integer real values). You would only have to pass through the memory twice to fix things up if you detected a 0 in the first pass.

Sign in to comment.

More Answers (1)

Is v much smaller than m or n? If so, one approach might be with an outer product decomposition:
L=sparse(L); %if L was not previously in sparse form
C=0;
for i=1:v
C = C + A(:,i).*L.*B(i,:);
end
Either way, please respond to @James Tursa's question about the typical matrix dimensions.

1 Comment

Either way, please respond to @James Tursa's question about the typical matrix dimensions.
copying my answer from responding to James Tursa:
> v can be very small to very big, e.g. anywhere from 20 to 20000. m and n also have very wide ranges, either one can be between 400 to 200,000. Assuming max values for m and n are 200,000, that means L can be at most a 40 billion entry sparse matrix. Additionally, the larger these L matrices are, the fewer the percentage of entries in it that are nonzero (e.g. in the max case of 40 billion entries, I'd assume that less than 1e-5 % or the entries of L are nonzero (400,000). But I'd assume that in most of my cases, the vast, vast majority of entries are zero.
Is v much smaller than m or n? If so, one approach might be with an outer product decomposition:
Thank you for the suggestion.
I tried the loop you gave for the "outer product decomposition" as you wrote it, and unfortunately found it to always be slower than (A * B), it was faster only for v = 1 (around 1.5 times as fast) but around 5 times slower for v = 2, and became exponentially slower as v increased. Perhaps there is a way to vectorize this loop over v?

Sign in to comment.

Categories

Find more on Sparse Matrices in Help Center and File Exchange

Products

Release

R2024a

Asked:

Cal
on 17 Feb 2025

Edited:

on 22 Feb 2025

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!