need to vectorize efficiently calculating only certain values in the matrix multiplication A * B, using a logical array L the size of A * B

Question

0 votes

I have matrices A (m by v) and B (v by n). I also have a logical matrix L (m by n).

I am interested in, as efficiently as possible, calculating only the values in A * B that correspond to logical values in L (values of 1s). Essentially I am interested in the quantity ( A * B ) .* L .

For my problem, a typical L matrix has less than 0.1% percent of its values as 1s; the vast majority of the values are 0s. Thus, it makes no sense for me to directly perform ( A * B ) .* L , it would actually be faster to loop over each element of A * B that I want to compute, but even that is inefficient.

-------------------------------------------------------------------------------------------------------------------------------------------------------

Possible solution (need help vectorizing this code if possible)

My particular problem may have a nice solution given that the logical matrix L has a nice structure.

Here's an example of L for a very small scale example (in most applications L is much much bigger and has much fewer 1-yellow entries, and many more 0-blue entries).

This L matrix is nice in that it can be represented as something like a permuted block matrix. This example L in is composed of 9 "blocks" of 1s, where each block of 1s has its own set of row and column indices. For instance, the highlighted area here can be seen the values of 1 as a particular submatrix in L.

My solution was to do this. I can get the row indices and column indices per each block's submatrix in L, organized in two cell lists "rowidxs_list" and "colidxs_list", both with the number of cells equal to the number of blocks. For instance in the block example I gave, subblock 1, I could calculate those particular values in A * B by simply doing A( rowidxs_list{1} , : ) * B( : , colidxs_list{1} ) .

That means that if I precomputed rowidxs_list and colidxs_list (ignore the costs of calculating these lists, they are negligable for my application), then my problem of calculating C = ( A * B ) .* L could effectively be done by:

C = sparse( m,n )
for i = 1:length( rowidxs_list )
    C( rowidxs_list{i} , colidxs_list{i} ) = A( rowidxs_list{i} , : ) * B( : , colidxs_list{i} ) .
end

This seems like it would be the most efficient way to solve this problem if I knew how to vectorize this for loop. Does anyone see a way to vectorize this?

There may be ways to vectorize if certain things hold, e.g. only if rowidxs_list and colidxs_list are matrix arrays instead of cell lists of lists (where each column in an array is an index list, thus replacing use of rowidxs_list{i} with rowidxs_list(i,:) ). I'd prefer to use cell lists here if possible since different lists can have different numbers of elements.

-------------------------------------------------------------------------------------------------------------------------------------------------------

other suggested solution (creating a mex file?)

I first posted this question on the /r/matlab subreddit, see here for the reddit thread. The user "qtac" recommended that a C-MEX function linking to C programming language:

My gut feeling is the only way to really optimize this is with a C-MEX solution; otherwise, you are going to get obliterated by overhead from subsref in these loops. With C you could loop over L until you find a nonzero element, and then do only the row-column dot product needed to populate that specific element. You will miss out on a lot of the BLAS optimizations but the computational savings may make up for it.

Honestly I bet an LLM could write 90%+ of that MEX function for you; it's a well-formulated problem.

I think this could be a good solution to pursue, but I'd like other opinons as well.

8 Comments
Show 6 older comments Hide 6 older comments

Cal on 18 Feb 2025

Edited: Cal on 18 Feb 2025

Open in MATLAB Online

responding to James:

"We would need to know typical sizes involved to offer any meaningful advice. Is there any pattern to the 1's in the L matrix?"

It's hard for me to say what the "typical size" of m, n, and v can be for my application; these sizes can have huge difference in ranges.

dimension "v" can be very small to very big, e.g. anywhere from 20 to 20000. Dimensions "m" and "n" also have very wide ranges, either one can be between 400 to 200,000. Assuming max values for m and n are 200,000, that means L can be at most a 40 billion entry sparse matrix. Additionally, the larger these L matrices are, the fewer the percentage of entries in it that are nonzero (e.g. in the max case of 40 billion entries, I'd assume that less than 1e-5 % or the entries of L are nonzero (400,000). But I'd assume that in most of my cases, the vast majority of entries are zero.

--------------------------------------------------------------------------------------------------

"Is there any pattern to the 1's in the L matrix?"

Yes, the L matrix has a well-structured pattern that should be exploitable. I can explain it in a few different ways.

consider the operation

J_i = sparse( m , n )
J_i( rowidxs_list{i} , colidxs_list{i} ) = ones( length(rowidxs_list{i}) , length(colidxs_list{i}) )

This will make a rank-1 logical matrix, a permuted "block" of 1s within a larger matrix of 0s, with the row/column indices of this block corresponding to some ith set of row indices and column indices that could be stored in a cell list. For example, with a very small scale example (just for visualization purposes), Here is an example L. , and there are a total of 9 submatrices, with this one being one such example.

Each submatrix forms a rank-1 matrix in L, which can be written as the outer product of two logical vectors:

j_row = sparse(m,1);
j_row(rowidxs_list{i}) = 1;
j_col = sparse(n,1);
j_col(colidxs_list{i}) = 1;
J_i = j_row j_col^T ;

The entire matrix L is then a summation of several (e.g. "p") different of these rank-1 matrices J_i, where the blocks per each rank-1 do not overlap (such that values are either 0 or 1, not 2 or more). It could be written using rank-1 logical matrices as:

L = sum_(i=1)^p J_i

or in code as (ignoring efficiency):

L = sparse(m,n);
for i=1:p
   j_row = sparse(m,1);
   j_row(rowidxs_list{i}) = 1;
   j_col = sparse(n,1);
   j_col(colidxs_list{i}) = 1;
   J_i = j_row j_col^T ;
   L = L + J_i;
end

"p" can also vary considerably, e.g. between 9 to 40,000 rank-1 matrices, and here p = length(rowidxs_list) = length(colidxs_list) for the cell arrays rowidxs_list and colidxs_list.

Cal on 20 Feb 2025

To Catalytic:

> I asked if you had compared (A*B).* L to the mcoded loop that you claimed would be faster. Below is an example where the loop does poorer than the direct operation.

I just tested both methods you mentioned in your comment here, in addition to the loop I describe in my original post under "Possible solution (need help vectorizing this code if possible)". The methods I tested are:

only performing A*B (I'll call this "AtimesB")
directly doing (A*B).*L (the function you call "direct")
doing the looping over each index (the function you call "looping")
doing the loop over each nonzero block in (A*B).*L, as I describe in my OP (I'll call this "blocklooping")
another recommended on reddit from a user "86BillionFireflies", here they provide a pretty nice vectorized solution.

I tested with as large a matrix as I could on my current computer. E.g., I tested when 2e-2% of the entries were 1s in L, and A * B was (m=5625 by n=90000).

I observed these cpu times for the methods:

"AtimesB" = 18.8 sec
"direct" = 28.5 sec
"looping" = 23.1 sec
"blocklooping" = 7.0 sec
"86BillionFireflies" = 5.4 sec

The looping you suggested over indices is faster than "direct", but my original post's "backlooping" is supposed to exploit that the L matrix is composed of these blocks, using fewer loops and larger matrix multiplications in each loop. However, 86BillionFireflies provided a solution that is faster than mine, and appears to be the fastest out of all solutions recommended to me (aside from a solution that uses a "mex" file such that for loops can be done more efficiently in C, I am still trying to get that solution to work).

Cal on 20 Feb 2025

To Matt J:

> In both of your examples, L appears to have a 3x4 tiled pattern. In other words, it is a Kronecker product of the form L=kron(ones(3,4), Lsub) where Lsub is a smaller binary matrix. Will L always be a Kronecker product?

Yes! L is always a Kronecker product of some matrix of 1s with some smaller binary matrix you can call "Lsub". The matrix of 1s can be any sized matrix of 1s (including ones(1)).

I didn't realize this was a Kronecker product originally. But I sort of exploited this kronecker product structure to perform my original post's solution under "Possible solution (need help vectorizing this code if possible)".

Essentially for that small binary matrix "Lsub", each row of Lsub corresponded to some submatrix of all 1s in C = ( A * B ) .* L. In the visual example I provided, Lsub has 9 rows, thus there are 9 submatrices ("blocks") of 1s in L, each having their own row and column indices in L (and thus C), and I went over each of these submatrices (per each row in Lsub) to evaluate that submatrix's entries in C.

Sign in to comment.

Sign in to answer this question.

Follow Question

Answer 1

James Tursa on 19 Feb 2025

Edited: James Tursa on 22 Feb 2025

Open in MATLAB Online

1 vote

You can try this naive mex routine. Not optimized at all for cache hits or patterns in L. Might be faster to transpose A first, but I haven't looked into that yet. Also doesn't contain any code to clean 0's from result.

/* File: ABL.c
*
* C = ABL(A,B,L)
*
* Performs the function
*
* C = (A*B).*L
*
* Where
*
* A = Full M x P real double matrix
* B = Full P x N real double matrix
* L = Sparse M x N logical matrix
* C = Sparse M x N real double matrix
*
* Brute force algorithm, no attempt at multi-threading, etc.
*
* Building:  mex ABL.c -R2018a
*
* Programmer: James Tursa
*/
#include "mex.h"
void mexFunction(int nlhs, mxArray *plhs[], int nrhs, const mxArray *prhs[])
{
    mwSize Am, An, Bm, Bn, Lm, Ln, i, j, k;
    mwIndex nrow;
    mwIndex *Ir, *Jc, *Cir, *Cjc;
    double *A, *B, *C, *a, *b;
    /* Check inputs */
    if( nrhs != 3 ||
    !mxIsDouble(prhs[0]) || mxIsComplex(prhs[0]) || mxIsSparse(prhs[0]) || mxGetNumberOfDimensions(prhs[0]) != 2 ||
    !mxIsDouble(prhs[1]) || mxIsComplex(prhs[1]) || mxIsSparse(prhs[1]) || mxGetNumberOfDimensions(prhs[1]) != 2 ||
    !mxIsLogical(prhs[2]) || !mxIsSparse(prhs[2]) ) {
    mexErrMsgTxt("Invalid Inputs");
    }
    Am = mxGetM(prhs[0]);
    An = mxGetN(prhs[0]);
    Bm = mxGetM(prhs[1]);
    Bn = mxGetN(prhs[1]);
    Lm = mxGetM(prhs[2]);
    Ln = mxGetN(prhs[2]);
    if( Am != Lm || Bn != Ln || An != Bm ) {
        mexErrMsgTxt("Matrix sizes incompatible");
    }
        /* Get sparse indexing pointers */
        Ir = mxGetIr(prhs[2]);
        Jc = mxGetJc(prhs[2]);
        /* Create output array same size and sparsity as L */
        //    mexCallMATLAB(plhs,1,(mxArray **)(prhs+2),1,"double"); /* slower? will need to manually 0 the data spots */
        plhs[0] = mxCreateSparse(Lm, Ln, Jc[Ln], mxREAL); /* will need to manually fill in Ir and Jc */
        /* Get data pointers */
        A = (double *) mxGetData(prhs[0]);
        B = (double *) mxGetData(prhs[1]);
        C = (double *) mxGetData(plhs[0]);
        Cir = mxGetIr(plhs[0]);
        Cjc = mxGetJc(plhs[0]);
        /* Loop through the logical L matrix indexing, C has exact same structure */
        for( j=0; j<Ln; j++) {
            *Cjc++ = Jc[j];
            nrow = Jc[j+1] - Jc[j]; /* number of elements in this column */
            for( i=0; i<nrow; i++ ) {
                *Cir++ = *Ir;
                a = A + *Ir++;
                b = B + Bm * j;
                for( k=0; k<An; k++ ) {
                    *C += *a * *b++;
                    a += Am;
                }
                C++;
            }
        }
        *Cjc = Jc[Ln];
    }

9 Comments
Show 7 older comments Hide 7 older comments

Raymond Norris on 20 Feb 2025

It was all @James Tursa's work!

James Tursa on 22 Feb 2025

Edited: James Tursa on 22 Feb 2025

Open in MATLAB Online

*** UPDATE ***

Here is a more CPU cache friendly version of the code. By transposing A first, the A elements used for the dot products are contiguous in memory (the B elements already are) and this makes much better use of the CPU cache for the dot product operations and gives better performance. This version would also be more friendly to multi-threading (although I don't think I will bother attempting that at the moment ... this can get tricky to figure out the best way to split things up depending on matrix sizes). Keep in mind that you have to pass in the transpose of A to this mex routine. I.e., the following are all mathematically equivalent (but not numerically because of different order of operations):

(A*B).*L

ABL(A,B,L)

ATBL(A.',B,L) % <-- Pass in the transpose of A!

That is, ABL(A,B,L) should give the exact same result as ATBL(A.',B,L) because the order of operations is exactly the same, but neither of these would be expected to give the exact same result as (A*B).*L because of different order of operations involved, even though there are all mathematically equivalent.

The mex code:

/* File: ATBL.c
*
* C = ATBL(A,B,L)
*
* Performs the function
*
* C = ((A.')*B).*L
*
* Where
*
* A = Full P x M real double matrix
* B = Full P x N real double matrix
* L = Sparse M x N logical matrix
* C = Sparse M x N real double matrix
*
* Brute force algorithm, no attempt at multi-threading, etc.
*
* Building:  mex ATBL.c -R2018a
*
* Programmer: James Tursa
*/
#include "mex.h"
void mexFunction(int nlhs, mxArray *plhs[], int nrhs, const mxArray *prhs[])
{
    mwSize Am, An, Bm, Bn, Lm, Ln, i, j, k;
    mwIndex nrow;
    mwIndex *Lir, *Ljc, *Cir, *Cjc;
    double *A, *B, *C, *a, *b, d;
    /* Check inputs */
    if( nrhs != 3 ||
    !mxIsDouble(prhs[0]) || mxIsComplex(prhs[0]) || mxIsSparse(prhs[0]) || mxGetNumberOfDimensions(prhs[0]) != 2 ||
    !mxIsDouble(prhs[1]) || mxIsComplex(prhs[1]) || mxIsSparse(prhs[1]) || mxGetNumberOfDimensions(prhs[1]) != 2 ||
    !mxIsLogical(prhs[2]) || !mxIsSparse(prhs[2]) ) {
        mexErrMsgTxt("Invalid Inputs");
    }
    Am = mxGetM(prhs[0]);
    An = mxGetN(prhs[0]);
    Bm = mxGetM(prhs[1]);
    Bn = mxGetN(prhs[1]);
    Lm = mxGetM(prhs[2]);
    Ln = mxGetN(prhs[2]);
    if( An != Lm || Bn != Ln || Am != Bm ) {
        mexErrMsgTxt("Matrix sizes incompatible");
    }
    /* Get sparse indexing pointers */
    Lir = mxGetIr(prhs[2]);
    Ljc = mxGetJc(prhs[2]);
    /* Create output array same size and sparsity as L */
    plhs[0] = mxCreateSparse(Lm, Ln, Ljc[Ln], mxREAL); /* will need to manually fill in Ir and Jc */
    /* Get data pointers */
    A = (double *) mxGetData(prhs[0]);
    B = (double *) mxGetData(prhs[1]);
    C = (double *) mxGetData(plhs[0]);
    Cir = mxGetIr(plhs[0]);
    Cjc = mxGetJc(plhs[0]) + 1; /* 2nd value is for 1st column */
    /* Loop through the logical L matrix indexing */
    /* C has same basic structure, but we need to clean the 0's as we go */
    for( j=0; j<Ln; j++) { /* for each column of L */
        *Cjc = *(Cjc-1); /* Copy running total from last column */
        nrow = Ljc[j+1] - Ljc[j]; /* number of elements in this column of L */
        for( i=0; i<nrow; i++ ) { /* for each index in this column of L */
            a = A + Am * *Lir; /* point to correct column of A */
            b = B; /* point to this column of B */
            d = 0.0; /* initialize dot product value */
            for( k=0; k<Am; k++ ) { /* dot product */
                d += *a++ * *b++;
            }
            if( d ) { /* only include in sparse result if non-zero */
                *Cir++ = *Lir; /* copy row index and increment pointer */
                *Cjc += 1; /* bump up running total */
                *C++ = d; /* store the value and increment pointer */
            }
            Lir++; /* point to next row value */
        }
        B += Bm; /* point to next column of B */
        Cjc++; /* point to next value of non-zero running total */
    }
}

Also note that this version of the code is production quality, meaning that it checks for 0's along the way and only stores explicit non-zeros in the sparse result. The ABL.c code posted above does not do that. E.g., a run where there are forced 0 results:

>> A = reshape(1:12,3,4); A = [A,fliplr(A)]
A =
1     4     7    10    10     7     4     1
2     5     8    11    11     8     5     2
3     6     9    12    12     9     6     3
>> B = [ones(4,4);-ones(4,4)]
B =
1     1     1     1
1     1     1     1
1     1     1     1
1     1     1     1
-1    -1    -1    -1
-1    -1    -1    -1
-1    -1    -1    -1
-1    -1    -1    -1
>> L = sparse(rand(3,4)<.5)
L =
3×4 sparse logical array
(1,1)      1
(1,2)      1
(2,2)      1
(3,2)      1
(1,3)      1
(2,3)      1
(3,3)      1
(2,4)      1
>> (A*B).*L
ans =
All zero sparse: 3×4
>> sarek(ans)
SAREK -- Sparse Analyzer Real Et Komplex , by James Tursa
Compiled in version R2020a (with -R2018a option)
Running  in version R2020a
Matrix is double ...
    Matrix is real ...
    M = 3 >= 0 OK ...
    N = 4 >= 0 OK ...
    Nzmax = 1 >= 1 OK ...
    Jc[0] == 0 OK ...
    Jc[N] = 0 <= Nzmax = 1 OK ...
    Jc array OK ...
    Ir array OK ...
    All stored elements nonzero OK ...
    
All sparse integrity checks OK
ans =
0
>> 
>> ABL(A,B,L)
ans =
(1,1)        0
(1,2)        0
(2,2)        0
(3,2)        0
(1,3)        0
(2,3)        0
(3,3)        0
(2,4)        0
>> sarek(ans)
SAREK -- Sparse Analyzer Real Et Komplex , by James Tursa
Compiled in version R2020a (with -R2018a option)
Running  in version R2020a
Matrix is double ...
    Matrix is real ...
    M = 3 >= 0 OK ...
    N = 4 >= 0 OK ...
    Nzmax = 8 >= 1 OK ...
    Jc[0] == 0 OK ...
    Jc[N] = 8 <= Nzmax = 8 OK ...
    Jc array OK ...
    Ir array OK ...
    ERROR: There are 8 explicit 0's in matrix
TO FIX: B = 1*A;
There were ERRORS found in matrix!
ans =
1
>> 
>> ATBL(A.',B,L)
ans =
All zero sparse: 3×4
>> sarek(ans)
SAREK -- Sparse Analyzer Real Et Komplex , by James Tursa
Compiled in version R2020a (with -R2018a option)
Running  in version R2020a
Matrix is double ...
    Matrix is real ...
    M = 3 >= 0 OK ...
    N = 4 >= 0 OK ...
    Nzmax = 8 >= 1 OK ...
    Jc[0] == 0 OK ...
    Jc[N] = 0 <= Nzmax = 8 OK ...
    Jc array OK ...
    Ir array OK ...
    All stored elements nonzero OK ...
    
All sparse integrity checks OK
ans =
0

You can see that the ABL mex routine stores explicit 0's in the sparse result, whereas the ATBL mex routine checks for this and does not store them. The sarek function is a mex routine that checks sparse matrices for integrity. It can be found here:

https://www.mathworks.com/matlabcentral/fileexchange/158161-sarek-sparse-analyzer-real-et-komplex?s_tid=srchtitle

As a side issue, this 0 check makes it trickier to multi-thread the code. You can't have each thread work completely independently on part of the result since the resulting memory locations of one part depend on the number of actual non-zeros of the other part(s). What you could do is multi-thread where each thread assumes no 0's in previous threads (that way each thread knows where to put its results in memory), then in a second pass clean the 0's if there were any detected. That would not be a bad strategy, particularly if you don't expect to have 0's in the resulting sparse spots (e.g., if you were typically working with arbitrary non-integer real values). You would only have to pass through the memory twice to fix things up if you detected a 0 in the first pass.

Sign in to comment.

Answer 2

Matt J on 18 Feb 2025

Edited: Matt J on 18 Feb 2025

Open in MATLAB Online

1 vote

Is v much smaller than m or n? If so, one approach might be with an outer product decomposition:

L=sparse(L); %if L was not previously in sparse form
C=0;
for i=1:v
    C = C +  A(:,i).*L.*B(i,:);
end

Either way, please respond to @James Tursa's question about the typical matrix dimensions.

1 Comment
Show -1 older comments Hide -1 older comments

Cal on 18 Feb 2025

Either way, please respond to @James Tursa's question about the typical matrix dimensions.

copying my answer from responding to James Tursa:

> v can be very small to very big, e.g. anywhere from 20 to 20000. m and n also have very wide ranges, either one can be between 400 to 200,000. Assuming max values for m and n are 200,000, that means L can be at most a 40 billion entry sparse matrix. Additionally, the larger these L matrices are, the fewer the percentage of entries in it that are nonzero (e.g. in the max case of 40 billion entries, I'd assume that less than 1e-5 % or the entries of L are nonzero (400,000). But I'd assume that in most of my cases, the vast, vast majority of entries are zero.

Is v much smaller than m or n? If so, one approach might be with an outer product decomposition:

Thank you for the suggestion.

I tried the loop you gave for the "outer product decomposition" as you wrote it, and unfortunately found it to always be slower than (A * B), it was faster only for v = 1 (around 1.5 times as fast) but around 5 times slower for v = 2, and became exponentially slower as v increased. Perhaps there is a way to vectorize this loop over v?

Sign in to comment.

need to vectorize efficiently calculating only certain values in the matrix multiplication A * B, using a logical array L the size of A * B

8 Comments
Show 6 older comments Hide 6 older comments

Accepted Answer

9 Comments
Show 7 older comments Hide 7 older comments

More Answers (1)

1 Comment
Show -1 older comments Hide -1 older comments

Categories

Products

Release

Tags

Community Treasure Hunt

need to vectorize efficiently calculating only certain values in the matrix multiplication A * B, using a logical array L the size of A * B

8 Comments Show 6 older comments Hide 6 older comments

Accepted Answer

9 Comments Show 7 older comments Hide 7 older comments

More Answers (1)

1 Comment Show -1 older comments Hide -1 older comments

Categories

Products

Release

Tags

See Also

Community Treasure Hunt

8 Comments
Show 6 older comments Hide 6 older comments

9 Comments
Show 7 older comments Hide 7 older comments

1 Comment
Show -1 older comments Hide -1 older comments