Batched matrix multiplicaion with CUDA

1 view (last 30 days)
Peter Egli
Peter Egli on 28 Apr 2020
Edited: Erik Meade on 5 May 2020
Hi,
I saw that Matlab R2020a implements new features for the GPU coder, especially the gpucoder.stridedMatrixMultiply. However, I don't understand how the batch is defined there. If you take a look at the generated CUDA code that is shown in the example, it states 1 for the batch size (cf. NVIDIA documentation). Also the variables A,B & C are expected to be 2D and of the dimensionality of the matrices to be processes.
How do I use the function correctly? I have a 3D vector in Matlab which holdes many small matrices, so A(:,:,1), A(:,:,2) and so on. The same applies for B. I would like to process them all at the same time using CUDA. I would like to calculate A(:,:,1)*B(:,:,1) etc using a CUDA function. How can I achieve that with the new GPU coder functionality? How do I interface that from Matlab?
Peter

Answers (1)

Erik Meade
Erik Meade on 5 May 2020
Edited: Erik Meade on 5 May 2020
Hi Peter,
gpucoder.stridedMatrixMultiply works exactly as you want. You can directly pass A and B to gpucoder.stridedMatrixMultiply and it will compute them in the way you want.
A small example, say you have a function called stridedMultiply:
function c = stridedMultiply(a, b)
c = gpucoder.stridedMatrixMultiply(a, b);
end
Then we can generate code for it and verify that the answer is correct with the following code:
% 3D-vector inputs
a = rand(5,4,100);
b = rand(4,5,100);
% Generate Code
codegen -config coder.gpuConfig('mex') -args {a, b} stridedMultiply
% Verify correctness
c_mex = stridedMultiply_mex(a, b);
c = zeros(size(c_mex));
for i = 1:100
c(:,:,i) = a(:,:,i) * b(:,:,i);
end
% Check MATLAB answer vs. stridedMatrixMultiply generated code
tolerance = 1e-8;
assert(all(abs(c(:) - c_mex(:)) < tolerance));
If we look at the generated code, we will see that the batch size has been properly set to 100:
cublasDgemmStridedBatched(getCublasGlobalHandle(), CUBLAS_OP_N, CUBLAS_OP_N, 5,
5, 4, (double *)gpu_alpha1, (double *)&(*gpu_a)[0], 5, 20, (double *)
&(*gpu_b)[0], 4, 20, (double *)gpu_beta1, (double *)&(*gpu_c)[0], 5, 25, 100);
With regards to the example in the doc page you cited, since the input matrices in the example are both 2D, there is only 1 batch to be computed, therefore the parameter is set to 1. I understand your confusion however, since gpucoder.stridedMatrixMultiply is mostly intended to be used with 3D inputs. To clarify, gpucoder.stridedMatrixMultiply multiplies along the first two dimensions only. I understand how that example can be confusing however, and we will look into updating that example.
I hope that answers your question!

Categories

Find more on Get Started with GPU Coder in Help Center and File Exchange

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!