cuda kernel does not work when a data is transfer to the GPU. any problem with my gateway MEX code?
Show older comments
Hi,
I'm trying to transfer a data to a KUDA kernel, do some processings and have the output back to Matlab. I have already evaluated the Kernel in visual studio and it works as it should. However, when I make a MEX file out of the code, It does not provide me the output i expect. here is my MEX gateway code:
#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include "cuda.h"
using namespace std;
#include <mex.h>
__global__ void kernel_Reconstruction2(int* Dev_RfData, int* ReconstructedImage_GPU, int transmit, int NStart_Transmit) {
int TID = threadIdx.y * blockDim.x + threadIdx.x;
int BlockOFFset = blockDim.x * blockDim.y * blockIdx.x;
int RowOFFset = blockDim.x * blockDim.y * gridDim.x * blockIdx.y;
int GID = RowOFFset + BlockOFFset + TID;
}
void mexFunction(int nlhs, mxArray* plhs[],
int nrhs, const mxArray* prhs[]) {
int* RfData; // RF data; a pinned memory was dedicated to this
int* ReconstructedImage_GPU;
RfData = (int*)mxGetPr(prhs[0]);
plhs[0] = mxCreateNumericMatrix(1, 64 * 64, mxINT32_CLASS, mxREAL);
ReconstructedImage_GPU = (int*)mxGetData(plhs[0]);
int ArrayByteSize_RfData = sizeof(int) * (96 * 96 * 4096);
int BYTES_PER_STREAM = ArrayByteSize_RfData / 96;
//Memory allocation: RfData ; we send the RF data to the device with streaming
int* Device_RfData; // device pointer to the RF data.
(cudaMalloc((int**)&Device_RfData, ArrayByteSize_RfData));
int* Device_ReconstructedImage_GPU; // device pointer to the reconstructed image
int ArrayByteSize_ReconstructedImage_GPU = sizeof(int) * (96*96);
(cudaMalloc((int**)&Device_ReconstructedImage_GPU, ArrayByteSize_ReconstructedImage_GPU));
printf("The CUDA reconstruction started... \n");
dim3 block(1024, 1);
dim3 grid(64 * 64, 96);//SystemSetup.NumberOfTransmitter
cudaStream_t* streams = new cudaStream_t[96]; //SystemSetup.NumberOfTransmitter
int NStart_Transmit{};
for (int transmit = 0; transmit < 96; transmit++) {
cudaStreamCreate(&streams[transmit]);
NStart_Transmit = transmit * (96 * 4096);
cudaMemcpyAsync(&Device_RfData[NStart_Transmit], &RfData[NStart_Transmit], BYTES_PER_STREAM, cudaMemcpyHostToDevice, streams[transmit]);
kernel_Reconstruction2 << <grid, block, 0, streams[transmit] >> > (&Device_RfData[NStart_Transmit], Device_ReconstructedImage_GPU, transmit, NStart_Transmit);
(cudaPeekAtLastError());
}
for (int transmit = 0; transmit < 96; transmit++) { cudaStreamDestroy(streams[transmit]); } // destroy the streams
delete[] streams;
cudaDeviceSynchronize();
(cudaMemcpy(ReconstructedImage_GPU, Device_ReconstructedImage_GPU, ArrayByteSize_ReconstructedImage_GPU, cudaMemcpyDeviceToHost));
(cudaFree(Device_RfData));
(cudaFree(Device_ReconstructedImage_GPU));
}
I have also checked if the "RfData" contains the actual values in the host or not; all is good with this as well. So, i think there is something wrong with the "Device_RfData" values. Is there anything that I'm missing ?
Regards,
Moein.
2 Comments
Richard
on 22 Mar 2021
Hi Moein, could you add a Matlab code example to show how you are calling this function, and what output you expect for that example? (I can't see where ReconstructedImage_GPU is ever altered on the device - are you just expecting a zeros matrix as output and not getting it?)
Moein Mozaffarzadeh
on 22 Mar 2021
Accepted Answer
More Answers (1)
Moein, there are a couple of issues that I can see in the code.
The first issue is that it is not initializing the GPU device in any way. The easiest way to do this is to add a call to mxInitGPU as documented in https://www.mathworks.com/help/parallel-computing/mxinitgpu.html . This will use MATLAB's normal rules to select the default GPU. I added a simple assignment line in the kernel to produce dummy output values which I also observed were not returned, and then confirmed that adding a call to mxInitGPU at the start of mexFunction() caused the expected values to be returned (after restarting MATLAB to fix the GPU state).
The second issue is that the code sample looks to be over-running array bounds when copying back from Device_ReconstructedImage_GPU to ReconstructedImage_GPU. The output mxArray is created as a 1-by-(64x64) matrix but the array size used on the device is 1-by-(96x96). This is likely to cause MATLAB to crash - the output mxArray needs to match the GPU device array size.
7 Comments
Moein Mozaffarzadeh
on 22 Mar 2021
Moein Mozaffarzadeh
on 22 Mar 2021
Richard
on 22 Mar 2021
Hi Moein,
Thanks for pointing out the broken link. I think it picked up a trailing dot and hopefully I have fixed it.
mxInitGPU is effectively the same as calling gpuDevice. The main advantage is that you don't have to remember to call gpuDevice, but it is also best practice to use it in case a future version of MATLAB needs to add extra initialization before mxGPU functions are used.
gpu/mxGPUarray.h is in <matlabroot>/toolbox/parallel/gpu/extern/include. The mexcuda command adds this to the include path for you but if you are using the VS build system then you will need to add that path.
Building with VS should not be an issue and since you are compiling successfully and not causing crashes it is unlikely that is an issue.
I did compile and run your original example with just a minor addition of assigning a fixed output value inside the kernel. This ran successfully for me and returned an array containing that fixed value.
I have managed to compile your full example but I don't know how to run it - like you say it is more complex. Is there an easy way for me to generate data, run this, and compare with expected outputs? Alternatively, is there a simple addition to the kernel of the original example that produces an output that you think is incorrect?
Moein Mozaffarzadeh
on 22 Mar 2021
Richard
on 23 Mar 2021
Hi Moein, thanks for the example data.
I have a few things that you might want check. Firstly, I did manage to run the code and I have some images. This is the image I get using excatly the code you posted:

Is this the same incorrect output that you are seeing?
I think that the reshaping code you are using to turn the output into a matrix is incorrect. Remember that arrays in MATLAB are column-major vs row-major in C. When I use this code:
DAS_Ccode=reshape(Image,[64, 256])';
abs1= abs(DAS_Ccode);
figure;imagesc(abs1);
I get this image instead:

which looks a lot more sensible to me, but still isn't the same as the image you posted. Is the above a correct image, or not?
There are a couple of possible code issues I did note:
(1) You have a mismatch in types for the input data. The MATLAB data is in uint32 arrays - unsigned. In the C code, you are reading it via a signed int pointer. Given the values in the example data you appear to be getting away with it, but probably should fix it.
(2) I experienced persistent crashes when running the code on certain versions of MATLAB, and eventually I tracked it down to a memory access error being returned from the cudaMemcpyAsync calls that transfer data to the device. The async API requires "pinned" memory, and the source in this case is mxArray data that is not allocated via cuda at all.
Richard
Moein Mozaffarzadeh
on 23 Mar 2021
Moein Mozaffarzadeh
on 23 Mar 2021
Categories
Find more on GPU CUDA and MEX Programming in Help Center and File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!