Memory Bottleneck Analysis
MATLAB is column major but the algorithm could be implemented for an optimized row-major implementation. In the generated code, if your fastest changing dimension is not the innermost loop, then memory is not coalesced. Often, transposing the input matrices can simply fix this problem.
Try transposing the data.
Small Data Sizes
If your problem/data size is too small, then the overhead of moving data to GPU (even if it is just at the I/O boundary) can offset the performance gains of running on the GPU.
Try the algorithm with larger data sizes.
Too Many cudaMemcpys
If you use only
coder.gpu.kernel, then everything outside the
loop goes to the CPU. To try to keep most of the code on the GPU, use of both pragmas is
recommended. Also, presence of unsupported functions or a function/statement that cannot
run on the GPU, causes more
cudaMemcpys to be generated.
If certain inputs of your entry-point function are constant, wrap them using the
coder.const object. Use of
coder.const object indicates that these variables are constant
during code generation. Without this function, GPU Coder™ considers these inputs to be variables and hence treats matrices sized by
these variables as variable-dimension matrices. GPU Coder does not create good kernels out of variable-dimension matrices since
currently dynamic sizing of kernels or dynamic
calls is not supported.
Stack Memory Usage
Using large stack memory inside kernels can reduce the performance of the generated code. Under such conditions consider rewriting the algorithm in a different fashion or breaking it into smaller computations to reduce stack memory usage and improve performance.