Analysis with NVIDIA Profiler
Not Enough Parallelism
Condition
If the kernel is doing little work, then the overhead of memcpy
and
kernel launches can offset a performance gains. Consider
working on a larger sample set (thus increasing the loop size). To detect this condition,
look at the nvvpreport
.
Action
Do more work in the loop or increase sample set size
Too Many Local per-Thread Registers
Condition
In case of too many local/temp variables used in the loop body, then it causes high
register pressure in the per-thread register file. You can detect this condition by
running in GPU safe-build mode. Or, nvvp
reports this fact.
Action
Consider using different block sizes in coder.gpu.kernel
pragma.
Related Topics
- Code Generation Using the Command Line Interface
- Code Generation by Using the GPU Coder App
- Code Generation Reports
- Trace Between Generated CUDA Code and MATLAB Source Code
- Generating a GPU Code Metrics Report for Code Generated from MATLAB Code
- Kernel Analysis
- Memory Bottleneck Analysis
- GPU Performance Analyzer