Kernel Creation from MATLAB Code

MATLAB code structures and patterns that create CUDA^® GPU kernels

GPU Coder™ generates and executes optimized CUDA kernels for specific algorithm structures and patterns in your MATLAB^® code. The generated code calls optimized NVIDIA^® CUDA libraries, including cuFFT, cuSolver, cuBLAS, cuDNN, and TensorRT. The generated code can be integrated into your project as source code, static libraries, or dynamic libraries, and can be compiled for desktops, servers, and GPUs embedded on NVIDIA Jetson, DRIVE, and other platforms. GPU Coder lets you incorporate handwritten CUDA code into your algorithms and into the generated code.

Apps

expand all

GPU Coder

GPU Coder	Generate CUDA code from MATLAB code
GPU Environment Check	Verify and set up GPU code generation environment

Functions

expand all

Code Generation

`codegen`	Generate C/C++ code from MATLAB code
`gpucoder`	Open GPU Coder app
`coder.checkGpuInstall`	Verify GPU code generation environment
`coder.gpuConfig`	Configuration parameters for CUDA code generation from MATLAB code by using GPU Coder

GPU Kernel Pragmas

`coder.gpu.kernel`	Pragma that maps `for`-loops to GPU kernels
`coder.gpu.kernelfun`	Pragma that maps function to GPU kernels
`coder.gpu.nokernel`	Pragma to disable kernel creation for loops
`coder.ceval`	Call C/C++ function from generated code
`coder.gpu.iterations`	Pragma that provides information to the code generator for making parallelization decisions on variable bound loops

GPU Memory Pragmas

`coder.gpu.constantMemory`	Pragma that maps a variable to the constant memory on GPU
`coder.gpu.persistentMemory`	Pragma to allocate a variable as persistent memory on the GPU
`cudaMemoryManager`	Query memory usage by shared GPU memory manager for MEX functions (Since R2024a)

GPU Atomic Operations

`gpucoder.atomicAdd`	Atomically add a specified value to a variable in global or shared memory (Since R2021b)
`gpucoder.atomicAnd`	Atomically perform bit-wise AND between a specified value and a variable in global or shared memory (Since R2021b)
`gpucoder.atomicCAS`	Atomically compare and swap the value of a variable in global or shared memory (Since R2021b)
`gpucoder.atomicDec`	Atomically decrement a variable in global or shared memory within a specified upper bound (Since R2021b)
`gpucoder.atomicExch`	Atomically exchange a variable in global or shared memory with the specified value (Since R2021b)
`gpucoder.atomicInc`	Atomically increment a variable in global or shared memory within a specified upper bound (Since R2021b)
`gpucoder.atomicMax`	Atomically find the maximum between a specified value and a variable in global or shared memory (Since R2021b)
`gpucoder.atomicMin`	Atomically find the minimum between a specified value and a variable in global or shared memory (Since R2021b)
`gpucoder.atomicOr`	Atomically perform bit-wise OR between a specified value and a variable in global or shared memory (Since R2021b)
`gpucoder.atomicSub`	Atomically subtract a specified value from a variable in global or shared memory (Since R2021b)
`gpucoder.atomicXor`	Atomically perform bit-wise XOR between a specified value and a variable in global or shared memory (Since R2021b)

Programming for Code Generation

`half`	Construct half-precision numeric object
`stencilfun`	Generate CUDA code for stencil functions (Since R2022b)
`selectdata`	Select slices of arrays and generate CUDA code (Since R2025a)
`gpucoder.matrixMatrixKernel`	Optimized GPU implementation of functions containing matrix-matrix operations
`gpucoder.batchedMatrixMultiply`	Optimized GPU implementation of batched matrix multiply operation
`gpucoder.stridedMatrixMultiply`	Optimized GPU implementation of strided and batched matrix multiply operation
`gpucoder.batchedMatrixMultiplyAdd`	Optimized GPU implementation of batched matrix multiply with add operation
`gpucoder.stridedMatrixMultiplyAdd`	Optimized GPU implementation of strided, batched matrix multiply with add operation
`gpucoder.sort`	Optimized GPU implementation of the MATLAB sort function
`gpucoder.ctranspose`	Optimized GPU implementation of the MATLAB transpose function
`gpucoder.transpose`	Optimized GPU implementation of the MATLAB transpose function
`gpucoder.reduce`	Optimized GPU implementation for reduction operations

Code Configuration Settings

expand all

GPU Code

Generate GPU Code	Control GPU code generation
GPU device ID	CUDA device selection
Minimum compute capability	Minimum compute capability for code generation
Custom compute capability	Virtual GPU architecture
Malloc mode	GPU memory allocation
Malloc threshold	Malloc threshold
Stack limit	Stack limit per GPU thread
Maximum blocks per kernel	Maximum number of blocks created during a kernel launch
Benchmarking	Add benchmarking to the generated code
Safe build	Error checking in the generated code
Kernel name prefix	Custom kernel name prefixes
Compiler flags	Pass additional flags to GPU compiler
Enable cuBLAS	Replace math function calls with `cuBLAS` library calls
Enable cuSOLVER	Replace math function calls with `cuSOLVER` library calls
Enable cuFFT	Replace `fft` function calls with `cuFFT` library calls
Enable GPU Memory manager	Use GPU memory manager

Objects

expand all

Code configuration

`coder.gpuConfig`	Configuration parameters for CUDA code generation from MATLAB code by using GPU Coder
`coder.CodeConfig`	Configuration parameters for C/C++ code generation from MATLAB code
`coder.EmbeddedCodeConfig`	Configuration parameters for C/C++ code generation from MATLAB code with Embedded Coder
`coder.gpuEnvConfig`	Configuration object for checking the GPU code generation environment

Topics

Configure GPU Code Generation
Configure the code generator using configuration objects or the GPU Coder app.
Kernels from Element-Wise Loops
Create kernels from MATLAB functions containing scalarized, element-wise math operations.
Kernels from Scatter-Gather Type Operations
Create kernels from MATLAB functions containing reduction operations.
Kernels from Library Calls

Target GPU optimized math libraries such as cuBLAS, cuSOLVER, cuFFT, and Thrust.
- cuBLAS Example
- cuSOLVER Example
- FFT Example
Support for GPU Arrays
Generate CUDA code that uses GPU arrays.
Use Dynamically Allocated C++ Arrays in Generated Function Interfaces
Understand and use dynamically allocated arrays from the generated CUDA C++ function interfaces.
Call Custom CUDA Kernels from the Generated Code
Integrate custom CUDA kernels with MATLAB code intended for code generation.
Call Custom CUDA Device Function from the Generated Code
Integrate custom GPU device functions with MATLAB code intended for code generation.
Design Patterns
Create kernels for MATLAB functions containing computational design patterns.
GPU Memory Allocation and Minimization
Memory allocation options and optimizations for GPU Coder.
How Shared GPU Memory Manager Improves Performance of Generated MEX
GPU Coder creates a single universal memory manager that handles the memory management for running CUDA MEX functions.
What Is Half Precision?
Introduction to the half-precision data type in MATLAB and Simulink^®.
Half Precision Code Generation Support
C/C++ and GPU code generation support for functions that support half-precision inputs.

Featured Examples

Build a Map from Lidar Data Using SLAM on GPU

Perform 3-D simultaneous localization and mapping (SLAM) using generated code on an NVIDIA GPU.

Open Live Script

$Simulate Diffraction Patterns Using CUDA FFT Libraries$

Simulate Diffraction Patterns Using CUDA FFT Libraries

Use GPU Coder™ to leverage the CUDA® Fast Fourier Transform library (cuFFT) to compute two-dimensional FFT on a NVIDIA® GPU. The two-dimensional Fourier transform is used in optics to calculate far-field diffraction patterns. When a monochromatic light source passes through a small aperture, such as in Young's double-slit experiment, you can observe these diffraction patterns. This example also shows you how to use GPU pointers as inputs to an entry-point function when generating CUDA MEX, source code, static libraries, dynamic libraries, and executables. By using this functionality, the performance of the generated code is improved by minimizing the number of cudaMemcpy calls in the generated code.

Open Script

QR Decomposition on NVIDIA GPU Using cuSOLVER Libraries

Create a standalone CUDA® executable that leverages the NVIDIA® cuSOLVER library. The example uses a curve fitting application that mimics automatic lane tracking on a road to illustrate:

Open Live Script

Stencil Processing on GPU

Generate CUDA® kernels for stencil type operations by implementing "Game of Life" by John H. Conway.

Open Script

Benchmark Solving a Linear System by Using GPU Coder

Benchmark solving a linear system by generating CUDA® code. Use matrix left division, also known as mldivide or the backslash operator (\), to solve the system of linear equations A*x = b for x (that is, compute x = A\b).

Open Live Script

Fog Rectification

The use of image processing functions for GPU code generation. The example takes a foggy image as input and produces a defogged image. This example is a typical implementation of fog rectification algorithm. The example uses conv2, im2gray, and imhist functions.

Open Live Script

Stereo Disparity

Generate a CUDA® MEX function from a MATLAB® function that computes the stereo disparity of two images.

Open Live Script

Accelerating Radar Signal Processing Using GPU

Compare the performance of a conventional radar signal processing chain implemented in interpreted MATLAB and on a graphical processing unit (GPU).

(Radar Toolbox)

Since R2024a