Main Content

coder.gpu.kernel

Pragma that maps for-loops to GPU kernels

Description

coder.gpu.kernel() is a loop-level pragma that you must place immediately before a for loop. It generates a kernel with the dimensions computed from the loop parameters.

Note

The coder.gpu.kernel pragma overrides all parallel loop analysis checks that the software performs. Use coder.gpu.kernelfun first before using the more advanced functionality of the coder.gpu.kernel pragma.

coder.gpu.kernel(B,T) is a loop-level pragma that you must place immediately before a for loop. It generates a kernel with the dimensions specified by B and T. B[Bx,By,1] is an array that defines the number of blocks in the grid along dimensions x and y (z not used). T[Tx,Ty,Tz] is an array that defines the number of threads in the block along dimensions x, y, and z.

A value of -1 for B and T indicates that GPU Coder™ must infer the grid and block dimensions automatically. The coder.gpu.kernel pragma generates errors for invalid grid and block dimensions.

example

coder.gpu.kernel(B,T,M,name) expects the same B and T arguments. You can specify optional arguments M and name. M is a positive integer specifying the minimum number of blocks per streaming multiprocessor. Sometimes, increasing M can reduce the register usage within a kernel and improve kernel occupancy. A value of -1 for M indicates that GPU Coder must use the default value of 1. name is a character array that allows you to customize the name of the generated kernel.

Specifying the kernel pragma overrides all parallel loop analysis checks. This override allows loops to be parallelized in situations where parallel loop analysis cannot prove that all iterations are independent of each other. First, ensure that the loop is safe to parallelize.

This function is a code generation function. It has no effect in MATLAB®.

Examples

collapse all

This example shows how to use the kernel pragma in a function and generate CUDA® code.

In one file, write the entry-point function scalars that accepts two vector inputs x,y of size 1x4096 and one scalar input scale. The function has two for-loops of different iteration lengths, one for vector addition and one for finding the cumulative sum. Place the coder.gpu.kernel(1,1024) pragma outside the first loop. This pragma creates a kernel with one block having 1024 threads. Place the coder.gpu.kernel(8,512,512,'reduction') pragma outside the second loop. This pragma creates a kernel with eight blocks having 512 threads per block. The kernel created for this block is named reduction.

function [vout, sout1] = scalars(x,y,scale)
    sout1 = 0;
    vout = coder.nullcopy(zeros(size(x)));
    
    coder.gpu.kernel(1,1024);
    for i=1:1024
        vout(i) = x(i) + y(i);
    end
    
    coder.gpu.kernel(8,512,512,'reduction');
    for i=1:4096
        sout1 = (x(i)*scale) + sout1;    
    end
end

Use the codegen function to generate CUDA MEX function.

codegen -config coder.gpuConfig('mex')...
 -args {ones(1,4096,'double'),ones(1,4096,'double'),coder.typeof(0)}...
 -report scalars

GPU Coder creates two kernels: scalars_kernel1 for vector addition and scalarsreduction kernel for the cumulative sum. No kernel is needed for initializing sout1=0.

  cudaMemcpy(gpu_y, y, 32768U, cudaMemcpyHostToDevice);
  cudaMemcpy(gpu_x, x, 32768U, cudaMemcpyHostToDevice);  
  scalars_kernel1<<<dim3(1U, 1U, 1U), dim3(1024U, 1U, 1U)>>>(gpu_y, gpu_x, gpu_vout);
  cudaMemcpy(gpu_sout1, sout1, 8U, cudaMemcpyHostToDevice);
  scalarsreduction<<<dim3(8U, 1U, 1U), dim3(512U, 1U, 1U>>>(scale, gpu_x, gpu_sout1);
  cudaMemcpy(vout, gpu_vout, 32768U, cudaMemcpyDeviceToHost);
  cudaMemcpy(sout1, gpu_sout1, 8U, cudaMemcpyDeviceToHost);

scalars_kernel1 has one block with 1024 threads per block, one for adding each element. scalarsreduction kernel has eight blocks with 512 threads per block, resulting in a total of 4096 threads.

You can use variables or expressions when specifying the kernel dimensions. For example, you can rewrite the scalars entry-point function such that the grid and block dimensions are specified at compile time.

function [vout, sout1] = scalars(x,y,scale, a, b)
    sout1 = 0;
    vout = zeros(size(x));
    
    coder.gpu.kernel(1,1024);
    for i=1:1024
        vout(i) = x(i) + y(i);
    end
    
    coder.gpu.kernel([a,a*b,1], [a*b, 1, 1], 'reduction');
    for i=1:length(x)
        sout1 = (x(i)*scale) + sout1;    
    end
end

Use the codegen function to generate CUDA MEX function.

codegen -config coder.gpuConfig('mex')...
 -args {ones(1,4096,'double'),ones(1,4096,'double'),20,8,4}...
 -report scalars
Introduced in R2017b