Main Content

cuBLAS Example

This example multiplies two matrices A and B by using the cuBLAS library. The MATLAB® implementation of GEneral Matrix-Matrix Multiplication (GEMM) is:

function [C] = blas_gemm(A,B)

    C = zeros(size(A));
    C = A * B;
end

Generated CUDA Code

When you generate CUDA® code, GPU Coder™ creates function calls to initialize the cuBLAS library, perform matrix-matrix operations, and release hardware resources that the cuBLAS library uses. The following is a snippet of the generated CUDA code.

   cublasEnsureInitialization();
  blas_gemm_kernel1<<<dim3(2048U, 1U, 1U), dim3(512U, 1U, 1U)>>>(gpu_C);
  alpha1 = 1.0;
  beta1 = 0.0;
  cudaMemcpy((void *)gpu_alpha1, (void *)&alpha1, 8ULL, cudaMemcpyHostToDevice);
  cudaMemcpy((void *)gpu_A, (void *)A, 8388608ULL, cudaMemcpyHostToDevice);
  cudaMemcpy((void *)gpu_B, (void *)B, 8388608ULL, cudaMemcpyHostToDevice);  
  cudaMemcpy(gpu_beta1, &beta1, 8ULL, cudaMemcpyHostToDevice);
  cublasDgemm(cublasGlobalHandle, CUBLAS_OP_N, CUBLAS_OP_N, 1024, 1024, 1024,
             (double *)gpu_alpha1, (double *)&gpu_A[0], 1024, (double *)&gpu_B
              [0], 1024, (double *)gpu_beta1, (double *)&gpu_C[0], 1024);
  cublasEnsureDestruction();
  cudaMemcpy((void *)C, (void *)gpu_C, 8388608ULL, cudaMemcpyDeviceToHost);

To initialize the cuBLAS library and create a handle to the cuBLAS library context, the function cublasEnsureInitialization() calls cublasCreate() cuBLAS API. It allocates hardware resources on the host and device.

static void cublasEnsureInitialization(void)
{
  if (cublasGlobalHandle == NULL) {
    cublasCreate(&cublasGlobalHandle);
    cublasSetPointerMode(cublasGlobalHandle, CUBLAS_POINTER_MODE_DEVICE);
  }
}

blas_gemm_kernel1 initializes the result matrix C to zero. This kernel is launched with 2048 blocks and 512 threads per block. These block and thread values correspond to the size of C.

static __global__ __launch_bounds__(512, 1) void blas_gemm_kernel1(real_T *C)
{
  int32_T threadIdX;
  threadIdX = (int32_T)(blockDim.x * blockIdx.x + threadIdx.x);
  if (!(threadIdX >= 1048576)) {
    C[threadIdX] = 0.0;
  }
}

Calls to cudaMemcpy transfer the matrices A and B from the host to the device. The function cublasDgemm is a level-3 Basic Linear Algebra Subprogram (BLAS3) that performs the matrix-matrix multiplication:

C = αAB + βC

where α and β are scalars, and A, B, and C are matrices stored in column-major format. CUBLAS_OP_N controls transpose operations on the input matrices.

The final calls are to cublasEnsureDestruction() and another cudaMemcpy. cublasEnsureDestruction() calls cublasCreate() cuBLAS API to release hardware resources the cuBLAS library uses. cudaMemcpy copies the result matrix C from the device to the host.

static void cublasEnsureDestruction(void)
{
  if (cublasGlobalHandle != NULL) {
    cublasDestroy(cublasGlobalHandle);
    cublasGlobalHandle = NULL;
  }
}

Prepare blas_gemm for Kernel Creation

GPU Coder requires no special pragma to generate calls to libraries. There are two ways to generate CUDA kernels — coder.gpu.kernelfun and coder.gpu.kernel. In this example, we utilize the coder.gpu.kernelfun pragma to generate CUDA kernels. The modified blas_gemm function is:

function [C] = blas_gemm(A,B) %#codegen
    C = coder.nullcopy(zeros(size(A)));

    coder.gpu.kernelfun;
    C = A * B;
end

Note

A minimum size (128 elements) is required on the input data for replacing math operators and functions with cuBLAS library implementations.