## Measure and Improve GPU Performance

### Measure GPU Performance

#### Measure Code Performance on a GPU

An important measure of the performance of your code is how long it takes to run. The best way to time code running on a GPU is to use the `gputimeit` function which runs a function multiple times to average out variation and compensate for overhead. The `gputimeit` function also ensures that all operations on the GPU are complete before recording the time.

For example, measure the time that the `lu` function takes to compute the LU factorization of a random matrix `A` of size `N`-by-`N`. To perform this measurement, create a function handle to the `lu` function and pass the function handle to `gputimeit`.

```N = 1000; A = rand(N,"gpuArray"); f = @() lu(A); numOutputs = 2; gputimeit(f,numOutputs)```

You can also time your code using `tic` and `toc`. However, to get accurate timing information for code running on a GPU, you must wait for operations to complete before calling `tic` and `toc`. To do this, you can use the `wait` function with a `gpuDevice` object as its input. For example, measure the time taken to compute the LU factorization of matrix `A` using `tic`, `toc`, and `wait`.

```D = gpuDevice; wait(D) tic [L,U] = lu(A); wait(D) toc```

You can view how long each part of your code takes using the MATLAB® Profiler. For more information about profiling your code, see `profile` and Profile Your Code to Improve Performance. The Profiler is useful for identifying performance bottlenecks in your code but cannot accurately time GPU code as it does not account for overlapping execution, which is common when you use a GPU.

Use this table to help you decide which timing method to use.

`gputimeit`Timing individual functions
• Because the `gputimeit` function requires a function handle as an argument, you can only use this method to time a single function. However, the function that you time can contain calls to other functions.

• Because the `gputimeit` function executes the function a number of times to account for initialization overhead, this method is often unsuitable for timing long-running functions.

`tic` and `toc`Timing multiple lines of code or entire workflows
• To ensure that all GPU computations are complete, you must call `wait` before calling `toc`. Similarly, if the preceding code runs on the GPU, call `wait` before calling `tic`.

• You cannot use `tic` and `toc` to measure the execution time of `gputimeit`.

MATLAB ProfilerFinding performance bottlenecks

The Profiler runs each line of code independently and does not account for overlapping execution, which is common when you use a GPU. You cannot use the Profiler as a way to accurately time GPU code.

#### GPU Benchmarking

Benchmark tests are useful for identifying the strengths and weaknesses of a GPU and for comparing the performance of different GPUs. Measure the performance of your GPU by using these benchmark tests:

• Run the Measure GPU Performance example to obtain detailed information about your GPU, including PCI bus speed, GPU memory read/write, and peak calculation performance for double-precision matrix calculations.

• Use `gpuBench` to test memory- and computation-intensive tasks in single and double precision. `gpuBench` can be downloaded from the Add-On Explorer or from the MATLAB Central File Exchange. For more information, see https://www.mathworks.com/matlabcentral/fileexchange/34080-gpubench.

### Improve GPU Performance

The purpose of GPU computing in MATLAB is to speed up your code. You can achieve better performance on the GPU by implementing best practices for writing code and configuring your GPU hardware. Various methods to improve performance are discussed below, starting with the most straightforward to implement.

Performance Improvement MethodWhen Should I Use This Method?Limitations

Use GPU Arrays – pass GPU arrays to supported functions to run your code on the GPU

Generally applicable

Your functions must support `gpuArray` input. For a list of MATLAB functions that support `gpuArray` input, see Run MATLAB Functions on a GPU.

Profile and Improve Your MATLAB Code – profile your code to identify bottlenecks

Generally applicable

The profiler cannot be used to accurately time code running on the GPU as described in the Measure Code Performance on a GPU section.

Vectorize Calculations – replace for-loops with matrix and vector operations

When running code that operates on vectors or matrices inside a for-loop

Perform Calculations in Single Precision – reduce computation by using lower precision data

When smaller ranges of values and lower accuracy are acceptable

Some types of calculation, such as linear algebra problems, might require double-precision processing.

Use `arrayfun` – execute element-wise functions using a custom CUDA® kernel

• When using a function that performs many element-wise operations

• When a nested function needs access to variables declared in its parent function

• Operations that change the size or shape of the input or output arrays (`cat`, `reshape`, and so on) are not supported.

• Not all built-in MATLAB functions are supported.

For information about supported functions and additional limitations, see `arrayfun`.

Use `pagefun` – perform large batches of matrix operations in a single call

When using a function that performs independent matrix operations on a large number of small matrices

Not all built-in MATLAB functions are supported. For information about supported functions and additional limitations, see `pagefun`.

Write MEX File Containing CUDA Code – access additional libraries of GPU functions

When you want access to NVIDIA® libraries or advanced CUDA featuresRequires code written using the CUDA C++ framework.

Configure Your Hardware for GPU Performance – make the best use of your hardware

Generally applicable
• Not all NVIDIA GPU devices support TCC mode.

• A GPU device in TCC mode is used for computation only and does not provide output for a display.

#### Use GPU Arrays

If all the functions that your code uses are supported on the GPU, the only necessary modification is to transfer the input data to the GPU by calling `gpuArray`. For a list of MATLAB functions that support `gpuArray` input, see Run MATLAB Functions on a GPU.

A `gpuArray` object stores data in GPU memory. Because most numeric functions in MATLAB and in many other toolboxes support `gpuArray` objects, you can usually run your code on a GPU by making minimal changes. These functions take `gpuArray` inputs, perform calculations on the GPU, and return `gpuArray` outputs. In general, these functions support the same arguments and data types as standard MATLAB functions that run on the CPU.

Tip

To reduce overhead, limit the number of times you transfer data between the host memory and the GPU. Create arrays directly on the GPU where possible. For more information see, Create GPU Arrays Directly. Similarly, only transfer data from the GPU back to the host memory using `gather` if the data needs to be displayed, saved, or used in code that does not support `gpuArray` objects.

#### Profile and Improve Your MATLAB Code

When converting MATLAB code to run on a GPU, it is best to start with MATLAB code that already performs well. Many of the guidelines for writing code that runs well on a CPU will also improve the performance of code that runs on a GPU. You can profile your CPU code using the MATLAB Profiler. The lines of code that take the most time on the CPU will likely be ones that you should improve or consider moving onto the GPU using `gpuArray` objects. For more information about profiling your code, see Profile Your Code to Improve Performance.

Because the MATLAB Profiler runs each line of code independently, it does not account for overlapping execution, which is common when you use a GPU. To time whole algorithms use `tic` and `toc` or `gputimeit` as described in the Measure Code Performance on a GPU section.

#### Vectorize Calculations

Vector, matrix, and higher-dimensional operations typically perform much better than scalar operations on a GPU because GPUs achieve high performance by calculating many results in parallel. You can achieve better performance by rewriting loops to make use of higher-dimensional operations. The process of revising loop-based, scalar-oriented code to use MATLAB matrix and vector operations is called vectorization. For information on vectorization, see Using Vectorization and Improve Performance Using a GPU and Vectorized Calculations. This plot from the Improve Performance Using a GPU and Vectorized Calculations example shows the increase in performance achieved by vectorizing a function executing on the CPU and on the GPU.

#### Perform Calculations in Single Precision

You can improve the performance of code running on your GPU by calculating in single precision instead of double precision. CPU computations do not provide this improvement when switching from double to single precision because most GPU cards are designed for graphic display, which demands a high single-precision performance. For more information on converting data to single precision and performing arithmetic operations on single-precision data, see Floating-Point Numbers.

Typical examples of workflows suitable for single-precision computation on the GPU include image processing and machine learning. However, other types of calculation, such as linear algebra problems, typically require double-precision processing. The Deep Learning Toolbox™ performs many operations in single precision by default. For more information, see Deep Learning Precision (Deep Learning Toolbox).

The exact performance improvement depends on the GPU card and total number of cores. High-end compute cards typically show a smaller improvement. For a comprehensive performance overview of NVIDIA GPU cards, including single- and double-precision processing power, see https://en.wikipedia.org/wiki/List_of_Nvidia_graphics_processing_units.

#### Improve Performance of Element-Wise Functions

If you have an element-wise function, you can often improve its performance by calling it with `arrayfun`. The `arrayfun` function on the GPU turns an element-wise MATLAB function into a custom CUDA kernel, which reduces the overhead of performing the operation. You can often use `arrayfun` with a subset of your code even if `arrayfun` does not support your entire code. The performance of a wide variety of element-wise functions can be improved using `arrayfun`, including functions performing many element-wise operations within looping or branching code, and nested functions where the nested function accesses variables declared in its parent function.

The Improve Performance of Element-Wise MATLAB Functions on the GPU Using arrayfun example shows a basic application of `arrayfun`. The Using GPU arrayfun for Monte-Carlo Simulations example shows `arrayfun` used to improve the performance of a function executing element-wise operations within a loop. The Stencil Operations on a GPU example shows `arrayfun` used to call a nested function that accesses variables declared in a parent function.

#### Improve Performance of Operations on Small Matrices

If you have a function that performs independent matrix operations on a large number of small matrices, you can improve its performance by calling it with `pagefun`. You can use `pagefun` to perform matrix operations in parallel on the GPU instead of looping over the matrices. The Improve Performance of Small Matrix Problems on the GPU Using pagefun example shows how to improve performance using `pagefun` when operating on many small matrices.

#### Write MEX File Containing CUDA Code

While MATLAB provides an extensive library of GPU-enabled functions, you can access libraries of additional functions that do not have analogs in MATLAB. Examples include NVIDIA libraries such as the NVIDIA Performance Primitives (NPP) and cuRAND libraries. You can compile MEX files that you write in the CUDA C++ framework using the `mexcuda` function. You can execute the compiled MEX files in MATLAB and call functions from NVIDIA libraries. For an example that shows how to write and run MEX functions that take `gpuArray` input and return `gpuArray` output, see Run MEX Functions Containing CUDA Code.

#### Configure Your Hardware for GPU Performance

Because many computations require large quantities of memory and most systems use the GPU constantly for graphics, using the same GPU for computations and graphics is usually impractical.

On Windows® systems, a GPU device has two operating models: Windows Display Driver Model (WDDM) or Tesla Compute Cluster (TCC). To attain the best performance for your code, set the devices that you use for computing to use the TCC model. To see which model your GPU device is using, inspect the `DriverModel` property returned by the `gpuDevice` function. For more information about switching models and which GPU devices support the TCC model, consult the NVIDIA documentation.

To reduce the likelihood of running out of memory on the GPU, do not use one GPU on multiple instances of MATLAB. To see which GPU devices are available and selected, use the `gpuDeviceTable` function.