Shallow Neural Networks with Parallel and GPU Computing
Note
For deep learning, parallel and GPU support is automatic. You can train a convolutional
neural network (CNN, ConvNet) or long short-term memory networks (LSTM or BiLSTM networks)
using the trainnet
function, and choose the execution environment (CPU,
GPU, multi-GPU, and parallel) using trainingOptions
.
Training in parallel, or on a GPU, requires Parallel Computing Toolbox™. For more information on deep learning with GPUs and in parallel, see Deep Learning with Big Data on CPUs, GPUs, in Parallel, and on the Cloud.
Modes of Parallelism
Neural networks are inherently parallel algorithms. Multicore CPUs, graphical processing units (GPUs), and clusters of computers with multiple CPUs and GPUs can take advantage of this parallelism.
Parallel Computing Toolbox, when used in conjunction with Deep Learning Toolbox™, enables neural network training and simulation to take advantage of each mode of parallelism.
For example, the following shows a standard single-threaded training and simulation session:
[x, t] = bodyfat_dataset; net1 = feedforwardnet(10); net2 = train(net1, x, t); y = net2(x);
The two steps you can parallelize in this session are the call to train
and the implicit call to sim
(where the network net2
is called as a
function).
In Deep Learning Toolbox you can divide any data, such as x
and t
in the previous example code, across samples. If x
and
t
contain only one sample each, there is no parallelism. But if
x
and t
contain hundreds or thousands of samples,
parallelism can provide both speed and problem size benefits.
Distributed Computing
Parallel Computing Toolbox allows neural network training and simulation to run across multiple CPU cores on a single PC, or across multiple CPUs on multiple computers on a network using MATLAB® Parallel Server™.
Using multiple cores can speed calculations. Using multiple computers can allow you to solve problems using data sets too big to fit in the RAM of a single computer. The only limit to problem size is the total quantity of RAM available across all computers.
To manage cluster configurations, use the Cluster Profile Manager from the MATLAB Home tab Environment menu Parallel > Manage Cluster Profiles.
To open a pool of MATLAB workers using the default cluster profile, which is usually the local CPU cores, use this command:
pool = parpool
Starting parallel pool (parpool) using the 'Processes' profile ... connected to 4 workers.
When parpool
runs, it displays the number of workers available in the
pool. Another way to determine the number of workers is to query the pool:
pool.NumWorkers
4
Now you can train and simulate the neural network with data split by sample across all
the workers. To do this, set the train
and sim
parameter 'useParallel'
to 'yes'
.
net2 = train(net1,x,t,'useParallel','yes') y = net2(x,'useParallel','yes')
Use the 'showResources'
argument to verify that the calculations ran
across multiple workers.
net2 = train(net1,x,t,'useParallel','yes','showResources','yes'); y = net2(x,'useParallel','yes','showResources','yes');
MATLAB indicates which resources were used. For example:
Computing Resources: Parallel Workers Worker 1 on MyComputer, MEX on PCWIN64 Worker 2 on MyComputer, MEX on PCWIN64 Worker 3 on MyComputer, MEX on PCWIN64 Worker 4 on MyComputer, MEX on PCWIN64
When train
and sim
are called, they divide the
input matrix or cell array data into distributed Composite values before training and
simulation. When sim
has calculated a Composite, this output is
converted back to the same matrix or cell array form before it is returned.
However, you might want to perform this data division manually if:
The problem size is too large for the host computer. Manually defining the elements of Composite values sequentially allows much bigger problems to be defined.
It is known that some workers are on computers that are faster or have more memory than others. You can distribute the data with differing numbers of samples per worker. This is called load balancing.
The following code sequentially creates a series of random datasets and saves them to separate files:
pool = gcp; for i=1:pool.NumWorkers x = rand(2,1000); save(['inputs' num2str(i)],'x'); t = x(1,:) .* x(2,:) + 2 * (x(1,:) + x(2,:)); save(['targets' num2str(i)],'t'); clear x t end
Because the data was defined sequentially, you can define a total dataset larger than can fit in the host PC memory. PC memory must accommodate only a sub-dataset at a time.
Now you can load the datasets sequentially across parallel workers, and train and
simulate a network on the Composite data. When train
or
sim
is called with Composite data, the
'useParallel'
argument is automatically set to
'yes'
. When using Composite data, configure the network’s input and
outputs to match one of the datasets manually using the configure
function before training.
xc = Composite; tc = Composite; for i=1:pool.NumWorkers data = load(['inputs' num2str(i)],'x'); xc{i} = data.x; data = load(['targets' num2str(i)],'t'); tc{i} = data.t; clear data end net2 = configure(net1,xc{1},tc{1}); net2 = train(net2,xc,tc); yc = net2(xc);
To convert the Composite output returned by sim
, you can access
each of its elements, separately if concerned about memory limitations.
for i=1:pool.NumWorkers yi = yc{i} end
Combined the Composite value into one local value if you are not concerned about memory limitations.
y = {yc{:}};
When load balancing, the same process happens, but, instead of each dataset having the same number of samples (1000 in the previous example), the numbers of samples can be adjusted to best take advantage of the memory and speed differences of the worker host computers.
It is not required that each worker have data. If element i
of a
Composite value is undefined, worker i
will not be used in the
computation.
Single GPU Computing
The number of cores, size of memory, and speed efficiencies of GPU cards are growing rapidly with each new generation. Where video games have long benefited from improved GPU performance, these cards are now flexible enough to perform general numerical computing tasks like training neural networks.
For the latest GPU requirements, see the web page for Parallel Computing Toolbox; or query MATLAB to determine whether your PC has a supported GPU. This function returns the number of GPUs in your system:
count = gpuDeviceCount
count = 1
If the result is one or more, you can query each GPU by index for its characteristics.
This includes its name, number of multiprocessors, SIMDWidth
of each
multiprocessor, and total memory.
gpu1 = gpuDevice(1)
gpu1 = CUDADevice with properties: Name: 'NVIDIA RTX A5000' Index: 1 ComputeCapability: '8.6' SupportsDouble: 1 DriverVersion: 11.6000 ToolkitVersion: 11.2000 MaxThreadsPerBlock: 1024 MaxShmemPerBlock: 49152 (49.15 KB) MaxThreadBlockSize: [1024 1024 64] MaxGridSize: [2.1475e+09 65535 65535] SIMDWidth: 32 TotalMemory: 25553076224 (25.55 GB) AvailableMemory: 25153765376 (25.15 GB) MultiprocessorCount: 64 ClockRateKHz: 1695000 ComputeMode: 'Default' GPUOverlapsTransfers: 1 KernelExecutionTimeout: 0 CanMapHostMemory: 1 DeviceSupported: 1 DeviceAvailable: 1 DeviceSelected: 1
The simplest way to take advantage of the GPU is to specify call
train
and sim
with the parameter argument
'useGPU'
set to 'yes'
('no'
is
the default).
net2 = train(net1,x,t,'useGPU','yes') y = net2(x,'useGPU','yes')
If net1
has the default training function
trainlm
, you see a warning that GPU calculations do not support
Jacobian training, only gradient training. So the training function is automatically changed
to the gradient training function trainscg
. To avoid the notice, you
can specify the function before training:
net1.trainFcn = 'trainscg';
To verify that the training and simulation occur on the GPU device, request that the computer resources be shown:
net2 = train(net1,x,t,'useGPU','yes','showResources','yes') y = net2(x,'useGPU','yes','showResources','yes')
Each of the above lines of code outputs the following resources summary:
Computing Resources: GPU device #1, GeForce GTX 470
Many MATLAB functions automatically execute on a GPU when any of the input arguments is
a gpuArray. Normally you move arrays to and from the GPU with the functions
gpuArray
and gather
. However, for neural network
calculations on a GPU to be efficient, matrices need to be transposed and the columns padded
so that the first element in each column aligns properly in the GPU memory. Deep Learning Toolbox provides a special function called nndata2gpu
to move an array to a GPU and properly organize it:
xg = nndata2gpu(x); tg = nndata2gpu(t);
Now you can train and simulate the network using the converted data already on the GPU,
without having to specify the 'useGPU'
argument. Then convert and return
the resulting GPU array back to MATLAB with the complementary function gpu2nndata
.
Before training with gpuArray data, the network’s input and outputs must be manually
configured with regular MATLAB matrices using the configure
function:
net2 = configure(net1,x,t); % Configure with MATLAB arrays net2 = train(net2,xg,tg); % Execute on GPU with NNET formatted gpuArrays yg = net2(xg); % Execute on GPU y = gpu2nndata(yg); % Transfer array to local workspace
On GPUs and other hardware where you might want to deploy your neural networks, it is
often the case that the exponential function exp
is not implemented
with hardware, but with a software library. This can slow down neural networks that use the
tansig
sigmoid transfer function. An alternative function is the
Elliot sigmoid function whose expression does not include a call to any higher order
functions:
(equation) a = n / (1 + abs(n))
Before training, the network’s tansig
layers can be converted to
elliotsig
layers as follows:
for i=1:net.numLayers if strcmp(net.layers{i}.transferFcn,'tansig') net.layers{i}.transferFcn = 'elliotsig'; end end
Now training and simulation might be faster on the GPU and simpler deployment hardware.
Distributed GPU Computing
Distributed and GPU computing can be combined to run calculations across multiple CPUs and/or GPUs on a single computer, or on a cluster with MATLAB Parallel Server.
The simplest way to do this is to specify train
and
sim
to do so, using the parallel pool determined by the cluster
profile you use. The 'showResources'
option is especially recommended in
this case, to verify that the expected hardware is being employed:
net2 = train(net1,x,t,'useParallel','yes','useGPU','yes','showResources','yes') y = net2(x,'useParallel','yes','useGPU','yes','showResources','yes')
These lines of code use all available workers in the parallel pool. One worker for each
unique GPU employs that GPU, while other workers operate as CPUs. In some cases, it might be
faster to use only GPUs. For instance, if a single computer has three GPUs and four workers
each, the three workers that are accelerated by the three GPUs might be speed limited by the
fourth CPU worker. In these cases, you can specify that train
and
sim
use only workers with unique GPUs.
net2 = train(net1,x,t,'useParallel','yes','useGPU','only','showResources','yes') y = net2(x,'useParallel','yes','useGPU','only','showResources','yes')
As with simple distributed computing, distributed GPU computing can benefit from manually created Composite values. Defining the Composite values yourself lets you indicate which workers to use, how many samples to assign to each worker, and which workers use GPUs.
For instance, if you have four workers and only three GPUs, you can define larger datasets for the GPU workers. Here, a random dataset is created with different sample loads per Composite element:
numSamples = [1000 1000 1000 300]; xc = Composite; tc = Composite; for i=1:4 xi = rand(2,numSamples(i)); ti = xi(1,:).^2 + 3*xi(2,:); xc{i} = xi; tc{i} = ti; end
You can now specify that train
and sim
use the
three GPUs available:
net2 = configure(net1,xc{1},tc{1}); net2 = train(net2,xc,tc,'useGPU','yes','showResources','yes'); yc = net2(xc,'showResources','yes');
To ensure that the GPUs get used by the first three workers, manually converting each
worker’s Composite elements to gpuArrays. Each worker performs this transformation within a
parallel executing spmd
block.
spmd if spmdIndex <= 3 xc = nndata2gpu(xc); tc = nndata2gpu(tc); end end
Now the data specifies when to use GPUs, so you do not need to tell
train
and sim
to do so.
net2 = configure(net1,xc{1},tc{1}); net2 = train(net2,xc,tc,'showResources','yes'); yc = net2(xc,'showResources','yes');
Ensure that each GPU is used by only one worker, so that the computations are most efficient. If multiple workers assign gpuArray data on the same GPU, the computation will still work but will be slower, because the GPU will operate on the multiple workers’ data sequentially.
Parallel Time Series
For time series networks, simply use cell array values for x
and
t
, and optionally include initial input delay states
xi
and initial layer delay states ai
, as
required.
net2 = train(net1,x,t,xi,ai,'useGPU','yes') y = net2(x,xi,ai,'useParallel','yes','useGPU','yes') net2 = train(net1,x,t,xi,ai,'useParallel','yes') y = net2(x,xi,ai,'useParallel','yes','useGPU','only') net2 = train(net1,x,t,xi,ai,'useParallel','yes','useGPU','only') y = net2(x,xi,ai,'useParallel','yes','useGPU','only')
Note that parallelism happens across samples, or in the case of time series across
different series. However, if the network has only input delays, with no layer delays, the
delayed inputs can be precalculated so that for the purposes of computation, the time steps
become different samples and can be parallelized. This is the case for networks such as
timedelaynet
and open-loop versions of narxnet
and narnet
. If a network has layer delays, then time cannot be
“flattened” for purposes of computation, and so single series data cannot be
parallelized. This is the case for networks such as layrecnet
and
closed-loop versions of narxnet
and narnet
.
However, if the data consists of multiple sequences, it can be parallelized across the
separate sequences.
Parallel Availability, Fallbacks, and Feedback
As mentioned previously, you can query MATLAB to discover the current parallel resources that are available.
To see what GPUs are available on the host computer:
gpuCount = gpuDeviceCount for i=1:gpuCount gpuDevice(i) end
To see how many workers are running in the current parallel pool:
poolSize = pool.NumWorkers
To see the GPUs available across a parallel pool running on a PC cluster using MATLAB Parallel Server:
spmd worker.index = spmdIndex; worker.name = system('hostname'); worker.gpuCount = gpuDeviceCount; try worker.gpuInfo = gpuDevice; catch worker.gpuInfo = []; end worker end
When 'useParallel'
or 'useGPU'
are set to
'yes'
, but parallel or GPU workers are unavailable, the convention is
that when resources are requested, they are used if available. The computation is performed
without error even if they are not. This process of falling back from requested resources to
actual resources happens as follows:
If
'useParallel'
is'yes'
but Parallel Computing Toolbox is unavailable, or a parallel pool is not open, then computation reverts to single-threaded MATLAB.If
'useGPU'
is'yes'
but the gpuDevice for the current MATLAB session is unassigned or not supported, then computation reverts to the CPU.If
'useParallel'
and'useGPU'
are'yes'
, then each worker with a unique GPU uses that GPU, and other workers revert to CPU.If
'useParallel'
is'yes'
and'useGPU'
is'only'
, then workers with unique GPUs are used. Other workers are not used, unless no workers have GPUs. In the case with no GPUs, all workers use CPUs.
When unsure about what hardware is actually being employed, check
gpuDeviceCount
, gpuDevice
, and
pool.NumWorkers
to ensure the desired hardware is available, and call
train
and sim
with
'showResources'
set to 'yes'
to verify what
resources were actually used.