This example shows how to train deep networks in parallel using Experiment Manager. Running an experiment in parallel allows you to try different training configurations at the same time. You can also use MATLAB® while the training is in progress. Parallel execution requires Parallel Computing Toolbox™.
In this example, you train two networks to classify images of digits from 0 to 9. The experiment trains the networks with augmented image data produced by applying random translations and horizontal reflections to the Digits data set. Data augmentation prevents the networks from overfitting and memorizing the exact details of the training images. When you run the experiment, Experiment Manager starts the parallel pool and executes multiple simultaneous trials, depending on the number of parallel workers available. Each trial uses a different combination of network and training options. While you monitor the training progress, you can stop trials that appear to be underperforming.
As an alternative, you can use
parfeval to train multiple networks in parallel programmatically. For more information, see Train Deep Learning Networks in Parallel.
First, open the example. Experiment Manager loads a project with a preconfigured experiment that you can inspect and run. To open the experiment, in the Experiment Browser pane, double-click the name of the experiment (
Built-in training experiments consist of a description, a table of hyperparameters, a setup function, and a collection of metric functions to evaluate the results of the experiment. For more information, see Configure Built-In Training Experiment.
The Description field contains a textual description of the experiment. For this example, the description is:
Classification using data image augmentation to apply random translations and horizontal reflections to the Digits data set.
The Hyperparameters section specifies the strategy (
Exhaustive Sweep) and hyperparameter values to use for the experiment. When you run the experiment, Experiment Manager trains the network using every combination of hyperparameter values specified in the hyperparameter table. This example uses two hyperparameters,
Network specifies the network to train. The possible values for this hyperparameter are:
TrainingOptions indicates the set of options used to train the network. The possible values for this hyperparameter are:
"fast" — Experiment Manager trains the network for a maximum of 10 epochs with an initial learning rate of 0.1.
"slow" — Experiment Manager trains the network for a maximum of 15 epochs with an initial learning rate of 0.001.
The Setup Function configures the training data, network architecture, and training options for the experiment. To inspect the setup function, under Setup Function, click Edit. The setup function opens in MATLAB Editor.
The input to the setup function is a structure with fields from the hyperparameter table. The setup function returns three outputs that you use to train a network for image classification problems. The setup function has three sections.
Load Training Data loads images from the Digits data set and splits this data set into training and validation sets. For the training data, this example creates an
object by applying random translations and horizontal reflections. The validation data is stored in an
imageDatastore object with no augmentation. For more information on this data set, see Image Data Sets.
Define Network Architecture defines the architecture for a convolutional neural network for deep learning classification. This example trains the network you specify for the hyperparameter
Specify Training Options defines a
object for the experiment. In this example, the value you specify for the hyperparameter
TrainingOptions determines the training options
Note that Experiment Manager does not support parallel execution when you set the training option
'parallel' or enable the training option
'DispatchInBackground'. For more information, see Configure Built-In Training Experiment.
The Metrics section specifies optional functions that evaluate the results of the experiment. This example does not include any custom metric functions.
If you have multiple GPUs, parallel execution typically increases the speed of your experiment. For best results, before you run your experiment, start a parallel pool with as many workers as GPUs. You can check the number of available GPUs by using the
numGPUs = gpuDeviceCount("available"); parpool(numGPUs);
However, if you have a single GPU, all workers share that GPU, so you do not obtain the training speed-up and you increase the chances of the GPU running out of memory. To continue using MATLAB while you train a deep network on a single GPU, start a parallel pool with a single worker before you run your experiment in parallel.
Using a GPU for deep learning requires Parallel Computing Toolbox and a supported GPU device. For more information, see GPU Support by Release (Parallel Computing Toolbox).
To run your experiment, on the Experiment Manager toolstrip, click Use Parallel and then Run. If there is no current parallel pool, Experiment Manager starts one using the default cluster profile. Experiment Manager then executes multiple simultaneous trials, depending on the number of parallel workers available. Each trial uses a different combination of hyperparameter values.
A table of results displays the accuracy and loss for each trial.
While the experiment is running, you can track its progress by displaying the training plot for each trial. Select a trial and click Training Plot.
Experiment Manager runs as many simultaneous trials as there are workers in your parallel pool. All other trials in your experiment are queued for later evaluation. While your experiment is running, you can stop a trial that is running or cancel a queued trial. In the Progress column of the results table, click the red square icon for each trial you want to stop or cancel.
For example, the validation loss for trials that use the
"7 layers" network becomes undefined after only a few iterations.
Continuing the training for those trials does not produce any useful results, so you can stop those trials before the training is complete. Experiment Manager continues the training for the remaining trials.
To record your reason for stopping each trial, add an annotation.
In the results table, right-click the Validation Loss cell for the first stopped trial.
Select Add Annotation.
In the Annotations pane, enter your observations in the text box.
Repeat the previous steps for the second stopped trial.
When the training is complete, you can rerun a trial that you stopped or canceled. In the Progress column of the results table, click the green triangle icon for the trial.
Alternatively, to rerun all the trials that you canceled, in the Experiment Manager toolstrip, click Restart All Canceled.
In the Experiment Browser pane, right-click the name of the project and select Close Project. Experiment Manager closes all of the experiments and results contained in the project.
gpuDeviceCount(Parallel Computing Toolbox) |
parfeval(Parallel Computing Toolbox) |
parfor(Parallel Computing Toolbox) |
parpool(Parallel Computing Toolbox)