Main Content

trainYOLOv2ObjectDetector

Train YOLO v2 object detector

Description

trainedDetector = trainYOLOv2ObjectDetector(trainingData,detector,options) returns an object detector trained using the you only look once version 2 (YOLO v2) network specified by detector. The options argument specifies training parameters for the detection network.

You can use this syntax for training an untrained detector or for fine-tuning a pretrained detector.

example

trainedDetector = trainYOLOv2ObjectDetector(trainingData,checkpoint,options) resumes training from the saved detector checkpoint.

You can use this syntax to:

  • Add more training data and continue the training.

  • Improve training accuracy by increasing the maximum number of iterations.

[trainedDetector,info] = trainYOLOv2ObjectDetector(___) also returns information on the training progress, such as the training accuracy and learning rate for each iteration.

___ = trainYOLOv2ObjectDetector(___,Name=Value) uses additional options specified by one or more name-value arguments and any of the previous inputs. For example, ExperimentMonitor=[] specifies not to track metrics with the Experiment Manager (Deep Learning Toolbox) app.

Examples

collapse all

Load the training data for vehicle detection into the workspace.

data = load("vehicleTrainingData.mat");
trainingData = data.vehicleTrainingData;

Specify the directory in which training samples are stored. Add full path to the file names in training data.

dataDir = fullfile(toolboxdir("vision"),"visiondata");
trainingData.imageFilename = fullfile(dataDir,trainingData.imageFilename);

Randomly shuffle data for training.

rng(0)
shuffledIdx = randperm(height(trainingData));
trainingData = trainingData(shuffledIdx,:);

Create an imageDatastore using the files from the table.

imds = imageDatastore(trainingData.imageFilename);

Create a boxLabelDatastore using the label columns from the table.

blds = boxLabelDatastore(trainingData(:,2:end));

Combine the datastores.

ds = combine(imds,blds);

Specify the class names using the label columns from the table.

classes = trainingData.Properties.VariableNames(2:end);

Specify anchor boxes.

anchorBoxes = [8 8; 32 48; 40 24; 72 48];

Load a preinitialized YOLO v2 object detection network.

load("yolov2VehicleDetectorNet.mat","net");

Create the YOLO v2 object detection network.

detector = yolov2ObjectDetector(net,classes,anchorBoxes)
detector = 
  yolov2ObjectDetector with properties:

                  Network: [1×1 dlnetwork]
                InputSize: [128 128 3]
        TrainingImageSize: [128 128]
              AnchorBoxes: [4×2 double]
               ClassNames: vehicle
    ReorganizeLayerSource: ''
              LossFactors: [5 1 1 1]
                ModelName: ''

Configure the network training options.

options = trainingOptions("sgdm", ...
    InitialLearnRate=0.001, ...
    Verbose=true, ...
    MiniBatchSize=16, ...
    MaxEpochs=30, ...
    Shuffle="never", ...
    VerboseFrequency=30, ...
    CheckpointPath=tempdir);

Train the YOLO v2 network.

[trainedDetector,info] = trainYOLOv2ObjectDetector(ds,detector,options);
*************************************************************************
Training a YOLO v2 Object Detector for the following object classes:

* vehicle

Training on single CPU.
|========================================================================================|
|  Epoch  |  Iteration  |  Time Elapsed  |  Mini-batch  |  Mini-batch  |  Base Learning  |
|         |             |   (hh:mm:ss)   |     RMSE     |     Loss     |      Rate       |
|========================================================================================|
|       1 |           1 |       00:00:00 |         7.13 |         50.8 |          0.0010 |
|       2 |          30 |       00:00:13 |         1.18 |          1.4 |          0.0010 |
|       4 |          60 |       00:00:25 |         0.98 |          1.0 |          0.0010 |
|       5 |          90 |       00:00:35 |         0.59 |          0.3 |          0.0010 |
|       7 |         120 |       00:00:46 |         0.53 |          0.3 |          0.0010 |
|       9 |         150 |       00:00:57 |         0.63 |          0.4 |          0.0010 |
|      10 |         180 |       00:01:07 |         0.45 |          0.2 |          0.0010 |
|      12 |         210 |       00:01:18 |         0.39 |          0.2 |          0.0010 |
|      14 |         240 |       00:01:28 |         0.60 |          0.4 |          0.0010 |
|      15 |         270 |       00:01:39 |         0.42 |          0.2 |          0.0010 |
|      17 |         300 |       00:01:49 |         0.35 |          0.1 |          0.0010 |
|      19 |         330 |       00:02:00 |         0.47 |          0.2 |          0.0010 |
|      20 |         360 |       00:02:10 |         0.36 |          0.1 |          0.0010 |
|      22 |         390 |       00:02:21 |         0.34 |          0.1 |          0.0010 |
|      24 |         420 |       00:02:32 |         0.44 |          0.2 |          0.0010 |
|      25 |         450 |       00:02:42 |         0.54 |          0.3 |          0.0010 |
|      27 |         480 |       00:02:52 |         0.39 |          0.2 |          0.0010 |
|      29 |         510 |       00:03:03 |         0.44 |          0.2 |          0.0010 |
|      30 |         540 |       00:03:13 |         0.37 |          0.1 |          0.0010 |
|========================================================================================|
Training finished: Max epochs completed.
Detector training complete.
*************************************************************************

Verify the training accuracy by inspecting the training loss for each iteration.

figure
plot(info.TrainingLoss)
grid on
xlabel("Number of Iterations")
ylabel("Training Loss for Each Iteration")

Figure contains an axes object. The axes object with xlabel Number of Iterations, ylabel Training Loss for Each Iteration contains an object of type line.

Read a test image into the workspace.

img = imread("detectcars.png");

Run the trained YOLO v2 object detector on the test image for vehicle detection.

[bboxes,scores] = detect(trainedDetector,img);

Display the detection results.

if(~isempty(bboxes))
    img = insertObjectAnnotation(img,"rectangle",bboxes,scores);
end
figure
imshow(img)

Figure contains an axes object. The hidden axes object contains an object of type image.

Input Arguments

collapse all

Labeled ground truth images, specified as a datastore or a table.

  • If you use a datastore, your data must be set up so that calling the datastore with the read and readall functions returns a cell array or table with two or three columns. When the output contains two columns, the first column must contain bounding boxes, and the second column must contain labels, {boxes,labels}. When the output contains three columns, the second column must contain the bounding boxes, and the third column must contain the labels. In this case, the first column can contain any type of data. For example, the first column can contain images or point cloud data.

    databoxeslabels

    The first column must be images.

    M-by-4 matrices of bounding boxes of the form [x, y, width, height], where [x,y] represent the top-left coordinates of the bounding box.

    The third column must be a cell array that contains M-by-1 categorical vectors containing object class names. All categorical data returned by the datastore must contain the same categories.

    For more information, see Datastores for Deep Learning (Deep Learning Toolbox).

  • If you use a table, the table must have two or more columns. The first column of the table must contain image file names with paths. The images must be grayscale or truecolor (RGB) and they can be in any format supported by imread. Each of the remaining columns must be a cell vector that contains M-by-4 matrices that represent a single object class, such as vehicle, flower, or stop sign. The columns contain 4-element double arrays of M bounding boxes in the format [x,y,width,height]. The format specifies the upper-left corner location and size of the bounding box in the corresponding image. To create a ground truth table, you can use the Image Labeler app or Video Labeler app. To create a table of training data from the generated ground truth, use the objectDetectorTrainingData function.

Note

When the training data is specified using a table, the trainYOLOv2ObjectDetector function checks these conditions

  • The bounding box values must be integers. Otherwise, the function automatically rounds each noninteger values to its nearest integer.

  • The bounding box must not be empty and must be within the image region. While training the network, the function ignores empty bounding boxes and bounding boxes that lie partially or fully outside the image region.

Pretrained or untrained YOLO v2 object detector, specified as a yolov2ObjectDetector object. If detector is a pretrained detector, then you can continue training the detector with additional training data or perform more training iterations to improve detector accuracy.

Training options, specified as a TrainingOptionsSGDM, TrainingOptionsRMSProp, or TrainingOptionsADAM object returned by the trainingOptions (Deep Learning Toolbox) function. To specify the solver name and other options for network training, use the trainingOptions (Deep Learning Toolbox) function.

Note

The trainYOLOv2ObjectDetector function does not support these training options:

  • The trainingOptions Shuffle values, "once" and "every-epoch" are not supported when you use a datastore input.

  • Datastore inputs are not supported when you set the DispatchInBackground training option to true.

Saved detector checkpoint, specified as a yolov2ObjectDetector object. To periodically save a detector checkpoint during training, specify CheckpointPath. To control how frequently check points are saved see the CheckPointFrequency and CheckPointFrequencyUnit training options.

To load a checkpoint for a previously trained detector, load the MAT file from the checkpoint path. For example, if the CheckpointPath property of the object specified by options is "/checkpath", you can load a checkpoint MAT file by using this code.

data = load("/checkpath/yolov2_checkpoint__216__2018_11_16__13_34_30.mat");
checkpoint = data.detector;

The name of the MAT file includes the iteration number and timestamp of when the detector checkpoint was saved. The detector is saved in the detector variable of the file. Pass this file back into the trainYOLOv2ObjectDetector function:

yoloDetector = trainYOLOv2ObjectDetector(trainingData,checkpoint,options);

Name-Value Arguments

Specify optional pairs of arguments as Name1=Value1,...,NameN=ValueN, where Name is the argument name and Value is the corresponding value. Name-value arguments must appear after other arguments, but the order of the pairs does not matter.

Example: ExperimentManager="none" specifies not to monitor the detector training.

Before R2021a, use commas to separate each name and value, and enclose Name in quotes.

Example: "ExperimentManager","none"

Detector training experiment monitoring, specified as an experiments.Monitor (Deep Learning Toolbox) object for use with the Experiment Manager (Deep Learning Toolbox) app. You can use this object to track the progress of training, update information fields in the training results table, record values of the metrics used by the training, and to produce training plots. For an example using this app, see Train Object Detectors in Experiment Manager.

Information monitored during training:

  • Training loss at each iteration.

  • Training accuracy at each iteration.

  • Training root mean square error (RMSE) for the box regression layer.

  • Learning rate at each iteration.

Validation information when the training options input contains validation data:

  • Validation loss at each iteration.

  • Validation accuracy at each iteration.

  • Validation RMSE at each iteration.

Output Arguments

collapse all

Trained YOLO v2 object detector, returned as yolov2ObjectDetector object.

Training progress information, returned as a structure array with seven fields. Each field corresponds to a stage of training.

  • TrainingLoss — Training loss at each iteration is the mean squared error (MSE) calculated as the sum of localization error, confidence loss, and classification loss. For more information about the training loss function, see YOLO v2 Training Loss.

  • TrainingRMSE — Training root mean squared error (RMSE) is the RMSE calculated from the training loss at each iteration.

  • BaseLearnRate — Learning rate at each iteration.

  • ValidationLoss — Validation loss at each iteration.

  • ValidationRMSE — Validation RMSE at each iteration.

  • FinalValidationLoss — Final validation loss at end of the training.

  • FinalValidationRMSE — Final validation RMSE at end of the training.

Each field is a numeric vector with one element per training iteration. Values that have not been calculated at a specific iteration are assigned as NaN. The struct contains ValidationLoss, ValidationAccuracy, ValidationRMSE, FinalValidationLoss, and FinalValidationRMSE fields only when options specifies validation data.

More About

collapse all

Data Preprocessing

By default, the trainYOLOv2ObjectDetector function preprocesses the training images by:

  • Resizing the input images to match the input size of the network.

  • Normalizing the pixel values of the input images to lie in the range [0, 1].

When you specify the training data by using a table, the trainYOLOv2ObjectDetector function also augments the input dataset by:

  • Reflecting the training data horizontally. The probability for horizontally flipping each image in the training data is 0.5.

  • Uniformly scaling (resizing) the training data by a scale factor that is randomly picked from a continuous uniform distribution in the range [1, 1.1].

  • Random color jittering for brightness, hue, saturation, and contrast.

When you specify the training data by using a datastore, the trainYOLOv2ObjectDetector function does not perform data augmentation. Instead you can augment the training data in datastore by using the transform function and then, train the network with the augmented training data. For more information on how to apply augmentation while using datastores, see Preprocess Images for Deep Learning (Deep Learning Toolbox).

YOLO v2 Training Loss

During training, the trainYOLOv2ObjectDetector function predicts refined bounding box locations by optimizing the mean squared error (MSE) loss between predicted bounding boxes and the ground truth. The loss function is defined as

K1i=0S2j=0B1ijobj[(xix^i)2+(yiy^i)2]+K1i=0S2j=0B1ijobj[(wiw^i)2+(hih^i)2]+K2i=0S2j=0B1ijobj(CiC^i)2+K3i=0S2j=0B1ijnoobj(CiC^i)2+K4i=0S21iobjcclasses(pi(c)p^i(c))2

where:

  • S is the number of grid cells.

  • B is the number of bounding boxes in each grid cell.

  • 1ijobj is 1 if the jth bounding box in grid cell i is responsible for detecting the object. Otherwise it is set to 0. A grid cell i is responsible for detecting the object, if the overlap between the ground truth and a bounding box in that grid cell is greater than or equal to 0.6.

  • 1ijnoobj is 1 if the jth bounding box in grid cell i does not contain any object. Otherwise it is set to 0.

  • 1iobj is 1 if an object is detected in grid cell i. Otherwise it is set to 0.

  • K1, K2, K3, and K4 are the weights. To adjust the weights, modify the LossFactors property of the detector.

The loss function can be split into three parts:

  • Localization loss

    The first and second terms in the loss function comprise the localization loss. It measures error between the predicted bounding box and the ground truth. The parameters for computing the localization loss include the position, size of the predicted bounding box, and the ground truth. The parameters are defined as follows.

    • (xi,yi), is the center of the jth bounding box relative to grid cell i.

    • (x^i,y^i), is the center of the ground truth relative to grid cell i.

    • wiandhi is the width and the height of the jth bounding box in grid cell i, respectively. The size of the predicted bounding box is specified relative to the input image size.

    • w^iandh^i is the width and the height of the ground truth in grid cell i, respectively.

    • K1 is the weight for localization loss. Increase this value to increase the weightage for bounding box prediction errors.

  • Confidence loss

    The third and fourth terms in the loss function comprise the confidence loss. The third term measures the objectness (confidence score) error when an object is detected in the jth bounding box of grid cell i. The fourth term measures the objectness error when no object is detected in the jth bounding box of grid cell i. The parameters for computing the confidence loss are defined as follows.

    • Ci is the confidence score of the jth bounding box in grid cell i.

    • Ĉi is the confidence score of the ground truth in grid cell i.

    • K2 is the weight for objectness error, when an object is detected in the predicted bounding box. You can adjust the value of K2 to weigh confidence scores from grid cells that contain objects.

    • K3 is the weight for objectness error, when an object is not detected in the predicted bounding box. You can adjust the value of K3 to weigh confidence scores from grid cells that do not contain objects.

    The confidence loss can cause the training to diverge when the number of grid cells that do not contain objects is more than the number of grid cells that contain objects. To remedy this, increase the value for K2 and decrease the value for K3.

  • Classification loss

    The fifth term in the loss function comprises the classification loss. For example, suppose that an object is detected in the predicted bounding box contained in grid cell i. Then, the classification loss measures the squared error between the class conditional probabilities for each class in grid cell i. The parameters for computing the classification loss are defined as follows.

    • pi (c) is the estimated conditional class probability for object class c in grid cell i.

    • p^i(c) is the actual conditional class probability for object class c in grid cell i.

    • K4 is the weight for classification error when an object is detected in the grid cell. Increase this value to increase the weightage for classification loss.

Tips

  • To generate the ground truth, use the Image Labeler or Video Labeler app. To create a table of training data from the generated ground truth, use the objectDetectorTrainingData function.

  • To improve prediction accuracy:

    • Increase the number of images you can use to train the network. You can expand the training dataset through data augmentation. For information on how to apply data augmentation for preprocessing, see Preprocess Images for Deep Learning (Deep Learning Toolbox).

    • Perform multiscale training by specifying an input detector whose TrainingImageSize property is a matrix with two or more rows. For each training epoch, the trainYOLOv2ObjectDetector function randomly resizes the input training images to one of the specified training image sizes.

    • Choose anchor boxes appropriate to the dataset for training the network. You can use the estimateAnchorBoxes function to compute anchor boxes directly from the training data.

References

[1] Joseph. R, S. K. Divvala, R. B. Girshick, and F. Ali. "You Only Look Once: Unified, Real-Time Object Detection." In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 779–788. Las Vegas, NV: CVPR, 2016.

[2] Joseph. R and F. Ali. "YOLO 9000: Better, Faster, Stronger." In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6517–6525. Honolulu, HI: CVPR, 2017.

Version History

Introduced in R2019a

expand all