Classify Videos Using Deep Learning

This example uses:

This example shows how to create a network for video classification by combining a pretrained image classification model and an LSTM network.

To create a deep learning network for video classification:

Convert videos to sequences of feature vectors using a pretrained convolutional neural network, such as GoogLeNet, to extract features from each frame.
Train an LSTM network on the sequences to predict the video labels.
Assemble a network that classifies videos directly by combining layers from both networks.

The following diagram illustrates the network architecture.

To input image sequences to the network, use a sequence input layer.
Use convolutional layers to extract features, that is, to apply the convolutional operations to each frame of the videos independently.
To classify the resulting vector sequences, include the LSTM layers followed by the output layers.

Flow diagram of the network architecture, showing the sequence input, the convolutional layers, the LSTM layers, and the output layers.

Load Data

Download the HMBD51 data set from HMDB: a large human motion database and extract the RAR file into a folder named "hmdb51_org". The data set contains about 2 GB of video data for 7000 clips over 51 classes, such as "drink", "run", and "shake_hands".

After extracting the RAR files, use the supporting function hmdb51Files to get the file names and the labels of the videos.

dataFolder = "hmdb51_org";
[files,labels] = hmdb51Files(dataFolder);
classNames = categories(labels);

Read the first video using the readVideo helper function, defined at the end of this example, and view the size of the video. The video is a H-by-W-by-C-by-S array, where H, W, C, and S are the height, width, number of channels, and number of frames of the video, respectively.

idx = 1;
filename = files(idx);
video = readVideo(filename);
[height,width,numChannels,numFrames] = size(video);

View the corresponding label.

labels(idx)

ans = categorical
     brush_hair

To view the video, use the implay function (requires Image Processing Toolbox™). This function expects data in the range [0,1], so you must first divide the data by 255. Alternatively, you can loop over the individual frames and use the imshow function.

figure
for i = 1:numFrames
    frame = video(:,:,:,i);
    imshow(frame/255);
    drawnow
end

Load Pretrained Convolutional Network

To convert frames of videos to feature vectors, use the activations of a pretrained network.

Load a pretrained GoogLeNet model using the imagePretrainedNetwork function. This function requires the Deep Learning Toolbox™ Model for GoogLeNet Network support package. If this support package is not installed, then the function provides a download link.

netCNN = imagePretrainedNetwork("googlenet");

Convert Frames to Feature Vectors

Use the convolutional network as a feature extractor by getting the activations when inputting the video frames to the network. Convert the videos to sequences of feature vectors, where the feature vectors are the output of the last pooling layer of the GoogLeNet network ("pool5-7x7_s1").

This diagram illustrates the data flow through the network.

Flow diagram of the feature extractor network architecture, showing the image input, the convolutional layers, and the output layers. The feature vectors are taken from the network after the convolutional layers and before the output layers.

To read the video data and resize it to match the input size of the GoogLeNet network, use the readVideo and centerCrop helper functions, defined at the end of this example. This step can take a long time to run. After converting the videos to sequences, save the sequences in a MAT-file in the tempdir folder. If the MAT-file already exists, then load the sequences from the MAT-file without reconverting them.

inputSize = netCNN.Layers(1).InputSize(1:2);
layerName = "pool5-7x7_s1";

tempFile = fullfile(tempdir,"hmdb51_org.mat");

if exist(tempFile,"file")
    load(tempFile,"sequences")
else
  numFiles = numel(files);
    sequences = cell(numFiles,1);
    
    for i = 1:numFiles
        fprintf("Reading file %d of %d...\n", i, numFiles)
        
        video = readVideo(files(i));
        video = centerCrop(video,inputSize);

        sequences{i,1} = predict(netCNN,video,Outputs=layerName);
        sequences{i,1} = squeeze(sequences{i,1})';

    end
    
    save(tempFile,"sequences","-v7.3");
end

View the sizes of the first few sequences. Each sequence is a S-by-D array, where S is the number of frames of the video, and D is the number of features (the output size of the pooling layer).

sequences(1:10)

ans=10×1 cell array
    {409×1024 single}
    {395×1024 single}
    {323×1024 single}
    {246×1024 single}
    {159×1024 single}
    {137×1024 single}
    {359×1024 single}
    {191×1024 single}
    {439×1024 single}
    {528×1024 single}

Prepare Training Data

Prepare the data for training by partitioning the data into training and validation partitions and removing any long sequences.

Create Training and Validation Partitions

Partition the data. Assign 90% of the data to the training partition and 10% to the validation partition.

numObservations = numel(sequences);
idx = randperm(numObservations);
N = floor(0.9 * numObservations);

idxTrain = idx(1:N);
sequencesTrain = sequences(idxTrain);
labelsTrain = labels(idxTrain);

idxValidation = idx(N+1:end);
sequencesValidation = sequences(idxValidation);
labelsValidation = labels(idxValidation);

Remove Long Sequences

Sequences that are much longer than typical sequences in the networks can introduce lots of padding into the training process. Having too much padding can negatively impact the classification accuracy.

Get the sequence lengths of the training data and visualize them in a histogram of the training data.

numObservationsTrain = numel(sequencesTrain);
sequenceLengths = zeros(1,numObservationsTrain);

for i = 1:numObservationsTrain
    sequence = sequencesTrain{i};
    sequenceLengths(i) = size(sequence,1);
end

figure
histogram(sequenceLengths)
title("Sequence Lengths")
xlabel("Sequence Length")
ylabel("Frequency")

Only a few sequences have more than 400 time steps. To improve the classification accuracy, remove the training sequences that have more than 400 time steps along with their corresponding labels.

maxLength = 400;
idx = sequenceLengths > maxLength;
sequencesTrain(idx) = [];
labelsTrain(idx) = [];

Create LSTM Network

Next, create an LSTM network that can classify the sequences of feature vectors representing the videos.

Define the LSTM network architecture. Specify the following network layers.

A sequence input layer with an input size corresponding to the feature dimension of the feature vectors.
A BiLSTM layer with 2000 hidden units with a dropout layer afterwards. Output only one label for each sequence by setting the OutputMode option of the BiLSTM layer to "last".
A fully connected layer with an output size corresponding to the number of classes, and a softmax layer.

numFeatures = size(sequencesTrain{1},2);
numClasses = numel(categories(labelsTrain));

layers = [
    sequenceInputLayer(numFeatures,Name="sequence")
    bilstmLayer(2000,OutputMode="last",Name="bilstm")
    dropoutLayer(0.5,Name="drop")
    fullyConnectedLayer(numClasses,Name="fc")
    softmaxLayer(Name="softmax")];

Specify Training Options

Specify the training options using the trainingOptions function.

Set the mini-batch size to 32, the initial learning rate to 0.0001, and the gradient threshold to 2 (to prevent the gradients from exploding).
Shuffle the data every epoch.
Validate the network once per epoch.
Stop training if the validation loss is greater than or equal to its previous lowest value for five epochs.
Display the training progress in a plot, including the accuracy of the network, and suppress verbose output.

miniBatchSize = 32;
numObservations = numel(sequencesTrain);
numIterationsPerEpoch = floor(numObservations / miniBatchSize);

options = trainingOptions("adam", ...
    MiniBatchSize=miniBatchSize, ...
    InitialLearnRate=1e-4, ...
    GradientThreshold=2, ...
    Shuffle="every-epoch", ...
    ValidationData={sequencesValidation,labelsValidation}, ...
    ValidationFrequency=numIterationsPerEpoch, ...
    ValidationPatience=5, ...
    Plots="training-progress", ...
    Metrics="accuracy", ...
    Verbose=false);

Train LSTM Network

Train the network using the trainnet function. This can take a long time to run. By default, the trainnet function uses a GPU if one is available. Training on a GPU requires a Parallel Computing Toolbox™ license and a supported GPU device. For information on supported devices, see GPU Computing Requirements (Parallel Computing Toolbox). Otherwise, the trainnet function uses the CPU. To select the execution environment manually, use the ExecutionEnvironment training option.

[netLSTM,info] = trainnet(sequencesTrain,labelsTrain,layers,"crossentropy",options);

Calculate the classification accuracy of the network on the validation set. Use the same mini-batch size as for the training options.

accuracy = testnet(netLSTM,sequencesValidation,labelsValidation,"accuracy",MiniBatchSize=miniBatchSize)

accuracy = 
65.7312

Assemble Video Classification Network

To create a network that classifies videos directly, assemble a network using layers from both of the created networks. Use the layers from the convolutional network to transform the videos into vector sequences and the layers from the LSTM network to classify the vector sequences.

The following diagram illustrates the network architecture.

To input image sequences to the network, use a sequence input layer.
To extract features, use convolutional layers.
To classify the resulting vector sequences, include the LSTM layers followed by the output layers. The output layers, sometimes called the model head, include the final fully connected layer and softmax layer.

Flow diagram of the network architecture, showing the sequence input, the convolutional layers, the LSTM layers, and the output layers.

Add Convolutional Layers

Extract the average image of the input layer, as this will be used by the sequence input layer to normalize the images.

averageImage = netCNN.Layers(1).Mean;

Remove the input layer ("data") and the layers after the pooling layer used for the activations ("pool5-drop_7x7_s1", "loss3-classifier", and "prob").

layerNames = ["data" "pool5-drop_7x7_s1" "loss3-classifier" "prob"];
net = removeLayers(netCNN,layerNames);

Add Sequence Input Layer

Create a sequence input layer that accepts image sequences containing images of the same input size as the GoogLeNet network. To normalize the images using the same average image as the GoogLeNet network, set the Normalization option of the sequence input layer to "zerocenter" and the Mean option to the average image of the input layer of GoogLeNet.

inputLayer = sequenceInputLayer([inputSize 3], ...
    Normalization="zerocenter", ...
    Mean=averageImage, ...
    Name="input");

Add the sequence input layer to the network and connect its output to the input of the first convolutional layer ("conv1-7x7_s2").

net = addLayers(net,inputLayer);
net = connectLayers(net,"input/out","conv1-7x7_s2/in");

Add LSTM Layers

Take the layers from the LSTM network and remove the sequence input layer.

lstmLayers = netLSTM.Layers;
lstmLayers(1) = [];

Add the LSTM layers to the network. Connect the last convolutional layer ("pool5-7x7_s1") to the input of the BiLSTM layer ("bilstm/in").

net = addLayers(net,lstmLayers);
net = connectLayers(net,"pool5-7x7_s1/out","bilstm/in");

Check Network

Check that the network is valid using the analyzeNetwork function.

analyzeNetwork(net)

Classify Using New Data

Read and center-crop the video "pushup.mp4" using the same steps as before.

filename = "pushup.mp4";
video = readVideo(filename);

To view the video, use the implay function (requires Image Processing Toolbox). This function expects data in the range [0,1], so you must first divide the data by 255. Alternatively, you can loop over the individual frames and use the imshow function.

numFrames = size(video,4);
figure
for i = 1:numFrames
    frame = video(:,:,:,i);
    imshow(frame/255);
    drawnow
end

Initialize the network and use it to classify the video.

net = initialize(net);
video = centerCrop(video,inputSize);
scoresPred = predict(net,video);
Y = scores2label(scoresPred,classNames)

Y = categorical
     pushup

Helper Functions

The readVideo function reads the video in filename and returns an H-by-W-by-C-by-S array, where H, W, C, and S are the height, width, number of channels, and number of frames of the video, respectively.

function video = readVideo(filename)

vr = VideoReader(filename);
H = vr.Height;
W = vr.Width;
C = 3;

% Preallocate video array
numFrames = floor(vr.Duration * vr.FrameRate);
video = zeros(H,W,C,numFrames);

% Read frames
i = 0;
while hasFrame(vr)
    i = i + 1;
    video(:,:,:,i) = readFrame(vr);
end

% Remove unallocated frames
if size(video,4) > i
    video(:,:,:,i+1:end) = [];
end

end

The centerCrop function crops the longest edges of a video and resizes it to size inputSize.

function videoResized = centerCrop(video,inputSize)

[height,width] = size(video,1:2);

if height < width
    % Video is landscape
    idx = floor((width - height)/2);
    video(:,1:(idx-1),:,:) = [];
    video(:,(height+1):end,:,:) = [];
    
elseif width < height
    % Video is portrait
    idx = floor((height - width)/2);
    video(1:(idx-1),:,:,:) = [];
    video(width+1:end,:,:,:) = [];
end

videoResized = imresize(video,inputSize(1:2));

end