Classify Videos Using Deep Learning
This example shows how to create a network for video classification by combining a pretrained image classification model and an LSTM network.
To create a deep learning network for video classification:
Convert videos to sequences of feature vectors using a pretrained convolutional neural network, such as GoogLeNet, to extract features from each frame.
Train an LSTM network on the sequences to predict the video labels.
Assemble a network that classifies videos directly by combining layers from both networks.
The following diagram illustrates the network architecture.
To input image sequences to the network, use a sequence input layer.
Use convolutional layers to extract features, that is, to apply the convolutional operations to each frame of the videos independently.
To classify the resulting vector sequences, include the LSTM layers followed by the output layers.
Load Data
Download the HMBD51 data set from HMDB: a large human motion database and extract the RAR file into a folder named "hmdb51_org"
. The data set contains about 2 GB of video data for 7000 clips over 51 classes, such as "drink"
, "run"
, and "shake_hands"
.
After extracting the RAR files, use the supporting function hmdb51Files
to get the file names and the labels of the videos.
dataFolder = "hmdb51_org";
[files,labels] = hmdb51Files(dataFolder);
classNames = categories(labels);
Read the first video using the readVideo
helper function, defined at the end of this example, and view the size of the video. The video is a H-by-W-by-C-by-S array, where H, W, C, and S are the height, width, number of channels, and number of frames of the video, respectively.
idx = 1; filename = files(idx); video = readVideo(filename); [height,width,numChannels,numFrames] = size(video);
View the corresponding label.
labels(idx)
ans = categorical
brush_hair
To view the video, use the implay
function (requires Image Processing Toolbox™). This function expects data in the range [0,1], so you must first divide the data by 255. Alternatively, you can loop over the individual frames and use the imshow
function.
figure for i = 1:numFrames frame = video(:,:,:,i); imshow(frame/255); drawnow end
Load Pretrained Convolutional Network
To convert frames of videos to feature vectors, use the activations of a pretrained network.
Load a pretrained GoogLeNet model using the imagePretrainedNetwork
function. This function requires the Deep Learning Toolbox™ Model for GoogLeNet Network support package. If this support package is not installed, then the function provides a download link.
netCNN = imagePretrainedNetwork("googlenet");
Convert Frames to Feature Vectors
Use the convolutional network as a feature extractor by getting the activations when inputting the video frames to the network. Convert the videos to sequences of feature vectors, where the feature vectors are the output of the last pooling layer of the GoogLeNet network ("pool5-7x7_s1"
).
This diagram illustrates the data flow through the network.
To read the video data and resize it to match the input size of the GoogLeNet network, use the readVideo
and centerCrop
helper functions, defined at the end of this example. This step can take a long time to run. After converting the videos to sequences, save the sequences in a MAT-file in the tempdir
folder. If the MAT-file already exists, then load the sequences from the MAT-file without reconverting them.
inputSize = netCNN.Layers(1).InputSize(1:2); layerName = "pool5-7x7_s1"; tempFile = fullfile(tempdir,"hmdb51_org.mat"); if exist(tempFile,"file") load(tempFile,"sequences") else numFiles = numel(files); sequences = cell(numFiles,1); for i = 1:numFiles fprintf("Reading file %d of %d...\n", i, numFiles) video = readVideo(files(i)); video = centerCrop(video,inputSize); sequences{i,1} = predict(netCNN,video,Outputs=layerName); sequences{i,1} = squeeze(sequences{i,1})'; end save(tempFile,"sequences","-v7.3"); end
View the sizes of the first few sequences. Each sequence is a S-by-D array, where S is the number of frames of the video, and D is the number of features (the output size of the pooling layer).
sequences(1:10)
ans=10×1 cell array
{409×1024 single}
{395×1024 single}
{323×1024 single}
{246×1024 single}
{159×1024 single}
{137×1024 single}
{359×1024 single}
{191×1024 single}
{439×1024 single}
{528×1024 single}
Prepare Training Data
Prepare the data for training by partitioning the data into training and validation partitions and removing any long sequences.
Create Training and Validation Partitions
Partition the data. Assign 90% of the data to the training partition and 10% to the validation partition.
numObservations = numel(sequences); idx = randperm(numObservations); N = floor(0.9 * numObservations); idxTrain = idx(1:N); sequencesTrain = sequences(idxTrain); labelsTrain = labels(idxTrain); idxValidation = idx(N+1:end); sequencesValidation = sequences(idxValidation); labelsValidation = labels(idxValidation);
Remove Long Sequences
Sequences that are much longer than typical sequences in the networks can introduce lots of padding into the training process. Having too much padding can negatively impact the classification accuracy.
Get the sequence lengths of the training data and visualize them in a histogram of the training data.
numObservationsTrain = numel(sequencesTrain); sequenceLengths = zeros(1,numObservationsTrain); for i = 1:numObservationsTrain sequence = sequencesTrain{i}; sequenceLengths(i) = size(sequence,1); end figure histogram(sequenceLengths) title("Sequence Lengths") xlabel("Sequence Length") ylabel("Frequency")
Only a few sequences have more than 400 time steps. To improve the classification accuracy, remove the training sequences that have more than 400 time steps along with their corresponding labels.
maxLength = 400; idx = sequenceLengths > maxLength; sequencesTrain(idx) = []; labelsTrain(idx) = [];
Create LSTM Network
Next, create an LSTM network that can classify the sequences of feature vectors representing the videos.
Define the LSTM network architecture. Specify the following network layers.
A sequence input layer with an input size corresponding to the feature dimension of the feature vectors.
A BiLSTM layer with 2000 hidden units with a dropout layer afterwards. Output only one label for each sequence by setting the
OutputMode
option of the BiLSTM layer to"last"
.A fully connected layer with an output size corresponding to the number of classes, and a softmax layer.
numFeatures = size(sequencesTrain{1},2); numClasses = numel(categories(labelsTrain)); layers = [ sequenceInputLayer(numFeatures,Name="sequence") bilstmLayer(2000,OutputMode="last",Name="bilstm") dropoutLayer(0.5,Name="drop") fullyConnectedLayer(numClasses,Name="fc") softmaxLayer(Name="softmax")];
Specify Training Options
Specify the training options using the trainingOptions
function.
Set the mini-batch size to 32, the initial learning rate to 0.0001, and the gradient threshold to 2 (to prevent the gradients from exploding).
Shuffle the data every epoch.
Validate the network once per epoch.
Stop training if the validation loss is greater than or equal to its previous lowest value for five epochs.
Display the training progress in a plot, including the accuracy of the network, and suppress verbose output.
miniBatchSize = 32; numObservations = numel(sequencesTrain); numIterationsPerEpoch = floor(numObservations / miniBatchSize); options = trainingOptions("adam", ... MiniBatchSize=miniBatchSize, ... InitialLearnRate=1e-4, ... GradientThreshold=2, ... Shuffle="every-epoch", ... ValidationData={sequencesValidation,labelsValidation}, ... ValidationFrequency=numIterationsPerEpoch, ... ValidationPatience=5, ... Plots="training-progress", ... Metrics="accuracy", ... Verbose=false);
Train LSTM Network
Train the network using the trainnet
function. This can take a long time to run. By default, the trainnet
function uses a GPU if one is available. Training on a GPU requires a Parallel Computing Toolbox™ license and a supported GPU device. For information on supported devices, see GPU Computing Requirements (Parallel Computing Toolbox). Otherwise, the trainnet
function uses the CPU. To select the execution environment manually, use the ExecutionEnvironment
training option.
[netLSTM,info] = trainnet(sequencesTrain,labelsTrain,layers,"crossentropy",options);
Calculate the classification accuracy of the network on the validation set. Use the same mini-batch size as for the training options.
accuracy = testnet(netLSTM,sequencesValidation,labelsValidation,"accuracy",MiniBatchSize=miniBatchSize)
accuracy = 65.7312
Assemble Video Classification Network
To create a network that classifies videos directly, assemble a network using layers from both of the created networks. Use the layers from the convolutional network to transform the videos into vector sequences and the layers from the LSTM network to classify the vector sequences.
The following diagram illustrates the network architecture.
To input image sequences to the network, use a sequence input layer.
To extract features, use convolutional layers.
To classify the resulting vector sequences, include the LSTM layers followed by the output layers. The output layers, sometimes called the model head, include the final fully connected layer and softmax layer.
Add Convolutional Layers
Extract the average image of the input layer, as this will be used by the sequence input layer to normalize the images.
averageImage = netCNN.Layers(1).Mean;
Remove the input layer ("data"
) and the layers after the pooling layer used for the activations ("pool5-drop_7x7_s1"
, "loss3-classifier"
, and "prob"
).
layerNames = ["data" "pool5-drop_7x7_s1" "loss3-classifier" "prob"]; net = removeLayers(netCNN,layerNames);
Add Sequence Input Layer
Create a sequence input layer that accepts image sequences containing images of the same input size as the GoogLeNet network. To normalize the images using the same average image as the GoogLeNet network, set the Normalization
option of the sequence input layer to "zerocenter"
and the Mean
option to the average image of the input layer of GoogLeNet.
inputLayer = sequenceInputLayer([inputSize 3], ... Normalization="zerocenter", ... Mean=averageImage, ... Name="input");
Add the sequence input layer to the network and connect its output to the input of the first convolutional layer ("conv1-7x7_s2"
).
net = addLayers(net,inputLayer); net = connectLayers(net,"input/out","conv1-7x7_s2/in");
Add LSTM Layers
Take the layers from the LSTM network and remove the sequence input layer.
lstmLayers = netLSTM.Layers; lstmLayers(1) = [];
Add the LSTM layers to the network. Connect the last convolutional layer ("pool5-7x7_s1"
) to the input of the BiLSTM layer ("bilstm/in"
).
net = addLayers(net,lstmLayers); net = connectLayers(net,"pool5-7x7_s1/out","bilstm/in");
Check Network
Check that the network is valid using the analyzeNetwork
function.
analyzeNetwork(net)
Classify Using New Data
Read and center-crop the video "pushup.mp4"
using the same steps as before.
filename = "pushup.mp4";
video = readVideo(filename);
To view the video, use the implay
function (requires Image Processing Toolbox). This function expects data in the range [0,1], so you must first divide the data by 255. Alternatively, you can loop over the individual frames and use the imshow
function.
numFrames = size(video,4); figure for i = 1:numFrames frame = video(:,:,:,i); imshow(frame/255); drawnow end
Initialize the network and use it to classify the video.
net = initialize(net); video = centerCrop(video,inputSize); scoresPred = predict(net,video); Y = scores2label(scoresPred,classNames)
Y = categorical
pushup
Helper Functions
The readVideo
function reads the video in filename
and returns an H
-by-W
-by-C-
by-S
array, where H
, W
, C
, and S
are the height, width, number of channels, and number of frames of the video, respectively.
function video = readVideo(filename) vr = VideoReader(filename); H = vr.Height; W = vr.Width; C = 3; % Preallocate video array numFrames = floor(vr.Duration * vr.FrameRate); video = zeros(H,W,C,numFrames); % Read frames i = 0; while hasFrame(vr) i = i + 1; video(:,:,:,i) = readFrame(vr); end % Remove unallocated frames if size(video,4) > i video(:,:,:,i+1:end) = []; end end
The centerCrop
function crops the longest edges of a video and resizes it to size inputSize
.
function videoResized = centerCrop(video,inputSize) [height,width] = size(video,1:2); if height < width % Video is landscape idx = floor((width - height)/2); video(:,1:(idx-1),:,:) = []; video(:,(height+1):end,:,:) = []; elseif width < height % Video is portrait idx = floor((height - width)/2); video(1:(idx-1),:,:,:) = []; video(width+1:end,:,:,:) = []; end videoResized = imresize(video,inputSize(1:2)); end
See Also
trainnet
| trainingOptions
| dlnetwork
| lstmLayer
| sequenceInputLayer
| flattenLayer