# Sequence-to-Sequence Classification Using 1-D Convolutions

This example shows how to classify each time step of sequence data using a generic temporal convolutional network (TCN).

While sequence-to-sequence tasks are commonly solved with recurrent neural network architectures, Bai et al. [1] show that convolutional neural networks can match the performance of recurrent networks on typical sequence modeling tasks or even outperform them. Potential benefits of using convolutional networks can be better parallelism, better control over the receptive field size, better control of the network's memory footprint during training, and more stable gradients. Just like recurrent networks, convolutional networks can operate on variable length input sequences and can be used to model sequence-to-sequence or sequence-to-one tasks.

This example trains a TCN network to recognize the activity of person wearing a smartphone on the body given time series data representing accelerometer readings in three different directions. The example defines the TCN as a function and trains the model using a custom training loop.

To convolve over the time dimension of the input data (also known as 1-D convolutions), use the `dlconv` function and specify the dimensions of the weights using the `'WeightsFormat'` option or specify the weights as a formatted `dlarray`.

Load the human activity recognition data. The data contains seven time series of sensor data obtained from a smartphone worn on the body. Each sequence has three features and varies in length. The three features correspond to the accelerometer readings in three different directions. Six sequences are used for training and one sequence is used for testing after training.

`s = load("HumanActivityTrain.mat");`

Create `arrayDatastore` objects containing the data and combine them using the `combine` function.

```dsXTrain = arrayDatastore(s.XTrain,'OutputType','same'); dsYTrain = arrayDatastore(s.YTrain,'OutputType','same'); dsTrain = combine(dsXTrain,dsYTrain);```

View the number of observations in the training data.

`numObservations = numel(s.XTrain)`
```numObservations = 6 ```

View the number of classes in the training data.

```classes = categories(s.YTrain{1}); numClasses = numel(classes)```
```numClasses = 5 ```

Visualize one of the training sequences in a plot. Plot the features of the first training sequence and the corresponding activity.

```figure for i = 1:3 X = s.XTrain{1}(i,:); subplot(4,1,i) plot(X) ylabel("Feature " + i + newline + "Acceleration") end subplot(4,1,4) hold on plot(s.YTrain{1}) hold off xlabel("Time Step") ylabel("Activity") subplot(4,1,1) title("Training Sequence 1")```

### Define Deep Learning Model

The main building block of a temporal convolutional network is a dilated causal convolution layer which operates over the time steps of each sequence. In this context, "causal" means that the activations computed for a particular time step cannot depend on activations from future time steps.

In order to build up context from previous time steps, multiple convolutional layers are typically stacked on top of each other. To achieve large receptive field sizes the dilation factor of subsequent convolution layers is increased exponentially as shown in the image below. Assuming the dilation factor of the k-th convolutional layer is ${2}^{\left(\mathit{k}-1\right)}$ and the stride is 1, then the receptive field size of such a network can be computed as $\mathit{R}=\left(\mathit{f}-1\right)\left({2}^{\mathit{K}}-1\right)+1$, where $\mathit{f}$ is the filter size and $\mathit{K}$ is the number of convolutional layers. Change the filter size and number of layers to easily adjust the receptive field size and the number or learnable parameters as necessary for the data and task at hand.

One of the disadvantages of TCNs compared to RNNs is the higher memory footprint during inference. The entire raw sequence is required to compute the next time step. To reduce inference time and memory consumption, especially for step-ahead predictions, it can be beneficial to train with the smallest sensible receptive field size $\mathit{R}$ and only perform prediction with the last $\mathit{R}$ time steps of the input sequence.

A general TCN architecture (as described in [1]) consists of multiple residual blocks, each containing two sets of dilated causal convolution layers with the same dilation factor, followed by normalization, ReLU activation, and spatial dropout layers. The input to each block is added to the output of the block (including a 1-by-1 convolution on the input when the number of channels between the input and output do not match) and a final activation function is applied.

#### Define and Initialize Model Parameters

Specify the parameters for the TCN architecture with four residual blocks containing dilated causal convolution layers with each 175 filters of size 3.

```numBlocks = 4; numFilters = 175; filterSize = 3; dropoutFactor = 0.05;```

To pass the model hyperparameters to the model functions (the number of blocks and the dropout factor), create a structure containing these values.

```hyperparameters = struct; hyperparameters.NumBlocks = numBlocks; hyperparameters.DropoutFactor = dropoutFactor;```

Create a structure containing `dlarray` objects for all the learnable parameters of the model based on the number of input channels and the hyperparameters that define the model architecture. Each residual block requires weights parameters and bias parameters for each of the two convolution operations. The first residual block usually also requires weights and biases for an additional convolution operation with filter size 1. The final fully connected operation requires a weights and bias parameter as well. Initialize the learnable layer weights and biases using the `initializeGaussian` and `initializeZeros` functions, respectively. These functions are attached to this example as a supporting file.

```numInputChannels = 3; parameters = struct; numChannels = numInputChannels; mu = 0; sigma = 0.01; for k = 1:numBlocks parametersBlock = struct; blockName = "Block"+k; % 1-D Convolution. sz = [filterSize numChannels numFilters]; parametersBlock.Conv1.Weights = initializeGaussian(sz,mu,sigma); parametersBlock.Conv1.Bias = initializeZeros([numFilters 1]); % 1-D Convolution. sz = [filterSize numFilters numFilters]; parametersBlock.Conv2.Weights = initializeGaussian(sz,mu,sigma); parametersBlock.Conv2.Bias = initializeZeros([numFilters 1]); % If the input and output of the block have different numbers of % channels, then add a convolution with filter size 1. if numChannels ~= numFilters % 1-D Convolution. sz = [1 numChannels numFilters]; parametersBlock.Conv3.Weights = initializeGaussian(sz,mu,sigma); parametersBlock.Conv3.Bias = initializeZeros([numFilters 1]); end numChannels = numFilters; parameters.(blockName) = parametersBlock; end % Fully connect. sz = [numClasses numChannels]; parameters.FC.Weights = initializeGaussian(sz,mu,sigma); parameters.FC.Bias = initializeZeros([numClasses 1]);```

View the network parameters.

`parameters`
```parameters = struct with fields: Block1: [1×1 struct] Block2: [1×1 struct] Block3: [1×1 struct] Block4: [1×1 struct] FC: [1×1 struct] ```

View the parameters of the first block.

`parameters.Block1`
```ans = struct with fields: Conv1: [1×1 struct] Conv2: [1×1 struct] Conv3: [1×1 struct] ```

View the parameters of the first convolutional operation of the first block.

`parameters.Block1.Conv1`
```ans = struct with fields: Weights: [3×3×175 dlarray] Bias: [175×1 dlarray] ```

#### Define Model and Model Gradients Functions

Create the function `model`, listed in the Model Function section at the end of the example, that computes the outputs of the deep learning model. The function `model` takes the input data, the learnable model parameters, the model hyperparameters, and a flag that specifies whether the model should return outputs for training or prediction. The network outputs the predictions for the labels at each time step of the input sequence.

Create the function `modelGradients`, listed in the Model Gradients Function section at the end of the example, that takes a mini-batch of input data, the corresponding target sequences, and the parameters of the network, and returns the gradients of the loss with respect to the learnable parameters and the corresponding loss.

### Specify Training Options

Specify a set of training options used in the custom training loop.

• Train for 30 epochs with mini-batch size 1.

• Multiply the learn rate by 0.1 every 12 epochs.

• Clip the gradients using the ${L}_{2}$ norm with threshold 1.

```maxEpochs = 30; miniBatchSize = 1; initialLearnRate = 0.001; learnRateDropFactor = 0.1; learnRateDropPeriod = 12; gradientThreshold = 1;```

To monitor the training progress, you can plot the training loss after each iteration. Create the variable `plots` that contains `"training-progress"`. If you do not want to plot the training progress, then set this value to `"none"`.

`plots = "training-progress";`

### Train Model

Train the network via stochastic gradient descent by looping over the sequences in the training dataset, computing parameter gradients, and updating the network parameters via the Adam update rule. This process is repeated multiple times (referred to as epochs) until training converges and the maximum number of epochs is reached.

Use `minibatchqueue` to process and manage mini-batches of images during training. For each mini-batch:

• Preprocess the data using the custom mini-batch preprocessing function `preprocessMiniBatch` (defined at the end of this example) that returns padded predictors, padded one-hot encoded targets, and the padding mask. The function has three outputs, so specify three output variables for the `minibatchqueue` object.

• Convert the outputs to `dlarray` objects with format `'CTB'`(channel, time, batch). The `minibatchqueue` object, by default, outputs data with underlying type `single`.

• Train on a GPU if one is available. The `minibatchqueue` object, by default, converts each output to a `gpuArray` if a GPU is available. Using a GPU requires Parallel Computing Toolbox™ and a CUDA® enabled NVIDIA® GPU. For more information, see GPU Support by Release (Parallel Computing Toolbox).

```numDatastoreOutputs = 3; mbq = minibatchqueue(dsTrain,numDatastoreOutputs,... 'MiniBatchSize',miniBatchSize,... 'MiniBatchFormat',{'CTB','CTB','CTB'},... 'MiniBatchFcn', @preprocessMiniBatch);```

For each epoch, shuffle the training data. For each mini-batch:

• Evaluate the model gradients and loss using `dlfeval` and the `modelGradients` function.

• Clip any large gradients using the function `thresholdL2Norm`, listed at the end of the example, and the `dlupdate` function.

• Update the network parameters using the `adamupdate` function.

• Update the training progress plot.

After completing `learnRateDropPeriod` epochs, reduce the learn rate by multiplying the current learning rate by `learnRateDropFactor`.

Initialize the learning rate which will be multiplied by the `LearnRateDropFactor` value every `LearnRateDropPeriod` epochs.

`learnRate = initialLearnRate;`

Initialize the moving average of the parameter gradients and the element-wise squares of the gradients used by the Adam optimizer.

```trailingAvg = []; trailingAvgSq = [];```

Initialize a plot showing the training progress.

```if plots == "training-progress" figure lineLossTrain = animatedline('Color',[0.85 0.325 0.098]); ylim([0 inf]) xlabel("Iteration") ylabel("Loss") grid on end```

Train the model.

```iteration = 0; start = tic; % Loop over epochs. for epoch = 1:maxEpochs % Shuffle the data. shuffle(mbq) % Loop over mini-batches. while hasdata(mbq) iteration = iteration + 1; [dlX,dlY,mask] = next(mbq); % Evaluate the model gradients and loss using dlfeval. [gradients, loss] = dlfeval(@modelGradients,parameters,hyperparameters,dlX,dlY,mask); % Clip the gradients. gradients = dlupdate(@(g) thresholdL2Norm(g,gradientThreshold),gradients); % Update the network parameters using the Adam optimizer. [parameters,trailingAvg,trailingAvgSq] = adamupdate(parameters,gradients, ... trailingAvg, trailingAvgSq, iteration, learnRate); if plots == "training-progress" % Plot training progress. D = duration(0,0,toc(start),'Format','hh:mm:ss'); % Normalize the loss over the sequence lengths numTimeSteps = sum(mask(1,:,:),3); loss = mean(loss ./ numTimeSteps); loss = double(gather(extractdata(loss))); loss = mean(loss); addpoints(lineLossTrain,iteration, loss); title("Epoch: " + epoch + ", Elapsed: " + string(D)) drawnow end end % Reduce the learning rate after learnRateDropPeriod epochs if mod(epoch,learnRateDropPeriod) == 0 learnRate = learnRate*learnRateDropFactor; end end```

### Test Model

Test the classification accuracy of the model by comparing the predictions on a held-out test set with the true labels for each time step.

Create a datastore containing the test predictors.

```s = load("HumanActivityTest.mat"); dsXTest = arrayDatastore(s.XTest,'OutputType','same');```

After training, making predictions on new data does not require the labels. Create `minibatchqueue` object containing only the predictors of the test data:

• To ignore the labels for testing, set the number of outputs of the mini-batch queue to 1.

• Specify the same mini-batch size used for training.

• Preprocess the predictors using the `preprocessMiniBatchPredictors` function, listed at the end of the example.

• For the single output of the datastore, specify the mini-batch format `'CTB'` (channel, time, batch).

```numDatastoreOutputs = 1; mbqTest = minibatchqueue(dsXTest,numDatastoreOutputs,... 'MiniBatchSize',miniBatchSize,... 'MiniBatchFormat','CTB', ... 'MiniBatchFcn', @preprocessMiniBatchPredictors);```

Loop over the mini-batches and classify the sequences using `modelPredictions` function, listed at the end of the example.

`predictions = modelPredictions(parameters,hyperparameters,mbqTest,classes);`

Evaluate the classification accuracy by comparing the predictions to the test labels.

```YTest = s.YTest{1}; accuracy = mean(predictions == YTest)```
```accuracy = 0.9996 ```

Visualize the test sequence in a plot and compare the predictions with the corresponding test data.

```figure for i = 1:3 X = s.XTest{1}(i,:); subplot(4,1,i) plot(X) ylabel("Feature " + i + newline + "Acceleration") end subplot(4,1,4) idx = 1; plot(predictions(idx,:),'.-') hold on plot(YTest(idx,:)) hold off xlabel("Time Step") ylabel("Activity") legend(["Predicted" "Test Data"],'Location','northeast') subplot(4,1,1) title("Test Sequence")```

### Model Function

The function `model`, described in the Define Model and Model Gradients Functions section of the example, takes as input the model parameters and hyperparameters, input data `dlX`, and the flag `doTraining` which specifies whether the model should return outputs for training or prediction. The network outputs the predictions for the labels at each time step of the input sequence. The model consists of multiple residual blocks with exponentially increasing dilation factors. After the last residual block, a final `fullyconnect` operation maps the output to the number of classes in the target data. The model function creates residual blocks using the `residualBlock` function listed in the Residual Block Function section of the example.

```function dlY = model(parameters,hyperparameters,dlX,doTraining) numBlocks = hyperparameters.NumBlocks; dropoutFactor = hyperparameters.DropoutFactor; dlY = dlX; % Residual blocks. for k = 1:numBlocks dilationFactor = 2^(k-1); parametersBlock = parameters.("Block"+k); dlY = residualBlock(dlY,dilationFactor,dropoutFactor,parametersBlock,doTraining); end % Fully connect weights = parameters.FC.Weights; bias = parameters.FC.Bias; dlY = fullyconnect(dlY,weights,bias); % Softmax. dlY = softmax(dlY); end```

#### Residual Block Function

The function `residualBlock` implements the core building block of the temporal convolutional network.

To apply 1-D causal dilated convolution, use the `dlconv` function:

• To convolve over the time dimension, set the `'WeightsFormat'` option to `'TCU'` (time, channel, unspecified),

• Set the `'DilationFactor'` option according to the dilation factor of the residual block.

• To ensure only the past time steps are used, apply padding only at the beginning of the sequence.

```function dlY = residualBlock(dlX,dilationFactor,dropoutFactor,parametersBlock,doTraining) % Convolution options. filterSize = size(parametersBlock.Conv1.Weights,1); paddingSize = (filterSize - 1) * dilationFactor; % Convolution. weights = parametersBlock.Conv1.Weights; bias = parametersBlock.Conv1.Bias; dlY = dlconv(dlX,weights,bias, ... 'WeightsFormat','TCU', ... 'DilationFactor', dilationFactor, ... 'Padding', [paddingSize; 0]); % Normalization. dim = find(dims(dlY)=='T'); mu = mean(dlY,dim); sigmaSq = var(dlY,1,dim); epsilon = 1e-5; dlY = (dlY - mu) ./ sqrt(sigmaSq + epsilon); % ReLU, spatial dropout. dlY = relu(dlY); dlY = spatialDropout(dlY,dropoutFactor,doTraining); % Convolution. weights = parametersBlock.Conv2.Weights; bias = parametersBlock.Conv2.Bias; dlY = dlconv(dlY,weights,bias, ... 'WeightsFormat','TCU', ... 'DilationFactor', dilationFactor, ... 'Padding',[paddingSize; 0]); % Normalization. dim = find(dims(dlY)=='T'); mu = mean(dlY,dim); sigmaSq = var(dlY,1,dim); epsilon = 1e-5; dlY = (dlY - mu) ./ sqrt(sigmaSq + epsilon); % ReLU, spatial dropout. dlY = relu(dlY); dlY = spatialDropout(dlY,dropoutFactor,doTraining); % Optional 1-by-1 convolution. if ~isequal(size(dlX),size(dlY)) weights = parametersBlock.Conv3.Weights; bias = parametersBlock.Conv3.Bias; dlX = dlconv(dlX,weights,bias,'WeightsFormat','TCU'); end % Addition and ReLU dlY = relu(dlX + dlY); end```

The `modelGradients` function, described in the Define Model and Model Gradients Functions section of the example, takes as input the model parameters and hyperparameters, a mini-batch of input data `dlX`, the corresponding target sequences `dlT`, and the sequence padding mask, and returns the gradients of the loss with respect to the learnable parameters and the corresponding loss. To calculate the masked cross-entropy loss, use the `'Mask'` option of the `crossentropy` function. To compute the gradients, evaluate the `modelGradients` function using the `dlfeval` function in the training loop.

```function [gradients,loss] = modelGradients(parameters,hyperparameters,dlX,dlT,mask) doTraining = true; dlY = model(parameters,hyperparameters,dlX,doTraining); mask = stripdims(mask); loss = crossentropy(dlY,dlT, ... 'Mask',mask, ... 'Reduction','none'); loss = sum(loss,[1 3]); loss = mean(loss); gradients = dlgradient(loss,parameters); end```

### Model Predictions Function

The `modelPredictions` function takes as input the model parameters and hyperparameters, a `minibatchqueue` of input data `mbq`, and the network classes, and computes the model predictions by iterating over all data in the `minibatchqueue` object. The function uses the `onehotdecode` function to find the predicted classes with the highest score.

```function predictions = modelPredictions(parameters,hyperparameters,mbq,classes) doTraining = false; predictions = []; while hasdata(mbq) dlX = next(mbq); dlYPred = model(parameters,hyperparameters,dlX,doTraining); YPred = onehotdecode(dlYPred,classes,1); predictions = [predictions; YPred]; end predictions = permute(predictions,[2 3 1]); end```

### Spatial Dropout Function

The `spatialDropout` function performs spatial dropout [3] on the input `dlX` with dimension labels `fmt` when the `doTraining` flag is `true`. Spatial dropout drops an entire channel of the input data. That is, all time steps of a certain channel are dropped with the probability specified by `dropoutFactor`. The channels are dropped out independently in the batch dimension.

```function dlY = spatialDropout(dlX,dropoutFactor,doTraining) fmt = dims(dlX); if doTraining maskSize = size(dlX); maskSize(ismember(fmt,'ST')) = 1; dropoutScaleFactor = single(1 - dropoutFactor); dropoutMask = (rand(maskSize,'like',dlX) > dropoutFactor) / dropoutScaleFactor; dlY = dlX .* dropoutMask; else dlY = dlX; end end```

### Mini-Batch Preprocessing Function

The `preprocessMiniBatch` function preprocesses the data for training. The function transforms the input sequences to a numeric array of left-padded 1-D sequences and also returns the padding mask.

The `preprocessMiniBatch` function preprocesses the data using the following step:

1. Preprocess the predictors using the `preprocessMiniBatchPredictors` function/.

2. One-hot encode the categorical labels of each time step into numeric arrays.

3. Pad the sequences to the same length as the longest sequence in the mini-batch using the `padsequences` function

```function [XTransformed,YTransformed,mask] = preprocessMiniBatch(XCell,YCell) XTransformed = preprocessMiniBatchPredictors(XCell); miniBatchSize = numel(XCell); responses = cell(1,miniBatchSize); for i = 1:miniBatchSize responses{i} = onehotencode(YCell{i},1); end [YTransformed,mask] = padsequences(responses,2,'Direction','left'); end```

### Mini-Batch Predictors Preprocessing Function

The `preprocessMiniBatchPredictors` function preprocesses a mini-batch of predictors by extracting the sequence data from the input cell array and left-pads them to have the same length.

```function XTransformed = preprocessMiniBatchPredictors(XCell) XTransformed = padsequences(XCell,2,'Direction','left'); end```

The `thresholdL2Norm` function scales the gradient `g` so that its ${L}_{2}$ norm equals `gradientThreshold` when the ${L}_{2}$ norm of the gradient is larger than `gradientThreshold`.

```function g = thresholdL2Norm(g,gradientThreshold) gradientNorm = sqrt(sum(g.^2,'all')); if gradientNorm > gradientThreshold g = g * (gradientThreshold / gradientNorm); end end```

### References

[1] Bai, Shaojie, J. Zico Kolter, and Vladlen Koltun. "An empirical evaluation of generic convolutional and recurrent networks for sequence modeling." arXiv preprint arXiv:1803.01271 (2018).

[2] Van Den Oord, Aäron, et al. "WaveNet: A generative model for raw audio." SSW 125 (2016).

[3] Tompson, Jonathan, et al. "Efficient object localization using convolutional networks." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2015.