Is there any documentation on how to build a transformer encoder from scratch in matlab?

Question

Nour on 30 Aug 2023

3
Link

Direct link to this question

https://uk.mathworks.com/matlabcentral/answers/2014811-is-there-any-documentation-on-how-to-build-a-transformer-encoder-from-scratch-in-matlab

Commented: haohaoxuexi1 on 27 Jul 2024

I am building a transformer encoder, and I came accross the following exchange: https://www.mathworks.com/matlabcentral/fileexchange/107375-transformer-models

However, in the exchange there are examples on how to use a pretrained transformer model. I just need an example on how to build a model. Something to give a general idea so I can build on it. I have studied the basics of transformers but I am having some difficulty building the model from scratch.

Thank you in advance.

1 Comment
Show -1 older commentsHide -1 older comments

Shubham on 8 Sep 2023

Hi,

You can refer to this documentation:

https://machinelearningmastery.com/implementing-the-transformer-encoder-from-scratch-in-tensorflow-and-keras/

This article is Tensorflow, but you can replicate this in MATLAB

Sign in to comment.

Sign in to answer this question.

Answer 1

Ben on 18 Sep 2023

10
Link

Direct link to this answer

https://uk.mathworks.com/matlabcentral/answers/2014811-is-there-any-documentation-on-how-to-build-a-transformer-encoder-from-scratch-in-matlab#answer_1312937

You can use selfAttentionLayer to build the encoder from layers.

The general structure of the intermediate encoder blocks is like:

selfAttentionLayer(numHeads,numKeyChannels) % self attention

additionLayer(2,Name="attention_add") % residual connection around attention

layerNormalizationLayer(Name="attention_norm") % layer norm

fullyConnectedLayer(feedforwardHiddenSize) % feedforward part 1

reluLayer % nonlinear activation

fullyConnectedLayer(attentionHiddenSize) % feedforward part 2

additionLayer(2,Name="feedforward_add") % residual connection around feedforward

layerNormalizationLayer() % layer norm

You would need to hook up the connections to the addition layers appropriately.

Typically you would have multiple copies of this encoder block in a transformer encoder.

You also typically need an embedding at the start of the model. For text data it's common to use wordEmbeddingLayer whereas image data you would use patchEmbeddingLayer.

Also the above encoder block makes no use of positional information, so if your training task requires positional information to be used, you would typically inject the position information via a positionEmbeddingLayer or sinusoidalPositionEncodingLayer.

Finally the last encoder block will typically feed into a model "head" to map the encoder output back to the dimensions of the training targets. Typically this can just be some simple fullyConnectedLayer-s.

Note that for both image and sequence input data the output of the encoder is still an image or sequence, so for image classification and sequence-to-one tasks you need some way to map that sequence of encoder ouptuts to a fixed-size representation. For this you could use indexing1dLayer or pooling layers like globalMaxPooling1dLayer.

Here's a demonstration of the general architecture for a toy task. Given a sequence

where each

we can specify a task

. For example

would have

and

, then

. This is a toy problem that requires positional information to solve and can be easily implemented in code. You can train a transformer encoder to predict y from x as follows:

% Create model

% We will use 2 encoder layers.

numHeads = 1;

numKeyChannels = 20;

feedforwardHiddenSize = 100;

modelHiddenSize = 20;

% Since the values in the sequence can be 1,2, ..., 10 the "vocabulary" size is 10.

vocabSize = 10;

inputSize = 1;

encoderLayers = [

sequenceInputLayer(1,Name="in") % input

wordEmbeddingLayer(modelHiddenSize,vocabSize,Name="embedding") % embedding

positionEmbeddingLayer(modelHiddenSize,vocabSize) % position embedding

additionLayer(2,Name="embed_add") % add the data and position embeddings

selfAttentionLayer(numHeads,numKeyChannels) % encoder block 1

additionLayer(2,Name="attention_add") %

layerNormalizationLayer(Name="attention_norm") %

fullyConnectedLayer(feedforwardHiddenSize) %

reluLayer %

fullyConnectedLayer(modelHiddenSize) %

additionLayer(2,Name="feedforward_add") %

layerNormalizationLayer(Name="encoder1_out") %

selfAttentionLayer(numHeads,numKeyChannels) % encoder block 2

additionLayer(2,Name="attention2_add") %

layerNormalizationLayer(Name="attention2_norm") %

fullyConnectedLayer(feedforwardHiddenSize) %

reluLayer %

fullyConnectedLayer(modelHiddenSize) %

additionLayer(2,Name="feedforward2_add") %

layerNormalizationLayer() %

indexing1dLayer %

fullyConnectedLayer(inputSize)]; % output head

net = dlnetwork(encoderLayers,Initialize=false);

net = connectLayers(net,"embed_add","attention_add/in2");

net = connectLayers(net,"embedding","embed_add/in2");

net = connectLayers(net,"attention_norm","feedforward_add/in2");

net = connectLayers(net,"encoder1_out","attention2_add/in2");

net = connectLayers(net,"attention2_norm","feedforward2_add/in2");

net = initialize(net);

% analyze the network to see how data flows through it

analyzeNetwork(net)

% create toy training data

% We will generate 10,000 sequences of length 10

% with values that are random integers 1-10

numObs = 10000;

seqLen = 10;

x = randi([1,10],[seqLen,numObs]);

% Loop over to create y(i) = x(x(1),i) + x(x(2),i)

y = zeros(numObs,1);

for i = 1:numObs

idx = x(1:2,i);

y(i) = sum(x(idx,i));

end

x = num2cell(x,1);

% specify training options and train

opts = trainingOptions("adam", ...

MaxEpochs = 200, ...

MiniBatchSize = numObs/10, ...

Plots="training-progress", ...

Shuffle="every-epoch", ...

InitialLearnRate=1e-2, ...

LearnRateDropFactor=0.9, ...

LearnRateDropPeriod=10, ...

LearnRateSchedule="piecewise");

net = trainnet(x,y,net,"mse",opts);

% test the network on a new input

x = randi([1,10],[seqLen,1]));

ypred = predict(net,x)

yact = x(x(1)) + x(x(2))

Obviously this is a toy task, but I think it demonstrates the parts of the standard transformer architecture. Two additional things you would likely need to deal with in real tasks is:

For sequence data often the observations have different sequence lengths. For this you need to pad the data and pass padding masks to the selfAttentionLayer so that no attention is paid to padding elements.
Often the encoder will be initially pre-trained on a self-supervised task, e.g. masked-language-modeling for natural language encoders.

Hope that helps.

7 Comments
Show 5 older commentsHide 5 older comments

Ben on 11 Dec 2023

Could you provide code to reproduce the issue with the output format?

I am able to get this example to run with a transformer encoder in place of the LSTM. I would say there are a handful of considerations to make when doing this:

The input data appear to be integers - do these integers have meaningful values or are they simply class labels? In the latter case you typically use an embedding, like a wordEmbeddingLayer above, to create initial vector embeddings of those class labels. However the original example passes them directly to LSTM so perhaps this is unnecessary.
The input data appear to be all sequences of length 5000. If the sequences to be used at test time will always have length at most 5000 then you can use positionEmbeddingLayer to provide positional information, but if the sequences might have arbitrary length at test time you might want to use sinusoidalPositionEmbedding.
The sequence length of 5000 is quite large for selfAttentionLayer, the computation scales quadratically with sequence length. This caused my 12GB GPU to go out of memory. A potential workaround would be to use convolution, pooling, and transposed convolution to downsample initially before the selfAttentionLayer, then up sample after the transformer encoder.

I don't know the ECG data that well, so the following choices may not be appropriate, however for 1. I chose to not use a class embedding because the LSTM case doesn't. For 2. I chose to use sinusoidal position embeddings. For 3. I chose downsample from 5000 -> 1000 -> 250 by 2 conv-activation-pool blocks with a stride of 5 in the pooling layers, and upsample using transposed convolution. Additionally I concatenated the inputs to the conv-activation-pool blocks with their counterparts after the transformer encoder. That lead to me trying this architecture:

modelHiddenSize = 50;

filterSize = 10;

layers = [ ...

sequenceInputLayer(1,MinLength=5000)

fullyConnectedLayer(modelHiddenSize,Name="emb")

sinusoidalPositionEncodingLayer(modelHiddenSize)

additionLayer(2,Name="add")

convolution1dLayer(filterSize,modelHiddenSize,Padding="same")

reluLayer

maxPooling1dLayer(filterSize,Stride=5,Padding="same",Name="pool_1")

convolution1dLayer(filterSize,modelHiddenSize,Padding="same")

reluLayer

maxPooling1dLayer(filterSize,Stride=5,Name="downsample_out",Padding="same")

selfAttentionLayer(5,modelHiddenSize)

additionLayer(2,Name="attn_add")

layerNormalizationLayer

fullyConnectedLayer(modelHiddenSize*2)

geluLayer

fullyConnectedLayer(modelHiddenSize)

concatenationLayer(1,2,Name="cat_1")

transposedConv1dLayer(filterSize,modelHiddenSize,Cropping="same",Stride=5)

geluLayer

concatenationLayer(1,2,Name="cat_2")

transposedConv1dLayer(filterSize,modelHiddenSize,Cropping="same",Stride=5)

geluLayer

concatenationLayer(1,2,Name="cat_3")

fullyConnectedLayer(4)

softmaxLayer

classificationLayer];

lg = layerGraph(layers);

lg = lg.connectLayers("emb","add/in2");

lg = lg.connectLayers("downsample_out","attn_add/in2");

lg = lg.connectLayers("add","cat_3/in2");

lg = lg.connectLayers("pool_1","cat_2/in2");

lg = lg.connectLayers("downsample_out","cat_1/in2");

This beats the LSTM on the raw data and the filtered signals. Note however the above model is quite a bit more complex, so you need to consider what metrics to compare it to LSTM with.

For the time-frequency representation signals that have passed through the FSST, the LSTM seems to perform better - I tried a number of adaptations of the above model but didn't have any luck. This suggests to me that either the FSST extracted features are quite useful representations on their own, and it takes time for the convolution layers to learn how to use these, or that the downsampling in time used to make the transformer feasible to train destroys too much information.

In the above the hyperparameters are more-or-less arbitrary choices, for real tasks you might want to experiment with various values for each hyperparameter to see which affects the model performance, in which case Experiment Manager could be useful.

DGM on 5 Mar 2024

Posted as a comment-as-flag by chang gao:

Useful answer.

haohaoxuexi1 on 27 Jul 2024

@Ben Hi Ben, Is it possible for you to provide me an example of applying Transformer network for classification task?

Sign in to comment.

Is there any documentation on how to build a transformer encoder from scratch in matlab?

1 Comment
Show -1 older commentsHide -1 older comments

Accepted Answer

7 Comments
Show 5 older commentsHide 5 older comments

More Answers (0)

See Also

Categories

Tags

Products

Release

Community Treasure Hunt

Is there any documentation on how to build a transformer encoder from scratch in matlab?

1 Comment Show -1 older commentsHide -1 older comments

Accepted Answer

7 Comments Show 5 older commentsHide 5 older comments

More Answers (0)

See Also

Categories

Tags

Products

Release

Community Treasure Hunt

1 Comment
Show -1 older commentsHide -1 older comments

7 Comments
Show 5 older commentsHide 5 older comments