How to prepare irregularly spaced time-series data for classification using LSTM

6 views (last 30 days)
(First 50 days worth of data included)
I have the variable holding 215 days worth of data structured like this: processed_data is a cell array of size 215×1, holding cells, where each cell contains data for a given day. Each cell (day) has a varying number of observations (with a mean of approximately 12,000 rows). Each row represents an observation, where: the first column contains the seconds elapsed since the previous row (not normalized), the second column contains the price of a specified security (normalized using z-score), and the third column is the target variable, signaling whether the price at that moment will be 0.01% higher (represented as 1) 60 seconds later or not (represented as 0). I'm using the first two columns as the predictors. I imagined this network to be able to make a prediction for every observation in the data. I keep the days separate, because hours pass between the last row of day i and the first row of day i+1. Below is a sample of data from an arbitrary day:
2.57500000000437 0.502515050312692 0
1.03600000000006 0.469361050915526 1
1.05899999999383 0.386501335237771 1
0.838000000003376 0.436219680495852 0
1.12999999999738 0.469361050915526 0
0.824000000000524 0.369924327252462 1
I'm just a beginner in ML, and I'm having a really hard time imagining how the data for the LSTM layer should be formatted. If I'm correct, it needs 3-dimensional data, where one dimension represents the channel, another the time step, and another the batch. I'm now sure that I have completely misunderstood these concepts and have written the code below:
%% Partitioning data.
train_data_length = round(length(processed_data) * 0.9);
train_data = processed_data(1: train_data_length);
test_data = processed_data(train_data_length+1:end);
%% Training setup
% Convert data to cell arrays of dlarray.
train_X = cell(size(train_data));
train_Y = cell(size(train_data));
for day = 1:length(train_data)
% Add batch dimension (C×B×T where B=1).
data = permute(train_data{day}(:, 1:2)', [1 3 2]); % [2×1×T]
train_X{day} = dlarray(data, "CBT");
% Convert labels to one-hot encoded CBT format [2×1×T].
labels = train_data{day}(:, 3)'; % [1×T]
one_hot_labels = onehotencode(labels, 1, 'ClassNames', [0 1]); % [2×T]
one_hot_labels = reshape(one_hot_labels, 2, 1, []); % [2×1×T]
train_Y{day} = dlarray(single(one_hot_labels), "CBT");
end
ds = combine(...
arrayDatastore(train_X, 'OutputType', 'same'), ...
arrayDatastore(train_Y, 'OutputType', 'same')...
);
%clearvars -except ds test_data ml_method
num_features = 2;
num_hidden_units = 128;
num_classes = 2;
mini_batch_size = 32;
layers = [
sequenceInputLayer(num_features, 'Name', 'input')
lstmLayer(num_hidden_units, 'OutputMode', 'sequence')
fullyConnectedLayer(num_classes)
softmaxLayer
];
net = dlnetwork(layers);
options = trainingOptions('adam', ...
'MaxEpochs', 30, ...
'MiniBatchSize', mini_batch_size, ...
'SequenceLength', 'longest', ...
'Shuffle', 'every-epoch', ...
'Plots', 'training-progress', ...
'InputDataFormats', 'CBT', ...
'Verbose', false, ...
'ExecutionEnvironment', 'gpu');
net = trainnet(ds, net, 'crossentropy', options);
In the code above, I tried to define the channel as the number of predictors (2 in my case—most likely the only dimension I defined correctly). I set the batch to 1 because I thought it meant the network would use one observation to make predictions. I set the time step as the first column of a day's worth of data (the seconds passed since the last observation) because I thought it literally meant steps in time. Now I know that I was completely wrong. I also had to change the mini_batch_size to 32 from 128, which I found too low, but otherwise, I would run out of memory. I guess this is because of my incorrectly formatted data (I'm not sure if this is an important detail, but I'll include my GPU which is an RTX2070 Super with 8GB of memory). My question is: How should I format my data for the LSTM layer based on my goals? Or my goals are unrealistic?

Accepted Answer

Joss Knight
Joss Knight on 1 Mar 2025

It seems like your observations (batch) are the days, time is the rows and you perhaps have two channels in the first two columns. I'm not sure your data isn't already in the right format expected by trainnet. But it doesn't sound like you want a fullyconnected and softmax layer. You are not trying to classify each day, you are performing regression, trying to match your predictions to targets. You just want one or more LSTM layers separated by activation layers and you want the last one to output a single channel.

  5 Comments
Joss Knight
Joss Knight on 2 Mar 2025

Is there really no example in the MATLAB documentation that fits your use case? This isn't exactly my area of expertise and I'm just trying to avoid literally doing a search for you.

Classification and regression are not fundamentally different. The softmax operation is what let's us convert a regression loss (match the values) into a classification loss (match the highest number), but there's nothing fundamentally different about the underlying algorithms here.

Eugen Fekete
Eugen Fekete on 2 Mar 2025
Of course! I don’t want to dump my work on you. I’m sorry if it seemed that way! I did look at numerous official MATLAB examples (I mostly based my network on this example), but unfortunately, most of them are deprecated since they use the trainNetwork function, which was deprecated starting from version 2024a. I don’t want to waste your time, you’ve already helped a lot, and I’m very grateful! I’ll mark your comment as the accepted answer. Thanks again for the tips @Joss Knight!

Sign in to comment.

More Answers (0)

Categories

Find more on Parallel and Cloud in Help Center and File Exchange

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!