Main Content

Manage Data Sets for Machine Learning and Deep Learning Workflows

Use MATLAB® and Signal Processing Toolbox™ functionality to create a successful artificial intelligence (AI) workflow from labeling to training to deployment.

Common AI Tasks

Common AI tasks are signal classification, sequence-to-sequence classification, and regression. An AI model predicts:

  • For signal classification — A discrete class label for each input signal

  • For sequence-to-sequence classification — A label for each time step of the sequence data

  • For regression — A continuous numeric value

Data Organization

For many machine learning and deep learning applications, data sets are large and consist of both signal and label variables. Based on how your data set is organized, you can use datastores and functions in MATLAB and Signal Processing Toolbox to manage your data.

There are various methods to collect and store data that influence how you can access it in a workflow. In the data preparation stage, you might come across one or more of these common questions:

  • How do I organize my data?

  • How do I access data for training?

  • How do I create labels?

  • How do I combine signal and label data?

This table provides different data organization scenarios and shows you how to create datastores that correspond to these scenarios, so that you can access and prepare your data for your workflow.

Data OrganizationTaskRelated DatastoreExample
Signal and label variables stored separately in memory
  • Signal classification

Consider a data set consisting of signals stored in matrix sig and corresponding labels stored in vector lbls. Create an arrayDatastore object for the signal data and another for the labels. You can use the IterationDimension property of arrayDatastore to specify whether the data is stored in columns or rows.

ads1 = arrayDatastore(sig);
ads2 = arrayDatastore(lbls);

Use the combine function to combine the data from the two datastores into a single CombinedDatastore.

cds = combine(ads1,ads2);

Determine the count of each label in the data set. Specify the underlying datastore index to count the labels in ads2.

cnt = countlabels(cds,UnderlyingDatastoreIndex=2)
cnt =

  4×3 table

    Label    Count    Percent
    _____    _____    _______

      a       20        25   
      b       20        25   
      c       20        25   
      d       20        25   

Use the splitlabels function to split the data at random into training, validation, and testing sets.

idxs = splitlabels(cds,[0.7 0.2],"randomized");
trainDs = subset(cds,idxs{1});
valDs = subset(cds,idxs{2});
testDs = subset(cds,idxs{3});

Count the number of labels in the training subset datastore.

trainCnt = countlabels(trainDs,UnderlyingDatastoreIndex=2)
trainCnt =

  4×3 table

    Label    Count    Percent
    _____    _____    _______

      a       14        25   
      b       14        25   
      c       14        25   
      d       14        25   
Signal and label variables stored in separate MAT-files
  • Signal classification

Consider a data set consisting of two sets of MAT-files. The first set contains signal data and the second set contains corresponding labels. All files are saved in the same folder and have either "signal" or "label" as a prefix. Create a signalDatastore object that points to the location of the files.

sds = signalDatastore(datasetFolder);

Use the subset function to create two new datastores, where one datastore contains signal data and the other datastore contains label data. All signal data filenames contain "signal" and all label data filenames contain "label".

sigds = subset(sds,contains(sds.Files,"signal"));
lblds = subset(sds,contains(sds.Files,"label"));

Read the label data into memory. Convert the labels to a categorical array with categories a, b, and c.

labeldata = readall(lblds);
lblcat = categorical(labeldata,{'a' 'b' 'c'});

Create an arrayDatastore object that contains the categorical labels. Combine the labels with the signal data.

ads = arrayDatastore(lblcat);
allds = combine(sigds,ads);

Preview the first signal and the corresponding label in the datastore.

preview(allds)
ans =

  1×2 cell array

    {1000×1 double}    {[a]}

Note

A datastore parses files alphabetically. To ensure that signal variables and label variables stored in separate files are paired correctly, use a matching identifier for corresponding filenames.

Signal and label variables stored in a single MAT-file
  • Sequence-to-sequence classification

Consider a data set consisting of MAT-files that contain both signal (sig) and label (lbl) data. The files are saved in folders. Create a signalDatastore object that points to the location of the files and include subfolders in the path. Specify sig and lbl as the variable names in the SignalVariableNames property of the datastore.

sds = signalDatastore(datasetFolder,IncludeSubFolders=true, ...
      SignalVariableNames=["sig" "lbl"]);

Read the first pair of signal and label data.

read(sds)
ans =

  2×1 cell array

    {225000×1 double}
    {225000×1 categorical}

Divide the data at random into training and testing sets. Use 80% of the data to train the network and 20% of the data to test the network.

[trainIdx,~,testIdx] = dividerand(numel(sds.Files),0.8,0.2);
trainds = subset(sds,trainIdx);
testds = subset(sds,testIdx);
Signals stored in MAT-files and labels stored in memory
  • Signal classification

  • Sequence-to-sequence classification

Consider a data set consisting of signals stored in MAT-files in location folder and corresponding labels stored in vector lbls in memory. The label values are stored in a matrix where each column corresponds to a label sequence. Create a signalDatastore object to consume the signal data and an arrayDatastore object from the labels.

sds = signalDatastore(folder);
ads = arrayDatastore(lbls);

Use the combine function to combine the data from the two datastores into a single CombinedDatastore.

cds = combine(sds,ads)
cds = 

  CombinedDatastore with properties:

      UnderlyingDatastores: {[1×1 signalDatastore]  [1×1 matlab.io.datastore.ArrayDatastore]}
    SupportedOutputFormats: ["txt"    "csv"    "xlsx"    "xls"    "parquet"    "parq"    …    ]
Signals stored in MAT-files saved in folders containing label names
  • Signal classification

Consider a data set consisting of signals stored in MAT-files. The files are saved in folders, and each folder name corresponds to a label. Create a signalDatastore object that points to the location of the folders.

sds = signalDatastore(location);

Use the folders2labels function to obtain a list of label names. Create an arrayDatastore object containing the labels.

lbls = folders2labels(location,FileExtensions=".mat");
ads = arrayDatastore(lbls);

Combine the signal datastore and the array datastore using the combine function.

cds = combine(sds,ads);
Signals stored in MAT-files and region-of-interest (ROI) limits stored in separate MAT-files
  • Sequence-to-sequence classification

Consider a data set consisting of MAT-files that contain signal data and other MAT-files that contain label data. The label data is stored as region-of-interest tables that define a label value for different signal regions. Create two separate datastores to consume the data.

sds1 = signalDatastore(FileLocation1,SampleRate=fs);
sds2 = signalDatastore(FileLocation2,SignalVariableNames=["LabelVals";"LabelROIs"]);

Convert the ROI limits and labels to a categorical sequence that you can use to train a model.

i = 1;
while hasdata(sds1)
    signal = read(sds1);
    label = read(sds2);

  % Convert label values to categorical vector
    labelCats = categorical(label{2,1}.Value,{'a' 'b' 'c' 'd'});

  % Convert label values and ROI limits to table for input into signalMask
    roiTable = table(label{2,1}.ROILimits,labelCats);
    m = signalMask(roiTable);

  % Obtain categorical sequence mask
    mask = catmask(m,length(signal));
    lbls{i} = mask;

    i = i+1;
end

% Store categorical sequence mask in array datastore
ads = arrayDatastore(lbls,IterationDimension=2);

Combine sds1 and ads into a single datastore.

sds4 = combine(sds1,ads);
Labeled signal set containing signal and label data
  • Signal classification

  • Sequence-to-sequence classification

Consider a labeled signal set lss that contains signal data and label information returned by the Signal Labeler app. The data set includes two recordings of whale songs. Use the getLabelNames function to obtain the list of label names in the labeled signal set. You can also retrieve label names for a specified label type.

lblnames = getLabelNames(lss)
ans = 3×1 string
    "WhaleType"
    "MoanRegions"
    "TrillRegions"

Use the createDatastores function to create a signalDatastore containing the signal data and an arrayDatastore containing the corresponding labels.

[sds,ads] = createDatastores(lss,lblnames)
sds = 

  signalDatastore with properties:

    MemberNames:{
                'Whale1';
                'Whale2'
                }
       Members: {2×1 cell}
      ReadSize: 1
    SampleRate: 4000


ads = 

  ArrayDatastore with properties:

              ReadSize: 1
    IterationDimension: 1
            OutputType: "cell"
Input and output signals stored in the same MAT-file
  • Regression

Consider a data set consisting of MAT-files stored in folder. Each file contains an input variable xIn and an output variable xOut that you want to feed to a regression model. Create a signal datastore that contains both variables.

sds = signalDatastore(folder,SignalVariableNames=["xIn" "xOut"]);

You can input sds directly to trainNetwork.

Consider a different data set consisting of MAT-files stored in location. Each file contains both input and output variables. Create two signal datastores to separate the variables.

inDs = signalDatastore(location,SignalVariableNames=["a" "b" "c"]);
outDs = signalDatastore(location,SignalVariableNames=["d" "e"]);

When your data is ready, you can use the trainNetwork (Deep Learning Toolbox) function to train a neural network. Common functions that you can use for network training, like trainNetwork or minibatchqueue (Deep Learning Toolbox), accept datastores as an input for training data and responses.

net = trainNetwork(ds,...)
For more information about how to create a deep learning network for signal classification, see Create Simple Deep Learning Neural Network for Classification (Deep Learning Toolbox).

Note

When data is stored in memory, you can input a cell array directly to the trainNetwork function. If you need to transform in-memory data before training, use a TransformedDatastore.

Data Preprocessing

Some workflows require you to preprocess the data before feeding it to a network. For example, you can resample, resize, or filter signals before or during training. You can precompute features or use datastore transformations to prepare the data for training.

Example: Compute Fourier synchrosqueezed transform (FSST)

Calculate the FSST of each signal in datastore ds.

fsstDs = transform(ds,@fsst);

The transformed data fits in memory. Use the readall function to read all of the data from the TransformedDatastore into memory so that the FSST computations are performed only once during the training step.

transformedData = readall(fsstDs);

Example: Extract time-frequency features from signal data

Obtain the short-time Fourier transform (STFT) of each signal in datastore ds. Call the transform function to compute the stft and then use the writeall function to write the output to the disk.

tds = transform(ds,@stft);
writeall(tds,outputLocation);

Create a new datastore that points to the out-of-memory features.

ds = signalDatastore(outputLocation);

Example: Filter and downsample signal data and downsample label data with custom preprocessing function

Create a datastore that points to a location containing both signal data files and label data files.

sds = signalDatastore(location,SignalVariableNames=["data" "labels"]);

Define a custom preprocessing function that bandpass-filters and downsamples the signal data and the label data.

function [dataOut] = downsampleData(dataIn)
    sig = dataIn{1};
    lbls = dataIn{2};

    filtsig = bandpass(sig,[10 400],3000);
    downsig = downsample(filtsig,3);

    downlbls = downsample(lbls,3);

    dataOut = [downsig,downlbls];
end

Call transform on sds to apply the custom preprocessing function to each file.

tds = transform(sds,@downsampleData);

For more information about preprocessing in deep learning workflows, see Preprocess Data for Domain-Specific Deep Learning Applications (Deep Learning Toolbox).

Workflow Scenarios

A general workflow for any machine learning or deep learning task involves these steps:

  1. Data preparation

  2. Network training

  3. Model deployment

This table shows examples and functions you can use to go from preparing data to training a network for signal classification tasks.

ExampleDataRelated FunctionsHighlights

Spoken Digit Recognition with Custom Log Spectrogram Layer and Deep Learning

  • .wav files

  • Filenames contain labels

  • File collection too large to fit in memory

Predict labels for audio recordings using deep convolutional neural network (DCNN) and custom log spectrogram layer

  • Define custom log spectrogram layer to insert into network

  • Compute log spectrogram of each signal inside network during training

Hand Gesture Classification Using Radar Signals and Deep Learning

  • .mat files

  • Each file contains three data matrices

  • Filenames contain labels

Preprocess signals using custom functions and train multiple-input single-output convolutional neural network (CNN)

  • Combine signal and label data into single datastore

  • Read all data into memory and apply preprocessing simultaneously

Train Spoken Digit Recognition Network Using Out-of-Memory Features

  • .wav files

  • Filenames contain labels

  • Collection of files too large to fit in memory

Predict labels for audio recordings using a network trained on mel-frequency spectrograms

  • Convert all signals to mel-frequency spectrograms

  • Write spectrograms to disk

  • Train CNN classifier

This table shows examples and functions you can use to go from preparing data to training a network for sequence-to-sequence classification tasks.

ExampleDataRelated FunctionsHighlights
Waveform Segmentation Using Deep Learning

  • .mat files

  • Each file contains:

    • Signal variable

    • Label variable

    • Sample rate variable

Segment regions of interest in signals

  • Transform region labels to categorical sequences such that each signal sample has a corresponding label

  • Train network

  • Apply filter and time-frequency transformations to signals to improve network performance

  • Retrain network

Classify Arm Motions Using EMG Signals and Deep Learning

  • .mat files

    • One set of files contains signal data

    • One set of files contains label data

Classify signal ROIs

  • Define regions of interest based on label data

  • Combine signals and labels into single datastore

  • Read all data into memory and apply preprocessing transformations to entire data set once before training

  • Train network

This table shows examples and functions you can use to go from preparing data to training a network for regression tasks.

ExampleDataRelated FunctionsHighlights

Denoise EEG Signals Using Differentiable Signal Processing Layers

  • .mat files

    • One file contains matrix of clean signal data

    • One file contains matrix of artifact data

Denoise signals using regression model

  • Generate pairs of clean and noisy signals

  • Define LSTM network with output regression layer

  • Train network with noisy signals as input and clean signals as requested output

  • Improve network performance using features extracted from short-time Fourier transformation

  • Denoise raw signals using deep learning regression

Tip

Use the read, readall, and writeall functions to read data in a datastore or write data from a datastore to files.

  • read — Use this function to read data iteratively from a datastore that contains file data or in-memory data.

  • readall — Use this function to read all the data in a datastore at once when the data set fits in memory. If the data set is too large to fit in memory, you can transform the data at each training epoch or use the writeall function to store the transformed data that you can then read using a signalDatastore.

  • writeall — Use this function to write preprocessed data that does not fit in memory to files. You can then create a new datastore that points to the location of the output files.

Available Data Sets

There are several data sets readily available for use in an AI workflow:

  • QT Database — 210 ECG signals with region labels. Available for download at https://www.mathworks.com/supportfiles/SPT/data/QTDatabaseECGData.zip.

  • EEGdenoiseNet — 4514 clean EEG segments and 3400 ocular artifact segments. Available for download at https://ssd.mathworks.com/supportfiles/SPT/data/EEGEOGDenoisingData.zip.

  • UWB-gestures — 96 multichannel UWB impulse radar signals. Available for download at https://ssd.mathworks.com/supportfiles/SPT/data/uwb-gestures.zip.

  • Myoelectric Data — 720 multichannel EMG signals with region labels. Available for download at https://ssd.mathworks.com/supportfiles/SPT/data/MyoelectricData.zip.

  • Mendeley Data — 327 accelerometer signals with class labels. Available for download at https://ssd.mathworks.com/supportfiles/wavelet/crackDetection/transverse_crack.zip.

For additional data sets, see Time Series and Signal Data Sets (Deep Learning Toolbox).

Related Topics