# batchnorm

Normalize data across all observations for each channel independently

## Description

The batch normalization operation normalizes the input data across all observations for each channel independently. To speed up training of the convolutional neural network and reduce the sensitivity to network initialization, use batch normalization between convolution and nonlinear operations such as relu.

After normalization, the operation shifts the input by a learnable offset β and scales it by a learnable scale factor γ.

The batchnorm function applies the batch normalization operation to dlarray data. Using dlarray objects makes working with high dimensional data easier by allowing you to label the dimensions. For example, you can label which dimensions correspond to spatial, time, channel, and batch dimensions using the "S", "T", "C", and "B" labels, respectively. For unspecified and other dimensions, use the "U" label. For dlarray object functions that operate over particular dimensions, you can specify the dimension labels by formatting the dlarray object directly, or by using the DataFormat option.

Note

To apply batch normalization within a layerGraph object or Layer array, use batchNormalizationLayer.

example

dlY = batchnorm(dlX,offset,scaleFactor) applies the batch normalization operation to the input data dlX using the population mean and variance of the input data and the specified offset and scale factor.

The function normalizes over the 'S' (spatial), 'T' (time), 'B' (batch), and 'U' (unspecified) dimensions of dlX for each channel in the 'C' (channel) dimension, independently.

For unformatted input data, use the 'DataFormat' option.

[dlY,popMu,popSigmaSq] = batchnorm(dlX,offset,scaleFactor) applies the batch normalization operation and also returns the population mean and variance of the input data dlX.

example

[dlY,updatedMu,updatedSigmaSq] = batchnorm(dlX,offset,scaleFactor,runningMu,runningSigmaSq) applies the batch normalization operation and also returns the updated moving mean and variance statistics. runningMu and runningSigmaSq are the mean and variance values after the previous training iteration, respectively.

Use this syntax to maintain running values for the mean and variance statistics during training. When you have finished training, use the final updated values of the mean and variance for the batch normalization operation during prediction and classification.

dlY = batchnorm(dlX,offset,scaleFactor,trainedMu,trainedSigmaSq) applies the batch normalization operation using the mean trainedMu and variance trainedSigmaSq.

Use this syntax during classification and prediction, where trainedMu and trainedSigmaSq are the final values of the mean and variance after you have finished training, respectively.

[___] = batchnorm(___,'DataFormat',FMT) applies the batch normalization operation to unformatted input data with format specified by FMT using any of the input or output combinations in previous syntaxes. The output dlY is an unformatted dlarray object with dimensions in the same order as dlX. For example, 'DataFormat','SSCB' specifies data for 2-D image input with the format 'SSCB' (spatial, spatial, channel, batch).

[___] = batchnorm(___,Name,Value) specifies additional options using one or more name-value pair arguments. For example, 'MeanDecay',0.3 sets the decay rate of the moving average computation.

## Examples

collapse all

Create a formatted dlarray object containing a batch of 128 28-by-28 images with 3 channels. Specify the format 'SSCB' (spatial, spatial, channel, batch).

miniBatchSize = 128;
inputSize = [28 28];
numChannels = 3;
X = rand(inputSize(1),inputSize(2),numChannels,miniBatchSize);
dlX = dlarray(X,'SSCB');

View the size and format of the input data.

size(dlX)
ans = 1×4

28    28     3   128

dims(dlX)
ans =
'SSCB'

Initialize the scale and offset for batch normalization. For the scale, specify a vector of ones. For the offset, specify a vector of zeros.

scaleFactor = ones(numChannels,1);
offset = zeros(numChannels,1);

Apply the batch normalization operation using the batchnorm function and return the mini-batch statistics.

[dlY,mu,sigmaSq] = batchnorm(dlX,offset,scaleFactor);

View the size and format of the output dlY.

size(dlY)
ans = 1×4

28    28     3   128

dims(dlY)
ans =
'SSCB'

View the mini-batch mean mu.

mu
mu = 3×1

0.4998
0.4993
0.5011

View the mini-batch variance sigmaSq.

sigmaSq
sigmaSq = 3×1

0.0831
0.0832
0.0835

Use the batchnorm function to normalize several batches of data and update the statistics of the whole data set after each normalization.

Create three batches of data. The data consists of 10-by-10 random arrays with five channels. Each batch contains 20 observations. The second and third batches are scaled by a multiplicative factor of 1.5 and 2.5, respectively, so the mean of the data set increases with each batch.

height = 10;
width = 10;
numChannels = 5;
observations = 20;

X1 = rand(height,width,numChannels,observations);
dlX1 = dlarray(X1,"SSCB");

X2 = 1.5*rand(height,width,numChannels,observations);
dlX2 = dlarray(X2,"SSCB");

X3 = 2.5*rand(height,width,numChannels,observations);
dlX3 = dlarray(X3,"SSCB");

Create the learnable parameters.

offset = zeros(numChannels,1);
scale = ones(numChannels,1);

Normalize the first batch of data dlX1 using batchnorm. Obtain the values of the mean and variance of this batch as outputs.

[dlY1,mu,sigmaSq] = batchnorm(dlX1,offset,scale);

Normalize the second batch of data dlX2. Use mu and sigmaSq as inputs to obtain the values of the combined mean and variance of the data in batches dlX1 and dlX2.

[dlY2,datasetMu,datasetSigmaSq] = batchnorm(dlX2,offset,scale,mu,sigmaSq);

Normalize the final batch of data dlX3. Update the data set statistics datasetMu and datasetSigmaSq to obtain the values of the combined mean and variance of all data in batches dlX1, dlX2, and dlX3.

[dlY3,datasetMuFull,datasetSigmaSqFull] = batchnorm(dlX3,offset,scale,datasetMu,datasetSigmaSq);

Observe the change in the mean of each channel as each batch is normalized.

plot([mu datasetMu datasetMuFull]')
legend("Channel " + string(1:5),"Location","southeast")
xticks([1 2 3])
xlabel("Number of Batches")
xlim([0.9 3.1])
ylabel("Per-Channel Mean")
title("Data Set Mean")

## Input Arguments

collapse all

Input data, specified as a formatted dlarray, an unformatted dlarray, or a numeric array.

If dlX is an unformatted dlarray or a numeric array, then you must specify the format using the 'DataFormat' option. If dlX is a numeric array, then either scaleFactor or offset must be a dlarray object.

dlX must have a 'C' (channel) dimension.

Offset β, specified as a formatted dlarray, an unformatted dlarray, or a numeric array with one nonsingleton dimension with size matching the size of the 'C' (channel) dimension of the input dlX.

If offset is a formatted dlarray object, then the nonsingleton dimension must have label 'C' (channel).

Scale factor γ, specified as a formatted dlarray, an unformatted dlarray, or a numeric array with one nonsingleton dimension with size matching the size of the 'C' (channel) dimension of the input dlX.

If scaleFactor is a formatted dlarray object, then the nonsingleton dimension must have label 'C' (channel).

Running value of mean statistic, specified as a numeric vector of the same length as the 'C' dimension of the input data.

To maintain a running value for the mean during training, provide runningMu as the updatedMu output of the previous training iteration.

Data Types: single | double

Running value of variance statistic, specified as a numeric vector of the same length as the 'C' dimension of the input data.

To maintain a running value for the variance during training, provide runningSigmaSq as the updatedSigmaSq output of the previous training iteration.

Data Types: single | double

Final value of mean statistic after training, specified as a numeric vector of the same length as the 'C' dimension of the input data.

During classification and prediction, provide trainedMu as the updatedMu output of the final training iteration.

Data Types: single | double

Final value of variance statistic after training, specified as a numeric vector of the same length as the 'C' dimension of the input data.

During classification and prediction, provide trainedSigmaSq as the updatedSigmaSq output of the final training iteration.

Data Types: single | double

### Name-Value Arguments

Specify optional comma-separated pairs of Name,Value arguments. Name is the argument name and Value is the corresponding value. Name must appear inside quotes. You can specify several name and value pair arguments in any order as Name1,Value1,...,NameN,ValueN.

Example: 'MeanDecay',0.3,'VarianceDecay',0.5 sets the decay rate for the moving average computations of the mean and variance of several batches of data to 0.3 and 0.5, respectively.

Dimension order of unformatted input data, specified as a character vector or string scalar FMT that provides a label for each dimension of the data.

When you specify the format of a dlarray object, each character provides a label for each dimension of the data and must be one of the following:

• "S" — Spatial

• "C" — Channel

• "B" — Batch (for example, samples and observations)

• "T" — Time (for example, time steps of sequences)

• "U" — Unspecified

You can specify multiple dimensions labeled "S" or "U". You can use the labels "C", "B", and "T" at most once.

You must specify DataFormat when the input data is not a formatted dlarray.

Data Types: char | string

Variance offset for preventing divide-by-zero errors, specified as the comma-separated pair consisting of 'Epsilon' and a numeric scalar greater than or equal to 1e-5.

Data Types: single | double

Decay value for the moving mean computation, specified as a numeric scalar between 0 and 1.

The function updates the moving mean value using

${\mu }^{*}={\lambda }_{\mu }\stackrel{^}{\mu }+\left(1-{\lambda }_{\mu }\right)\mu ,$

where ${\mu }^{*}$ denotes the updated mean updatedMu, ${\lambda }_{\mu }$ denotes the mean decay value 'MeanDecay', $\stackrel{^}{\mu }$ denotes the mean of the input data, and $\mu$ denotes the current value of the mean mu.

Data Types: single | double

Decay value for the moving variance computation, specified as a numeric scalar between 0 and 1.

The function updates the moving variance value using

${\sigma }^{2}{}^{*}={\lambda }_{{\sigma }^{2}}\stackrel{^}{{\sigma }^{2}}+\left(1-{\lambda }_{{\sigma }^{2}}\right){\sigma }^{2},$

where ${\sigma }^{2}{}^{*}$ denotes the updated variance updatedSigmaSq, ${\lambda }_{{\sigma }^{2}}$ denotes the variance decay value 'VarianceDecay', $\stackrel{^}{{\sigma }^{2}}$ denotes the variance of the input data, and ${\sigma }^{2}$ denotes the current value of the variance sigmaSq.

Data Types: single | double

## Output Arguments

collapse all

Normalized data, returned as a dlarray with the same underlying data type as dlX.

If the input data dlX is a formatted dlarray, then dlY has the same format as dlX. If the input data is not a formatted dlarray, then dlY is an unformatted dlarray with the same dimension order as the input data.

The size of the output dlY matches the size of the input dlX.

Per-channel mean of the input data, returned as a numeric column vector with length equal to the size of the 'C' dimension of the input data.

Per-channel variance of the input data, returned as a numeric column vector with length equal to the size of the 'C' dimension of the input data.

Updated mean statistic, returned as a numeric vector with length equal to the size of the 'C' dimension of the input data.

The function updates the moving mean value using

${\mu }^{*}={\lambda }_{\mu }\stackrel{^}{\mu }+\left(1-{\lambda }_{\mu }\right)\mu ,$

where ${\mu }^{*}$ denotes the updated mean updatedMu, ${\lambda }_{\mu }$ denotes the mean decay value 'MeanDecay', $\stackrel{^}{\mu }$ denotes the mean of the input data, and $\mu$ denotes the current value of the mean mu.

Updated variance statistic, returned as a numeric vector with length equal to the size of the 'C' dimension of the input data.

The function updates the moving variance value using

${\sigma }^{2}{}^{*}={\lambda }_{{\sigma }^{2}}\stackrel{^}{{\sigma }^{2}}+\left(1-{\lambda }_{{\sigma }^{2}}\right){\sigma }^{2},$

where ${\sigma }^{2}{}^{*}$ denotes the updated variance updatedSigmaSq, ${\lambda }_{{\sigma }^{2}}$ denotes the variance decay value 'VarianceDecay', $\stackrel{^}{{\sigma }^{2}}$ denotes the variance of the input data, and ${\sigma }^{2}$ denotes the current value of the variance sigmaSq.

## Algorithms

The batch normalization operation normalizes the elements xi of the input by first calculating the mean μB and variance σB2 over the spatial, time, and observation dimensions for each channel independently. Then, it calculates the normalized activations as

$\stackrel{^}{{x}_{i}}=\frac{{x}_{i}-{\mu }_{B}}{\sqrt{{\sigma }_{B}^{2}+ϵ}},$

where ϵ is a constant that improves numerical stability when the variance is very small.

To allow for the possibility that inputs with zero mean and unit variance are not optimal for the operations that follow batch normalization, the batch normalization operation further shifts and scales the activations using the transformation

${y}_{i}=\gamma {\stackrel{^}{x}}_{i}+\beta ,$

where the offset β and scale factor γ are learnable parameters that are updated during network training.

To make predictions with the network after training, batch normalization requires a fixed mean and variance to normalize the data. This fixed mean and variance can be calculated from the training data after training, or approximated during training using running statistic computations.