Regularization is the technique for specifying constraints on a model's flexibility, thereby reducing uncertainty in the estimated parameter values.
Model parameters are obtained by fitting measured data to the predicted model response, such as a transfer function with three poles or a second-order state-space model. The model order is a measure of its flexibility — higher the order, the greater the flexibility. For example, a model with three poles is more flexible than one with two poles. Increasing the order causes the model to fit the observed data with increasing accuracy. However, the increased flexibility comes with the price of higher uncertainty in the estimates, measured by a higher value of random or variance error. On the other hand, choosing a model with too low an order leads to larger systematic errors. Such errors cannot be attributed to measurement noise and are also known as bias error.
Ideally, the parameters of a good model should minimize the mean square error (MSE), given by a sum of systematic error (bias) and random error (variance):
MSE = |Bias|^{2} + Variance
The minimization is thus a tradeoff in constraining the model. A flexible (high order) model gives small bias and large variance, while a simpler (low order) model results in larger bias and smaller variance errors. Typically, you can investigate this tradeoff between bias and variance errors by cross-validation tests on a set of models of increasing flexibility. However, such tests do not always give full control in managing the parameter estimation behavior. For example:
You cannot use the known (a priori) information about the model to influence the quality of the fits.
In grey-box and other structured models, the order is fixed by the underlying ODEs and cannot be changed. If the data is not rich enough to capture the full range of dynamic behavior, this typically leads to high uncertainty in the estimated values.
Varying the model order does not let you explicitly shape the variance of the underlying parameters.
Regularization gives you a better control over the bias versus variance tradeoff by introducing an additional term in the minimization criterion that penalizes the model flexibility. Without regularization, the parameter estimates are obtained by minimizing a weighted quadratic norm of the prediction errors ε(t,θ):
$${V}_{N}\left(\theta \right)={\displaystyle \sum}_{t=1}^{N}{\epsilon}^{2}(t,\theta )$$
where t is the time variable, N is the number of data samples, and ε(t,θ) is the predicted error computed as the difference between the observed output and the predicted output of the model.
Regularization modifies the cost function by adding a term proportional to the square of the norm of the parameter vector θ, so that the parameters θ are obtained by minimizing:
${\widehat{V}}_{N}\left(\theta \right)={\displaystyle \sum}_{t=1}^{N}{\epsilon}^{2}\left(t,\theta \right)+\lambda {\Vert \theta \Vert}^{2}$
where λ is a positive constant that has the effect of trading variance error in V_{N}(θ) for bias error — the larger the value of λ, the higher the bias and lower the variance of θ. The added term penalizes the parameter values with the effect of keeping their values small during estimation. In statistics, this type of regularization is called ridge regression. For more information, see Ridge Regression in the Statistics and Machine Learning Toolbox™ documentation.
Note: Another choice for the norm of θ vector is the L_{1}-norm, known as lasso regularization. However, System Identification Toolbox™ supports only the 2-norm based penalty, known as L_{2} regularization, as shown in the previous equation. |
The penalty term is made more effective by using a positive definite matrix R, which allows weighting and/or rotation of the parameter vector:
${\widehat{V}}_{N}\left(\theta \right)={\displaystyle \sum}_{t=1}^{N}{\epsilon}^{2}\left(t,\theta \right)+\lambda {\theta}^{T}R\theta $
The square matrix R gives additional freedom for:
Shaping the penalty term to meet the required constraints, such as keeping the model stable
Adding known information about the model parameters, such as reliability of the individual parameters in the θ vector
For structured models such as grey-box models, you may want to keep the estimated parameters close to their guess values to maintain the physical validity of the estimated model. This can be achieved by generalizing the penalty term to $$\lambda {\left(\theta -{\theta}^{*}\right)}^{T}R\left(\theta -{\theta}^{*}\right)$$, such that the cost function becomes:
${\widehat{V}}_{N}\left(\theta \right)={\displaystyle \sum}_{t=1}^{N}{\epsilon}^{2}\left(t,\theta \right)+\lambda {\left(\theta -{\theta}^{*}\right)}^{T}R\left(\theta -{\theta}^{*}\right)$
Minimizing this cost function has the effect of estimating θ such that their values remain close to initial guesses θ*.
In regularization:
θ* represents prior knowledge about the unknown parameters.
λ*R represents the confidence in the prior knowledge of the unknown parameters. This implies that the larger the value, the higher the confidence.
A formal interpretation in a Bayesian setting is that θ has a prior distribution that is Gaussian with mean θ* and covariance matrix $${\sigma}^{2}/\lambda {R}^{-1}$$, where σ^{2} is the variance of ε(t). The use of regularization can therefore be linked to some prior information about the system. This could be quite soft, such as the system is stable.
You can use the regularization variables λ and R as
tools to find a good model that balances complexity and provides the
best tradeoff between bias and variance. You can obtain regularized
estimates of parameters for transfer function, state-space, polynomial,
grey-box, process and nonlinear black-box models. The three terms
defining the penalty term, λ, R and θ*,
are represented by regularization options Lambda
, R
,
and Nominal
, respectively in the toolbox. You can
specify their values in the estimation option sets for both linear
and nonlinear models. In the System Identification app, click Regularization in
the linear model estimation dialog box or Estimation Options in
the Nonlinear Models dialog box.
Use regularization for:
Over-parameterized models are rich in parameters. Their estimation
typically yields parameter values with a high level of uncertainty.
Over-parameterization is quite common for nonlinear ARX (idnlarx
) models and can also be for linear
state-space models using free parameterization.
In such cases, regularization improves the numerical conditioning
of the estimation. You can explore the bias-vs.-variance tradeoff
using various values of the regularization constant Lambda
.
Typically, the Nominal
option is its default value
of 0
, and R
is an identity matrix
such that the following cost function is minimized:
${\widehat{V}}_{N}\left(\theta \right)={\displaystyle \sum}_{t=1}^{N}{\epsilon}^{2}\left(t,\theta \right)+\lambda {\Vert \theta \Vert}^{2}$
In the following example, a nonlinear ARX model estimation using a large number of neurons leads to an ill-conditioned estimation problem.
% Load estimation data load regularizationExampleData.mat nldata % Estimate model without regularization Orders = [1 2 1]; NL = sigmoidnet('NumberOfUnits',30); sys = nlarx(nldata,Orders,NL); compare(nldata,sys)
Applying even a small regularizing penalty produces a good fit for the model to the data.
% Estimate model using regularization constant λ = 1e-8;
opt = nlarxOptions;
opt.Regularization.Lambda = 1e-8;
sysr = nlarx(nldata,Orders,NL,opt);
compare(nldata,sysr)
In models derived from differential equations, the parameters have physical significance. You may have a good guess for typical values of those parameters even if the reliability of the guess may be different for each parameter. Because the model structure is fixed in such cases, you cannot simplify the structure to reduce variance errors.
Using the regularization constant Nominal
,
you can keep the estimated values close to their initial guesses.
You can also design R
to reflect the confidence
in the initial guesses of the parameters. For example, if θ
is a 2-element vector and you can guess the value of the first element
with more confidence than the second one, set R
to
be a diagonal matrix of size 2-by-2 such that R(1,1) >> R(2,2).
In the following example, a model of a DC motor is parameterized
by static gain G
and time constant τ. From
prior knowledge, suppose you know that G
is about
4 and τ is about 1. Also, assume that you have more confidence
in the value of τ than G
and would like to
guide the estimation to remain close to the initial guess.
% Load estimation data load regularizationExampleData.mat motorData % Create idgrey model for DC motor dynamics mi = idgrey(@DCMotorODE,{'G', 4; 'Tau', 1},'cd',{}, 0); mi = setpar(mi, 'label', 'default'); % Configure Regularization options opt = greyestOptions; opt.Regularization.Lambda = 100; % Specify that the second parameter better known than the first opt.Regularization.R = [1, 1000]; % Specify initial guess as Nominal opt.Regularization.Nominal = 'model'; % Estimate model sys = greyest(motorData, mi, opt) getpar(sys)
In many situations, you may know the shape of the system impulse
response from impact tests. For example, it is quite common for stable
systems to have an impulse response that is smooth and exponentially
decaying. You can use such prior knowledge of system behavior to derive
good values of regularization constants for linear-in-parameter models
such as ARX and FIR structure models using the arxRegul
command.
For black-box models of arbitrary structure, it is often difficult
to determine the optimal values of Lambda
and R
that
yield the best bias-vs.-variance tradeoff. Therefore, it is recommended
that you start by obtaining the regularized estimate of an ARX or
FIR structure model. Then, convert the model to a state-space, transfer
function or polynomial model using the idtf
, idss
, or idpoly
commands,
followed by order reduction if required.
In the following example, direct estimation of a 15th order continuous-time transfer function model fails due to numerical ill-conditioning.
% Load estimation data load dryer2 Dryer = iddata(y2,u2,0.08); Dryerd = detrend(Dryer,0); Dryerde = Dryerd(1:500); xe = Dryerd(1:500); ze = Dryerd(1:500); zv = Dryerd(501:end); % Estimate model without regularization sys1 = tfest(ze, 15);
Therefore, use regularized ARX estimation and then convert the model to transfer function structure.
% Specify regularization constants [L, R] = arxRegul(ze, [15 15 1]); optARX = arxOptions; optARX.Regularization.Lambda = L; optARX.Regularization.R = R; % Estimate ARX model sysARX = arx(ze, [15 15 1], optARX); % Convert model to continuous time sysc = d2c(sysARX); % Convert model to transfer function sys2 = idtf(sysc); % Validate the models sys1 and sys2 compare(zv, sys1, sys2)
A guideline for selecting the regularization constants λ
and R
is in the Bayesian interpretation. The added
penalty term is an assumption that the parameter vector θ is
a Gaussian random vector with mean θ* and covariance matrix $${\sigma}^{2}/\lambda {R}^{-1}$$.
You can relate naturally to such an assumption for a grey-box
model, where the parameters are of known physical interpretation.
In other cases, this may be more difficult. Then, you have to use
ridge regression (R
= 1; θ* = 0) and tune
λ by trial and error.
Use the following techniques for determining λ and R
values:
Tuning the regularization constants for ARX models in arxRegul
is based on simple assumptions
about the properties of the true impulse responses.
In the case of an FIR model, the parameter vector contains the impulse response coefficients b_{k} for the system. From prior knowledge of the system, it is often known that the impulse response is smooth and exponentially decaying:
$$E{\left[{b}_{k}\right]}^{2}=C{\mu}^{k},\text{}corr\left\{{b}_{k}{b}_{k-1}\right\}=\rho $$
where corr means correlation. The equation is a parameterization of the regularization constants in terms of coefficients C, μ, and ρ and the chosen shape (decaying polynomial) is called a kernel. The kernel thus contains information about parameterization of the prior covariance of the impulse response coefficients.
You can estimate the parameters of the kernel by adjusting them
to the measured data using the RegulKernel
input
of the arxRegul
command. For
example, the DC
kernel estimates all three parameters
while the TC
kernel links $$\rho =\sqrt{\mu}$$.
This technique of tuning kernels applies to all linear-in-parameter
models such as ARX and FIR models.
A general way to test and evaluate any regularization parameters is to estimate a model based on certain parameters on an estimation data set, and evaluate the model fit for another validation data set. This is known as cross-validation.
Cross-validation is entirely analogous to the method for selecting model order:
Generate a list of candidate λ and R
values
to be tested.
Estimate a model for each candidate regularization constant set.
Compare the model fit to the validation data.
Use the constants that give the best fit to the validation data.
For example:
% Create estimation and validation data sets ze = z(1:N/2); zv = z(N/2:end); % Specify regularization options and estimate models opt = ssestOptions; for tests = 1:M opt.Regularization.Lambda = Lvalue(test); opt.Regularization.R = Rvalue(test); m{test} = ssest(ze,order,opt); end % Compare models with validation data for model fit [~,fit] = compare(zv,m{:))
[1] L. Ljung. "Some Classical and Some New Ideas for Identification of Linear Systems." Journal of Control, Automation and Electrical Systems. April 2013, Volume 24, Issue 1-2, pp 3-10.
[2] L. Ljung, and T. Chen. "What can regularization offer for estimation of dynamical systems?" In Proceedings of IFAC International Workshop on Adaptation and Learning in Control and Signal Processing, ALCOSP13, Caen, France, July 2013.
[3] L. Ljung, and T. Chen. "Convexity issues in system identification." In Proceedings of the 10th IEEE International Conference on Control & Automation, ICCA 2013, Hangzhou, China, June 2013.