MathWorks Machine Translation
The automated translation of this page is provided by a general purpose third party translator tool.
MathWorks does not warrant, and disclaims all liability for, the accuracy, suitability, or fitness for purpose of the translation.
Superclasses: CompactLinearModel
Linear regression model class
An object comprising training data, model description, diagnostic
information, and fitted coefficients for a linear regression. Predict
model responses with the predict
or feval
methods.
or mdl
=
fitlm(tbl
)
create
a linear model of a table or dataset array mdl
=
fitlm(X
,y
)tbl
,
or of the responses y
to a data matrix X
.
For details, see fitlm
.
or mdl
= stepwiselm(tbl
)
create
a linear model of a table or dataset array mdl
=
stepwiselm(X
,y
)tbl
,
or of the responses y
to a data matrix X
,
with unimportant predictors excluded. For details, see stepwiselm
.
tbl
— Input dataInput data, specified as a table or dataset array. When modelspec
is
a formula
, it specifies the variables to be used
as the predictors and response. Otherwise, if you do not specify the
predictor and response variables, the last variable is the response
variable and the others are the predictor variables by default.
Predictor variables can be numeric, or any grouping variable type, such as logical or categorical (see Grouping Variables). The response must be numeric or logical.
To set a different column as the response variable, use the ResponseVar
namevalue
pair argument. To use a subset of the columns as predictors, use the PredictorVars
namevalue
pair argument.
X
— Predictor variablesPredictor variables, specified as an nbyp matrix,
where n is the number of observations and p is
the number of predictor variables. Each column of X
represents
one variable, and each row represents one observation.
By default, there is a constant term in the model, unless you
explicitly remove it, so do not include a column of 1s in X
.
Data Types: single
 double
 logical
y
— Response variableResponse variable, specified as an nby1
vector, where n is the number of observations.
Each entry in y
is the response for the corresponding
row of X
.
Data Types: single
 double
 logical
CoefficientCovariance
— Covariance matrix of coefficient estimatesCovariance matrix of coefficient estimates, specified as a pbyp matrix of numeric values. p is the number of coefficients in the fitted model.
CoefficientNames
— Coefficient namesCoefficient names, specified as a cell array of character vectors containing a label for each coefficient.
Coefficients
— Coefficient valuesCoefficient values, specified as a table. Coefficients
has one row for each
coefficient and the following columns:
Estimate
— Estimated coefficient
value
SE
— Standard error of the
estimate
tStat
— t statistic
for a test that the coefficient is zero
pValue
— pvalue
for the t statistic
To obtain any of these columns as a vector, index into the property
using dot notation. For example, in mdl
the estimated
coefficient vector is
beta = mdl.Coefficients.Estimate
Use coefTest
to perform other tests on the
coefficients.
DFE
— Degrees of freedom for errorDegrees of freedom for error (residuals), equal to the number of observations minus the number of estimated coefficients, specified as a positive integer value.
Diagnostics
— Diagnostic valuesDiagnostic values, specified as a table with the same number of rows as the input data
(tbl
or X
). Diagnostics
contains diagnostics helpful in finding outliers and influential observations. Many
diagnostics describe the effect on the fit of deleting single observations.
Diagnostics
contains the following fields.
Field  Meaning  Utility 

Leverage  Diagonal elements of HatMatrix  Leverage indicates to what extent the predicted
value for an observation is determined by the observed value for that
observation. A value close to 1 indicates that
the prediction is largely determined by that observation, with little
contribution from the other observations. A value close to 0 indicates
the fit is largely determined by the other observations. For a model
with P coefficients and N observations,
the average value of Leverage is P/N .
An observation with Leverage larger than 2*P/N can
be regarded as having high leverage. 
CooksDistance  Cook's measure of scaled change in fitted values  CooksDistance is a measure of scaled change
in fitted values. An observation with CooksDistance larger
than three times the mean Cook's distance can be an outlier. 
Dffits  Delete1 scaled differences in fitted values vs. observation number  Dffits is the scaled change in the fitted
values for each observation that would result from excluding that
observation from the fit. Values with an absolute value larger than 2*sqrt(P/N) may
be considered influential. 
S2_i  Delete1 variance vs. observation number  S2_i is a set of residual variance estimates
obtained by deleting each observation in turn. These can be compared
with the value of the MSE property. 
CovRatio  Delete1 ratio of determinant of covariance vs. observation number  CovRatio is the ratio of the determinant
of the coefficient covariance matrix with each observation deleted
in turn to the determinant of the covariance matrix for the full model.
Values larger than 1+3*P/N or smaller than 13*P/N indicate
influential points. 
Dfbetas  Delete1 scaled differences in covariance estimates vs. observation number  Dfbetas is an N byP matrix
of the scaled change in the coefficient estimates that would result
from excluding each observation in turn. Values larger than 3/sqrt(N) in
absolute value indicate that the observation has a large influence
on the corresponding coefficient. 
HatMatrix  Projection matrix to compute fitted from observed responses  HatMatrix is an N byN matrix
such that Fitted = HatMatrix*Y ,
where Y is the response vector and Fitted is
the vector of fitted response values. 
Rows not used in the fit because of missing values (in ObservationInfo.Missing
)
contain NaN
values.
Rows not used in the fit because of excluded values (in ObservationInfo.Excluded
)
contain NaN
values, with the following exception:
Delete1 diagnostics refer to the statistic with and without that
observation (row) included in the fit. These diagnostics help identify
important observations.
Fitted
— Fitted response values based on input dataFitted (predicted) response values based on input data, specified as an
nby1 vector of numeric values. n is the number
of observations in the input data. Use predict
to compute predictions for other predictor values, or to compute
confidence bounds on Fitted
.
Formula
— Model informationLinearFormula
object  NonLinearFormula
objectModel information, specified as a LinearFormula
object or
NonLinearFormula
object. If you fit a linear or generalized
linear regression model, then Formula
is a
LinearFormula
object. If you fit a nonlinear regression model,
then Formula
is a NonLinearFormula
object.
LogLikelihood
— Log likelihoodLog likelihood of the model distribution at the response values, specified as a numeric value. The mean is fitted from the model, and other parameters are estimated as part of the model fit.
ModelCriterion
— Criterion for model comparisonCriterion for model comparison, specified as a structure with the following fields:
AIC
— Akaike information criterion. AIC = –2*logL +
2*m
, where logL
is the loglikelihood and
m
is the number of estimated parameters.
AICc
— Akaike information criterion corrected for the sample
size. AICc = AIC + (2*m*(m+1))/(n–m–1)
, where n
is
the number of observations.
BIC
— Bayesian information criterion. BIC = –2*logL
+ m*log(n)
.
CAIC
— Consistent Akaike information criterion.
CAIC = –2*logL + m*(log(n)+1)
.
Information criteria are model selection tools that you can use to compare multiple models fit to the same data. These criteria are likelihoodbased measures of model fit that include a penalty for complexity (specifically, the number of parameters). Different information criteria are distinguished by the form of the penalty.
When you compare multiple models, the model with the lowest information criterion value is the bestfitting model. The bestfitting model can vary depending on the criterion used for model comparison.
To obtain any of the criterion values as a scalar, index into the property by using dot
notation. For example, in the model mdl
, the AIC value
aic
is:
aic = mdl.ModelCriterion.AIC
MSE
— Mean squared errorMean squared error (residuals), specified as a numeric value. Mean square error is calculated as MSE = SSE / DFE, where MSE is the mean square error, SSE is the sum of squared errors, and DFE is the degrees of freedom.
NumCoefficients
— Number of model coefficientsNumber of model coefficients, specified as a positive integer.
NumCoefficients
includes coefficients that are set to zero when
the model terms are rank deficient.
NumEstimatedCoefficients
— Number of estimated coefficientsNumber of estimated coefficients in the model, specified as a positive integer.
NumEstimatedCoefficients
does not include coefficients that are
set to zero when the model terms are rank deficient.
NumEstimatedCoefficients
is the degrees of freedom for
regression.
NumObservations
— Number of observationsNumber of observations the fitting function used in fitting, specified as a positive integer.
This is the number of observations supplied in the original table, dataset, or matrix,
minus any excluded rows (set with the Exclude
namevalue pair) or
rows with missing values.
NumPredictors
— Number of predictor variablesNumber of predictor variables used to fit the model, specified as a positive integer.
NumVariables
— Number of variablesNumber of variables in the input data, specified as a positive integer.
NumVariables
is the number of variables in the original table or
dataset, or the total number of columns in the predictor matrix and response vector when
the fit is based on those arrays. It includes variables, if any, that are not used as
predictors or as the response.
ObservationInfo
— Observation informationObservation information, specified as a nby4 table, where
n is equal to the number of rows of input data. The four columns
of ObservationInfo
contain the following:
Field  Description 

Weights  Observation weights. Default is all 1 . 
Excluded  Logical value, 1 indicates an observation
that you excluded from the fit with the Exclude namevalue
pair. 
Missing  Logical value, 1 indicates a missing value
in the input. Missing values are not used in the fit. 
Subset  Logical value, 1 indicates the observation
is not excluded or missing, so is used in the fit. 
ObservationNames
— Observation namesObservation names, specified as a cell array of character vectors containing the names of the observations used in the fit.
If the fit is based on a table or dataset containing
observation names, ObservationNames
uses those
names.
Otherwise, ObservationNames
is
an empty cell array
PredictorNames
— Names of predictors used to fit the modelNames of predictors used to fit the model, specified as a cell array of character vectors.
Residuals
— Residuals for fitted modelResiduals for fitted model, specified as a table that contains one row for each observation and the following columns.
Field  Description 

Raw  Observed minus fitted values. 
Pearson  Raw residuals divided by RMSE. 
Standardized  Raw residuals divided by their estimated standard deviation. 
Studentized  Residual divided by an independent estimate of the residual standard deviation. The residual for observation i is divided by an estimate of the error standard deviation based on all observations except for observation i. 
To obtain any of these columns as a vector, index into the property
using dot notation. For example, in a model mdl
,
the ordinary raw residual vector r
is:
r = mdl.Residuals.Raw
Rows not used in the fit because of missing values (in ObservationInfo.Missing
)
contain NaN
values.
Rows not used in the fit because of excluded values (in ObservationInfo.Excluded
)
contain NaN
values, with the following exceptions:
raw
contains the difference between
the observed and predicted values.
standardized
is the residual, standardized
in the usual way.
studentized
matches the standardized
values because this residual is not used in the estimate of the residual
standard deviation.
ResponseName
— Response variable nameResponse variable name, specified as a character vector.
RMSE
— Root mean squared errorRoot mean squared error (residuals), specified as a numeric value. The root mean squared error (RMSE) is equal to RMSE = sqrt(MSE), where MSE is the mean squared error.
Robust
— Robust fit informationRobust fit information, specified as a structure with the following fields:
Field  Description 

WgtFun  Robust weighting function, such as 'bisquare' (see robustfit ) 
Tune  Value specified for tuning parameter (can be [] ) 
Weights  Vector of weights used in final iteration of robust fit. This
field is empty for compacted CompactLinearModel models. 
This structure is empty unless fitlm
constructed
the model using robust regression.
Rsquared
— Rsquared value for the modelRsquared value for the model, specified as a structure.
For a linear or nonlinear model, Rsquared
is
a structure with two fields:
Ordinary
— Ordinary (unadjusted)
Rsquared
Adjusted
— Rsquared adjusted
for the number of coefficients
For a generalized linear model, Rsquared
is
a structure with five fields:
Ordinary
— Ordinary (unadjusted)
Rsquared
Adjusted
— Rsquared adjusted
for the number of coefficients
LLR
— Loglikelihood ratio
Deviance
— Deviance
AdjGeneralized
— Adjusted
generalized Rsquared
The Rsquared value is the proportion of total sum of squares
explained by the model. The ordinary Rsquared value relates to the SSR
and SST
properties:
Rsquared = SSR/SST = 1  SSE/SST
.
To obtain any of these values as a scalar, index into the property
using dot notation. For example, the adjusted Rsquared value in mdl
is
r2 = mdl.Rsquared.Adjusted
SSE
— Sum of squared errorsSum of squared errors (residuals), specified as a numeric value.
The Pythagorean theorem implies
SST = SSE + SSR
.
SSR
— Regression sum of squaresRegression sum of squares, specified as a numeric value. The regression sum of squares is equal to the sum of squared deviations of the fitted values from their mean.
The Pythagorean theorem implies
SST = SSE + SSR
.
SST
— Total sum of squaresTotal sum of squares, specified as a numeric value. The total sum of squares is equal to the
sum of squared deviations of response vector y
from
mean(y)
.
The Pythagorean theorem implies
SST = SSE + SSR
.
Steps
— Stepwise fitting informationStepwise fitting information, specified as a structure with the following fields.
Field  Description 

Start  Formula representing the starting model 
Lower  Formula representing the lower bound model, these terms that must remain in the model 
Upper  Formula representing the upper bound model, model cannot contain
more terms than Upper 
Criterion  Criterion used for the stepwise algorithm, such as 'sse' 
PEnter  Value of the parameter, such as 0.05 
PRemove  Value of the parameter, such as 0.10 
History  Table representing the steps taken in the fit 
The History
table has one row for each step
including the initial fit, and the following variables (columns).
Field  Description 

Action 
Action taken during this step, one of:

TermName 

Terms  Terms matrix (see modelspec of fitlm ) 
DF  Regression degrees of freedom after this step 
delDF  Change in regression degrees of freedom from previous step (negative for steps that remove a term) 
Deviance  Deviance (residual sum of squares) at that step 
FStat  F statistic that led to this step 
PValue  pvalue of the F statistic 
The structure is empty unless you use stepwiselm
or stepwiseglm
to fit the model.
VariableInfo
— Information about input variablesInformation about input variables contained in Variables
, specified as a
table with one row for each model term and the following columns.
Field  Description 

Class  Character vector giving variable class, such as 'double' 
Range 
Cell array giving variable range:

InModel  Logical vector, where true indicates the
variable is in the model 
IsCategorical  Logical vector, where true indicates a categorical
variable 
VariableNames
— Names of variables used in fitNames of variables used in fit, specified as a cell array of character vectors.
If the fit is based on a table or dataset, this property provides the names of the variables in that table or dataset.
If the fit is based on a predictor matrix and response
vector, VariableNames
is the values in the VarNames
namevalue
pair of the fitting method.
Otherwise the variables have the default fitting names.
Variables
— Data used to fit the modelData used to fit the model, specified as a table. Variables
contains both
observation and response values. If the fit is based on a table or dataset array,
Variables
contains all of the data from that table or dataset
array. Otherwise, Variables
is a table created from the input data
matrix X
and response vector y
.
addTerms  Add terms to linear regression model 
compact  Compact linear regression model 
dwtest  DurbinWatson test of linear model 
fit  Create linear regression model 
plot  Scatter plot or added variable plot of linear model 
plotAdded  Added variable plot or leverage plot for linear model 
plotAdjustedResponse  Adjusted response plot for linear regression model 
plotDiagnostics  Plot diagnostics of linear regression model 
plotResiduals  Plot residuals of linear regression model 
removeTerms  Remove terms from linear model 
step  Improve linear regression model by adding or removing terms 
stepwise  Create linear regression model by stepwise regression 
anova  Analysis of variance for linear model 
coefCI  Confidence intervals of coefficient estimates of linear model 
coefTest  Linear hypothesis test on linear regression model coefficients 
disp  Display linear regression model 
feval  Evaluate linear regression model prediction 
plotEffects  Plot main effects of each predictor in linear regression model 
plotInteraction  Plot interaction effects of two predictors in linear regression model 
plotSlice  Plot of slices through fitted linear regression surface 
predict  Predict response of linear regression model 
random  Simulate responses for linear regression model 
Value. To learn how value classes affect copy operations, see Copying Objects (MATLAB).
Fit a linear model of the Hald data.
Load the data.
load hald X = ingredients; % Predictor variables y = heat; % Response
Fit a default linear model to the data.
mdl = fitlm(X,y)
mdl = Linear regression model: y ~ 1 + x1 + x2 + x3 + x4 Estimated Coefficients: Estimate SE tStat pValue ________ _______ ________ ________ (Intercept) 62.405 70.071 0.8906 0.39913 x1 1.5511 0.74477 2.0827 0.070822 x2 0.51017 0.72379 0.70486 0.5009 x3 0.10191 0.75471 0.13503 0.89592 x4 0.14406 0.70905 0.20317 0.84407 Number of observations: 13, Error degrees of freedom: 8 Root Mean Squared Error: 2.45 Rsquared: 0.982, Adjusted RSquared 0.974 Fstatistic vs. constant model: 111, pvalue = 4.76e07
Fit a model of a table that contains a categorical predictor.
Load the carsmall
data.
load carsmall
Construct a table containing continuous predictor variable Weight
, categorical predictor variable Year
, and response variable MPG
.
tbl = table(MPG,Weight); tbl.Year = categorical(Model_Year);
Create a fitted model of MPG
as a function of Year
, Weight
, and Weight^2
. You don't have to include Weight
explicitly in your formula because it is a lowerorder term of Weight^2
. The fitlm
function includes Weight
automatically.
mdl = fitlm(tbl,'MPG ~ Year + Weight^2')
mdl = Linear regression model: MPG ~ 1 + Weight + Year + Weight^2 Estimated Coefficients: Estimate SE tStat pValue __________ __________ _______ __________ (Intercept) 54.206 4.7117 11.505 2.6648e19 Weight 0.016404 0.0031249 5.2493 1.0283e06 Year_76 2.0887 0.71491 2.9215 0.0044137 Year_82 8.1864 0.81531 10.041 2.6364e16 Weight^2 1.5573e06 4.9454e07 3.149 0.0022303 Number of observations: 94, Error degrees of freedom: 89 Root Mean Squared Error: 2.78 Rsquared: 0.885, Adjusted RSquared 0.88 Fstatistic vs. constant model: 172, pvalue = 5.52e41
fitlm
creates two dummy (indicator) variables for the categorical variate, Year
. The dummy variable Year_76
takes the value 1 if model year is 1976 and takes the value 0 if it is not. The dummy variable Year_82
takes the value 1 if model year is 1982 and takes the value 0 if it is not. And the year 1970 is the reference year. The corresponding model is
Fit a linear regression model using a robust fitting method.
Load the sample data.
load hald
The hald
data measures the effect of cement composition on its hardening heat. The matrix ingredients
contains the percent composition of four chemicals present in the cement. The array heat
contains the heat of hardening after 180 days for each cement sample.
Fit a robust linear model to the data.
mdl = fitlm(ingredients,heat,'linear','RobustOpts','on')
mdl = Linear regression model (robust fit): y ~ 1 + x1 + x2 + x3 + x4 Estimated Coefficients: Estimate SE tStat pValue ________ _______ ________ ________ (Intercept) 60.09 75.818 0.79256 0.4509 x1 1.5753 0.80585 1.9548 0.086346 x2 0.5322 0.78315 0.67957 0.51596 x3 0.13346 0.8166 0.16343 0.87424 x4 0.12052 0.7672 0.15709 0.87906 Number of observations: 13, Error degrees of freedom: 8 Root Mean Squared Error: 2.65 Rsquared: 0.979, Adjusted RSquared 0.969 Fstatistic vs. constant model: 94.6, pvalue = 9.03e07
The hat matrix H is defined in terms of the data matrix X:
H = X(X^{T}X)^{–1}X^{T}.
The diagonal elements h_{ii} satisfy
$$\begin{array}{l}0\le {h}_{ii}\le 1\\ {\displaystyle \sum _{i=1}^{n}{h}_{ii}}=p,\end{array}$$
where n is the number of observations (rows of X), and p is the number of coefficients in the regression model.
The leverage of observation i is the value of the ith diagonal term, h_{ii}, of the hat matrix H. Because the sum of the leverage values is p (the number of coefficients in the regression model), an observation i can be considered to be an outlier if its leverage substantially exceeds p/n, where n is the number of observations.
Cook’s distance is the scaled change in fitted values.
Each element in CooksDistance
is the normalized
change in the vector of coefficients due to the deletion of an observation.
The Cook’s distance, D_{i},
of observation i is
$${D}_{i}=\frac{{\displaystyle \sum _{j=1}^{n}{\left({\widehat{y}}_{j}{\widehat{y}}_{j(i)}\right)}^{2}}}{p\text{\hspace{0.17em}}MSE},$$
where
$${\widehat{y}}_{j}$$ is the jth fitted response value.
$${\widehat{y}}_{j(i)}$$ is the jth fitted response value, where the fit does not include observation i.
MSE is the mean squared error.
p is the number of coefficients in the regression model.
Cook’s distance is algebraically equivalent to the following expression:
$${D}_{i}=\frac{{r}_{i}^{2}}{p\text{\hspace{0.17em}}MSE}\left(\frac{{h}_{ii}}{{\left(1{h}_{ii}\right)}^{2}}\right),$$
where r_{i} is the ith residual, and h_{ii} is the ith leverage value.
CooksDistance
is an nby1
column vector in the Diagnostics
table of the LinearModel
object.
The main fitting algorithm is QR decomposition. For robust fitting,
the algorithm is robustfit
.
To remove redundant predictors in linear regression using lasso
or elastic net, use the lasso
function.
To regularize a regression with correlated terms using ridge
regression, use the ridge
or lasso
functions.
To regularize a regression with correlated terms using partial
least squares, use the plsregress
function.
Usage notes and limitations:
When you fit a model by using fitlm
or stepwiselm
, you cannot supply training data in a table that contains
at least one categorical predictor, and you cannot use the 'CategoricalVars'
namevalue pair argument. Code generation does not
support categorical predictors. To dummycode variables that you want treated as
categorical, preprocess the categorical data by using dummyvar
before fitting the
model.
For more information, see Introduction to Code Generation.
You clicked a link that corresponds to this MATLAB command:
Run the command by entering it in the MATLAB Command Window. Web browsers do not support MATLAB commands.
Choose a web site to get translated content where available and see local events and offers. Based on your location, we recommend that you select: .
Select web siteYou can also select a web site from the following list:
Select the China site (in Chinese or English) for best site performance. Other MathWorks country sites are not optimized for visits from your location.