Documentation

regress

Multiple linear regression

Description

example

b = regress(y,X) returns a vector b of coefficient estimates for a multiple linear regression of the responses in vector y on the predictors in matrix X. To compute coefficient estimates for a model with a constant term (intercept), include a column of ones in the matrix X.

[b,bint] = regress(y,X) also returns a matrix bint of 95% confidence intervals for the coefficient estimates.

[b,bint,r] = regress(y,X) also returns an additional vector r of residuals.

example

[b,bint,r,rint] = regress(y,X) also returns a matrix rint of intervals that can be used to diagnose outliers.

example

[b,bint,r,rint,stats] = regress(y,X) also returns a vector stats that contains the R2 statistic, the F-statistic and its p-value, and an estimate of the error variance. The matrix X must include a column of ones for the software to compute the model statistics correctly.

example

[___] = regress(y,X,alpha) uses a 100*(1-alpha)% confidence level to compute bint and rint. Specify any of the output argument combinations in the previous syntaxes.

Examples

collapse all

Load the carsmall data set. Identify weight and horsepower as predictors and mileage as the response.

x1 = Weight;
x2 = Horsepower;    % Contains NaN data
y = MPG;

Compute the regression coefficients for a linear model with an interaction term.

X = [ones(size(x1)) x1 x2 x1.*x2];
b = regress(y,X)    % Removes NaN data
b = 4×1

60.7104
-0.0102
-0.1882
0.0000

Plot the data and the model.

scatter3(x1,x2,y,'filled')
hold on
x1fit = min(x1):100:max(x1);
x2fit = min(x2):10:max(x2);
[X1FIT,X2FIT] = meshgrid(x1fit,x2fit);
YFIT = b(1) + b(2)*X1FIT + b(3)*X2FIT + b(4)*X1FIT.*X2FIT;
mesh(X1FIT,X2FIT,YFIT)
xlabel('Weight')
ylabel('Horsepower')
zlabel('MPG')
view(50,10)
hold off Use the last exam scores as response data and the first two exam scores as predictor data.

Perform multiple linear regression with alpha = 0.01.

[~,~,r,rint] = regress(y,X,0.01);

Diagnose outliers by finding the residual intervals rint that do not contain 0.

contain0 = (rint(:,1)<0 & rint(:,2)>0);
idx = find(contain0==false)
idx = 2×1

53
54

Observations 53 and 54 are possible outliers.

Create a scatter plot of the residuals. Fill in the points corresponding to the outliers.

hold on
scatter(y,r)
scatter(y(idx),r(idx),'b','filled')
ylabel("Residuals")
hold off Load the hald data set. Use heat as the response variable and ingredients as the predictor data.

y = heat;
X1 = ingredients;
x1 = ones(size(X1,1),1);
X = [x1 X1];    % Includes column of ones

Perform multiple linear regression and generate model statistics.

[~,~,~,~,stats] = regress(y,X)
stats = 1×4

0.9824  111.4792    0.0000    5.9830

Because the ${R}^{2}$ value of 0.9824 is close to 1, and the p-value of 0.0000 is less than the default significance level of 0.05, a significant linear regression relationship exists between the response y and the predictor variables in X.

Input Arguments

collapse all

Response data, specified as an n-by-1 numeric vector. Rows of y correspond to different observations. y must have the same number of rows as X.

Data Types: single | double

Predictor data, specified as an n-by-p numeric matrix. Rows of X correspond to observations, and columns correspond to predictor variables. X must have the same number of rows as y.

Data Types: single | double

Significance level, specified as a positive scalar. alpha must be between 0 and 1.

Data Types: single | double

Output Arguments

collapse all

Coefficient estimates for multiple linear regression, returned as a numeric vector. b is a p-by-1 vector, where p is the number of predictors in X. If the columns of X are linearly dependent, regress sets the maximum number of elements of b to zero.

Data Types: double

Lower and upper confidence bounds for coefficient estimates, returned as a numeric matrix. bint is a p-by-2 matrix, where p is the number of predictors in X. The first column of bint contains lower confidence bounds for each of the coefficient estimates; the second column contains upper confidence bounds. If the columns of X are linearly dependent, regress returns zeros in elements of bint corresponding to the zero elements of b.

Data Types: double

Residuals, returned as a numeric vector. r is a p-by-1 vector, where p is the number of predictors in X.

Data Types: single | double

Intervals to diagnose outliers, returned as a numeric matrix. rint is a p-by-2 matrix, where p is the number of predictors in X. If the interval rint(i,:) for observation i does not contain zero, the corresponding residual is larger than expected in 100*(1-alpha)% of new observations, suggesting an outlier. For more information, see Algorithms.

Data Types: single | double

Model statistics, returned as a numeric vector including the R2 statistic, the F-statistic and its p-value, and an estimate of the error variance.

• X must include a column of ones so that the model contains a constant term. The F-statistic and its p-value are computed under this assumption and are not correct for models without a constant.

• The F-statistic is the test statistic of the F-test on the regression model. The F-test looks for a significant linear regression relationship between the response variable and the predictor variables.

• The R2 statistic can be negative for models without a constant, indicating that the model is not appropriate for the data.

Data Types: single | double

Tips

• regress treats NaN values in X or y as missing values. regress omits observations with missing values from the regression fit.

Algorithms

collapse all

Residual Intervals

In a linear model, observed values of y and their residuals are random variables. Residuals have normal distributions with zero mean but with different variances at different values of the predictors. To put residuals on a comparable scale, regress “Studentizes” the residuals. That is, regress divides the residuals by an estimate of their standard deviation that is independent of their value. Studentized residuals have t-distributions with known degrees of freedom. The intervals returned in rint are shifts of the 100*(1-alpha)% confidence intervals of these t-distributions, centered at the residuals.

Alternative Functionality

regress is useful when you simply need the output arguments of the function and when you want to repeat fitting a model multiple times in a loop. If you need to investigate a fitted regression model further, create a linear regression model object LinearModel by using fitlm or stepwiselm. A LinearModel object provides more features than regress.

• Use the properties of LinearModel to investigate a fitted linear regression model. The object properties include information about coefficient estimates, summary statistics, fitting method, and input data.

• Use the object functions of LinearModel to predict responses and to modify, evaluate, and visualize the linear regression model.

• Unlike regress, the fitlm function does not require a column of ones in the input data. A model created by fitlm always includes an intercept term unless you specify not to include it by using the 'Intercept' name-value pair argument.

• You can find the information in the output of regress using the properties and object functions of LinearModel.

Output of regressEquivalent Values in LinearModel
bSee the Estimate column of the Coefficients property.
bintUse the coefCI function.
rSee the Raw column of the Residuals property.
rintNot supported. Instead, use studentized residuals (Residuals property) and observation diagnostics (Diagnostics property) to find outliers.
statsSee the model display in the Command Window. You can find the statistics in the model properties (MSE and Rsquared) and by using the anova function.

 Chatterjee, S., and A. S. Hadi. “Influential Observations, High Leverage Points, and Outliers in Linear Regression.” Statistical Science. Vol. 1, 1986, pp. 379–416.