Delete-1 Statistics

Delete-1 Change in Covariance (`CovRatio`)

Purpose

Delete-1 change in covariance (CovRatio) identifies the observations that are influential in the regression fit. An influential observation is one where its exclusion from the model might significantly alter the regression function. Values of CovRatio larger than 1 + 3*p/n or smaller than 1 – 3*p/n indicate influential points, where p is the number of regression coefficients, and n is the number of observations.

Definition

The CovRatio statistic is the ratio of the determinant of the coefficient covariance matrix with observation i deleted to the determinant of the covariance matrix for the full model:

$CovRatio = \frac{\det {M S E (i) {[X^{'} (i) X (i)]}^{- 1}}}{\det [M S E {(X^{'} X)}^{- 1}]} .$

CovRatio is an n-by-1 vector in the Diagnostics table of the fitted LinearModel object. Each element is the ratio of the generalized variance of the estimated coefficients when the corresponding element is deleted to the generalized variance of the coefficients using all the data.

How To

After obtaining a fitted model, say, mdl, using fitlm or stepwiselm, you can:

Display the CovRatio by indexing into the property using dot notation
```
mdl.Diagnostics.CovRatio
```
Plot the delete-1 change in covariance using
```
plotDiagnostics(mdl,'CovRatio')
```
For details, see the plotDiagnostics method of the LinearModel class.

Determine Influential Observations Using `CovRatio`

Open Live Script

This example shows how to use the CovRatio statistics to determine the influential points in data. Load the sample data and define the response and predictor variables.

load hospital
y = hospital.BloodPressure(:,1);
X = double(hospital(:,2:5));

Fit a linear regression model.

mdl = fitlm(X,y);

Plot the CovRatio statistics.

plotDiagnostics(mdl,'CovRatio')

For this example, the threshold limits are 1 + 3*5/100 = 1.15 and 1 - 3*5/100 = 0.85. There are a few points beyond the limits, which might be influential points.

Find the observations that are beyond the limits.

find((mdl.Diagnostics.CovRatio)>1.15|(mdl.Diagnostics.CovRatio)<0.85)

Delete-1 Scaled Difference in Coefficient Estimates (`Dfbetas`)

Purpose

The sign of a delete-1 scaled difference in coefficient estimate (Dfbetas) for coefficient j and observation i indicates whether that observation causes an increase or decrease in the estimate of the regression coefficient. The absolute value of a Dfbetas indicates the magnitude of the difference relative to the estimated standard deviation of the regression coefficient. A Dfbetas value larger than 3/sqrt(n) in absolute value indicates that the observation has a large influence on the corresponding coefficient.

Definition

Dfbetas for coefficient j and observation i is the ratio of the difference in the estimate of coefficient j using all observations and the one obtained by removing observation i, and the standard error of the coefficient estimate obtained by removing observation i. The Dfbetas for coefficient j and observation i is

$D f b e t a s_{i j} = \frac{b_{j} - b_{j (i)}}{\sqrt{M S E_{(i)}} (1 - h_{i i})},$

where b_j is the estimate for coefficient j, b_j(i) is the estimate for coefficient j by removing observation i, MSE_(i) is the mean squared error of the regression fit by removing observation i, and h_ii is the leverage value for observation i. Dfbetas is an n-by-p matrix in the Diagnostics table of the fitted LinearModel object. Each cell of Dfbetas corresponds to the Dfbetas value for the corresponding coefficient obtained by removing the corresponding observation.

How To

After obtaining a fitted model, say, mdl, using fitlm or stepwiselm, you can obtain the Dfbetas values as an n-by-p matrix by indexing into the property using dot notation,

mdl.Diagnostics.Dfbetas

Determine Observations Influential on Coefficients Using `Dfbetas`

Open Live Script

This example shows how to determine the observations that have large influence on coefficients using Dfbetas. Load the sample data and define the response and independent variables.

load hospital
y = hospital.BloodPressure(:,1);
X = double(hospital(:,2:5));

Fit a linear regression model.

mdl = fitlm(X,y);

Find the Dfbetas values that are high in absolute value.

[row,col] = find(abs(mdl.Diagnostics.Dfbetas)>3/sqrt(100));
disp([row col])

Delete-1 Scaled Change in Fitted Values (`Dffits`)

Purpose

The delete-1 scaled change in fitted values (Dffits) show the influence of each observation on the fitted response values. Dffits values with an absolute value larger than 2*sqrt(p/n) might be influential.

Definition

Dffits for observation i is

${Dffits}_{i} = s r_{i} \sqrt{\frac{h_{i i}}{1 - h_{i i}}},$

where sr_iis the studentized residual, and h_ii is the leverage value of the fitted LinearModel object. Dffits is an n-by-1 column vector in the Diagnostics table of the fitted LinearModel object. Each element in Dffits is the change in the fitted value caused by deleting the corresponding observation and scaling by the standard error.

How To

After obtaining a fitted model, say, mdl, using fitlm or stepwiselm, you can:

Display the Dffits values by indexing into the property using dot notation
```
mdl.Diagnostics.Dffits
```
Plot the delete-1 scaled change in fitted values using
```
plotDiagnostics(mdl,'Dffits')
```
For details, see the plotDiagnostics method of the LinearModel class for details.

Determine Observations Influential on Fitted Response Using `Dffits`

Open Live Script

This example shows how to determine the observations that are influential on the fitted response values using Dffits values. Load the sample data and define the response and independent variables.

load hospital
y = hospital.BloodPressure(:,1);
X = double(hospital(:,2:5));

Fit a linear regression model.

mdl = fitlm(X,y);

Plot the Dffits values.

plotDiagnostics(mdl,'Dffits')

The influential threshold limit for the absolute value of Dffits in this example is 2*sqrt(5/100) = 0.45. Again, there are some observations with Dffits values beyond the recommended limits.

Find the Dffits values that are large in absolute value.

find(abs(mdl.Diagnostics.Dffits)>2*sqrt(4/100))

Delete-1 Variance (`S2_i`)

Purpose

The delete-1 variance (S2_i) shows how the mean squared error changes when an observation is removed from the data set. You can compare the S2_i values with the value of the mean squared error.

Definition

S2_i is a set of residual variance estimates obtained by deleting each observation in turn. The S2_i value for observation i is

$S 2_i = M S E_{(i)} = \frac{\sum_{j \neq i}^{n} {[y_{j} - {\hat{y}}_{j (i)}]}^{2}}{n - p - 1},$

where y_j is the jth observed response value. S2_i is an n-by-1 vector in the Diagnostics table of the fitted LinearModel object. Each element in S2_i is the mean squared error of the regression obtained by deleting that observation.

How To

After obtaining a fitted model, say, mdl, using fitlm or stepwiselm, you can:

Display the S2_i vector by indexing into the property using dot notation
```
mdl.Diagnostics.S2_i
```
Plot the delete-1 variance values using
```
plotDiagnostics(mdl,'S2_i')
```
For details, see the plotDiagnostics method of the LinearModel class.

Compute and Examine Delete-1 Variance Values

Open Live Script

This example shows how to compute and plot S2_i values to examine the change in the mean squared error when an observation is removed from the data. Load the sample data and define the response and independent variables.

load hospital
y = hospital.BloodPressure(:,1);
X = double(hospital(:,2:5));

Fit a linear regression model.

mdl = fitlm(X,y);

Display the MSE value for the model.

mdl.MSE

ans = 23.1140

Plot the S2_i values.

plotDiagnostics(mdl,'S2_i')

This plot makes it easy to compare the S2_i values to the MSE value of 23.114, indicated by the horizontal dashed lines. You can see how deleting one observation changes the error variance.