Multicolin​earity/Reg​ression/PC​A and choice of optimal model (2nd try)

3 views (last 30 days)
laurie on 13 Apr 2012
Hi there
I have a set of results and 4 "candidate" explanatory variables. Those variables are correlated beteween each other (only two of them are not correlated with one another). What I want to figure out is wich one(s) of them is (are) the best at explaining the results.
I understand stepwise regression is screwed by the multicolinearity (I tried to run it and it all went fine until I tried to put the interactions in the mix)
I tried an ANOVA, two of them are significant, but I get NaNs when I ask about interactions.
I tried to run a PCA among all the explanatory variables but 1) i don't understand how the PCA isnt concerned with the results I am trying to explain and 2) I don't understand the results I am getting with pcacov: what do those coefficients in the matrix mean ? How am I supposed to rank the variables ?
Does it make sense ? Thank you very much ps: i also learned about the Akaike information cirterium but i am unsure how this would apply here. I hope something more simple could help me because it feels like trying to crush a fly with a bulldozer

Richard Willey on 13 Apr 2012
I'm attaching some code that might provide helpful
I also have a two part blog posting on this same subject that provides a bit more depth...
%%Introduction to using LASSO
% This demo explains how to start using the lasso functionality introduced
% in R2011b. It is motivated by an example in Tibshirani’s original paper
% on the lasso.
% Tibshirani, R. (1996). Regression shrinkage and selection via the lasso.
% J. Royal. Statist. Soc B., Vol. 58, No. 1, pages 267-288).
% The data set that we’re working with in this demo is a wide
% dataset with correlated variables. This data set includes 8 different
% variables and only 20 observations. 5 out of the 8 variables have
% coefficients of zero. These variables have zero impact on the model. The
% other three variables have non negative values and impact the model
%%Clean up workspace and set random seed
clear all
clc
% Set the random number stream
rng(1981);
%%Creating data set with specific characteristics
% Create eight X variables
% The mean of each variable will be equal to zero
mu = [0 0 0 0 0 0 0 0];
% The variable are correlated with one another
% The covariance matrix is specified as
i = 1:8;
matrix = abs(bsxfun(@minus,i',i));
covariance = repmat(.5,8,8).^matrix;
% Use these parameters to generate a set of multivariate normal random numbers
X = mvnrnd(mu, covariance, 20);
% Create a hyperplane that describes Y = f(X)
Beta = [3; 1.5; 0; 0; 2; 0; 0; 0];
ds = dataset(Beta);
% Add in a noise vector
Y = X * Beta + 3 * randn(20,1);
%%Use linear regression to fit the model
b = regress(Y,X);
ds.Linear = b;
%%Use a lasso to fit the model
[B Stats] = lasso(X,Y, 'CV', 5);
disp(B)
disp(Stats)
%%Create a plot showing MSE versus lamba
lassoPlot(B, Stats, 'PlotType', 'CV')
%%Identify a reasonable set of lasso coefficients
% View the regression coefficients associated with Index1SE
ds.Lasso = B(:,Stats.Index1SE);
disp(ds)
Create a plot showing coefficient values versus L1 norm
lassoPlot(B, Stats)
Run a Simulation
% Preallocate some variables
MSE = zeros(100,1);
mse = zeros(100,1);
Coeff_Num = zeros(100,1);
Betas = zeros(8,100);
cv_Reg_MSE = zeros(1,100);
for i = 1 : 100
X = mvnrnd(mu, covariance, 20);
Y = X * Beta + randn(20,1);
[B Stats] = lasso(X,Y, 'CV', 5);
Shrink = Stats.Index1SE - ceil((Stats.Index1SE - Stats.IndexMinMSE)/2);
Betas(:,i) = B(:,Shrink) > 0;
Coeff_Num(i) = sum(B(:,Shrink) > 0);
MSE(i) = Stats.MSE(:, Shrink);
regf = @(XTRAIN, ytrain, XTEST)(XTEST*regress(ytrain,XTRAIN));
cv_Reg_MSE(i) = crossval('mse',X,Y,'predfun',regf, 'kfold', 5);
end
Number_Lasso_Coefficients = mean(Coeff_Num);
disp(Number_Lasso_Coefficients)
MSE_Ratio = median(cv_Reg_MSE)/median(MSE);
disp(MSE_Ratio)
laurie on 19 Apr 2012
Hi. Thank you but this it way too complicated for me. I tried some easier ways to figure out what was happening in the data... Maybe I ll come back to your method later :)