Lasso/Elastic Net feature selection with kFold crossvalidation

13 views (last 30 days)
I want to understand how Lasso/Elastic Net regression selects the final features when using kFold cross-validation and using the function: [B, stats] = lasso(featData, classData, 'CV', 10) (from the Statistics & ML toolbox).
In my understanding, if the model is trained 10 times on different subsets of the total sample, this may result in different features selected/penalized in every fold. However, the cross-validated model output does not provide any insight on the variability of those features across different folds. Is the best model simply chosen among all folds and applied to the entire training set? Or are features averaged/weighted based on their stability across folds?
There was a related question previously, but nobody ever answered it:
https://www.mathworks.com/matlabcentral/answers/125357-understanding-k-fold-cross-validation
Thanks for your help!
  1 Comment
Tyson
Tyson on 23 Jul 2018
This is an important thread. We are also looking for clarification on this exact question. We do not find any info about the beta values for the k-folds in the FitInfo, only a single set of beta values for each lambda. Exactly how were these betas determined?

Sign in to comment.

Answers (1)

Bernhard Suhm
Bernhard Suhm on 22 Apr 2018

Crossvalidation just applies to assessing model performance. As described in doc , with kfold the average error across the k different partitions will be reported. The model is trained on the complete dataset that you provide to the training function, in this case, "lasso".

  3 Comments
Bernhard Suhm
Bernhard Suhm on 30 Apr 2018
You are right, and asked internally for additional clarification. If you use the kfold argument, you don't get a "final" model back with features weighted or averaged, but pointers to all k models, whose coefficients (or selected features) may slightly differ. If they do differ, that would be a sign those features aren't very strong, so you wouldn't want them in your final model. - You can get additional information on the various fitted models in the FitInfo field of the output object, but you have to analyze the variability across different objects yourself. - Alternatively, you can retrain the model without k-fold, which will give you the best features using the complete data set.
Juliana Corlier
Juliana Corlier on 11 May 2018
Thanks for clarifying this! This is very helpful. I have a practical follow up question:
I was looking for these pointers, but I can't seem to find them. In the FitInfo struct I only get coefficients for the 72 different Lambda values (which I also get if I don't run crossvalidation). I would have expected a multidimensional struct/object for different kFolds, but my FitInfo is a 1x1 struct. Any ideas on that? Many thanks!

Sign in to comment.

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!