To determine a good lasso-penalty strength for a linear classification model that uses a logistic regression learner, compare distributions of k-fold margins.
Load the NLP data set. Preprocess the data as in Estimate k-Fold Cross-Validation Margins.
Create a set of 11 logarithmically-spaced regularization strengths from through .
Cross-validate a binary, linear classification model using 5-fold cross-validation and that uses each of the regularization strengths. Optimize the objective function using SpaRSA. Lower the tolerance on the gradient of the objective function to 1e-8
.
CVMdl =
ClassificationPartitionedLinear
CrossValidatedModel: 'Linear'
ResponseName: 'Y'
NumObservations: 31572
KFold: 5
Partition: [1x1 cvpartition]
ClassNames: [0 1]
ScoreTransform: 'none'
Properties, Methods
CVMdl
is a ClassificationPartitionedLinear
model. Because fitclinear
implements 5-fold cross-validation, CVMdl
contains 5 ClassificationLinear
models that the software trains on each fold.
Estimate the k-fold margins for each regularization strength.
m
is a 31572-by-11 matrix of cross-validated margins for each observation. The columns correspond to the regularization strengths.
Plot the k-fold margins for each regularization strength. Because logistic regression scores are in [0,1], margins are in [-1,1]. Rescale the margins to help identify the regularization strength that maximizes the margins over the grid.
Several values of Lambda
yield k-fold margin distributions that are compacted near 10000. Higher values of lambda lead to predictor variable sparsity, which is a good quality of a classifier.
Choose the regularization strength that occurs just before the centers of the k-fold margin distributions start decreasing.
Train a linear classification model using the entire data set and specify the desired regularization strength.
To estimate labels for new observations, pass MdlFinal
and the new data to predict
.