Advice on data normalization in k-fold cross validation

15 views (last 30 days)
Hello,
I'd like to compare the performance of two classifiers, namely logistic regression and SVM, to see if they can accurately classify participants' binary response (0/1) from predictor data. My data is repeated measures, so I am block partitioning my data by participant to avoid data leakage using a custom "cvpartition". I use this cvpartition in the function fitclinear() and fitcsvm() to perform 10-fold cross validation.
However, I'd like to scale/normalize my predictor data. Applying feature scale (normalization) before splitting data into training and test sets would result into data leakage (Kapoor & Narayanan, 2023; Zhu et al., 2023). Therefore, I would like to scale my training data separately to my test data.
Firstly, fitcsvm() has the option "standarize", but it is unclear whether standardization occurs seperarely for training and test data for each iteration of the 10-fold cross validation or whether the standarization occurs before the data are split, which would result into leakage.
Second, fitclinear does not have the option to standardize within its function. So it seems I cannot effectively compare the results from fitclinear and fitcsvm at this stage because the normalization cannot be done in the same way.
Has anyone run into this issue before?
If so, am I better off to create my own for-loop wherein I perform the 10-fold cross validation and standardize the training and test data seperately each iteration? I am able to write this loop myself, I am merely wondering whether I am missing a function in MATLAB that does this already.
Thank you for your time.
Kapoor & Narayanan (2023) - Leakage and the reproducibility crisis in machine-learning-based science
Zhu, Yang, and Ren (2023) - Machine Learning in Environmental Research: Common Pitfalls and Best Practices (p. 17677)
  6 Comments
Lars K
Lars K on 22 Dec 2023
Thank you for your in-depth response Ive J! This is very clear and very helpful!
Your comment "note that in cases where your features are independent, and there is no collinearity or they're fairly normal, it's ok to apply normalization on the whole training set" is also very interesting, as I have been thinking about the influence of multicollinearity in ML.
In any case, your feedback is very insightful and I will play around with your suggestions in MATLAB to see if I can create a ML classifier on some dummy data (e.g., spirals) or some exisiting data sets.
Thank you again for your time. Happy holidays!
Cheers

Sign in to comment.

Answers (1)

Sulaymon Eshkabilov
Sulaymon Eshkabilov on 20 Dec 2023
For data normalization, you can use MATLAB's builtin fcn: normalize(), e.g.:
A = 1:5;
A_nor = normalize(A, 'range')
A_nor = 1×5
0 0.2500 0.5000 0.7500 1.0000
Once you created a model with "standardize" option that would be applciable for your trainings and testing/validation data.

Products


Release

R2023b

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!