Main Content

crossvalind

Generate indices for training and test sets

Description

cvIndices = crossvalind(cvMethod,N,M) returns the indices cvIndices after applying cvMethod on N observations using M as the selection parameter.

example

[train,test] = crossvalind(cvMethod,N,M) returns the logical vectors train and test, representing observations that belong to the training set and the test (evaluation) set, respectively. You can specify any supported method except 'Kfold', which accepts a scalar output only.

example

___ = crossvalind(___,Name,Value) specifies additional options using one or more name-value pair arguments in addition to the arguments in previous syntaxes. For example, cvIndices = crossvalind('HoldOut',Groups,0.2,'Class',{'Cancer','Control'}) specifies to use observations from the 'Cancer' and 'Control' groups to generate indices that represent 20% of observations as the holdout set and 80% as the training set.

Examples

collapse all

Create indices for the 10-fold cross-validation and classify measurement data for the Fisher iris data set. The Fisher iris data set contains width and length measurements of petals and sepals from three species of irises.

Load the data set.

load fisheriris

Create indices for the 10-fold cross-validation.

indices = crossvalind('Kfold',species,10);

Initialize an object to measure the performance of the classifier.

cp = classperf(species);

Perform the classification using the measurement data and report the error rate, which is the ratio of the number of incorrectly classified samples divided by the total number of classified samples.

for i = 1:10
    test = (indices == i); 
    train = ~test;
    class = classify(meas(test,:),meas(train,:),species(train,:));
    classperf(cp,class,test);
end
cp.ErrorRate
ans = 
0.0200

Suppose you want to use the observation data from the setosa and virginica species only and exclude the versicolor species from cross-validation.

labels = {'setosa','virginica'};
indices = crossvalind('Kfold',species,10,'Classes',labels);

indices now contains zeros for the rows that belong to the versicolor species.

Perform the classification again.

for i = 1:10
    test = (indices == i); 
    train = ~test;
    class = classify(meas(test,:),meas(train,:),species(train,:));
    classperf(cp,class,test);
end
cp.ErrorRate
ans = 
0.0160

Load the carbig data set.

load carbig;
x = Displacement; 
y = Acceleration;
N = length(x);

Train a second degree polynomial model with the leave-one-out cross-validation, and evaluate the averaged cross-validation error. The function randomly selects one observation to hold out for the evaluation set, and using this method within a loop does not guarantee disjointed evaluation sets, and you may see a different CVerr for each run.

sse = 0; % Initialize the sum of squared error.
for i = 1:100
    [train,test] = crossvalind('LeaveMOut',N,1);
    yhat = polyval(polyfit(x(train),y(train),2),x(test));
    sse = sse + sum((yhat - y(test)).^2);
end
CVerr = sse / 100;

Input Arguments

collapse all

Cross-validation method, specified as a character vector or string.

This table describes the valid cross-validation methods. Depending on the method, the third input argument (M) has different meanings and requirements.

cvMethodMDescription

'Kfold'

M is the fold parameter, most commonly known as K in the K-fold cross-validation. M must be a positive integer. The default value is 5.

The method uses K-fold cross-validation to generate indices. This method uses M-1 folds for training and the last fold for evaluation. The method repeats this process M times, leaving one different fold for evaluation each time.

'HoldOut'

M is the proportion of observations to hold out for the test set. M must be a scalar between 0 and 1. The default value is 0.5, corresponding to a 50% holdout.

The method randomly selects approximately N*M observations to hold out for the test (evaluation) set. Using this method within a loop is similar to using K-fold cross-validation one time outside the loop, except that nondisjointed subsets are assigned to each evaluation.

'LeaveMOut

M is the number of observations to leave out for the test set. M must be a positive integer. The default value is 1, corresponding to the leave-one-out cross-validation (LOOCV).

The method randomly selects M observations to hold out for the evaluation set. Using this cross-validation method within a loop does not guarantee disjointed evaluation sets. To guarantee disjointed evaluation sets, use 'Kfold' instead.

'Resubstitution'

M must be specified as a two-element vector [P,Q]. Each element must be a scalar between 0 and 1. The default value is [1,1], corresponding to the full resubstitution.

The method randomly selects N*P observations for the evaluation set and N*Q observations for the training set. The method selects the sets while minimizing the number of observations used in both sets.

Q = 1-P corresponds to the holdout (100*P)%.

Example: 'Kfold'

Data Types: char | string

Total number of observations or grouping information, specified as a positive integer, vector of positive integers, logical vector, or cell array of character vectors.

N can be a positive integer specifying the total number of samples in your data set, for instance.

N can also be a vector of positive integers or logical values, or a cell array of character vectors, containing grouping information or labels for your samples. The partition of the groups depends on the type of cross-validation. For 'Kfold', each group is divided into M subsets, approximately equal in size. For all other methods, approximately equal numbers of observations from each group are selected for the evaluation (test) set. The training set contains at least one observation from each group regardless of the cross-validation method you use.

Example: 100

Data Types: double | cell

Cross-validation parameter, specified as a positive scalar between 0 and 1, positive integer, or two-element vector. Depending on the cross-validation method, the requirements for M differ. For details, see cvMethod.

Example: 5

Data Types: double

Name-Value Arguments

Specify optional pairs of arguments as Name1=Value1,...,NameN=ValueN, where Name is the argument name and Value is the corresponding value. Name-value arguments must appear after other arguments, but the order of the pairs does not matter.

Before R2021a, use commas to separate each name and value, and enclose Name in quotes.

Example: [train,test] = crossvalind('LeaveMOut',groups,1,'Min',3) specifies to have at least three observations in each group in the training set when performing the leave-one-out cross-validation.

Class or group information, specified as the comma-separated pair consisting of 'Classes' and a vector of positive integers, character vector, string, string vector, or cell array of character vectors. This option lets you restrict the observations to only the specified groups.

This name-value pair argument is applicable only when you specify N as a grouping variable. The data type of 'Classes' must match that of N. For example, if you specify N as a cell array of character vectors containing class labels, you must use a cell array of character vectors to specify 'Classes'. The output arguments you specify contain the value 0 for observations belonging to excluded classes.

Example: 'Classes',{'Cancer','Control'}

Data Types: double | cell

Minimum number of observations for each group in the training set, specified as the comma-separated pair consisting of 'Min' and a positive integer. Setting a large value can help to balance the training groups, but causes partial resubstitution when there are not enough observations.

This name-value pair argument is not applicable for the 'Kfold' method.

Example: 'Min',3

Data Types: double

Output Arguments

collapse all

Cross-validation indices, returned as a vector.

If you are using 'Kfold' as the cross-validation method, cvIndices contains equal (or approximately equal) proportions of the integers 1 through M, which define a partition of the N observations into M disjointed subsets.

For other cross-validation methods, cvIndices is a logical vector containing 1s for observations that belong to the training set and 0s for observations that belong to the test (evaluation) set.

Training set, returned as a logical vector. This argument specifies which observations belong to the training set.

Test set, returned as a logical vector. This argument specifies which observations belong to the test set.

Version History

Introduced before R2006a