Split dataset into three different size sets without overlapping

3 views (last 30 days)
I am working on image processing using Matlab. I need to split a large dataset into three non-overlapped subsets (25%, 25% and 50%). The dataset (let's say has 1K images) has 10 classes (each has 100 images). from class 1, 25% of images should be in the training set, other 25% should be stored in the validation set and the rest (50%) should be stored in the testset. there should not repetition. I mean if an image from a class has been stored in a subset, it must not be stored in other subsets of the class. How do I do that in Matlab?
My code is as follows:
load ('data.mat')
for i = 1:size(data, 1)
for j = 1:78
if mod(i,2)==0
trainingset(i/2,j) = data(i,j);
else
remainset((i-1)/2+1,j) = data(i,j);
end
end
end
for i = 1:size(remainset, 1)
for j = 1:78
if mod(i,2)==0
testset(i/2,j) = remainset(i,j);
else
validationset((i-1)/2+1,j) = remainset(i,j);
end
end
end
Although it somehow works, I am looking for a better algorithm as some parts of data are lost.
  2 Comments
david jones
david jones on 3 Sep 2016
As I need to split the data into three subsets, using 'datasample', it calculates indices for one subset. But, if I use it again to calculate indices of other subsets, is likely to have duplicate indices in different subset. I can use randperm, but the same issue exists. I need to split the dataset into three different subsets that each of the subsets contains a percentage of each class of data. using simple sampling method like using
1:250,1:250,1:500
does not work as the subsets have members of some of the classes. Example: subset 1 should have 25% of class1, 25% of class 2, 25% of class 3, ... , 25% of class n. subset 2 should have 25% of class1, 25% of class 2, 25% of class 3, ... , 25% of class n. subset 3 should have 50% of class1, 50% of class 2, 50% of class 3, ... , 50% of class n.
intersection of subset 1,subset 2 and subset 3 must be zero and union of subsets must cover the whole dataset.

Sign in to comment.

Answers (1)

Frank B.
Frank B. on 8 May 2018
Here is a quick answer using datasample, for a single vector named data. Loop over your classes or use indexes if they have to be shared.
load ('data.mat')
% Declaring data division ratio
% 25% for training, 25% for validation, 50% for test
dataset_div=[0.25 0.25 0.5];
% Number of data in each set
nb_train=(dataset_div(1)/sum(dataset_div))*length(data);
nb_valid=(dataset_div(2)/sum(dataset_div))*length(data);
nb_test=(dataset_div(3)/sum(dataset_div))*length(data);
% Splitting data in 3 un-overlapping vector
% Training data
[data_train,idx_sample]=datasample(data,nb_train,'Replace',false);
% Removing used values
idx_left=1:length(data);
idx_left(idx_sample)=[];
val_left=data(idx_left);
% Validation data
[data_valid,idx_sample]=datasample(val_left,nb_valid,'Replace',false);
% Removing used values
idx_left=1:length(val_left);
idx_left(idx_sample)=[];
val_left=data(idx_left);
% Test data
[data_test,idx_sample]=datasample(val_left,nb_test,'Replace',false);
Cheers

Products

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!