Random sampling using k-means cluster without replacement

Version 1.0.0 (1.36 KB) by Dipankar
Preparation of training dataset from a categorical sample with a well representation of a maximum possible samples from each cluster
31 Downloads
Updated 19 Aug 2021

View License

Random sampling without replacement does not ensures picking up all possible clusters of a sample data set. For example, in case of iris dataset a random sampling from a particular class may miss some important cluster. In case of all the three species of flowers, missing clusters among the three classes may be too high and training dataset prepared from this random selection may provide a poor classification/regression model.
Solution
MATLAB function randperm() is in generally used to create a random number series without replacement. But this function does not ensure index values from all possible clusters. To identify all possible clusters one may use k-means clusters, then from each cluster certain percent of values may be extracted to prepare the training dataset.
close all; clear; clc
%%load divided input data set
load fisheriris
class_labels=unique(species);
class_cl=length(class_labels);
cl=[]; %zeros(1,3);
for i=1:class_cl
m=find(string(species)==char(class_labels(1)));
cl(i)=length(m);
end
iris(:,1:4)=meas;
iris(:,5)=[repmat(1,50,1);repmat(2,50,1);repmat(3,50,1)]; %last column ac categorical vaues 1,2 ,3
k=3; %no of class labels i.e., 'setosa, 'versicolor' , 'virginica '
%
per=0.7; %percentage of sample from each cluster,
% per is percentage of sample from each cluster, i.e., 70% from each of the
% clusters of the 3 classes ('setosa, 'versicolor' , 'virginica ')
%
%k-means will create floot(length(class)/7) clusters in each class i.e.
%within 'setosa, 'versicolor' , and 'virginica '
class_col=5; %class label column no
[s,td]= split_data_kmean(iris,class_col.per,k); %'kmsampledata is kmean data
Algorithm: split_data_kmean
Step 1: Inputs data (For iris data 5 columns 150 records dataset), No of labelled class = k (For iris the value is 3), Percentage of data from each cluster = per, Column no of labelled class = class_col (For iris data it is 5thcolumn)
Step 2: For c = 1 to k /* 3: 1->'setosa,2->'versicolor',3->'virginica ' */
Step 3: find row indexes of class no c (c=1 implies satosa) -> i
Step 4: ub = max(i);lb =min(i);
Find the range lb and ub : rng = lb:1:ub;
Step 5: Find the length of data(i,: ) then thke the lower bound
lenc=size(data(i,:),1); f=floor(lenc/7);
Step 6: Use k-means clustering over data(i,: ) for cluster no f
idx : clusrer indexes , C clister centres
Step 7: Find the range of cluster indices
mn=min(idx);mx=max(idx);
Step 8: For the range of clusters i=mn:mx
Find indices of each cluster with a class
t1 = find(idx(:)==i);
t1 = t1 +(lb-1);
Step 9: Random sampling without repetition from the clusters of th cth class
trn = randsample(t1,round(per*length(t1)));
Step 10: Extract those rows from the dataset for training dataset
traing_points=data(trn,:);
Step 11: Append the rows into an empty list.
td = [td;traing_points];
Step 12: Append the indices into an empty set
s = [s;trn];
Step 13: end of for loop
Step 14: end of for loop
Step 15: end of algorithm

Cite As

Dipankar (2024). Random sampling using k-means cluster without replacement (https://www.mathworks.com/matlabcentral/fileexchange/97894-random-sampling-using-k-means-cluster-without-replacement), MATLAB Central File Exchange. Retrieved .

MATLAB Release Compatibility
Created with R2021a
Compatible with any release
Platform Compatibility
Windows macOS Linux

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!
Version Published Release Notes
1.0.0