Random sampling using k-means cluster without replacement

Version 1.0.0 (1.36 KB) by Dipankar

Preparation of training dataset from a categorical sample with a well representation of a maximum possible samples from each cluster

0.0

(0)

31 Downloads

Updated 19 Aug 2021

View License

Random sampling without replacement does not ensures picking up all possible clusters of a sample data set. For example, in case of iris dataset a random sampling from a particular class may miss some important cluster. In case of all the three species of flowers, missing clusters among the three classes may be too high and training dataset prepared from this random selection may provide a poor classification/regression model.

Solution

MATLAB function randperm() is in generally used to create a random number series without replacement. But this function does not ensure index values from all possible clusters. To identify all possible clusters one may use k-means clusters, then from each cluster certain percent of values may be extracted to prepare the training dataset.

close all; clear; clc

%%load divided input data set

load fisheriris

class_labels=unique(species);

class_cl=length(class_labels);

cl=[]; %zeros(1,3);

for i=1:class_cl

m=find(string(species)==char(class_labels(1)));

cl(i)=length(m);

end

iris(:,1:4)=meas;

iris(:,5)=[repmat(1,50,1);repmat(2,50,1);repmat(3,50,1)]; %last column ac categorical vaues 1,2 ,3

k=3; %no of class labels i.e., 'setosa, 'versicolor' , 'virginica '

per=0.7; %percentage of sample from each cluster,

% per is percentage of sample from each cluster, i.e., 70% from each of the

% clusters of the 3 classes ('setosa, 'versicolor' , 'virginica ')

%k-means will create floot(length(class)/7) clusters in each class i.e.

%within 'setosa, 'versicolor' , and 'virginica '

class_col=5; %class label column no

[s,td]= split_data_kmean(iris,class_col.per,k); %'kmsampledata is kmean data

Algorithm: split_data_kmean

Step 1: Inputs data (For iris data 5 columns 150 records dataset), No of labelled class = k (For iris the value is 3), Percentage of data from each cluster = per, Column no of labelled class = class_col (For iris data it is 5thcolumn)

Step 2: For c = 1 to k /* 3: 1->'setosa,2->'versicolor',3->'virginica ' */

Step 3: find row indexes of class no c (c=1 implies satosa) -> i

Step 4: ub = max(i);lb =min(i);

Find the range lb and ub : rng = lb:1:ub;

Step 5: Find the length of data(i,: ) then thke the lower bound

lenc=size(data(i,:),1); f=floor(lenc/7);

Step 6: Use k-means clustering over data(i,: ) for cluster no f

idx : clusrer indexes , C clister centres

Step 7: Find the range of cluster indices

mn=min(idx);mx=max(idx);

Step 8: For the range of clusters i=mn:mx

Find indices of each cluster with a class

t1 = find(idx(:)==i);

t1 = t1 +(lb-1);

Step 9: Random sampling without repetition from the clusters of th cth class

trn = randsample(t1,round(per*length(t1)));

Step 10: Extract those rows from the dataset for training dataset

traing_points=data(trn,:);

Step 11: Append the rows into an empty list.

td = [td;traing_points];

Step 12: Append the indices into an empty set

s = [s;trn];

Step 13: end of for loop

Step 14: end of for loop

Step 15: end of algorithm

Cite As

Dipankar (2024). Random sampling using k-means cluster without replacement (https://www.mathworks.com/matlabcentral/fileexchange/97894-random-sampling-using-k-means-cluster-without-replacement), MATLAB Central File Exchange. Retrieved April 16, 2024.