Find Optimal Number of Cluster using Silhoutte Criterion from Scratch In MATLAB

1 view (last 30 days)
ello, I Hope you are doing well. I am trying to Find optimal Number of Cluster using evalclusters with K-means and silhouette Criterion
The build in Command takes very large time to find optimal Cluster. I am implementing this method from scratch. I have the following code. The score obtained by scratch algorithm is different from build in Function
The Dataset and the build-in function in the following section. The evaluation.CriterionValues are the scores for optimal K
x =[ [0.1 0.2 0.15 0.2 0.21 ] 1+[0.1 0.2 0.15 0.2 0.21 ]];
y =[ [0.1 0.2 0.15 0.2 0.21 ] 1+[0.1 0.2 0.15 0.2 0.21 ]];
X = [x.' y.'];
dataset_len = size(X,1);
num_kmeans = 6;
%%
evaluation = evalclusters(X,"kmeans","silhouette","KList",1:num_kmeans)
evaluation.CriterionValues
Here is the Code to implement this from scratch. The array_silhoutte are the scores for optimal K
array_silhoutte = zeros(1,num_kmeans);
distance_a = [];
distance_b = [];
for j=1:num_kmeans
[cluster_assignments,centroids] = kmeans(X,j,'Distance','sqeuclidean','Start','sample');
%[~,grps_11]=grp2idx(cluster_assignments);
for i = 1:dataset_len
distance_a = [];
distance_b = [];
current_datapoint = X(i,:);
for k=1:dataset_len
if i~=k
if (cluster_assignments(i)== cluster_assignments(k))
dist = pdist2( current_datapoint,X(k,:),'squaredeuclidean') ;
distance_a = [distance_a;dist];
else
dist = pdist2( current_datapoint,X(k,:),'squaredeuclidean') ;
distance_b=[distance_b;dist];
end
end
end
Average_a=mean(distance_a);
Average_b=mean(distance_b);
end
array_silhoutte(j) = (Average_b-Average_a)./max(Average_b, Average_a);
end
Can anybody help me with this to equal the score for scratch and build-in-function

Accepted Answer

Marco Riani
Marco Riani on 16 Feb 2023
Edited: Marco Riani on 16 Feb 2023
x =[ [0.1 0.2 0.15 0.2 0.21 ] 1+[0.1 0.2 0.15 0.2 0.21 ]];
y =[ [0.1 0.2 0.15 0.2 0.21 ] 1+[0.1 0.2 0.15 0.2 0.21 ]];
X = [x.' y.'];
dataset_len = size(X,1);
num_kmeans = 6;
evaluation = evalclusters(X,"kmeans","silhouette","KList",1:num_kmeans)
evaluation =
SilhouetteEvaluation with properties: NumObservations: 10 InspectedK: [1 2 3 4 5 6] CriterionValues: [NaN 0.9956 0.8842 0.7731 0.8798 0.9864] OptimalK: 2
disp("Criterion values from evalclusters")
Criterion values from evalclusters
disp(evaluation.CriterionValues)
NaN 0.9956 0.8842 0.7731 0.8798 0.9864
array_silhoutte = zeros(1,num_kmeans);
for j=1:num_kmeans
% [cluster_assignments,centroids] = kmeans(X,j,'Distance','sqeuclidean','Start','sample');
[cluster_assignments,centroids] = kmeans(X,j,'Replicates',100);
avgDWithin=zeros(dataset_len,1);
avgDBetween=Inf(dataset_len,j);
for i=1:dataset_len
for jj=1:j
boo=cluster_assignments==cluster_assignments(i);
Xsamecluster=X(boo,:);
if size(Xsamecluster,1)>1
avgDWithin(i)=sum(sum((X(i,:)-Xsamecluster).^2,2))/(size(Xsamecluster,1)-1);
end
boo1= cluster_assignments~=cluster_assignments(i);
Xdifferentcluster=X(boo1 & cluster_assignments ==jj,:);
if ~isempty(Xdifferentcluster)
avgDBetween(i,jj)=mean(sum((X(i,:)-Xdifferentcluster).^2,2));
end
end
end
% Calculate the silhouette values
minavgDBetween = min(avgDBetween, [], 2);
silh = (minavgDBetween - avgDWithin) ./ max(avgDWithin,minavgDBetween);
array_silhoutte(j) =mean(silh);
end
disp("Criterion values computed manually")
Criterion values computed manually
disp(array_silhoutte)
NaN 0.9956 0.8841 0.7731 0.8798 0.9864
I slighly rewrote your code and put Replicates',100 in the call to kmeans. Please let me know if now everything is clear. Of course kmeans does not take into account the correlation among the variables and it is not robust to the presence of atypical observations. Anyway, this is another story.
Best
Marco

More Answers (0)

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!