Find Optimal Number of Cluster using Silhoutte Criterion from Scratch In MATLAB
1 view (last 30 days)
Show older comments
ello, I Hope you are doing well. I am trying to Find optimal Number of Cluster using evalclusters with K-means and silhouette Criterion
The build in Command takes very large time to find optimal Cluster. I am implementing this method from scratch. I have the following code. The score obtained by scratch algorithm is different from build in Function
The Dataset and the build-in function in the following section. The evaluation.CriterionValues are the scores for optimal K
x =[ [0.1 0.2 0.15 0.2 0.21 ] 1+[0.1 0.2 0.15 0.2 0.21 ]];
y =[ [0.1 0.2 0.15 0.2 0.21 ] 1+[0.1 0.2 0.15 0.2 0.21 ]];
X = [x.' y.'];
dataset_len = size(X,1);
num_kmeans = 6;
%%
evaluation = evalclusters(X,"kmeans","silhouette","KList",1:num_kmeans)
evaluation.CriterionValues
Here is the Code to implement this from scratch. The array_silhoutte are the scores for optimal K
array_silhoutte = zeros(1,num_kmeans);
distance_a = [];
distance_b = [];
for j=1:num_kmeans
[cluster_assignments,centroids] = kmeans(X,j,'Distance','sqeuclidean','Start','sample');
%[~,grps_11]=grp2idx(cluster_assignments);
for i = 1:dataset_len
distance_a = [];
distance_b = [];
current_datapoint = X(i,:);
for k=1:dataset_len
if i~=k
if (cluster_assignments(i)== cluster_assignments(k))
dist = pdist2( current_datapoint,X(k,:),'squaredeuclidean') ;
distance_a = [distance_a;dist];
else
dist = pdist2( current_datapoint,X(k,:),'squaredeuclidean') ;
distance_b=[distance_b;dist];
end
end
end
Average_a=mean(distance_a);
Average_b=mean(distance_b);
end
array_silhoutte(j) = (Average_b-Average_a)./max(Average_b, Average_a);
end
Can anybody help me with this to equal the score for scratch and build-in-function
Accepted Answer
Marco Riani
on 16 Feb 2023
Edited: Marco Riani
on 16 Feb 2023
x =[ [0.1 0.2 0.15 0.2 0.21 ] 1+[0.1 0.2 0.15 0.2 0.21 ]];
y =[ [0.1 0.2 0.15 0.2 0.21 ] 1+[0.1 0.2 0.15 0.2 0.21 ]];
X = [x.' y.'];
dataset_len = size(X,1);
num_kmeans = 6;
evaluation = evalclusters(X,"kmeans","silhouette","KList",1:num_kmeans)
disp("Criterion values from evalclusters")
disp(evaluation.CriterionValues)
array_silhoutte = zeros(1,num_kmeans);
for j=1:num_kmeans
% [cluster_assignments,centroids] = kmeans(X,j,'Distance','sqeuclidean','Start','sample');
[cluster_assignments,centroids] = kmeans(X,j,'Replicates',100);
avgDWithin=zeros(dataset_len,1);
avgDBetween=Inf(dataset_len,j);
for i=1:dataset_len
for jj=1:j
boo=cluster_assignments==cluster_assignments(i);
Xsamecluster=X(boo,:);
if size(Xsamecluster,1)>1
avgDWithin(i)=sum(sum((X(i,:)-Xsamecluster).^2,2))/(size(Xsamecluster,1)-1);
end
boo1= cluster_assignments~=cluster_assignments(i);
Xdifferentcluster=X(boo1 & cluster_assignments ==jj,:);
if ~isempty(Xdifferentcluster)
avgDBetween(i,jj)=mean(sum((X(i,:)-Xdifferentcluster).^2,2));
end
end
end
% Calculate the silhouette values
minavgDBetween = min(avgDBetween, [], 2);
silh = (minavgDBetween - avgDWithin) ./ max(avgDWithin,minavgDBetween);
array_silhoutte(j) =mean(silh);
end
disp("Criterion values computed manually")
disp(array_silhoutte)
I slighly rewrote your code and put Replicates',100 in the call to kmeans. Please let me know if now everything is clear. Of course kmeans does not take into account the correlation among the variables and it is not robust to the presence of atypical observations. Anyway, this is another story.
Best
Marco
More Answers (0)
See Also
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!