Matching the labels of a clustering with ground truth labels for performance analysis

Question

Samuel L. Polk on 9 Oct 2021

0
Link

Direct link to this question

https://uk.mathworks.com/matlabcentral/answers/1470216-matching-the-labels-of-a-clustering-with-ground-truth-labels-for-performance-analysis

Answered: Shubham on 1 Mar 2024

I'm working on clustering a dataset I have ground truth labels. I want to evaluate the confusion matrix between the predicted and ground truth labels, but the labels assigned by my clustering algorithm are not necessarily the same numerically as those assigned in the ground truth labels.

For example, suppose that I assign labels cHat=[2,2,1,1] using some clustering algorithm, but the true labels are cTrue=[1,1,2,2]. I've achieved perfect performance (splitting the two clusters apart). However, if I construct a confusion matrix before any preprocessing steps, I will get a matrix of the form: C = [[0,2]; [2,0]], implying zero accuracy. How do I preprocess my cHat to better match the labels of cTrue?

I understand that there are other ways to evaluate clustering accuracy (such as the Rand Index, NMI, etc.). However, the metrics that I need to use (Overall Accuracy, Average Accuracy, and Kappa coefficient) rely on the construction of a confusion matrix.

0 Comments
Show -2 older commentsHide -2 older comments

Sign in to comment.

Sign in to answer this question.

Answer 1

Shubham on 1 Mar 2024

0
Link

Direct link to this answer

https://uk.mathworks.com/matlabcentral/answers/1470216-matching-the-labels-of-a-clustering-with-ground-truth-labels-for-performance-analysis#answer_1419411

Hi Samuel,

When evaluating clustering results with a confusion matrix, the absolute values of the labels are not important; rather, it's the consistency of labeling within the clusters that matters. Since clustering algorithms like k-means arbitrarily assign numerical labels to clusters, these labels often do not match the ground truth labels. To compare the predicted labels with the ground truth, you need to align the labels first.

One common approach to align the labels is to use the Hungarian algorithm, also known as the Kuhn-Munkres algorithm, which can solve the assignment problem in polynomial time. The algorithm finds the best one-to-one mapping between two sets of labels that minimizes the total cost (or maximizes the total similarity).

Here's a theoretical approach to preprocess cHat to better match the labels of cTrue:

Construct a contingency table (similar to a confusion matrix) where each cell (i, j) represents the number of samples in predicted cluster i and true cluster j.
Use the Hungarian algorithm to find the optimal one-to-one mapping between the predicted labels and the true labels based on the contingency table. The goal is to maximize the sum of the diagonal (which represents correctly clustered samples) in the confusion matrix.
According to the mapping obtained from the Hungarian algorithm, relabel the predicted clusters so that they correspond to the true clusters as closely as possible.
With the relabeled predicted clusters, construct the confusion matrix again. This time, the diagonal should reflect the actual number of correctly clustered samples.
Now that you have a properly aligned confusion matrix, you can calculate Overall Accuracy, Average Accuracy, and the Kappa coefficient as needed.

In MATLAB, you can use the matchpairs function from the Optimization Toolbox to perform the Hungarian algorithm.

0 Comments
Show -2 older commentsHide -2 older comments

Sign in to comment.

Matching the labels of a clustering with ground truth labels for performance analysis

0 Comments
Show -2 older commentsHide -2 older comments

Answers (1)

0 Comments
Show -2 older commentsHide -2 older comments

See Also

Categories

Tags

Products

Release

Community Treasure Hunt

Matching the labels of a clustering with ground truth labels for performance analysis

0 Comments Show -2 older commentsHide -2 older comments

Answers (1)

0 Comments Show -2 older commentsHide -2 older comments

See Also

Categories

Tags

Products

Release

Community Treasure Hunt

0 Comments
Show -2 older commentsHide -2 older comments

0 Comments
Show -2 older commentsHide -2 older comments