Matching the labels of a clustering with ground truth labels for performance analysis

24 views (last 30 days)
I'm working on clustering a dataset I have ground truth labels. I want to evaluate the confusion matrix between the predicted and ground truth labels, but the labels assigned by my clustering algorithm are not necessarily the same numerically as those assigned in the ground truth labels.
For example, suppose that I assign labels cHat=[2,2,1,1] using some clustering algorithm, but the true labels are cTrue=[1,1,2,2]. I've achieved perfect performance (splitting the two clusters apart). However, if I construct a confusion matrix before any preprocessing steps, I will get a matrix of the form: C = [[0,2]; [2,0]], implying zero accuracy. How do I preprocess my cHat to better match the labels of cTrue?
I understand that there are other ways to evaluate clustering accuracy (such as the Rand Index, NMI, etc.). However, the metrics that I need to use (Overall Accuracy, Average Accuracy, and Kappa coefficient) rely on the construction of a confusion matrix.

Answers (1)

Shubham
Shubham on 1 Mar 2024
Hi Samuel,
When evaluating clustering results with a confusion matrix, the absolute values of the labels are not important; rather, it's the consistency of labeling within the clusters that matters. Since clustering algorithms like k-means arbitrarily assign numerical labels to clusters, these labels often do not match the ground truth labels. To compare the predicted labels with the ground truth, you need to align the labels first.
One common approach to align the labels is to use the Hungarian algorithm, also known as the Kuhn-Munkres algorithm, which can solve the assignment problem in polynomial time. The algorithm finds the best one-to-one mapping between two sets of labels that minimizes the total cost (or maximizes the total similarity).
Here's a theoretical approach to preprocess cHat to better match the labels of cTrue:
  1. Construct a contingency table (similar to a confusion matrix) where each cell (i, j) represents the number of samples in predicted cluster i and true cluster j.
  2. Use the Hungarian algorithm to find the optimal one-to-one mapping between the predicted labels and the true labels based on the contingency table. The goal is to maximize the sum of the diagonal (which represents correctly clustered samples) in the confusion matrix.
  3. According to the mapping obtained from the Hungarian algorithm, relabel the predicted clusters so that they correspond to the true clusters as closely as possible.
  4. With the relabeled predicted clusters, construct the confusion matrix again. This time, the diagonal should reflect the actual number of correctly clustered samples.
  5. Now that you have a properly aligned confusion matrix, you can calculate Overall Accuracy, Average Accuracy, and the Kappa coefficient as needed.
In MATLAB, you can use the matchpairs function from the Optimization Toolbox to perform the Hungarian algorithm.

Products


Release

R2021a

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!