Clustering

What Is Clustering?

Clustering and its applications explained

Clustering or cluster analysis is an unsupervised learning method used in machine learning and data analysis that organizes your data so that data points in the same group (or cluster) are more similar to each other than to those in other groups. Clustering helps to make sense of large and complex data sets by uncovering patterns and trends or making predictions on unlabeled data.

How Clustering Works

Clustering involves several key steps including data preparation, defining a similarity measure, choosing the right clustering algorithm, and evaluating and refining the clusters.

Workflow chart showing clustering steps: data preparation, similarity measure definition, clustering algorithm selection, and cluster evaluation.

Key steps in clustering.

Clustering works by measuring the similarity between data points and grouping the points that have a higher measure of similarity than data in any other cluster. The concept of “similarity” varies depending on the context and the data, and it’s a fundamental aspect of unsupervised learning. Various similarity measures can be used, including Euclidean, probabilistic, cosine distance, and correlation.

Clustering result shown in two dimensions, where the clusters are represented by different colors.

Scatter plot of data grouped into three clusters created using the spectralcluster function. (See MATLAB code.)

Types of Clustering Algorithms

Clustering algorithms fall into two broad groups:

  • Hard clustering: When each data point belongs to only one cluster, such as the popular k-means method
  • Soft clustering: When each data point can belong to more than one cluster, such as in Gaussian mixture models
The <i>k</i>-means cluster analysis method shows discrete clusters of data points with a star added to represent the centroid.

k-means clustering, which represents groups by their centroid—the average of each member, depicted by the stars.

The Gaussian mixture model shows two clusters of data points with isobar-like lines depicting cluster membership probabilities.

A Gaussian mixture model, which assigns cluster membership probabilities, representing strength of association with different clusters.

There are several clustering algorithms, and each clustering algorithm has a unique approach to grouping data. These methods vary significantly in their mechanics and ideal use cases. The most common types of clustering algorithms used in machine learning are:

  • Hierarchical clustering builds a multilevel hierarchy of clusters by creating a cluster tree.
  • k-means clustering partitions data into k distinct clusters based on the distance to the centroid of a cluster.
  • Gaussian mixture models form clusters as a mixture of multivariate normal density components.
  • Density-based spatial clustering (DBSCAN) groups points that are close to each other in areas of high density, keeping track of outliers in low-density regions. It can handle arbitrary nonconvex shapes.
  • Self-organizing maps use neural networks that learn the topology and distribution of the data.
  • Spectral clustering transforms input data into a graph-based representation where the clusters are better separated than in the original feature space. The number of clusters can be estimated by studying the eigenvalues of the graph.
  • Hidden Markov models can be used to discover patterns in sequences, such as genes and proteins in bioinformatics.
  • Fuzzy c-means (FCM) groups data into N clusters, with every data point in the data set belonging to every cluster to a certain degree.

Clustering for Unsupervised Learning

Unsupervised learning is a type of machine learning algorithm used to draw inferences from unlabeled data without human intervention. Clustering is the most common unsupervised learning method. It applies clustering algorithms to explore data and find hidden patterns or groupings in data without any prior knowledge of group labels. Using these groups and patterns, clustering helps to extract useful insights from unlabeled data and reveal inherent structures within it.

The original photo shows a light brown dog on a black-and-white tiled floor; using clustering, the processed image separates the dog from the floor.

Using clustering for image segmentation to separate out the patterned background on the floor.

Why Clustering Is Important

Clustering is a significant area of artificial intelligence. It plays an important role in various domains by offering valuable insights into data and uncovering patterns and relationships that are not immediately obvious. For unlabeled data, where the inherent relationship between the data points is hidden but required for revealing useful insights, clustering helps in discovering those relationships and organizing the unlabeled data into meaningful groups.

By grouping similar items, clustering reduces data complexity so that you can focus on the behavior of the groups rather than getting overwhelmed by individual data points. So, clustering can be used for exploratory data analysis and semisupervised learning. In the latter, clustering is used as a preprocessing step before supervised learning to reduce the amount of data to be processed by a machine learning model and improve the predictive modeling accuracy.

Clustering is also frequently used in applications such as anomaly detection, image segmentation, and pattern recognition. More specifically, clustering can be applied in the following areas to identify patterns and sequences:

  • Clusters can represent the data instead of the raw signal in data compression methods.
  • Clusters indicate regions of images and lidar point clouds in segmentation algorithms.
  • Clustering can assist in identifying outliers or anomalies within a data set.
  • In medical imaging, clustering algorithms can be used to separate images into regions of interest, such as for differentiating between healthy tissue and tumors or segmenting the brain into white matter, gray matter, and cerebrospinal fluid.
  • Clustering is used in geographic information systems (GISs) to analyze satellite imagery or aerial photographs to identify urban sprawl or land use patterns, or to monitor changes in urban areas over time.
  • Genetic clustering and sequence analysis are used in bioinformatics.
Original image of tissue stained in shades of purple with hematoxylin and eosin, and processed image of tissue segmented into three classes.

Left: Original image of tissue stained with hematoxylin and eosin. Right: MATLAB assigned three clusters to the image, providing a segmentation of the tissue into three classes.

Clustering with MATLAB

Using MATLAB® with Statistics and Machine Learning Toolbox™, you can identify patterns and features by applying clustering methods of your choice and dividing your data into groups or clusters. With Image Processing Toolbox™, you can perform clustering on image data.

Data Preparation

For accurate and efficient clustering results, it is vital to preprocess the data and handle missing values and outliers. You can clean and preprocess your data programmatically using built-in functions or interactively using the Data Cleaner app.

Clustering Algorithms

MATLAB supports all popular clustering algorithms, which you can apply with built-in functions, such as the kmeans function. You can use the Cluster Data Live Editor task to interactively perform k-means and hierarchical clustering. Using the task, you can automatically generate MATLAB code for your live script.

You can also perform nearest-neighbors clustering in Simulink by using the KNN Search block. The block accepts a query point and returns the k nearest-neighbor points in the observational data using a nearest-neighbor searcher object.

A 2D plot showing the petal width and length measurements for three species of iris, and a plot showing the three resulting clusters using GMM clustering.

Left: MATLAB scatter plot of petal measurements from several specimens of three iris species. Right: Petal measurements segmented into three clusters using the Gaussian mixture model (GMM) clustering technique. (See Statistics and Machine Learning example.)

Visualize and Evaluate Clustering Results

When the data does not contain natural divisions that indicate the appropriate number of clusters, you can use different evaluation criteria, such as gap or silhouette, to determine how well the data fits into a particular number of clusters. You can also visualize clusters to inspect clustering results. For example, you can use a dendrogram plot for clustering visualization.

All the points in the two clusters have large silhouette values (0.8 or greater), indicating that the clusters are well separated.

A MATLAB plot created using the silhouette function showing that the data is split into two clusters of equal size. (See MATLAB code.)

Clustering for Images

You can perform image segmentation (using the imsegkmeans function) and volume segmentation (using the imsegkmeans3 function) on images by clustering regions of pixels based on similarities in color or shape. You can create a segmented labeled image using a specific clustering algorithm. For example, in medical imaging you can detect and label pixels in an image or voxels of a 3D volume that represent a tumor in a patient’s brain or other organs. By leveraging MATLAB tools, you can process and analyze images for a wide range of applications, from disease diagnosis to land use classification.

Four black-and-white images: The test image, segmented image, tumor detection, and labeled image.

Brain tumor detection from an MR image using fuzzy c-means clustering in MATLAB. (See Fuzzy Logic Toolbox example.)