Main Content

incrementalKMeans

Incremental k-means clustering

Since R2025a

    Description

    The incrementalKMeans function creates an incrementalKMeans model object that is suitable for incremental k-means clustering. Unlike the kmeans function, which requires you to provide all the data before computing the cluster assignments, incrementalKMeans allows you to update the clustering model incrementally by supplying chunks of data to the incremental fit function. To perform incremental k-means clustering with a dynamically changing number of clusters, use incrementalDynamicKMeans.

    When you call the incrementalKMeans function you can specify clustering options, such as the distance metric, the warm-up period, and whether to standardize the training data before fitting the model to data. After you create an incrementalKMeans object, it is prepared for incremental k-means clustering. For more information, see Incremental k-Means Clustering.

    Creation

    You can create an incrementalKMeans model object in two ways:

    • Call the function directly — Configure incremental k-means clustering options by calling incrementalKMeans directly. This approach is best when you do not have data yet or you want to start incremental k-means clustering immediately. When you call incrementalKMeans, you can specify cluster centroids and cluster counts so that the initial model is warm.

    • Call an incremental learning function — The fit and updateMetrics functions accept a configured incrementalKMeans model object and data as input, and return an incrementalKMeans model object updated with information computed from the input model and data.

    Description

    Mdl = incrementalKMeans(numClusters=k) creates an incremental k-means model object for incremental clustering with a fixed number of clusters k and default model parameters.

    example

    Mdl = incrementalKMeans(centroids=C) creates an incremental k-means model object using the cluster centroids in C.

    Mdl = incrementalKMeans(___,Name=Value) specifies options using one or more name-value arguments in addition to one of the input arguments in the previous syntaxes. For example, Mdl=incrementalKMeans(numClusters=2,Distance="cityblock") creates an incrementalKMeans model object with two clusters using the city block distance metric.

    example

    Input Arguments

    expand all

    Number of clusters, specified as a positive integer. This argument sets the NumClusters property.

    If you specify k:

    • You cannot specify C.

    • Centroids is a k-by-NumPredictors array of NaN values. If NumPredictors=0, the software resizes Centroids when you call the incremental fit function.

    Example: 10

    Data Types: single | double

    Initial cluster centroids, specified as an n-by-p numeric matrix where each row contains a cluster centroid, and each column contains the predictor values. The software uses C to set the initial values of the Centroids property.

    If you specify C:

    • Centroids contains n rows.

    • Centroids contains the unique rows of C and additional rows of NaN values, if C contains nonunique rows.

    • You cannot specify k or StandardizeData.

    • The software sets NumClusters=n and StandardizeData=false.

    • You cannot specify a nonzero value of NumPredictors. If you specify NumPredictors=0, the software sets NumPredictors equal to p.

    Example: [2 4 5; 1 3 3; 2 5 1]

    Data Types: single | double

    Name-Value Arguments

    expand all

    Specify optional pairs of arguments as Name1=Value1,...,NameN=ValueN, where Name is the argument name and Value is the corresponding value. Name-value arguments must appear after other arguments, but the order of the pairs does not matter.

    Example: Mdl = incrementalKMeans(numClusters=13,EstimationPeriod=1000,StandardizeData=true) creates the model object using 13 clusters, and to standardize the data using an estimation period of 1000 observations.

    Cluster counts, specified as a vector of positive integers. This argument sets the initial value of the ClusterCounts property. The software updates this property when you call the reset function or the incremental fit function. The incremental fit function uses ClusterCounts to determine the learning rate when it updates the cluster centroids.

    If you specify ClusterCounts=counts when you create Mdl:

    • You must specify C.

    • You cannot specify k.

    • You cannot specify StandardizeData. The software sets StandardizeData=false.

    • counts must be a vector of positive integers with length size(C,1).

    • ClusterCounts is a column vector with length size(C,1).

    • The first m rows of ClusterCounts contain the sum of the counts values for each of the m unique rows of C. If C contains nonunique rows, the remaining rows of ClusterCounts contain zeros.

    If you do not specify ClusterCounts when you create Mdl:

    • ClusterCounts is a k-by-1 vector of zeros if you specify k.

    • ClusterCounts is a size(C,1)-by-1 vector if you specify C. The first m rows of ClusterCounts contain the number of instances of each of the m unique rows in C. The remaining rows of ClusterCounts contain zeros.

    Example: ClusterCounts=[2 4 9 2 5 2 6 7]

    Data Types: single | double

    Number of predictors, specified as a nonnegative integer. This argument sets the NumPredictors property.

    • If you specify C when you create Mdl:

      • You can only specify NumPredictors=size(C,2) or NumPredictors=0.

      • The software sets NumPredictors=size(C,2) if you do not specify NumPredictors or specify NumPredictors=0.

    • If you specify k and do not specify NumPredictors when you create Mdl, the software sets NumPredictors=0.

    • If NumPredictors=0, the software infers the number of predictors from the training data and updates NumPredictors when you call the incremental fit function.

    Example: NumPredictors=10

    Data Types: single | double

    Distance metric in p-dimensional space used for minimization, where p is the number of predictors in the training data, specified as "sqeuclidean", "cityblock", "cosine", or "correlation". The incrementalKMeans function does not support the Hamming distance metric. This argument sets the Distance property.

    incrementalKMeans computes centroid clusters differently for the supported distance metrics. This table summarizes the available distance metrics. In each formula, x is an observation (that is, a row of X) and c is a centroid (a row vector).

    Distance MetricDescriptionFormula
    "sqeuclidean"

    Squared Euclidean distance (default). Each centroid is the mean of the points in the cluster.

    d(x,c)=(xc)(xc)

    "cityblock"

    Sum of absolute differences, that is, the L1 distance. Each centroid is the component-wise median of the points in the cluster.

    d(x,c)=j=1p|xjcj|

    "cosine"

    One minus the cosine of the included angle between points (treated as vectors). Each centroid is the mean of the points in the cluster, after the points are normalized to unit Euclidean length.

    d(x,c)=1xc(xx)(cc)

    "correlation"

    One minus the sample correlation between points (treated as sequences of values). Each centroid is the component-wise mean of the points in the cluster, after the points are centered and normalized to zero mean and unit standard deviation.

    d(x,c)=1(xx¯)(cc¯)(xx¯)(xx¯)(cc¯)(cc¯),

    where

    • x¯=1p(j=1pxj)1p

    • c¯=1p(j=1pcj)1p

    • 1p is a row vector of p ones.

    Example: Distance="cityblock"

    Data Types: char | string

    Forgetting factor for cluster centroid updates, specified as a scalar value from 0 to 1. This argument sets the ForgettingFactor property.

    A forgetting factor value of 0.1 gives more weight to the older data than a forgetting factor value of 0.9. A forgetting factor value of 0 indicates infinite memory, where all the previous observations have equal weight when the incremental fit function updates the cluster centroids.

    Example: ForgettingFactor=0.1

    Data Types: double | single

    Number of observations to which the model must be fit before it is warm, specified as a nonnegative integer. This argument sets the WarmupPeriod property.

    When a model is warm, the incremental fit function returns cluster indices, and the incremental updateMetrics function returns performance metrics. When processing observations during the warm-up period, the software ignores observations that contain at least one missing value. If you specify C and ClusterCounts when you create Mdl, and C contains no duplicate rows, then IsWarm is true and the default value of WarmupPeriod is 0. Otherwise, the default value of WarmupPeriod is 1000.

    Note

    IsWarm cannot be true if Centroids contains any NaN values or NumPredictors is 0.

    Example: WarmupPeriod=100

    Data Types: single | double

    Performance metrics to track during incremental learning, specified as "SimplifiedSilhouette". The Metrics property of Mdl stores two forms of each performance metric as variables (columns) of a table, Cumulative and Window, with individual metrics in rows. MetricsWindowSize determines the update frequency of the Window metrics. For more details, see Estimation Period and Simplified Silhouette.

    Example: Metrics="SimplifiedSilhouette"

    Data Types: char | string

    Number of observations to use to compute window performance metrics, specified as a positive integer. The default value is 200. This argument sets the MetricsWindowSize property.

    For more details on performance metrics options, see Performance Metrics.

    Example: MetricsWindowSize=100

    Data Types: single | double

    Flag to standardize the predictor data, specified as a numeric or logical 0 (false) or 1 (true).

    If you specify StandardizeData=true, the incremental fit function estimates the predictor means Mu and standard deviations Sigma during the estimation period specified by EstimationPeriod, and standardizes the predictor data.

    You cannot specify StandardizeData if you specify C.

    For more information, see Standardize Data.

    Example: StandardizeData=true

    Data Types: single | double | logical

    Number of observations processed by the incremental model to estimate the predictor means and standard deviations, specified as a nonnegative integer. This argument sets the EstimationPeriod property.

    If you specify StandardizeData=true, the default value is 1000. Otherwise, the default value is 0.

    If you specify EstimationPeriod when you create Mdl:

    • The software sets EstimationPeriod=0 when you specify C or StandardizeData=false.

    • The software uses EstimationPeriod observations to estimate the predictor means (Mu) and standard deviations (Sigma) prior to training the model.

    • The software ignores observations that contain at least one missing value when processing observations during the estimation period.

    For more details, see Estimation Period.

    Example: EstimationPeriod=500

    Data Types: single | double

    Properties

    expand all

    Training Parameters

    This property is read-only.

    Predictor means, represented as a numeric vector.

    • When you create Mdl and specify NumPredictors=0 or StandardizeData=false (the default), then Mu is an empty array [].

    • When you create Mdl and set StandardizeData=true, specify NumPredictors as a positive integer, and specify k, then Mu is initially a 1-by-NumPredictors vector of zeros. Otherwise, Mu is [].

    • When you create Mdl and set StandardizeData=true, and Mu is [] or an array of zeros, then the incremental fit function calculates the predictor variable means using all data points that do not have any missing values. At the end of the estimation period specified by EstimationPeriod, Mu is a NumPredictors-by-1 vector that contains the predictor means.

    You cannot specify Mu directly.

    Data Types: single | double

    This property is read-only.

    Predictor standard deviations, represented as a numeric vector.

    • When you create Mdl and specify NumPredictors=0 or StandardizeData=false (the default), then Sigma is an empty array [].

    • When you create Mdl and set StandardizeData=true, specify NumPredictors as a positive integer, and specify k, then Sigma is initially a 1-by-NumPredictors vector of zeros. Otherwise, Sigma is [].

    • When you create Mdl and set StandardizeData=true, and Sigma is [] or an array of zeros, then the incremental fit function calculates the predictor variable standard deviations using all data points that do not have any missing values. At the end of the estimation period specified by EstimationPeriod, Sigma is a NumPredictors-by-1 vector that contains the predictor standard deviations.

    You cannot specify Sigma directly.

    Data Types: single | double

    This property is read-only after object creation.

    Number of observations processed by the incremental model to estimate the predictor means and standard deviations, represented as a nonnegative integer. If you specify StandardizeData=true when you create Mdl, the default value is 1000. Otherwise, the default value is 0.

    If EstimationPeriod > 0:

    • The software uses EstimationPeriod observations to estimate the predictor means (Mu) and standard deviations (Sigma) prior to training the model.

    • The software ignores observations that contain at least one missing value when processing observations during the estimation period.

    For more details, see Estimation Period.

    Data Types: single | double

    This property is read-only after object creation.

    Distance metric in p-dimensional space used for minimization, where p is the number of variables in the training data, stored as "sqeuclidean", "cityblock", "cosine", or "correlation". For a description of the supported distance metrics, see Distance. The incrementalKMeans function does not support the Hamming distance metric.

    Data Types: string

    This property is read-only after object creation.

    Forgetting factor for cluster centroid updates, represented as a scalar value from 0 to 1. A forgetting factor value of 0.1 gives more weight to the older data than a forgetting factor value of 0.9. A forgetting factor value of 0 indicates infinite memory, where all the previous observations have equal weight when the incremental fit function updates the cluster centroids.

    Data Types: single | double

    This property is read-only.

    Number of observations fit to the incremental model Mdl, represented as a nonnegative numeric scalar. NumTrainingObservations increases when you pass Mdl and training data to the incremental fit function outside of the estimation period. The software resets NumTrainingObservations to 0 when you call the reset function.

    When fitting the model, the software ignores observations that contain at least one missing value.

    You cannot specify NumTrainingObservations directly.

    Data Types: double

    Clustering Parameters

    This property is read-only after object creation.

    Number of predictors, represented as a nonnegative integer.

    • If you specify C when you create Mdl and do not specify NumPredictors, or specify NumPredictors=0, the software sets NumPredictors=size(C,2).

    • If you specify k when you create Mdl and do not specify NumPredictors, the initial value of NumPredictors is 0.

    • If NumPredictors=0, the software infers the number of predictors from the training data and updates NumPredictors when you call the incremental fit function.

    Data Types: single | double

    This property is read-only after object creation.

    Number of clusters, represented as a positive integer. If you do not specify k when you create Mdl, then NumClusters is equal to size(C,1). The software updates NumClusters when you call the reset function or the incremental fit function.

    Data Types: single | double

    This property is read-only after object creation.

    Cluster centroids, represented as a NumClusters-by-NumPredictors numeric matrix where each row contains a cluster centroid, and each column contains the predictor values. The software updates Centroids when you call the reset function or the incremental fit function.

    If you do not specify C when you create Mdl:

    • Centroids is initially a k-by-NumPredictors array of NaN values.

    • When you call the incremental fit function with predictor data X :

      • If NumPredictors=0, the function resizes Centroids to have the same number of columns as X.

      • If Centroids has i rows of NaN values, the function sets their values equal to the first i observations in X.

    Data Types: single | double

    This property is read-only after object creation.

    Cluster counts, represented as a NumClusters-by-1 numeric vector. The software updates ClusterCounts when you call the reset function or the incremental fit function. The incremental fit function uses ClusterCounts to determine the learning rate when it updates the cluster centroids. If ForgettingFactor is 0, then the values of ClusterCounts are 1 + the number of observations assigned to each cluster. Otherwise, the values of ClusterCounts represent the relative sizes of each cluster.

    If you specify ClusterCounts=counts when you create Mdl:

    • ClusterCounts is a column vector with length size(C,1).

    • The first m rows of ClusterCounts contain the sum of the counts values for each unique row of C, if C contains nonunique rows and m unique rows. The remaining rows of ClusterCounts contain zeros.

    If you do not specify ClusterCounts when you create Mdl:

    • ClusterCounts is a k-by-1 vector of zeros, if you specify k.

    • ClusterCounts is a size(C,1)-by-1 vector, if you specify C. The first m rows of ClusterCounts contain ones, where m is the number of unique rows in C. The remaining rows of ClusterCounts contain zeros.

    Data Types: single | double

    Performance Metrics Parameters

    This property is read-only.

    Flag indicating whether the incremental fit function returns cluster indices and the incremental updateMetrics function returns performance metrics, represented as a numeric or logical 0 (false) or 1 (true).

    IsWarm becomes true after the incremental fit function fits the incremental model to WarmupPeriod observations. However, IsWarm cannot be true if Centroids contains any NaN values or NumPredictors is 0.

    If IsWarm is false:

    • The idx output of fit consists of NaN values.

    • The updateMetrics function stores NaN values in Metrics.

    If Mdl.EstimationPeriod > 0, then during the estimation period:

    • IsWarm is false.

    • The value of NumTrainingObservations is 0.

    • The fit function does not fit the model.

    • The updateMetrics function does not store any values in Metrics.

    You cannot specify IsWarm directly.

    Data Types: single | double | logical

    This property is read-only after object creation.

    Number of observations to which the model must be fit before it is warm, represented as a nonnegative integer. When a model is warm, the incremental fit function returns cluster indices, and the incremental updateMetrics function returns performance metrics. When processing observations during the warm-up period, the software ignores observations that contain at least one missing value. If you specify both C and ClusterCounts when you create Mdl, and C contains no duplicate rows, then IsWarm=true and the default value of WarmupPeriod is 0. Otherwise, the default value of WarmupPeriod is 1000.

    Note

    IsWarm cannot be true if Centroids contains any NaN values or NumPredictors is 0.

    Data Types: single | double

    This property is read-only.

    Model performance metrics updated during incremental learning by updateMetrics, represented as a table with two columns labeled Cumulative and Window.

    • Cumulative — Model performance, as measured by the Simplified Silhouette metric, from the time the model becomes warm (IsWarm is 1).

    • Window — Model performance, as measured by the simplified silhouette metric, evaluated over all observations within the window specified by the MetricsWindowSize property. The software updates Window after it processes MetricsWindowSize observations.

    The software sets Metrics to NaN when you call the reset function.

    You cannot specify Metrics directly.

    Data Types: table

    This property is read-only after object creation.

    Number of observations to use to compute window performance metrics, represented as a positive integer. The default value is 200.

    For more details on performance metrics options, see Performance Metrics.

    Data Types: single | double

    Object Functions

    fitFit model for incremental k-means clustering
    updateMetricsUpdate performance metrics in incremental k-means clustering model given new data
    assignClustersAssign observations to existing clusters
    resetReset incremental k-means clustering model

    Examples

    collapse all

    Create an incremental model for k-means clustering that has two clusters.

    Mdl = incrementalKMeans(numClusters=2)
    Mdl = 
      incrementalKMeans
    
             IsWarm: 0
            Metrics: [1×2 table]
        NumClusters: 2
          Centroids: [2×0 double]
           Distance: "sqeuclidean"
    
    
      Properties, Methods
    
    

    Mdl is an incrementalKMeans model object. All its properties are read-only.

    Load and Preprocess Data

    Load the New York city housing data set.

    load NYCHousing2015.mat

    The data set includes 10 variables with information on the sales of properties in New York City in 2015. Keep only the gross square footage and sale price predictors. Keep all records that have a gross square footage above 100 square feet and a sales price above $1000.

    data = NYCHousing2015(:,{'GROSSSQUAREFEET','SALEPRICE'});
    data = data((data.GROSSSQUAREFEET > 100 & data.SALEPRICE > 1000),:);

    Convert the tabular data into a matrix that contains the logarithm of both predictors.

     X = table2array(log10(data));

    Randomly shuffle the order of the records.

     rng(0,"twister"); % For reproducibility
     X = X(randperm(size(X,1)),:);

    Fit and Plot Incremental Model

    Fit the incremental model Mdl to the data by using the fit function. To simulate a data stream, fit the model in chunks of 500 records at a time. At each iteration:

    • Process 500 observations.

    • Overwrite the previous incremental model with a new one fitted to the incoming records.

    • Update the performance metrics for the model. The default metric for Mdl is SimplifiedSilhouette.

    • Store the cumulative and window metrics to see how they evolve during incremental learning.

    • Compute the cluster assignments of all records seen so far, according to the current model.

    • Plot all records seen so far, and color each record by its cluster assignment.

    • Plot the current centroid location of each cluster.

    In this workflow, the updateMetrics function provides information about the model's clustering performance after it is fit to the incoming data chunk. In other workflows, you might want to evaluate a clustering model's performance on unseen data. In such cases, you can call updateMetrics prior to calling the incremental fit function.

    % Initialize plot properties
    hold on
    h1 = scatter(NaN,NaN,0.3);
    h2 = plot(NaN,NaN,Marker="o", ...
        MarkerFaceColor="k",MarkerEdgeColor="k");
    h3 = plot(NaN,NaN,Marker="^", ...
        MarkerFaceColor="b",MarkerEdgeColor="b");
    colormap(gca,"prism")
    pbaspect([1,1,1])
    xlim([min(X(:,1)),max(X(:,1))]);
    ylim([min(X(:,2)),max(X(:,2))]);
    xlabel("log Gross Square Footage");
    ylabel("log Sales Price in Dollars")
    
    % Incremental fitting and plotting
    n = numel(X(:,1));
    numObsPerChunk = 500;
    nchunk = floor(n/numObsPerChunk);
    sil = array2table(zeros(nchunk,2),VariableNames=["Cumulative" "Window"]);
    
    for j = 1:nchunk
        ibegin = min(n,numObsPerChunk*(j-1) + 1);
        iend = min(n,numObsPerChunk*j);
        idx = ibegin:iend;    
        Mdl = fit(Mdl,X(idx,:));
        Mdl = updateMetrics(Mdl,X(idx,:));
        sil{j,:} = Mdl.Metrics{'SimplifiedSilhouette',:};
        indices = assignClusters(Mdl,X(1:iend,:));
        title("Iteration " + num2str(j))
        set(h1,XData=X(1:iend,1),YData=X(1:iend,2),CData=indices);
        set(h2,Marker="none") % Erase previous centroid markers
        set(h3,Marker="none")
        set(h2,XData=Mdl.Centroids(1,1),YData=Mdl.Centroids(1,2),Marker="o")
        set(h3,Xdata=Mdl.Centroids(2,1),YData=Mdl.Centroids(2,2),Marker="^")
        pause(0.5);
    end
    Warning: Hardware-accelerated graphics is unavailable. Displaying fewer markers to preserve interactivity.
    
    hold off

    Figure contains an axes object. The axes object with title Iteration 59, xlabel log Gross Square Footage, ylabel log Sales Price in Dollars contains 3 objects of type scatter, line.

    To view the animated figure, you can run the example, or open the animated gif below in your web browser.

    FixedNumberofClusters.gif

    At each iteration, the animated plot displays all the observations processed so far as small circles, and colors them according to the cluster assignments of the current model. The black circle indicates the centroid position of cluster 1, and the blue triangle indicates the centroid position of cluster 2.

    Plot the window and cumulative metrics values at each iteration.

    h4 = plot(sil.Variables);
    xlabel("Iteration")
    ylabel("Performance Metric")
    xline(Mdl.WarmupPeriod/numObsPerChunk,'g-.')
    legend(h4,sil.Properties.VariableNames,Location="southeast")

    Figure contains an axes object. The axes object with xlabel Iteration, ylabel Performance Metric contains 3 objects of type line, constantline. These objects represent Cumulative, Window.

    The updateMetrics function calculates the performance metrics after the end of the warm-up period. The performance metrics rise rapidly from an initial value of 0.81 and approach a value of approximately 0.88 after 10 iterations.

    Create a set of noisy position measurements of two moving objects. Object 1 starts at (x,y) coordinate (-50,0) and moves along the x-axis. Object 2 starts at (x,y) coordinate (0,-40) and moves along the y-axis. The objects move at the same speed.

    Generate numObsPerStep=100 measurements of each object at numSteps=100 individual time steps.

    rng(0,"twister") % For reproducibility
    sigma = 2;  % Measurement noise level
    numObsPerStep = 100;
    numSteps = 100;
    startPosA = [-50,0];
    startPosB = [0,-40];
    X = [];
    for t = 0:numSteps-1
        for i = 1:numObsPerStep
            p = randn(1,4)*sigma;  % Gaussian measurement noise
            X = [X;[[p(1)+t+startPosA(1);p(2)+startPosB(1)], ...
                [p(3)+startPosA(2);p(4)+t+startPosB(2)]]];
        end
    end

    The rows of X contain 2*numObsPerStep*numSteps position measurements. The columns of X contain the x and y coordinates of each measurement, respectively.

    Create Incremental k-Means Clustering Models

    To track the centroids of the moving clusters, create two incremental k-means clustering model objects that each have two clusters and no warm-up period. Specify a forgetting factor value of 0.1 for the first model, and 0.75 for the second model. A lower value of the forgetting factor (which can range from 0 to 1) assigns more weight to older measurements when the incremental fit algorithm calculates new cluster centroids.

    MdlA = incrementalKMeans(numClusters=2,WarmupPeriod=0, ...
        ForgettingFactor=0.1);
    MdlB = incrementalKMeans(numClusters=2,WarmupPeriod=0, ...
        ForgettingFactor=0.75);

    Fit and Plot Incremental Models

    Fit the incremental k-means clustering models to the data by using the fit function. Fit the models in data chunks that consist of the measurements at each time step. At each iteration:

    • Process 2*numObsPerStep observations.

    • Overwrite the previous incremental models with new ones fitted to the incoming measurements.

    • Update the performance metrics for the models. The metric for the models is SimplifiedSilhouette.

    • Store the cumulative and window metrics to see how they evolve during incremental learning.

    • Compute the cluster assignments of the incoming chunk of measurements, according to the current model A.

    • Plot the incoming chunk of measurements, and color each measurement by its cluster assignment according to model A.

    • Plot the current model centroid locations for each cluster.

    • Plot all of the previous measurements using gray points.

    % Initialize plot properties
    hold on
    h1 = scatter(NaN,NaN,0.2,[0.9 0.9 0.9],".");
    h2 = scatter(NaN,NaN,1.5); 
    h3 = plot(NaN,NaN,"^",MarkerSize=6,MarkerEdgeColor="k", ...
        MarkerFaceColor="k"); 
    h4 = plot(NaN,NaN,"square",MarkerSize=6,MarkerEdgeColor="b", ...
        MarkerFaceColor="b");
    colormap(gca,"prism")
    xlim([min(X(:,1)),max(X(:,1))]);
    ylim([min(X(:,2)),max(X(:,2))]);
    xlabel("X");
    ylabel("Y");
    % Incremental fitting and plotting
    n = numel(X(:,1));
    nChunk = 2*numObsPerStep;
    silA = array2table(zeros(numSteps,2), ...
        'VariableNames',["Cumulative" "Window"]);
    silB = array2table(zeros(numSteps,2), ...
        'VariableNames',["Cumulative" "Window"]);
    for j = 1:numSteps
        ibegin = min(n,nChunk*(j-1) + 1);
        iend = min(n,nChunk*j);
        idx = ibegin:iend;    
        [MdlA,indices] = fit(MdlA,X(idx,:));
        MdlA = updateMetrics(MdlA,X(idx,:));
        MdlB = fit(MdlB,X(idx,:));
        MdlB = updateMetrics(MdlB,X(idx,:));
        title("Iteration " + num2str(j))
        silA{j,:} = MdlA.Metrics{'SimplifiedSilhouette',:};
        silB{j,:} = MdlB.Metrics{'SimplifiedSilhouette',:};
        set(h1,XData=X(1:ibegin-1,1),YData=X(1:ibegin-1,2));
        set(h2,XData=X(idx,1),YData=X(idx,2),CData=indices);
        set(h3,Marker="none") % Erase the previous centroid markers
        set(h4,Marker="none")
        set(h3,XData=MdlA.Centroids(:,1),YData=MdlA.Centroids(:,2), ...
            Marker="^");
        set(h4,XData=MdlB.Centroids(:,1),YData=MdlB.Centroids(:,2), ...
            Marker="square");
        pause(0.2);
    end
    hold off

    Figure contains an axes object. The axes object with title Iteration 100, xlabel X, ylabel Y contains 4 objects of type scatter, line. One or more of the lines displays its values using only markers

    FitMovingCentroids.gifAt each iteration, the animated plot displays all of the position measurements processed so far in gray. The incremental fit function tracks the centroid of each object at each iteration. The measurements in the current data chunk are colored according to the cluster assignment of model A. The black upward-pointing triangles and blue squares indicate the fitted cluster centroids of models A and B, respectively.

    Model A does a good job of tracking the true position of each moving object. Because model B has a higher forgetting factor, the fit function assigns the highest weights to the most recent measurements. Therefore, model B does a poorer job of tracking the true positions of the objects.

    Plot the simplified silhouette performance metrics at each iteration.

    h5 = plot([silA.Variables,silB.Variables]);
    xlabel("Iteration")
    ylabel("Simplified Silhouette")
    legend(h5,{"Cumulative A","Window A", ...
        "Cumulative B","Window B"},Location="southwest")

    Figure contains an axes object. The axes object with xlabel Iteration, ylabel Simplified Silhouette contains 4 objects of type line. These objects represent Cumulative A, Window A, Cumulative B, Window B.

    The plot shows that the simplified silhouette values of model B are poorer than those of model A. The values of both models dip significantly between iterations 30 and 60, when the two objects are close to each other. As the objects move apart, the window values of both models return to their previous levels.

    Generate a training data set using three distributions.

    rng(0,"twister") % For reproducibility
    X = [randn(100,2)*0.75+ones(100,2);
        randn(100,2)*0.5-ones(100,2);
        randn(100,2)*0.75];

    Train a k-means clustering model on the batch of data using kmeans with the city block distance metric.

    dist = "cityblock";
    [idx,C] = kmeans(X,3,Distance=dist);

    Compute the number of data points in each cluster.

    countTable = tabulate(idx);
    counts = countTable(:,2)
    counts = 3×1
    
        84
       103
       113
    
    

    Plot the clusters and the cluster centroids.

    hold on
    gscatter(X(:,1),X(:,2),idx,"bgm")
    plot(C(:,1),C(:,2),"kx",Markersize=10)
    legend("Cluster 1","Cluster 2","Cluster 3","Cluster centroid")
    hold off

    Figure contains an axes object. The axes object contains 4 objects of type line. One or more of the lines displays its values using only markers These objects represent Cluster 1, Cluster 2, Cluster 3, Cluster centroid.

    Create an incremental model for k-means clustering that uses the same distance metric. Initialize the incremental model object using the centroids and cluster counts from the fitted batch k-means model.

    Mdl = incrementalKMeans(centroids=C,ClusterCounts=counts,Distance=dist);
    details(Mdl)
      incrementalKMeans with properties:
    
                             Mu: []
                          Sigma: []
               EstimationPeriod: 0
                      Centroids: [3×2 double]
                  ClusterCounts: [3×1 double]
                       Distance: "cityblock"
               ForgettingFactor: 0.0500
                    NumClusters: 3
                         IsWarm: 1
        NumTrainingObservations: 0
                  NumPredictors: 2
                   WarmupPeriod: 0
                        Metrics: [1×2 table]
              MetricsWindowSize: 200
    
      Methods, Superclasses
    

    Mdl is an incrementalKMeans model object. All its properties are read-only. Because Mdl is warm, when you pass the model and streaming data to the incremental fit and updateMetrics functions, they return cluster indices and performance metrics, respectively.

    More About

    expand all

    Tips

    • You can create an incrementalKMeans model object that incorporates the outputs of the kmeans function by using the following code:

      k = 2;
      [idx,C]=kmeans(X,k);
      countTable = tabulate(idx);
      counts = countTable(:,2)
      Mdl = incrementalKMeans(centroids=C,ClusterCounts=counts);

    References

    [1] Lloyd, S. Least Squares Quantization in PCM. IEEE Transactions on Information Theory 28, no. 2 (March 1982): 129–37.

    [2] Sculley, D. Web-Scale k-Means Clustering. In Proceedings of the 19th International Conference on World Wide Web, 1177–78. Raleigh North Carolina USA: ACM, 2010.

    [3] Vendramin, Lucas, Ricardo J.G.B. Campello, and Eduardo R. Hruschka. On the Comparison of Relative Clustering Validity Criteria. In Proceedings of the 2009 SIAM international conference on data mining, 733–744. Society for Industrial and Applied Mathematics, 2009.

    Version History

    Introduced in R2025a