Main Content

boxchart

Visualize Shapley values using box charts (box plots)

Since R2024a

    Description

    example

    boxchart(explainer) creates a box chart, or box plot, for each predictor in explainer.BlackboxModel.PredictorNames, where explainer is a shapley object. For each predictor, the function displays the Shapley values for the query points in explainer.QueryPoints. The corresponding box plot displays the following: the median, the lower and upper quartiles, any outliers (computed using the interquartile range), and the minimum and maximum values that are not outliers.

    If explainer.BlackboxModel is a classification model, the function displays box plots for class explainer.BlackboxModel.ClassNames(1) by default.

    example

    boxchart(explainer,Name=Value) specifies additional options using one or more name-value arguments. For example, specify NumImportantPredictors=5 to create box plots for the five features with the greatest mean absolute Shapley values (explainer.MeanAbsoluteShapley).

    boxchart(ax,___) displays the box plots in the target axes ax. Specify ax as the first argument in any of the previous syntaxes.

    b = boxchart(___) returns a BoxChart object using any of the input argument combinations in the previous syntaxes. Use b to query or modify the properties (BoxChart Properties) of the object after you create it.

    Examples

    collapse all

    Train a regression model and create a shapley object. Use the fit object function to compute the Shapley values for the specified query points. Then visualize the Shapley values for multiple query points by using the boxchart object function.

    Load the carbig data set, which contains measurements of cars made in the 1970s and early 1980s.

    load carbig

    Create a table containing the predictor variables Acceleration, Cylinders, and so on, as well as the response variable MPG.

    tbl = table(Acceleration,Cylinders,Displacement, ...
        Horsepower,Model_Year,Weight,MPG);

    Removing missing values in a training set helps to reduce memory consumption and speed up training for the fitrkernel function. Remove missing values in tbl.

    tbl = rmmissing(tbl);

    Train a blackbox model of MPG by using the fitrkernel function. Specify the Cylinders and Model_Year variables as categorical predictors. Standardize the remaining predictors.

    rng("default") % For reproducibility
    mdl = fitrkernel(tbl,"MPG",CategoricalPredictors=[2 5], ...
        Standardize=true);

    Create a shapley object. Because mdl does not contain training data, specify the data set tbl.

    explainer = shapley(mdl,tbl)
    explainer = 
                BlackboxModel: [1×1 RegressionKernel]
                  QueryPoints: []
               BlackboxFitted: []
                ShapleyValues: []
                            X: [392×7 table]
        CategoricalPredictors: [2 5]
                       Method: "interventional-kernel"
                    Intercept: 22.7326
                   NumSubsets: 64
    
    

    explainer stores the training data tbl in the X property.

    Compute the Shapley values for all observations in tbl. Speed up computations by using the UseParallel name-value argument, if you have a Parallel Computing Toolbox™ license.

    explainer = fit(explainer,tbl,UseParallel=true)
    Starting parallel pool (parpool) using the 'Processes' profile ...
    10-Jan-2024 09:32:00: Job Queued. Waiting for parallel pool job with ID 2 to start ...
    Connected to parallel pool with 6 workers.
    
    explainer = 
    shapley explainer with the following mean absolute Shapley values:
    
          Predictor       ShapleyValue
        ______________    ____________
    
        "Acceleration"      0.52233   
        "Cylinders"          1.0412   
        "Displacement"      0.80485   
        "Horsepower"         0.7589   
        "Model_Year"        0.82285   
        "Weight"            0.98453   
    
    
      Properties, Methods
    
    
    

    For a regression model, shapley computes Shapley values using the predicted response, and stores them in the ShapleyValues property. Because explainer contains Shapley values for multiple query points, the function displays the mean absolute Shapley values by default.

    Visualize the distribution of the Shapley values by using the boxchart object function.

    boxchart(explainer)

    Figure contains an axes object. The axes object with title Shapley Summary Plot, xlabel Shapley Value, ylabel Predictor contains 2 objects of type boxchart, constantline.

    For each predictor, the function displays a box plot of the Shapley values for the query points. The function determines the order of the predictors by using the mean absolute Shapley values.

    The box plot for the Weight predictor indicates that the Shapley values are distributed symmetrically about the median. The minimum is slightly less than –2, the 25th percentile is approximately –1, the median is approximately 0, the 75th percentile is approximately 1, and the maximum is slightly more than 2.

    Use a data tip to view the Shapley value metrics for the Weight predictor.

    b = boxchart(explainer);
    datatip(b,"DataIndex",5);

    Figure contains an axes object. The axes object with title Shapley Summary Plot, xlabel Shapley Value, ylabel Predictor contains 2 objects of type boxchart, constantline.

    Train a classification model and create a shapley object. Then visualize the Shapley values for multiple query points by using the boxchart object function.

    Load the CreditRating_Historical data set. The data set contains customer IDs and their financial ratios, industry labels, and credit ratings.

    tbl = readtable("CreditRating_Historical.dat");

    Display the first three rows of the table.

    head(tbl,3)
         ID      WC_TA    RE_TA    EBIT_TA    MVE_BVTD    S_TA     Industry    Rating
        _____    _____    _____    _______    ________    _____    ________    ______
    
        62394    0.013    0.104     0.036      0.447      0.142       3        {'BB'}
        48608    0.232    0.335     0.062      1.969      0.281       8        {'A' }
        42444    0.311    0.367     0.074      1.935      0.366       1        {'A' }
    

    Train a blackbox model of credit ratings by using the fitcecoc function. Use the variables from the second through seventh columns in tbl as the predictor variables. A recommended practice is to specify the class names to set the order of the classes.

    blackbox = fitcecoc(tbl,"Rating", ...
        PredictorNames=tbl.Properties.VariableNames(2:7), ...
        CategoricalPredictors="Industry", ...
        ClassNames={'AAA','AA','A','BBB','BB','B','CCC'});

    Create a shapley object that explains the predictions for multiple query points. For faster computation, subsample 10% of the observations from tbl with stratification and use the samples to compute the Shapley values. Specify the sampled observations as the query points.

    rng("default") % For reproducibility
    c = cvpartition(tbl.Rating,"Holdout",0.10);
    sampleTbl = tbl(test(c),:);
    explainer = shapley(blackbox,sampleTbl, ...
        queryPoints=sampleTbl);

    For a classification model, shapley computes Shapley values using the predicted class scores, and stores them in the ShapleyValues property. Because explainer contains Shapley values for multiple query points, display the mean absolute Shapley values instead.

    explainer.MeanAbsoluteShapley
    ans=6×8 table
        Predictor        AAA           AA            A           BBB           BB            B           CCC   
        __________    _________    __________    _________    __________    _________    _________    _________
    
        "WC_TA"        0.056246      0.034016     0.027208       0.02194     0.041348     0.060144     0.056189
        "RE_TA"          0.1202      0.097136     0.099341      0.094155      0.10629       0.1799      0.25493
        "EBIT_TA"     0.0014694    0.00086978    0.0010461    0.00088111    0.0011695    0.0020823    0.0018035
        "MVE_BVTD"      0.81198       0.79496       1.0804        1.5952       2.0768       2.2893       1.7551
        "S_TA"         0.025692     0.0098722     0.011002       0.01535    0.0015691    0.0075802     0.012961
        "Industry"     0.073842      0.084015     0.066049      0.039714     0.062301      0.12082      0.11111
    
    

    For each predictor and class, the mean absolute Shapley value is the absolute value of the Shapley values, averaged across all query points. For class AA, the MVE_BVTD predictor has a noticeably greater mean absolute Shapley value than the other predictors.

    Visualize the distribution of the Shapley values for class AA by using the boxchart object function.

    boxchart(explainer,ClassName={'AA'})

    Figure contains an axes object. The axes object with title Shapley Summary Plot, xlabel Shapley Value, ylabel Predictor contains 2 objects of type boxchart, constantline.

    For each predictor, the function displays a box plot of the Shapley values for the query points. The function determines the order of the predictors by using the mean absolute Shapley values.

    For class AA, some of the Shapley values for the MVE_BVTD predictor are outliers. This result suggests that, for a few query points, the predictor greatly affects the class AA predicted score.

    Input Arguments

    collapse all

    Object explaining the blackbox model, specified as a shapley object. explainer must contain Shapley values; that is, explainer.ShapleyValues must be nonempty.

    Axes for the plot, specified as an Axes object. If you do not specify ax, then boxchart creates the plot using the current axes. For more information on creating an Axes object, see axes.

    Name-Value Arguments

    Specify optional pairs of arguments as Name1=Value1,...,NameN=ValueN, where Name is the argument name and Value is the corresponding value. Name-value arguments must appear after other arguments, but the order of the pairs does not matter.

    Example: boxchart(explainer,NumImportantPredictors=5,JitterOutliers="on") creates a box plot for each of the five predictors with the greatest mean absolute Shapley values, and jitters the outliers in the box plots.

    Number of important predictors to plot, specified as a positive integer. The boxchart function plots the Shapley values of the specified number of predictors with the greatest mean absolute Shapley values.

    Example: NumImportantPredictors=5 specifies to plot the five most important predictors. The boxchart function determines the order of importance by using the mean absolute Shapley values.

    Data Types: single | double

    Class label to plot, specified as a numeric scalar, logical scalar, character vector, string scalar, or categorical scalar. The value and data type of the ClassName value must match one of the class names in the ClassNames property of the machine learning model in explainer (explainer.BlackboxModel.ClassNames). Note that the software accepts character vectors, string scalars, and categorical scalars interchangeably.

    This argument is valid only when the machine learning model (BlackboxModel) in explainer is a classification model.

    Example: ClassName="AAA"

    Data Types: single | double | logical | char | string | categorical

    Outlier marker displacement, specified as "on" or "off", or as numeric or logical 1 (true) or 0 (false). A value of "on" is equivalent to true, and "off" is equivalent to false. Therefore, you can use the value of this property as a logical value. The value is stored as an on/off logical value of type matlab.lang.OnOffSwitchState.

    If you specify the JitterOutliers value as "on", then boxchart randomly displaces the outlier markers along the vertical direction to help you distinguish between outliers that have similar Shapley values.

    Example: JitterOutliers="on"

    Data Types: single | double | logical | char | string

    More About

    collapse all

    Shapley Values

    In game theory, the Shapley value of a player is the average marginal contribution of the player in a cooperative game. In the context of machine learning prediction, the Shapley value of a feature for a query point explains the contribution of the feature to a prediction (response for regression or score of each class for classification) at the specified query point.

    The Shapley value of a feature for a query point is the contribution of the feature to the deviation from the average prediction. For a query point, the sum of the Shapley values for all features corresponds to the total deviation of the prediction from the average. That is, the sum of the average prediction and the Shapley values for all features corresponds to the prediction for the query point.

    For more details, see Shapley Values for Machine Learning Model.

    Tips

    • Use boxchart when explainer contains Shapley values for many query points.

    Version History

    Introduced in R2024a