Dummy Variables

This topic provides an introduction to dummy variables, describes how the software creates them for classification and regression problems, and shows how you can create dummy variables by using the dummyvar function.

What Are Dummy Variables?

When you perform classification and regression analysis, you often need to include both continuous (quantitative) and categorical (qualitative) predictor variables. A categorical variable must not be included as a numeric array. Numeric arrays have both order and magnitude. A categorical variable can have order (for example, an ordinal variable), but it does not have magnitude. Using a numeric array implies a known “distance” between the categories. The appropriate way to include categorical predictors is as dummy variables. To define dummy variables, use indicator variables that have the values 0 and 1.

The software chooses one of four schemes to define dummy variables based on the type of analysis, as described in the next sections. For example, suppose you have a categorical variable with three categories: Cool, Cooler, and Coolest.

Full Dummy Variables

Represent the categorical variable with three categories using three dummy variables, one variable for each category.

The 1-by-3 vectors [1 0 0], [0 1 0], and [0 0 1] represent the categories Cool, Cooler, and Coolest, respectively. The vector columns correspond to the dummy variables X0, X1, and X2.

X₀ is a dummy variable that has the value 1 for Cool, and 0 otherwise. X₁ is a dummy variable that has the value 1 for Cooler, and 0 otherwise. X₂ is a dummy variable that has the value 1 for Coolest, and 0 otherwise.

Dummy Variables with Reference Group

Represent the categorical variable with three categories using two dummy variables with a reference group.

The 1-by-2 vectors [0 0], [1 0], and [0 1] represent the categories Cool, Cooler, and Coolest, respectively. The vector columns correspond to the dummy variables X1 and X2. Cool is the reference group.

You can distinguish Cool, Cooler, and Coolest using only X₁ and X₂, without X₀. Observations for Cool have 0s for both dummy variables. The category represented by all 0s is the reference group.

Dummy Variables for Ordered Categorical Variable

Assume the mathematical ordering of the categories is Cool < Cooler < Coolest. This coding scheme uses 1 and –1 values, and uses more 1s for higher categories, to indicate the ordering.

The 1-by-2 vectors [-1 -1], [1 -1], and [1 1] represent the categories Cool, Cooler, and Coolest, respectively. The vector columns correspond to the dummy variables X1 and X2.

X₁ is a dummy variable that has the value 1 for Cooler and Coolest, and –1 for Cool. X₂ is a dummy variable that has the value 1 for Coolest, and –1 otherwise.

You can indicate that a categorical variable has mathematical ordering by using the 'Ordinal' name-value pair argument of the categorical function.

Dummy Variables Created with Effects Coding

Effects coding uses 1, 0, and –1 to create dummy variables. Instead of using 0 values to represent a reference group, as in Dummy Variables with Reference Group, effects coding uses –1 to represent the last category.

The 1-by-2 vectors [1 0], [0 1], and [-1 -1] represent the categories Cool, Cooler, and Coolest, respectively. The vector columns correspond to the dummy variables X1, and X2.

Creating Dummy Variables

Automatic Creation of Dummy Variables

Statistics and Machine Learning Toolbox™ offers several classification and regression fitting functions that accept categorical predictors. Some fitting functions create dummy variables to handle categorical predictors.

The following is the default behavior of the fitting functions in identifying categorical predictors.

If the predictor data is in a table, the functions assume that a variable is categorical if it is a logical vector, categorical vector, character array, string array, or cell array of character vectors. The fitting functions that use decision trees assume ordered categorical vectors to be continuous variables.
If the predictor data is a matrix, the functions assume all predictors are continuous.

To identify any other predictors as categorical predictors, specify them by using the 'CategoricalPredictors' or 'CategoricalVars' name-value pair argument.

The fitting functions handle the identified categorical predictors as follows:

fitckernel, fitclinear, fitcnet, fitcsvm, fitrgp, fitrkernel, fitrlinear, fitrnet, and fitrsvm use two different schemes to create dummy variables, depending on whether a categorical variable is unordered or ordered.
- For an unordered categorical variable, the functions use Full Dummy Variables.
- For an ordered categorical variable, the functions use Dummy Variables for Ordered Categorical Variable.
Parametric regression fitting functions such as fitlm, fitglm, and fitcox use Dummy Variables with Reference Group. When the functions include the dummy variables, the estimated coefficients of the dummy variables are relative to the reference group. For an example, see Linear Regression with Categorical Predictor.
fitlme, fitlmematrix and fitglme allow you to specify the scheme for creating dummy variables by using the 'DummyVarCoding' name-value pair argument. The functions support three schemes: Full Dummy Variables ('DummyVarCoding','full'), Dummy Variables with Reference Group ('DummyVarCoding','reference'), and Dummy Variables Created with Effects Coding ('DummyVarCoding','effects'). Note that these functions do not offer a name-value pair argument for specifying categorical variables.
fitrm uses Dummy Variables Created with Effects Coding.
Other fitting functions that accept categorical predictors use algorithms that can handle categorical predictors without creating dummy variables.

Manual Creation of Dummy Variables

Open Live Script

This example shows how to create your own dummy variable design matrix by using the dummyvar function. This function accepts grouping variables and returns a matrix containing zeros and ones, whose columns are dummy variables for the grouping variables.

Create a column vector of categorical data specifying gender.

gender = categorical({'Male';'Female';'Female';'Male';'Female'});

Create dummy variables for gender.

dv = dummyvar(gender)

dv has five rows corresponding to the number of rows in gender and two columns for the unique groups, Female and Male. Column order corresponds to the order of the levels in gender. For categorical arrays, the default order is ascending alphabetical. You can check the order by using the categories function.

categories(gender)

ans = 2×1 cell
    {'Female'}
    {'Male'  }

To use the dummy variables in a regression model, you must either delete a column (to create a reference group) or fit a regression model with no intercept term. For the gender example, you need only one dummy variable to represent two genders. Notice what happens if you add an intercept term to the complete design matrix dv.

X = [ones(5,1) dv]

X = 5×3

     1     0     1
     1     1     0
     1     1     0
     1     0     1
     1     1     0

rank(X)

ans = 
2

The design matrix with an intercept term is not of full rank and is not invertible. Because of this linear dependence, use only c – 1 indicator variables to represent a categorical variable with c categories in a regression model with an intercept term.