Clean Messy and Missing Data in Tables
This example shows how to clean and reorganize a table that has messy and missing data values. First, you can identify missing data by using the Import Tool or by using functions such as the summary
and ismissing
functions. You can standardize, fill, or remove missing values by using the standardizeMissing
, fillmissing
, or rmmissing
functions. Then you can clean your table further by reorganizing it. You can rearrange table rows and variables by using functions such as the sortrows
and movevars
functions.
Examine Data in File
Examine the data in the sample comma-separated value (CSV) file, messy.csv
, by using the Import Tool. The tool previews the data and enables you to specify how to import the data. To examine messy.csv
in the Import Tool, after opening this example in MATLAB®, go to the Home tab, click Import Data in the Variable section, and open messy.csv
using the file selection dialog box.
The Import Tool shows that messy.csv
has five columns with text and numeric values.
The file contains many different missing data indicators:
Empty text
Period (.)
NA
NaN
-99
The Import Tool automatically recognizes (but does not visually highlight) some missing data indicators, such as NaN
in numeric columns and empty text in text columns.
The tool highlights other indicators, such as the empty text, period, and NA
, that occur in column B
. These values are not standard missing values. However, nonnumeric values in a numeric column are likely to represent missing values. When you import the data, you can specify that these values should also be treated as missing values.
When numeric data otherwise consists of positive values, a single negative value, such as -99
, can be a flag for missing data. If a number such as -99
represents missing data in your table, then you must specify that it is a missing value when you clean your table.
Import Data as Table
You can import data into the MATLAB workspace from the Import Tool. You can also use the readtable
function to read the data from a file and import it as a table.
Import the data in messy.csv
using the readtable
function. To read text data into table variables that are string arrays, use the TextType
name-value argument. To treat specified nonnumeric values in numeric columns as missing values, use the TreatAsMissing
name-value argument. For columns A
and C
with text data, readtable
imports any empty text as missing strings, which display as <missing>
. For columns B
, D
, and E
with numeric data, readtable
imports empty text as NaN
values, and also imports .
and NA
as NaN
values when you specify them using TreatAsMissing
. However, the values that are -99
remain unchanged because they are numeric.
messyTable = readtable("messy.csv","TextType","string","TreatAsMissing",[".","NA"])
messyTable=21×5 table
A B C D E
______ ____ _________ ____ ____
"afe1" NaN "yes" 3 3
"egh3" NaN "no" 7 7
"wth4" -99 "yes" -99 -99
"atn2" 23.7 <missing> NaN 23.7
"arg1" NaN "yes" 5 NaN
"jre3" 34.6 "yes" 34.6 34.6
"wen9" 234 "yes" 234 234
"ple2" 2 "no" 2 2
"dbo8" 5 "no" 5 5
"oii4" 5 "yes" 5 5
"wnk3" 245 "yes" 245 245
"abk6" 563 "no" 563 563
"pnj5" 463 "no" 463 463
"wnn3" 6 "no" 6 6
"oks9" 23 "yes" 23 23
"wba3" 14 "yes" 14 14
⋮
View Summary of Table
To view a summary of the table, use the summary
function. The summary shows the data type and other descriptive statistics for each table variable. For example, summary
shows the number of missing values in each numeric variable of messyTable
.
summary(messyTable)
messyTable: 21x5 table Variables: A: string B: double C: string D: double E: double Statistics for applicable variables: NumMissing Min Median Max Mean Std A 0 B 3 -99 22.5000 563 90.1056 174.0532 C 1 D 2 -99 14 563 83.8000 171.0755 E 1 -99 21.5000 563 81.5950 166.7100
Find Rows with Missing Values
To find the rows of messyTable
that have at least one missing value, use the ismissing
function. If you have nonstandard missing values in your data, such as -99
, you can specify it along with the standard missing values.
The output of ismissing
is a logical array that identifies the elements of messyTable
that have missing values.
missingElements = ismissing(messyTable,{string(missing),NaN,-99})
missingElements = 21x5 logical array
0 1 0 0 0
0 1 0 0 0
0 1 0 1 1
0 0 1 1 0
0 1 0 0 1
0 0 0 0 0
0 0 0 0 0
0 0 0 0 0
0 0 0 0 0
0 0 0 0 0
⋮
To create a logical vector that identifies rows that have missing values, use the any
function.
rowsWithMissingValues = any(missingElements,2)
rowsWithMissingValues = 21x1 logical array
1
1
1
1
1
0
0
0
0
0
⋮
To index into the table and return only the rows that have missing values, use the logical vector rowsWithMissingValues
.
missingValuesTable = messyTable(rowsWithMissingValues,:)
missingValuesTable=6×5 table
A B C D E
______ ____ _________ ___ ____
"afe1" NaN "yes" 3 3
"egh3" NaN "no" 7 7
"wth4" -99 "yes" -99 -99
"atn2" 23.7 <missing> NaN 23.7
"arg1" NaN "yes" 5 NaN
"gry5" 21 "yes" NaN 21
Fill Missing Values
One strategy for cleaning the missing values in a table is to replace them with more meaningful values. You can replace nonstandard missing values by inserting standard missing values. Then you can fill missing values with adjusted values. For example, you can fill missing values with their nearest neighbors or with the mean value of a table variable.
In this example, -99
is a nonstandard value for indicating a missing value. To replace the instances of -99
with standard missing values, use the standardizeMissing
function. NaN
is the standard missing value for single- and double-precision floating-point numeric arrays.
messyTable = standardizeMissing(messyTable,-99)
messyTable=21×5 table
A B C D E
______ ____ _________ ____ ____
"afe1" NaN "yes" 3 3
"egh3" NaN "no" 7 7
"wth4" NaN "yes" NaN NaN
"atn2" 23.7 <missing> NaN 23.7
"arg1" NaN "yes" 5 NaN
"jre3" 34.6 "yes" 34.6 34.6
"wen9" 234 "yes" 234 234
"ple2" 2 "no" 2 2
"dbo8" 5 "no" 5 5
"oii4" 5 "yes" 5 5
"wnk3" 245 "yes" 245 245
"abk6" 563 "no" 563 563
"pnj5" 463 "no" 463 463
"wnn3" 6 "no" 6 6
"oks9" 23 "yes" 23 23
"wba3" 14 "yes" 14 14
⋮
To fill missing values, use the fillmissing
function. The function provides many methods that fill missing values. For example, fill missing values with their nearest neighbors that are not missing values.
filledTable = fillmissing(messyTable,"nearest")
filledTable=21×5 table
A B C D E
______ ____ _____ ____ ____
"afe1" 23.7 "yes" 3 3
"egh3" 23.7 "no" 7 7
"wth4" 23.7 "yes" 7 23.7
"atn2" 23.7 "yes" 5 23.7
"arg1" 34.6 "yes" 5 34.6
"jre3" 34.6 "yes" 34.6 34.6
"wen9" 234 "yes" 234 234
"ple2" 2 "no" 2 2
"dbo8" 5 "no" 5 5
"oii4" 5 "yes" 5 5
"wnk3" 245 "yes" 245 245
"abk6" 563 "no" 563 563
"pnj5" 463 "no" 463 463
"wnn3" 6 "no" 6 6
"oks9" 23 "yes" 23 23
"wba3" 14 "yes" 14 14
⋮
Remove Rows with Missing Values
Another strategy for cleaning the missing values in a table is to remove the rows that have them.
To remove rows that have missing values, use the rmmissing
function.
remainingTable = rmmissing(messyTable)
remainingTable=15×5 table
A B C D E
______ ____ _____ ____ ____
"jre3" 34.6 "yes" 34.6 34.6
"wen9" 234 "yes" 234 234
"ple2" 2 "no" 2 2
"dbo8" 5 "no" 5 5
"oii4" 5 "yes" 5 5
"wnk3" 245 "yes" 245 245
"abk6" 563 "no" 563 563
"pnj5" 463 "no" 463 463
"wnn3" 6 "no" 6 6
"oks9" 23 "yes" 23 23
"wba3" 14 "yes" 14 14
"pkn4" 2 "no" 2 2
"adw3" 22 "no" 22 22
"poj2" 34.6 "yes" 34.6 34.6
"bas8" 23 "no" 23 23
Sort and Rearrange Table Rows and Variables
Once you have cleaned the missing values in a table, you can organize it in other ways. For example, you can sort the rows of a table by the values in one or more variables.
Sort the rows by the values in the first variable, A
.
sortedTable = sortrows(remainingTable)
sortedTable=15×5 table
A B C D E
______ ____ _____ ____ ____
"abk6" 563 "no" 563 563
"adw3" 22 "no" 22 22
"bas8" 23 "no" 23 23
"dbo8" 5 "no" 5 5
"jre3" 34.6 "yes" 34.6 34.6
"oii4" 5 "yes" 5 5
"oks9" 23 "yes" 23 23
"pkn4" 2 "no" 2 2
"ple2" 2 "no" 2 2
"pnj5" 463 "no" 463 463
"poj2" 34.6 "yes" 34.6 34.6
"wba3" 14 "yes" 14 14
"wen9" 234 "yes" 234 234
"wnk3" 245 "yes" 245 245
"wnn3" 6 "no" 6 6
Sort the rows in descending order by C
, and then sort in ascending order by A
.
sortedBy2Vars = sortrows(remainingTable,["C","A"],["descend","ascend"])
sortedBy2Vars=15×5 table
A B C D E
______ ____ _____ ____ ____
"jre3" 34.6 "yes" 34.6 34.6
"oii4" 5 "yes" 5 5
"oks9" 23 "yes" 23 23
"poj2" 34.6 "yes" 34.6 34.6
"wba3" 14 "yes" 14 14
"wen9" 234 "yes" 234 234
"wnk3" 245 "yes" 245 245
"abk6" 563 "no" 563 563
"adw3" 22 "no" 22 22
"bas8" 23 "no" 23 23
"dbo8" 5 "no" 5 5
"pkn4" 2 "no" 2 2
"ple2" 2 "no" 2 2
"pnj5" 463 "no" 463 463
"wnn3" 6 "no" 6 6
Sorting by C
, the rows are grouped first by "yes"
, followed by "no"
. Then sorting by A
, the rows are listed alphabetically.
To reorder the table so that A
and C
are next to each other, use the movevars
function.
sortedRowsAndMovedVars = movevars(sortedBy2Vars,"C","After","A")
sortedRowsAndMovedVars=15×5 table
A C B D E
______ _____ ____ ____ ____
"jre3" "yes" 34.6 34.6 34.6
"oii4" "yes" 5 5 5
"oks9" "yes" 23 23 23
"poj2" "yes" 34.6 34.6 34.6
"wba3" "yes" 14 14 14
"wen9" "yes" 234 234 234
"wnk3" "yes" 245 245 245
"abk6" "no" 563 563 563
"adw3" "no" 22 22 22
"bas8" "no" 23 23 23
"dbo8" "no" 5 5 5
"pkn4" "no" 2 2 2
"ple2" "no" 2 2 2
"pnj5" "no" 463 463 463
"wnn3" "no" 6 6 6
You can also reorder table variables by indexing. Use smooth parentheses and a variable index that specifies the order of the variables in the output table.
sortedRowsAndMovedVars = sortedBy2Vars(:,["A","C","B","D","E"])
sortedRowsAndMovedVars=15×5 table
A C B D E
______ _____ ____ ____ ____
"jre3" "yes" 34.6 34.6 34.6
"oii4" "yes" 5 5 5
"oks9" "yes" 23 23 23
"poj2" "yes" 34.6 34.6 34.6
"wba3" "yes" 14 14 14
"wen9" "yes" 234 234 234
"wnk3" "yes" 245 245 245
"abk6" "no" 563 563 563
"adw3" "no" 22 22 22
"bas8" "no" 23 23 23
"dbo8" "no" 5 5 5
"pkn4" "no" 2 2 2
"ple2" "no" 2 2 2
"pnj5" "no" 463 463 463
"wnn3" "no" 6 6 6
See Also
table
| ismissing
| standardizeMissing
| fillmissing
| rmmissing
| sortrows
| movevars
| Import Tool
| readtable
| summary