Read and Analyze Large Tabular Text File

This example shows how to create a datastore for a large text file containing tabular data, and then read and process the data one block at a time or one file at a time.

Create a Datastore

Create a datastore from the sample file airlinesmall.csv using the datastore function. When you create the datastore, you can specify that the text, NA, in the data is treated as missing data.

ds = datastore('airlinesmall.csv','TreatAsMissing','NA');

datastore returns a TabularTextDatastore. The datastore function automatically determines the appropriate type of datastore to create based on the file extension.

You can modify the properties of the datastore by changing its properties. Modify the MissingValue property to specify that missing values are treated as 0.

ds.MissingValue = 0;

In this example, select the variable for the arrival delay, ArrDelay, as the variable of interest.

ds.SelectedVariableNames = 'ArrDelay';

Preview the data using the preview function. This function does not affect the state of the datastore.

data = preview(ds)
data=8×1 table
    ArrDelay
    ________

        8   
        8   
       21   
       13   
        4   
       59   
        3   
       11   

Read Subsets of Data

By default, read reads from a TabularTextDatastore 20000 rows at a time. To read a different number of rows in each call to read, modify the ReadSize property of ds.

ds.ReadSize = 15000;

Read subsets of the data from ds using the read function in a while loop. The loop executes until hasdata(ds) returns false.

sums = [];
counts = [];
while hasdata(ds)
    T = read(ds);
    
    sums(end+1) = sum(T.ArrDelay);
    counts(end+1) = length(T.ArrDelay);
end

Compute the average arrival delay

avgArrivalDelay = sum(sums)/sum(counts)
avgArrivalDelay = 6.9670

Reset the datastore to allow rereading of the data.

reset(ds)

Read One File At a Time

A datastore can contain multiple files, each with a different number of rows. You can read from the datastore one complete file at a time by setting the ReadSize property to 'file'.

ds.ReadSize = 'file';

When you change the value of ReadSize from a number to 'file' or vice versa, MATLAB resets the datastore.

Read from ds using the read function in a while loop, as before, and compute the average arrival delay.

sums = [];
counts = [];
while hasdata(ds)
    T = read(ds);
    
    sums(end+1) = sum(T.ArrDelay);
    counts(end+1) = length(T.ArrDelay);
end
avgArrivalDelay = sum(sums)/sum(counts)
avgArrivalDelay = 6.9670

See Also

| | |

Related Topics