Main Content

Visualization of Tall Arrays

Visualizing large data sets requires that the data is summarized, binned, or sampled in some way to reduce the number of points that are plotted on the screen. In some cases, functions such as histogram and pie bin the data to reduce the size, while other functions such as plot and scatter use a more complex approach that avoids plotting duplicate pixels on the screen. For problems where the pixel overlap is relevant to the analysis, the binscatter function also offers an efficient way to visualize density patterns.

Visualizing tall arrays does not require the use of gather. MATLAB® immediately evaluates and displays visualizations of tall arrays. Currently, you can visualize tall arrays using the functions and methods in this table.

FunctionRequired ToolboxesNotes
plot

These functions plot in iterations, progressively adding to the plot as more data is read. During the updates, a progress indicator shows the proportion of data that has been plotted. Zooming and panning is supported during the updating process, before the plot is complete. To stop the update process, press the pause button in the progress indicator.

scatter
binscatter
histogram 
histogram2 
pie

For visualizing categorical data only.

binScatterPlot (Statistics and Machine Learning Toolbox) Statistics and Machine Learning Toolbox™

Figure contains a slider to control the brightness and color detail in the image. The slider adjusts the value of the Gamma image correction parameter.

ksdensity (Statistics and Machine Learning Toolbox) Statistics and Machine Learning Toolbox

Produces a probability density estimate for the data, evaluated at 100 points for univariate data, or 900 points for bivariate data.

datasample (Statistics and Machine Learning Toolbox) Statistics and Machine Learning Toolbox

datasample enables you to extract a subsample of a tall array in a statistically sound way compared to simple indexing. If the subset of data is small enough to fit in memory, then you can use plotting and fitting functions on the subset that do not directly support tall arrays.

Tall Array Plotting Examples

This example shows several different ways you can visualize tall arrays.

Create a datastore for the airlinesmall.csv data set, which contains rows of airline flight data. Select a subset of the table variables to work with and remove rows that contain missing values.

ds = tabularTextDatastore('airlinesmall.csv','TreatAsMissing','NA');
ds.SelectedVariableNames = {'Year','Month','ArrDelay','DepDelay','Origin','Dest'};
T = tall(ds);
T = rmmissing(T)
T =

  Mx6 tall table

    Year    Month    ArrDelay    DepDelay    Origin      Dest  
    ____    _____    ________    ________    _______    _______

    1987     10          8          12       {'LAX'}    {'SJC'}
    1987     10          8           1       {'SJC'}    {'BUR'}
    1987     10         21          20       {'SAN'}    {'SMF'}
    1987     10         13          12       {'BUR'}    {'SJC'}
    1987     10          4          -1       {'SMF'}    {'LAX'}
    1987     10         59          63       {'LAX'}    {'SJC'}
    1987     10          3          -2       {'SAN'}    {'SFO'}
    1987     10         11          -1       {'SEA'}    {'LAX'}
     :        :         :           :           :          :
     :        :         :           :           :          :

Pie Chart of Flights by Month

Convert the numeric Month variable into a categorical variable that reflects the name of the month. Then plot a pie chart showing how many flights are in the data for each month of the year.

T.Month = categorical(T.Month,1:12,{'Jan','Feb','Mar','Apr','May','Jun','Jul','Aug','Sep','Oct','Nov','Dec'})
T =

  Mx6 tall table

    Year    Month    ArrDelay    DepDelay    Origin      Dest  
    ____    _____    ________    ________    _______    _______

    1987     Oct         8          12       {'LAX'}    {'SJC'}
    1987     Oct         8           1       {'SJC'}    {'BUR'}
    1987     Oct        21          20       {'SAN'}    {'SMF'}
    1987     Oct        13          12       {'BUR'}    {'SJC'}
    1987     Oct         4          -1       {'SMF'}    {'LAX'}
    1987     Oct        59          63       {'LAX'}    {'SJC'}
    1987     Oct         3          -2       {'SAN'}    {'SFO'}
    1987     Oct        11          -1       {'SEA'}    {'LAX'}
     :        :         :           :           :          :
     :        :         :           :           :          :
pie(T.Month)
Evaluating tall expression using the Local MATLAB Session:
- Pass 1 of 2: Completed in 0.93 sec
- Pass 2 of 2: Completed in 0.92 sec
Evaluation completed in 2.4 sec

Histogram of Delays

Plot a histogram of the arrival delays for each flight in the data. Since the data has a long tail, limit the plotting area using the BinLimits name-value pair.

histogram(T.ArrDelay,'BinLimits',[-50 150])
Evaluating tall expression using the Local MATLAB Session:
- Pass 1 of 2: Completed in 1.8 sec
- Pass 2 of 2: Completed in 0.7 sec
Evaluation completed in 2.9 sec

Scatter Plot of Delays

Plot a scatter plot of arrival and departure delays. You can expect a strong correlation between these variables since flights that leave late are also likely to arrive late.

When operating on tall arrays, the plot, scatter, and binscatter functions plot the data in iterations, progressively adding to the plot as more data is read. During the updates the top of the plot has a progress indicator showing how much data has been plotted. Zooming and panning is supported during the updates before the plot is complete.

scatter(T.ArrDelay,T.DepDelay)
xlabel('Arrival Delay')
ylabel('Departure Delay')
xlim([-140 1000])
ylim([-140 1000])

The progress bar also includes a Pause/Resume button. Use the button to stop the plot updates early once enough data is displayed.

Fit Trend Line

Use the polyfit and polyval functions to overlay a linear trend line on the plot of arrival and departure delays.

hold on
p = polyfit(T.ArrDelay,T.DepDelay,1);
x = sort(T.ArrDelay,1);
yp = polyval(p,x);
plot(x,yp,'r-')
hold off

Visualize Density

The scatter plot of points is helpful up to a certain point, but it can be hard to decipher information from the plot if the points overlap extensively. In that case, it helps to visualize the density of points in the plot to spot trends.

Use the binscatter function to visualize the density of points in the plot of arrival and departure delays.

binscatter(T.ArrDelay,T.DepDelay,'XLimits',[-100 1000],'YLimits',[-100 1000])
xlim([-100 1000])
ylim([-100 1000])
xlabel('Arrival Delay')
ylabel('Departure Delay')

Adjust the CLim property of the axes so that all bin values greater than 150 are colored the same. This prevents a few bins with very large values from dominating the plot.

ax = gca;
ax.CLim = [0 150];

See Also

| |

Related Topics