Main Content

Machine Learning for Statistical Arbitrage I: Data Management and Visualization

This example shows techniques for managing, processing, and visualizing large amounts of financial data in MATLAB®. It is part of a series of related examples on machine learning for statistical arbitrage (see Machine Learning Applications).

Working with Big Data

Financial markets, with electronic exchanges such as NASDAQ executing orders on a timescale of milliseconds, generate vast amounts of data. Data streams can be mined for statistical arbitrage opportunities, but traditional methods for processing and storing dynamic analytic information can be overwhelmed by big data. Fortunately, new computational approaches have emerged, and MATLAB has an array of tools for implementing them.

Main computer memory provides high-speed access but limited capacity, whereas external storage offers low-speed access but potentially unlimited capacity. Computation takes place in memory. The computer recalls data and results from external storage.

Data Files

This example uses one trading day of NASDAQ exchange data [2] on one security (INTC) in a sample provided by LOBSTER [1] and included with Financial Toolbox™ documentation in the zip file LOBSTER_SampleFile_INTC_2012-06-21_5.zip. Extract the contents of the zip file into your current folder. The expanded files, including two CSV files of data and the text file LOBSTER_SampleFiles_ReadMe.txt, consume 93.7 MB of memory.

unzip("LOBSTER_SampleFile_INTC_2012-06-21_5.zip");

The data describes the intraday evolution of the limit order book (LOB), which is the record of market orders (best price), limit orders (designated price), and resulting buys and sells. The data includes the precise time of these events, with orders tracked from arrival until cancellation or execution. At each moment in the trading day, orders on both the buy and sell side of the LOB exist at various levels away from the midprice between the lowest ask (order to sell) and the highest bid (order to buy).

Level 5 data (five levels away from the midprice on either side) is contained in two CSV files. Extract the trading date from the message file name.

MSGFileName = "INTC_2012-06-21_34200000_57600000_message_5.csv";   % Message file (description of data)
LOBFileName = "INTC_2012-06-21_34200000_57600000_orderbook_5.csv"; % Data file

[ticker,rem] = strtok(MSGFileName,'_');
date = strtok(rem,'_'); 

Data Storage

Daily data streams accumulate and need to be stored. A datastore is a repository for collections of data that are too big to fit in memory.

Use tabularTextDatastore to create datastores for the message and data files. Because the files contain data with different formats, create the datastores separately. Ignore generic column headers (for example, VarName1) by setting the 'ReadVariableNames' name-value argument to false. Replace the headers with descriptive variable names obtained from LOBSTER_SampleFiles_ReadMe.txt. Set the 'ReadSize' name-value argument to 'file' to allow similarly formatted files to be appended to existing datastores at the end of each trading day.

DSMSG = tabularTextDatastore(MSGFileName,'ReadVariableNames',false,'ReadSize','file');
DSMSG.VariableNames = ["Time","Type","OrderID","Size","Price","Direction"];

DSLOB = tabularTextDatastore(LOBFileName,'ReadVariableNames',false,'ReadSize','file');
DSLOB.VariableNames = ["AskPrice1","AskSize1","BidPrice1","BidSize1",...
                       "AskPrice2","AskSize2","BidPrice2","BidSize2",...
                       "AskPrice3","AskSize3","BidPrice3","BidSize3",...
                       "AskPrice4","AskSize4","BidPrice4","BidSize4",...
                       "AskPrice5","AskSize5","BidPrice5","BidSize5"];

Create a combined datastore by selecting Time and the level 3 data.

TimeVariable = "Time";
DSMSG.SelectedVariableNames = TimeVariable;

LOB3Variables = ["AskPrice1","AskSize1","BidPrice1","BidSize1",...
                 "AskPrice2","AskSize2","BidPrice2","BidSize2",...
                 "AskPrice3","AskSize3","BidPrice3","BidSize3"];
DSLOB.SelectedVariableNames = LOB3Variables;
                               
DS = combine(DSMSG,DSLOB);         

You can preview the first few rows in the combined datastore without loading data into memory.

DSPreview = preview(DS);
LOBPreview = DSPreview(:,1:5)
LOBPreview=8×5 table
    Time     AskPrice1    AskSize1    BidPrice1    BidSize1
    _____    _________    ________    _________    ________

    34200    2.752e+05       66       2.751e+05      400   
    34200    2.752e+05      166       2.751e+05      400   
    34200    2.752e+05      166       2.751e+05      400   
    34200    2.752e+05      166       2.751e+05      400   
    34200    2.752e+05      166       2.751e+05      300   
    34200    2.752e+05      166       2.751e+05      300   
    34200    2.752e+05      166       2.751e+05      300   
    34200    2.752e+05      166       2.751e+05      300   

The preview shows asks and bids at the touch, meaning the level 1 data, which is closest to the midprice. Time units are seconds after midnight, price units are dollar amounts times 10,000, and size units are the number of shares (see LOBSTER_SampleFiles_ReadMe.txt).

Tall Arrays and Timetables

Tall arrays work with out-of-memory data backed by a datastore using the MapReduce technique (see Tall Arrays for Out-of-Memory Data). When you use MapReduce, tall arrays remain unevaluated until you execute specific computations that use the data.

Set the execution environment for MapReduce to the local MATLAB session, instead of using Parallel Computing Toolbox™, by calling mapreducer(0). Then, create a tall array from the datastore DS by using tall. Preview the data in the tall array.

mapreducer(0)
DT = tall(DS);

DTPreview = DT(:,1:5)
DTPreview =

  Mx5 tall table

    Time     AskPrice1    AskSize1    BidPrice1    BidSize1
    _____    _________    ________    _________    ________

    34200    2.752e+05       66       2.751e+05      400   
    34200    2.752e+05      166       2.751e+05      400   
    34200    2.752e+05      166       2.751e+05      400   
    34200    2.752e+05      166       2.751e+05      400   
    34200    2.752e+05      166       2.751e+05      300   
    34200    2.752e+05      166       2.751e+05      300   
    34200    2.752e+05      166       2.751e+05      300   
    34200    2.752e+05      166       2.751e+05      300   
      :          :           :            :           :
      :          :           :            :           :

Timetables allow you to perform operations specific to time series (see Create Timetables). Because the LOB data consists of concurrent time series, convert DT to a tall timetable.

DT.Time = seconds(DT.Time); % Cast time as a duration from midnight.
DTT = table2timetable(DT);

DTTPreview = DTT(:,1:4)
DTTPreview =

  Mx4 tall timetable

      Time       AskPrice1    AskSize1    BidPrice1    BidSize1
    _________    _________    ________    _________    ________

    34200 sec    2.752e+05       66       2.751e+05      400   
    34200 sec    2.752e+05      166       2.751e+05      400   
    34200 sec    2.752e+05      166       2.751e+05      400   
    34200 sec    2.752e+05      166       2.751e+05      400   
    34200 sec    2.752e+05      166       2.751e+05      300   
    34200 sec    2.752e+05      166       2.751e+05      300   
    34200 sec    2.752e+05      166       2.751e+05      300   
    34200 sec    2.752e+05      166       2.751e+05      300   
        :            :           :            :           :
        :            :           :            :           :

Display all variables in the MATLAB workspace.

whos
  Name               Size            Bytes  Class                                       Attributes

  DS                 1x1                 8  matlab.io.datastore.CombinedDatastore                 
  DSLOB              1x1                 8  matlab.io.datastore.TabularTextDatastore              
  DSMSG              1x1                 8  matlab.io.datastore.TabularTextDatastore              
  DSPreview          8x13             4899  table                                                 
  DT                 Mx13             5292  tall                                                  
  DTPreview          Mx5              2926  tall                                                  
  DTT                Mx12             5056  tall                                                  
  DTTPreview         Mx4              2704  tall                                                  
  LOB3Variables      1x12              796  string                                                
  LOBFileName        1x1               250  string                                                
  LOBPreview         8x5              2331  table                                                 
  MSGFileName        1x1               246  string                                                
  TimeVariable       1x1               166  string                                                
  date               1x1               172  string                                                
  rem                1x1               238  string                                                
  ticker             1x1               166  string                                                

Because all the data is in the datastore, the workspace uses little memory.

Preprocess and Evaluate Data

Tall arrays allow preprocessing, or queuing, of computations before they are evaluated, which improves memory management in the workspace.

Midprice S and imbalance index I are used to model LOB dynamics. To queue their computations, define them, and the time base, in terms of DTT.

timeBase = DTT.Time;
MidPrice = (DTT.BidPrice1 + DTT.AskPrice1)/2;

% LOB level 3 imbalance index:

lambda  = 0.5; % Hyperparameter
weights = exp(-(lambda)*[0 1 2]);
VAsk = weights(1)*DTT.AskSize1 + weights(2)*DTT.AskSize2 + weights(3)*DTT.AskSize3;
VBid = weights(1)*DTT.BidSize1 + weights(2)*DTT.BidSize2 + weights(3)*DTT.BidSize3;
ImbalanceIndex = (VBid-VAsk)./(VBid+VAsk);

The imbalance index is a weighted average of ask and bid volumes on either side of the midprice [3]. The imbalance index is a potential indicator of future price movements. The variable lambda is a hyperparameter, which is a parameter specified before training rather than estimated by the machine learning algorithm. A hyperparameter can influence the performance of the model. Feature engineering is the process of choosing domain-specific hyperparameters to use in machine learning algorithms. You can tune hyperparameters to optimize a trading strategy.

To bring preprocessed expressions into memory and evaluate them, use the gather function. This process is called deferred evaluation.

[t,S,I] = gather(timeBase,MidPrice,ImbalanceIndex);
Evaluating tall expression using the Local MATLAB Session:
- Pass 1 of 1: Completed in 1.4 sec
Evaluation completed in 1.5 sec

A single call to gather evaluates multiple preprocessed expressions with a single pass through the datastore.

Determine the sample size, which is the number of ticks, or updates, in the data.

numTicks = length(t)
numTicks = 
581030

The daily LOB data contains 581,030 ticks.

Checkpoint Data

You can save both unevaluated and evaluated data to external storage for later use.

Prepend the time base with the date, and cast the result as a datetime array. Save the resulting datetime array, MidPrice, and ImbalanceIndex to a MAT-file in a specified location.

dateTimeBase = datetime(date) + timeBase; 
Today = timetable(dateTimeBase,MidPrice,ImbalanceIndex)
Today =

  581,030x2 tall timetable

        dateTimeBase         MidPrice     ImbalanceIndex
    ____________________    __________    ______________

    21-Jun-2012 09:30:00    2.7515e+05         -0.205   
    21-Jun-2012 09:30:00    2.7515e+05       -0.26006   
    21-Jun-2012 09:30:00    2.7515e+05       -0.26006   
    21-Jun-2012 09:30:00    2.7515e+05      -0.086772   
    21-Jun-2012 09:30:00    2.7515e+05       -0.15581   
    21-Jun-2012 09:30:00    2.7515e+05       -0.35382   
    21-Jun-2012 09:30:00    2.7515e+05       -0.19084   
    21-Jun-2012 09:30:00    2.7515e+05       -0.19084   
             :                  :               :
             :                  :               :
location = fullfile(pwd,"ExchangeData",ticker,date);
write(location,Today,'FileType','mat')
Writing tall data to folder /tmp/Bdoc24b_2679053_3457411/tp3404b24c/finance-ex97702880/ExchangeData/INTC/2012-06-21
Evaluating tall expression using the Local MATLAB Session:
- Pass 1 of 1: Completed in 2 sec
Evaluation completed in 2.3 sec

The file is written once, at the end of each trading day. The code saves the data to a file in a date-stamped folder. The series of ExchangeData subfolders serves as a historical data repository.

Alternatively, you can save workspace variables evaluated with gather directly to a MAT-file in the current folder.

save("LOBVars.mat","t","S","I")

In preparation for model validation later on, evaluate and add market order prices to the same file.

[MOBid,MOAsk] = gather(DTT.BidPrice1,DTT.AskPrice1);
Evaluating tall expression using the Local MATLAB Session:
- Pass 1 of 1: Completed in 1.2 sec
Evaluation completed in 1.2 sec
save("LOBVars.mat","MOBid","MOAsk","-append")

The remainder of this example uses only the unevaluated tall timetable DTT. Clear other variables from the workspace.

clearvars -except DTT 
whos
  Name            Size            Bytes  Class    Attributes

  DTT       581,030x12             5056  tall               

Data Visualization

To visualize large amounts of data, you must summarize, bin, or sample the data in some way to reduce the number of points plotted on the screen.

LOB Snapshot

One method of visualization is to evaluate only a selected subsample of the data. Create a snapshot of the LOB at a specific time of day (11 AM).

sampleTimeTarget = seconds(11*60*60);               % Seconds after midnight
sampleTimes = withtol(sampleTimeTarget,seconds(1)); % 1 second tolerance
sampleLOB = DTT(sampleTimes,:);

numTimes = gather(size(sampleLOB,1))
Evaluating tall expression using the Local MATLAB Session:
- Pass 1 of 1: Completed in 1.4 sec
Evaluation completed in 1.4 sec
numTimes = 
23

There are 23 ticks within one second of 11 AM. For the snapshot, use the tick closest to the midtime.

sampleLOB = sampleLOB(round(numTimes/2),:);
sampleTime = sampleLOB.Time;

sampleBidPrices = [sampleLOB.BidPrice1,sampleLOB.BidPrice2,sampleLOB.BidPrice3];
sampleBidSizes  = [sampleLOB.BidSize1,sampleLOB.BidSize2,sampleLOB.BidSize3];
sampleAskPrices = [sampleLOB.AskPrice1,sampleLOB.AskPrice2,sampleLOB.AskPrice3];
sampleAskSizes  = [sampleLOB.AskSize1,sampleLOB.AskSize2,sampleLOB.AskSize3];

[sampleTime,sampleBidPrices,sampleBidSizes,sampleAskPrices,sampleAskSizes] = ...
    gather(sampleTime,sampleBidPrices,sampleBidSizes,sampleAskPrices,sampleAskSizes);
Evaluating tall expression using the Local MATLAB Session:
- Pass 1 of 2: Completed in 1.2 sec
- Pass 2 of 2: Completed in 1.2 sec
Evaluation completed in 2.7 sec

Visualize the limited data sample returned by gather by using bar.

figure
hold on

bar((sampleBidPrices/10000),sampleBidSizes,'r')
bar((sampleAskPrices/10000),sampleAskSizes,'g')
hold off

xlabel("Price (Dollars)")
ylabel("Number of Shares")
legend(["Bid","Ask"],'Location','North')
title(strcat("Level 3 Limit Order Book: ",datestr(sampleTime,"HH:MM:SS")))

Figure contains an axes object. The axes object with title Level 3 Limit Order Book: 11:00:00, xlabel Price (Dollars), ylabel Number of Shares contains 2 objects of type bar. These objects represent Bid, Ask.

Depth of Market

Some visualization functions work directly with tall arrays and do not require the use of gather (see Visualization of Tall Arrays). The functions automatically sample data to decrease pixel density. Visualize the level 3 intraday depth of market, which shows the time evolution of liquidity, by using plot with the tall timetable DTT.

figure
hold on

plot(DTT.Time,-DTT.BidSize1,'Color',[1.0 0 0],'LineWidth',2)
plot(DTT.Time,-DTT.BidSize2,'Color',[0.8 0 0],'LineWidth',2)
plot(DTT.Time,-DTT.BidSize3,'Color',[0.6 0 0],'LineWidth',2)

plot(DTT.Time,DTT.AskSize1,'Color',[0 1.0 0],'LineWidth',2)
plot(DTT.Time,DTT.AskSize2,'Color',[0 0.8 0],'LineWidth',2)
plot(DTT.Time,DTT.AskSize3,'Color',[0 0.6 0],'LineWidth',2)

hold off

xlabel("Time")
ylabel("Number of Shares")
title("Depth of Market: Intraday Evolution")
legend(["Bid1","Bid2","Bid3","Ask1","Ask2","Ask3"],'Location','NorthOutside','Orientation','Horizontal');

Figure contains an axes object. The axes object with title Depth of Market: Intraday Evolution, xlabel Time, ylabel Number of Shares contains 6 objects of type line. These objects represent Bid1, Bid2, Bid3, Ask1, Ask2, Ask3.

To display details, limit the time interval.

xlim(seconds([45000 45060]))
ylim([-35000 35000])
title("Depth of Market: One Minute")

Figure contains an axes object. The axes object with title Depth of Market: One Minute, xlabel Time, ylabel Number of Shares contains 6 objects of type line. These objects represent Bid1, Bid2, Bid3, Ask1, Ask2, Ask3.

Summary

This example introduces the basics of working with big data, both in and out of memory. It shows how to set up, combine, and update external datastores, then create tall arrays for preprocessing data without allocating variables in the MATLAB workspace. The gather function transfers data into the workspace for computation and further analysis. The example shows how to visualize the data through data sampling or by MATLAB plotting functions that work directly with out-of-memory data.

References

[1] LOBSTER Limit Order Book Data. Berlin: frischedaten UG (haftungsbeschränkt).

[2] NASDAQ Historical TotalView-ITCH Data. New York: The Nasdaq, Inc.

[3] Rubisov, Anton D. "Statistical Arbitrage Using Limit Order Book Imbalance." Master's thesis, University of Toronto, 2015.

Related Topics