Principal Component Analysis - return data (stock market data)

8 views (last 30 days)
First I want to explain you what I want to do.
I have data on returns of 262 stocks for 299 days in one year.
I want to run a factor model that takes the following form:
r(i,t) = y(0,i) + beta(i)*F(t) + e(i,t)
where t denotes a daily observation (edit: there are 299 days) and i denotes a stock. After the regression I want to calculate the standard deviation of the residual e for every stock.
F(t) in this regression should be the first 5 principal components of the cross section of returns in this year.
I spent the last two days reading about the principal component analysis. I think I understood the basic idea but I have difficulties to use it.
So I loaded the data into matlab and executed the following code:
coeff = pca(Data,'NumComponents',5)
This returns a 262 x 5 matrix.
So there are 5 columns because I specified the number of components to be 5, right?
But why do I get 5 different components for every stock?
First I thougt I need only 1 row and 5 columns. But when I look at the regression and see that F has the subscribed t I need five different components for every day or am I wrong? And how do I get them?
  3 Comments
David Schaefer
David Schaefer on 15 Oct 2019
What I forgot to mention: I have 299 daily oberservations.
So for my factor model:
Could it be a solution to transpose the data before applying the code? So that I get a 299 x 5 matrix. And then run the regression because in my factor model F has the subscib t for daily observations.
Adam
Adam on 15 Oct 2019
Edited: Adam on 15 Oct 2019
The number of observations should not be a factor. The observations just determine what the eigenvectors actually are and how accurately they will measure what you want (more observations should give greater accuracy as a model of your data), but the eigenvectors themselves will have the dimensionality of your inputs.
Each of your input obersvations is in 262-dimensional space - i.e. it will have 262 components to it. These are all 'axis-aligned' along each of those components. The eigenvectors you get will simply re-orientate within that 262-dimensional space to give new axes that follow the multi-dimensional shape of your data rather than following each of the original components.
You can then project your data onto the eigenvectors and use these instead of the original dimensions and, because they follow the principal components of your data that is why you can throw away 257 of them (well, you chose to keep just 5 at least) because they describe your data better than if you just threw away 257 of the original dimensions.
You should also look at the other outputs from the pca function though. The explained output will tell you how much of the data variation is captured by those first 5 principal components.

Sign in to comment.

Answers (1)

the cyclist
the cyclist on 16 Oct 2019
You might want to check out my tutorial-style PCA answer here.
I think one helpful way to think about your 262x5 output is that your 262 stocks are the entire "market", and your 5 are stock "indices", designed to capture a fraction (ideally a large fraction) of the variability of the market.
Each index -- defined by a column of the coeff output -- is a principal component, defined by a linear combination of all the stocks. So if the first column of coeff is
coeff(:,1) = [0.03;
0.02;
0.06;
...
...]
that is telling you that the first "index" (i.e. first principal component) is composed of 3% of stock 1, 2% of stock 2, 6% of stock 3, and so on.
So, to capture what the market was doing on the 299 market days (as captured by the 5 indices), I believe you just need
Data * coeff
which is a (299x262) * (262x5) = (299,5) matrix. That matrix is the 299 daily returns of the 5 indices.

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!