using PCA to filter outliers in MATLAB
Show older comments
A bit of a silly question (I suspect). I am using MATLAB to preprocess some bulk RNAseq data. I am calling PCA on my normalized counts just as an initial way to identify outliers I didn't automatically reject. There are indeed some clear outliers but I am not sure how to use the score matrix output by pca to filter my original data because it is now in descending order of explained variance. How can I use the pca scores to return to the indicies of problematic samples. Thank you so much! Info below:
[coeff,score,latent,tsquared,explained] = pca(normCounts);
size(normCounts) 136 samples with 24427 measured genes
ans =
136 24427
2 Comments
Mathieu NOE
on 21 May 2024
It would help if you coud share some code and data
Cooper
on 21 May 2024
Accepted Answer
More Answers (1)
the cyclist
on 21 May 2024
1 vote
I frankly did not read all your code, but you might also be interested in this tutorial-style answer I wrote about using MATLAB's pca function. In particular, if you are trying to remove observations that are outliers along the principal component axes, you should be able to
- apply pca
- apply rmoutliers on the variable I call dataInPrincipalComponentSpace
- transform that variable back to the original space
5 Comments
John D'Errico
on 21 May 2024
Edited: John D'Errico
on 21 May 2024
I'm not sure PCA is a great tool for removing outliers. A fundamental assumption in PCA is the data is normally distributed in multiple dimensions. But when you have outliers in there, it completely wrecks that assumption.
At most, you might want to use a remove one at a time scheme, where you choose the most clear outlier for removal, then completely redo the PCA.
Personally, I might rather consider a scheme where I'd look at the influence each data point has on the resulting covariance matrix. I'm pretty sure you could find a measure of the distance between two populations, based on their population means and covariance matrices. But even that has an implicit presumption of multivariate normality, and that in itself can easily be a flaw. For example, consider some data which happens to be lognormally distributed in multiple dimensions. If you look only at the outliers a PCA would identify you would easily mislabel some points as outliers, when after logging the data, the outliers owuld be clearly a different set of points.
Cooper
on 21 May 2024
John D'Errico
on 21 May 2024
Edited: John D'Errico
on 21 May 2024
I was kind of vague, partly because I'm not really a true sadistician. I hope I spelled that correclty. ;-)
But there are two issues I see. One is the question of lack of normality. For example, consider the set:
X = exp(randn(1000,3));
X is lognormally distributed. And I am sure any PCA on X would identify some of those points as outliers. But if you consider the log of that sample, you would recognize there are no outliers at all. Everything is as perfectly well behaved as possible in that domain. Even worse, suppose I added in one new point:
X2 = [X;exp([-8 -8 -8])];
I would conjecture that this new point will not even be remotely observed as an outlier by any PCA based scheme, but back in the log space, that last point is out in the weeds. The problem in all of this is PCA implicitly presumes normality. And when you lack normality, PCA will fail to perform well. Worse, any data that contains outliers explicitly fails the presumption of normally distributed data.
The second issue I was thinking of is how a PCA will treat data with an outlier in it. The scores will all be strange, and potentially might miss an outlier. I think I need to come up with an example of what I am thinking, but my proposal below would seem to eliminate that issue anyway.
Anyway, if you want to do this, I would work in terms of a leave-one-out Mahalanobis distance, instead of using a PCA. Thus for each point, compute the mean and convariance matrix of the set, leaving that point out. Now, compute the distance for the point you left out. Repeat for every point in your set. Choose the one with the largest distance measure, IF that distance is sufficiently large in terms of Mahalanobis distance. Drop it from your set. Now lather, rinse and repeat, until none of the points are sufficiently far out to be considered an outlier by this scheme.
Will that scheme fail for the lognormal example I showed before? Of course.
Finally, does rmoutliers do something like this already? I'd need to check on that, but I don't see that it does, at least not at a glance.
the cyclist
on 21 May 2024
The transformation to, and back from, the principal components will not change the order of the observations. It will only change the coordinate system.
The cautions you received from @John D'Errico are valid (of course). I took the "low road", in just telling you how to accomplish what you asked, rather than the high road of giving you the many things to think about in what you are doing. I have a 90-page book on how to detect and handle outliers, and it's just difficult for me to know where to draw the line on what to share here. :-)
I think the most important short lesson is to ask yourself why are you dropping an observation. The simple fact that it lies away from the bulk of the distribution -- without any hypothesis about why this may have happened -- is not a very good reason to do so.
Categories
Find more on Dimensionality Reduction and Feature Extraction in Help Center and File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!