using PCA to filter outliers in MATLAB

Question

0 votes

A bit of a silly question (I suspect). I am using MATLAB to preprocess some bulk RNAseq data. I am calling PCA on my normalized counts just as an initial way to identify outliers I didn't automatically reject. There are indeed some clear outliers but I am not sure how to use the score matrix output by pca to filter my original data because it is now in descending order of explained variance. How can I use the pca scores to return to the indicies of problematic samples. Thank you so much! Info below:

[coeff,score,latent,tsquared,explained] = pca(normCounts);

size(normCounts) 136 samples with 24427 measured genes

ans =

136 24427

2 Comments
Show None Hide None

Mathieu NOE on 21 May 2024

It would help if you coud share some code and data

Cooper on 21 May 2024

Hi Matthieu thanks so much for your speedy response please find attached my code. Unfortunately I can't attach any of the data since it is sensitive and unpublished. My question is really just trying to understand how PCA is implemented in Matlab and how to return to the original order of an input matrix after outputting rank ordered scores for explained variance.

Thanks so much

%% normalizing samples

%first we take a pseudoReference by calculating the geometric mean across

%each one of our genes

pseudoRefSample= geomean(countData2use,2);

nz = pseudoRefSample > 0; %remove genes with no expression for calc

ratios = bsxfun(@rdivide,countData2use(nz,:),pseudoRefSample(nz)); %then calculate

sizeFactors = median(ratios,1);

normCounts = bsxfun(@rdivide,countData2use,sizeFactors);

normCounts(end,:)=[]; %last one is always nan'd out

%remove samples with lots of outliers

extremeVals=sizeFactors >1.4 | sizeFactors<=.6;

normCounts(:,extremeVals)=[]; ptID2use=ptIDZ;

ptID2use(extremeVals)=[];

groupID2use=groupID; groupID2use(extremeVals)=[];

[coeff,score,latent,tsquared,explained] = pca(normCounts');

figure

plot(1:length(explained),explained)

ylabel('explained variance')

xlabel('Principal Component')

figure

scatter(score(1,:),score(2,:))

axis equal

xlabel(['1st Principal Component'])

ylabel('2nd Principal Component')

%this will be where (ideally) I filter out the few data points that are clearly far removed from the distribution

a=find(abs(score(1,:))>10000)

Sign in to comment.

Sign in to answer this question.

Follow Question

Answer 1

John D'Errico on 21 May 2024

Edited: John D'Errico on 22 May 2024

Open in MATLAB Online

1 vote

mahalOutliers.m

I've attached my quick shot at an n-dimensional outlier tool, based on Mahalanobis distance. I am pretty sure it has some serious issues, and since I refuse to accept the label of statistician, I'll choose to ignore them. :)

For some random data, with 10% potential outliers...

X = randn(100,2)*randn(2,3) + randn(100,3)/50;
X = [X;randn(10,3)*50+5];

As you should see, the last 10 of the 110 points will be potential outliers, though some of those points might just randomly have fallen inside the upper cloud. Use it in a sequential fashion, removing the worst point, IF the flag is greater than 1.

retainset = 1:110;
outlierlist = [];
continueflag = true;
tol = 0.001;
while continueflag
  outflag = mahalOutliers(X(retainset,:),tol);
  [maxflag,maxind] = max(outflag);
  if maxflag > 1;
    outlierlist(end+1) = retainset(maxind);
    retainset(maxind) = [];
  else
    continueflag = false;
  end
end
outlierlist
outlierlist = 1x10
   105   108   102   101   103   107   109   104   110   106
<mw-icon class=""></mw-icon>
<mw-icon class=""></mw-icon>

So, with a tolerance set at 0.001, the tool identified points numbered 105, 108, 102, etc., in sequence, then happily stopped after it found 10 outliers. Coincidentally, those were the 10 points I added to the set.

As I said, the idea surely has flaws. And I would seriously agree with everything said by @the cyclist.

1 Comment
Show -1 older comments Hide -1 older comments

Cooper on 22 May 2024

wow @John D'Errico and @the cyclist

I don't know what to say... I'm so grateful for this incredibly edifying and kind response. I will think critically about my approach to this problem, do some more reading, and hopefully one day pay this forward.

Thank you thank you thank you!!

Sign in to comment.

Answer 2

the cyclist on 21 May 2024

1 vote

There is a built-in MATLAB function, rmoutliers, for detecting and removing outliers.

I frankly did not read all your code, but you might also be interested in this tutorial-style answer I wrote about using MATLAB's pca function. In particular, if you are trying to remove observations that are outliers along the principal component axes, you should be able to

apply pca
apply rmoutliers on the variable I call dataInPrincipalComponentSpace
transform that variable back to the original space

5 Comments
Show 3 older comments Hide 3 older comments

John D'Errico on 21 May 2024

Edited: John D'Errico on 21 May 2024

I'm not sure PCA is a great tool for removing outliers. A fundamental assumption in PCA is the data is normally distributed in multiple dimensions. But when you have outliers in there, it completely wrecks that assumption.

At most, you might want to use a remove one at a time scheme, where you choose the most clear outlier for removal, then completely redo the PCA.

Personally, I might rather consider a scheme where I'd look at the influence each data point has on the resulting covariance matrix. I'm pretty sure you could find a measure of the distance between two populations, based on their population means and covariance matrices. But even that has an implicit presumption of multivariate normality, and that in itself can easily be a flaw. For example, consider some data which happens to be lognormally distributed in multiple dimensions. If you look only at the outliers a PCA would identify you would easily mislabel some points as outliers, when after logging the data, the outliers owuld be clearly a different set of points.

John D'Errico on 21 May 2024

Edited: John D'Errico on 21 May 2024

Open in MATLAB Online

@Cooper

I was kind of vague, partly because I'm not really a true sadistician. I hope I spelled that correclty. ;-)

But there are two issues I see. One is the question of lack of normality. For example, consider the set:

X = exp(randn(1000,3));

X is lognormally distributed. And I am sure any PCA on X would identify some of those points as outliers. But if you consider the log of that sample, you would recognize there are no outliers at all. Everything is as perfectly well behaved as possible in that domain. Even worse, suppose I added in one new point:

X2 = [X;exp([-8 -8 -8])];

I would conjecture that this new point will not even be remotely observed as an outlier by any PCA based scheme, but back in the log space, that last point is out in the weeds. The problem in all of this is PCA implicitly presumes normality. And when you lack normality, PCA will fail to perform well. Worse, any data that contains outliers explicitly fails the presumption of normally distributed data.

The second issue I was thinking of is how a PCA will treat data with an outlier in it. The scores will all be strange, and potentially might miss an outlier. I think I need to come up with an example of what I am thinking, but my proposal below would seem to eliminate that issue anyway.

Anyway, if you want to do this, I would work in terms of a leave-one-out Mahalanobis distance, instead of using a PCA. Thus for each point, compute the mean and convariance matrix of the set, leaving that point out. Now, compute the distance for the point you left out. Repeat for every point in your set. Choose the one with the largest distance measure, IF that distance is sufficiently large in terms of Mahalanobis distance. Drop it from your set. Now lather, rinse and repeat, until none of the points are sufficiently far out to be considered an outlier by this scheme.

Will that scheme fail for the lognormal example I showed before? Of course.

Finally, does rmoutliers do something like this already? I'd need to check on that, but I don't see that it does, at least not at a glance.

the cyclist on 21 May 2024

The transformation to, and back from, the principal components will not change the order of the observations. It will only change the coordinate system.

The cautions you received from @John D'Errico are valid (of course). I took the "low road", in just telling you how to accomplish what you asked, rather than the high road of giving you the many things to think about in what you are doing. I have a 90-page book on how to detect and handle outliers, and it's just difficult for me to know where to draw the line on what to share here. :-)

I think the most important short lesson is to ask yourself why are you dropping an observation. The simple fact that it lies away from the bulk of the distribution -- without any hypothesis about why this may have happened -- is not a very good reason to do so.

Sign in to comment.

using PCA to filter outliers in MATLAB

2 Comments
Show None Hide None

Accepted Answer

1 Comment
Show -1 older comments Hide -1 older comments

More Answers (1)

5 Comments
Show 3 older comments Hide 3 older comments

Categories

Products

Release

Tags

Community Treasure Hunt

using PCA to filter outliers in MATLAB

2 Comments Show None Hide None

Accepted Answer

1 Comment Show -1 older comments Hide -1 older comments

More Answers (1)

5 Comments Show 3 older comments Hide 3 older comments

Categories

Products

Release

Tags

See Also

Community Treasure Hunt

2 Comments
Show None Hide None

1 Comment
Show -1 older comments Hide -1 older comments

5 Comments
Show 3 older comments Hide 3 older comments