Can I add 95% confidence ellipses around groups of data in a pca plot (biplot)?

I am plotting the results of principle component analysis using biplot. I am wondering if there is a way (in matlab) to add confidence ellipses around the groups of data. Maybe I have to use something instead of biplot?
In the example below I would want to draw an ellipse around the data for each site (green and blue points)
Here's how I am currently plotting the data:
[coeff,score,latent,tsquared,explained] = pca(X);
f1=figure(1)
h= biplot(coeff(:,1:2),'scores',score(:,1:2),'color','k','marker','.','markersize',17,'varlabels',...
{'DO','O2sat','pH','Temp','Sal','Depth','PAR','|Velocity|'});
%color by site
hID = get(h,'tag'); %identify handle
hPt = h(strcmp(hID,'obsmarker')); %isolate handles to scatter points
grp = findgroups(site);
grpID = 1:max(grp);
clrMap = winter(length(unique(grp)));
for i = 1:max(grp)
set(hPt(grp==i), 'Color', clrMap(i,:), 'DisplayName', sprintf('MSP%d', grpID(i)))
end
title('Full Deployment (12h)');
set(gca,'fontsize',18)
xlabel('Component 1 (55.26%)') % how can I add percent ('%s %', explained(1))')
ylabel('Component 2 (21.87%)')
[~, unqIdx] = unique(grp);
legend(hPt(unqIdx))

 Accepted Answer

If you know the center of the clusters and the CI's along the x and y axes for each cluster, you can use this function to plot the ellipses
A comment below that answer points to an alternative solution as well.
Also check the file exchange.
Addendum
The block of code below implements the ellipses outlined in the first link above with your data which is attached. I made additional changes from your version to clean some stuff up a bit.
Stars mark the center of each cluster using the mean (consider using the median instead since the data are not normally distributed). Major and minor axes of the ellipses represent the 95% CI using the percentile method which is a good method given that the data are not normally distributed.
load('data.mat')
[coeff,score,latent,tsquared,explained] = pca(X);
f1=figure();
h= biplot(coeff(:,1:2),'scores',score(:,1:2),'color','k','marker','.','markersize',17,'varlabels',...
{'DO','O2sat','pH','Temp','Sal','Depth','PAR','|Velocity|'});
hold on
%color by site
hID = get(h,'tag'); %identify handle
hPt = h(strcmp(hID,'obsmarker')); %isolate handles to scatter points
[grp, grpID] = findgroups(site);
clrMap = winter(numel(grpID));
p = 95; % CI level
for i = 1:max(grpID)
set(hPt(grp==i), 'Color', clrMap(i,:), 'DisplayName', sprintf('MSP%d', grpID(i)))
% Compute centers (means)
allX = arrayfun(@(hh)hh.XData(1), hPt(grp==i));
allY = arrayfun(@(hh)hh.YData(1), hPt(grp==i));
centers(1) = mean(allX); % x mean
centers(2) = mean(allY); % y mean
% Plot centers, do they make sense?
plot(centers(1), centers(2), 'rp', 'MarkerFaceColor', clrMap(i,:), 'MarkerSize', 20, 'LineWidth', 1)
% Compute 95% CI using percentile method
CIx = prctile(allX, [(100-p)/2, p+(100-p)/2]); % x CI [left, right]
CIy = prctile(allY, [(100-p)/2, p+(100-p)/2]); % y CI [lower, upper]
CIrng(1) = CIx(2)-CIx(1); % CI range (x)
CIrng(2) = CIy(2)-CIy(1); % CI range (y)
% Draw ellipses
llc = [CIx(1), CIy(1)]; % (x,y) lower left corners
rectangle('Position',[llc,CIrng],'Curvature',[1,1], 'EdgeColor', clrMap(i,:));
end
title('Full Deployment (12h)');
set(gca,'fontsize',18)
xlabel('Component 1 (55.26%)') % how can I add percent ('%s %', explained(1))')
ylabel('Component 2 (21.87%)')
[~, unqIdx] = unique(grp);
legend(hPt(unqIdx))

5 Comments

Thank you Adam. I am having trouble isolating the data for each group to get the centers and CIs. I ran all my data together for the pca then I grouped it based on the group ID for each point. It seems like I should be able to save each subgroup by indexing the score results of the pca but that indexing doesn't appear to be working correctly. Shouldn't I be able to save each group of scores for separate plotting by doing something like this:
Ind1 is an index for the observations from the first site and Ind2 is an index for the observations from the second site2.
first site scores:
scores_1 = score(Ind1,1:2)
scores_2 = score(Ind2, 1:2)
I've attached the data if that helps. site is the site ID for each observation and X is the data that I feed into the pca:
[coeff,score,latent,tsquared,explained] = pca(X);
I tried adding the following code. I think I got close to a solution but for some reason everything is huge. Can anyone spot what I am doing wrong?
%divide scores by site (first have of observations are site 1 and second half are site 2)
score1 = score(1:34,1:2);
x1 = score1(:,1); y1 = score1(1,:);
score2 = score(35:end,1:2);
x2 = score2(:,1); y2 = score2(1,:);
%find centers for data from each site
meanx1 = mean(x1);
meany1 = mean(y1);
meanx2 = mean(x2);
meany2 = mean(y2);
%confidence intervals
% x is a vector, matrix, or any numeric array of data. NaNs are ignored.
% p is the confidence level (ie, 95 for 95% CI)
% The output is 1x2 vector showing the [lower,upper] interval values.
p=95;
%find upper and lower bounds for 95% confidence in x and y directions
% CIFcn_x1 = @(x1,p)prctile(x1,abs([0,100]-(100-p)/2));
CIFcn_x1 = prctile(x1,abs([0,100]-(100-p)/2));
CIFcn_y1 = prctile(y1,abs([0,100]-(100-p)/2));
CIFcn_x2 = prctile(x2,abs([0,100]-(100-p)/2));
CIFcn_y2 = prctile(y2,abs([0,100]-(100-p)/2));
CI_x1 = abs((CIFcn_x1(1)-CIFcn_x1(2))/2);
CI_y1 = abs((CIFcn_y1(1)-CIFcn_y1(2))/2);
CI_x2 = abs((CIFcn_x2(1)-CIFcn_x2(2))/2);
CI_y2 = abs((CIFcn_y2(1)-CIFcn_y2(2))/2);
mean1 = [meanx1,meany1]
std1 = [CI_x1,CI_y1]
mean2 = [meanx2,meany2]
std2 = [CI_x2,CI_y2]
%add ellipses
E = plotEllipses(mean1,std1);
E.FaceColor = [0 0 1 .3];
E.EdgeColor = clrMap(1,:);
E.LineWidth = 2;
E = plotEllipses(mean2,std2);
E.FaceColor = [0 1 0 .3];
E.EdgeColor = clrMap(2,:);
E.LineWidth = 2;
Thank you so much Adam! This is so helpful!

Sign in to comment.

More Answers (0)

Categories

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!