Correlation of two variables over time: can this happen?

I have a variable x1 and x2 between January 1 till December 31. When I calculate the correlation between Janaury and June it is positive. When I calculate the correlation between July to December it is positive. But the correlation between Janauary and December is negative. Can it happen?

1 Comment

Yes. However, you probably want to perform a statistical test to determine if the fluctuating correlation is significant.

Sign in to comment.

 Accepted Answer

Yes. This is known as Simpson's Paradox. Here is an example:
rng default
x1 = [1 2 3 4 6 7 8 9]';
x2 = [1 2 3 4 -6 -5 -4 -3]' + 0.8*rand(8,1);
% Correlation of first half
corrcoef(x1(1:4),x2(1:4))
% Correlation of second half
corrcoef(x1(5:8),x2(5:8))
% Correlation of entire vector
corrcoef(x1,x2)
% Plot it
figure
scatter(x1,x2)
You can see that the first half and the second half are positively correlated with each other, but if you look at the trend over the entire vector, it is negative.

7 Comments

Thank you. Then what would be an interpretation of the relationship between x and y?
As also mentioned above it's a clear example of Simpson's paradox; you cannot (you can but better don't) interpret the association of this model (irrespective of the statistical method you use) without taking into account the effect of confounding variables. As a simple example, you can think of group 1 and group 2 (two clusteres in above picture) as a confounding you should adjust for.
rng default
x = [1 2 3 4 6 7 8 9]'; % predictor
y = [1 2 3 4 -6 -5 -4 -3]' + 0.8*rand(8,1); % dependent
conf = [ones(4, 1); ones(4, 1) + 1]; % confounding variable
% model 1: without adjusting for this confounder
fitlm(x, y)
Linear regression model:
y ~ 1 + x1
Estimated Coefficients:
Estimate SE tStat pValue
________ _______ _______ ________
(Intercept) 4.6512 2.1124 2.2019 0.069923
x1 -1.0439 0.37054 -2.8173 0.030462
% model 2: after adjusting for the confounder
fitlm([x, conf], y)
Linear regression model:
y ~ 1 + x1 + x2
Estimated Coefficients:
Estimate SE tStat pValue
________ ________ _______ __________
(Intercept) 12.737 0.38027 33.496 4.4587e-07
x1 0.97767 0.087819 11.133 0.00010197
x2 -12.129 0.48101 -25.217 1.8305e-06
Judea Pearl would emphasize that this "paradox" cannot be resolved using only the data. The correct interpretation will rely on understanding the causal mechanism or generative process that led to these data. (I don't expect there is an ELI5 explanation.)
Assuming that the data are meaningful, neither the positive nor the negative correlation is "wrong". They just describe different aspects of the data. For example, suppose in my example the variables represent something like
  • x1 = amount of fertilizer used on a field
  • x2 = crop yield
(Doesn't really work with the negative values I used, but ignore that.)
And maybe the cluster of 4 points on the left is from spring, and the cluster of 4 points on the right is from autumn.
The interpretation could be that greater use of fertilizer yields greater crop yield ... but that there is a factor related to the season that means there is lower yield in autumn.
It is not possible to interpret the data themselves, or determine whether the partitioned or aggregated data are more relevant, without a conceptual model of what is going on.
Valid point. So I must fruther emphasize to adjust for a confounding factor which is the cause of both x1 and x2. And of course, otherwise the problem should be approached carefully (e.g. in case of a mediator).
To be clear, my comment was directed more at the OP than as a reply to your comment, @IveIve
It was useful anyway!
Yes I now understand confounding factor t.
x1 and x2 has a postive correlation between date 1 and date 2.
x1 and x2 has a positive correltion between date 2 and date 3.
x1 and x2 has a negative correlation betwen date 1 and date 3.
Regression of x2 on x1 between date 1 and date 2 has positive statistically sig coef.
Regression of x2 on x1 between date 2 and date 3 has positive statistically sig coef.
Regression of x2 on x1 between date 1 and date 3 has negative statistically sig coef.
Regression of x2 on x1 and time between date 1 and date 2 has positive statistically sig coef for both x1 and time.
Regression of x2 on x1 and time between date 2 and date 3 has positive statistically sig coef for both x1 and time.
Regression of x2 on x1 and time between date 1 and date 3 has positive statistically sig coef for both x1 and time.
How can it happen?

Sign in to comment.

More Answers (1)

Matt Gaidica
Matt Gaidica on 17 Dec 2020
Edited: Matt Gaidica on 17 Dec 2020
Sure, it can happen. What's your concern about it?

4 Comments

"Analysis reveals time to be the confounding variable: plotting both price and demand against time reveals the expected negative correlation over various periods, which then reverses to become positive if the influence of time is ignored by simply plotting demand against price." (https://en.wikipedia.org/wiki/Simpson%27s_paradox) Controlling time resolve the issue.
Thank you. Then, what would be the ELI5 (Explain Like I am Five) explanation?
For a thorough explanation see the Wikipedia link in the Cyclist's answer below. (He beat me to it, posting this fun paradox.)
It would be interesting for us to know what real world measurements x1 and x2 are so we can see a real world situation that gives rise to this parabox. What do x1 and x2 represent?
I am happy to explain to get comments. But I am not sure whether I want to discuss the problem in a public bulletin board. So if there is a suggestion for a venue I will be happy to listen.

Sign in to comment.

Asked:

on 17 Dec 2020

Edited:

on 17 Dec 2020

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!