fitlm returns pvalues equal to NaN without zscoring
33 views (last 30 days)
Show older comments
I can not understand which is the reason why the fitlm using variables without zscoring returns pvalues equal to NaN, whereas this does not happen using zscored variables. Below you can find the code I used:
medians_var1_scored = nanzscore(medians_var1);
medians_var2_scored = nanzscore(medians_var2);
medians_var1_scored_tr = medians_var1_scored';
medians_var2_scored_tr = medians_var2_scored';
tbl=table(medians_var1_scored_tr,medians_var2_scored_tr,'VariableNames', ...
{'var1','var2'});
%build your model
mdl=fitlm(tbl,'var1 ~ var2','RobustOpts','on')
which gives this result:
mdl =
Linear regression model (robust fit):
var1 ~ 1 + var2
Estimated Coefficients:
Estimate SE tStat pValue
_________ ________ ________ _______
(Intercept) 0.028674 0.094595 0.30312 0.76234
var2 -0.072919 0.094998 -0.76758 0.44429
Number of observations: 118, Error degrees of freedom: 116
Root Mean Squared Error: 1.03
R-squared: 0.00584, Adjusted R-Squared: -0.00273
F-statistic vs. constant model: 0.681, p-value = 0.411
If instead I use the original variables, I do:
tbl=table(medians_var1',medians_var2','VariableNames', ...
{'var1','var2'});
%build your model
mdl=fitlm(tbl,'var1 ~ var2','RobustOpts','on')
and obtain:
mdl =
Linear regression model (robust fit):
var1 ~ 1 + var2
Estimated Coefficients:
Estimate SE tStat pValue
__________ __________ ______ __________
(Intercept) 0 0 NaN NaN
var2 8.6386e-13 2.1087e-14 40.966 7.8796e-71
Number of observations: 118, Error degrees of freedom: 117
Root Mean Squared Error: 31.4
R-squared: 0.263, Adjusted R-Squared: 0.263
F-statistic vs. constant model: Inf, p-value = NaN
I can't understand why this happens. You can find attached both the original var1 and var2 variables and the zscored ones. But If I plot both of them, I obtain the same plot (just rescaled). Hence, there shouldn't be a problem in the nanzscore function which is "hand written".
Answers (1)
dpb
about 8 hours ago
Edited: dpb
about 7 hours ago
d=dir('median*.mat'); % load the files w/o having to type a zillion characters...
for i=1:numel(d)
load(d(i).name)
disp(d(i).name) % figure out which is var1, var2??? show which file
whos('-file',d(i).name) % contains which variable...
end
whos
es=medians_energy_scored_tr; ds=medians_dwi_scored_tr; % shorten names
e=medians_energy_vec.'; d=medians_micro_parameter_vec.';
nnz(isfinite(es)), nnz(isfinite(ds)) % is there a NaN lurking, maybe?
nnz(isfinite(e)), nnz(isfinite(d))
[min(es) max(es) min(ds) max(ds)] % see what the magnitude of variables are...
[min(d) max(d)], [min(e) max(e)]
Oh....there's likely the problem; d (medians_micro_parameter_vec) is huge in magnitude so the unscaled calculation overflows when computing the sum terms in X'X. The scaling brings everything down to roughly +/-3 range if is normal scaling.
Always examine data before blindly fitting...
plot(ds,es,'x')
figure
plot(d,e,'x')
These appear to have just been randomly generated values---why such a large magnitude was used is anybody's guess here...but there's the problem.
We'll just fit the same data, still large but 10E4 instead of 10E14...
fitlm(d/1E10,e,'RobustOpts','on')
The problem is simply one of the magnitude of the independent variable is so large the sums during computation overflow; as the above shows, simply a reduction in the absolute magnitude of the x variable is sufficient; it isn't necessarily there being "magic" in z-scoring.
First we checked to be sure there weren't any NaN that were causing the issue that the function nanzscore() was silently taking care of for us...
If we then adjust by the mean and scale, but only by a power of 10 instead of std(), we get roughly the same result of about a zero intercept, but note the slope is the same only scaled by the additional 10E4 we introduced, but the pValue is also identical.
fitlm((d-mean(d))/1E14,e,'RobustOpts','on')
And just for good measure, we'll use standardized variables...
fitlm(zscore(d),zscore(e),'RobustOpts','on')
This effect of keeping intermediate calculations within the range of floating point precision is one of the prime uses of z-scaling in regression besides putting variables of differing magnitudes on the same scale for comparison of sensitivities.
1 Comment
dpb
about 1 hour ago
d=dir('median*.mat'); % load the files w/o having to type a zillion characters...
for i=1:numel(d)
load(d(i).name)
end
es=medians_energy_scored_tr; ds=medians_dwi_scored_tr; % shorten names
e=medians_energy_vec.'; d=medians_micro_parameter_vec.';
b=polyfit(d,e,1) % compare polyfit() w/o standardization
polyfit() discovers the problem, too.
[b,~,mu]=polyfit(d,e,1) % and with internally scaled
fitlm(zscore(d),e)
fitlm and polyfit agree with each other with OLS and using standardization on the independent variable as would expect.
histogram(d,20)
While the x values are all distinct, we can observer they are clustered in a couple of regions. It might be better numerically (although a different problem, of course) if they were distributed uniformly...let's see what might happen then...
x=linspace(min(d),max(d),numel(d));
polyfit(x,e,1)
fitlm(x,e)
Well, that's not enough on its own; let's go back to just making the magnitude smaller that we did before...
polyfit(x/1E10,e,1)
and Voila! -- at least the coefficients can again be estimted once the values don't completely swamp precision.
See Also
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!

