fitlm returns pvalues equal to NaN without zscoring

33 views (last 30 days)
Manuela
Manuela about 17 hours ago
Commented: dpb about 6 hours ago
I can not understand which is the reason why the fitlm using variables without zscoring returns pvalues equal to NaN, whereas this does not happen using zscored variables. Below you can find the code I used:
medians_var1_scored = nanzscore(medians_var1);
medians_var2_scored = nanzscore(medians_var2);
medians_var1_scored_tr = medians_var1_scored';
medians_var2_scored_tr = medians_var2_scored';
tbl=table(medians_var1_scored_tr,medians_var2_scored_tr,'VariableNames', ...
{'var1','var2'});
%build your model
mdl=fitlm(tbl,'var1 ~ var2','RobustOpts','on')
which gives this result:
mdl =
Linear regression model (robust fit):
var1 ~ 1 + var2
Estimated Coefficients:
Estimate SE tStat pValue
_________ ________ ________ _______
(Intercept) 0.028674 0.094595 0.30312 0.76234
var2 -0.072919 0.094998 -0.76758 0.44429
Number of observations: 118, Error degrees of freedom: 116
Root Mean Squared Error: 1.03
R-squared: 0.00584, Adjusted R-Squared: -0.00273
F-statistic vs. constant model: 0.681, p-value = 0.411
If instead I use the original variables, I do:
tbl=table(medians_var1',medians_var2','VariableNames', ...
{'var1','var2'});
%build your model
mdl=fitlm(tbl,'var1 ~ var2','RobustOpts','on')
and obtain:
mdl =
Linear regression model (robust fit):
var1 ~ 1 + var2
Estimated Coefficients:
Estimate SE tStat pValue
__________ __________ ______ __________
(Intercept) 0 0 NaN NaN
var2 8.6386e-13 2.1087e-14 40.966 7.8796e-71
Number of observations: 118, Error degrees of freedom: 117
Root Mean Squared Error: 31.4
R-squared: 0.263, Adjusted R-Squared: 0.263
F-statistic vs. constant model: Inf, p-value = NaN
I can't understand why this happens. You can find attached both the original var1 and var2 variables and the zscored ones. But If I plot both of them, I obtain the same plot (just rescaled). Hence, there shouldn't be a problem in the nanzscore function which is "hand written".

Answers (1)

dpb
dpb about 8 hours ago
Edited: dpb about 7 hours ago
d=dir('median*.mat'); % load the files w/o having to type a zillion characters...
for i=1:numel(d)
load(d(i).name)
disp(d(i).name) % figure out which is var1, var2??? show which file
whos('-file',d(i).name) % contains which variable...
end
medians_var1.mat
Name Size Bytes Class Attributes medians_energy_vec 1x118 944 double
medians_var1_scored_tr.mat
Name Size Bytes Class Attributes medians_energy_scored_tr 118x1 944 double
medians_var2.mat
Name Size Bytes Class Attributes medians_micro_parameter_vec 1x118 944 double
medians_var2_scored_tr.mat
Name Size Bytes Class Attributes medians_dwi_scored_tr 118x1 944 double
whos
Name Size Bytes Class Attributes ans 1x40 80 char d 4x1 3555 struct i 1x1 8 double medians_dwi_scored_tr 118x1 944 double medians_energy_scored_tr 118x1 944 double medians_energy_vec 1x118 944 double medians_micro_parameter_vec 1x118 944 double
es=medians_energy_scored_tr; ds=medians_dwi_scored_tr; % shorten names
e=medians_energy_vec.'; d=medians_micro_parameter_vec.';
nnz(isfinite(es)), nnz(isfinite(ds)) % is there a NaN lurking, maybe?
ans = 118
ans = 118
nnz(isfinite(e)), nnz(isfinite(d))
ans = 118
ans = 118
[min(es) max(es) min(ds) max(ds)] % see what the magnitude of variables are...
ans = 1×4
-2.8784 2.2034 -2.8010 1.8310
<mw-icon class=""></mw-icon>
<mw-icon class=""></mw-icon>
[min(d) max(d)], [min(e) max(e)]
ans = 1×2
1.0e+14 * 0.7538 1.7465
<mw-icon class=""></mw-icon>
<mw-icon class=""></mw-icon>
ans = 1×2
47.1018 175.4894
<mw-icon class=""></mw-icon>
<mw-icon class=""></mw-icon>
Oh....there's likely the problem; d (medians_micro_parameter_vec) is huge in magnitude so the unscaled calculation overflows when computing the sum terms in X'X. The scaling brings everything down to roughly +/-3 range if is normal scaling.
Always examine data before blindly fitting...
plot(ds,es,'x')
figure
plot(d,e,'x')
These appear to have just been randomly generated values---why such a large magnitude was used is anybody's guess here...but there's the problem.
We'll just fit the same data, still large but 10E4 instead of 10E14...
fitlm(d/1E10,e,'RobustOpts','on')
ans =
Linear regression model (robust fit): y ~ 1 + x1 Estimated Coefficients: Estimate SE tStat pValue __________ _________ ________ __________ (Intercept) 132.19 15.351 8.6108 4.1607e-14 x1 -0.0008596 0.0011199 -0.76758 0.44429 Number of observations: 118, Error degrees of freedom: 116 Root Mean Squared Error: 26 R-squared: 0.00584, Adjusted R-Squared: -0.00273 F-statistic vs. constant model: 0.681, p-value = 0.411
The problem is simply one of the magnitude of the independent variable is so large the sums during computation overflow; as the above shows, simply a reduction in the absolute magnitude of the x variable is sufficient; it isn't necessarily there being "magic" in z-scoring.
First we checked to be sure there weren't any NaN that were causing the issue that the function nanzscore() was silently taking care of for us...
If we then adjust by the mean and scale, but only by a power of 10 instead of std(), we get roughly the same result of about a zero intercept, but note the slope is the same only scaled by the additional 10E4 we introduced, but the pValue is also identical.
fitlm((d-mean(d))/1E14,e,'RobustOpts','on')
ans =
Linear regression model (robust fit): y ~ 1 + x1 Estimated Coefficients: Estimate SE tStat pValue ________ ______ ________ __________ (Intercept) 120.55 2.3898 50.441 9.3494e-81 x1 -8.596 11.199 -0.76758 0.44429 Number of observations: 118, Error degrees of freedom: 116 Root Mean Squared Error: 26 R-squared: 0.00584, Adjusted R-Squared: -0.00273 F-statistic vs. constant model: 0.681, p-value = 0.411
And just for good measure, we'll use standardized variables...
fitlm(zscore(d),zscore(e),'RobustOpts','on')
ans =
Linear regression model (robust fit): y ~ 1 + x1 Estimated Coefficients: Estimate SE tStat pValue _________ ________ ________ _______ (Intercept) 0.028674 0.094595 0.30312 0.76234 x1 -0.072919 0.094998 -0.76758 0.44429 Number of observations: 118, Error degrees of freedom: 116 Root Mean Squared Error: 1.03 R-squared: 0.00584, Adjusted R-Squared: -0.00273 F-statistic vs. constant model: 0.681, p-value = 0.411
This effect of keeping intermediate calculations within the range of floating point precision is one of the prime uses of z-scaling in regression besides putting variables of differing magnitudes on the same scale for comparison of sensitivities.
  1 Comment
dpb
dpb about 1 hour ago
d=dir('median*.mat'); % load the files w/o having to type a zillion characters...
for i=1:numel(d)
load(d(i).name)
end
es=medians_energy_scored_tr; ds=medians_dwi_scored_tr; % shorten names
e=medians_energy_vec.'; d=medians_micro_parameter_vec.';
b=polyfit(d,e,1) % compare polyfit() w/o standardization
Warning: Polynomial is badly conditioned. Add points with distinct X values, reduce the degree of the polynomial, or try centering and scaling as described in HELP POLYFIT.
b = 1×2
-0.0000 135.6279
<mw-icon class=""></mw-icon>
<mw-icon class=""></mw-icon>
polyfit() discovers the problem, too.
[b,~,mu]=polyfit(d,e,1) % and with internally scaled
b = 1×2
-2.5016 119.8224
<mw-icon class=""></mw-icon>
<mw-icon class=""></mw-icon>
mu = 2×1
1.0e+14 * 1.3541 0.2143
<mw-icon class=""></mw-icon>
<mw-icon class=""></mw-icon>
fitlm(zscore(d),e)
ans =
Linear regression model: y ~ 1 + x1 Estimated Coefficients: Estimate SE tStat pValue ________ ______ _______ __________ (Intercept) 119.82 2.3243 51.553 8.3006e-82 x1 -2.5016 2.3342 -1.0717 0.28607 Number of observations: 118, Error degrees of freedom: 116 Root Mean Squared Error: 25.2 R-squared: 0.0098, Adjusted R-Squared: 0.00127 F-statistic vs. constant model: 1.15, p-value = 0.286
fitlm and polyfit agree with each other with OLS and using standardization on the independent variable as would expect.
histogram(d,20)
While the x values are all distinct, we can observer they are clustered in a couple of regions. It might be better numerically (although a different problem, of course) if they were distributed uniformly...let's see what might happen then...
x=linspace(min(d),max(d),numel(d));
polyfit(x,e,1)
Warning: Polynomial is badly conditioned. Add points with distinct X values, reduce the degree of the polynomial, or try centering and scaling as described in HELP POLYFIT.
ans = 1×2
-0.0000 155.9119
<mw-icon class=""></mw-icon>
<mw-icon class=""></mw-icon>
fitlm(x,e)
Warning: Regression design matrix is rank deficient to within machine precision.
ans =
Linear regression model: y ~ 1 + x1 Estimated Coefficients: Estimate SE tStat pValue _________ __________ ______ _________ (Intercept) 0 0 NaN NaN x1 8.952e-13 3.0539e-14 29.313 1.842e-55 Number of observations: 118, Error degrees of freedom: 117 Root Mean Squared Error: 42.6 R-squared: -1.84, Adjusted R-Squared: -1.84 F-statistic vs. constant model: -Inf, p-value = NaN
Well, that's not enough on its own; let's go back to just making the magnitude smaller that we did before...
polyfit(x/1E10,e,1)
ans = 1×2
-0.0029 155.9119
<mw-icon class=""></mw-icon>
<mw-icon class=""></mw-icon>
and Voila! -- at least the coefficients can again be estimted once the values don't completely swamp precision.

Sign in to comment.

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!