fitlm returns pvalues equal to NaN without zscoring
You are now following this question
- You will see updates in your followed content feed.
- You may receive emails, depending on your communication preferences.
An Error Occurred
Unable to complete the action because of changes made to the page. Reload the page to see its updated state.
Show older comments
0 votes
Share a link to this question
I can not understand which is the reason why the fitlm using variables without zscoring returns pvalues equal to NaN, whereas this does not happen using zscored variables. Below you can find the code I used:
medians_var1_scored = nanzscore(medians_var1);
medians_var2_scored = nanzscore(medians_var2);
medians_var1_scored_tr = medians_var1_scored';
medians_var2_scored_tr = medians_var2_scored';
tbl=table(medians_var1_scored_tr,medians_var2_scored_tr,'VariableNames', ...
{'var1','var2'});
%build your model
mdl=fitlm(tbl,'var1 ~ var2','RobustOpts','on')
which gives this result:
mdl =
Linear regression model (robust fit):
var1 ~ 1 + var2
Estimated Coefficients:
Estimate SE tStat pValue
_________ ________ ________ _______
(Intercept) 0.028674 0.094595 0.30312 0.76234
var2 -0.072919 0.094998 -0.76758 0.44429
Number of observations: 118, Error degrees of freedom: 116
Root Mean Squared Error: 1.03
R-squared: 0.00584, Adjusted R-Squared: -0.00273
F-statistic vs. constant model: 0.681, p-value = 0.411
If instead I use the original variables, I do:
tbl=table(medians_var1',medians_var2','VariableNames', ...
{'var1','var2'});
%build your model
mdl=fitlm(tbl,'var1 ~ var2','RobustOpts','on')
and obtain:
mdl =
Linear regression model (robust fit):
var1 ~ 1 + var2
Estimated Coefficients:
Estimate SE tStat pValue
__________ __________ ______ __________
(Intercept) 0 0 NaN NaN
var2 8.6386e-13 2.1087e-14 40.966 7.8796e-71
Number of observations: 118, Error degrees of freedom: 117
Root Mean Squared Error: 31.4
R-squared: 0.263, Adjusted R-Squared: 0.263
F-statistic vs. constant model: Inf, p-value = NaN
I can't understand why this happens. You can find attached both the original var1 and var2 variables and the zscored ones. But If I plot both of them, I obtain the same plot (just rescaled). Hence, there shouldn't be a problem in the nanzscore function which is "hand written".
1 Comment
Torsten
on 31 Dec 2025
You forgot to include "nanzscore".
Accepted Answer
d=dir('median*.mat'); % load the files w/o having to type a zillion characters...
for i=1:numel(d)
load(d(i).name)
disp(d(i).name) % figure out which is var1, var2??? show which file
whos('-file',d(i).name) % contains which variable...
end
medians_var1.mat
Name Size Bytes Class Attributes
medians_energy_vec 1x118 944 double
medians_var1_scored_tr.mat
Name Size Bytes Class Attributes
medians_energy_scored_tr 118x1 944 double
medians_var2.mat
Name Size Bytes Class Attributes
medians_micro_parameter_vec 1x118 944 double
medians_var2_scored_tr.mat
Name Size Bytes Class Attributes
medians_dwi_scored_tr 118x1 944 double
whos
Name Size Bytes Class Attributes
ans 1x40 80 char
d 4x1 3555 struct
i 1x1 8 double
medians_dwi_scored_tr 118x1 944 double
medians_energy_scored_tr 118x1 944 double
medians_energy_vec 1x118 944 double
medians_micro_parameter_vec 1x118 944 double
es=medians_energy_scored_tr; ds=medians_dwi_scored_tr; % shorten names
e=medians_energy_vec.'; d=medians_micro_parameter_vec.';
[all(isfinite(es)), all(isfinite(ds))] % is there a NaN lurking, maybe?
ans = 1×2 logical array
1 1
[all(isfinite(e)), all(isfinite(d))]
ans = 1×2 logical array
1 1
[min(es) max(es) min(ds) max(ds)] % see what the magnitude of variables are...
ans = 1×4
-2.8784 2.2034 -2.8010 1.8310
<mw-icon class=""></mw-icon>
<mw-icon class=""></mw-icon>
[min(d) max(d)], [min(e) max(e)]
ans = 1×2
1.0e+14 *
0.7538 1.7465
<mw-icon class=""></mw-icon>
<mw-icon class=""></mw-icon>
ans = 1×2
47.1018 175.4894
<mw-icon class=""></mw-icon>
<mw-icon class=""></mw-icon>
Oh....there's likely the problem; d (medians_micro_parameter_vec) is huge in magnitude so the unscaled calculation overflows when computing the sum terms in X'X. The scaling brings everything down to roughly +/-3 range if is normal scaling.
Always examine data before blindly fitting...
plot(ds,es,'x')

figure
plot(d,e,'x')

These appear to have just been randomly generated values---why such a large magnitude was used is anybody's guess here...but there's the problem.
We'll just fit the same data, still large but 10E4 instead of 10E14...
fitlm(d/1E10,e,'RobustOpts','on')
ans =
Linear regression model (robust fit):
y ~ 1 + x1
Estimated Coefficients:
Estimate SE tStat pValue
__________ _________ ________ __________
(Intercept) 132.19 15.351 8.6108 4.1607e-14
x1 -0.0008596 0.0011199 -0.76758 0.44429
Number of observations: 118, Error degrees of freedom: 116
Root Mean Squared Error: 26
R-squared: 0.00584, Adjusted R-Squared: -0.00273
F-statistic vs. constant model: 0.681, p-value = 0.411
The problem is simply one of the magnitude of the independent variable is so large the sums during computation lose precision leading to an apparently singular design matrix; as the above shows, simply a reduction in the absolute magnitude of the x variable is sufficient; it isn't necessarily there being "magic" in z-scoring.
First we checked to be sure there weren't any NaN that were causing the issue that the function nanzscore() was silently taking care of for us...
If we then adjust by the mean and scale, but only by a power of 10 instead of std(), we get roughly the same result of about a zero intercept, but note the slope is the same only scaled by the additional 10E4 we introduced, but the pValue is also identical.
fitlm((d-mean(d))/1E14,e,'RobustOpts','on')
ans =
Linear regression model (robust fit):
y ~ 1 + x1
Estimated Coefficients:
Estimate SE tStat pValue
________ ______ ________ __________
(Intercept) 120.55 2.3898 50.441 9.3494e-81
x1 -8.596 11.199 -0.76758 0.44429
Number of observations: 118, Error degrees of freedom: 116
Root Mean Squared Error: 26
R-squared: 0.00584, Adjusted R-Squared: -0.00273
F-statistic vs. constant model: 0.681, p-value = 0.411
And just for good measure, we'll use standardized variables...
fitlm(zscore(d),zscore(e),'RobustOpts','on')
ans =
Linear regression model (robust fit):
y ~ 1 + x1
Estimated Coefficients:
Estimate SE tStat pValue
_________ ________ ________ _______
(Intercept) 0.028674 0.094595 0.30312 0.76234
x1 -0.072919 0.094998 -0.76758 0.44429
Number of observations: 118, Error degrees of freedom: 116
Root Mean Squared Error: 1.03
R-squared: 0.00584, Adjusted R-Squared: -0.00273
F-statistic vs. constant model: 0.681, p-value = 0.411
This effect of keeping intermediate calculations of the design matrix roughly on the same order as the constant column of ones is one of the prime uses of z-scaling in regression besides putting variables of differing magnitudes on the same scale for comparison of sensitivities. When the magnitude of one or more columns becomes excessivly large in comparison, the end result is that the resulting model design matrix looks more and more ill-conditioned until the numerics finally fail.
15 Comments
dpb
on 31 Dec 2025
d=dir('median*.mat'); % load the files w/o having to type a zillion characters...
for i=1:numel(d)
load(d(i).name)
end
es=medians_energy_scored_tr; ds=medians_dwi_scored_tr; % shorten names
e=medians_energy_vec.'; d=medians_micro_parameter_vec.';
b=polyfit(d,e,1) % compare polyfit() w/o standardization
Warning: Polynomial is badly conditioned. Add points with distinct X values, reduce the degree of the polynomial, or try centering and scaling as described in HELP POLYFIT.
b = 1×2
-0.0000 135.6279
<mw-icon class=""></mw-icon>
<mw-icon class=""></mw-icon>
polyfit() discovers the problem, too.
[b,~,mu]=polyfit(d,e,1) % and with internally scaled
b = 1×2
-2.5016 119.8224
<mw-icon class=""></mw-icon>
<mw-icon class=""></mw-icon>
mu = 2×1
1.0e+14 *
1.3541
0.2143
<mw-icon class=""></mw-icon>
<mw-icon class=""></mw-icon>
fitlm(zscore(d),e)
ans =
Linear regression model:
y ~ 1 + x1
Estimated Coefficients:
Estimate SE tStat pValue
________ ______ _______ __________
(Intercept) 119.82 2.3243 51.553 8.3006e-82
x1 -2.5016 2.3342 -1.0717 0.28607
Number of observations: 118, Error degrees of freedom: 116
Root Mean Squared Error: 25.2
R-squared: 0.0098, Adjusted R-Squared: 0.00127
F-statistic vs. constant model: 1.15, p-value = 0.286
fitlm and polyfit agree with each other with OLS and using standardization on the independent variable as would expect.
histogram(d,20)

While the x values are all distinct, we can observer they are clustered in a couple of regions. It might be better numerically (although a different problem, of course) if they were distributed uniformly...let's see what might happen then...
x=linspace(min(d),max(d),numel(d));
polyfit(x,e,1)
Warning: Polynomial is badly conditioned. Add points with distinct X values, reduce the degree of the polynomial, or try centering and scaling as described in HELP POLYFIT.
ans = 1×2
-0.0000 155.9119
<mw-icon class=""></mw-icon>
<mw-icon class=""></mw-icon>
fitlm(x,e)
Warning: Regression design matrix is rank deficient to within machine precision.
ans =
Linear regression model:
y ~ 1 + x1
Estimated Coefficients:
Estimate SE tStat pValue
_________ __________ ______ _________
(Intercept) 0 0 NaN NaN
x1 8.952e-13 3.0539e-14 29.313 1.842e-55
Number of observations: 118, Error degrees of freedom: 117
Root Mean Squared Error: 42.6
R-squared: -1.84, Adjusted R-Squared: -1.84
F-statistic vs. constant model: -Inf, p-value = NaN
Well, that's not enough on its own; let's go back to just making the magnitude smaller that we did before...
polyfit(x/1E10,e,1)
ans = 1×2
-0.0029 155.9119
<mw-icon class=""></mw-icon>
<mw-icon class=""></mw-icon>
and Voila! -- at least the coefficients can again be estimted once the values don't completely swamp precision.
dpb
on 1 Jan 2026
Just one more demonstration...
d=dir('median*.mat'); % load the files w/o having to type a zillion characters...
for i=1:numel(d)
load(d(i).name)
end
%es=medians_energy_scored_tr; ds=medians_dwi_scored_tr; % shorten names
e=medians_energy_vec.'; d=medians_micro_parameter_vec.';
o=ones(size(e));
for n=0:14
x=[d/10^n o]; % the design columns; x term scaled by powers of 10
X=x.'*x;
fprintf('Divisor: %2d, Condition Number: %.4e\n',n, cond(X))
end
Divisor: 0, Condition Number: 7.7533e+29
Divisor: 1, Condition Number: 7.7533e+27
Divisor: 2, Condition Number: 7.7533e+25
Divisor: 3, Condition Number: 7.7533e+23
Divisor: 4, Condition Number: 7.7533e+21
Divisor: 5, Condition Number: 7.7533e+19
Divisor: 6, Condition Number: 7.7533e+17
Divisor: 7, Condition Number: 7.7533e+15
Divisor: 8, Condition Number: 7.7533e+13
Divisor: 9, Condition Number: 7.7533e+11
Divisor: 10, Condition Number: 7.7533e+09
Divisor: 11, Condition Number: 7.7533e+07
Divisor: 12, Condition Number: 7.7541e+05
Divisor: 13, Condition Number: 7.8341e+03
Divisor: 14, Condition Number: 1.8001e+02
x=[zscore(d) o]; % now do the z score to compare
X=x.'*x;
fprintf('Standardized, Condition Number: %.2e\n', cond(X))
Standardized, Condition Number: 1.01e+00
Illustrates what happens to the design matrix condition number as the magnitude difference between the constant term column of ones and the x values is reduced. Initially, since we're dividing by a factor of 10, the condition number is reduced by almost exactly a factor of 100 owing to the multiplication; once the difference is down to about 8000 differences in the values begin to appear at the higher level of significant digits besides the pure scaling.
When do use actual z-scoring at the end, the actual use of the data itself to remove the mean and standardize does have the remarkable property of bring the condition number down to almost identically 1.
Manuela
on 1 Jan 2026
Hi @dpb... thanks for the detailed explanation! However, I have a question. Sorry for the ignorance, but when is the term X=x.'*x computed in the fitlm ? What is it in a linear regression ? As you are saying that it's the cause of the troubles but I don't understand what it is...
x.'*x arises in the matrix form of the O(rdinary)L(east)S(quares) calculation -- hence, in one form or another the solution for the coefficients is dependent upon inverting (or other solution technique) the design matrix. For a Nth order polynomial, the base x is a column-wise matrix
[x.^n x.^(n-1) ... x.^2 x 1]
where x is the vector of the independent variable values and the column of ones is there for the intercept term. (You can solve for a non-intercept model by omitting it). And, to compute the sums of squares, the form is to multiply the transpose by x. In your case of simply a first-order polynomial, you have only [x 1] columns as were used in the examples.
The end result is
X'X B = X'y known as the "normal" equations.
where B is the array of coefficients and y the dependent observations vector. So, to solve for B
inverse(X'X)(X'X) B = inverse(X'X) X'y
I B = inverse(X'X) X'y where I is the identity matrix so
B = inverse(X'X) X'y is the solution for the coefficients, B
That's why the X'X matrix is key and if it is ill-conditioned one has trouble and showed the resulting condition number as a function of the magnitude of x to illustrate.
I don't see it written expressly in the MATLAB doc, but <here's a brief development>; note specifically they arrive at eq (10) at the top of page 3.
Just out of curiosity, why the magnitude of the variable in your example? What does it represent? If it is really of that magnitude, your choices for estimation are likely to be to either transform the units to a set that are much smaller in magnitude or use the standardization technique.
Manuela
on 2 Jan 2026
Many thanks!! It's clearer now!! So, if the coefficient B is computed through X'y/(X'X), it's like having a ratio equal to Inf/Inf = NaN ? The independent variable has a high magnitude because it represents the number of cells in a cubic meter of tissue, whereas the dependent variable is the energy consumed by these cells... Hence, these quantities have very different magnitude. My supervisors asked me to report the results of GLM considering the original variables values, i.e. without z-scoring in order to be able to compare the beta coefficient values with the values which are reported in the plot (as we are going to show the plot of one variable against the other one without z-scoring). But, at this point, I think I have to show the GLM results with z-scoring. I don't have another choice...
dpb
on 2 Jan 2026
Not exactly, but similar idea -- if the X values in two columns are identical or exact linear combintions of each other, then the system is identically singular -- the values themselves are all finite but the combination of both makes the system inderminate. In this case, the values aren't actually perfectly correlated, but the large magnitude makes the ability to distinguish between many go away because of loss of precision so it appears more and more like they actually are linearly related as the magnitude gets larger.
As for reporting, you can compute the results using standardized variables but then transform the coefficients back to the original units...it's just algebra.
z = (x-u)/s --> z s = x - u --> x = z s + u
and I can consider the t-stat and pvalue of the coefficients which derive from the standardized variables ?
Yes, it's the same model, just using the base variable instead of the standardized.
You could count the cells in terms of 10E9 or 10E12 cells/cu-m instead and see the same t-values.
d=dir('median*.mat'); % load the files w/o having to type a zillion characters...
for i=1:numel(d)
load(d(i).name)
end
%es=medians_energy_scored_tr; ds=medians_dwi_scored_tr; % shorten names
e=medians_energy_vec.'; d=medians_micro_parameter_vec.';
fitlm(d,e)
Warning: Regression design matrix is rank deficient to within machine precision.
ans =
Linear regression model:
y ~ 1 + x1
Estimated Coefficients:
Estimate SE tStat pValue
__________ __________ ______ __________
(Intercept) 0 0 NaN NaN
x1 8.6062e-13 2.2087e-14 38.966 1.7747e-68
Number of observations: 118, Error degrees of freedom: 117
Root Mean Squared Error: 32.9
R-squared: -0.695, Adjusted R-Squared: -0.695
F-statistic vs. constant model: -Inf, p-value = NaN
fitlm(d/1E9,e)
ans =
Linear regression model:
y ~ 1 + x1
Estimated Coefficients:
Estimate SE tStat pValue
___________ __________ _______ __________
(Intercept) 135.63 14.93 9.0843 3.3091e-15
x1 -0.00011673 0.00010891 -1.0717 0.28607
Number of observations: 118, Error degrees of freedom: 116
Root Mean Squared Error: 25.2
R-squared: 0.0098, Adjusted R-Squared: 0.00127
F-statistic vs. constant model: 1.15, p-value = 0.286
fitlm(d/1E12,e)
ans =
Linear regression model:
y ~ 1 + x1
Estimated Coefficients:
Estimate SE tStat pValue
________ _______ _______ __________
(Intercept) 135.63 14.93 9.0843 3.3091e-15
x1 -0.11673 0.10891 -1.0717 0.28607
Number of observations: 118, Error degrees of freedom: 116
Root Mean Squared Error: 25.2
R-squared: 0.0098, Adjusted R-Squared: 0.00127
F-statistic vs. constant model: 1.15, p-value = 0.286
fitlm(zscore(d),e)
ans =
Linear regression model:
y ~ 1 + x1
Estimated Coefficients:
Estimate SE tStat pValue
________ ______ _______ __________
(Intercept) 119.82 2.3243 51.553 8.3006e-82
x1 -2.5016 2.3342 -1.0717 0.28607
Number of observations: 118, Error degrees of freedom: 116
Root Mean Squared Error: 25.2
R-squared: 0.0098, Adjusted R-Squared: 0.00127
F-statistic vs. constant model: 1.15, p-value = 0.286
Do you have a real dataset could attach? One presumes from the plots these are just random values over the expected ranges?
Manuela
on 2 Jan 2026
ah ok thanks! I am going to check... Why should one presume that these are just random values over the expected ranges ? The one I attached is the real dataset.
There's essentially zero correlation of the two variables is why I made that presumption as well as the plots look basically like a pretty-much random scattering of points in the 2D space.
What is the null hypothesis here? An adjusted R-square correlation of only 0.001 is pretty much showing that the two are virtually uncorrelated to each other. The slope is only significant at about the 75% level; it is -0.12+/-0.11 -- the standard error of estimate is virtually the same as the estimate.
Manuela
on 2 Jan 2026
Thanks for all this extra statistics! But the fact that two variables are not correlated doesn't mean they are random values. Are all variables in this world all correlated ?
No, but one normally uses regression to predict the response variable from the independent one -- but if the total Rsq shows that the model can explain only about 1% of the total variance, it's certainly going to be of little use for that purpose.
It just doesn't seem like an appropriate analysis given the data unless the null hypothesis is to show they aren't related closely and the slope being virtually statistically indistinguishable from zero makes one wonder just what is the underlying objective here.
IOW, I'm wondering if there might not be a more appropriate statistical approach than regression here, depending upon what the null hypothesis really is. Only you/your lead researcher can answer that, of course, and that is getting far off the forum topic of MATLAB-specific functionality (not tht we aren't quite a ways departed already <vbg> ).
Manuela
on 2 Jan 2026
yes the objective is to show that they aren't related, so I don't want to quantitatively predict anything from this model
Manuela
on 2 Jan 2026
Don't worry :) thanks for the very exhaustive answer!
In that case I'd suggest corr is the more appropriate. Use the two-output form to get the estimated p-value. This will avoid the issue of the regression calculation matrix inversion of the normal equations entirely.
d=dir('median*.mat'); % load the files w/o having to type a zillion characters...
for i=1:numel(d)
load(d(i).name)
end
%es=medians_energy_scored_tr; ds=medians_dwi_scored_tr; % shorten names
e=medians_energy_vec.'; d=medians_micro_parameter_vec.';
[c,p]=corr([d,e],'Type','Kendall');
[c(2,1) p(2,1)] % corr returns the full correlation matrix, not just the 2-column comparison
ans = 1×2
-0.0204 0.7447
<mw-icon class=""></mw-icon>
<mw-icon class=""></mw-icon>
I used Kendall's tau instead of default as the previous histogram indicated not normally distributed owing to what appeared to be multiple peaks in the distribution. You'll undoubtedly want to do some distribution testing first to see just what sort of assumptions can be justified.
More Answers (0)
Categories
Find more on Descriptive Statistics in Help Center and File Exchange
See Also
Select a Web Site
Choose a web site to get translated content where available and see local events and offers. Based on your location, we recommend that you select: .
You can also select a web site from the following list
How to Get Best Site Performance
Select the China site (in Chinese or English) for best site performance. Other MathWorks country sites are not optimized for visits from your location.
Americas
- América Latina (Español)
- Canada (English)
- United States (English)
Europe
- Belgium (English)
- Denmark (English)
- Deutschland (Deutsch)
- España (Español)
- Finland (English)
- France (Français)
- Ireland (English)
- Italia (Italiano)
- Luxembourg (English)
- Netherlands (English)
- Norway (English)
- Österreich (Deutsch)
- Portugal (English)
- Sweden (English)
- Switzerland
- United Kingdom (English)