fitlm returns pvalues equal to NaN without zscoring

Question

0 votes

I can not understand which is the reason why the fitlm using variables without zscoring returns pvalues equal to NaN, whereas this does not happen using zscored variables. Below you can find the code I used:

medians_var1_scored = nanzscore(medians_var1);
medians_var2_scored = nanzscore(medians_var2);
medians_var1_scored_tr = medians_var1_scored';
medians_var2_scored_tr = medians_var2_scored';
tbl=table(medians_var1_scored_tr,medians_var2_scored_tr,'VariableNames', ...
    {'var1','var2'});
%build your model
mdl=fitlm(tbl,'var1 ~ var2','RobustOpts','on')

which gives this result:

mdl = 
Linear regression model (robust fit):
    var1 ~ 1 + var2
Estimated Coefficients:
                   Estimate        SE        tStat      pValue 
                   _________    ________    ________    _______
    (Intercept)     0.028674    0.094595     0.30312    0.76234
    var2             -0.072919    0.094998    -0.76758    0.44429
Number of observations: 118, Error degrees of freedom: 116
Root Mean Squared Error: 1.03
R-squared: 0.00584,  Adjusted R-Squared: -0.00273
F-statistic vs. constant model: 0.681, p-value = 0.411

If instead I use the original variables, I do:

tbl=table(medians_var1',medians_var2','VariableNames', ...
    {'var1','var2'});
%build your model
mdl=fitlm(tbl,'var1 ~ var2','RobustOpts','on')

and obtain:

mdl = 
Linear regression model (robust fit):
    var1 ~ 1 + var2
Estimated Coefficients:
                    Estimate         SE        tStat       pValue  
                   __________    __________    ______    __________
    (Intercept)             0             0       NaN           NaN
    var2             8.6386e-13    2.1087e-14    40.966    7.8796e-71
Number of observations: 118, Error degrees of freedom: 117
Root Mean Squared Error: 31.4
R-squared: 0.263,  Adjusted R-Squared: 0.263
F-statistic vs. constant model: Inf, p-value = NaN

I can't understand why this happens. You can find attached both the original var1 and var2 variables and the zscored ones. But If I plot both of them, I obtain the same plot (just rescaled). Hence, there shouldn't be a problem in the nanzscore function which is "hand written".

1 Comment
Show -1 older comments Hide -1 older comments

Torsten on 31 Dec 2025

You forgot to include "nanzscore".

Sign in to comment.

Sign in to answer this question.

Follow Question

Answer 1

dpb on 31 Dec 2025

Edited: dpb on 1 Jan 2026

Open in MATLAB Online

1 vote

d=dir('median*.mat');           % load the files w/o having to type a zillion characters...
for i=1:numel(d)
  load(d(i).name)
  disp(d(i).name)               % figure out which is var1, var2??? show which file
  whos('-file',d(i).name)       % contains which variable...
end
medians_var1.mat
  Name                    Size             Bytes  Class     Attributes

  medians_energy_vec      1x118              944  double              
medians_var1_scored_tr.mat
  Name                            Size            Bytes  Class     Attributes

  medians_energy_scored_tr      118x1               944  double              
medians_var2.mat
  Name                             Size             Bytes  Class     Attributes

  medians_micro_parameter_vec      1x118              944  double              
medians_var2_scored_tr.mat
  Name                         Size            Bytes  Class     Attributes

  medians_dwi_scored_tr      118x1               944  double              
whos
  Name                               Size             Bytes  Class     Attributes

  ans                                1x40                80  char                
  d                                  4x1               3555  struct              
  i                                  1x1                  8  double              
  medians_dwi_scored_tr            118x1                944  double              
  medians_energy_scored_tr         118x1                944  double              
  medians_energy_vec                 1x118              944  double              
  medians_micro_parameter_vec        1x118              944  double              
es=medians_energy_scored_tr; ds=medians_dwi_scored_tr;      % shorten names
e=medians_energy_vec.'; d=medians_micro_parameter_vec.';
[all(isfinite(es)), all(isfinite(ds))]                      % is there a NaN lurking, maybe?
ans = 1×2 logical array
   1   1
[all(isfinite(e)), all(isfinite(d))]
ans = 1×2 logical array
   1   1
[min(es) max(es) min(ds) max(ds)]     % see what the magnitude of variables are...
ans = 1×4
   -2.8784    2.2034   -2.8010    1.8310
<mw-icon class=""></mw-icon>
<mw-icon class=""></mw-icon>
[min(d) max(d)], [min(e) max(e)]
ans = 1×2
1.0e+14 *

    0.7538    1.7465
<mw-icon class=""></mw-icon>
<mw-icon class=""></mw-icon>
ans = 1×2
   47.1018  175.4894
<mw-icon class=""></mw-icon>
<mw-icon class=""></mw-icon>

Oh....there's likely the problem; d (medians_micro_parameter_vec) is huge in magnitude so the unscaled calculation overflows when computing the sum terms in X'X. The scaling brings everything down to roughly +/-3 range if is normal scaling.

Always examine data before blindly fitting...

plot(ds,es,'x')

figure

plot(d,e,'x')

These appear to have just been randomly generated values---why such a large magnitude was used is anybody's guess here...but there's the problem.

We'll just fit the same data, still large but 10E4 instead of 10E14...

fitlm(d/1E10,e,'RobustOpts','on')
ans = 
Linear regression model (robust fit):
    y ~ 1 + x1

Estimated Coefficients:
                    Estimate        SE         tStat        pValue  
                   __________    _________    ________    __________

    (Intercept)        132.19       15.351      8.6108    4.1607e-14
    x1             -0.0008596    0.0011199    -0.76758       0.44429


Number of observations: 118, Error degrees of freedom: 116
Root Mean Squared Error: 26
R-squared: 0.00584,  Adjusted R-Squared: -0.00273
F-statistic vs. constant model: 0.681, p-value = 0.411

The problem is simply one of the magnitude of the independent variable is so large the sums during computation lose precision leading to an apparently singular design matrix; as the above shows, simply a reduction in the absolute magnitude of the x variable is sufficient; it isn't necessarily there being "magic" in z-scoring.

First we checked to be sure there weren't any NaN that were causing the issue that the function nanzscore() was silently taking care of for us...

If we then adjust by the mean and scale, but only by a power of 10 instead of std(), we get roughly the same result of about a zero intercept, but note the slope is the same only scaled by the additional 10E4 we introduced, but the pValue is also identical.

fitlm((d-mean(d))/1E14,e,'RobustOpts','on')
ans = 
Linear regression model (robust fit):
    y ~ 1 + x1

Estimated Coefficients:
                   Estimate      SE       tStat        pValue  
                   ________    ______    ________    __________

    (Intercept)     120.55     2.3898      50.441    9.3494e-81
    x1              -8.596     11.199    -0.76758       0.44429


Number of observations: 118, Error degrees of freedom: 116
Root Mean Squared Error: 26
R-squared: 0.00584,  Adjusted R-Squared: -0.00273
F-statistic vs. constant model: 0.681, p-value = 0.411

And just for good measure, we'll use standardized variables...

fitlm(zscore(d),zscore(e),'RobustOpts','on')
ans = 
Linear regression model (robust fit):
    y ~ 1 + x1

Estimated Coefficients:
                   Estimate        SE        tStat      pValue 
                   _________    ________    ________    _______

    (Intercept)     0.028674    0.094595     0.30312    0.76234
    x1             -0.072919    0.094998    -0.76758    0.44429


Number of observations: 118, Error degrees of freedom: 116
Root Mean Squared Error: 1.03
R-squared: 0.00584,  Adjusted R-Squared: -0.00273
F-statistic vs. constant model: 0.681, p-value = 0.411

This effect of keeping intermediate calculations of the design matrix roughly on the same order as the constant column of ones is one of the prime uses of z-scaling in regression besides putting variables of differing magnitudes on the same scale for comparison of sensitivities. When the magnitude of one or more columns becomes excessivly large in comparison, the end result is that the resulting model design matrix looks more and more ill-conditioned until the numerics finally fail.

15 Comments
Show 13 older comments Hide 13 older comments

dpb on 31 Dec 2025

Open in MATLAB Online

d=dir('median*.mat');           % load the files w/o having to type a zillion characters...
for i=1:numel(d)
    load(d(i).name)
end
es=medians_energy_scored_tr; ds=medians_dwi_scored_tr;      % shorten names
e=medians_energy_vec.'; d=medians_micro_parameter_vec.';
b=polyfit(d,e,1)                  % compare polyfit() w/o standardization
Warning: Polynomial is badly conditioned. Add points with distinct X values, reduce the degree of the polynomial, or try centering and scaling as described in HELP POLYFIT.
b = 1×2
   -0.0000  135.6279
<mw-icon class=""></mw-icon>
<mw-icon class=""></mw-icon>

polyfit() discovers the problem, too.

[b,~,mu]=polyfit(d,e,1)           % and with internally scaled
b = 1×2
   -2.5016  119.8224
<mw-icon class=""></mw-icon>
<mw-icon class=""></mw-icon>
mu = 2×1
1.0e+14 *

    1.3541
    0.2143
<mw-icon class=""></mw-icon>
<mw-icon class=""></mw-icon>
fitlm(zscore(d),e)
ans = 
Linear regression model:
    y ~ 1 + x1

Estimated Coefficients:
                   Estimate      SE       tStat       pValue  
                   ________    ______    _______    __________

    (Intercept)     119.82     2.3243     51.553    8.3006e-82
    x1             -2.5016     2.3342    -1.0717       0.28607


Number of observations: 118, Error degrees of freedom: 116
Root Mean Squared Error: 25.2
R-squared: 0.0098,  Adjusted R-Squared: 0.00127
F-statistic vs. constant model: 1.15, p-value = 0.286

fitlm and polyfit agree with each other with OLS and using standardization on the independent variable as would expect.

histogram(d,20)

While the x values are all distinct, we can observer they are clustered in a couple of regions. It might be better numerically (although a different problem, of course) if they were distributed uniformly...let's see what might happen then...

x=linspace(min(d),max(d),numel(d));
polyfit(x,e,1)
Warning: Polynomial is badly conditioned. Add points with distinct X values, reduce the degree of the polynomial, or try centering and scaling as described in HELP POLYFIT.
ans = 1×2
   -0.0000  155.9119
<mw-icon class=""></mw-icon>
<mw-icon class=""></mw-icon>
fitlm(x,e)
Warning: Regression design matrix is rank deficient to within machine precision.
ans = 
Linear regression model:
    y ~ 1 + x1

Estimated Coefficients:
                   Estimate         SE        tStat      pValue  
                   _________    __________    ______    _________

    (Intercept)            0             0       NaN          NaN
    x1             8.952e-13    3.0539e-14    29.313    1.842e-55


Number of observations: 118, Error degrees of freedom: 117
Root Mean Squared Error: 42.6
R-squared: -1.84,  Adjusted R-Squared: -1.84
F-statistic vs. constant model: -Inf, p-value = NaN

Well, that's not enough on its own; let's go back to just making the magnitude smaller that we did before...

polyfit(x/1E10,e,1)
ans = 1×2
   -0.0029  155.9119
<mw-icon class=""></mw-icon>
<mw-icon class=""></mw-icon>

and Voila! -- at least the coefficients can again be estimted once the values don't completely swamp precision.

dpb on 1 Jan 2026

Open in MATLAB Online

Just one more demonstration...

d=dir('median*.mat');           % load the files w/o having to type a zillion characters...
for i=1:numel(d)
    load(d(i).name)
end
%es=medians_energy_scored_tr; ds=medians_dwi_scored_tr;      % shorten names
e=medians_energy_vec.'; d=medians_micro_parameter_vec.';
o=ones(size(e));
for n=0:14
  x=[d/10^n o];     % the design columns; x term scaled by powers of 10
  X=x.'*x;
  fprintf('Divisor: %2d, Condition Number: %.4e\n',n, cond(X))
end
Divisor:  0, Condition Number: 7.7533e+29
Divisor:  1, Condition Number: 7.7533e+27
Divisor:  2, Condition Number: 7.7533e+25
Divisor:  3, Condition Number: 7.7533e+23
Divisor:  4, Condition Number: 7.7533e+21
Divisor:  5, Condition Number: 7.7533e+19
Divisor:  6, Condition Number: 7.7533e+17
Divisor:  7, Condition Number: 7.7533e+15
Divisor:  8, Condition Number: 7.7533e+13
Divisor:  9, Condition Number: 7.7533e+11
Divisor: 10, Condition Number: 7.7533e+09
Divisor: 11, Condition Number: 7.7533e+07
Divisor: 12, Condition Number: 7.7541e+05
Divisor: 13, Condition Number: 7.8341e+03
Divisor: 14, Condition Number: 1.8001e+02
x=[zscore(d) o];                    % now do the z score to compare
X=x.'*x;
fprintf('Standardized, Condition Number: %.2e\n', cond(X))
Standardized, Condition Number: 1.01e+00

Illustrates what happens to the design matrix condition number as the magnitude difference between the constant term column of ones and the x values is reduced. Initially, since we're dividing by a factor of 10, the condition number is reduced by almost exactly a factor of 100 owing to the multiplication; once the difference is down to about 8000 differences in the values begin to appear at the higher level of significant digits besides the pure scaling.

When do use actual z-scoring at the end, the actual use of the data itself to remove the mean and standardize does have the remarkable property of bring the condition number down to almost identically 1.

dpb on 1 Jan 2026

Edited: dpb on 1 Jan 2026

Open in MATLAB Online

x.'*x arises in the matrix form of the O(rdinary)L(east)S(quares) calculation -- hence, in one form or another the solution for the coefficients is dependent upon inverting (or other solution technique) the design matrix. For a Nth order polynomial, the base x is a column-wise matrix

[x.^n x.^(n-1) ... x.^2 x 1]

where x is the vector of the independent variable values and the column of ones is there for the intercept term. (You can solve for a non-intercept model by omitting it). And, to compute the sums of squares, the form is to multiply the transpose by x. In your case of simply a first-order polynomial, you have only [x 1] columns as were used in the examples.

The end result is

X'X B = X'y known as the "normal" equations.

where B is the array of coefficients and y the dependent observations vector. So, to solve for B

inverse(X'X)(X'X) B = inverse(X'X) X'y 
                I B = inverse(X'X) X'y where I is the identity matrix so
                  B = inverse(X'X) X'y is the solution for the coefficients, B

That's why the X'X matrix is key and if it is ill-conditioned one has trouble and showed the resulting condition number as a function of the magnitude of x to illustrate.

I don't see it written expressly in the MATLAB doc, but <here's a brief development>; note specifically they arrive at eq (10) at the top of page 3.

Just out of curiosity, why the magnitude of the variable in your example? What does it represent? If it is really of that magnitude, your choices for estimation are likely to be to either transform the units to a set that are much smaller in magnitude or use the standardization technique.

dpb on 2 Jan 2026

Edited: dpb on 2 Jan 2026

Open in MATLAB Online

Yes, it's the same model, just using the base variable instead of the standardized.

You could count the cells in terms of 10E9 or 10E12 cells/cu-m instead and see the same t-values.

d=dir('median*.mat');           % load the files w/o having to type a zillion characters...
for i=1:numel(d)
    load(d(i).name)
end
%es=medians_energy_scored_tr; ds=medians_dwi_scored_tr;      % shorten names
e=medians_energy_vec.'; d=medians_micro_parameter_vec.';
fitlm(d,e)
Warning: Regression design matrix is rank deficient to within machine precision.
ans = 
Linear regression model:
    y ~ 1 + x1

Estimated Coefficients:
                    Estimate         SE        tStat       pValue  
                   __________    __________    ______    __________

    (Intercept)             0             0       NaN           NaN
    x1             8.6062e-13    2.2087e-14    38.966    1.7747e-68


Number of observations: 118, Error degrees of freedom: 117
Root Mean Squared Error: 32.9
R-squared: -0.695,  Adjusted R-Squared: -0.695
F-statistic vs. constant model: -Inf, p-value = NaN
fitlm(d/1E9,e)
ans = 
Linear regression model:
    y ~ 1 + x1

Estimated Coefficients:
                    Estimate          SE         tStat       pValue  
                   ___________    __________    _______    __________

    (Intercept)         135.63         14.93     9.0843    3.3091e-15
    x1             -0.00011673    0.00010891    -1.0717       0.28607


Number of observations: 118, Error degrees of freedom: 116
Root Mean Squared Error: 25.2
R-squared: 0.0098,  Adjusted R-Squared: 0.00127
F-statistic vs. constant model: 1.15, p-value = 0.286
fitlm(d/1E12,e)
ans = 
Linear regression model:
    y ~ 1 + x1

Estimated Coefficients:
                   Estimate      SE        tStat       pValue  
                   ________    _______    _______    __________

    (Intercept)      135.63      14.93     9.0843    3.3091e-15
    x1             -0.11673    0.10891    -1.0717       0.28607


Number of observations: 118, Error degrees of freedom: 116
Root Mean Squared Error: 25.2
R-squared: 0.0098,  Adjusted R-Squared: 0.00127
F-statistic vs. constant model: 1.15, p-value = 0.286
fitlm(zscore(d),e)
ans = 
Linear regression model:
    y ~ 1 + x1

Estimated Coefficients:
                   Estimate      SE       tStat       pValue  
                   ________    ______    _______    __________

    (Intercept)     119.82     2.3243     51.553    8.3006e-82
    x1             -2.5016     2.3342    -1.0717       0.28607


Number of observations: 118, Error degrees of freedom: 116
Root Mean Squared Error: 25.2
R-squared: 0.0098,  Adjusted R-Squared: 0.00127
F-statistic vs. constant model: 1.15, p-value = 0.286

Do you have a real dataset could attach? One presumes from the plots these are just random values over the expected ranges?

Manuela on 2 Jan 2026

Don't worry :) thanks for the very exhaustive answer!

dpb on 2 Jan 2026

Edited: dpb on 2 Jan 2026

Open in MATLAB Online

In that case I'd suggest corr is the more appropriate. Use the two-output form to get the estimated p-value. This will avoid the issue of the regression calculation matrix inversion of the normal equations entirely.

d=dir('median*.mat');           % load the files w/o having to type a zillion characters...
for i=1:numel(d)
    load(d(i).name)
end
%es=medians_energy_scored_tr; ds=medians_dwi_scored_tr;      % shorten names
e=medians_energy_vec.'; d=medians_micro_parameter_vec.';
[c,p]=corr([d,e],'Type','Kendall');
[c(2,1) p(2,1)]     % corr returns the full correlation matrix, not just the 2-column comparison
ans = 1×2
   -0.0204    0.7447
<mw-icon class=""></mw-icon>
<mw-icon class=""></mw-icon>

I used Kendall's tau instead of default as the previous histogram indicated not normally distributed owing to what appeared to be multiple peaks in the distribution. You'll undoubtedly want to do some distribution testing first to see just what sort of assumptions can be justified.

Sign in to comment.

fitlm returns pvalues equal to NaN without zscoring

1 Comment
Show -1 older comments Hide -1 older comments

Accepted Answer

15 Comments
Show 13 older comments Hide 13 older comments

More Answers (0)

Categories

Tags

Community Treasure Hunt

fitlm returns pvalues equal to NaN without zscoring

1 Comment Show -1 older comments Hide -1 older comments

Accepted Answer

15 Comments Show 13 older comments Hide 13 older comments

More Answers (0)

Categories

Tags

See Also

Community Treasure Hunt

1 Comment
Show -1 older comments Hide -1 older comments

15 Comments
Show 13 older comments Hide 13 older comments