Let denote a function of two variables and the parameters of the optimization problem which I want to solve with lsqnonlin. To calculate , must be convex at all iterations. Otherwise, it is purely a matter of luck that the identification works. My idea is to enforce the convexity by enforcing the Hessian of to be positiv definite. The determinant of the Hessian is given by (please correct me if I am wrong) Given that, I would implement the constraints c = in a function function [c,ceq] = mycon(x) ceq = []; c = ... % the above formula end I evaluate the above equation at points in the {x,y} space, i.e., is a vector with entries. Can I implement those constraints as nonlinear constraints or do I make things too complicated?

Can I implement those constraints as nonlinear constraints You can, but you have a few problems: (1) Nonlinear constraints cannot be enforced at all iterations. In fact, only simple bounds constraints can be. (2) det(A)>0 is not a sufficient condition for the positive definiteness of A. By Sylvester's criterion, you need to ensure as well, although that will be a simpler, linear constraint in E. (3) The constraints need to be satisfied for all (x,y) in whatever domain f(x,y) is defined on. In theory, that gives you an unaccountably infinite number of constraints. To be practical, you could relax the constraint, imposing it on some discrete grid of points (x_j, y_j), but in theory, you could not be sure that f(x,y) is convex everywhere between these points. or do I make things too complicated? Maybe. You would need to tell us more about the properties of the N_i(x,y) functions and what g^sim() is doing. If the N_i(x,y) functions are all convex individually, then it would be sufficient (although not necessary) to impose the much simpler constraints E_i>=0.

can I pass these nonlinear constraints to lsqnonlin?

Matt J on 20 Jun 2023

Edited: Matt J on 20 Jun 2023

Open in MATLAB Online

Even with cubic spline basis functions, which are not generally shape preserving, I think you have a pretty good bet of ensuring convexity, which may be important for the smoothness of h and g,

D=rand(1,20);

C1=-5;

C2=0;

E=cumsum([C2,cumsum([C1,D])]);

fplot( @(x) interp1( E,x ,'spline') ,[1, numel(E)] ); xlabel x; ylabel h(x)

SA-W on 20 Jun 2023

Open in MATLAB Online

ensure the convexity of h just by ensuring that the second order finite differences are increasing

This is what I am doing at the moment if f=f(x). But

E(i+1)-2*E(i) + E(i+1)>=0

assumes that the spacing between the support points is constant, right? Also for the E(i) at the boundary, one can only impose constraints E(i-1)<=E(i).

E=cumsum(cumsum(D)+C);

Interesting. Why do you add a constant C here?

Matt J on 20 Jun 2023

Edited: Matt J on 20 Jun 2023

Open in MATLAB Online

assumes that the spacing between the support points is constant, right?

Yes, there would be different weights on the E(i) if the spacing is not uniform, but it shouldn't invalidate the general approach.

Why do you add a constant C here?

When you double integrate something there are two constants of integration, so in theory we should probably have is two constants have,

E=cumsum([C2,cumsum([C1,D])]);

I was not sure that we needed C2, but now I think we do.

Also for the E(i) at the boundary, one can only impose constraints E(i-1)<=E(i).

If you make your E-grid a little bit larger than your f-grid, then that should not be a problem.

SA-W on 20 Jun 2023

If you make your E-grid a little bit larger than your f-grid, then that should not be a problem.

How do you distinguish between E-grid and f-grid? If we have f=f(x) and there are n support points x_i, there are also n parameters E_i. If I have x_i +1 support points, there are E_i + 1 parameters and I gained nothing.

When you double integrate something there are two constants of integration

This transformation is not clear to me yet. How does the inverse relation D(E) look like, i.e., how do I transform the "real" parameters into D?

Matt J on 20 Jun 2023

Edited: Matt J on 20 Jun 2023

Open in MATLAB Online

If we have f=f(x) and there are n support points x_i, there are also n parameters E_i.

f(x) is a continuous function on some interval [fmin, fmax]. You have a parametric model

where the s_i are known shifts which can be thought of as the "locations" of the basis functions. The function F(x) is also continuous and defined on some possibly larger domain [Fmin,Fmax]. You want these functions to agree at certain discrete sample points,

and to choose the unknown E_i to achieve this agreement as closely as possible.

You seem to be assuming that for every basis function location s_i there needs to be corresponding x_j=s_i, but there is no reason to have that assumption. There is no necessary relationship between the s_i and x_j at all other than that they should be chosen to ensure,

Additonally, there are certain mathematical simplifications and conveniences if you choose the s_i so that

how do I transform the "real" parameters into D?

The inverse of,

E=cumsum([C2,cumsum([C1,D])])

is

D=diff(E,2)

For example,

D=rand(1,8)
D = 1×8
    0.1401    0.6240    0.2104    0.2853    0.7782    0.2184    0.6423    0.4381
C1=rand; C2=rand; 
E=cumsum([C2,cumsum([C1,D])]);
D2=diff(E,2)
D2 = 1×8
    0.1401    0.6240    0.2104    0.2853    0.7782    0.2184    0.6423    0.4381

SA-W on 20 Jun 2023

Edited: SA-W on 20 Jun 2023

The inverse of,

E=cumsum([C2,cumsum([C1,D])])

is

D=diff(E,2)

Ah. But this relation between E and D only holds if E is constructed as E=cumsum([C2,cumsum([C1,D])]),i.e, there is a given D.

But in practice, D is not given but E (say E it is the start vector). So how would I implement this?

Matt J on 20 Jun 2023

Edited: Matt J on 20 Jun 2023

Open in MATLAB Online

No, D=diff(E,2) is always valid.

To recover E from D, however, you would need to save the relevant C1 and C2. Example:

 E=rand(1,5) %start with E
E = 1×5
    0.1769    0.4293    0.3007    0.6656    0.9979
 
 C1=E(2)-E(1); %transform to {C1,C2,D} space
 C2=E(1);
 D=diff(E,2);

Now, we recover E

 Erec=cumsum([C2,cumsum([C1,D])]) %recovered E
Erec = 1×5
    0.1769    0.4293    0.3007    0.6656    0.9979

SA-W on 20 Jun 2023

Nice. Another strategy I can give a try. Before the transformation we had F(E) = sum_i N_i E_i. diff(E,2) gives two components less than E has, but with {C1,C2} there are again n components. But how does F(D) look like?

Matt J on 20 Jun 2023

It looks convex as I showed in my graph above.

SA-W on 20 Jun 2023

I meant how F(D) looks like in mathematical terms. Is F(D) = \sum_i N_i D_i ?

Matt J on 20 Jun 2023

Edited: Matt J on 20 Jun 2023

It would just be

SA-W on 20 Jun 2023

Hm, sorry that I have to ask again. We start with the E parameters and transform them into the {C1, C2,,D} using D=diff(E,2). Then, we solve the optimization problem in the {C1, C2,,D} space and finally back out to the E parameter space using E=cumsum([C2,cumsum([C1,D])]).

Or does my pde solver still "see" the E parameters at every iteration, just expressed differently? Originally, we had E_i = E_i, now we have E_i=E_i(D,C1,C2).

Matt J on 20 Jun 2023

now we have E_i=E_i(D,C1,C2).

Yes, that.

SA-W on 20 Jun 2023

If my objective function originally returned gSim(E) and J(E), it has to return now (dJ/dE)*dE/dZ and (dgSim/dE)*(dE/dZ) where Z={D,C1,C2}.

In both cases dE/dZ is required with E=cumsum([C2,cumsum([C1,D])]). Correct? And is dE/dZ continously differentiable?

Matt J on 21 Jun 2023

Edited: Matt J on 21 Jun 2023

In both cases dE/dZ is required with E=cumsum([C2,cumsum([C1,D])]). Correct? And is dE/dZ continously differentiable?

Yes, and in fact because E=M*[D(:);C1;C2] for some constant matrix M, then dE/dZ will just be M.'

You can find M either analytically, or using func2mat in this FEX download,

https://www.mathworks.com/matlabcentral/fileexchange/44669-convert-linear-functions-to-matrix-form

SA-W on 21 Jun 2023

Additonally, there are certain mathematical simplifications and conveniences if you choose the s_i so that

Let me shortly come back to this argument: What simplifications and conveniences did you mean that I can may take advantage of?

Matt J on 21 Jun 2023

Mainly, if your basis functions are simple linear interpolating kernels, the advantage would be that F(x_j)=E_j, i.e. you don't have to interpolate to obtain the values of F(x_j).

SA-W on 21 Jun 2023

Mainly, if your basis functions are simple linear interpolating kernels, the advantage would be that F(x_j)=E_j, i.e. you don't have to interpolate to obtain the values of F(x_j).

Yes, but you said this "advantage" comes from choosing

Is is not true even if x_j = s_i, i.e., for each basis function location s_i we have F(x_j)=f(x_j)=E_j ?

Matt J on 21 Jun 2023

Is is not true even if x_j = s_i,

Yes, that's a special case of

.

SA-W on 21 Jun 2023

Say,

is defined on the domain

and

on the domain

as indicated by the two dashed lines in the picture,

are the linear interpolating kernels. Clearly, we have

. This is the terminology you introduced some comments before.

You said, I can make the E-grid a little larger than the f-grid; probably, you meant what I sketched above. But how can that work with linear basis functions? Here, we have

, but how would I determine

and

. We said

is not defined there.

Matt J on 21 Jun 2023

Open in MATLAB Online

You said, I can make the E-grid a little larger than the f-grid

Yes, but I think it's moot at this point. I think we've determined that you can make the grids the same size, and the convexity conditions E(i+1)-2*E(i)+E(i-1) do not have to be satisfied at all i. It is sufficient for this to hold for i=2,...n-1

But how can that work with linear basis functions? Here, we have , but how would I determine and . We said f(x) is not defined there.

One way would be to not use linear interpolating basis functions. We talked about basis functions like

E_i log(1+A_i*exp(a_i*(x-s_i))), i=1...n

which do not have a compact footprint Instead, each basis function has a footprint over the entire domain of x. The value F(x0) at any x0 is therefore influenced by all n basis functions, So there is reason to think that all 3*n parameters could be estimated, assuming you have 3*n sample points f(x_j) wherever they happen to be located. Although the more spread out the x_j the better conditioning the estimation problem is likely to have

Another approach would be to add a penalty function on the roughness of the coefficients E_i to your objective function, e.g.,

Penalty(x)= smallweight * sum_i abs(E(i)-E(i+1)) , i=1...n-1

Certainly if you have linear interpolating basis functions, we expect the E(i) to step smoothly from one i to the next, and the above penalty would enforce that, even without having any samples f(x_j) that provides information on a particular E_i.

SA-W on 21 Jun 2023

the convexity conditions E(i+1)-2*E(i)+E(i-1) do not have to be satisfied at all i. It is sufficient for this to hold for i=2,...n-1

But if we just have E(n)>=E(n-1), convexity at the boundary can not be guaranteed. Why do you think the constraints for i=2,...,n-1 are sufficient?

So there is reason to think that all 3*n parameters could be estimated, assuming you have 3*n sample points

So I should have at least 3n sample points to determine the 3n parameters? Did you mean that?

Matt J on 21 Jun 2023

Edited: Matt J on 21 Jun 2023

E(n)>=E(n-1)

Don't know where you got this. When you look at E(i+1)-2*E(i)+E(i-1)>=0 at i=n-1 you get

E(n)-2*E(n-1)+E(n-2)>=0

Since i only goes from 2 to n-1, there is no other constraint involving E(n) that I've proposed.

So I should have at least 3n sample points to determine the 3n parameters? Did you mean that?

Yes. The number of equations must be greater than or equal to the numbr of unknowns. Although if somehow you had additional equality constraints on the unknown parameters that don't involve f(x_j) those would count as well.

SA-W on 21 Jun 2023

There is no other constraint involving E(n) that I've proposed.

You did not indeed. But I thought E(n)-E(n-1) would be necessary in addition to ensure convexity. But I do not need this constraint?

May it be that E(n)-E(n-1) along with E(n)-2E(n-1)+E(n-2)>=0 makes f non-convex?

Matt J on 21 Jun 2023

Open in MATLAB Online

But I thought E(n)>=E(n-1) would be necessary in addition to ensure convexity.

Definitely not. Here's a counter-example. E is linear and therefore convex, but it is strictly decreasing, so E(n)>=E(n-1) does not hold,

E=8:-1:1
E = 1×8
     8     7     6     5     4     3     2     1

SA-W on 21 Jun 2023

Makes sense.

So may it be that E(n)-E(n-1)>=0 along with E(n)-2E(n-1)+E(n-2)>=0 makes f non-convex?

Matt J on 22 Jun 2023

Edited: Matt J on 22 Jun 2023

Open in MATLAB Online

No, because an upward sloping line,

E=1:8

is convex and satisfies E(n)-E(n-1)>0 along with E(n)-2E(n-1)+E(n-2)>=0.

SA-W on 22 Jun 2023

So if I have a quadratic like function as you plotted it in this answer and I implement the four constraints

E(n)-E(n-1)>=0

E(n)-2E(n-1)+E(n-2)>=0

E(1)-E(2)>=0

E(3)-2E(2)+E(1)>=0

the constraints

E(n)-E(n-1)>=0

E(1)-E(2)>=0

would be redundant, but are not in contradiction to the two other constraints? That is, I could keep them in theory?

Matt J on 22 Jun 2023

Edited: Matt J on 22 Jun 2023

They are not redundant. They narrow the range of functions that you are modeling. It's just that they have nothing to do with convexity. You could and probably should keep them if you know the function you are trying to model satisfies those extra constraints.

SA-W on 22 Jun 2023

I see.

I solve

N_i(x,y) = E_i*log( 1 + exp(a_i*(x-x_i)) + exp(b_i*(y-y_i)) )

with lsqnonlin.

Should this optimization problem have a unique minimum (one global minimum)? I.e., all start vectors converge to the same point {E,a,b}?

Matt J on 22 Jun 2023

Edited: Matt J on 22 Jun 2023

No, there's no obvious guarantee of that. For fixed a_i,b_i the function becomes convex quadratic with respect to E_i but that's all that can be said.

SA-W on 22 Jun 2023

Edited: SA-W on 22 Jun 2023

So if I would fix a_i, b_i (what would only make sense for testing), there should be a unique minimum?

And, E_i >=0 are the only constraints, right?

Matt J on 22 Jun 2023

There would be a unique minimum value. A convex quadratic function can have non-unique global minima if the Hessian is singular.

SA-W on 22 Jun 2023

I have observed that the resnorm values are quite small (around 1e-3). Lsqnonlin stops then after 900 iterations (10 parameters, 25 equations). I have function and optimality tolerance equal to 1e-9.

Would you change those settings? I could increase the number of allowed iterations, but resnorm decreases very slowly after a first drop in the first two,three iterations.

Matt J on 22 Jun 2023

Edited: Matt J on 22 Jun 2023

I usually set all the tolerances to 1e-12. Whether resnorm of 1e-3 is to be considered small, I have no idea. It depends on the scale of f.

SA-W on 22 Jun 2023

What do you mean by scale?

I mean if I want to approximate a non-convex function f with a convex function F, there would clearly be a large resnorm.

Matt J on 22 Jun 2023

Edited: Matt J on 22 Jun 2023

I mean, how big the f_i are. If they are all on the order of 1e-10, then a resnorm of 1e-3 may be quite large in comparison.

SA-W on 22 Jun 2023

The f_i are between 1 and 2 in my case. So resnorm=1e-3 should be good right?

Matt J on 22 Jun 2023

I would say so, but a co-plot of the surface fit along with f would make it more obvious.

SA-W on 22 Jun 2023

fplot() only works for functions f(x). Is there no 2d equivalent?

So how would you plot F(x) = sum_i N_i ? surf() seems to be not appropriate I think.

Matt J on 23 Jun 2023

surf or fsurf would be appropriate.

SA-W on 23 Jun 2023

Thank you for all the help, I do not wanna waste your time further!

There are a lot of things I can give a try, maybe I will come back to this chat here if I have follow-up questions.

Currently, I try to setup a c++ interface to matlab, see this question

https://de.mathworks.com/matlabcentral/answers/1987119-c-engine-api-minimal-example-gives-segmentation-fault-see-stacktrace?s_tid=prof_contriblnk

In case you also have knowdlege in that field, I would appreciate your help again!

SA-W on 26 Jun 2023

Edited: SA-W on 26 Jun 2023

Open in MATLAB Online

@Matt J

I would appreciate your opinion regarding my new results when approximating f(x,y) = g(x) + h(y).

g(x) is the above plot and h(y) the plot below, each of which is approximated with five parameters and linear basis functions. If the objective function is

||g^sim(f(x,y)) - g^exp ||^2

"No noise" means that I solve the pde with the reference paramters (green curve) to obtain g^exp, and "1% noise" means that I add random noise to g^exp with a standard deviation of 1%*max(g^exp) to emulate real experimental data.

As you can see, in the noiseless case I have perfect fitting results, not so in the case with noise.

As I told you, x and y are somwhow correlated at some intervals in the {xy} space, see the plot below where I obtained g^exp using a coupled approximation to f(x,y) with nine parameters.

Do you think the correlation between x and y is the reason why the identification fails for the noisy case above? Here, h(y)=sqrt(y) is concave in the example above, which is why I implemented E(i+1)-2*E(i) + E(i+1) <=0, i=2,...,4 and E(4)<=E(5).

I mean, due to correlation, the columns of the Jacobian associated with x and y are nearly identical and I was sceptical that the decoupled approximation can work at all under those circumstances. But it clearly worked in the noiseless case.

Matt J on 26 Jun 2023

Too much is hidden. I would guess that you used poorer initial guesses in the latter case.

SA-W on 26 Jun 2023

The initial functions g_0(x) and h_0(y) were the same in both cases.

Matt J on 26 Jun 2023

Edited: Matt J on 26 Jun 2023

Too much is hidden. Maybe some noise sensitivity in your PDE solver. Make sure you're following the guidelines here:

https://www.mathworks.com/help/optim/ug/optimizing-a-simulation-or-ordinary-differential-equation.html

What ahppens when you take the solution from the noise-fre case and initilaize with that?

SA-W on 26 Jun 2023

I think I follow all the guidelines there (analytical Jacobion, bounds, no stochastic objective, having NaN at failures,...).

But you are right, the entire fitting procedure seems to be quite sensitive with respect to noise in the data.

The solution from the noise-free case is the exact solution that I am looking for. So initializing there does not change the initial values. If I neglect g(y), i.e., only have five parameters associated with f(x), even the noisy case gives nearly perfect fitting results with a final sum of squares = 0.46. Interestingly, the sum of squares is also 0.46 in the case above where I have the g(y) included and five parameters more.

Matt J on 26 Jun 2023

So initializing there does not change the initial values.

What I meant was. What happens if you take the noise-free solution and use that to initialize the noisy estimation?

SA-W on 26 Jun 2023

Sorry that I have to ask again: What do you mean by noise-free solution (g^exp ?) and noisy estimation (g^exp + noise ?) ?

Matt J on 26 Jun 2023

Edited: Matt J on 26 Jun 2023

Take the solution for the unknown parameters that you get when you run the problem with no-noise and use that as your initial guess x0 when you run with noise.

SA-W on 26 Jun 2023

As I said, the solution in the noise-less case are the parameters that I used to calculate g^exp. The final sum of squares in that case is zero. So why would you pass this solution to the simulation with noisy data?

Matt J on 26 Jun 2023

Edited: Matt J on 26 Jun 2023

To see if the optimization pulls you strongly away from the ideal solution when there is only mild noise.

SA-W on 27 Jun 2023

Yes, if I start from the reference solution, I also end up with the same false solution that I plotted.

Matt J on 27 Jun 2023

Intriguing. But how well fitted is gexp in these cases? It would be good to see gsim co-plotted with gexp in the noisy case.

SA-W on 27 Jun 2023

gsim and gexp (noisy) are a collection of displacemen vectors at 10 different times (10*146=1460 -> gsim, gexp \in R^1460). Since the sum of squares is 0.46, there are of course differences visible between them. However, the displacements are in the range [0,25], so 0.46 is not that big in my opinion.

Matt J on 27 Jun 2023

Then, it sounds like nothing is really wrong, although possibly gsim is is an ill-conditioned function of f. I.e., lots of different f can give nearly the same gsim.

SA-W on 27 Jun 2023

possibly gsim is is an ill-conditioned function of f

It probably is. Do you think the correlation between x and y when approximing f(x,y) = h(x) + g(y) might be a problem that makes it difficult to reidentify the reference h(x) and g(y)?

Here, h(x) is overestimated and g(y) is underestimated, so their sum probably is somehow close to the correct f.

Matt J on 27 Jun 2023

Edited: Matt J on 27 Jun 2023

I have no way of knowing. Whatever process is responsible for the ill-conditioning it seems to be in the PDE solution step. Perhaps more boundary conditions or other constraints need to be applied.

SA-W on 27 Jun 2023

Yes, I have to figure that out.

Let me please shortly address the following issue: for nearly all my initial guesses, I end up with exitflag=2 although I have step_tolerance=1e-12. What are typical reasons why the parameter updates become so small? All what I pass to fmincon/lsqnonlin are the Jacobian or gradient (fmincon) and the components of the objective function or the sum of squares (fmincon). Sometimes, exitflag=2 is indeed the optimum I am looking for, sometimes not. This is a problem I ignored so far, but I think I should at least try to understand why I have this exitflag.

Matt J on 27 Jun 2023

Edited: Matt J on 27 Jun 2023

Open in MATLAB Online

From the documentation, exitflag=2 means "Change in x is less than the specified tolerance, or Jacobian at x is undefined."

Jacobian at x is undefined. It would be easy for you to use assert() to trap the case where you have NaN's or inf in your Jacobian.

Change in x is less than the specified tolerance: If the objective is very flat at the current iteration, the optimizer can be fooled into thinking it is close to a minimum, even if it is not. This can result in it taking prematurely small steps or stopping outright. This can happen even if you get an exitflag=1. In the example below, the minimum at x=1 is reached with an error of more than 11%, which some might consider large:

opts=optimoptions('lsqnonlin','StepTolerance',1e-12);
[x,fval,res,exitflag]=lsqnonlin(@(x) (x-1).^4, 1.2,[],[],opts)
Local minimum found.

Optimization completed because the size of the gradient is less than
the value of the optimality tolerance.
x = 1.1125
fval = 2.5658e-08
res = 1.6018e-04
exitflag = 1

Matt J on 27 Jun 2023

Edited: Matt J on 27 Jun 2023

However, premature stopping doesn't seem to be the reason for inaccuracies in g and h. That's why I asked you to initialize using the ground truth parameters We would expect ground truth to be fairly close to the minimum and for the objective to be getting pretty flat in that general vicinity. Yet,the optimizer strongly pulls away from it.

SA-W on 27 Jun 2023

Edited: SA-W on 27 Jun 2023

Open in MATLAB Online

I have very tight tolerances:

opts=optimoptions('lsqnonlin','StepTolerance',1e-12, 'FunctionTolerance', 1e-12, 'OptimalityTolerance', 1e-9);

With these tolerances, the error in your example decreases further to around 3%, which, of course, may be still considered large.

We would expect ground truth to be fairly close to the minimum and for the objective to be getting pretty flat in that general vicinity. Yet,the optimizer strongly pulls away from it.

If you consider noise with 1%*max(gexp) standard deviation small, I agree that the minimum should be close to ground truth. But why would you expect the objective to be getting flat in that general vicinity? If I think of minimizing f(x)=x^2, the objective is far from being flat at the minimum.

Also, if my objective were really flat around ground truth, would this not explain the inaccuracies in g and h? If I think of a nearly flat valley, there are many parameter combinations giving the same objective function value.

Matt J on 27 Jun 2023

Edited: Matt J on 27 Jun 2023

Open in MATLAB Online

With these tolerances, the error in your example decreases further to around 3%, which, of course, may be still considered large.

But remember also that we have no idea precisely what the Taylor expansion of your actual function looks like at the minimum, and whether your tolerances are appropriate to it.. I can modify my example as below to get poor accuracy again, even with your actual tolerances.

opts=optimoptions('lsqnonlin','StepTolerance',1e-12, 'FunctionTolerance', 1e-12, 'OptimalityTolerance', 1e-9);
[x,fval,exitflag]=lsqnonlin(@(x) (x-1).^6, 1.2,[],[],opts)
Local minimum found.

Optimization completed because the size of the gradient is less than
the value of the optimality tolerance.
x = 1.1157
fval = 5.7788e-12
exitflag = 2.4039e-06

If I think of minimizing f(x)=x^2, the objective is far from being flat at the minimum.

When I say "flat", I mean that qualitatively speaking the gradients are getting small. Even in my example with (x-1)^4, the objective is not perfectly flat anywhere, but you can still see that it is flat enough to cause early termination.

SA-W on 27 Jun 2023

I can modify my example as below to get poor accuracy again, even with your actual tolerances.

As you said, setting the tolerances to 1e-12 is probably a good choice. I think I read in a different chat that one should not go below 1e-14. Of course, as your examples shows, one could create cases where this still may not be sufficient.

Even in my example with (x-1)^4, the objective is not perfectly flat anywhere, but you can still see that it is flat enough to cause early termination.

The fact that the optimizer pulls away from ground truth, along with exitflag=2, suggests that my objective is "quite" flat around ground truth, right?

Matt J on 27 Jun 2023

Edited: Matt J on 27 Jun 2023

Open in MATLAB Online

Also, if my objective were really flat around ground truth, would this not explain the inaccuracies in g and h? If I think of a nearly flat valley, there are many parameter combinations giving the same objective function value.

If your objective is nearly flat in a broad area around ground truth, it would explain why the function is sensitive to noise. The addition of noise can "tilt" the floor of the valley so that water runs downhill away from ground truth. We do know that once you added the 1% noise, you tilted the valley away from ground truth significantly, because the optimizer pulled strongly away from ground truth when initialized there.

One thing you could try is to add curvature penalty terms to your objective function. For lsqnonlin, this would be equivalent to extending your residual vector,

residual =[residual; smallnumber*[D_g;D_h]]

where D_g and D_h are the D parameters (from our earlier discussion) of g and h. This will discourage g and h from bending more than necessary.

SA-W on 27 Jun 2023

where D_g and D_h are the D parameters (from our earlier discussion) of g and h

I discretized g and h with five parameters each. Do you mean D_g and D_h is the solution that I got in the noisy case? Would the associated Jacobian not be zero then if D_g and D_h are const numbers?

Matt J on 27 Jun 2023

Edited: Matt J on 27 Jun 2023

Open in MATLAB Online

I discretized g and h with five parameters each

If you implemented my original suggestion, the 5 parameters would be [C1,C2,D1,D2,D3] which parameterize the basis coefficients E(i) according to,

E=cumsum([C2,cumsum([C1,D])]);

Would the associated Jacobian not be zero then if D_g and D_h are const numbers?

In light of the above, you should see that they would not be constant. D_g and D_h each contain 3 of the 5 unknown parameters that are being iteratively estimated.

SA-W on 28 Jun 2023

Open in MATLAB Online

If you implemented my original suggestion, the 5 parameters would be [C1,C2,D1,D2,D3] which parameterize the basis coefficients E(i)

I am still working with the E(i) and implemented the constraints E(i+1)-2*E(i) + E(i+1)>=0.

So if g(x) = \sum_i N_i G_i and h(y) = \sum_i N_i H_i, my parameters are [G_1, G_2, G_3, G_4, G_5] and [H_1, H_2, H_3, H_4, H_5].

In light of this,

residual =[residual; smallnumber*[D_g;D_h]]

translates into

residual =[residual; smallnumber*[G_1; ... ;G_5; H_1;...H_5]].

The 10x10 Jacobian matrix associated with this additional block would be the identity matrix?

Matt J on 28 Jun 2023

Edited: Matt J on 28 Jun 2023

I am still working with the E(i) and implemented the constraints E(i+1)-2*E(i) + E(i+1)>=0.

How would you be able to do that? It does not enforce convexity at all iterations, which means your PDE solver would fail, creating unrecoverable problems. That was the original point of the question.

In any case, I really did intend the new residual terms be smallnumber*[D_g;D_h], so you should transform your E parameters to D space as we discussed earlier.

SA-W on 28 Jun 2023

How would you be able to do that? It does not enforce convexity at all iterations,

I multiply the A matrix by 1e4, which leads to a fulfillment of convexity at all iterations. I know that this is not a good practise (you also told me practical problems that may occur), however, I have to stick to that framework to complete an article. Once this is done, I can explore new things and move to the transformed parameters in D space.

In any case, I really did intend the new residual terms be smallnumber*[D_g;D_h], so you should transform your E parameters to D space as we discussed earlier.

Can the additional residual components smallnumber*[D_g;D_h] not be written down if the parameters are in the E-space?

Matt J on 28 Jun 2023

Edited: Matt J on 28 Jun 2023

I multiply the A matrix by 1e4, which leads to a fulfillment of convexity at all iterations.

If so, that could be another reason why your StepTolerance stopping threshold is being met prematurely, resulting in exitflag=2.

Can the additional residual components smallnumber*[D_g;D_h] not be written down if the parameters are in the E-space?

Yes, if lsqnonlin is handing an E vector to your objective function code, you can transform from E to D within each call to your objective function code and use D for whatever computations are needed.

SA-W on 28 Jun 2023

If so, that could be another reason why your StepTolerance stopping threshold is being met prematurely, resulting in exitflag=2.

Indeed, if I do forego the linear constraints, which works surprisingly for a quite small number of initial guesses, I have exitflag=3 with FunctionTolerance=1e-12. Reducing it to say 1e-6, sometimes results in exitflag=1.

But why is the higher weighting of the constraints (1e4*A) a constellation that can lead to exitflag=2? I know that this somehow restricts the domain of the parameters. But exitflag=2 makes sense to me if I think of a nearly flat objective as discussed, but not if the search domain is restricted.

Yes, if lsqnonlin is handing an E vector to your objective function code, you can transform from E to D within each call to your objective function code and use D for whatever computations are needed.

But I mean if I do not make the transformation to D-space, but keep the E vector. In that case, how would the curvature penalty term look like (if possible)?

Matt J on 28 Jun 2023

Edited: Matt J on 28 Jun 2023

Open in MATLAB Online

But why is the higher weighting of the constraints (1e4*A) a constellation that can lead to exitflag=2?

It's algorithm dependent thing, but basically if you force the algorithm to prioritize the constraints over progress in reducing the objective function, it may need to take much small steps than it otherwise would, and this can trigger the steptolerance stopping criterion.

But I mean if I do not make the transformation to D-space, but keep the E vector. In that case, how would the curvature penalty term look like (if possible)?

It is not possible. I think you want me to tell you that you can do this:

residual =[residual; smallnumber*[diff(G,2);diff(H,2)]]

This is an equivalent way to implement my suggestion and, in your mind, this may be different from transforming to D-space. It is not, though.

SA-W on 28 Jun 2023

Moved: Matt J on 28 Jun 2023

Open in MATLAB Online

residual =[residual; smallnumber*[diff(G,2);diff(H,2)]]

This is an equivalent way to implement my suggestion

I just realize that diff(G,2) and diff(H,2) represents

G(i+1)-2*G(i) + G(i+1)>=0 i=2,3,4

H(i+1)-2*H(i) + H(i+1)>=0 i=2,3,4

which is what I have currently implemented as linear constraints in A and b, the only difference is that instead of "smallnumber" I have "1e4".

So why do you think

"This will discourage g and h from bending more than necessary."

as a possible workaround to the "pulling away from ground truth in the noisy case" issue?

Matt J on 28 Jun 2023

Edited: Matt J on 28 Jun 2023

Open in MATLAB Online

I just realize that diff(G,2) and diff(H,2) represents...

They represent the left hand side of these inequalities, yes.

which is what I have currently implemented as linear constraints in A and b

In A, yes. b is unrelated. And so if you wished you can express the new residual terms as

residual =[residual; smallnumber*[A_g*G;A_h*H]]

where A_h and A_g are some appropriate subset of the rows of A. Also, for computng the Jacobian of these new terms, it would just be J=[A_g;A_h].

So why do you think "This will discourage g and h from bending more than necessary."

lsqnonlin is trying to make all the residuals as small as possible. By appending the second derivatives to the residuals, it will try to keep them small, which is the same as saying g and h will be inhibited from curving too much.

SA-W on 28 Jun 2023

Open in MATLAB Online

In A, yes. b is unrelated.

but the right side, b, is zero.

residual =[residual; smallnumber*[A_g*G;A_h*H]]

where A_h and A_g are some appropriate subset of the rows of A. Also, for computng the Jacobian of these new terms, it would just be J=[A_g;A_h].

So far, I have 1e4*A*E<=0, where A = blkdiag(A_g, A_h) and E=[G(:);H(:)]. Instead of passing these linear constraints to lsqnonlin, you want to attach them to the residual. Why do you think this is different and possibly better? I probably would have to set smallnumber=1e4 to make sure lsqnonlin pays attention to the additional residuals.

Matt J on 28 Jun 2023

Edited: Matt J on 28 Jun 2023

Instead of passing these linear constraints to lsqnonlin, you want to attach them to the residual.

No, you would still keep the constraints as they are. They are not related to what we are doing with the residuals.

The constraints are forcing the derivatives to be non-negative.

Conversely, the residual terms are not being used to control whether the second derivatives are positive or negative in sign. Rather they are trying to keep the second derivatives small in magnitude. Moreover, the algorithm will normally not push the residuals to zero precisely, so the second derivatives will merely be encouraged to be small in magnitude rather required to have some exact value, as with constraints. Finally, the residuals can be prioritized by weighting them differently, unlike the constraints. By varying the penalty weight factor smallnumber, we can put different priority on those residuals as compared to the data fitting residual terms gsim-gexp.

SA-W on 28 Jun 2023

Makes totally sense what you say. One point:

so the second derivatives will merely be encouraged to be small in magnitude rather required to have some exact value, as with constraints.

My motivation to multiply A by 1e4 was because some components 1e3*A*E were still greater than zero. So I increased the exponent by one. What I want to say is that with c, c*A*E, I do influence the sign and magnitude simultaneously, right?

And is it really that this factor somehow sets the derivatives to an exact value (...required to have some exact value, as with constraints)? I mean is there not also an merit function internally whose goal is to decrease the constraint violations?

Matt J on 28 Jun 2023

Edited: Matt J on 28 Jun 2023

What I want to say is that with c, c*A*E, I do influence the sign and magnitude simultaneously, right?

It's hard to know what's really happening, since I don't know which of lsqnonlin's algorithms you are running. It appears that you are just amplifying the weight of the constraint violations, forcing the update steps to gave more consideration to the constraints than to the residuals.

And is it really that this factor somehow sets the derivatives to an exact value

No. But if your constraints are not met within some tolerance, lsqnonlin won't stop iterating or issue exitflag=1. Conversely, lsqnonlin doesn't apply any stopping tolerance thresholds on the magnitude of the residuals that is achieved.

SA-W on 28 Jun 2023

It's hard to know what's really happening, since I don't know which of lsqnonlin's algorithms you are running.

I use lsqnonlin's interior point algorithm which, If I am not mistaken, calls fmincon's interior-point algorithm internally. Given that, do you know what the scaling of the A matrix affects qualitatively? Just magnitude or sign, or both?

Matt J on 28 Jun 2023

do you know what the scaling of the A matrix affects qualitatively? Just magnitude or sign, or both?

I'm sure it doesn't change the constraints that the solver is trying to satisfy at all. It just changes the path taken to the solution.

SA-W on 29 Jun 2023

I see. I will append the residual vector by the constraints once my solver is available and report again then.

I did not fully understand your comment,

It's algorithm dependent thing, but basically if you force the algorithm to prioritize the constraints over progress in reducing the objective function, it may need to take much small steps than it otherwise would, and this can trigger the steptolerance stopping criterion.

regarding my question

But why is the higher weighting of the constraints (1e4*A) a constellation that can lead to exitflag=2

Say the algorithm takes some smaller steps to fulfill the constraints (due to 1e4), which are currently violated. If satisfied, the algorithm can devote again to reducing the objective function and may take larger steps again. What I want to say is that, to fullfill the constraints, a step tolerance of 1e-12 should not be triggered.

Most of the time when I have exitflag=2, the following happens: The algorithm takes some regular steps, but from a certain iteration on, the sum of squares remains static (far away from ground truth) and the parameters change only at high digits after the decimal point:

Iteration 50: resnorm=234, step size = 1e-3,

Iteration 51: resnorm=234, step size = 1e-4,

...

Iteration 58: resnorm=234, step size = 1e-11,

--> step tolerance is passed, algorithm terminates

As I said, this happens at points far away from the minimum. Do you think this can also be traced back to the weighting of A? If so, this is not tangible to me at the moment: If constraints at iteration 50 were violated, let the algorithm do some iteation with maybe smaller steps, but then proceed again with a bigger step size.

Matt J on 29 Jun 2023

Edited: Matt J on 29 Jun 2023

Maybe the best thing is to consider the mian update step that the interior point algoirthm uses. This is described in Equation 38 of this doc page.

The Jg matrix is the Jacobian of the inequality constraints, which in this case, because the constraints are linear, is a constant. If you make this constant too large, the contribution to the equation coefficients from the objective function gradient and Hessian are very small in comparison. In this case, your steps will always be guided more by the constraints than the objective.

SA-W on 29 Jun 2023

Therefore, your steps will always be guided more by the constraints than the objective.

This makes qualitatively sense to me, but I do not see why this translates into small steps that can even trigger the stop tolerance?

Matt J on 29 Jun 2023

Edited: Matt J on 29 Jun 2023

Open in MATLAB Online

data.mat

Well, my intuition was that if the update step is (approximately) indifferent to the objective function, then you can imagine it may take steps as if the objective function were some trivial flat value with gradient zero everywhere. In other words, it would have no reason to take large steps once the constraints were satisfied. But that was just my intuition...

I cannot come up with an example to show how scaling up the constraints can force an exitflag=2, but I can show one (below) which demonstrates that scaling the constraints can slow convergence when the objective is ill-conditioned:

load data
opts=optimoptions('fmincon','Algorithm','interior-point','Display','none',...
                   'MaxFunEvals',inf);
%%%Optimization 1
[E1,f1,exitflag1,out1]=fmincon(f,E0,A,b,[],[],[],[],[],opts);
%%%%Optimization 2: scaled constraints
s=1e4;
opts.MaxIterations=out1.iterations;
[E2,f2,exitflag2]=fmincon(f,E0,s*A,s*b,[],[],[],[],[],opts);
f1,f2
f1 = 7.2114
f2 = 6.3699e+04

SA-W on 29 Jun 2023

Well, my intuition was that if the update step is (approximately) indifferent to the objective function, then you can imagine it may take steps as if the objective function were some trivial flat value with gradient zero everywhere. In other words, it would have no reason to take large steps once the constraints were satisfied. But that was just my intuition...

Thanks for the demo. Well, sounds plausible to me. I just varied the scale factor of the constraints between 1e0 and 1e4 for an initial guess that was succesful with 1e4. In either case, the optimizer returned exitflag=2 and sometimes, the optimized parameters were wrong. Long story short, one would have to narrow that down much more to better understand whats going on.

What I described before, namely that the optimizer does not reduce the sum of squares from a certain iteration on, but decreases the step size even further like

Iteration 50: resnorm=234, step size = 1e-3,

Iteration 51: resnorm=234, step size = 1e-4,

...

Iteration 58: resnorm=234, step size = 1e-11,

do you have an intutition what might be the problem here? As I said, this happens at points which are far off the ground truth. Is it maybe that the optimizer reached a region in the parameter space where the objective is flat, and the scaling of the constraints amplifies this even more?

Matt J on 29 Jun 2023

Edited: Matt J on 29 Jun 2023

I'm not convinced that resnorm is not changing, but maybe it's only doing so to decimal places that are beyond the precision of the display.

Or mabe you've landed in a region where the resnorm is unchanging, but the constraints have not yet been satisfied. So, more iterations are necessary.

SA-W on 29 Jun 2023

Edited: SA-W on 29 Jun 2023

I'm not convinced that resnorm is not changing, but maybe it's only doing so to decimal places that are beyond the precision of the display.

Yes, resnorm still changes but at the fifth/sixth digit after the decimal point only.

Or mabe you've landed in a region where the resnorm is unchanging, but the constraints have not yet been satisfied.

What could be a region where the resnorm is unchanging? Where one parameter say increases the objective value and another parameter cancels this out again?

So, more iterations are necessary.

But in those iterations (to satisfy the constraints), I would expect the parameters to change also, which however happens only at the 7th, 8th, 9th digit ... after the decimal point.

Matt J on 30 Jun 2023

Open in MATLAB Online

What could be a region where the resnorm is unchanging?

For example,

fmincon(@(x) 0 ,[5,1],[1,1],1,[],[],[0,0],[],[],optimoptions('fmincon','Display','iter'))
                                            First-order      Norm of
 Iter F-count            f(x)  Feasibility   optimality         step
    0       3    0.000000e+00    5.000e+00    0.000e+00
    1       7    0.000000e+00    4.829e+00    0.000e+00    1.211e-01
    2      10    0.000000e+00    2.691e+00    0.000e+00    1.527e+00
    3      13    0.000000e+00    0.000e+00    0.000e+00    2.934e+00

Local minimum found that satisfies the constraints.

Optimization completed because the objective function is non-decreasing in 
feasible directions, to within the value of the optimality tolerance,
and constraints are satisfied to within the value of the constraint tolerance.
ans = 1×2
    0.7572    0.1630

which however happens only at the 7th, 8th, 9th digit ... after the decimal point.

Why expect larger changes? You have a StepTolerance of 1e-12.

can I pass these nonlinear constraints to lsqnonlin?

1 Comment
Show -1 older comments Hide -1 older comments

Accepted Answer

44 Comments
Show 42 older comments Hide 42 older comments

More Answers (1)

89 Comments
Show 87 older comments Hide 87 older comments

Categories

Tags

Community Treasure Hunt

can I pass these nonlinear constraints to lsqnonlin?

1 Comment Show -1 older comments Hide -1 older comments

Accepted Answer

44 Comments Show 42 older comments Hide 42 older comments

More Answers (1)

89 Comments Show 87 older comments Hide 87 older comments

Categories

Tags

See Also

Community Treasure Hunt

1 Comment
Show -1 older comments Hide -1 older comments

44 Comments
Show 42 older comments Hide 42 older comments

89 Comments
Show 87 older comments Hide 87 older comments