can I pass these nonlinear constraints to lsqnonlin?
Show older comments
Let
denote a function of two variables and
the parameters of the optimization problem
the parameters of the optimization problem
which I want to solve with lsqnonlin. To calculate
, fmust be convex at all iterations. Otherwise, it is purely a matter of luck that the identification works.
My idea is to enforce the convexity by enforcing the Hessian of fto be positiv definite. The determinant of the Hessian is given by (please correct me if I am wrong)
Given that, I would implement the constraints
c = 
in a function
function [c,ceq] = mycon(x)
ceq = [];
c = ... % the above formula
end
I evaluate the above equation at n points in the {x,y} space, i.e., c is a vector with nentries.
Can I implement those constraints as nonlinear constraints or do I make things too complicated?
1 Comment
Matt J
on 15 Jun 2023
To calculate sim , f must be convex at all iterations.
Convex as a function of (x,y), I assume you mean. Is is already a convex function of E.
Accepted Answer
More Answers (1)
Most of the f(x,y) that I am working on are indeed of the form f(x,y)=h(x)+g(y).
If this is true, then ensuring the convexity of f(x,y) is the same as ensuring the convexity of h(x) and g(y) as 1D functions, which is much simpler. I you use 1D linear interpolating basis functions,
h=@(x) interp1([E(1),E(2),E(3),...,En] ,x)
then you can ensure the convexity of h just by ensuring that the second order finite differences are increasing, which is a simple linear constraint on the E(i),
E(i+1)-2*E(i) + E(i+1)>=0
No need for nonlinear constraints at all. Moreover, if you make the change of variables
E=cumsum([C2,cumsum([C1,D])]);
where D(j), C1, and C2 are the new set of unknowns, then the constraints on D required for convexity are simple positivity bounds D(j)>=0. As you'll recall, bound constraints can indeed be enforced at all iterations, which you said is what you wanted.
D=rand(1,20);
C1=-5;
C2=0;
E=cumsum([C2,cumsum([C1,D])]);
fplot( @(x) interp1( E,x ) ,[1, numel(E)] ); xlabel x; ylabel h(x)
89 Comments
Even with cubic spline basis functions, which are not generally shape preserving, I think you have a pretty good bet of ensuring convexity, which may be important for the smoothness of h and g,
D=rand(1,20);
C1=-5;
C2=0;
E=cumsum([C2,cumsum([C1,D])]);
fplot( @(x) interp1( E,x ,'spline') ,[1, numel(E)] ); xlabel x; ylabel h(x)
SA-W
on 20 Jun 2023
assumes that the spacing between the support points is constant, right?
Yes, there would be different weights on the E(i) if the spacing is not uniform, but it shouldn't invalidate the general approach.
Why do you add a constant C here?
When you double integrate something there are two constants of integration, so in theory we should probably have is two constants have,
E=cumsum([C2,cumsum([C1,D])]);
I was not sure that we needed C2, but now I think we do.
Also for the E(i) at the boundary, one can only impose constraints E(i-1)<=E(i).
If you make your E-grid a little bit larger than your f-grid, then that should not be a problem.
SA-W
on 20 Jun 2023
If we have f=f(x) and there are n support points x_i, there are also n parameters E_i.
f(x) is a continuous function on some interval [fmin, fmax]. You have a parametric model
where the s_i are known shifts which can be thought of as the "locations" of the basis functions. The function F(x) is also continuous and defined on some possibly larger domain [Fmin,Fmax]. You want these functions to agree at certain discrete sample points,
where the s_i are known shifts which can be thought of as the "locations" of the basis functions. The function F(x) is also continuous and defined on some possibly larger domain [Fmin,Fmax]. You want these functions to agree at certain discrete sample points,
and to choose the unknown E_i to achieve this agreement as closely as possible.
You seem to be assuming that for every basis function location s_i there needs to be corresponding x_j=s_i, but there is no reason to have that assumption. There is no necessary relationship between the s_i and x_j at all other than that they should be chosen to ensure,

Additonally, there are certain mathematical simplifications and conveniences if you choose the s_i so that
how do I transform the "real" parameters into D?
The inverse of,
E=cumsum([C2,cumsum([C1,D])])
is
D=diff(E,2)
For example,
D=rand(1,8)
C1=rand; C2=rand;
E=cumsum([C2,cumsum([C1,D])]);
D2=diff(E,2)
No, D=diff(E,2) is always valid.
To recover E from D, however, you would need to save the relevant C1 and C2. Example:
E=rand(1,5) %start with E
C1=E(2)-E(1); %transform to {C1,C2,D} space
C2=E(1);
D=diff(E,2);
Now, we recover E
Erec=cumsum([C2,cumsum([C1,D])]) %recovered E
SA-W
on 20 Jun 2023
Matt J
on 20 Jun 2023
It looks convex as I showed in my graph above.
SA-W
on 20 Jun 2023
SA-W
on 20 Jun 2023
Matt J
on 20 Jun 2023
now we have E_i=E_i(D,C1,C2).
Yes, that.
SA-W
on 20 Jun 2023
In both cases dE/dZ is required with E=cumsum([C2,cumsum([C1,D])]). Correct? And is dE/dZ continously differentiable?
Yes, and in fact because E=M*[D(:);C1;C2] for some constant matrix M, then dE/dZ will just be M.'
You can find M either analytically, or using func2mat in this FEX download,
SA-W
on 21 Jun 2023
Matt J
on 21 Jun 2023
Mainly, if your basis functions are simple linear interpolating kernels, the advantage would be that F(x_j)=E_j, i.e. you don't have to interpolate to obtain the values of F(x_j).
SA-W
on 21 Jun 2023
Matt J
on 21 Jun 2023
Is is not true even if x_j = s_i,
Yes, that's a special case of
.
SA-W
on 21 Jun 2023
Matt J
on 21 Jun 2023
You said, I can make the E-grid a little larger than the f-grid
Yes, but I think it's moot at this point. I think we've determined that you can make the grids the same size, and the convexity conditions E(i+1)-2*E(i)+E(i-1) do not have to be satisfied at all i. It is sufficient for this to hold for i=2,...n-1
But how can that work with linear basis functions? Here, we have , but how would I determine and . We said f(x) is not defined there.
One way would be to not use linear interpolating basis functions. We talked about basis functions like
E_i log(1+A_i*exp(a_i*(x-s_i))), i=1...n
which do not have a compact footprint Instead, each basis function has a footprint over the entire domain of x. The value F(x0) at any x0 is therefore influenced by all n basis functions, So there is reason to think that all 3*n parameters could be estimated, assuming you have 3*n sample points f(x_j) wherever they happen to be located. Although the more spread out the x_j the better conditioning the estimation problem is likely to have
Another approach would be to add a penalty function on the roughness of the coefficients E_i to your objective function, e.g.,
Penalty(x)= smallweight * sum_i abs(E(i)-E(i+1)) , i=1...n-1
Certainly if you have linear interpolating basis functions, we expect the E(i) to step smoothly from one i to the next, and the above penalty would enforce that, even without having any samples f(x_j) that provides information on a particular E_i.
SA-W
on 21 Jun 2023
E(n)>=E(n-1)
Don't know where you got this. When you look at E(i+1)-2*E(i)+E(i-1)>=0 at i=n-1 you get
E(n)-2*E(n-1)+E(n-2)>=0
Since i only goes from 2 to n-1, there is no other constraint involving E(n) that I've proposed.
So I should have at least 3n sample points to determine the 3n parameters? Did you mean that?
Yes. The number of equations must be greater than or equal to the numbr of unknowns. Although if somehow you had additional equality constraints on the unknown parameters that don't involve f(x_j) those would count as well.
SA-W
on 21 Jun 2023
But I thought E(n)>=E(n-1) would be necessary in addition to ensure convexity.
Definitely not. Here's a counter-example. E is linear and therefore convex, but it is strictly decreasing, so E(n)>=E(n-1) does not hold,
E=8:-1:1
SA-W
on 21 Jun 2023
No, because an upward sloping line,
E=1:8
is convex and satisfies E(n)-E(n-1)>0 along with E(n)-2E(n-1)+E(n-2)>=0.
SA-W
on 22 Jun 2023
They are not redundant. They narrow the range of functions that you are modeling. It's just that they have nothing to do with convexity. You could and probably should keep them if you know the function you are trying to model satisfies those extra constraints.
SA-W
on 22 Jun 2023
Matt J
on 22 Jun 2023
There would be a unique minimum value. A convex quadratic function can have non-unique global minima if the Hessian is singular.
SA-W
on 22 Jun 2023
SA-W
on 22 Jun 2023
SA-W
on 22 Jun 2023
Matt J
on 22 Jun 2023
I would say so, but a co-plot of the surface fit along with f would make it more obvious.
SA-W
on 22 Jun 2023
Matt J
on 23 Jun 2023
surf or fsurf would be appropriate.
SA-W
on 23 Jun 2023
Matt J
on 26 Jun 2023
Too much is hidden. I would guess that you used poorer initial guesses in the latter case.
SA-W
on 26 Jun 2023
Too much is hidden. Maybe some noise sensitivity in your PDE solver. Make sure you're following the guidelines here:
What ahppens when you take the solution from the noise-fre case and initilaize with that?
SA-W
on 26 Jun 2023
Matt J
on 26 Jun 2023
So initializing there does not change the initial values.
What I meant was. What happens if you take the noise-free solution and use that to initialize the noisy estimation?
SA-W
on 26 Jun 2023
SA-W
on 26 Jun 2023
SA-W
on 27 Jun 2023
Matt J
on 27 Jun 2023
Intriguing. But how well fitted is gexp in these cases? It would be good to see gsim co-plotted with gexp in the noisy case.
SA-W
on 27 Jun 2023
Matt J
on 27 Jun 2023
Then, it sounds like nothing is really wrong, although possibly gsim is is an ill-conditioned function of f. I.e., lots of different f can give nearly the same gsim.
SA-W
on 27 Jun 2023
SA-W
on 27 Jun 2023
From the documentation, exitflag=2 means "Change in x is less than the specified tolerance, or Jacobian at x is undefined."
Jacobian at x is undefined. It would be easy for you to use assert() to trap the case where you have NaN's or inf in your Jacobian.
Change in x is less than the specified tolerance: If the objective is very flat at the current iteration, the optimizer can be fooled into thinking it is close to a minimum, even if it is not. This can result in it taking prematurely small steps or stopping outright. This can happen even if you get an exitflag=1. In the example below, the minimum at x=1 is reached with an error of more than 11%, which some might consider large:
opts=optimoptions('lsqnonlin','StepTolerance',1e-12);
[x,fval,res,exitflag]=lsqnonlin(@(x) (x-1).^4, 1.2,[],[],opts)
However, premature stopping doesn't seem to be the reason for inaccuracies in g and h. That's why I asked you to initialize using the ground truth parameters We would expect ground truth to be fairly close to the minimum and for the objective to be getting pretty flat in that general vicinity. Yet,the optimizer strongly pulls away from it.
With these tolerances, the error in your example decreases further to around 3%, which, of course, may be still considered large.
But remember also that we have no idea precisely what the Taylor expansion of your actual function looks like at the minimum, and whether your tolerances are appropriate to it.. I can modify my example as below to get poor accuracy again, even with your actual tolerances.
opts=optimoptions('lsqnonlin','StepTolerance',1e-12, 'FunctionTolerance', 1e-12, 'OptimalityTolerance', 1e-9);
[x,fval,exitflag]=lsqnonlin(@(x) (x-1).^6, 1.2,[],[],opts)
If I think of minimizing f(x)=x^2, the objective is far from being flat at the minimum.
When I say "flat", I mean that qualitatively speaking the gradients are getting small. Even in my example with (x-1)^4, the objective is not perfectly flat anywhere, but you can still see that it is flat enough to cause early termination.
SA-W
on 27 Jun 2023
Also, if my objective were really flat around ground truth, would this not explain the inaccuracies in g and h? If I think of a nearly flat valley, there are many parameter combinations giving the same objective function value.
If your objective is nearly flat in a broad area around ground truth, it would explain why the function is sensitive to noise. The addition of noise can "tilt" the floor of the valley so that water runs downhill away from ground truth. We do know that once you added the 1% noise, you tilted the valley away from ground truth significantly, because the optimizer pulled strongly away from ground truth when initialized there.
One thing you could try is to add curvature penalty terms to your objective function. For lsqnonlin, this would be equivalent to extending your residual vector,
residual =[residual; smallnumber*[D_g;D_h]]
where D_g and D_h are the D parameters (from our earlier discussion) of g and h. This will discourage g and h from bending more than necessary.
SA-W
on 27 Jun 2023
I discretized g and h with five parameters each
If you implemented my original suggestion, the 5 parameters would be [C1,C2,D1,D2,D3] which parameterize the basis coefficients E(i) according to,
E=cumsum([C2,cumsum([C1,D])]);
Would the associated Jacobian not be zero then if D_g and D_h are const numbers?
In light of the above, you should see that they would not be constant. D_g and D_h each contain 3 of the 5 unknown parameters that are being iteratively estimated.
SA-W
on 28 Jun 2023
I am still working with the E(i) and implemented the constraints E(i+1)-2*E(i) + E(i+1)>=0.
How would you be able to do that? It does not enforce convexity at all iterations, which means your PDE solver would fail, creating unrecoverable problems. That was the original point of the question.
In any case, I really did intend the new residual terms be smallnumber*[D_g;D_h], so you should transform your E parameters to D space as we discussed earlier.
SA-W
on 28 Jun 2023
I multiply the A matrix by 1e4, which leads to a fulfillment of convexity at all iterations.
If so, that could be another reason why your StepTolerance stopping threshold is being met prematurely, resulting in exitflag=2.
Can the additional residual components smallnumber*[D_g;D_h] not be written down if the parameters are in the E-space?
Yes, if lsqnonlin is handing an E vector to your objective function code, you can transform from E to D within each call to your objective function code and use D for whatever computations are needed.
SA-W
on 28 Jun 2023
But why is the higher weighting of the constraints (1e4*A) a constellation that can lead to exitflag=2?
It's algorithm dependent thing, but basically if you force the algorithm to prioritize the constraints over progress in reducing the objective function, it may need to take much small steps than it otherwise would, and this can trigger the steptolerance stopping criterion.
But I mean if I do not make the transformation to D-space, but keep the E vector. In that case, how would the curvature penalty term look like (if possible)?
It is not possible. I think you want me to tell you that you can do this:
residual =[residual; smallnumber*[diff(G,2);diff(H,2)]]
This is an equivalent way to implement my suggestion and, in your mind, this may be different from transforming to D-space. It is not, though.
I just realize that diff(G,2) and diff(H,2) represents...
They represent the left hand side of these inequalities, yes.
which is what I have currently implemented as linear constraints in A and b
In A, yes. b is unrelated. And so if you wished you can express the new residual terms as
residual =[residual; smallnumber*[A_g*G;A_h*H]]
where A_h and A_g are some appropriate subset of the rows of A. Also, for computng the Jacobian of these new terms, it would just be J=[A_g;A_h].
So why do you think "This will discourage g and h from bending more than necessary."
lsqnonlin is trying to make all the residuals as small as possible. By appending the second derivatives to the residuals, it will try to keep them small, which is the same as saying g and h will be inhibited from curving too much.
SA-W
on 28 Jun 2023
Instead of passing these linear constraints to lsqnonlin, you want to attach them to the residual.
No, you would still keep the constraints as they are. They are not related to what we are doing with the residuals.
The constraints are forcing the derivatives to be non-negative.
Conversely, the residual terms are not being used to control whether the second derivatives are positive or negative in sign. Rather they are trying to keep the second derivatives small in magnitude. Moreover, the algorithm will normally not push the residuals to zero precisely, so the second derivatives will merely be encouraged to be small in magnitude rather required to have some exact value, as with constraints. Finally, the residuals can be prioritized by weighting them differently, unlike the constraints. By varying the penalty weight factor smallnumber, we can put different priority on those residuals as compared to the data fitting residual terms gsim-gexp.
SA-W
on 28 Jun 2023
What I want to say is that with c, c*A*E, I do influence the sign and magnitude simultaneously, right?
It's hard to know what's really happening, since I don't know which of lsqnonlin's algorithms you are running. It appears that you are just amplifying the weight of the constraint violations, forcing the update steps to gave more consideration to the constraints than to the residuals.
And is it really that this factor somehow sets the derivatives to an exact value
No. But if your constraints are not met within some tolerance, lsqnonlin won't stop iterating or issue exitflag=1. Conversely, lsqnonlin doesn't apply any stopping tolerance thresholds on the magnitude of the residuals that is achieved.
SA-W
on 28 Jun 2023
Matt J
on 28 Jun 2023
do you know what the scaling of the A matrix affects qualitatively? Just magnitude or sign, or both?
I'm sure it doesn't change the constraints that the solver is trying to satisfy at all. It just changes the path taken to the solution.
SA-W
on 29 Jun 2023
Maybe the best thing is to consider the mian update step that the interior point algoirthm uses. This is described in Equation 38 of this doc page.

The Jg matrix is the Jacobian of the inequality constraints, which in this case, because the constraints are linear, is a constant. If you make this constant too large, the contribution to the equation coefficients from the objective function gradient and Hessian are very small in comparison. In this case, your steps will always be guided more by the constraints than the objective.
SA-W
on 29 Jun 2023
Well, my intuition was that if the update step is (approximately) indifferent to the objective function, then you can imagine it may take steps as if the objective function were some trivial flat value with gradient zero everywhere. In other words, it would have no reason to take large steps once the constraints were satisfied. But that was just my intuition...
I cannot come up with an example to show how scaling up the constraints can force an exitflag=2, but I can show one (below) which demonstrates that scaling the constraints can slow convergence when the objective is ill-conditioned:
load data
opts=optimoptions('fmincon','Algorithm','interior-point','Display','none',...
'MaxFunEvals',inf);
%%%Optimization 1
[E1,f1,exitflag1,out1]=fmincon(f,E0,A,b,[],[],[],[],[],opts);
%%%%Optimization 2: scaled constraints
s=1e4;
opts.MaxIterations=out1.iterations;
[E2,f2,exitflag2]=fmincon(f,E0,s*A,s*b,[],[],[],[],[],opts);
f1,f2
SA-W
on 29 Jun 2023
I'm not convinced that resnorm is not changing, but maybe it's only doing so to decimal places that are beyond the precision of the display.
Or mabe you've landed in a region where the resnorm is unchanging, but the constraints have not yet been satisfied. So, more iterations are necessary.
What could be a region where the resnorm is unchanging?
For example,
fmincon(@(x) 0 ,[5,1],[1,1],1,[],[],[0,0],[],[],optimoptions('fmincon','Display','iter'))
which however happens only at the 7th, 8th, 9th digit ... after the decimal point.
Why expect larger changes? You have a StepTolerance of 1e-12.
Categories
Find more on Solver Outputs and Iterative Display in Help Center and File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!














