can I pass these nonlinear constraints to lsqnonlin?

Let denote a function of two variables and the parameters of the optimization problem
which I want to solve with lsqnonlin. To calculate , fmust be convex at all iterations. Otherwise, it is purely a matter of luck that the identification works.
My idea is to enforce the convexity by enforcing the Hessian of fto be positiv definite. The determinant of the Hessian is given by (please correct me if I am wrong)
Given that, I would implement the constraints
c =
in a function
function [c,ceq] = mycon(x)
ceq = [];
c = ... % the above formula
end
I evaluate the above equation at n points in the {x,y} space, i.e., c is a vector with nentries.
Can I implement those constraints as nonlinear constraints or do I make things too complicated?

1 Comment

To calculate sim , f must be convex at all iterations.
Convex as a function of (x,y), I assume you mean. Is is already a convex function of E.

Sign in to comment.

 Accepted Answer

Matt J
Matt J on 15 Jun 2023
Edited: Matt J on 15 Jun 2023
Can I implement those constraints as nonlinear constraints
You can, but you have a few problems:
(1) Nonlinear constraints cannot be enforced at all iterations. In fact, only simple bounds constraints can be.
(2) det(A)>0 is not a sufficient condition for the positive definiteness of A. By Sylvester's criterion, you need to ensure as well, although that will be a simpler, linear constraint in E.
(3) The constraints need to be satisfied for all (x,y) in whatever domain f(x,y) is defined on. In theory, that gives you an unaccountably infinite number of constraints. To be practical, you could relax the constraint, imposing it on some discrete grid of points (x_j, y_j), but in theory, you could not be sure that f(x,y) is convex everywhere between these points.
or do I make things too complicated?
Maybe. You would need to tell us more about the properties of the N_i(x,y) functions and what g^sim() is doing. If the N_i(x,y) functions are all convex individually, then it would be sufficient (although not necessary) to impose the much simpler constraints E_i>=0.

44 Comments

Thank you!
can be easily implemented in A*x<b.
" Maybe. You would need to tell us more about the properties of the N_i(x,y) functions and what g^sim() is doing. If the N_i(x,y) functions are all convex, then it would be sufficient (although not necessary) to impose the much simpler constraints E_i>=0. "
Think about the N_i(x,y) as interpolation functions (piecewise polynomials). Attached is a picture where n=2 and f being a function just of x: as you can see, the N_i look like hat functions, they are 1 at one support point and zero at all others. The support points x_i are the points where the parameters E_i are defined, i.e. f(x_i) = E_i.
In my case, they are a little bit smoother such that I can take second derivatives of them, but the shape is like in the picture which means that they are not convex. E_i >=0 are lower bounds of my optimization, but this does not really help given the shape of the N_i, right?
No, it doesn't help. It sounds like it might be the wrong way to parametrize f(x,y) if f is supposed to be convex.
This way of (local) interpolation, where I have one parameter for each suport point, gives me control about where I want to have more parameters,...
"It sounds like it might be the wrong way to parametrize f(x,y) if f is supposed to be convex."
Finding this function is a main challange in the realm of my physics. So the idea is then to approximate it like above and find the nodal values (parameters) based on experiments.
Why do you think it might be the wrong way if f is supposed to be convex?
Why do you think it might be the wrong way if f is supposed to be convex?
Because a linear combination of non-convex basis functions will rarely give a convex function...
If there is a reason, based in physics, that f needs to be convex, the proper way to model f would probably come from that.
I have to think about that. Doing this local approximation of f is the novelty of my work, so I should research this in more depth.
As for your point (1):
I am aware that only bound constraints can be enforced at all itereations. We already talked about this issue in https://de.mathworks.com/matlabcentral/answers/1921445-fmincon-any-way-to-enforce-linear-inequality-constraints-at-intermediate-iterations?s_tid=prof_contriblnk
There, you suggested to (i) have the objective return NaN in case of evaluation failures, to (ii) project the current point onto the set of feasible points thats satisfy A*x<=b using lsqlin.
Returning NaN leads to exitflag=2 most of the time (although I have step_tolerance=1e-12) and projecting using lsqlin makes most of my constraints active, which is also not the best way to go. It ould be better to have them inactive.
What really helped me is to multiply my constraints by a large number (A*x<=b becomes (1e4*A)*x<=b). Just passing A and b to fmincon or lsqnonlin, the constraints were basically violated at all iterations. At the linked question above, you wrote
"it violates the theoretical assumptions of fmincon, and probably all the Optimization Toolbox solvers as well, when the domain of your objective function and constraints is not an open subset of . If you forbid evaluation outside the closed set defined by your inequality constraints, that is essentially the situation you are creating."
Can you comment a little bit more on this? At the end, is it not the same situation that I also create when (i) returning NaN or (ii) projecting the current point onto the feasible set?
I have to think about that. Doing this local approximation of f is the novelty of my work, so I should research this in more depth.
Might you be able to choose basis functions that are convex? Then you would still have some semi-local control over f through the E coefficients. Beyond that, it doesn't really make sense to be controlling f locally when f must satisfy a global property like convexity. A function generally cannot be perturbed locally withour destroying convexity.
Can you comment a little bit more on this?
The theory behind fmincon's algorithms were derived under the assumption that the objective and constraint functions are continuously differentiable over the domain D where you might want the algorithm to search. In order to be differentiable over D, it is necessary that these functions be defined on an open superset of D.
The following 1D function is an example that violates this, assuming the search domain is D= {x|x>=0}.
Since D is closed and f is not defined throughout any open superset of D, it is not possible to evaluate the derivative of f at x=0. Note that even though f has a right-hand derivative at x=0, it lacks a left-hand derivative, and therefore is not differentiable.
When f is undefined outside a closed domain D, there are also practical difficulties for finite differencing. If the solution xopt is on the boundary of D, and if you are approximating the derivatives using central finite differences, you have to take the finite differences with smaller and smaller step sizes as xopt is approached. Otherwise, the finite differencing will eventually, at some iteration, have to reach for x that are outside of D where f is undefined. On the other hand, if you let the step size shrink arbitrarily, it will eventually fall below floating point precision and the derivative estimation will fail.
At the end, is it not the same situation that I also create when (i) returning NaN
Yes, but the common situations where people return NaN from the obejctive and constraints are problems where the domain D is naturally open. For example f(x)=x-log(x) has an open search domain D={x|x>0}, and therefore an open superset for D is D itself. Therefore, setting f(x)=NaN for x<=0 don't give the above difficulties.
or (ii) projecting the current point onto the feasible set?
No. Let D be a non-empty search domain whose points x satisfy ,
(i) f(x) is defined and
(ii) x is feasible.
Also, let P(x) be a projection from any x \in R^n onto D. Then f(P(x)) will be defined throughout R^n due to (i).
"The theory behind fmincon's algorithms were derived under the assumption that the objective and constraint functions are continuously differentiable over the domain D where you might want the algorithm to search. In order to be differentiable over D, it is necessary that these functions be defined on an open superset of D."
I think I got the gist of your examples. So you would not multiply the A matrix by a large number to have fmincon pay attention to the linear constraints? In my case, I return the Jacobian in the objective function and there are no finite-difference calculations involved. I can not really grasp what practical implications (1e4*A) has and when this might lead to problems.
"Might you be able to choose basis functions that are convex? Then you would still have some semi-local control over f through the E coefficients. Beyond that, it doesn't really make sense to be controlling f locally when f must satisfy a global property like convexity. A function generally cannot be perturbed locally withour destroying convexity."
I agree with you that controlling f(x,y) locally with finite element basis functions is not the best way to go. In 1d, f=f(x), it worked well with local basis functions but in 2d, f=f(x,y), I have the issue that the {xy} space is not "sampled" uniformly when calculating g^sim. Consider a grid with n uniformly spaced support points, f is mainly evaluated along the grid diagonal since there is a little correlation between x and y. Then, of course, I can not calibrate the parameters E_i far off the diagonal which is the main issue I currently face.
Maybe semi-local control of f is a workaround. Can you explain what "semi-local control over f through the E coefficients" means? And to answer your question, my basis functions do not neccessarily be convex. What are convex basis functions that you think might be appropriate?
Maybe semi-local control of f is a workaround. Can you explain what "semi-local control over f through the E coefficients" means?
Imagine your basis functions were Gaussian lobes. Changing its coefficient would change f(x,y) strongest near the center of the lobe, but would still affect (x,y) further away somewhat.
What are convex basis functions that you think might be appropriate?
No way for me to know. The physics of the problem and what it says about how f() should look have not been described to us.
SA-W
SA-W on 16 Jun 2023
Edited: SA-W on 16 Jun 2023
"Imagine your basis functions were Gaussian lobes."
Gaussian lobes are Gaussian basis functions or did you refer to something else?
"No way for me to know. The physics of the problem and what it says about how f() should look have not been described to us."
Let me try to give you a few more details, maybe you can give a suggestion then that allow me to explore new things...
The pde that I am solving is , which is the balance of linear momentan (Newton's law). It is a purely mechanical boundary-value-problem and I solve it by means of the finite-element-method. P is a second-order tensor (stress tensor, 3x3 matrix) which can be derived from the strain energy density fas follows:, where Cis a SPD second order tensor (strain tensor, 3x3 matrix). fis a scalar valued function (strain energy density) chracterizing the constitutive behavior of the material under consideration. If we assume the material to be isotropic (same response regardless of the loading direction), then we can express $f$ in terms of (some of) the principal invariants of $C$, i.e., $f(C) = f(I_1(C), I_2(C))$. Renaming the invariants to x and y, we end up with as I defined it in my question. From the physics we know that fmust vanish for zero deformation, which is the case for or , where it should also attain its global minimum. f should have no other stationary points. Also we know that based on the properties of C and since energy can not be negative. Also, $f$ should go to infinity if either x or y approach infinity and is ideally convex. However, are resonable upper bounds for most physical applications. That are the main properties we know about f.
What we usually do in material modeling is to prescribe the global shape like
and fit a quite small number of material parameters (here ) to some measurements. However, this approximation a priori assumes a particular shape of fand it turns out that this global approximation is not really satisfactory which is why people try to find better approximations of fto make it a good model for nearly all loading cases of the above boundary-value-problem. My idea is/was to discretize f(locally) over the space of invariants (x and y), distribute some support points in that space, and find the values of fat those support points (the ) with optimization. This works excellent in a one-dimensional invariant space, but as I told you, not so good in two-dimensional invariant space because the principal invariants of a SPD tensor show some correlation; In other words, there are some support points (x_a, y_a) that I do not really "touch" when I solve my boundary-value-problem, in particular the ones at the boundary. I highly doubt it works to identify the parameters on those support points with a purey local approximation.
As you said, maybe doing a semi-local approximation helps in a two-dimensional invariant space. What I can think of is to keep idenitifying some of the (the ones at the "most activated" support points) with optimization, and computing the at the boundary based on the optimized and the convexity property.
Can this be achieved by a special class of (convex) basis functions? I am open for any input in any direction!
The pde that I am solving is div(P)=0.
That seems to be the same as Laplacian(f)=0.
Also, $f$ should go to infinity if either x or y approach infinity and is ideally convex.
You've restated here that f should be convex, but still haven't explained why. You also haven't explained why this convexity needs to be enforced at all iterations.

" That seems to be the same as Laplacian(f)=0."

But f is not a second order tensor in the Laplacian.

" You've restated here that f should be convex, but still haven't explained why. You also haven't explained why this convexity needs to be enforced at all iterations."

f(C) should be polyconvex, which is often weakened by requiring f(x,y) to be convex, where x and y are the first two principal invariants of C. I solve the pde with finite elements. Since it is nonlinear in the displacement field U, a Newton scheme is applied where the increments are obtained by solving the linear system K*deltaU = -R

If f(x,y) is convex, it can be guaranteed that the pde has a unique solution and that K is PSD. We can not guarantee that if f(x,y) is not convex. If the convexity is not enforced at all iterations, I often see that K is rank deficient, i.e. I can not solve the pde.

f(C) should be polyconvex, which is often weakened by requiring f(x,y) to be convex,
Strengthened, I think you mean,.
If f(x,y) is convex, it can be guaranteed that the pde has a unique solution and that K is PSD. We can not guarantee that if f(x,y) is not convex. If the convexity is not enforced at all iterations, I often see that K is rank deficient, i.e. I can not solve the pde.
Ah well. We've already discussed what I think about solving PDEs inside objective functions...
One question. Is the converse true? If you have an f that solves the PDEs will necessarily be convex or polyconvex?
SA-W
SA-W on 16 Jun 2023
Edited: SA-W on 16 Jun 2023

Strengthened, I think you mean,.

Yes.

Ah well. We've already discussed what I think about solving PDEs inside objective functions... One question. Is the converse true? If you have an f that solves the PDEs will necessarily be convex or polyconvex?

It is on my list for the future to implement the pde as a nonlinear constraint.

No. If f solves the pde, f must not necessarily be convex/polyconvex. Does this help?

Also, were the information that I provided helpful to make a suggestion for a different set of basis functions / different approximation of f? You mentioned that there are also convex basis functions or semi-local approximation strategies.

You often have great ideas, I really hope also here once again :)

I could suggest things like shifted exponentials, e.g.
N_ij(x,y) = E_ij*exp(-b_ij*(x-i) - c_ij(y-j)) %convex for E_i>=0
%E_ij, b_ij, c_ij are
%unknown
but really I have no idea whether it would be any better than the traditional moels for f(x,y) that you've already ruled out.
Is
f(x,y) = E_ij*exp(-b_ij*(x-i) - c_ij(y-j))
, i.e, there is only one N_ij and in total three parameters, E_ij, b_ij, c_ij, or would I have more of those N_ij's?
No, I was implying that you would have .
Possibly better:
N_ij(x,y) = 1/( a_ij*(x-i) + b_ij*(y-j) + c_ij )^2
If I have a 3x3 grid with nine support points, i and j run from one to three?
If so, there are 9*3=27 parameters. The E_ij represent still the nodal values at the support point z_ij = (x_i, y_j). But what is the meaning of the b_ij and c_ij? Would you determine them along with the E_ij from the optimization or more heuristically based on the "importance" of the support point z_ij?
If I have a 3x3 grid with nine support points, i and j run from one to three?
Yes.
But I don't like the shifted exponential model anymore. A shifted exponential is equivalent to a non-shifted exponential E*exp(t-t0)=(E*exp(-t0))*exp(t) so I'm not sure it will be a very flexible basis.
That's why I suggested,
N_ij(x,y) = 1/( a_ij*(x-i) + b_ij*(y-j) + c_ij )^2
in my last comment. In this case, the nodal values will be E_ij=1/c_ij^2. The a_ij and b_ij are just additional optimization parameters to give the model some additional flexibility.
N_ij(x,y) = 1/( a_ij*(x-i) + b_ij*(y-j) + c_ij )^2
To have f(x,y) convex, are the bounds E_ij>=0 sufficient? Do I have to impose constraints on a_ij and b_ij as well?
You need to ensure that the denominator is >=0 for all x,y in the domain of f. However, because of the monotonicty of this choice for N_ij(x,y), it would be sufficient just to impose this on the boundaries of the box [xmin xmax, ymin, ymax].
Another possible choice for the basis is,
N_ij(x,y) = E_ij*log( 1 + exp(a_ij*(x-i)) + exp(b_ij*(y-j)) )/log(3)
This would be inherently positive and convex provided only that all E_ij>=0. The nodal values would be E_ij again.
You could embellish it still further
N_ij(x,y) = E_ij*log( 1 + A_ij*exp(a_ij*(x-i)) + B_ij*exp(b_ij*(y-j)) )/log(3)
but now you would need constraints A_ij>=0, B_ij>=0 as well...
Interesting.
Are the basis you provided so far interpolating, i.e. is the sum of all N_ij(x,y) = 1 for all x,y in the domain of f?
Also I was wondering about the meaning of the terms (x-i) and (y-j) in the N_ij's. Say I evaluate N_ij at the support point (x_1,y_1),
N_11(x_1,y_1) = E_11*log( 1 + exp(a_11*(x_1-1)) + exp(b_11(y_1-1)) )/log(3),
I would expect N_11 to be roughly E_11, but the terms (x_1-1) and (y_1-1) can be arbitrary.
Also I was wondering about the meaning of the terms (x-i) and (y-j) in the N_ij's. Say I evaluate N_ij at the support point (x_1,y_1),.
For simplicity, I was assuming the nodal points were the integers. More generally, you would have,
N_ij(x,y) = E_ij*log( 1 + A_ij*exp(a_ij*(x-xi)) + B_ij*exp(b_ij*(y-yj)) )/log(3)
Are the basis you provided so far interpolating, i.e. is the sum of all N_ij(x,y) = 1 for all x,y in the domain of f?
No, and more generally you wouldn't have f(xi,yj) = E_ij. You would have to solve a potentially non-sparse linear system to find coefficients E_ij giving a certain set of f(xi,yj) values. That means that the number of basis functions you could have might have to be somewhat modest.

You would have to solve a potentially non-sparse linear system to find coefficients E_ij giving a certain set of f(xi,yj) values. That means that the number of basis functions you could have might have to be somewhat modest.

I thought the approximation f = sum N_ij implies that we have ad many basis functions as there are nodal points. So what do you mean by "number of basis functions..." modest?

You don't necessarily need to situate a basis function at every nodal point. You can situate them every 10 or 15 points for example. The greater the density of basis functions, the more flexible the range of f that you can span, but the computational expense will also increase.
SA-W
SA-W on 17 Jun 2023
Edited: SA-W on 17 Jun 2023
You don't necessarily need to situate a basis function at every nodal point.
Above is an example , where the N_i are the local finite element basis functions that I use so far. The blue dots are the sampled points when I solve the pde. You can clearly see the correlation between x and y which makes it nearly impossible to identfiy all E_i. Replacing my N_i with your proposed
N_ij(x,y) = E_ij*log( 1 + exp(a_ij*(x-i)) + exp(b_ij*(y-j)) )/log(3)
will also not remedy this issue I think.
But what I would do now is to situate three basis function at the support points building the diagonal, i.e.,
with
N_i(x,y) = E_i*log( 1 + exp(a_i*(x-x_i)) + exp(b_i*(y-y_i)) )/log(3)
In words, I would only situate a basis function at the most frequent sampled points (here, the three points along the diagonal). Do you think this makes sense? I think I do not have the problem anymore that most of the E_i are not sampled when solving the pde.
It makes intuitive sense, but if there is a way you can construct a realistic case where f(x,y) is known, it would be wise to test whether these basis functions can produce a good surface fit to f(). I have nothing to support the notion that the basis,
N_i(x,y) = E_i*log( 1 + exp(a_i*(x-x_i)) + exp(b_i*(y-y_i)) )/log(3)
is a good model for f(), only that it has the convexity and positivity properties that are desired
"It makes intuitive sense, but if there is a way you can construct a realistic case where f(x,y) is known, it would be wise to test whether these basis functions can produce a good surface fit to f()."
That is acutally what I am doing wright now: constructing an approximation f = \sum N_i E_i where the E_i are constructed from a known f*(x,y). Then I do a pure re-identification...
If I do now instead
f = \sum N_i
where
N_i = E_i*log( 1 + exp(a_i*(x-x_i)) + exp(b_i*(y-y_i)) )/log(3)
would you determine the E_i as f*(x_i,y_i) = E_i, also in case there is not a N_i situated at every support point?
f*(x_i,y_i) = E_i
This will never be true if you have more than one basis function.
Yes, my question was sloppily formulated:
Given f*(x_i,y_i) = E*_i from a "real" function f*(x,y),
how would you particularize E_i, a_i, and b_i to determine N_i (hence, f = sum N_i) to approximate f*?
E_i = E*_i, a_i = b_i = 1.0 ?
If this is a good approximation or not is, as you said, a different question.
Well, that's the optimization problem. The E_i, a_i, and b_i are the unknowns in the surface fit.
Yes, but what I currently and probably for several more month do is re-identification, that is, I know the reference (exact) values from a known surface f*(x,y). Lets call the reference values E*_i, a*_i, b*_i.
With the interpolating, local finite element basis functions (f = sum N_i E_i), it is obvious to set E*_i = f*(x_i,y_i). But what would be your logic to determine the reference values in case of
N_i = E_i*log( 1 + exp(a_i*(x-x_i)) + exp(b_i*(y-y_i)) )/log(3) ?
As I said, you have to obtain them from a surface fit.
Yes, but what I currently and probably for several more month do is re-identification, that is, I know the reference (exact) values from a known surface f*(x,y). Lets call the reference values E*_i, a*_i, b*_i.
I don't see how that should change my answer. You have a least squares cost function where E_i, a_i, and b_i are the unknowns and f* is known input data.
When you minimize that cost function, the solution will be E*_i, a*_i, b*_i.
I see -- solving a separate optimization problem is the trick to get the (best) reference values E*_i, a*_i, b*_i...
Why is your summation index i running over the number of parameters? I mean I could sample f* at 1000 points although I just have 9 parameters, for instance.
SA-W
SA-W on 19 Jun 2023
Edited: SA-W on 19 Jun 2023

Let me finally clarify a few points regarding

N_i(x,y) = E_i*log( 1 + exp(a_i*(x-x_i)) + exp(b_i*(y-y_i)) )/log(3)

1. The purpose of the factor 1/log(3) is to make sure that N_i(x_i,y_i)=E_i ? Is that necessari at all given that the basis is not interpolating?

2. What is the purpose of log(...), i.e. why not simple

N_i(x,y) = E_i*( 1 + exp(a_i*(x-x_i)) + exp(b_i*(y-y_i)) )
The purpose of the factor 1/log(3) is to make sure that N_i(x_i,y_i)=E_i ? Is that necessari at all given that the basis is not interpolating?
No, it's not necessary.
2. What is the purpose of log(...), i.e. why not simple
(1)That would make the basis function separable in x,y. Not sure that would be a good basis unless f(x,y) is also separable in x and y.
(2) Exponentials overflow to infinite and underflow to zero really fast. Incidentally, regarding
N_i(x,y) = E_i*log( 1 + exp(a_i*(x-x_i)) + exp(b_i*(y-y_i)) )
you will want to use log1p for small values of the operands, a_i*(x-x_i)<=33 and b_i(y-y_i)<=33 and you will need to use an appropriate Taylor approximation when either of the exponential terms exp(a_i*(x-x_i)), exp(b_i*(y-y_i)) overflows.
SA-W
SA-W on 19 Jun 2023
Edited: SA-W on 19 Jun 2023

(1)That would make the basis function separable in x,y. Not sure that would be a good basis unless f(x,y) is also separable in x and y

Most of the f(x,y) that I am working on are indeed of the form f(x,y)=h(x)+g(y). Given that, you think

N_i(x,y) = E_i*log( 1 + exp(a_i*(x-x_i)) + exp(b_i*(y-y_i)) )

is appropriate?

you will want to use log1p for small values of the operands, a_i(x-x_i)<=33 and b_i(y-y_i)<=33 and you will need to use an appropriate Taylor approximation when either of the exponential terms exp(a_i*(x-x_i)), exp(b_i*(y-y_i)) overflows.*

I think x_i,y_i,E_i<=10 are reasonable upper bounds, so overflow should not be a matter of concern. Why in particular 33 (a_i*(x-x_i)<=33)? The examples on your link use logp1(1+x) for quite small values of x.

Most of the f(x,y) that I am working on are indeed of the form f(x,y)=h(x)+g(y).
In that case, the basis decomposition I would probably use is,
N_i(x) = C_i*log( 1 + exp(a_i*(x-x_i))) = C_i *softplus(a_i*(x-x_i)))
M_j(y) = D_j*log( 1 + exp(b_i*(y-y_j))) = D_i *softplus(b_i*(x-x_i)))
f(x,y)= sum_i N_i(x) + sum_j N_j(y)
Also, I happen to have numerically stable code for the softplus(z)=log(1+exp(z)) function that I can give you (below).
I think x_i,y_i,E_i<=10 are reasonable upper bounds, so overflow should not be a matter of concern.
I don't see why that removes the concern unless you can also bound the a_i and b_i values.
Why in particular 33 (a_i*(x-x_i)<=33)?
I believe z>=33 is the threshold past which f(z)=log(1+exp(z)) and f(z)=z can't be distinguished in double float precision.
function y=softplus(x)
%Accurate implementation of log(1+exp(x))
idx=x<=33;
y=x;
y(idx)=log1p( exp(x(idx)) );
end
Thanks for the softplus function.
In that case, the basis decomposition I would probably use is,
I just plotted N_1(x) = 1*log( 1 + exp(a_i*(x-x_1))) = log( 1 + exp(a_i*(x-4)))
for a_i=0.1, a_i=1, and a_i=10. In either case, N_1 is quite linear, in particular after x_1 = 4.0. So the basis seems to be quite linear over the domain of f(x,y). Do you think this is a flexible basis as you previously said?
I also mentioned the possibility of adding an additional parameter on the exponentials.
N=@(x) 10*log( 1 + A*exp(a*(x-5)));
With the right selection of A, it doesn't seem too bad
A=0.1;
avalues=linspace(-0.5,0.5,9);
for a=avalues
N=@(x) 10*log( 1 + A*exp(a*(x-5)));
hold on
fplot(N,[0,10])
hold off
end; legend("a="+avalues,'Location','north')
I also mentioned the possibility of adding an additional parameter on the exponentials.
Yes, but the additional parameters require additional constraints and, more important, increase the number of parameters by . So it is probably a trade-off: if I do not situate a basis function at every support point, I have to introduce the additional parameters on the exponentials to make the basis more flexible. If I have a higher density of basis functions, I can probably set them to one. Makes intuitively sense?

Sign in to comment.

More Answers (1)

Most of the f(x,y) that I am working on are indeed of the form f(x,y)=h(x)+g(y).
If this is true, then ensuring the convexity of f(x,y) is the same as ensuring the convexity of h(x) and g(y) as 1D functions, which is much simpler. I you use 1D linear interpolating basis functions,
h=@(x) interp1([E(1),E(2),E(3),...,En] ,x)
then you can ensure the convexity of h just by ensuring that the second order finite differences are increasing, which is a simple linear constraint on the E(i),
E(i+1)-2*E(i) + E(i+1)>=0
No need for nonlinear constraints at all. Moreover, if you make the change of variables
E=cumsum([C2,cumsum([C1,D])]);
where D(j), C1, and C2 are the new set of unknowns, then the constraints on D required for convexity are simple positivity bounds D(j)>=0. As you'll recall, bound constraints can indeed be enforced at all iterations, which you said is what you wanted.
D=rand(1,20);
C1=-5;
C2=0;
E=cumsum([C2,cumsum([C1,D])]);
fplot( @(x) interp1( E,x ) ,[1, numel(E)] ); xlabel x; ylabel h(x)

89 Comments

Even with cubic spline basis functions, which are not generally shape preserving, I think you have a pretty good bet of ensuring convexity, which may be important for the smoothness of h and g,
D=rand(1,20);
C1=-5;
C2=0;
E=cumsum([C2,cumsum([C1,D])]);
fplot( @(x) interp1( E,x ,'spline') ,[1, numel(E)] ); xlabel x; ylabel h(x)
ensure the convexity of h just by ensuring that the second order finite differences are increasing
This is what I am doing at the moment if f=f(x). But
E(i+1)-2*E(i) + E(i+1)>=0
assumes that the spacing between the support points is constant, right? Also for the E(i) at the boundary, one can only impose constraints E(i-1)<=E(i).
E=cumsum(cumsum(D)+C);
Interesting. Why do you add a constant C here?
assumes that the spacing between the support points is constant, right?
Yes, there would be different weights on the E(i) if the spacing is not uniform, but it shouldn't invalidate the general approach.
Why do you add a constant C here?
When you double integrate something there are two constants of integration, so in theory we should probably have is two constants have,
E=cumsum([C2,cumsum([C1,D])]);
I was not sure that we needed C2, but now I think we do.
Also for the E(i) at the boundary, one can only impose constraints E(i-1)<=E(i).
If you make your E-grid a little bit larger than your f-grid, then that should not be a problem.
If you make your E-grid a little bit larger than your f-grid, then that should not be a problem.
How do you distinguish between E-grid and f-grid? If we have f=f(x) and there are n support points x_i, there are also n parameters E_i. If I have x_i +1 support points, there are E_i + 1 parameters and I gained nothing.
When you double integrate something there are two constants of integration
This transformation is not clear to me yet. How does the inverse relation D(E) look like, i.e., how do I transform the "real" parameters into D?
If we have f=f(x) and there are n support points x_i, there are also n parameters E_i.
f(x) is a continuous function on some interval [fmin, fmax]. You have a parametric model where the s_i are known shifts which can be thought of as the "locations" of the basis functions. The function F(x) is also continuous and defined on some possibly larger domain [Fmin,Fmax]. You want these functions to agree at certain discrete sample points,
and to choose the unknown E_i to achieve this agreement as closely as possible.
You seem to be assuming that for every basis function location s_i there needs to be corresponding x_j=s_i, but there is no reason to have that assumption. There is no necessary relationship between the s_i and x_j at all other than that they should be chosen to ensure,
Additonally, there are certain mathematical simplifications and conveniences if you choose the s_i so that
how do I transform the "real" parameters into D?
The inverse of,
E=cumsum([C2,cumsum([C1,D])])
is
D=diff(E,2)
For example,
D=rand(1,8)
D = 1×8
0.1401 0.6240 0.2104 0.2853 0.7782 0.2184 0.6423 0.4381
C1=rand; C2=rand;
E=cumsum([C2,cumsum([C1,D])]);
D2=diff(E,2)
D2 = 1×8
0.1401 0.6240 0.2104 0.2853 0.7782 0.2184 0.6423 0.4381
SA-W
SA-W on 20 Jun 2023
Edited: SA-W on 20 Jun 2023
The inverse of,
E=cumsum([C2,cumsum([C1,D])])
is
D=diff(E,2)
Ah. But this relation between E and D only holds if E is constructed as E=cumsum([C2,cumsum([C1,D])]),i.e, there is a given D.
But in practice, D is not given but E (say E it is the start vector). So how would I implement this?
No, D=diff(E,2) is always valid.
To recover E from D, however, you would need to save the relevant C1 and C2. Example:
E=rand(1,5) %start with E
E = 1×5
0.1769 0.4293 0.3007 0.6656 0.9979
C1=E(2)-E(1); %transform to {C1,C2,D} space
C2=E(1);
D=diff(E,2);
Now, we recover E
Erec=cumsum([C2,cumsum([C1,D])]) %recovered E
Erec = 1×5
0.1769 0.4293 0.3007 0.6656 0.9979
Nice. Another strategy I can give a try. Before the transformation we had F(E) = sum_i N_i E_i. diff(E,2) gives two components less than E has, but with {C1,C2} there are again n components. But how does F(D) look like?
It looks convex as I showed in my graph above.
I meant how F(D) looks like in mathematical terms. Is F(D) = \sum_i N_i D_i ?
It would just be
Hm, sorry that I have to ask again. We start with the E parameters and transform them into the {C1, C2,,D} using D=diff(E,2). Then, we solve the optimization problem in the {C1, C2,,D} space and finally back out to the E parameter space using E=cumsum([C2,cumsum([C1,D])]).
Or does my pde solver still "see" the E parameters at every iteration, just expressed differently? Originally, we had E_i = E_i, now we have E_i=E_i(D,C1,C2).
now we have E_i=E_i(D,C1,C2).
Yes, that.
If my objective function originally returned gSim(E) and J(E), it has to return now (dJ/dE)*dE/dZ and (dgSim/dE)*(dE/dZ) where Z={D,C1,C2}.
In both cases dE/dZ is required with E=cumsum([C2,cumsum([C1,D])]). Correct? And is dE/dZ continously differentiable?
In both cases dE/dZ is required with E=cumsum([C2,cumsum([C1,D])]). Correct? And is dE/dZ continously differentiable?
Yes, and in fact because E=M*[D(:);C1;C2] for some constant matrix M, then dE/dZ will just be M.'
You can find M either analytically, or using func2mat in this FEX download,
Additonally, there are certain mathematical simplifications and conveniences if you choose the s_i so that
Let me shortly come back to this argument: What simplifications and conveniences did you mean that I can may take advantage of?
Mainly, if your basis functions are simple linear interpolating kernels, the advantage would be that F(x_j)=E_j, i.e. you don't have to interpolate to obtain the values of F(x_j).
Mainly, if your basis functions are simple linear interpolating kernels, the advantage would be that F(x_j)=E_j, i.e. you don't have to interpolate to obtain the values of F(x_j).
Yes, but you said this "advantage" comes from choosing
Is is not true even if x_j = s_i, i.e., for each basis function location s_i we have F(x_j)=f(x_j)=E_j ?
Is is not true even if x_j = s_i,
Yes, that's a special case of .
Say, is defined on the domain and on the domain as indicated by the two dashed lines in the picture, are the linear interpolating kernels. Clearly, we have . This is the terminology you introduced some comments before.
You said, I can make the E-grid a little larger than the f-grid; probably, you meant what I sketched above. But how can that work with linear basis functions? Here, we have , but how would I determine and . We said is not defined there.
You said, I can make the E-grid a little larger than the f-grid
Yes, but I think it's moot at this point. I think we've determined that you can make the grids the same size, and the convexity conditions E(i+1)-2*E(i)+E(i-1) do not have to be satisfied at all i. It is sufficient for this to hold for i=2,...n-1
But how can that work with linear basis functions? Here, we have , but how would I determine and . We said f(x) is not defined there.
One way would be to not use linear interpolating basis functions. We talked about basis functions like
E_i log(1+A_i*exp(a_i*(x-s_i))), i=1...n
which do not have a compact footprint Instead, each basis function has a footprint over the entire domain of x. The value F(x0) at any x0 is therefore influenced by all n basis functions, So there is reason to think that all 3*n parameters could be estimated, assuming you have 3*n sample points f(x_j) wherever they happen to be located. Although the more spread out the x_j the better conditioning the estimation problem is likely to have
Another approach would be to add a penalty function on the roughness of the coefficients E_i to your objective function, e.g.,
Penalty(x)= smallweight * sum_i abs(E(i)-E(i+1)) , i=1...n-1
Certainly if you have linear interpolating basis functions, we expect the E(i) to step smoothly from one i to the next, and the above penalty would enforce that, even without having any samples f(x_j) that provides information on a particular E_i.

the convexity conditions E(i+1)-2*E(i)+E(i-1) do not have to be satisfied at all i. It is sufficient for this to hold for i=2,...n-1

But if we just have E(n)>=E(n-1), convexity at the boundary can not be guaranteed. Why do you think the constraints for i=2,...,n-1 are sufficient?

So there is reason to think that all 3*n parameters could be estimated, assuming you have 3*n sample points

So I should have at least 3n sample points to determine the 3n parameters? Did you mean that?

E(n)>=E(n-1)
Don't know where you got this. When you look at E(i+1)-2*E(i)+E(i-1)>=0 at i=n-1 you get
E(n)-2*E(n-1)+E(n-2)>=0
Since i only goes from 2 to n-1, there is no other constraint involving E(n) that I've proposed.
So I should have at least 3n sample points to determine the 3n parameters? Did you mean that?
Yes. The number of equations must be greater than or equal to the numbr of unknowns. Although if somehow you had additional equality constraints on the unknown parameters that don't involve f(x_j) those would count as well.

There is no other constraint involving E(n) that I've proposed.

You did not indeed. But I thought E(n)-E(n-1) would be necessary in addition to ensure convexity. But I do not need this constraint?

May it be that E(n)-E(n-1) along with E(n)-2E(n-1)+E(n-2)>=0 makes f non-convex?

But I thought E(n)>=E(n-1) would be necessary in addition to ensure convexity.
Definitely not. Here's a counter-example. E is linear and therefore convex, but it is strictly decreasing, so E(n)>=E(n-1) does not hold,
E=8:-1:1
E = 1×8
8 7 6 5 4 3 2 1
Makes sense.
So may it be that E(n)-E(n-1)>=0 along with E(n)-2E(n-1)+E(n-2)>=0 makes f non-convex?
No, because an upward sloping line,
E=1:8
is convex and satisfies E(n)-E(n-1)>0 along with E(n)-2E(n-1)+E(n-2)>=0.
So if I have a quadratic like function as you plotted it in this answer and I implement the four constraints
E(n)-E(n-1)>=0
E(n)-2E(n-1)+E(n-2)>=0
E(1)-E(2)>=0
E(3)-2E(2)+E(1)>=0
the constraints
E(n)-E(n-1)>=0
E(1)-E(2)>=0
would be redundant, but are not in contradiction to the two other constraints? That is, I could keep them in theory?
They are not redundant. They narrow the range of functions that you are modeling. It's just that they have nothing to do with convexity. You could and probably should keep them if you know the function you are trying to model satisfies those extra constraints.
I see.
I solve
N_i(x,y) = E_i*log( 1 + exp(a_i*(x-x_i)) + exp(b_i*(y-y_i)) )
with lsqnonlin.
Should this optimization problem have a unique minimum (one global minimum)? I.e., all start vectors converge to the same point {E,a,b}?
No, there's no obvious guarantee of that. For fixed a_i,b_i the function becomes convex quadratic with respect to E_i but that's all that can be said.
SA-W
SA-W on 22 Jun 2023
Edited: SA-W on 22 Jun 2023
So if I would fix a_i, b_i (what would only make sense for testing), there should be a unique minimum?
And, E_i >=0 are the only constraints, right?
There would be a unique minimum value. A convex quadratic function can have non-unique global minima if the Hessian is singular.
I have observed that the resnorm values are quite small (around 1e-3). Lsqnonlin stops then after 900 iterations (10 parameters, 25 equations). I have function and optimality tolerance equal to 1e-9.
Would you change those settings? I could increase the number of allowed iterations, but resnorm decreases very slowly after a first drop in the first two,three iterations.
I usually set all the tolerances to 1e-12. Whether resnorm of 1e-3 is to be considered small, I have no idea. It depends on the scale of f.
What do you mean by scale?
I mean if I want to approximate a non-convex function f with a convex function F, there would clearly be a large resnorm.
I mean, how big the f_i are. If they are all on the order of 1e-10, then a resnorm of 1e-3 may be quite large in comparison.
The f_i are between 1 and 2 in my case. So resnorm=1e-3 should be good right?
I would say so, but a co-plot of the surface fit along with f would make it more obvious.
fplot() only works for functions f(x). Is there no 2d equivalent?
So how would you plot F(x) = sum_i N_i ? surf() seems to be not appropriate I think.
surf or fsurf would be appropriate.
Thank you for all the help, I do not wanna waste your time further!
There are a lot of things I can give a try, maybe I will come back to this chat here if I have follow-up questions.
Currently, I try to setup a c++ interface to matlab, see this question
In case you also have knowdlege in that field, I would appreciate your help again!
I would appreciate your opinion regarding my new results when approximating f(x,y) = g(x) + h(y).
g(x) is the above plot and h(y) the plot below, each of which is approximated with five parameters and linear basis functions. If the objective function is
||g^sim(f(x,y)) - g^exp ||^2
"No noise" means that I solve the pde with the reference paramters (green curve) to obtain g^exp, and "1% noise" means that I add random noise to g^exp with a standard deviation of 1%*max(g^exp) to emulate real experimental data.
As you can see, in the noiseless case I have perfect fitting results, not so in the case with noise.
As I told you, x and y are somwhow correlated at some intervals in the {xy} space, see the plot below where I obtained g^exp using a coupled approximation to f(x,y) with nine parameters.
Do you think the correlation between x and y is the reason why the identification fails for the noisy case above? Here, h(y)=sqrt(y) is concave in the example above, which is why I implemented E(i+1)-2*E(i) + E(i+1) <=0, i=2,...,4 and E(4)<=E(5).
I mean, due to correlation, the columns of the Jacobian associated with x and y are nearly identical and I was sceptical that the decoupled approximation can work at all under those circumstances. But it clearly worked in the noiseless case.
Too much is hidden. I would guess that you used poorer initial guesses in the latter case.
The initial functions g_0(x) and h_0(y) were the same in both cases.
Too much is hidden. Maybe some noise sensitivity in your PDE solver. Make sure you're following the guidelines here:
What ahppens when you take the solution from the noise-fre case and initilaize with that?
I think I follow all the guidelines there (analytical Jacobion, bounds, no stochastic objective, having NaN at failures,...).
But you are right, the entire fitting procedure seems to be quite sensitive with respect to noise in the data.
The solution from the noise-free case is the exact solution that I am looking for. So initializing there does not change the initial values. If I neglect g(y), i.e., only have five parameters associated with f(x), even the noisy case gives nearly perfect fitting results with a final sum of squares = 0.46. Interestingly, the sum of squares is also 0.46 in the case above where I have the g(y) included and five parameters more.
So initializing there does not change the initial values.
What I meant was. What happens if you take the noise-free solution and use that to initialize the noisy estimation?
Sorry that I have to ask again: What do you mean by noise-free solution (g^exp ?) and noisy estimation (g^exp + noise ?) ?
Take the solution for the unknown parameters that you get when you run the problem with no-noise and use that as your initial guess x0 when you run with noise.
As I said, the solution in the noise-less case are the parameters that I used to calculate g^exp. The final sum of squares in that case is zero. So why would you pass this solution to the simulation with noisy data?
To see if the optimization pulls you strongly away from the ideal solution when there is only mild noise.
Yes, if I start from the reference solution, I also end up with the same false solution that I plotted.
Intriguing. But how well fitted is gexp in these cases? It would be good to see gsim co-plotted with gexp in the noisy case.
gsim and gexp (noisy) are a collection of displacemen vectors at 10 different times (10*146=1460 -> gsim, gexp \in R^1460). Since the sum of squares is 0.46, there are of course differences visible between them. However, the displacements are in the range [0,25], so 0.46 is not that big in my opinion.
Then, it sounds like nothing is really wrong, although possibly gsim is is an ill-conditioned function of f. I.e., lots of different f can give nearly the same gsim.
possibly gsim is is an ill-conditioned function of f
It probably is. Do you think the correlation between x and y when approximing f(x,y) = h(x) + g(y) might be a problem that makes it difficult to reidentify the reference h(x) and g(y)?
Here, h(x) is overestimated and g(y) is underestimated, so their sum probably is somehow close to the correct f.
I have no way of knowing. Whatever process is responsible for the ill-conditioning it seems to be in the PDE solution step. Perhaps more boundary conditions or other constraints need to be applied.
Yes, I have to figure that out.
Let me please shortly address the following issue: for nearly all my initial guesses, I end up with exitflag=2 although I have step_tolerance=1e-12. What are typical reasons why the parameter updates become so small? All what I pass to fmincon/lsqnonlin are the Jacobian or gradient (fmincon) and the components of the objective function or the sum of squares (fmincon). Sometimes, exitflag=2 is indeed the optimum I am looking for, sometimes not. This is a problem I ignored so far, but I think I should at least try to understand why I have this exitflag.
From the documentation, exitflag=2 means "Change in x is less than the specified tolerance, or Jacobian at x is undefined."
Jacobian at x is undefined. It would be easy for you to use assert() to trap the case where you have NaN's or inf in your Jacobian.
Change in x is less than the specified tolerance: If the objective is very flat at the current iteration, the optimizer can be fooled into thinking it is close to a minimum, even if it is not. This can result in it taking prematurely small steps or stopping outright. This can happen even if you get an exitflag=1. In the example below, the minimum at x=1 is reached with an error of more than 11%, which some might consider large:
opts=optimoptions('lsqnonlin','StepTolerance',1e-12);
[x,fval,res,exitflag]=lsqnonlin(@(x) (x-1).^4, 1.2,[],[],opts)
Local minimum found. Optimization completed because the size of the gradient is less than the value of the optimality tolerance.
x = 1.1125
fval = 2.5658e-08
res = 1.6018e-04
exitflag = 1
However, premature stopping doesn't seem to be the reason for inaccuracies in g and h. That's why I asked you to initialize using the ground truth parameters We would expect ground truth to be fairly close to the minimum and for the objective to be getting pretty flat in that general vicinity. Yet,the optimizer strongly pulls away from it.
I have very tight tolerances:
opts=optimoptions('lsqnonlin','StepTolerance',1e-12, 'FunctionTolerance', 1e-12, 'OptimalityTolerance', 1e-9);
With these tolerances, the error in your example decreases further to around 3%, which, of course, may be still considered large.
We would expect ground truth to be fairly close to the minimum and for the objective to be getting pretty flat in that general vicinity. Yet,the optimizer strongly pulls away from it.
If you consider noise with 1%*max(gexp) standard deviation small, I agree that the minimum should be close to ground truth. But why would you expect the objective to be getting flat in that general vicinity? If I think of minimizing f(x)=x^2, the objective is far from being flat at the minimum.
Also, if my objective were really flat around ground truth, would this not explain the inaccuracies in g and h? If I think of a nearly flat valley, there are many parameter combinations giving the same objective function value.
With these tolerances, the error in your example decreases further to around 3%, which, of course, may be still considered large.
But remember also that we have no idea precisely what the Taylor expansion of your actual function looks like at the minimum, and whether your tolerances are appropriate to it.. I can modify my example as below to get poor accuracy again, even with your actual tolerances.
opts=optimoptions('lsqnonlin','StepTolerance',1e-12, 'FunctionTolerance', 1e-12, 'OptimalityTolerance', 1e-9);
[x,fval,exitflag]=lsqnonlin(@(x) (x-1).^6, 1.2,[],[],opts)
Local minimum found. Optimization completed because the size of the gradient is less than the value of the optimality tolerance.
x = 1.1157
fval = 5.7788e-12
exitflag = 2.4039e-06
If I think of minimizing f(x)=x^2, the objective is far from being flat at the minimum.
When I say "flat", I mean that qualitatively speaking the gradients are getting small. Even in my example with (x-1)^4, the objective is not perfectly flat anywhere, but you can still see that it is flat enough to cause early termination.

I can modify my example as below to get poor accuracy again, even with your actual tolerances.

As you said, setting the tolerances to 1e-12 is probably a good choice. I think I read in a different chat that one should not go below 1e-14. Of course, as your examples shows, one could create cases where this still may not be sufficient.

Even in my example with (x-1)^4, the objective is not perfectly flat anywhere, but you can still see that it is flat enough to cause early termination.

The fact that the optimizer pulls away from ground truth, along with exitflag=2, suggests that my objective is "quite" flat around ground truth, right?

Also, if my objective were really flat around ground truth, would this not explain the inaccuracies in g and h? If I think of a nearly flat valley, there are many parameter combinations giving the same objective function value.
If your objective is nearly flat in a broad area around ground truth, it would explain why the function is sensitive to noise. The addition of noise can "tilt" the floor of the valley so that water runs downhill away from ground truth. We do know that once you added the 1% noise, you tilted the valley away from ground truth significantly, because the optimizer pulled strongly away from ground truth when initialized there.
One thing you could try is to add curvature penalty terms to your objective function. For lsqnonlin, this would be equivalent to extending your residual vector,
residual =[residual; smallnumber*[D_g;D_h]]
where D_g and D_h are the D parameters (from our earlier discussion) of g and h. This will discourage g and h from bending more than necessary.

where D_g and D_h are the D parameters (from our earlier discussion) of g and h

I discretized g and h with five parameters each. Do you mean D_g and D_h is the solution that I got in the noisy case? Would the associated Jacobian not be zero then if D_g and D_h are const numbers?

I discretized g and h with five parameters each
If you implemented my original suggestion, the 5 parameters would be [C1,C2,D1,D2,D3] which parameterize the basis coefficients E(i) according to,
E=cumsum([C2,cumsum([C1,D])]);
Would the associated Jacobian not be zero then if D_g and D_h are const numbers?
In light of the above, you should see that they would not be constant. D_g and D_h each contain 3 of the 5 unknown parameters that are being iteratively estimated.
If you implemented my original suggestion, the 5 parameters would be [C1,C2,D1,D2,D3] which parameterize the basis coefficients E(i)
I am still working with the E(i) and implemented the constraints E(i+1)-2*E(i) + E(i+1)>=0.
So if g(x) = \sum_i N_i G_i and h(y) = \sum_i N_i H_i, my parameters are [G_1, G_2, G_3, G_4, G_5] and [H_1, H_2, H_3, H_4, H_5].
In light of this,
residual =[residual; smallnumber*[D_g;D_h]]
translates into
residual =[residual; smallnumber*[G_1; ... ;G_5; H_1;...H_5]].
The 10x10 Jacobian matrix associated with this additional block would be the identity matrix?
I am still working with the E(i) and implemented the constraints E(i+1)-2*E(i) + E(i+1)>=0.
How would you be able to do that? It does not enforce convexity at all iterations, which means your PDE solver would fail, creating unrecoverable problems. That was the original point of the question.
In any case, I really did intend the new residual terms be smallnumber*[D_g;D_h], so you should transform your E parameters to D space as we discussed earlier.
How would you be able to do that? It does not enforce convexity at all iterations,
I multiply the A matrix by 1e4, which leads to a fulfillment of convexity at all iterations. I know that this is not a good practise (you also told me practical problems that may occur), however, I have to stick to that framework to complete an article. Once this is done, I can explore new things and move to the transformed parameters in D space.
In any case, I really did intend the new residual terms be smallnumber*[D_g;D_h], so you should transform your E parameters to D space as we discussed earlier.
Can the additional residual components smallnumber*[D_g;D_h] not be written down if the parameters are in the E-space?
I multiply the A matrix by 1e4, which leads to a fulfillment of convexity at all iterations.
If so, that could be another reason why your StepTolerance stopping threshold is being met prematurely, resulting in exitflag=2.
Can the additional residual components smallnumber*[D_g;D_h] not be written down if the parameters are in the E-space?
Yes, if lsqnonlin is handing an E vector to your objective function code, you can transform from E to D within each call to your objective function code and use D for whatever computations are needed.
If so, that could be another reason why your StepTolerance stopping threshold is being met prematurely, resulting in exitflag=2.
Indeed, if I do forego the linear constraints, which works surprisingly for a quite small number of initial guesses, I have exitflag=3 with FunctionTolerance=1e-12. Reducing it to say 1e-6, sometimes results in exitflag=1.
But why is the higher weighting of the constraints (1e4*A) a constellation that can lead to exitflag=2? I know that this somehow restricts the domain of the parameters. But exitflag=2 makes sense to me if I think of a nearly flat objective as discussed, but not if the search domain is restricted.
Yes, if lsqnonlin is handing an E vector to your objective function code, you can transform from E to D within each call to your objective function code and use D for whatever computations are needed.
But I mean if I do not make the transformation to D-space, but keep the E vector. In that case, how would the curvature penalty term look like (if possible)?
But why is the higher weighting of the constraints (1e4*A) a constellation that can lead to exitflag=2?
It's algorithm dependent thing, but basically if you force the algorithm to prioritize the constraints over progress in reducing the objective function, it may need to take much small steps than it otherwise would, and this can trigger the steptolerance stopping criterion.
But I mean if I do not make the transformation to D-space, but keep the E vector. In that case, how would the curvature penalty term look like (if possible)?
It is not possible. I think you want me to tell you that you can do this:
residual =[residual; smallnumber*[diff(G,2);diff(H,2)]]
This is an equivalent way to implement my suggestion and, in your mind, this may be different from transforming to D-space. It is not, though.
residual =[residual; smallnumber*[diff(G,2);diff(H,2)]]
This is an equivalent way to implement my suggestion
I just realize that diff(G,2) and diff(H,2) represents
G(i+1)-2*G(i) + G(i+1)>=0 i=2,3,4
H(i+1)-2*H(i) + H(i+1)>=0 i=2,3,4
which is what I have currently implemented as linear constraints in A and b, the only difference is that instead of "smallnumber" I have "1e4".
So why do you think
"This will discourage g and h from bending more than necessary."
as a possible workaround to the "pulling away from ground truth in the noisy case" issue?
I just realize that diff(G,2) and diff(H,2) represents...
They represent the left hand side of these inequalities, yes.
which is what I have currently implemented as linear constraints in A and b
In A, yes. b is unrelated. And so if you wished you can express the new residual terms as
residual =[residual; smallnumber*[A_g*G;A_h*H]]
where A_h and A_g are some appropriate subset of the rows of A. Also, for computng the Jacobian of these new terms, it would just be J=[A_g;A_h].
So why do you think "This will discourage g and h from bending more than necessary."
lsqnonlin is trying to make all the residuals as small as possible. By appending the second derivatives to the residuals, it will try to keep them small, which is the same as saying g and h will be inhibited from curving too much.
In A, yes. b is unrelated.
but the right side, b, is zero.
residual =[residual; smallnumber*[A_g*G;A_h*H]]
where A_h and A_g are some appropriate subset of the rows of A. Also, for computng the Jacobian of these new terms, it would just be J=[A_g;A_h].
So far, I have 1e4*A*E<=0, where A = blkdiag(A_g, A_h) and E=[G(:);H(:)]. Instead of passing these linear constraints to lsqnonlin, you want to attach them to the residual. Why do you think this is different and possibly better? I probably would have to set smallnumber=1e4 to make sure lsqnonlin pays attention to the additional residuals.
Instead of passing these linear constraints to lsqnonlin, you want to attach them to the residual.
No, you would still keep the constraints as they are. They are not related to what we are doing with the residuals.
The constraints are forcing the derivatives to be non-negative.
Conversely, the residual terms are not being used to control whether the second derivatives are positive or negative in sign. Rather they are trying to keep the second derivatives small in magnitude. Moreover, the algorithm will normally not push the residuals to zero precisely, so the second derivatives will merely be encouraged to be small in magnitude rather required to have some exact value, as with constraints. Finally, the residuals can be prioritized by weighting them differently, unlike the constraints. By varying the penalty weight factor smallnumber, we can put different priority on those residuals as compared to the data fitting residual terms gsim-gexp.

Makes totally sense what you say. One point:

so the second derivatives will merely be encouraged to be small in magnitude rather required to have some exact value, as with constraints.

My motivation to multiply A by 1e4 was because some components 1e3*A*E were still greater than zero. So I increased the exponent by one. What I want to say is that with c, c*A*E, I do influence the sign and magnitude simultaneously, right?

And is it really that this factor somehow sets the derivatives to an exact value (...required to have some exact value, as with constraints)? I mean is there not also an merit function internally whose goal is to decrease the constraint violations?

What I want to say is that with c, c*A*E, I do influence the sign and magnitude simultaneously, right?
It's hard to know what's really happening, since I don't know which of lsqnonlin's algorithms you are running. It appears that you are just amplifying the weight of the constraint violations, forcing the update steps to gave more consideration to the constraints than to the residuals.
And is it really that this factor somehow sets the derivatives to an exact value
No. But if your constraints are not met within some tolerance, lsqnonlin won't stop iterating or issue exitflag=1. Conversely, lsqnonlin doesn't apply any stopping tolerance thresholds on the magnitude of the residuals that is achieved.

It's hard to know what's really happening, since I don't know which of lsqnonlin's algorithms you are running.

I use lsqnonlin's interior point algorithm which, If I am not mistaken, calls fmincon's interior-point algorithm internally. Given that, do you know what the scaling of the A matrix affects qualitatively? Just magnitude or sign, or both?

do you know what the scaling of the A matrix affects qualitatively? Just magnitude or sign, or both?
I'm sure it doesn't change the constraints that the solver is trying to satisfy at all. It just changes the path taken to the solution.
I see. I will append the residual vector by the constraints once my solver is available and report again then.
I did not fully understand your comment,
It's algorithm dependent thing, but basically if you force the algorithm to prioritize the constraints over progress in reducing the objective function, it may need to take much small steps than it otherwise would, and this can trigger the steptolerance stopping criterion.
regarding my question
But why is the higher weighting of the constraints (1e4*A) a constellation that can lead to exitflag=2
Say the algorithm takes some smaller steps to fulfill the constraints (due to 1e4), which are currently violated. If satisfied, the algorithm can devote again to reducing the objective function and may take larger steps again. What I want to say is that, to fullfill the constraints, a step tolerance of 1e-12 should not be triggered.
Most of the time when I have exitflag=2, the following happens: The algorithm takes some regular steps, but from a certain iteration on, the sum of squares remains static (far away from ground truth) and the parameters change only at high digits after the decimal point:
Iteration 50: resnorm=234, step size = 1e-3,
Iteration 51: resnorm=234, step size = 1e-4,
...
Iteration 58: resnorm=234, step size = 1e-11,
--> step tolerance is passed, algorithm terminates
As I said, this happens at points far away from the minimum. Do you think this can also be traced back to the weighting of A? If so, this is not tangible to me at the moment: If constraints at iteration 50 were violated, let the algorithm do some iteation with maybe smaller steps, but then proceed again with a bigger step size.
Maybe the best thing is to consider the mian update step that the interior point algoirthm uses. This is described in Equation 38 of this doc page.
The Jg matrix is the Jacobian of the inequality constraints, which in this case, because the constraints are linear, is a constant. If you make this constant too large, the contribution to the equation coefficients from the objective function gradient and Hessian are very small in comparison. In this case, your steps will always be guided more by the constraints than the objective.

Therefore, your steps will always be guided more by the constraints than the objective.

This makes qualitatively sense to me, but I do not see why this translates into small steps that can even trigger the stop tolerance?

Well, my intuition was that if the update step is (approximately) indifferent to the objective function, then you can imagine it may take steps as if the objective function were some trivial flat value with gradient zero everywhere. In other words, it would have no reason to take large steps once the constraints were satisfied. But that was just my intuition...
I cannot come up with an example to show how scaling up the constraints can force an exitflag=2, but I can show one (below) which demonstrates that scaling the constraints can slow convergence when the objective is ill-conditioned:
load data
opts=optimoptions('fmincon','Algorithm','interior-point','Display','none',...
'MaxFunEvals',inf);
%%%Optimization 1
[E1,f1,exitflag1,out1]=fmincon(f,E0,A,b,[],[],[],[],[],opts);
%%%%Optimization 2: scaled constraints
s=1e4;
opts.MaxIterations=out1.iterations;
[E2,f2,exitflag2]=fmincon(f,E0,s*A,s*b,[],[],[],[],[],opts);
f1,f2
f1 = 7.2114
f2 = 6.3699e+04
Well, my intuition was that if the update step is (approximately) indifferent to the objective function, then you can imagine it may take steps as if the objective function were some trivial flat value with gradient zero everywhere. In other words, it would have no reason to take large steps once the constraints were satisfied. But that was just my intuition...
Thanks for the demo. Well, sounds plausible to me. I just varied the scale factor of the constraints between 1e0 and 1e4 for an initial guess that was succesful with 1e4. In either case, the optimizer returned exitflag=2 and sometimes, the optimized parameters were wrong. Long story short, one would have to narrow that down much more to better understand whats going on.
What I described before, namely that the optimizer does not reduce the sum of squares from a certain iteration on, but decreases the step size even further like
Iteration 50: resnorm=234, step size = 1e-3,
Iteration 51: resnorm=234, step size = 1e-4,
...
Iteration 58: resnorm=234, step size = 1e-11,
do you have an intutition what might be the problem here? As I said, this happens at points which are far off the ground truth. Is it maybe that the optimizer reached a region in the parameter space where the objective is flat, and the scaling of the constraints amplifies this even more?
I'm not convinced that resnorm is not changing, but maybe it's only doing so to decimal places that are beyond the precision of the display.
Or mabe you've landed in a region where the resnorm is unchanging, but the constraints have not yet been satisfied. So, more iterations are necessary.
SA-W
SA-W on 29 Jun 2023
Edited: SA-W on 29 Jun 2023
I'm not convinced that resnorm is not changing, but maybe it's only doing so to decimal places that are beyond the precision of the display.
Yes, resnorm still changes but at the fifth/sixth digit after the decimal point only.
Or mabe you've landed in a region where the resnorm is unchanging, but the constraints have not yet been satisfied.
What could be a region where the resnorm is unchanging? Where one parameter say increases the objective value and another parameter cancels this out again?
So, more iterations are necessary.
But in those iterations (to satisfy the constraints), I would expect the parameters to change also, which however happens only at the 7th, 8th, 9th digit ... after the decimal point.
What could be a region where the resnorm is unchanging?
For example,
fmincon(@(x) 0 ,[5,1],[1,1],1,[],[],[0,0],[],[],optimoptions('fmincon','Display','iter'))
First-order Norm of Iter F-count f(x) Feasibility optimality step 0 3 0.000000e+00 5.000e+00 0.000e+00 1 7 0.000000e+00 4.829e+00 0.000e+00 1.211e-01 2 10 0.000000e+00 2.691e+00 0.000e+00 1.527e+00 3 13 0.000000e+00 0.000e+00 0.000e+00 2.934e+00 Local minimum found that satisfies the constraints. Optimization completed because the objective function is non-decreasing in feasible directions, to within the value of the optimality tolerance, and constraints are satisfied to within the value of the constraint tolerance.
ans = 1×2
0.7572 0.1630
which however happens only at the 7th, 8th, 9th digit ... after the decimal point.
Why expect larger changes? You have a StepTolerance of 1e-12.

Sign in to comment.

Asked:

on 15 Jun 2023

Commented:

on 30 Jun 2023

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!