Fundamentals of Unconstrained Optimization

Review

Reminded that the general form of the optimization problem is as follows

$\min_{x\in\R^n} f(x)\quad s.t.\ \begin{cases} c_i(x)=0\quad i \in \mathcal{E}\\ c_i(x)\ge 0\quad i \in \mathcal{I} \end{cases}$

Now in unconstrained optimization, we minimize an objective function that depends on real variables, with no restrictions at all on the values of these variables. The mathematical formulation is

$\min_{x\in\R^n}f(x)$

where $\ x\in\R^n\ $ is a real vector with $\ n\ge 1\ $ components and $\ f:\R^n\rightarrow\R\ $ is a smooth function.

What Is A Solution

global minimizer

We say a point $\ x^{*}\ $ is a global minimizer if

$f(x^*)\le f(x)\quad for\ all\ x\in\R^n$

local minimizer

Then a local minimizer is that the point is only achieves the smallest value of f in its neighborhood. Formally, we say a point $\ x^{*}\ $ is a local minimizer if there is a neighborhood $\ \mathcal{N}\ $ of $\ x^{*}\ $ such that

$f(x^*)\le f(x)\quad for\ all\ x\in\mathcal{N}$

A point that satisfies this definition is sometimes called a weak local minimizer. A strict local minimizer is then the outright winner in its neighborhood. Formally, A point $\ x^{*}\ $ is a strict local minimizer

$\exist\ \mathcal{N}\ s.t.\ f(x^{*})\lt f(x),\ \forall x\in\mathcal{N}\ \text{with}\ x\ne x^{*}$

The last type of minimizer is called an isolated minimizer, which there is a neighborhood$\ \mathcal{N}\ $of$\ x^{*}\ $such that$\ x^{*}\ $is the only local minimizer in$\ \mathcal{N}\ $.

All isolated local minimizers are strict, while the reverse does not hold.

Recognizing A Local Minimum

It seems that we have to check all the points to make sure none of them has a smaller function value then as $\ x^{*}$’s did. Fortunately, supposing the objective function to be twice continuously differentiable, we may be able to tell that whether $\ x^{*}\ $is a local minimizer by examining just the gradient $\ \triangledown f(x^{*})\ $and the Hessian matrix $\ \triangledown^2 f(x^{*})$.

Theorem (Taylor’s Theorem)

suppose that $\ f:\R^n\rightarrow\R\ $is continuously differentiable and that $p \in \R^n$. Then we have that

$f(x+p)=f(x)+\triangledown f(x+tp)^{T}p$

for some $t\in (0,1)$. Moreover, if f is twice continuously differentiable, we have that

$\triangledown f(x+p)=\triangledown f(x) + \int_{0}^{1}\triangledown^2f(x+tp)pdt$

and that

$f(x+p)=f(x)+\triangledown f(x)^{T}p+\frac 1{2}p^T\triangledown^2f(x+tp)p$

for some $t\in (0,1)$.

Theorem (First-Order Necessary Conditions)

If $\ x^{*}\ $ is local minimizer and f is continuously differentiable in an open neighborhood of $\ x^{*}\ $, then $\ \triangledown f(x^{*})=0$.

Theorem (Second-Order Necessary Conditions)

If $\ x^{*}\ $is a local minimizer of f and $\ \triangledown^2f\ $ exists and is continuous in an open neighborhood of $\ x^{*}\ $, then $\ \triangledown f(x^{*})=0\ $and $\ \triangledown^2 f(x^{*})\ $is positive semidefinite.

Theorem (Second-Order Sufficient Conditions)

Suppose that $\ \triangledown^2f\ $ is continuous in an open neighborhood of $\ x^{*}\ $, and that $\ \triangledown f(x^{*})=0\ $and $\ \triangledown^2 f(x^{*})\ $is positive definite. Then $\ x^{*}\ $ is a strict local minimizer of f.

Theorem (Transformation from local-m to global-m)

When f is convex, any local minimizer $\ x^{*}\ $is a global minimizer of f. If in addition f is differentiable, then any stationary point $\ x^{*}\ $is a global minimizer of f.

Overview of Algorithm

Our idea is simple. Beginning at a starting point $\ x_0$, optimization algorithms generate a sequence of iterates $\ \{x_k\}_{k=0}^{\infty}\ $ that terminate when either no more progress can be made or when it seems that a solution point has been approximated by current point $\ x_k\ $with sufficient accuracy.

Intuitively, when we generate such a sequence of iterates, we hope that the corresponding function value sequence $\ \{f(x_k)\}_{k=0}^{\infty}\ $ ought to decease (if we try to solve a minimization problem, otherwise to increase when facing a maximization problem) or at least we can decrease the objective function value $\ f(x_k)\ $ with several steps of iteration, that is $\ f(x_k)\lt f(x_{k-m})$.

There are two fundamental strategies for moving from the current point $\ x_k\ $to a new iterate $\ x_{k+1}$. Most of the algorithms describe in this book follow one of these approaches.

two strategies: line search and trust region

line search

In the line search strategy, the algorithm chooses a direction $\ p_k\ $and searches along this direction from the current iterate $\ x_k\ $for a new iterate with a lower function value. The distance to move along $\ p_k\ $can be found by approximately solving the following one dimensional minimization problem to find a step length $\ \alpha$:

$\min_{\alpha\gt 0}f(x_k+\alpha p_k)$

By solving the problem above exactly, we would derive the maximum benefit from the direction $\ p_k$, but an exact minimization may be expensive. Instead, the line search algorithm generates a limited number of trial step lengths until it finds one that loosely approximates the minimum of the problem above.

trust region

In the second algorithmic strategy, known as trust region, the information gathered about f is used to construct a model function $\ m_k\ $whose behavior near the current point $\ x_k\ $is similar to that of the actual objective function f. Because the model $\ m_k\ $may be a good approximation of f when x is far from $\ x_k$. In other words, we restrict the search for a minimizer of $\ m_k\ $to some region around $\ x_k$. In other words, we find the candidate step p by approximately solving the following subproblem:

$\min_{p}m_k(x_k+p),\quad wherex_k+p\ lies\ inside\ the\ trust\ region$

If the candidates solution does not produce a sufficient decrease in f, we conclude that the trust region is too large, and we shrink it and re-solve the problem.

The model $\ m_k\ $in the problem above is usually defined to be a quadratic function of the form

$m_k(x_k+p)=f_k+p^T\triangledown f_k+\frac1{2}p^TB_kp$

where $\ f_k$, $\triangledown f_k$, and $B_k\ $are scalar, vector, and matrix, respectively. The matrix $\ B_k\ $is either the Hessian $\ \triangledown^2f_k\ $or some approximation to it.

Reference

Numerical Optimization (2nd) by Jorge Nocedal, Stephen J. Wright