Convergence Rates for Hestenes’ Gram–Schmidt Conjugate Direction Method without Derivatives in Numerical Optimization
Convergence Rates for Hestenes’ Gram–Schmidt Conjugate Direction Method without...
Stein, Ivie;Raihen, Md Nurul
2023-03-24 00:00:00
Article Convergence Rates for Hestenes’ Gram–Schmidt Conjugate Direction Method without Derivatives in Numerical Optimization 1 2, Ivie Stein, Jr. and Md Nurul Raihen * Department of Mathematics and Statistics, The University of Toledo, Toledo, OH 43606, USA Department of Mathematics and Statistics, Stephen F. Austin State University, Nacogdoches, TX 75962, USA * Correspondence: nurul.raihen@sfasu.edu; Tel.: +1-313-378-7353 Abstract: In this work, we studied convergence rates using quotient convergence factors and root convergence factors, as described by Ortega and Rheinboldt, for Hestenes’ Gram–Schmidt conjugate direction method without derivatives. We performed computations in order to make a comparison between this conjugate direction method, for minimizing a nonquadratic function f , and Newton’s method, for solving r f = 0. Our primary purpose was to implement Hestenes’ CGS method with no derivatives and determine convergence rates. Keywords: conjugate direction method; Gram–Schmidt conjugate process without derivatives; convergence rates; numerical optimization 1. Introduction The conjugate gradient (CG) and conjugate direction (CD) methods have been ex- tended to the optimization of nonquadratic functions by several authors. Fletcher and Reeves [1] gave a direct extension of the conjugate gradient (CG) method. An approach to conjugate direction (CD) methods using only function values was developed by Powell [2]. Citation: Stein, I., Jr.; Raihen, M.N. Davidon [3] developed a variable metric algorithm, which was later modified by Fletcher Convergence Rates for Hestenes’ and Powell [4]. According to Davidon [3], variable metric methods are considered to be Gram–Schmidt Conjugate Direction very effective techniques for optimizing a nonquadratic function. Method without Derivatives in In 1952, Hestenes and Stiefel [5] developed conjugate direction (CD) methods for Numerical Optimization. minimizing a quadratic function defined on a finite dimensional space. One of their AppliedMath 2022, 3, 268–285. objectives was to find efficient computational methods for solving a large system of linear https://doi.org/10.3390/ equations. In 1964, Fletcher and Reeves [1] extended the conjugate gradient (CG) method appliedmath3020015 of Hestenes and Stiefel [5] to nonquadratic functions. The method presented here is related Academic Editor: Alexander Ayriyan to those described by G.S. Smith [6], M.J.B. Powell [2] and W.I. Zangwill [7]. The method of Smith is also described by Fletcher [8] on pp. 9–10, Brent [9] on p. 124 and Hestenes [10] Received: 27 December 2022 Revised: 21 February 2023 on p. 210. In addition to that, Nocedal [11] explored the possibility of nonlinear conjugate Accepted: 27 February 2023 gradient methods converging without restarts and with the use of practical line search. In Published: 24 March 2023 the field of numerical optimization, a number of additional authors, including Kelley [12], Zang and Li [13], among others, investigated a wide range of approaches in the use of conjugate direction methods. The primary purpose of this work is to implement Hestenes’ Gram–Schmidt conjugate Copyright: © 2022 by the authors. direction method without derivatives, which uses function values with no line searches. Licensee MDPI, Basel, Switzerland. We will refer to this method as the GSCD method; Hestenes refers to it as the CGS method. This article is an open access article We illustrate the procedure numerically, computing asymptotic constants and the quotient distributed under the terms and convergent factors of Ortega and Rheinboldt [14]. In reference to Hestenes [10], p. 202, conditions of the Creative Commons where he states that the CGS has Newton’s algorithm as its limit, Russak [15] shows that Attribution (CC BY) license (https:// n-step superlinear convergence is possible. We verify numerically that the GSCD procedure creativecommons.org/licenses/by/ converges quadratically under appropriate conditions. 4.0/). AppliedMath 2022, 3, 268–285. https://doi.org/10.3390/appliedmath3020015 https://www.mdpi.com/journal/appliedmath AppliedMath 2022, 3 269 As for notation, we use capital letters, such A, B, C, . . ., to denote matrices and lower case letters, such as a, b, c, . . ., for scalars. The value A denotes the transpose of matrix A. If F is a real-valued differentiable function of n real variables, we denote its gradient at x 0 00 by F (x) and the Hessian of F at x by F (x). We use subscripts to distinguish vectors and superscripts to denote components when these distinctions are made together, for example, 1 n x =(x , . . . , x ). k k The method of steepest descent is due to Cauchy [16]. It is one of the oldest and most obvious ways to find a minimum point of a function f . There are two versions of steepest descent. The one due to Cauchy, which we call an iterative method, uses line searches and another, described by Eells [17] in Equation (10), p. 783, uses a differential equation of steepest descent. In Equation (4.3) we describe another version of the differential equation of steepest descent. However, numerically, both have flaws. The iterative method is generally quite slow, as shown by Rosenbrock [18] in his banana valley function. Newton’s method applied to r f = 0, where f is a function to be minimized, is another approach for finding a minimum of the function f . Newton’s method has rapid convergence, but it is costly because of derivative evaluations. Hestenes’ CGS method without derivatives [10], p. 202, has Newton’s method as its limit as s ! 0. If the minimizing function is strongly convex quadratic and the line search is exact, then, in theory, all choices for the search direction in standard conjugate gradient algorithms are equivalent. However, for nonquadratic functions, each choice of the search direction leads to standard conjugate gradient algorithms with very different performances [19]. In this article, we investigate quotient convergence factors and root convergence factors. We computationally compare the conjugate Gram–Schmidt direction method with Newton’s method. There are other types of convergence for the conjugate gradient, the conjugate direction, the gradient method, Newton’s method and the steepest descent method, such as superlinear convergence [20–22], Wall [23] root convergence and Ostrowski convergence factors [24], but, for the sake of this research, quotient convergence is the one that is the most appropriate for the quadratic convergence. In this article, the well-known conjugate directions algorithm, for minimizing a quadratic function, is modified to become an algorithm for minimizing a nonquadratic function, in the manner described in Section 2. The algorithm uses the gradient estimates and Hessian matrix estimates described in Section 3. In Section 4, a test example for mini- mizing a nonquadratic function by the developed conjugate direction algorithm without derivatives is analyzed. The advantage of this approach compared to Newton’s method is efficiency. The proposed approach is justified in sufficient detail. The results obtained are of certain theoretical and practical interest for specialists in the theory and methods of optimization. 2. Methodology In this section, we present a class of CD algorithms for minimizing functions defined on an n-dimensional space. The reader is directed to refer to Stein [25] and Hestenes [10], pp. 135–137 and pp. 199–202, respectively, for more details. 2.1. The Method of CD Let A be a positive definite real symmetric n n matrix, let k be a constant n-vector and let c be a fixed real scalar. Throughout this section, F denotes the function defined on Euclidean n-space E by F(x) = x Ax k x + c, (1) where x is in E . Suppose 1 m n. Let S be the linear subspace spanned by the set { p , . . . , p } of m 1 n m linearly independent and, hence, nonzero vectors. Let x be any vector in E . Then, the m-dimensional plane P through x obtained by translating the subspace S is defined by m 1 m AppliedMath 2022, 3 270 P = x : x = x + a p + . . . + a p , a 2 R (i = 1, . . . , m) . (2) m 1 1 1 m m i Two vectors, p and q, in E are said to be A-orthogonal or conjugate if p A p = 0. A set { p , . . . , p } of nonzero vectors in E is said to be mutually A-orthogonal or mutually m n conjugate if p A p = 0 f or i 6= j (i = 1, . . . , m). Theorem 1 ([25]). Let S be a subspace of E , where f p , . . . , p g is a basis for S , 1 m n. m n 1 n m Further assume that p , . . . , p is a mutually A-orthogonal set of vectors. Let x be any vector in 1 1 E . Let x be in P . Then, the following conditions are equivalent: n m 1. x minimizes F on P . 2. F (x), the gradient of F at x, is orthogonal to the subspace S . 3. x = x + a p + . . . + a p , where a = , c = p F (x ), d = p A p , i = 1, . . . , m. m m 1 1 1 i i 1 i i i i Let x = x + a p + . . . + a p , i = 1, . . . , m. Then the quantity c defined in (3) above is i 1 1 1 i i i also given by c = p F (x ), i = 1, . . . , m. i i Then, there is a unique vector x in the m-dimensional plane P through x translating S m m 0 1 such that x minimizes the function F given by (1) on P . 0 m Proof. First, we are going to show that F has at least one minimizing vector in P . Let p be any vector in P and let M = F( p). Since A is positive definite, there is an R 2 R > 0 such that jjxjj > R implies F(x) > M. Hence, F(x) M implies jjxjj R. Since C = fx : jjxjj Rg \ P is a compact set in E on which F assumes values and is m n continuous, then F has at least one minimizing vector p in the compact set C. Outside this compact set C, F assumes only larger values. Thus, p is a minimizing vector for F in P . 0 m To show that (1) implies (2), assume that x minimizes F on P . Then, dF p F (x) = (x + a p ) = 0, (3) da a=0 for j = 1, . . . , m. Thus, p F (x) = 0 (j = 1, . . . , m). (4) So F (x) is orthogonal to every vector in the basis of S and, hence, is orthogonal to S . To show (2) implies (1), suppose that F (x) is orthogonal to S . Let v be any vector in P . We are going to show that F(v) > F(x) unless v = x. By Taylor ’s theorem we have the following: F(v) = F(x) + (v x) F (x) + (v x) A(v x). (5) Since (v x) is a vector in S , then (v x) F (x) = 0. In addition, (v x) A(v x) > 0 unless v = x, because A is positive definite. Thus, F(v) > F(x) unless v = x. (6) Hence, x is a unique absolute minimum for F in P . Now we can prove that there is a unique vector x in P minimizing F on P . Earlier 0 m m we established that there is at least one minimizing vector p for F in P . Since (1) implies (2), then F ( p ) is orthogonal to S . From the proof of (2) implies (1), it now follows that p 0 m 0 is a unique absolute minimum for F in P . To show that (2) implies (3), let x = x + a p + . . . + a p since x is in P , and assume 1 1 1 m m m that F (x) is orthogonal to the subspace S . We are going to show that a = , where c = p F (x ), d = p A p , i = 1, . . . , m. Note that Ax = Ax + a A p + . . . + a A p . In m m i 1 i i 1 1 1 i i addition, Ax k = Ax k + a A p + . . . + a A p . 1 1 1 m m AppliedMath 2022, 3 271 Since F (x) = Ax k, then 0 0 F (x) = F (x ) + a A p + . . . + a A p . 1 1 1 m m For i = 1, . . . , m, we have 0 0 p F (x) = p F (x ) + a p A p + . . . + a p A p . 1 1 1 m m i i i i 0 0 Sincef p , . . . , p g is a mutually A-orthogonal set of vectors, then p F (x) = p F (x ) + 1 m 1 i i 0 0 a p A p , i = 1, . . . , m. Since F (x) is orthogonal to the subspace S , then p F (x) = 0, i i m i i i = 1, . . . , m. Thus, a p A p = p F (x ). Since p 6= 0, i = 1, . . . , m, and A is positive i i i i i definite, then p A p 6= 0, i = 1, . . . , m. If we let c = p F (x ) and d = p A p , then i i 1 i i i i i a = , i = 1, . . . , m. (7) To show that (3) implies (2), we can use what was established in the previous proof. An indication of this is proved below. Suppose that x = x + a p + . . . + a p , where a = , c = p F (x ), d = p A p , 1 1 1 m m i i 1 i i i i i = 1, . . . , m. We want to show that p F (x) = 0, i = 1, . . . , m. Since 0 0 p F (x) = p F (x ) + a p A p , 1 i i i i i p F(x ) 0 0 and a = , then we have p F (x) = 0, i = 1, . . . , m. Hence, F (x) is orthogonal to p A p S . Thus, (1)–(3) are equivalent. Now we are going to show that the quantity c defined by c = p F (x ), i = i i 1 1, . . . , m, in (3) is also given by c = p F (x ), i = 1, . . . , m. i i Since x = x + a p , i = 1, . . . , (m 1), then i+1 i i i Ax = Ax + a A p , i+1 i i i Ax k = Ax k + a A p , i+1 i i i 0 0 F (x ) = F (x ) + a A p i = 1, . . . , (m 1). i+1 i i i Thus, 0 0 F (x ) = F (x ) + a A p + . . . + a A p , i = 1, . . . , (m 1), (8) i+1 1 i i 1 1 and, by conjugacy of f p , . . . , p g, we have 1 m 0 0 0 p F (x ) = p [F (x ) + a A p + . . . + a A p = p F (x ), i = 1, . . . , m. (9) i 1 1 1 i 1 i 1 1 i i i Hence, 0 0 p F (x ) = p F (x ), i = 1, . . . , m. (10) 1 i i i This completes the proof of the theorem. 2.2. A Class of Minimization Algorithms Now, we shall describe a class of minimization algorithms known as the method of CDs. The significance of the formulas given in (3) of Theorem 1 is indicated below. Suppose f p , . . . , p g, 1 m n, is a conjugate set of nonzero vectors and that P is m m the m-dimensional plane through x obtained by translating the subspace S generated 1 m by f p , . . . , p g. Then, the minimum of F given by (1) on P is attained at x , which we m m 0 will call x , where x = x + a p + . . . + a p , refer to Theorem 1. Now we assume m+1 m+1 1 1 1 m m that p is a nonzero vector that has been constructed to be conjugate to p , i = 1, . . . , m, m+1 i and let P denote the (m + 1)-dimensional plane through x obtained by translating the m+1 1 subspace S generated byf p , . . . , p , p g. It turns out that it is not necessary to solve m+1 1 m+1 AppliedMath 2022, 3 272 a new (m + 1)-dimensional minimization problem to determine the minimizing vector x on P . m+2 m+1 The minimizing vector x on P is obtained by a one-dimensional minimization m+2 m+1 of F about the vector x in the direction p . This follows directly from the following m+1 m+1 formulas found in Theorem 1: x = x + a p , m+2 m+1 m+1 m+1 and m+1 a = , c = p F (x ), d = p A p . m+1 m+1 m+1 m+1 m+1 m+1 m+1 m+1 Note that a depends upon x and p and explicitly involves no other x or p m+1 m+1 m+1 terms. Thus, the minimizing vector x on P results from m consecutive one-dimensional m+1 m minimizations starting at x and preceding along the CDs p , . . . , p successively. The ways 1 1 of obtaining a mutually conjugate set f p , . . . , p g of vectors are not specified in general. 1 m Thus, the method of CDs is really a class of algorithms, where a specific algorithm depends upon the choice of f p , . . . , p g. In practice, the vector p , k = 1, . . . , m, needed for the 1 m th (k + 1) iteration in finding x , k = 1, . . . , m, is usually constructed from information k+1 th obtained at the k iteration, k = 1, . . . , m. The following class of algorithms is referred to as the method of CDs: for k = 1, . . . , n, we find x = x + a p , k+1 k k k a = , c = p F (x ), d = p A p . k k 1 k k k k Alternatively, c may be given by c = p F (x ). k k k If F (x ) = 0 for 1 m n, then the algorithm terminates and x minimizes F on E . m m n Furthermore, any algorithm terminates in n steps or less since F is quadratic. 2.3. Special Inner Product and the Gram–Schmidt Process Let A be a positive definite symmetric n n matrix. Define a special inner product (x, y) by (x, y) = x Ay, where x and y are column vectors. Let u = (1, 0, , 0), u = (0, 1, 0, , 0) and u = (0, 0, , 0, 1). 1 2 n Then, using the special inner product above, we apply the Gram–Schmidt process to the linearly independent vectors u , u , . . . , u to obtain a set of mutually A-orthogonal 1 2 n vectors p , p , . . . , p , where the property of A-orthogonality is relative to the special inner 1 2 n product as performed by Hestenes and Stieffel [5] on p. 425. 3. Results A brief description of the CG method is given below using a quadratic function: F(x) = x Ax k x + c. The CG method is the CD method, which is described previously, with the first CD being in the direction of the negative gradient of function F. The remaining CDs can be determined in a variety of ways, and the CG procedure described by Hestenes [10] is given below. AppliedMath 2022, 3 273 3.1. CG—Algorithms for Nonquadratic Approximations One can apply the CG method to the quadratic function in z, namely F(z), to obtain a minimum of F(z). Let f be a function of n variables, then 0 00 F(z) = f (x ) + ( f (x )) z + z f (x )z. 1 1 1 Assume that a Hessian matrix is a positive definite symmetric matrix, which implies that F(z) has a unique minimum z . Then, min 0 00 rF(z) = f (x ) + f (x )z. 1 1 Applying Newton’s method to rF(z) = 0, we get 0 00 f (x ) + f (x )z = 0, 1 1 00 1 0 00 1 ( f (x )) ( f (x )) + z = 0 multiplied by ( f (x )) , 1 1 1 00 1 0 z ¯ = ( f (x )) ( f (x )). min 1 1 Remark 1. We solved rF(z ¯) = 0 directly to obtain min F(z). ~ ¯ In general, Newton’s method is used to solve f (z ¯) = 0 for z ¯. It is given by z = z J f (z ), n = 0, 1, 2, . . . n+1 n n where z is an initial guess and J is the Jacobian matrix, i.e., 0 n 0 1 (1) (1) ¶ f (z ) ¶ f (z ) n n ¶z B ¶z C B C . . . . . J = B C. . . @ A (1) (1) ¶ f (z ) ¶ f (z ) n n ¶z ¶z Now, we apply Newton’s method by taking f to rF and assuming that F and its second partial derivatives are continuous. So, one can apply Newton’s method to rF(z) = 0, with z = 0 as the initial point, to obtain the minimum point z ¯ of F, where 1 min 1 00 1 z = z J (rFz ) = z (F (x )) (rFz ). n+1 n n n 1 n Then, 00 1 00 1 z = z (F (x )) (rFz ) = 0 (F (x )) (rF(0)), 1 1 1 1 where we take z = 0. Since 0 00 rF(z ¯) = F (x ) + F (x )(z ¯), 1 1 0 00 ¯ ¯ rF(0) = F (x ) + F (x )(0), 1 1 rF(0) = F (x ). Then, 00 1 0 z = 0 (F (x )) F (x ). 2 1 1 For convenience in exposition, we include formulas below from Hestenes [10], pp. 136– 137 and pp. 199–202. AppliedMath 2022, 3 274 Here, the first step of Newton’s method is applied to rF(~ z) = 0 and z also turns out to be the only min of F(z) (a quadratic equation with positive definite symmetric term), i.e., 00 1 0 z = (F (x )) F (x ), 2 1 1 which satisfies rF(z ) = 0. Therefore, Newton’s method terminates in one iteration [10]. The initial formulas for b and c given in Algorithm 1 imply the basic CG relations k k p r = 0, s p = 0. k+1 k+1 k k Algorithm 1 CG algorithm Step 1: Select an initial point x . Set r = f (x ), p = r , z = 0. 1 1 1 1 1 1 for k = 1, . . . , n do perform the following iteration: Step 2: s = f (x ) p , k 1 k Step 3: a = , d = p s , c = p r or c = p r , k k k k k k 1 d k k k Step 4: z = z + a p , r = r a s , k+1 k k k k+1 k k k s r jr j k+1 k k+1 Step 5: p = r + b p , b = or b = . k+1 k+1 k k k k jr j end for Step 6: When k = n consider the next estimate of the minimum point x of f to be the point x ¯ = x + z . 1 1 n+1 Then choose x ¯ as the final estimate, if j f (x ¯ )j is sufficiently small enough. 1 1 Otherwise, reset x = x ¯ and the CG cycle (Ste p1)–(Ste p5) is repeated. 1 1 The CG cycle in Step 1 can terminate prematurely at the mth step if r = 0. In this m+1 case, we replace x by x ¯ = x + z and restart the algorithm. 1 1 1 m+1 If we take A = f (x ), where A is positive definite symmetric, then we establish the formula p p k k 00 1 f (x ) = , k=1 for the inverse of f (x ). Since Step 2 implies that s = f (x ) p , then, in Algorithm 1, we find k k 0 0 f (x + s p ) f (x ) 1 k 1 00 lim = f (x ) p . s!0 s We obtain the difference quotient by rewriting the vector s in Algorithm 1 (see Hestenes [10]). Therefore, without computing the second derivative we find 0 0 f (x + s p ) f (x ) 1 k 1 s = . In view of the development of Algorithms 1 and 2, each cycle of n steps is clearly comparable to one Newton step. Thus, we replace c = p r by c = p r and obtain the following relation k k k k k n n p p r c p k 1 k k k 0 z = = = H(x , s)( r ) = H(x , s) f (x ), n+1 å å 1 1 1 1 d d k k k=1 k=1 where p p k 0 H(x , s) = , r = f (x ). 1 å 1 1 k=1 AppliedMath 2022, 3 275 Algorithm 2 CG algorithm without derivative Step 1: Initially select x and choose a positive constant s. Set z = 0, r = f (x ), 1 1 1 1 p = r . 1 1 for k = 1, . . . , n do perform the following iteration: 0 0 f (x + s p ) f (x ) s 1 k 1 Step 2: s = , s = , k k s j p j Step 3: a = , d = p s , c = p r , k k k k k k k Step 4: z = z + a p , r = r a s , k+1 k k k k+1 k k k s r k+1 Step 5: p = z + a p , b = . k+1 k k k k end for Step 6: When k = n, then x ¯ = x + z is to be the next estimate of the minimum point 1 1 n+1 x of f . Then accept x ¯ as the final estimate of x , if j f (x ¯ )j is sufficiently small. 1 1 Otherwise, reset x = x ¯ and repeat the CG cycle (Ste p1)–(Ste p5). 1 1 The new initial point x ¯ = x + z generated by one cycle of the modified Algorithm 2 1 1 n+1 is, therefore, given by the Newton-type formula x ¯ = x H(x , s) f (x ). 1 1 1 1 00 1 So, we have lim H(x , s) = f (x ) . The above algorithm approximates the Newton 1 1 s!0 algorithm 00 1 0 x¯ = x f (x ) f (x ) 1 1 1 1 and has this algorithm as a limit as s ! 0. Therefore, Algorithm 2 will have nearly identical convergence features to Newton’s algorithm if s is replaced by at the end of each cycle. 3.2. Conjugate Gram–Schmidt (CGS)—Algorithms for Nonquadratic Functions With an appropriate initial point x , we can derive the algorithm that is described by Hestenes [10] on p. 135, which relates Newton’s method to a CGS algorithm. Since [10] 0 0 f (x + s p ) f (x ) 1 1 k 00 lim = f (x ) p . (11) 1 k s!0 s We can approximate the vector f (x ) p by the vector 1 k 0 0 f (x + s p ) f (x ) 1 k 1 s = , (12) with a small value of s . Then, we obtain the following modification of Newton’s algorithm, the CGS algorithm (see Hestenes [10]): In Step 2 of Algorithm 3, substitute s with the following formula s = f (x ) p k k and repeat the CGS algorithm. Then, we obtain Newton’s algorithm. AppliedMath 2022, 3 276 Algorithm 3 CGS algorithm Step 1: Select a point x . a small positive constant, s > 0 and n linearly independent vectors u , . . . , u ; set z = 0, r = f (x ), p = u . 1 1 1 1 1 1 for k = 1, . . . , n and having obtained z , r and p do perform the following iteration: k k k 0 0 f (x + s p ) f (x ) s 1 1 Step 2: s = , s = , k k s j p j Step 3: d = p s , c = p r , a = , k k k k k k Step 4: z = z + a p , k+1 k k k s u k+1 Step 5: b , j = (j = 1, . . . , k), k+1 Step 6: p = u b , p . . . b , p . k+1 k+1 k+1 1 1 k+1 k k end for Step 7: When when z has been computed, the cycle is terminated. n+1 Then choose x ¯ as the final estimate, if j f (x ¯ )j is sufficiently small enough. 1 1 Otherwise, reset x = x ¯ and repeat the CGS cycle (Ste p1)–(Ste p6). 1 1 In view of (11), for small s > 0, the CGS Algorithm 3 is a good approximation of Newton’s algorithm as a limit as s ! 0. A simple modification of Algorithm 3 is obtained by replacing the following formulas in Step 2 and Step 3, as described in Hestenes [10]. 0 0 f (x + s p ) f (x ) s 1 k 1 s = , s = , k k s j p j x = x + z , d = p s , c = p f (x ), a = . k 1 k k k k k k k k A CGS algorihtm for nonquadratic functions is obtained form the following relation, where the ratios f (x s p) f (x + s p) c(s) = , 2s ( f s p) 2 f (x) + f (x + s p) d(s) = , have the properties 0 00 lim c(s) = p f (x), lim d(s) = p f (x) p, s!0 s!0 and p is a nonzero vector. Moreover, for a given vector u 6= 0 , the ratio f (x + au s p) f (x + au + s p) c(a, s) = , 2s has the property that c(s) c(a, s) lim lim = u f (x) p. a!0 s!0 a The details are as follows. Suppose p , p , . . . , p is an orthogonal basis that spans the 1 2 n same vector space as that spanned by u , u , . . . , u , which are linearly independent vectors. 1 2 The inner product (x, y) is defined by x Ay, where A is a positive definite symmetric matrix. Then, the Gram–Schmidt process works as follows: AppliedMath 2022, 3 277 p ¯ p ¯ = u , p = = u 1 1 1 1 j p j (u , p ) p ¯ 2 1 2 p ¯ = u p , p = 2 2 1 2 ( p , p ) j p j 1 1 2 (u , p ) (u , p ) p ¯ 3 1 3 2 3 p ¯ = u p p , p = 3 3 1 2 3 ( p , p ) ( p , p ) j p ¯ j 2 2 3 1 1 ( p Au ) ( p Au ) 3 3 1 2 p ¯ = u p p , 3 3 1 2 ( p A p ) ( p A p ) 1 2 1 2 . . . ( p Au ) ( p Au ) p ¯ k+1 k+1 k+1 1 k p = u p p , p = . k+1 k+1 k k+1 ( p A p ) ( p A p ) j p ¯ j 1 k k+1 1 k Take A = f (x ), then 00 00 ( p f (x )u ) ( p f (x )u ) 1 k+1 1 k+1 1 k p ¯ = u p p . k+1 k+1 1 k 00 00 ( p f (x ) p ) ( p f (x ) p ) 1 1 1 k 1 k We already proved that p A p = D or p f (x ) p = D. Then, 00 00 ( p f (x )u ) ( p f (x )u ) 1 k+1 k+1 1 k p ¯ = u p p . k+1 k+1 1 k d d 1 k We also know that p f (x ) = s . 1 k Therefore, s u s u 1 k+1 k k+1 p ¯ = u p p , k+1 k+1 1 k d d p = u b , p , . . . , b , p , 1 1 k+1 k+1 k+1 k+1 k k p ¯ k+1 p ¯ = u b , p , . . . , b , p , since p = . k+1 k+1 k+1 1 1 k+1 k k k+1 j p ¯ j k+1 Now using function values only, a conjugate Gram–Schmidt process without derivatives is described by Hestenes [10] as follows, as the CGS routine without derivatives (Algorithm 4): AppliedMath 2022, 3 278 Algorithm 4 CGS algorithm without derivatives Step 1: select an initial point x , small s > 0 and a set of unit vectors u , . . . , u , which 1 1 are linearly independent; set z = 0, p = u , a = 2s, g = 0; compute f (x ). 1 1 1 0 1 for k = 1, . . . , n and having obtained z , p , . . . , p and g , do perform the following k 1 k k 1 iteration: f (x s p ) 2 f (x ) + f (x + s p ) 1 k 1 1 k Step 2: d = , f (x s p ) f (x + s p ) 1 k 1 k Step 3: dc = , 2s Step 4: g = max[g ,jc j], k k 1 k Step 5: a = , z = z + a p , k k+1 k k k Step 6: p = u b , p . . . b , p . k+1 k+1 k+1 1 1 k+1 k k end for Step 7: When z has been computed, the cycle is terminated. n+1 Then choose x ¯ as the final estimate, if j f (x ¯ )j is sufficiently small, x ¯ is the minimum 1 1 1 of f . Otherwise, reset x = x ¯ and repeat the CGS cycle (Ste p1)–(Ste p6) with the initial 1 1 condition g = 0. In addition, the conjugate Gram–Schmidt method without derivatives is described by Dennemeyer and Mookini [26]. In this program, they used different notations from Hestenes’ notations, but they provided the same procedure. Initial step: select an initial point x , a small s > 0 and a set of linearly independent vectors u , . . . , u ; 1 n set h = 0, p = u , a = 2s, g = 0 and compute f (x ). 1 1 1 0 1 Iterative steps: given x , p , . . . , p , h , compute 1 1 k k f (x s p ) 2 f (x ) + f (x + s p ) 1 1 1 k k d = , f (x s p ) f (x + s p ) 1 k 1 k c = , 2s g = max g ,jc j , a = , h = h + a p ; [ ] k k 1 k k k+1 k k k for j = 1, . . . , k compute f (x + au s p ) f (x + au + s p ) 1 j k 1 j k c , = , k+1 j 2s c , a , a k+1 j k+1 j j a , = , b , = , k+1 j k+1 j d a then, p = u + b , p . j j k+1 k+1 å k+1 j=1 Terminate when h is obtained, and set x = x + h . If the value g is small n+1 n+1 1 n+1 enough, x is the minimum point of f . Otherwise, set x = x and repeat the program. n+1 1 n+1 The term g is used to terminate the algorithm because the gradient is not explicitly computed. Another termination method would be to test if maxja j < e is chosen before- hand. Both of these tests were used on the computer by Dennemeyer and Mookini [26] and the results were comparable. 4. Discussion In this section, we present a computation to illustrate convergence rates, as well as the relationship between that computation and Newton’s method. Two of the most AppliedMath 2022, 3 279 important concepts in the study of iterative processes are the following: (a) when the iterations converge; and (b) how fast the convergence is. We introduce the idea of rates of convergence, as described by Ortega and Rheinboldt [14]. 4.1. Rates of Convergence A precise formulation of the asymptotic rate of convergence of a sequence x converg- ing to x is motivated by the fact that estimates of the form k+1 k p jjx x jj jjx x jj , (13) for all k = 1, 2, . . ., often arise naturally in the study of certain iterative processes. k n Definition 1. Let x be a sequence of points in R that converges to a point x . Let 1 p < ¥. Ortega and Rheinboldt [14] define the quantities 0 if x = x for all but finitely many k, k+1 kx x k lim sup if x 6= x for all but finitely many k, Q fx g = k p kx x k k!¥ +¥ otherwise, and refer to them as quotient convergence factors, or Q-factors for short. Definition 2. Let C(I , x ) denote the set of all sequences having a limit of x that are generated by an iterative process I . k k Q (I , x ) = supfQ fx gjfx g 2 C(I , x )g 1 p < +¥, p p are the Q-factors of I at x with respect to the norm in which the Q fx g are computed. Note that if Q fx g < +¥ for some p where 1 p < ¥, then, for any e > 0, there is some positive integer K such that (13) above holds for C = Q fx g + e. If 0 < Q fx g < ¥, p k p k then we say that x converges to x with Q-order of convergence p, and if Q fx g = 0, for some fixed p satisfying 1 p < ¥, then we say that x has superconvergence of Q-order p to x . For example, if 0 < Q fx g < +¥ when p = 1, then we also have 0 < C < 1 in (13), p k we say that fx g converges to x linearly. In addition, if Q fx g = 0 when p = 1, we say n p that fx g converges to x superlinearly. Definition 3. One other method of describing convergence rate involves the root convergence factors. See ([14]). 1/n lim supjjx x jj i f p = 1, k!¥ R (x ) = p n n 1/ p lim supjjx x jj i f p > 1. : n k!¥ 4.2. Acceleration One acceleration procedure is the following: first, apply n CD steps to an initial point x to obtain a point x = y ; then, take x to be a new initial point and apply n 1 n+1 1 n+1 CD steps again to obtain another x = y ; finally, check for acceleration by evaluating n+1 2 Q = F(y (Y y )), if Q < F(y ); then, we accelerate by taking [y (y y )] as our 2 2 1 2 2 2 1 initial point; if Q > F(y ), then take y as a new initial point; after two more applications 2 2 of the CD method, we check for acceleration again. The procedure continues in this manner [25]. AppliedMath 2022, 3 280 4.3. Test Function 4.3.1. Rosenbrook’s Banana Valley Function We carry out the following computations for Rosenbrook’s banana valley function (n = 2). This function possesses a steep sided valley that is nearly parabolic in shape. First, we determine values in the domain of Rosenbrock’s function for which its Hessian matrix is positive definite symmetric. Since the Rosenbrock’s banana valley function is non-negative, i.e., 2 2 2 f (x, y) = [100(y x ) + (x 1) ] 0, then we have 2 2 f = 200(y x )( 2x) + 2(x 1) = 400x(y x ) + 2(x 1), and 2 2 2 2 f = 400(y x ) 400x( 2x) + 2 = 400y + 400x + 800x + 2 = 1200x 400y + 2, xx and f = 400x, f = 200(y x ), f = 200. xy y yy Therefore, the Hessian matrix is positive definite symmetric if and only if Sylvester ’s criterion holds: 2 2 2 (1200x 400y + 2) > 0, and (200)(1200x 400y + 2) 160000x > 0, 2 2 1 which implies that 1200x + 2 > 400y,, y < 3x + , and 2 2 2 2 1200x 400y + 2 800x > 0,, 400x + 2 > 400y,, y < x + . 2 1 So, the Hessian matric is positive definite symmetric if and only if y < x + . Figure 1 shows the maximal convex level set on which the Hessian is positive definite symmetric in the interior for Rosenbrock’s Banana Valley Function. 4.3.2. Kantorovich’s Function The following function 2 2 2 4 3 2 F(x, y) = (3x y + y 1) + (x + xy 1) , which is non-negative, i.e., F(x , x ) 0, is called Kantorovich’s Function. 1 2 Calculating the Hessian matrix for Kantorovich’s function, we find that 2 2 2 2 3 3 2 4 3 2 F = 72x y + 12(3x y + y 1)y + 2(4x + y ) + 24(x + xy 1)x , xx 2 2 2 2 3 3 4 3 2 F = 12(3x + 2y)xy + 12(3x y + y 1)x + 6xy (4x + y ) + 6(x + xy 1)y xy and 2 2 2 2 2 4 4 3 F = 2(3x + 2y) + 12x y + 4y 4 + 18x y + 12(x + xy 1)xy. yy Minimizing this function is equivalent to solving the nonlinear system of equations. Therefore, for the initial point (0.98, 0.32), we obtain the minimum point at (0.992779, 0.306440) [25]. AppliedMath 2022, 3 281 Figure 1. Maximal conves level set for Rosenbrock’s banana valley function. 4.4. Numerical Computation The goal of this numerical computation is to provide a system of iterative approaches for finding these extreme points [10]. A significant point is that a Newton step can be performed instead by a CD sequence of n linear minimizations in n appropriately cho- sen directions. It is important to keep in mind that a function acts like a quadratic function when it is in the neighborhood of a nondegenerate minimum point. Conjugacy can be thought of as a generalization of the concept of orthogonality. Conjugate direction methods include substi- tuting conjugate bases for orthogonal bases in the foundational structure. The formulas for determining the minimum point of a quadratic function can be reduced to their simplest forms by following the CD technique. The conjugate direction algorithms for minimizing a quadratic function, which are discussed in the current work, were initially presented in Hestenes and Stiefel, 1952 [5]. These algorithms can be found in the present work. The authors Davidon [3], Fletcher and Powell [4] are most known for the modifications and additions that they made to these methods. However, numerous other authors also made these changes. The iterative methods described above apply to many problems. They are used in least squares fitting, in solving linear and nonlinear systems of equations and in optimization problems with and without constraints [25]. The computing performances and numerical results of these techniques and comparisons have received a significant amount of attention in recent years. This interest has been focused on the solving of unconstrained optimization problems and large-scale applications [19,27]. The Rosenbrock function of two variables, considered in Section 4.3, was introduced by Rosenbrock [18] as a simple test function for minimization algorithms. We chose (x , y ) = 1 1 ( 1.2, 1) as the initial point. We applied algorithm (4.4a)–(4.4 f ) with s = 0.1 10 , using 400-digit accuracy. Algorithm (4) is basically Newton’s algorithm. The final estimate of (x , y ) has more than 150-digit accuracy. The successive values 0 0 0.8574 . . ., 0.0274 . . ., 0.2433 . . ., 0.0030 . . ., 0.2000 . . ., 0.0030 . . ., 0.2000 . . ., . . . of quotients AppliedMath 2022, 3 282 that lead to the quotient convergence factor oscillate. The lim sup of these quotients give the quotient converge factor, which indicates quadratic convergence. The lim sup is .2000 . . .. 120 120 60 For s = 0.1 10 , r = 0.2 10 , e = 0.1 10 and the initial values, we obtained the following computations for Rosenbrock’s function f using the Gram-Schmidt Conjugate Direction Method without Derivatives or the CGS method, no derivatives, and Newton’s Method applied to r f = 0: (See [28]) For additional information regarding the programming, please refer to the supplemen- tary material. 4.5. Differential Equations of Steepest Descent The following equations are known as the differential equations of steepest descent: dx(t) = rF(x(t)), (14) dt and dx(t) rF(x(t)) = . (15) dt jjrF(x(t))jj The solution to either differential equation of steepest descent with initial condition x (0) = 1.2, x (0) = 1.0 is shown in Figure 2, one can refer to Equation (10), p. 783, in Eells [17]. For Equation (14), the solution will not include the minimum for finite values of t. For Equation (15), the solution will approach the minimum, but will blow up at the minimum. From a numerical point of view, the differential equation approach has to be used with caution. Rosenbrock [15] pointed out that the iterative method of steepest descent with line searches was not effective with steep valleys. The iterative method was introduced by Cauchy [16]. In summary, the method of steepest descent is not effective and does not compare with Hestenes’ CGS method with no derivatives, which is almost numerically equivalent to Newton’s method applied to grad( f ) = 0, where f is the function to be minimized. Below are level curves of Rosenbrock’s banana valley function. We used this function to compare Hestenes’ CGS method, Newton’s method and the steepest descent methods. In Figure 2, the level curves of Rosenbrock’s Banana Valley Function show that the minimizer is at (1, 1). Level curves are plotted for function values 4.0, 4.1, 4.25, 4.5 in Figure 3. For steepest descent, the iterative method and the ODE approach are illustrated. The curve y = x appears to parallel the valley floor in the graph. We use the CGS method for computation. The Rosenbrock’s banana valley function 2 2 2 F(x , x ) = (1 x ) + 100(x x ) , 1 2 1 2 gives the minimum point at (1, 1). This example provided us with geometric illustrations in Figure 2. For specific algo- rithms, please refer to Section 3 for the Gram–Schmidt conjugate direction method and the Newton method in order to compare the two methods along side one another. The outcomes of the numerical experiments performed on the standard test function using the CGS method are reported above. Based on these data, it is clear that this particular implementation of the CGS method is quite effective. AppliedMath 2022, 3 283 Figure 2. Level Curves of Rosenbrock’s banana valley function. Figure 3. Curve of steepest descent and level curves for Rosenbrock’s banana valley function. AppliedMath 2022, 3 284 5. Conclusions In this paper, we introduced a class of CD algorithms that, for small values of n, provided effective minimization methods. As n grew, however, the algorithms became more and more costly to run. The computer program above showed that the CGS algorithm without derivatives could generate Newton’s method. Since the Hessian matrix of Rosenbrock’s function was positive definite symmetric and satisfied Sylvester ’s criterion, the CGS method converged if we began anywhere in the closed convex set in the nearby area of a minimum. This was because the CGS method is based on the fact that the Hessian matrix of Rosenbrock’s function is positive definite symmetric. Using quotient convergence factors, one can see that for Rosenbrock’s function one sequence converged quadratically. In particular, the numerical computation on p. 21 re- vealed that the asymptotic constant oscillated between 0.20000 and 0.00307, so the quotient convergence factor by Ortega and Rheinboldt [14] was, approximately, Q fx g = 0.200002, which indicated quadratic convergence. The results agreed for Newton’s method. Moreover, the CGS algorithm uses function evaluations and difference quotients for gradient and Hessian evaluations, it does not require accurate gradient evaluation nor function minimization. This approach is the most efficient algorithm that has been discussed in this study; yet, it is extremely sensitive to both the choice of s that is used for difference quotients and the choice of r that is used for scaling. The Gram–Schmidt conjugate direction method without derivatives has been used quite successfully in a variety of applications, including radar designs by Norman Olsen [27] in developing corporate feed systems for antennas and aperture distributions for antenna arrays. He tweaked the parameters sigma and rho in our GSCD computer programs to obtain successful radar designs. Supplementary Materials: Supporting information can be downloaded at: https://www.mdpi.com/ article/10.3390/appliedmath3020015/s1. Author Contributions: Conceptualization, I.S.J. and M.N.R.; methodology, I.S.J.; software, I.S.J.; validation, M.N.R. and I.S.J.; formal analysis, M.N.R.; investigation, M.N.R.; resources, I.S.J.; data curation, I.S.J.; writing—original draft preparation, I.S.J.; writing—review and editing, M.N.R.; visualization, M.N.R.; supervision, I.S.J.; project administration, I.S.J. All authors have read and agreed to the published version of the manuscript. Funding: This research received no external funding. Institutional Review Board Statement: Not applicable. Informed Consent Statement: Not applicable. Data Availability Statement: Not applicable. Conflicts of Interest: The authors declare no conflict of interest. Acronyms CD conjugate direction; CG conjugate gradient; CGS conjugate Gram–Schmidt; GSCD Gram–Schmidt conjugate direction. References 1. Fletcher R.; Reeves C. Function minimization by conjugate gradients. Comput. J. 1964, 7, 149–154. [CrossRef] 2. Powell, M. An Efficient Method of Finding the Minimum of a Function of Several Variables Without Calculating Derivatives. Comput. J. 1964, 7, 155–162. [CrossRef] 3. Davidon, W.C. Variable Metric Method for Minimization, A.E.C. Research and Development Report; ANL-5990 (Revision 2); Argonne National Lab.: Lemont, IL, USA, 1959. 4. Fletcher, R.; Powell, M. A rapidly convergent descent method for minimization. Comput. J. 1936, 6, 163–168. [CrossRef] AppliedMath 2022, 3 285 5. Hestenes, M.R.; Stiefel, E. The Method of Conjugate Gradients for Solving Linear System. J. Res. Natl. Bur. Stand. 1952, 49, 409–436. [CrossRef] 6. Smith, C.S. The Automatic Computation of Maximum Likelihood Estimates; Scientific Department, National Coal Board: Bretby, UK, 1962; pp. 7.1–7.3. 7. Zangwill, W.I. Minimizing a Function without Calculating Derivatives. Comput. J. 1967, 10, 293–296. [CrossRef] 8. Fletcher, R. A Review of Methods For Unconstrained Optimization; Academic Press: New York, NY, USA, 1969; pp. 1–12. 9. Brent, R.P. Algorithms for Minimizing without Derivatives; Prentice-Hall: Englewood Cliffs, NJ, USA, 1973. 10. Hestenes, M.R. Conjugate Direction Methods in Optimization; Springer: New York, NY, USA, 1980. 11. Nocedal, J.; Wright, S.J. Conjugate gradient methods. In Numerical Optimization; Springer: New York, NY, USA,2006; pp. 101–134. 12. Kelley, C.T. Iterative Methods for Optimization: Society for Industrial and Applied Mathematics; SIAM: Wake Forest, NC, USA, 1999. 13. Zhang, L. An improved Wei–Yao–Liu nonlinear conjugate gradient method for optimization computation. Appl. Math. Comput. 2009, 215, 2269–2274. [CrossRef] 14. Ortega, J.; Rheinboldt, W.C. Iterative Solutions of Nonlinear Equations in Several Variables; Academic Press: New York, NY, USA, 1970. 15. Russak, I.B. Convergence of the conjugate Gram-Schmidt method. J. Optim. Theory Appl. 1981, 33, 163–173. [CrossRef] 16. Cauchy, A. Méthode générale pour la resolution de systemes d’eguations simultanées. C. R. Hebd. Séances Acad. Sci. 1847, 25, 536–538. 17. Eells, J. A setting for global analysis. Am. Math. Soc. Bull. 1966, 72, 751–807. [CrossRef] 18. Rosenbrock, H.H. An automatic method for finding the greatest or least value of a function. Comput. J. 1960, 3, 175–184. [CrossRef] 19. Andrei, N. Nonlinear Conjugate Gradient Methods for Unconstrained Optimization; Series Title: Springer Optimization and Its Applications; Springer Nature Switzerland AG: Cham, Switzerland, 2021; ISBN 978-3-030-42952-2. 20. Jakovlev, M. On the solution of nonlinear equations by iterations. Dokl. Akad. Nauk SSSR 1964, 156, 522–524. 21. Jakovlev, M. On the solution of nonlinear equations by an iteration method. Sibirskii Matematicheskii Zhurnal 1964, 5, 1428–1430. (In Russian) 22. Jakovlev, M. The solution of systems of nonlinear equations by a method of differentiation with respect to a parameter. USSR Comput. Math. Math. Phys. 1964, 4, 198–203. (In Russian) [CrossRef] 23. Wall, D. The order of an iteration formula. Math. Camp. 1956, 10, 167–168. [CrossRef] 24. Ostrowski, A. Solution of Equations and Systems of Equations, 2nd ed.; Academic Press: New York, NY, USA, 1960. 25. Stein, I., Jr. Conjugate Direction Algorithms in Numerical Analysis and Optimization: Final Report; U.S. Army Research Office, DAHC 04-74-G-0006, National Science Foundation GP-40175, and University of Toledo Faculty Research Grant; University of Toledo: Toledo, OH, USA, 1975. 26. Dennemeyer, R.F.; Mookini, E.H. CGS Algorithms for Unconstrained Minimization of Functions. J. Optim. Theory Appl. 1975, 16, 67–85. [CrossRef] 27. Olsen, N.C. (Consultant, Lockheed, Palmdale, CA, USA) Private communication to Ivie Stein Jr., 2005. 28. Raihen, N. Convergence Rates for Hestenes’ Gram-Schmidt Conjugate Direction Method without Derivatives in Numerical Optimization. Master ’s Thesis in Mathematics, University of Toledo, Toledo, OH, USA, 2017. Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.
http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png
AppliedMath
Multidisciplinary Digital Publishing Institute
http://www.deepdyve.com/lp/multidisciplinary-digital-publishing-institute/convergence-rates-for-hestenes-rsquo-gram-ndash-schmidt-conjugate-nPpnaqfaKC