Access the full text.
Sign up today, get DeepDyve free for 14 days.
E. Hairer, S. Ncrsett, G. Wanner (1993)
Solving ordinary differential equations h nonstiff problems
M. Gori, A. Tesi (1992)
On the Problem of Local Minima in BackpropagationIEEE Trans. Pattern Anal. Mach. Intell., 14
Akarachai Atakulreka, D. Sutivong (2007)
Avoiding Local Minima in Feedforward Neural Networks by Simultaneous Learning
E-mail address: atsygvin@umpa.ens-lyon.fr
B. Cetin, J. Burdick, J. Barhen (1993)
Global descent replaces gradient descent to avoid local minima problem in learning with artificial neural networksIEEE International Conference on Neural Networks
Nazri Nawi, Abdullah Khan, M. Rehman (2013)
A New Back-Propagation Neural Network Optimized with Cuckoo Search Algorithm
S. Wiggins (1989)
Introduction to Applied Nonlinear Dynamical Systems and Chaos
E. Hairer, G. Wanner (2010)
Solving Ordinary Differential Equations II: Stiff and Differential-Algebraic Problems
A. Pavelka, A. Prochazka (2004)
ALGORITHMS FOR INITIALIZATION OF NEURAL NETWORK WEIGHTS
K. Burse, M. Manoria, Vishnu Kirar (2010)
Improved Back Propagation Algorithm to Avoid Local Minima in Multiplicative Neuron Model, 4
Chien-Cheng Yu, Bin-Da Liu (2002)
A backpropagation algorithm with adaptive learning rate and momentum coefficientProceedings of the 2002 International Joint Conference on Neural Networks. IJCNN'02 (Cat. No.02CH37290), 2
S. Gallant (1990)
Perceptron-based learning algorithmsIEEE transactions on neural networks, 1 2
S. Linge, H. Langtangen (2019)
Solving Ordinary Differential EquationsProgramming for Computations - Python
E. Busvelle, R. Kharab, A. Maciejewski, J. Strelcyn (1994)
Numerical integration of differential equations in the presence of first integrals: observer methodApplicationes Mathematicae, 22
Y. Fukuoka, H. Matsuki, H. Minamitani, A. Ishida (1998)
A modified back-propagation method to avoid false local minimaNeural networks : the official journal of the International Neural Network Society, 11 6
P. Absil, Krzysztof Kurdyka (2006)
On the stable equilibrium points of gradient systemsSyst. Control. Lett., 55
Joan Brierton (1997)
Techniques for avoiding local minima in gradient-descent-based ID algorithms, 3066
M. Avriel (1976)
Nonlinear programming
Joseph Salle, S. Lefschetz, R. Alverson (1962)
Stability by Liapunov's Direct Method With ApplicationsPhysics Today, 15
Eduardo Sontag, H. Sussmann (1989)
Backpropagation Can Give Rise to Spurious Local Minima Even for Networks without Hidden LayersComplex Syst., 3
M. Hirsch, S. Smale (1974)
Differential Equations, Dynamical Systems, and Linear Algebra
I. Sprinkhuizen-Kuyper, E. Boers (1999)
The local minima of the error surface of the 2-2-1 XOR networkAnnals of Mathematics and Artificial Intelligence, 25
ON THE OVERFLY ALGORITHM IN DEEP LEARNING OF NEURAL NETWORKS ALEXEI TSYGVINTSEV Abstract. In this paper we investigate the supervised backpropagation training of mul- tilayer neural networks from a dynamical systems point of view. We discuss some links with the qualitative theory of differential equations and introduce the overfly algorithm to tackle the local minima problem. Our approach is based on the existence of first integrals of the generalised gradient system with build–in dissipation. 1. Introduction. The dynamics of gradient flow. Neural networks and backpropagation. Let F : U → R be a smooth function in some open domain U ⊂ R . We equip U with the topology induced by the standard Euclidean norm ||·|| defined by the canonical scalar product < x, y >= x y . The gradient vector field defined in U by F is given by i i ∂F ∂F T T V (x) = −∇F = −( , . . . , ) , where x = (x , . . . , x ) are canonical coordinates in 1 n ∂x ∂x 1 n U. The critical points of F are the solutions of V (x) = 0, x ∈ U. Let K be the set of all critical points of F in U (which can be unbounded and/or contain non–isolated points). The following theorem [10], [19] is a classical result describing the asymptotic behaviour of solutions of the gradient differential system: x = V (x), x ∈ U . (1.1) Theorem 1.1. Let x ∈ U be the initial condition of (1.1). Then every solution t 7→ x(t), x(0) = x either leaves all compact subsets of U or approaches as t → +∞ the critical set K i.e lim inf ||x(t) − y|| = 0 . (1.2) t→+∞ y∈K In particular, at regular points, the trajectories of (1.1) cross the level surfaces of F orthogonally and isolated minima of F (which is a Lyapunov function [14] ) are asymp- totically equilibrium points. Under the additional analyticity condition the above convergence result can be made stronger: Key words and phrases. deep learning, neural networks, dynamical systems, gradient descent. arXiv:1807.10668v6 [cs.LG] 29 Dec 2018 2 ALEXEI TSYGVINTSEV Theorem 1.2. (Absila,Kurdyka, [3]) Let F be real analytic in U. Then y ∈ K is a local minimum of F iff it is asymptotically stable equilibrium point of (1.1). It should be noticed that the gradient system (1.1) can not have any non–constant periodic or recurrent solutions, homoclinic orbits or heteroclitic cycles. Thus, trajectories of gradient dynamical systems have quite simple asymptotic behaviour. Nevertheless, the localisation of basin of attraction of any equilibrium point (stable or saddle one) belonging to K is a non trivial problem. Supervised machine learning in multi–layered neural networks can be considered as application of gradient descent method in a non–convex optimization problem. The cor- responding cost (or error) functions are of the general form i 2 E = (p − f(W, A ) , (1.3) with data set (A , p ) and a certain highly non–linear function f containing the weights W . The main problem of the machine learning is to minimize the cost function E with a suitable choice of weights W . A gradient method, described above and called backpropa- gation in the context of neural network training, can get stuck in local minima or take very long time to run in order to optimize E. This is due to the fact that general properties of the cost surface are usually unknown and only the trial and error numerical methods are available (see [4], [12], [16], [9], [17], [18], [5])). No theoretical approach is known to provide the exact initial weights in backpropagation with guaranteed convergence to the global minima of E. One of most powerful techniques used in backpropagation is the adaptive learning rate selection [8] where the step size of iterations is gradually raised in order to escape a local minimum. Another approach is based on random initialization [15] of weights in order to fortunately select them to be close to the values that give the global minimum of the cost function. The deterministic approach, called global descent, was proposed in [7] where optimization was formulated in terms of the flow of a special deterministic dynamical system. The present work seeks to integrate the ideas from the theory of ordinary differential equations to enrich the theoretical framework and assist in better understanding the nature of convergence in the training of multi–layered neural networks. The principal contribution is to propose the natural extension of classical gradient descent method by adding new degrees of freedom and reformulating the problem in the new extended phase space of higher dimension. We argue that this brings a deeper insight into the convergence problem since new equation become simpler algebraically and admit a family of known first integrals. While this proposal may seem radical, we believe that it offers a number of ON THE OVERFLY ALGORITHM IN DEEP LEARNING OF NEURAL NETWORKS 3 advantages on both theoretical and as numerical levels as our experiments clearly show. Common sense suggests that embedding the dynamics of a gradient flow in a more general phase space of a new more general dynamical system is always advantageous since it can bring new possibilities to improve the convergence and escape local minima by embedding the cost surface into the higher dimensional phase space. The study is divided into three parts. In Section 2 we begin by reminding how the gradient descent method is applied to train the simplest possible neural network with only output layer. That corresponds to the conventional backpropagation algorithm known for its simplicity and which is frequently used in deep learning. Next we introduce a natural extension of the gradient system which is done by replacing the weights of individual neurones within the output layer by their nonlinear outputs. That brings more complexity to the iterative method, since the number of parameters rises considerably, but at the same time, the training data becomes built up into network in a quite natural way. The so obtained generalised gradient system is later converted to the observer one (see [6]). The aim is to turn the constant level of known first integrals into the attractor set. We will explain how the Euler iterative method, applied to the observer system, and called overfly algorithm, is involved in achieving of convergence to the global minimum of the cost function. Sections 3 and 4 discuss the applications of this algorithm in training of 1–layer and multilayer networks. The objective is to put forward an explanation of how to expand the backpropagation algorithm to its overfly version via modifying the weights updating procedure only for the first network’s layer. In Section 5 we provide concrete numerical examples to illustrate the efficacy of the overfly algorithm in training of some particular neural networks. 2. Neural network without hidden layers In this section we give an elementary algebraic description of the simplest no hidden layer neural network called also a perceptron (see [11]). We define the sigmoid function σ(t) = , t ∈ R , (2.1) −t 1 + e as a particular solution of the logistic algebraic differential equation: σ (t) = σ(t)(1 − σ(t)) . (2.2) In particular, σ : R → (0, 1) is increasing and rapidly convergent map as t → ±∞. 4 ALEXEI TSYGVINTSEV n n Let X ∈ R and A ∈ R be two vectors called respectively weight and input ones . The analytic map f : R → (0, 1) defined by f : A 7→ σ(< A, X >) , (2.3) is called a no hidden layer neural network. Let (A , p ), 1 ≤ i ≤ N , (2.4) i n be the training set of (2.3) containing N input data vectors A ∈ R and corresponding scalar output values p ∈ (0, 1). We want to determine the weight vector X so that the N values f (A ) match outputs p as better as possible. That can be achieved by minimising X i the so called cost function k 2 E(X) = (p − f (A )) , (2.5) k X k=1 or, after the substitution of (2.3): k 2 E(X) = (p − σ(< A , X >)) . (2.6) k=1 In general, E : R → (0, 1) is not coercive and not necessarily convex map. To apply the gradient descent method one considers the following system of differential equations X = −∇E(X) . (2.7) Since E is always decreasing along the trajectories of (2.7), it is natural to solve it starting from some initial point X ∈ R and use X(t), X(0) = X to minimise E. The solution 0 0 X can converge (in the ideal case) to the global minimum of E or, in the less favourable case, ||X(t)|| → +∞ or X converges to local minima or saddle points. The backpropagation method [11] for a neural network can be viewed as the Euler numerical method [13] of solving of a gradient system (2.7). Here one approximates the time derivative by its discrete version X(t + h) − X(t) X (t) ≈ , (2.8) for some small step h > 0 so that the approximative solution of (2.7) X ≈ X(t ) at time k k t = kh can be obtained by iterations: ¯ ¯ ¯ ¯ X = X − h∇E(X ), X = X , k ≥ 0 . (2.9) k+1 k k 0 0 We write (2.7) in a more simple algebraic form by introducing the additional variables M = σ(< A , X >), i = 1, . . . , N , (2.10) i ON THE OVERFLY ALGORITHM IN DEEP LEARNING OF NEURAL NETWORKS 5 representing the nonlinear outputs of the network for N given inputs A of the training set. Using the equations (2.7) to compute the derivatives M , one obtains the following system of N differential equations M = M (1 − M ) (p − M )M (1 − M )G , (2.11) i i j j j j i,j j=1 i j with G = G =< A , A > – the N × N symmetric Gram matrix. We call (2.11) the i,j generalised gradient system. 1 N T Let D be n × N matrix defined by D = (A , . . . , A ). Then G = D D and, as known from the elementary linear algebra: rank(G) = rank(D) and Ker(G) = Ker(D). Since the number of training vectors N usually exceeds the total number of weights n of the network, we can assume that N > n. Thus, since rank(G) ≤ n, we have dim(Ker(G)) ≥ N − n > 0. Let C = (C , . . . , C ) ∈ Ker(G) be a non–zero vector from the null space of G and 1 N N N I = (0, 1) = (0, 1)×···×(0, 1) ⊂ R . As seen from the equations (2.11), I is invariant N N under the flow of the system. Indeed, M = 0 and M = 1 are invariant hypersurfaces. i i Theorem 2.1. The function I = C ln , M = (M , . . . , M ) ∈ I , (2.12) C k 1 N N 1 − M k=1 is a real analytic first integral of the system (2.11). There exists p = N − dim(Ker(D)) > 0 functionally independent first integrals of the above form. Proof. The first statement can be checked straightforwardly by derivation of (2.12) using (2.11). We notice that if 0 < M < 1 then M /(1 − M ) > 0. Thus, one has the real i i i analyticity property of I . The linearity and functional independency of I , C ∈ Ker(D) C C follow directly from the definition (2.12). In the rest of the paper we will always assume that rank(D) = n i.e the set D contains sufficiently many independent vectors. 1 p Let C , . . . , C , p = N − n be the basis of Ker(D). Using the vector notation M M 1 N F (M) = ln , . . . , ln , M = (M , . . . , M ) , (2.13) 1 N 1 − M 1 − M 1 N the family of the first integrals given by Theorem 2.1 can be written simply as I i(M) =< C , F (M) >, i = 1, . . . , p . (2.14) C 6 ALEXEI TSYGVINTSEV p N N Let H : I → R , I = (0, 1) = (0, 1) × ··· × (0, 1) ⊂ R be the map defined by N N H(M) = (I 1(M), . . . , I p(M)) . (2.15) Lemma 2.1. H : I → R is a submersion. 1 p Proof. This follows directly from the fact that C , . . . , C are linearly independent vectors and (2.14). p −1 Thus, for all y ∈ R the set Γ = I ∩ H (y) is a n–dimensional invariant manifold y N for the system (2.11). Lemma 2.2. Γ is diffeomorphic to R . n n N Proof. Let X ∈ R . We define the map Φ : R → R by 1 N T Φ(X) = (σ(< A , X >), . . . , σ(< A , X >) . (2.16) N N P P j j n Then, I i(Φ(X)) = C < A , X >=< C A , X >= 0 and so Φ : R → Γ . ij ij 0 j=1 j=1 To show that φ is invertible, let us fix M ∈ Γ . Since σ : R → (0, 1) is one to one, there T N exists unique vector Z = (Z , . . . , Z ) ∈ R , such that M = σ(Z ), for i = 1, . . . , N 1 N i i and < C , Z >= 0, i = 1, . . . , p , (2.17) because F (M) = Z by substitution into (2.13). n i We are looking now for the solution X ∈ R of the linear system < A , X >= Z , i = 1, . . . , N which can be written in the vector form as A X = Z . The linear map n N T ⊥ φ : R → R , φ(X) = A X has rank(φ) = n. Moreover, Im(φ) = Ker(D) where orthogonality is defined by the scalar product <, >. Indeed, Im(φ) ⊂ Ker(D) , by the direct verification, and dim(Im(φ)) = dim(Ker(D) ) by the rank–nullity theorem. Hence, n ⊥ T the map φ : R → Ker(D ) is a linear bijection and the linear equation A X = Z ⇐⇒ φ(X) = Z admits the unique solution X since Z ∈ Ker(D) as follows from (2.17). The proof is done. The system (2.11) can be written in the vector form as M = V (M) where V is a complete in I vector field (I is a bounded open invariant set). Let ǫ > 0 and N N U = {M ∈ I : r(M) = ||H(M)|| ≤ ǫ} , (2.18) ǫ N be the ǫ–neighbourhood of Γ . Together with (2.11), consider the following observer system M = W (M) = V (M) + P (M), M ∈ I , (2.19) N ON THE OVERFLY ALGORITHM IN DEEP LEARNING OF NEURAL NETWORKS 7 where −1 t t ˜ ˜ P (M) = −kΠ(M)RF (M), R = ΘR Θ , R = Θ Θ . (2.20) 1 p . Here, Θ = (C , . . . , C ), Θ ∈ M (R) and p,N Π(M) = diag (M (1 − M ), . . . , M (1 − M )) . (2.21) 1 1 N N The matrix R is invertible and positive definite since rank(Θ) = N − n . Thus, the vector field P is well defined in I . Theorem 2.2. Let M ∈ I and t → M(t) be the solution of the observer system (2.19) 0 N with the initial condition M(0) = M . Then −kt r(M(t)) = r(M )e , t ≥ 0 , (2.22) with r defined in (2.18). In particular lim r(M(t)) = 0 and U is invariant set containing t→+∞ Γ as attractor. Proof. Firstly, we write the H introduced in (2.15) in the compact matrix form H(M) = Θ F (M) . We follow now the idea of the proof of Main Lemma from [6], p. 377. and derive r with respect to time along the solution of (2.19) to obtain a simple differential equation: dr (M(t)) 2 2 = −2kr (M(t)), r (M(0)) = r(M ) , (2.23) dt which can be easily solved to get (2.22). We notice that our choice of the term P in (2.19) is different from one proposed in [6]. Lemma 2.3. The function E(M) = (p − M ) , (2.24) i i i=1 dE(M(t)) is a Lyapunov one and verifies ≤ 0 for every solution t 7→ M(t), M(0) ∈ I dt of (2.11). Proof. It is sufficient to derive L and to use the positiveness of the Gram matrix G = D D. Now we shall explain the role of the observer system (2.19) in the problem of minimi- sation of the cost function (2.5). Firstly, while using the standard gradient descent method, instead of dealing with the system (2.7), one can solve the observer equations (2.19) with some initial condition 8 ALEXEI TSYGVINTSEV M(0) ∈ Γ and use then Lemma 2.2 to compute X as corresponding to M(t) for some sufficiently large t > 0. It is well known that applying the Euler method (2.8) to solve (2.7), i.e following the conventional backpropagation algorithm, leads to accumulation of a global error proportional to the step size h. At the same time, the numerical integration of the observer system (2.19), as due to the existence of the attractor set Γ , is much more stable numerically since the solution is attracted by the integral manifold Γ (see [6] for more details and examples). Second improvement brought by the observer system (2.19) is more promising. Imagine we start integration of (2.19) with the perturbed initial condition M(0) ∈ U , M(0) 6∈ Γ ǫ 0 for some ǫ > 0. Then, according to Theorem 2.2, M(t) → Γ , t → +∞ and as follows from Lemma 2.3, t 7→ E(M(t)) will be decreasing function of t > 0 in a neighbourhood of Γ since P = 0 on Γ . That can be seen as a coexistence of the local dynamics of the 0 0 observer system in U , pushing M to the equilibrium point M = p , i = 1, . . . , N of (2.7) ǫ i i and the dynamics of the gradient system (2.7) on Γ forcing M to approach the critical points set (see Figure 3). One can suggest that this kind of double dynamics increases considerably the chances of convergence to the global minimum of the cost function (2.5). We call overfly the training of the neural network (2.3) done by solving the observer system (2.19) with help of the Euler first–order method starting from some initial point M(0) ∈ U \ Γ . ǫ 0 3. The 1–hidden layer network case In this section we describe the generalised gradient system of differential equations appearing in the supervised backpropagation training of a 1–hidden layer network. As in n 1 m n the previous section, let A ∈ R belongs to the training set (2.4). Let Y , . . . , Y ∈ R be m weight vectors of the hidden layer and X ∈ R is the weight vector of the output layer. The 1–hidden layer neural network is a real analytic map f : R → (0, 1) defined Y,X as follows f (A) = σ(< π (A), X >) , (3.1) Y,X Y 1 m T where π (A) = (σ(< A, Y >), . . . , σ(< A, Y >)) are the outputs of the first layer. We want to minimise the same cost function i 2 E(Y, X) = (p − f (A )) , (3.2) i Y,X i=1 where (A , p ), i = 1, . . . , N is the training set. To solve the optimisation problem one can define the gradient system analogous to (2.7) with respect to the vector variables Y and ON THE OVERFLY ALGORITHM IN DEEP LEARNING OF NEURAL NETWORKS 9 X: i ′ Y = −∇ iE, X = −∇ E, 1 ≤ i ≤ m . (3.3) Y X Let us introduce the following scalar variables: j k Ω = σ(< A , Y >) . (3.4) jk The function (3.2), expressed in new variables, takes the following form i 2 i T E(Ω, X) = (p − σ(< Ω , X >)) , Ω = (Ω , . . . , Ω ) . (3.5) i i1 im i=1 The differential equations describing the generalised gradient system for the neural network (3.1) are obtained by derivation of (3.4) with help of (3.3): Ω = m (Ω, X) = Ω (1 − Ω )X (p − ω )ω (1 − ω )Ω (1 − Ω )G , ik ik ik k j j j j jk jk ij ik j=1 (3.6) ′ i i X = −∇ E = (p − ω )ω (1 − ω )Ω , ω = σ(< Ω , X >) , X i i i i i i=1 i j where G =< A , A > is the Gram matrix defined by the training set (2.4). ij The next theorem is a generalisation of Theorem 2.1. Let r = dim(Ker(G)) and 1 r j T Ker(G) = Span(C , . . . , C ), C = (C , . . . , C ) . j1 jr Theorem 3.1. The generalised gradient system (3.6) admits rm functionally independent first integrals ik I j (Ω) = C ln , 1 ≤ j ≤ r, 1 ≤ k ≤ m . (3.7) ji C ,k 1 − Ω ik i=1 The cost function E defined by (3.5) is a Lyapunov function for (3.6) Proof. One verifies directly that I j is a first integral of (3.6) by simple derivation. C ,k A rather tedious but elementary calculation shows that E(Ω(t), X(t)) ≤ 0 along the solutions of (3.6) (see also Theorem 4.1 for the general proof). The observer system, analogous to (2.19), written for the generalised gradient system (3.6), can be obtained straightforwardly by replacing the first equation of (3.6) with ′ ′ Ω = U(Ω, X) + P (Ω), X = −∇ E, 1 ≤ i ≤ N, 1 ≤ k ≤ m , (3.8) where the additional term P is defined in similar to (2.20) way with help of the first integrals defined by Theorem 3.1. Indeed, let K = (K ) and S = (S ) are two matrices defined ij 1≤i≤N,1≤j≤m ij 1≤i≤N,1≤j≤m by ij K = Ω (1 − Ω ), S = ln . (3.9) ij ij ij ij 1 − Ω ij 10 ALEXEI TSYGVINTSEV To prove the result similar to Theorem 2.2 one can define P in (3.8) as follows P = −kK ◦ (RS) , (3.10) where the constant matrix R is the same as in (2.20) and “◦” is the Kronecker matrix product. Indeed, the first integrals defined by (3.7) can be written in a matrix form: H(Ω) = t 2 2 Θ S(Ω). Then, deriving r (Ω(t)) = ||H(Ω(t))|| , where || · || is the Frobenius matrix norm, along a solution t 7→ Ω(t) of (3.8), one gets dr (Ω(t)) = −2kr (Ω(t)) , (3.11) dt and so −kt r(Ω(t)) = r(Ω )e , t ≥ 0 . (3.12) The practical implementation of the overfly algorithm in the 1–layer case is analogous to one described in Section 2. Instead of modifying the weights of the first layer Y at every step of the gradient descent, one updates the values of Ω and X applying the ik Euler method to solve the observer equations (3.8). For the sake of simplicity, we will provide below the explicit matrix form of the system (3.8) which is better adopted to numerical implementations. We introduce the following diagonal matrices: P = diag((p − ω )ω (1 − ω ), . . . , (p − ω )ω (1 − ω )) , ω 1 1 1 1 N N N N (3.13) X = diag(X , . . . , X ) , 1 m and the N–vector P = ((p − ω )ω (1 − ω ), . . . , (p − ω )ω (1 − ω )) . (3.14) ω 1 1 1 1 N N N N Let X = (X , . . . , X ) be the m–vector of the output layer. The observer system (3.8) 1 m can be written in the following compact form ˆ ˆ ˜ Ω = K ◦ (GP KX − kRS) (3.15) ′ T X = Ω P , where K = Ω − Ω ◦ Ω. 4. General multilayer case We want to analyse a general multilayer neuronal network with the architecture n − l − ··· − 1. Here n is a number of inputs and l is the number of neurones in the very first layer. The network has only one output and in every layer the same sigmoid function i n (2.1) is used. The training set is defined by (2.4). Let Y ∈ R , 1 ≤ i ≤ l be the weight ON THE OVERFLY ALGORITHM IN DEEP LEARNING OF NEURAL NETWORKS 11 vectors of l neurones of the first layer. We note Z the weights of other network’s layers. Let A ∈ R be the input vector. The generic multilayer neural network can be written as the composition of two maps: f (A) = Φ ◦ π (A) , (4.1) Y,Z Z Y l T where Φ : R → (0, 1), π = (π , . . . , π ) 7−−→ Φ (π) is defined jointly by all layers Z 1 l Z different from the first one and 1 l T π (A) = (σ(< A, Y >), . . . , σ(< A, Y >)) , (4.2) is the output vector of the first layer. Using the chain rule one obtains for every k = 1, . . . , l: ∂f ∂π Y,Z Y = ∇Φ , , i = 1, . . . , n , (4.3) ∂Y ∂Y ki ki where, according to (4.2), ∂π k k T = σ(< A, Y >)(1 − σ(< A, Y >))(0, . . . , A , . . . , 0) . (4.4) | {z } ∂Y ki Thus, combining together (4.3),(4.4) we obtain: ∂f ∂Φ Y,Z Z k k = σ(< A, Y >)(1 − σ(< A, Y >))A . (4.5) ∂Y ∂π ki k We can compute now the partial derivatives of the cost function j 2 E(Y, Z) = (p − f (A )) , (4.6) j Y,Z j=1 with respect to the weights of the first layer: ∂E ∂Φ j j k j k j j = − (p − f (A ))σ(< A , Y >)(1 − σ(< A , Y >)) (π (A ))A . (4.7) j Y,Z Y ∂Y ∂π j=1 The equation of the gradient system corresponding to the weight vector Y can be written as ′ ∂E Y = −∇ k E = − . (4.8) ∂Y Introducing the variables p k Ω = σ(< A , Y >) , (4.9) pk called the splitting weights, and whose derivatives can be found with help of (4.8), we deduce from (4.7) the following differential equations ∂Φ ′ i i Ω = Ω (1 − Ω ) (p − Φ (Ω ))Ω (1 − Ω ) (Ω )G , (4.10) pk pk i Z ik ik ip pk ∂π i=1 12 ALEXEI TSYGVINTSEV i T where Ω = (Ω , . . . , Ω ) . i1 il The above equations can be written also as Ω = N (Ω, Z), 1 ≤ p ≤ N, 1 ≤ k ≤ l . (4.11) pk pk ∂Φ j j Indeed, f (A ) and (π (A )) are functions of Ω and Z only. Moreover, the same Y,Z Y ∂π holds for the cost function E defined in (4.6) and its gradient ∇ E = ∂E/∂Z: they can be written as functions of variables Ω and Z. Let r = dim(Ker(G)) be the dimension of the null space of the Gram matrix G =< i,j 1 r i T A , A > and Ker(G) = Span(C , . . . , C ). We note C = (C , . . . , C ) . i j i1 iN Theorem 4.1. Let ′ ′ Ω = N(Ω, Z), Z = −∇ E(Ω, Z) , (4.12) be the generalised gradient system written for the multilayer network (4.1) with the training set (A , p ), i = 1, . . . , N. Then (4.12) admits rl independent first integrals of the form ik i k I j (Ω) = C ln , Ω = σ(< A , Y >) . (4.13) C ,k ji ik 1 − Ω ik i=1 The cost function (4.6) E = E(Ω, Z) is a Lyapunov function for (4.12). Proof. It is straightforward to verify that I j are functionally independent first integrals C ,k of (4.12). Accordingly to (4.1), (4.2) and (4.9), the cost function (4.6), written in variables Ω, Z, is given by i 2 i T E(Ω, Z) = (p − Φ (Ω )) , Ω = (Ω , . . . , Ω ) , (4.14) i Z i1 il i=1 in view of (4.3),(4.2) and (4.9). Let t 7→ (Ω(t), Z(t)) be a solution of (4.12). Then d ∂E ∂E ∂E E(Ω(t), Z(t)) = , N − ,∇ E = , N − ||∇ E|| , (4.15) Z Z dt ∂Ω ∂Z ∂Ω Ω Z Ω where <, > , <, > are the standard scalar products defined respectively in spaces R Ω Z and R where a = pl is the total number of splitting weights Ω and b is the total number pk of weights Z of the neural network (4.1). One writes with help of (4.11): N l l N N X X X X X ∂E ∂E , N = − N = − T G T , (4.16) ik ik ij jk ∂Ω ∂Ω ik i=1 i=1 j=1 k=1 k=1 ∂Φ where T = (p − Φ (Ω )) (Ω )Ω (1 − Ω ). Since G is a positive matrix, the last ik i Z ik ik ij ∂π ∂E equality implies , N ≤ 0. Together with (4.15) this yields that E is a Lyapunov ∂Ω function of (4.12). ON THE OVERFLY ALGORITHM IN DEEP LEARNING OF NEURAL NETWORKS 13 The observer system, defined by analogy with (2.19) for the generalised gradient system (4.12), can be written in the following form ′ ′ Ω = N(Ω, Z) + P (Ω), Z = −∇ E(Ω, Z) , (4.17) where the vector field P , called the dissipation term, is defined by the first integrals (4.13) and given by the same formula (3.10). The overfly algorithm for neural network training, already described in previous sec- tions, can be easily adopted to the general multilayer case. The only difference from the conventional backpropagation applied to the network (4.1), consists in replacing the weights of the first layer Y by the splitting weights Ω , while keeping updating the ij pk weights Z of other layers accordingly to the usual bacpropagation algorithm. At each iteration step, the evolution of parameters Ω , Z is governed by the Euler discretisation pk of the observer system (4.17). 5. Conclusion and numerical results In this section we compare the usual backpropagation and the overfly methods for some particular neural networks. We start by a simple no hidden layer case (2.3). We put n = 1 and X = x ∈ R. Let N = 5 and the input input values are defined by T = [79/100,−9/20, 7/10,−9/50,−19/25] , (5.1) with the corresponding output vector p: p = [−1/20,−21/25,−11/100, 61/100,−83/100] . (5.2) The couple (T, p) defines the training set (2.4). Analysing the equation E (x) = 0, with E defined in (2.5), one calculates, with help of Maple’s 10 RootFinding routine, two local minima A and B (see Figure1) of the cost function E in points x = 2.510, E(x ) = 1.967 and x = 6.067, E(x ) = 1.966 with B A A B B being the global minimum of E. The gradient system (2.7) was solved using the Euler method (2.9) with h = 1 with the initial point x(0) = 3. After d = 3000 iterations one obtains x = x = 2.510 with E(x ) = 1.967 and the backpropagation network converges to the local minimum A. To calculate the vector M(0), corresponding to x(0), one can apply Lemma 2.2 to find M(0) = [0.879, 0.244, 0.853, 0.389, 0.129] (5.3) 14 ALEXEI TSYGVINTSEV Now, following the overfly approach, we consider the observer system (2.19) with k = ˜ ˜ 0.002 and initial conditions M(0) + M with the perturbation vector M defined by M = [0.01, 0.01, 0.01, 0.01, 0.01] . (5.4) The Euler method, applied to (2.19) with h = 1 provides after δ = 3000 iterations the value x = x˜ = 6.085 with E(x˜ ) = 1.966. Since, x˜ is sufficiently close to x we conclude δ δ δ B that the overfly network converges to the global minimum B rather than to the local one A. So, the benefits of the overfly training are immediately visible. We have tested numerically the overfly method for a 4 − 2 − 1 neural network (3.1). It has 4 inputs and 1 hidden layer with 2 neurones (n = 4, m = 2). Both hidden and output layer have biases. The input data set has N = 10 entries arranged into the following 1 10 4 × 10 matrix A = [A . . . , A ] : 0.234 −0.316 −0.746 0.064 0.124 0.894 −0.786 −0.076 1.044 −0.436 −0.385 −0.835 0.015 0.365 −0.935 0.135 0.335 0.505 0.495 0.305 A = (5.5) 0.764 0.594 0.684 −0.946 0.024 −0.196 −0.596 0.534 −0.436 −0.426 −1.014 −0.074 0.346 0.876 −0.354 −0.184 −0.174 −0.254 0.266 0.566 The columns of A were chosen randomly and have zero mean. The output target vector p ∈ R is of the form [0.301, 0.30001, 0.30002, 0.30013, 0.30004, 0.30005, 0.30006, 0.30007, 0.30008, 0.30009] , (5.6) and corresponds to a highly deviated data set. In particular: p − p p − p 1 2 6 5 = 99 and = 1 . (5.7) p − p p − p 3 2 5 4 Firstly, the standard 4 − 2 − 1 neural network (3.1) was trained on the above data set using usual backpropagation method (BM) with randomly chosen in the interval [−1, 1] weights Y and X. The number of iterations was d = 1500 with the step size h = 0.1. Then, the overfly algorithm was applied, as described in Section 3, with randomly chosen initial splitting weights Ω ∈ (0, 1), same X and the dissipation parameter k = 0.01. The ij observer system (3.15) was solved by Euler method with the same step size h = 0.1 and using the same iteration number d = 1500. At each iteration we computed the cost function value for both methods: using the formula (3.2) for BM and the expression (3.5) for the overfly −3 method (OM). The final cost value, after d iterations for BM, was E = 0.588 · 10 and for BM −7 OM it was E = 3.499 · 10 with the ratio E /E ≈ 146. Thus the overfly algorithm OM BM OM significantly outperforms the conventional backpropagation for this particular problem. The Figure 2 contains graphs of both cost functions in the logarithmic scale. We notice that our example is quite generic one since our numerical experiments show that statistically OM gives more precise results than BM for the large deviation output data sets. ON THE OVERFLY ALGORITHM IN DEEP LEARNING OF NEURAL NETWORKS 15 We notice that there is an obvious resemblance between conventional backpropagation and overfly approaches. Below we summarise briefly the principal steps of the proposed method. Step 1: Splitting. Assuming that the training data (A , p ) is given, firstly, it is necessary to 1 N compute the generating vectors of the null–space of the matrix D = (A , . . . , A ) i.e determine Ker(G). Secondly, one introduces Nl splitting weights (4.9) to replace nl weights of neurones of the first layer. In practice, the number N of training examples can be considerably larger than the input size of the network n, so the splitting brings more additional parameters to be stored in the memory. Step 2: Dissipation. Using the vectors spanning Ker(G) one creates a procedure computing the dissipation term P defined by (2.20). The matrix inversion in (2.20) can be done, in the beginning, using the conjugate gradient algorithm [2] i.e in an iterative way. Indeed, the matrix R is symmetric and positive definite. Step 3: Generalised gradient – observer: The first–order Euler iterative method is applied next to solve the observer system (4.17). The optimal choice of the step h and the constant k depends on the concrete problem. We suggest to run firstly the usual backpropagation (i.e choosing the initial value Ω ∈ Γ ) and try to improve the result using several choices of initial values for Ω ∈ (0, 1) and of k > 0 in the overfly training. If k = 0 i.e then no dissipation term is present ik and starting with Ω 6∈ Γ the method can provide only the approximation of the neural network weights. But it is still worth trying: if initial values of Ω are sufficiently close to Γ they will stay near Γ (first integrals (4.13) are conserved) and the algorithm’s complexity is greatly reduced since no dissipation is added at every iteration (no need to compute P in (4.17) at every step). Thus, the neural network can be trained in alternation with dissipation switched on and off. We notice as well that the proposed method can be easily adopted to take into account biases by introducing additional bias nodes. Clearly, further research and more numerical evidences are necessary to confirm the benefits of the overfly algorithm. The results of our study suggest a number of new avenues for research and numerical experiments. Acknowledgments. The study was supported by the PEPS project Sigmapad, Intelligence Artificielle et Apprentissage Automatique. References [1] Atakulreka A., Sutivong D., Avoiding Local Minima in Feedforward Neural Networks by Simultaneous Learning, Advances in Artificial Intelligence, Lecture Notes in Computer Science, vol 4830, 2007 [2] Avriel M., Nonlinear Programming: Analysis and Methods, Dover Publishing, 2003 [3] Absila P.-A., Kurdyka K., On the stable equilibrium points of gradient systems, Systems & Control Letters Volume 55, Issue 7, July 2006, Pages 573-577 16 ALEXEI TSYGVINTSEV [4] Brierton J. L., Techniques for avoiding local minima in gradient-descent-based ID algorithms, Proc. SPIE 3066, Radar Sensor Technology II,1997 [5] Burse K., Manoria M., Kirar V.P.S., Improved Back Propagation Algorithm to Avoid Local Minima in Multiplicative Neuron Model, Communications in Computer and Information Science, vol 147, [6] Busvelle E., Kharab R., Maciejewski A. J., Strelcyn J.-M., Numerical integration of differential equations in the presence of first integrals: observer method, Appl. Math., 22, no. 3, 373–418, 1994 [7] Cetin B.C., Burdick J.W., Barhen J., Global Descent Replaces Gradient Descent to Avoid Local Minima Problem in Learning with Artificial Neural Networks, IEEE International Conference on Neural Networks 2, 836–842, 1993 [8] Chien-Cheng Yu, Bin-Da Liu, A backpropagation algorithm with adaptive learning rate and mo- mentum coefficient, Proceedings of the 2002 International Joint Conference on Neural Networks. IJCNN’02, Honolulu, HI, USA, pp. 1218-1223 vol.2, 2002 [9] Fukuoka Y., Matsuki H., Minamitani H., Akimasa Ishida, A modified back-propagation method to avoid false local minima, Neural Networks : the Official Journal of the International Neural Network Society,11(6):1059-1072, 1998 [10] Hirsch M.W., Smale S., Differential equations, dynamical systems, and linear algebra, New York : Academic Press, 1974 [11] Gallant, S. I., Perceptron-based learning algorithms, IEEE Transactions on Neural Networks, vol. 1, no. 2, pp. 179–191, 1990 [12] Gori M., Tesi A., On the problem of local minima in backpropagation, IEEE Transactions on Pattern Analysis and Machine Intelligence, Volume: 14, Issue: 1, 1992 [13] Hairer E., Norsett S.P., Wanner G., Solving Ordinary Differential Equations I: Nonstiff Problems, Springer Series in Computational Mathematics, 2nd ed., 1993 [14] La Salle J., Lefschetz S., Stability by Liapunov’s Direct Method: With Applications, New York: Academic Press, 1961 [15] Pavelka A., Proch A., Algorithms for initialization of neural network weights random numbers in matlab, Proc. Control Eng., vol. 2, pp. 453-459, 2004 [16] Sprinkhuizen-Kuyper, I.G., Boers, E.J.W. The local minima of the error surface of the 2-2-1 XOR network, Annals of Mathematics and Artificial Intelligence 25: 107-136, 1999 [17] Sontag E.D., Sussmann H.J., Backpropagation Can Give Rise to Spurious Local Minima Even for Networks without Hidden Layers, Complex Systems 3, 91-106, 1989 [18] Nawi N.M., Khan A., Rehman M.Z., A New Back-Propagation Neural Network Optimized with Cuckoo Search Algorithm, Lecture Notes in Computer Science, vol 7971, 2013 [19] Wiggins S., Introduction to Applied Nonlinear Dynamical Systems and Chaos, Texts in Applied Mathematics, vol 2. Springer, New York, NY ´ ´ U.M.P.A, Ecole Normale Superieure de Lyon, 46, allee d’Italie, F69364 Lyon Cedex 07 E-mail address: atsygvin@umpa.ens-lyon.fr ON THE OVERFLY ALGORITHM IN DEEP LEARNING OF NEURAL NETWORKS 17 Figure 1. The graph of the cost function E for the training set (5.1), (5.2) Figure 2. 4 − 2 − 1 neural network, testing performance of overfly and backpropagation for the data set (5.5), (5.6) 18 ALEXEI TSYGVINTSEV Figure 3.
Statistics – arXiv (Cornell University)
Published: Jul 27, 2018
You can share this free article with as many people as you like with the url below! We hope you enjoy this feature!
Read and print from thousands of top scholarly journals.
Already have an account? Log in
Bookmark this article. You can see your Bookmarks on your DeepDyve Library.
To save an article, log in first, or sign up for a DeepDyve account if you don’t already have one.
Copy and paste the desired citation format or use the link below to download a file formatted for EndNote
Access the full text.
Sign up today, get DeepDyve free for 14 days.
All DeepDyve websites use cookies to improve your online experience. They were placed on your computer when you launched this website. You can change your cookie settings through your browser.