Access the full text.
Sign up today, get DeepDyve free for 14 days.
M. Fornasier, Hui-Lin Huang, L. Pareschi, Philippe Sunnen (2020)
Consensus-based optimization on hypersurfaces: Well-posedness and mean-field limitMathematical Models and Methods in Applied Sciences
A. Sznitman (1991)
Topics in propagation of chaos
S. Haykin, B. Kosko (2001)
GradientBased Learning Applied to Document Recognition
Yann LeCun, L. Bottou, Yoshua Bengio, P. Haffner (1998)
Gradient-based learning applied to document recognitionProc. IEEE, 86
P. Jabin, Zhenfu Wang (2017)
Mean Field Limit for Stochastic Particle Systems
S. Grassi, Hui-Lin Huang, L. Pareschi, Jinniao Qiu (2021)
Mean-field particle swarm optimizationArXiv, abs/2108.00393
Cristina Cipriani, Hui-Lin Huang, Jinniao Qiu (2021)
Zero-Inertia Limit: from Particle Swarm Optimization to Consensus Based OptimizationSIAM J. Math. Anal., 54
(2006)
AppliedAsymptoticAnalysis,GraduateStudiesinMathematics,vol.75
M Schmitt, R Wanka (2015)
Particle swarm optimization almost surely finds local optimaTheor. Comput. Sci., 561
(2010)
and C
École Saint-Flour, D. Burkholder, É. Pardoux, A. Sznitman, P. Hennequin (1991)
Ecole d'été de probabilités de Saint-Flour XIX, 1989
J. Carrillo, Shi Jin, Lei Li, Yuhua Zhu (2019)
A consensus-based global optimization method for high dimensional machine learning problemsESAIM: Control, Optimisation and Calculus of Variations
M. Schmitt, R. Wanka (2013)
Particle swarm optimization almost surely finds local optima
(2010)
MNIST handwritten digit database
A. Dembo, O. Zeitouni (1998)
Large Deviations Techniques and Applications
F. Bergh, A. Engelbrecht (2010)
A Convergence Proof for the Particle Swarm OptimiserFundam. Informaticae, 105
D. Higham (2001)
An Algorithmic Introduction to Numerical Simulation of Stochastic Differential EquationsSIAM Rev., 43
E. Aarts, J. Korst (1990)
Simulated annealing and Boltzmann machines - a stochastic approach to combinatorial optimization and neural computing
B Øksendal (2003)
10.1007/978-3-642-14394-6Stochastic Differential Equations: An Introduction with Applications
Yudong Zhang, Shuihua Wang, G. Ji (2015)
A Comprehensive Survey on Particle Swarm Optimization Algorithm and Its ApplicationsMathematical Problems in Engineering, 2015
Zhiyan Ding, Shi Chen, Qin Li, S. Wright (2021)
On the Global Convergence of Gradient Descent for multi-layer ResNets in the mean-field regimeArXiv, abs/2110.02926
Shi Jin, Lei Li, Jian‐Guo Liu (2018)
Random Batch Methods (RBM) for interacting particle systemsMultiscale Model. Simul., 20
J. Holland (1992)
Adaptation in Natural and Artificial Systems: An Introductory Analysis with Applications to Biology, Control, and Artificial Intelligence
M. Fornasier, Hui Huang, L. Pareschi, Philippe Sünnen (2020)
Consensus-based Optimization on the Sphere II: Convergence to Global Minimizers and Machine LearningJ. Mach. Learn. Res., 22
Ergodicity of the finite dimensional approximation of the 3d navier–stokes equations forced by a degenerate. Peprint, 2002. Daniel Revuz and Marc Yor. Continuous martingales and Brownian motion, volume 293 of Grundlehren der Mathematischen Wissenschaften
E. Aarts, J. Korst (2003)
Simulated annealing and Boltzmann machines
(1992)
Recuit simulé sur R n . étude de l'évolution de l'énergie libre
P. Miller (2006)
Applied asymptotic analysis
E. Özcan, C. Mohan (1998)
Analysis of a simple particle swarm optimization system, 1998
Shih-Wei Lin, Kuo-Ching Ying, Shih-Chieh Chen, Z. Lee (2008)
Particle swarm optimization for parameter determination and feature selection of support vector machinesExpert Syst. Appl., 35
H. Robbins (1951)
A Stochastic Approximation MethodAnnals of Mathematical Statistics, 22
M. Fornasier, T. Klock, Konstantin Riedl (2021)
Convergence of Anisotropic Consensus-Based Optimization in Mean-Field LawArXiv, abs/2111.08136
S. Grassi, L. Pareschi (2020)
From particle swarm optimization to consensus based optimization: stochastic modeling and mean-field limitArXiv, abs/2012.05613
S Jin, L Li, JG Liu (2020)
Random batch methods (RBM) for interacting particle systemsJ. Comput. Phys., 400
Daniel Revuz, M. Yor (1990)
Continuous martingales and Brownian motion
(1963)
The convergence of the random search method in the external control of many-parameter system
C. Totzeck, Marie-Therese Wolfram (2020)
Consensus-based global optimization with personal best.Mathematical biosciences and engineering : MBE, 17 5
M. Fornasier, T. Klock, Konstantin Riedl (2021)
Consensus-based optimization methods converge globally in mean-field lawArXiv, abs/2103.15130
Hui-Lin Huang, Jian‐Guo Liu, P. Pickl (2018)
On the Mean-Field Limit for the Vlasov–Poisson–Fokker–Planck SystemJournal of Statistical Physics, 181
Song Mei, A. Montanari, Phan-Minh Nguyen (2018)
A mean field view of the landscape of two-layer neural networksProceedings of the National Academy of Sciences of the United States of America, 115
G. Royer, D. Babbitt (2007)
An Initiation to Logarithmic Sobolev Inequalities
C. Witt (2011)
Theory of Particle Swarm Optimization
J. Carrillo, Young-Pil Choi, C. Totzeck, O. Tse (2016)
An analytical framework for a consensus-based global optimization methodarXiv: Analysis of PDEs
Francois Jos (2010)
Stochastic Mean-Field Limit: Non-Lipschitz Forces & SwarmingarXiv: Probability
Thomas Bäck (1997)
Evolutionary computation: Toward a new philosophy of machine intelligenceComplex., 2
Vianney Bruned, André Mas, Sylvain Wlodarczyk (2018)
Weak convergence of particle swarm optimizationarXiv: Probability
D. Williams (1976)
STOCHASTIC DIFFERENTIAL EQUATIONS: THEORY AND APPLICATIONSBulletin of The London Mathematical Society, 8
R. Pinnau, C. Totzeck, O. Tse, Stephan Martin (2016)
A consensus-based model for global optimization and its mean-field limitarXiv: Probability
J. Kennedy (1997)
The particle swarm: social adaptation of knowledgeProceedings of 1997 IEEE International Conference on Evolutionary Computation (ICEC '97)
M. Clerc, J. Kennedy, France Télécom (2003)
The Particle Swarm : Explosion , Stability , and Convergence in a Multi-Dimensional Complex Space
E. Platen (1999)
An introduction to numerical methods for stochastic differential equationsActa Numerica, 8
I. Karatzas, S. Shreve (2019)
Stochastic Differential EquationsHow to Measure the Infinite
Konstantin Riedl (2022)
Leveraging Memory Effects and Gradient Information in Consensus-Based Optimization: On Global Convergence in Mean-Field LawArXiv, abs/2211.12184
R. Poli, J. Kennedy, T. Blackwell (1995)
Particle swarm optimizationSwarm Intelligence, 1
B. Panigrahi, Yuhui Shi, M. Lim (2011)
Handbook of Swarm Intelligence: Concepts, Principles and Applications
Quan Yuan, G. Yin (2013)
Analyzing Convergence and Rates of Convergence of Particle Swarm Optimization Algorithms Using Stochastic Approximation MethodsIEEE Transactions on Automatic Control, 60
Hui-Lin Huang (2021)
A note on the mean-field limit for the particle swarm optimizationAppl. Math. Lett., 117
James Foster (2017)
Evolutionary ComputationNature reviews. Genetics, 2 6
Hui-Lin Huang, Jinniao Qiu (2021)
On the mean‐field limit for the consensus‐based optimizationMathematical Methods in the Applied Sciences, 45
F. Bergh, A. Engelbrecht (2002)
An analysis of particle swarm optimizers
L. Kadanoff (2009)
More is the Same; Phase Transitions and Mean Field TheoriesJournal of Statistical Physics, 137
H. Kushner, G. Yin (2003)
Stochastic Approximation and Recursive Algorithms and Applications
J. Holland (1975)
Adaptation in natural and artificial systems
M. Fornasier, Hui-Lin Huang, L. Pareschi, Philippe Sünnen (2021)
Anisotropic Diffusion in Consensus-Based Optimization on the SphereSIAM J. Optim., 32
L. Kadano (2009)
More is the Same; Phase Transitions and Mean Field Theories
R. Azencott (1992)
Simulated annealing : parallelization techniques
B. Panigrahi, Yuhui Shi, M. Lim (2011)
Handbook of Swarm Intelligence
Publisher's Note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations
R. Poli (2009)
Mean and Variance of the Sampling Distribution of Particle Swarm Optimizers During StagnationIEEE Transactions on Evolutionary Computation, 13
Wenpin Tang, X. Zhou (2022)
TAIL PROBABILITY ESTIMATES OF CONTINUOUS-TIME SIMULATED ANNEALING PROCESSES
In this paper we provide a rigorous convergence analysis for the renowned particle swarm optimization method by using tools from stochastic calculus and the analysis of partial differential equations. Based on a continuous-time formulation of the particle dynamics as a system of stochastic differential equations, we establish convergence to a global minimizer of a possibly nonconvex and nonsmooth objective function in two steps. First, we prove consensus formation of an associated mean-field dynamics by analyzing the time-evolution of the variance of the particle distribution, which acts as Lyapunov function of the dynamics. We then show that this consensus is close to a global minimizer by employing the asymptotic Laplace principle and a tractability condition on the energy landscape of the objective function. These results allow for the usage of memory mechanisms, and hold for a rich class of objectives provided certain conditions of well-preparation of the hyperparameters and the initial datum. In a sec- ond step, at least for the case without memory effects, we provide a quantitative result about the mean-field approximation of particle swarm optimization, which specifies the convergence of the interacting particle system to the associated mean-field limit. Combining these two results allows for global convergence guarantees of the numer- ical particle swarm optimization method with provable polynomial complexity. To demonstrate the applicability of the method we propose an efficient and parallelizable B Konstantin Riedl konstantin.riedl@ma.tum.de Hui Huang hui.huang@uni-graz.at Jinniao Qiu jinniao.qiu@ucalgary.ca Institute of Mathematics and Scientific Computing, University of Graz, Graz, Austria Department of Mathematics and Statistics, University of Calgary, Calgary, Canada Department of Mathematics, School of Computation, Information and Technology, Technical University of Munich, Munich, Germany Munich Center for Machine Learning, Munich, Germany 0123456789().: V,-vol 123 30 Page 2 of 44 Applied Mathematics & Optimization (2023) 88:30 implementation, which is tested in particular on a competitive and well-understood high-dimensional benchmark problem in machine learning. Keywords Global derivative-free optimization · High-dimensional nonconvex optimization · Metaheuristics · Particle swarm optimization · Mean-field limit · Vlasov-Fokker-Planck equations Mathematics Subject Classification 65C35 · 65K10 · 90C26 · 90C56 · 35Q90 · 35Q83 1 Introduction In nature, collective behavior and self-organization allow complicated global patterns to emerge from simple interaction rules and random fluctuations. Inspired by the fascinating capabilities of swarm intelligence, large multi-agent systems are employed as a tool for solving challenging problems in applied mathematics. One classical task arising throughout science is concerned with the global optimization of a problem- dependent possibly nonconvex and nonsmooth objective function E : R → R, i.e., the search for a global optimizer x ∈ arg min E (x ). (1.1) x ∈R A popular class of methods with a long history of achieving state-of-the-art per- formance on such problems are metaheuristics [14]. They orchestrate an interplay between local and global improvement procedures, consider memory mechanisms and selection strategies, and combine random and deterministic decisions, to create a process capable of escaping local optima and performing a robust search of the solu- tion space in order to find a global optimizer. Initiated by seminal works on stochastic approximation [49] and random search [46], a big variety of such mechanisms has been introduced, analyzed and applied to numerous real-world problems. A non-exclusive list of representatives includes evolutionary programming [15], genetic algorithms [24], simulated annealing [1], and particle swarm optimization [31]. Despite their tremendous empirical success, it is very difficult to provide a theoretical conver- gence analysis to global minimizers, mostly due to their stochastic nature and the appearance of memory effects. Simulated annealing, however, is theoretically actu- ally well-studied, see, e.g., the works [3, 37] as well as the recent survey [53] and references therein. In this paper we study particle swarm optimization (PSO), which was initially introduced by Kennedy and Eberhart in the 90s [30, 31] and is now widely recognized as an efficient method for tackling complex optimization problems [35, 45]. Originally, PSO solves (1.1) by considering a group of finitely many particles, which explore the energy landscape of E. Each agent experiences a force towards its own personal (historical) best position as well as towards the global best position communicated in the swarm. We refer to the ability of each particle remembering the best position it has been positioned at in the past as memory mechanisms. Although these interaction rules 123 Applied Mathematics & Optimization (2023) 88:30 Page 3 of 44 30 are seemingly simple, a complete numerical analysis of PSO is still lacking; see, e.g., [41, 55, 57] and references therein. Recently, however, by introducing a continuous description of PSO based on a system of stochastic differential equations (SDEs), the authors of [22] have paved the way for a rigorous mathematical analysis using tools from stochastic calculus and the analysis of partial differential equations (PDEs). In order to explore the domain and to form a global consensus about the minimizer x as time passes, the formulation of PSO proposed by the authors of [22]uses N i i i i i particles, described by triplets ( X , Y , V ) , with X and V denoting the t ≥0 t t t t t i =1..., N position and velocity, and Y being a regularized version of the local (historical) best position, also referred to as memory, of the ith agent at time t. The particles, formally stochastic processes, are initialized independently according to some common dis- 3d tribution f ∈ P(R ). In the most general form the PSO dynamics is given by the system of SDEs, expressed in Itô’s form as i i dX = V dt , (1.2a) t t i i i β,θ i i dY = κ X − Y S X , Y dt , (1.2b) t t t t t i i i i N i mdV =− γ V dt + λ Y − X dt + λ y (ρ ) − X dt (1.2c) 1 2 α t t t t Y ,t t i i 1,i N i 2,i + σ D Y − X dB + σ D y (ρ ) − X dB , 1 2 α t t t Y ,t t t where α, β, θ , κ, γ , m,λ ,λ ,σ ,σ ≥ 0 are user-specified parameters. The change 1 2 1 2 of the velocity in (1.2c) is subject to five forces. The first term on the right-hand side models friction with a coefficient commonly chosen as γ = 1 − m ≥ 0, where m > 0 denotes the inertia weight. The subsequent term can be regarded as the drift towards the local best position of the ith particle, which it has memorized in the state i i variable Y . A continuous-time approximation of its evolution is given by Y and t t β,θ β,θ described in Equation (1.2b). It involves the operator S , given by S (x , y) = 1 + θ + tanh(β(E ( y) − E (x ))) for 0 ≤ θ 1 and β 1, which converges to the Heaviside function as θ → 0 and β →∞. The concept behind Equation (1.2b) can then be seen when being discretized, see Remark 1. For an alternative implementation of the local best position we refer to [54]. Remark 1 A time-discretization of (1.2b) with κ = 1/(2 t ), θ = 0 and β =∞ yields the update rule i i i Y , if E ( X ) ≥ E (Y ), i k t (k+1) t k Y = (1.3) (k+1) i i i X , if E ( X )< E (Y ), (k+1) t (k+1) t k meaning that the ith particle stores in Y the best position which it has seen up to the kth iteration. This explains the name local (historical) best position and restores the original definition from the work [31]. 123 30 Page 4 of 44 Applied Mathematics & Optimization (2023) 88:30 The last deterministic term imposes a drift towards the momentaneous consensus point y (ρ ), given by Y ,t ω ( y) N α N E y (ρ ) := y dρ ( y), with ω ( y) := exp(−αE ( y)), (1.4) Y ,t Y ,t α R N L (ρ ) Y ,t N N 1 where ρ denotes the empirical measure ρ := δ i of the particles’ local Y ,t Y ,t N i =1 Y best positions. The choice of the weight ω in (1.4) comes from the well-known Laplace principle [12, 38], a classical asymptotic argument for integrals stating that for any probability measure ∈ P(R ) it holds lim − log ω ( y) d ( y) = inf E ( y). (1.5) α→∞ α d y∈supp( ) N ∗ Based thereon, y (ρ ) is expected to be a rough estimate for a global minimizer x , Y ,t which improves as α →∞ and as larger regions of the domain are explored. To feature the latter, the two remaining terms in (1.2c), each associated with a drift term, are diffusion terms injecting randomness into the dynamics through independent standard 1,i 2,i Brownian motions ( B ) and ( B ) . The two commonly t ≥0 t ≥0 t t i =1,..., N i =1,..., N studied diffusion types for similar methods are isotropic [8, 18, 42] and anisotropic [9, 19] diffusion with y − x Id, for isotropic diffusion, D( y − x ) = (1.6) diag ( y − x ), for anisotropic diffusion, d×d d d×d where Id ∈ R is the identity matrix and diag : R → R the operator mapping a vector onto a diagonal matrix with the vector as its diagonal. Intuitively, the term’s scaling encourages agents far from its own local best position or the globally computed consensus point to explore larger regions, whereas agents already close try to enhance their position only locally. As the coordinate-dependent scaling of anisotropic diffusion has been proven to be highly beneficial for high-dimensional problems [9, 17], in what follows, we limit our analysis to this case. An illustration of the formerly described PSO dynamics (1.2) is given in Fig. 1. A theoretical convergence analysis of PSO is possible either on the microscopic level (1.2) or by analyzing the macroscopic behavior of the particle density through a mean-field limit, what usually admits more powerful analysis tools. In the large particle limit an individual particle is not influenced any more by individual particles but only by the average behavior of all particles. As shown in [21, Section 3.2], N 1 the empirical particle measure f := δ i i i converges in law to the ( X ,Y ,V ) N i =1 3d deterministic agent distribution f ∈ C([0, T ], P(R )), which weakly satisfies the nonlinear Vlasov-Fokker-Planck equation 123 Applied Mathematics & Optimization (2023) 88:30 Page 5 of 44 30 1 N Fig. 1 An illustration of the PSO dynamics. Agents with positions X ,..., X (yellow dots with their trajectories) explore the energy landscape of E in search of the global minimizer x (green star). The dynamics of each particle is governed by five terms. A local drift term (light blue arrow) imposes a force towards its local best position Y (indicated by a circle). A global drift term (dark blue arrow) drags the agent towards a momentaneous consensus point y (ρ ) (orange circle) computed as a weighted Y ,t (visualized through color opacity) average of the particles’ local best positions. Friction (purple arrow) counteracts inertia. The two remaining terms are diffusion terms (light and dark green arrows) associated with a respective drift term β,θ ∂ f + v ·∇ f +∇ · κ(x − y)S (x , y) f t t x t y t γ λ λ 1 2 =∇ · v f + (x − y) f + x − y (ρ ) f v t t α Y ,t t m m m 2 2 σ σ 2 2 1 2 + D(x − y) + D x − y (ρ ) ∇ f (1.7) α Y ,t v t 2 2 2m 2m with initial datum f . The mean-field limit results [6, 25–27, 52] ensure that the particle system (1.2) is well-approximated by the following self-consistent mean-field McKean process d X = V dt , (1.8a) t t β,θ d Y = κ X − Y S X , Y dt , (1.8b) t t t t t md V =− γ V dt + λ Y − X dt + λ y (ρ ) − X dt (1.8c) t t 1 t t 2 α Y ,t t 1 2 + σ D Y − X dB + σ D y (ρ ) − X dB , 1 t t 2 α Y ,t t t t with initial datum ( X , Y , V ) ∼ f and the marginal law ρ of Y given by 0 0 0 0 Y ,t t ρ (t , · ) = df (x , ·,v). Y t d d R ×R Here, f denotes the distribution of ( X , Y , V ). This makes (1.7) and (1.8) nonlinear. t t t t 123 30 Page 6 of 44 Applied Mathematics & Optimization (2023) 88:30 1.1 Contribution In view of the versatility, efficiency, and wide applicability of PSO combined with its long historical tradition, a mathematical analysis of the finite particle system (1.2)is of considerable interest. In this work we advance the theoretical understanding of the method and con- tribute to the completion of a full numerical analysis of PSO by proving rigorously the convergence of PSO with memory effects to global minimizers using mean-field techniques. More precisely, under mild regularity assumptions on the objective E and a well-preparation condition about the initialization f , we analyze the behavior of the particle distribution f solving the mean-field dynamics (1.8). At first, it is shown that concentration is achieved at some x in the sense that the marginal law w.r.t. the local best position, ρ , converges narrowly to a Dirac delta δ as t →∞. Consecu- Y ,t x tively, we argue that, for an appropriate choice of the parameters, in particular α 1, which may depend on the dimension d, E ( x ) can be made arbitrarily close to the minimal value E := inf d E (x ). A suitable tractability condition on the objective E x ∈R eventually ensures that x is close to a global minimizer. Similar mean-field conver- gence results are obtained for the case without memory effects. In this setting we are moreover able to establish the convergence of the interacting N -particle dynamics to its mean-field limit with a dimension-independent rate, which allows to obtain a so far unique holistic and quantitative convergence statement of PSO. As the mean-field approximation result does not suffer from the curse of dimensionality, we in particular prove that the numerical PSO method has polynomial complexity. With these new results we solve the theoretical open problem about the convergence of PSO posed in [22]. Furthermore, we propose an efficient and parallelizable implementation, which is particularly suited for machine learning problems by integrating modern machine learning techniques such as random mini-batch ideas as well as traditional metaheuristic- inspired techniques from genetic programming and simulated annealing. 1.2 Prior Arts The convergence of PSO algorithms has been investigated by many scholars since its introduction, which has lead to several variations allowing to establish desirable properties such as consensus formation or convergence to optimal solutions. While the matter of consensus is well-studied, see, e.g., [11, 40] or more recently [56], where the authors employ stochastic approximation methods [32], only few general theoretical statements regarding the properties of the found consensus are available. Both the existence of a large number of variations of the algorithm and the lack of a rigoros global convergence analysis are attributed amongst other things, such as the stochasticity and the usage of memory mechanisms, to the phenomenon of premature convergence of basic PSO [31], which was observed in [4, 5] and remedied by proposing a modified version, called guaranteed convergence PSO. Nevertheless, this adaptation only allows to prove the convergence to local optima. In order to obtain therefrom a stochastic global search algorithm, the authors suggest to add 123 Applied Mathematics & Optimization (2023) 88:30 Page 7 of 44 30 purely stochastic particles to the swarm, which trivially makes the method capable of detecting a global optimizer, but entails a computational time which coincides with the time required to examine every location in the search space. Other works consider certain notions of weak convergence [7] or provide probabilistic guarantees of finding locally optimal solutions, meaning that eventually all particles are located almost surely at a local optimum of the objective function [51]. In [44], similarly to our work, the expected behavior of the particles is investigated. All of the formerly mentioned results though are obtained through the analysis of the particles’ trajectories generated by a time-discretized algorithm as in [21, Equation (6.3)]. The present paper takes a different point of view by studying the continuous-time description of the PSO model (1.2) through the lens of the mean- field approximation (1.7). Analyzing the macroscopic behavior of a system through a mean-field limit instead of investigating the microscopic particle dynamics has its origins in statistical mechanics [29], where interactions between particles are approx- imated by an averaged influence. By eliminating the correlation between the particles, a many-body problem can be reduced to a one-body problem, which is usually much easier to solve while still giving an understanding of the mechanisms at play by describ- ing the average behavior of the particles. These ideas, for instance, are also used to study the collective behavior of animals when forming large-scale patterns through self-organization by analyzing an associated kinetic PDE [6]. In very recent works, this perspective of analysis has also been taken to demystify the training process of neural networks, see, e.g., [13, 36], where a mean-field approximation is utilized to formulate risk minimization by stochastic gradient descent (SGD) in terms of a gradient-flow PDE, which allows for a rigorous mathematical analysis. The analysis technique we use follows the line of work of self-organization. It is inspired by [8, 9], where a variance-based analysis approach has been developed for consensus-based optimization (CBO), which follows the guiding principles of meta- heuristics and in particular resembles PSO but is of much simpler nature and therefore easier to analyze. In comparison to Equation (1.2), CBO methods are described by a system of first-order SDEs [8, Equation (1.1)] and do not contain memory mecha- nisms, which are responsible for both a significantly more challenging mathematical modeling and convergence analysis. 1.3 Organization Sections 2 and 3 are dedicated to the analysis of PSO without and with memory mecha- nisms, respectively. After providing details about the well-posedness of the mean-field dynamics, we present and discuss the main result about the convergence of the mean- field dynamics to a global minimizer of the objective function. In Sect. 4 we then state a quantitative result about the mean-field approximation for PSO without memory effects, which enables us to obtain a holistic convergence statement of the numeri- cal PSO method. Eventually, a computationally efficient implementation of PSO is proposed in Sect. 5, before Sect. 6 concludes the paper. For the sake of reproducible research, in the GitHub repository https://github.com/KonstantinRiedl/PSOAnalysis we provide the Matlab code implementing the PSO algorithm analyzed in this work. 123 30 Page 8 of 44 Applied Mathematics & Optimization (2023) 88:30 2 Mean-Field Analysis of PSO Without Memory Effects Before providing a theoretical analysis of the mean-field PSO dynamics (1.7) and (1.8), in this section we investigate a reduced version, which does not involve memory mechanisms. Its multi-particle formulation was proposed in [22, Section 3.1] and reads i i dX = V dt , (2.1a) t t i i N i N i i mdV =−γ V dt + λ x (ρ ) − X dt + σ D x (ρ ) − X dB . (2.1b) α α t t X ,t t X ,t t t Compared to the full model, each particle is characterized only by its position X and velocity V . The forces acting on a particle, i.e., influencing its velocity in Equa- tion (2.1b), are friction, acceleration through the consensus drift and diffusion as in (1.6) with independent standard Brownian motions ( B ) . The consen- t ≥0 i =1,..., N sus point x (ρ ) is directly computed from the current positions of the particles X ,t according to ω (x ) N α N x (ρ ) := x dρ (x ), (2.2) X ,t X ,t L (ρ ) X ,t N N 1 where ρ denotes the empirical measure ρ := δ i of the particles’ posi- X ,t X ,t N i =1 X i i tions. Independent and identically distributed initial data ( X , V ) ∼ f 0 0 i =1,..., N 2d with f ∈ P(R ) complement (2.1). Similar to the particle system (1.2), as N →∞, the mean-field dynamics of (2.1) is described by the nonlinear self-consistent McKean process d X = V dt , (2.3a) t t md V =−γ V dt + λ x (ρ ) − X dt + σ D x (ρ ) − X dB , (2.3b) t t α X ,t t α X ,t t t with initial datum ( X , V ) ∼ f and the marginal law ρ of X given by ρ (t , · ) = 0 0 0 X ,t t X df (t , · ,v). A direct application of the Itô-Doeblin formula shows that the law f ∈ 2d C([0, T ], P(R )) is a weak solution to the nonlinear Vlasov-Fokker-Planck equation ∂ f + v ·∇ f t t x t (2.4) γ λ σ =∇ · v f + x − x (ρ ) f + D x − x (ρ ) ∇ f v t α X ,t t α X ,t v t m m 2m with initial datum f . Remark 2 A separate theoretical analysis of the dynamics (2.1) is necessary as it cannot be derived from (1.2) in a way that also the proof technique can be adopted in a straightforward manner. This can be seen from subtle differences in the proofs of Theorems 2 and 4; see in particular Lemma 3. 123 Applied Mathematics & Optimization (2023) 88:30 Page 9 of 44 30 It is also worth noting that Equation (2.1) bears a certain resemblance to CBO [8, 9, 18, 19, 42], whereas (1.8) resembles [48]. Indeed, as made rigorous in [10], CBO methods can be derived from PSO in the small inertia limit m → 0, or equivalently γ → 1. Nevertheless, analyzing the convergence of CBO directly permits sharper bounds when compared to utilizing the results obtained in our work together with [10, Theorem 2.4]. Before turning towards the well-posedness of the mean-field dynamics (2.3) and presenting the main result of this section about the convergence to the global mini- mizer x , let us introduce the class of objective function E considered in the theoretical part of this work. We remark that the assumptions made in what follows coincide with the ones of [8, 9] as well as several subsequent works in this direction. Assumption 1 Throughout the paper we are interested in objective functions E : R → R,for which ∗ d ∗ A1 there exists x ∈ R such that E (x ) = inf d E (x ) =: E, x ∈R A2 there exists some constant L > 0 such that E (x ) − E (x ) ≤ L |x | + x x − x , for all x , x ∈ R , A3 either E := sup d E (x)< ∞ or there exist constants c , R > 0 such that x ∈R 2 d E (x ) − E ≥ c |x | , for all x ∈ R with |x | ≥ R, 2 d 2 A4 E ∈ C (R ) with ∇ E ≤ C for some constant C > 0, E E A5 there exist η> 0 and ν ∈ (0, ∞) such that for any x ∈ R there exists a global minimizer x of E (which may depend on x) such that ∗ ν x − x ≤ (E (x ) − E ) /η. Assumption A1 just states that the objective function E attains its infimum E at ∗ d some x ∈ R , which may not necessarily be unique. Assumption A2 describes the local Lipschitz-continuity of E, entailing in particular that the objective has at most quadratic growth at infinity. Assumption A3, on the other hand, requires E to be either bounded or of at least quadratic growth in the farfield. Together, A2 and A3 allow to obtain the well-posedness of the PSO model. Assumption A4 is a regularity assumption about E, which is required only for the theoretical analysis. The quadratic growth nature of Assumptions A2–A4 in the farfield may bear a certain resemblance to log- Sobolev inequalities [50], which are pivotal in the convergence analysis of simulated annealing, see [53] for further details. Unlike simulated annealing however, the PSO method is a zero-order method where we do not need the gradient information of the objective function in the numerical application. Assumption A5 should be interpreted as a tractability condition of the landscape of E, which ensures that achieving an objective value of approximately E guarantees closeness to a global minimizer x and thus eliminates cases of almost-optimal valleys in the energy landscape far away from 123 30 Page 10 of 44 Applied Mathematics & Optimization (2023) 88:30 any globally minimizing argument. Such assumption is therefore also referred to as an inverse continuity property. It shall be emphasized that objectives with multiple global minima of identical quality are not excluded. 2.1 Well-Posedness of PSO without Memory Effects Let us recite a well-posedness result about the mean-field PSO dynamics (2.3) and the associated Vlasov-Fokker-Planck equation (2.4). Its proof is analogous to the one provided for Theorem 3 for the full dynamics (1.8) based on the Leray-Schauder fixed point theorem. Theorem 1 Let E satisfy Assumptions A1–A3. Moreover, let m,γ,λ,σ,α, T > 0. 2d If ( X , V ) is distributed according to f ∈ P (R ), then the nonlinear SDE (2.3) 0 0 0 4 admits a unique strong solution up to time T with the paths of process ( X , V ) valued in d d 2d C([0, T ], R )×C([0, T ], R ). The associated law f has regularity C([0, T ], P (R )) and is a weak solution to the Vlasov-Fokker-Planck equation (2.4). In particular, 4 4 4 4 CT sup E[| X | +|V | ]≤ 1 + 2E[| X | +|V | ] e (2.5) t t 0 0 t ∈[0,T ] for some constant C > 0 depending only on m,γ,λ,σ,α, c , R, and L . E E 2.2 Convergence of PSO without Memory Effects to a Global Minimizer A successful application of the PSO dynamics underlies the premise that the particles form consensus about a certain position x. In particular, in the mean-field limit one expects that the distribution of a particle’s position ρ converges to a Dirac delta δ . X ,t x This entails that the variance in the position E[| X − E[ X ]| ] and the second-order t t moment of the velocity E[|V | ] of the averaged particle vanish. As we show in what follows, both functionals indeed decay exponentially fast in time. Motivated by these expectations we define the functional γ γ 2 2 H(t ) := | X − E[ X ]| +|V | + X − E[ X ], V , (2.6) t t t t t t 2m 2m which we analyze in the remainder of this section. Its last term is required from a technical perspective. However, by proving the decay of E[H(t )], which acts as Lyapunov function of the dynamics, one immediately obtains the same for E[| X − 2 2 E[ X ]| +|V | ] as a consequence of the equivalence established in Lemma 1, which t t follows from Young’s inequality. 123 Applied Mathematics & Optimization (2023) 88:30 Page 11 of 44 30 2 2 Lemma 1 The functional H(t ) is equivalent to | X − E[ X ]| +|V | in the sense that t t t 1 γ 1 2 2 | X − E[ X ]| + |V | t t t 2 2m 2 (2.7) 3 γ 2 2 ≤ H(t ) ≤ + 1 | X − E[ X ]| +|V | . t t t 2 2m We now derive an evolution inequality of the Lyapunov function E[H(t )]. Lemma 2 Let E satisfy Assumptions A1–A3 and let ( X , V ) be a solution to the t t t ≥0 nonlinear SDE (2.3). Then E[H(t )] with H as defined in (2.6) satisfies d γ E[H(t )]≤− E[|V | ] dt m 2 2 −αE λγ 2λ σ 2e − − + E[| X − E[ X ]| ]. t t 2 2 2m γ m m E[exp(−αE ( X ))] (2.8) Proof Let us write δ X := X − E[ X ] for short and note that the integration by parts t t t formula gives E[|δ X | ]= 2E[ δ X , V ]. (2.9) t t t dt Observe that, in what follows, the appearing stochastic integrals have vanishing 2d expectations as a consequence of the regularity f ∈ C([0, T ], P (R )) obtained in Theorem 1. This is due to [39, Theorem 3.2.1(iii), Definition 3.1.4(iii)], which state that a stochastic integral vanishes if the associated second moment is integrable. Notice that the latter condition is sufficient for the stochastic integral to be a martingale. Applying the Itô-Doeblin formula and Young’s inequality yields d 2γ 2λ σ 2 2 2 E[|V | ]=− E[|V | ]+ E[ V , x (ρ ) − X ]+ E[|x (ρ ) − X | ] t t t α X ,t t α X ,t t dt m m m (2.10) 2γ λ ελ σ 2 2 ≤− − E[|V | ]+ + E[|x (ρ ) − X | ], ∀ ε> 0. t α X ,t t m εm m m Again by employing the Itô-Doeblin formula we obtain d γ λ E[ δ X , V ]= E[|V | ]− E[V ] − E[ δ X , V ]+ E[ δ X , x (ρ ) − X ] t t t t t t t α X ,t t dt m m γ d λ λ 2 2 2 ≤ E[|V | ]− E[|δ X | ]+ E[ δ X , x (ρ ) − E[ X ] ]− E[|δ X | ] (2.11) t t t α X ,t t t 2m dt m m γ d λ 2 2 2 = E[|V | ]− E[|δ X | ]− E[|δ X | ], t t t 2m dt m 123 30 Page 12 of 44 Applied Mathematics & Optimization (2023) 88:30 where we used the identity (2.9) and the fact that E[ δ X , x (ρ ) − E[ X ] ]= 0in t α X ,t t the last two steps. We now rearrange inequality (2.11) to get γ d d λ 2 2 2 E[|δ X | ]+ E[ δ X , V ]≤ E[|V | ]− E[|δ X | ], t t t t t 2m dt dt m which, in combination with (2.10), allows to show d 3γ λ λγ 2 2 E[H(t )]≤− − E[|V | ]− E[|δ X | ] t t dt 2m εm 2m ελ σ + + E[| X − x (ρ )| ]. (2.12) t α X ,t m m In order to upper bound E[| X − x (ρ )| ], an application of Jensen’s inequality t α X ,t yields x − x ω (x ) dρ (x )dρ (x ) 2d X ,t X ,t 2 R E[| X − x (ρ )| ]≤ t α X ,t ω (x ) dρ (x ) d X ,t E[|δ X | ] −αE ≤ 2e . (2.13) E[exp(−αE ( X ))] By choosing ε = (2λ)/γ in (2.12) and utilizing the estimate (2.13), we obtain (2.8) as desired. Remark 3 To obtain exponential decay of E[H(t )] it is necessary to ensure the negativ- ity of the prefactor of E[| X − E[ X ]| ] in Inequality (2.8) by choosing the parameters t t of the PSO method in a suitable manner. This may be achieved by choosing for any fixed time t,given α and arbitrary σ, γ > 0, X 2 2 X λ> 4 D σ /γ and subsequently m <γ /(8 D λ), (2.14) t t X −αE where we abbreviate D = 2e /E[exp(−αE ( X ))]. In order to be able to choose the parameters in Remark 3 once at the beginning of the algorithm instead of at every time step t, we need to be able to control the time- evolution of E[exp(−αE ( X ))]. We therefore study its time-derivative in the following lemma. Lemma 3 Let E satisfy Assumptions A1–A4 and let ( X , V ) be the solution to the t t t ≥0 nonlinear SDE (2.3). Then it holds that d γ d 2 2 E[exp(−αE ( X ))] ≥− E[exp(−αE ( X ))] t t dt m dt λ 2m −2αE −4αe C 1 + 2 E[H(t )]. (2.15) m γ 123 Applied Mathematics & Optimization (2023) 88:30 Page 13 of 44 30 Proof We first note that 1 d E[exp(−αE ( X ))] 2 dt d d = E[exp(−αE ( X ))] E[exp(−αE ( X ))] t t dt dt = E[exp(−αE ( X ))] dt +E[exp(−αE ( X ))] E[exp(−αE ( X ))] t t dt ≥ E[exp(−αE ( X ))] E[exp(−αE ( X ))], (2.16) t t dt leaving the second time-derivative of E[exp(−αE ( X ))] to be lower bounded. To do so, we start with its first derivative. Applying the Itô-Doeblin formula twice and noting that stochastic integrals have vanishing expectations as a consequence of [39, Theorem 2d 3.2.1(iii), Definition 3.1.4(iii)] combined with the regularity f ∈ C([0, T ], P (R )) obtained in Theorem 1,wehave E[exp(−αE ( X ))]=−αE[exp(−αE ( X ))∇E ( X ), V ] t t t t dt =−αE[exp(−αE ( X ))∇E ( X ), V ] 0 0 0 − αE d exp(−αE ( X ))∇E ( X ), V s s s =−αE[exp(−αE ( X ))∇E ( X ), V ] 0 0 0 (2.17) − αE exp(−αE ( X ))V , ∇ E ( X )V ds s s s s + α E exp(−αE ( X )) ∇E ( X ), V ds s s s − αE exp(−αE ( X )) ∇E ( X ), − V ds s s s − αE exp(−αE ( X )) ∇E ( X ), (x (ρ ) − X ) ds . s s α X ,s s Differentiating both sides of (2.17) with respect to the time t yields E[exp(−αE ( X ))]=−αE[exp(−αE ( X ))V , ∇ E ( X )V ] t t t t t dt 2 2 + α E[exp(−αE ( X ))|∇E ( X ), V | ] t t t αγ + E[exp(−αE ( X ))∇E ( X ), V ] t t t 123 30 Page 14 of 44 Applied Mathematics & Optimization (2023) 88:30 αλ − E[exp(−αE ( X ))∇E ( X ), x (ρ ) − X ] t t α X ,t t γ d ≥− E[exp(−αE ( X ))] m dt − α E[exp(−αE ( X ))V , ∇ E ( X )V ] t t t t αλ − E[exp(−αE ( X ))∇E ( X ), x (ρ ) − X ], t t α X ,t t (2.18) where we employed the first line of (2.17) in the last step. It remains to upper bound the terms T and T . Making use of Assumptions A1 and A4, we immediately obtain 1 2 2 2 −αE 2 T ≤ E[exp(−αE ( X )) ∇ E |V | ]≤ e C E[|V | ]. (2.19) 1 t ∞ t E t For T , again under Assumptions A1 and A4, we first note that T =−E exp(−αE ( X )) ∇E ( X ) −∇E (x (ρ )), X − x (ρ ) 2 t t α X ,t t α X ,t (2.20) −αE 2 ≤ e C E[| X − x (ρ )| ], E t α X ,t where the equality is a consequence of E[exp(−αE ( X ))∇E (x (ρ )), X − t α X ,t t x (ρ )] = 0, which follows from the definition of x (ρ ). Bounding E[| X − α X ,t α X ,t t x (ρ )| ] as in (2.13) we can further bound (2.20)as α X ,t E[| X − E[ X ]| ] t t −αE 2 −2αE T ≤ e C E[| X − x (ρ )| ]≤ 2e C . (2.21) 2 E t α X ,t E E[exp(−αE ( X ))] Collecting the estimates (2.19) and (2.21) within (2.18) and inserting the result into (2.16)give 1 d γ d E[exp(−αE ( X ))] ≥− E[exp(−αE ( X ))] E[exp(−αE ( X ))] t t t 2 dt m dt −αE 2 − E[exp(−αE ( X ))]αC e E[|V | ] t t 2αλ −2αE 2 − e C E[| X − E[ X ]| ] E t t γ d ≥− E[exp(−αE ( X ))] 2m dt 2λ −2αE 2 2 − αe C E[|V | ]+ E[| X − E[ X ]| ] , E t t t which yields the statement after employing the lower bound of (2.7) as in Lemma 1. 123 Applied Mathematics & Optimization (2023) 88:30 Page 15 of 44 30 We are now ready to state and prove the main result about the convergence of the mean-field PSO dynamics (2.3) without memory mechanisms to the global mini- mizer x . Theorem 2 Let E satisfy Assumptions A1–A4 and let ( X , V ) beasolutionto t t t ≥0 the nonlinear SDE (2.3). Moreover, let us assume the well-preparation of the initial datum X and V in the sense that 0 0 P1 μ> 0 with 2 2 −αE λγ 2λ σ 4e μ := − + , 2 2 2m γ m m E[exp(−αE ( X ))] P2 it holds E[ exp(−αE ( X ))∇E ( X ),V ] mα 0 0 0 2γ E[exp(−αE ( X ))] αC 8mλ E[H(0)] 3 + 1+ < , 2 2 χ( − χ) γ 16 E[exp(−α(E ( X )−E ))] with x = max{x , 0} for x ∈ R denoting the positive part and where 2 min{γ/m,μ} χ := . (γ/(2m)) + 1 Then E[H(t )] with H as defined in Equation (2.6) converges exponentially fast with rate χ to 0 as t →∞. Moreover, there exists some x , which may depend on α and f , such that E[ X ]→ x and x (ρ ) → x exponentially fast with rate χ/2 as t →∞. t α X ,t Eventually, for any given accuracy ε> 0, there exists α > 0, which may depend on the dimension d , such that for all α> α , x satisfies E ( x ) − E ≤ ε. ∗ ν If E additionally satisfies Assumption A5, we have | x − x | ≤ ε /η. Remark 4 As suggested in Remark 3, Theorem 2 traces back the evolution of E[exp(−αE ( X ))] to its initial state by employing Lemma 3. This allows to fixate X X all parameters of PSO at initialization time. By replacing D with 2 D in (2.14), the t 0 well-preparation of the parameters as in Condition P1 can be ensured. Condition P2 requires the well-preparation of the initialization in the sense that the initial datum f is both well-concentrated and to a certain extent not too far from an optimal value. While this might have a locality flavor, the condition is generally fulfilled in practical applications. Moreover, for CBO methods there is recent work where such assumption about the initial datum is reduced to the absolute minimum [18, 19]. 123 30 Page 16 of 44 Applied Mathematics & Optimization (2023) 88:30 Remark 5 The choice of the parameter α necessary in Theorem 2 may be affected by the dimensionality d of the optimization problem at hand. By establishing a quantitative nonasymptotic Laplace principle, this dependence is made explicit in the works [18, Proposition 18] and [19, Proposition 1], where the authors show that α may be required to grow linearly in d,see [18, Remark 21]. Proof of Theorem 2 Let us define the time horizon T := inf t ≥ 0 : E[exp(−αE ( X ))] < E[exp(−αE ( X ))] with inf ∅=∞. t 0 Obviously, by continuity, T > 0. We claim that T =∞, which we prove by con- tradiction in the following. Therefore, assume T < ∞. Then, for t ∈[0, T ],we have 2 2 −αE λγ 2λ σ 2e − + 2 2 2m γ m m E[exp(−αE ( X ))] 2 2 −αE λγ 2λ σ 4e ≥ − + = μ> 0, 2 2 2m γ m m E[exp(−αE ( X ))] where the positivity of μ is due to the well-preparation condition P1 of the ini- tialization. Lemma 2 then provides an upper bound for the time derivative of the functional E[H(t )], d γ 2 2 E[H(t )]≤− E[|V | ]− μE[| X − E[ X ]| ] t t t dt m " # 2 2 ≤− min ,μ E[| X − E[ X ]| ]+ E[|V | ] t t t (2.22) 2 min{γ/m,μ} ≤− E[H(t )]=:−χ E[H(t )], (γ/(2m)) + 1 where we made use of the upper bound of (2.7) as in Lemma 1 in the last inequality. The rate χ is defined implicitly and it is straightforward to check that χ< γ/m. Grönwall’s inequality implies E[H(t )]≤ E[H(0)] exp(−χ t ). (2.23) Let us now investigate the evolution of the functional X (t ) := E[exp(−αE ( X ))] . First note that X (0) := X (t ) =−2αE[exp(−αE ( X ))]E[exp(−αE ( X )) ∇E ( X ), V ]. 0 0 0 0 t =0 dt 123 Applied Mathematics & Optimization (2023) 88:30 Page 17 of 44 30 Then, an application of Grönwall’s inequality to Equation (2.15) from Lemma 3 and using the explicit bound of E[H(t )] from (2.23) yields d γ X (t ) ≥ X (0) exp − t dt m λ 2m γ −2αE − 4αe C 1 + 2 E[H(s)] exp − (t − s) ds m γ m ≥ X (0) exp − t λ 2m 1 γ −2αE − 4αe C 1 + 2 E[H(0)] exp (−χ t ) − exp − t m γ γ/m − χ m ≥ X (0) exp − t λ 2m 1 −2αE − 4αe C 1 + 2 E[H(0)] exp (−χ t ) , m γ γ/m − χ which, in turn, implies −2αE m 4αe C λ 2m X (t ) ≥ X (0) − − X (0) − 1 + 2 E[H(0)] γ χ(γ /m − χ) m γ after discarding the positive parts. Recalling the definition of X and employing the second well-preparation condition P2, we can deduce that for all t ∈[0, T ] it holds 2 2 E[exp(−αE ( X ))] ≥ E[exp(−αE ( X ))] t 0 2mα − E[exp(−αE ( X ))] E[exp(−αE ( X )) ∇E ( X ), V ] 0 0 0 0 −2αE 4αe C λ 2m − 1 + 2 E[H(0)] χ(γ /m − χ) m γ > E[exp(−αE ( X ))] , which entails that there existsδ> 0 such that E[exp(−αE ( X ))]≥ E[exp(−αE ( X ))] t 0 /2in [T , T + δ] as well, contradicting the definition of T and therefore showing the claim T =∞. As a consequence of (2.23)wehave E[H(t )]≤ E[H(0)] exp(−χ t ) and E[exp(−αE ( X ))]≥ E[exp(−αE ( X ))] t 0 (2.24) 123 30 Page 18 of 44 Applied Mathematics & Optimization (2023) 88:30 for all t ≥ 0. In particular, by means of Lemma 1, for a suitable generic constant C > 0, we infer 2 2 E[| X − E[ X ]| ]≤ C exp(−χ t ), E[|V | ]≤ C exp(−χ t ), t t t and E[| X − x (ρ )| ]≤ C exp(−χ t ), (2.25) t α X ,t where the last inequality uses the fact (2.13). Moreover, with Jensen’s inequality, E[ X ] ≤ E[|V |] ≤ C exp (−χ t /2) →0as t →∞, t t dt showing that E[ X ]→ x for some x ∈ R , which may depend on α and f . According t 0 to (2.25), X → x in mean-square and x (ρ ) → x, since t α X ,t 2 2 |x (ρ ) − x | ≤ 3E[|x (ρ ) − X | ] α X ,t α X ,t t 2 2 +3E[| X − E X | ]+ 3|E X − x | →0as t →∞. t t t Eventually, by continuity of the objective function E and by the dominated convergence −αE ( x ) theorem, we conclude that E[exp(−αE ( X ))]→ e as t →∞. Using this when taking the limit t →∞ in the second bound of (2.24) after applying the logarithm and multiplying both sides with −1/α, we obtain 1 1 1 E ( x ) = lim − log E[exp(−αE ( X ))] ≤− log E[exp(−αE ( X ))]+ log 2. t 0 t →∞ α α α (2.26) The Laplace principle (1.5) on the other hand allows to choose α 1 large enough such that for given ε> 0 it holds − log E[exp(−αE ( X ))]− E <ε/2 for any α ≥ α. Together with (2.26), this establishes 0 ≤ E ( x ) − E ≤ ε/2 + (log 2)/α ≤ ε for α ≥ max{ α, (2log 2)/ε}. Finally, under the inverse continuity property A5 we ∗ ν ν additionally have | x − x | ≤ (E ( x ) − E ) /η ≤ ε /η, concluding the proof. 3 Mean-Field Analysis of PSO with Memory Effects Let us now turn back to the PSO dynamics (1.2) described in the introduction. The fundamental difference to what was analyzed in the preceding section is the presence of a personal memory of each particle, encoded through the additional state variable Y . It can be thought of as an approximation to the in-time best position arg min E ( X ) τ ≤t τ seen by the respective particle. Its dynamics is encoded in Equation (1.2b). In this section we analyze (1.2) in the large particle limit, i.e., through its mean-field limit (1.8). 123 Applied Mathematics & Optimization (2023) 88:30 Page 19 of 44 30 3.1 Well-Posedness of PSO with Memory Effects Ensured by a sufficiently regularized implementation of the local best position Y,we can show the well-posedness of the mean-field PSO dynamics (1.8), respectively, the associated Vlasov-Fokker-Planck equation (1.7). As regards uniqueness, it does not seem straightforward to extend the standard proof technique to the present setting due to the way the memory effects are implemented in (1.2b) and (1.8b). Therefore, in what follows, we merely prove existence of solutions and leave the development of a suitably modified proof technique for future research, see also Remark 8. Theorem 3 Let E satisfy Assumptions A1–A3. Moreover, let m,γ,λ ,λ ,σ ,σ ,α,β, 1 2 1 2 3d θ, κ, T > 0.If ( X , Y , V ) is distributed according to f ∈ P (R ), then the 0 0 0 0 4 nonlinear SDE (1.8) admits a strong solution up to time T with C([0, T ], R ) × d d C([0, T ], R ) × C([0, T ], R )-valued paths. The associated law f has regular- 3d ity C([0, T ], P (R )) and is a weak solution to the Vlasov-Fokker-Planck equa- tion (1.7). In particular, 4 4 4 4 4 4 CT sup E[| X | +|Y | +|V | ]≤ 1 + 3E[| X | +|Y | +|V | ] e (3.1) t t t 0 0 0 t ∈[0,T ] for some constant C > 0 depending only on m,γ,λ ,λ ,σ ,σ ,α,β,θ,κ, c , R and 1 2 1 2 E L . Proof sketch The proof follows the steps taken in [8, Theorems 3.1, 3.2]. d 3d Step 1: For a given function u ∈ C([0, T ], R ) and an initial measure f ∈ P (R ), 0 4 according to standard SDE theory [2, Chapter 6], we can uniquely solve the auxiliary SDE d X = V dt , t t β,θ d Y = κ X − Y S X , Y dt , t t t t t md V =− γ V dt + λ Y − X dt + λ u − X dt + σ D Y − X dB t t 1 t t 2 t t 1 t t + σ D u − X dB , 2 t t β,θ with initial condition X , Y , V ∼ f as, due to the smoothness of S and Assump- 0 0 0 0 tions A2 and A3, the coefficients are locally Lipschitz and have at most linear growth. 3d This induces f = Law X , Y , V . Moreover, the regularity of f ∈ P (R ) allows t t t t 0 4 3d for a moment estimate of the form (3.1) and thus f ∈ C([0, T ], P (R )), see, e.g. [2, Chapter 7]. In what follows, ρ denotes the spatial local best marginal of f , i.e., ρ (t , · ) = d f (t , x , · ,v). Y 2d Step 2: Let us now define, for some constant C > 0, the test function space 2 3d 2 3d C (R ) := φ ∈ C (R ) :|∇ φ|≤ C (1 +|x|+| y|+|v|) (3.2) and sup ∂ φ < ∞ . v v k k k=1,...,d 123 30 Page 20 of 44 Applied Mathematics & Optimization (2023) 88:30 2 3d For some φ ∈ C (R ), by the Itô-Doeblin formula, we derive β,θ dφ =∇ φ · V dt + κ∇ φ · X −Y S X , Y dt x t y t t t t γ λ λ 1 2 +∇ φ · − V + Y − X + u − X dt v t t t t t m m m 2 2 1 σ σ 2 2 2 1 2 + ∂ φ Y − X + u − X dt t t t t v v k k 2 k 2 k 2 m m k=1 σ σ 1 2 1 2 +∇ φ · D Y − X dB + D u − X dB , v t t t t t t m m where we mean φ X , Y , V whenever we write φ. After taking the expectation, t t t applying Fubini’s theorem and observing that the stochastic integrals vanish due to 2 3d the definition of the test function space C (R ) and the regularity (3.1), we observe 3d that f ∈ C([0, T ], P (R )) satisfies the Vlasov-Fokker-Planck equation β,θ φ d f = v ·∇ φ d f + κ(x − y)S (x , y) ·∇ φ d f t x t y t 3d 3d 3d dt R R R γ λ λ 1 2 − v + (x − y) + (x − u ) ·∇ φ d f t v t 3d m m m 2 2 σ σ 1 2 2 2 2 + (x − y) + (x − u ) · ∂ φ d f . t t k k v v 2 2 k k 3d 2m 2m k=1 (3.3) Step 3: Setting T u := y (ρ ) ∈ C([0, T ], R ) provides the self-mapping property of α Y the map d d T : C([0, T ], R ) → C([0, T ], R ), u → T u = y (ρ ), α Y which is compact as a consequence of the stability estimate | y (ρ ) − y (ρ )| α Y ,t α Y ,s 2 W (ρ , ρ ) for ρ , ρ ∈ P (R ), see, e.g., [8, Lemma 3.2], and the Hölder- 2 Y ,t Y ,s Y ,t Y ,s 4 1/2 continuity of the Wasserstein-2 distance W (ρ , ρ ). 2 Y ,t Y ,s 3d Step 4: Then, for u = ϑ T u with ϑ ∈[0, 1], there exists f ∈ C([0, T ], P (R )) satisfying (3.3) with marginal ρ such that u = ϑ y (ρ ). For such u, a uniform Y t α Y ,t bound can be obtained as of Assumption A3. An application of the Leray-Schauder fixed point theorem provides a solution to (1.8). 3.2 Convergence of PSO with Memory Effects to a Global Minimizer Analogously to Sect. 2.2 we define a functional H(t ), which is analyzed in this section to eventually prove its exponential decay and thereby consensus formation at some x close to the global minimizer x . In addition to the requirements that the variance E[| X − E[ X ]| ] in the position and the second-order moment of the t t 123 Applied Mathematics & Optimization (2023) 88:30 Page 21 of 44 30 velocity E[|V | ] of the averaged particle vanish, we also expect that the particle’s posi- tion X aligns with its personal best position Y over time, meaning that E[| X − Y | ] t t t t decays to zero. This motivates the definition γ 3 1 3λ γ 2 2 2 H(t ) := | X − E[ X ]| + |V | + + | X − Y | t t t t t 2m 2 2 m m (3.4) γ γ + X − E[ X ], V + X − Y , V , t t t t t t 2m m whose last two terms are required for technical reasons. Again, by the equivalence established in the following Lemma 4, proving the decay of the Lyapunov function 2 2 2 E[H(t )] directly entails the decay of E[| X − E[ X ]| +|V | +| X − Y | ] with the t t t t t same rate. 2 2 2 Lemma 4 The functional H(t ) is equivalent to | X − E[ X ]| +|V | +| X − Y | t t t t t in the sense that 1 γ 1 3λ 2 2 2 | X − E[ X ]| + |V | + | X − Y | ≤ H(t ) t t t t t 2 2m 2 2m (3.5) 5 γ 3λ 2γ 2 2 2 ≤ + 1 + + | X − E[ X ]| +|V | +| X − Y | . t t t t t 2 2m m m We now derive an evolution inequality of the Lyapunov function E[H(t )]. Lemma 5 Let E satisfy Assumptions A1–A3 and let ( X , Y , V ) beasolutionto t t t t ≥0 the nonlinear SDE (1.8). Then E[H(t )] with H as defined in (3.4) satisfies E[H(t )] dt ≤− E[|V | ] 2m 2 2 −αE (λ +2λ )γ 9λ 3σ 3λ γ 6e 1 2 1 2 2 2 − − + + E[| X −E[ X ]| ] t t 2 2 2 (2m) γ m m (2m) E[exp(−αE (Y ))] (3.6) 2 2 2 2 (λ +λ )γ 3λ γ 8κ γ λ γ 3σ 1 2 1 2 1 − +κθ + − − − 2 2 2 2 m m m m 2m λ 2m 2 2 2 2 −αE 9λ 3σ 9λ 3σ 3λ γ 12e 2 2 2 2 2 − + − + + E[| X −Y | ]. t t 2 2 2 γ m m γ m m (2m) E[exp(−αE (Y ))] Proof Let us write δ X := X − E[ X ] for short and note that the integration by parts t t t formula gives E[|δ X | ]= 2E[ δ X , V ]. (3.7) t t t dt 123 30 Page 22 of 44 Applied Mathematics & Optimization (2023) 88:30 Observe that the stochastic integrals have vanishing expectations as a consequence of [39, Theorem 3.2.1(iii), Definition 3.1.4(iii)] combined with the regularity f ∈ 2d C([0, T ], P (R )) obtained in Theorem 3. An application of the Itô-Doeblin formula and Young’s inequality yields d 2γ 2λ 2 2 E[|V | ]=− E[|V | ]+ E[ V , Y − X ] t t t t t dt m m 2λ 1 2 + E[ V , y (ρ ) − X ]+ E[|Y − X | ] t α Y ,t t t t m m 2 2 + E[| y (ρ ) − X | ] α Y ,t t 2 (3.8) 2γ λ σ 2 1 2 ≤− − E[|V | ]+ E[|Y − X | ] t t t m εm m ελ σ 2 2 + + E[| y (ρ ) − X | ] α Y ,t t m m 2λ − E[ V , X − Y ], ∀ ε> 0. t t t Again by employing the Itô-Doeblin formula we obtain d γ λ 2 1 E[ δ X , V ]= E[|V | ]− E[V ] − E[ δ X , V ]+ E[ δ X , Y − X ] t t t t t t t t t dt m m + E[ δ X , y (ρ ) − X ] t α Y ,t t γ d 2 2 ≤ E[|V | ]− E[|δ X | ] t t 2m dt + E[ δ X , Y − y (ρ ) − X − E[ X ] ] t t α Y ,t t t + E[ δ X , E[ X ]− X ] t t t γ d λ + λ 1 2 2 2 2 = E[|V | ]− E[|δ X | ]− E[|δ X | ] t t t 2m dt m + E[ δ X , Y − y (ρ ) ] t t α Y ,t γ d λ + 2λ 1 2 2 2 2 ≤ E[|V | ]− E[|δ X | ]− E[|δ X | ] t t t 2m dt 2m + E[|Y − y (ρ )| ], t α Y ,t 2m where, for the second line, we used the identity (3.7) and that E[ δ X , C ]= 0, whenever C ∈ R is constant, allowing to expand the expression in the way done. We 123 Applied Mathematics & Optimization (2023) 88:30 Page 23 of 44 30 now rearrange the previous inequality to get γ d d λ + 2λ 1 2 2 2 2 E[|δ X | ]+ E[ δ X , V ]≤ E[|V | ]− E[|δ X | ] t t t t t 2m dt dt 2m + E[|Y − y (ρ )| ]. (3.9) t α Y ,t 2m Next, using the Itô-Doeblin formula, we compute 2 β,θ E[| X − Y | ]= 2E[ X − Y , V − κ X − Y S ( X , Y ) ] t t t t t t t t t dt (3.10) ≤ 2E[ X − Y , V ]− 2κθ E[| X − Y | ], t t t t t β,θ where the last step follows from the fact that θ< S ( X , Y )< 2 + θ< 4. And t t lastly, the Itô-Doeblin formula and Young’s inequality allow to bound E[ X − Y , V ] t t t dt γ λ + λ λ 1 2 2 =− E[ X − Y , V ]− E[| X − Y | ]+ E[ X − Y , y (ρ ) − Y ] t t t t t t t α Y ,t t m m m β,θ + E[ V − κ X − Y S ( X , Y ), V ] t t t t t t γ λ + λ 1 2 2 2 2 ≤− E[ X − Y , V ]− E[| X − Y | ]+ E[| X − Y | ] t t t t t t t m m 2mλ λ (3.11) + E[| y (ρ ) − Y | ] α Y ,t t 2m 2 2 2 2 + E[|V | ]+ E[|V | ]+ 8κ E[| X − Y | ] t t t t λ + λ λ 3 λ 1 2 1 2 2 2 2 2 =− − 8κ − E[| X − Y | ]+ E[|V | ]+ E[| y (ρ ) − Y | ] t t t α Y ,t t m 2mλ 2 2m − E[ X − Y , V ]. t t t We now collect the bounds (3.8), (3.9), (3.10), and (3.11)toshow E[H(t )] dt 3γ 3λ γ 3γ (λ +2λ )γ 2 1 2 2 2 ≤− − − − E[|V | ]− E[|δ X | ] t t m 2εm 2m 2m (2m) 2 2 2 2 λ γ 3σ (λ +λ )γ 8κ γ 3λ γ 1 2 1 2 1 2 − − − +κθ + − E[| X −Y | ] t t 2 2 2 2 m m 2m λ m m 2m 3 ελ σ 3λ γ 2 1 2 2 2 + + E[| y (ρ )− X | ]+ E[| y (ρ )−Y | ] α Y ,t t α Y ,t t 2 2 2 m m (2m) γ 3λ (λ +2λ )γ 2 1 2 2 2 ≤− − E[|V | ]− E[|δ X | ] t t m 2εm (2m) 2 2 2 2 2 (λ +λ )γ 8κ γ λ γ 3λ γ 3σ ελ σ 1 2 1 2 2 1 2 2 − − − +κθ + − −3 + E[| X −Y | ] t t 2 2 2 2 2 m m 2m λ m m 2m m m 123 30 Page 24 of 44 Applied Mathematics & Optimization (2023) 88:30 ελ σ 3λ γ 2 1 2 2 + 3 + + E[| y (ρ )−Y | ]. α Y ,t t 2 2 m m (2m) Recalling the computation (2.13) yields the bound E[|δY | ] 2 −αE E[|Y − y (ρ )| ]≤ 2e t α Y ,t E[exp(−αE (Y ))] 2 2 6E[|Y − X | ]+ 3E[|δ X | ] t t t −αE ≤ 2e , (3.12) E[exp(−αE (Y ))] where we inserted ± X and ±E[ X ] in the second step and used that (a + b + c) ≤ t t 2 2 2 3(a + b + c ) as well as Jensen’s inequality. Combining the last two bounds and choosing ε = (3λ )/γ we obtain (3.6) as desired. Remark 6 The exponential decay of E[H(t )] it obtained by choosing the parameters of PSO in a manner which ensures the negativity of the prefactors of E[| X − E[ X ]| ] t t and E[| X − Y | ] in Inequality (3.6). This may be achieved by choosing for any fixed t t time t,given α and arbitrary θ, σ ,σ ,γ > 0, 1 2 2 Y Y 2 2 Y 3σ D λ (1 + D )σ 3λ (1 + D ) t t 1 t 2 2 λ > ,λ > 6max , ,κ > , 1 2 2γ 4 γ γθλ γθ λ γ and m < min , , Y 2 16κ 18 D λ Y −αE where we abbreviate D = 12e /E[exp(−αE (Y ))]. In our main theorem on convergence of the PSO dynamics with memory mecha- nisms to the global minimizer x we again ensure that the parameter can be chosen once at initialization time. Theorem 4 Let E satisfy Assumptions A1–A4 and let ( X , V ) beasolutionto t t t ≥0 the nonlinear SDE (1.8). Moreover, let us assume the well-preparation of the initial datum X and V in the sense that 0 0 P1 μ > 0 with 2 2 −αE (λ + 2λ )γ 9λ 3σ 3λ γ 12e 1 2 1 2 2 μ := − + + , 2 2 2 (2m) γ m m 4m E[exp(−αE (Y ))] P2 μ > 0 with 2 2 2 2 (λ + λ )γ 3λ γ 8κ γ λ γ 3σ 1 2 1 2 1 μ := + κθ + − − − 2 2 2 2 m m m m 2m λ 2m 2 2 2 2 −αE 9λ 3σ 9λ 3σ 3λ γ 24e 2 2 2 2 − + − + + , 2 2 2 γ m m γ m m (2m) E[exp(−αE (Y ))] 123 Applied Mathematics & Optimization (2023) 88:30 Page 25 of 44 30 P3 it holds ακ m 24C κ E[H(0)] 2 E C + 2α + λ χ αχ E[exp(−α(E (Y ) − E ))] 6κ E[|∇E ( X )| ] 3 + < αχ 32 E[exp(−α(E (Y ) − E ))] where 2 min{γ/(2m), μ ,μ } 1 2 χ := . (γ/(2m)) + 1 + 3λ /m + 2(γ /m) Then E[H(t )] with H as defined in Equation (3.4) converges exponentially fast with rate χ to 0 as t →∞. Moreover, there exists some x , which may depend on α and f , such that E[ X ]→ x and y (ρ ) → x exponentially fast with rate χ/2 as t →∞. t α Y ,t Eventually, for any given accuracy ε> 0, there exists α > 0, which may depend on the dimension d , such that for all α> α , x satisfies E ( x ) − E ≤ ε. ∗ ν If E additionally satisfies Assumption A5, we additionally have | x − x | ≤ ε /η. Y Y Remark 7 By replacing D with 2 D in the parameter choices of Remark 6,the t 0 well-preparation of the parameters as in Conditions P1 and P2 can be ensured. In analogy to Remark 4, Condition P3 guarantees the well-preparation of the ini- tialization. Proof of Theorem 4 Let us define the time horizon T := inf t ≥ 0 : E[exp(−αE (Y ))] < E[exp(−αE (Y ))] with inf ∅=∞. t 0 Obviously, by continuity, T > 0. We claim that T =∞, which we prove by con- tradiction in the following. Therefore, assume T < ∞. Then, for t ∈[0, T ], noting that E[exp(−αE (Y ))]≥ E[exp(−αE (Y ))]/2, we observe that the prefactors of t 0 2 2 E[| X − E[ X ]| ] and E[| X − Y | ] in Lemma 5 are upper bounded by −μ and t t t t 1 −μ , respectively. Lemma 5 then provides an upper bound for the time derivative of the functional E[H(t )], d γ 2 2 2 E[H(t )]≤− E[|V | ]− μ E[| X − E[ X ]| ]− μ E[| X − Y | ] t 1 t t 2 t t dt 2m " # 2 2 2 ≤− min ,μ ,μ E[| X − E[ X ]| ]+ E[|V | ]+ E[| X − Y | ] 1 2 t t t t t (3.13) 2m 2 min{γ/(2m), μ ,μ } 1 2 ≤− E[H(t )]=:−χ E[H(t )], 2 2 (γ/(2m)) + 1 + 3λ /m + 2γ /m where we made use of the upper bound of (3.5) as in Lemma 4 in the last inequality. The rate χ is defined implicitly and it is straightforward to check that 0 <χ <γ/m, 123 30 Page 26 of 44 Applied Mathematics & Optimization (2023) 88:30 where the positivity of χ follows from the well-preparation conditions P1 and P2 of the initialization. Grönwall’s inequality implies E[H(t )]≤ E[H(0)] exp(−χ t ). (3.14) We now investigate the evolution of the functional Y(t ) := E[exp(−αE (Y ))].The Itô-Doeblin formula yields β,θ Y(t ) =−ακ E[exp(−αE (Y )) ∇E (Y ), ( X − Y )S ( X , Y ) ] t t t t t t dt β,θ =−ακ E[exp(−αE (Y )) ∇E (Y ) −∇E ( X ), ( X − Y )S ( X , Y ) ] t t t t t t t β,θ − ακ E[exp(−αE (Y )) ∇E ( X ), ( X − Y )S ( X , Y ) ] t t t t t t −αE 2 −αE ≥−4ακ e C E[| X − Y | ]− 4ακ e E[|∇E ( X )|| X − Y |], E t t t t t (3.15) where the last step follows from Cauchy-Schwarz inequality and uses Assump- β,θ tion A4 and S ( X , Y )< 4. Now firstly notice that E[|∇E ( X )|| X − Y |] ≤ t t t t t (χ /2)t 2 2 −(χ /2)t 2 2 e α E[| X − Y | ]+ e /α E[|∇E ( X )| ] by Young’s inequality. Secondly, t t t using again Assumption A4 in the first inequality, we have ( ) 2 2 E[|∇E ( X )| ]= E ∇E ( X ) + ∇ E ( X )V ds t 0 s s 2 2 2 ≤ 2E[|∇E ( X )| ]+ 2C t E[|V | ] ds 0 s 2 2 ≤ 2E[|∇E ( X )| ]+ 4C t E[H(s)] ds 2 2 ≤ 2E[|∇E ( X )| ]+ 4C t E[H(0)] exp(−χ s) ds 2 2 = 2E[|∇E ( X )| ]+ 4C t E[H(0)] 1 − exp(−χ t ) , ( ) where the next-to-last step uses the explicit bound in (3.14). Using the two latter observations together with the fact that E[| X − Y | ]≤ 2m/(3λ )E[H(t )] we can t t 1 continue (3.15)asfollows d χ 2m −αE 2 Y(t ) ≥−4ακ e C + exp t α E[H(t )] dt 2 3λ 4 χ −αE 2 − κ e exp − t E[|∇E ( X )| ] α 2 χ 2m −αE 2 ≥−4ακ e C + exp t α E[H(0)] exp(−χ t ) 2 3λ 4 χ 1 −αE 2 2 − κ e exp − t 2E[|∇E ( X )| ]+ 4C t E[H(0)] (1 − exp(−χ t )) α 2 χ 123 Applied Mathematics & Optimization (2023) 88:30 Page 27 of 44 30 χ 2m −αE 2 ≥−4ακ e C exp (−χ t ) + exp − t α E[H(0)] 2 3λ 4C t 4 χ −αE 2 E − κ e exp − t 2E[|∇E ( X )| ]+ E[H(0)] . (3.16) α 2 χ By integrating (3.16) we obtain for all t ∈[0, T ] C 2α 2m −αE Y(t ) ≥ Y(0) − 4ακ e + E[H(0)] χ χ 3λ 16C 4 2 −αE 2 E − κ e 2E[|∇E ( X )| ] + E[H(0)] . α χ χ Recalling the definition of Y and employing Condition P3, we can deduce that for all t ∈[0, T ] it holds C 2α 2m −αE E[exp(−αE (Y ))]≥ E[exp(−αE (Y ))]− 4ακ e + E[H(0)] t 0 χ χ 3λ 16C 4 2 −αE 2 E − κ e 2E[|∇E ( X )| ] + E[H(0)] α χ χ > E[exp(−αE (Y ))], which entails that there existsδ> 0 such that E[exp(−αE (Y ))]≥ E[exp(−αE (Y ))]/ t 0 2in [T , T + δ] as well, contradicting the definition of T and therefore showing the claim T =∞. As a consequence of (3.14)wehave E[H(t )]≤ E[H(0)] exp(−χ t ) and E[exp(−αE (Y ))]≥ E[exp(−αE (Y ))] t 0 (3.17) for all t ≥ 0. In particular, by means of Lemma 4, for a suitable generic constant C > 0, we infer 2 2 E[| X − E[ X ]| ]≤ C exp(−χ t ), E[|V | ]≤ C exp(−χ t ), t t t and E[| X − Y | ]≤ C exp(−χ t ). (3.18) t t 123 30 Page 28 of 44 Applied Mathematics & Optimization (2023) 88:30 Moreover, with Jensen’s inequality, E[ X ] ≤ E[|V |] ≤ C exp (−χ t /2) →0as t →∞, t t dt showing that E[ X ]→ x for some x ∈ R , which may depend on α and f . According t 0 to (3.18), X → x as well as Y → x in mean-square. Moreover, by reusing the t t inequality (3.12) we get 2 2 6E[|Y − X | ]+ 3E[| X − E X | ] t t t t 2 −αE E[|Y − y (ρ )| ]≤ 4e ≤ C exp(−χ t ) t α Y ,t E[exp(−αE (Y ))] showing y (ρ ) → x, since α Y ,t 2 2 2 | y (ρ ) − x | ≤ 4E[| y (ρ ) − Y | ]+ 4E[|Y − X | ] α Y ,t α Y ,t t t t 2 2 +4E[| X − E X | ]+ 4|E X − x | →0as t →∞. t t t The remainder of the proof follows the lines of the proof of Theorem 2, replacing merely X with Y . t t 4 A Holistic Convergence Statement of PSO Without Memory Effects In Sects. 2 and 3 we analyzed the macroscopic behavior of PSO without and with mem- ory effects in the mean-field regime. For this purpose we introduced the with (1.2) and (2.1) associated self-consistent mono-particle processes (1.8) and (2.3), for which we then established convergence guarantees under the in Theorems 2 and 4 specified assumptions. However, in order to be able to infer therefrom the optimization capa- bilities of the numerically implemented PSO method, a quantitative estimate on the approximation quality of the interacting particle system by the corresponding mean- field dynamics is necessary. 4.1 On the Mean-Field Approximation of PSO Without Memory Effects The following theorem provides a probabilistic quantitative estimate on the mean-field approximation for PSO without memory effects. Notably, the result does not suffer from the curse of dimensionality. 2d Theorem 5 Let T > 0,f ∈ P (R ) and let N ∈ N be fixed. Moreover, let E obey 0 4 i i Assumptions A1–A4. We denote by ( X , V ) the solution to system (2.1) t ≥0 t t i =1,..., N i i and let ( X , V ) be N independent copies of the solution to the mean-field t ≥0 t t i =1,..., N 123 Applied Mathematics & Optimization (2023) 88:30 Page 29 of 44 30 dynamics (2.3). Then it holds ( ) " # i i i 4 i 4 4 4 P ( ) = P sup max | X | +|V | , | X | +|V | ≤ M t t t t t ∈[0,T ] i =1 2 K ≥ 1 − , (4.1) where K = K (γ /m,λ/m,σ/m, T , E ) is a constant, which is in particular indepen- dent of N and d . Furthermore, if the processes share the initial data as well as the Brownian motion paths ( B ) for all i = 1,..., N , then we have a probabilistic mean-field approxi- t ≥0 mation of the form * + i i i 2 i 2 −1 max sup E | X − X | +|V − V | ≤ C N (4.2) M MFA t t t t i =1,..., N t ∈[0,T ] with a constant C = C (α, γ /m,λ/m,σ/m, T , E , K , M ), which is in partic- MFA MFA ular independent of N and d . Proof The proof is based on the arguments of [18, Section 3.3] about the mean-field 1 i 4 approximation of CBO. First we compute a bound for E[sup max{| X | + t ∈[0,T ] N i =1 t i i i 4 4 4 |V | , | X | +|V | }], which is then used to derive a mean-field approximation for t t t PSO conditioned on the set of uniformly bounded processes. Step 1: Using standard inequalities and Jensen’s inequality allows to derive the bound ( ) ( ) i 4 i 4 i E sup | X | E[| X | ]+ E sup V ds t 0 s t ∈[0,T ] t ∈[0,T ] 0 (4.3) i 4 i 4 ≤ C E[| X | ]+ E |V | ds 0 s with C = C (T ). For the velocities V we first note that ( ) ( ) i 4 i 4 i E sup |V | E[|V | ]+ E sup V ds t s t ∈[0,T ] t ∈[0,T ] 0 ( ) 4 4 N i + E sup x (ρ ) − X ds (4.4) X ,s t ∈[0,T ] 0 ( ) N i i + E sup D x (ρ ) − X dB . X ,s s s t ∈[0,T ] 0 While the two middle terms on the right-hand side of (4.4) can be controlled as before by applying Jensen’s inequality, the last term is treated as follows. Since 123 30 Page 30 of 44 Applied Mathematics & Optimization (2023) 88:30 i i D x (ρ ) − X dB is a martingale we can apply the Burkholder-Davis-Gundy X ,s s s inequality [47, Chapter IV, Theorem 4.1], which gives ( ) ( ) 4 2 t t N i i N i E sup D x (ρ ) − X dB sup E x (ρ ) − X ds α α s s s X ,s X ,s t ∈[0,T ] 0 t ∈[0,T ] 0 N i ≤ C E x (ρ ) − X ds , X ,s s (4.5) where the latter step is again due to Jensen’s inequality and with a constant C = C (T ). Utilizing these bounds allows to continue the inequality in (4.4) and to obtain ( ) i 4 i 4 i 4 N 4 i 4 E sup |V | ≤ C E[|V | ]+ E | X | +|x (ρ )| +|V | ds (4.6) t 0 s X ,s s t ∈[0,T ] 0 with C = C (γ /m,λ/m,σ/m, T ). Since according to [8, Lemma 3.3] it holds ω (x ) N 2 2 α N 2 N |x (ρ )| ≤ |x | dρ (x ) ≤ b + b |x | dρ ( y) α 1 2 X ,s X ,s X ,s L (ρ ) X ,s i 2 = b + b | X | 1 2 i =1 α(E −E ) with b = 0 and b = e in the case that E is bounded, and 1 2 ∗ 2 2L max{1, |x | } 1 2 2 b = R + b and b = 1 + 1 2 c αc R E E in the case that E satisfies the coercivity assumption A3, we eventually obtain the upper bound ( ) i 4 E sup |V | t ∈[0,T ] ⎛ ⎡ ⎤ ⎞ (4.7) i 4 i 4 4 i 4 ⎝ ⎣ ⎦ ⎠ ≤ C 1 + E[|V | ]+ E | X | + | X | +|V | ds s s j =1 123 Applied Mathematics & Optimization (2023) 88:30 Page 31 of 44 30 with C = C (γ /m,λ/m,σ/m, T , b , b ). Adding up (4.3) and (4.7) yields 1 2 ( ) i 4 i 4 i 4 i 4 E sup | X | +|V | ≤ C 1 + E[| X | +|V | ] t t 0 0 t ∈[0,T ] ⎡ ⎤ ⎞ (4.8) i 4 2 i 4 ⎣ ⎦ ⎠ +E | X | + | X | +|V | ds , s s j =1 which, averaged over i, allows to derive the bound ( ) i 4 i 4 E sup | X | +|V | t t t ∈[0,T ] i =1 ( ) ( ) N N & & 1 1 i 4 i 4 i 4 i 4 ≤ C 1 + E | X | +|V | + E | X | +|V | ds . s s 0 0 N N i =1 i =1 (4.9) 1 i 4 An application of Grönwall’s inequality ensures that E sup | X | t ∈[0,T ] i =1 t i 4 +|V | is bounded independently of N by some constant K = K (γ /m,λ/m,σ/m, T , b , b ). Note, that the constant K does in particular not depend on N or d. With 1 2 i i identical arguments for the processes ( X , V ) an analogous bound can be obtained for t t i i 1 4 4 E sup | X | +|V | . The first claim of the statement now follows t ∈[0,T ] i =1 t t from Markov’s inequality. Step 2: We define the cutoff function $ % i i 1 i 4 i 4 4 4 1, if max | X | +|V | , | X | +|V | ≤ M for all s ∈[0, t ], N i =1 s s s s I (t ) = 0, else, (4.10) which is a random variable adapted to the natural filtration and satisfying 1 ≤ I (t ) pointwise for all t ∈[0, T ] as well as I (t ) = I (t ) I (s) for all s ∈[0, t ]. Firstly, M M M for the positions, by using standard inequalities and Jensen’s inequality, we obtain the bound ( ) i i i i 2 i 2 i E[| X − X | I (t )] E[| X − X | ]+ E V − V I (s) ds M M t t 0 0 s s * + i i i 2 i 2 ≤ C E[| X − X | ]+ E |V − V | I (s) ds 0 0 s s (4.11) 123 30 Page 32 of 44 Applied Mathematics & Optimization (2023) 88:30 with C = C (T ). Secondly, for the velocities we have i 2 E[|V − V | I (t )] t t ( ) i i i 2 i E[|V − V | ]+ E V − V I (s) ds 0 0 s s ( ) 2 2 N i + E x (ρ ) − X − x (ρ ) − X I (s) ds α α X ,s M X ,s s s ( ) N i i + E |x (ρ ) − X |−|x (ρ ) − X | I (s) dB α α X ,s M s s s X ,s t * + i i i 2 i 2 ≤ C E[|V − V | ]+ E |V − V | I (s) ds 0 0 s s * + N 2 i 2 + E |x (ρ ) − x (ρ )| +| X − X | I (s) ds (4.12) α α X ,s M X ,s s s with C = C (γ /m,λ/m,σ/m, T ). In the first step of (4.12)weusedthat thepro- i i i i cesses ( X , V ) and ( X , V ) share the Brownian motion paths, and in the second t t t t both Itô isometry and Jensen’s inequality. In order to conclude, it remains to control * + N 2 the term E |x (ρ ) − x (ρ )| I (s) . To do so, in analogy to the definition of α α X ,s M X ,s N N ρ , let us denote by ρ the empirical measure associated with the processes X , X ,s X ,s 1 N i.e., ρ := δ . Then, by following the proofs of [8, Lemma 3.2] and [16, i =1 X ,s N Lemma 3.1], and exploiting the boundedness ensured by the multiplication with the random variable I (s), we obtain * + N 2 E |x (ρ ) − x (ρ )| I (s) α α X ,s M X ,s * + * + N N 2 N 2 E |x (ρ ) − x (ρ )| I (s) + E |x (ρ ) − x (ρ )| I (s) α α M α α X ,s M X ,s X ,s X ,s i 2 −1 ≤ C E[| X − X | I (s)]+ N s s i =1 i 2 −1 ≤ C max E[| X − X | I (s)]+ N s s i =1,..., N with C = C (α, L , c , |x | , M , b , b ). Inserting the latter into (4.12), and adding E E 1 2 up (4.11) and (4.12) yields i i i 2 i 2 E | X − X | +|V − V | I (t ) t t t t * + i i i 2 i 2 ≤ C E | X − X | +|V − V | I (s) s s s s (4.13) * + 2 −1 + max E | X − X | I (s) + N ds s M j =1,..., N 123 Applied Mathematics & Optimization (2023) 88:30 Page 33 of 44 30 with C = C (α, γ /m,λ/m,σ/m, T , L , c , |x | , M , b , b ) and where we used that E E 1 2 i i i i the processes ( X , V ) and ( X , V ) share the initial conditions. Lastly, by taking the t t t t maximum over i on both sides we get i i i 2 i 2 max E | X − X | +|V − V | I (t ) t t t t i =1,..., N t * + j j j j 2 2 −1 ≤ C E max E | X − X | +|V − V | I (s) + N ds s s M s s j =1,..., N (4.14) with the C from before. After recalling the definition of the conditional expectation, an application of Grönwall’s inequality concludes the proof. Remark 8 While the first part of Theorem 5 about the uniform in time boundedness of the empirical measures holds mutatis mutandis for the PSO dynamics with memory effects (1.2) and (1.8), it does not seem straightforward to obtain the second part in this setting due to the way the memory effects are implemented in (1.2b) and (1.8b). As a matter of fact, this is due to exactly the same technical reasons why we lack a uniqueness statement in Sect. 3.1. We therefore leave the investigation of this extension to future research, in particular in regard to the question whether a suitably modified proof technique or another implementations of memory effects resolve this issue. 4.2 Convergence of PSO Without Memory Effects in Probability Combining Theorem 5 with the convergence analysis of the mean-field dynamics (2.1) as described in Theorem 2, as well as a classical result about the numerical approx- imation of SDEs allows to obtain convergence guarantees with provable polynomial complexity for the numerical PSO method as stated in Theorem 6 below. Let us, for the reader’s convenience, recall from [21, Section 6] that a possible discretized version of the interacting particle system (2.1)isgiven by i i i X = X + tV , (4.15a) (k+1) t k t (k+1) t λ i i N i V = V + x (ρ ) − X (k+1) t k t X ,k t k m + t γ m + t γ t σ N i i + D x (ρ ) − X B (4.15b) X ,k t k t k m + t γ for k = 0,..., K and where ( B ) are independent, identically k=1,...,K −1 t i =1,..., N distributed standard Gaussian random vectors in R . Theorem 6 Let > 0 and δ ∈ (0, 1/2). Then, under the assumptions of Theorems 2 total and 5, it holds for the discretized PSO dynamics (4.15) that i ∗ X − x ≤ (4.16) total i =1 123 30 Page 34 of 44 Applied Mathematics & Optimization (2023) 88:30 −1 m −1 −1 with probability larger than 1 − δ + (C ( t ) + C N + C N + NA MFA LLN total 2ν 2 ε + ε /η ) . Here, m denotes the order of accuracy of the used discretization scheme. Moreover, besides problem dependent factors and the parameters of the method, the dependence of the constants is as follows. C depends linearly on d and N , and NA −1 exponentially on T . C depends on exponentially on α, T and δ .C depends on MFA LLN the moment bound from Theorem 1. Lastly, ε and ε are chosen according to Theorem 2. Remark 9 It is worth emphasizing at this point that the time horizon T in Theorem 6 −1 scales as O log( ε )/χ and therefore logarithmically in the desired accuracy ε as a result of Theorem 2, see also the proof below. This ensures that the constants C and NA C appearing implicitly in the bound (4.16) do not lead to an unfeasible numerical MFA method by requiring extremely small time step sizes t and an exceedingly large amount of particles N . Proof of Theorem 6 The overall error can be decomposed as ⎡ ⎤ i ∗ ⎣ ⎦ E X − x i =1 ⎡ ⎤ ⎡ ⎤ 2 2 N N & & 1 1 i i i ⎣ ⎦ ⎣ ⎦ (4.17) E X − X + E X − X t T T T N N i =1 i =1 ⎡ ⎤ i 2 2 ⎣ ⎦ + E X − E X + E X − x + x − x , T T i =1 where we used that P( ) ≥ (1 − δ) ≥ 1/2. By means of a classical result about the convergence of numerical schemes for SDEs [43], the first term in (4.17) can be m −1 bounded by C ( t ) . For the second term, Theorem 5 gives the estimate C N . NA MFA −1 The third term can be bounded by C N as a consequence of the law of large LLN −1 numbers. Eventually, Theorem 2 allows us to choose T = O log( ε )/χ sufficiently 2ν 2 large to reach any prescribed accuracy ε for the next-to-last term as well as ε /η for the last term by a suitable choice of α. With these individual bounds we obtain ⎡ ⎤ i ∗ ⎣ ⎦ E X − x (4.18) i =1 m −1 −1 2ν 2 ≤ C ( t ) + C N + C N + ε + ε /η . NA MFA LLN It now remains to estimate the probability of the set K ⊂ , where Inequality (4.16) total does not hold. By utilizing the conditional version of Markov’s inequality together with 123 Applied Mathematics & Optimization (2023) 88:30 Page 35 of 44 30 the formerly established bound (4.18), we have N N N c P K = P K ∩ + P K ∩ total total total N c ≤ P K P( ) + P M M total M (4.19) m −1 −1 2ν 2 C ( t ) + C N + C N + ε + ε /η NA MFA LLN ≤ + δ total for a sufficiently large choice of M in (4.1). A result in this spirit was first presented for CBO in [18, Theorem 14] and is hereby extended to PSO. 5 Implementation of PSO and Numerical Results The purpose of this section is twofold. At first, an efficient implementation of PSO is provided, which is particularly suited for high-dimensional optimization problems arising in machine learning. Its performance is then evaluated on a standard bench- mark problem, where we investigate the influence of the parameters, and the training of a neural network classifier for handwritten digits. Furthermore, several relevant implementational aspects are discussed, including the computational complexity and scalability, modifications inspired from simulated annealing and evolutionary algo- rithms, and the numerical stability of the method. 5.1 An Efficient Implementation of PSO Let us stress that PSO is an extremely versatile, flexible and customizable optimization method, which can be regarded as a black-box optimizer. As a zero-order method it is not reliant on the gradient information and can be even applied to discontinuous objectives, making it inevitably superior to first-order optimization methods in cases where derivatives are computationally infeasible. However, also in machine learning applications, where gradient-based optimizers are considered the state of the art, PSO may be of particular interest in view of vanishing or exploding gradient phenomena. Typical objective functions appearing in machine learning are of the form E (x ) = E (x ), (5.1) j =1 where E is usually the loss of the jth training sample. In order to run the scheme (1.2), frequent evaluations of E are necessary, which may be computationally intense or even prohibitive in some applications. Computational complexity: Inspired by mini-batch gradient descent, the authors of [28] developed a random batch method for interacting particle systems, which was employed for CBO in [9]. In the same spirit, we present with Algorithm 1 a 123 30 Page 36 of 44 Applied Mathematics & Optimization (2023) 88:30 Algorithm 1 Particle swarm optimization (PSO) Input: Objective E as in (5.1), time horizon T or number of epochs #epochs, discrete time step size t, batch sizes n and n , parameters m,γ,λ ,λ ,σ ,σ ,α,β,θ and κ, number of particles N , initial- N E 1 2 1 2 ization f N ∗ Output: Approximation y (ρ ) of the global minimizer x of E Y ,T i i 1: Generate the particles’ initial positions and velocities ( X , V ) according to a common initial i =1,..., N 0 0 i i law f . Initialize the local best positions Y = X . 0 0 2: Ensure that n divides M and n divides N . 3: Convert T into #epochs or vice versa via T = #epochs ( M /n )( N /n ) t.Set k = 0and epoch = 1. E N 4: while epoch ≤ #epochs and stopping criterion not fulfilled M /n 1 E 5: Partition {1,..., M } into batches B ,..., B of batch size n . k k 6: for b = 1,..., M /n 7: Define the objective function on this batch as E (x ) = E (x ). (5.2) batch j j ∈B N /n 1 N 8: Partition the particles, i.e., the set {1,..., N }, into disjoint sets P ,..., P of size n . k k 9: for n = 1,..., N /n N /n 10: Compute the consensus point y (ρ ) according to Equation (1.4) with objective E batch Y ,k N /n n N 1 from the particles in P , i.e., with the empirical measure ρ = n δ . i ∈P k Y ,k t n N Y 11: Update either all particles (full update) or only the particles in the current batch P (partial update) according to a discretized version of the PSO dynamics (1.2). N N 12: if k > 0and y (ρ )− y (ρ ) is too small despite stopping criterion not fulfilled α α Y ,k t Y ,(k−1) 13: Perform an independent Brownian motion for the positions or velocities of all particles. 14: end if 15: Set k = k + 1. 16: end for 17: end for 18: Check the stopping criterion and break if fulfilled. If not, employ the optional strategies described at the end of Section 5.1,set epoch = epoch + 1 and continue. 19: end while 20: Compute the consensus point y (ρ ) according to Equation (1.4) with objective E from all particles, Y ,T N 1 i.e., with ρ = δ . Y ,T N i =1 Y computationally efficient implementation of PSO. The mini-batch idea is present on two different levels. In line 7, the objective is defined with respect to a batch of the training data of size n , meaning that only a subsample of the data is considered. One epoch is completed after each data sample was seen exactly once, i.e., after M /n steps. At each step the consensus point y has to be computed, for which E needs α batch to be evaluated for N particles. This still constitutes the most significant computational effort. However, the mini-batch idea can be leveraged for a second time. In the for loop in line 9 we partition the particles into sets of size n and perform the updates of line 11 only for the n particles in the respective subset. Since this is embarrassingly parallel, a parallel machine can be deployed to reduce the runtime by up to a factor p (the number of available processors). While this is referred to as partial update, alternatively, on a sequential architecture, a full update can be made at every iteration, requiring all N particles to be updated in line 11. Apart from lowering the required computing 123 Applied Mathematics & Optimization (2023) 88:30 Page 37 of 44 30 resources tremendously, these mini-batch ideas actually improve the stability of the method and the capability of finding good optima by introducing more stochasticity into the algorithm. Concerning additional computational complexity due to the usage of memory effects, let us point out that, except for the required storage of the local (historical) best positions and their objective values, the update rule (1.3) in combination with the partial update allows to include such mechanisms with no additional cost by keeping track of the objective values of the local best positions. In such case, only one func- tion call of each E per epoch and per particle is necessary, which is optimal and batch coincides with PSO without memory effects or CBO. A different realization of (1.2b) might result in a higher cost. Implementational aspects: A discretization of the SDE (1.2) in line 11 can be obtained for instance from a simple Euler-Maruyama or semi-implicit scheme [23, 43], see, e.g., [21, Equation (6.3)]. In our numerical experiments below Equation (1.3)isusedfor updating the local best position, which corresponds to κ = 1/(2 t ), θ = 0, and β =∞. Furthermore, the friction parameter is set according to γ = 1 − m, which is a typical choice in the literature. Let us also remark that a numerically stable computation of the consensus point in lines 10 and 20 for α 1 can be obtained by replacing E batch with E − E, where E := min E (Y ). batch i ∈P batch Cooling and evolutionary strategies: The PSO algorithm can be divided into two phases, an exploration phase, where the energy landscape is searched coarsely, and a determination phase, where the final output is identified. While the former ben- efits from small α and large diffusion parameters, in the latter, α 1 guarantees the selection of the best solution. A cooling strategy inspired from simulated anneal- ing allows to start with moderate α and relatively large diffusion parameters σ ,σ . 1 2 After each epoch, α is multiplied by 2, while the diffusion parameters follow the schedule σ = σ/ log(epoch + 2) for σ ∈{σ ,σ }. Such strategy was proposed 1 2 in [9, Section 4] for CBO. In order to further reduce computational complexity, the provable decay of the variance suggests to decrease the number of agents by discarding particles in accordance with the empirical variance. A possible sched- ule for the number of agents proposed in [20, Section 2.2] is to set N = epoch+1 4 5 N (1 − μ) + μ / for μ ∈[0, 1] and where and epoch epoch epoch epoch epoch denote the empirical variances of the N particles at the beginning and at the end epoch of the current epoch. 5.2 Numerical Experiments for the Rastrigin Function Before turning to high-dimensional optimization problems, let us discuss the parameter choices of PSO in moderate dimensions (d = 20) at the example of the well-known d 5 Rastrigin benchmark function E (v) = v + (1 − cos(2πv )), which meets the k=1 k 2 requirements of Assumption 1 despite being highly non-convex with many spurious local optima. To narrow down the number of tunable parameters, we let γ = 1 − m, choose α = 100, N = 100, and update the local best position (if present) according to Equation (1.3), i.e., κ = 1/(2 t ), θ = 0, and β =∞. We moreover let λ = 123 30 Page 38 of 44 Applied Mathematics & Optimization (2023) 88:30 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 (a) PSO without memory (b) PSO with memory but no local (c) PSO with memory and local best drift (λ =0, σ =0) best drift (λ =0.4, σ =0.4σ ) 1 1 1 1 2 Fig. 2 Phase transition diagrams comparing PSO without and with memory effects for different inertia parameters m and noise coefficients σ (PSO without memory) and σ (PSO with memory). The empirical success probability is computed from 25 runs and depicted by color from zero (blue) to one (yellow) Applied Mathematics & Optimization (2023) 88:30 Page 39 of 44 30 Input Flatten Output Input Output layer Layer layer Convolutional and Pooling Layers, and Flatten and Dense Layer layer layer convolution layer, convolution layer, max pooling layer, max pooling layer, 4 kernels of size kernel size 2 × 2, 3 kernels of size kernel size 2 × 2, 5 × 5, zero padding stride 2 5 × 5, zero padding stride 2 (a) Dense shallow NN. (b) CNN with two convolutional and two pooling layers, and one dense layer. Fig. 3 Architectures of the NNs used in the experiments of Sect. 5.3,cf. [19, Section 4] 1(or λ = 1 for PSO without memory) and t = 0.01, which are such that the algorithm either finds consensus or explodes within the time horizon T = 100 in all instances. For simplicity we assume that σ = λ σ . The algorithm is initialized 1 1 2 with positions distributed according to N (2,..., 2), 4Id and velocities according to N (0,..., 0), Id .InFig. 2 we depict the phase diagram comparing the success probability of PSO for different parameter choices of the inertia parameter m and the diffusion parameter σ or σ , respectively. We observe that for m fixed there is a noise threshold above which the dynamics explodes. While smaller m permit a larger flexibility in the used noise, they require an individual minimal noise level. Further numerical experiments suggest however that increasing the number of particles N allows for a lower minimal noise level. There are subtle differences between PSO without and with memory, but they are not decisive as in applications also confirmed by the numerical experiments in Sect. 5.3, [22, Section 5.3] as well as the survey paper [21, Section 6.3]. 5.3 A Machine Learning Application We now showcase the practicability of PSO as implemented in Algorithm 1 at the example of a very competitive high-dimensional benchmark problem in machine learn- ing, the classification of handwritten digits. In what follows we train a shallow and a convolutional NN (CNN) classifier for the MNIST dataset [34]. Let us point out, that it is not our objective to challenge the state of the art by employing the most sophisticated model (deep CNNs achieve near-human performance of more than 99.5% accuracy). Instead, we want to demonstrate that PSO reaches results comparable to SGD with backpropagation, while at the same time relying exclusively on the evaluation of E. In our experiment we use NNs with architectures as depicted in Fig. 3. The input is a black-and-white image represented by a (28×28)-dimensional matrix with entries between 0 and 1. For the shallow NN (see Fig. 3a), the flattened image is 10×728 passed through a dense layer ReLU(W ·+b) with trainable weights W ∈ R and bias b ∈ R .Our CNN(seeFig. 3b) is similar to LeNet-1, cf. [33, Section III.C.7]. Each dense or convolution layer has a ReLU activation and is followed by a batch normalization layer to speed up the training process. Eventually, the final layers of both NNs apply a softmax activation function allowing to interpret the 10-dimensional output vector as a probability distribution over the digits. 123 30 Page 40 of 44 Applied Mathematics & Optimization (2023) 88:30 (a) PSO for three different memory settings: without (b) PSO with memory but without local best drift for memory (lightest lines); with memory but without local three different inertia parameters m ∈{0.1, 0.2, 0.4} (in- best drift, i.e., λ =0, σ = 0 (line with intermediate creasing opacity for larger m). 1 1 opacity); with memory with local best drift λ =0.4, Note that, for reference, the lines with intermediate opac- σ = λ σ (darkest lines). ity coincide with the ones of Fig. 4a. 1 1 2 Fig. 4 Comparison of the performances of a shallow (dashed lines) and convolutional (solid lines) NN with architectures as described in Fig. 3, when trained with PSO as in Algorithm 1. Depicted are the accuracies on a test dataset (orange lines) and the values of the objective function E (blue lines) evaluated on a random sample of the training set of size 10000 Applied Mathematics & Optimization (2023) 88:30 Page 41 of 44 30 We denote by θ the trainable parameters of the NNs, which are 7850 for the shallow NN and 2112 for the CNN. They are learned by minimizing E (θ ) = 1 M j j j j ( f (x ), y ), where f denotes the forward pass of the NN, (x , y ) θ θ j =1 the jth image-label tuple and the categorical crossentropy loss ( y, y) = − y log ( y ). The performance is measured by counting the number of suc- k k k=0 cessful predictions on a test set. We use a train-test split of 60000 training and 10000 test images. For our experiments we choose λ = 1, (σ ) = 0.4, α = 50, 2 2 initial initial t = 0.1 and update the local best position according to Equation (1.3). We use N = 100 agents, which are initialized according to N (0,..., 0) , Id in position and velocity. The mini-batch sizes are n = 60 and n = 100 (consequently a full update is performed in line 11) and a cooling strategy is used in line 18. Figure 4a reports the performances for different memory settings and fixed m = 0.2, whereas Fig. 4b depicts the results for different inertia parameters m in the case of PSO with memory but no memory drift. For the shallow NN, we obtain a test accuracy of above 89%, whereas the CNN achieves almost 97%. To put those numbers into perspective, when trained with SGD, a similar performance for the shallow NN, see [9, Figure 7], and a benchmark accuracy of 98.3% for a comparable CNN, cf. [33, Figure 9], are reached. As can be seen from Fig. 4a, the usage of the local best positions when computing the consensus point significantly improves the performance. The advantage of having an additional drift towards the local best position is less pronounced. Regarding the inertia parameter m in Fig. 4b, our numerical results suggest that larger m yield faster convergence. 6 Conclusions In this paper we prove the convergence of PSO without and with memory effects to a global minimizer of a possibly nonconvex and nonsmooth objective function in the mean-field sense. Our analysis holds under a suitable well-preparation condition about the initialization and comprises a rich class of objectives which in particu- lar includes functions with multiple global minimizers. For PSO without memory effects we furthermore quantify how well the mean-field dynamics approximates the interacting finite particle dynamics, which is implemented for numerical experiments. Since in particular the latter results does not suffer from the curse of dimensionality, we thereby prove that the numerical PSO method has polynomial complexity. With this we contribute to the completion of a mathematically rigorous understanding of PSO. Furthermore, we propose a computationally efficient and parallelizable imple- mentation and showcase its practicability by training a CNN reaching a performance comparable to stochastic gradient descent. It remains an open problem to extend the mean-field approximation result to the variant of PSO with memory effects or, alternatively, to devise an implementation of such effects compatible with the used proof technique. Moreover, we also leave a more thorough understanding of the influence of the parameters as well as the influence of memory effects for future, more experimental research. 123 30 Page 42 of 44 Applied Mathematics & Optimization (2023) 88:30 Finally, we believe that the analysis framework of this and prior works on CBO [8, 18, 42] motivates to investigate also other renowned metaheuristic algorithms through the lens of a mean-field limit. Funding Open Access funding enabled and organized by Projekt DEAL. H.H. is partially supported by the Pacific Institute for the Mathematical Sciences (PIMS) postdoctoral fellowship. J.Q. is partially supported by the National Science and Engineering Research Council of Canada (NSERC) and by the start-up funds from the University of Calgary. K.R. acknowledges the financial support from the Technical University of Munich – Institute for Ethics in Artificial Intelligence (IEAI). The authors gratefully acknowledge the compute and data resources provided by the Leibniz Supercomputing Centre (LRZ). Declarations Competing interest The authors declare that they have no conflict of interest. Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. References 1. Aarts, E., Korst, J.: Simulated annealing and Boltzmann machines. A stochastic approach to combi- natorial optimization and neural computing. Wiley-Interscience Series in Discrete Mathematics and Optimization. Wiley , Chichester (1989) 2. Arnold, L.: Stochastic differential equations: Theory and applications. Wiley-Interscience [John Wiley & Sons], New York-London-Sydney (1974). Translated from the German 3. Azencott, R. (ed.): Simulated annealing: Parallelization techniques. Wiley (1992) 4. van den Bergh, F.: An analysis of particle swarm optimizers. Ph.D. thesis, University of Pretoria (2007) 5. van den Bergh, F., Engelbrecht, A.P.: A convergence proof for the particle swarm optimiser. Fund. Inform. 105(4), 341–374 (2010) 6. Bolley, F., Cañizo, J.A., Carrillo, J.A.: Stochastic mean-field limit: non-Lipschitz forces and swarming. Math. Models Methods Appl. Sci. 21(11), 2179–2210 (2011) 7. Bruned, V., Mas, A., Wlodarczyk, S.: Weak convergence of particle swarm optimization. arXiv preprint arXiv:1811.04924 (2018) 8. Carrillo, J.A., Choi, Y.P., Totzeck, C., Tse, O.: An analytical framework for consensus-based global optimization method. Math. Models Methods Appl. Sci. 28(6), 1037–1066 (2018) 9. Carrillo, J.A., Jin, S., Li, L., Zhu, Y.: A consensus-based global optimization method for high dimen- sional machine learning problems. ESAIM Control Optim. Calc. Var. 27(suppl.), Paper No. S5, 1–22 (2021) 10. Cipriani, C., Huang, H., Qiu, J.: Zero-inertia limit: from particle swarm optimization to consensus- based optimization. SIAM J. Math. Anal. 54(3), 3091–3121 (2022) 11. Clerc, M., Kennedy, J.: The particle swarm—explosion, stability, and convergence in a multidimen- sional complex space. IEEE Trans. Evol. Comput. 6(1), 58–73 (2002). https://doi.org/10.1109/4235. 12. Dembo, A., Zeitouni, O.: Large deviations techniques and applications, Applications of Mathematics (New York), vol. 38, 2nd edn. Springer, New York (1998) 13. Ding, Z., Chen, S., Li, Q., Wright, S.: On the global convergence of gradient descent for multi-layer resnets in the mean-field regime. arXiv preprint arXiv:2110.02926 (2021) 123 Applied Mathematics & Optimization (2023) 88:30 Page 43 of 44 30 14. Dréo, J., Pétrowski, A., Siarry, P., Taillard, E.: Metaheuristics for hard optimization. Springer, Berlin (2006). Methods and case studies, Translated from the 2003 French original by Amitava Chatterjee 15. Fogel, D.B.: Evolutionary Computation. Toward a New Philosophy of Machine Intelligence, 2nd edn. IEEE Press, Piscataway (2000) 16. Fornasier, M., Huang, H., Pareschi, L., Sünnen, P.: Consensus-based optimization on hypersurfaces: well-posedness and mean-field limit. Math. Models Methods Appl. Sci. 30(14), 2725–2751 (2020) 17. Fornasier, M., Huang, H., Pareschi, L., Sünnen, P.: Anisotropic diffusion in consensus-based optimiza- tion on the sphere. SIAM J. Optim. 32(3), 1984–2012 (2022) 18. Fornasier, M., Klock, T., Riedl, K.: Consensus-based optimization methods converge globally. arXiv preprint arXiv:2103.15130 (2021) 19. Fornasier, M., Klock, T., Riedl, K.: Convergence of anisotropic consensus-based optimization in mean- field law. In: J.L. Jiménez Laredo, J.I. Hidalgo, K.O. Babaagba (eds.) Applications of Evolutionary Computation, pp. 738–754. Springer, Cham (2022) 20. Fornasier, M., Pareschi, L., Huang, H., Sünnen, P.: Consensus-based optimization on the sphere: convergence to global minimizers and machine learning. J. Mach. Learn. Res. 22(237), 1–55 (2021) 21. Grassi, S., Huang, H., Pareschi, L., Qiu, J.: Mean-field particle swarm optimization. arXiv preprint arXiv:2108.00393 (2021) 22. Grassi, S., Pareschi, L.: From particle swarm optimization to consensus based optimization: stochastic modeling and mean-field limit. Math. Models Methods Appl. Sci. 31(8), 1625–1657 (2021) 23. Higham, D.J.: An algorithmic introduction to numerical simulation of stochastic differential equations. SIAM Rev. 43(3), 525–546 (2001) 24. Holland, J.H.: Adaptation in natural and artificial systems. An introductory analysis with applications to biology, control, and artificial intelligence. University of Michigan Press, Ann Arbor, Mich. (1975) 25. Huang, H., Liu, J.G., Pickl, P.: On the mean-field limit for the Vlasov-Poisson-Fokker-Planck system. J. Stat. Phys. 181(5), 1915–1965 (2020) 26. Huang, H., Qiu, J.: On the mean-field limit for the consensus-based optimization. Mathematical Meth- ods in the Applied Sciences, pp. 1–18 (2022) 27. Jabin, P.E., Wang, Z.: Mean field limit for stochastic particle systems. In: Active particles. Vol. 1. Advances in theory, models, and applications, Model. Simul. Sci. Eng. Technol., pp. 379–402. Birkhäuser/Springer, Cham (2017) 28. Jin, S., Li, L., Liu, J.G.: Random batch methods (RBM) for interacting particle systems. J. Comput. Phys. 400(108877), 1–30 (2020) 29. Kadanoff, L.P.: More is the same; phase transitions and mean field theories. J. Stat. Phys. 137(5–6), 777–797 (2009) 30. Kennedy, J.: The particle swarm: social adaptation of knowledge. In: Proceedings of 1997 IEEE International Conference on Evolutionary Computation, pp. 303–308. IEEE (1997). 10.1109/ICEC.1997.592326 31. Kennedy, J., Eberhart, R.: Particle swarm optimization. In: Proceedings of ICNN’95 - International Conference on Neural Networks, vol. 4, pp. 1942–1948. IEEE (1995). 10.1109/ICNN.1995.488968 32. Kushner, H., Yin, G.G.: Stochastic approximation and recursive algorithms and applications, vol. 35. Springer Science & Business Media (2003) 33. LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proc. IEEE 86(11), 2278–2324 (1998) 34. LeCun, Y., Cortes, C., Burges, C.: MNIST handwritten digit database (2010). http://yann.lecun.com/ exdb/mnist/ 35. Lin, S.W., Ying, K.C., Chen, S.C., Lee, Z.J.: Particle swarm optimization for parameter determination and feature selection of support vector machines. Expert Syst. Appl. 35(4), 1817–1824 (2008) 36. Mei, S., Montanari, A., Nguyen, P.M.: A mean field view of the landscape of two-layer neural networks. Proc. Natl. Acad. Sci. U.S.A. 115(33), E7665–E7671 (2018) 37. Miclo, L.: Recuit simulé sur R . étude de l’évolution de l’énergie libre. In: Annales de l’IHP Probabilités et statistiques, vol. 28, pp. 235–266 (1992) 38. Miller, P.D.: Applied Asymptotic Analysis, Graduate Studies in Mathematics, vol. 75. American Math- ematical Society, Providence, RI (2006) 39. Øksendal, B.: Stochastic Differential Equations: An Introduction with Applications, 6th edn. Springer, Berlin (2003) 40. Özcan, E., Mohan, C.K.: Analysis of a simple particle swarm optimization system (1998) 123 30 Page 44 of 44 Applied Mathematics & Optimization (2023) 88:30 41. Panigrahi, B.K., Shi, Y., Lim, M.H.: Handbook of swarm intelligence: concepts, principles and appli- cations, vol. 8. Springer Science & Business Media (2011) 42. Pinnau, R., Totzeck, C., Tse, O., Martin, S.: A consensus-based model for global optimization and its mean-field limit. Math. Models Methods Appl. Sci. 27(1), 183–204 (2017) 43. Platen, E.: An introduction to numerical methods for stochastic differential equations. In: Acta numer- ica, Acta Numer., vol. 8, pp. 197–246. Cambridge University Press, Cambridge (1999) 44. Poli, R.: Mean and variance of the sampling distribution of particle swarm optimizers during stagnation. IEEE Trans. Evol. Comput. 13(4), 712–721 (2009) 45. Poli, R., Kennedy, J., Blackwell, T.: Particle swarm optimization. Swarm Intell 1(1), 33–57 (2007). https://doi.org/10.1007/s11721-007-0002-0 46. Rastrigin, L.A.: The convergence of the random search method in the external control of many- parameter system. Autom. Remote Control 24, 1337–1342 (1963) 47. Revuz, D., Yor, M.: Continuous martingales and Brownian motion, Grundlehren der mathematischen Wissenschaften [Fundamental Principles of Mathematical Sciences], vol. 293, third edn. Springer- Verlag, Berlin (1999) 48. Riedl, K.: Leveraging memory effects and gradient information in consensus-based optimization: On global convergence in mean-field law. arXiv:2211.12184 (2022) 49. Robbins, H., Monro, S.: A stochastic approximation method. Ann. Math. Stat. 22, 400–407 (1951) 50. Royer, G.: An initiation to logarithmic Sobolev inequalities. 5. American Mathematical Soc. (2007) 51. Schmitt, M., Wanka, R.: Particle swarm optimization almost surely finds local optima. Theor. Comput. Sci. 561(Part A), 57–72 (2015) 52. Sznitman, A.S.: Topics in propagation of chaos. In: École d’Été de Probabilités de Saint-Flour XIX – 1989, Lecture Notes in Math., vol. 1464, pp. 165–251. Springer, Berlin (1991) 53. Tang, W., Zhou, X.Y.: Tail probability estimates of continuous-time simulated annealing processes. Numerical Algebra, Control and Optimization (2022) 54. Totzeck, C., Wolfram, M.T.: Consensus-based global optimization with personal best. Math. Biosci. Eng. 17(5), 6026–6044 (2020) 55. Witt, C.: Theory of particle swarm optimization. In: Theory of randomized search heuristics, Ser. Theor. Comput. Sci., vol. 1, pp. 197–223. World Sci. Publ., Hackensack, NJ (2011) 56. Yuan, Q., Yin, G.: Analyzing convergence and rates of convergence of particle swarm optimization algorithms using stochastic approximation methods. IEEE Trans. Autom. Control 60(7), 1760–1773 (2014) 57. Zhang, Y., Wang, S., Ji, G.: A comprehensive survey on particle swarm optimization algorithm and its applications. Math. Probl. Eng. 931256, 1–38 (2015) Publisher’s Note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Applied Mathematics & Optimization – Springer Journals
Published: Oct 1, 2023
Keywords: Global derivative-free optimization; High-dimensional nonconvex optimization; Metaheuristics; Particle swarm optimization; Mean-field limit; Vlasov-Fokker-Planck equations; 65C35; 65K10; 90C26; 90C56; 35Q90; 35Q83
You can share this free article with as many people as you like with the url below! We hope you enjoy this feature!
Read and print from thousands of top scholarly journals.
Already have an account? Log in
Bookmark this article. You can see your Bookmarks on your DeepDyve Library.
To save an article, log in first, or sign up for a DeepDyve account if you don’t already have one.
Copy and paste the desired citation format or use the link below to download a file formatted for EndNote
Access the full text.
Sign up today, get DeepDyve free for 14 days.
All DeepDyve websites use cookies to improve your online experience. They were placed on your computer when you launched this website. You can change your cookie settings through your browser.