Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

Optimal Rates for the Regularized Learning Algorithms under General Source Condition

Optimal Rates for the Regularized Learning Algorithms under General Source Condition ORIGINAL RESEARCH published: 27 March 2017 doi: 10.3389/fams.2017.00003 Optimal Rates for the Regularized Learning Algorithms under General Source Condition Abhishake Rastogi* and Sivananthan Sampath Department of Mathematics, Indian Institute of Technology Delhi, New Delhi, India We consider the learning algorithms under general source condition with the polynomial decay of the eigenvalues of the integral operator in vector-valued function setting. We discuss the upper convergence rates of Tikhonov regularizer under general source condition corresponding to increasing monotone index function. The convergence issues are studied for general regularization schemes by using the concept of operator monotone index functions in minimax setting. Further we also address the minimum possible error for any learning algorithm. Keywords: learning theory, general source condition, vector-valued RKHS, error estimate, optimal rates Mathematics Subject Classification 2010: 68T05, 68Q32 Edited by: 1. INTRODUCTION Yiming Ying, University at Albany, SUNY, USA Learning theory [1–3] aims to learn the relation between the inputs and outputs based on finite Reviewed by: random samples. We require some underlying space to search the relation function. From the Xin Guo, experiences we have some idea about the underlying space which is called hypothesis space. The Hong Kong Polytechnic University, Hong Kong Learning algorithms tries to infer the best estimator over the hypothesis space such that f (x) gives Ernesto De Vito, the maximum information of the output variable y for any unseen input x. The given samples University of Genoa, Italy m {x , y } are not exact in the sense that for underlying relation function f (x ) 6= y but f (x ) ≈ y . i i i i i i i=1 *Correspondence: We assume that the uncertainty follows the probability distribution ρ on the sample space X × Y Abhishake Rastogi and the underlying function (called the regression function) for the probability distribution ρ is abhishekrastogi2012@gmail.com given by Specialty section: f (x) = ydρ(y|x), x ∈ X, This article was submitted to Mathematics of Computation and Data Science, where ρ(y|x) is the conditional probability measure for given x. The problem of obtaining a section of the journal estimator from examples is ill-posed. Therefore, we apply the regularization schemes [4–7] to Frontiers in Applied Mathematics and stabilize the problem. Various regularization schemes are studied for inverse problems. In the Statistics context of learning theory [2, 3, 8–10], the square loss-regularization (Tikhonov regularization) Received: 02 November 2016 is widely considered to obtain the regularized estimator [9, 11–16]. Gerfo et al. [6] introduced Accepted: 09 March 2017 general regularization in the learning theory and provided the error bounds under Hölder’s source Published: 27 March 2017 condition [5]. Bauer et al. [4] discussed the convergence issues for general regularization under Citation: general source condition [17] by removing the Lipschitz condition on the regularization considered Rastogi A and Sampath S (2017) in Gerfo et al. [6]. Caponnetto and De Vito [12] discussed the square-loss regularization under the Optimal Rates for the Regularized polynomial decay of the eigenvalues of the integral operator L with Hölder’s source condition. For Learning Algorithms under General the inverse statistical learning problem, Blanchard and Mücke [18] analyzed the convergence rates Source Condition. for general regularization scheme under Hölder’s source condition in scalar-valued function setting. Front. Appl. Math. Stat. 3:3. doi: 10.3389/fams.2017.00003 Here we are discussing the convergence issues of general regularization schemes under general Frontiers in Applied Mathematics and Statistics | www.frontiersin.org 1 March 2017 | Volume 3 | Article 3 Rastogi and Sampath Optimal Rates for the Regularized Learning Algorithms source condition and the polynomial decay of the eigenvalues of where f is called the target function. In case f ∈ H, f H ρ H the integral operator in vector-valued framework. We present the becomes the regression function f . minimax upper convergence rates for Tikhonov regularization Because of inaccessibility of the probability distribution ρ, we under general source condition  , for a monotone increasing minimize the regularized empirical estimate of the generalization φ,R index function φ. For general regularization the minimax rates error over the hypothesis spaceH, are obtained using the operator monotone index function φ. The ( ) concept of effective dimension [19, 20] is exploited to achieve X 2 2 f : = argmin ||f (x ) − y || + λ||f || , (5) z,λ i i Y H the convergence rates. In the choice of regularization parameters, f ∈H i=1 the effective dimension plays the important role. We also discuss the lower convergence rates for any learning algorithm under the where λ is the positive regularization parameter. The smoothness conditions. We present the results in vector-valued regularization schemes [4–7, 10] are used to incorporate various function setting. Therefore, in particular they can be applied to features in the solution such as boundedness, monotonicity multi-task learning problems. and smoothness. In order to optimize the vector-valued The structure of the paper is as follows. In the second regularization functional, one of the main problems is to choose section, we introduce some basic assumptions and notations for the appropriate hypothesis space which is assumed to be a source supervised learning problems. In Section 3, we present the upper to provide the estimator. and lower convergence rates under the smoothness conditions in minimax setting. 2.1. Reproducing Kernel Hilbert Space as a Hypothesis Space 2. LEARNING FROM EXAMPLES: Definition 2.1. (Vector-valued reproducing kernel Hilbert NOTATIONS AND ASSUMPTIONS space) For non-empty set X and the real Hilbert space (Y, h·, ·i ), the Hilbert space (H, h·, ·i ) of functions from X to Y is called In the learning theory framework [2, 3, 8–10], the sample space reproducing kernel Hilbert space if for any x ∈ X and y ∈ Y the Z = X × Y consists of two spaces: The input space X (locally linear functional which maps f ∈ H to hy, f (x)i is continuous. compact second countable Hausdorff space) and the output space (Y, h·, ·i ) (the real separable Hilbert space). The input space X Y By Riesz lemma [21], for every x ∈ X and y ∈ Y there exists a and the output space Y are related by some unknown probability linear operator K : Y → H such that distribution ρ on Z. The probability measure can be split as ρ(x, y) = ρ(y|x)ρ (x), where ρ(y|x) is the conditional probability X hy, f (x)i = hK y, f i , ∀f ∈ H. Y x H measure of y given x and ρ is the marginal probability measure ∗ ∗ on X. The only available information is the random i.i.d. samples Therefore, the adjoint operator K : H → Y is given by K f = x x z = ((x , y ),... , (x , y )) drawn according to the probability 1 1 m m f (x). Through the linear operator K :Y → H we define the linear measure ρ. Given the training set z, learning theory aims to operator K(x, t) : Y → Y, develop an algorithm which provides an estimator f : X → Y such that f (x) predicts the output variable y for any given K(x, t)y: = K y(x). input x. The goodness of the estimator can be measured by the generalization error of a function f which can be defined as From Proposition 2.1 [22], the linear operator K(x, t) ∈ L(Y) (the set of bounded linear operators on Y), K(x, t) = K(t, x) and K(x, x) is non-negative bounded linear operator. For any E(f ): = E (f ) = V(f (x), y)dρ(x, y), (1) m ∈ N, {x : 1 ≤ i ≤ m} ∈ X, {y : 1 ≤ i ≤ m} ∈ Y, we i i where V : Y × Y → R is the loss function. The minimizer of E(f ) have that hy , K(x , x )y i ≥ 0. The operator valued function i i j j i,j=1 for the square loss function V(f (x), y) = ||f (x) − y|| is given by K : X × X → L(Y) is called the kernel. There is one to one correspondence between the kernels and f (x): = ydρ(y|x), (2) reproducing kernel Hilbert spaces [22, 23]. So a reproducing kernel Hilbert space H corresponding to a kernel K can be where f is called the regression function. The regression function denoted as H and the norm in the space H can be denoted as f belongs to the space of square integrable functions provided || · || . In the following article, we suppress K by simply using that H for reproducing kernel Hilbert space and || · || for its norm. Throughout the paper we assume the reproducing kernel ||y|| dρ(x, y) < ∞. (3) Hilbert spaceH is separable such that We search the minimizer of the generalization error over a (i) K : Y → H is a Hilbert-Schmidt operator for all x ∈ X and hypothesis spaceH, κ : = sup Tr(K K ) < ∞. x∈X 2 (ii) The real function from X × X to R, defined by (x, t) 7→ f : = argmin ||f (x) − y|| dρ(x, y) , (4) hK v, K wi , is measurable ∀v, w ∈ Y. f ∈H Z t x H Frontiers in Applied Mathematics and Statistics | www.frontiersin.org 2 March 2017 | Volume 3 | Article 3 Rastogi and Sampath Optimal Rates for the Regularized Learning Algorithms 1/2 By the representation theorem [22], the solution of the penalized Using the fact E(f ) = ||L (f − f )|| + E(f ), we get the H H K H regularization problem (5) will be of the form: expression of f , −1 f = (L + λI) L f , (9) λ K K H f = K c , for (c ,... , c ) ∈ Y . z,λ x i 1 m 2 2 where the integral operator L : L → L is a self-adjoint, i=1 K ρ ρ X X non-negative, compact operator, defined as Definition 2.2. let H be a separable Hilbert space and {e } k=1 be an orthonormal basis of H. Then for any positive operator L (f )(x): = K(x, t)f (t)dρ (t), x ∈ X. K X A ∈ L(H) we define Tr(A) = hAe , e i. It is well-known that k k k=1 The integral operator L can also be defined as a self-adjoint the number Tr(A) is independent of the choice of the orthonormal operator on H. We use the same notation L for both the basis. 1/2 operators defined on different domains. It is well-known that L is an isometry from the space of square integrable functions to Definition 2.3. An operator A ∈ L(H) is called Hilbert-Schmidt reproducing kernel Hilbert space. operator if Tr(A A) < ∞. The family of all Hilbert-Schmidt In order to achieve the uniform convergence rates for learning operators is denoted by L (H). For A ∈ L (H), we define Tr(A) = 2 2 algorithms we need some prior assumptions on the probability hAe , e i for an orthonormal basis {e } ofH. k k k k=1 measure ρ. Following the notion of Bauer et al. [4] and k=1 Caponnetto and De Vito [12], we consider the class of probability measures P which satisfies the assumptions: It is well-known that L (H) is the separable Hilbert space with the inner product, (i) For the probability measure ρ on X × Y, hA, Bi = Tr(B A) L (H) ||y|| dρ(x, y) < ∞. (10) and its norm satisfies (ii) The minimizer of the generalization error f (4) over the ||A|| ≤ ||A|| ≤ Tr(|A|), hypothesis spaceH exists. L(H) L (H) (iii) There exist some constants M,6 such that for almost all where |A| = A A and ||·|| is the operator norm (For more x ∈ X, L(H) details see [24]). ∗ ||y − f (x)|| 6 H Y ||y−f (x)|| /M For the positive trace class operator K K , we have H Y e − − 1 dρ(y|x) ≤ . M 2M ∗ ∗ ∗ 2 (11) ||K K || ≤ ||K K || ≤ Tr(K K ) ≤ κ . x L(H) x L (H) x x x 2 x (iv) The target function f belongs to the class  with H φ,R Given the ordered set x = (x ,... , x ) ∈ X , the sampling 1 m m  : = f ∈ H : f = φ(L )g and ||g|| ≤ R , (12) φ,R K H operator S : H → Y is defined by S (f ) = (f (x ),... , f (x )) x x 1 m ∗ m ∗ 1 and its adjoint S : Y → H is given by S y = K y , ∀ y = where φ is a continuous increasing index function defined x i x x i i=1 on the interval [0,κ ] with the assumption φ(0) = 0. This (y ,... , y ) ∈ Y . 1 m condition is usually referred to as general source condition The regularization scheme (5) can be expressed as [17]. 2 2 f = argmin ||S f − y|| + λ||f || , (6) In addition, we consider the set of probability measures z,λ x m H f ∈H P which satisfies the conditions (i), (ii), (iii), (iv) and the φ,b eigenvalues t ’s of the integral operator L follow the polynomial n K 2 1 2 decay: For fixed positive constants α,β and b > 1, where ||y|| = ||y || . m Y i=1 −b −b We obtain the explicit expression of f by taking the αn ≤ t ≤ βn ∀n ∈ N. (13) z,λ functional derivative of above expression over RKHSH. Under the polynomial decay of the eigenvalues the effective Theorem 2.1. For the positive choice of λ, the functional (6) has dimension N (λ), to measure the complexity of RKHS, can be unique minimizer: estimated from Proposition 3 [12] as follows, −1 ∗ ∗ βb f = S S + λI S y. (7) −1 −1/b z,λ x x x N (λ): = Tr (L + λI) L ≤ λ , for b > 1 (14) K K b − 1 Define f as the minimizer of the optimization functional, and without the polynomial decay condition (13), we have 2 2 f : = argmin ||f (x) − y|| dρ(x, y) + λ||f || . (8) −1 Y H N (λ) ≤ ||(L + λI) || Tr (L ) ≤ . K L(H) K f ∈H Z Frontiers in Applied Mathematics and Statistics | www.frontiersin.org 3 March 2017 | Volume 3 | Article 3 Rastogi and Sampath Optimal Rates for the Regularized Learning Algorithms We discuss the convergence issues for the learning algorithms then for any 0 < η < 1 and for all m ∈ N, (z → f ∈ H) in probabilistic sense by exponential tail inequalities such that 1 Prob (ω ,... ,ω ) ∈  : [ξ(ω ) − E(ξ(ω ))] 1 m i i i=1 Q S 2 Prob ||f − f || ≤ ε(m) log ≥ 1 − η z z ρ ρ ≤ 2 + √ log ≥ 1 − η. m η In particular, the inequality (15) holds if for all 0 < η ≤ 1 and ε(m) is a positive decreasing function of m. Using these probabilistic estimates we can obtain error estimates 2 2 ||ξ(ω)|| ≤ Q and E(||ξ(ω)|| ) ≤ S . in expectation by integration of tail inequalities: We estimate the error bounds for the regularized estimators by Z measuring the effect of random sampling and the complexity of f . The quantities described in Proposition 3.2 express E ||f − f || = Prob ||f − f || > t dt H z z ρ ρ z z ρ ρ the probabilistic estimates of the perturbation measure due to random sampling. The expressions of Proposition 3.3 describe t the complexity of the target function f which are usually ≤ exp − dt = ε(m), referred to as the approximation errors. The approximation ε(m) errors are independent of the samples z. Proposition 3.2. Let z be i.i.d. samples drawn according to the 1/2 where ||f || = ||f || 2 = { ||f (x)|| dρ (x)} and E (ξ) = ρ X z X Y probability measure ρ satisfying the assumptions (10), (11) and R r ξdρ(z )... dρ(z ). ∗ 1 m κ = sup Tr(K K ). Then for all 0 < η < 1, we have Z x x∈X −1/2 ∗ ∗ 3. CONVERGENCE ANALYSIS ||(L + λI) {S y − S S f }|| K x H H x x   κM 6 N (λ) 4 In this section, we analyze the convergence issues of the   ≤ 2 √ + log (16) learning algorithms on reproducing kernel Hilbert space under m η m λ the smoothness priors in the supervised learning framework. We discuss the upper and lower convergence rates for vector- and valued estimators in the standard minimax setting. Therefore, the 2 2 estimates can be utilized particularly for scalar-valued functions κ κ 4 ||S S − L || ≤ 2 + √ log . (17) x K L (H) and multi-task learning algorithms. m η with the confidence 1 − η. 3.1. Upper Rates for Tikhonov The proof of the first expression is the content of the step 3.2 Regularization of Theorem 4 [12] while the proof of the second expression can In General, we consider Tikhonov regularization in learning be obtained from Theorem 2 in De Vito et al. [25]. theory. Tikhonov regularization is briefly discussed in the literature [7, 9, 10, 25]. The error estimates for Tikhonov Proposition 3.3. Suppose f ∈  . Then, H φ,R regularization are discussed theoretically under Hölder’s source √ √ condition [12, 15, 16]. We establish the error estimates for (i) Under the assumption that φ(t) t and t/φ(t) are non- Tikhonov regularization scheme under general source condition decreasing functions, we have f ∈  for some continuous increasing index function φ and H φ,R the polynomial decay of the eigenvalues of the integral operator ||f − f || ≤ Rφ(λ) λ. (18) λ H ρ L . (ii) Under the assumption that φ(t) and t/φ(t) are non-decreasing In order to estimate the error bounds, we consider the following inequality used in the papers [4, 12] which is based on functions, we have the results of Pinelis and Sakhanenko [26]. ||f − f || ≤ Rκφ(λ) (19) λ H ρ Proposition 3.1. Let ξ be a random variable on the probability and space (,B, P) with values in real separable Hilbert space H. If there exist two constants Q and S satisfying ||f − f || ≤ Rφ(λ). (20) λ H H Under the source condition f ∈  , the proposition can be H φ,R n 2 n−2 E ||ξ − E(ξ)|| ≤ n!S Q ∀n ≥ 2, (15) proved using the ideas of Theorem 10 [4]. Frontiers in Applied Mathematics and Statistics | www.frontiersin.org 4 March 2017 | Volume 3 | Article 3 Rastogi and Sampath Optimal Rates for the Regularized Learning Algorithms Theorem 3.1. Let z be i.i.d. samples drawn according to the From the condition (21) we get with confidence 1 − η/2, probability measure ρ ∈ P where φ is the index function −1 ∗ satisfying the conditions that φ(t), t/φ(t) are non-decreasing ||(L + λI) (S S − L )|| ≤ . (25) K x K L(H) functions. Then for all 0 < η < 1, with confidence 1 − η, for the regularized estimator f (7) the following upper bound holds: z,λ Consequently, using (25) in the inequality (24) we obtain with   probability 1 − η/2,   2κM 46 N (λ) 4 ||f − f || ≤ 2 Rφ(λ) + + log ∗ −1 1/2 z,λ H H I = ||(S S + λI) (L + λI) || 1 x K L(H)  mλ mλ  η x −1/2 ≤ 2||(L + λI) || ≤ √ . (26) K L(H) provided that From Proposition 3.2 we have with confidence 1 − η/2, mλ ≥ 8κ log(4/η). (21) 2 2 κ κ 4 Proof. The error of regularized solution f can be estimated in z,λ ||S S − L || ≤ 2 + √ log . x K L(H) terms of the sample error and the approximation error as follows: m m η Again from the condition (21) we get with probability 1 − η/2, ||f − f || ≤ ||f − f || + ||f − f || . (22) z,λ H H z,λ λ H λ H H Now f − f can be expressed as z,λ λ I = ||S S − L || ≤ . (27) 3 x K L(H) ∗ −1 ∗ ∗ f − f = (S S + λI) {S y − S S f − λf }. z,λ λ x x λ λ x x x Therefore, the inequality (23) together with (16), (20), (26), (27) provides the desired bound. −1 Then f = (L + λI) L f implies λ K K H The following theorem discuss the error estimates in L - L f = L f + λf . K H K λ λ norm. The proof is similar to the above theorem. Therefore, Theorem 3.2. Let z be i.i.d. samples drawn according to the probability measure ρ ∈ P and f is the regularized solution (7) φ z,λ ∗ −1 ∗ ∗ f − f = (S S + λI) {S y − S S f − L (f − f )}. z,λ λ x x λ K H λ x x x corresponding to Tikhonov regularization. Then for all 0 < η < 1, with confidence 1 − η, the following upper bounds holds: Employing RKHS-norm we get, (i) Under the assumption that φ(t), t/φ(t) are non-decreasing ∗ −1 ∗ ∗ functions, ||f − f || ≤ ||(S S + λI) {S y − S S f z,λ λ H x x H x x x   + (S S − L )(f − f )}|| (23) x K H λ H   2κM 46 N (λ) ≤ I I + I ||f − f || /λ, 1 2 3 λ H H ||f − f || ≤ 2 Rφ(λ) λ + √ + z,λ H ρ  m  m λ ∗ −1 1/2 where I = ||(S S + λI) (L + λI) || , I = ||(L + 1 x K L(H) 2 K −1/2 ∗ ∗ ∗ λI) (S y − S S f )|| and I = ||S S − L || . x H H 3 x K L(H) x x x log The estimates of I , I can be obtained from Proposition 3.2 2 3 and the only task is to bound I . For this we consider (ii) Under the assumption that φ(t), t/φ(t) are non-decreasing ∗ −1 1/2 −1 ∗ −1 functions, (S S + λI) (L + λI) = {I − (L + λI) (L − S S )} x K K K x x x   −1/2 (L + λI)   4κM 166 N (λ) ||f − f || ≤ R(κ + λ)φ(λ) + √ + z,λ H ρ  m  which implies m λ ∞ 4 log −1 ∗ n −1/2 I ≤ ||(L + λI) (L − S S )|| ||(L + λI) || 1 K K x K L(H) η x L(H) n=0 (24) provided that −1 ∗ provided that ||(L +λI) (L − S S )|| < 1. To verify this K K x √ L(H) mλ ≥ 8κ log(4/η). (28) condition, we consider We derive the convergence rates of Tikhonov regularizer based −1 ∗ ||(L + λI) (S S − L )|| ≤ I /λ. K x K L(H) 3 on data-driven strategy of the parameter choice of λ for the class of probability measure P . φ,b Now using Proposition 3.2 we get with confidence 1 − η/2, Theorem 3.3. Under the same assumptions of Theorem 3.2 and 4κ 4 −1 ∗ hypothesis (13), the convergence of the estimator f (7) to the z,λ ||(L + λI) (S S − L )|| ≤ √ log . K x K L(H) mλ η target function f can be described as: Frontiers in Applied Mathematics and Statistics | www.frontiersin.org 5 March 2017 | Volume 3 | Article 3 Rastogi and Sampath Optimal Rates for the Regularized Learning Algorithms 2b (i) If φ(t) and t/φ(t) are non-decreasing functions. Then under (ii) Suppose 2(t) = t φ(t). Then the condition (28) follows that −1 −1/2 the parameter choice λ ∈ (0, 1], λ = 9 (m ) where 2 2 1 1 √ 8κ log(4/η) 8κ 2 2b 9(t) = t φ(t), we have mλ ≥ √ ≥ √ . λ λ ( ) −1 −1/2 1/2 ||f − f || ≤ C(9 (m )) φ z,λ H ρ −1 −1/2 Hence under the parameter choice λ ∈ (0, 1], λ = 2 (m ) Prob ≥ 1 − η −1 −1/2 4 (9 (m )) log we have 1 1 2 2b 1 λ λ φ(λ) φ(λ) and √ ≤ √ ≤ ≤ . 2 2 2 8κ 8κ 8κ m m λ lim lim sup sup Prob ||f − f || z z,λ H ρ From Theorem 3.2 and Equation (14), it follows that with τ→∞ m→∞ ρ∈P φ,b confidence 1 − η, −1 −1/2 1/2 −1 −1/2 > τ(9 (m )) φ(9 (m )) = 0, ′ −1 −1/2 ||f − f || ≤ C φ(2 (m )) log , (31) z,λ H ρ (ii) If φ(t) and t/φ(t) are non-decreasing functions. Then under −1 −1/2 the parameter choice λ ∈ (0, 1], λ = 2 (m ) where 1 2 where C : = R(κ + 1) + M/2κ + 4 βb6 /(b − 1). 2b 2(t) = t φ(t), we have ′ 4 Now defining τ : = C log gives ′ −1 −1/2 Prob ||f − f || ≤ C φ(2 (m )) log ≥ 1−η z z,λ H ρ −τ/C η η = η = 4e . The estimate (31) can be reexpressed as and −1 −1/2 Prob ||f − f || > τφ(2 (m )) ≤ η . (32) z z,λ H ρ τ lim lim sup sup Prob ||f − f || z z,λ H ρ τ→∞ m→∞ ρ∈P φ,b Then from Equations (30) and (32) our conclusions follow. −1 −1/2 > τφ(2 (m )) = 0. Theorem 3.4. Under the same assumptions of Theorem 3.1 and 1 1 hypothesis (13) with the parameter choice λ ∈ (0, 1], λ = 2 2b Proof. (i) Let 9(t) = t φ(t). Then it follows, 1 1 −1 −1/2 2 2b 9 (m ) where 9(t) = t φ(t), the convergence of the 2 estimator f (7) to the target function f can be described as z,λ H 9(t) t lim √ = lim = 0. −1 t→0 t→0 9 (t) t 4 −1 −1/2 Prob ||f − f || ≤ Cφ(9 (m )) log ≥ 1 − η z z,λ H H −1 −1/2 Under the parameter choice λ = 9 (m ) we have, and lim mλ = ∞. −1 −1/2 m→∞ lim lim sup sup Prob ||f − f || > τφ(9 (m )) z z,λ H H τ→∞ m→∞ ρ∈P φ,b Therefore, for sufficiently large m, = 0. The proof of the theorem follows the same steps as of Theorem 2b 1 λ φ(λ) 1 2b = √ ≤ λ φ(λ). 3.3 (i). We obtain the following corollary as a consequence of mλ mλ Theorem 3.3, 3.4. Under the fact λ ≤ 1 from Theorem 3.2 and Equation (14) Corollary 3.1. Under the same assumptions of Theorem 3.3, 3.4 follows that with confidence 1 − η, for Tikhonov regularization with Hölder’s source condition f ∈  , φ(t) = t , for all 0 < η < 1, with confidence 1 − η, for the φ,R 4 b −1 −1/2 1/2 −1 −1/2 − ||f − f || ≤ C(9 (m )) φ(9 (m )) log , 2br+b+1 z,λ H ρ parameter choice λ = m , we have (29) br p 4 2br+b+1 ||f − f || ≤ Cm log for 0 ≤ r ≤ 1, z,λ H H where C = 2R + 4κM + 4 βb6 /(b − 1). Now defining τ : = C log gives 2br+b 4 1 4br+2b+2 ||f − f || ≤ Cm log for 0 ≤ r ≤ −τ/C z,λ H ρ η = η = 4e . τ η 2 2br+1 The estimate (29) can be reexpressed as and for the parameter choice λ = m , we have br −1 −1/2 1/2 −1 −1/2 4 ′ − Prob {||f − f || > τ(9 (m )) φ(9 (m ))} ≤ η . z z,λ H ρ τ 2br+1 ||f − f || ≤ C m log for 0 ≤ r ≤ 1. z,λ H ρ (30) Frontiers in Applied Mathematics and Statistics | www.frontiersin.org 6 March 2017 | Volume 3 | Article 3 Rastogi and Sampath Optimal Rates for the Regularized Learning Algorithms 3.2. Upper Rates for General Definition 3.3. A function φ : [0, d] → [0, ∞) is said to be operator monotone index function if φ (0) = 0 and for every non- Regularization Schemes negative pair of self-adjoint operators A, B such that ||A||, ||B|| ≤ d Bauer et al. [4] discussed the error estimates for general and A ≤ B we have φ (A) ≤ φ (B). 1 1 regularization schemes under general source condition. Here we study the convergence issues for general regularization We consider the class of operator monotone index functions: schemes under general source condition and the polynomial decay of the eigenvalues of the integral operator L . We define F = {φ : [0,κ ] → [0, ∞) operator monotone, μ 1 the regularization in learning theory framework similar to φ (0) = 0,φ (κ ) ≤ μ}. 1 1 considered for ill-posed inverse problems (See Section 3.1 [4]). For the above class of operator monotone functions from Definition 3.1. A family of functions g : [0,κ ] → R, 0 < 2 Theorem 1 [4], given φ ∈ F there exists c such that 1 μ φ λ ≤ κ , is said to be the regularization if it satisfies the following 1 conditions: ∗ ∗ ||φ (S S ) − φ (L )|| ≤ c φ (||S S − L || ). 1 x 1 K φ 1 x K x L(H) 1 x L(H) • ∃D : sup |σ g (σ )| ≤ D. σ∈(0,κ ] ∗ Here we observe that the rate of convergence of φ (S S ) to 1 x • ∃B : sup |g (σ )| ≤ . λ φ (L ) is slower than the convergence rate of S S to L . 1 K x K σ∈(0,κ ] Therefore, we consider the following class of index functions: • ∃γ : sup |1 − g (σ )σ| ≤ γ . σ∈(0,κ ] F = {φ = φ φ : φ ∈ F ,φ : [0,κ ] • The maximal p satisfying the condition: 2 1 1 μ 2 → [0, ∞) non-decreasing Lipschitz,φ (0) = 0}. p p sup |1 − g (σ )σ|σ ≤ γ λ λ p σ∈(0,κ ] The splitting of φ = φ φ is not unique. So we can take φ 2 1 2 as a Lipschitz function with Lipschitz constant 1. Now using is called the qualification of the regularization g , where γ does λ p Corollary 1.2.2 [27] we get not depend on λ. ∗ ∗ ||φ (S S ) − φ (L )|| ≤ ||S S − L || . 2 x 2 K L (H) x K L (H) x 2 x 2 The properties of general regularization are satisfied by the General source condition f ∈  corresponding to index H φ,R large class of learning algorithms which are essentially all the class functions F covers wide range of source conditions as linear regularization schemes. We refer to Section 2.2 [10] Hölder’s source condition φ(t) = t , logarithm source condition for brief discussion of the regularization schemes. Here we −ν p 1 φ(t) = t log . Following the analysis of Bauer et al. [4] we consider general regularized solution corresponding to the above t develop the error estimates of general regularization for the index regularization: class function F under the suitable priors on the probability ∗ ∗ measure ρ. f = g (S S )S y. (33) z,λ λ x x x Theorem 3.5. Let z be i.i.d. samples drawn according to the Here we are discussing the connection between the qualification probability measure ρ ∈ P . Suppose f is the regularized φ z,λ of the regularization and general source condition [17]. solution (33) corresponding to general regularization and the Definition 3.2. The qualification p covers the index function φ if qualification of the regularization covers φ. Then for all 0 < η < 1, t 2 with confidence 1 − η, the following upper bound holds: the function t → on t ∈ (0,κ ] is non-decreasing. φ(t)   4Rμγκ 2 2ν κM The following result is a restatement of Proposition 3 [17].  1  Rc (1 + c )φ(λ) + + g φ mλ ||f − f || ≤ z,λ H H 8ν 6 N (λ)   Proposition 3.4. Suppose φ is a non-decreasing index function mλ and the qualification of the regularization g covers φ. Then log sup |1 − g (σ )σ|φ(σ ) ≤ c φ(λ), c = max(γ ,γ ). λ g g p σ∈(0,κ ] provided that Generally, the index function φ is not stable with respect to 2 mλ ≥ 8κ log(4/η). (34) perturbation in the integral operator L . In practice, we are only accessible to the perturbed empirical operator S S but the source Proof. We consider the error expression for general regularized condition can be expressed in terms of L only. So we want solution (33), to control the difference φ(L ) − φ(S S ). In order to obtain K x ∗ ∗ ∗ ∗ the error estimates for general regularization, we further restrict f − f = g (S S )(S y − S S f ) − r (S S )f , (35) z,λ H λ x x H λ x H x x x x the index functions to operator monotone functions which is defined as where r (σ ) = 1 − g (σ )σ . λ λ Frontiers in Applied Mathematics and Statistics | www.frontiersin.org 7 March 2017 | Volume 3 | Article 3 Rastogi and Sampath Optimal Rates for the Regularized Learning Algorithms Now the first term can be expressed as Now we consider the second term, ∗ ∗ ∗ ∗ ∗ 1/2 ∗ ∗ ∗ ∗ g (S S )(S y − S S f ) = g (S S )(S S + λI) r (S S )f = r (S S )φ(L )v = r (S S )φ(S S )v λ x H λ x K λ x x λ x x H λ x x x x x x x x x x x ∗ ∗ ∗ ∗ −1/2 1/2 +r (S S )φ (S S )(φ (L ) − φ (S S ))v (S S + λI) (L + λI) x K λ x 2 x 1 K 1 x x x x x −1/2 ∗ ∗ ∗ ∗ (L + λI) (S y − S S f ). +r (S S )(φ (L ) − φ (S S ))φ (L )v. λ x 2 K 2 x 1 K K x H x x x x Employing RKHS-norm we get On applying RKHS-norm we get, ∗ ∗ ∗ ∗ ||r (S S )f || ≤ Rc φ(λ) + Rc c φ (λ)φ λ x H H g g φ 2 1 ||g (S S )(S y − S S f )|| ≤ I I ||g (S S ) 1 λ x x H H 2 5 λ x x x x x x ∗ ∗ ∗ 1/2 (||L − S S || ) + Rμγ ||L − S S || . (S S + λI) || , (36) K x L(H) K x L (H) x x 2 x L(H) −1/2 ∗ ∗ ∗ Here we used the fact that if the qualification of the regularization where I = ||(L +λI) (S y−S S f )|| and I = ||(S S + 2 K x H H 5 x x x x −1/2 1/2 covers φ = φ φ , then the qualification also covers φ and φ 1 2 1 2 λI) (L + λI) || . K L(H) both separately. The estimate of I can be obtained from the first estimate of From Equations (17) and (34) we have with probability 1 − Proposition 3.2 and from the second estimate of Proposition 3.2 η/2, with the condition (34) we obtain with probability 1 − η/2, 4κ 4 −1/2 ∗ −1/2 ||(L + λI) (L − S S )(L + λI) || ||S S − L || ≤ √ log ≤ λ/2. (41) K K x K L(H) x K L(H) x x 1 4κ 4 1 ≤ ||S S − L || ≤ √ log ≤ . x K x L(H) Therefore, with probability 1 − η/2, λ mλ η 2 4Rμγκ 4 which implies that with confidence 1 − η/2, ||r (S S )f || ≤ Rc (1+c )φ(λ)+ √ log . (42) λ x H H g φ x 1 m η ∗ −1/2 1/2 I = ||(S S + λI) (L + λI) || 5 x K x L(H) Combining the bounds (40) and (42) we get the desired 1/2 1/2 ∗ −1 1/2 = ||(L + λI) (S S + λI) (L + λI) || K x K L(H) result. −1/2 ∗ = ||{I − (L + λI) (L − S S ) K K x Theorem 3.6. Let z be i.i.d. samples drawn according to the 1/2 −1/2 −1 (L + λI) } || L(H) probability measure ρ ∈ P and f is the regularized solution φ z,λ (33) corresponding to general regularization. Then for all 0 < η < ≤ 2. (37) 1, with confidence 1 − η, the following upper bounds holds: From the properties of the regularization we have, (i) If the qualification of the regularization covers φ, ∗ ∗ 1/2  √  ||g (S S )(S S ) || ≤ sup |g (σ ) σ| λ x x L(H) λ x x  Rc (1 + c )(κ + λ)φ(λ)  g φ   2 √ 0<σ≤κ   4Rμγκ (κ+ λ) 2 2ν κM 4 ! √ √ r + + 1/2 ||f − f || ≤ log , z,λ H ρ m λ   BD η  2 2  8ν 6 N (λ)   = sup |g (σ )σ| sup |g (σ )| ≤ . (38) λ λ λ m 2 2 0<σ≤κ 0<σ≤κ Hence it follows, (ii) If the qualification of the regularization covers φ(t) t,  √  ∗ ∗ 1/2 1/2 √ 2 ||g (S S )(S S + λI) || ≤ sup |g (σ )(σ + λ) | λ x x λ 4Rμ(γ +c )κ λ L(H) g x x   2Rc (1 + c )φ(λ) λ + g φ 0<σ≤κ m ||f − f || ≤ √ z,λ H ρ √ 2 ν 8ν 6 N (λ) 1  2 2ν κM  2 2 + + ≤ sup |g (σ ) σ| + λ sup |g (σ )| ≤ √ , (39) λ λ m λ 2 2 λ 0<σ≤κ 0<σ≤κ log where ν : = B + BD. Therefore, using (16), (37) and (39) in Equation (36) we provided that conclude that with probability 1 − η, mλ ≥ 8κ log(4/η). (43)     κM 6 N (λ) ∗ ∗ ∗ ||g (S S )(S y − S S f )|| ≤ 2 2ν + λ x x H H 1 x x x Proof. Here we establish L -norm estimate for the error   mλ mλ expression: log . (40) ∗ ∗ ∗ ∗ f − f = g (S S )(S y − S S f ) − r (S S )f . z,λ H λ x x H λ x H η x x x x Frontiers in Applied Mathematics and Statistics | www.frontiersin.org 8 March 2017 | Volume 3 | Article 3 Rastogi and Sampath Optimal Rates for the Regularized Learning Algorithms On applying L -norm in the first term we get, Therefore, using Equation (17) we obtain with probability 1 − η/2, 1/2 ∗ ∗ ∗ ∗ ||g (S S )(S y − S S f )|| ≤ I I ||L g (S S ) λ x x H ρ 2 5 λ x x x x K x ∗ 1/2 ||r (S S )f || ≤ (κ + λ) λ x H ρ (S S + λI) || , (44) x x L(H) 4Rμγκ 4 −1/2 ∗ ∗ ∗ Rc (1 + c )φ(λ) + √ log . (47) where I = ||(L +λI) (S y−S S f )|| and I = ||(S S + 2 K x H H 5 x g φ x x x 1 m η −1/2 1/2 λI) (L + λI) || . K L(H) The estimates of I and I can be obtained from Proposition 2 5 Case 2. If the qualification of the regularization covers φ(t) t, 3.2 and Theorem 3.5 respectively. Now we consider we get with probability 1 − η/2, 1/2 1/2 ∗ ∗ 1/2 ||L g (S S )(S S + λI) || ≤ ||L λ x x L(H) x x √ K K ||r (S S )f || ≤ 2Rc (1 + c )φ(λ) λ ∗ 1/2 ∗ ∗ 1/2 λ x H ρ g φ x 1 −(S S ) || ||g (S S )(S S + λI) x L(H) λ x x x x x r ∗ 1/2 ∗ λ 4 || + ||(S S ) g (S S ) x λ x L(H) x x +4Rμ(γ + c )κ log . (48) m η ∗ 1/2 (S S + λI) || . x L(H) Combining the error estimates (46), (47) and (48) we get the Since φ(t) = t is operator monotone function. Therefore, from desired results. Equation (41) with probability 1 − η/2, we get 1/2 ∗ 1/2 ∗ 1/2 We discuss the convergence rates of general regularizer based ||L − (S S ) || ≤ (||L − S S || ) ≤ λ. x L(H) K x L(H) K x x on data-driven strategy of the parameter choice of λ for the class Then using the properties of the regularization and Equation (38) of probability measure P . The proof of Theorem 3.7, 3.8 are φ,b similar to Theorem 3.3. we conclude that with probability 1 − η/2, 1/2 ∗ ∗ 1/2 Theorem 3.7. Under the same assumptions of Theorem 3.5 and || L g (S S )(S S + λI) || λ x x L(H) K x x hypothesis (13) with the parameter choice λ ∈ (0, 1], λ = 1/2 2 1/2 ≤ λ sup |g (σ )(σ + λ) | + sup |g (σ )(σ + λσ ) | 1 1 λ λ −1 −1/2 + 2b 2 2 9 (m ) where 9(t) = t φ(t), the convergence of the 0<σ≤κ 0<σ≤κ estimator f (33) to the target function f can be described as z,λ H ≤ sup |g (σ )σ| + λ sup |g (σ )| + 2 λ sup |g (σ ) σ| λ λ λ 2 2 2 0<σ≤κ 0<σ≤κ 0<σ≤κ √ 4 −1 −1/2 Prob ||f − f || ≤ Cφ(9 (m )) log ≥ 1 − η, ≤ B + D + 2 BD = ν (let). (45) z z,λ H H From Equations (44) with Equations (16), (37), and (45) we where C = Rc (1 + c ) + 4Rμγκ + 2 2ν κM + g φ 1 obtain with probability 1 − η, 1   8βbν 6 /(b − 1) and   κM 6 N (λ) ∗ ∗ ∗ ||g (S S )(S y − S S f )|| ≤ 2 2ν √ + λ x x H ρ 2 x x x   −1 −1/2 m λ lim lim sup sup Prob ||f − f || > τφ(9 (m )) z z,λ H H τ→∞ m→∞ ρ∈P φ,b = 0. log . (46) Theorem 3.8. Under the same assumptions of Theorem 3.6 and The second term can be expressed as hypothesis (13), the convergence of the estimator f (33) to the z,λ 1/2 ∗ ∗ 1/2 ∗ || r (S S )f || ≤ ||L − (S S ) || ||r (S S )f || target function f can be described as λ x H ρ x L(H) λ x H H x K x x ∗ 1/2 ∗ +||(S S ) r (S S )f || x λ x H H (i) If the qualification of the regularization covers φ. Then under x x −1 −1/2 1/2 ∗ ∗ the parameter choice λ ∈ (0, 1], λ = 2 (m ) where ≤ ||L − S S || ||r (S S )f || K x λ x H H x x L(H) 1 2b 2(t) = t φ(t), we have ∗ ∗ 1/2 ∗ +||r (S S )(S S ) φ(S S )v|| λ x x x H x x x ∗ ∗ 1/2 ∗ ∗ +||r (S S )(S S ) φ (S S )(φ (S S ) − φ (L ))v|| λ x x 2 x 1 x 1 K H 4 x x x x −1 −1/2 Prob ||f − f || ≤C φ(2 (m )) log ≥1−η, z z,λ H ρ 1 ∗ ∗ 1/2 ∗ +||r (S S )(S S ) (φ (S S ) − φ (L ))φ (L )v|| . η λ x x 2 x 2 K 1 K H x x x Here two cases arises: 2 where C = Rc (1 + c )(κ + 1) + 4Rμγκ (κ + 1) + 1 g φ ν M/2 2κ + 8βbν 6 /(b − 1) and Case 1. If the qualification of the regularization covers φ. Then we get with confidence 1 − η/2, −1 −1/2 √ lim lim sup sup Prob ||f − f || > τφ(2 (m )) z z,λ H ρ τ→∞ m→∞ ρ∈P ||r (S S )f || ≤ (κ + λ) Rc (1 + c )φ(λ) λ x H ρ g φ φ,b x 1 = 0, +Rμγ ||S S − L || . x K L (H) x 2 Frontiers in Applied Mathematics and Statistics | www.frontiersin.org 9 March 2017 | Volume 3 | Article 3 Rastogi and Sampath Optimal Rates for the Regularized Learning Algorithms (ii) If the qualification of the regularization covers φ(t) t. Then Remark 3.3. For the real valued functions and multi-task −1 −1/2 m under the parameter choice λ ∈ (0, 1], λ = 9 (m ) algorithms (the output space Y ⊂ R for some m ∈ N) we 1 1 can obtain the error estimates from our analysis without imposing 2 2b where 9(t) = t φ(t), we have any condition on the conditional probability measure (11) for the −1 −1/2 1/2 −1 −1/2 bounded output space Y. Prob ||f − f || ≤ C (9 (m )) φ(9 (m )) z z,λ H ρ 2 Remark 3.4. We can address the convergence issues of binary log ≥ 1 − η, η classification problem [28] using our error estimates as similar to discussed in Section 3.3 [4] and Section 5 [16]. where C = 2Rc (1 + c ) + 4Rμ(γ + c )κ + 2 2ν κM + 2 g φ g 2 3.3. Lower Rates for General Learning 8βbν 6 /(b − 1) and Algorithms In this section, we discuss the estimates of minimum possible lim lim sup sup Prob ||f − f || z z,λ H ρ error over a subclass of the probability measures P τ→∞ φ,b m→∞ ρ∈P φ,b parameterized by suitable functions f ∈ H. Throughout this −1 −1/2 1/2 −1 −1/2 > τ(9 (m )) φ(9 (m )) = 0. section we assume that Y is finite-dimensional. Let {v } be a basis of Y and f ∈  . Then we parameterize j φ,R j=1 We obtain the following corollary as a consequence of Theorem the probability measure based on the function f , 3.7, 3.8. Corollary 3.2. Under the same assumptions of Theorem 3.7, 3.8 ρ (x, y): = a (x)δ + b (x)δ ν(x), (49) f j y+dLv j y−dLv j j 2dL for general regularization of the qualification p with Hölder’s source j=1 condition f ∈  , φ(t) = t , for all 0 < η < 1, with confidence H φ,R − where a (x) = L − hf , K v i , b (x) = L + hf , K v i , L = j x j H j x j H 2br+b+1 1 − η, for the parameter choice λ = m , we have 4κφ(κ )R and δ denotes the Dirac measure with unit mass at ξ. It is easy to observe that the marginal distribution of ρ over X br e 2br+b+1 ||f − f || ≤ Cm log for 0 ≤ r ≤ p, is ν and the regression function for the probability measure ρ is z,λ H H f (see Proposition 4 [12]). In addition to this, for the conditional probability measure ρ (y|x) we have, 2br+b 4 1 e 4br+2b+2 ||f − f || ≤ C m log for 0 ≤ r ≤ p − ||y − f (x)|| z,λ H ρ 2 Y ||y−f (x)|| /M η 2 e − − 1 dρ (y|x) i−2 2 2br+1 (dL + ||f (x)|| ) 6 and for the parameter choice λ = m , we have 2 2 2 ≤ d L − ||f (x)|| ≤ i 2 M i! 2M i=2 br e 2br+1 ||f − f || ≤ C m log for 0 ≤ r ≤ p. z,λ H ρ 1 provided that dL + L/4 ≤ M and 2dL ≤ 6. Remark 3.1. It is important to observe from Corollary 3.1, 3.2 that using the concept of operator monotonicity of index function We assume that the eigenvalues of the integral operator L we are able to achieve the same error estimates for general follow the polynomial decay (13) for the marginal probability regularization as of Tikhonov regularization up to a constant measure ν. Then we conclude that the probability measure ρ multiple. parameterized by f belongs to the class P . φ,b The concept of information theory such as the Kullback- Remark 3.2. (Related work) Corollary 3.1 provides the order of Leibler information and Fano inequalities (Lemma 3.3 [29]) convergence same as of Theorem 1 [12] for Tikhonov regularization are the main ingredients in the analysis of lower bounds. In under the Hölder’s source condition f ∈  for φ(t) = H φ,R r 1 the literature [12, 29], the closeness of probability measures t ≤ r ≤ 1 and the polynomial decay of the eigenvalues is described through Kullback-Leibler information: Given two (13). Blanchard and Mücke [18] addressed the convergence rates probability measures ρ and ρ , it is defined as 1 2 for inverse statistical learning problem for general regularization under the Hölder’s source condition with the assumption f ∈ H. In particular, the upper convergence rates discussed in Blanchard K(ρ ,ρ ): = log(g(z))dρ (z), 1 2 1 and Mücke [18] agree with Corollary 3.2 for considered learning problem which is referred as direct learning problem in Blanchard where g is the density of ρ with respect to ρ , that is, ρ (E) = 1 2 1 and Mücke[18]. Under the fact N (λ) ≤ from Theorem 3.5, 3.6 g(z)dρ (z) for all measurable sets E. λ E we obtain the similar estimates as of Theorem 10 [4] for general Following the analysis of Caponnetto and De Vito [12] and regularization schemes without the polynomial decay condition of DeVore et al. [29] we establish the lower rates of accuracy that the eigenvalues (13). can be attained by any learning algorithm. Frontiers in Applied Mathematics and Statistics | www.frontiersin.org 10 March 2017 | Volume 3 | Article 3 Rastogi and Sampath Optimal Rates for the Regularized Learning Algorithms 1 ε ℓ To estimate the lower rates of learning algorithms, we generate where σ = (σ ,... ,σ ) ∈ {−1, +1} . Then from Equation i i N -functions belonging to  for given ε > 0 such that (53), (51) we get, ε φ,R (54) holds. Then we construct the probability measures ρ ∈ P i φ,b from Equation (49), parameterized by these functions f ’s (1 ≤ ε ≤ ||f − f || , for all 1 ≤ i, j ≤ N , i 6= j. (53) i i j H ε i ≤ N ). On applying Lemma 3.3 [29], we obtain the lower convergence rates using Kullback-Leibler information. For 1 ≤ i, j ≤ N , we have Theorem 3.9. Let z be i.i.d. samples drawn according to the 2 n−ℓ n−ℓ ε ε 2ℓ 2ℓ ε ε βε σ − σ 2 X X i j 4βε probability measure ρ ∈ P under the hypothesis dim(Y) = d < φ,b 2 ||f − f || ≤ ≤ i j 2 b b ∞. Then for any learning algorithm (z → f ∈ H) there exists L (X) z ν ℓ n ℓ n ε ε n=ℓ +1 n=ℓ +1 ε ε a probability measure ρ ∈ P and f ∈ H such that for all ∗ ρ φ,b ∗ 2 2ℓ 2 4βε 1 ε 0 < ε < ε , f can be approximated as o z ≤ dx = c , (54) b b ℓ x ℓ   ε ℓ  ℓ  ε cmε 48 b Prob ||f − f || > ε/2 ≥ min ,ϑe 4β z z ρ H ′ 1 −ℓ /24 where c = 1 − .  1 + e  b−1 (b−1) We define the sets, 64β 1 −3/e n o where ϑ = e , c = 1 − and ℓ = 2 ε b−1 15(b−1)dL A = z : ||f − f || < , for 1 ≤ i ≤ N . i z i H ε 1/b 1 α −1 2 φ (ε/R) It is clear from Equation (53) that A ’s are disjoint sets. On applying Lemma 3.3 [29] with probability measures ρ , 1 ≤ i ≤ Proof. For given ε > 0, we define N , we obtain that either 2ℓ n−ℓ εσ e g = √ , m c ℓφ(t ) p: = max ρ (A ) ≥ (55) n=ℓ+1 f i 1≤i≤N N + 1 ε ε 1 ℓ ℓ where σ = (σ ,... ,σ ) ∈ {−1, +1} , t ’s are the eigenvalues of n or the integral operator L , e ’s are the eigenvectors of the integral K n operator L and the orthonormal basis of RKHS H. Under the 1 m m min K(ρ ,ρ ) ≥ 9 (p), (56) b ε f f i j decay condition on the eigenvalues α ≤ n t , we get n 1≤j≤N N i=1,i6=j 2ℓ 2ℓ 2 2 2 X X ε ε ε 1−p N −p where 9 (p) = log(N ) + (1 − p) log − p log . ||g|| = ≤ ≤ . N ε H p p ℓφ (t ) α α 2 2 ℓφ φ n=ℓ+1 n=ℓ+1 b b b Further, n 2 ℓ Hence f = φ(L )g ∈  provided that ||g|| ≤ R or K φ,R H 9 (p) ≥ (1 − p) log(N ) + (1 − p) log(1 − p) − log(p) N ε equivalently, +2p log(p) ≥ − log(p) + log( N ) − 3/e. (57) 1/b 1 α ℓ ≤ . (50) Since minimum value of x log(x) is −1/e on [0, 1]. −1 2 φ (ε/R) m m For the joint probability measures ρ , ρ (ρ ,ρ ∈ P , 1 ≤ f f φ,b i j f f i j 1/b 1 α i, j ≤ N ) from Proposition 4 [12] and the Equation (54) we get, For ℓ = ℓ = , choose ε such that ℓ > 16. ε o ε −1 2 o φ (ε/R) Then according to Proposition 6 [12], for every positive ε < 16m cmε m m 2 ε K(ρ ,ρ ) = mK(ρ ,ρ ) ≤ ||f − f || ≤ , ε (ℓ > ℓ ) there exists N ∈ N and σ ,... ,σ ∈ {−1, +1} f f i j 2 f f i j o ε ε ε 1 N 2 o ε i j b L (X) 15dL ℓ such that (58) ′ 2 where c = 16c /15dL . n n 2 Therefore, Equations (55), (56), together with Equations (57) (σ − σ ) ≥ ℓ , for all 1 ≤ i, j ≤ N , i 6= j (51) ε ε i j and (58) implies n=1 n o and p: = max Prob z : ||f − f || > z i H 1≤i≤N 2 ℓ /24 N ≥ e . (52) ( ) 3 cmε − − ε e b ≥ min , N e . Now we suppose f = φ(L )g and for ε > 0, i K i N + 1 2ℓ ε n−ℓ εσ e From Equation (52) for the probability measure ρ such that g = √ , for i = 1,... , N , i ε m c ℓ φ(t ) ε n p = ρ (A ) follows the result. n=ℓ +1 ∗ i Frontiers in Applied Mathematics and Statistics | www.frontiersin.org 11 March 2017 | Volume 3 | Article 3 Rastogi and Sampath Optimal Rates for the Regularized Learning Algorithms The lower estimates in L -norm can be obtained similar to Theorem 3.12. Under the same assumptions of Theorem 3.9 for 1 1 above theorem. 2 2b 9(t) = t φ(t), the estimator f corresponding to any learning algorithm converges to the regression function f with the following Theorem 3.10. Let z be i.i.d. samples drawn according to the lower rate: probability measure ρ ∈ P under the hypothesis dim(Y) = d < φ,b n o ∞. Then for any learning algorithm (z → f ∈ H) there exists l −1 −1/2 lim lim inf inf sup Prob ||f − f || > τφ 9 (m ) z ρ H a probability measure ρ ∈ P and f ∈ H such that for all ∗ φ,b ρ ∗ τ→0 m→∞ l∈A ρ∈P φ,b 0 < ε < ε , f can be approximated as o z = 1. n o Prob ||f − f || 2 > ε/2 z z ρ L (X) We obtain the following corollary as a consequence of ℓ 64mε − Theorem 3.11, 3.12. 48 2 15dL ≥ min ,ϑe −ℓ /24 1 + e Corollary 3.3. For any learning algorithm under Hölder’s source 1/b √ condition f ∈  , φ(t) = t and the polynomial decay ρ φ,R −3/e α where ϑ = e , ℓ = and ψ(t) = tφ(t). ε −1 ψ (ε/R) condition (13) for b > 1, the lower convergence rates can be described as Theorem 3.11. Under the same assumptions of Theorem 3.10 n o 1 1 2br+b 1/2 − 2 2b for ψ(t) = t φ(t) and 9(t) = t φ(t), the estimator f 4br+2b+2 z lim lim inf inf sup Prob ||f − f || 2 > τm z ρ L (X) m→∞ τ→0 ν l∈A ρ∈P corresponding to any learning algorithm converges to the regression φ,b function f with the following lower rate: = 1 lim lim inf inf sup Prob ||f − f || 2 > τψ and z ρ L (X) τ→0 m→∞ ν l∈A ρ∈P φ,b n o br −1 −1/2 − 2br+b+1 9 (m ) = 1, lim lim inf inf sup Prob ||f − f || > τm = 1. z ρ H m→∞ τ→0 l∈A ρ∈P φ,b where A denotes the set of all learning algorithms l : z → f . If the minimax lower rate coincides with the upper convergence 1/b rate for λ = λ . Then the choice of parameter is said to be Proof. Under the condition ℓ = from ε −1 ψ (ε/R) −1 −1/2 optimal. For the parameter choice λ = 9 (m ), Theorem Theorem 3.10 we get, 3.3 and Theorem 3.8 share the upper convergence rate with the n o lower convergence rate of Theorem 3.11 in L -norm. For the Prob ||f − f || 2 > same parameter choice, Theorem 3.4 and Theorem 3.7 share z z ρ L (X)   the upper convergence rate with the lower convergence rate 1/b   1 α 64mε 1 of Theorem 3.12 in RKHS-norm. Therefore, the choice of the 48 −1 2 ψ (ε/R) 15dL 1 − ≥ min ,ϑe e . −ℓ /24 1+e parameter is optimal.   It is important to observe that we get the same convergence rates for b = 1. −1 −1/2 Choosing ε = τRψ(9 (m )), we obtain 3.4. Individual Lower Rates −1 −1/2 Prob ||f − f || 2 > τ ψ(9 (m )) z z ρ In this section, we discuss the individual minimax lower rates L (X) ν 2 that describe the behavior of the error for the class of probability 1 1 −1 −1/2 −1/b − c(9 (m )) measure P as the sample size m grows. ≥ min ,ϑe e , φ,b −ℓ /24 1 + e Definition 3.4. A sequence of positive numbers a (n ∈ N) is √ 1 1/b 2 2 α 64R τ 5dLα 2b called the individual lower rate of convergence for the class of where c = − > 0 for 0 < τ < min , 1 . 48 15dL 32R probability measure P, if Now as m goes to ∞, ε → 0 and ℓ → ∞. Therefore, for   c > 0 we conclude that l 2 E ||f − f || z H   inf sup lim sup > 0, l −1 −1/2 l∈A ρ∈P m→∞ m lim inf inf sup Prob ||f − f || 2 > τ ψ(9 (m )) z ρ L (X) m→∞ ν 2 l∈A ρ∈P φ,b = 1. where A denotes the set of all learning algorithms l : z 7→ f . Theorem 3.13. Let z be i.i.d. samples drawn according to the −1 −1/2 Choosing ε = τRφ(9 (m )) we get the following probability measure P where φ is the index function satisfying φ,b r r 1 2 convergence rate from Theorem 3.9. the conditions that φ(t)/t , t /φ(t) are non-decreasing functions Frontiers in Applied Mathematics and Statistics | www.frontiersin.org 12 March 2017 | Volume 3 | Article 3 Rastogi and Sampath Optimal Rates for the Regularized Learning Algorithms and dim(Y) = d < ∞. Then for every ε > 0, the following lower and dim(Y) = d < ∞. Then for every ε > 0, the following lower bound holds: bound holds:     l 2 l 2 E ||f − f || z H E ||f − f || z H z H z 2     L (X) inf sup lim sup > 0.   −(bc −b+ε)/(bc +ε+1) inf sup lim sup > 0, 2 1 l∈A m m→∞   ρ∈P −(bc +ε)/(bc +ε+1) φ,b 2 1 l∈A m ρ∈P m→∞ φ,b 4. CONCLUSION where c = 2r + 1 and c = 2r + 1. 1 1 2 2 In our analysis we derive the upper and lower convergence We consider the class of probability measures such that the target rates over the wide class of probability measures considering ∞ ∞ function f is parameterized by s = (s ) ∈ {−1, +1} . H n n=1 general source condition in vector-valued setting. In particular, Suppose for ε > 0, our minimax rates can be used for the scalar-valued functions ! and multi-task learning problems. The lower convergence rates ε α φ(α/n ) coincide with the upper convergence rates for the optimal −(ε+1)/2 g = s R n e , n n ε + 1 n t φ(t ) parameter choice based on smoothness parameters b,φ. We can n=1 also develop various parameter choice rules such as balancing ∞ ∞ principle [31], quasi-optimality principle [32], discrepancy where s = (s ) ∈ {−1, +1} , t ’s are the eigenvalues n n n=1 principle [33] for the regularized solutions provided in our of the integral operator L , e ’s are the eigenvectors of the K n analysis. integral operator L and the orthonormal basis of RKHS H. Then the target function f = φ(L )g satisfies the general H K source condition. We assume that the conditional probability AUTHOR CONTRIBUTIONS measure ρ(y|x) follows the normal distribution centered at All authors listed, have made substantial, direct and intellectual f and the marginal probability measure ρ = ν. Now H X contribution to the work, and approved it for publication. we can derive the individual lower rates over the considered class of probability measures from the ideas of the literature [12, 30]. ACKNOWLEDGMENTS Theorem 3.14. Let z be i.i.d. samples drawn according to the The authors are grateful to the reviewers for their helpful probability measure P where φ is the index function satisfying comments and pointing out a subtle error that led to improve φ,b r r 1 2 the conditions that φ(t)/t , t /φ(t) are non-decreasing functions the quality of the paper. REFERENCES 11. Abhishake Sivananthan S. Multi-penalty regularization in learning theory. J Complex. (2016) 36:141–65. doi: 10.1016/j.jco.2016.05.003 1. Cucker F, Smale S. On the mathematical foundations of learning. 12. Caponnetto A, De Vito E. Optimal rates for the regularized least-squares Bull Am Math Soc. (2002) 39:1–49. doi: 10.1090/S0273-0979-01- algorithm. Found Comput Math. (2007) 7:331–68. doi: 10.1007/s10208- 00923-5 006-0196-8 2. Evgeniou T, Pontil M, Poggio T. Regularization networks and support vector 13. Smale S, Zhou DX. Estimating the approximation error in learning theory. machines. Adv Comput Math. (2000) 13:1–50. doi: 10.1023/A:1018946025316 Anal Appl. (2003) 1:17–41. doi: 10.1142/S0219530503000089 3. Vapnik VN, Vapnik V. Statistical Learning Theory. New York, NY: Wiley 14. Smale S, Zhou DX. Shannon sampling and function reconstruction from (1998). point values. Bull Am Math Soc. (2004) 41:279–306. doi: 10.1090/S0273-0979- 4. Bauer F, Pereverzev S, Rosasco L. On regularization algorithms in 04-01025-0 learning theory. J Complex. (2007) 23:52–72. doi: 10.1016/j.jco.2006. 15. Smale S, Zhou DX. Shannon sampling II: connections to learning theory. Appl 07.001 Comput Harmon Anal. (2005) 19:285–302. doi: 10.1016/j.acha.2005.03.001 5. Engl HW, Hanke M, Neubauer A. Regularization of Inverse Problems. 16. Smale S, Zhou DX. Learning theory estimates via integral operators and Dordrecht: Kluwer Academic Publishers Group (1996). their approximations. Constr Approx. (2007) 26:153–72. doi: 10.1007/s00365- 6. Gerfo LL, Rosasco L, Odone F, De Vito E, Verri A. Spectral 006-0659-y algorithms for supervised learning. Neural Comput. (2008) 20:1873–97. 17. Mathé P, Pereverzev SV. Geometry of linear ill-posed problems in variable doi: 10.1162/neco.2008.05-07-517 Hilbert scales. Inverse Probl. (2003) 19:789–803. doi: 10.1088/0266-5611/ 7. Tikhonov AN, Arsenin VY. Solutions of Ill-Posed Problems. Washington, DC: 19/3/319 W. H. Winston (1977). 18. Blanchard G, Mücke N. Optimal rates for regularization of statistical inverse 8. Bousquet O, Boucheron S, Lugosi G. Introduction to statistical learning learning problems. arXiv:1604.04054 (2016). theory. In: Bousquet O, von Luxburg U, Ratsch G editors. Advanced Lectures 19. Mendelson S. On the performance of kernel classes. J Mach Learn Res. (2003) on Machine Learning, Volume 3176 of Lecture Notes in Computer Science. 4:759–71. Berlin; Heidelberg: Springer (2004). pp. 169–207. 20. Zhang T. Effective dimension and generalization of kernel learning. In: Thrun 9. Cucker F, Zhou DX. Learning Theory: An Approximation Theory Viewpoint. S, Becker S, Obermayer K. editors. Advances in Neural Information Processing Cambridge, UK: Cambridge Monographs on Applied and Computational Systems. Cambridge, MA: MIT Press, (2003). pp. 454–61. Mathematics, Cambridge University Press (2007). 21. Akhiezer NI, Glazman IM. Theory of Linear Operators in Hilbert Space, 10. Lu S, Pereverzev S. Regularization Theory for Ill-posed Problems: Selected Translated from the Russian and with a preface by Merlynd Nestell. New York, Topics, Berlin: DeGruyter (2013). NY: Dover Publications Inc (1993). Frontiers in Applied Mathematics and Statistics | www.frontiersin.org 13 March 2017 | Volume 3 | Article 3 Rastogi and Sampath Optimal Rates for the Regularized Learning Algorithms 22. Micchelli CA, Pontil M. On learning vector-valued functions. Neural Comput. 31. De Vito E, Pereverzyev S, Rosasco L. Adaptive kernel methods using (2005) 17:177–204. doi: 10.1162/0899766052530802 the balancing principle. Found Comput Math. (2010) 10:455–79. 23. Aronszajn N. Theory of reproducing kernels. Trans Am Math Soc. (1950) doi: 10.1007/s10208-010-9064-2 68:337–404. doi: 10.1090/S0002-9947-1950-0051437-7 32. Bauer F, Reiss M. Regularization independent of the noise level: 24. Reed M, Simon B. Functional Analysis, Vol. 1, San Diego, CA: Academic Press an analysis of quasi-optimality. Inverse Prob. (2008) 24:055009. (1980). doi: 10.1088/0266-5611/24/5/055009 25. De Vito E, Rosasco L, Caponnetto A, De Giovannini U, Odone F. Learning 33. Lu S, Pereverzev SV, Tautenhahn U. A model function method from examples as an inverse problem. J Mach Learn Res. (2005) 6:883–904. in regularized total least squares. Appl Anal. (2010) 89:1693–703. 26. Pinelis IF, Sakhanenko AI. Remarks on inequalities for the probabilities of doi: 10.1080/00036811.2010.492502 large deviations. Theory Prob Appl. (1985) 30:127–31. doi: 10.1137/1130013 27. Peller VV. Multiple operator integrals in perturbation theory. Bull Math Sci. Conflict of Interest Statement: The authors declare that the research was (2016) 6:15–88. doi: 10.1007/s13373-015-0073-y conducted in the absence of any commercial or financial relationships that could 28. Boucheron S, Bousquet O, Lugosi G. Theory of classification: a survey of some be construed as a potential conflict of interest. recent advances. ESAIM: Prob Stat. (2005) 9:323–75. doi: 10.1051/ps:2005018 29. DeVore R, Kerkyacharian G, Picard D, Temlyakov V. Approximation Copyright © 2017 Rastogi and Sampath. This is an open-access article distributed methods for supervised learning. Found Comput Math. (2006) 6:3–58. under the terms of the Creative Commons Attribution License (CC BY). The use, doi: 10.1007/s10208-004-0158-6 distribution or reproduction in other forums is permitted, provided the original 30. Györfi L, Kohler M, Krzyzak A, Walk H. A Distribution-Free Theory of author(s) or licensor are credited and that the original publication in this journal Nonparametric Regression. New York, NY: Springer Series in Statistics, is cited, in accordance with accepted academic practice. No use, distribution or Springer-Verlag (2002). reproduction is permitted which does not comply with these terms. Frontiers in Applied Mathematics and Statistics | www.frontiersin.org 14 March 2017 | Volume 3 | Article 3 http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Frontiers in Applied Mathematics and Statistics Unpaywall

Optimal Rates for the Regularized Learning Algorithms under General Source Condition

Frontiers in Applied Mathematics and StatisticsMar 27, 2017

Loading next page...
 
/lp/unpaywall/optimal-rates-for-the-regularized-learning-algorithms-under-general-sQCaqH7pMV

References

References for this paper are not available at this time. We will be adding them shortly, thank you for your patience.

Publisher
Unpaywall
ISSN
2297-4687
DOI
10.3389/fams.2017.00003
Publisher site
See Article on Publisher Site

Abstract

ORIGINAL RESEARCH published: 27 March 2017 doi: 10.3389/fams.2017.00003 Optimal Rates for the Regularized Learning Algorithms under General Source Condition Abhishake Rastogi* and Sivananthan Sampath Department of Mathematics, Indian Institute of Technology Delhi, New Delhi, India We consider the learning algorithms under general source condition with the polynomial decay of the eigenvalues of the integral operator in vector-valued function setting. We discuss the upper convergence rates of Tikhonov regularizer under general source condition corresponding to increasing monotone index function. The convergence issues are studied for general regularization schemes by using the concept of operator monotone index functions in minimax setting. Further we also address the minimum possible error for any learning algorithm. Keywords: learning theory, general source condition, vector-valued RKHS, error estimate, optimal rates Mathematics Subject Classification 2010: 68T05, 68Q32 Edited by: 1. INTRODUCTION Yiming Ying, University at Albany, SUNY, USA Learning theory [1–3] aims to learn the relation between the inputs and outputs based on finite Reviewed by: random samples. We require some underlying space to search the relation function. From the Xin Guo, experiences we have some idea about the underlying space which is called hypothesis space. The Hong Kong Polytechnic University, Hong Kong Learning algorithms tries to infer the best estimator over the hypothesis space such that f (x) gives Ernesto De Vito, the maximum information of the output variable y for any unseen input x. The given samples University of Genoa, Italy m {x , y } are not exact in the sense that for underlying relation function f (x ) 6= y but f (x ) ≈ y . i i i i i i i=1 *Correspondence: We assume that the uncertainty follows the probability distribution ρ on the sample space X × Y Abhishake Rastogi and the underlying function (called the regression function) for the probability distribution ρ is abhishekrastogi2012@gmail.com given by Specialty section: f (x) = ydρ(y|x), x ∈ X, This article was submitted to Mathematics of Computation and Data Science, where ρ(y|x) is the conditional probability measure for given x. The problem of obtaining a section of the journal estimator from examples is ill-posed. Therefore, we apply the regularization schemes [4–7] to Frontiers in Applied Mathematics and stabilize the problem. Various regularization schemes are studied for inverse problems. In the Statistics context of learning theory [2, 3, 8–10], the square loss-regularization (Tikhonov regularization) Received: 02 November 2016 is widely considered to obtain the regularized estimator [9, 11–16]. Gerfo et al. [6] introduced Accepted: 09 March 2017 general regularization in the learning theory and provided the error bounds under Hölder’s source Published: 27 March 2017 condition [5]. Bauer et al. [4] discussed the convergence issues for general regularization under Citation: general source condition [17] by removing the Lipschitz condition on the regularization considered Rastogi A and Sampath S (2017) in Gerfo et al. [6]. Caponnetto and De Vito [12] discussed the square-loss regularization under the Optimal Rates for the Regularized polynomial decay of the eigenvalues of the integral operator L with Hölder’s source condition. For Learning Algorithms under General the inverse statistical learning problem, Blanchard and Mücke [18] analyzed the convergence rates Source Condition. for general regularization scheme under Hölder’s source condition in scalar-valued function setting. Front. Appl. Math. Stat. 3:3. doi: 10.3389/fams.2017.00003 Here we are discussing the convergence issues of general regularization schemes under general Frontiers in Applied Mathematics and Statistics | www.frontiersin.org 1 March 2017 | Volume 3 | Article 3 Rastogi and Sampath Optimal Rates for the Regularized Learning Algorithms source condition and the polynomial decay of the eigenvalues of where f is called the target function. In case f ∈ H, f H ρ H the integral operator in vector-valued framework. We present the becomes the regression function f . minimax upper convergence rates for Tikhonov regularization Because of inaccessibility of the probability distribution ρ, we under general source condition  , for a monotone increasing minimize the regularized empirical estimate of the generalization φ,R index function φ. For general regularization the minimax rates error over the hypothesis spaceH, are obtained using the operator monotone index function φ. The ( ) concept of effective dimension [19, 20] is exploited to achieve X 2 2 f : = argmin ||f (x ) − y || + λ||f || , (5) z,λ i i Y H the convergence rates. In the choice of regularization parameters, f ∈H i=1 the effective dimension plays the important role. We also discuss the lower convergence rates for any learning algorithm under the where λ is the positive regularization parameter. The smoothness conditions. We present the results in vector-valued regularization schemes [4–7, 10] are used to incorporate various function setting. Therefore, in particular they can be applied to features in the solution such as boundedness, monotonicity multi-task learning problems. and smoothness. In order to optimize the vector-valued The structure of the paper is as follows. In the second regularization functional, one of the main problems is to choose section, we introduce some basic assumptions and notations for the appropriate hypothesis space which is assumed to be a source supervised learning problems. In Section 3, we present the upper to provide the estimator. and lower convergence rates under the smoothness conditions in minimax setting. 2.1. Reproducing Kernel Hilbert Space as a Hypothesis Space 2. LEARNING FROM EXAMPLES: Definition 2.1. (Vector-valued reproducing kernel Hilbert NOTATIONS AND ASSUMPTIONS space) For non-empty set X and the real Hilbert space (Y, h·, ·i ), the Hilbert space (H, h·, ·i ) of functions from X to Y is called In the learning theory framework [2, 3, 8–10], the sample space reproducing kernel Hilbert space if for any x ∈ X and y ∈ Y the Z = X × Y consists of two spaces: The input space X (locally linear functional which maps f ∈ H to hy, f (x)i is continuous. compact second countable Hausdorff space) and the output space (Y, h·, ·i ) (the real separable Hilbert space). The input space X Y By Riesz lemma [21], for every x ∈ X and y ∈ Y there exists a and the output space Y are related by some unknown probability linear operator K : Y → H such that distribution ρ on Z. The probability measure can be split as ρ(x, y) = ρ(y|x)ρ (x), where ρ(y|x) is the conditional probability X hy, f (x)i = hK y, f i , ∀f ∈ H. Y x H measure of y given x and ρ is the marginal probability measure ∗ ∗ on X. The only available information is the random i.i.d. samples Therefore, the adjoint operator K : H → Y is given by K f = x x z = ((x , y ),... , (x , y )) drawn according to the probability 1 1 m m f (x). Through the linear operator K :Y → H we define the linear measure ρ. Given the training set z, learning theory aims to operator K(x, t) : Y → Y, develop an algorithm which provides an estimator f : X → Y such that f (x) predicts the output variable y for any given K(x, t)y: = K y(x). input x. The goodness of the estimator can be measured by the generalization error of a function f which can be defined as From Proposition 2.1 [22], the linear operator K(x, t) ∈ L(Y) (the set of bounded linear operators on Y), K(x, t) = K(t, x) and K(x, x) is non-negative bounded linear operator. For any E(f ): = E (f ) = V(f (x), y)dρ(x, y), (1) m ∈ N, {x : 1 ≤ i ≤ m} ∈ X, {y : 1 ≤ i ≤ m} ∈ Y, we i i where V : Y × Y → R is the loss function. The minimizer of E(f ) have that hy , K(x , x )y i ≥ 0. The operator valued function i i j j i,j=1 for the square loss function V(f (x), y) = ||f (x) − y|| is given by K : X × X → L(Y) is called the kernel. There is one to one correspondence between the kernels and f (x): = ydρ(y|x), (2) reproducing kernel Hilbert spaces [22, 23]. So a reproducing kernel Hilbert space H corresponding to a kernel K can be where f is called the regression function. The regression function denoted as H and the norm in the space H can be denoted as f belongs to the space of square integrable functions provided || · || . In the following article, we suppress K by simply using that H for reproducing kernel Hilbert space and || · || for its norm. Throughout the paper we assume the reproducing kernel ||y|| dρ(x, y) < ∞. (3) Hilbert spaceH is separable such that We search the minimizer of the generalization error over a (i) K : Y → H is a Hilbert-Schmidt operator for all x ∈ X and hypothesis spaceH, κ : = sup Tr(K K ) < ∞. x∈X 2 (ii) The real function from X × X to R, defined by (x, t) 7→ f : = argmin ||f (x) − y|| dρ(x, y) , (4) hK v, K wi , is measurable ∀v, w ∈ Y. f ∈H Z t x H Frontiers in Applied Mathematics and Statistics | www.frontiersin.org 2 March 2017 | Volume 3 | Article 3 Rastogi and Sampath Optimal Rates for the Regularized Learning Algorithms 1/2 By the representation theorem [22], the solution of the penalized Using the fact E(f ) = ||L (f − f )|| + E(f ), we get the H H K H regularization problem (5) will be of the form: expression of f , −1 f = (L + λI) L f , (9) λ K K H f = K c , for (c ,... , c ) ∈ Y . z,λ x i 1 m 2 2 where the integral operator L : L → L is a self-adjoint, i=1 K ρ ρ X X non-negative, compact operator, defined as Definition 2.2. let H be a separable Hilbert space and {e } k=1 be an orthonormal basis of H. Then for any positive operator L (f )(x): = K(x, t)f (t)dρ (t), x ∈ X. K X A ∈ L(H) we define Tr(A) = hAe , e i. It is well-known that k k k=1 The integral operator L can also be defined as a self-adjoint the number Tr(A) is independent of the choice of the orthonormal operator on H. We use the same notation L for both the basis. 1/2 operators defined on different domains. It is well-known that L is an isometry from the space of square integrable functions to Definition 2.3. An operator A ∈ L(H) is called Hilbert-Schmidt reproducing kernel Hilbert space. operator if Tr(A A) < ∞. The family of all Hilbert-Schmidt In order to achieve the uniform convergence rates for learning operators is denoted by L (H). For A ∈ L (H), we define Tr(A) = 2 2 algorithms we need some prior assumptions on the probability hAe , e i for an orthonormal basis {e } ofH. k k k k=1 measure ρ. Following the notion of Bauer et al. [4] and k=1 Caponnetto and De Vito [12], we consider the class of probability measures P which satisfies the assumptions: It is well-known that L (H) is the separable Hilbert space with the inner product, (i) For the probability measure ρ on X × Y, hA, Bi = Tr(B A) L (H) ||y|| dρ(x, y) < ∞. (10) and its norm satisfies (ii) The minimizer of the generalization error f (4) over the ||A|| ≤ ||A|| ≤ Tr(|A|), hypothesis spaceH exists. L(H) L (H) (iii) There exist some constants M,6 such that for almost all where |A| = A A and ||·|| is the operator norm (For more x ∈ X, L(H) details see [24]). ∗ ||y − f (x)|| 6 H Y ||y−f (x)|| /M For the positive trace class operator K K , we have H Y e − − 1 dρ(y|x) ≤ . M 2M ∗ ∗ ∗ 2 (11) ||K K || ≤ ||K K || ≤ Tr(K K ) ≤ κ . x L(H) x L (H) x x x 2 x (iv) The target function f belongs to the class  with H φ,R Given the ordered set x = (x ,... , x ) ∈ X , the sampling 1 m m  : = f ∈ H : f = φ(L )g and ||g|| ≤ R , (12) φ,R K H operator S : H → Y is defined by S (f ) = (f (x ),... , f (x )) x x 1 m ∗ m ∗ 1 and its adjoint S : Y → H is given by S y = K y , ∀ y = where φ is a continuous increasing index function defined x i x x i i=1 on the interval [0,κ ] with the assumption φ(0) = 0. This (y ,... , y ) ∈ Y . 1 m condition is usually referred to as general source condition The regularization scheme (5) can be expressed as [17]. 2 2 f = argmin ||S f − y|| + λ||f || , (6) In addition, we consider the set of probability measures z,λ x m H f ∈H P which satisfies the conditions (i), (ii), (iii), (iv) and the φ,b eigenvalues t ’s of the integral operator L follow the polynomial n K 2 1 2 decay: For fixed positive constants α,β and b > 1, where ||y|| = ||y || . m Y i=1 −b −b We obtain the explicit expression of f by taking the αn ≤ t ≤ βn ∀n ∈ N. (13) z,λ functional derivative of above expression over RKHSH. Under the polynomial decay of the eigenvalues the effective Theorem 2.1. For the positive choice of λ, the functional (6) has dimension N (λ), to measure the complexity of RKHS, can be unique minimizer: estimated from Proposition 3 [12] as follows, −1 ∗ ∗ βb f = S S + λI S y. (7) −1 −1/b z,λ x x x N (λ): = Tr (L + λI) L ≤ λ , for b > 1 (14) K K b − 1 Define f as the minimizer of the optimization functional, and without the polynomial decay condition (13), we have 2 2 f : = argmin ||f (x) − y|| dρ(x, y) + λ||f || . (8) −1 Y H N (λ) ≤ ||(L + λI) || Tr (L ) ≤ . K L(H) K f ∈H Z Frontiers in Applied Mathematics and Statistics | www.frontiersin.org 3 March 2017 | Volume 3 | Article 3 Rastogi and Sampath Optimal Rates for the Regularized Learning Algorithms We discuss the convergence issues for the learning algorithms then for any 0 < η < 1 and for all m ∈ N, (z → f ∈ H) in probabilistic sense by exponential tail inequalities such that 1 Prob (ω ,... ,ω ) ∈  : [ξ(ω ) − E(ξ(ω ))] 1 m i i i=1 Q S 2 Prob ||f − f || ≤ ε(m) log ≥ 1 − η z z ρ ρ ≤ 2 + √ log ≥ 1 − η. m η In particular, the inequality (15) holds if for all 0 < η ≤ 1 and ε(m) is a positive decreasing function of m. Using these probabilistic estimates we can obtain error estimates 2 2 ||ξ(ω)|| ≤ Q and E(||ξ(ω)|| ) ≤ S . in expectation by integration of tail inequalities: We estimate the error bounds for the regularized estimators by Z measuring the effect of random sampling and the complexity of f . The quantities described in Proposition 3.2 express E ||f − f || = Prob ||f − f || > t dt H z z ρ ρ z z ρ ρ the probabilistic estimates of the perturbation measure due to random sampling. The expressions of Proposition 3.3 describe t the complexity of the target function f which are usually ≤ exp − dt = ε(m), referred to as the approximation errors. The approximation ε(m) errors are independent of the samples z. Proposition 3.2. Let z be i.i.d. samples drawn according to the 1/2 where ||f || = ||f || 2 = { ||f (x)|| dρ (x)} and E (ξ) = ρ X z X Y probability measure ρ satisfying the assumptions (10), (11) and R r ξdρ(z )... dρ(z ). ∗ 1 m κ = sup Tr(K K ). Then for all 0 < η < 1, we have Z x x∈X −1/2 ∗ ∗ 3. CONVERGENCE ANALYSIS ||(L + λI) {S y − S S f }|| K x H H x x   κM 6 N (λ) 4 In this section, we analyze the convergence issues of the   ≤ 2 √ + log (16) learning algorithms on reproducing kernel Hilbert space under m η m λ the smoothness priors in the supervised learning framework. We discuss the upper and lower convergence rates for vector- and valued estimators in the standard minimax setting. Therefore, the 2 2 estimates can be utilized particularly for scalar-valued functions κ κ 4 ||S S − L || ≤ 2 + √ log . (17) x K L (H) and multi-task learning algorithms. m η with the confidence 1 − η. 3.1. Upper Rates for Tikhonov The proof of the first expression is the content of the step 3.2 Regularization of Theorem 4 [12] while the proof of the second expression can In General, we consider Tikhonov regularization in learning be obtained from Theorem 2 in De Vito et al. [25]. theory. Tikhonov regularization is briefly discussed in the literature [7, 9, 10, 25]. The error estimates for Tikhonov Proposition 3.3. Suppose f ∈  . Then, H φ,R regularization are discussed theoretically under Hölder’s source √ √ condition [12, 15, 16]. We establish the error estimates for (i) Under the assumption that φ(t) t and t/φ(t) are non- Tikhonov regularization scheme under general source condition decreasing functions, we have f ∈  for some continuous increasing index function φ and H φ,R the polynomial decay of the eigenvalues of the integral operator ||f − f || ≤ Rφ(λ) λ. (18) λ H ρ L . (ii) Under the assumption that φ(t) and t/φ(t) are non-decreasing In order to estimate the error bounds, we consider the following inequality used in the papers [4, 12] which is based on functions, we have the results of Pinelis and Sakhanenko [26]. ||f − f || ≤ Rκφ(λ) (19) λ H ρ Proposition 3.1. Let ξ be a random variable on the probability and space (,B, P) with values in real separable Hilbert space H. If there exist two constants Q and S satisfying ||f − f || ≤ Rφ(λ). (20) λ H H Under the source condition f ∈  , the proposition can be H φ,R n 2 n−2 E ||ξ − E(ξ)|| ≤ n!S Q ∀n ≥ 2, (15) proved using the ideas of Theorem 10 [4]. Frontiers in Applied Mathematics and Statistics | www.frontiersin.org 4 March 2017 | Volume 3 | Article 3 Rastogi and Sampath Optimal Rates for the Regularized Learning Algorithms Theorem 3.1. Let z be i.i.d. samples drawn according to the From the condition (21) we get with confidence 1 − η/2, probability measure ρ ∈ P where φ is the index function −1 ∗ satisfying the conditions that φ(t), t/φ(t) are non-decreasing ||(L + λI) (S S − L )|| ≤ . (25) K x K L(H) functions. Then for all 0 < η < 1, with confidence 1 − η, for the regularized estimator f (7) the following upper bound holds: z,λ Consequently, using (25) in the inequality (24) we obtain with   probability 1 − η/2,   2κM 46 N (λ) 4 ||f − f || ≤ 2 Rφ(λ) + + log ∗ −1 1/2 z,λ H H I = ||(S S + λI) (L + λI) || 1 x K L(H)  mλ mλ  η x −1/2 ≤ 2||(L + λI) || ≤ √ . (26) K L(H) provided that From Proposition 3.2 we have with confidence 1 − η/2, mλ ≥ 8κ log(4/η). (21) 2 2 κ κ 4 Proof. The error of regularized solution f can be estimated in z,λ ||S S − L || ≤ 2 + √ log . x K L(H) terms of the sample error and the approximation error as follows: m m η Again from the condition (21) we get with probability 1 − η/2, ||f − f || ≤ ||f − f || + ||f − f || . (22) z,λ H H z,λ λ H λ H H Now f − f can be expressed as z,λ λ I = ||S S − L || ≤ . (27) 3 x K L(H) ∗ −1 ∗ ∗ f − f = (S S + λI) {S y − S S f − λf }. z,λ λ x x λ λ x x x Therefore, the inequality (23) together with (16), (20), (26), (27) provides the desired bound. −1 Then f = (L + λI) L f implies λ K K H The following theorem discuss the error estimates in L - L f = L f + λf . K H K λ λ norm. The proof is similar to the above theorem. Therefore, Theorem 3.2. Let z be i.i.d. samples drawn according to the probability measure ρ ∈ P and f is the regularized solution (7) φ z,λ ∗ −1 ∗ ∗ f − f = (S S + λI) {S y − S S f − L (f − f )}. z,λ λ x x λ K H λ x x x corresponding to Tikhonov regularization. Then for all 0 < η < 1, with confidence 1 − η, the following upper bounds holds: Employing RKHS-norm we get, (i) Under the assumption that φ(t), t/φ(t) are non-decreasing ∗ −1 ∗ ∗ functions, ||f − f || ≤ ||(S S + λI) {S y − S S f z,λ λ H x x H x x x   + (S S − L )(f − f )}|| (23) x K H λ H   2κM 46 N (λ) ≤ I I + I ||f − f || /λ, 1 2 3 λ H H ||f − f || ≤ 2 Rφ(λ) λ + √ + z,λ H ρ  m  m λ ∗ −1 1/2 where I = ||(S S + λI) (L + λI) || , I = ||(L + 1 x K L(H) 2 K −1/2 ∗ ∗ ∗ λI) (S y − S S f )|| and I = ||S S − L || . x H H 3 x K L(H) x x x log The estimates of I , I can be obtained from Proposition 3.2 2 3 and the only task is to bound I . For this we consider (ii) Under the assumption that φ(t), t/φ(t) are non-decreasing ∗ −1 1/2 −1 ∗ −1 functions, (S S + λI) (L + λI) = {I − (L + λI) (L − S S )} x K K K x x x   −1/2 (L + λI)   4κM 166 N (λ) ||f − f || ≤ R(κ + λ)φ(λ) + √ + z,λ H ρ  m  which implies m λ ∞ 4 log −1 ∗ n −1/2 I ≤ ||(L + λI) (L − S S )|| ||(L + λI) || 1 K K x K L(H) η x L(H) n=0 (24) provided that −1 ∗ provided that ||(L +λI) (L − S S )|| < 1. To verify this K K x √ L(H) mλ ≥ 8κ log(4/η). (28) condition, we consider We derive the convergence rates of Tikhonov regularizer based −1 ∗ ||(L + λI) (S S − L )|| ≤ I /λ. K x K L(H) 3 on data-driven strategy of the parameter choice of λ for the class of probability measure P . φ,b Now using Proposition 3.2 we get with confidence 1 − η/2, Theorem 3.3. Under the same assumptions of Theorem 3.2 and 4κ 4 −1 ∗ hypothesis (13), the convergence of the estimator f (7) to the z,λ ||(L + λI) (S S − L )|| ≤ √ log . K x K L(H) mλ η target function f can be described as: Frontiers in Applied Mathematics and Statistics | www.frontiersin.org 5 March 2017 | Volume 3 | Article 3 Rastogi and Sampath Optimal Rates for the Regularized Learning Algorithms 2b (i) If φ(t) and t/φ(t) are non-decreasing functions. Then under (ii) Suppose 2(t) = t φ(t). Then the condition (28) follows that −1 −1/2 the parameter choice λ ∈ (0, 1], λ = 9 (m ) where 2 2 1 1 √ 8κ log(4/η) 8κ 2 2b 9(t) = t φ(t), we have mλ ≥ √ ≥ √ . λ λ ( ) −1 −1/2 1/2 ||f − f || ≤ C(9 (m )) φ z,λ H ρ −1 −1/2 Hence under the parameter choice λ ∈ (0, 1], λ = 2 (m ) Prob ≥ 1 − η −1 −1/2 4 (9 (m )) log we have 1 1 2 2b 1 λ λ φ(λ) φ(λ) and √ ≤ √ ≤ ≤ . 2 2 2 8κ 8κ 8κ m m λ lim lim sup sup Prob ||f − f || z z,λ H ρ From Theorem 3.2 and Equation (14), it follows that with τ→∞ m→∞ ρ∈P φ,b confidence 1 − η, −1 −1/2 1/2 −1 −1/2 > τ(9 (m )) φ(9 (m )) = 0, ′ −1 −1/2 ||f − f || ≤ C φ(2 (m )) log , (31) z,λ H ρ (ii) If φ(t) and t/φ(t) are non-decreasing functions. Then under −1 −1/2 the parameter choice λ ∈ (0, 1], λ = 2 (m ) where 1 2 where C : = R(κ + 1) + M/2κ + 4 βb6 /(b − 1). 2b 2(t) = t φ(t), we have ′ 4 Now defining τ : = C log gives ′ −1 −1/2 Prob ||f − f || ≤ C φ(2 (m )) log ≥ 1−η z z,λ H ρ −τ/C η η = η = 4e . The estimate (31) can be reexpressed as and −1 −1/2 Prob ||f − f || > τφ(2 (m )) ≤ η . (32) z z,λ H ρ τ lim lim sup sup Prob ||f − f || z z,λ H ρ τ→∞ m→∞ ρ∈P φ,b Then from Equations (30) and (32) our conclusions follow. −1 −1/2 > τφ(2 (m )) = 0. Theorem 3.4. Under the same assumptions of Theorem 3.1 and 1 1 hypothesis (13) with the parameter choice λ ∈ (0, 1], λ = 2 2b Proof. (i) Let 9(t) = t φ(t). Then it follows, 1 1 −1 −1/2 2 2b 9 (m ) where 9(t) = t φ(t), the convergence of the 2 estimator f (7) to the target function f can be described as z,λ H 9(t) t lim √ = lim = 0. −1 t→0 t→0 9 (t) t 4 −1 −1/2 Prob ||f − f || ≤ Cφ(9 (m )) log ≥ 1 − η z z,λ H H −1 −1/2 Under the parameter choice λ = 9 (m ) we have, and lim mλ = ∞. −1 −1/2 m→∞ lim lim sup sup Prob ||f − f || > τφ(9 (m )) z z,λ H H τ→∞ m→∞ ρ∈P φ,b Therefore, for sufficiently large m, = 0. The proof of the theorem follows the same steps as of Theorem 2b 1 λ φ(λ) 1 2b = √ ≤ λ φ(λ). 3.3 (i). We obtain the following corollary as a consequence of mλ mλ Theorem 3.3, 3.4. Under the fact λ ≤ 1 from Theorem 3.2 and Equation (14) Corollary 3.1. Under the same assumptions of Theorem 3.3, 3.4 follows that with confidence 1 − η, for Tikhonov regularization with Hölder’s source condition f ∈  , φ(t) = t , for all 0 < η < 1, with confidence 1 − η, for the φ,R 4 b −1 −1/2 1/2 −1 −1/2 − ||f − f || ≤ C(9 (m )) φ(9 (m )) log , 2br+b+1 z,λ H ρ parameter choice λ = m , we have (29) br p 4 2br+b+1 ||f − f || ≤ Cm log for 0 ≤ r ≤ 1, z,λ H H where C = 2R + 4κM + 4 βb6 /(b − 1). Now defining τ : = C log gives 2br+b 4 1 4br+2b+2 ||f − f || ≤ Cm log for 0 ≤ r ≤ −τ/C z,λ H ρ η = η = 4e . τ η 2 2br+1 The estimate (29) can be reexpressed as and for the parameter choice λ = m , we have br −1 −1/2 1/2 −1 −1/2 4 ′ − Prob {||f − f || > τ(9 (m )) φ(9 (m ))} ≤ η . z z,λ H ρ τ 2br+1 ||f − f || ≤ C m log for 0 ≤ r ≤ 1. z,λ H ρ (30) Frontiers in Applied Mathematics and Statistics | www.frontiersin.org 6 March 2017 | Volume 3 | Article 3 Rastogi and Sampath Optimal Rates for the Regularized Learning Algorithms 3.2. Upper Rates for General Definition 3.3. A function φ : [0, d] → [0, ∞) is said to be operator monotone index function if φ (0) = 0 and for every non- Regularization Schemes negative pair of self-adjoint operators A, B such that ||A||, ||B|| ≤ d Bauer et al. [4] discussed the error estimates for general and A ≤ B we have φ (A) ≤ φ (B). 1 1 regularization schemes under general source condition. Here we study the convergence issues for general regularization We consider the class of operator monotone index functions: schemes under general source condition and the polynomial decay of the eigenvalues of the integral operator L . We define F = {φ : [0,κ ] → [0, ∞) operator monotone, μ 1 the regularization in learning theory framework similar to φ (0) = 0,φ (κ ) ≤ μ}. 1 1 considered for ill-posed inverse problems (See Section 3.1 [4]). For the above class of operator monotone functions from Definition 3.1. A family of functions g : [0,κ ] → R, 0 < 2 Theorem 1 [4], given φ ∈ F there exists c such that 1 μ φ λ ≤ κ , is said to be the regularization if it satisfies the following 1 conditions: ∗ ∗ ||φ (S S ) − φ (L )|| ≤ c φ (||S S − L || ). 1 x 1 K φ 1 x K x L(H) 1 x L(H) • ∃D : sup |σ g (σ )| ≤ D. σ∈(0,κ ] ∗ Here we observe that the rate of convergence of φ (S S ) to 1 x • ∃B : sup |g (σ )| ≤ . λ φ (L ) is slower than the convergence rate of S S to L . 1 K x K σ∈(0,κ ] Therefore, we consider the following class of index functions: • ∃γ : sup |1 − g (σ )σ| ≤ γ . σ∈(0,κ ] F = {φ = φ φ : φ ∈ F ,φ : [0,κ ] • The maximal p satisfying the condition: 2 1 1 μ 2 → [0, ∞) non-decreasing Lipschitz,φ (0) = 0}. p p sup |1 − g (σ )σ|σ ≤ γ λ λ p σ∈(0,κ ] The splitting of φ = φ φ is not unique. So we can take φ 2 1 2 as a Lipschitz function with Lipschitz constant 1. Now using is called the qualification of the regularization g , where γ does λ p Corollary 1.2.2 [27] we get not depend on λ. ∗ ∗ ||φ (S S ) − φ (L )|| ≤ ||S S − L || . 2 x 2 K L (H) x K L (H) x 2 x 2 The properties of general regularization are satisfied by the General source condition f ∈  corresponding to index H φ,R large class of learning algorithms which are essentially all the class functions F covers wide range of source conditions as linear regularization schemes. We refer to Section 2.2 [10] Hölder’s source condition φ(t) = t , logarithm source condition for brief discussion of the regularization schemes. Here we −ν p 1 φ(t) = t log . Following the analysis of Bauer et al. [4] we consider general regularized solution corresponding to the above t develop the error estimates of general regularization for the index regularization: class function F under the suitable priors on the probability ∗ ∗ measure ρ. f = g (S S )S y. (33) z,λ λ x x x Theorem 3.5. Let z be i.i.d. samples drawn according to the Here we are discussing the connection between the qualification probability measure ρ ∈ P . Suppose f is the regularized φ z,λ of the regularization and general source condition [17]. solution (33) corresponding to general regularization and the Definition 3.2. The qualification p covers the index function φ if qualification of the regularization covers φ. Then for all 0 < η < 1, t 2 with confidence 1 − η, the following upper bound holds: the function t → on t ∈ (0,κ ] is non-decreasing. φ(t)   4Rμγκ 2 2ν κM The following result is a restatement of Proposition 3 [17].  1  Rc (1 + c )φ(λ) + + g φ mλ ||f − f || ≤ z,λ H H 8ν 6 N (λ)   Proposition 3.4. Suppose φ is a non-decreasing index function mλ and the qualification of the regularization g covers φ. Then log sup |1 − g (σ )σ|φ(σ ) ≤ c φ(λ), c = max(γ ,γ ). λ g g p σ∈(0,κ ] provided that Generally, the index function φ is not stable with respect to 2 mλ ≥ 8κ log(4/η). (34) perturbation in the integral operator L . In practice, we are only accessible to the perturbed empirical operator S S but the source Proof. We consider the error expression for general regularized condition can be expressed in terms of L only. So we want solution (33), to control the difference φ(L ) − φ(S S ). In order to obtain K x ∗ ∗ ∗ ∗ the error estimates for general regularization, we further restrict f − f = g (S S )(S y − S S f ) − r (S S )f , (35) z,λ H λ x x H λ x H x x x x the index functions to operator monotone functions which is defined as where r (σ ) = 1 − g (σ )σ . λ λ Frontiers in Applied Mathematics and Statistics | www.frontiersin.org 7 March 2017 | Volume 3 | Article 3 Rastogi and Sampath Optimal Rates for the Regularized Learning Algorithms Now the first term can be expressed as Now we consider the second term, ∗ ∗ ∗ ∗ ∗ 1/2 ∗ ∗ ∗ ∗ g (S S )(S y − S S f ) = g (S S )(S S + λI) r (S S )f = r (S S )φ(L )v = r (S S )φ(S S )v λ x H λ x K λ x x λ x x H λ x x x x x x x x x x x ∗ ∗ ∗ ∗ −1/2 1/2 +r (S S )φ (S S )(φ (L ) − φ (S S ))v (S S + λI) (L + λI) x K λ x 2 x 1 K 1 x x x x x −1/2 ∗ ∗ ∗ ∗ (L + λI) (S y − S S f ). +r (S S )(φ (L ) − φ (S S ))φ (L )v. λ x 2 K 2 x 1 K K x H x x x x Employing RKHS-norm we get On applying RKHS-norm we get, ∗ ∗ ∗ ∗ ||r (S S )f || ≤ Rc φ(λ) + Rc c φ (λ)φ λ x H H g g φ 2 1 ||g (S S )(S y − S S f )|| ≤ I I ||g (S S ) 1 λ x x H H 2 5 λ x x x x x x ∗ ∗ ∗ 1/2 (||L − S S || ) + Rμγ ||L − S S || . (S S + λI) || , (36) K x L(H) K x L (H) x x 2 x L(H) −1/2 ∗ ∗ ∗ Here we used the fact that if the qualification of the regularization where I = ||(L +λI) (S y−S S f )|| and I = ||(S S + 2 K x H H 5 x x x x −1/2 1/2 covers φ = φ φ , then the qualification also covers φ and φ 1 2 1 2 λI) (L + λI) || . K L(H) both separately. The estimate of I can be obtained from the first estimate of From Equations (17) and (34) we have with probability 1 − Proposition 3.2 and from the second estimate of Proposition 3.2 η/2, with the condition (34) we obtain with probability 1 − η/2, 4κ 4 −1/2 ∗ −1/2 ||(L + λI) (L − S S )(L + λI) || ||S S − L || ≤ √ log ≤ λ/2. (41) K K x K L(H) x K L(H) x x 1 4κ 4 1 ≤ ||S S − L || ≤ √ log ≤ . x K x L(H) Therefore, with probability 1 − η/2, λ mλ η 2 4Rμγκ 4 which implies that with confidence 1 − η/2, ||r (S S )f || ≤ Rc (1+c )φ(λ)+ √ log . (42) λ x H H g φ x 1 m η ∗ −1/2 1/2 I = ||(S S + λI) (L + λI) || 5 x K x L(H) Combining the bounds (40) and (42) we get the desired 1/2 1/2 ∗ −1 1/2 = ||(L + λI) (S S + λI) (L + λI) || K x K L(H) result. −1/2 ∗ = ||{I − (L + λI) (L − S S ) K K x Theorem 3.6. Let z be i.i.d. samples drawn according to the 1/2 −1/2 −1 (L + λI) } || L(H) probability measure ρ ∈ P and f is the regularized solution φ z,λ (33) corresponding to general regularization. Then for all 0 < η < ≤ 2. (37) 1, with confidence 1 − η, the following upper bounds holds: From the properties of the regularization we have, (i) If the qualification of the regularization covers φ, ∗ ∗ 1/2  √  ||g (S S )(S S ) || ≤ sup |g (σ ) σ| λ x x L(H) λ x x  Rc (1 + c )(κ + λ)φ(λ)  g φ   2 √ 0<σ≤κ   4Rμγκ (κ+ λ) 2 2ν κM 4 ! √ √ r + + 1/2 ||f − f || ≤ log , z,λ H ρ m λ   BD η  2 2  8ν 6 N (λ)   = sup |g (σ )σ| sup |g (σ )| ≤ . (38) λ λ λ m 2 2 0<σ≤κ 0<σ≤κ Hence it follows, (ii) If the qualification of the regularization covers φ(t) t,  √  ∗ ∗ 1/2 1/2 √ 2 ||g (S S )(S S + λI) || ≤ sup |g (σ )(σ + λ) | λ x x λ 4Rμ(γ +c )κ λ L(H) g x x   2Rc (1 + c )φ(λ) λ + g φ 0<σ≤κ m ||f − f || ≤ √ z,λ H ρ √ 2 ν 8ν 6 N (λ) 1  2 2ν κM  2 2 + + ≤ sup |g (σ ) σ| + λ sup |g (σ )| ≤ √ , (39) λ λ m λ 2 2 λ 0<σ≤κ 0<σ≤κ log where ν : = B + BD. Therefore, using (16), (37) and (39) in Equation (36) we provided that conclude that with probability 1 − η, mλ ≥ 8κ log(4/η). (43)     κM 6 N (λ) ∗ ∗ ∗ ||g (S S )(S y − S S f )|| ≤ 2 2ν + λ x x H H 1 x x x Proof. Here we establish L -norm estimate for the error   mλ mλ expression: log . (40) ∗ ∗ ∗ ∗ f − f = g (S S )(S y − S S f ) − r (S S )f . z,λ H λ x x H λ x H η x x x x Frontiers in Applied Mathematics and Statistics | www.frontiersin.org 8 March 2017 | Volume 3 | Article 3 Rastogi and Sampath Optimal Rates for the Regularized Learning Algorithms On applying L -norm in the first term we get, Therefore, using Equation (17) we obtain with probability 1 − η/2, 1/2 ∗ ∗ ∗ ∗ ||g (S S )(S y − S S f )|| ≤ I I ||L g (S S ) λ x x H ρ 2 5 λ x x x x K x ∗ 1/2 ||r (S S )f || ≤ (κ + λ) λ x H ρ (S S + λI) || , (44) x x L(H) 4Rμγκ 4 −1/2 ∗ ∗ ∗ Rc (1 + c )φ(λ) + √ log . (47) where I = ||(L +λI) (S y−S S f )|| and I = ||(S S + 2 K x H H 5 x g φ x x x 1 m η −1/2 1/2 λI) (L + λI) || . K L(H) The estimates of I and I can be obtained from Proposition 2 5 Case 2. If the qualification of the regularization covers φ(t) t, 3.2 and Theorem 3.5 respectively. Now we consider we get with probability 1 − η/2, 1/2 1/2 ∗ ∗ 1/2 ||L g (S S )(S S + λI) || ≤ ||L λ x x L(H) x x √ K K ||r (S S )f || ≤ 2Rc (1 + c )φ(λ) λ ∗ 1/2 ∗ ∗ 1/2 λ x H ρ g φ x 1 −(S S ) || ||g (S S )(S S + λI) x L(H) λ x x x x x r ∗ 1/2 ∗ λ 4 || + ||(S S ) g (S S ) x λ x L(H) x x +4Rμ(γ + c )κ log . (48) m η ∗ 1/2 (S S + λI) || . x L(H) Combining the error estimates (46), (47) and (48) we get the Since φ(t) = t is operator monotone function. Therefore, from desired results. Equation (41) with probability 1 − η/2, we get 1/2 ∗ 1/2 ∗ 1/2 We discuss the convergence rates of general regularizer based ||L − (S S ) || ≤ (||L − S S || ) ≤ λ. x L(H) K x L(H) K x x on data-driven strategy of the parameter choice of λ for the class Then using the properties of the regularization and Equation (38) of probability measure P . The proof of Theorem 3.7, 3.8 are φ,b similar to Theorem 3.3. we conclude that with probability 1 − η/2, 1/2 ∗ ∗ 1/2 Theorem 3.7. Under the same assumptions of Theorem 3.5 and || L g (S S )(S S + λI) || λ x x L(H) K x x hypothesis (13) with the parameter choice λ ∈ (0, 1], λ = 1/2 2 1/2 ≤ λ sup |g (σ )(σ + λ) | + sup |g (σ )(σ + λσ ) | 1 1 λ λ −1 −1/2 + 2b 2 2 9 (m ) where 9(t) = t φ(t), the convergence of the 0<σ≤κ 0<σ≤κ estimator f (33) to the target function f can be described as z,λ H ≤ sup |g (σ )σ| + λ sup |g (σ )| + 2 λ sup |g (σ ) σ| λ λ λ 2 2 2 0<σ≤κ 0<σ≤κ 0<σ≤κ √ 4 −1 −1/2 Prob ||f − f || ≤ Cφ(9 (m )) log ≥ 1 − η, ≤ B + D + 2 BD = ν (let). (45) z z,λ H H From Equations (44) with Equations (16), (37), and (45) we where C = Rc (1 + c ) + 4Rμγκ + 2 2ν κM + g φ 1 obtain with probability 1 − η, 1   8βbν 6 /(b − 1) and   κM 6 N (λ) ∗ ∗ ∗ ||g (S S )(S y − S S f )|| ≤ 2 2ν √ + λ x x H ρ 2 x x x   −1 −1/2 m λ lim lim sup sup Prob ||f − f || > τφ(9 (m )) z z,λ H H τ→∞ m→∞ ρ∈P φ,b = 0. log . (46) Theorem 3.8. Under the same assumptions of Theorem 3.6 and The second term can be expressed as hypothesis (13), the convergence of the estimator f (33) to the z,λ 1/2 ∗ ∗ 1/2 ∗ || r (S S )f || ≤ ||L − (S S ) || ||r (S S )f || target function f can be described as λ x H ρ x L(H) λ x H H x K x x ∗ 1/2 ∗ +||(S S ) r (S S )f || x λ x H H (i) If the qualification of the regularization covers φ. Then under x x −1 −1/2 1/2 ∗ ∗ the parameter choice λ ∈ (0, 1], λ = 2 (m ) where ≤ ||L − S S || ||r (S S )f || K x λ x H H x x L(H) 1 2b 2(t) = t φ(t), we have ∗ ∗ 1/2 ∗ +||r (S S )(S S ) φ(S S )v|| λ x x x H x x x ∗ ∗ 1/2 ∗ ∗ +||r (S S )(S S ) φ (S S )(φ (S S ) − φ (L ))v|| λ x x 2 x 1 x 1 K H 4 x x x x −1 −1/2 Prob ||f − f || ≤C φ(2 (m )) log ≥1−η, z z,λ H ρ 1 ∗ ∗ 1/2 ∗ +||r (S S )(S S ) (φ (S S ) − φ (L ))φ (L )v|| . η λ x x 2 x 2 K 1 K H x x x Here two cases arises: 2 where C = Rc (1 + c )(κ + 1) + 4Rμγκ (κ + 1) + 1 g φ ν M/2 2κ + 8βbν 6 /(b − 1) and Case 1. If the qualification of the regularization covers φ. Then we get with confidence 1 − η/2, −1 −1/2 √ lim lim sup sup Prob ||f − f || > τφ(2 (m )) z z,λ H ρ τ→∞ m→∞ ρ∈P ||r (S S )f || ≤ (κ + λ) Rc (1 + c )φ(λ) λ x H ρ g φ φ,b x 1 = 0, +Rμγ ||S S − L || . x K L (H) x 2 Frontiers in Applied Mathematics and Statistics | www.frontiersin.org 9 March 2017 | Volume 3 | Article 3 Rastogi and Sampath Optimal Rates for the Regularized Learning Algorithms (ii) If the qualification of the regularization covers φ(t) t. Then Remark 3.3. For the real valued functions and multi-task −1 −1/2 m under the parameter choice λ ∈ (0, 1], λ = 9 (m ) algorithms (the output space Y ⊂ R for some m ∈ N) we 1 1 can obtain the error estimates from our analysis without imposing 2 2b where 9(t) = t φ(t), we have any condition on the conditional probability measure (11) for the −1 −1/2 1/2 −1 −1/2 bounded output space Y. Prob ||f − f || ≤ C (9 (m )) φ(9 (m )) z z,λ H ρ 2 Remark 3.4. We can address the convergence issues of binary log ≥ 1 − η, η classification problem [28] using our error estimates as similar to discussed in Section 3.3 [4] and Section 5 [16]. where C = 2Rc (1 + c ) + 4Rμ(γ + c )κ + 2 2ν κM + 2 g φ g 2 3.3. Lower Rates for General Learning 8βbν 6 /(b − 1) and Algorithms In this section, we discuss the estimates of minimum possible lim lim sup sup Prob ||f − f || z z,λ H ρ error over a subclass of the probability measures P τ→∞ φ,b m→∞ ρ∈P φ,b parameterized by suitable functions f ∈ H. Throughout this −1 −1/2 1/2 −1 −1/2 > τ(9 (m )) φ(9 (m )) = 0. section we assume that Y is finite-dimensional. Let {v } be a basis of Y and f ∈  . Then we parameterize j φ,R j=1 We obtain the following corollary as a consequence of Theorem the probability measure based on the function f , 3.7, 3.8. Corollary 3.2. Under the same assumptions of Theorem 3.7, 3.8 ρ (x, y): = a (x)δ + b (x)δ ν(x), (49) f j y+dLv j y−dLv j j 2dL for general regularization of the qualification p with Hölder’s source j=1 condition f ∈  , φ(t) = t , for all 0 < η < 1, with confidence H φ,R − where a (x) = L − hf , K v i , b (x) = L + hf , K v i , L = j x j H j x j H 2br+b+1 1 − η, for the parameter choice λ = m , we have 4κφ(κ )R and δ denotes the Dirac measure with unit mass at ξ. It is easy to observe that the marginal distribution of ρ over X br e 2br+b+1 ||f − f || ≤ Cm log for 0 ≤ r ≤ p, is ν and the regression function for the probability measure ρ is z,λ H H f (see Proposition 4 [12]). In addition to this, for the conditional probability measure ρ (y|x) we have, 2br+b 4 1 e 4br+2b+2 ||f − f || ≤ C m log for 0 ≤ r ≤ p − ||y − f (x)|| z,λ H ρ 2 Y ||y−f (x)|| /M η 2 e − − 1 dρ (y|x) i−2 2 2br+1 (dL + ||f (x)|| ) 6 and for the parameter choice λ = m , we have 2 2 2 ≤ d L − ||f (x)|| ≤ i 2 M i! 2M i=2 br e 2br+1 ||f − f || ≤ C m log for 0 ≤ r ≤ p. z,λ H ρ 1 provided that dL + L/4 ≤ M and 2dL ≤ 6. Remark 3.1. It is important to observe from Corollary 3.1, 3.2 that using the concept of operator monotonicity of index function We assume that the eigenvalues of the integral operator L we are able to achieve the same error estimates for general follow the polynomial decay (13) for the marginal probability regularization as of Tikhonov regularization up to a constant measure ν. Then we conclude that the probability measure ρ multiple. parameterized by f belongs to the class P . φ,b The concept of information theory such as the Kullback- Remark 3.2. (Related work) Corollary 3.1 provides the order of Leibler information and Fano inequalities (Lemma 3.3 [29]) convergence same as of Theorem 1 [12] for Tikhonov regularization are the main ingredients in the analysis of lower bounds. In under the Hölder’s source condition f ∈  for φ(t) = H φ,R r 1 the literature [12, 29], the closeness of probability measures t ≤ r ≤ 1 and the polynomial decay of the eigenvalues is described through Kullback-Leibler information: Given two (13). Blanchard and Mücke [18] addressed the convergence rates probability measures ρ and ρ , it is defined as 1 2 for inverse statistical learning problem for general regularization under the Hölder’s source condition with the assumption f ∈ H. In particular, the upper convergence rates discussed in Blanchard K(ρ ,ρ ): = log(g(z))dρ (z), 1 2 1 and Mücke [18] agree with Corollary 3.2 for considered learning problem which is referred as direct learning problem in Blanchard where g is the density of ρ with respect to ρ , that is, ρ (E) = 1 2 1 and Mücke[18]. Under the fact N (λ) ≤ from Theorem 3.5, 3.6 g(z)dρ (z) for all measurable sets E. λ E we obtain the similar estimates as of Theorem 10 [4] for general Following the analysis of Caponnetto and De Vito [12] and regularization schemes without the polynomial decay condition of DeVore et al. [29] we establish the lower rates of accuracy that the eigenvalues (13). can be attained by any learning algorithm. Frontiers in Applied Mathematics and Statistics | www.frontiersin.org 10 March 2017 | Volume 3 | Article 3 Rastogi and Sampath Optimal Rates for the Regularized Learning Algorithms 1 ε ℓ To estimate the lower rates of learning algorithms, we generate where σ = (σ ,... ,σ ) ∈ {−1, +1} . Then from Equation i i N -functions belonging to  for given ε > 0 such that (53), (51) we get, ε φ,R (54) holds. Then we construct the probability measures ρ ∈ P i φ,b from Equation (49), parameterized by these functions f ’s (1 ≤ ε ≤ ||f − f || , for all 1 ≤ i, j ≤ N , i 6= j. (53) i i j H ε i ≤ N ). On applying Lemma 3.3 [29], we obtain the lower convergence rates using Kullback-Leibler information. For 1 ≤ i, j ≤ N , we have Theorem 3.9. Let z be i.i.d. samples drawn according to the 2 n−ℓ n−ℓ ε ε 2ℓ 2ℓ ε ε βε σ − σ 2 X X i j 4βε probability measure ρ ∈ P under the hypothesis dim(Y) = d < φ,b 2 ||f − f || ≤ ≤ i j 2 b b ∞. Then for any learning algorithm (z → f ∈ H) there exists L (X) z ν ℓ n ℓ n ε ε n=ℓ +1 n=ℓ +1 ε ε a probability measure ρ ∈ P and f ∈ H such that for all ∗ ρ φ,b ∗ 2 2ℓ 2 4βε 1 ε 0 < ε < ε , f can be approximated as o z ≤ dx = c , (54) b b ℓ x ℓ   ε ℓ  ℓ  ε cmε 48 b Prob ||f − f || > ε/2 ≥ min ,ϑe 4β z z ρ H ′ 1 −ℓ /24 where c = 1 − .  1 + e  b−1 (b−1) We define the sets, 64β 1 −3/e n o where ϑ = e , c = 1 − and ℓ = 2 ε b−1 15(b−1)dL A = z : ||f − f || < , for 1 ≤ i ≤ N . i z i H ε 1/b 1 α −1 2 φ (ε/R) It is clear from Equation (53) that A ’s are disjoint sets. On applying Lemma 3.3 [29] with probability measures ρ , 1 ≤ i ≤ Proof. For given ε > 0, we define N , we obtain that either 2ℓ n−ℓ εσ e g = √ , m c ℓφ(t ) p: = max ρ (A ) ≥ (55) n=ℓ+1 f i 1≤i≤N N + 1 ε ε 1 ℓ ℓ where σ = (σ ,... ,σ ) ∈ {−1, +1} , t ’s are the eigenvalues of n or the integral operator L , e ’s are the eigenvectors of the integral K n operator L and the orthonormal basis of RKHS H. Under the 1 m m min K(ρ ,ρ ) ≥ 9 (p), (56) b ε f f i j decay condition on the eigenvalues α ≤ n t , we get n 1≤j≤N N i=1,i6=j 2ℓ 2ℓ 2 2 2 X X ε ε ε 1−p N −p where 9 (p) = log(N ) + (1 − p) log − p log . ||g|| = ≤ ≤ . N ε H p p ℓφ (t ) α α 2 2 ℓφ φ n=ℓ+1 n=ℓ+1 b b b Further, n 2 ℓ Hence f = φ(L )g ∈  provided that ||g|| ≤ R or K φ,R H 9 (p) ≥ (1 − p) log(N ) + (1 − p) log(1 − p) − log(p) N ε equivalently, +2p log(p) ≥ − log(p) + log( N ) − 3/e. (57) 1/b 1 α ℓ ≤ . (50) Since minimum value of x log(x) is −1/e on [0, 1]. −1 2 φ (ε/R) m m For the joint probability measures ρ , ρ (ρ ,ρ ∈ P , 1 ≤ f f φ,b i j f f i j 1/b 1 α i, j ≤ N ) from Proposition 4 [12] and the Equation (54) we get, For ℓ = ℓ = , choose ε such that ℓ > 16. ε o ε −1 2 o φ (ε/R) Then according to Proposition 6 [12], for every positive ε < 16m cmε m m 2 ε K(ρ ,ρ ) = mK(ρ ,ρ ) ≤ ||f − f || ≤ , ε (ℓ > ℓ ) there exists N ∈ N and σ ,... ,σ ∈ {−1, +1} f f i j 2 f f i j o ε ε ε 1 N 2 o ε i j b L (X) 15dL ℓ such that (58) ′ 2 where c = 16c /15dL . n n 2 Therefore, Equations (55), (56), together with Equations (57) (σ − σ ) ≥ ℓ , for all 1 ≤ i, j ≤ N , i 6= j (51) ε ε i j and (58) implies n=1 n o and p: = max Prob z : ||f − f || > z i H 1≤i≤N 2 ℓ /24 N ≥ e . (52) ( ) 3 cmε − − ε e b ≥ min , N e . Now we suppose f = φ(L )g and for ε > 0, i K i N + 1 2ℓ ε n−ℓ εσ e From Equation (52) for the probability measure ρ such that g = √ , for i = 1,... , N , i ε m c ℓ φ(t ) ε n p = ρ (A ) follows the result. n=ℓ +1 ∗ i Frontiers in Applied Mathematics and Statistics | www.frontiersin.org 11 March 2017 | Volume 3 | Article 3 Rastogi and Sampath Optimal Rates for the Regularized Learning Algorithms The lower estimates in L -norm can be obtained similar to Theorem 3.12. Under the same assumptions of Theorem 3.9 for 1 1 above theorem. 2 2b 9(t) = t φ(t), the estimator f corresponding to any learning algorithm converges to the regression function f with the following Theorem 3.10. Let z be i.i.d. samples drawn according to the lower rate: probability measure ρ ∈ P under the hypothesis dim(Y) = d < φ,b n o ∞. Then for any learning algorithm (z → f ∈ H) there exists l −1 −1/2 lim lim inf inf sup Prob ||f − f || > τφ 9 (m ) z ρ H a probability measure ρ ∈ P and f ∈ H such that for all ∗ φ,b ρ ∗ τ→0 m→∞ l∈A ρ∈P φ,b 0 < ε < ε , f can be approximated as o z = 1. n o Prob ||f − f || 2 > ε/2 z z ρ L (X) We obtain the following corollary as a consequence of ℓ 64mε − Theorem 3.11, 3.12. 48 2 15dL ≥ min ,ϑe −ℓ /24 1 + e Corollary 3.3. For any learning algorithm under Hölder’s source 1/b √ condition f ∈  , φ(t) = t and the polynomial decay ρ φ,R −3/e α where ϑ = e , ℓ = and ψ(t) = tφ(t). ε −1 ψ (ε/R) condition (13) for b > 1, the lower convergence rates can be described as Theorem 3.11. Under the same assumptions of Theorem 3.10 n o 1 1 2br+b 1/2 − 2 2b for ψ(t) = t φ(t) and 9(t) = t φ(t), the estimator f 4br+2b+2 z lim lim inf inf sup Prob ||f − f || 2 > τm z ρ L (X) m→∞ τ→0 ν l∈A ρ∈P corresponding to any learning algorithm converges to the regression φ,b function f with the following lower rate: = 1 lim lim inf inf sup Prob ||f − f || 2 > τψ and z ρ L (X) τ→0 m→∞ ν l∈A ρ∈P φ,b n o br −1 −1/2 − 2br+b+1 9 (m ) = 1, lim lim inf inf sup Prob ||f − f || > τm = 1. z ρ H m→∞ τ→0 l∈A ρ∈P φ,b where A denotes the set of all learning algorithms l : z → f . If the minimax lower rate coincides with the upper convergence 1/b rate for λ = λ . Then the choice of parameter is said to be Proof. Under the condition ℓ = from ε −1 ψ (ε/R) −1 −1/2 optimal. For the parameter choice λ = 9 (m ), Theorem Theorem 3.10 we get, 3.3 and Theorem 3.8 share the upper convergence rate with the n o lower convergence rate of Theorem 3.11 in L -norm. For the Prob ||f − f || 2 > same parameter choice, Theorem 3.4 and Theorem 3.7 share z z ρ L (X)   the upper convergence rate with the lower convergence rate 1/b   1 α 64mε 1 of Theorem 3.12 in RKHS-norm. Therefore, the choice of the 48 −1 2 ψ (ε/R) 15dL 1 − ≥ min ,ϑe e . −ℓ /24 1+e parameter is optimal.   It is important to observe that we get the same convergence rates for b = 1. −1 −1/2 Choosing ε = τRψ(9 (m )), we obtain 3.4. Individual Lower Rates −1 −1/2 Prob ||f − f || 2 > τ ψ(9 (m )) z z ρ In this section, we discuss the individual minimax lower rates L (X) ν 2 that describe the behavior of the error for the class of probability 1 1 −1 −1/2 −1/b − c(9 (m )) measure P as the sample size m grows. ≥ min ,ϑe e , φ,b −ℓ /24 1 + e Definition 3.4. A sequence of positive numbers a (n ∈ N) is √ 1 1/b 2 2 α 64R τ 5dLα 2b called the individual lower rate of convergence for the class of where c = − > 0 for 0 < τ < min , 1 . 48 15dL 32R probability measure P, if Now as m goes to ∞, ε → 0 and ℓ → ∞. Therefore, for   c > 0 we conclude that l 2 E ||f − f || z H   inf sup lim sup > 0, l −1 −1/2 l∈A ρ∈P m→∞ m lim inf inf sup Prob ||f − f || 2 > τ ψ(9 (m )) z ρ L (X) m→∞ ν 2 l∈A ρ∈P φ,b = 1. where A denotes the set of all learning algorithms l : z 7→ f . Theorem 3.13. Let z be i.i.d. samples drawn according to the −1 −1/2 Choosing ε = τRφ(9 (m )) we get the following probability measure P where φ is the index function satisfying φ,b r r 1 2 convergence rate from Theorem 3.9. the conditions that φ(t)/t , t /φ(t) are non-decreasing functions Frontiers in Applied Mathematics and Statistics | www.frontiersin.org 12 March 2017 | Volume 3 | Article 3 Rastogi and Sampath Optimal Rates for the Regularized Learning Algorithms and dim(Y) = d < ∞. Then for every ε > 0, the following lower and dim(Y) = d < ∞. Then for every ε > 0, the following lower bound holds: bound holds:     l 2 l 2 E ||f − f || z H E ||f − f || z H z H z 2     L (X) inf sup lim sup > 0.   −(bc −b+ε)/(bc +ε+1) inf sup lim sup > 0, 2 1 l∈A m m→∞   ρ∈P −(bc +ε)/(bc +ε+1) φ,b 2 1 l∈A m ρ∈P m→∞ φ,b 4. CONCLUSION where c = 2r + 1 and c = 2r + 1. 1 1 2 2 In our analysis we derive the upper and lower convergence We consider the class of probability measures such that the target rates over the wide class of probability measures considering ∞ ∞ function f is parameterized by s = (s ) ∈ {−1, +1} . H n n=1 general source condition in vector-valued setting. In particular, Suppose for ε > 0, our minimax rates can be used for the scalar-valued functions ! and multi-task learning problems. The lower convergence rates ε α φ(α/n ) coincide with the upper convergence rates for the optimal −(ε+1)/2 g = s R n e , n n ε + 1 n t φ(t ) parameter choice based on smoothness parameters b,φ. We can n=1 also develop various parameter choice rules such as balancing ∞ ∞ principle [31], quasi-optimality principle [32], discrepancy where s = (s ) ∈ {−1, +1} , t ’s are the eigenvalues n n n=1 principle [33] for the regularized solutions provided in our of the integral operator L , e ’s are the eigenvectors of the K n analysis. integral operator L and the orthonormal basis of RKHS H. Then the target function f = φ(L )g satisfies the general H K source condition. We assume that the conditional probability AUTHOR CONTRIBUTIONS measure ρ(y|x) follows the normal distribution centered at All authors listed, have made substantial, direct and intellectual f and the marginal probability measure ρ = ν. Now H X contribution to the work, and approved it for publication. we can derive the individual lower rates over the considered class of probability measures from the ideas of the literature [12, 30]. ACKNOWLEDGMENTS Theorem 3.14. Let z be i.i.d. samples drawn according to the The authors are grateful to the reviewers for their helpful probability measure P where φ is the index function satisfying comments and pointing out a subtle error that led to improve φ,b r r 1 2 the conditions that φ(t)/t , t /φ(t) are non-decreasing functions the quality of the paper. REFERENCES 11. Abhishake Sivananthan S. Multi-penalty regularization in learning theory. J Complex. (2016) 36:141–65. doi: 10.1016/j.jco.2016.05.003 1. Cucker F, Smale S. On the mathematical foundations of learning. 12. Caponnetto A, De Vito E. Optimal rates for the regularized least-squares Bull Am Math Soc. (2002) 39:1–49. doi: 10.1090/S0273-0979-01- algorithm. Found Comput Math. (2007) 7:331–68. doi: 10.1007/s10208- 00923-5 006-0196-8 2. Evgeniou T, Pontil M, Poggio T. Regularization networks and support vector 13. Smale S, Zhou DX. Estimating the approximation error in learning theory. machines. Adv Comput Math. (2000) 13:1–50. doi: 10.1023/A:1018946025316 Anal Appl. (2003) 1:17–41. doi: 10.1142/S0219530503000089 3. Vapnik VN, Vapnik V. Statistical Learning Theory. New York, NY: Wiley 14. Smale S, Zhou DX. Shannon sampling and function reconstruction from (1998). point values. Bull Am Math Soc. (2004) 41:279–306. doi: 10.1090/S0273-0979- 4. Bauer F, Pereverzev S, Rosasco L. On regularization algorithms in 04-01025-0 learning theory. J Complex. (2007) 23:52–72. doi: 10.1016/j.jco.2006. 15. Smale S, Zhou DX. Shannon sampling II: connections to learning theory. Appl 07.001 Comput Harmon Anal. (2005) 19:285–302. doi: 10.1016/j.acha.2005.03.001 5. Engl HW, Hanke M, Neubauer A. Regularization of Inverse Problems. 16. Smale S, Zhou DX. Learning theory estimates via integral operators and Dordrecht: Kluwer Academic Publishers Group (1996). their approximations. Constr Approx. (2007) 26:153–72. doi: 10.1007/s00365- 6. Gerfo LL, Rosasco L, Odone F, De Vito E, Verri A. Spectral 006-0659-y algorithms for supervised learning. Neural Comput. (2008) 20:1873–97. 17. Mathé P, Pereverzev SV. Geometry of linear ill-posed problems in variable doi: 10.1162/neco.2008.05-07-517 Hilbert scales. Inverse Probl. (2003) 19:789–803. doi: 10.1088/0266-5611/ 7. Tikhonov AN, Arsenin VY. Solutions of Ill-Posed Problems. Washington, DC: 19/3/319 W. H. Winston (1977). 18. Blanchard G, Mücke N. Optimal rates for regularization of statistical inverse 8. Bousquet O, Boucheron S, Lugosi G. Introduction to statistical learning learning problems. arXiv:1604.04054 (2016). theory. In: Bousquet O, von Luxburg U, Ratsch G editors. Advanced Lectures 19. Mendelson S. On the performance of kernel classes. J Mach Learn Res. (2003) on Machine Learning, Volume 3176 of Lecture Notes in Computer Science. 4:759–71. Berlin; Heidelberg: Springer (2004). pp. 169–207. 20. Zhang T. Effective dimension and generalization of kernel learning. In: Thrun 9. Cucker F, Zhou DX. Learning Theory: An Approximation Theory Viewpoint. S, Becker S, Obermayer K. editors. Advances in Neural Information Processing Cambridge, UK: Cambridge Monographs on Applied and Computational Systems. Cambridge, MA: MIT Press, (2003). pp. 454–61. Mathematics, Cambridge University Press (2007). 21. Akhiezer NI, Glazman IM. Theory of Linear Operators in Hilbert Space, 10. Lu S, Pereverzev S. Regularization Theory for Ill-posed Problems: Selected Translated from the Russian and with a preface by Merlynd Nestell. New York, Topics, Berlin: DeGruyter (2013). NY: Dover Publications Inc (1993). Frontiers in Applied Mathematics and Statistics | www.frontiersin.org 13 March 2017 | Volume 3 | Article 3 Rastogi and Sampath Optimal Rates for the Regularized Learning Algorithms 22. Micchelli CA, Pontil M. On learning vector-valued functions. Neural Comput. 31. De Vito E, Pereverzyev S, Rosasco L. Adaptive kernel methods using (2005) 17:177–204. doi: 10.1162/0899766052530802 the balancing principle. Found Comput Math. (2010) 10:455–79. 23. Aronszajn N. Theory of reproducing kernels. Trans Am Math Soc. (1950) doi: 10.1007/s10208-010-9064-2 68:337–404. doi: 10.1090/S0002-9947-1950-0051437-7 32. Bauer F, Reiss M. Regularization independent of the noise level: 24. Reed M, Simon B. Functional Analysis, Vol. 1, San Diego, CA: Academic Press an analysis of quasi-optimality. Inverse Prob. (2008) 24:055009. (1980). doi: 10.1088/0266-5611/24/5/055009 25. De Vito E, Rosasco L, Caponnetto A, De Giovannini U, Odone F. Learning 33. Lu S, Pereverzev SV, Tautenhahn U. A model function method from examples as an inverse problem. J Mach Learn Res. (2005) 6:883–904. in regularized total least squares. Appl Anal. (2010) 89:1693–703. 26. Pinelis IF, Sakhanenko AI. Remarks on inequalities for the probabilities of doi: 10.1080/00036811.2010.492502 large deviations. Theory Prob Appl. (1985) 30:127–31. doi: 10.1137/1130013 27. Peller VV. Multiple operator integrals in perturbation theory. Bull Math Sci. Conflict of Interest Statement: The authors declare that the research was (2016) 6:15–88. doi: 10.1007/s13373-015-0073-y conducted in the absence of any commercial or financial relationships that could 28. Boucheron S, Bousquet O, Lugosi G. Theory of classification: a survey of some be construed as a potential conflict of interest. recent advances. ESAIM: Prob Stat. (2005) 9:323–75. doi: 10.1051/ps:2005018 29. DeVore R, Kerkyacharian G, Picard D, Temlyakov V. Approximation Copyright © 2017 Rastogi and Sampath. This is an open-access article distributed methods for supervised learning. Found Comput Math. (2006) 6:3–58. under the terms of the Creative Commons Attribution License (CC BY). The use, doi: 10.1007/s10208-004-0158-6 distribution or reproduction in other forums is permitted, provided the original 30. Györfi L, Kohler M, Krzyzak A, Walk H. A Distribution-Free Theory of author(s) or licensor are credited and that the original publication in this journal Nonparametric Regression. New York, NY: Springer Series in Statistics, is cited, in accordance with accepted academic practice. No use, distribution or Springer-Verlag (2002). reproduction is permitted which does not comply with these terms. Frontiers in Applied Mathematics and Statistics | www.frontiersin.org 14 March 2017 | Volume 3 | Article 3

Journal

Frontiers in Applied Mathematics and StatisticsUnpaywall

Published: Mar 27, 2017

There are no references for this article.