Access the full text.
Sign up today, get DeepDyve free for 14 days.
Junhong Lin, V. Cevher (2018)
Optimal Convergence for Distributed Learning with Stochastic Gradient Methods and Spectral-Regularization AlgorithmsArXiv, abs/1801.07226
G. Blanchard, Nicole Mücke (2016)
Optimal Rates for Regularization of Statistical Inverse Learning ProblemsFoundations of Computational Mathematics, 18
P. Mathé, S. Pereverzev (2002)
MODULI OF CONTINUITY FOR OPERATOR VALUED FUNCTIONSNumerical Functional Analysis and Optimization, 23
Qiang Wu, Yiming Ying, Ding-Xuan Zhou (2006)
Learning Rates of Least-Square Regularized RegressionFoundations of Computational Mathematics, 6
Abhishake Rastogi, Sivananthan Sampath (2016)
Optimal Rates for the Regularized Learning Algorithms under General Source ConditionFrontiers Appl. Math. Stat., 3
L. Gerfo, L. Rosasco, F. Odone, E. Vito, A. Verri (2008)
Spectral Algorithms for Supervised LearningNeural Computation, 20
Ingo Steinwart, A. Christmann (2008)
Support vector machinesWiley Interdisciplinary Reviews: Computational Statistics, 1
Tong Zhang (2005)
Learning Bounds for Kernel Regression Using Effective Data DimensionalityNeural Computation, 17
Z. Szabó, A. Gretton, B. Póczos, Bharath Sriperumbudur (2014)
Two-stage sampled learning theory on distributionsarXiv: Statistics Theory
Ingo Steinwart, D. Hush, C. Scovel (2009)
Optimal Rates for Regularized Least Squares Regression
Y. Yao, L. Rosasco, A. Caponnetto (2007)
On Early Stopping in Gradient Descent LearningConstructive Approximation, 26
F. Bauer, S. Pereverzyev, L. Rosasco (2007)
On regularization algorithms in learning theoryJ. Complex., 23
Junhong Lin, L. Rosasco (2016)
Optimal Rates for Multi-pass Stochastic Gradient MethodsJ. Mach. Learn. Res., 18
S. Smale, Ding-Xuan Zhou (2007)
Learning Theory Estimates via Integral Operators and Their ApproximationsConstructive Approximation, 26
Alessandro Rudi, R. Camoriano, L. Rosasco (2015)
Less is More: Nystr\"om Computational RegularizationarXiv: Machine Learning
Alessandro Rudi, Guillermo Cañas, L. Rosasco (2013)
On the Sample Complexity of Subspace Learning
I. Pinelis, A. Sakhanenko (1986)
Remarks on Inequalities for Large Deviation ProbabilitiesTheory of Probability and Its Applications, 30
M. Birman, M. Solomyak (2003)
Double Operator Integrals in a Hilbert SpaceIntegral Equations and Operator Theory, 47
Lee Dicker, Dean Foster, Daniel Hsu (2017)
Kernel ridge vs. principal component regression: Minimax bounds and the qualification of regularization operatorsElectronic Journal of Statistics, 11
J. Ramsay, Bernard Silverman (1997)
Functional Data Analysis
Junhong Lin, V. Cevher (2018)
Optimal Convergence for Distributed Learning with Stochastic Gradient Methods and Spectral AlgorithmsJ. Mach. Learn. Res., 21
P. Mathé, S. Pereverzyev (2006)
Regularization of some linear ill-posed problems with discretized random noisy dataMath. Comput., 75
J. Fujii, M. Fujii, T. Furuta, Ritsuo Nakamoto (1993)
Norm inequalities equivalent to Heinz inequality, 118
A. Caponnetto, E. Vito (2007)
Optimal Rates for the Regularized Least-Squares AlgorithmFoundations of Computational Mathematics, 7
H. Engl, M. Hanke, A. Neubauer (1996)
Regularization of Inverse Problems
G. Myleiko, S. Pereverzyev, S. Solodky (2019)
Regularized Nyström subsampling in regression and ranking problems under general smoothness assumptionsAnalysis and Applications
Tong Zhang, Bin Yu (2005)
Boosting with early stopping: Convergence and consistencyAnnals of Statistics, 33
Alessandro Rudi, R. Camoriano, L. Rosasco (2015)
Less is More: Nyström Computational Regularization
Zheng-Chu Guo, Shaobo Lin, Ding-Xuan Zhou (2017)
Learning theory of distributed spectral algorithmsInverse Problems, 33
Alessandro Rudi, L. Rosasco (2016)
Generalization Properties of Learning with Random FeaturesArXiv, abs/1602.04474
L. Rosasco, S. Villa (2014)
Learning with Incremental Iterative Regularization
25th Annual Conference on Learning Theory Random Design Analysis of Ridge Regression
Shaobo Lin, Xin Guo, Ding-Xuan Zhou (2016)
Distributed Learning with Regularized Least SquaresJ. Mach. Learn. Res., 18
F. Cucker, Ding-Xuan Zhou (2007)
Learning Theory: An Approximation Theory Viewpoint: Index
In this paper, we study regression problems over a separable Hilbert space with the square loss, covering non-parametric regression over a reproducing kernel Hilbert space. We investigate a class of spectral/regularized algorithms, including ridge regression, principal component regression, and gradient methods. We prove optimal, high-probability conver- gence results in terms of variants of norms for the studied algorithms, considering a capacity assumption on the hypothesis space and a general source condition on the target function. Consequently, we obtain almost sure convergence results with optimal rates. Our results improve and generalize previous results, ﬁlling a theoretical gap for the non-attainable cases. Keywords Learning theory, Reproducing kernel Hilbert space, Sampling operator, Regu- larization scheme, Regression. 1 Introduction Let the input space H be a separable Hilbert space with inner product denoted by h·,·i and the output space R. Let ρ be an unknown probability measure on H × R, ρ (·) the induced marginal measure on H, and ρ(·|x) the conditional probability measure on R with respect to x ∈ H and ρ. Let the hypothesis space H = {f : H → R|∃ω ∈ H with f(x) = hω, xi , ρ -almost surely}. The goal of least-squares regression is to approximately solve the H X following expected risk minimization, inf E(f), E(f) = (f(x) − y) dρ(x, y), (1) f∈H H×R where the measure ρ is known only through a sample z = {z = (x , y )} of size n ∈ N, i i i i=1 independently and identically distributed according to ρ. Let L be the Hilbert space of square integral functions from H to R with respect to ρ , with its norm given by kfk = X ρ 1/2 |f(x)| dρ . The function that minimizes the expected risk over all measurable functions is the regression function [6, 27], deﬁned as f (x) = ydρ(y|x), x ∈ H, ρ -almost every. (2) ρ X arXiv:1801.06720v4 [stat.ML] 15 Jul 2022 Throughout this paper, we assume that the support of ρ is compact and there exists a constant κ ∈ [1,∞[, such that ′ 2 ′ hx, x i ≤ κ , ∀x, x ∈ H, ρ -almost every. (3) H X Under this assumption, H is a subspace of L , and a solution f for (1) is the projection of ρ H the regression function f (x) onto the closure of H in L . See e.g., [14, 1], or Section 2 for ρ ρ further details. The above problem was raised for non-parametric regression with kernel methods [6, 27] and it is closely related to functional regression [20]. A common and classic approach for the above problem is based on spectral algorithms. It amounts to solving an empirical linear equation, where to avoid over-ﬁtting and to ensure good performance, a ﬁlter function for regularization is involved, see e.g., [1, 10]. Such approaches include ridge regression, principal component regression, gradient methods and iterated ridge regression. A large amount of research has been carried out for spectral algorithms within the setting of learning with kernel methods, see e.g., [26, 5] for Tikhonov regularization, [33, 31] for gradient methods, and [4, 1] for general spectral algorithms. Statistical results have been developed in these references, but still, they are not satisfactory. For example, most of the previous results either restrict to the case that the space H is universal consistency (i.e., H is dense in L ) ρ ρ [26, 31, 4] or the attainable case (i.e., f ∈ H ) [5, 1]. Also, some of these results require an H ρ unnatural assumption that the sample size is large enough and the derived convergence rates tend to be (capacity-dependently) suboptimal in the non-attainable cases. Finally, it is still unclear whether one can derive capacity-dependently optimal convergence rates for spectral algorithms under a general source assumption. In this paper, we study statistical results for spectral algorithms. Considering a capacity assumption of the space H [32, 5], and a general source condition [1] of the target function f , we show high-probability, optimal convergence results in terms of variants of norms for spectral algorithms. As a corollary, we obtain almost sure convergence results with optimal rates. The general source condition is used to characterize the regularity/smoothness of the target function f in L , rather than in H as those in [5, 1]. The derived convergence rates are optimal in H ρ a minimax sense. Our results, not only resolve the issues mentioned in the last paragraph but also generalize previous results to convergence results with diﬀerent norms and consider a more general source condition. 2 Learning with Kernel Methods and Notations In this section, we ﬁrst introduce supervised learning with kernel methods, which is a special in- stance of the learning setting considered in this paper. We then introduce some useful notations and auxiliary operators. Learning with Kernel Methods. Let Ξ be a closed subset of Euclidean space R . Let µ be an unknown but ﬁxed Borel probability measure on Ξ × Y . Assume that {(ξ , y )} are i.i.d. i i i=1 from the distribution µ . A reproducing kernel K is a symmetric function K : Ξ × Ξ → R ℓ ℓ such that (K(u , u )) is positive semideﬁnite for any ﬁnite set of points {u } in Ξ. The i j i i,j=1 i=1 kernel K deﬁnes a reproducing kernel Hilbert space (RKHS) (H ,k · k ) as the completion K K of the linear span of the set {K (·) := K(ξ,·) : ξ ∈ Ξ} with respect to the inner product 2 hK , K i := K(ξ, u). For any f ∈ H , the reproducing property holds: f(ξ) = hK , fi . In ξ u K K ξ K learning with kernel methods, one considers the following minimization problem inf (f(ξ) − y) dµ (ξ, y). f∈H Ξ×R Since f(ξ) = hK , fi by the reproducing property, the above can be rewritten as ξ K inf (hf, K i − y) dµ (ξ, y). ξ K f∈H Ξ×R Deﬁning another probability measure ρ(K , y) = µ (ξ, y), the above reduces to (1). Notations and Auxiliary Operators. We next introduce some notations and auxiliary oper- ators which will be useful in the following. For a given bounded operator L : L → H, kLk denotes the operator norm of L, i.e., kLk = sup kLfk . f∈L ,kfk =1 Let S : H → L be the linear map ω → hω,·i , which is bounded by κ under Assumption ρ H ∗ 2 (3). Furthermore, we consider the adjoint operator S : L → H, the covariance operator ρ ρ ∗ 2 2 ∗ T : H → H given by T = S S , and the operator L : L → L given by S S . It can be easily ρ ρ ρ ρ ρ ρ X X R R R proved that S g = xg(x)dρ (x), Lf = f(x)hx,·i dρ (x) and T = h·, xi xdρ (x). X H X H X H H H Under Assumption (3), the operators T and L can be proved to be positive trace class operators (and hence compact): Z Z 2 2 kLk = kT k ≤ tr(T ) = tr(x ⊗ x)dρ (x) = kxk dρ (x) ≤ κ . (4) X X H H For any ω ∈ H, it is easy to prove the following isometry property [27] kS ωk = k T ωk . (5) ρ ρ H Moreover, according to the spectral theorem, kL S ωk ≤ kωk (6) ρ ρ H We deﬁne the sampling operator S : H → R by (S ω) = hω, x i , i ∈ [n], where the norm x x i i H n ∗ n k · k in R is the Euclidean norm times 1/ n. Its adjoint operator S : R → H, deﬁned 1 n ∗ n ∗ by hS y, ωi = hy,S ωi for y ∈ R is thus given by S y = y x . Moreover, we can H x R i i x x i=1 deﬁne the empirical covariance operator T : H → H such that T = S S . Obviously, x x x T = h·, x i x . x i H i i=1 By Assumption (3), similar to (4), we have kT k ≤ tr(T ) ≤ κ . (7) x x A simple calculation shows that [6, 27] for all f ∈ L , E(f) − E(f ) = kf − f k . ρ ρ Then it is easy to see that (1) is equivalent to inf kf − f k . Using the projection theorem, f∈H ρ ρ ρ one can prove that a solution f for the above problem is the projection of the regression function f onto the closure of H in L , and moreover, for all f ∈ H , (see e.g., [14]), ρ ρ ρ ∗ ∗ S f = S f , (8) ρ H ρ ρ and E(f) − E(f ) = kf − f k . (9) H H 3 3 Spectral/Regularized Algorithms In this section, we demonstrate and introduce spectral algorithms. The search for an approximate solution in H for Problem (1) is equivalent to the search of an approximated solution in H for e e inf E(ω), E(ω) = (hω, xi − y) dρ(x, y). (10) ω∈H H×R As the expected risk E(ω) can not be computed exactly and that it can be only approximated through the empirical risk E (ω), deﬁned as E (ω) = (hω, x i − y ) , z i H i i=1 a ﬁrst idea to deal with the problem is to replace the objective function in (10) with the empirical risk, which leads to an estimator ωˆ satisfying the empirical, linear equation T ωˆ = S y. However, solving the empirical, linear equation directly may lead to a solution that ﬁts the sample points very well but has a large expected risk. This is called as overﬁtting phenomenon in statistical learning theory. Moreover, the inverse of the empirical covariance operator T does not exist in general. To tackle with this issue, a common approach in statistical learning theory −1 and inverse problems, is to replace T with an alternative, regularized one, which leads to spectral algorithms [8, 4, 1]. A spectral algorithm is generated by a speciﬁc choice of ﬁlter function. Recall that the deﬁnition of ﬁlter functions is given as follows. Deﬁnition 3.1 (Filter functions). Let Λ be a subset of R . A class of functions {G : [0, κ ] → [0,∞[, λ ∈ Λ} is said to be ﬁlter functions with qualiﬁcation τ (τ ≥ 1) if there exist some positive constants E, F < ∞ such that α 1−α sup sup sup |u G (u)|λ ≤ E. (11) α∈[0,1] λ∈Λ u∈]0,κ ] and α −α sup sup sup |(1 − G (u)u)|u λ ≤ F . (12) λ τ α∈[0,τ] λ∈Λ u∈]0,κ ] Given a ﬁlter function G , the spectral algorithm is deﬁned as follows. Algorithm 1. Let G be a ﬁlter function indexed with λ > 0. The spectral algorithm over the samples z is given by z ∗ ω = G (T )S y, (13) λ x λ x and z z f = S ω . (14) λ λ Let L be a self-adjoint, compact operator over a separable Hilbert space H. G (L) is an operator on L deﬁned by spectral calculus: suppose that {(σi, ψi)}i is a set of normalized eigenpairs of L with the eigenfunctions {ψi}i forming an orthonormal basis of H, then G (T ) = G (σ )ψ ⊗ ψ . λ x λ i i i 4 Diﬀerent ﬁlter functions correspond to diﬀerent regularization algorithms. The following examples provide several speciﬁc choices on ﬁlter functions, which leads to diﬀerent types of regularization methods, see e.g. [10, 1, 26]. Example 3.1 (Spectral cut-oﬀ). Consider the spectral cut-oﬀ or truncated singular value de- composition (TSVD) deﬁned by −1 u , if u ≥ λ, G (u) = 0, if u < λ. Then the qualiﬁcation τ could be any positive number and E = F = 1. t−k 2 Example 3.2 (Gradient methods). The choice G (u) = η(1−ηu) with η ∈]0, κ ] where k=1 −1 we identify λ = (ηt) , corresponds to gradient methods or Landweber iteration algorithm. The qualiﬁcation τ could be any positive number, E = 1, and F = (τ/e) . Example 3.3 ((Iterated) ridge regression). Let l ∈ N. Consider the function 1 λ i−1 −i G (u) = λ (λ + u) = 1 − . u (λ + u) i=1 It is easy to show that the qualiﬁcation τ = l, E = l and F = 1. In the case that l = 1, the algorithm is ridge regression. The performance of spectral algorithms can be measured in terms of the excess risk, E(f )− z 2 inf E, which is exactly kf − f k according to (9). Assuming that f ∈ H , which implies H H H ρ ρ ρ that there exists some ω such that f = S ω (in this case, the solution with minimal H-norm ∗ H ρ ∗ for f = S ω is denoted by ω ), it can be measured in terms of H-norm, kω − ω k , which H ρ H H H 1 1 − z − z 2 2 is closely related to kL S (ω − ω )k = kL (f − f )k according to (6). In what follows, ρ H H H ρ λ λ we will measure the performance of spectral algorithms in terms of a broader class of norms, −a z 1 −a kL (f − f )k , where a ∈ [0, ] is such that L f is well deﬁned. Throughout this paper, H ρ H we assume that 1/n ≤ λ ≤ 1. 4 Convergence Results In this section, we ﬁrst introduce some basic assumptions and then present convergence results for spectral algorithms. 4.1 Assumptions The ﬁrst assumption relates to a moment condition on the output value y. Assumption 1. There exists positive constants Q and M such that for all l ≥ 2 with l ∈ N, l l−2 2 |y| dρ(y|x) ≤ l!M Q , (15) ρ -almost surely. 5 The above assumption is very standard in statistical learning theory. It is satisﬁed if y is bounded almost surely, or if y = hω , xi + ǫ, where ǫ is a Gaussian random variable with ∗ H zero mean and it is independent from x. Obviously, Assumption 1 implies that the regression function f is bounded almost surely, as Z Z |f (x)| ≤ |y|dρ(y|x) ≤ |y| dρ(y|x) ≤ Q. (16) R R The next assumption relates to the regularity/smoothness of the target function f . As f ∈ Range(S ) and L = S S , it is natural to assume a general source condition on f as H ρ ρ H follows. Assumption 2. f satisﬁes 2 2 (f (x) − f (x)) x ⊗ xdρ (x) B T , (17) H ρ X and the following source condition f = φ(L)g , with kg k ≤ R. (18) H 0 0 ρ 2 + Here, B, R ≥ 0 and φ : [0, κ ] → R is a non-decreasing index function such that φ(0) = 0 and 2 −ζ φ(κ ) < ∞. Moreover, for some ζ ∈ [0, τ], φ(u)u is non-decreasing, and the qualiﬁcation τ of G covers the index function φ. Recall that the qualiﬁcation τ of G covers the index function φ is deﬁned as follows [1]. Deﬁnition 4.1. We say that the qualiﬁcation τ covers the index function φ if there exists a c > 0 such that for all 0 < λ ≤ κ , τ τ λ u c ≤ inf . (19) φ(λ) λ≤u≤κ φ(u) Condition (17) is trivially satisﬁed if f is bounded almost surely. Moreover, when making a consistency assumption, i.e., inf E = E(f ), as that in [26, 4, 5, 28], for kernel-based non- H ρ parametric regression, it is satisﬁed with B = 0. Condition (18) is a more general source condition that characterizes the “regularity/smoothness” of the target function. It is trivially satisﬁed with φ(u) = 1 as f ∈ H ⊆ L . In non-parametric regression with kernel methods, H ρ one typically considers H¨olders condition (corresponding to φ(u) = u , α ≥ 0) [26, 5, 4] . [1, 18, 21] considers a general source condition but only with an index function φ(u) u, where φ can be decomposed as ψϑ and ψ : [0, b] → R is operator monotone with ψ(0) = 0 and ψ(b) < ∞, and ϑ : [0, κ ] → R is Lipschitz continuous with ϑ(0) = 0. In the latter case inf E + H has a solution f in H as that [27, 22] H ρ L (L ) ⊆ H , (20) In this paper, we will consider a source assumption with respect to a more general index function, φ = ψϑ, where ψ : [0, b] → R is operator monotone with ψ(0) = 0 and ψ(b) < ∞, and ϑ : [0, κ ] → R is Lipschitz continuous. Without loss of generality, we assume that the Lipschitz constant of ϑ is 1, as one can always scale both sides of the source condition (18). 6 Recall that the function ψ is called operator monotone on [0, b], if for any pair of self-adjoint operators U, V with spectra in [0, b] such that U V , φ(U) φ(V ). Finally, the last assumption relates to the capacity of the hypothesis space H (induced by H). Assumption 3. For some γ ∈]0, 1] and c > 0, T satisﬁes −1 −γ tr(T (T + λI) ) ≤ c λ , for all λ > 0. (21) The left hand-side of of (21) is called as the eﬀective dimension [5], or the degrees of freedom [32]. It can be related to covering/entropy number conditions, see [27] for further details. Assumption 3 is always true for γ = 1 and c = κ , since T is a trace class operator which implies the eigenvalues of T , denoted as σ , satisfy tr(T ) = σ ≤ κ . This is referred to as i i the capacity independent setting. Assumption 3 with γ ∈]0, 1] allows to derive better rates. It −1/γ is satisﬁed, e.g., if the eigenvalues of T satisfy a polynomial decaying condition σ ∼ i , or with γ = 0 if T is ﬁnite rank. 4.2 Main Results Now we are ready to state our main results as follows. θ−1 Theorem 4.2. Under Assumptions 1, 2 and 3, let a ∈ [0, ∧ ζ], λ = n with θ ∈ [0, 1], and δ ∈]0, 1[. The followings hold with probability at least 1 − δ. 1) If φ : [0, b] → R is operator monotone with b > κ , and φ(b) < ∞, or Lipschitz continuous with constant 1 over [0, κ ], then −a z kL (f − f )k H ρ 1−a ˜ ˜ C C 6 6 1 2 −a −1 ≤λ + √ + C φ(λ) log log + γ(θ ∧ log n) . (22) ∨(1−ζ) γ δ δ 2 nλ nλ 2) If φ = ψϑ, where ψ : [0, b] → R is operator monotone with b > κ , ψ(0) = 0 and ψ(b) < ∞, and ϑ : [0, κ ] → R is non-decreasing, Lipschitz continuous with constant 1 and ϑ(0) = 0. −a Furthermore, assume that the quality of G covers ϑ(u)u , then 1−a 6 6 −a z −a −1 kL (f − f )k ≤λ log log + γ(θ ∧ log n) (23) H ρ δ δ ˜ ˜ C C 1 4 ˜ ˜ × + √ + C φ(λ) + C ϑ(λ)ψ(n ) , 5 6 ∨(1−ζ) γ 2 nλ nλ ˜ ˜ ˜ Here, C , C ,··· , C are positive constants depending only on κ , c , γ, ζ, φ, τ B, M, Q, R, E, F , 1 2 6 γ τ b, a, c and kT k (independent from λ, n, δ, and θ, and given explicitly in the proof). The above theorem provides convergence results with respect to variants of norms in high- probability for spectral algorithms. Balancing the diﬀerent terms in the upper bounds, one has the following results with an optimal, data-dependent choice of regularization parameters. Throughout the rest of this paper, C is denoted as a positive constant that depends only on κ , c , γ, ζ, φ, τ B, M, Q, R, E, F , b, a, c and kT k, and it could be diﬀerent at its each appear- γ τ ance. 7 Corollary 4.3. Under the assumptions and notations of Theorem 4.2, let 2ζ + γ > 1 and −1 −1 2 γ λ = Θ (n ) where Θ(u) = (φ(u)/φ(1)) u . The following holds with probability at least 1− δ. 1) Let φ be as in Part 1) of Theorem 4.2, then −1 −1 φ(Θ (n )) 6 −a z 2−a kL (f − f )k ≤ C log . (24) H ρ −1 −1 a (Θ (n )) δ 2) Let φ be as in Part 2) of Theorem 4.2 and λ ≥ n , then (24) holds. The error bounds in the above corollary are optimal as they match the minimax rates from [21] (considering only the case ζ ≥ 1/2 and a = 0). The assumption that the quality of G covers ϑ(u)u in Part 2) of Corollary 4.3 is also implicitly required in [1, 18, 21], and it is always satisﬁed for principle component analysis and gradient methods. The condition −1/2 λ ≥ n will be satisﬁed in most cases when the index function has a Lipschitz continuous part, and moreover, it is trivially satisﬁed when ζ ≥ 1, as will be seen from the proof. As a direct corollary of Theorem 4.2, we have the following results considering H¨older source conditions. −2(ζ−1) ζ Corollary 4.4. Under the assumptions and notations of Theorem 4.2, we let φ(u) = κ u 1∨(2ζ+γ) in Assumption 2 and λ = n , then with probability at least 1 − δ, ζ−a 2−a 6 2ζ+γ n log if 2ζ + γ > 1, −a z kL (f − f )k ≤ C (25) H ρ λ 1−a −(ζ−a) 6 6 γ n log log + log n if 2ζ + γ ≤ 1. δ δ The error bounds in (25) are optimal as the convergence rates match the minimax rates shown in [5, 3] with ζ ≥ 1/2. The above result asserts that spectral algorithms with an appropriate regularization parameter converge optimally. Corollary 4.4 provides convergence results in high-probability for the studied algorithms. It implies convergence in expectation and almost sure convergence shown in the follows. Moreover, when ζ ≥ 1/2, it can be translated into convergence results with respect to norms related to H. Corollary 4.5. Under the assumptions of Corollary 4.4, the following holds. 1) For any q ∈ N , we have q(ζ−a) 2ζ+γ n if 2ζ + γ > 1, −a z q EkL (f − f )k ≤ C (26) q(1−a) −q(ζ−a) γ n (1 ∨ log n ) if 2ζ + γ ≤ 1. 2) For any 0 < ǫ < ζ − a, ζ−a−ǫ −a z 1∨(2ζ+γ) lim kL (S f − f )k n = 0, almost surely. ρ H ρ n→∞ 3) If ζ ≥ 1/2, then for some ω ∈ H, S ω = f almost surely, and with probability at least H ρ H H 1 − δ, ζ−a −a z 2−a 2ζ+γ kT (ω − ω )k ≤ Cn log . (27) H H −1 d 2 Remark 4.6. If H = R , then Assumption 3 is trivially satisﬁed with c = κ (d ∧ σ ), γ = min 0, and Assumption 2 could be satisﬁed with any ζ > 1/2. Here σ denotes the smallest min Note that this is not true in general if H is a general Hilbert space, and the proof for the ﬁnite-dimensional cases could be simpliﬁed, leading to some smaller constants in the error bounds. 8 eigenvalue of T . Thus, following from the proof of Theorem 4.2, we have that with probability at least 1 − δ, 1−a c 6 6 −a z kL (f − f )k ≤ C log log log c . H ρ γ n δ δ The proof for all the results stated in this subsection are postponed in the next section. 4.3 Discussions There is a large amount of research on theoretical results for non-parametric regression with kernel methods in the literature, see e.g., [30, 23, 29, 15, 7, 18, 25, 13] and references therein. As noted in Section 2, our results apply to non-parametric regression with kernel methods. In what follows, we will translate some of the results for kernel-based regression into results for regression over a general Hilbert space and compare our results with these results. We ﬁrst compare Corollary 4.4 with some of these results in the literature for spectral algorithms with H¨older source conditions. Making a source assumption as f = L g with kg k ≤ R, (28) ρ 0 0 ρ 1/2 ≤ ζ ≤ τ, and with γ > 0, [11] shows that with probability at least 1 − δ, z 4 2ζ+γ kf − f k ≤ Cn log . ρ ρ Condition (28) implies that f ∈ H as H = range(S ) and L = S S . Thus f = f almost ρ ρ ρ ρ ρ H ρ surely. In comparison, Corollary 4.4 is more general. It provides convergence results in terms of diﬀerent norms for a more general H¨older source condition, allowing 0 < ζ ≤ 1/2 and γ = 0. Besides, it does not require the extra assumption f = f and the derived error bound in (25) H ρ has a smaller depending order on log . For the assumption (28) with 0 ≤ ζ < 1/2, certain results are derived in [26] for Tikhonov regularization and in [31] for gradient methods, but the rates are suboptimal and capacity-independent. Recently, [13] shows that under the assumption (28), with ζ ∈]0, τ] and γ ∈ [0, 1], spectral algorithm has the following error bounds in expectation, 2ζ 2ζ+γ n if 2ζ + γ > 1, z 2 Ekf − f k ≤ C λ ρ −2ζ γ n (1 ∨ log n ) if 2ζ + γ ≤ 1. Note also that [7] provides the same optimal error bounds as the above, but only restricts to the cases 1/2 ≤ ζ ≤ τ and n ≥ n . In comparison, Corollary 4.4 is more general. It provides convergence results with diﬀerent norms and it does not require the universal consistency as- sumption. The derived error bound in (25) is more meaningful as it holds with high probability. However, it has an extra logarithmic factor in the upper bound for the case 2ζ + γ ≤ 1, which is worser than that from [13]. [1, 3] study statistical results for spectral algorithms, under a H¨older source condition, f ∈ L g with 1/2 ≤ ζ ≤ τ. Particularly, [3] shows that if H 0 −2 2 n ≥ Cλ log , (29) e e Such a assumption is satisﬁed if inf E = E(f ) and it is supported by many function classes and reproducing H ρ kernel Hilbert space in learning with kernel methods [27]. 9 then with probability at least 1 − δ, with 1/2 < ζ ≤ τ and 0 ≤ a ≤ 1/2, ζ−a −a z 2ζ+γ kL (f − f )k ≤ Cn log . H ρ In comparison, Corollary 4.4 provides optimal convergence rates even in the case that 0 ≤ ζ ≤ 1/2, while it does not require the extra condition (29). Note that we do not pursue an error bound that depends both on R and the noise level as those in [3, 7], but it should be easy to modify our proof to derive such error bounds (at least in the case that ζ ≥ 1/2). The only results by now for the non-attainable cases with a general H¨older condition with respect to f 1∨(2ζ+γ) (rather than f ) are from [14], where convergence rates of order O(n log n) are derived (but only) for gradient methods assuming n is large enough. We next compare Theorem 4.2 with results from [1, 21] for spectral algorithms considering general source conditions. Assuming that f ∈ φ(L) Lg with kg k ≤ R (which implies H 0 0 ρ f = S ω for some ω ∈ H,) where φ is as in Part 2) of Theorem 4.2, [1] shows that if the H ρ H H qualiﬁcation of G covers φ(u) u and (29) holds, then with probability at least 1 − δ, 1 6 1 −a z −a kL (f − f )k ≤ Cλ φ(λ) λ + √ log , a = 0, . H ρ δ 2 λn The error bound is capacity independent, i.e., with γ = 1. Involving the capacity assumption , the error bound is further improved in [21], to 1 6 1 −a z −a kL (f − f )k ≤ Cλ φ(λ) λ + √ log , a = 0, . H ρ δ 2 nλ As noted in [11, Discussion], these results lead to the following estimates in expectation 1 1 −a z 2 −2a EkL (f − f )k ≤ Cλ φ(λ) λ + log n, a = 0, nλ In comparison with these results, Theorem 4.2 is more general, considering a general source assumption and covering the general case that f may not be in H . Furthermore, it provides H ρ convergence results with respect to a broader class of norms, and it does not require the condition (29). Finally, it leads to convergence results in expectation with a better rate (without the logarithmic factor) when the index function is φ(u) u, and it can infer almost-sure convergence results. 5 Proofs In this section, we prove the results stated in Section 4. We ﬁrst give some basic lemmas, and then give the proof of the main results. 5.1 Lemmas Deterministic Estimates We ﬁrst introduce the following lemma, which is a generalization of [1, Proposition 7]. For notational simplicity, we denote R (u) = 1 − G (u)u, (30) λ λ Note that from the proof from [21], we can see the results from [21] also require (29). 10 and −1 N (λ) = tr(T (T + λ) ). 2 + Lemma 5.1. Let φ : [0, κ ] → R be a non-decreasing index function and the qualiﬁcation −ζ τ of the ﬁlter function G covers the index function φ, and for some ζ ∈ [0, τ], φ(u)u is non-decreasing. Then for all a ∈ [0, ζ], −a −a sup |R (u)|φ(u)u ≤ c φ(λ)λ , c = , (31) g g c ∧ 1 0<u≤κ where c is from Deﬁnition 4.1. Proof. When λ ≤ u ≤ κ , by (19), we have φ(u) 1 φ(λ) ≤ . τ τ u c λ Thus, −a τ−a −τ τ−a −1 −τ −1 −a |R (u)|φ(u)u = |R (u)|u φ(u)u ≤ |R (u)|u c φ(λ)λ ≤ F c λ φ(λ), λ λ λ −ζ where for the last inequality, we used (12). When 0 < u ≤ λ, since φ(u)u is non-decreasing, −a ζ−a −ζ ζ−a −ζ −a |R (u)|φ(u)u = |R (u)|u φ(u)u ≤ |R (u)|u φ(λ)λ ≤ F φ(λ)λ , λ λ λ τ where we used (12) for the last inequality. From the above analysis, one can ﬁnish the proof. Using the above lemma, we have the following results for the deterministic vector ω , deﬁned by ω = G (T )S f . (32) λ λ H Lemma 5.2. Under Assumption 2, we have for all a ∈ [0, ζ], −a −a kL (S ω − f )k ≤ c Rφ(λ)λ , (33) ρ λ H ρ g and 2 −(2ζ∧1) −( −ζ) kω k ≤ Eφ(κ )κ λ . (34) λ H The left hand-side of (33) is often called as the true bias. Proof. Following from the deﬁnition of ω in (32), we have S ω − f = S G (T )S f − f = (LG (L) − I)f . ρ λ H ρ λ H H λ H Introducing with (18), with the notation R (u) = 1 − G (u)u, we get λ λ −a −a −a kL (S ω − f )k = kL R (L)φ(L)g k ≤ kL R (L)φ(L)kR. ρ λ H ρ λ 0 ρ λ Applying the spectral theorem with (4) and Lemma 5.1 which leads to −a −a −a kL R (L)φ(L)k ≤ sup |R (u)|u φ(u) ≤ c φ(λ)λ , λ λ g u∈[0,κ ] 11 one can get (33). From the deﬁnition of ω in (32) and applying (18), we have ∗ ∗ kω k = kG (T )S φ(L)g k ≤ kG (T )S φ(L)kR. λ H λ 0 H λ ρ ρ According to the spectral theorem, with (4), one has 1 1 2 2 kG (T )S φ(L)k = kφ(L)S G (T )G (T )S φ(L)k = kG (L)L φ(L)k ≤ sup |G (u)u φ(u)|. λ ρ λ λ λ λ ρ ρ u∈[0,κ ] −ζ 2 −ζ Since both φ(u) and φ(u)u are non-decreasing and non-negative over [0, κ ], thus φ(u)u is also non-decreasing for any ζ ∈ [0, ζ]. If ζ ≥ 1/2, then 1 1 − 2 −1 2 2 sup |G (u)|u φ(u) = sup |G (u)|uφ(u)u ≤ Eφ(κ )κ , λ λ 2 2 u∈[0,κ ] u∈[0,κ ] where for the last inequality, we used (11) and that φ(u)u is non-decreasing. If ζ < 1/2, similarly, we have 1 1 1 +ζ −ζ ζ− 2 −2ζ 2 2 2 sup |G (u)|u φ(u) = sup |G (u)|u φ(u)u ≤ Eλ φ(κ )κ . λ λ 2 2 u∈[0,κ ] u∈[0,κ ] From the above analysis, one can prove (34). Probabilistic Estimates We next introduce the following lemma, whose prove can be found in [13]. Note that the lemma improves those from [12] for the matrix cases and Lemma 7.2 in [24] for the operator cases , as it does not need the assumption that the sample size is large enough while considering the inﬂuence of γ for the logarithmic factor. −θ Lemma 5.3. Under Assumption 3, let δ ∈ (0, 1), λ = n for some θ ≥ 0, and 4κ (c + 1) 1 a (θ) = 8κ log + θγ min , log n . (35) n,δ,γ δkT k e(1 − θ) We have with probability at least 1 − δ, 1/2 −1/2 2 θ−1 k(T + λ) (T + λ) k ≤ 3a (θ)(1 ∨ n ), x n,δ,γ and −1/2 1/2 2 θ−1 k(T + λ) (T + λ) k ≤ a (θ)(1 ∨ n ), x n,δ,γ To proceed the proof of our next lemmas, we need the following concentration result for Hilbert space valued random variable used in [5] and based on the results in [19]. Lemma 5.4. Let w ,··· , w be i.i.d random variables in a Hilbert space with norm k · k. 1 m Suppose that there are two positive constants B and σ such that l l−2 2 E[kw − E[w ]k ] ≤ l!B σ , ∀l ≥ 2. (36) 1 1 Then for any 0 < δ < 1/2, the following holds with probability at least 1 − δ, 1 B σ 2 w − E[w ] ≤ 2 + log . m 1 m m m δ k=1 12 The following lemma is a consequence of the lemma above (see e.g., [26] for a proof). Lemma 5.5. Let 0 < δ < 1/2. It holds with probability at least 1 − δ : 6κ 2 kT − T k ≤ kT − T k ≤ log . x x HS n δ Here, k · k denotes the Hilbert-Schmidt norm. HS One novelty of this paper is the following new lemma, which provides a probabilistic estimate on the terms caused by both the variance and approximation error. The lemma is mainly motivated by [26, 5, 14, 13]. Note that the condition (17) is slightly weaker than the condition kf k < ∞ required in [14] for analyzing gradient methods. H ∞ Lemma 5.6. Under Assumptions 1, 2 and 3, let ω be given by (32). For all δ ∈]0, 1/2[, the following holds with probability at least 1 − δ : C C (φ(λ)) C 2 −1/2 1 2 3 ∗ ∗ kT [(T ω − S y) − (T ω − S f )]k ≤ + + log . (37) x λ λ ρ H x ρ 1 ∨(1−ζ) nλ nλ δ nλ 2 (1−2ζ) 2 2 2 2 2 Here, C = 8κ(M + Eφ(κ )κ ), C = 96c R κ and C = 32(3B + 4Q )c . 1 2 3 γ Proof. Let ξ = T (hω , xi − y )x for all i ∈ [n]. From the deﬁnition of the regression i λ H i i function f in (2) and (8), a simple calculation shows that 1 1 1 − − − ∗ ∗ 2 2 2 E[ξ] = E[T (hω , xi − f (x))x] = T (T ω − S f ) = T (T ω − S f ). (38) λ H ρ λ ρ λ H ρ ρ λ λ λ In order to apply Lemma 5.4, we need to estimate E[kξ − E[ξ]k ] for any l ∈ N with l ≥ 2. In fact, using H¨older’s inequality twice, l l−1 l l l l Ekξ − E[ξ]k ≤ E (kξk + Ekξk ) ≤ 2 (Ekξk + (Ek[ξ]k ) ) ≤ 2 Ekξk . (39) H H H H H H We now estimate Ekξk . By H¨older’s inequality, 1 1 − − l l l l−1 l l l 2 2 Ekξk = E[kT xk (y − hω , xi ) ] ≤ 2 E[kT xk (|y| + |hω , xi | )]. λ H λ H H H H λ λ According to (3), one has 1 1 − − 1 2 2 kT xk ≤ kT kkxk ≤ √ κ. (40) H H λ λ Moreover, by Cauchy-Schwarz inequality and (3), |hω , xi | ≤ kω k kxk ≤ κkω k . Thus, λ H λ H H λ H we get l−2 κ − l l−1 2 l l−2 2 Ekξk ≤ 2 √ E[kT xk (|y| + (κkω k ) |hω , xi | ). (41) λ H λ H H H Note that by (15), Z Z 1 1 − − 2 l 2 l 2 2 E[kT xk |y| ] = kT xk |y| dρ(y|x)dρ (x) H H λ λ H R 1 − l−2 2 2 2 ≤ l!M Q kT xk dρ (x). 13 2 Using kwk = tr(w ⊗ w) which implies Z Z 1 1 1 1 1 − − − − − 2 2 2 2 2 kT xk dρ (x) = tr(T x ⊗ xT )dρ (x) = tr(T T T ) = N (λ), (42) X X λ λ λ λ λ H H we get 2 l l−2 2 E[kT xk |y| ] ≤ l!M Q N (λ). (43) Besides, by Cauchy-Schwarz inequality, 1 1 − − 2 2 2 2 2 2 2 2 E[kT xk |hω , xi | ] ≤ 3E[kT xk (|hω , xi − f (x)| + |f (x) − f (x)| + |f (x)| )]. λ H λ H H H ρ ρ H H λ λ By (40) and (33), 2 2 2 κ κ (φ(λ)) 2 2 2 2 2 2 2 E[kT xk (|hω , xi −f (x)| ] ≤ E[|hω , xi −f (x)| ] = kS ω −f k ≤ c R κ , λ H H λ H H ρ λ H H ρ g λ λ λ and by (16) and (42), 1 1 − − 2 2 2 2 2 2 2 E[kT xk |f (x)| ] ≤ Q E[kT xk ] = Q N (λ). λ H λ H Therefore, 1 1 − − 2 2 2 2 2 2 −1 2 2 2 2 2 E[kT xk |hω , xi | ] ≤ 3 c R κ φ (λ)λ + E[kT xk |f (x) − f (x)| ] + Q N (λ) . λ H H ρ H g H λ λ Using kwk = tr(w ⊗ w) and (17), we have 1 1 1 − − − 2 2 2 2 2 2 E[kT xk |f (x) − f (x)| ] =E[|f (x) − f (x)| tr(T x ⊗ xT )] H ρ H ρ λ λ λ −1 2 = tr(T E[(f (x) − f (x)) x ⊗ x]) H ρ 2 −1 2 ≤B tr(T T ) = B N (λ), and therefore, 2 2 2 2 2 2 −1 2 2 E[kT xk |hω , xi | ] ≤ 3 c R κ (φ(λ)) λ + (B + Q )N (λ) . λ H g Introducing the above estimate and (43) into (41), we derive l−2 κ 1 l l−1 l−2 2 l−2 2 2 2 2 −1 2 2 Ekξk ≤2 √ l!M Q N (λ) + 3(κkω k ) (c R κ (φ(λ)) λ + (B + Q )N (λ)) λ H H g l−2 κM + κ kω k 1 λ H l−1 2 2 2 2 2 −1 2 2 ≤2 l! Q N (λ) + 3(c R κ (φ(λ)) λ + (B + Q )N (λ)) , l−2 κM + κ kω k 1 λ H l−1 2 2 2 2 −1 2 2 −γ ≤2 l! 3c R κ (φ(λ)) λ + (3B + 4Q )c λ , where for the last inequality, we used Assumption 3. Introducing the above estimate into (39), and then substituting with (34) and noting that λ ≤ 1, we get l−2 2 (1−2ζ) 1 4κ(M + Eφ(κ )κ ) l 2 2 2 2 −1 2 2 −γ E[kξ−E[ξ]k ] ≤ l! 8 3c R κ (φ(λ)) λ + (3B + 4Q )c λ . H 1 g ∨(1−ζ) Applying Lemma 5.4, one can get the desired result. 14 Basic Operator Inequalities Lemma 5.7. [9, Cordes inequality] Let A and B be two positive bounded linear operators on a separable Hilbert space. Then s s s kA B k ≤ kABk , when 0 ≤ s ≤ 1. Lemma 5.8. [17, 16] Suppose ψ is an operator monotone index function on [0, b], with b > 1. Then there is a constant c < ∞ depending on b−a, such that for any pair B , B , kB k,kB k ≤ ψ 1 2 1 2 a, of non-negative self-adjoint operators on some Hilbert space, it holds, kψ(B ) − ψ(B )k ≤ c ψ(kB − B k). 1 2 ψ 1 2 Moreover, there is c > 0 such that λ σ c ≤ , ψ(λ) ψ(σ) whenever 0 < λ < σ ≤ a < b. Lemma 5.9. Let ϑ : [0, a] → R be Lipschitz continuous with constant 1 and ϑ(0) = 0. Then for any pair B , B , kB k,kB k ≤ a, of non-negative self-adjoint operators on some Hilbert 1 2 1 2 space, it holds, kϑ(B ) − ϑ(B )k ≤ kB − B k . 1 2 HS 1 2 HS Proof. The result follows from [2, Subsection 8.2]. 5.2 Proof of Main Results Now we are ready to prove Theorem 4.2. Proof of Theorem 4.2. Following from Lemmas 5.3, 5.5 and 5.6, and by a simple calculation, θ−1 with λ = n and θ ∈ [0, 1], we get that with probability at least 1 − δ, the following holds: 1 1 1 1 − − 2 2 2 2 2 2 kT T k ∨ kT T k ≤ Δ , kT − T k ≤ kT − T k ≤ Δ , (44) 1 x x HS 3 λ xλ λ xλ −1/2 ∗ ∗ and kT [(T ω − S y) − (T ω − S f )]k ≤ Δ , (45) x λ λ ρ H 2 x ρ where 6 1 2eκ (c + 1) Δ = C (log + γ( ∧ log n)), C = 24κ log , 1 4 4 δ θ kT k C C 6 1 3 Δ = + C φ(λ) + √ log , 2 2 ∨(1−ζ) γ 2 nλ nλ 6κ 6 Δ = √ log . n δ Obviously, we have Δ ≥ 1 since log > 1 and by (4), C ≥ 1. 1 4 We now begin with the following inequality: −a z −a z −a z −a kL (f − f )k = kL (S ω − f )k ≤ kL S (ω − ω )k + kL (S ω − f )k . H ρ ρ H ρ ρ λ ρ ρ λ H ρ λ λ λ 15 Introducing with (33), we get −a z −a z −a kL (f − f )k ≤kL S (ω − ω )k + c Rφ(λ)λ H ρ ρ λ ρ g λ λ 1 1 1 −a a− −a −a a− 2 2 2 z −a ≤kL S T kkT T kkT (ω − ω )k + c Rφ(λ)λ . ρ λ H g λ xλ xλ λ ∗ ∗ −a a− By the spectral theorem, L = S S , T = S S , and (4), we have kL S T k ≤ 1. Moreover, ρ ρ ρ ρ ρ by Lemma 5.7 and 0 ≤ a ≤ ζ ∧ , 1 1 1 1 1 1 1 −a a− (1−2a) − (1−2a) − −a 1−2a 2 2 2 2 2 2 2 kT T k = kT T k ≤ kT T k ≤ Δ . λ xλ λ xλ λ xλ 1 We thus get 1 1 −a −a −a z z −a 2 2 kL (f − f )k ≤ Δ kT (ω − ω )k + c Rφ(λ)λ . H ρ λ H g λ λ 1 xλ Subtracting and adding with the same term, using the triangle inequality and recalling the notation R (u) deﬁned in (30), we get 1 1 1 −a −a −a −a z z −a 2 2 2 kL (f − f )k ≤ Δ kT R (T )ω k + kT (ω − G (T )T ω )k + c Rφ(λ)λ . H ρ λ x λ H λ x x λ H g λ 1 xλ xλ λ Introducing with (13), 1 1 1 −a −a −a −a z ∗ −a 2 2 2 kL (f − f )k ≤ Δ kT R (T )ω k + kT G (T )(S y − T ω )k + c Rφ(λ)λ . H ρ λ x λ H λ x x λ H g λ 1 xλ xλ x (46) −a Estimating kT G (T )(S y − T ω )k : λ x x λ H xλ We ﬁrst have 1 1 1 1 1 1 −a −a − − ∗ ∗ 2 2 2 2 2 2 kT G (T )(S y − T ω )k ≤kT G (T )T kkT T kkT (S y − T ω )k . λ x x λ H λ x x λ H x x xλ xλ xλ xλ λ λ With (11) and (7), we have 1 1 −a 1−a 1−a 1−a −a 2 2 kT G (T )T k ≤ sup |(u + λ) G (u)| ≤ sup |(u + λ )G (u)| ≤ 2Eλ , λ x λ λ xλ xλ 2 2 u∈[0,κ ] u∈[0,κ ] and thus 1 1 −a − 1/2 ∗ −a ∗ 2 2 kT G (T )(S y − T ω )k ≤2Eλ Δ kT (S y − T ω )k x x H x H λ λ λ xλ x 1 λ x Since by (33) and (8), kT (S y − T ω )k x λ H 1 1 − − ∗ ∗ ∗ 2 2 ≤kT [(S y − T ω ) − (T ω − S f )]k + kT (T ω − S f )k x λ λ ρ H λ ρ H x ρ ρ λ λ 1 1 − − ∗ ∗ ∗ 2 2 ≤kT [(S y − T ω ) − (T ω − S f )]k + kT S kkS ω − f k x λ λ ρ H ρ λ H ρ x ρ ρ λ λ ≤Δ + c Rφ(λ), 2 g we thus have −a 1/2 ∗ −a kT G (T )(S y − T ω )k ≤ 2Eλ Δ (Δ + c Rφ(λ)). (47) λ x x λ H 2 g xλ 1 16 1 −a Estimating kT R (T )ω k : λ x λ H xλ ∗ ∗ Note that from the deﬁnition of ω in (32), (18), L = S S , and T = S S , λ ρ ρ ρ ρ ∗ ∗ ω = G (T )S φ(L)g = G (T )φ(T )S g , λ λ 0 λ 0 ρ ρ and thus, 1 1 1 −a −a −a 1 2 2 2 kT R (T )ω k ≤ kT R (T )G (T )φ(T )S kR = kT R (T )G (T )φ(T )T kR. (48) λ x λ H λ x λ λ x λ xλ xλ xλ −a In what follows, we will estimate kT R (T )G (T )φ(T )T k, considering three diﬀerent cases. λ x λ xλ Case 1: φ(·) is operator monotone. We ﬁrst have 1 1 1 1 1 1 1 1 −a −a − − 2 2 2 2 2 2 2 2 kT R (T )φ(T )G (T )T k ≤kT R (T )T kkT T kkT T kkφ(T )G (T )k λ x λ λ x λ xλ xλ xλ xλ λ λ 1−a ≤kT R (T )kΔ kφ(T )G (T )k λ x λ xλ 1 By the spectral theorem and (12), with (7), 1−a 1−a 1−a 1−a 1−a kT R (T )k ≤ sup |(u + λ) R (u)| ≤ sup |(u + λ )R (u)| ≤ 2Fλ , λ x λ λ xλ 2 2 u∈[0,κ ] u∈[0,κ ] (where we write F = F throughout) and it thus follows that 1 1 −a 2 2 1−a kT R (T )φ(T )G (T )T k ≤2FΔ λ kφ(T )G (T )k. λ x λ λ xλ Using the spectral theorem, with (4), we get 1 1 −a 1−a 2 2 kT R (T )φ(T )G (T )T k ≤2FΔ λ sup |G (u)φ(u)|. λ λ λ xλ 1 u∈[0,κ ] When 0 < u ≤ λ, as φ(u) is non-decreasing, φ(u) ≤ φ(λ). Applying (11), we have −1 G (u)φ(u) ≤ Eφ(λ)λ . 2 ′ When λ < u ≤ κ , following from Lemma 5.8, we have that there is a c ≥ 1, which depends only on φ, κ and b, such that −1 ′ −1 φ(u)u ≤ c φ(λ)λ . Then, combing with (11), −1 ′ −1 G (u)φ(u) = G (u)uφ(u)u ≤ Ec φ(λ)λ . λ λ 2 ′ −1 Therefore, for all 0 < u ≤ κ , G (u)φ(u) ≤ Ec φ(λ)λ and consequently, 1 1 −a ′ −a 2 2 kT R (T )φ(T )G (T )T k ≤2Ec FΔ λ φ(λ). λ λ xλ φ 1 Introducing the above into (48), we get 1 1 −a ′ −a 2 2 kT R (T )ω k ≤ 2Ec FRΔ λ φ(λ). (49) λ x λ H φ 1 xλ 17 Case 2: φ(·) is Lipschitz continuous with constant 1. By the triangle inequality, we have −a kT R (T )φ(T )G (T )T k λ x λ xλ 1 1 −a 1 −a 1 2 2 2 2 ≤kT R (T )φ(T )G (T )T k + kT R (T )(φ(T ) − φ(T ))G (T )T k λ x x λ λ x x λ xλ xλ 1 1 1 1 −a a− −a 1 −a 1 2 2 2 2 2 2 ≤kR (T )φ(T )kkT T kkT G (T )T k + kT R (T )kkφ(T ) − φ(T )k kG (T )T k. λ x x λ λ x x HS λ xλ λ λ xλ Since φ(u) is Lipschitz continuous with constant 1 and φ(0) = 0, then according to Lemma 5.9, kφ(T ) − φ(T )k ≤ kT − T k . It thus follows that x HS x HS −a kT R (T )φ(T )G (T )T k λ x λ xλ 1 1 1 1 1 1 −a a− −a −a 2 2 2 2 2 2 ≤kR (T )φ(T )kkT T kkT G (T )T k + kT R (T )kkT − T k kG (T )T k λ x x λ λ x x HS λ xλ λ λ xλ 1 1 1 1 −a a− −a 1 −a 1 2 2 2 2 2 2 ≤c φ(λ)kT T kkT G (T )T k + kT R (T )kΔ kG (T )T k, g λ λ x 3 λ xλ λ λ xλ where for the last inequality, we used (31) to bound kR (T )φ(T )k: x x kR (T )φ(T )k ≤ sup |R (u)φ(u)| ≤ c φ(λ). λ x x λ g u∈[0,κ ] Applying Lemma 5.7 which implies 1 1 1 1 1 1 1 −a a− (1−2a) − (1−2a) − −a 1−2a 2 2 2 2 2 2 2 kT T k = kT T k ≤ kT T k ≤ Δ , xλ λ xλ λ xλ λ 1 we get 1 1 1 1 1 1 1 −a −a −a −a 2 2 2 2 2 2 2 kT R (T )φ(T )G (T )T k ≤ c φ(λ)Δ kT G (T )T k + kT R (T )kΔ kG (T )T k. x g x 3 λ λ λ λ λ xλ 1 λ xλ (50) By the spectral theorem and (11), with (4) and 0 ≤ a ≤ , we have 1 1 1 2 2 2 kG (T )T k ≤ sup |u G (u)| ≤ Eλ and (51) λ λ u∈[0,κ ] 1 1 1 1 −a −a −a −a 2 2 2 2 kT G (T )T k ≤ sup (u + λ )|G (u)|u ≤ 2Eλ . λ λ u∈[0,κ ] Similarly, by (12), with (7), 1 1 1 −a 2 −a −a −a 2 2 2 kT R (T )k ≤ sup (u + λ )|R (u)| ≤ 2Fλ . λ x λ xλ u∈[0,κ ] Therefore, following from the above three estimates and (50), we get 1 1 −a −a −a −a 2 2 kT R (T )φ(T )G (T )T k ≤ 2c φ(λ)Δ Eλ + 2EFλ Δ . (52) λ x λ g 3 xλ Introducing the above into (48), we get 1 1 −a −a −a 2 2 kT R (T )ω k ≤ 2ERλ (c φ(λ)Δ + FΔ ). (53) λ x λ H g 3 xλ Applying (53) (or (49)) and (47) into (46), by a direct calculation, we get 1 1 1 −a −a z 2 −a 2 2 ′′ −a kL (f − f )k ≤ Δ 2Eλ (Δ Δ + Δ R(c + c )φ(λ) + FRΔ ) + c Rφ(λ)λ . H ρ 2 g 3 g λ 1 1 1 φ 18 ′′ ′ ′′ Here, c = c F if φ is operator monotone or c = c if φ is Lipschitz continuous with constant φ φ φ 1. Introducing with Δ , Δ and Δ , by a direct calculation and λ ≤ 1, one can prove the ﬁrst 1 2 3 part of the theorem with √ 1 −a 1−a 1−a 2 ˜ ˜ C = 2EC C , C = 2E C C + 12κ EFRC , and 1 1 2 3 4 4 4 1−a 1−a ′′ C = 2E C C + c R + 2ERC (c + c ). 3 2 g g 4 4 Case 3: φ = ψϑ, where ψ is operator monotone and ϑ is Lipschitz continuous with constant 1. Since φ = ϑψ, we can rewrite φ(T ) as φ(T ) + (ϑ(T ) − ϑ(T ))ψ(T ) + ϑ(T )(ψ(T ) − ψ(T )). x x x x Thus, together with the triangle inequality, −a 1 kT R (T )φ(T )G (T )T k λ x λ xλ 1 1 1 1 −a −a 2 2 2 2 ≤kT R (T )φ(T )G (T )T k + kT R (T )(ϑ(T ) − ϑ(T ))G (T )T kkψ(T )k λ x x λ λ x x λ xλ xλ −a + kT R (T )ϑ(T )(ψ(T ) − ψ(T ))G (T )T k. (54) λ x x x λ xλ Following the same argument as that for (52), we know that 1 1 −a −a 2 2 −a kT R (T )φ(T )G (T )T k ≤ 2c Eφ(λ)Δ λ , (55) λ x x λ g xλ and −a −a kT R (T )(ϑ(T ) − ϑ(T ))G (T )T k ≤ 2EFλ Δ . (56) λ x x λ 3 xλ −a As the quality of G covers ϑ(u)u , applying the spectral theorem and Lemma 5.1, we get 1 1 1 −a −a ′ −a −a 2 2 2 kT R (T )ϑ(T )k ≤ sup (u + λ) R (u)ϑ(u) ≤ c ϑ(λ)(λ + λ ). λ x x λ xλ u∈[0,κ ] Since ψ is operator monotone on [0, b] where b > κ , we know from Lemma 5.8 that there exists a positive constant c < ∞ depending on b − κ , such that kψ(T ) − ψ(T )k ≤ c ψ(kT − T k). x ψ x If n ≥ 6 log , as ψ is non-decreasing, following from (44), we have ψ(kT − T k) ≤ ψ(kT − T k ) ≤ ψ(Δ ) and thus x HS 3 ′ −1/2 2 kψ(T ) − ψ(T )k ≤ c ψ(Δ ) ≤ c c ψ(n )6κ log , x ψ 3 ψ where for the last inequality, we used Lemma 5.8. If n ≤ 6 log , then as kT − T k ≤ max(kT k,kT k) ≤ κ , 6 1 6 2 2 ′ 2 −1/2 kψ(T ) − ψ(T )k ≤ c ψ(κ ) ≤ c ψ(κ )6 log √ ≤ c c 6κ log ψ(n ), x ψ ψ ψ δ n δ where for the last inequality, we used Lemma 5.8. Therefore, following from the above analysis and (51), −a kT R (T )ϑ(T )(ψ(T ) − ψ(T ))G (T )T k λ x x x λ xλ −a ≤kT R (T )ϑ(T )kkψ(T ) − ψ(T )kkG (T )T k λ x x x λ xλ ′ ′ 2 −a −1/2 ≤12c c c Eκ log λ ϑ(λ)ψ(n ). g ψ 19 2 Introducing the above estimate, (55) and (56) into (54), with kψ(T )k ≤ ψ(κ ) (since ψ is operator monotone and (4)), we conclude that −a kT R (T )φ(T )G (T )T k λ x λ xλ −a 6 −a 2 2 ′ ′ − ≤2λ c Eφ(λ)Δ + EFΔ ψ(κ ) + 6κ c c c Eϑ(λ)ψ(n ) log . g 3 ψ 1 g ψ Introducing the above into (48), we get 1 1 −a −a 6 2 −a 2 2 2 ′ ′ − kT R (T )ω k ≤ 2λ c Eφ(λ)Δ + EFψ(κ )Δ + 6κ c c c Eϑ(λ)ψ(n ) log R. λ x λ g 3 ψ 1 g ψ xλ Combining the above and (47) with (46), by a direct calculation, we get −a z kL (f − f )k H ρ −a −a 1−a 2 ≤λ 2EΔ (Δ + 2c Rφ(λ)) + c Rφ(λ) + 2EFRψ(κ )Δ Δ 2 g g 3 1 1 −a 6 2 ′ ′ − + 12κ c ERc c Δ ϑ(λ)ψ(n ) log . g ψ 1 Introducing with Δ , Δ and Δ , by a simple calculation, with λ ≤ 1, we can prove the second 1 2 3 part of the theorem with p 1 p −a 1−a 2 2 1−a ˜ ˜ C = 2EC C + 12EFRψ(κ )κ C , C = 2EC (3c R + C ), 4 3 5 g 2 4 4 4 −a 2 ′ ′ 2 and C = 12κ c c c ERC . 6 ψ g ψ 4 θ−1 Proof of Corollary 4.3. Let θ be such that λ = n . As Θ(u) is non-decreasing, Θ(0) = 0, −1 −ζ Θ(1) = 1 and that λ satisﬁes Θ(λ) = n , then 0 ≤ λ ≤ 1. Moreover, as that φ(λ)λ is non-decreasing which implies φ(1) 1 √ ≥ , (57) −ζ nφ(λ)λ n and that φ(1) γ+2ζ λ = √ , −ζ nφ(λ)λ 2ζ+γ then λ ≥ n . Thus, with 2ζ + γ > 1, θ = log λ + 1 ≥ − + 1 > 0. Also, θ ≤ 1 as λ ≤ 1. 2ζ+γ 2ζ+γ Applying Part 1) of Theorem 4.2, and noting that by 2ζ + γ > 1, 1 ≥ λ ≥ n and (57), 1 1 1 φ(λ) 1 1 ≤ √ ≤ √ = , ≤ √ (if 2ζ ≤ 1). 1−ζ γ γ n φ(1) nλ 2 nλ nλ nλ we can prove the ﬁrst desired result. The second desired result can be proved by using Part 2) of −1/2 Theorem 4.2, the above estimates, as well as ψ(n ) ≤ ψ(λ) (since ψ is non-decreasing). Proof of Corollary 4.4. If ζ ≤ 1, then φ is operator monotone [17, Theorem 1 and Example 1]. If ζ ≥ 1, then φ is Lipschitz continuous with constant 1 over [0, κ ]. Applying Part 1) of Theorem 4.2, one can prove the desired results. 20 Proof of Corollary 4.5. The proof can be done by using Corollary 4.4 with simple arguments. For notional simplicity, we let (ζ−a) 2ζ+γ n if 2ζ + γ > 1, Λ = (1−a) −(ζ−a) γ n (1 ∨ log n ) if 2ζ + γ ≤ 1. 1) Using the fact that for any non-negative random variable ξ, E[ξ] = Pr(ξ ≥ t)dt, and t≥0 Corollary 4.4, for any q ∈ N ( ) q(2−a) −a z q q EkL (f − f )k ≤ C exp − dt ≤ CΛ . H q λ ρ n CΛ t≥0 −2 2) By Corollary 4.4, we have, with δ = n , ∞ ∞ X X ζ−a−ǫ ζ−a−ǫ −a z 2−a 2 1∨(2ζ+γ) 1∨(2ζ+γ) Pr n kL (f − f )k > Cn Λ log ≤ δ < ∞. H ρ n λ n n=1 n=1 ζ−a−ǫ 2−a 1∨(2ζ+γ) Note that Cn Λ log → 0 as n → ∞. Thus, applying the Borel-Cantelli lemma, one can prove Part 2). 3) Following the argument from the proof of Corollary 4.4, one can prove Part 3). Acknowledgment This manuscript version is made available under the CC-BY-NC-ND 4.0 license. JL and VC’s work was supported in part by Oﬃce of Naval Research (ONR) under grant agreement number N62909-17-1-2111, in part by Hasler Foundation Switzerland under grant agreement number 16066, and in part by the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation program (grant agreement number 725594). References [1] F. Bauer, S. Pereverzev, and L. Rosasco. On regularization algorithms in learning theory. Journal of Complexity, 23(1):52–72, 2007. [2] M. S. Birman and M. Solomyak. Double operator integrals in a Hilbert space. Integral Equations and Operator Theory, 47(2):131–168, 2003. [3] G. Blanchard and N. Mucke. Optimal rates for regularization of statistical inverse learning problems. arXiv preprint arXiv:1604.04054, 2016. [4] A. Caponnetto. Optimal learning rates for regularization operators in learning theory. Technical report, 2006. [5] A. Caponnetto and E. De Vito. Optimal rates for the regularized least-squares algorithm. Foundations of Computational Mathematics, 7(3):331–368, 2007. [6] F. Cucker and D. X. Zhou. Learning theory: an approximation theory viewpoint, volume 24. Cambridge University Press, 2007. 21 [7] L. H. Dicker, D. P. Foster, and D. Hsu. Kernel ridge vs. principal component regression: Minimax bounds and the qualiﬁcation of regularization operators. Electronic Journal of Statistics, 11(1):1022–1047, 2017. [8] H. W. Engl, M. Hanke, and A. Neubauer. Regularization of inverse problems, volume 375. Springer Science & Business Media, 1996. [9] J. Fujii, M. Fujii, T. Furuta, and R. Nakamoto. Norm inequalities equivalent to Heinz inequality. Proceedings of the American Mathematical Society, 118(3):827–830, 1993. [10] L. L. Gerfo, L. Rosasco, F. Odone, E. De Vito, and A. Verri. Spectral algorithms for supervised learning. Neural Computation, 20(7):1873–1897, 2008. [11] Z.-C. Guo, S.-B. Lin, and D.-X. Zhou. Learning theory of distributed spectral algorithms. Inverse Problems, 2017. [12] D. Hsu, S. M. Kakade, and T. Zhang. Random design analysis of ridge regression. Foun- dations of Computational Mathematics, 14(3):569–600, 2014. [13] J. Lin and V. Cevher. Optimal convergence for distributed learning with stochastic gradient methods and spectral algorithms. Arxiv, 2018. [14] J. Lin and L. Rosasco. Optimal rates for multi-pass stochastic gradient methods. Journal of Machine Learning Research, 18(97):1–47, 2017. [15] S.-B. Lin, X. Guo, and D.-X. Zhou. Distributed learning with regularized least squares. Journal of Machine Learning Research, 18(92):1–31, 2017. [16] P. Math´e and S. Pereverzev. Regularization of some linear ill-posed problems with dis- cretized random noisy data. Mathematics of Computation, 75(256):1913–1929, 2006. [17] P. Math´e and S. V. Pereverzev. Moduli of continuity for operator valued functions. 2002. [18] G. Myleiko, S. Pereverzyev Jr, and S. Solodky. Regularized Nystr¨om subsampling in re- gression and ranking problems under general smoothness assumptions. 2017. [19] I. Pinelis and A. Sakhanenko. Remarks on inequalities for large deviation probabilities. Theory of Probability & Its Applications, 30(1):143–148, 1986. [20] J. O. Ramsay. Functional data analysis. Wiley Online Library, 2006. [21] A. Rastogi and S. Sampath. Optimal rates for the regularized learning algorithms under general source condition. Frontiers in Applied Mathematics and Statistics, 3:3, 2017. [22] L. Rosasco and S. Villa. Learning with incremental iterative regularization. In Advances in Neural Information Processing Systems, pages 1630–1638, 2015. [23] A. Rudi, R. Camoriano, and L. Rosasco. Less is more: Nystr¨om computational regulariza- tion. Advances in Neural Information Processing Systems, pages 1657–1665, 2015. [24] A. Rudi, G. D. Canas, and L. Rosasco. On the sample complexity of subspace learning. In Advances in Neural Information Processing Systems, pages 2067–2075, 2013. 22 [25] A. Rudi and L. Rosasco. Generalization properties of learning with random features. In Advances in Neural Information Processing Systems, pages 3215–3225, 2017. [26] S. Smale and D.-X. Zhou. Learning theory estimates via integral operators and their approximations. Constructive approximation, 26(2):153–172, 2007. [27] I. Steinwart and A. Christmann. Support vector machines. Springer Science & Business Media, 2008. [28] I. Steinwart, D. R. Hush, and C. Scovel. Optimal rates for regularized least squares regres- sion. In Conference On Learning Theory, 2009. [29] Z. Szabo´, A. Gretton, B. Po´czos, and B. Sriperumbudur. Two-stage sampled learning theory on distributions. In Artiﬁcial Intelligence and Statistics, pages 948–957, 2015. [30] Q. Wu, Y. Ying, and D.-X. Zhou. Learning rates of least-square regularized regression. Foundations of Computational Mathematics, 6(2):171–192, 2006. [31] Y. Yao, L. Rosasco, and A. Caponnetto. On early stopping in gradient descent learning. Constructive Approximation, 26(2):289–315, 2007. [32] T. Zhang. Learning bounds for kernel regression using eﬀective data dimensionality. Neural Computation, 17(9):2077–2098, 2005. [33] T. Zhang and B. Yu. Boosting with early stopping: Convergence and consistency. The Annals of Statistics, 33(4):1538–1579, 2005. A List of Notations Notation Meaning H the input space - separable Hilbert space ρ, ρ the ﬁxed probability measure on H × R, the induced marginal measure of ρ on H ρ(·|x) the conditional probability measure on R w.r.t. x ∈ H and ρ H the hypothesis space, {f : H → R|∃ω ∈ H with f(x) = hω, xi , ρ -almost surely}. ρ H X n the sample size z the whole samples {z } , where each z is i.i.d. according to ρ i i=1 i y the vector of sample outputs, (y , · · · , y ) 1 n x the set of sample outputs, {x , · · · , x } 1 n E the expected risk deﬁned by (1) L the Hilbert space of square integral functions from H to R with respect to ρ ρ X f the regression function deﬁned (2) κ the constant from the bounded assumption (3) on the input space H S the linear map from H → L deﬁned by S ω = hω, ·i ρ ρ ρ H X R ∗ ∗ S the adjoint operator of S : S f = f(x)xdρ (x) ρ ρ ρ X X R 2 2 ∗ L the operator from L to L , L(f) = SρS f = hx, ·iHf(x)ρX(x) ρ ρ ρ X X XR T the covariance operator from H to H, T = S S = h·, xixdρ (x) ρ ρ X S the sampling operator from H to R , (S ω) = hω, x i , i ∈ {1, · · · , n} x x i i H ∗ ∗ 1 S the adjoint operator of S , S y = y x x i i x x i=1 ∗ 1 T the empirical covariance operator, T = S S = h·, x ix x x x i i n i=1 f the projection of f onto the closure of H in L H ρ ρ G (·) the ﬁlter function of the regularized algorithm from Deﬁnition 3.1 τ the qualiﬁcation of the ﬁlter function G E, F the constants related to the ﬁlter function G from (11) and (12) τ λ λ a regularization parameter λ > 0 ω an estimated vector deﬁned by (13) 23 z f an estimated function deﬁned by (14) M, Q the positive constants from Assumption (15) B the constant from (17) φ, R the function and the parameter related to the ‘regularity’ of f (see Assumption 2) γ, c the parameters related to the eﬀective dimension (see Assumption 3) {σ } the sequence of eigenvalues of L i i ψ, ϑ the functions from Part 2 of Theorem 4.2, φ = ψϑ T , T = T + λ λ λ T , T = T + λ xλ xλ x ζ the parameter related to the Holder source condition on f (see (28)) R (u) = 1 − G (u)u λ λ −1 N (λ) = tr(T (T + λ) ) c the constant from Lemma 5.1 ω the population vector deﬁned by (32) a (θ) the quantity deﬁned by (35) n,δ,γ
Applied and Computational Harmonic Analysis – arXiv (Cornell University)
Published: Jan 20, 2018
You can share this free article with as many people as you like with the url below! We hope you enjoy this feature!
Read and print from thousands of top scholarly journals.
Already have an account? Log in
Bookmark this article. You can see your Bookmarks on your DeepDyve Library.
To save an article, log in first, or sign up for a DeepDyve account if you don’t already have one.
Copy and paste the desired citation format or use the link below to download a file formatted for EndNote
Access the full text.
Sign up today, get DeepDyve free for 14 days.
All DeepDyve websites use cookies to improve your online experience. They were placed on your computer when you launched this website. You can change your cookie settings through your browser.