{"title": "Fast Learning from Non-i.i.d. Observations", "book": "Advances in Neural Information Processing Systems", "page_first": 1768, "page_last": 1776, "abstract": "We prove an oracle inequality for generic regularized empirical risk minimization algorithms learning from $\\a$-mixing processes. To illustrate this oracle inequality, we use it to derive learning rates for some learning methods including least squares SVMs. Since the proof of the oracle inequality uses recent localization ideas developed for independent and identically distributed (i.i.d.) processes, it turns out that these learning rates are close to the optimal rates known in the i.i.d. case.", "full_text": "Fast Learning from Non-i.i.d. Observations\n\nIngo Steinwart\n\nInformation Sciences Group CCS-3\nLos Alamos National Laboratory\nLos Alamos, NM 87545, USA\n\ningo@lanl.gov\n\nAndreas.Christmann@uni-bayreuth.de\n\nAndreas Christmann\nUniversity of Bayreuth\n\nDepartment of Mathematics\n\nD-95440 Bayreuth\n\nAbstract\n\nWe prove an oracle inequality for generic regularized empirical risk minimization\nalgorithms learning from \u03b1-mixing processes. To illustrate this oracle inequality,\nwe use it to derive learning rates for some learning methods including least squares\nSVMs. Since the proof of the oracle inequality uses recent localization ideas\ndeveloped for independent and identically distributed (i.i.d.) processes, it turns\nout that these learning rates are close to the optimal rates known in the i.i.d. case.\n\n1\n\nIntroduction\n\nIn the past, most articles investigating statistical properties of learning algorithms assumed that the\nobserved data was generated in an i.i.d. fashion. However, in many applications this assumption\ncannot be strictly justi\ufb01ed since the sample points are intrinsically temporal and thus often weakly\ndependent. Typical examples for this phenomenon are applications where observations come from\n(suitably pre-processed) time series, i.e., for example, \ufb01nancial predictions, signal processing, sys-\ntem observation and diagnosis, and speech or text recognition. A set of natural and widely accepted\nnotions for describing such weak dependencies1 are mixing concepts such as \u03b1-, \u03b2-, and \u03c6-mixing,\nsince a) they offer a generalization to i.i.d. processes that is satis\ufb01ed by various types of stochastic\nprocesses including Markov chains and many time series models, and b) they quantify the depen-\ndence in a conceptionally simple way that is accessible to various types of analysis.\nBecause of these features, the machine learning community is currently in the process of appreciat-\ning and accepting these notions as the increasing number of articles in this direction shows. Prob-\nably the \ufb01rst work in this direction goes back to Yu [20], whose techniques for \u03b2-mixing processes\ninspired subsequent work such as [18, 10, 11], while the analysis of speci\ufb01c learning algorithms\nprobably started with [9, 5, 8]. More recently, [7] established consistency of regularized boosting\nalgorithms learning from \u03b2-mixing processes, while [15] established consistency of support vector\nmachines (SVMs) learning from \u03b1-mixing processes, which constitute the largest class of mixing\nprocesses. For the latter, [21] established generalization bounds for empirical risk minimization\n(ERM) and [19, 17] analyzed least squares support vector machines (LS-SVMs).\nIn this work, we establish a general oracle inequality for generic regularized learning algorithms and\n\u03b1-mixing observations by combining a Bernstein inequality for such processes [9] with localization\nideas for i.i.d. processes pioneered by [6] and re\ufb01ned by e.g. [1]. To illustrate this oracle inequality,\nwe then use it to show learning rates for some algorithms including ERM over \ufb01nite sets and LS-\nSVMs. In the ERM case our results match those in the i.i.d. case if one replaces the number of\nobservations with the \u201ceffective number of observations\u201d, while, for LS-SVMs, our rates are at\nleast quite close to the recently obtained optimal rates [16] for i.i.d. observations. However, the\nlatter difference is not surprising, when considering the fact that [16] used heavy machinery from\n\n1For example, [4] write on page 71: \u201c. . .\n\nit is a common practice to assume a certain mild asymptotic\n\nindependence (such as \u03b1-mixing) as a precondition in the context of . . . nonlinear times series.\u201d\n\n1\n\n\fempirical process theory such as Talagrand\u2019s inequality and localized Rademacher averages, while\nour results only use a light-weight argument based on Bernstein\u2019s inequality.\n\n2 De\ufb01nitions, Results, and Examples\nLet X be a measurable space and Y \u2282 R be closed. Furthermore, let (\u2126,A, \u00b5) be a probability\nspace and Z := (Zi)i\u22651 be a stochastic process such that Zi : \u2126 \u2192 X \u00d7 Y for all i \u2265 1. For n \u2265 1,\nwe further write Dn := ((X1, Y1), . . . , (Xn, Yn)) := (Z1, . . . , Zn) for a training set of length n\nthat is distributed according to the \ufb01rst n components of Z. Throughout this work, we assume that\nZ is stationary, i.e., the (X \u00d7 Y )n-valued random variables (Zi1, . . . , Zin) and (Zi1+i, . . . , Zin+i)\nhave the same distribution for all n, i, i1, . . . , in \u2265 1. We further write P for the distribution of one\n(and thus all) Zi, i.e., for all measurable A \u2282 X \u00d7 Y , we have\n\n(1)\nTo learn from stationary processes whose components are not independent, [15] suggests that it is\nnecessary to replace the independence assumption by a notion that still guarantees certain concen-\ntration inequalities. We will focus on \u03b1-mixing, which is based on the \u03b1-mixing coef\ufb01cients\n\nP (A) = \u00b5(cid:0){\u03c9 \u2208 \u2126 : Zi(\u03c9) \u2208 A}(cid:1) .\nn(cid:12)(cid:12)\u00b5(A \u2229 B) \u2212 \u00b5(A)\u00b5(B)(cid:12)(cid:12) : i \u2265 1, A \u2208 Ai\n\n1 and B \u2208 A\u221e\n\ni+n\n\n,\n\nn \u2265 1,\n\n\u03b1(Z, \u00b5, n) := sup\n\no\n\nwhere Ai\ntively. Throughout this work, we assume that the process Z is geometrically \u03b1-mixing, that is\n\ni+n are the \u03c3-algebras generated by (Z1, . . . , Zi) and (Zi+n, Zi+n+1, . . . ), respec-\n\n1 and A\u221e\n\n\u03b1(Z, \u00b5, n) \u2264 c exp(\u2212bn\u03b3) ,\n\n(2)\nfor some constants b > 0, c \u2265 0, and \u03b3 > 0. Of course, i.i.d. processes satisfy (2) for c = 0 and all\nb, \u03b3 > 0. Moreover, several time series models such as ARMA and GARCH, which are often used\nto describe, e.g. \ufb01nancial data, satisfy (2) under natural conditions [4, Chapter 2.6.1], and the same\nis true for many Markov chains including some dynamical systems perturbed by dynamic noise, see\ne.g. [18, Chapter 3.5]. An extensive and thorough account on mixing concepts including stronger\nmixing notions such as \u03b2- and \u03c6-mixing is provided by [3].\nLet us now describe the learning algorithms we are interested in. To this end, we assume that we\nhave a hypothesis set F consisting of bounded measurable functions f : X \u2192 R that is pre-compact\nwith respect to the supremum norm k \u00b7 k\u221e, i.e., for all \u03b5 > 0, the covering numbers\n\nn \u2265 1,\n\n(cid:26)\n\nn \u2265 1 : \u2203f1, . . . , fn \u2208 F such that F \u2282 n[\n\nN (F,k \u00b7 k\u221e, \u03b5) := inf\n\n(cid:27)\n\nB(fi, \u03b5)\n\ni=1\n\nare \ufb01nite, where B(fi, \u03b5) := {f \u2208 \u2018\u221e(X) : kf \u2212 fik\u221e \u2264 \u03b5} denotes the \u03b5-ball with center fi in the\nspace \u2018\u221e(X) of bounded functions f : X \u2192 R. Moreover, we assume that we have a regularizer,\nthat is, a function \u03a5 : F \u2192 [0,\u221e). Following [13, De\ufb01nition 2.22], we further say that a function\nL : X \u00d7 Y \u00d7 R \u2192 [0,\u221e) is a loss that can be clipped at some M > 0, if L is measurable and\n\n(x, y, t) \u2208 X \u00d7 Y \u00d7 R,\n\nL(x, y, \u00aft) \u2264 L(x, y, t) ,\n\n(3)\nwhere \u00aft denotes the clipped value of t at \u00b1M, that is \u00aft := t if t \u2208 [\u2212M, M], \u00aft := \u2212M if t < \u2212M,\nand \u00aft := M if t > M. Various often used loss functions can be clipped. For example, if Y :=\n{\u22121, 1} and L is a convex, margin-based loss represented by \u03d5 : R \u2192 [0,\u221e), that is L(y, t) =\n\u03d5(yt) for all y \u2208 Y and t \u2208 R, then L can be clipped, if and only if \u03d5 has a global minimum, see [13,\nLemma 2.23]. In particular, the hinge loss, the least squares loss for classi\ufb01cation, and the squared\nhinge loss can be clipped, but the logistic loss for classi\ufb01cation and the AdaBoost loss cannot be\nclipped. On the other hand, [12] established a simple technique, which is similar to inserting a\nsmall amount of noise into the labeling process, to construct a clippable modi\ufb01cation of an arbitrary\nconvex, margin-based loss. Moreover, if Y := [\u2212M, M] and L is a convex, distance-based loss\nrepresented by some \u03c8 : R \u2192 [0,\u221e), that is L(y, t) = \u03c8(y \u2212 t) for all y \u2208 Y and t \u2208 R, then L\ncan be clipped whenever \u03c8(0) = 0, see again [13, Lemma 2.23]. In particular, the least squares loss\nand the pinball loss used for quantile regression can be clipped, if the space of labels Y is bounded.\nGiven a loss function L and an f : X \u2192 R, we often use the notation L \u25e6 f for the function\n(x, y) 7\u2192 L(x, y, f(x)). Moreover, the L-risk is de\ufb01ned by\n\nZ\n\nRL,P (f) :=\n\nX\u00d7Y\n\nL(x, y, f(x)) dP (x, y) ,\n\n2\n\n\fand the minimal L-risk is R\u2217\nL,P := inf{RL,P (f)| f : X \u2192 R}. In addition, a function f\u2217\nsatisfying RL,P (f\u2217\nL,P ) = R\u2217\nL,P is called a Bayes decision function. Finally, we denote empirical\nrisks based on Dn by RL,Dn(f), that is, for a realization of Dn(\u03c9) of the training set Dn we have\n\nL(cid:0)Xi(\u03c9), Yi(\u03c9), f(Xi(\u03c9))(cid:1) .\nGiven a regularizer \u03a5 : F \u2192 [0,\u221e), a clippable loss, and an accuracy \u03b4 \u2265 0, we consider learning\n(cid:17)\nmethods that, for all n \u2265 1, produce a decision function fDn,\u03a5 \u2208 F satisfying\n\u03a5(f) + RL,Dn(f)\n\n\u03a5(fDn,\u03a5) + RL,Dn( \u00affDn,\u03a5) \u2264 inf\nf\u2208F\n\nRL,Dn(\u03c9)(f) =\n\nnX\n\n+ \u03b4 .\n\n(cid:16)\n\n1\nn\n\n(4)\n\ni=1\n\nL,P\n\nNote that methods such as SVMs (see below) that minimize the right-hand side of (4) exactly, satisfy\n(4), because of (3). The following theorem, which is our main result, establishes an oracle inequality\nfor methods (4), when the training data is generated by Z.\nTheorem 2.1 Let L : X \u00d7 Y \u00d7 R \u2192 [0,\u221e) be a loss that can be clipped at M > 0 and that\nsatis\ufb01es L(x, y, 0) \u2264 1, L(x, y, t) \u2264 B, and\n\n(5)\nfor all (x, y) \u2208 X \u00d7 Y and t, t0 \u2208 [\u2212M, M], where B > 0 is some constant. Moreover, let\nZ := (Zi)i\u22651 be an X \u00d7 Y -valued process that satis\ufb01es (2), and P be de\ufb01ned by (1). Assume that\nthere exist a Bayes decision function f\u2217\n\nL,P and constants \u03d1 \u2208 [0, 1] and V \u2265 B2\u2212\u03d1 such that\n\n(cid:12)(cid:12)L(x, y, t) \u2212 L(x, y, t0)(cid:12)(cid:12) \u2264 |t \u2212 t0|\nL,P )(cid:1)\u03d1\n(cid:1)2 \u2264 V \u00b7(cid:0)EP (L \u25e6 \u00aff \u2212 L \u25e6 f\u2217\n\n(cid:0)L \u25e6 \u00aff \u2212 L \u25e6 f\u2217\n\nEP\n\nL,P\n\n(6)\nwhere F is a hypothesis set and L \u25e6 f denotes the function (x, y) 7\u2192 L(x, y, f(x)). Finally, let\n\u03a5 : F \u2192 [0,\u221e) be a regularizer, f0 \u2208 F be a \ufb01xed function and B0 \u2265 B be a constant such that\nkL \u25e6 f0k\u221e \u2264 B0. Then, for all \ufb01xed \u03b5 > 0, \u03b4 \u2265 0, \u03c4 > 0, and n \u2265 max{b/8, 22+5/\u03b3b\u22121/\u03b3}, every\nlearning method de\ufb01ned by (4) satis\ufb01es with probability \u00b5 not less than 1 \u2212 3Ce\u2212\u03c4 :\n\u03a5(fDn,\u03a5) + RL,P ( \u00affDn,\u03a5) \u2212 R\u2217\n\nL,P < 3(cid:0)\u03a5(f0) + RL,P (f0) \u2212 R\u2217\n\n(cid:1) +\n(cid:18)36c\u03c3V (\u03c4 + lnN (F,k \u00b7 k\u221e, \u03b5))\n\n(cid:19)1/(2\u2212\u03d1)\nn\u03b1 + 4\u03b5 + 2\u03b4\n\n4B0cB\u03c4\n\nL,P\n\n,\n\nf \u2208 F,\n\n+\n\nn\u03b1\n\n,\n\nwhere \u03b1 := \u03b3\n\n\u03b3+1 , C := 1 + 4e\u22122c, c\u03c3 := ( 82+\u03b3\n\nb\n\n)1/(1+\u03b3), and cB := c\u03c3/3.\n\nBefore we illustrate this theorem by a few examples, let us brie\ufb02y discuss the variance bound (6).\nFor example, if Y = [\u2212M, M] and L is the least squares loss, then it is well-known that (6) is\nsatis\ufb01ed for V := 16M 2 and \u03d1 = 1, see e.g. [13, Example 7.3]. Moreover, under some assumptions\non the distribution P , [14] established a variance bound of the form (6) for the so-called pinball\nloss used for quantile regression. In addition, for the hinge loss, (6) is satis\ufb01ed for \u03d1 := q/(q + 1),\nif Tsybakov\u2019s noise assumption holds for q, see [13, Theorem 8.24]. Finally, based on [2], [12]\nestablished a variance bound with \u03d1 = 1 for the earlier mentioned clippable modi\ufb01cations of strictly\nconvex, twice continuously differentiable margin-based loss functions.\nOne might wonder, why the constant B0 is necessary in Theorem 2.1, since appearently it only adds\nfurther complexity. However, a closer look reveals that the constant B only bounds functions of the\nform L \u25e6 \u00aff, while B0 bounds the function L \u25e6 f0 for an unclipped f0 \u2208 F. Since we do not assume\nthat all f \u2208 F satisfy \u00aff = f, we conclude that in general B0 is necessary. We refer to Examples 2.4\nand 2.5 for situations, where B0 is signi\ufb01cantly larger than B.\nLet us now consider a few examples of learning methods to which Theorem 2.1 applies. The \ufb01rst\none is empirical risk minimization over a \ufb01nite set.\nExample 2.2 Let the hypothesis set F be \ufb01nite and \u03a5(f) = 0 for all f \u2208 F. Moreover, assume\nthat kfk\u221e \u2264 M for all f \u2208 F. Then, for accuracy \u03b4 := 0, the learning method described by (4) is\nERM, and Theorem 2.1 provides, by some simple estimates, the oracle inequality\nRL,P (fDn,\u03a5) \u2212 R\u2217\n\n(cid:18)36c\u03c3V (\u03c4 + ln|F|)\n\n(cid:0)RL,P (f) \u2212 R\u2217\n\n(cid:19)1/(2\u2212\u03d1)\n\n(cid:1) +\n\n4BcB\u03c4\n\n+\n\n.\n\nL,P\n\nL,P < 3 inf\nf\u2208F\n\nn\u03b1\n\nn\u03b1\n\n3\n\n\fBesides constants, this oracle inequality is an exact analogue to the standard oracle inequality for\nC\nERM learning from i.i.d. processes, [13, Theorem 7.2].\n\nBefore we present another example, let us \ufb01rst reformulate Theorem 2.1 for the case that the involved\ncovering numbers have a certain polynomial behavior.\n\nCorollary 2.3 Consider the situation of Theorem 2.1 and additionally assume that there exist con-\nstants a > 0 and p \u2208 (0, 1] such that\n\nThen there is cp,\u03d1 > 0 only depending on p and \u03d1 such that the inequality of Theorem 2.1 reduces to\n\nlnN (F,k \u00b7 k\u221e, \u03b5) \u2264 a \u03b5\u22122p ,\n\n\u03b5 > 0.\n\nL,P < 3(cid:0)\u03a5(f0) + RL,P (f0) \u2212 R\u2217\n(cid:19)1/(2\u2212\u03d1)\n\n(cid:18)36c\u03c3V \u03c4\n\nL,P\n\n+\n\nn\u03b1\n\n+\n\n4B0cB\u03c4\n\nn\u03b1 + 2\u03b4 .\n\n(cid:18)c\u03c3V a\n\n(cid:19)1/(2+2p\u2212\u03d1)\n\n(cid:1) + cp,\u03d1\n\nn\u03b1\n\n\u03a5(fDn,\u03a5) + RL,P ( \u00affDn,\u03a5) \u2212 R\u2217\n\nFor the learning rates considered in the following examples, the exact value of cp,\u03d1 is of no impor-\ntance. However, a careful numerical analysis shows that cp,\u03d1 \u2264 40 for all p \u2208 (0, 1] and \u03d1 \u2208 [0, 1].\nCorollary 2.3 can be applied to various methods including e.g. SVMs with the hinge loss or the\npinball loss, and regularized boosting algorithms. For the latter, we refer to e.g. [2] for some learning\nrates in the i.i.d. case and to [7] for a consistency result in the case of geometrically \u03b2-mixing\nobservations. Unfortunately, a detailed exposition of the learning rates resulting from Corollary 2.3\nfor all these algorithms is clearly out of scope this paper, and hence we will only discuss learning\nrates for LS-SVMs. However, the only reason we picked LS-SVMs is that they are one of the few\nmethods for which both rates for learning from \u03b1-mixing processes and optimal rates in the i.i.d. case\nare known. By considering LS-SVMs we can thus assess the sharpness of our results. Let us begin\nby brie\ufb02y recalling LS-SVMs. To this end, let X be a compact metric space and k be a continuous\nkernel on X with reproducing kernel Hilbert space (RKHS) H. Given a regularization parameter\n\u03bb > 0 and the least squares loss L(y, t) := (y \u2212 t)2, the LS-SVM \ufb01nds the unique solution\n\n(cid:0)\u03bbkfk2\n\nH + RL,Dn(f)(cid:1) .\n\nTo describe the approximation properties of H, we further need the approximation error function\n\nA(\u03bb) := inf\nf\u2208H\n\nH + RL,P (f) \u2212 R\u2217\n\nL,P\n\n\u03bb > 0 .\n\nExample 2.4 (Rates for least squares SVMs) Let X be a compact metric space, Y = [\u22121, 1], and\nZ and P as above. Furthermore, let L be the least squares loss and H be the RKHS of a continuous\nkernel k over X. Assume that the closed unit ball BH of H satis\ufb01es\n\n(7)\nwhere a > 0 and p \u2208 (0, 1] are some constants. In addition, assume that the approximation error\nfunction satis\ufb01es A(\u03bb) \u2264 c\u03bb\u03b2 for some c > 0, \u03b2 \u2208 (0, 1], and all \u03bb > 0. We de\ufb01ne\n\n\u03b5 > 0,\n\nlnN (BH ,k \u00b7 k\u221e, \u03b5) \u2264 a \u03b5\u22122p ,\n\nfDn,\u03bb = arg min\nf\u2208H\n\n(cid:0)\u03bbkfk2\n\n(cid:1) ,\n\no\n\n.\n\nn\n\n\u03c1 := min\n\n\u03b2,\n\n\u03b2\n\n\u03b2 + 2p\u03b2 + p\n\nThen Corollary 2.3 applied to F := \u03bb\u22121/2BH shows that the LS-SVM using \u03bbn := n\u2212\u03b1\u03c1/\u03b2 learns\nwith rate n\u2212\u03b1\u03c1. Let us compare this rate with other recent results: [17] establishes the learning rate\n\n\u2212 2\u03b2\n\n\u03b2+3 ,\n\nn\n\nwhenever (2) is satis\ufb01ed for some \u03b1. At \ufb01rst glance, this rate looks stronger, since it is independent\nof \u03b1. However, a closer look shows that it depends on the con\ufb01dence level 1\u2212 3Ce\u2212\u03c4 by a factor of\ne\u03c4 rather than by the factor of \u03c4 appearing in our analysis, and hence these rates are not comparable.\nMoreover, in the case \u03b1 = 1, our rates are still faster whenever p \u2208 (0, 1/3], which is e.g. satis\ufb01ed\n\n4\n\n\f\u2212 \u03b1\u03b2\n\nfor suf\ufb01ciently smooth kernels, see e.g. [13, Theorem 6.26]. Moreover, [19] has recently established\nthe rate\n\nn\n\n(8)\n2p+1 ,\nwhich is faster than ours, if and only if \u03b2 > 1+p\n1+2p. In particular, for highly smooth kernels such\nas the Gaussian RBF kernels, where p can be chosen arbitrarily close to 0, their rate is never faster.\nMoreover, [19] requires knowing \u03b1, which, as we will brie\ufb02y discuss in Remark 2.6, is not the case\nfor our rates. In this regard, it is interesting to note that their iterative proof procedure, see [13,\nChapter 7.1] for a generic description of this technique, can also be applied to our oracle inequality.\nThe resulting rate is essentially n\u2212\u03b1 min{\u03b2,\u03b2/(\u03b2+p\u03b2+p)}, which is always faster than (8). Due to\nspace constraints and the fact that these rates require knowing \u03b1 and \u03b2, we skip a detailed exposition.\nFinally, both [19] and [17] only consider LS-SVMs, while Theorem 2.1 applies to various learning\nC\nmethods.\n\nExample 2.5 (Almost optimal rates for least squares SVMs) Consider the situation of Example\n2.4, and additionally assume that there exists a constant Cp > 0 such that\nf \u2208 H.\n\nkfk\u221e \u2264 Cp kfkp\n\nHkfk1\u2212p\n\n(9)\n\nL2(PX ) ,\n\nAs in [16], we can then bound B0 \u2264 \u03bb(\u03b2\u22121)p, and hence the SVM using \u03bbn := n\nwith rate\n\n\u2212 \u03b1\u03b2\n\n\u03b2+2p\u03b2+p ,\n\nn\n\n\u2212\n\n\u03b1\n\n\u03b2+2p\u03b2+p learns\n\n\u2212 \u03b2\n\n\u03b2+p in the i.i.d. case, see [16]. In particular, if H = W m(X)\ncompared to the optimal rate n\nis a Sobolev space over X \u2282 Rd with smoothness m > d/2, and the marginal distribution PX\nis absolutely continuous with respect to the uniform distribution, where corresponding density is\nbounded away from 0 and \u221e, then (7) and (9) are satis\ufb01ed for p := d\n2m. Moreover, the assumption\nL,P \u2208 W s(X) and\non the approximation error function is satis\ufb01ed for \u03b2 := s/m, whenever f\u2217\ns \u2208 (d/2, m]. Consequently, the resulting learning rate is\n\n\u2212\n\nn\n\n2s\u03b1\n\n2s+d+2ds/m ,\n\nwhich in the i.i.d. case, where \u03b1 = 1, is worse than the optimal rate n\u2212 2s\n2s+d by the term 2ds/m. Note\nthat this difference can be made arbitrarily small by picking a suf\ufb01ciently large m. Unfortunately,\nwe do not know, whether the extra term 2ds/m is an artifact of our proof techniques, which are\nrelatively light-weighted compared to the heavy machinery used in the i.i.d. case. Similarly, we\ndo not know, whether the used Bernstein inequality for \u03b1-mixing processes, see Theorem 3.1, is\noptimal, but it is the best inequality we could \ufb01nd in the literature. However, if there is, or will be, a\nbetter version of this inequality, our oracle inequalities can be easily improved, since our techniques\nC\nonly require a generic form of Bernstein\u2019s inequality.\n\nRemark 2.6 In the examples above, the rates were achieved by picking particular regularization\nsequences that depend on both \u03b1 and \u03b2, which in turn, are almost never known in practice. Fortu-\nnately, there exists an easy way to achieve the above rates without such knowledge. Indeed, let us\nassume we pick a polynomially growing n\u22121/p-net \u039bn of (0, 1], split the training sample Dn into\nn ,\u03bb for all \u03bb \u2208 \u039bn,\ntwo (almost) equally sized and consecutive parts D(1)\nand pick a \u03bb\u2217 \u2208 \u039bn whose fD\nn ,\u03bb\u2217 minimizes the RL,D\n-risk over \u039bn. Then combining Example\n2.2 with the oracle inequality of Corollary 2.3 for LS-SVMs shows that the learning rates of the\nExamples 2.4 and 2.5 are also achieved by this training-validation approach. Although the proof is\nC\na straightforward modi\ufb01cation of [13, Theorem 7.24], it is out of the page limit of this paper.\n\nn , compute fD\n\nn and D(2)\n\n(2)\nn\n\n(1)\n\n(1)\n\n3 Proofs\nIn the following, btc denotes the largest integer n satisfying n \u2264 t, and similarly, dte denotes the\nsmallest integer n satisfying n \u2265 t.\nThe key result we need to prove the oracle inequality of Theorem 2.1 is the following Bernstein type\ninequality for geometrically \u03b1-mixing processes, which was established in [9, Theorem 4.3]:\n\n5\n\n\fTheorem 3.1 Let Z := (Zi)i\u22651 be an X \u00d7 Y -valued stochastic process that satis\ufb01es (2) and P be\nde\ufb01ned by (1). Furthermore, let h : X \u00d7 Y \u2192 R be a bounded measurable function for which there\nexist constants B > 0 and \u03c3 \u2265 0 such that EP h = 0, EP h2 \u2264 \u03c32, and khk\u221e \u2264 B. For n \u2265 1 we\nde\ufb01ne\n\nn(\u03b3) :=\nThen, for all n \u2265 1 and all \u03b5 > 0, we have\n\n$\n\nn\n\n\u03b3+1(cid:25)\u22121%\n(cid:24)(cid:18)8n\n(cid:19) 1\n(cid:27)(cid:19)\n\u2264 (cid:0)1 + 4e\u22122c(cid:1)e\n\nb\n\n.\n\n\u03c9 \u2208 \u2126 :\n\n1\nn\n\nh(Zi(\u03c9)) \u2265 \u03b5\n\n\u2212 3\u03b52n(\u03b3)\n\n6\u03c32+2\u03b5B .\n\n(10)\n\n(cid:18)(cid:26)\n\n\u00b5\n\n\u03b52n\u03b1\n\n(cid:18)(cid:26)\n\n\u00b5\n\nnX\n\ni=1\n\ni=1\n\nnX\n\ni=1\n\nBefore we prove Theorem 2.1, we need to slightly modify (10). To this end, we \ufb01rst observe that\ndte \u2264 2t for all t \u2265 1 and btc \u2265 t/2 for all t \u2265 2. From this it is easy to conclude that, for all n\nsatisfying n \u2265 n0 := max{b/8, 22+5/\u03b3b\u22121/\u03b3}, we have\n\nwhere \u03b1 := \u03b3\n\n)1/(1+\u03b3), and cB := c\u03c3/3, we thus obtain\n\nn(\u03b3) \u2265 2\u2212 2\u03b3+5\n\u03b3+1 b\n(cid:18)(cid:26)\n(cid:27)(cid:19)\n\u03b3+1. For C := 1 + 4e\u22122c, c\u03c3 := ( 82+\u03b3\nh(Zi(\u03c9)) \u2265 \u03b5\n\n\u03c9 \u2208 \u2126 :\n\nnX\n\n\u00b5\n\nb\n\n1\nn\n\n1\n\u03b3+1 n\u03b1 ,\n\n\u2264 Ce\u2212\u03c4 ,\n\nn \u2265 n0,\n\nr\n\n(cid:27)(cid:19)\n\nwhere \u03c4 :=\n\nc\u03c3\u03c32+\u03b5cB B . Simple transformations and estimations then yield\n\n\u03c9 \u2208 \u2126 :\n\n1\nn\n\nh(Zi(\u03c9)) \u2265\n\n\u03c4 c\u03c3\u03c32\n\nn\u03b1 + cBB\u03c4\n\nn\u03b1\n\n\u2264 Ce\u2212\u03c4\n\n(11)\n\nfor all n \u2265 max{b/8, 22+5/\u03b3b\u22121/\u03b3} and \u03c4 > 0. In the following, we will use only this inequality.\nIn addition, we will need the following simple and well-known lemma:\nLemma 3.2 For q \u2208 (1,\u221e), de\ufb01ne q0 \u2208 (1,\u221e) by 1/q + 1/q0 = 1. Then, for all a, b \u2265 0, we have\n(qa)2/q(q0b)2/q0 \u2264 (a + b)2 and ab \u2264 aq/q + bq0\nProof of Theorem 2.1: For f : X \u2192 R we de\ufb01ne hf := L \u25e6 f \u2212 L \u25e6 f\u2217\nL,P . By the de\ufb01nition of\nfDn,\u03a5, we then have \u03a5(fDn,\u03a5) + EDnh \u00affDn,\u03a5 \u2264 \u03a5(f0) + EDnhf0 + \u03b4, and consequently we obtain\n\n/q0.\n\n\u03a5(fDn,\u03a5) + RL,P ( \u00affDn,\u03a5) \u2212 R\u2217\n\nL,P\n\n= \u03a5(fDn,\u03a5) + EP h \u00affDn,\u03a5\n\u2264 \u03a5(f0) + EDnhf0 \u2212 EDnh \u00affDn,\u03a5 + EP h \u00affDn,\u03a5 + \u03b4\n= (\u03a5(f0) + EP hf0) + (EDnhf0 \u2212 EP hf0) + (EP h \u00affDn,\u03a5 \u2212 EDnh \u00affDn,\u03a5) + \u03b4 .\n(cid:1) .\nEDnhf0 \u2212 EP hf0 =(cid:0)EDn(hf0 \u2212 h \u00aff0) \u2212 EP (hf0 \u2212 h \u00aff0)(cid:1) +(cid:0)EDnh \u00aff0 \u2212 EP h \u00aff0\nLet us \ufb01rst bound the term EDnhf0 \u2212 EP hf0. To this end, we further split this difference into\n(cid:0)(hf0 \u2212 h \u00aff0) \u2212 EP (hf0 \u2212 h \u00aff0)(cid:1)2 \u2264 EP (hf0 \u2212 h \u00aff0)2 \u2264 B0 EP (hf0 \u2212 h \u00aff0) .\n\nNow L \u25e6 f0 \u2212 L \u25e6 \u00aff0 \u2265 0 implies hf0 \u2212 h \u00aff0 = L \u25e6 f0 \u2212 L \u25e6 \u00aff0 \u2208 [0, B0], and hence we obtain\n\nEP\n\nInequality (11) applied to h := (hf0 \u2212 h \u00aff0) \u2212 EP (hf0 \u2212 h \u00aff0) thus shows that\n\n(12)\n\n(13)\n\nr \u03c4 c\u03c3B0 EP (hf0 \u2212 h \u00aff0)\n\nn\u03b1\nholds with probability \u00b5 not less than 1 \u2212 Ce\u2212\u03c4 . Moreover, using\n\nEDn(hf0 \u2212 h \u00aff0) \u2212 EP (hf0 \u2212 h \u00aff0) <\nq\n2 + b\nn\u2212\u03b1\u03c4 c\u03c3B0 EP (hf0 \u2212 h \u00aff0) \u2264 EP (hf0 \u2212 h \u00aff0) + n\u2212\u03b1c\u03c3B0\u03c4 /4 ,\n\nab \u2264 a\n\n\u221a\n\n+ cBB0\u03c4\nn\u03b1\n\n2, we \ufb01nd\n\n6\n\n\fand consequently we have with probability \u00b5 not less than 1 \u2212 Ce\u2212\u03c4 that\nEDn(hf0 \u2212 h \u00aff0) \u2212 EP (hf0 \u2212 h \u00aff0) < EP (hf0 \u2212 h \u00aff0) +\n\n(14)\nIn order to bound the remaining term in (13), that is EDnh \u00aff0 \u2212 EP h \u00aff0, we \ufb01rst observe that (5)\nimplies kh \u00aff0k\u221e \u2264 B, and hence we have kh \u00aff0 \u2212 EP h \u00aff0k\u221e \u2264 2B. Moreover, (6) yields\n\n7cBB0\u03c4\n\n4n\u03b1\n\n.\n\nEP (h \u00aff0 \u2212 EP h \u00aff0)2 \u2264 EP h2\n\u00aff0\nIn addition, if \u03d1 \u2208 (0, 1], Lemma 3.2 implies for q := 2\ns\nand b := (2\u03d1\u22121EP h \u00aff0)\u03d1/2, that\n(cid:19) 1\n(cid:18)\n(cid:18) c\u03c3V \u03c4\n\n(cid:19)(cid:18) c\u03c32\u2212\u03d1\u03d1\u03d1V \u03c4\n\nc\u03c3V \u03c4(EP h \u00aff0)\u03d1\n\n1 \u2212 \u03d1\n2\n\nEDnh \u00aff0 \u2212 EP h \u00aff0 < EP h \u00aff0 +\n\n\u2264\n\nn\u03b1\n\nn\u03b1\n\n2\u2212\u03d1\n\n\u2264 V (EP h \u00aff0)\u03d1 .\n2\u2212\u03d1, q0 := 2\n\n+ EP h \u00aff0 \u2264\n(cid:19) 1\n\n2\u2212\u03d1\n\n2cBB\u03c4\n\nSince EP h \u00aff0 \u2265 0, this inequality also holds for \u03d1 = 0, and hence (11) shows that we have\n\n\u03d1, a := (n\u2212\u03b1c\u03c32\u2212\u03d1\u03d1\u03d1V \u03c4)1/2,\n\n(cid:18) c\u03c3V \u03c4\n\n(cid:19) 1\n\n2\u2212\u03d1\n\nn\u03b1\n\n+ EP h \u00aff0.\n\n(15)\nwith probability \u00b5 not less than 1 \u2212 Ce\u2212\u03c4 . By combining this estimate with (14) and (13), we now\nobtain that with probability \u00b5 not less than 1 \u2212 2Ce\u2212\u03c4 we have\n\nn\u03b1\n\nn\u03b1\n\n+\n\nEDnh \u00aff0 \u2212 EP h \u00aff0 < EP h \u00aff0 +\n\n+\n\n2cBB\u03c4\nn\u03b1 +\n\n7cBB0\u03c4\n\n4n\u03b1\n\n,\n\n(16)\n\ni.e., we have established a bound on the second term in (12).\nLet us now \ufb01x a minimal \u03b5-net C of F, that is, an \u03b5-net of cardinality |C| = N (F,k \u00b7 k\u221e, \u03b5). Let\nus \ufb01rst consider the case n\u03b1 < 3cB(\u03c4 + ln|C|). Combining (16) with (12) and using B \u2264 B0,\nB2\u2212\u03d1 \u2264 V , 3cB \u2264 c\u03c3, 2 \u2264 41/(2\u2212\u03d1), and EP h \u00affDn,\u03a5 \u2212 EDnh \u00affDn,\u03a5 \u2264 2B, we then \ufb01nd\n\n(cid:18) c\u03c3V \u03c4\n\n(cid:19) 1\n\n2\u2212\u03d1\n\nn\u03b1\n\n(cid:19) 1\n\u03a5(fDn,\u03a5) + RL,P (fDn,\u03a5) \u2212 R\u2217\n\n\u2264 \u03a5(f0) + 2EP hf0 +\n\nL,P\n\n+\n\n2\u2212\u03d1\n\nn\u03b1\n\n(cid:18) c\u03c3V \u03c4\n2cBB\u03c4\n(cid:19) 1\n(cid:18) c\u03c3V (\u03c4 + ln|C|)\nn\u03b1 +\n(cid:18)36c\u03c3V (\u03c4 + ln|C|)\n(cid:19) 1\n\nn\u03b1\n\n2\u2212\u03d1\n\n+\n\n2\u2212\u03d1\n\nn\u03b1\n\n\u2264 \u03a5(f0) + 2EP hf0 +\n\n\u2264 3\u03a5(f0) + 3EP hf0 +\n\n7cBB0\u03c4\n4n\u03b1 + (EP h \u00affDn,\u03a5 \u2212 EDnh \u00affDn,\u03a5) + \u03b4\n4cBB0\u03c4\n\n(cid:18) c\u03c3(\u03c4 + ln|C|)\n\n(cid:19) 1\n\n2\u2212\u03d1\n\n+ \u03b4\n\nn\u03b1\n\nn\u03b1 + 2B\n4cBB0\u03c4\n\nn\u03b1 + \u03b4\n\n+\n\nwith probability \u00b5 not less than 1\u22122e\u2212\u03c4 . It thus remains to consider the case n\u03b1 \u2265 3cB(\u03c4 + ln|C|).\nTo establish a non-trivial bound on the term EP h \u00affD \u2212 EDnh \u00affD in (12), we de\ufb01ne functions\n\ngf,r :=\n\nEP h \u00aff \u2212 h \u00aff\nEP h \u00aff + r\n\n,\n\nf \u2208 F ,\n\n2\u2212\u03d1, q0 := 2\n\nwhere r > 0 is a real number to be \ufb01xed later. For f \u2208 F, we then have kgf,rk\u221e \u2264 2Br\u22121, and for\n\u03d1 > 0, q := 2\n\n\u03d1, a := r, and b := EP h \u00aff 6= 0, the \ufb01rst inequality of Lemma 3.2 yields\nf,r \u2264\n\n(EP h \u00aff + r)2 \u2264 (2 \u2212 \u03d1)2\u2212\u03d1\u03d1\u03d1 EP h2\n\u00aff = 0 by the variance bound (6), which in\nf,r \u2264 V r\u03d1\u22122. Finally, it is not hard to see that EP g2\nf,r \u2264 V r\u03d1\u22122 also holds for\n(cid:19)\n\n4r2\u2212\u03d1(EP h \u00aff )\u03d1 \u2264 V r\u03d1\u22122 .\n\nMoreover, for \u03d1 \u2208 (0, 1] and EP h \u00aff = 0, we have EP h2\nturn implies EP g2\n\u03d1 = 0. Now, (11) together with a simple union bound yields\n\nEP h2\n\u00aff\n\nr\n\nEP g2\n\n(cid:18)\n\n(17)\n\n\u00aff\n\n\u00b5\n\nDn \u2208 (X \u00d7 Y )n : sup\nf\u2208C\n\nEDngf,r <\n\nc\u03c3V \u03c4\nn\u03b1r2\u2212\u03d1 +\n\n2cBB\u03c4\nn\u03b1r\n\n\u2265 1 \u2212 C |C| e\u2212\u03c4 ,\n\n7\n\n\f(cid:19)\n\nEP h \u00aff \u2212 EDnh \u00aff <(cid:0)EP h \u00aff + r(cid:1)(cid:18)r\n\nand consequently we see that, with probability \u00b5 not less than 1 \u2212 C |C| e\u2212\u03c4 , we have\n\n(18)\nfor all f \u2208 C. Since fDn,\u03a5 \u2208 F, there now exists an fDn \u2208 C with kfDn,\u03a5 \u2212 fDnk\u221e \u2264 \u03b5. By the\nassumed Lipschitz continuity of L the latter implies\n\n(x, y) \u2212 h \u00affDn,\u03a5(x, y)(cid:12)(cid:12) \u2264(cid:12)(cid:12) \u00affDn(x) \u2212 \u00affDn,\u03a5(x)(cid:12)(cid:12) \u2264(cid:12)(cid:12)fDn(x) \u2212 fDn,\u03a5(x)(cid:12)(cid:12) \u2264 \u03b5\n\nc\u03c3V \u03c4\nn\u03b1r2\u2212\u03d1 +\n\n2cBB\u03c4\nn\u03b1r\n\n(cid:19)\n\n+ 2\u03b5\nwith probability \u00b5 not less than 1 \u2212 C e\u2212\u03c4 . By combining this estimate with (12) and (16), we then\nobtain that\n\nn\u03b1r2\u2212\u03d1\n\nn\u03b1r\n\n+\n\nc\u03c3V (\u03c4 + ln|C|)\n\n2cBB(\u03c4 + ln|C|)\n\nfor all (x, y) \u2208 X \u00d7 Y . Combining this with (18), we obtain\n\n(cid:12)(cid:12)h \u00affDn\nEP h \u00affDn,\u03a5 \u2212 EDnh \u00affDn,\u03a5 <(cid:0)EP h \u00aff + \u03b5 + r(cid:1)(cid:18)r\n(cid:18) c\u03c3V \u03c4\n+(cid:0)EP h \u00affDn,\u03a5 + \u03b5 + r(cid:1)(cid:18)r\n(cid:18)36c\u03c3V (\u03c4 + ln|C|)\nr\n\n\u03a5(fDn,\u03a5) + EP h \u00affDn,\u03a5 < \u03a5(f0) + 2EP hf0 +\n\nwe obtain, since 6 \u2264 361/(2\u2212\u03d1),\n\n(cid:19) 1\n\nr :=\n\nn\u03b1\n\n(cid:16) c\u03c3V \u03c4\n\n(cid:17) 1\n\n2\u2212\u03d1 \u2264 r\n6\n\nn\u03b1\n\nand\n\n(cid:19)1/(2\u2212\u03d1)\n\n,\n\n2\u2212\u03d1\n\n+\n\n2cBB\u03c4\nn\u03b1 +\n\nn\u03b1\nc\u03c3V (\u03c4 + ln|C|)\n\n7cBB0\u03c4\n(cid:19)\n4n\u03b1 + 2\u03b5 + \u03b4\n\n2cBB(\u03c4 + ln|C|)\n\n(19)\nholds with probability \u00b5 not less than 1 \u2212 3Ce\u2212\u03c4 . Consequently, it remains to bound the various\nterms. To this end, we \ufb01rst observe that for\n\nn\u03b1r2\u2212\u03d1\n\nn\u03b1r\n\n+\n\nIn addition, V \u2265 B2\u2212\u03d1, c\u03c3 \u2265 3cB, 6 \u2264 361/(2\u2212\u03d1), and n\u03b1 \u2265 3cB(\u03c4 + ln|C|) imply\n\n2cBB(\u03c4 + ln|C|)\n\nrn\u03b1\n\n=\n\n6\n9\n\n\u00b7 3cB(\u03c4 + ln|C|)\n\nn\u03b1\n\n\u00b7 B\nr\n\nc\u03c3V (\u03c4 + ln|C|)\n\nn\u03b1r2\u2212\u03d1\n\n\u2264 1\n6 .\n(cid:17) 1\n\u00b7(cid:16)3cB(\u03c4 + ln|C|)\n\u00b7(cid:16)36c\u03c3V (\u03c4 + ln|C|)\n(cid:17) 1\n\nn\u03b1\n\n1\n\n2\u2212\u03d1\n\n2\u2212\u03d1 \u00b7 V\n\nr\n2\u2212\u03d1 =\n\n\u2264 6\n9\n\u2264 1\n9\n\n1\n9 .\n\nn\u03b1r2\u2212\u03d1\nUsing these estimates together with 1/6 + 1/9 \u2264 1/3 in (19), we see that\n\u03a5(fDn,\u03a5) + EP h \u00affDn,\u03a5 < \u03a5(f0) + 2EP hf0 + r\n3\nholds with probability \u00b5 not less than 1 \u2212 3Ce\u2212\u03c4 . Consequently, we have\n\n7cBB0\u03c4\n4n\u03b1 +\n\n+\n\n+ 2\u03b5 + \u03b4\n\n3\n\nEP h \u00affDn,\u03a5 + \u03b5 + r\n(cid:19)1/(2\u2212\u03d1)\n\n4cBB0\u03c4\n\n+\n\nn\u03b1 +4\u03b5+2\u03b4 ,\n\n(cid:18)36c\u03c3V (\u03c4 + ln|C|\n\nn\u03b1\n\n\u03a5(fDn,\u03a5)+EP h \u00affDn,\u03a5 < 3\u03a5(f0)+3EP hf0 +\ni.e. we have shown the assertion.\n\nProof of Corollary 2.3: The result follows from minimizing the right-hand side of the oracle in-\nequality of Theorem 2.1 with respect to \u03b5.\n\nReferences\n[1] P. L. Bartlett, O. Bousquet, and S. Mendelson. Local Rademacher complexities. Ann. Statist.,\n\n33:1497\u20131537, 2005.\n\n[2] G. Blanchard, G. Lugosi, and N. Vayatis. On the rate of convergence of regularized boosting\n\nclassi\ufb01ers. J. Mach. Learn. Res., 4:861\u2013894, 2003.\n\n8\n\n\f[3] R. C. Bradley. Introduction to Strong Mixing Conditions. Vol. 1-3. Kendrick Press, Heber City,\n\nUT, 2007.\n\n[4] J. Fan and Q. Yao. Nonlinear Time Series. Springer, New York, 2003.\n[5] A. Irle. On consistency in nonparametric estimation under mixing conditions. J. Multivariate\n\nAnal., 60:123\u2013147, 1997.\n\n[6] W. S. Lee, P. L. Bartlett, and R. C. Williamson. The importance of convexity in learning with\n\nsquared loss. IEEE Trans. Inform. Theory, 44:1974\u20131980, 1998.\n\n[7] A. Lozano, S. Kulkarni, and R. Schapire. Convergence and consistency of regularized boosting\nIn Y. Weiss, B. Sch\u00a8olkopf, and J. Platt,\nalgorithms with stationary \u03b2-mixing observations.\neditors, Advances in Neural Information Processing Systems 18, pages 819\u2013826. MIT Press,\nCambridge, MA, 2006.\n\n[8] R. Meir. Nonparametric time series prediction through adaptive model selection. Mach. Learn.,\n\n39:5\u201334, 2000.\n\n[9] D. S. Modha and E. Masry. Minimum complexity regression estimation with weakly dependent\n\nobservations. IEEE Trans. Inform. Theory, 42:2133\u20132145, 1996.\n\n[10] M. Mohri and A. Rostamizadeh. Stability bounds for non-i.i.d. processes.\n\nIn J.C. Platt,\nD. Koller, Y. Singer, and S. Roweis, editors, Advances in Neural Information Processing Sys-\ntems 20, pages 1025\u20131032. MIT Press, Cambridge, MA, 2008.\n\n[11] M. Mohri and A. Rostamizadeh. Rademacher complexity bounds for non-i.i.d. processes. In\nD. Koller, D. Schuurmans, Y. Bengio, and L. Bottou, editors, Advances in Neural Information\nProcessing Systems 21, pages 1097\u20131104. 2009.\n\n[12] I. Steinwart. Two oracle inequalities for regularized boosting classiers. Statistics and Its\n\nInterface, 2:271284, 2009.\n\n[13] I. Steinwart and A. Christmann. Support Vector Machines. Springer, New York, 2008.\n[14] I. Steinwart and A. Christmann. Estimating conditional quantiles with the help of the pinball\n\nloss. Bernoulli, accepted with minor revision.\n\n[15] I. Steinwart, D. Hush, and C. Scovel. Learning from dependent observations. J. Multivariate\n\nAnal., 100:175\u2013194, 2009.\n\n[16] I. Steinwart, D. Hush, and C. Scovel. Optimal rates for regularized least squares regression. In\nS. Dasgupta and A. Klivans, editors, Proceedings of the 22nd Annual Conference on Learning\nTheory, pages 79\u201393. 2009.\n\n[17] H. Sun and Q. Wu. Regularized least square regression with dependent samples. Adv. Comput.\n\nMath., to appear.\n\n[18] M. Vidyasagar. A Theory of Learning and Generalization: With Applications to Neural Net-\n\nworks and Control Systems. Springer, London, 2nd edition, 2003.\n\n[19] Y.-L. Xu and D.-R. Chen. Learning rates of regularized regression for exponentially strongly\n\nmixing sequence. J. Statist. Plann. Inference, 138:2180\u20132189, 2008.\n\n[20] B. Yu. Rates of convergence for empirical processes of stationary mixing sequences. Ann.\n\nProbab., 22:94\u2013116, 1994.\n\n[21] B. Zou and L. Li. The performance bounds of learning machines based on exponentially\n\nstrongly mixing sequences. Comput. Math. Appl., 53:1050\u20131058, 2007.\n\n9\n\n\f", "award": [], "sourceid": 1061, "authors": [{"given_name": "Ingo", "family_name": "Steinwart", "institution": null}, {"given_name": "Andreas", "family_name": "Christmann", "institution": null}]}