{"title": "Stability Bounds for Non-i.i.d. Processes", "book": "Advances in Neural Information Processing Systems", "page_first": 1025, "page_last": 1032, "abstract": "The notion of algorithmic stability has been used effectively in the past to derive tight generalization bounds. A key advantage of these bounds is that they are de- signed for specific learning algorithms, exploiting their particular properties. But, as in much of learning theory, existing stability analyses and bounds apply only in the scenario where the samples are independently and identically distributed (i.i.d.). In many machine learning applications, however, this assumption does not hold. The observations received by the learning algorithm often have some inherent temporal dependence, which is clear in system diagnosis or time series prediction problems. This paper studies the scenario where the observations are drawn from a station- ary beta-mixing sequence, which implies a dependence between observations that weaken over time. It proves novel stability-based generalization bounds that hold even with this more general setting. These bounds strictly generalize the bounds given in the i.i.d. case. We also illustrate their application in the case of several general classes of learning algorithms, including Support Vector Regression and Kernel Ridge Regression.", "full_text": "Stability Bounds for Non-i.i.d. Processes\n\nMehryar Mohri\n\nCourant Institute of Mathematical Sciences\n\nand Google Research\n\n251 Mercer Street\n\nNew York, NY 10012\nmohri@cims.nyu.edu\n\nAfshin Rostamizadeh\n\nDepartment of Computer Science\n\nCourant Institute of Mathematical Sciences\n\n251 Mercer Street\n\nNew York, NY 10012\nrostami@cs.nyu.edu\n\nAbstract\n\nThe notion of algorithmic stability has been used effectively in the past to derive\ntight generalization bounds. A key advantage of these bounds is that they are de-\nsigned for speci\ufb01c learning algorithms, exploiting their particular properties. But,\nas in much of learning theory, existing stability analyses and bounds apply only\nin the scenario where the samples are independently and identically distributed\n(i.i.d.). In many machine learning applications, however, this assumption does\nnot hold. The observations received by the learning algorithm often have some\ninherent temporal dependence, which is clear in system diagnosis or time series\nprediction problems. This paper studies the scenario where the observations are\ndrawn from a stationary mixing sequence, which implies a dependence between\nobservations that weaken over time. It proves novel stability-based generalization\nbounds that hold even with this more general setting. These bounds strictly gen-\neralize the bounds given in the i.i.d. case. It also illustrates their application in the\ncase of several general classes of learning algorithms, including Support Vector\nRegression and Kernel Ridge Regression.\n\n1 Introduction\n\nThe notion of algorithmic stability has been used effectively in the past to derive tight generalization\nbounds [2\u20134,6]. A learning algorithm is stable when the hypotheses it outputs differ in a limited way\nwhen small changes are made to the training set. A key advantage of stability bounds is that they are\ntailored to speci\ufb01c learning algorithms, exploiting their particular properties. They do not depend\non complexity measures such as the VC-dimension, covering numbers, or Rademacher complexity,\nwhich characterize a class of hypotheses, independently of any algorithm.\n\nBut, as in much of learning theory, existing stability analyses and bounds apply only in the scenario\nwhere the samples are independently and identically distributed (i.i.d.). Note that the i.i.d. assump-\ntion is typically not tested or derived from a data analysis. In many machine learning applications\nthis assumption does not hold. The observations received by the learning algorithm often have some\ninherent temporal dependence, which is clear in system diagnosis or time series prediction prob-\nlems. A typical example of time series data is stock pricing, where clearly prices of different stocks\non the same day or of the same stock on different days may be dependent.\n\nThis paper studies the scenario where the observations are drawn from a stationary mixing sequence,\na widely adopted assumption in the study of non-i.i.d. processes that implies a dependence between\nobservations that weakens over time [8, 10, 16, 17]. Our proofs are also based on the independent\nblock technique commonly used in such contexts [17] and a generalized version of McDiarmid\u2019s\ninequality [7]. We prove novel stability-based generalization bounds that hold even with this more\ngeneral setting. These bounds strictly generalize the bounds given in the i.i.d. case and apply to all\nstable learning algorithms thereby extending the usefulness of stability-bounds to non-i.i.d. scenar-\n\n1\n\n\fios. It also illustrates their application to general classes of learning algorithms, including Support\nVector Regression (SVR) [15] and Kernel Ridge Regression [13].\n\nAlgorithms such as support vector regression (SVR) [14, 15] have been used in the context of time\nseries prediction in which the i.i.d. assumption does not hold, some with good experimental re-\nsults [9, 12]. To our knowledge, the use of these algorithms in non-i.i.d. scenarios has not been\nsupported by any theoretical analysis. The stability bounds we give for SVR and many other kernel\nregularization-based algorithms can thus be viewed as the \ufb01rst theoretical basis for their use in such\nscenarios.\n\nIn Section 2, we will introduce the de\ufb01nitions for the non-i.i.d. problems we are considering and\ndiscuss the learning scenarios. Section 3 gives our main generalization bounds based on stabil-\nity, including the full proof and analysis. In Section 4, we apply these bounds to general kernel\nregularization-based algorithms, including Support Vector Regression and Kernel Ridge Regression.\n\n2 Preliminaries\n\nWe \ufb01rst introduce some standard de\ufb01nitions for dependent observations in mixing theory [5] and\nthen brie\ufb02y discuss the learning scenarios in the non-i.i.d. case.\n\n2.1 Non-i.i.d. De\ufb01nitions\nDe\ufb01nition 1. A sequence of random variables Z = {Zt}\u221e\nt=\u2212\u221e is said to be stationary if for any t\nand non-negative integers m and k, the random vectors (Zt, . . . , Zt+m) and (Zt+k, . . . , Zt+m+k)\nhave the same distribution.\n\nThus, the index t or time, does not affect the distribution of a variable Zt in a stationary sequence.\nThis does not imply independence however. In particular, for i < j < k, Pr[Zj | Zi] may not\nequal Pr[Zk | Zi]. The following is a standard de\ufb01nition giving a measure of the dependence of the\nrandom variables Zt within a stationary sequence. There are several equivalent de\ufb01nitions of this\nquantity, we are adopting here that of [17].\n\nDe\ufb01nition 2. Let Z = {Zt}\u221e\nt=\u2212\u221e be a stationary sequence of random variables. For any i, j \u2208\nZ \u222a {\u2212\u221e, +\u221e}, let \u03c3j\ni denote the \u03c3-algebra generated by the random variables Zk, i \u2264 k \u2264 j.\nThen, for any positive integer k, the \u03b2-mixing and \u03d5-mixing coef\ufb01cients of the stochastic process Z\nare de\ufb01ned as\n\u2212\u221eh sup\n\nn+k(cid:12)(cid:12)(cid:12)Pr[A | B] \u2212 Pr[A](cid:12)(cid:12)(cid:12)i \u03d5(k) = sup\n\n(cid:12)(cid:12)(cid:12)Pr[A | B] \u2212 Pr[A](cid:12)(cid:12)(cid:12).\n\n\u03b2(k) = sup\n\nn\nn+k\n\nA\u2208\u03c3\u221e\nB\u2208\u03c3n\n\n\u2212\u221e\n\n(1)\n\nE\nB\u2208\u03c3n\n\nn\n\nA\u2208\u03c3\u221e\n\nZ is said to be \u03b2-mixing (\u03d5-mixing) if \u03b2(k) \u2192 0 (resp. \u03d5(k) \u2192 0) as k \u2192 \u221e. It is said to be\nalgebraically \u03b2-mixing (algebraically \u03d5-mixing) if there exist real numbers \u03b20 > 0 (resp. \u03d50 > 0)\nand r > 0 such that \u03b2(k) \u2264 \u03b20/kr (resp. \u03d5(k) \u2264 \u03d50/kr) for all k, exponentially mixing if there\nexist real numbers \u03b20 (resp. \u03d50 > 0) and \u03b21 (resp. \u03d51 > 0) such that \u03b2(k) \u2264 \u03b20 exp(\u2212\u03b21kr) (resp.\n\u03d5(k) \u2264 \u03d50 exp(\u2212\u03d51kr)) for all k.\nBoth \u03b2(k) and \u03d5(k) measure the dependence of the events on those that occurred more than k\nunits of time in the past. \u03b2-mixing is a weaker assumption than \u03c6-mixing. We will be using a\nconcentration inequality that leads to simple bounds but that applies to \u03c6-mixing processes only.\nHowever, the main proofs presented in this paper are given in the more general case of \u03b2-mixing\nsequences. This is a standard assumption adopted in previous studies of learning in the presence\nof dependent observations [8, 10, 16, 17]. As pointed out in [16], \u03b2-mixing seems to be \u201cjust the\nright\u201d assumption for carrying over several PAC-learning results to the case of weakly-dependent\nsample points. Several results have also been obtained in the more general context of \u03b1-mixing but\nthey seem to require the stronger condition of exponential mixing [11]. Mixing assumptions can be\nchecked in some cases such as with Gaussian or Markov processes [10]. The mixing parameters can\nalso be estimated in such cases.\n\n2\n\n\fMost previous studies use a technique originally introduced by [1] based on independent blocks of\nequal size [8,10,17]. This technique is particularly relevant when dealing with stationary \u03b2-mixing.\nWe will need a related but somewhat different technique since the blocks we consider may not have\nthe same size. The following lemma is a special case of Corollary 2.7 from [17].\n\nLemma 1 (Yu [17], Corollary 2.7). Let \u00b5 \u2265 1 and suppose that h is measurable function, with\nri(cid:17) where ri \u2264\nabsolute value bounded by M , on a product probability space(cid:16)Q\u00b5\nsi \u2264 ri+1 for all i. Let Q be a probability measure on the product space with marginal measures Qi\nrj(cid:17), i = 1, . . . , \u00b5\u22121.\n), and let Qi+1 be the marginal measure of Q on(cid:16)Qi+1\non (\u2126i, \u03c3si\nri\nLet \u03b2(Q) = sup1\u2264i\u2264\u00b5\u22121 \u03b2(ki), where ki = ri+1 \u2212 si, and P =Q\u00b5\n\nj=1 \u2126j,Q\u00b5\nj=1 \u2126j,Qi+1\n\nj=1 \u03c3sj\ni=1 Qi. Then,\n\ni=1 \u03c3si\n\n(2)\n\n[h]| \u2264 (\u00b5 \u2212 1)M \u03b2(Q).\n\n| E\n\nQ\n\n[h] \u2212 E\n\nP\n\nThe lemma gives a measure of the difference between the distribution of \u00b5 blocks where the blocks\nare independent in one case and dependent in the other case. The distribution within each block\nis assumed to be the same in both cases. For a monotonically decreasing function \u03b2, we have\n\u03b2(Q) = \u03b2(k\u2217), where k\u2217 = mini(ki) is the smallest gap between blocks.\n\n2.2 Learning Scenarios\n\nWe consider the familiar supervised learning setting where the learning algorithm receives a sample\nof m labeled points S = (z1, . . . , zm) = ((x1, y1), . . . , (xm, ym)) \u2208 (X \u00d7 Y )m, where X is the\ninput space and Y the set of labels (Y = R in the regression case), both assumed to be measurable.\nFor a \ufb01xed learning algorithm, we denote by hS the hypothesis it returns when trained on the sample\nS. The error of a hypothesis on a pair z \u2208 X\u00d7Y is measured in terms of a cost function c : Y \u00d7Y \u2192\nR+. Thus, c(h(x), y) measures the error of a hypothesis h on a pair (x, y), c(h(x), y) = (h(x)\u2212y)2\nin the standard regression cases. We will use the shorthand c(h, z) := c(h(x), y) for a hypothesis h\nand z = (x, y) \u2208 X \u00d7 Y and will assume that c is upper bounded by a constant M > 0. We denote\nby bR(h) the empirical error of a hypothesis h for a training sample S = (z1, . . . , zm):\n\nc(h, zi).\n\n(3)\n\nIn the standard machine learning scenario, the sample pairs z1, . . . , zm are assumed to be i.i.d., a\nrestrictive assumption that does not always hold in practice. We will consider here the more general\ncase of dependent samples drawn from a stationary mixing sequence Z over X \u00d7 Y . As in the i.i.d.\ncase, the objective of the learning algorithm is to select a hypothesis with small error over future\nsamples. But, here, we must distinguish two versions of this problem.\n\nIn the most general version, future samples depend on the training sample S and thus the general-\nization error or true error of the hypothesis hS trained on S must be measured by its expected error\nconditioned on the sample S:\n\n1\nm\n\nmXi=1\n\nbR(h) =\n\n(4)\n\n(5)\n\nR(hS) = E\nz\n\n[c(hS, z) | S].\n\nThis is the most realistic setting in this context, which matches time series prediction problems.\nA somewhat less realistic version is one where the samples are dependent, but the test points are\nassumed to be independent of the training sample S. The generalization error of the hypothesis hS\ntrained on S is then:\n\nR(hS) = E\nz\n\n[c(hS, z) | S] = E\n\nz\n\n[c(hS, z)].\n\nThis setting seems less natural since if samples are dependent, then future test points must also\ndepend on the training points, even if that dependence is relatively weak due to the time interval\nafter which test points are drawn. Nevertheless, it is this somewhat less realistic setting that has\nbeen studied by all previous machine learning studies that we are aware of [8,10,16,17], even when\nexamining speci\ufb01cally a time series prediction problem [10]. Thus, the bounds derived in these\nstudies cannot be applied to the more general setting.\n\nWe will consider instead the most general setting with the de\ufb01nition of the generalization error based\non Eq. 4. Clearly, our analysis applies to the less general setting just discussed as well.\n\n3\n\n\f3 Non-i.i.d. Stability Bounds\n\nThis section gives generalization bounds for \u02c6\u03b2-stable algorithms over a mixing stationary distribu-\ntion.1 The \ufb01rst two sections present our main proofs which hold for \u03b2-mixing stationary distri-\nbutions. In the third section, we will be using a concentration inequality that applies to \u03c6-mixing\nprocesses only.\n\nThe condition of \u02c6\u03b2-stability is an algorithm-dependent property \ufb01rst introduced in [4] and [6]. It has\nbeen later used successfully by [2, 3] to show algorithm-speci\ufb01c stability bounds for i.i.d. samples.\nRoughly speaking, a learning algorithm is said to be stable if small changes to the training set do\nnot produce large deviations in its output. The following gives the precise technical de\ufb01nition.\nDe\ufb01nition 3. A learning algorithm is said to be (uniformly) \u02c6\u03b2-stable if the hypotheses it returns for\nany two training samples S and S\u2032 that differ by a single point satisfy\n|c(hS, z) \u2212 c(hS \u2032, z)| \u2264 \u02c6\u03b2.\n\n\u2200z \u2208 X \u00d7 Y,\n\n(6)\n\nMany generalization error bounds rely on McDiarmid\u2019s inequality. But this inequality requires the\nrandom variables to be i.i.d. and thus is not directly applicable in our scenario. Instead, we will\nuse a theorem that extends McDiarmid\u2019s inequality to general mixing distributions (Theorem 1,\nSection 3.3).\nTo obtain a stability-based generalization bound, we will apply this theorem to \u03a6(S) = R(hS) \u2212\nbR(hS). To do so, we need to show, as with the standard McDiarmid\u2019s inequality, that \u03a6 is a Lipschitz\nfunction and, to make it useful, bound E[\u03a6]. The next two sections describe how we achieve both of\nthese in this non-i.i.d. scenario.\n\n3.1 Lipschitz Condition\n\nAs discussed in Section 2.2, in the most general scenario, test points depend on the training sample.\nWe \ufb01rst present a lemma that relates the expected value of the generalization error in that scenario\nand the same expectation in the scenario where the test point is independent of the training sample.\n\nWe denote by R(hS) = Ez[c(hS, z)|S] the expectation in the dependent case and by eR(hSb) =\nEez[c(hSb,ez)] that expectation when the test points are assumed independent of the training, with\n\nSb denoting a sequence similar to S but with the last b points removed. Figure 1(a) illustrates that\nsequence. The block Sb is assumed to have exactly the same distribution as the corresponding block\nof the same size in S.\nLemma 2. Assume that the learning algorithm is \u02c6\u03b2-stable and that the cost function c is bounded\nby M . Then, for any sample S of size m drawn from a \u03b2-mixing stationary distribution and for any\nb \u2208 {0, . . . , m}, the following holds:\n\n| E\n\nS\n\n[R(hS)] \u2212 E\n\nS\n\n[eR(hSb)]| \u2264 b \u02c6\u03b2 + \u03b2(b)M.\n\nProof. The \u02c6\u03b2-stability of the learning algorithm implies that\n\nE\nS\n\n[R(hS)] = E\nS,z\n\n[c(hS, z)] \u2264 E\n\nS,z\n\n[c(hSb, z)] + b \u02c6\u03b2.\n\nThe application of Lemma 1 yields\n\nE\n[R(hS)] \u2264 E\nS\n\nS,ez\n\nThe other side of the inequality of the lemma can be shown following the same steps.\n\n[c(hSb,ez)] + b \u02c6\u03b2 + \u03b2(b)M =eES[R(hSb)] + b \u02c6\u03b2 + \u03b2(b)M.\n\n(7)\n\n(8)\n\n(9)\n\nWe can now prove a Lipschitz bound for the function \u03a6.\n\n1The standard variable used for the stability coef\ufb01cient is \u03b2. To avoid the confusion with the \u03b2-mixing\n\ncoef\ufb01cient, we will use \u02c6\u03b2 instead.\n\n4\n\n\fSb\n\n(a)\n\nz\n\nb\n\nb\n\nSi\n\nzi\n\nb\n(b)\n\nSi,b\n\nzi\n\nb\n\nb\n(c)\n\nSi\n\ni,b\n\nz\n\nz\n\nz\n\nb\n\nb\n\nb\n\nb\n(d)\n\nFigure 1: Illustration of the sequences derived from S that are considered in the proofs.\n\nLemma 3. Let S = (z1, z2, . . . , zm) and Si = (z\u2032\nm) be two sequences drawn from a\n\u03b2-mixing stationary process that differ only in point i \u2208 [1, m], and let hS and hSi be the hypotheses\nreturned by a \u02c6\u03b2-stable algorithm when trained on each of these samples. Then, for any i \u2208 [1, m],\nthe following inequality holds:\n\n2, . . . , z\u2032\n\n1, z\u2032\n\n|\u03a6(S) \u2212 \u03a6(Si)| \u2264 (b + 1)2 \u02c6\u03b2 + 2\u03b2(b)M +\n\nM\nm\n\n.\n\n(10)\n\nProof. To prove this inequality, we \ufb01rst bound the difference of the empirical errors as in [3], then\nthe difference of the true errors. Bounding the difference of costs on agreeing points with \u02c6\u03b2 and the\none that disagrees with M yields\n\n|bR(hS) \u2212 bR(hSi )| =\n\n=\n\n1\nm\n\nmXj=1\nmXj6=i\n\n1\n\n|c(hS, zj) \u2212 c(hSi, z\u2032\nj)|\n\n(11)\n\n|c(hS, zj) \u2212 c(hSi, z\u2032\n\nj)| +\n\n1\nm|c(hS, zi) \u2212 c(hSi , z\u2032\n\ni)| \u2264 \u02c6\u03b2 +\n\nM\nm\n\n.\n\nNow, applying Lemma 2 to both generalization error terms and using \u02c6\u03b2-stability result in\n\n|R(hS) \u2212 R(hSi)| \u2264 |eR(hSb) \u2212 eR(hSi\n[c(hSb,ez) \u2212 c(hSi\n\n= E\nez\n\nb\n\nb\n\n)| + 2b \u02c6\u03b2 + 2\u03b2(b)\n\nThe lemma\u2019s statement is obtained by combining inequalities 11 and 12.\n\n,ez)] + 2b \u02c6\u03b2 + 2\u03b2(b)M \u2264 \u02c6\u03b2 + 2b \u02c6\u03b2 + 2\u03b2(b)M.\n\n(12)\n\n3.2 Bound on E[\u03a6]\n\nAs mentioned earlier, to make the bound useful, we also need to bound ES[\u03a6(S)]. This is done by\nanalyzing independent blocks using Lemma 1.\nLemma 4. Let hS be the hypothesis returned by a \u02c6\u03b2-stable algorithm trained on a sample S drawn\nfrom a stationary \u03b2-mixing distribution. Then, for all b \u2208 [1, m], the following inequality holds:\n\n[|\u03a6(S)|] \u2264 (6b + 1) \u02c6\u03b2 + 3\u03b2(b)M.\nE\nS\n\n(13)\n\ndenote a similar set of three blocks each with the same distribution as the corresponding block in Si,\nbut such that the three blocks are independent. In particular, the middle block reduced to one point\n\nProof. We \ufb01rst analyze the term ES[bR(hS)]. Let Si be the sequence S with the b points before and\nafter point zi removed. Figure 1(b) illustrates this de\ufb01nition. Si is thus made of three blocks. Let eSi\nezi is independent of the two others. By the \u02c6\u03b2-stability of the algorithm,\n\n(14)\n\n(15)\n\nApplying Lemma 1 to the \ufb01rst term of the right-hand side yields\n\nE\nS\n\nm\n\nS\" 1\n[bR(hS)] = E\n\nc(hS, zi)# \u2264 E\nSi\" 1\nc(hSi , zi)# + 2b \u02c6\u03b2.\nmXi=1\neSi\" 1\n,ezi)# + 2b \u02c6\u03b2 + 2\u03b2(b)M.\n[bR(hS)] \u2264 E\n\nmXi=1\n\nmXi=1\n\nc(h eSi\n\nE\nS\n\nm\n\nm\n\n5\n\n\fCombining the independent block sequences associated to bR(hS) and R(hS) will help us prove the\nlemma in a way similar to the i.i.d. case treated in [3]. Let Sb be de\ufb01ned as in the proof of Lemma 2.\nTo deal with independent block sequences de\ufb01ned with respect to the same hypothesis, we will\nconsider the sequence Si,b = Si \u2229 Sb, which is illustrated by Figure 1(c). This can result in as many\nas four blocks. As before, we will consider a sequence eSi,b with a similar set of blocks each with\nthe same distribution as the corresponding blocks in Si,b, but such that the blocks are independent.\nSince three blocks of at most b points are removed from each hypothesis, by the \u02c6\u03b2-stability of the\nlearning algorithm, the following holds:\n\nE\nS\n\n[\u03a6(S)] = E\nS\n\nS,z\" 1\n[bR(hS) \u2212 R(hS)] = E\nSi,b,z\" 1\n\nm\n\nE\n\nm\n\nc(hSi,b , zi) \u2212 c(hSi,b, z)# + 6b \u02c6\u03b2.\n\nc(hS, zi) \u2212 c(hS, z)#\n\nmXi=1\n\n\u2264\n\nmXi=1\n\n(16)\n\n(17)\n\nNow, the application of Lemma 1 to the difference of two cost functions also bounded by M as in\nthe right-hand side leads to\n\neSi,b,ez\" 1\n\nmXi=1\n\n,ez)# + 6b \u02c6\u03b2 + 3\u03b2(b)M.\n\nm\n\nc(h eSi,b\n\nE\n[\u03a6(S)] \u2264 E\nS\n\n,ezi) \u2212 c(h eSi,b\nSinceez andezi are independent and the distribution is stationary, they have the same distribution and\nwe can replaceezi withez in the empirical cost and write\n,ez)# + 6b \u02c6\u03b2 + 3\u03b2(b)M \u2264 \u02c6\u03b2 + 6b \u02c6\u03b2 + 3\u03b2(b)M, (19)\neSi,b,ez\" 1\ni,b is the sequence derived from eSi,b by replacingezi withez. The last inequality holds by\nwhere eSi\n\n\u02c6\u03b2-stability of the learning algorithm. The other side of the inequality in the statement of the lemma\ncan be shown following the same steps.\n\n,ez) \u2212 c(h eSi,b\n\nE\n[\u03a6(S)] \u2264 E\nS\n\nmXi=1\n\nc(h eSi\n\nm\n\ni,b\n\n(18)\n\n3.3 Main Results\n\nThis section presents several theorems that constitute the main results of this paper. We will use the\nfollowing theorem which extends McDiarmid\u2019s inequality to \u03d5-mixing distributions.\nTheorem 1 (Kontorovich and Ramanan [7], Thm. 1.1). Let \u03a6 : Z m \u2192 R be a function de\ufb01ned over\na countable space Z. If \u03a6 is l-Lipschitz with respect to the Hamming metric for some l > 0, then\nthe following holds for all \u01eb > 0:\n\nPr\nZ\n\n[|\u03a6(Z) \u2212 E[\u03a6(Z)]| > \u01eb] \u2264 2 exp(cid:18)\nmXk=1\n\n\u03d5(k).\n\n\u2212\u01eb2\n\n2ml2||\u2206m||2\n\n\u221e(cid:19) ,\n\nwhere ||\u2206m||\u221e \u2264 1 + 2\nTheorem 2 (General Non-i.i.d. Stability Bound). Let hS denote the hypothesis returned by a \u02c6\u03b2-\nstable algorithm trained on a sample S drawn from a \u03d5-mixing stationary distribution and let c be\na measurable non-negative cost function upper bounded by M > 0, then for any b \u2208 [0, m] and any\n\u01eb > 0, the following generalization bound holds\n2m((b + 1)2 \u02c6\u03b2 + 2M \u03d5(b) + M/m)2! .\n\nS h\u02db\u02db\u02dbR(hS) \u2212 bR(hS)\u02db\u02db\u02db > \u01eb + (6b + 1) \u02c6\u03b2 + 6M \u03d5(b)i \u2264 2 exp \n\n\u2212\u01eb2(1 + 2Pm\n\nProof. The theorem follows directly the application of Lemma 3 and Lemma 4 to Theorem 1.\n\ni=1 \u03d5(i))\u22122\n\nPr\n\n(20)\n\nThe theorem gives a general stability bound for \u03d5-mixing stationary sequences.\nIf we further\nassume that the sequence is algebraically \u03d5-mixing, that is for all k, \u03d5(k) = \u03d50k\u2212r for some r > 1,\nthen we can solve for the value of b to optimize the bound.\n\n6\n\n\fTheorem 3 (Non-i.i.d. Stability Bound for Algebraically Mixing Sequences). Let hS denote the\nhypothesis returned by a \u02c6\u03b2-stable algorithm trained on a sample S drawn from an algebraically\n\u03d5-mixing stationary distribution, \u03d5(k) = \u03d50k\u2212r with r > 1 and let c be a measurable non-negative\ncost function upper bounded by M > 0, then for any \u01eb > 0, the following generalization bound\nholds\n\n2m(2 \u02c6\u03b2 + (r + 1)2M \u03d5(b) + M/m)2! ,\n\n\u2212\u01eb2(4 + 2/(r \u2212 1))\u22122\n\nPr\n\nS h\u02db\u02db\u02dbR(hS) \u2212 bR(hS)\u02db\u02db\u02db > \u01eb + \u02c6\u03b2 + (r + 1)6M \u03d5(b)i \u2264 2 exp \nwhere \u03d5(b) = \u03d50(cid:16) \u02c6\u03b2\nsatis\ufb01es \u02c6\u03b2b = rM \u03d5(b), which gives b = (cid:16) \u02c6\u03b2\n\nr\u03d50M(cid:17)r/(r+1)\n\nr\u03d50M(cid:17)\u22121/(r+1)\n\nfollowing term can be bounded as\n\n.\n\nProof. For an algebraically mixing sequence, the value of b minimizing the bound of Theorem 2\n\n1 + 2\n\n\u03d5(i) = 1 + 2\n\nmXi=1\n\ni\u2212r \u2264 1 + 2(cid:18)1 +Z m\n\n1\n\ni\u2212rdi(cid:19) = 1 + 2(cid:18)1 +\n\nmXi=1\n\nFor r > 1, the exponent of m is negative, and so we can bound this last term by 3 + 2/(r \u2212 1).\nPlugging in this value and the minimizing value of b in the bound of Theorem 2 yields the statement\nof the theorem.\n\nand \u03d5(b) = \u03d50(cid:16) \u02c6\u03b2\n\nr\u03d50M(cid:17)r/(r+1)\n1 \u2212 r (cid:19) .\nm1\u2212r \u2212 1\n\n. The\n\n(21)\n\nIn the case of a zero mixing coef\ufb01cient (\u03d5 = 0 and b = 0), the bounds of Theorem 2 and Theorem 3\ncoincide with the i.i.d. stability bound of [3]. In order for the right-hand side of these bounds to\nconverge, we must have \u02c6\u03b2 = o(1/\u221am) and \u03d5(b) = o(1/\u221am). For several general classes of\nalgorithms, \u02c6\u03b2 \u2264 O(1/m) [3]. In the case of algebraically mixing sequences with r > 1 assumed in\nTheorem 3, \u02c6\u03b2 \u2264 O(1/m) implies \u03d5(b) = \u03d50( \u02c6\u03b2/(r\u03d50M ))(r/(r+1)) < O(1/\u221am). The next section\nillustrates the application of Theorem 3 to several general classes of algorithms.\n\n4 Application\n\nWe now present the application of our stability bounds to several algorithms in the case of an al-\ngebraically mixing sequence. Our bound applies to all algorithms based on the minimization of a\nregularized objective function based on the norm k\u00b7kK in a reproducing kernel Hilbert space, where\nK is a positive de\ufb01nite symmetric kernel:\n\nargmin\n\nh\u2208H\n\n1\nm\n\nmXi=1\n\nc(h, zi) + \u03bbkhk2\nK,\n\n(22)\n\nunder some general conditions, since these algorithms are stable with \u02c6\u03b2 \u2264 O(1/m) [3]. Two speci\ufb01c\ninstances of these algorithms are SVR, for which the cost function is based on the \u01eb-insensitive cost:\n\nc(h, z) = |h(x) \u2212 y|\u01eb =(cid:26)0\n\n|h(x) \u2212 y| \u2212 \u01eb\n\nif |h(x) \u2212 y| \u2264 \u01eb,\notherwise,\n\n(23)\n\na. Support vector regression (SVR):\n\nand Kernel Ridge Regression [13], for which c(h, z) = (h(z) \u2212 y)2.\nCorollary 1. Assume a bounded output Y = [0, B], for some B > 0, and assume that K(x, x) \u2264 \u03ba\nfor all x for some \u03ba > 0. Let hS denote the hypothesis returned by the algorithm when trained on\na sample S drawn from an algebraically \u03d5-mixing stationary distribution. Then, with probability at\nleast 1 \u2212 \u03b4, the following generalization bounds hold for\n+ 5 3\u03ba2\n+ 5 12\u03ba2B 2\n\n\u03bb!r 2 ln(1/\u03b4)\n+ \u03bar B\n\u03bb!r 2 ln(1/\u03b4)\n+ \u03bar B\n\nR(hS) \u2264 bR(hS) +\nR(hS) \u2264 bR(hS) +\n\nb. Kernel Ridge Regression (KRR):\n\n13\u03ba2\n2\u03bbm\n\n26\u03ba2B 2\n\n(25)\n\n(24)\n\n\u03bbm\n\nm\n\n\u03bb\n\n;\n\nm\n\n\u03bb\n\n.\n\n7\n\n\fProof. It has been shown in [3] that for SVR \u02c6\u03b2 \u2264 \u03ba2/(2\u03bbm) and that M < \u03bapB/\u03bb and for KRR,\n\u02c6\u03b2 \u2264 2\u03ba2B2/(\u03bbm) and M < \u03bapB/\u03bb. Plugging in these values in the bound of Theorem 3 and\n\nusing the lower bound on r, r > 1, yield the statement of the corollary.\n\nThese bounds give, to the best of our knowledge, the \ufb01rst stability-based generalization bounds for\nSVR and KRR in a non-i.i.d. scenario. Similar bounds can be obtained for other families of algo-\nrithms such as maximum entropy discrimination, which can be shown to have comparable stability\nproperties [3]. Our bounds have the same convergence behavior as those derived by [3] in the i.i.d.\ncase. In fact, they differ only by some constants. As in the i.i.d. case, they are non-trivial when the\n\ncondition \u03bb \u226b 1/\u221am on the regularization parameter holds for all large values of m. It would be\n\ninteresting to give a quantitative comparison of our bounds and the generalization bounds of [10]\nbased on covering numbers for mixing stationary distributions, in the scenario where test points\nare independent of the training sample. In general, because the bounds of [10] are not algorithm-\ndependent, one can expect tighter bounds using stability, provided that a tight bound is given on\nthe stability coef\ufb01cient. The comparison also depends on how fast the covering number grows with\nthe sample size and trade-off parameters such as \u03bb. For a \ufb01xed \u03bb, the asymptotic behavior of our\nstability bounds for SVR and KRR is tight.\n\n5 Conclusion\n\nOur stability bounds for mixing stationary sequences apply to large classes of algorithms, including\nSVR and KRR, extending to weakly dependent observations existing bounds in the i.i.d. case. Since\nthey are algorithm-speci\ufb01c, these bounds can often be tighter than other generalization bounds.\nWeaker notions of stability might help further improve or re\ufb01ne them.\n\nReferences\n[1] S. N. Bernstein. Sur l\u2019extension du th\u00b4eor`eme limite du calcul des probabilit\u00b4es aux sommes de quantit\u00b4es\n\nd\u00b4ependantes. Math. Ann., 97:1\u201359, 1927.\n\n[2] O. Bousquet and A. Elisseeff. Algorithmic stability and generalization performance. In NIPS 2000, 2001.\n[3] O. Bousquet and A. Elisseeff. Stability and generalization. JMLR, 2:499\u2013526, 2002.\n[4] L. Devroye and T. Wagner. Distribution-free performance bounds for potential function rules. In Infor-\n\nmation Theory, IEEE Transactions on, volume 25, pages 601\u2013604, 1979.\n\n[5] P. Doukhan. Mixing: Properties and Examples. Springer-Verlag, 1994.\n[6] M. Kearns and D. Ron. Algorithmic stability and sanity-check bounds for leave-one-out cross-validation.\n\nIn Computational Learing Theory, pages 152\u2013162, 1997.\n\n[7] L. Kontorovich and K. Ramanan. Concentration inequalities for dependent random variables via the\n\nmartingale method, 2006.\n\n[8] A. Lozano, S. Kulkarni, and R. Schapire. Convergence and consistency of regularized boosting algorithms\n\nwith stationary \u03b2-mixing observations. In NIPS, 2006.\n\n[9] D. Mattera and S. Haykin. Support vector machines for dynamic reconstruction of a chaotic system. In\nAdvances in kernel methods: support vector learning, pages 211\u2013241. MIT Press, Cambridge, MA, 1999.\n[10] R. Meir. Nonparametric time series prediction through adaptive model selection. Machine Learning,\n\n39(1):5\u201334, 2000.\n\n[11] D. Modha and E. Masry. On the consistency in nonparametric estimation under mixing assumptions.\n\nIEEE Transactions of Information Theory, 44:117\u2013133, 1998.\n\n[12] K.-R. M\u00a8uller, A. Smola, G. R\u00a8atsch, B. Sch\u00a8olkopf, J. K., and V. Vapnik. Predicting time series with support\n\nvector machines. In Proceedings of ICANN\u201997, LNCS, pages 999\u20131004. Springer, 1997.\n\n[13] C. Saunders, A. Gammerman, and V. Vovk. Ridge Regression Learning Algorithm in Dual Variables. In\n\nProceedings of the ICML \u201998, pages 515\u2013521. Morgan Kaufmann Publishers Inc., 1998.\n[14] B. Sch\u00a8olkopf and A. Smola. Learning with Kernels. MIT Press: Cambridge, MA, 2002.\n[15] V. N. Vapnik. Statistical Learning Theory. Wiley-Interscience, New York, 1998.\n[16] M. Vidyasagar. Learning and Generalization: With Applications to Neural Networks. Springer, 2003.\n[17] B. Yu. Rates of convergence for empirical processes of stationary mixing sequences. The Annals of\n\nProbability, 22(1):94\u2013116, Jan. 1994.\n\n8\n\n\f", "award": [], "sourceid": 197, "authors": [{"given_name": "Mehryar", "family_name": "Mohri", "institution": null}, {"given_name": "Afshin", "family_name": "Rostamizadeh", "institution": null}]}