{"title": "Predictive PAC Learning and Process Decompositions", "book": "Advances in Neural Information Processing Systems", "page_first": 1619, "page_last": 1627, "abstract": "We informally call a stochastic process learnable if it admits a generalization error approaching zero in probability for any concept class with finite VC-dimension (IID processes are the simplest example). A mixture of learnable processes need not be learnable itself, and certainly its generalization error need not decay at the same rate. In this paper, we argue that it is natural in predictive PAC to condition not on the past observations but on the mixture component of the sample path. This definition not only matches what a realistic learner might demand, but also allows us to sidestep several otherwise grave problems in learning from dependent data. In particular, we give a novel PAC generalization bound for mixtures of learnable processes with a generalization error that is not worse than that of each mixture component. We also provide a characterization of mixtures of absolutely regular ($\\beta$-mixing) processes, of independent interest.", "full_text": "Predictive PAC Learning and Process Decompositions\n\nCosma Rohilla Shalizi\nStatistics Department\n\nCarnegie Mellon University\nPittsburgh, PA 15213 USA\ncshalizi@cmu.edu\n\nAryeh Kontorovich\n\nComputer Science Department\n\nBen Gurion University\nBeer Sheva 84105 Israel\n\nkaryeh@cs.bgu.ac.il\n\nAbstract\n\nWe informally call a stochastic process learnable if it admits a generalization error\napproaching zero in probability for any concept class with \ufb01nite VC-dimension\n(IID processes are the simplest example). A mixture of learnable processes need\nnot be learnable itself, and certainly its generalization error need not decay at the\nsame rate. In this paper, we argue that it is natural in predictive PAC to condition\nnot on the past observations but on the mixture component of the sample path.\nThis de\ufb01nition not only matches what a realistic learner might demand, but also\nallows us to sidestep several otherwise grave problems in learning from dependent\ndata.\nIn particular, we give a novel PAC generalization bound for mixtures of\nlearnable processes with a generalization error that is not worse than that of each\nmixture component. We also provide a characterization of mixtures of absolutely\nregular (\u03b2-mixing) processes, of independent probability-theoretic interest.\n\n1\n\nIntroduction\n\nStatistical learning theory, especially the theory of \u201cprobably approximately correct\u201d (PAC) learn-\ning, has mostly developed under the assumption that data are independent and identically distributed\n(IID) samples from a \ufb01xed, though perhaps adversarially-chosen, distribution. As recently as 1997,\nVidyasagar [1] named extending learning theory to stochastic processes of dependent variables as a\nmajor open problem. Since then, considerable progress has been made for speci\ufb01c classes of pro-\ncesses, particularly strongly-mixing sequences and exchangeable sequences. (Especially relevant\ncontributions, for our purposes, came from [2, 3] on exchangeability, from [4, 5] on absolute regu-\nlarity1, and from [6, 7] on ergodicity; others include [8, 9, 10, 11, 12, 13, 14, 15, 16, 17].) Our goals\nin this paper are to point out that many practically-important classes of stochastic processes possess\na special sort of structure, namely they are convex combinations of simpler, extremal processes.\nThis both demands something of a re-orientation of the goals of learning, and makes the new goals\nvastly easier to attain than they might seem.\n\nMain results Our main contribution is threefold: a conceptual de\ufb01nition of learning from non-IID\ndata (De\ufb01nition 1) and a technical result establishing tight generalization bounds for mixtures of\nlearnable processes (Theorem 2), with a direct corollary about exchangeable sequences (Corollary\n1), and an application to mixtures of absolutely regular sequences, for which we provide a new\ncharacterization.\n\nNotation X1, X2, . . . will be a sequence of dependent random variables taking values in a common\nmeasurable space X , which we assume to be \u201cstandard\u201d [18, ch. 3] to avoid technicalities, implying\n1Absolutely regular processes are ones where the joint distribution of past and future events approaches\nindependence, in L1, as the separation between events goes to in\ufb01nity; see \u00a74 below for a precise statement and\nextensive discussion. Absolutely regular sequences are now more commonly called \u201c\u03b2-mixing\u201d, but we use the\nolder name to avoid confusion with the other sort of \u201cmixing\u201d.\n\n1\n\n\fthat their \u03c3-\ufb01eld has a countable generating basis; the reader will lose little if they think of X as\nRd. (We believe our ideas apply to stochastic processes with multidimensional index sets as well,\nbut use sequences here.) X j\ni will stand for the block (Xi, Xi+1, . . . Xj\u22121, Xj). Generic in\ufb01nite-\ndimensional distributions, measures on X \u221e, will be \u00b5 or \u03c1; these are probability laws for X\u221e\n1 . Any\nsuch stochastic process can be represented through the shift map \u03c4 : X \u221e (cid:55)\u2192 X \u221e (which just drops\nthe \ufb01rst coordinate, (\u03c4 x)t = xt+1), and a suitable distribution of initial conditions. When we speak\nof a set being invariant, we mean invariant under the shift. The collection of all such probability\nmeasures is itself a measurable space, and a generic measure there will be \u03c0.\n\n2 Process Decompositions\n\nSince the time of de Finetti and von Neumann, an important theme of the theory of stochastic pro-\ncesses has been \ufb01nding ways of representing complicated but structured processes, obeying certain\nsymmetries, as mixtures of simpler processes with the same symmetries, as well as (typically) some\nsort of 0-1 law. (See, for instance, the beautiful paper by Dynkin [19], and the statistically-motivated\n[20].) In von Neumann\u2019s original ergodic decomposition [18, \u00a77.9], stationary processes, whose dis-\ntributions are invariant over time, proved to be convex combinations of stationary ergodic measures,\nones where all invariant sets have either probability 0 or probability 1. In de Finetti\u2019s theorem [21,\nch. 1], exchangeable sequences, which are invariant under permutation, are expressed as mixtures of\nIID sequences2. Similar results are now also known for asymptotically mean stationary sequences\n[18, \u00a78.4], for partially-exchangeable sequences [22], for stationary random \ufb01elds, and even for\nin\ufb01nite exchangeable arrays (including networks and graphs) [21, ch. 7].\nThe common structure shared by these decompositions is as follows.\n\n1. The probability law \u03c1 of the composite but symmetric process is a convex combination of\nthe simpler, extremal processes \u00b5 \u2208 M with the same symmetry. The in\ufb01nite-dimensional\ndistribution of these extremal processes are, naturally, mutually singular.\n2. Sample paths from the composite process are generated hierarchically, \ufb01rst by picking an\nextremal process \u00b5 from M according to a some measure \u03c0 supported on M, and then\ngenerating a sample path from \u00b5. Symbolically,\n\n\u00b5 \u223c \u03c0\n| \u00b5 \u223c \u00b5\n\nX\u221e\n\n1\n\n3. Each realization of the composite process therefore gives information about only a single\n\nextremal process, as this is an invariant along each sample path.\n\n3 Predictive PAC\n\nThese points raise subtle but key issues for PAC learning theory. Recall the IID case: random\nvariables X1, X2, . . . are all generated from a common distribution \u00b5(1), leading to an in\ufb01nite-\ndimensional process distribution \u00b5. Against this, we have a class F of functions f. The goal in PAC\ntheory is to \ufb01nd a sample complexity function3 s(\u0001, \u03b4,F, \u00b5) such that, whenever n \u2265 s,\n\nf (Xt) \u2212 E\u00b5 [f ]\n\n\u2264 \u03b4\n\n(1)\n\nThat is, PAC theory seeks \ufb01nite-sample uniform law of large numbers for F.\nBecause of its importance, it will be convenient to abbreviate the supremum,\n\n(cid:32)\n\nP\u00b5\n\nsup\nf\u2208F\n\nn\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) 1\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) 1\n\nn\n\nn(cid:88)\n\nt=1\n\nn(cid:88)\n\nt=1\n\nsup\nf\u2208F\n\nf (Xt) \u2212 E\u00b5 [f ]\n\n(cid:33)\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) \u2265 \u0001\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) \u2261 \u0393n\n\n2This is actually a special case of the ergodic decomposition [21, pp. 25\u201326].\n3Standard PAC is de\ufb01ned as distribution-free, but here we maintain the dependence on \u00b5 for consistency\n\nwith future notation.\n\n2\n\n\fusing the letter \u201c\u0393\u201d as a reminder that when this goes to zero, F is a Glivenko-Cantelli class (for \u00b5).\n\u0393n is also a function of F and of \u00b5, but we suppress this in the notation for brevity. We will also\npass over the important and intricate, but fundamentally technical, issue of establishing that \u0393n is\nmeasurable (see [23] for a thorough treatment of this topic).\nWhat one has in mind, of course, is that there is a space H of predictive models (classi\ufb01ers, regres-\nsions, . . . ) h, and that F is the image of H through an appropriate loss function (cid:96), i.e., each f \u2208 F\ncan be written as\n\nf (x) = (cid:96)(x, h(x))\n\n(cid:104) \u02c6fn\n\nminimization is reliable. That is, the function \u02c6fn which minimizes the empirical risk n\u22121(cid:80)\n(cid:105)\n\nfor some h \u2208 H.\nIf \u0393n \u2192 0 in probability for this \u201closs function\u201d class, then empirical risk\nt f (Xt)\nhas an expected risk in the future which is close to the best attainable risk over all of F, R(F, \u00b5) =\nIndeed, since when n \u2265 s, with high (\u2265 1 \u2212 \u03b4) probability all functions have\ninf f\u2208F E\u00b5 [f ].\nempirical risks within \u0001 of their true risks, with high probability the true risk E\u00b5\nis within 2\u0001\nof R(F, \u00b5). Although empirical risk minimization is not the only conceivable learning strategy,\nit is, in a sense, a canonical one (computational considerations aside). The latter is an immediate\nconsequence of the VC-dimension characterization of PAC learnability:\nTheorem 1 Suppose that the concept class H is PAC learnable from IID samples. Then H is learn-\nable via empirical risk minimization.\nPROOF: Since H is PAC-learnable, it must necessarily have a \ufb01nite VC-dimension [24]. But for\n\ufb01nite-dimensional H and IID samples, \u0393n \u2192 0 in probability (see [25] for a simple proof). This\nimplies that the empirical risk minimizer is a PAC learner for H. (cid:3)\nIn extending these ideas to non-IID processes, a subtle issue arises, concerning which expectation\nvalue we would like empirical means to converge towards. In the IID case, because \u00b5 is simply the\nin\ufb01nite product of \u00b5(1) and f is a function on X , we can without trouble identify expectations under\nthe two measures with each other, and with expectations conditional on the \ufb01rst n observations:\n\nE\u00b5 [f (X)] = E\u00b5(1) [f (X)] = E\u00b5 [f (Xn+1) | X n\n1 ]\nThings are not so tidy when \u00b5 is the law of a dependent stochastic process.\nIn introducing a notion of \u201cpredictive PAC learning\u201d, Pestov [3], like Berti and Rigo [2] earlier,\nproposes that the target should be the conditional expectation, in our notation E\u00b5 [f (Xn+1) | X n\n1 ].\nThis however presents two signi\ufb01cant problems. First, in general there is no single value for this \u2014\nit truly is a function of the past X n\n1 , or at least some part of it. (Consider even the case of a binary\nMarkov chain.) The other, and related, problem with this idea of predictive PAC is that it presents\nlearning with a perpetually moving target. Whether the function which minimizes the empirical risk\nis going to do well by this criterion involves rather arbitrary details of the process. To truly pursue\nthis approach, one would have to actually learn the predictive dependence structure of the process,\na quite formidable undertaking, though perhaps not hopeless [26].\nBoth of these complications are exacerbated when the process producing the data is actually a mix-\nture over simpler processes, as we have seen is very common in interesting applied settings. This\nis because, in addition to whatever dependence may be present within each extremal process, X n\ngives (partial) information about what that process is. Finding E\u03c1 [Xn+1 | X n\n1\n1 ] amounts to \ufb01rst\n\ufb01nding all the individual conditional expectations, E\u00b5 [Xn+1 | X n\n1 ], and then averaging them with\nrespect to the posterior distribution \u03c0(\u00b5 | X n\n1 ). This averaging over the posterior produces ad-\nditional dependence between past and future. (See [27] on quantifying how much extra apparent\nShannon information this creates.) As we show in \u00a74 below, continuous mixtures of absolutely reg-\nular processes are far from being absolutely regular themselves. This makes it exceedingly hard,\nif not impossible, to use a single sample path, no matter how long, to learn anything about global\nexpectations.\nThese dif\ufb01culties all simply dissolve if we change the target distribution. What a learner should care\nabout is not averaging over some hypothetical prior distribution of inaccessible alternative dynamical\nsystems, but rather what will happen in the future of the particular realization which provided the\ntraining data, and must continue to provide the testing data. To get a sense of what this is means,\n\n3\n\n\fnotice that for an ergodic \u00b5,\n\nm(cid:88)\n\ni=1\n\n1\nm\n\nE\u00b5 [f ] = lim\nm\u2192\u221e\n\nE [f (Xn+i) | X n\n1 ]\n\n(from [28, Cor. 4.4.1]). That is, matching expectations under the process measure \u00b5 means control-\nling the long-run average behavior, and not just the one-step-ahead expectation suggested in [3, 2].\nEmpirical risk minimization now makes sense: it is attempting to \ufb01nd a model which will work\nwell not just at the next step (which may be inherently unstable), but will continue to work well, on\naverage, inde\ufb01nitely far into the future.\nWe are thus led to the following de\ufb01nition.\n1 be a stochastic process with law \u00b5, and let I be the \u03c3-\ufb01eld generated by all\nDe\ufb01nition 1 Let X\u221e\nthe invariant events. We say that \u00b5 admits predictive PAC learning of a function class F when there\nexists a sample-complexity function s(\u0001, \u03b4,F, \u00b5) such that, if n \u2265 s, then\n\n(cid:32)\n\nP\u00b5\n\nsup\nf\u2208F\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) 1\n\nn\n\nn(cid:88)\n\nt=1\n\nf (Xt) \u2212 E\u00b5 [f|I]\n\n\u2264 \u03b4\n\n(cid:33)\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) \u2265 \u0001\n\nA class of processes P admits of distribution-free predictive PAC learning if there exists a common\nsample-complexity function for all \u00b5 \u2208 P, in which case we write s(\u0001, \u03b4,F, \u00b5) = s(\u0001, \u03b4,F,P).\nAs is well-known, distribution-free predictive PAC learning, in this sense, is possible for IID pro-\ncesses (F must have \ufb01nite VC dimension). For an ergodic \u00b5, [6] shows that s(\u0001, \u03b4,F, \u00b5) exist and is\n\ufb01nite if and only, once again, F has a \ufb01nite VC dimension; this implies predictive PAC learning, but\nnot distribution-free predictive PAC. Since ergodic processes can converge arbitrarily slowly, some\nextra condition must be imposed on P to ensure that dependence decays fast enough for each \u00b5. A\nsuf\ufb01cient restriction is that all processes in P be stationary and absolutely regular (\u03b2-mixing), with\na common upper bound on the \u03b2 dependence coef\ufb01cients, as [5, 14] show how to turn algorithms\nwhich are PAC on IID data into ones where are PAC on such sequences, with a penalty in extra sam-\nple complexity depending on \u00b5 only through the rate of decay of correlations4. We may apply these\nfamiliar results straightforwardly, because, when \u00b5 is ergodic, all invariant sets have either measure\n0 or measure 1, conditioning on I has no effect, and E\u00b5 [f | I] = E\u00b5 [f ].\nOur central result is now almost obvious.\n\nTheorem 2 Suppose that distribution-free prediction PAC learning is possible for a class of func-\ntions F and a class of processes M, with sample-complexity function s(\u0001, \u03b4,F,P). Then the class\nof processes P formed by taking convex mixtures from M admits distribution-free PAC learning\nwith the same sample complexity function.\nPROOF: Suppose that n \u2265 s(\u0001, \u03b4,F). Then by the law of total expectation,\n\nP\u03c1 (\u0393n \u2265 \u0001) = E\u03c1 [P\u03c1 (\u0393n \u2265 \u0001 | \u00b5)]\n\n= E\u03c1 [P\u00b5 (\u0393n \u2265 \u0001)]\n\u2264 E\u03c1 [\u03b4] = \u03b4\n\n(2)\n(3)\n(4)\n\n(cid:3)\nIn words, if the same bound holds for each component of the mixture, then it still holds after averag-\ning over the mixture. It is important here that we are only attempting to predict the long-run average\nrisk along the continuation of the same sample path as that which provided the training data; with\nthis as our goal, almost all sample paths looks like \u2014 indeed, are \u2014 realizations of single compo-\nnents of the mixture, and so the bound for extremal processes applies directly to them5. By contrast,\nthere may be no distribution-free bounds at all if one does not condition on I.\n\n4We suspect that similar results could be derived for many of the weak dependence conditions of [29].\n5After formulating this idea, we came across a remarkable paper by Wiener [30], where he presents a qual-\nitative version of highly similar considerations, using the ergodic decomposition to argue that a full dynamical\nmodel of the weather is neither necessary nor even helpful for meteorological forecasting. The same paper\nalso lays out the idea of sensitive dependence on initial conditions, and the kernel trick of turning nonlinear\nproblems into linear ones by projecting into in\ufb01nite-dimensional feature spaces.\n\n4\n\n\fA useful consequence of this innocent-looking result is that it lets us immediately apply PAC learning\nresults for extremal processes to composite processes, without any penalty in the sample complexity.\nFor instance, we have the following corollary:\nCorollary 1 Let F have \ufb01nite VC dimension, and have distribution-free sample complexity\ns(\u0001, \u03b4,F,M) for all IID measures \u00b5 \u2208 P. Then the class M of exchangeable processes composed\nfrom P admit distribution-free PAC learning with the same sample complexity,\n\ns(\u0001, \u03b4,F,P) = s(\u0001, \u03b4,F,M)\n\nThis is in contrast with, for instance, the results obtained by [2, 3], which both go from IID se-\nquences (laws in P) to exchangeable ones (laws in M) at the cost of considerably increased sample\ncomplexity. The easiest point of comparison is with Theorem 4.2 of [3], which, in our notation,\nshows that\nThat we pay no such penalty in sample complexity because our target of learning is E\u00b5 [f | I], not\nE\u03c1 [f | X n\n1 ]. This means we do not have to use data points to narrow the posterior distribution\n\u03c0(\u00b5 | X n\n1 ), and that we can give a much more direct argument.\nIn [3], Pestov asks whether \u201cone [can] maintain the initial sample complexity\u201d when going from\nIID to exchangeable sequences; the answer is yes, if one picks the right target. This holds whenever\nthe data-generating process is a mixture of extremal processes for which learning is possible. A\nparticularly important special case of this are the absolutely regular processes.\n\ns(\u0001, \u03b4,M) \u2264 s(\u0001/2, \u03b4\u0001,P)\n\n4 Mixtures of Absolutely Regular Processes\n\nRoughly speaking, an absolutely regular process is one which is asymptotically independent in\na particular sense, where the joint distribution between past and future events approaches, in L1,\nthe product of the marginal distributions as the time-lag between past and future grows. These are\nparticularly important for PAC learning, since much of the existing IID learning theory translates\ndirectly to this setting.\nTo be precise, let X\u221e\n\n\u2212\u221e be a two-sided6 stationary process. The \u03b2-dependence coef\ufb01cient at lag k is\n(5)\nwhere P\u2212(1:k) is the joint distribution of X 0\u2212\u221e and X\u221e\nk , the total variation distance between the\nactual joint distribution of the past and future, and the product of their marginals. Equivalently\n[31, 32]\n\n\u03b2(k) \u2261(cid:13)(cid:13)P 0\u2212\u221e \u2297 P \u221e\n(cid:34)\n\nk \u2212 P\u2212(1:k)\n\n(cid:13)(cid:13)TV\n(cid:35)\nP(cid:0)B | X 0\u2212\u221e(cid:1) \u2212 P (B)\n\n\u03b2(k) = E\n\nsup\n\nB\u2208\u03c3(X\u221e\nk )\n\n(6)\n\nwhich, roughly, is the expected total variation distance between the marginal distribution of the\nfuture and its distribution conditional on the past.\nAs is well known, \u03b2(k) is non-increasing in k, and of course \u2265 0, so \u03b2(k) must have a limit as\nk \u2192 \u221e; it will be convenient to abbreviate this as \u03b2(\u221e). When \u03b2(\u221e) = 0, the process is said to\nbe beta mixing, or absolutely regular. All absolutely regular processes are also ergodic [32].\nThe importance of absolutely regular processes for learning comes essentially from a result due to\n1 be a part of a sample path from an absolutely regular process \u00b5, whose dependence\nYu [4]. Let X n\ncoef\ufb01cients are \u03b2(k). Fix integers m and a such that 2ma = n, so that the sequence is divided into\n2m contiguous blocks of length a, and de\ufb01ne \u00b5(m,a) to be distribution of m length-a blocks. (That\nis, \u00b5(m,a) approximates \u00b5 by IID blocks.) Then |\u00b5(C) \u2212 \u00b5(m,a)(C)| \u2264 (m \u2212 1)\u03b2(a) [4, Lemma\n4.1]. Since in particular the event C could be taken to be {\u0393n > \u0001}, this approximation result allows\ndistribution-free learning bounds for IID processes to translate directly into distribution-free learning\nbounds for absolutely regular processes with bounded \u03b2 coef\ufb01cients.\n\n6We have worked with one-sided processes so far, but the devices for moving between the two representa-\n\ntions are standard, and this de\ufb01nition is more easily stated in its two-sided version.\n\n5\n\n\fIf M contains only absolutely regular processes, then a measure \u03c0 on M creates a \u03c1 which is a\nmixture of absolutely regular processes, or a MAR process. It is easy to see that absolute regularity\nof the component processes (\u03b2\u00b5(k) \u2192 0) does not imply absolute regularity of the mixture processes\n(\u03b2\u03c1(k) (cid:54)\u2192 0). To see this, suppose that M consists of two processes \u00b50, which puts unit probability\nmass on the sequence of all zeros, and \u00b51, which puts unit probability on the sequence of all ones;\nand that \u03c0 gives these two equal probability. Then \u03b2\u00b5i(k) = 0 for both i, but the past and the future\nof \u03c1 are not independent of each other.\nMore generally, suppose \u03c0 is supported on just two absolutely regular processes, \u00b5 and \u00b5(cid:48). For each\nsuch \u00b5, there exists a set of typical sequences T\u00b5 \u2282 X \u221e, in which the in\ufb01nite sample path of \u00b5 lies\nalmost surely7, and these sets do not overlap8, T\u00b5 \u2229 T\u00b5(cid:48) = \u2205. This implies that \u03c1(T\u00b5) = \u03c0(\u00b5), but\n\n(cid:26) 1 X 0\u2212\u221e \u2208 T\u00b5\n\n0\n\notherwise\n\n\u03c1(T\u00b5 | X 0\u2212\u221e) =\n\nThus the change in probability of T\u00b5 due to conditioning on the past is \u03c0(\u00b51) if the selected compo-\nnent was \u00b52, and 1 \u2212 \u03c0(\u00b51) = \u03c0(\u00b52) if the selected component is \u00b51. We can reason in parallel for\nT\u00b52, and so the average change in probability is\n\n\u03c01(1 \u2212 \u03c01) + \u03c02(1 \u2212 \u03c02) = 2\u03c01(1 \u2212 \u03c01)\n\nand this must be \u03b2\u03c1(\u221e). Similar reasoning when \u03c0 is supported on q extremal processes shows\n\nq(cid:88)\n\n\u03b2\u03c1(\u221e) =\n\n\u03c0i(1 \u2212 \u03c0i)\n\n(cid:90)\n\ni=1\n\n[1 \u2212 \u03c0({\u00b5})]d\u03c0(\u00b5)\n\nso that the general case is\n\n\u03b2\u03c1(\u221e) =\n\nThis implies that if \u03c0 has no atoms, \u03b2\u03c1(\u221e) = 1 always. Since \u03b2\u03c1(k) is non-increasing, \u03b2\u03c1(k) = 1\nfor all k, for a continuous mixture of absolutely regular processes. Conceptually, this arises because\nof the use of in\ufb01nite-dimensional distributions for both past and future in the de\ufb01nition of the \u03b2-\ndependence coef\ufb01cient. Having seen an in\ufb01nite past is suf\ufb01cient, for an ergodic process, to identify\nthe process, and of course the future must be a continuation of this past.\nMARs thus display a rather odd separation between the properties of individual sample paths, which\napproach independence asymptotically in time, and the ensemble-level behavior, where there is\nineradicable dependence, and indeed maximal dependence for continuous mixtures. Any one re-\nalization of a MAR, no matter how long, is indistinguishable from a realization of an absolutely\nregular process which is a component of the mixture. The distinction between a MAR and a single\nabsolutely regular process only becomes apparent at the level of ensembles of paths.\nIt is desirable to characterize MARs. They are stationary, but non-ergodic and have non-vanishing\n\u03b2(\u221e). However, this is not suf\ufb01cient to characterize them. Bernoulli shifts are stationary and\nergodic, but not absolutely regular9. It follows that a mixture of Bernoulli shifts will be stationary,\nand by the preceding argument will have positive \u03b2(\u221e), but will not be a MAR.\nRoughly speaking, given the in\ufb01nite past of a MAR, events in the future become asymptotically\nindependent as the separation between them increases10. A more precise statement needs to control\nthe approach to independence of the component processes in a MAR. We say that \u03c1 is a \u02dc\u03b2-uniform\nMAR when \u03c1 is a MAR, and, for \u03c0-almost-all \u00b5, \u03b2\u00b5(k) \u2264 \u02dc\u03b2(k), with \u02dc\u03b2(k) \u2192 0. Then if we con-\ndition on \ufb01nite histories X 0\u2212n and let n grow, widely separated future events become asymptotically\nindependent.\n\n7Since X is \u201cstandard\u201d, so is X \u221e, and the latter\u2019s \u03c3-\ufb01eld \u03c3(X\u221e\n\nFor each B \u2208 B, the set T\u00b5,B =(cid:8)x \u2208 X \u221e :\nis dynamically invariant, and, by Bikrhoff\u2019s ergodic theorem, \u00b5(T\u00b5,B) = 1 [18, \u00a77.9]. Then T\u00b5 \u2261(cid:84)\n\nt=0 1B(\u03c4 tx) = \u00b5(B)(cid:9) exists, is measurable,\n\nalso has \u00b5-measure 1, because B is countable.\n\nB\u2208B T\u00b5,B\n8Since \u00b5 (cid:54)= \u00b5(cid:48), there exists at least one set C with \u00b5(C) (cid:54)= \u00b5(cid:48)(C). The set T\u00b5,C then cannot overlap at all\n\nlimn\u2192\u221e n\u22121(cid:80)n\u22121\n\n\u2212\u221e) has a countable generating basis, say B.\n\nwith the set T\u00b5(cid:48),C, and so T\u00b5 cannot intersect T\u00b5(cid:48).\n\n9They are, however, mixing in the sense of ergodic theory [28].\n10\u03c1-almost-surely, X 0\u2212\u221e belongs to the typical set of a unique absolutely regular process, say \u00b5, and so the\nposterior concentrates on that \u00b5, \u03c0(\u00b7 | X 0\u2212\u221e) = \u03b4\u00b5. Hence \u03c1(X l\nl+k), which\ntends to \u00b5(X l\u2212\u221e)\u00b5(X\u221e\n\n1, X\u221e\nl+k) as k \u2192 \u221e because \u00b5 is absolutely regular.\n\nl+k | X 0\u2212\u221e) = \u00b5(X l\u2212\u221e, X\u221e\n\n6\n\n\fTheorem 3 A stationary process \u03c1 is a \u02dc\u03b2-uniform MAR if and only if\n\n(cid:34)\n\n(cid:35)\n1, X 0\u2212n) \u2212 \u03c1(B | X 0\u2212n)\n\n\u03c1(B | X l\n\nlim\nk\u2192\u221e lim\n\nn\u2192\u221e E\n\nsup\n\nl\n\nsup\nB\u2208\u03c3(X\u221e\n\nk+l)\n\n= 0\n\n(7)\n\nBefore proceeding to the proof, it is worth remarking on the order of the limits: for \ufb01nite n, condi-\ntioning on X 0\u2212n still gives a MAR, not a single (albeit random) absolutely-regular process. Hence\nthe k \u2192 \u221e limit for \ufb01xed n would always be positive, and indeed 1 for a continuous \u03c0.\nPROOF \u201cOnly if\u201d: Re-write Eq. 7, expressing \u03c1 in terms of \u03c0 and the extremal processes:\n\n(cid:90) (cid:0)\u00b5(B | X l\n\n1, X 0\u2212n) \u2212 \u00b5(B | X 0\u2212n)(cid:1) d\u03c0(\u00b5 | X 0\u2212n)\n\n(cid:35)\n\n(cid:34)\n\nlim\nk\u2192\u221e lim\n\nn\u2192\u221e E\n\nsup\n\nl\n\nsup\nB\u2208\u03c3(X\u221e\n\nk+l)\n\nAs n \u2192 \u221e, \u00b5(B | X 0\u2212n) \u2192 \u00b5(B | X 0\u2212\u221e), and \u00b5(B | X l\n1, X 0\u2212n) \u2192 \u00b5(B | X l\u2212\u221e). But,\nin expectation, both of these are within \u02dc\u03b2(k) of \u00b5(B), and so within 2 \u02dc\u03b2(k). \u201cIf\u201d: Consider the\ncontrapositive. If \u03c1 is not a uniform MAR, either it is a non-uniform MAR, or it is not a MAR at\nall. If it is not a uniform MAR, then no matter what function \u02dc\u03b2(k) tending to zero we propose,\nthe set of \u00b5 with \u03b2\u00b5 \u2265 \u02dc\u03b2 must have positive \u03c0 measure, i.e., a positive-measure set of processes\nmust converge arbitrarily slowly. Therefore there must exist a set B (or a sequence of such sets)\nwitnessing this arbitrarily slow convergence, and hence the limit in Eq. 7 will be strictly positive. If\n\u03c1 is not a MAR at all, we know from the ergodic decomposition of stationary processes that it must\nbe a mixture of ergodic processes, and so it must give positive \u03c0 weight to processes which are not\nabsolutely regular at all, i.e., \u00b5 for which \u03b2\u00b5(\u221e) > 0. The witnessing events B for these processes\na fortiori drive the limit in Eq. 7 above zero. (cid:3)\n\n5 Discussion and future work\n\nWe have shown that with the right conditioning, a natural and useful notion of predictive PAC\nemerges. This notion is natural in the sense that for a learner sampling from a mixture of ergodic\nprocesses, the only thing that matters is the future behavior of the component he is \u201cstuck\u201d in \u2014 and\ncertainly not the process average over all the components. This insight enables us to adapt the recent\nPAC bounds for mixing processes to mixtures of such processes. An interesting question then is to\ncharacterize those processes that are convex mixtures of a given kind of ergodic process (de Finetti\u2019s\ntheorem was the \ufb01rst such characterization).\nIn this paper, we have addressed this question for\nmixtures of uniformly absolutely regular processes. Another fascinating question is how to extend\npredictive PAC to real-valued functions [33, 34].\n\nReferences\n[1] Mathukumalli Vidyasagar. A Theory of Learning and Generalization: With Applications to\n\nNeural Networks and Control Systems. Springer-Verlag, Berlin, 1997.\n\n[2] Patrizia Berti and Pietro Rigo. A Glivenko-Cantelli theorem for exchangeable random vari-\n\nables. Statistics and Probability Letters, 32:385\u2013391, 1997.\n\n[3] Vladimir Pestov. Predictive PAC learnability: A paradigm for learning from exchangeable\ninput data. In Proceedings of the 2010 IEEE International Conference on Granular Computing\n(GrC 2010), pages 387\u2013391, Los Alamitos, California, 2010. IEEE Computer Society. URL\nhttp://arxiv.org/abs/1006.1129.\n\n[4] Bin Yu. Rates of convergence for empirical processes of stationary mixing sequences. Annals\nof Probability, 22:94\u2013116, 1994. URL http://projecteuclid.org/euclid.aop/\n1176988849.\n[5] M. Vidyasagar.\n\nLearning and Generalization: With Applications to Neural Networks.\n\nSpringer-Verlag, Berlin, second edition, 2003.\n\n[6] Terrence M. Adams and Andrew B. Nobel. Uniform convergence of Vapnik-Chervonenkis\nclasses under ergodic sampling. Annals of Probability, 38:1345\u20131367, 2010. URL http:\n//arxiv.org/abs/1010.3162.\n\n7\n\n\f[7] Ramon van Handel. The universal Glivenko-Cantelli property. Probability Theory and Related\nFields, 155:911\u2013934, 2013. doi: 10.1007/s00440-012-0416-5. URL http://arxiv.org/\nabs/1009.4434.\n\n[8] Dharmendra S. Modha and Elias Masry. Memory-universal prediction of stationary random\nprocesses. IEEE Transactions on Information Theory, 44:117\u2013133, 1998. doi: 10.1109/18.\n650998.\n\n[9] Ron Meir. Nonparametric time series prediction through adaptive model selection. Ma-\nchine Learning, 39:5\u201334, 2000. URL http://www.ee.technion.ac.il/\u02dcrmeir/\nPublications/MeirTimeSeries00.pdf.\n\n[10] Rajeeva L. Karandikar and Mathukumalli Vidyasagar. Rates of uniform convergence of empir-\nical means with mixing processes. Statistics and Probability Letters, 58:297\u2013307, 2002. doi:\n10.1016/S0167-7152(02)00124-4.\n\n[11] David Gamarnik. Extension of the PAC framework to \ufb01nite and countable Markov chains.\nIEEE Transactions on Information Theory, 49:338\u2013345, 2003. doi: 10.1145/307400.307478.\n[12] Ingo Steinwart and Andreas Christmann. Fast learning from non-i.i.d. observations. In Y. Ben-\ngio, D. Schuurmans, John Lafferty, C. K. I. Williams, and A. Culotta, editors, Advances\nin Neural Information Processing Systems 22 [NIPS 2009], pages 1768\u20131776. MIT Press,\nCambridge, Massachusetts, 2009. URL http://books.nips.cc/papers/files/\nnips22/NIPS2009_1061.pdf.\n\n[13] Mehryar Mohri and Afshin Rostamizadeh. Stability bounds for stationary \u03c6-mixing and \u03b2-\nmixing processes. Journal of Machine Learning Research, 11, 2010. URL http://www.\njmlr.org/papers/v11/mohri10a.html.\n\n[14] Mehryar Mohri and Afshin Rostamizadeh. Rademacher complexity bounds for non-I.I.D. pro-\nIn Daphne Koller, D. Schuurmans, Y. Bengio, and L\u00b4eon Bottou, editors, Advances\ncesses.\nin Neural Information Processing Systems 21 [NIPS 2008], pages 1097\u20131104, 2009. URL\nhttp://books.nips.cc/papers/files/nips21/NIPS2008_0419.pdf.\n\n[15] Pierre Alquier and Olivier Wintenberger. Model selection for weakly dependent time series\nforecasting. Bernoulli, 18:883\u2013913, 2012. doi: 10.3150/11-BEJ359. URL http://arxiv.\norg/abs/0902.2924.\n\n[16] Ben London, Bert Huang, and Lise Getoor.\n\nImproved generalization bounds for large-\nIn NIPS Workshop on Algorithmic and Statistical Approaches\nscale structured prediction.\nfor Large Social Networks, 2012. URL http://linqs.cs.umd.edu/basilic/web/\nPublications/2012/london:nips12asalsn/.\n\n[17] Ben London, Bert Huang, Benjamin Taskar, and Lise Getoor. Collective stability in structured\nprediction: Generalization from one example.\nIn Sanjoy Dasgupta and David McAllester,\neditors, Proceedings of the 30th International Conference on Machine Learning [ICML-13],\nvolume 28, pages 828\u2013836, 2013. URL http://jmlr.org/proceedings/papers/\nv28/london13.html.\n\n[18] Robert M. Gray. Probability, Random Processes, and Ergodic Properties. Springer-Verlag,\nNew York, second edition, 2009. URL http://ee.stanford.edu/\u02dcgray/arp.\nhtml.\n\n[19] E. B. Dynkin. Suf\ufb01cient statistics and extreme points. Annals of Probability, 6:705\u2013730, 1978.\n\nURL http://projecteuclid.org/euclid.aop/1176995424.\n\n[20] Steffen L. Lauritzen. Extreme point models in statistics. Scandinavian Journal of Statistics,\n11:65\u201391, 1984. URL http://www.jstor.org/pss/4615945. With discussion and\nresponse.\n\n[21] Olav Kallenberg. Probabilistic Symmetries and Invariance Principles. Springer-Verlag, New\n\nYork, 2005.\n\n[22] Persi Diaconis and David Freedman. De Finetti\u2019s theorem for Markov chains. Annals of Proba-\nbility, 8:115\u2013130, 1980. doi: 10.1214/aop/1176994828. URL http://projecteuclid.\norg/euclid.aop/1176994828.\n\n[23] R. M. Dudley. A course on empirical processes. In \u00b4Ecole d\u2019\u00b4et\u00b4e de probabilit\u00b4es de Saint-Flour,\nXII\u20131982, volume 1097 of Lecture Notes in Mathematics, pages 1\u2013142. Springer, Berlin, 1984.\n\n8\n\n\f[24] Anselm Blumer, Andrzej Ehrenfeucht, David Haussler, and Manfred K. Warmuth. Learnability\nand the Vapnik-Chervonenkis dimension. Journal of the Association for Computing Machin-\nery, 36:929\u2013965, 1989. doi: 10.1145/76359.76371. URL http://users.soe.ucsc.\nedu/\u02dcmanfred/pubs/J14.pdf.\n\n[25] St\u00b4ephane Boucheron, Olivier Bousquet, and G\u00b4abor Lugosi. Theory of classi\ufb01cation: A survey\nof recent advances. ESAIM: Probabability and Statistics, 9:323\u2013375, 2005. URL http:\n//www.numdam.org/item?id=PS_2005__9__323_0.\n\n[26] Frank B. Knight. A predictive view of continuous time processes. Annals of Probability, 3:\n\n573\u2013596, 1975. URL http://projecteuclid.org/euclid.aop/1176996302.\n\n[27] William Bialek, Ilya Nemenman, and Naftali Tishby. Predictability, complexity and learning.\nNeural Computation, 13:2409\u20132463, 2001. URL http://arxiv.org/abs/physics/\n0007070.\n\n[28] Andrzej Lasota and Michael C. Mackey. Chaos, Fractals, and Noise: Stochastic Aspects of Dy-\nnamics. Springer-Verlag, Berlin, 1994. First edition, Probabilistic Properties of Deterministic\nSystems, Cambridge University Press, 1985.\n\n[29] J\u00b4er\u02c6ome Dedecker, Paul Doukhan, Gabriel Lang, Jos\u00b4e Rafael Le\u00b4on R., Sana Louhichi, and\nCl\u00b4ementine Prieur. Weak Dependence: With Examples and Applications. Springer, New York,\n2007.\n\n[30] Norbert Wiener. Nonlinear prediction and dynamics. In Jerzy Neyman, editor, Proceedings\nof the Third Berkeley Symposium on Mathematical Statistics and Probability, volume 3, pages\n247\u2013252, Berkeley, 1956. University of California Press. URL http://projecteuclid.\norg/euclid.bsmsp/1200502197.\n\n[31] Paul Doukhan. Mixing: Properties and Examples. Springer-Verlag, New York, 1995.\n[32] Richard C. Bradley. Basic properties of strong mixing conditions. A survey and some open\nquestions. Probability Surveys, 2:107\u2013144, 2005. URL http://arxiv.org/abs/math/\n0511078.\n\n[33] Noga Alon, Shai Ben-David, Nicol`o Cesa-Bianchi, and David Haussler. Scale-sensitive di-\nmensions, uniform convergence, and learnability. Journal of the ACM, 44:615\u2013631, 1997. doi:\n10.1145/263867.263927. URL http://tau.ac.il/\u02dcnogaa/PDFS/learn3.pdf.\n\n[34] Peter L. Bartlett and Philip M. Long.\n\nlearning, uniform convergence, and\nscale-sensitive dimensions. Journal of Computer and Systems Science, 56:174\u2013190, 1998.\ndoi: 10.1006/jcss.1997.1557. URL http://www.phillong.info/publications/\nmore_theorems.pdf.\n\nPrediction,\n\n9\n\n\f", "award": [], "sourceid": 799, "authors": [{"given_name": "Cosma", "family_name": "Shalizi", "institution": "CMU"}, {"given_name": "Aryeh", "family_name": "Kontorovich", "institution": "Ben Gurion University"}]}