{"title": "Lifelong Learning with Weighted Majority Votes", "book": "Advances in Neural Information Processing Systems", "page_first": 3612, "page_last": 3620, "abstract": "Better understanding of the potential benefits of information transfer and representation learning is an important step towards the goal of building intelligent systems that are able to persist in the world and learn over time. In this work, we consider a setting where the learner encounters a stream of tasks but is able to retain only limited information from each encountered task, such as a learned predictor. In contrast to most previous works analyzing this scenario, we do not make any distributional assumptions on the task generating process. Instead, we formulate a complexity measure that captures the diversity of the observed tasks. We provide a lifelong learning algorithm with error guarantees for every observed task (rather than on average). We show sample complexity reductions in comparison to solving every task in isolation in terms of our task complexity measure. Further, our algorithmic framework can naturally be viewed as learning a representation from encountered tasks with a neural network.", "full_text": "Lifelong Learning with Weighted Majority Votes\n\nAnastasia Pentina\n\nIST Austria\n\napentina@ist.ac.at\n\nRuth Urner\n\nMax Planck Institute for Intelligent Systems\n\nrurner@tuebingen.mpg.de\n\nAbstract\n\nBetter understanding of the potential bene\ufb01ts of information transfer and repre-\nsentation learning is an important step towards the goal of building intelligent\nsystems that are able to persist in the world and learn over time. In this work, we\nconsider a setting where the learner encounters a stream of tasks but is able to retain\nonly limited information from each encountered task, such as a learned predictor.\nIn contrast to most previous works analyzing this scenario, we do not make any\ndistributional assumptions on the task generating process. Instead, we formulate a\ncomplexity measure that captures the diversity of the observed tasks. We provide a\nlifelong learning algorithm with error guarantees for every observed task (rather\nthan on average). We show sample complexity reductions in comparison to solv-\ning every task in isolation in terms of our task complexity measure. Further, our\nalgorithmic framework can naturally be viewed as learning a representation from\nencountered tasks with a neural network.\n\n1\n\nIntroduction\n\nMachine learning has made signi\ufb01cant progress in understanding both theoretical and practical\naspects of solving a single prediction problem from a set of annotated examples. However, if we\naim at building autonomous agents, capable to persist in the world, we need to establish methods\nfor continuously learning various tasks over time [25, 26]. There is no hope to initially provide, for\nexample, an autonomous robot with suf\ufb01ciently rich prior knowledge to solve any problem that it may\nencounter during the course of its life. Therefore, an important goal of machine learning research\nis to replicate humans\u2019 ability to learn from experience and to reuse knowledge from previously\nencountered tasks for solving new ones more ef\ufb01ciently. This is aimed at in lifelong learning or\nlearning to learn, where a learning algorithm is assumed to encounter a stream of tasks and is aiming\nto exploit commonalities between them by transferring information from earlier tasks to later ones.\nThe \ufb01rst theoretical formulation of this framework was proposed by Baxter [4]. In that model, tasks\nare generated by a probability distribution and the goal, given a sample of tasks from this distribution,\nis to perform well in expectation over tasks. Under certain assumptions, such as a shared good\nhypothesis set, this model allows for sample complexity savings [4]. However, good performance in\nexpectation is often too weak a requirement. To stay with the robot example, failure on a single task\nmay cause severe malfunction \u2013 and the end of the robot\u2019s life. Moreover, the theoretical analysis of\nthis model relies on the assumption that the learner maintains access to training data for all previously\nobserved tasks, which allows to formulate a joint optimization problem. However, it is unlikely that\nan autonomous robot is able to keep all this data.\nThus, we instead focus on a streaming setting for lifelong learning, where the learner can only\nretain learned models from previously encountered tasks. These models have a much more compact\ndescription than the joint training data. Speci\ufb01cally, we are interested in analysis and performance\nguarantees in the scenario that 1) tasks arrive one at a time without distributional or i.i.d. assumptions,\n2) the learner can only keep the learned hypotheses from previously observed tasks, 3) error bounds\nare required for every single task, rather than on average.\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fThe \ufb01rst analysis of this challenging setting was recently provided by Balcan et al. [3]. That work\ndemonstrates sample complexity improvements for learning linear halfspaces (and some boolean\nfunction classes) in the lifelong learning setting in comparison to solving each task in isolation under\nthe assumption that the tasks share a common low dimensional representation. However, the analysis\nrelies on the marginal distributions of all tasks being isotropic log-concave. It was stated as an open\nchallenge in that work whether similar guarantees (error bounds for every task, while only keeping\nlimited information from earlier tasks) were possible under less restrictive distributional assumptions.\nIn this work, we (partially) answer this question in the positive. We do so by proposing to learn with\nweighted majority votes rather than linear combinations over linear predictors. We show that the\nshift from linear combinations to majority votes introduces stability to the learned ensemble that\nallows exploiting it for later tasks. Additionally, we show that this stability is achieved for any ground\nhypothesis class. We formulate a relatedness assumption on the sequence of tasks (similar to one used\nin [3]) that captures how suitable to lifelong learning a sequence of tasks is. With this, we prove that\nsample complexity savings through lifelong learning are obtained for arbitrary marginal distributions\n(provided that these marginal distributions are related in terms of their discrepancy [5, 17]). This is a\nsigni\ufb01cant generalization towards more practically relevant scenarios.\n\nSummary of our work We employ a natural algorithmic paradigm, similar to the one in [3]. The\nalgorithm maintains a set of base hypotheses from some \ufb01xed ground hypothesis class H. These base\nhypotheses are predictors learned on previous tasks. For each new task, the algorithm \ufb01rst attempts to\nachieve good prediction performance with a weighted majority vote over the current base hypotheses,\nand uses this predictor for the task if successful. Otherwise (if no majority vote classi\ufb01er achieves\nhigh accuracy), the algorithm resorts to learning a classi\ufb01er from the ground class for this task. This\nclassi\ufb01er is then added to the set of base hypotheses, to be used for subsequent tasks. We describe\nthis algorithm Section 4.1.\nIf the ground class is the class of linear predictors, this algorithm is actually learning a neural network.\nEach base classi\ufb01er becomes a node in a hidden middle layer, which represents a learned feature\nrepresentation of the neural net. A new task is then either solved by employing the representation\nlearned from previous tasks (the current middle layer), and just learning task speci\ufb01c weights for the\nlast layer, or, in case this is not possible, it extends the current representation. See also Section 4.2.\nThis paradigm yields sample complexity savings, if the tasks encountered are related in the sense\nthat for many tasks, good classi\ufb01cation accuracy can be achieved with a weighted majority vote\nover previously learned models. We formally capture this property as an effective dimension of\nthe sequence of tasks. We prove in Section 4.3 that if this effective dimension is bounded by\nk, then the total sample complexity for learning n tasks with this paradigm is upper bounded by\n\n\u02dcO(cid:0)(nk + VC(H)k2)/\u0001(cid:1), a reduction from \u02dcO (nVC(H)/\u0001), the sample complexity of learning n\n\ntasks individually without retaining information from previous tasks.\nThe main technical dif\ufb01culty is to control the propagation of errors. Since every task is represented\nby a \ufb01nite training set, the learner has access only to approximations of the true labeling functions,\nwhich may degrade the quality of this collection of functions as a \u201cbasis\u201d for new tasks. Balcan et\nal. [3] control this error propagation using elegant geometric arguments for linear predictors under\nisotropic log-concave distributions. We show that moving from linear combinations to majority votes\nyields the required stability for the quality of the representation under arbitrary distributions.\nFinally, while we \ufb01rst present our algorithm and results for known upper bounds of k base tasks and\nn tasks in total, we also provide a variation of the algorithm that does not need to know the number\nof tasks and the complexity parameter of the task sequence. We show that similar sample complexity\nimprovements are achievable in this setting in Section 5.\n\n2 Related Work\n\nLifelong learning. While there are many ways in which prediction tasks may be related [20], most\nof the existing approaches to transfer or lifelong learning are exploiting possible similarities between\nthe optimal predictors for the considered tasks. In particular, one widely used relatedness assumption\nis that these predictors can be described as linear or sparse combinations of some common meta-\nfeatures and the corresponding methods aim at learning this representations [10, 2, 15, 3]. Though\nthis idea was originally used in the multi-task setting, it was later extended to lifelong learning by\n\n2\n\n\fEaton et al. [9], who proposed a method for sequentially updating the underlying representation as\nnew tasks arrive. These settings were theoretically analyzed in a series of works [18, 19, 23, 21]\nthat have demonstrated that information transfer can lead to provable sample complexity reductions\ncompared to solving each task independently. However, all these results rely on Baxter\u2019s model of\nlifelong learning and therefore assume access to the training data for all (observed) tasks and provide\nguarantees only on average performance over all tasks. An exception is [6], where the authors provide\nerror guarantees for every task in the multi-task scenario. However, these guarantees are due to the\nrelatedness assumption used which implies that all tasks have the same expected error. The task\nrelatedness assumption that we employ is related to the one used in [1] for multi-task learning with\nexpert advice. There the authors consider a setting where there exists a small subset of experts that\nperform well on all tasks. Similarly, we assume that there is a small subset of base tasks, such that\nthe remaining ones can be solved well using majority votes over the corresponding base hypotheses.\n\nMajority votes. Weighted majority votes are a theoretically well understood and widely used in\npractice type of ensemble predictor. In particular, they are employed in boosting [11]. They are\nalso often considered in works that utilize PAC-Bayesian techniques [22, 12]. Majority votes are\nalso used in the concept drift setting [14]. The corresponding method, conceptually similar to the\none proposed here, dynamically updates a set of experts and uses their weighted majority votes for\nmaking predictions.\n\n3 Formal Setting\n\n3.1 General notation and background\nWe let X \u2286 Rd denote a domain set and let Y denote a label set. A hypothesis is a function\nh : X \u2192 Y, and a hypothesis class H is a set of hypotheses. We model learning tasks as pairs\n(cid:104)D, h\u2217(cid:105) of a distribution D over X and a labeling function h\u2217 : X \u2192 Y. The quality of a hypothesis\nis measured by a loss function (cid:96) : Y \u00d7 Y \u2192 R+. We deal with binary classi\ufb01cation tasks, that is,\nfunction). The risk of a hypothesis h with respect to task (cid:104)D, h\u2217(cid:105) is de\ufb01ned as its expected loss:\n\nY = {\u22121, 1}, under the 0/1-loss function, that is, (cid:96)(y, y(cid:48)) =(cid:74)y (cid:54)= y(cid:48)(cid:75) (we let(cid:74)\u00b7(cid:75) denote the indicator\n\nGiven a sample S = {(x1, y1), (x2, y2), . . . , (xn, yn)}, the empirical risk of h with respect to S is\n\nLD,h\u2217 (h) := Ex\u223cD[(cid:96)(h(x), h\u2217(x))].\n\nLS(h) :=\n\n1\nn\n\n(cid:96)(h(xi), yi).\n\nFor binary classi\ufb01cation, the sample complexity of learning a hypothesis class is characterized (that\nis, upper and lower bounded), by the VC-dimension of the class [27]. We will employ the following\ngeneralization bounds for classes of \ufb01nite VC-dimension:\nTheorem 1 (Corollaries 5.2 and 5.3 in [7]). Let H be a class of binary functions with a \ufb01nite\nVC-dimension. There exists a constant C, such that for any \u03b4 \u2208 (0, 1), for any task (cid:104)D, h\u2217(cid:105), with\nprobability at least 1 \u2212 \u03b4 over a training set S of size n, sampled i.i.d. from (cid:104)D, h\u2217(cid:105):\n\nn(cid:88)\n\ni=1\n\n(cid:113)\n\nLD,h\u2217 (\u02c6h) \u2264 LS(\u02c6h) +\n\n(cid:113)\n(cid:113)\nLS(\u02c6h) \u2264 LD,h\u2217 (\u02c6h) +\nh\u2208HLD(h) +\n\nLD,h\u2217 (\u02c6h) \u2264 inf\n\nLS(\u02c6h) \u00b7 \u2206 + \u2206,\n\nLD,h\u2217 (\u02c6h) \u00b7 \u2206 + \u2206,\nh\u2208HLD(h) \u00b7 \u2206 + \u2206,\n\ninf\n\nwhere \u02c6h \u2208 arg minh\u2208H LS(h) is an empirical risk minimizer and\nVC(H) log(n) + log(1/\u03b4)\n\n\u2206 = C\n\nn\n\n.\n\n(cid:16) VC(H)+log(1/\u03b4)\n\n(cid:17)\n\n\u0001\n\nIn the realizable case (h\u2217 \u2208 H), the above bounds imply that the sample complexity is upper bounded\nby \u02dcO\n\n.\n\n3\n\n(1)\n\n(2)\n\n(3)\n\n(4)\n\n\f(cid:40)\n\n(cid:32) k(cid:88)\n\n(cid:33)(cid:41)\n\nWeighted majority votes Given a hypothesis class H, we de\ufb01ne the class of k-majority votes as\nMV(H, k) =\n\ng : X \u2192 Y | \u2203h1, . . . , hk \u2208 H,\u2203w1, . . . , wk \u2208 R : g(x) = sign\n\nwihi(x)\n\n.\n\nWe will omit k in the above notation if clear from the context. The VC-dimension of MV(H, k) is\nupper bounded as\n\nVC(MV(H, k)) \u2264 k(VC(H) + 1)(3 log(VC(H) + 1) + 2)\n\n(5)\n(see Theorem 10.3 in [24]). This implies in particular, that the VC-dimension of majority votes over\na \ufb01xed set of k functions is upper bounded by \u02dcO(k log(k)).\n\ni=1\n\n3.2 Lifelong learning\n\n1(cid:105), . . . ,(cid:104)Dn, h\u2217\n\nn(cid:105), one at a time. In this work we focus on the realizable case, i.e. h\u2217\n\nIn the lifelong learning setting the learner encounters a stream of prediction problems\ni \u2208 H for\n(cid:104)D1, h\u2217\nevery i and some \ufb01xed H. In contrast to most works on lifelong learning, we assume that the only\ninformation the learner is able to store about the already observed tasks is the obtained predictors, i.e.\nit does not have access to the training data of already solved tasks. The need to keep little information\nfrom previous tasks has also been argued for in the context of domain adaptation [16].\nPossible bene\ufb01ts of information transfer depend on how related or, in other words, how diverse the\nobserved tasks are. Moreover, since we do not make any assumptions on the task generation process\n(in contrast to Baxter\u2019s i.i.d. model [4]), we will formulate our relatedness assumption in terms of a\nsequence of tasks. Intuitively, one would expect that the information transfer is bene\ufb01cial if only a\nfew times throughout the course of learning information obtained from the already solved tasks will\nnot be suf\ufb01cient to solve the current one. In order to formalize this intuition, we use the following\n(pseudo-)metric over the hypothesis class with respect to a marginal distribution D:\n\nFurther, we can de\ufb01ne a distance of a hypothesis to a hypothesis space as\n\ndD(h, h(cid:48)) = E\n\nx\u223cD(cid:74)h(x) (cid:54)= h(cid:48)(x)(cid:75).\nh(cid:48)\u2208H(cid:48) dD(h, h(cid:48))\n\ndD(h,H(cid:48)) = min\n\n(6)\n\n(7)\n\n(8)\n\nand the distance between two sets of hypotheses as\n\ndD(H,H(cid:48)) = max\n\nh\u2208H dD(h,H(cid:48)) = max\n\nh(cid:48)\u2208H(cid:48) dD(h, h(cid:48)).\n\nh\u2208H min\n\nNote that the latter is not necessarily a metric over subsets of the hypothesis space. However, it does\nsatisfy the triangle inequality (see Section 1 in the supplementary material).\nNow we can formulate the diversity measure for a sequence of learning tasks that we will employ.\nNote that the concepts below are closely related to the ones used in [3] for the case of linear predictors\nand linear combinations over these.\nn(cid:105) is \u03b3-separated, if for every i\nDe\ufb01nition 1. A sequence of learning tasks (cid:104)D1, h\u2217\ndDi (h\u2217\nn(cid:105) has \u03b3-effective dimension k, if\nDe\ufb01nition 2. A sequence of learning tasks (cid:104)D1, h\u2217\nthe largest \u03b3-separated subsequence of these tasks has length k.\n\n1(cid:105), . . . ,(cid:104)Dn, h\u2217\n1(cid:105), . . . ,(cid:104)Dn, h\u2217\n\ni , MV(h\u2217\n\n1, . . . , h\u2217\n\ni\u22121)) > \u03b3.\n\nFormally, we will assume that the \u03b3-effective dimension k of the observed sequence of tasks is\nrelatively small for a suf\ufb01ciently small \u03b3. Note that this assumption can also be seen as a relaxation of\nthe one used in [8]. There the authors assumed that there exists a set of k hypothesis such that every\ntask can be well solved by one of them. This would correspond to substituting the sets of weighted\ni\u22121} in the above de\ufb01nitions.\nmajority votes MV(h\u2217\nMoreover, we will assume that the marginal distributions have small discrepancy with respect to the\nhypothesis set H:\n\ni\u22121) by just the collections {h\u2217\n\n1, . . . , h\u2217\n\n1, . . . , h\u2217\n\ndiscH(Di, Dj) = max\n\nh,h(cid:48)\u2208H|dDi(h, h(cid:48)) \u2212 dDj (h, h(cid:48))|.\n\n(9)\n\nThis is a measure of task relatedness that has been introduced in [13] and shown to be bene\ufb01cial in\nthe context of domain adaptation [5, 17]. Note, however, that we do not make any assumptions on the\nmarginal distributions D1, . . . , Dn themselves.\n\n4\n\n\f4 Algorithm and complexity guarantees\n\n4.1 The algorithm\n\nWe employ a natural algorithmic paradigm, which is similar to the one in [3]. Algorithm 1 below\nprovides pseudocode for our procedure. The algorithm takes as parameters a class H, which we call\nthe ground class, accuracy and con\ufb01dence parameters \u0001 and \u03b4, as well as a task horizon n (the number\nof tasks to be solved) and a parameter k (a guessed upper bound on the number of tasks that will not\nbe solvable as majority votes over earlier tasks). In Section 5 we present a version that does not need\nto know n and k in advance.\n\n1(cid:105), such that \u22061 := \u2206(VC(H), \u03b4(cid:48),|S1|) \u2264 \u0001(cid:48)\n\nAlgorithm 1 Lifelong learning of majority votes\n1: Input parameters H, n, k, \u0001, \u03b4\n2: set \u03b4(cid:48) = \u03b4/(2n), \u0001(cid:48) = \u0001/(8k)\n3: draw a training set S1 from (cid:104)D1, h\u2217\n4: g1 = arg minh\u2208H LS1 (h)\n5: set \u02dck = 1, i1 = 1, \u02dch1 = g1\n6: for i = 2 to n do\n7:\n8:\n9:\n10:\n11:\n12:\nend if\n13:\n14: end for\n15: return g1, . . . , gn\n\ndraw a training set Si from (cid:104)Di, h\u2217\nif LSi(gi) +(cid:112)LSi(gi) \u00b7 \u2206i + \u2206i > \u0001 then\ngi = arg minh\u2208MV(\u02dch1,...,\u02dch\u02dck) LSi(h)\ndraw a training set Si from (cid:104)Di, h\u2217\ngi = arg minh\u2208H LSi(h)\nset \u02dck = \u02dck + 1, \u02dch\u02dck = gi, i\u02dck = i\n\ni (cid:105), such that \u2206i := \u2206(VC(MV(\u02dch1, . . . , \u02dch\u02dck)), \u03b4(cid:48),|Si|) \u2264 \u0001\n\n40\n\ni (cid:105), such that \u2206i := \u2206(VC(H), \u03b4(cid:48),|Si|) \u2264 \u0001(cid:48)\n\nDuring the course of its \u201clife\u201d, the algorithm maintains a set of base hypotheses (\u02dch1, . . . , \u02dch\u02dck) from\nthe ground class, which are predictors learned on previous tasks. In order to solve the \ufb01rst task, it\nuses the hypothesis set H and a large enough training set S1 to ensure the error guarantee \u0001(cid:48) \u2264 \u0001/8k\nwith probability at least 1 \u2212 \u03b4(cid:48), where \u03b4(cid:48) = \u03b4/2n. The learned hypothesis \u02dch1 \u2208 H is the \ufb01rst member\nof the set of base hypotheses. For each new task i, the algorithm \ufb01rst attempts to achieve good\nprediction performance (up to error \u0001) with a weighted majority vote over the base hypotheses, i.e. it\nattempts to learn this task using the class MV(\u02dch1, . . . , \u02dch\u02dck), and uses the obtained predictor for the\ntask if successful. Otherwise (if no majority vote classi\ufb01er achieves high accuracy), the algorithm\nresorts to learning a classi\ufb01er from the base class for this task, which is then added to the set of base\nhypotheses, to be used for subsequent tasks. The error guarantees are ensured with Theorem 1 by\nchoosing the training sets Si large enough so that\n\n\u2206i := \u2206(VC(Hi), \u03b4(cid:48),|Si|) := C\n\nVC(Hi) log(|Si|) + log(1/\u03b4(cid:48))\n\n\u2264 c\u0001,\n\n|Si|\n\nwhere Hi is either the ground class H or the set of weighted majority votes over the current set of\nbase hypotheses MV(\u02dch1, . . . , \u02dch\u02dck), and constant c is set according to case, see pseudocode.\nWhile this approach is very natural, the challenge is to analyze it and to specify the parameters.\nIn particular, we need to ensure that the algorithm will not have to search over (potentially large)\nhypothesis set H too often and, consequently, will lead to provable sample complexity reductions\nover solving each task independently. The following theorem summarizes the performance guarantees\nfor Algorithm 1 (the proof is in Section 4.3).\nTheorem 2. Consider running Algorithm 1 on a sequence of tasks with \u03b3-effective dimension at most\nk and discH(Di, Dj) \u2264 \u03be for all i, j. Then, if \u03b3 \u2264 \u0001/4 and k\u03be < \u0001/8, with probability at least 1 \u2212 \u03b4:\n\n\u2022 The error of every task is bounded: LDi,h\u2217\n\u2022 The total number of labeled examples used is \u02dcO\n\ni\n\n(gi) \u2264 \u0001 for every i = 1, . . . , n.\n\n(cid:16) nk+VC(H)k2\n\n(cid:17)\n\n\u0001\n\n.\n\n5\n\n\f(cid:16) VC(H)n\n\n(cid:17)\n\n\u0001\n\n(cid:16) nk+VC(H)k2\n\nDiscussion Note that if we assume that all tasks are realizable by H, independently learning them\nup to error \u0001 would have sample complexity \u02dcO\n. The sample complexity of learning n\ntasks in the lifelong learning regime with our paradigm in contrast is \u02dcO\n. This is a\nsigni\ufb01cant reduction if the effective dimension of the task sequence k is small in comparison to the\ntotal number n of tasks, as well as the complexity measure VC(H) of the ground class. That is,\nif most tasks are learnable as combination of previously stored base predictors, much less data is\nrequired overall.\nNote that for all those tasks that are solved as majority votes, our algorithm and analysis actu-\nally require realizability only by the class of k-majority votes over H and not by the ground\nclass H. Learning the n tasks independently under this assumption, has sample complexity\n\u02dcO\n. In contrast, the lifelong learning method gradually identi\ufb01es the\nrelevant set of base predictors and thereby reduces the number of required examples.\n\n(cid:16) VC(H)k\n\n+ (n\u2212k)VC(H)k\n\n(cid:17)\n\n(cid:17)\n\n\u0001\n\n\u0001\n\n\u0001\n\n4.2 Neural networks\n\nIf the ground class is the class of linear predictors, our algorithm is actually learning a neural network\n(with sign() as the activation function). Each base classi\ufb01er becomes a new node in a hidden middle\nlayer. Thus, the maintained set of base classi\ufb01ers can be viewed as feature representation in the\nneural net, which was learned based on the encountered tasks. A new task is then either solved by\nemploying the representation learned from previous tasks (the current middle layer), and just learning\ntask speci\ufb01c weights for the last layer; or, in case this is not possible, a fresh linear classi\ufb01er is\nlearned, and added as a node to the middle layer. Thus, in this case, the feature representation is\nextended.\n\n4.3 Analysis\n\nWe start with presenting the following two lemmas that show how to control the error propagation of\nthe learned representations (sets of base classi\ufb01ers). We then proceed to the proof of Theorem 2.\nLemma 1. Let V = MV(h1, . . . , hk, g) and \u02dcV = MV(h1, . . . , hk, \u02dcg). Then, for any distribution D:\n(10)\n\ndD(V, \u02dcV ) \u2264 dD(g, \u02dcg).\n\nProof. By the de\ufb01nition of dD(V, \u02dcV ) there exists u \u2208 V such that:\n\nWe can represent u as u = sign((cid:80)k\n\ndD(V, \u02dcV ) = dD(u, \u02dcV ).\n\ni=1 \u03b1ihi + \u03b1g) and let u1 =(cid:80)k\n\ng and \u02dcg are assumed to take values in {\u22121, 1}, u1 can take values in R. Then:\n\n(11)\n\ni=1 \u03b1ihi. Note that while all hi-s,\n\ndD(u, \u02dcV ) = min\n\u02dch\u2208 \u02dcV\n\ndD(u, \u02dch) \u2264\n\nmin\n\n\u02dch\u2208MV(u1,\u02dcg)\n\ndD(u, \u02dch)\n\n\u2264\n\nmax\n\nh\u2208MV(u1,g)\n\nmin\n\n\u02dch\u2208MV(u1,\u02dcg)\n\ndD(h, \u02dch) = dD(MV(u1, g), MV(u1, \u02dcg)).\n\nNow we show that for any \u03b11u1 + \u03b12g \u2208 MV(u1, g) there exists a close hypothesis in MV(u1, \u02dcg).\nIn particular, this hypothesis is \u03b11u1 + \u03b12\u02dcg:\ndD(\u03b11u1 + \u03b12g, \u03b11u1 + \u03b12\u02dcg) = E\n\nNote that for every x on which g and \u02dcg agree, i.e. g(x)\u02dcg(x) = 1, we obtain:\n\n= E\n\nx\u223cD(cid:74)\u03b12\n\n1u2\n\n1(x) + \u03b11\u03b12u1(x)g(x) + \u03b11\u03b12u1(x)\u02dcg(x) + \u03b12\n\nx\u223cD(cid:74)sign(\u03b11u1(x) + \u03b12g(x)) (cid:54)= sign(\u03b11u1(x) + \u03b12\u02dcg(x))(cid:75)\n2g(x)\u02dcg(x) < 0(cid:75).\n2g(x)\u02dcg(x) = (\u03b11u1(x) + \u03b12g(x))2 \u2265 0.\nx\u223cD(cid:74)g(x) (cid:54)= \u02dcg(x)(cid:75) = dD(g, \u02dcg).\n\n(12)\n\n\u03b12\n1u2\n\n1(x) + \u03b11\u03b12u1(x)g(x) + \u03b11\u03b12u1(x)\u02dcg(x) + \u03b12\ndD(\u03b11u1 + \u03b12g, \u03b11u1 + \u03b12\u02dcg) \u2264 E\n\nTherefore:\n\n6\n\n\fdD(hi, \u02dchi) \u2264 \u0001i for every i = 1, . . . , k, then dD(Vk, \u02dcVk) \u2264(cid:80)k\n\nLemma 2. Let Vk = MV(h1, . . . , hk) and \u02dcVk = MV(\u02dch1, . . . , \u02dchk). For any distribution D, if\n\ni=1 \u0001i.\n\nFor the proof see Section 2 in the supplementary material.\n\ntasks i that are solved by a majority vote over previous tasks, we have LSi(gi) +(cid:112)LSi(gi) \u00b7 \u2206i +\n\nProof of Theorem 2. 1. First, note that for every task Algorithm 1 solves at most 2 estimation\nproblems with a probability of failure \u03b4(cid:48) for each of them. Therefore, with a union bound argument,\nthe probability of any of these estimations being wrong is at most 2 \u00b7 n \u00b7 \u03b4(cid:48) = \u03b4. Thus, from now we\nassume that all the estimations were correct, that is, the high probability events of Theorem 1 hold.\n2. To see that the error of every encountered task is bounded by \u0001, note that there are two cases. For\n\u2206i \u2264 \u0001. In this case, Equation (1) in Theorem 1 implies LDi,h\u2217\n(gi) \u2264 \u0001. For tasks i that are not solved\nas a majority vote over previous tasks, we have \u2206i = \u2206(VC(MV(\u02dch1, . . . , \u02dch\u02dck)), \u03b4(cid:48), m) \u2264 \u0001/8k.\nSince task i is realizable by the base class H, we have inf h\u2208H LDi,h\u2217\n(h) = 0, and thus Equation (3)\nof Theorem 1 implies LDi,h\u2217\n3. To upper bound the sample complexity we \ufb01rst prove that the number \u02dck of tasks, which are not\nlearned as majority votes over previous tasks, is at most k. For that we use induction showing that for\nevery \u02c6k \u2264 \u02dck, when we create a new \u02dch\u02c6k from the i\u02c6k-th task, we have that\n\n(gi) \u2264 \u0001/8k < \u0001.\n\ni\n\ni\n\ni\n\ndDi\u02c6k\n\n(13)\nThis implies \u02dck \u2264 k by invoking that the \u03b3-effective dimension of the sequence of encountered tasks\nis at most k.\nTo proceed to the induction, note that for \u02c6k = 1, the claim follows immediately. Consider \u02c6k > 1. If\nwe create a new \u02dch\u02c6k, it means that the condition in line 9 is true, which is:\n\n)) > \u03b3.\n\ni\u02c6k\u22121\n\ni1\n\n(h\u2217\ni\u02c6k\n\n, MV(h\u2217\n\n, . . . , h\u2217\n\n(cid:113)LSi\u02c6k\n\nLSi\u02c6k\n\n(gi\u02c6k\n\n) +\n\n(gi\u02c6k\n\n) \u00b7 \u2206i + \u2206i > \u0001.\n(gi\u02c6k\n\n(gi\u02c6k\n\n) > 0.83\u0001. Consequently, due to (2), LDi\u02c6k\n\nTherefore LSi\u02c6k\ninf g LDi\u02c6k\nleads to error less than \u0001/2 on the problem i\u02c6k. In other words:\n\n) > 0.67\u0001. Finally, by (3),\n(g) > 0.5\u0001. Therefore there is no majority vote predictor based on \u02dch1, . . . , \u02dch\u02c6k\u22121 that\n\n,h\u2217\ni\u02c6k\n\n,h\u2217\ni\u02c6k\n\nNow, by way of contradiction, suppose that dDi\u02c6k\nfor every j = 1, . . . , \u02c6k \u2212 1 dDij\nassumption on the marginal distributions it follows that for all j:\n\n(h\u2217\n\ni1\n\nij\n\n, MV(\u02dch1, . . . , \u02dch\u02c6k\u22121)) > \u0001/2.\n\n(15)\n)) \u2264 \u03b3. By construction\n, \u02dchj) \u2264 \u0001(cid:48) \u2264 \u0001/8k. By the de\ufb01nition of discrepancy and the\n\n, MV(h\u2217\n\n, . . . , h\u2217\n\n(h\u2217\ni\u02c6k\n\ni\u02c6k\u22121\n\n(h\u2217\ni\u02c6k\n\ndDi\u02c6k\n\n(14)\n\n(16)\n\n(17)\n\n(18)\n\ndDi\u02c6k\nTherefore by Lemma 2:\ndDi\u02c6k\n\ni\u02c6k\u22121\nConsequently, by using the triangle inequality:\n\ni1\n\n(h\u2217\n\nij\n\n, \u02dchj) \u2264 dDij\n\n(h\u2217\n\nij\n\n, \u02dchj) + discH(Dij , Di\u02c6k\n\n) \u2264 \u0001(cid:48) + \u03be.\n\n(MV(h\u2217\n\n, . . . , h\u2217\n\n), MV(\u02dch1, . . . , \u02dch\u02c6k)) \u2264 k(\u0001(cid:48) + \u03be).\n\n(cid:32)\n\n(h\u2217\ni\u02c6k\n\ndDi\u02c6k\n\n, MV(\u02dch1, . . . , \u02dch\u02c6k\u22121)) \u2264 \u03b3 + k(\u0001(cid:48) + \u03be) \u2264 \u0001/4 + \u0001/8 + \u0001/8 = \u0001/2,\n\nwhich is in contradiction with (15).\n4. The total sample complexity of Algorithm 1 consists of two parts. First, for every task Algorithm 1\nchecks, whether it can be solved by a majority vote over the base, at most \u02dck predictors. For that it\nemploys Theorem 1 and therefore needs the following number of samples:\n\n\u02dcO\n\nn\u02dck log \u02dck log(\u02dck log \u02dck) log(2n/\u03b4)\n\n\u0001\n\n= \u02dcO\n\n.\n\n(19)\n\n(cid:16) \u02dckVC(H) log(2n/\u03b4)\n\nSecond, there are at most \u02dck tasks that satisfy the condition in line 9 and are learned using the\nhypothesis set H with estimation error \u0001(cid:48) = \u0001/(8k). Therefore the corresponding sample complexity\nis: O\n\n(cid:16) VC(H)k2\n\n= \u02dcO\n\n(cid:17)\n\n(cid:17)\n\n.\n\n\u0001/(8k)\n\n\u0001\n\n7\n\n(cid:33)\n\n(cid:18) nk\n\n(cid:19)\n\n\u0001\n\n\f5 Lifelong learning with unknown horizon\n\nIn this section we present a modi\ufb01cation of Algorithm 1 for the case when the total number of\ntasks n and the complexity of the task sequence k are not known in advance. The main difference\nbetween Algorithm 2 and Algorithm 1 is that with unknown n and k the learner has to adopt the\nparameters \u03b4(cid:48) and \u0001(cid:48) on the \ufb02y. We show that this can be done by the doubling trick that is often used\nin online learning. Theorem 3 summarizes the resulting guarantees (the proof can be found in the\nsupplementary material, Section 3).\n\n1(cid:105) of size m, such that \u2206(VC(H), \u03b41, m) \u2264 \u0001(cid:48)\n\n1 (see (4))\n\nAlgorithm 2 Lifelong learning of majority votes with unkown horizon\n1: Input parameters H, \u0001, \u03b4\n2: set \u03b41 = \u03b4/2, \u0001(cid:48)\n1 = \u0001/16\n3: draw a training set S1 from (cid:104)D1, h\u2217\n4: g1 = arg minh\u2208H LS1(h)\n5: set \u02dck = 1, i1 = 1, \u02dch1 = g1\n6: for i = 2 to n do\n7:\n22l+2 , \u0001(cid:48)\n8:\n9:\n\ni = \u0001\n\n22m+4\n\nset l = (cid:98)log i(cid:99), m = (cid:98)log \u02dck + 1(cid:99)\nset \u03b4i = \u03b4\ndraw a training set Si from (cid:104)Di, h\u2217\nif LSi(gi) +(cid:112)LSi(gi) \u00b7 \u2206 + \u2206 > \u0001 then\n\u0001/40 (see (4))\ngi = arg minh\u2208MV(\u02dch1,...,\u02dch\u02dck) LSi(h)\ndraw a training set Si from (cid:104)Di, h\u2217\ngi = arg minh\u2208H LSi(h)\nset \u02dck = \u02dck + 1, \u02dch\u02dck = gi, i\u02dck = i\n\n10:\n11:\n12:\n13:\n14:\nend if\n15:\n16: end for\n17: return g1, . . . , gn\n\ni (cid:105) of size m, such that \u2206(VC(MV(\u02dch1, . . . , \u02dch\u02dck)), \u03b4i, m) \u2264\n\ni (cid:105) of size m, such that \u2206(VC(H), \u03b4i, m) \u2264 \u0001(cid:48)\n\ni (see (4))\n\nTheorem 3. Consider running Algorithm 2 on a sequence of tasks with \u03b3-effective dimension at most\nk and discH(Di, Dj) \u2264 \u03be for all i, j. Then, if \u03b3 \u2264 \u0001/4 and k\u03be < \u0001/8, with probability at least 1 \u2212 \u03b4:\n\n\u2022 The error of every task is bounded: LDi,h\u2217\n\u2022 The total number of labeled examples used is \u02dcO\n\ni\n\n(gi) \u2264 \u0001 for every i = 1, . . . , n.\n\n(cid:16) nk+VC(H)k3\n\n(cid:17)\n\n\u0001\n\n.\n\n6 Conclusion\n\nIn this work, we have shown sample complexity improvements with lifelong learning in the challeng-\ning, yet as argued important setting, where tasks arrive in a stream (without assumptions on the tasks\ngenerating process), where the learner is only allowed to maintain limited amounts of information\nfrom previously encountered tasks, and where high performance is required for every single task,\nrather than on average. While such improvements have been established in very speci\ufb01c settings [3],\nour work shows they are possible in much more general and realistic scenarios. We hope that this will\nopen the door for more work in this area of machine lifelong learning and lead to better understanding\nof how and when learning machines can bene\ufb01t from past experience. An intriguing direction is to\ninvestigate whether there exists a more general characterization of ensemble methods and/or data\ndistributions that would lead to bene\ufb01ts with lifelong learning. Another one is to better understand\nlifelong learning with neural networks, analyzing cases of more complex network structures and\nactivation functions, an area where current machine learning practice yields exciting successes, but\nlittle is understood.\n\nAcknowledgments\n\nThis work was in parts funded by the European Research Council under the European Union\u2019s\nSeventh Framework Programme (FP7/2007-2013)/ERC grant agreement no 308036.\n\n8\n\n\fReferences\n[1] J. Abernethy, P. Bartlett, and A. Rakhlin. Multitask learning with expert advice.\n\nComputational Learning Theory (COLT), 2007.\n\nIn Workshop on\n\n[2] A. Argyriou, T. Evgeniou, and M. Pontil. Convex multi-task feature learning. Machine Learning (ML),\n\n2008.\n\n[3] M.-F. Balcan, A. Blum, and S. Vempala. Ef\ufb01cient representations for lifelong learning and autoencoding.\n\nIn Workshop on Computational Learning Theory (COLT), 2015.\n\n[4] J. Baxter. A model of inductive bias learning. Journal of Arti\ufb01cial Intelligence Research (JAIR), 12, 2000.\n[5] S. Ben-David, J. Blitzer, K. Crammer, A. Kulesza, F. Pereira, and J. W. Vaughan. A theory of learning\n\nfrom different domains. Machine Learning (ML), 2010.\n\n[6] S. Ben-David and R. Schuller. Exploiting task relatedness for multiple task learning. 2003.\n[7] S. Boucheron, O. Bousquet, and G. Lugosi. Theory of classi\ufb01cation: a survey of some recent advances.\n\nESAIM: Probability and Statistics, 9:323\u2013375, 11 2005.\n\n[8] K. Crammer and Y. Mansour. Learning multiple tasks using shared hypotheses. In Conference on Neural\n\nInformation Processing Systems (NIPS), 2012.\n\n[9] E. Eaton and P. L. Ruvolo. ELLA: An ef\ufb01cient lifelong learning algorithm. In International Conference on\n\nMachine Learning (ICML), 2013.\n\n[10] A. Evgeniou and M. Pontil. Multi-task feature learning. In Conference on Neural Information Processing\n\nSystems (NIPS), 2007.\n\n[11] Y. Freund and R. E. Schapire. Experiments with a new boosting algorithm. In International Conference on\n\nMachine Learning (ICML), 1996.\n\n[12] P. Germain, A. Habrard, F. Laviolette, and E. Morvant. A New PAC-Bayesian Perspective on Domain\n\nAdaptation. In International Conference on Machine Learning (ICML), 2016.\n\n[13] D. Kifer, S. Ben-David, and J. Gehrke. Detecting change in data streams. In International Conference on\n\nVery Large Data Bases (VLDB), 2004.\n\n[14] J. Z. Kolter and M. A. Maloof. Dynamic weighted majority: An ensemble method for drifting concepts.\n\nJournal of Machine Learning Research (JMLR), 8:2755\u20132790, Dec. 2007.\n\n[15] A. Kumar and H. Daum\u00e9 III. Learning task grouping and overlap in multi-task learning. In International\n\nConference on Machine Learning (ICML), 2012.\n\n[16] I. Kuzborskij and F. Orabona. Stability and hypothesis transfer learning. In International Conference on\n\nMachine Learning (ICML), 2013.\n\n[17] Y. Mansour, M. Mohri, and A. Rostamizadeh. Domain adaptation: Learning bounds and algorithms. In\n\nWorkshop on Computational Learning Theory (COLT), 2009.\n\n[18] A. Maurer. Transfer bounds for linear feature learning. Machine Learning, 75(3):327\u2013350, 2009.\n[19] A. Maurer, M. Pontil, and B. Romera-Paredes. Sparse coding for multitask and transfer learning. In\n\nInternational Conference on Machine Learning (ICML), 2013.\n\n[20] S. J. Pan and Q. Yang. A survey on transfer learning.\n\nEngineering, 22(10):1345\u20131359, Oct. 2010.\n\nIEEE Transactions on Knowledge and Data\n\n[21] A. Pentina and S. Ben-David. Multi-task and lifelong learning of kernels. In Algorithmic Learning Theory\n\n(ALT), 2015.\n\n[22] A. Pentina and C. H. Lampert. A PAC-Bayesian bound for lifelong learning. In International Conference\n\non Machine Learning (ICML), 2014.\n\n[23] M. Pontil and A. Maurer. Excess risk bounds for multitask learning with trace norm regularization. In\n\nWorkshop on Computational Learning Theory (COLT), 2013.\n\n[24] S. Shalev-Shwartz and S. Ben-David. Understanding Machine Learning: From Theory to Algorithms.\n\nCambridge University Press, New York, NY, USA, 2014.\n\n[25] S. Thrun and T. M. Mitchell. Lifelong robot learning. Robotics and Autonomous Systems, 15(1\u20132):25 \u2013 46,\n\n1995. The Biology and Technology of Intelligent Autonomous Agents.\n\n[26] S. Thrun and L. Pratt. Learning to learn. Kluwer Academic Publishers, 1998.\n[27] V. N. Vapnik and A. J. Chervonenkis. On the uniform convergence of relative frequencies of events to their\n\nprobabilities. Theory of Probability & Its Applications, 16(2):264\u2013280, 1971.\n\n9\n\n\f", "award": [], "sourceid": 1800, "authors": [{"given_name": "Anastasia", "family_name": "Pentina", "institution": "IST Austria"}, {"given_name": "Ruth", "family_name": "Urner", "institution": "MPI Tuebingen"}]}