{"title": "Online-Within-Online Meta-Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 13110, "page_last": 13120, "abstract": "We study the problem of learning a series of tasks in a fully online Meta-Learning\nsetting. The goal is to exploit similarities among the tasks to incrementally adapt\nan inner online algorithm in order to incur a low averaged cumulative error over\nthe tasks. We focus on a family of inner algorithms based on a parametrized\nvariant of online Mirror Descent. The inner algorithm is incrementally adapted\nby an online Mirror Descent meta-algorithm using the corresponding within-task\nminimum regularized empirical risk as the meta-loss. In order to keep the process\nfully online, we approximate the meta-subgradients by the online inner algorithm.\nAn upper bound on the approximation error allows us to derive a cumulative\nerror bound for the proposed method. Our analysis can also be converted to the\nstatistical setting by online-to-batch arguments. We instantiate two examples of the\nframework in which the meta-parameter is either a common bias vector or feature\nmap. Finally, preliminary numerical experiments confirm our theoretical findings.", "full_text": "Online-Within-Online Meta-Learning\n\nGiulia Denevi1,2, Dimitris Stamos3, Carlo Ciliberto3,4 and Massimiliano Pontil1,3\n\ngiulia.denevi@iit.it, c.ciliberto@imperial.ac.uk, {d.stamos.12,m.pontil}@ucl.ac.uk\n\n1Istituto Italiano di Tecnologia (Italy), 2University of Genoa (Italy),\n\n3University College of London (UK),4Imperial College of London (UK),\n\nAbstract\n\nWe study the problem of learning a series of tasks in a fully online Meta-Learning\nsetting. The goal is to exploit similarities among the tasks to incrementally adapt\nan inner online algorithm in order to incur a low averaged cumulative error over\nthe tasks. We focus on a family of inner algorithms based on a parametrized\nvariant of online Mirror Descent. The inner algorithm is incrementally adapted\nby an online Mirror Descent meta-algorithm using the corresponding within-task\nminimum regularized empirical risk as the meta-loss. In order to keep the process\nfully online, we approximate the meta-subgradients by the online inner algorithm.\nAn upper bound on the approximation error allows us to derive a cumulative\nerror bound for the proposed method. Our analysis can also be converted to the\nstatistical setting by online-to-batch arguments. We instantiate two examples of the\nframework in which the meta-parameter is either a common bias vector or feature\nmap. Finally, preliminary numerical experiments con\ufb01rm our theoretical \ufb01ndings.\n\n1\n\nIntroduction\n\nHumans can quickly adapt knowledge gained when learning past tasks, in order to solve new tasks\nfrom just a handful of examples. In contrast, learning systems are still rather limited when it comes to\ntransfer knowledge over a sequence of learning problems. Overcoming this limitation can have a broad\nimpact in arti\ufb01cial intelligence, as it can save the expensive preparation of large training samples,\noften humanly annotated, needed by current machine learning methods. As a result, Meta-Learning\nis receiving increasing attention, both from applied [15, 32] and theoretical perspective [5, 40, 17].\nUntil very recently, Meta-Learning was mainly studied in the batch statistical setting, where data are\nassumed to be independently sampled from some distribution and they are processed in one batch, see\n[6, 23, 24, 25, 26, 29]. Only recently, a lot of interest raised in investigating more ef\ufb01cient methods,\ncombining ideas from Online Learning and Meta-Learning, see [1, 12, 13, 30, 3, 21, 16, 8, 11, 30]. In\nthis setting, which is sometimes referred to as Lifelong Learning, the tasks are observed sequentially\n\u2013 via corresponding sets of training examples \u2013 and the broad goal is to exploit similarities across the\ntasks to incrementally adapt an inner (within-task) algorithm to such a sequence. There are different\nways to deal with Meta-Learning in an online framework: the so-called Online-Within-Batch (OWB)\nframework, where the tasks are processed online but the data within each task are processed in one\nbatch, see [1, 12, 13, 16, 8, 3, 21], or the so-called Online-Within-Online (OWO) framework, where\ndata are processed sequentially both within and across the tasks, see [1, 3, 21, 16, 11]. Previous work\nmainly analyzed speci\ufb01c settings, see the technical discussion in App. A. The main goal of this work\nis to propose an OWO Meta-Learning approach that can be adapted to a broad family of algorithms.\nWe consider a general class of inner algorithms based on primal-dual Online Learning [37, 33, 38, 36,\n35]. In particular, we discuss in detail the case of online Mirror Descent on a regularized variant of the\nempirical risk. The regularizer belongs to a general family of strongly convex functions parametrized\nby a meta-parameter. The inner algorithm is adapted by a meta-algorithm, which also consists in\napplying online Mirror Descent on a meta-objective given by the within-task minimum regularized\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fempirical risk. The interplay between the meta-algorithm and the inner algorithm plays a key role\nin our analysis. The latter is used to compute a good approximation of the meta-subgradient which\nis supplied to the former. A key novelty of our analysis is to show that, exploiting a closed form\nexpression of the error on the meta-subgradients, we can automatically derive a cumulative error\nbound for the entire procedure. Our analysis holds also for more aggressive primal-dual online\nupdates and it can be adapted to the statistical setting by online-to-batch arguments.\nContributions. Our contribution is threefold. First, we derive an ef\ufb01cient and theoretically grounded\nOWO Meta-Learning framework which is inspired by Multi-Task Learning (MTL). Our framework\napplies to a wide class of within-task algorithms and tasks\u2019 relationships. Second, we establish how\nour analysis can be converted to the statistical setting. Finally, we show how our general analysis\ncan be directly applied to two important families of inner algorithms in which the meta-parameter is\neither a bias vector or a feature map shared across the tasks.\nPaper organization. We start by introducing in Sec. 2 our OWO Meta-Learning setting. In Sec. 3\nwe recall some background material from primal-dual Online Learning. In Sec. 4 we outline the\nproposed method and we give a cumulative error bound for it. In Sec. 5 we show how the above\nanalysis can be used to derive guarantees for our method in the statistical setting. In Sec. 6 we specify\nour framework to two important examples in which the tasks share a common bias vector or feature\nmap. Finally, in Sec. 7 we report preliminary experiments with our method and in Sec. 8 we draw\nconclusions. Technical proofs are postponed to the appendix.\n\n2 Setting\n\ni=1 = (xi, yi)n\n\nIn this section we introduce the OWO Meta-Learning problem. We consider that the learner is facing\na sequence of online tasks. Corresponding to each task, there is an input space X , an output space\nY and a dataset Z = (zi)n\ni=1 2 (X\u21e5Y )n, which is observed sequentially. Online\nLearning aims to design an algorithm that makes predictions through time from past information.\nMore precisely, at each step i 2{ 1, . . . , n}: (a) a datapoint zi = (xi, yi) is observed, (b) the\nalgorithm outputs a label \u02c6yi, (c) the learner incurs the error `i(\u02c6yi), where `i(\u00b7) = `(\u00b7, yi) for a loss\nfunction `. To simplify our presentation, throughout we let X\u2713 Rd, Y\u2713 R and we consider\nalgorithms that perform linear predictions of the form \u02c6yi = hxi, wii, where (wi)n\ni=1 is a sequence of\nweight vectors updated by the algorithm and h\u00b7,\u00b7i denotes the standard inner product in Rd. The goal\nis to bound the cumulative error of the algorithm, i.e. Einner(Z) =Pn\ni=1 `i(hxi, wii), with respect\nto (w.r.t.) the same quantity incurred by a vector \u02c6w 2 Rd \ufb01xed in hindsight, i.e.Pn\ni=1 `i(hxi, \u02c6wi).\nIn the OWO Meta-Learning setting, we have a family of inner online algorithms identi\ufb01ed by a\nmeta-parameter \u2713 belonging to a prescribed set \u21e5 and the goal is to adapt \u2713 to a sequence of learning\ntasks, in online fashion. Throughout this work, \u21e5 will be a closed, convex, non-empty subset of\nan Euclidean space M. The broad goal is to \u201ctransfer information\u201d gained when learning previous\ntasks, in order to help learning future tasks. For this purpose, we propose a Meta-Learning procedure,\nacting across the tasks, which modi\ufb01es the inner algorithm one task after another. More precisely,\n1 be\nwe let T be the number of tasks and, for each task t 2{ 1, . . . , T} we let Zt = (xt,i, yt,i)n\nthe corresponding data sequence. At each time t: (a) the meta-learner incrementally receives a task\ndataset Zt, (b) it runs the inner online algorithm with meta-parameter \u2713t on Zt, returning the predictor\nvectors (w\u2713t,i)n\ni=1, (c) it incrementally incurs the errors `t,i(hxt,i, w\u2713t,ii), where `t,i(\u00b7) = `(\u00b7, yt,i),\n(d) the meta-parameter (and consequently, the inner algorithm) is updated in \u2713t+1. Denoting by\nEinner(Zt,\u2713 t) the cumulative error of the inner algorithm with meta-parameter \u2713t on the dataset Zt,\nthe goal is to bound the error accumulated across the tasks, i.e.\n\ni=1\n\nw.r.t. the same quantity incurred by a sequence of tasks\u2019 vectors ( \u02c6wt)T\n\nt=1 \ufb01xed in hindsight, i.e.\n\nEinner(Zt,\u2713 t) =\n\n`t,i(hxt,i, w\u2713t,ii),\n\n(1)\n\nTXt=1\n\nnXi=1\n\nt=1 =\n\nTXt=1\n\nEmeta(Zt)T\ni=1 `t,i(hxt,i, \u02c6wti).\n\nPT\nt=1Pn\n\nThe setting we consider in the paper is inspired by previous work on Multi-Task Learning, such as\n[2, 10, 18]. To describe it, we use extended real-valued functions and, for any data sequence Z and\n1Throughout the paper we use the double subscript notation \u201ct,i\u201d, to denote the {outer, inner} task index.\n\n2\n\n\fmeta-parameter \u2713 2 \u21e5, we de\ufb01ne the within-task minimum regularized empirical risk\n`i(hxi, wi),\n\nLZ(\u2713) = min\n\nw2Rd RZ(w) + f (w, \u2713)\n\nRZ(w) =\n\n1\nn\n\nnXi=1\n\nwhere > 0 is a regularization parameter and f is an appropriate complexity term ensuring the\nexistence and the uniqueness of the above minimizer \u02c6w\u2713. Assuming the entire sequence (Zt)T\nt=1\navailable in hindsight, introducing the notation Lt = LZt, many MTL methods read as follows\n\n(2)\n\nmin\n\u27132M\n\nTXt=1\n\nLt(\u2713) + \u2318F (\u2713),\n\n(3)\n\nwhere \u2318> 0 is a meta-regularization parameter and F is an appropriate meta-regularizer ensuring\nthat the above minimum is attained. We stress that in our OWO Meta-Learning setting, the data\nare received sequentially, both within and across the tasks. The above formulation inspires us to\ntake a within-task online algorithm that mimics well the (batch) objective in Eq. (2) and to de\ufb01ne\nt=1. Obviously, in this setting,\nas meta-objectives for the online meta-algorithm the functions (Lt)T\nthe meta-objectives (and consequently their subgradients used by the meta-algorithm) are computed\nonly up to an approximation error, depending on the speci\ufb01c properties of the inner algorithm we are\nusing. We will show how to control and exploit this approximation error in the analysis.\nIn the sequel, for an Euclidean space V, we let 0(V) to be the set of proper, closed and convex\nfunctions over V and, for any f 2 0(V), we denote by Domf its domain (we refer to App. B and\n[31] for notions on convex analysis). In this work, we make the following standard assumptions in\nwhich we introduce two norms k\u00b7k \u2713 and |||\u00b7||| that will be speci\ufb01ed in two applications below.\nAssumption 1 (Loss and regularizer). Let `(\u00b7, y) be a convex and closed real-valued function for\nany y 2Y and let f 2 0(Rd \u21e5M ) be such that, for any \u2713 2 \u21e5, f (\u00b7,\u2713 ) is 1-strongly convex w.r.t. a\nnorm k\u00b7k \u2713 over Rd, inf w2Rd f (w, \u2713) = 0 and, for any \u2713/2 \u21e5, Domf (\u00b7,\u2713 ) = ;.\nAssumption 2 (Meta-regularizer). Let F be a closed and 1-strongly convex function w.r.t. a norm\n|||\u00b7||| over M such that inf \u27132M F (\u2713) = 0 and DomF =\u21e5 .\nNotice that the norm w.r.t. which the function f (\u00b7,\u2713 ) is assumed to be strongly convex may vary\nwith \u2713. Moreover, under Asm. 1, DomLZ =\u21e5 and, since LZ is de\ufb01ned as the partial minimum of a\nfunction in 0(Rd \u21e5M ), LZ 2 0(M). This property supports the choice of this function as the\nmeta-objective for our meta-algorithm. Finally, by Lemma 29 in App. B, Asm. 1 and Asm. 2 ensure\nthe existence and the uniqueness of the minimizers in Eq. (2) and Eq. (3).\nWe conclude this section by giving two examples included in the framework above. The \ufb01rst one\nis inspired by the MTL variance regularizer in [14], while the second example, which can be easily\nextended to more general MTL regularizers such as in [2, 10, 27, 28], relates to the MTL trace norm\nregularizer. As we will see in the following, in the \ufb01rst example the tasks\u2019 predictors are encouraged\nto stay close to a common bias vector, in the second example they are encouraged to lie in the range\nof a low-rank feature map. In order to describe these examples we require some additional notation.\nWe let k\u00b7k 2, k\u00b7k F , k\u00b7k Tr, k\u00b7k 1, be the Euclidean, Frobenius, trace, and operator norm, respectively.\nWe also let \u201c\u00b7\u2020\u201d be the pseudo-inverse, Tr(\u00b7) be the trace, Ran(\u00b7) be the range and Sd (resp. Sd\n+) be\nthe set of symmetric (resp. positive semi-de\ufb01nite) matrices in Rd\u21e5d. Finally, \u25c6S denotes the indicator\nfunction of the set S, taking value 0 when the argument belongs to S and +1 otherwise.\nExample 1 (Bias). We choose M =\u21e5= Rd, F (\u00b7) = 1\n2, satisfying Asm. 2 with |||\u00b7||| = k\u00b7k 2,\nand f (\u00b7,\u2713 ) = 1\nExample 2 (Feature Map). We choose M = Sd and \u21e5= S, where S = {\u2713 2 Sd\nFor a \ufb01xed \u27130 2S , we set F (\u00b7) = 1\nf (\u00b7,\u2713 ) = 1\nWe will return to these examples in Sec. 6, specializing our method and our analysis to these settings.\n\n+ : Tr(\u2713) \uf8ff 1}.\nF + \u25c6S(\u00b7), satisfying Asm. 2 with |||\u00b7||| = k\u00b7k F , and\n2h\u00b7,\u2713 \u2020\u00b7i + \u25c6Ran(\u2713)(\u00b7) + \u25c6S(\u2713), satisfying Asm. 1 with k\u00b7k \u2713 =ph\u00b7,\u2713 \u2020\u00b7i for any \u2713 2S .\n\n2, satisfying Asm. 1 with k\u00b7k \u2713 = k\u00b7k 2 for every \u2713 2 Rd.\n\n2k\u00b7 \u27130k2\n\n2k\u00b7 \u2713k2\n\n2k\u00b7k 2\n\n3 Preliminaries: primal-dual Online Learning\n\nOur OWO Meta-Learning method consists in the application of two nested primal-dual online\nalgorithms, one operating within the tasks and another across the tasks. In particular, even though our\n\n3\n\n\fAlgorithm 1 Primal-dual online algorithm \u2013 online Mirror Descent\n\nm=1, (Am)M\n\nInput (gm)M\nm=1, (\u270fm)M\nInitialization \u21b51 = (), v1 = rr\u21e4(0) 2 Dom r\nFor m = 1 to M\n\nm=1, (cm)M\n\nm=1, r as described in the text\n\nReceive gm, Am, cm+1, \u270fm\nSuffer gm(Amvm) and compute \u21b50m 2 @\u270fm gm(Amvm)\nUpdate \u21b5m+1 = (\u21b5m,\u21b5 0m)\nDe\ufb01ne vm+1 = rr\u21e4 1/cm+1Pm\n\nj=1 A\u21e4j \u21b5m+1,j 2 Dom r\n\nm=1 , (vm)M +1\nm=1\n\nReturn (\u21b5m)M +1\n\nmXj=1\n\nanalysis holds also for more aggressive schemes, in this work, we consider online Mirror Descent\nalgorithm. In this section we brie\ufb02y recall some material from the primal-dual interpretation of this\nalgorithm that will be used in our subsequent analysis. The material of this section is an adaptation\nfrom [37, 33, 38, 36, 35]; we refer to App. C for a more detailed presentation.\nOnline Mirror Descent algorithm on a (primal) problem can be derived from the following primal-\ndual framework in which we introduce an appropriate dual algorithm. Speci\ufb01cally, at each iteration\nm 2{ 1, . . . , M}, we consider the following instantaneous primal optimization problem\n\n\u02c6Pm+1 = inf\nv2V\n\nPm+1(v)\n\nPm+1(v) =\n\ngj(Ajv) + cmr(v)\n\n(4)\n\nwhere V is an Euclidean space, cm > 0, r 2 0(V) is a 1-strongly convex function w.r.t. a norm\nk\u00b7k over V (with dual norm k\u00b7k \u21e4) such that inf v2V r(v) = 0, for any j 2{ 1, . . . , M}, letting Vj an\nEuclidean space, gj 2 0(Vj) and Aj : V!V j is a linear operator with adjoint A\u21e4j . As explained in\nApp. C, the corresponding dual problem is given by\n\ninf\n\n(5)\n\n1\ncm\n\nDm+1(\u21b5)\n\n\u02c6Dm+1 =\n\nmXj=1\n\nmXj=1\n\nDm+1(\u21b5) =\n\n\u21b52V1\u21e5\u00b7\u00b7\u00b7\u21e5Vm\n\ng\u21e4j (\u21b5j) + cmr\u21e4\u21e3\n\nA\u21e4j \u21b5j\u2318,\nwhere g\u21e4j and r\u21e4 are respectively the conjugate functions of gj and r. After this, we de\ufb01ne the dual\nscheme in which the dual variable \u21b5m+1 is updated by a greedy coordinate descent approach on the\ndual, setting \u21b5m+1 = (\u21b5m,\u21b5 0m), where \u21b50m 2 @\u270fmgm(Amvm) is an \u270fm-subgradient of gm at Amvm\nand vm is the current primal iteration. The primal variable is then updated from the dual one by a\nvariant of the Karush\u2013Kuhn\u2013Tucker (KKT) conditions, providing its belonging to Dom r, see Alg. 1.\nIn this paper, following [36], we refer to such a scheme as lazy online Mirror Descent. However, the\nterm linearized Follow-The-Regularized-Leader is historically more accurate. We recall also that\nsuch a scheme includes many well-known algorithms, when one properly speci\ufb01es the complexity\nterm r. The behavior of Alg. 1 is analyzed in the next result which will be a key tool for our analysis.\nTheorem 1 (Dual optimality gap for Alg. 1). Let (vm)M\nm=1 be the primal iterates returned by Alg. 1\nwhen applied to the generic problem in Eq. (4) and let Dual = DM +1(\u21b5M +1) \u02c6DM +1 be the\ncorresponding (non-negative) dual optimality gap at the last dual iterate \u21b5M +1 of the algorithm.\n\nMXm=1\n\nDual \uf8ff \n\n1. If, for any m 2{ 1, . . . , M}, cm+1 cm, then,\ngm(Amvm) + \u02c6PM +1 +\n\n2. If, for any m 2{ 1, . . . , M}, cm =Pm\n\nMXm=1\nMXm=1ngm(Amvm) + mr(vm)o + \u02c6PM +1 +\n\nDual \uf8ff \n\n1\n2\n\nj=1 j for some j > 0, then,\n1\n\n\u21e4\n\n1\n\n+\n\ncmA\u21e4m\u21b50m2\nMXm=1\n\nMXm=1\ncmA\u21e4m\u21b50m2\n\n1\n2\n\n\u270fm.\n\n+\n\n\u21e4\n\nMXm=1\n\n\u270fm.\n\nThe \ufb01rst (resp. second) inequality in Thm. 1 links the dual optimality gap of the last dual iterate\ngenerated by Alg. 1, with the (resp. regularized) cumulative error of the corresponding primal iterates.\nNote that this result can be readily used to bound the cumulative error (resp. its regularized version)\nof Alg. 1 by the batch regularized comparative \u02c6PM +1 and additional terms. In the following section,\nwe will make use of the above theorem in order to analyze our OWO Meta-Learning method.\n\n4\n\n\fE reg\ninner(Z, \u2713) =\n\nnXi=1n`i(hxi, w\u2713,ii) + f (w\u2713,i,\u2713 )o,\ninner(Z, \u2713) nLZ(\u2713)\u2318 +\n\nnXi=1\n\n1\n2\n\n\u270f\u2713 = \u21e3E reg\n\nDual \uf8ff \u270f\u2713\n\n(6)\n\n.\n\n(7)\n\n1\n\nixis0\u2713,i2\n\n\u2713,\u21e4\n\ni=1\n\nAlgorithm 2 Within-task algorithm\nInput > 0, \u2713 2 \u21e5, Z = (zi)n\nInitialization s\u2713,1 = (), w\u2713,1 = rf (\u00b7,\u2713 )\u21e4(0)\nFor i = 1 to n\nReceive the datapoint zi = (xi, yi)\nCompute s0\u2713,i 2 @`i(hxi, w\u2713,ii) \u2713 R\nDe\ufb01ne (s\u2713,i+1)i = s0\u2713,i, i = (i + 1)\nUpdate w\u2713,i+1=rf (\u00b7,\u2713 )\u21e41/iPi\n\nj=1 xjs0\u2713,j\n\nReturn (w\u2713,i)n+1\n\ni=1 , \u00afw\u2713 =\n\nw\u2713,i, s\u2713,n+1\n\n1\nn\n\nnXi=1\n\n4 Method\n\nAlgorithm 3 Meta-algorithm\n\nt=1\n\nInput \u2318> 0, (Zt)T\nInitialization \u27131 = rF \u21e4(0)\nFor t = 1 to T\nReceive incrementally the dataset Zt\nRun Alg. 2 with \u2713t over Zt\nCompute s\u2713t,n+1\nCompute r0\u2713t as in Prop. 3 using s\u2713t,n+1\nUpdate \u2713t+1 = rF \u21e4 1/\u2318Pt\nj=1 r0\u2713j\n\nReturn (\u2713t)T +1\n\nt=1 , \u00af\u2713 =\n\n\u2713t\n\n1\nT\n\nTXt=1\n\nIn this section we present the proposed OWO Meta-Learning method and we establish a (regularized)\ncumulative error bound for it. As anticipated in Sec. 2, the method consists in the application of\nAlg. 1 both to the (non-normalized) within-task problem in Eq. (2) and to the across-tasks problem in\nEq. (3), corresponding, as we will show in the following, to Alg. 2 and Alg. 3, respectively. In order\nto analyze our method, we start from studying the behavior of the inner Alg. 2.\nProposition 2 (Dual optimality gap for the inner Alg. 2). Let Asm. 1 hold. Then, Alg. 2 coincides with\nAlg. 1 applied to the non-normalized within-task problem in Eq. (2). As a consequence, introducing\nthe regularized cumulative error of the iterates generated by Alg. 2,\n\nwhere w\u2713,i 2 Domf (\u00b7,\u2713 ) for any i 2{ 1, . . . , n}, the following upper bound for the associated dual\noptimality gap Dual introduced in Thm. 1 holds\n\nProof. The inner Alg. 2 coincides with Alg. 1 applied to the non-normalized within-task problem in\nEq. (2), once one makes the identi\ufb01cations \u21b50m s0\u2713,i for the (exact) subgradients and realizes that\nthe non-normalized within-task problem in Eq. (2) is of the form in Eq. (4) with\nm M, j i, M n, v w, V Rd, gj `i, Aj x>i , cm n, r(\u00b7) f (\u00b7,\u2713 ).\nNow, the bound in the statement directly derives from the second point of Thm. 1.\n\nSince Dual 0, by moving the terms and normalizing by the number of points n, the above\nare bounded, for an appropriate choice of , the inner\nresult tells us that, when the terms kxis0\u2713,ik2\n\u2713,\u21e4\nalgorithm attempts to mimic the function LZ in Eq. (2), as the number of points n increases. The\nmethod we propose in this work relies on the application of Alg. 1 also to the meta-problem in\nEq. (3) as the tasks are sequentially observed, using the functions (Lt)T\nt=1 as meta-objectives. A key\ndif\ufb01culty here is that the meta-objective is de\ufb01ned via the inner batch problem in Eq. (2), hence it is\nnot available exactly but it is only approximately approached by the within-task online algorithm.\nFrom a practical point of view, this means that in this case, differently from the inner algorithm, the\nresulting meta-algorithm has to deal with an error on the meta-subgradients at each iteration. Our\nnext result describes how, leveraging on the dual optimality gap for the inner Alg. 2, we can compute\nan \u270f-subgradient of the meta-objective, where \u270f is (up to normalization) the value stated in Prop. 2.\nThis will allow us to develop an ef\ufb01cient method which is computationally appealing and fully online.\nProposition 3 (Computation of an \u270f-subgradient of LZ). Let Asm. 1 hold and let s\u2713,n+1 be the output\nof Alg. 2 with \u2713 2 \u21e5 over the dataset Z. Let r\u2713 2 @{Dn+1(s\u2713,n+1,\u00b7)}(\u2713), where\ns 2 Rn\n\nDn+1(s, \u2713) =\n\n1\nn\n\n(8)\n\n`\u21e4i (si) + nf\u21e4(\u00b7,\u2713 )\u21e3\n\nxisi\u2318\n\nnXi=1\n\nnXi=1\n\n5\n\n\fis the dual of the non-normalized Eq. (2). Then, r0\u2713 = r\u2713/n 2 @\u270f\u2713/nLZ(\u2713), with \u270f\u2713 as in Prop. 2.\nThe proof of the above statement is reported in App. D. It is based on rewriting the meta-objective\nas LZ(\u2713) = 1/n maxs2Rn{Dn+1(s, \u2713)} (by strong duality, see Lemma 34 in App. D) and it\nessentially exploits Prop. 2, according to which, the last dual iteration s\u2713,n+1 returned by Alg. 2 is an\n\u270f\u2713-maximizer of the dual objective Dn+1(\u00b7,\u2713 ). We remark that the procedure described above to\ncompute an \u270f-subgradient has been already used in our work [11] for the statistical setting in Ex. 1.\nHere, with a different proof technique, we show that it can be extended also to more general inner\nregularizers. Leveraging on the form of the error on the meta-subgradients in Prop. 3, we now show\nhow we can automatically deduce a (regularized) cumulative error bound for the entire procedure.\nTheorem 4 (Cumulative error bound). Let Asm. 1 and Asm. 2 hold. Then, Alg. 3 coincides with\nAlg. 1 applied to the outer-tasks problem in Eq. (3). As a consequence, introducing the regularized\ncumulative error for the iterates generated by the combination of Alg. 2 and Alg. 3,\n\nTXt=1\nmeta(Zt)T\nt=1 =\nE reg\nt=1 \uf8ff nT 1\n\nT\n\nE reg\ninner(Zt,\u2713 t) =\n\nTXt=1\n\nnXi=1n`t,i(hxt,i, w\u2713t,ii) + f (w\u2713t,i,\u2713 t)o,\n\n(9)\n\nwhere \u2713t 2 \u21e5 for any t 2{ 1, . . . , T}, for any sequence of vectors ( \u02c6wt)T\nt=1 in Rd and any \u2713 2 \u21e5\nsuch that f ( \u02c6wt,\u2713 ) < +1 for any t 2{ 1, . . . , T}, the following upper bound holds\nTXt=1\nnXi=1\nmeta(Zt)T\n1\nE reg\n\u21e4!.\nTXt=1r0\u2713t2\n\nProof. The meta-algorithm in Alg. 3 coincides with Alg. 1 applied to the outer-tasks problem in\nEq. (3), once one makes the identi\ufb01cations \u21b50m r0\u2713t for the (approximated) subgradients and\nrealizes that the outer-tasks problem in Eq. (3) is of the form in Eq. (4) with\n\nixt,is0\u2713t,i2\n\nRZt( \u02c6wt) +\n\nTXt=1\n\nTXt=1\n\nf ( \u02c6wt,\u2713 ) +\n\n1\n2\u2318T\n\n\u2318F (\u2713)\n\n2nT\n\n\u2713t,\u21e4\n\n\nT\n\n1\n\n+\n\nT\n\n+\n\nm M, j t, M T, v \u2713, V \u21e5, gj Lt, Aj I, cm \u2318, r F.\nAs a consequence, denoting by Dual the associated dual optimality gap introduced in Thm. 1,\nspecializing the \ufb01rst point of Thm. 1 to this setting and exploiting the fact Dual 0, we get\n\n\u270f\u2713t.\n\n(10)\n\n0 \uf8ff \n\nLt(\u2713t) + min\n\nTXt=1\n\n\u27132\u21e5n TXt=1\n\nLt(\u2713) + \u2318F (\u2713)o +\n\n1\n2\u2318\n\nTXt=1r0\u2713t2\n\n\u21e4\n\n+\n\n1\nn\n\nTXt=1\n\nSubstituting the closed form of \u270f\u2713t in Prop. 2 (applied to the task t) into Eq. (10), one immediately\nt=1 Lt(\u2713t) erases. The desired statement then directly follows by rearranging\n\nobserves that the termPT\nthe remaining terms, using the de\ufb01nition of (Lt)T\nWhen the inputs are bounded and both the inner loss and meta-objective are Lipschitz w.r.t. the\nassociated norms (as we will see for Ex. 1), the termsr0\u2713t2\ncan be upper\nbounded by a constant. In this case, for an appropriate choice of and \u2318, we recover a reasonable\nrate \u02dcO(1/pn) + O(1/pT ). However, when the bounds onr0\u2713t2\nhide a dependency w.r.t. or n\n\n\u21e4\n(as we will see for Ex. 2), the bound must be accordingly analyzed.\n\nandxt,is0\u2713t,i2\n\nt=1 and multiplying by the number of points n.\n\n\u2713t,\u21e4\n\n\u21e4\n\n5 Adaptation to the statistical setting\n\nIn this section we present guarantees for our method in the statistical setting. Following the framework\noutlined in [6, 23, 26] we assume that, for any t 2{ 1, . . . , T}, the within-task dataset Zt is an\nindependently identically distributed (i.i.d.) sample from a distribution (task) \u00b5t, and in turn the\nt=1 are an i.i.d. sample from a meta-distribution \u21e2. The estimator we consider here is\ntasks (\u00b5t)T\ni=1 w\u2713,i, the average of the iterates resulting from applying Alg. 2 to a test dataset Z\n\u00afw\u00af\u2713 = 1\nwith meta-parameter \u00af\u2713 = 1\nt=1 \u2713t, the average of the meta-parameters returned by our online\nmeta-algorithm in Alg. 3 applied to the training datasets (Zt)T\nt=1. We wish to study the performance\nof such an estimator in expectation w.r.t. the tasks sampled from the environment \u21e2.\n\nT PT\n\nnPn\n\n6\n\n\fFormally, for any \u00b5 \u21e0 \u21e2, we require that the corresponding true risk R\u00b5(w) = E(x,y)\u21e0\u00b5`(hx, wi, y)\nadmits minimizers over the entire space Rd and we denote by w\u00b5 the minimum norm one. With these\ningredients, we introduce the oracle E\u21e2 = E\u00b5\u21e0\u21e2 R\u00b5(w\u00b5), representing the expected minimum error\nover the environment of tasks, and, introducing the transfer risk of the estimator \u00afw\u00af\u2713:\n(11)\n\nEstat( \u00afw\u00af\u2713) = E\u00b5\u21e0\u21e2 EZ\u21e0\u00b5n R\u00b5( \u00afw\u00af\u2713(Z)),\n\nwe give a bound on it w.r.t. the oracle E\u21e2. This is described in the following theorem.\nTheorem 5 (Transfer risk bound). Let the same assumptions in Thm. 4 hold in the i.i.d. statistical\nsetting. Then, introducing the regularized transfer risk of the average \u00afw\u00af\u2713 of the iterates resulting\nfrom the combination of Alg. 2 and Alg. 3,\n\nfor any \u2713 2 \u21e5 such that E\u00b5\u21e0\u21e2f (w\u00b5,\u2713 ) < +1, the following upper bound holds in expectation w.r.t.\nthe sampling of the datasets (Zt)T\n\nE E reg\n\nstat( \u00afw\u00af\u2713) \uf8ffE \u21e2 + E\u00b5\u21e0\u21e2f (w\u00b5,\u2713 ) +\n\nstat( \u00afw\u00af\u2713) = E\u00b5\u21e0\u21e2 EZ\u21e0\u00b5nhR\u00b5( \u00afw\u00af\u2713(Z)) + f ( \u00afw\u00af\u2713(Z), \u00af\u2713)i,\nE reg\nixt,is0\u2713t,i2\n\n1\n2nT E\n\nnXi=1\n\n\u2713t,\u21e4\n\nt=1\n\n1\n\n\u2318F (\u2713)\n\n+\n\nT\n\n+\n\n1\n2\u2318T E\n\nTXt=1\nTXt=1r0\u2713t2\n\n\u21e4\n\n+ E E\u00b5\u21e0\u21e2 EZ\u21e0\u00b5n\n\n1\n\n2n\n\nnXi=1\n\n1\n\nixis0\u00af\u2713,i2\n\n.\n\n\u00af\u2713,\u21e4\n\nThe proof of the statement above is reported in App. E. It exploits the regularized cumulative error\nbound given in Thm. 4 for our Meta-Learning procedure and two nested online-to-batch conversion\nsteps [9, 22], one within-task and one across-tasks. The bound above is composed by the expectation\nof the terms comparing in Thm. 4 plus an additional term. Such a term comes out from the online-\nto-batch conversion and, as we will see in the sequel, it does not affect the general behavior of the\nbound. Finally, we observe that, differently from [1, Thm. 6.1] and [3, Thm. 3.3], the theorem\nabove holds for the average of the meta-parameters (\u2713t)T\nt=1 returned by our meta-algorithm (not\nfor a meta-parameter randomly sampled from the pool) and, consequently, it does not require their\nmemorization or the introduction of additional randomization to the process. In the following section\nwe will show that specializing Thm. 4 and Thm. 5 to Ex. 1 and Ex. 2, we will get meaningful bounds.\n\n6 Examples\n\n1\n\n1\n\ni=1\n\nt=1\n\nT PT\n\nT PT\n\nt=1 Ct and \u02c6Ctot = 1\n\nIn this section we specify our framework to Ex. 1 and Ex. 2 outlined at the end of Sec. 2. In order\nto do this, we require the following assumption, which is for instance satis\ufb01ed by the absolute loss\n\ni=1 xt,ix>t,i, \u02c6Ct =\n\u02c6Ct. We also use the notation kCtotk1,a =\n\n`(\u02c6y, y) =\u02c6y y and the hinge loss `(\u02c6y, y) = max0, 1 y\u02c6y , where y, \u02c6y 2Y .\nAssumption 3 (Lipschitz Loss). Let `(\u00b7, y) be L-Lipschitz for any y 2Y .\nnPn\nBelow, for any task t 2{ 1, . . . , T}, we let the input covariance matrices Ct = 1\nPn\nT PT\n\ni xt,ix>t,i, Ctot = 1\nt=1 kCtka\n1\n\nwith a = 1, 2 and, in the statistical setting, we let C\u21e2 = E\u00b5\u21e0\u21e2 E(x,y)\u21e0\u00b5xx>.\n\nBias. In App. G we report the adaptation of our method in Alg. 2 and Alg. 3 (cf. Alg. 5 and Alg. 6)\nand we specify Thm. 4 and Thm. 5 (cf. Cor. 40 and Cor. 42) to Ex. 1. In such a case, the resulting\ninner algorithm coincides with online Subgradient Descent on the regularized empirical risk and,\nsimilarly, the resulting meta-algorithm coincides with online Subgradient Descent (with approximated\nt=1. We thus recover the method in [11] with a slightly\nsubgradients) on the meta-objectives (Lt)T\ndifferent choice of the inner algorithm step size. Our results (see App. G.4.2) are in line with [11],\nwhere we present the same bound in Cor. 42 with slightly worse constants.\nFeature map. In App. H.1 we report the adaptation of our method in Alg. 2 and Alg. 3 (cf. Alg. 7 and\nAlg. 8) to Ex. 2. In this case, the resulting inner algorithm coincides with a pre-conditioned variant\nof online Subgradient Descent on the regularized empirical risk and the resulting meta-algorithm\ncoincides with a lazy variant of online Subgradient Descent (with approximate subgradients) on the\nt=1, projected on the set S. The meta-algorithm we retrieve is a slightly different\nmeta-objectives (Lt)T\nversion of the algorithm we propose in [12] for an OWB statistical framework.\nOur next result speci\ufb01es the cumulative error bound in Thm. 4 to Ex. 2. The proof is in App. H.2.\n\n7\n\n\f\u27131:T = 1\n\nt=1 \u2713t \u02c6Ct. Then, for any sequence of vectors ( \u02c6wt)T\n\nCorollary 6 (Cumulative error bound, feature map). Let Asm. 3 hold, consider the setting in Thm. 4\napplied to Ex. 2 and let \u02c6Ctot\nt=1 in Rd,\nt=1 \u02c6wt \u02c6w>t , for any \u2713 2S such that Ran( \u02c6B) \u2713 Ran(\u2713), the following bound\nintroducing \u02c6B = 1\nholds for our method with an appropriate choice of hyper-parameters\n!!.\n+ k\u2713 \u27130kFrkCtotk1,2\nTXt=1\n\nT PT\nRZt( \u02c6wt) + LvuutTr(\u2713\u2020 \u02c6B) Tr( \u02c6Ctot\n\nT PT\nt=1 \uf8ff nT 1\n\nmeta(Zt)T\nE reg\n\n\u27131:T\nn\n\nT\n\nT\n\n)\n\nn\n\nE E reg\n\nThe next result speci\ufb01es the transfer risk bound in Thm. 5 to Ex. 2. The proof is in App. H.3.\nCorollary 7 (Transfer risk bound, feature map). Let Asm. 3 hold and consider the setting in Thm. 5\napplied to Ex. 2. Then, in expectation w.r.t.\nt=1, introducing\nB\u21e2 = E\u00b5\u21e0\u21e2w\u00b5w>\u00b5 , for any \u2713 2S such that Ran(B\u21e2) \u2713 Ran(\u2713), the following bound holds for our\nmethod with an appropriate choice of hyper-parameters\n!.\n\nstat( \u00afw\u00af\u2713) \uf8ffE \u21e2 + LvuutTr(\u2713\u2020B\u21e2) 2log(n) + 1 Tr(E \u00af\u2713C\u21e2)\n\n+ k\u2713 \u27130kFrE kCtotk1,2\n\nthe sampling of the datasets (Zt)T\n\nWe now analyze the statistical setting. Following [12, 26, 25] we study whether, as the number of\ntasks grows, our method mimics the performance of the inner algorithm with the best feature map in\nhindsight (oracle, see App. H.4.1) for any task. We note that, once \ufb01xed in an appropriate way the\nmeta-parameter \u2713 in the statement (hence, the hyper-parameters), the above bound in Cor. 7 becomes\ncomparable to the bound for the best feature map in hindsight, see the discussion in App. H.4.2.\nHence, we recover the same conclusion: there is an advantage in using the feature map found by our\nMeta-Learning method w.r.t. solving each task independently when kC\u21e2k1 is small (the inputs are\nhigh-dimensional, for instance) and B\u21e2 is low-rank (the tasks share a low dimensional representation).\nIn addition, note that the bound in Cor. 7 converges, as the number of tasks grow, to the oracle at a\nrate of O(T 1/4), whereas the corresponding bounds for the bias example (cf. Cor. 40 and Cor. 42 in\nApp. G) yield the faster O(T 1/2) rate, suggesting that feature learning is a more dif\ufb01cult problem\nthan bias learning. Regarding the non-statistical setting, the bound in Cor. 6 is less clear to interpret\nbecause of the presence of the modi\ufb01ed version of the inputs\u2019 covariance matrix \u02c6Ctot\n\u27131:T . Future work\nmay be devoted to investigate this point, which could be either an artifact of our analysis or due to\nsome intrinsic characteristics of the problem we are considering.\n\nT\n\n7 Experiments\n\nWe present preliminary experiments with our OWO Meta-Learning method (ONL-ONL)2 in the\nstatistical setting of Ex. 2. In all experiments, the hyper-parameters and \u2318 were chosen by a\nmeta-validation procedure (see App. I for more details) and we \ufb01xed \u27130 = I/d for the meta-algorithm\nin Alg. 8. We compared ONL-ONL to the modi\ufb01ed batch-online (BAT-ONL) variant, where the\nmeta-subgradients in the meta-training phase are computed with higher accuracy by a convex solver\n(such as CVX), to Independent-Task Learning (ITL), i.e. running the inner Alg. 7 with the feature\nmap \u2713 = I/d for each task, and, in the synthetic data experiment, to the Oracle, i.e. running the inner\nAlg. 7 with the best feature map in hindsight for each task, see App. H.4.1.\nSynthetic data. We considered the regression setting with the absolute loss function. We generated\nTtot = 3600 tasks. For each task, the corresponding dataset (xi, yi)ntot\ni=1 of ntot = 80 points was\ngenerated according to the linear equation y = hx, w\u00b5i + \u270f, with x sampled uniformly on the unit\nsphere in Rd with d = 20 and \u270f sampled from a Gaussian distribution, \u270f \u21e0G (0, 0.2). The tasks\u2019\npredictors w\u00b5 were generated as w\u00b5 = P \u02dcw\u00b5 with the components of \u02dcw\u00b5 2 Rd/5 sampled from\nG(0, 1) and then \u02dcw\u00b5 normalized to have unit norm, with P 2 Rd\u21e5d/5 a matrix with orthonormal\ncolumns. In this setting, the operator norm of the inputs\u2019 covariance matrix C\u21e2 is small (equal to\n1/d) and the weight vectors\u2019 covariance matrix B\u21e2 is low-rank, a favorable setting for our method,\naccording to Cor. 7. Looking at the results in Fig. 1 (Left), we can state that, in this setting, our method\noutperforms ITL and it tends to the Oracle as the number of training tasks increases. Moreover, the\n\n2The code is available at https://github.com/dstamos/Adversarial-LTL\n\n8\n\n\fFigure 1: Synthetic data (Left) and Movielens-100k dataset (Right). Performance of different methods as the\nnumber of training tasks increases. The results are averaged over 10 runs/splits of the data.\n\nFigure 2: Mini-Wiki dataset (Left) and Jester-1 dataset (Right). Performance of different methods as the number\nof training tasks increases. The results are averaged over 10 splits of the data.\n\nperformance of ONL-ONL and BAT-ONL are comparable, suggesting that our approximation of the\nmeta-subgradients is an effective way to keep the entire process fully online.\nReal data. We further validated the proposed method on three real datasets: 1) the Movielens-100k\ndataset3, containing the ratings of different users to different movies 2) the Mini-Wiki dataset from\n[3], containing sentences from Wikipedia pages and 3) the Jester-1 dataset4, containing the ratings of\ndifferent users to different jokes. For the Movielens-100k and the Jester-1 datasets we considered each\nuser as a task and each movie/joke as a point. Speci\ufb01cally, we casted each task as a regression problem\nwhere the labels are the ratings of the users and the raw features are simply the index of the movie/joke\n(i.e. a matrix completion setting where the input dimension d coincides with the number of points).\nFor the Mini-Wiki dataset we casted each task as a multi-class classi\ufb01cation problem where the labels\nare the Wikipedia pages and the features are vectors with dimension d = 50. After processing the\ndata, we ended with a total number of Ttot = 939, 813, 5700 tasks and ntot = 939, 128, 100 points\nper task for the Movielens-100k, the Mini-Wiki and the Jester-1 datasets, respectively. In the above\nformulation of the problem for the Movielens-100k and the Jester-1 datasets, it is possible to show\nthat, the ITL algorithm is not able to predict any rate for the \ufb01lms/jokes without observed rates. For\nthis reason, in order to evaluate the performance of the Meta-Learning methods ONL-ONL and\nBAT-ONL, we decided to introduce a more challenging method for this particular formulation of the\nproblem in which, for the \ufb01lms/jokes without any observed rate, we predicted the rate coinciding\nwith the average of the rates of all the observed users, at the end of the entire sequence of tasks. We\ndenoted this method as BAT. In Fig. 1 (Right) and Fig. 2 we report the performance of the methods by\nusing the absolute loss for the Movielens-100k and the Jester-1 datastes and the multi-class hinge loss\nfor the Mini-Wiki dataset. The results we got are consistent with the synthetic experiments above,\nshowing the effectiveness of our method also in real-life scenarios. We note also that the online\nMeta-Learning methods outperform the BAT method when the number of training tasks increases.\n\n8 Conclusion\n\nWe presented a fully online Meta-Learning method stemming from primal-dual Online Learning. Our\nmethod can be adapted to a wide class of learning algorithms and it covers various types of tasks\u2019\nrelatedness. By means of a new analysis technique we derived a cumulative error bound for our\nmethod based on which it is also possible to obtain guarantees in the statistical setting. We illustrated\nour framework with two important examples, the bias and the feature learning, improving upon\nstate-of-the-art results. To conclude, we believe that the generality of our framework and our method\nof proof could be a valuable starting point for future theoretical investigations of Meta-Learning.\n\n3https://grouplens.org/datasets/movielens/\n4http://goldberg.berkeley.edu/jester-data/\n\n9\n\n\fAcknowledgments\nThis work was supported in part by EPSRC Grant N. EP/P009069/1.\n\nReferences\n[1] P. Alquier, T. T. Mai, and M. Pontil. Regret bounds for lifelong learning. In Proceedings\nof the 20th International Conference on Arti\ufb01cial Intelligence and Statistics, volume 54 of\nProceedings of Machine Learning Research, pages 261\u2013269, 2017.\n\n[2] A. Argyriou, T. Evgeniou, and M. Pontil. Convex multi-task feature learning. Machine Learning,\n\n73(3):243\u2013272, 2008.\n\n[3] M.-F. Balcan, M. Khodak, and A. Talwalkar. Provable guarantees for gradient-based meta-\n\nlearning. In International Conference on Machine Learning, pages 424\u2013433, 2019.\n\n[4] H. H. Bauschke and P. L. Combettes. Convex Analysis and Monotone Operator theory in Hilbert\n\nSpaces, volume 408. Springer, 2011.\n\n[5] J. Baxter. Theoretical models of learning to learn. In Learning to Learn, pages 71\u201394. 1998.\n\n[6] J. Baxter. A model of inductive bias learning. J. Artif. Intell. Res., 12(149\u2013198):3, 2000.\n\n[7] S. Boyd and L. Vandenberghe. Convex optimization. Cambridge university press, 2004.\n\n[8] B. Bullins, E. Hazan, A. Kalai, and R. Livni. Generalize across tasks: Ef\ufb01cient algorithms for\n\nlinear representation learning. In Algorithmic Learning Theory, pages 235\u2013246, 2019.\n\n[9] N. Cesa-Bianchi, A. Conconi, and C. Gentile. On the generalization ability of on-line learning\n\nalgorithms. IEEE Transactions on Information Theory, 50(9):2050\u20132057, 2004.\n\n[10] C. Ciliberto, Y. Mroueh, T. Poggio, and L. Rosasco. Convex learning of multiple tasks and their\n\nstructure. In International Conference on Machine Learning, pages 1548\u20131557, 2015.\n\n[11] G. Denevi, C. Ciliberto, R. Grazzi, and M. Pontil. Learning-to-learn stochastic gradient descent\nwith biased regularization. In International Conference on Machine Learning, pages 1566\u20131575,\n2019.\n\n[12] G. Denevi, C. Ciliberto, D. Stamos, and M. Pontil. Incremental learning-to-learn with statistical\n\nguarantees. In Proc. 34th Conference on Uncertainty in Arti\ufb01cial Intelligence (UAI), 2018.\n\n[13] G. Denevi, C. Ciliberto, D. Stamos, and M. Pontil. Learning to learn around a common mean.\n\nIn Advances in Neural Information Processing Systems, pages 10190\u201310200, 2018.\n\n[14] T. Evgeniou, C. A. Micchelli, and M. Pontil. Learning multiple tasks with kernel methods.\n\nJournal of Machine Learning Research, 6:615\u2013637, 2005.\n\n[15] C. Finn, P. Abbeel, and S. Levine. Model-agnostic meta-learning for fast adaptation of deep\nnetworks. In Proceedings of the 34th International Conference on Machine Learning, volume 70\nof Proceedings of Machine Learning Research, pages 1126\u20131135. PMLR, 2017.\n\n[16] C. Finn, A. Rajeswaran, S. Kakade, and S. Levine. Online meta-learning. In International\n\nConference on Machine Learning, pages 1920\u20131930, 2019.\n\n[17] R. Gupta and T. Roughgarden. A pac approach to application-speci\ufb01c algorithm selection.\n\nSIAM Journal on Computing, 46(3):992\u20131017, 2017.\n\n[18] L. Jacob, J.-p. Vert, and F. R. Bach. Clustered multi-task learning: A convex formulation. In\n\nAdvances in neural information processing systems, pages 745\u2013752, 2009.\n\n[19] H.-U. Jean-Baptiste. Convex analysis and minimization algorithms: advanced theory and\n\nbundle methods. SPRINGER, 2010.\n\n10\n\n\f[20] S. Kakade, S. Shalev-Shwartz, and A. Tewari. On the duality of strong convexity and\nstrong smoothness: Learning applications and matrix regularization. Unpublished Manuscript,\nhttp://ttic. uchicago. edu/shai/papers/KakadeShalevTewari09. pdf, 2(1), 2009.\n\n[21] M. Khodak, M. Florina-Balcan, and A. Talwalkar. Adaptive gradient-based meta-learning\n\nmethods. arXiv preprint arXiv:1906.02717, 2019.\n\n[22] N. Littlestone. From on-line to batch learning. In Proceedings of the second annual workshop\n\non Computational learning theory, pages 269\u2013284, 1989.\n\n[23] A. Maurer. Algorithmic stability and meta-learning. Journal of Machine Learning Research,\n\n6:967\u2013994, 2005.\n\n[24] A. Maurer. Transfer bounds for linear feature learning. Machine Learning, 75(3):327\u2013350,\n\n2009.\n\n[25] A. Maurer, M. Pontil, and B. Romera-Paredes. Sparse coding for multitask and transfer learning.\n\nIn International Conference on Machine Learning, 2013.\n\n[26] A. Maurer, M. Pontil, and B. Romera-Paredes. The bene\ufb01t of multitask representation learning.\n\nThe Journal of Machine Learning Research, 17(1):2853\u20132884, 2016.\n\n[27] A. M. McDonald, M. Pontil, and D. Stamos. New perspectives on k-support and cluster norms.\n\nJournal of Machine Learning Research, 17(155):1\u201338, 2016.\n\n[28] C. A. Micchelli, J. M. Morales, and M. Pontil. Regularizers for structured sparsity. Advances in\n\nComputational Mathematics, 38(3):455\u2013489, 2013.\n\n[29] A. Pentina and C. Lampert. A PAC-Bayesian bound for lifelong learning. In International\n\nConference on Machine Learning, pages 991\u2013999, 2014.\n\n[30] A. Pentina and R. Urner. Lifelong learning with weighted majority votes. In Advances in Neural\n\nInformation Processing Systems, pages 3612\u20133620, 2016.\n\n[31] J. Peypouquet. Convex optimization in normed spaces: theory, methods and examples. Springer,\n\n2015.\n\n[32] S. Ravi and H. Larochelle. Optimization as a model for few-shot learning. In I5th International\n\nConference on Learning Representations, 2017.\n\n[33] S. Shalev-Shwartz. Online learning: theory, algorithms, and applications [ph. d. thesis]. Hebrew\n\nUniv., Jerusalem, 2007.\n\n[34] S. Shalev-Shwartz and S. Ben-David. Understanding Machine Learning: From Theory to\n\nAlgorithms. Cambridge University Press, 2014.\n\n[35] S. Shalev-Shwartz et al. Online learning and online convex optimization. Foundations and\n\nTrends R in Machine Learning, 4(2):107\u2013194, 2012.\n[36] S. Shalev-Shwartz and S. M. Kakade. Mind the duality gap: Logarithmic regret algorithms for\nonline optimization. In Advances in Neural Information Processing Systems, pages 1457\u20131464,\n2009.\n\n[37] S. Shalev-Shwartz and Y. Singer. Convex repeated games and fenchel duality. In Advances in\n\nneural information processing systems, pages 1265\u20131272, 2007.\n\n[38] S. Shalev-Shwartz and Y. Singer. Logarithmic regret algorithms for strongly convex repeated\n\ngames. The Hebrew University, 2007.\n\n[39] P. Stange. On the ef\ufb01cient update of the singular value decomposition. In PAMM: Proceedings\nin Applied Mathematics and Mechanics, volume 8, pages 10827\u201310828. Wiley Online Library,\n2008.\n\n[40] S. Thrun and L. Pratt. Learning to Learn. Springer, 1998.\n[41] E. Trianta\ufb01llou, T. Zhu, V. Dumoulin, P. Lamblin, K. Xu, R. Goroshin, C. Gelada, K. Swersky,\nP.-A. Manzagol, and H. Larochelle. Meta-dataset: A dataset of datasets for learning to learn\nfrom few examples. arXiv preprint arXiv:1903.03096, 2019.\n\n11\n\n\f", "award": [], "sourceid": 7179, "authors": [{"given_name": "Giulia", "family_name": "Denevi", "institution": "IIT & UNIGE"}, {"given_name": "Dimitris", "family_name": "Stamos", "institution": "University College London"}, {"given_name": "Carlo", "family_name": "Ciliberto", "institution": "Imperial College London"}, {"given_name": "Massimiliano", "family_name": "Pontil", "institution": "IIT & UCL"}]}