{"title": "Adaptive Gradient-Based Meta-Learning Methods", "book": "Advances in Neural Information Processing Systems", "page_first": 5917, "page_last": 5928, "abstract": "We build a theoretical framework for designing and understanding practical meta-learning methods that integrates sophisticated formalizations of task-similarity with the extensive literature on online convex optimization and sequential prediction algorithms. Our approach enables the task-similarity to be learned adaptively, provides sharper transfer-risk bounds in the setting of statistical learning-to-learn, and leads to straightforward derivations of average-case regret bounds for efficient algorithms in settings where the task-environment changes dynamically or the tasks share a certain geometric structure. We use our theory to modify several popular meta-learning algorithms and improve their training and meta-test-time performance on standard problems in few-shot and federated learning.", "full_text": "Adaptive Gradient-Based Meta-Learning Methods\n\nMikhail Khodak\n\nCarnegie Mellon University\n\nkhodak@cmu.edu\n\nMaria-Florina Balcan\n\nCarnegie Mellon University\n\nninamf@cs.cmu.edu\n\nAmeet Talwalkar\n\nCarnegie Mellon University\n\n& Determined AI\n\ntalwalkar@cmu.edu\n\nAbstract\n\nWe build a theoretical framework for designing and understanding practical meta-\nlearning methods that integrates sophisticated formalizations of task-similarity with\nthe extensive literature on online convex optimization and sequential prediction\nalgorithms. Our approach enables the task-similarity to be learned adaptively,\nprovides sharper transfer-risk bounds in the setting of statistical learning-to-learn,\nand leads to straightforward derivations of average-case regret bounds for ef\ufb01cient\nalgorithms in settings where the task-environment changes dynamically or the\ntasks share a certain geometric structure. We use our theory to modify several\npopular meta-learning algorithms and improve their meta-test-time performance\non standard problems in few-shot learning and federated learning.\n\n1\n\nIntroduction\n\nMeta-learning, or learning-to-learn (LTL) [52], has recently re-emerged as an important direction\nfor developing algorithms for multi-task learning, dynamic environments, and federated settings.\nBy using the data of numerous training tasks, meta-learning methods seek to perform well on new,\npotentially related test tasks without using many samples. Successful modern approaches have\nalso focused on exploiting the capabilities of deep neural networks, whether by learning multi-task\nembeddings passed to simple classi\ufb01ers [51] or by neural control of optimization algorithms [46].\nBecause of its simplicity and \ufb02exibility, a common approach is parameter-transfer, where all tasks\nuse the same class of \u0398-parameterized functions f\u03b8 : X (cid:55)\u2192 Y; often a shared model \u03c6 \u2208 \u0398 is\nlearned that is used to train within-task models. In gradient-based meta-learning (GBML) [23],\n\u03c6 is a meta-initialization for a gradient descent method over samples from a new task. GBML is\nused in a variety of LTL domains such as vision [38, 44, 35], federated learning [16], and robotics\n[20, 1]. Its simplicity also raises many practical and theoretical questions about the task-relations\nit can exploit and the settings in which it can succeed. Addressing these issues has naturally led\nseveral authors to online convex optimization (OCO) [55], either directly [24, 34] or from online-to-\nbatch conversion [34, 19]. These efforts study how to \ufb01nd a meta-initialization, either by proving\nalgorithmic learnability [24] or giving meta-test-time performance guarantees [34, 19].\nHowever, this recent line of work has so far considered a very restricted, if natural, notion of task-\nsimilarity \u2013 closeness to a single \ufb01xed point in the parameter space. We introduce a new theoretical\nframework, Average Regret-Upper-Bound Analysis (ARUBA), that enables the derivation of meta-\nlearning algorithms that can provably take advantage of much more sophisticated structure. ARUBA\ntreats meta-learning as the online learning of a sequence of losses that each upper bounds the regret\non a single task. These bounds often have convenient functional forms that are (a) suf\ufb01ciently nice, so\nthat we can draw upon the existing OCO literature, and (b) strongly dependent on both the task-data\nand the meta-initialization, thus encoding task-similarity in a mathematically accessible way. Using\nARUBA we introduce or dramatically improve upon GBML results in the following settings:\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\f\u2022 Adapting to the Task-Similarity: A major drawback of previous work is a reliance on knowing\nthe task-similarity beforehand to set the learning rate [24] or regularization [19], or the use of\na sub-optimal guess-and-tune approach using the doubling trick [34]. ARUBA yields a simple\ngradient-based algorithm that eliminates the need to guess the similarity by learning it on-the-\ufb02y.\n\u2022 Adapting to Dynamic Environments: While previous theoretical work has largely considered\na \ufb01xed initialization [24, 34], in many practical applications of GBML the optimal initialization\nvaries over time due to a changing environment [1]. We show how ARUBA reduces the problem\nof meta-learning in dynamic environments to a dynamic regret-minimization problem, for which\nthere exists a vast array of online algorithms with provable guarantees that can be directly applied.\n\u2022 Adapting to the Inter-Task Geometry: A recurring notion in LTL is that certain model weights,\nsuch as feature extractors, are shared, whereas others, such as classi\ufb01cation layers, vary between\ntasks. By only learning a \ufb01xed initialization we must re-learn this structure on every task. Using\nARUBA we provide a method that adapts to this structure and determines which directions in \u0398\nneed to be updated by learning a Mahalanobis-norm regularizer for online mirror descent (OMD).\nWe show how a variant of this can be used to meta-learn a per-coordinate learning-rate for certain\nGBML methods, such as MAML [23] and Reptile [44], as well as for FedAvg, a popular federated\nlearning algorithm [41]. This leads to improved meta-test-time performance on few-shot learning\nand a simple, tuning-free approach to effectively add user-personalization to FedAvg.\n\n\u2022 Statistical Learning-to-Learn: ARUBA allows us to leverage powerful results in online-to-batch\nconversion [54, 33] to derive new bounds on the transfer risk when using GBML for statistical\nLTL [8], including fast rates in the number of tasks when the task-similarity is known and high-\nprobability guarantees for a class of losses that includes linear regression. This improves upon the\nguarantees of Khodak et al. [34] and Denevi et al. [19] for similar or identical GBML methods.\n\n1.1 Related Work\n\nTheoretical LTL: The statistical analysis of LTL was formalized by Baxter [8]. Several works have\nbuilt upon this theory for modern LTL, such as via a PAC-Bayesian perspective [3] or by learning the\nkernel for the ridge regression [18]. However, much effort has also been devoted to the online setting,\noften through the framework of lifelong learning [45, 5, 2]. Alquier et al. [2] consider a many-task\nnotion of regret similar to the one we study in order to learn a shared data representation, although\nour algorithms are much more practical. Recently, Bullins et al. [11] developed an ef\ufb01cient online\napproach to learning a linear data embedding, but such a setting is distinct from GBML and more\nclosely related to popular shared-representation methods such as ProtoNets [51]. Nevertheless, our\napproach does strongly rely on online learning through the study of data-dependent regret-upper-\nbounds, which has a long history of use in deriving adaptive single-task methods [40, 21]; however,\nin meta-learning there is typically not enough data to adapt to without considering multi-task data.\nAnalyzing regret-upper-bounds was done implicitly by Khodak et al. [34], but their approach is\nlargely restricted to using Follow-the-Leader (FTL) as the meta-algorithm. Similarly, Finn et al. [24]\nuse FTL to show learnability of the MAML meta-initialization. In contrast, the ARUBA framework\ncan handle general classes of meta-algorithms, which leads not only to new and improved results in\nstatic, dynamic, and statistical settings but also to signi\ufb01cantly more practical LTL methods.\nGBML: GBML stems from the Model-Agnostic Meta-Learning (MAML) algorithm [23] and has\nbeen widely used in practice [1, 44, 31]. An expressivity result was shown for MAML by Finn and\nLevine [22], proving that the meta-learner can approximate any permutation-invariant learner given\nenough data and a speci\ufb01c neural architecture. Under strong-convexity and smoothness assumptions\nand using a \ufb01xed learning rate, Finn et al. [24] show that the MAML meta-initialization is learnable,\nalbeit via an impractical FTL method. In contrast to these efforts, Khodak et al. [34] and Denevi et al.\n[19] focus on providing \ufb01nite-sample meta-test-time performance guarantees in the convex setting,\nthe former for the SGD-based Reptile algorithm of Nichol et al. [44] and the latter for a regularized\nvariant. Our work improves upon these analyses by considering the case when the learning rate, a\nproxy for the task-similarity, is not known beforehand as in Finn et al. [24] and Denevi et al. [19]\nbut must be learned online; Khodak et al. [34] do consider an unknown task-similarity but use a\ndoubling-trick-based approach that considers the absolute deviation of the task-parameters from\nthe meta-initialization and is thus average-case suboptimal and sensitive to outliers. Furthermore,\nARUBA can handle more sophisticated and dynamic notions of task-similarity and in certain settings\ncan provide better statistical guarantees than those of Khodak et al. [34] and Denevi et al. [19].\n\n2\n\n\fruns the corresponding algorithm, and suffers regret Rt(xt) =(cid:80)mt\n\n2 Average Regret-Upper-Bound Analysis\nOur main contribution is ARUBA, a framework for analyzing the learning of X -parameterized\nlearning algorithms via reduction to the online learning of a sequence of functions Ut : X (cid:55)\u2192 R\nupper-bounding their regret on task t. We consider a meta-learner facing a sequence of online learning\ntasks t = 1, . . . , T , each with mt loss functions (cid:96)t,i : \u0398 (cid:55)\u2192 R over action-space \u0398 \u2282 Rd. The learner\nhas access to a set of learning algorithms parameterized by x \u2208 X that can be used to determine the\naction \u03b8t,i \u2208 \u0398 on each round i \u2208 [mt] of task t. Thus on each task t the meta-learner chooses xt \u2208 X ,\ni=1 (cid:96)t,i(\u03b8).\nWe propose to analyze the meta-learner\u2019s performance by studying the online learning of a sequence of\nregret-upper-bounds Ut(xt) \u2265 Rt(xt), speci\ufb01cally by bounding the average regret-upper-bound\n\u00afUT = 1\nt=1 Ut(xt). The following two observations highlight why we care about this quantity:\nT\n1. Generality: Many algorithms of interest in meta-learning have regret guarantees Ut(x) with nice,\ne.g. smooth and convex, functional forms that depend strongly on both their parameterizations\nx \u2208 X and the task-data. This data-dependence lets us adaptively set the parameterization xt \u2208 X .\n2. Consequences: By de\ufb01nition of Ut we have that \u00afUT bounds the task-averaged regret (TAR)\n\u00afRT = 1\nt=1 Rt(xt) [34]. Thus if the average regret-upper-bound is small then the meta-\nT\nlearner will perform well on-average across tasks. In Section 5 we further show that a low average\nregret-upper-bound will also lead to strong statistical guarantees in the batch setting.\n\ni=1 (cid:96)t,i(\u03b8t,i) \u2212 min\u03b8\n\n(cid:80)mt\n\n(cid:80)T\n\n(cid:80)T\n\nARUBA\u2019s applicability depends only on \ufb01nding a low-regret algorithm over the functions Ut; then\nby observation 2 we get a task-averaged regret bound where the \ufb01rst term vanishes as T \u2192 \u221e while\nby observation 1 the second term can be made small due to the data-dependent task-similarity:\n\n\u00afRT \u2264 \u00afUT \u2264 oT (1) + min\n\nx\n\n1\nT\n\nUt(x)\n\nT(cid:88)\n\nt=1\n\nUsing the notation va:b =(cid:80)b\n\nThe Case of Online Gradient Descent: Suppose the meta-learner uses online gradient descent\n(OGD) as the within-task learning algorithm, as is done by Reptile [44]. OGD can be parameterized\nby an initialization \u03c6 \u2208 \u0398 and a learning rate \u03b7 > 0, so that X = {(\u03c6, \u03b7) : \u03c6 \u2208 \u0398, \u03b7 > 0}.\ni=a vi and \u2207t,j = \u2207(cid:96)t,j(\u03b8t,j), at each round i of task t OGD plays\n2 + \u03b7(cid:104)\u2207t,1:i\u22121, \u03b8(cid:105). The regret of this procedure when run on m convex\n\n\u03b8t,i = arg min\u03b8\u2208\u0398\nG-Lipschitz losses has a well-known upper-bound [48, Theorem 2.11]\n\n2(cid:107)\u03b8 \u2212 \u03c6(cid:107)2\n\n1\n\n1\n2\u03b7\n\n(cid:107)\u03b8\u2217\n\n2 + \u03b7G2m \u2265 m(cid:88)\n\nt \u2212 \u03c6(cid:107)2\n(cid:80)T\nt=1 Ut(\u03c6, \u03b7) = O(GV\n\ni=1\n\n1\nT\n\n(cid:96)t,i(\u03b8t) \u2212 (cid:96)t,i(\u03b8\u2217\n\nt ) = Rt(x)\n\n(1)\n\nUt(x) = Ut(\u03c6, \u03b7) =\n\n(cid:80)mt\n\n\u221a\n\n\u221a\n\nT \u03b8\u2217\n\nt \u2212 \u00af\u03b8\u2217(cid:107)2\n\n(cid:80)T\nt=1 (cid:107)\u03b8\u2217\n\u221a\n\nt on each task and V 2 = 1\nT\n\n1:T is the mean of the optimal actions \u03b8\u2217\n\nm), which can be much better than the single-task regret O(GD\n\nwhich is convex in the learning rate \u03b7 and the initialization \u03c6. Note the strong data dependence via\nt \u2208 arg min\u03b8\n\u03b8\u2217\ni=1 (cid:96)t,i(\u03b8), the optimal action in hindsight. To apply ARUBA, \ufb01rst note that if\n\u00af\u03b8\u2217 = 1\n2 is their\nempirical variance, then min\u03c6,\u03b7\nm). Thus by running a low-regret\nalgorithm on the regret-upper-bounds Ut the meta-learner will suffer task-averaged regret at most\noT (1) + O(GV\nm), where D is\nthe (cid:96)2-radius of \u0398, if V (cid:28) D, i.e. if the optimal actions \u03b8\u2217\nt are close together. See Theorem 3.2 for\nthe result yielded by ARUBA in this simple setting.\n3 Adapting to Similar Tasks and Dynamic Environments\nWe now demonstrate the effectiveness of ARUBA for analyzing GBML by using it to prove a general\nbound for a class of algorithms that can adapt to both task-similarity, i.e. when the optimal actions\n\u03b8\u2217\nt for each task are close to some good initialization, and to changing environments, i.e. when this\ninitialization changes over time. The task-similarity will be measured using the Bregman divergence\nBR(\u03b8||\u03c6) = R(\u03b8) \u2212 R(\u03c6) \u2212 (cid:104)\u2207R(\u03c6), \u03b8 \u2212 \u03c6(cid:105) of a 1-strongly-convex function R : \u0398 (cid:55)\u2192 R [10], a\ngeneralized notion of distance. Note that for R(\u00b7) = 1\n2. A\nchanging environment will be studied by analyzing dynamic regret, which for a sequence of actions\n{\u03c6t}t \u2282 \u0398 taken by some online algorithm over a sequence of loss functions {ft : \u0398 (cid:55)\u2192 R}t is\nt=1 ft(\u03c6t) \u2212 ft(\u03c8t). Dynamic\nregret measures the performance of an online algorithm taking actions \u03c6t relative to a potentially time-\nvarying comparator taking actions \u03c8t. Note that when we \ufb01x \u03c8t = \u03c8\u2217 \u2208 arg min\u03c8\u2208\u0398\nt=1 ft(\u03c8)\nwe recover the standard static regret, in which the comparator always uses the same action.\n\nde\ufb01ned w.r.t. a reference sequence \u03a8 = {\u03c8t}t \u2282 \u0398 as RT (\u03a8) =(cid:80)T\n\n2(cid:107)\u03b8 \u2212 \u03c6(cid:107)2\n(cid:80)T\n\n2 we have BR(\u03b8||\u03c6) = 1\n\n2(cid:107) \u00b7 (cid:107)2\n\n3\n\n\ffor round i \u2208 [mt] do\n\nAlgorithm 1: Generic online algorithm for gradient-based parameter-transfer meta-learning. To\nrun OGD within-task set R(\u00b7) = 1\n2. To run FTRL within-task substitute (cid:96)t,j(\u03b8) for (cid:104)\u2207t,j, \u03b8(cid:105).\nSet meta-initialization \u03c61 \u2208 \u0398 and learning rate \u03b71 > 0.\nfor task t \u2208 [T ] do\n\n2(cid:107) \u00b7 (cid:107)2\n\n\u03b8t,i \u2190 arg min\u03b8\u2208\u0398 BR(\u03b8||\u03c6t) + \u03b7t(cid:104)\u2207t,1:i\u22121, \u03b8(cid:105) // online mirror descent step\nSuffer loss (cid:96)t,i(\u03b8t,i)\n\nUpdate \u03c6t+1, \u03b7t+1 // meta-update of OMD initialization and learning rate\n\n(cid:80)T\nt=1 BR(\u03b8\u2217\n\nPutting these together, we seek to de\ufb01ne variants of Algorithm 1 for which as T \u2192 \u221e the average\nt ||\u03c8t), without knowing this quantity in advance.\nregret scales with V\u03a8, where V 2\n\u03a8 = 1\nT\nNote for \ufb01xed \u03c8t = \u00af\u03b8\u2217 = 1\nT \u03b8\u2217\n1:T this measures the empirical standard deviation of the optimal task-\nactions \u03b8\u2217\nt . Thus achieving our goal implies that average performance improves with task-similarity.\nBR(\u00b7||\u03c6t) for initialization\nOn each task t Algorithm 1 runs online mirror descent with regularizer 1\n\u03c6t \u2208 \u0398 and learning rate \u03b7t > 0. It is well-known that OMD and the related Follow-the-Regularized-\n\u03b7t\nLeader (FTRL), for which our results also hold, generalize many important online methods, e.g. OGD\nand multiplicative weights [26]. For mt convex losses with mean squared Lipschitz constant G2\nt they\nt \u2208 \u0398 [48, Theorem 2.15]:\nalso share a convenient, data-dependent regret-upper-bound for any \u03b8\u2217\nt ||\u03c6t) + \u03b7tG2\nt mt\n\n(2)\nAll that remains is to come up with update rules for the meta-initialization \u03c6t \u2208 \u0398 and the learning rate\n\u03b7t > 0 in Algorithm 1 so that the average over T of these upper-bounds Ut(\u03c6t, \u03b7t) is small. While this\ncan be viewed as a single online learning problem to determine actions xt = (\u03c6t, \u03b7t) \u2208 \u0398 \u00d7 (0,\u221e),\nit is easier to decouple \u03c6 and \u03b7 by \ufb01rst de\ufb01ning two function sequences {f init\nt ||\u03c6t)\nv\n\nt }t:\nt }t and {f sim\n\u221a\n\n(cid:18)BR(\u03b8\u2217\n\nRt \u2264 Ut(\u03c6t, \u03b7t) =\n\n(\u03c6) = BR(\u03b8\u2217\n\nt ||\u03c6)Gt\n\nBR(\u03b8\u2217\n\n(cid:19)\n\nf sim\nt\n\n(v) =\n\nf init\nt\n\n1\n\u03b7t\n\n+ v\n\n\u221a\n\n(3)\n\nmt\n\nmt\n\nGt\n\nt\n\nt\n\nt\n\nt\n\n\u221a\n\n\u221a\n\nand f sim\n\n(v) = Ut(\u03c6t, v/(Gt\n\n, \ufb01rst note that f sim\n\nWe show in Theorem 3.1 that to get an adaptive algorithm it suf\ufb01ces to specify two OCO algorithms,\nINIT and SIM, such that the actions \u03c6t = INIT(t) achieve good (dynamic) regret over f init\nand the\nactions vt = SIM(t) achieve low (static) regret over f sim\n; these actions then determine the update\nrules of \u03c6t and \u03b7t = vt/(Gt\nmt). We will specialize Theorem 3.1 to derive algorithms that provably\nadapt to task similarity (Theorem 3.2) and to dynamic environments (Theorem 3.3).\nTo understand the formulation of f init\nmt)), so the\nt\nonline algorithm SIM over f sim\ncorresponds to an online algorithm over the regret-upper-bounds\nUt when the sequence of initializations \u03c6t is chosen adversarially. Once we have shown that SIM\nis low-regret we can compare its losses f sim\n(vt) to those of an arbitrary \ufb01xed v > 0; this is the \ufb01rst\nline in the proof of Theorem 3.1 (below). For \ufb01xed v, each f init\n(\u03c6t) is an af\ufb01ne transformation of\nf sim\n(v), so the algorithm INIT with low dynamic regret over f init\ncorresponds to an algorithm with\nmt) \u2200 t. Thus once we\nt\nt\nlow dynamic regret over the regret-upper-bounds Ut when \u03b7t = v/(Gt\nhave shown a dynamic regret guarantee for INIT we can compare its losses f init\n(\u03c6t) to those of an\narbitrary comparator sequence {\u03c8t}t \u2282 \u0398; this is the second line in the proof of Theorem 3.1.\nTheorem 3.1. Assume \u0398 \u2282 Rd is convex, each task t \u2208 [T ] is a sequence of mt convex losses\n(cid:96)t,i : \u0398 (cid:55)\u2192 R with mean squared Lipschitz constant G2\nt , and R : \u0398 (cid:55)\u2192 R is 1-strongly-convex.\nt }t w.r.t. any reference\n\u2022 Let INIT be an algorithm whose dynamic regret over functions {f init\n\u2022 Let SIM be an algorithm whose static regret over functions {f sim\nt }t w.r.t. any v > 0 is upper-\n\nt=1 \u2282 \u0398 is upper-bounded by Uinit\n\nsequence \u03a8 = {\u03c8t}T\n\nT (\u03a8).\n\n\u221a\n\nt\n\nt\n\nt\n\nt\n\nbounded by a non-increasing function Usim\n\nT (v) of v.\nIf Algorithm 1 sets \u03c6t = INIT(t) and \u03b7t = SIM(t)\nmt\nachieve average regret\n\u00afRT \u2264 \u00afUT \u2264 Usim\n\n(cid:118)(cid:117)(cid:117)(cid:116)Uinit\n\nT (V\u03a8)\n\nmin\n\n, 2\n\n\u221a\n\n+\n\nGt\n\n\uf8f1\uf8f2\uf8f3 Uinit\n\nT (\u03a8)\nV\u03a8\n\n1\nT\n\nT\n\nthen for V 2\n\n\u03a8 =\n\nT(cid:88)\n\n\u221a\n\nGt\n\nmt\n\n(cid:80)T\n(cid:80)T\nt=1 BR(\u03b8\u2217\nt=1 Gt\n\nt ||\u03c8t)Gt\nmt\n\n\u221a\n\n\u221a\n\nmt\n\n\uf8fc\uf8fd\uf8fe +\n\nT(cid:88)\n\nt=1\n\n2V\u03a8\nT\n\n\u221a\n\nGt\n\nmt\n\nit will\n\nT (\u03a8)\n\nt=1\n\n4\n\n\fProof. For \u03c3t = Gt\n\n(cid:18)BR(\u03b8\u2217\n\n(cid:19)\n\n\u221a\nmt we have by the regret bound on OMD/FTRL (2) that\nt ||\u03c6t)\nvt\n\n\u03c3t \u2264 min\n\nT(cid:88)\n\nT (v) +\n\nUsim\n\n+ vt\n\nv>0\n\nT(cid:88)\n\nt=1\n\n\u00afUT T =\n\n+ v\n\n(cid:19)\n(cid:18)BR(\u03b8\u2217\nt ||\u03c6t)\n(cid:18)BR(\u03b8\u2217\nT(cid:88)\nv\n(cid:113)(cid:113)\n\nt=1\n\n+\n\nt=1\nUinit\nT (\u03a8)\nv\n\nUinit\nT (\u03a8)\nV\u03a8\n\n(cid:40)\n(cid:113)\n\n(cid:27)\n\n, 2\n\nUinit\n\nT (\u03a8)\u03c31:T\n\n\u03c3t\n\nt ||\u03c8t)\nv\n\n(cid:19)\n\n\u03c3t\n\n+ v\n\n(cid:41)\n\n+ 2V\u03a8\u03c31:T\n\n\u2264 min\n\nv>0\n\nUsim\n\nT (v) +\n\n\u2264 Usim\n\nT (V\u03a8) + min\n\n(cid:26)\n\nwhere the last line follows by substituting v = max\n\nV\u03a8,\n\nUinit\n\nT (\u03a8)/\u03c31:T\n\n.\n\n\u02dcf sim\ns\n\u02dcf sim\ns\n\ns<t\n\ns<t\n\n(v))dv\n\n(v))dv\n\n\u221a\n2\nDG\n\nfor\n\n\u03b3 =\n\nt+1 = \u03b52t +(cid:80)\n\nmin\n\nm\n\nD2 , 1\n\n(4)\n\nt\n\nt\n\n\u221a\n\nand f sim\n\nm) as desired. We \ufb01rst show an approach in the case when the optimal actions \u03b8\u2217\n\nSimilar Tasks in Static Environments: By Theorem 3.1, if we can specify algorithms INIT and\nSIM with sublinear regret over f init\n(3), respectively, then the average regret will converge\nto O(V\u03a8\nt are close\nto a \ufb01xed point in \u0398, i.e. for \ufb01xed \u03c8t = \u00af\u03b8\u2217 = 1\nT \u03b8\u2217\n1:T . Henceforth we assume the Lipschitz constant\nG and number of rounds m are the same across tasks; detailed statements are in the supplement.\nt }t are quadratic functions, so playing \u03c6t+1 = 1\nNote that if R(\u00b7) = 1\nt \u03b8\u2217\n1:t has\nlogarithmic regret [48, Corollary 2.2]. We use a novel strongly convex coupling argument to show\nt ||\u00b7). The\nthat this holds for any such sequence of Bregman divergences, even for nonconvex BR(\u03b8\u2217\nsecond sequence {f sim\nt = \u03c6t.\n(v) + \u03b52/v for \u03b5 \u2265 0. Assuming a bound of D2 on\n\u221a\nWe study a regularized sequence \u02dcf sim\nthe Bregman divergence and setting \u03b5 = 1/ 4\nT ) regret on the original sequence\nby running exponentially-weighted online-optimization (EWOO) [28] on the regularized sequence:\n\nt }t is harder because it is not smooth near 0 and not strongly convex if \u03b8\u2217\n(cid:27)\n\n\u221a\nT , we achieve \u02dcO(\n\n2 then {f init\n\n2(cid:107) \u00b7 (cid:107)2\n\n(v) = f sim\n\nD2+\u03b52\n\nt\n\nt\n\n(cid:26) \u03b52\n\n(cid:82) \u221a\n(cid:82) \u221a\n\n0\n\n0\n\nv exp(\u2212\u03b3(cid:80)\nexp(\u2212\u03b3(cid:80)\n\nvt =\n\nD2+\u03b52\n\n,\n\nm\n\nt=1\n\nfor\n\n1\nT\n\nmin\n\n(cid:18)\n\n(cid:27)\n\n\u221a\n1\n4\nT\n\nV\u221a\nT\n\n(cid:19)\u221a\n\nV 2 = min\n\u03c6\u2208\u0398\n\nBR(\u03b8\u2217\n\u221a\n\n(cid:26) 1 + 1\n\n\u221a\nt \u03b8\u2217\n1:t and SIM uses \u03b5-EWOO (4) with \u03b5 = 1/ 4\n\nNote that while EWOO is inef\ufb01cient in high dimensions, we require only single-dimensional integrals.\ns\u2264t BR(\u03b8\u2217\ns||\u03c6t) has only a\nIn the supplement we also show that simply setting v2\nslightly worse regret of \u02dcO(T 3/5). These guarantees suf\ufb01ce to show the following:\nTheorem 3.2. Under the assumptions of Theorem 3.1 and boundedness of BR over \u0398, if INIT plays\nT(cid:88)\nT then Algorithm 1 achieves average regret\n\u03c6t+1 = 1\nt ||\u03c6)\n\u00afRT \u2264 \u00afUT = \u02dcO\n+ V\nt , is \u2126T (1) then the bound becomes O(V\nObserve that if V , the average deviation of \u03b8\u2217\n\u221a\nm) at rate\n\u02dcO(1/\nT ), while if V = oT (1) the bound tends to zero. Theorem 3.1 can be compared to the main\nresult in two aspects. First, their asymptotic regret is O(D\u2217\u221a\nresult of Khodak et al. [34], who set the learning rate via a doubling trick. We improve upon their\nm), where D\u2217 is the maximum distance\nbetween any two optimal actions. Note that V is always at most D\u2217, and indeed may be much smaller\nin the presence of outliers. Second, our result is more general, as we do not need convex BR(\u03b8\u2217\nt ||\u00b7).\nRemark 3.1. We assume an oracle giving a unique \u03b8\u2217 \u2208 arg min\u03b8\u2208\u0398\n(cid:96)\u2208S (cid:96)(\u03b8) for any \ufb01nite loss\nsequence S, which may be inef\ufb01cient or undesirable. One can instead use the last or average iterate of\n\u221a\nwithin-task OMD/FTRL for the meta-update; in the supplement we show that this incurs an additional\nm) regret term under a quadratic growth assumption that holds in many practical settings [34].\no(\nRelated Tasks in Changing Environments:\nIn many settings we have a changing environment\nand so it is natural to study dynamic regret. This has been widely analyzed by the online learning\ncommunity [15, 30], often by showing a dynamic regret bound consisting of a sublinear term plus a\nbound on the variation in the action or function space. Using Theorem 3.1 we can show dynamic\nguarantees for GBML via reduction to such bounds. We provide an example in the Euclidean\nt=2 (cid:107)\u03c8t \u2212 \u03c8t\u22121(cid:107)2 for reference actions\nt=1 [55]. We use a result showing that OGD with learning rate \u03b7 \u2264 1/\u03b2 over \u03b1-strongly-\n\u03a8 = {\u03c8t}T\nconvex, \u03b2-strongly-smooth, and L-Lipschitz functions has a bound of O(L(1 + P\u03a8)) on its dynamic\nregret [42, Corollary 1]. Observe that in the case of R(\u00b7) = 1\n\u221a\nin Theorem 3.1\nm-Lipschitz quadratic functions. Thus using Theorem 3.1 we achieve the following:\nconsists of DG\n\ngeometry using the popular path-length-bound P\u03a8 = (cid:80)T\n\n2 the sequence f init\n\n2(cid:107) \u00b7 (cid:107)2\n\n(cid:80)\n\nt\n\n5\n\n\fFigure 1: Left - Theorem 3.2 improves upon [34,\nTheorem 2.1] via its dependence on the average\ndeviation V rather than the maximal deviation\nD\u2217 of the optimal task-parameters \u03b8\u2217\nt (light blue).\nRight - a case where Theorem 3.3 yields a strong\ntask-similarity-based guarantee via a dynamic\ncomparator \u03a8 despite the deviation V being large.\n\nFigure 2: Learning rate variation across layers\nof a convolutional net trained on Mini-ImageNet\nusing Algorithm 2. Following intuition outlined\nin Section 6, shared feature extractors are not\nupdated much if at all compared to higher layers.\n\nTheorem 3.3. Under Theorem 3.1 assumptions, bounded \u0398, and R(\u00b7) = 1\n2, if INIT is OGD\nwith learning rate\nT then by using OGD within-task\nAlgorithm 1 will achieve for any \ufb01xed comparator sequence \u03a8 = {\u03c8t}t\u2208[T ] \u2282 \u0398 the average regret\n\n\u221a\nm and SIM uses \u03b5-EWOO (4) with \u03b5 = 1/ 4\n\n\u221a\n1\n\nG\n\n2(cid:107) \u00b7 (cid:107)2\n(cid:41)\n\n(cid:33)\u221a\n\n1 + P\u03a8\n\nT\n\n+ V\u03a8\n\nm\n\n(cid:32)\n\n(cid:40) 1 + 1\n(cid:41)\n2 and P\u03a8 =(cid:80)T\n\nV\u03a8\u221a\nT\n\n\u221a\n1\n4\nT\n\n,\n\n(cid:40)\n\n(cid:114)\n\n+ min\n\n1 + P\u03a8\nV\u03a8T\nt=2 (cid:107)\u03c8t \u2212 \u03c8t\u22121(cid:107)2.\n\n,\n\n\u00afRT \u2264 \u00afUT = \u02dcO\n(cid:80)T\nt=1 (cid:107)\u03b8\u2217\n\n\u03a8 = 1\n2T\n\nmin\nt \u2212 \u03c8t(cid:107)2\n\nfor V 2\n\nThis bound controls the average regret across tasks using the deviation V\u03a6 of the optimal task\nparameters \u03b8\u2217\nt from some reference sequence \u03a6, which is assumed to vary slowly or sparsely\nso that the path length P\u03a6 is small. Figure 1 illustrates when such a guarantee improves over\nTheorem 3.2. Note also that Theorem 3.3 speci\ufb01es OGD as the meta-update algorithm INIT, so\nunder the approximation that each task t\u2019s last iterate is close to \u03b8\u2217\nt this suggests that simple GBML\nmethods such as Reptile [44] or FedAvg [41] are adaptive. The generality of ARUBA also allows for\nthe incorporation of other dynamic regret bounds [25, 53] and other non-static notions of regret [27].\n4 Adapting to the Inter-Task Geometry\nPreviously we gave improved guarantees for learning OMD under a simple notion of task-similarity:\ncloseness of the optimal actions \u03b8\u2217\nt . We now turn to new algorithms that can adapt to a more sophisti-\ncated task-similarity structure. Speci\ufb01cally, we study a class of learning algorithms parameterized by\nan initialization \u03c6 \u2208 \u0398 and a symmetric positive-de\ufb01nite matrix H \u2208 M \u2282 Rd\u00d7d which plays\n\n\u03b8t,i = arg min\n\n\u03b8\u2208\u0398\n\n1\n2\n\n(cid:107)\u03b8 \u2212 \u03c6(cid:107)2\n\nH\u22121 + (cid:104)\u2207t,1:i\u22121, \u03b8(cid:105)\n\n(5)\n\nThis corresponds \u03b8t,i+1 = \u03b8t,i\u2212H\u2207t,i, so if the optimal actions \u03b8\u2217\nt vary strongly in certain directions,\na matrix emphasizing those directions improves within-task performance. By strong-convexity of\n2(cid:107)\u03b8\u2212 \u03c6(cid:107)2\ni=1 (cid:107)\u2207t,i(cid:107)2\nH\u22121 w.r.t. (cid:107)\u00b7(cid:107)H\u22121, the regret-upper-bound is Ut(\u03c6, H) = 1\n1\nH\n(cid:115)(cid:115)(cid:80)\n[48, Theorem 2.15]. We \ufb01rst study the diagonal case, i.e. learning a per-coordinate learning rate\n\u03b7 \u2208 Rd to get iteration \u03b8t,i+1 = \u03b8t,i \u2212 \u03b7t (cid:12) \u2207t,i. We propose to set \u03b7t at each task t as follows:\n(cid:80)\n\n(t + 1)p \u2200 t \u2265 0, where \u03b5, \u03b6, p > 0 (6)\n\nH\u22121 +(cid:80)m\n\n(t + 1)p , \u03b6 2\n\ns +(cid:80)ms\n\ns \u2212 \u03c6s)2\ni=1 \u22072\n\ns<t \u03b52\ns<t \u03b6 2\n\nt \u2212 \u03c6(cid:107)2\n\n2(cid:107)\u03b8\u2217\n\n2(\u03b8\u2217\n\nfor \u03b52\n\ns + 1\n\n\u03b7t =\n\nt =\n\nt =\n\n\u03b6 2\n\n\u03b52\n\ns,i\n\nObserve the similarity between this update AdaGrad [21], which is also inversely related to the sum\nof the element-wise squares of all gradients seen so far. Our method adds multi-task information by\nsetting the numerator to depend on the sum of squared distances between the initializations \u03c6t set by\nthe algorithm and that task\u2019s optimal action \u03b8\u2217\nTheorem 4.1. Let \u0398 be a bounded convex subset of Rd, let D \u2282 Rd\u00d7d be the set of positive de\ufb01nite\ndiagonal matrices, and let each task t \u2208 [T ] consist of a sequence of m convex Lipschitz loss functions\n(cid:96)t,i : \u0398 (cid:55)\u2192 R. Suppose for each task t we run the iteration in Equation 5 setting \u03c6 = 1\n1:t\u22121 and\nsetting H = Diag(\u03b7t) via Equation 6 for \u03b5 = 1, \u03b6 =\n5 . Then we achieve\n\u221a\n\u00afRT \u2264 \u00afUT = min\n1\n\u03c6\u2208\u0398\nT\nH\u2208D\n\nt . This algorithm has the following guarantee:\n\n\uf8eb\uf8ed d(cid:88)\n\n(cid:41)\uf8f6\uf8f8\u221a\u221a\n\nt\u22121 \u03b8\u2217\nm(cid:88)\n\n+ Hjj\nT 2\n\n(cid:40) 1\n\nm, and p = 2\n\nt \u2212 \u03c6(cid:107)2\n\n(cid:107)\u2207t,i(cid:107)2\n\nT(cid:88)\n\n(cid:107)\u03b8\u2217\n\nH\u22121\n\nm +\n\nmin\n\n\u221a\n\n1\nT\n\n\u02dcO\n\nHjj\n\nj=1\n\nt=1\n\ni=1\n\n+\n\n2\n\nH\n\n5\n\n,\n\n5\n\n6\n\n\fAs T \u2192 \u221e the average regret converges to the minimum over \u03c6, H of the last two terms, which\n\u221a\ncorresponds to running OMD with the optimal initialization and per-coordinate learning rate on\nevery task. The rate of convergence of T \u22122/5 is slightly slower than the usual 1/\nT achieved\nin the previous section; this is due to the algorithm\u2019s adaptivity to within-task gradients, whereas\npreviously we simply assumed a known Lipschitz bound Gt when setting \u03b7t. This adaptivity makes\nthe algorithm much more practical, leading to a method for adaptively learning a within-task learning\nrate using multi-task information; this is outlined in Algorithm 2 and shown to signi\ufb01cantly improve\nGBML performance in Section 6. Note also the per-coordinate separation of the left term, which\nshows that the algorithm converges more quickly on non-degenerate coordinates. The per-coordinate\nspeci\ufb01cation of \u03b7t (6) can be further generalized to learning a full-matrix adaptive regularizer, for\nwhich we show guarantees in Theorem 4.2. However, the rate is much slower, and without further\nassumptions such methods will have \u2126(d2) computation and memory requirements.\nTheorem 4.2. Let \u0398 be a bounded convex subset of Rd and let each task t \u2208 [T ] consist of a sequence\nof m convex Lipschitz loss functions (cid:96)t,i : \u0398 (cid:55)\u2192 R. Suppose for each task t we run the iteration in\nEquation 5 with \u03c6 = 1\n\nt = HG2\n\nt H for\n\n(cid:88)\nt\u22121 \u03b8\u2217\n1:t\u22121 and H the unique positive de\ufb01nite solution of B2\ns \u2212 \u03c6s)(\u03b8\u2217\n(\u03b8\u2217\n1\n2\n\u221a\n(cid:19)\u221a\n(cid:18) 1\nT and \u03b6 =\nm/ 8\n\u221a\n\nt \u2212 \u03c6\u2217(cid:107)2\n\ns \u2212 \u03c6s)T\n\nt = t\u03b6 2Id +\n\n(cid:88)\n\nT(cid:88)\n\n1 + log T\n\n(cid:107)\u03b8\u2217\n\nand\n\nH\u22121\n\n\u221a\n\nG2\n\ns<t\n\ns<t\n\n+\n\nm(cid:88)\n\ni=1\n\n+\n\nB2\n\nt = t\u03b52Id +\n\n\u221a\nfor \u03b5 = 1/ 8\n\u00afRT \u2264 \u00afUT = \u02dcO\n\nT . Then for \u03bbj corresponding to the jth largest eigenvalue we have\n\n\u2207s,i\u2207T\n\ns,i\n\n(cid:107)\u2207t,i(cid:107)2\n\nH\n\n8\n\nT\n\nm + min\n\u03c6\u2208\u0398\nH(cid:31)0\n\n2\u03bb2\n1(H)\n\u03bbd(H)\n\nT\n\nt=1\n\n2\n\nm(cid:88)\n\ni=1\n\n5 Fast Rates and High Probability Bounds for Statistical Learning-to-Learn\nBatch-setting transfer risk bounds have been an important motivation for studying LTL via online\nlearning [2, 34, 19]. If the regret-upper-bounds are convex, which is true for most practical variants\nof OMD/FTRL, ARUBA yields several new results in the classical distribution over task-distributions\nsetup of Baxter [8]. In Theorem 5.1 we present bounds on the risk (cid:96)P (\u00af\u03b8) of the parameter \u00af\u03b8 obtained\nby running OMD/FTRL on i.i.d. samples from a new task distribution P and averaging the iterates.\nTheorem 5.1. Assume \u0398,X are convex Euclidean subsets. Let convex losses (cid:96)t,i : \u0398 (cid:55)\u2192 [0, 1] be\ndrawn i.i.d. Pt \u223c Q,{(cid:96)t,i}i \u223c P m\nfor distribution Q over tasks. Suppose they are passed to an\nalgorithm with average regret upper-bound \u00afUT that at each t picks xt \u2208 X to initialize a within-task\nmethod with convex regret upper-bound Ut : X (cid:55)\u2192 [0, B\nm], for B \u2265 0. If the within-task algorithm\nT x1:T and it takes actions \u03b81, . . . , \u03b8m on m i.i.d. losses from new task P \u223c Q\nis initialized by \u00afx = 1\nthen \u00af\u03b8 = 1\n1. general case: EP\u223cQ EPm (cid:96)P (\u00af\u03b8) \u2264 EP\u223cQ (cid:96)P (\u03b8\u2217) + LT\n2. \u03c1-self-bounded losses (cid:96):\n\n(cid:113) 8\nm \u03b81:m satis\ufb01es the following transfer risk bounds for any \u03b8\u2217 \u2208 \u0398 (all w.p. 1 \u2212 \u03b4):\n\n\u03b4 .\nmT log 1\nif \u2203 \u03c1 > 0 s.t. \u03c1 E(cid:96)\u223cP \u2206(cid:96)(\u03b8) \u2265 E(cid:96)\u223cP (\u2206(cid:96)(\u03b8) \u2212 E(cid:96)\u223cP \u2206(cid:96)(\u03b8))2 for\nall distributions P \u223c Q, where \u2206(cid:96)(\u03b8) = (cid:96)(\u03b8) \u2212 (cid:96)(\u03b8\u2217) for any \u03b8\u2217 \u2208 arg min\u03b8\u2208\u0398 (cid:96)P (\u03b8), then for\nLT as above we have EP\u223cQ (cid:96)P (\u00af\u03b8) \u2264 EP\u223cQ (cid:96)P (\u03b8\u2217) + LT +\n\n(cid:113) 2\u03c1LT\n\nfor LT = \u00afU\n\nm + B\n\n\u03b4 + 3\u03c1+2\n\n\u03b4 .\nm log 2\n\nm log 2\n\n\u221a\n\nt\n\n3. \u03b1-strongly-convex, G-Lipschitz regret-upper-bounds Ut:\n\u03b1m log 8 log T\n\nsubstitute LT =\n\n\u00afU + minx EP\u223cQ U(x)\n\n+ 4G\nT\n\nm\n\n\u03b4 + max{16G2,6\u03b1B\n\n\u03b1mT\n\n\u221a\n\nm}\n\nin parts 1 and 2 above we can\n\nlog 8 log T\n\n\u03b4\n\n.\n\n(cid:113) \u00afU\n\n\u221a\n\n\u221a\n\nmT rather than 1/\n\n\u221a\nIn the general case, Theorem 5.1 provides bounds on the excess transfer risk decreasing with \u00afU /m\nmT . Thus if \u00afU improves with task-similarity so will the transfer risk as T \u2192 \u221e. Note that\nand 1/\nthe second term is 1/\nT as in most-analyses [34, 19]; this is because regret\nis m-bounded but the OMD regret-upper-bound is O(\nm)-bounded. The results also demonstrate\nARUBA\u2019s ability to utilize specialized results from the online-to-batch conversion literature. This is\nwitnessed by the guarantee for self-bounded losses, a class which Zhang [54] shows includes linear\nregression; we use a result by the same author to obtain high-probability bounds, whereas previous\nGBML bounds are in-expectation [34, 19]. We also apply a result due to Kakade and Tewari [33] for\nthe case of strongly-convex regret-upper-bounds, enabling fast rates in the number of tasks T . The\nstrongly-convex case is especially relevant for GBML since it holds for OGD with \ufb01xed learning rate.\n\n\u221a\n\n7\n\n\fAlgorithm 2: ARUBA: an approach for modifying a\ngeneric batch GBML method to learn a per-coordinate\nlearning rate. Two specialized variants provided below.\nInput: T tasks, update method for meta-initialization,\nwithin-task descent method, settings \u03b5, \u03b6, p > 0\nInitialize b1 \u2190 \u03b521d, g1 \u2190 \u03b6 21d\nfor task t = 1, 2, . . . , T do\n\nSet \u03c6t according to update method, \u03b7t \u2190(cid:112)bt/gt\n\n(t+1)p + 1\n\nRun descent method from \u03c6t with learning rate \u03b7t:\n\nobserve gradients \u2207t,1, . . . ,\u2207t,mt\nobtain within-task parameter \u02c6\u03b8t\n2 (\u03c6t \u2212 \u02c6\u03b8t)2\ni=1 \u22072\n\n(t+1)p +(cid:80)mt\n\nbt+1 \u2190 bt + \u03b521d\ngt+1 \u2190 gt + \u03b621d\n\nResult: initialization \u03c6T , learning rate \u03b7T =(cid:112)bT /gT\nfor some c > 0 and then updating \u03b7T,i+1 \u2190(cid:112)bT /gT,i+1.\n\nARUBA++: starting with \u03b7T,1 = \u03b7T and gT,1 = gT , adap-\ntively reset the learning rate by setting \u02c6gT,i+1 \u2190 \u02c6gT,i +c\u22072\ni\n\nIsotropic: bt and gt are scalars tracking the sum of squared\ndistances and sum of squared gradient norms, respectively.\n\nt,i\n\nFigure 3: Next-character prediction\nperformance for recurrent networks\ntrained on the Shakespeare dataset [12]\nusing FedAvg [41] and its modi\ufb01ca-\ntions by Algorithm 2. Note that the\ntwo ARUBA methods require no learn-\ning rate tuning when personalizing the\nmodel (re\ufb01ne), unlike both FedAvg\nmethods;\nthis is a critical improve-\nment in federated settings. Furthermore,\nisotropic ARUBA has negligible over-\nhead by only communicating scalars.\n\nWe present two consequences of these results for the algorithms from Section 3 when run on i.i.d.\ndata. To measure task-similarity we use the variance V 2Q = min\u03c6\u2208\u0398 EP\u223cQ EPm (cid:107)\u03b8\u2217 \u2212 \u03c6(cid:107)2\n2 of\nthe empirical risk minimizer \u03b8\u2217 of an m-sample task drawn from Q. If VQ is known we can use\nstrong-convexity of the regret-upper-bounds to obtain a fast rate for learning the initialization, as\n\u221a\nshown in the \ufb01rst part of Corollary 5.1. The result can be loosely compared to Denevi et al. [19], who\nprovide a similar asymptotic improvement but with a slower rate of O(1/\nT ) in the second term.\nHowever, their task-similarity measures the deviation of the true, not empirical, risk-minimizers, so\nthe results are not directly comparable. Corollary 5.1 also gives a guarantee for when we do not\nknow VQ and must learn the learning rate \u03b7 in addition to the initialization; here we match the rate of\nDenevi et al. [19], who do not learn \u03b7, up to some additional fast o(1/\nCorollary 5.1. In the setting of Theorems 3.2 & 5.1, if \u03b4 \u2264 1/e and Algorithm 1 uses within-task\n\u221a\nm for VQ as above, then w.p. 1 \u2212 \u03b4\nt \u03b8\u2217\n\u221a\n1:t and step-size \u03b7t = VQ+1/\nOGD with initialization \u03c6t+1 = 1\nP\u223cQ (cid:96)P (\u03b8\u2217) + \u02dcO\n(cid:96)P (\u00af\u03b8) \u2264 E\n(cid:33)\n(cid:32)\n\nlog\n\u221a\n\u221a\nmT\nIf \u03b7t is set adaptively using \u03b5-EWOO as in Theorem 3.2 for \u03b5 = 1/ 4\nmT + 1/\n\u221a\n\n1\n\u03b4\nm then w.p. 1 \u2212 \u03b4\n\n(cid:18) VQ\u221a\n(cid:40) 1\u221a\n\n(cid:18) 1\u221a\n\n(cid:114) 1\n\nm) terms.\n\n(cid:96)P (\u00af\u03b8) \u2264 E\n\nE\nP\u223cQ\n\n(cid:41)\n\n(cid:19)\n\n(cid:19)\n\n+ min\n\nE\nPm\n\nG\n\n+\n\n\u221a\n\nlog\n\n+\n\n1\nT\n\nm\n\nT\n\nP\u223cQ (cid:96)P (\u03b8\u2217) + \u02dcO\n\nVQ\u221a\nm\n\nm + 1\u221a\nVQm\n\nT\n\n,\n\n4\n\n1\nm3T\n\n+\n\n1\nm\n\nE\nP\u223cQ\n\nE\nPm\n\n+\n\nT\n\n1\n\u03b4\n\n6 Empirical Results: Adaptive Methods for Few-Shot & Federated Learning\nA generic GBML method does the following at iteration t: (1) initialize a descent method at \u03c6t; (2)\ntake gradient steps with learning rate \u03b7 to get task-parameter \u02c6\u03b8t; (3) update meta-initialization to \u03c6t+1.\nMotivated by Section 4, in Algorithm 2 we outline a generic way of replacing \u03b7 by a per-coordinate\nrate learned on-the-\ufb02y. This entails keeping track of two quantities: (1) bt \u2208 Rd, a per-coordinate sum\nover s < t of the squared distances from the initialization \u03c6s to within-task parameter \u02c6\u03b8s; (2) gt \u2208 Rd,\na per-coordinate sum of the squared gradients seen so far. At task t we set \u03b7 to be the element-wise\nsquare root of bt/gt, allowing multi-task information to inform the trajectory. For example, if along\ncoordinate j the \u02c6\u03b8t,j is usually not far from initialization then bj will be small and thus so will \u03b7j;\nthen if on a new task we get a high noisy gradient along coordinate j the performance will be less\nadversely affected because it will be down-weighted by the learning rate. Single-task algorithms such\nas AdaGrad [21] and Adam [36] also work by reducing the learning rate along frequent directions.\n\n8\n\n\f1st\nOrder\n\n2nd\nOrder\n\n1st-Order MAML [23]\n\nReptile [44] w. Adam [36]\n\nReptile w. ARUBA\nReptile w. ARUBA++\n\n2nd-Order MAML\n\nMeta-SGD [38]\n\n20-way Omniglot\n\n1-shot\n\n89.4 \u00b1 0.5\n89.43 \u00b1 0.14\n86.67 \u00b1 0.17\n89.66 \u00b1 0.3\n95.8 \u00b1 0.3\n95.93 \u00b1 0.38\n\n5-shot\n\n97.9 \u00b1 0.1\n97.12 \u00b1 0.32\n96.61 \u00b1 0.13\n97.49 \u00b1 0.28\n98.9 \u00b1 0.2\n98.97 \u00b1 0.19\n\n5-way Mini-ImageNet\n5-shot\n1-shot\n\n48.07 \u00b1 1.75\n49.97 \u00b1 0.32\n50.73 \u00b1 0.32\n50.35 \u00b1 0.74\n48.7 \u00b1 1.84\n50.47 \u00b1 1.87\n\n63.15 \u00b1 0.91\n65.99 \u00b1 0.58\n65.69 \u00b1 0.61\n65.89 \u00b1 0.34\n63.11 \u00b1 0.92\n64.03 \u00b1 0.94\n\nTable 1: Meta-test-time performance of GBML algorithms on few-shot classi\ufb01cation benchmarks.\n1st-order and 2nd-order results obtained from Nichol et al. [44] and Li et al. [38], respectively.\n\nHowever, in meta-learning some coordinates may be frequently updated during meta-training because\ngood task-weights vary strongly from the best initialization along them, and thus their gradients should\nnot be downweighted; ARUBA encodes this intuition in the numerator using the distance-traveled\nper-task along each direction, which increases the learning rate along high-variance directions. We\nshow in Figure 2 that this is realized in practice, as ARUBA assigns a faster rate to deeper layers than\nto lower-level feature extractors, following standard intuition in parameter-transfer meta-learning. As\ndescribed in Algorithm 2, we also consider two variants: ARUBA++, which updates the meta-learned\nlearning-rate at meta-test-time in a manner similar to AdaGrad, and Isotropic ARUBA, which only\ntracks scalar quantities and is thus useful for communication-constrained settings.\nFew-Shot Classi\ufb01cation: We \ufb01rst examine if Algorithm 2 can improve performance on Omniglot\n[37] and Mini-ImageNet [46], two standard few-shot learning benchmarks, when used to modify\nReptile, a simple meta-learning method [44]. In its serial form Reptile is roughly the algorithm we\nstudy in Section 3 when OGD is used within-task and \u03b7 is \ufb01xed. Thus we can set Reptile+ARUBA\nto be Algorithm 2 with \u02c6\u03b8t the last iterate of OGD and the meta-update a weighted sum of \u02c6\u03b8t and \u03c6t.\nIn practice, however, Reptile uses Adam [36] to exploit multi-task gradient information. As shown\nin Table 1, ARUBA matches or exceeds this baseline on Mini-ImageNet, although on Omniglot it\nrequires the additional within-task updating of ARUBA++ to show improvement.\nIt is less clear how ARUBA can be applied to MAML [23], as by only taking one step the distance\ntraveled will be proportional to the gradient, so \u03b7 will stay \ufb01xed. We also do not \ufb01nd that ARUBA\nimproves multi-step MAML \u2013 perhaps not surprising as it is further removed from our theory due to\nits use of held-out data. In Table 1 we compare to Meta-SGD [38], which does learn a per-coordinate\nlearning rate for MAML by automatic differentiation. This requires more computation but does\nlead to consistent improvement. As with the original Reptile, our modi\ufb01cation performs better on\nMini-ImageNet but worse on Omniglot compared to MAML and its modi\ufb01cation Meta-SGD.\nFederated Learning: A main goal in this setting is to use data on heterogeneous nodes to learn\na global model without much communication; leveraging this to get a personalized model is an\nauxiliary goal [50], with a common application being next-character prediction on mobile devices.\nA popular method is FedAvg [41], where at each communication round r the server sends a global\nmodel \u03c6r to a batch of nodes, which then run local OGD; the server then sets \u03c6r+1 to the average\nof the returned models. This can be seen as a GBML method with each node a task, making it easy\nto apply ARUBA: each node simply sends its accumulated squared gradients to the server together\nwith its model. The server can use this information and the squared difference between \u03c6r and \u03c6r+1\nto compute a learning rate \u03b7r+1 via Algorithm 2 and send it to each node in the next round. We\nuse FedAvg with ARUBA to train a character LSTM [29] on the Shakespeare dataset, a standard\nbenchmark of a thousand users with varying amounts of non-i.i.d. data [41, 12]. Figure 3 shows that\nARUBA signi\ufb01cantly improves over non-tuned FedAvg and matches the performance of FedAvg with\na tuned learning rate schedule. Unlike both baselines we also do not require step-size tuning when\nre\ufb01ning the global model for personalization. This reduced need for hyperparameter optimization is\ncrucial in federated settings, where the number of user-data accesses are extremely limited.\n7 Conclusion\nIn this paper we introduced ARUBA, a framework for analyzing GBML that is both \ufb02exible and\nconsequential, yielding new guarantees for adaptive, dynamic, and statistical LTL via online learning.\nAs a result we devised a novel per-coordinate learning rate applicable to generic GBML procedures,\nimproving their training and meta-test-time performance on few-shot and federated learning. We see\ngreat potential for applying ARUBA to derive many other new LTL methods in a similar manner.\n\n9\n\n\fAcknowledgments\nWe thank Jeremy Cohen, Travis Dick, Nikunj Saunshi, Dravyansh Sharma, Ellen Vitercik, and\nour three anonymous reviewers for helpful feedback. This work was supported in part by DARPA\nFA875017C0141, National Science Foundation grants CCF-1535967, CCF-1910321, IIS-1618714,\nIIS-1705121, IIS-1838017, and IIS-1901403, a Microsoft Research Faculty Fellowship, a Bloomberg\nData Science research grant, an Amazon Research Award, an Amazon Web Services Award, an\nOkawa Grant, a Google Faculty Award, a JP Morgan AI Research Faculty Award, and a Carnegie\nBosch Institute Research Award. Any opinions, \ufb01ndings and conclusions, or recommendations\nexpressed in this material are those of the authors and do not necessarily re\ufb02ect the views of DARPA,\nthe National Science Foundation, or any other funding agency.\n\nReferences\n[1] Maruan Al-Shedivat, Trapit Bansal, Yura Burda, Ilya Sutskever, Igor Mordatch, and Pieter\nAbbeel. Continuous adaptation via meta-learning in nonstationary and competitive environments.\nIn Proceedings of the 6th International Conference on Learning Representations, 2018.\n\n[2] Pierre Alquier, The Tien Mai, and Massimiliano Pontil. Regret bounds for lifelong learning. In\nProceedings of the 20th International Conference on Arti\ufb01cial Intelligence and Statistics, 2017.\n\n[3] Ron Amit and Ron Meir. Meta-learning by adjusting priors based on extended PAC-Bayes\n\ntheory. In Proceedings of the 35th International Conference on Machine Learning, 2018.\n\n[4] Kazuoki Azuma. Weighted sums of certain dependent random variables. T\u00f4hoku Mathematical\n\nJournal, 19:357\u2013367, 1967.\n\n[5] Maria-Florina Balcan, Avrim Blum, and Santosh Vempala. Ef\ufb01cient representations for lifelong\n\nlearning and autoencoding. In Proceedings of the Conference on Learning Theory, 2015.\n\n[6] Arindam Banerjee, Srujana Merugu, Inderjit S. Dhillon, and Joydeep Ghosh. Clustering with\n\nBregman divergences. Journal of Machine Learning Research, 6:1705\u20131749, 2005.\n\n[7] Peter L. Bartlett, Elad Hazan, and Alexander Rakhlin. Adaptive online gradient descent. In\n\nAdvances in Neural Information Processing Systems, 2008.\n\n[8] Jonathan Baxter. A model of inductive bias learning. Journal of Arti\ufb01cial Intelligence Research,\n\n12:149\u2013198, 2000.\n\n[9] Stephen Boyd and Lieven Vandenberghe. Convex Optimization. Cambridge University Press,\n\n2004.\n\n[10] Lev M. Bregman. The relaxation method of \ufb01nding the common point of convex sets and\nits application to the solution of problems in convex programming. USSR Computational\nMathematics and Mathematical Physics, 7:200\u2013217, 1967.\n\n[11] Brian Bullins, Elad Hazan, Adam Kalai, and Roi Livni. Generalize across tasks: Ef\ufb01cient algo-\nrithms for linear representation learning. In Proceedings of the 30th International Conference\non Algorithmic Learning Theory, 2019.\n\n[12] Sebastian Caldas, Peter Wu, Tian Li, Jakub Kone\u02c7cn\u00fd, H. Brendan McMahan, Virginia Smith,\n\nand Ameet Talwalkar. LEAF: A benchmark for federated settings. arXiv, 2018.\n\n[13] Nicol\u00f3 Cesa-Bianchi and Claudio Gentile. Improved risk tail bounds for on-line algorithms. In\n\nAdvances in Neural Information Processing Systems, 2005.\n\n[14] Nicol\u00f3 Cesa-Bianchi, Alex Conconi, and Claudio Gentile. On the generalization ability of\non-line learning algorithms. IEEE Transactions on Information Theory, 50(9):2050\u20132057, 2004.\n\n[15] Nicol\u00f3 Cesa-Bianchi, Pierre Gaillard, Gabor Lugosi, and Gilles Stoltz. A new look at shifting\n\nregret. HAL, 2012.\n\n[16] Fei Chen, Zhenhua Dong, Zhenguo Li, and Xiuqiang He. Federated meta-learning for recom-\n\nmendation. arXiv, 2018.\n\n10\n\n\f[17] Chandler Davis. Notions generalizing convexity for functions de\ufb01ned on spaces of matrices. In\n\nProceedings of Symposia in Pure Mathematics, 1963.\n\n[18] Giulia Denevi, Carlo Ciliberto, Dimitris Stamos, and Massimiliano Pontil. Incremental learning-\nto-learn with statistical guarantees. In Proceedings of the Conference on Uncertainty in Arti\ufb01cial\nIntelligence, 2018.\n\n[19] Giulia Denevi, Carlo Ciliberto, Riccardo Grazzi, and Massimiliano Pontil. Learning-to-learn\n\nstochastic gradient descent with biased regularization. arXiv, 2019.\n\n[20] Yan Duan, Marcin Andrychowicz, Bradly Stadie, Jonathan Ho, Jonas Schneider, Ilya Sutskever,\nPieter Abbeel, and Wojciech Zaremba. One-shot imitation learning. In Advances in Neural\nInformation Processing Systems, 2017.\n\n[21] John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning\n\nand stochastic optimization. Journal of Machine Learning Research, 12:2121\u20132159, 2011.\n\n[22] Chelsea Finn and Sergey Levine. Meta-learning and universality: Deep representations and\ngradient descent can approximate any learning algorithm. In Proceedings of the 6th International\nConference on Learning Representations, 2018.\n\n[23] Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adap-\ntation of deep networks. In Proceedings of the 34th International Conference on Machine\nLearning, 2017.\n\n[24] Chelsea Finn, Aravind Rajeswaran, Sham Kakade, and Sergei Levine. Online meta-learning. In\n\nProceedings of the 36th International Conference on Machine Learning, 2019. To Appear.\n\n[25] Eric C. Hall and Rebecca M. Willet. Online optimization in dynamic environments. arXiv,\n\n2016.\n\n[26] Elad Hazan.\n\nIntroduction to online convex optimization.\n\nOptimization, volume 2, pages 157\u2013325. now Publishers Inc., 2015.\n\nIn Foundations and Trends in\n\n[27] Elad Hazan and C. Seshadri. Ef\ufb01cient learning algorithms for changing environments. In\n\nProceedings of the 26th International Conference on Machine Learning, 2009.\n\n[28] Elad Hazan, Amit Agarwal, and Satyen Kale. Logarithmic regret algorithms for online convex\n\noptimization. Machine Learning, 69:169\u2013192, 2007.\n\n[29] Sepp Hochreiter and J\u00fcrgen Schmidhuber. Long short-term memory. Neural Computation, 9:\n\n1735\u20131780, 1997.\n\n[30] Ali Jadbabaie, Alexander Rakhlin, and Shahin Shahrampour. Online optimization : Competing\nwith dynamic comparators. In Proceedings of the 18th International Conference on Arti\ufb01cial\nIntelligence and Statistics, 2015.\n\n[31] Ghassen Jerfel, Erin Grant, Thomas L. Grif\ufb01ths, and Katherine Heller. Online gradient-based\n\nmixtures for transfer modulation in meta-learning. arXiv, 2018.\n\n[32] Sham Kakade and Shai Shalev-Shwartz. Mind the duality gap: Logarithmic regret algorithms\n\nfor online optimization. In Advances in Neural Information Processing Systems, 2008.\n\n[33] Sham Kakade and Ambuj Tewari. On the generalization ability of online strongly convex\n\nprogramming algorithms. In Advances in Neural Information Processing Systems, 2008.\n\n[34] Mikhail Khodak, Maria-Florina Balcan, and Ameet Talwalkar. Provable guarantees for gradient-\nbased meta-learning. In Proceedings of the 36th International Conference on Machine Learning,\n2019. To Appear.\n\n[35] Jaehong Kim, Sangyeul Lee, Sungwan Kim, Moonsu Cha, Jung Kwon Lee, Youngduck Choi,\nYongseok Choi, Dong-Yeon Choi, and Jiwon Kim. Auto-Meta: Automated gradient based meta\nlearner search. arXiv, 2018.\n\n11\n\n\f[36] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In Proceedings\n\nof the 3rd International Conference on Learning Representations, 2015.\n\n[37] Brenden M. Lake, Ruslan Salakhutdinov, Jason Gross, and Joshua B. Tenenbaum. One shot\nlearning of simple visual concepts. In Proceedings of the Conference of the Cognitive Science\nSociety (CogSci), 2017.\n\n[38] Zhenguo Li, Fengwei Zhou, Fei Chen, and Hang Li. Meta-SGD: Learning to learning quickly\n\nfor few-shot learning. arXiv, 2017.\n\n[39] Elliott H. Lieb. Convex trace functions and the Wigner-Yanase-Dyson conjecture. Advances in\n\nMathematics, 11:267\u2013288, 1973.\n\n[40] H. Brendan McMahan and Matthew Streeter. Adaptive bound optimization for online convex\n\noptimization. In Proceedings of the Conference on Learning Theory, 2010.\n\n[41] H. Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Aguera y Arcas.\nCommunication-ef\ufb01cient learning of deep networks from decentralized data. In Proceedings of\nthe 20th International Conference on Arti\ufb01cal Intelligence and Statistics, 2017.\n\n[42] Aryan Mokhtari, Shahin Shahrampour, Ali Jadbabaie, and Alejandro Ribeiro. Online opti-\nmization in dynamic environments: Improved regret rates for strongly convex problems. In\nProceedings of the 55th IEEE Conference on Decision and Control, 2016.\n\n[43] Ken-ichiro Moridomi, Kohei Hatano, and Eiji Takimoto. Online linear optimization with the\nlog-determinant regularizer. IEICE Transactions on Information and Systems, E101-D(6):\n1511\u20131520, 2018.\n\n[44] Alex Nichol, Joshua Achiam, and John Schulman. On \ufb01rst-order meta-learning algorithms.\n\narXiv, 2018.\n\n[45] Anastasia Pentina and Christoph H. Lampert. A PAC-Bayesian bound for lifelong learning. In\n\nProceedings of the 31st International Conference on Machine Learning, 2014.\n\n[46] Sachin Ravi and Hugo Larochelle. Optimization as a model for few-shot learning. In Proceed-\n\nings of the 5th International Conference on Learning Representations, 2017.\n\n[47] Ankan Saha, Prateek Jain, and Ambuj Tewari. The interplay between stability and regret in\n\nonline learning. arXiv, 2012.\n\n[48] Shai Shalev-Shwartz. Online learning and online convex optimization. Foundations and Trends\n\nin Machine Learning, 4(2):107\u2014-194, 2011.\n\n[49] Shai Shalev-Shwartz, Ohad Shamir, Nathan Srebro, and Karthik Sridharan. Learnability,\n\nstability and uniform convergence. Journal of Machine Learning Research, 11, 2010.\n\n[50] Virginia Smith, Chao-Kai Chiang, Maziar Sanjabi, and Ameet Talwalkar. Federated multi-task\n\nlearning. In Advances in Neural Information Processing Systems, 2017.\n\n[51] Jake Snell, Kevin Swersky, and Richard S. Zemel. Prototypical networks for few-shot learning.\n\nIn Advances in Neural Information Processing Systems, 2017.\n\n[52] Sebastian Thrun and Lorien Pratt. Learning to Learn. Springer Science & Business Media,\n\n1998.\n\n[53] Lijun Zhang, Tianbao Yang, Jinfeng Yi, and Rong Jin Zhi-Hua Zhou. Improved dynamic regret\n\nfor non-degenerate functions. In Advances in Neural Information Processing Systems, 2017.\n\n[54] Tong Zhang. Data dependent concentration bounds for sequential prediction algorithms. In\n\nProceedings of the International Conference on Learning Theory, 2005.\n\n[55] Martin Zinkevich. Online convex programming and generalized in\ufb01nitesimal gradient ascent.\n\nIn Proceedings of the 20th International Conference on Machine Learning, 2003.\n\n12\n\n\f", "award": [], "sourceid": 3181, "authors": [{"given_name": "Mikhail", "family_name": "Khodak", "institution": "CMU"}, {"given_name": "Maria-Florina", "family_name": "Balcan", "institution": "Carnegie Mellon University"}, {"given_name": "Ameet", "family_name": "Talwalkar", "institution": "CMU"}]}