{"title": "Meta-Learning with Implicit Gradients", "book": "Advances in Neural Information Processing Systems", "page_first": 113, "page_last": 124, "abstract": "A core capability of intelligent systems is the ability to quickly learn new tasks by drawing on prior experience. Gradient (or optimization) based meta-learning has recently emerged as an effective approach for few-shot learning. In this formulation, meta-parameters are learned in the outer loop, while task-specific models are learned in the inner-loop, by using only a small amount of data from the current task. A key challenge in scaling these approaches is the need to differentiate through the inner loop learning process, which can impose considerable computational and memory burdens. By drawing upon implicit differentiation, we develop the implicit MAML algorithm, which depends only on the solution to the inner level optimization and not the path taken by the inner loop optimizer. This effectively decouples the meta-gradient computation from the choice of inner loop optimizer. As a result, our approach is agnostic to the choice of inner loop optimizer and can gracefully handle many gradient steps without vanishing gradients or memory constraints. Theoretically, we prove that implicit MAML can compute accurate meta-gradients with a memory footprint that is, up to small constant factors, no more than that which is required to compute a single inner loop gradient and at no overall increase in the total computational cost. Experimentally, we show that these benefits of implicit MAML translate into empirical gains on few-shot image recognition benchmarks.", "full_text": "Meta-Learning with Implicit Gradients\n\nAravind Rajeswaran\u2217\nUniversity of Washington\n\nChelsea Finn\u2217\n\nUniversity of California Berkeley\n\naravraj@cs.washington.edu\n\ncbfinn@cs.stanford.edu\n\nSham M. Kakade\n\nUniversity of Washington\nsham@cs.washington.edu\n\nSergey Levine\n\nUniversity of California Berkeley\nsvlevine@eecs.berkeley.edu\n\nAbstract\n\nA core capability of intelligent systems is the ability to quickly learn new tasks by\ndrawing on prior experience. Gradient (or optimization) based meta-learning has\nrecently emerged as an effective approach for few-shot learning. In this formu-\nlation, meta-parameters are learned in the outer loop, while task-speci\ufb01c models\nare learned in the inner-loop, by using only a small amount of data from the cur-\nrent task. A key challenge in scaling these approaches is the need to differentiate\nthrough the inner loop learning process, which can impose considerable computa-\ntional and memory burdens. By drawing upon implicit differentiation, we develop\nthe implicit MAML algorithm, which depends only on the solution to the inner\nlevel optimization and not the path taken by the inner loop optimizer. This ef-\nfectively decouples the meta-gradient computation from the choice of inner loop\noptimizer. As a result, our approach is agnostic to the choice of inner loop opti-\nmizer and can gracefully handle many gradient steps without vanishing gradients\nor memory constraints. Theoretically, we prove that implicit MAML can compute\naccurate meta-gradients with a memory footprint no more than that which is re-\nquired to compute a single inner loop gradient and at no overall increase in the\ntotal computational cost. Experimentally, we show that these bene\ufb01ts of implicit\nMAML translate into empirical gains on few-shot image recognition benchmarks.\n\n1\n\nIntroduction\n\nA core aspect of intelligence is the ability to quickly learn new tasks by drawing upon prior expe-\nrience from related tasks. Recent work has studied how meta-learning algorithms [51, 55, 41] can\nacquire such a capability by learning to ef\ufb01ciently learn a range of tasks, thereby enabling learn-\ning of a new task with as little as a single example [50, 57, 15]. Meta-learning algorithms can be\nframed in terms of recurrent [25, 50, 48] or attention-based [57, 38] models that are trained via a\nmeta-learning objective, to essentially encapsulate the learned learning procedure in the parameters\nof a neural network. An alternative formulation is to frame meta-learning as a bi-level optimization\nprocedure [35, 15], where the \u201cinner\u201d optimization represents adaptation to a given task, and the\n\u201couter\u201d objective is the meta-training objective. Such a formulation can be used to learn the initial\nparameters of a model such that optimizing from this initialization leads to fast adaptation and gen-\neralization. In this work, we focus on this class of optimization-based methods, and in particular\nthe model-agnostic meta-learning (MAML) formulation [15]. MAML has been shown to be as ex-\npressive as black-box approaches [14], is applicable to a broad range of settings [16, 37, 1, 18], and\nrecovers a convergent and consistent optimization procedure [13].\n\n\u2217Equal Contributions. Project page: http://sites.google.com/view/imaml\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fFigure 1: To compute the meta-gradient(cid:80)\n\ndLi(\u03c6i)\n\n, the MAML algorithm differentiates through\nthe optimization path, as shown in green, while \ufb01rst-order MAML computes the meta-gradient by\napproximating d\u03c6i\nd\u03b8 as I. Our implicit MAML approach derives an analytic expression for the exact\nmeta-gradient without differentiating through the optimization path by estimating local curvature.\n\nd\u03b8\n\ni\n\nDespite its appealing properties, meta-learning an initialization requires backpropagation through\nthe inner optimization process. As a result, the meta-learning process requires higher-order deriva-\ntives, imposes a non-trivial computational and memory burden, and can suffer from vanishing gra-\ndients. These limitations make it harder to scale optimization-based meta learning methods to tasks\ninvolving medium or large datasets, or those that require many inner-loop optimization steps. Our\ngoal is to develop an algorithm that addresses these limitations.\nThe main contribution of our work is the development of the implicit MAML (iMAML) algorithm,\nan approach for optimization-based meta-learning with deep neural networks that removes the need\nfor differentiating through the optimization path. Our algorithm aims to learn a set of parameters\nsuch that an optimization algorithm that is initialized at and regularized to this parameter vector\nleads to good generalization for a variety of learning tasks. By leveraging the implicit differentiation\napproach, we derive an analytical expression for the meta (or outer level) gradient that depends only\non the solution to the inner optimization and not the path taken by the inner optimization algorithm,\nas depicted in Figure 1. This decoupling of meta-gradient computation and choice of inner level\noptimizer has a number of appealing properties.\nFirst, the inner optimization path need not be stored nor differentiated through, thereby making\nimplicit MAML memory ef\ufb01cient and scalable to a large number of inner optimization steps. Sec-\nond, implicit MAML is agnostic to the inner optimization method used, as long as it can \ufb01nd an\napproximate solution to the inner-level optimization problem. This permits the use of higher-order\nmethods, and in principle even non-differentiable optimization methods or components like sample-\nbased optimization, line-search, or those provided by proprietary software (e.g. Gurobi). Finally, we\nalso provide the \ufb01rst (to our knowledge) non-asymptotic theoretical analysis of bi-level optimiza-\ntion. We show that an \u0001\u2013approximate meta-gradient can be computed via implicit MAML using\n\u02dcO(log(1/\u0001)) gradient evaluations and \u02dcO(1) memory, meaning the memory required does not grow\nwith number of gradient steps.\n\n2 Problem Formulation and Notations\nWe \ufb01rst present the meta-learning problem in the context of few-shot supervised learning, and then\ngeneralize the notation to aid the rest of the exposition in the paper.\n\n2.1 Review of Few-Shot Supervised Learning and MAML\nIn this setting, we have a collection of meta-training tasks {Ti}M\ni=1 drawn from P (T ). Each task Ti\nis associated with a dataset Di, from which we can sample two disjoint sets: Dtr\n. These\ndatasets each consist of K input-output pairs. Let x \u2208 X and y \u2208 Y denote inputs and outputs,\nrespectively. The datasets take the form Dtr\n. We are\ninterested in learning models of the form h\u03c6(x) : X \u2192 Y, parameterized by \u03c6 \u2208 \u03a6 \u2261 Rd.\nPerformance on a task is speci\ufb01ed by a loss function, such as the cross entropy or squared error loss.\nWe will write the loss function in the form L(\u03c6,D), as a function of a parameter vector and dataset.\nThe goal for task Ti is to learn task-speci\ufb01c parameters \u03c6i using Dtr\ni such that we can minimize the\npopulation or test loss of the task, L(\u03c6i,Dtest\n\ni and Dtest\nk=1, and similarly for Dtest\n\ni = {(xk\n\ni )}K\n\ni , yk\n\n).\n\ni\n\ni\n\ni\n\n2\n\n\fL\n\ni=1\n\n(cid:122)\n\ni\n\ni\n\n\u03b8\u2208\u0398\n\n1\nM\n\n(cid:125)(cid:124)\n\nIn the general bi-level meta-learning setup, we consider a space of algorithms that compute task-\nspeci\ufb01c parameters using a set of meta-parameters \u03b8 \u2208 \u0398 \u2261 Rd and the training dataset from the\ni ) for task Ti. The goal of meta-learning is to learn meta-parameters\ntask, such that \u03c6i = Alg(\u03b8,Dtr\n(cid:123)\nthat produce good task speci\ufb01c parameters after adaptation, as speci\ufb01ed below:\n\n(cid:19)\n(cid:123)\n(cid:1), Dtest\nWe view this as a bi-level optimization problem since we typically interpret Alg(cid:0)\u03b8,Dtr\n\n(cid:18) inner\u2212level\n(cid:122)\n(cid:125)(cid:124)\nAlg(cid:0)\u03b8,Dtr\n\n(cid:1) as either\n\n\u03b8\u2217\nML := argmin\n\nF (\u03b8) , where F (\u03b8) =\n\nM(cid:88)\n\nouter\u2212level\n\nexplicitly or implicitly solving an underlying optimization problem. At meta-test (deployment) time,\nj corresponding to a new task Tj \u223c P (T ), we can achieve good\nwhen presented with a dataset Dtr\ngeneralization performance (i.e., low test error) by using the adaptation procedure with the meta-\nlearned parameters as \u03c6j = Alg(\u03b8\u2217\nML,Dtr\nj ).\nIn the case of MAML [15], Alg(\u03b8,D) corresponds to one or multiple steps of gradient descent\ninitialized at \u03b8. For example, if one step of gradient descent is used, we have:\n\n\u03c6i \u2261 Alg(\u03b8,Dtr\n\ni ) = \u03b8 \u2212 \u03b1\u2207\u03b8L(\u03b8,Dtr\ni ).\n\n(2)\nTypically, \u03b1 is a scalar hyperparameter, but can also be a learned vector [34]. Hence, for MAML, the\nmeta-learned parameter (\u03b8\u2217\nML) has a learned inductive bias that is particularly well-suited for \ufb01ne-\ntuning on tasks from P (T ) using K samples. To solve the outer-level problem with gradient-based\nmethods, we require a way to differentiate through Alg. In the case of MAML, this corresponds to\nbackpropagating through the dynamics of gradient descent.\n\n(inner-level of MAML)\n\n.\n\ni\n\n(1)\n\n2.2 Proximal Regularization in the Inner Level\nTo have suf\ufb01cient learning in the inner level while also avoiding over-\ufb01tting, Alg needs to incorpo-\nrate some form of regularization. Since MAML uses a small number of gradient steps, this corre-\nsponds to early stopping and can be interpreted as a form of regularization and Bayesian prior [20].\nIn cases like ill-conditioned optimization landscapes and medium-shot learning, we may want to\ntake many gradient steps, which poses two challenges for MAML. First, we need to store and differ-\nentiate through the long optimization path of Alg, which imposes a considerable computation and\nmemory burden. Second, the dependence of the model-parameters {\u03c6i} on the meta-parameters (\u03b8)\nshrinks and vanishes as the number of gradient steps in Alg grows, making meta-learning dif\ufb01cult.\nTo overcome these limitations, we consider a more explicitly regularized algorithm:\n\nAlg(cid:63)(\u03b8,Dtr\n\ni ) = argmin\n\u03c6(cid:48)\u2208\u03a6\n\nL(\u03c6(cid:48),Dtr\n\ni ) +\n\n\u03bb\n2\n\n||\u03c6(cid:48) \u2212 \u03b8||2.\n\n(3)\n\nThe proximal regularization term in Eq. 3 encourages \u03c6i to remain close to \u03b8, thereby retaining a\nstrong dependence throughout. The regularization strength (\u03bb) plays a role similar to the learning\nrate (\u03b1) in MAML, controlling the strength of the prior (\u03b8) relative to the data (DtrT ). Like \u03b1, the\nregularization strength \u03bb may also be learned. Furthermore, both \u03b1 and \u03bb can be scalars, vectors, or\nfull matrices. For simplicity, we treat \u03bb as a scalar hyperparameter. In Eq. 3, we use (cid:63) to denote that\nthe optimization problem is solved exactly. In practice, we use iterative algorithms (denoted by Alg)\nfor \ufb01nite iterations, which return approximate minimizers. We explicitly consider the discrepancy\nbetween approximate and exact solutions in our analysis.\n\n2.3 The Bi-Level Optimization Problem\nFor notation convenience, we will sometimes express the dependence on task Ti using a subscript\ninstead of arguments, e.g. we write:\n\nLi(\u03c6) := L(cid:0)\u03c6, Dtest\n\n(cid:1),\n\ni\n\n(cid:1), Algi\n(cid:0)\u03b8(cid:1) := Alg(cid:0)\u03b8,Dtr\n\u02c6Li(\u03c6) := L(cid:0)\u03c6,Dtr\nM(cid:88)\ni (\u03b8)(cid:1), and\n(cid:0)Alg(cid:63)\n\nLi\n\ni\n\ni\n\n(cid:1).\n\n1\nM\n\ni=1\n\n(4)\n\nWith this notation, the bi-level meta-learning problem can be written more generally as:\n\n\u03b8\u2217\nML := argmin\n\n\u03b8\u2208\u0398\n\nF (\u03b8) , where F (\u03b8) =\n\nAlg(cid:63)\n\ni (\u03b8) := argmin\n\u03c6(cid:48)\u2208\u03a6\n\nGi(\u03c6(cid:48), \u03b8), where Gi(\u03c6(cid:48), \u03b8) = \u02c6Li(\u03c6(cid:48)) +\n\n||\u03c6(cid:48) \u2212 \u03b8||2.\n\n\u03bb\n2\n\n3\n\n\f2.4 Total and Partial Derivatives\nWe use d to denote the total derivative and \u2207 to denote partial derivative. For nested function of the\nform Li(\u03c6i) where \u03c6i = Algi(\u03b8), we have from chain rule\n\nd\u03b8Li(Algi(\u03b8)) =\n\n\u2207\u03c6Li(\u03c6) |\u03c6=Algi(\u03b8) =\n\ndAlgi(\u03b8)\n\nd\u03b8\n\ndAlgi(\u03b8)\n\nd\u03b8\n\n\u2207\u03c6Li(Algi(\u03b8))\n\nNote the important distinction between d\u03b8Li(Algi(\u03b8)) and \u2207\u03c6Li(Algi(\u03b8)). The former passes\nderivatives through Algi(\u03b8) while the latter does not. \u2207\u03c6Li(Algi(\u03b8)) is simply the gradient func-\ntion, i.e. \u2207\u03c6Li(\u03c6), evaluated at \u03c6 = Algi(\u03b8). Also note that d\u03b8Li(Algi(\u03b8)) and \u2207\u03c6Li(Algi(\u03b8))\nare d\u2013dimensional vectors, while dAlgi(\u03b8)\nis a (d \u00d7 d)\u2013size Jacobian matrix. Throughout this text,\nd\u03b8 interchangeably.\nwe will also use d\u03b8 and d\n\nd\u03b8\n\n3 The Implicit MAML Algorithm\n\nOur aim is to solve the bi-level meta-learning problem in Eq. 4 using an iterative gradient based\nalgorithm of the form \u03b8 \u2190 \u03b8 \u2212 \u03b7 d\u03b8F (\u03b8). Although we derive our method based on standard\ngradient descent for simplicity, any other optimization method, such as quasi-Newton or Newton\nmethods, Adam [28], or gradient descent with momentum can also be used without modi\ufb01cation.\nThe gradient descent update be expanded using the chain rule as\n\ni (\u03b8)\n\n\u2207\u03c6Li(Alg(cid:63)\n\ni (\u03b8)).\n\n(5)\n\nM(cid:88)\n\n\u03b8 \u2190 \u03b8 \u2212 \u03b7\n\ndAlg(cid:63)\nd\u03b8\ni (\u03b8)) is simply \u2207\u03c6Li(\u03c6) |\u03c6=Alg(cid:63)\n\n1\nM\n\ni=1\n\nHere, \u2207\u03c6Li(Alg(cid:63)\ni (\u03b8) which can be easily obtained in practice via\nautomatic differentiation. For this update rule, we must compute dAlg(cid:63)\ni is implicitly\nde\ufb01ned as an optimization problem (Eq. 4), which presents the primary challenge. We now present\nan ef\ufb01cient algorithm (in compute and memory) to compute the meta-gradient..\n\n, where Alg(cid:63)\n\ni (\u03b8)\n\nd\u03b8\n\nd\u03b8\n\ni (\u03b8)\n\n3.1 Meta-Gradient Computation\nIf Alg(cid:63)\ni (\u03b8) is implemented as an iterative algorithm, such as gradient descent, then one way to com-\npute dAlg(cid:63)\nis to propagate derivatives through the iterative process, either in forward mode or\nreverse mode. However, this has the drawback of depending explicitly on the path of the optimiza-\ntion, which has to be fully stored in memory, quickly becoming intractable when the number of\ngradient steps needed is large. Furthermore, for second order optimization methods, such as New-\nton\u2019s method, third derivatives are needed which are dif\ufb01cult to obtain. Furthermore, this approach\nbecomes impossible when non-differentiable operations, such as line-searches, are used. However,\nby recognizing that Alg(cid:63)\ni is implicitly de\ufb01ned as the solution to an optimization problem, we may\nemploy a different strategy that does not need to consider the path of the optimization but only the\n\ufb01nal result. This is derived in the following Lemma.\nLemma 1. (Implicit Jacobian) Consider Alg(cid:63)\n\u02c6Li(\u03c6i)\nbe the result of Alg(cid:63)\n\n(cid:17)\ni (\u03b8) as de\ufb01ned in Eq. 4 for task Ti. Let \u03c6i = Alg(cid:63)\n(cid:18)\n\nis invertible, then the derivative Jacobian is\n\n(cid:19)\u22121\n\ni (\u03b8). If\n\ni (\u03b8)\n\n(cid:16)\n\n\u03c6\n\n=\n\nI +\n\n\u22072\n\n\u03c6\n\n\u02c6Li(\u03c6i)\n\n1\n\u03bb\n\n.\n\n(6)\n\n\u03bb\u22072\nI + 1\ndAlg(cid:63)\nd\u03b8\n\ni (\u03b8)\n\ni (\u03b8), thereby decoupling meta-gradient computation from choice of inner level optimizer.\n\nNote that the derivative (Jacobian) depends only on the \ufb01nal result of the algorithm, and not the\npath taken by the algorithm. Thus, in principle any approach of algorithm can be used to compute\nAlg(cid:63)\nPractical Algorithm: While Lemma 1 provides an idealized way to compute the Alg(cid:63)\ni Jacobians\nand thus by extension the meta-gradient, it may be dif\ufb01cult to directly use it in practice. Two\nissues are particularly relevant. First, the meta-gradients require computation of Alg(cid:63)\ni (\u03b8), which is\nthe exact solution to the inner optimization problem. In practice, we may be able to obtain only\napproximate solutions. Second, explicitly forming and inverting the matrix in Eq. 6 for computing\n\n4\n\n\fAlgorithm 1 Implicit Model-Agnostic Meta-Learning (iMAML)\n1: Require: Distribution over tasks P (T ), outer step size \u03b7, regularization strength \u03bb,\n2: while not converged do\nSample mini-batch of tasks {Ti}B\n3:\nfor Each task Ti do\n4:\n5:\nend for\n6:\n7:\nUpdate meta-parameters with gradient descent: \u03b8 \u2190 \u03b8 \u2212 \u03b7 \u02c6\u2207F (\u03b8) // (or Adam)\n8:\n9: end while\n\nAverage above gradients to get \u02c6\u2207F (\u03b8) = (1/B)(cid:80)B\n\nCompute task meta-gradient gi = Implicit-Meta-Gradient(Ti, \u03b8, \u03bb)\n\ni=1 \u223c P (T )\n\ni=1 gi\n\nAlgorithm 2 Implicit Meta-Gradient Computation\n1: Input: Task Ti, meta-parameters \u03b8, regularization strength \u03bb\n2: Hyperparameters: Optimization accuracy thresholds \u03b4 and \u03b4(cid:48)\n3: Obtain task parameters \u03c6i using iterative optimization solver such that: (cid:107)\u03c6i \u2212 Alg(cid:63)\ni (\u03b8)(cid:107) \u2264 \u03b4\n4: Compute partial outer-level gradient vi = \u2207\u03c6LT (\u03c6i)\n5: Use an iterative solver (e.g. CG) along with reverse mode differentiation (to compute Hessian\n\nvector products) to compute gi such that: (cid:107)gi \u2212(cid:0)I + 1\n\n\u03bb\u22072 \u02c6Li(\u03c6i)(cid:1)\u22121\n\nvi(cid:107) \u2264 \u03b4(cid:48)\n\n6: Return: gi\n\nthe Jacobian may be intractable for large deep neural networks. To address these dif\ufb01culties, we\nconsider approximations to the idealized approach that enable a practical algorithm.\nFirst, we consider an approximate solution to the inner optimization problem, that can be obtained\nwith iterative optimization algorithms like gradient descent.\nDe\ufb01nition 1. (\u03b4\u2013approx. algorithm) Let Algi(\u03b8) be a \u03b4\u2013accurate approximation of Alg(cid:63)\n\ni (\u03b8), i.e.\n\n(cid:107)Algi(\u03b8) \u2212 Alg(cid:63)\n\ni (\u03b8)(cid:107) \u2264 \u03b4\n\nSecond, we will perform a partial or approximate matrix inversion given by:\nDe\ufb01nition 2. (\u03b4(cid:48)\u2013approximate Jacobian-vector product) Let gi be a vector such that\n\n(cid:19)\u22121\u2207\u03c6Li(\u03c6i)(cid:107) \u2264 \u03b4(cid:48)\n\n(cid:107)gi \u2212\n\nI +\n\n1\n\u03bb\n\n\u22072\n\n\u03c6\n\n\u02c6Li(\u03c6i)\n\n(cid:18)\n\nw(cid:62)(cid:18)\n\n(cid:19)\n\nwhere \u03c6i = Algi(\u03b8) and Algi is based on de\ufb01nition 1.\nNote that gi in de\ufb01nition 2 is an approximation of the meta-gradient for task Ti. Observe that gi can\nbe obtained as an approximate solution to the optimization problem:\n\nmin\nw\n\n1\n2\n\nI +\n\n\u22072\n\n\u03c6\n\n\u02c6Li(\u03c6i)\n\n1\n\u03bb\n\nw \u2212 w(cid:62)\u2207\u03c6Li(\u03c6i)\n\n(7)\n\nThe conjugate gradient (CG) algorithm is particularly well suited for this problem due to its excel-\nlent iteration complexity and requirement of only Hessian-vector products of the form \u22072 \u02c6Li(\u03c6i)v.\nSuch hessian-vector products can be obtained cheaply without explicitly forming or storing the Hes-\nsian matrix (as we discuss in Appendix C). This CG based inversion has been successfully deployed\nin Hessian-free or Newton-CG methods for deep learning [36, 44] and trust region methods in re-\ninforcement learning [52, 47]. Algorithm 1 presents the full practical algorithm. Note that these\napproximations to develop a practical algorithm introduce errors in the meta-gradient computation.\nWe analyze the impact of these errors in Section 3.2 and show that they are controllable. See Ap-\npendix A for how iMAML generalizes prior gradient optimization based meta-learning algorithms.\n\n3.2 Theory\n\nIn Section 3.1, we outlined a practical algorithm that makes approximations to the idealized update\nrule of Eq. 5. Here, we attempt to analyze the impact of these approximations, and also under-\nstand the computation and memory requirements of iMAML. We \ufb01nd that iMAML can match the\n\n5\n\n\fTable 1: Compute and memory for computing the meta-gradient when using a \u03b4\u2013accurate Algi, and the cor-\nresponding approximation error. Our compute time is measured in terms of the number of \u2207 \u02c6Li computations.\nAll results are in \u02dcO(\u00b7) notation, which hide additional log factors; the error bound hides additional problem\ndependent Lipshitz and smoothness parameters (see the respective Theorem statements). \u03ba \u2265 1 is the condi-\ntion number for inner objective Gi (see Equation 4), and D is the diameter of the search space. The notions\nof error are subtly different: we assume all methods solve the inner optimization to error level of \u03b4 (as per\nde\ufb01nition 1). For our algorithm, the error refers to the (cid:96)2 error in the computation of d\u03b8Li(Alg(cid:63)\ni (\u03b8)). For\nthe other algorithms, the error refers to the (cid:96)2 error in the computation of d\u03b8Li(Algi(\u03b8)). We use Prop 3.1 of\nShaban et al. [53] to provide the guarantee we use. See Appendix D for additional discussion.\n\nAlgorithm\n\nMAML (GD + full back-prop)\n\nMAML (Nesterov\u2019s AGD + full back-prop)\n\nTruncated back-prop [53] (GD)\n\nImplicit MAML (this work)\n\n\u03b4\n\n\u03b4\n\nMemory\n\nCompute\n\n\u03ba log(cid:0) D\n(cid:1)\n\u03ba log(cid:0) D\n(cid:1) Mem(\u2207 \u02c6Li) \u00b7 \u221a\n\u03ba log(cid:0) D\n(cid:1)\n(cid:1)\n\u03ba log(cid:0) D\n\nMem(\u2207 \u02c6Li) \u00b7 \u03ba log(cid:0) D\n(cid:1)\n\u03ba log(cid:0) D\n(cid:1)\nMem(\u2207 \u02c6Li) \u00b7 \u03ba log(cid:0) 1\n(cid:1)\n\nMem(\u2207 \u02c6Li)\n\n\u03b4\n\n\u03b4\n\n\u03b4\n\n\u03b4\n\n\u0001\n\n\u221a\n\n\u221a\n\nError\n\n0\n\n0\n\n\u0001\n\n\u03b4\n\nminimax computational complexity of backpropagating through the path of the inner optimizer, but\nis substantially better in terms of memory usage. This work to our knowledge also provides the\n\ufb01rst non-asymptotic result that analyzes approximation error due to implicit gradients. Theorem 1\nprovides the computational and memory complexity for obtaining an \u0001\u2013approximate meta-gradient.\nWe assume Li is smooth but do not require it to be convex. We assume that Gi in Eq. 4 is strongly\nconvex, which can be made possible by appropriate choice of \u03bb. The key to our analysis is a second\norder Lipshitz assumption, i.e. \u02c6Li(\u00b7) is \u03c1-Lipshitz Hessian. This assumption and setting has received\nconsiderable attention in recent optimization and deep learning literature [26, 42].\nTable 1 summarizes our complexity results and compares with MAML and truncated backpropa-\ngation [53] through the path of the inner optimizer. We use \u03ba to denote the condition number of\nthe inner problem induced by Gi (see Equation 4), which can be viewed as a measure of hardness\nof the inner optimization problem. Mem(\u2207 \u02c6Li) is the memory taken to compute a single derivative\n\u2207 \u02c6Li. Under the assumption that Hessian vector products are computed with the reverse mode of\nautodifferentiation, we will have that both: the compute time and memory used for computing a\nHessian vector product are with a (universal) constant factor of the compute time and memory used\nfor computing \u2207 \u02c6Li itself (see Appendix C). This allows us to measure the compute time in terms of\nthe number of \u2207 \u02c6Li computations. We refer readers to Appendix D for additional discussion about\nthe algorithms and their trade-offs.\nOur main theorem is as follows:\nTheorem 1. (Informal Statement; Approximation error in Algorithm 2) Suppose that: Li(\u00b7) is B\nLipshitz and L smooth function; that Gi(\u00b7, \u03b8) (in Eq. 4) is a \u00b5-strongly convex function with condi-\ntion number \u03ba; that D is the diameter of search space for \u03c6 in the inner optimization problem (i.e.\n(cid:107)Alg(cid:63)\nLet gi be the task meta-gradient returned by Algorithm 2. For any task i and desired accuracy level\n\u0001, Algorithm 2 computes an approximate task-speci\ufb01c meta-gradient with the following guarantee:\n\ni (\u03b8)(cid:107) \u2264 D); and \u02c6Li(\u00b7) is \u03c1-Lipshitz Hessian.\n\n||gi \u2212 d\u03b8Li(Alg(cid:63)\n\ni (\u03b8))|| \u2264 \u0001 .\n\n(cid:16)\u221a\n\nFurthermore, under the assumption that the Hessian vector products are computed by the re-\nverse mode of autodifferentiation (Assumption 1), Algorithm 2 can be implemented using at most\n\u02dcO\n\ngradient computations of \u02c6Li(\u00b7) and 2 \u00b7 Mem(\u2207 \u02c6Li) memory.\n\n(cid:16) poly(\u03ba,D,B,L,\u03c1,\u00b5,\u03bb)\n\n(cid:17)(cid:17)\n\n\u03ba log\n\n\u0001\n\nThe formal statement of the theorem and the proof are provided the appendix.\nImportantly, the\nalgorithm\u2019s memory requirement is equivalent to the memory needed for Hessian-vector products\nwhich is a small constant factor over the memory required for gradient computations, assuming the\nreverse mode of auto-differentiation is used. Finally, based on the above, we also present corollary 1\nin the appendix which shows that iMAML ef\ufb01ciently \ufb01nds a stationary point of F (\u00b7), due to iMAML\nhaving controllable exact-solve error.\n\n6\n\n\f4 Experimental Results and Discussion\n\nIn our experimental evaluation, we aim to answer the following questions empirically: (1) Does\nthe iMAML algorithm asymptotically compute the exact meta-gradient? (2) With \ufb01nite iterations,\ndoes iMAML approximate the meta-gradient more accurately compared to MAML? (3) How does\nthe computation and memory requirements of iMAML compare with MAML? (4) Does iMAML\nlead to better results in realistic meta-learning problems? We have answered (1) - (3) through our\ntheoretical analysis, and now attempt to validate it through numerical simulations. For (1) and (2),\nwe will use a simple synthetic example for which we can compute the exact meta-gradient and\ncompare against it (exact-solve error, see de\ufb01nition 3). For (3) and (4), we will use the common\nfew-shot image recognition domains of Omniglot and Mini-ImageNet.\nTo study the question of meta-gradient accuracy, Figure 2 considers a synthetic regression example,\nwhere the predictions are linear in parameters. This provides an analytical expression for Alg(cid:63)\ni al-\nlowing us to compute the true meta-gradient. We \ufb01x gradient descent (GD) to be the inner optimizer\nfor both MAML and iMAML. The problem is constructed so that the condition number (\u03ba) is large,\nthereby necessitating many GD steps. We \ufb01nd that both iMAML and MAML asymptotically match\nthe exact meta-gradient, but iMAML computes a better approximation in \ufb01nite iterations. We ob-\nserve that with 2 CG iterations, iMAML incurs a small terminal error. This is consistent with our\ntheoretical analysis. In Algorithm 2, \u03b4 is dominated by \u03b4(cid:48) when only a small number of CG steps\nare used. However, the terminal error vanishes with just 5 CG steps. The computational cost of 1\nCG step is comparable to 1 inner GD step with the MAML algorithm, since both require 1 hessian-\nvector product (see section C for discussion). Thus, the computational cost as well as memory of\niMAML with 100 inner GD steps is signi\ufb01cantly smaller than MAML with 100 GD steps.\nTo study (3), we turn to the Omniglot dataset [30] which is a popular few-shot image recognition\ndomain. Figure 2 presents compute and memory trade-off for MAML and iMAML (on 20-way,\n5-shot Omniglot). Memory for iMAML is based on Hessian-vector products and is independent\nof the number of GD steps in the inner loop. The memory use is also independent of the number\nof CG iterations, since the intermediate computations need not be stored in memory. On the other\nhand, memory for MAML grows linearly in grad steps, reaching the capacity of a 12 GB GPU in\napproximately 16 steps. First-order MAML (FOMAML) does not back-propagate through the opti-\nmization process, and thus the computational cost is only that of performing gradient descent, which\nis needed for all the algorithms. The computational cost for iMAML is also similar to FOMAML\nalong with a constant overhead for CG that depends on the number of CG steps. Note however, that\nFOMAML does not compute an accurate meta-gradient, since it ignores the Jacobian. Compared\nto FOMAML, the compute cost of MAML grows at a faster rate. FOMAML requires only gradient\ncomputations, while backpropagating through GD (as done in MAML) requires a Hessian-vector\nproducts at each iteration, which are more expensive.\nFinally, we study empirical performance of iMAML on the Omniglot and Mini-ImageNet domains.\nFollowing the few-shot learning protocol in prior work [57], we run the iMAML algorithm on the\n\n(a)\n\n(b)\n\nFigure 2: Accuracy, Computation, and Memory tradeoffs of iMAML, MAML, and FOMAML. (a) Meta-\ngradient accuracy level in synthetic example. Computed gradients are compared against the exact meta-gradient\nper Def 3. (b) Computation and memory trade-offs with 4 layer CNN on 20-way-5-shot Omniglot task. We\nimplemented iMAML in PyTorch, and for an apples-to-apples comparison, we use a PyTorch implementation\nof MAML from: https://github.com/dragen1860/MAML-Pytorch\n\n7\n\n\fTable 2: Omniglot results. MAML results are taken from the original work of Finn et al. [15], and \ufb01rst-order\nMAML and Reptile results are from Nichol et al. [43]. iMAML with gradient descent (GD) uses 16 and 25 steps\nfor 5-way and 20-way tasks respectively. iMAML with Hessian-free uses 5 CG steps to compute the search\ndirection and performs line-search to pick step size. Both versions of iMAML use \u03bb = 2.0 for regularization,\nand 5 CG steps to compute the task meta-gradient.\n\nAlgorithm\nMAML [15]\n\ufb01rst-order MAML [15]\nReptile [43]\niMAML, GD (ours)\niMAML, Hessian-Free (ours)\n\n5-way 5-shot\n99.9 \u00b1 0.1%\n99.2 \u00b1 0.2%\n\n20-way 1-shot\n20-way 5-shot\n5-way 1-shot\n98.9 \u00b1 0.2%\n95.8 \u00b1 0.3%\n98.7 \u00b1 0.4%\n89.4 \u00b1 0.5%\n97.9 \u00b1 0.1%\n98.3 \u00b1 0.5%\n97.68 \u00b1 0.04% 99.48 \u00b1 0.06% 89.43 \u00b1 0.14% 97.12 \u00b1 0.32%\n99.16 \u00b1 0.35% 99.67 \u00b1 0.12% 94.46 \u00b1 0.42%\n98.69 \u00b1 0.1%\n99.50 \u00b1 0.26% 99.74 \u00b1 0.11% 96.18 \u00b1 0.36% 99.14 \u00b1 0.1%\n\ndataset for different numbers of class labels and shots (in the N-way, K-shot setting), and compare\ntwo variants of iMAML with published results of the most closely related algorithms: MAML,\nFOMAML, and Reptile. While these methods are not state-of-the-art on this benchmark, they pro-\nvide an apples-to-apples comparison for studying the use of implicit gradients in optimization-based\nmeta-learning. For a fair comparison, we use the identical convolutional architecture as these prior\nworks. Note however that architecture tuning can lead to better results for all algorithms [27].\nThe \ufb01rst variant of iMAML we consider involves solving the inner level problem (the regularized\nobjective function in Eq. 4) using gradient descent. The meta-gradient is computed using conjugate\ngradient, and the meta-parameters are updated using Adam. This presents the most straightforward\ncomparison with MAML, which would follow a similar procedure, but backpropagate through the\npath of optimization as opposed to invoking implicit differentiation. The second variant of iMAML\nuses a second order method for the inner level problem. In particular, we consider the Hessian-free\nor Newton-CG [44, 36] method. This method makes a local quadratic approximation to the objective\nfunction (in our case, G(\u03c6(cid:48), \u03b8) and approximately computes the Newton search direction using CG.\nSince CG requires only Hessian-vector products, this way of approximating the Newton search di-\nrection is scalable to large deep neural networks. The step size can be computed using regularization,\ndamping, trust-region, or linesearch. We use a linesearch on the training loss in our experiments to\nalso illustrate how our method can handle non-differentiable inner optimization loops. We refer the\nreaders to Nocedal & Wright [44] and Martens [36] for a more detailed exposition of this optimiza-\ntion algorithm. Similar approaches have also gained prominence in reinforcement learning [52, 47].\nTables 2 and 3 present the results on Omniglot\nand Mini-ImageNet, respectively. On the Om-\nniglot domain, we \ufb01nd that\nthe GD version of\niMAML is competitive with the full MAML algo-\nrithm, and substatially better than its approximations\n(i.e., \ufb01rst-order MAML and Reptile), especially for\nthe harder 20-way tasks. We also \ufb01nd that iMAML\nwith Hessian-free optimization performs substan-\ntially better than the other methods, suggesting that\npowerful optimizers in the inner loop can offer bene-\n\ufb01ts to meta-learning. In the Mini-ImageNet domain,\nwe \ufb01nd that iMAML performs better than MAML and FOMAML. We used \u03bb = 0.5 and 10 gra-\ndient steps in the inner loop. We did not perform an extensive hyperparameter sweep, and expect\nthat the results can improve with better hyperparameters. 5 CG steps were used to compute the\nmeta-gradient. The Hessian-free version also uses 5 CG steps for the search direction. Additional\nexperimental details are Appendix F.\n\nAlgorithm\nMAML\n\ufb01rst-order MAML\nReptile\niMAML GD (ours)\niMAML HF (ours)\n\nTable 3: Mini-ImageNet 5-way-1-shot accuracy\n\n5-way 1-shot\n48.70 \u00b1 1.84 %\n48.07 \u00b1 1.75 %\n49.97 \u00b1 0.32 %\n48.96 \u00b1 1.84 %\n49.30 \u00b1 1.88 %\n\n5 Related Work\n\nOur work considers the general meta-learning problem [51, 55, 41], including few-shot learning [30,\n57]. Meta-learning approaches can generally be categorized into metric-learning approaches that\nlearn an embedding space where non-parametric nearest neighbors works well [29, 57, 54, 45, 3],\nblack-box approaches that train a recurrent or recursive neural network to take datapoints as input\n\n8\n\n\fand produce weight updates [25, 5, 33, 48] or predictions for new inputs [50, 12, 58, 40, 38], and\noptimization-based approaches that use bi-level optimization to embed learning procedures, such\nas gradient descent, into the meta-optimization problem [15, 13, 8, 60, 34, 17, 59, 23]. Hybrid\napproaches have also been considered to combine the bene\ufb01ts of different approaches [49, 56]. We\nbuild upon optimization-based approaches, particularly the MAML algorithm [15], which meta-\nlearns an initial set of parameters such that gradient-based \ufb01ne-tuning leads to good generalization.\nPrior work has considered a number of inner loops, ranging from a very general setting where all\nparameters are adapted using gradient descent [15], to more structured and specialized settings,\nsuch as ridge regression [8], Bayesian linear regression [23], and simulated annealing [2]. The main\ndifference between our work and these approaches is that we show how to analytically derive the\ngradient of the outer objective without differentiating through the inner learning procedure.\nMathematically, we view optimization-based meta-learning as a bi-level optimization problem.\nSuch problems have been studied in the context of few-shot meta-learning (as discussed previ-\nously), gradient-based hyperparameter optimization [35, 46, 19, 11, 10], and a range of other set-\ntings [4, 31]. Some prior works have derived implicit gradients for related problems [46, 11, 4]\nwhile others propose innovations to aid back-propagation through the optimization path for speci\ufb01c\nalgorithms [35, 19, 24], or approximations like truncation [53]. While the broad idea of implicit\ndifferentiation is well known, it has not been empirically demonstrated in the past for learning more\nthan a few parameters (e.g., hyperparameters), or highly structured settings such as quadratic pro-\ngrams [4]. In contrast, our method meta-trains deep neural networks with thousands of parameters.\nClosest to our setting is the recent work of Lee et al. [32], which uses implicit differentiation for\nquadratic programs in a \ufb01nal SVM layer. In contrast, our formulation allows for adapting the full\nnetwork for generic objectives (beyond hinge-loss), thereby allowing for wider applications.\nWe also note that prior works involving implicit differentiation make a strong assumption of an exact\nsolution in the inner level, thereby providing only asymptotic guarantees. In contrast, we provide\n\ufb01nite time guarantees which allows us to analyze the case where the inner level is solved approxi-\nmately. In practice, the inner level is likely to be solved using iterative optimization algorithms like\ngradient descent, which only return approximate solutions with \ufb01nite iterations. Thus, this paper\nplaces implicit gradient methods under a strong theoretical footing for practically use.\n\n6 Conclusion\n\nIn this paper, we develop a method for optimization-based meta-learning that removes the need\nfor differentiating through the inner optimization path, allowing us to decouple the outer meta-\ngradient computation from the choice of inner optimization algorithm. We showed how this gives us\nsigni\ufb01cant gains in compute and memory ef\ufb01ciency, and also conceptually allows us to use a variety\nof inner optimization methods. While we focused on developing the foundations and theoretical\nanalysis of this method, we believe that this work opens up a number of interesting avenues for\nfuture study.\nBroader classes of inner loop procedures. While we studied different gradient-based optimization\nmethods in the inner loop, iMAML can in principle be used with a variety of inner loop algorithms,\nincluding dynamic programming methods such as Q-learning, two-player adversarial games such\nas GANs, energy-based models [39], and actor-critic RL methods, and higher-order model-based\ntrajectory optimization methods. This signi\ufb01cantly expands the kinds of problems that optimization-\nbased meta-learning can be applied to.\nMore \ufb02exible regularizers. We explored one very simple regularization, (cid:96)2 regularization to the pa-\nrameter initialization, which already increases the expressive power over the implicit regularization\nthat MAML provides through truncated gradient descent. To further allow the model to \ufb02exibly reg-\nularize the inner optimization, a simple extension of iMAML is to learn a vector- or matrix-valued \u03bb,\nwhich would enable the meta-learner model to co-adapt and co-regularize various parameters of the\nmodel. Regularizers that act on parameterized density functions would also enable meta-learning to\nbe effective for few-shot density estimation.\n\n9\n\n\fAcknowledgements\n\nAravind Rajeswaran thanks Emo Todorov for valuable discussions about implicit gradients and po-\ntential application domains; Aravind Rajeswaran also thanks Igor Mordatch and Rahul Kidambi\nfor helpful discussions and feedback. Sham Kakade acknowledges funding from the Washington\nResearch Foundation for innovation in Data-intensive Discovery; Sham Kakade also graciously ac-\nknowledges support from ONR award N00014-18-1-2247, NSF Award CCF-1703574, and NSF\nCCF 1740551 award.\n\nReferences\n[1] Maruan Al-Shedivat, Trapit Bansal, Yuri Burda, Ilya Sutskever, Igor Mordatch, and Pieter\nAbbeel. Continuous adaptation via meta-learning in nonstationary and competitive environ-\nments. CoRR, abs/1710.03641, 2017.\n\n[2] Ferran Alet, Tom\u00b4as Lozano-P\u00b4erez, and Leslie P Kaelbling. Modular meta-learning. arXiv\n\npreprint arXiv:1806.10166, 2018.\n\n[3] Kelsey R Allen, Evan Shelhamer, Hanul Shin, and Joshua B Tenenbaum.\n\nprototypes for few-shot learning. arXiv preprint arXiv:1902.04552, 2019.\n\nIn\ufb01nite mixture\n\n[4] Brandon Amos and J Zico Kolter. Optnet: Differentiable optimization as a layer in neural\nnetworks. In Proceedings of the 34th International Conference on Machine Learning-Volume\n70, pages 136\u2013145. JMLR. org, 2017.\n\n[5] Marcin Andrychowicz, Misha Denil, Sergio Gomez, Matthew W Hoffman, David Pfau, Tom\nSchaul, Brendan Shillingford, and Nando De Freitas. Learning to learn by gradient descent by\ngradient descent. In Advances in Neural Information Processing Systems, pages 3981\u20133989,\n2016.\n\n[6] Walter Baur and Volker Strassen. The complexity of partial derivatives. Theoretical Computer\n\nScience, 22:317\u2013330, 1983.\n\n[7] Atilim Gunes Baydin, Barak A. Pearlmutter, and Alexey Radul. Automatic differentiation in\n\nmachine learning: a survey. CoRR, abs/1502.05767, 2015.\n\n[8] Luca Bertinetto, Joao F Henriques, Philip HS Torr, and Andrea Vedaldi. Meta-learning with\n\ndifferentiable closed-form solvers. arXiv preprint arXiv:1805.08136, 2018.\n\n[9] Sebastien Bubeck. Convex optimization: Algorithms and complexity. Foundations and Trends\n\nin Machine Learning, 2015.\n\n[10] Chuong B. Do, Chuan-Sheng Foo, and Andrew Y. Ng. Ef\ufb01cient multiple hyperparameter\n\nlearning for log-linear models. In NIPS, 2007.\n\n[11] Justin Domke. Generic methods for optimization-based modeling. In AISTATS, 2012.\n[12] Yan Duan, John Schulman, Xi Chen, Peter L Bartlett, Ilya Sutskever, and Pieter Abbeel. Rl2:\n\nFast reinforcement learning via slow reinforcement learning. arXiv:1611.02779, 2016.\n\n[13] Chelsea Finn. Learning to Learn with Gradients. PhD thesis, UC Berkeley, 2018.\n[14] Chelsea Finn and Sergey Levine. Meta-learning and universality: Deep representations and\n\ngradient descent can approximate any learning algorithm. arXiv:1710.11622, 2017.\n\n[15] Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adap-\n\ntation of deep networks. International Conference on Machine Learning (ICML), 2017.\n\n[16] Chelsea Finn, Tianhe Yu, Tianhao Zhang, Pieter Abbeel, and Sergey Levine. One-shot visual\n\nimitation learning via meta-learning. arXiv preprint arXiv:1709.04905, 2017.\n\n[17] Chelsea Finn, Kelvin Xu, and Sergey Levine. Probabilistic model-agnostic meta-learning. In\n\nAdvances in Neural Information Processing Systems, pages 9516\u20139527, 2018.\n\n[18] Chelsea Finn, Aravind Rajeswaran, Sham Kakade, and Sergey Levine. Online meta-learning.\n\nInternational Conference on Machine Learning (ICML), 2019.\n\n[19] Luca Franceschi, Michele Donini, Paolo Frasconi, and Massimiliano Pontil. Forward and\nreverse gradient-based hyperparameter optimization. In Proceedings of the 34th International\nConference on Machine Learning-Volume 70, pages 1165\u20131173. JMLR. org, 2017.\n\n10\n\n\f[20] Erin Grant, Chelsea Finn, Sergey Levine, Trevor Darrell, and Thomas Grif\ufb01ths. Recasting\nInternational Conference on Learning\n\ngradient-based meta-learning as hierarchical bayes.\nRepresentations (ICLR), 2018.\n\n[21] Andreas Griewank. Some bounds on the complexity of gradients, jacobians, and hessians.\n\n1993.\n\n[22] Andreas Griewank and Andrea Walther. Evaluating Derivatives: Principles and Techniques\nof Algorithmic Differentiation. Society for Industrial and Applied Mathematics, Philadelphia,\nPA, USA, second edition, 2008.\n\n[23] James Harrison, Apoorva Sharma, and Marco Pavone. Meta-learning priors for ef\ufb01cient online\n\nbayesian regression. arXiv preprint arXiv:1807.08912, 2018.\n\n[24] Laurent Hasco\u00a8et and Mauricio Araya-Polo. Enabling user-driven checkpointing strategies in\n\nreverse-mode automatic differentiation. CoRR, abs/cs/0606042, 2006.\n\n[25] Sepp Hochreiter, A Steven Younger, and Peter R Conwell. Learning to learn using gradient\n\ndescent. In International Conference on Arti\ufb01cial Neural Networks, 2001.\n\n[26] Chi Jin, Rong Ge, Praneeth Netrapalli, Sham M. Kakade, and Michael I. Jordan. How to\n\nescape saddle points ef\ufb01ciently. In ICML, 2017.\n\n[27] Jaehong Kim, Youngduck Choi, Moonsu Cha, Jung Kwon Lee, Sangyeul Lee, Sungwan Kim,\nYongseok Choi, and Jiwon Kim. Auto-meta: Automated gradient based meta learner search.\narXiv preprint arXiv:1806.06927, 2018.\n\n[28] Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. International\n\nConference on Learning Representations (ICLR), 2015.\n\n[29] Gregory Koch. Siamese neural networks for one-shot image recognition. ICML Deep Learning\n\nWorkshop, 2015.\n\n[30] Brenden M Lake, Ruslan Salakhutdinov, Jason Gross, and Joshua B Tenenbaum. One shot\nlearning of simple visual concepts. In Conference of the Cognitive Science Society (CogSci),\n2011.\n\n[31] Benoit Landry, Zachary Manchester, and Marco Pavone. A differentiable augmented la-\ngrangian method for bilevel nonlinear optimization. arXiv preprint arXiv:1902.03319, 2019.\n[32] Kwonjoon Lee, Subhransu Maji, Avinash Ravichandran, and Stefano Soatto. Meta-learning\n\nwith differentiable convex optimization. arXiv preprint arXiv:1904.03758, 2019.\n\n[33] Ke Li and Jitendra Malik. Learning to optimize. arXiv preprint arXiv:1606.01885, 2016.\n[34] Zhenguo Li, Fengwei Zhou, Fei Chen, and Hang Li. Meta-sgd: Learning to learn quickly for\n\nfew-shot learning. arXiv preprint arXiv:1707.09835, 2017.\n\n[35] Dougal Maclaurin, David Duvenaud, and Ryan Adams. Gradient-based hyperparameter opti-\nmization through reversible learning. In International Conference on Machine Learning, pages\n2113\u20132122, 2015.\n\n[36] James Martens. Deep learning via hessian-free optimization. In ICML, 2010.\n[37] Fei Mi, Minlie Huang, Jiyong Zhang, and Boi Faltings. Meta-learning for low-resource natu-\nral language generation in task-oriented dialogue systems. arXiv preprint arXiv:1905.05644,\n2019.\n\n[38] Nikhil Mishra, Mostafa Rohaninejad, Xi Chen, and Pieter Abbeel. A simple neural attentive\n\nmeta-learner. arXiv preprint arXiv:1707.03141, 2017.\n\n[39] Igor Mordatch. Concept learning with energy-based models. CoRR, abs/1811.02486, 2018.\n[40] Tsendsuren Munkhdalai and Hong Yu. Meta networks. In Proceedings of the 34th Interna-\n\ntional Conference on Machine Learning-Volume 70, pages 2554\u20132563. JMLR. org, 2017.\n\n[41] Devang K Naik and RJ Mammone. Meta-neural networks that learn by learning. In Interna-\n\ntional Joint Conference on Neural Netowrks (IJCNN), 1992.\n\n[42] Yurii Nesterov and Boris T. Polyak. Cubic regularization of newton method and its global\n\nperformance. Math. Program., 108:177\u2013205, 2006.\n\n[43] Alex Nichol, Joshua Achiam, and John Schulman. On \ufb01rst-order meta-learning algorithms.\n\narXiv preprint arXiv:1803.02999, 2018.\n\n11\n\n\f[44] Jorge Nocedal and Stephen J. Wright. Numerical optimization (springer series in operations\n\nresearch and \ufb01nancial engineering). 2000.\n\n[45] Boris Oreshkin, Pau Rodr\u00b4\u0131guez L\u00b4opez, and Alexandre Lacoste. Tadam: Task dependent adap-\nIn Advances in Neural Information Processing\n\ntive metric for improved few-shot learning.\nSystems, pages 721\u2013731, 2018.\n\n[46] Fabian Pedregosa. Hyperparameter optimization with approximate gradient. arXiv preprint\n\narXiv:1602.02355, 2016.\n\n[47] Aravind Rajeswaran, Kendall Lowrey, Emanuel Todorov, and Sham Kakade. Towards Gener-\n\nalization and Simplicity in Continuous Control. In NIPS, 2017.\n\n[48] Sachin Ravi and Hugo Larochelle. Optimization as a model for few-shot learning. 2016.\n[49] Andrei A Rusu, Dushyant Rao, Jakub Sygnowski, Oriol Vinyals, Razvan Pascanu, Simon\nOsindero, and Raia Hadsell. Meta-learning with latent embedding optimization. arXiv preprint\narXiv:1807.05960, 2018.\n\n[50] Adam Santoro, Sergey Bartunov, Matthew Botvinick, Daan Wierstra, and Timothy Lillicrap.\nMeta-learning with memory-augmented neural networks. In International Conference on Ma-\nchine Learning (ICML), 2016.\n\n[51] Jurgen Schmidhuber. Evolutionary principles in self-referential learning. Diploma thesis,\n\nInstitut f. Informatik, Tech. Univ. Munich, 1987.\n\n[52] John Schulman, Sergey Levine, Pieter Abbeel, Michael I Jordan, and Philipp Moritz. Trust\nregion policy optimization. In International Conference on Machine Learning (ICML), 2015.\n[53] Amirreza Shaban, Ching-An Cheng, Olivia Hirschey, and Byron Boots. Truncated back-\n\npropagation for bilevel optimization. CoRR, abs/1810.10667, 2018.\n\n[54] Jake Snell, Kevin Swersky, and Richard Zemel. Prototypical networks for few-shot learning.\n\nIn Advances in Neural Information Processing Systems, pages 4077\u20134087, 2017.\n\n[55] Sebastian Thrun and Lorien Pratt. Learning to learn. Springer Science & Business Media,\n\n1998.\n\n[56] Eleni Trianta\ufb01llou, Tyler Zhu, Vincent Dumoulin, Pascal Lamblin, Kelvin Xu, Ross Goroshin,\nCarles Gelada, Kevin Swersky, Pierre-Antoine Manzagol, and Hugo Larochelle. Meta-\narXiv preprint\ndataset: A dataset of datasets for learning to learn from few examples.\narXiv:1903.03096, 2019.\n\n[57] Oriol Vinyals, Charles Blundell, Tim Lillicrap, Daan Wierstra, et al. Matching networks for\n\none shot learning. In Neural Information Processing Systems (NIPS), 2016.\n\n[58] Jane X Wang, Zeb Kurth-Nelson, Dhruva Tirumala, Hubert Soyer, Joel Z Leibo, Remi Munos,\nCharles Blundell, Dharshan Kumaran, and Matt Botvinick. Learning to reinforcement learn.\narXiv:1611.05763, 2016.\n\n[59] Fengwei Zhou, Bin Wu, and Zhenguo Li. Deep meta-learning: Learning to learn in the concept\n\nspace. arXiv preprint arXiv:1802.03596, 2018.\n\n[60] Luisa M Zintgraf, Kyriacos Shiarlis, Vitaly Kurin, Katja Hofmann, and Shimon Whiteson. Fast\n\ncontext adaptation via meta-learning. arXiv preprint arXiv:1810.03642, 2018.\n\n12\n\n\f", "award": [], "sourceid": 60, "authors": [{"given_name": "Aravind", "family_name": "Rajeswaran", "institution": "University of Washington"}, {"given_name": "Chelsea", "family_name": "Finn", "institution": "Stanford University"}, {"given_name": "Sham", "family_name": "Kakade", "institution": "University of Washington"}, {"given_name": "Sergey", "family_name": "Levine", "institution": "UC Berkeley"}]}