{"title": "Distributed Multitask Reinforcement Learning with Quadratic Convergence", "book": "Advances in Neural Information Processing Systems", "page_first": 8907, "page_last": 8916, "abstract": "Multitask reinforcement learning (MTRL) suffers from scalability issues when the number of tasks or trajectories grows large. The main reason behind this drawback is the reliance on centeralised solutions. Recent methods exploited the connection between MTRL and general consensus to propose scalable solutions. These methods, however, suffer from two drawbacks. First, they rely on predefined objectives, and, second, exhibit linear convergence guarantees. In this paper, we improve over state-of-the-art by deriving multitask reinforcement learning from a variational inference perspective. We then propose a novel distributed solver for MTRL with quadratic convergence guarantees.", "full_text": "Distributed Multitask Reinforcement Learning with\n\nQuadratic Convergence\n\nRasul Tutunov\nPROWLER.io\n\nCambridge, United Kingdom\n\nrasul@prowler.io\n\nDongho Kim\nPROWLER.io\n\nCambridge, United Kingdom\n\ndongho@prowler.io\n\nHaitham Bou-Ammar\n\nPROWLER.io\n\nCambridge, United Kingdom\nhaitham@prowler.io\n\nAbstract\n\nMultitask reinforcement learning (MTRL) suffers from scalability issues when\nthe number of tasks or trajectories grows large. The main reason behind this\ndrawback is the reliance on centeralised solutions. Recent methods exploited the\nconnection between MTRL and general consensus to propose scalable solutions.\nThese methods, however, suffer from two drawbacks. First, they rely on prede\ufb01ned\nobjectives, and, second, exhibit linear convergence guarantees. In this paper, we\nimprove over state-of-the-art by deriving multitask reinforcement learning from a\nvariational inference perspective. We then propose a novel distributed solver for\nMTRL with quadratic convergence guarantees.\n\n1\n\nIntroduction\n\nReinforcement learning (RL) allows agents to solve sequential decision-making problems with limited\nfeedback. Applications with these characteristics are ubiquitous ranging from stock-trading [1] to\nrobotics control [2, 3]. Though successful, RL methods typically require substantial amounts of data\nand computation for successful behaviour. Multitask and transfer learning [4\u20136, 2, 7] techniques have\nbeen developed to remedy these problems by allowing for knowledge reuse between tasks to bias\ninitial behaviour. Unfortunately, such methods suffer from scalability constraints when the number of\ntasks or policy dimensions grows large.\nTwo promising directions remedy these scalability problems. In the \ufb01rst, tasks are streamed online and\nmodels are \ufb01t iteratively. Such an alternative has been well-explored under the name of lifelong RL [8,\n9]. When considering lifelong learning, however, one comes to recognise that these improvements\nin computation come hand-in-hand with a decrease in the model\u2019s accuracy due to the usage of\napproximations to the original loss (e.g., second-order expansions [10]), as well as the unavailability\nof all tasks in batch. Interested readers are referred to [11] for an in-depth discussion of the limitations\nof lifelong reinforcement learners.\nThe other direction based on decentralised optimisation remedies scalability and accuracy constraints\nby distributing computation across multiple units. Though successful in supervised learning [12],\nthis direction is still to be well-explored in the context of MTRL. Recently, however, the authors\nin [11] proposed a distributed solver for MTRL with linear convergence guarantees based on the\nAlternating Direction Method of Multipliers (ADMM). Their method relied on a connection between\nMTRL and distributed general consensus. However, such ADMM-based techniques suffer from the\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\ffollowing drawbacks. First, these algorithms only achieve linear convergence in the order of O (1/k)\nwith k being the iteration count. Second, for linear convergence additional restrictive assumptions on\nthe penalty terms have to be imposed. Finally, they require large number of iterations to arrive at\naccurate (in terms of consensus error) solutions as noted by [13] and validated in our experiments,\nsee Section 5.\nIn this paper, we remedy the above problems by proposing a distributed solver for MTRL that exhibits\nquadratic convergence. Contrary to [11], our technique does not impose restrictive assumptions on\nthe reinforcement learning loss function and can thus be deemed more general. We achieve our results\nin two-steps. First, we reformulate MTRL as variational inference. Second, we map the resultant\nobjective to general consensus that allows us to exploit the symmetric and diagonal dominance\nproperty of the curvature of our dual problem. We show our novel distributed solver using Chebyshev\npolynomials has quadratic convergence guarantees.\nWe analyse the performance of our method both theoretically and empirically. On the theory side,\nwe formally prove quadratic convergence. On the empirical side, we show that our new technique\noutperforms state-of-the-art methods from both distributed optimisation and lifelong reinforcement\nlearning on a variety of graph topologies. We further show that these improvements arrive at relatively\nsmall increases in the communication overhead between the nodes.\n\n2 Background\n\nReinforcement learning (RL) [14] algorithms are successful in solving sequential decision making\n(SDM) tasks. In RL, the agent\u2019s goal is to sequentially select actions that maximise its total expected\nreturn. We formalise such problems as a Markov decision process (MDP) Z = (cid:104)X ,A,P,R, \u03b3(cid:105)\nwhere X \u2286 Rd is the set of states, A \u2286 Rm is the set of possible actions, P : X \u00d7 A \u00d7 X (cid:55)\u2192 [0, 1]\nrepresents the state transition probability describing the task\u2019s dynamics, R : X \u00d7 A \u00d7 X (cid:55)\u2192 R is\nthe reward function measuring the agent\u2019s performance, and \u03b3 \u2208 [0, 1) is the discount factor. The\ndynamics of an RL problem commence as follows: at each time step h, the agent is at state xh \u2208 X\nand has to choose an action ah \u2208 A transitioning it to a new state xh+1 \u223c p(xh+1|xh, ah) as\ngiven by P. This transition yields a reward rh+1 = R(xh, ah, xh+1). We assume that actions are\ngenerated by a policy \u03c0 : X \u00d7 A (cid:55)\u2192 [0, 1], which is de\ufb01ned as a distribution over state-action pairs,\ni.e., \u03c0(ah|xh) is the probability of choosing action ah in a state xh. The goal of the agent is to \ufb01nd\nan optimal policy \u03c0\u2217 that maximises its expected return given by: E\u03c0\nthe horizon length.\nPolicy Search RL parameterises a policy by a vector of unknown parameters \u03b8. As such, the RL\nproblem is transformed to a one of searching over the parameter space for \u03b8(cid:63) that maximises:\n\n(cid:3), with H being\n\n(cid:2)(cid:80)H\n\nh=1 \u03b3hrh\n\n(cid:90)\n\n\u03c4\n\nJ (\u03b8) = Ep\u03b8 (\u03c4 )[R(\u03c4 )] =\n\np\u03b8(\u03c4 )R(\u03c4 )d\u03c4 ,\n\n(1)\n\nH\n\n(cid:80)H\n\nde\ufb01ned as: p\u03b8(\u03c4 ) = P0(x0)(cid:81)H\n\nh=1 p(xh+1|xh, ah)\u03c0\u03b8(ah|xh), and R(\u03c4 ) = 1\n\nwhere a trajectory \u03c4 is a sequence of accumulated state-action pairs [x0:H , a0:H ]. Furthermore,\nthe probability of acquiring a certain trajectory, p\u03b8(\u03c4 ), and the total reward R(\u03c4 ) for a trace \u03c4 are\nh=0 rh+1, with\nP0 : X (cid:55)\u2192 [0, 1] being the initial state distribution.\nPolicy search can also be cast as variational inference by connecting RL and probabilistic infer-\nence [15\u201318]. In this formulation the goal is to derive the posterior distribution over trajectories\nconditioned on a desired output, given a prior trajectory distribution. The desired output is denoted\nas a binary random variable \u02c6R, where \u02c6R = 1 indicates the optimal reward event. This is typically\nobjective J (\u03b8) in Equation 1 is p\u03b8( \u02c6R = 1) = (cid:82)\nrelated to trajectory rewards using p( \u02c6R = 1|\u03c4 ) \u221d exp(R(\u03c4 )). With this de\ufb01nition, the optimisation\n\u03c4 p( \u02c6R = 1|\u03c4 )p\u03b8(\u03c4 )d\u03c4 . From the log-marginal\nof the binary event, we can write the evidence lower bound (ELBO). The ELBO is derived by\n(cid:21)\nintroducing a variational distribution q\u03c6(\u03c4 ) and applying Jensen\u2019s inequality:\nlog p\u03b8( \u02c6R) \u2265\nlog p( \u02c6R|\u03c4 )\nwith DKL (q(\u03c4 )||p(\u03c4 ))) being the Kullback Leibler divergence between q(\u03c4 ) and p(\u03c4 ).\n\n(cid:105) \u2212 DKL(q\u03c6(\u03c4 )(cid:107)p\u03b8(\u03c4 )),\n\nlog p( \u02c6R|\u03c4 ) + log\n\nd\u03c4 = Eq\u03c6(\u03c4 )\n\np\u03b8(\u03c4 )\nq\u03c6(\u03c4 )\n\n(cid:104)\n\n(cid:90)\n\n(cid:20)\n\nq\u03c6(\u03c4 )\n\n\u03c4\n\n2\n\n\f3 Multitask Reinforcement Learning as Variational Inference\n\nRL algorithms require substantial amounts of trajectories and learning times for successful behaviour.\nAcquiring large training samples easily leads to wear and tear on the system and thus worsens\nthe problem. When data is scarce, learning task policies jointly through multi-task reinforcement\nlearning (MTRL) rather than independently signi\ufb01cantly improves performance [4, 19]. In MTRL,\nthe agent is faced with a series of T SDM tasks Z (1), ...,Z (T ). Each task is an MDP denoted by\nZ (t) = (cid:104)X (t),A(t),P (t),R(t), \u03b3(t)(cid:105), and the goal for the agent is to learn a set of optimal policies\n\u03a0(cid:63) = {\u03c0(cid:63)\n\u03b8(T )} with corresponding parameters \u0398(cid:63) = {\u03b8(1)(cid:63), ..., \u03b8(T )(cid:63)}. Rather than\nde\ufb01ning the optimisation objective directly as done in [10, 4], we provide a probabilistic modeling\nview of the problem by framing MTRL as an instance of variational inference. We de\ufb01ne a set of\nreward binary events \u02c6R1, . . . , \u02c6RT \u2208 {0, 1}T where p( \u02c6Rk|\u03c4k) \u221d exp (Rk(\u03c4k)). Here, trajectories\nare assumed to be latent, and the goal of the agent is to determine a set of policy parameters that\nassign high density to trajectories with high rewards. In other words, the goal is to \ufb01nd a set of\npolicies that maximise the log-marginal of the reward events:\n\n\u03b8(1), ..., \u03c0(cid:63)\n\n(cid:16) \u02c6R1, . . . \u02c6RT\n\n(cid:17)\n\n(cid:90)\n(cid:16)\n\n(cid:90)\n\nT(cid:89)\n\nt=1\n\n= log\n\n(cid:17)\n\n\u00b7\u00b7\u00b7\n\n(cid:17)\n\n\u03c41\n\n\u03c4T\ndensity\nh |x(t)\na(t)\n.\n\nh\n\np( \u02c6Rt|\u03c4t)p\u03b8t(\u03c4t)d\u03c41 . . . d\u03c4T ,\n\nlog p\u03b81:\u03b8T\n\nh=1 p(t)(cid:16)\n0 (x0)(cid:81)Ht\n(cid:90)\n(cid:90)\nT(cid:89)\n\n\u00b7\u00b7\u00b7\n\nlog\n\np\u03b8t(\u03c4t)\n\nwhere\nP (t)\nputing the above integrals, we derive an ELBO using a variational distribution q\u03c6(\u03c41, . . . , \u03c4T ):\n\nfor\n=\nTo handle the intractability in com-\n\nthe\nis\nh+1|x(t)\nx(t)\n\ntrajectory\n\nh , a(t)\n\np\u03b8t(\u03c4t)\n\ntask\n\n\u03c0\u03b8t\n\nt:\n\nh\n\n(cid:34) T(cid:88)\n\n(cid:16) \u02c6Rt|\u03c4t\n(cid:17)\n(cid:34)\n(cid:81)T\n\n(cid:35)\n\n(cid:81)T\n\nt=1 p\u03b8t(\u03c4 )\nq\u03c6(\u00b7)\n\np( \u02c6Rt|\u03c4t)p\u03b8t(\u03c4t)d\u03c41 . . . d\u03c4T \u2265 Eq\u03c6(\u00b7)\n\nlog p\n\n+ log\n\n\u03c41\n\n\u03c4T\n\nt=1\n\nUsing the above, the optimisation objective of multitask reinforcement learning can be written as:\n\n(cid:35)\nWe assume a mean-\ufb01eld variational approximation [20], i.e., q\u03c6(\u03c41, . . . , \u03c4T ) = (cid:81)T\n\nt=1 p\u03b8t(\u03c4 )\nq\u03c6(\u03c41, . . . , \u03c4T )\n\n+ Eq\u03c6(\u03c41,...,\u03c4T )\n\n(cid:34) T(cid:88)\n\nEq\u03c6(\u03c41,...,\u03c4T )\n\n(cid:17)(cid:35)\n\nmax\n\u03c6,\u03b81:\u03b8T\n\nFurthermore, we assume that the distribution1 q\u03c6t(\u03c4t) follows that of p\u03b8t(\u03c4t). Hence, we write:\n\nlog p\n\nlog\n\nt=1\n\nt=1\n\n.\n\nt=1 q\u03c6t(\u03c4t).\n\n(cid:16) \u02c6Rt|\u03c4t\n(cid:16) \u02c6Rt|\u03c4t\n\nlog p\n\n(cid:104)\n\n(cid:17)(cid:105) \u2212 T(cid:88)\n\nmax\n\n\u03c61:\u03c6T ,\u03b81:\u03b8T\n\nEq\u03c6t (\u03c4t)\n\nt=1\n\nt=1\n\nDKL (q\u03c6t(\u03c4t)||p\u03b8t(\u03c4t)) .\n\n(2)\n\nT(cid:88)\n\nSo far, we discussed MTRL assuming independence between policy parameters \u03b81, . . . , \u03b8T . To\nbene\ufb01t from shared knowledge between tasks, we next introduce coupling by allowing for parameter\nsharing across MDPs. Inspired by stochastic variational inference [21], we decompose \u03b8t = \u0398sh \u02dc\u03b8t,\nwhere \u0398sh is a shared set of parameters between tasks, while \u02dc\u03b8t represents task-speci\ufb01c parameters\nintroduced to \u201cspecialise\u201d shared knowledge to the peculiarities of each task t \u2208 {1, . . . , T}. For\ninstance, if a task parameter \u03b8t \u2208 Rd, our decomposition yields \u0398sh \u2208 Rd\u00d7k, and \u02dc\u03b8t \u2208 Rk\u00d71 with k\nrepresenting the dimensions of the shared latent knowledge.\nSolving the problem in Equation 2 amounts to determining both variational and model parameters,\ni.e., \u03c61, . . . , \u03c6T , and \u0398sh, and \u02dc\u03b81, . . . , \u02dc\u03b8T . We propose an expectation-maximisation style algorithm\nfor computing each of the above free variables. Namely, in the E-step we solve for \u03c61, . . . , \u03c6T\nwhile keeping \u0398sh, and \u02dc\u03b81, . . . , \u02dc\u03b8T \ufb01xed. In the M-step, on the other hand, we determine \u0398sh and\n\u02dc\u03b81, . . . , \u02dc\u03b8T given the updated variational parameters. In both these steps, solving for the task-speci\ufb01c\nand variational parameters can be made ef\ufb01cient using parallelisation. Determining \u0398sh, however,\nrequires knowledge of all tasks making it unscalable as the number of tasks grows large. To remedy\nthis problem, we next propose a novel distributed Newton method with quadratic convergence\nguarantees2. Applying this method to determine \u0398sh results in a highly scalable learner as shown in\nthe following sections.\n\n1Please note that we leave exploring other forms of the variational distribution as an interesting direction for\n\nfuture work.\n\n2Contrary to stochastic variational inference, we are not restricted to exponential family distributions\n\n3\n\n\f4 Scalable Multitask Reinforcement Learning\n\nAs mentioned earlier, the problem of determining \u0398sh can become computationally intensive with\nan increasing number of tasks. In this section, we devise a distributed Newton method for \u0398sh to\naid in scalability. Given an updated variational (i.e., E-step) and \ufb01xed task-speci\ufb01c parameters, the\noptimisation problem for \u0398sh can be written as:\n\nT(cid:88)\n\nt=1\n\n1\nNt\n\n(cid:34) Nt(cid:88)\n\nit=1\n\nmax\n\u0398sh\n\n(cid:104)\n\n(cid:16) \u02c6R(it)\n\nt\n\nlog\n\np\n\n|\u03c4 (it)\n\nt\n\n(cid:17) \u00d7 p\u0398sh, \u02dc\u03b8t\n\n(cid:16)\n\n\u03c4 (it)\nt\n\n(cid:17)(cid:105)(cid:35)\n\n(cid:16)\n\n(cid:17)\n\nT(cid:88)\n\nt=1\n\n\u2261 max\n\n\u0398sh\n\nJ (t)\n\nMTRL\n\n\u0398sh \u02dc\u03b8t\n\n,\n\n(3)\n\n(4)\n\n\u0398(1)\n\ni=1\n\nt=1\n\n\u2212J (t)\n\nMTRL\n\nt \u2297 Id\u00d7d\n\n(cid:16)(cid:16) \u02dc\u03b8(cid:62)\n\nsh = \u00b7\u00b7\u00b7 = \u0398(n)\nsh ,\n\nwhere it = {1, . . . , Nt} denotes the index of trajectory i for task t \u2208 {1, . . . , T}. The above\nequation omits functions independent of \u0398sh, and estimates the variational expectation by sampling\nNt trajectories for each of the tasks.\nOur scaling strategy is to allow for a distributed framework generalising to any topology of connected\nprocessors. Hence, we assume an undirected graph G = (V,E) of computational units. Here, V\ndenotes the set of nodes (processors) and E the set of edges. Similar to [11], we assume n nodes\nconnected via |E| edges. Contrary to their work, however, no speci\ufb01c node-ordering assumptions\nare imposed. Before writing the problem above in an equivalent distributed fashion, we \ufb01rstly\nintroduce \u201cvec(A)\u201d to denote the column-wise vectorisation of a matrix A. This notation allows us\nto rewrite the by-product \u0398sh \u02dc\u03b8t in terms of a vectorised version of the optimisation variable \u0398sh,\nt \u2297 Id\u00d7d)vec(\u0398sh) \u2208 Rd\u00d71. Hence, the equivalent distributed formulation\nwhere vec(\u0398sh \u02dc\u03b8t) = ( \u02dc\u03b8(cid:62)\nTi(cid:88)\nof Equation 3 is given by:\n\nn(cid:88)\n\n(cid:17)\n\n(cid:17)\n\ns.t. \u0398(1)\n\nwhere Ti is the total number of tasks assigned to node i such that(cid:80)n\n\nvec(\u0398(i)\nsh )\n\nmin\nsh :\u0398(n)\nsh\n\ni=1 Ti = T . Intuitively, the\nabove is distributing Equation 3 among n nodes, where each computes its local copy of \u0398sh. For the\ndistributed version to coincide with the centeralised one, all nodes have to arrive to a consensus (in\na fully distributed fashion) on the value of \u0398sh. As such, a feasible solution for the distributed and\ncenteralised versions coincide making the two problems equivalent.\nNow, we can apply any off-the-shelf distributed optimisation algorithm. Unfortunately, current\ntechniques suffer from drawbacks prohibiting their direct usage for MTRL. Generally, there are two\npopular classes of algorithms for distributed optimisation. The \ufb01rst is sub-gradient based, while\nthe second relies on a decomposition-coordination procedure. Sub-gradient algorithms proceed by\ntaking a gradient step then followed by an averaging step at each iteration. The computation of\neach step is relatively cheap and can be implemented in a distributed fashion [22]. Though cheap to\n\u221a\ncompute, the best known convergence rate of sub-gradient methods is slow given by O (1/\nK) with\nK being the total number of iterations [23, 24]. The second class of algorithms solve constrained\nproblems by relying on dual methods. One of the well-known state-of-the-art methods from this\nclass is the Alternating Direction Method of Multipliers (ADMM) [13]. ADMM decomposes the\noriginal problem to two subproblems which are then solved sequentially leading to updates of dual\nvariables. In [23], the authors show that ADMM can be fully distributed over a network leading\nto improved convergence rates to the order of O (1/K). Recently, the authors in [11] applied the\nmethod [23] for distributed MTRL. In our experiments, we signi\ufb01cantly outperform [11], especially\nin high-dimensional environments.\nMuch rate improvements can be gained from adopting second-order (Newton) methods. Though\na variety of techniques have been proposed in [25\u201327], less progress has been made at leveraging\nADMM\u2019s accuracy and convergence rate issues. In a recent attempt [25], the authors propose a\ndistributed second-order method for general consensus by using the approach in [27] to compute the\nNewton direction. As detailed in Section 6, this method suffers from two problems. First, it fails to\noutperform ADMM and second, faces storage and computational de\ufb01ciencies for large data sets, thus\nADMM retains state-of-the-art status.\nNext, we develop a distributed solver that outperforms others both theoretically and empirically. On\nthe theory side, we develop the \ufb01rst distributed MTRL algorithm with provable quadratic convergence\nguarantees. On the empirical side, we demonstrate the superiority of our method on a variety of\nbenchmarks.\n\n4\n\n\fFigure 1: High-level depiction of our distribution framework for the shared parameters. Each of the\nvectors yi holds the ith components of the shared parameters across all nodes n.\n\n4.1 Laplacian-Based Distributed Multitask Reinforcement Learning\n\nFor maximum performance boost, we aim to have our algorithm exploit (locally) the structure of\nthe computational graph connecting the processing units. To consider such an effect, we rewrite\nour distributed MTRL in terms of the graph Laplacian L; a matrix that re\ufb02ects the graph structure.\nFormally, the L is an n \u00d7 n matrix such L(i, j) = degree(i) when i = j, -1 when (i, j) \u2208 E, and 0\notherwise. Of course, this matrix cannot be known to all the nodes in the network. We ensure full\ndistribution by allowing each node to only access its local neighbourhood. To view the problem in\nEquation 4 from a graph topology perspective, we introduce a set of dk vectors y1, . . . , ydk, each\nin Rn. The goal of these vectors is to hold the ith component of vec(\u0398(1)\nsh ). This\nprocess is depicted in Figure 1, where, for instance, the \ufb01rst vector y1 accumulates the \ufb01rst component\nof the shared parameters, \u03b8(1)\nWe can now describe consensus on the copies of the shared parameters as consensus between\nthe components of y1, . . . , ydk. Clearly, the components in yr coincide, if the rth component of\nthe shared parameters equate across nodes. Hence, consensus between the components (for all r)\ncorresponds to consensus on all dimensions of the shared parameters. This is exactly the constraint in\nEquation 4. One can think of a vector with equal components as that parallel to 1, namely, yr = cr1.\nConsequently, we can introduce the graph Laplacian in our constraints by having y1, . . . , ydk to be\nthe solution of Ly1 = 0, . . . ,Lydk = 0. This is true since the only solution to Lv = 0 is a vector, v,\nparallel to the vector of ones. Hence, a vector yr satisfying the above system has to be of the form\ncr1, i.e., its components equate. Hence, we write:\n\nsh ), . . . , vec(\u0398(n)\n\n1,sh, from all nodes.\n\n1,sh, . . . , \u03b8(n)\n\n\u2212J (t)\n\nMTRL\n\nt \u2297 Id\u00d7d\n\n\u02dcyi\n\ns.t. Ly1 = 0, . . . ,Lydk = 0 \u21d0\u21d2 M y = 0,\n\n(5)\n\nn(cid:88)\n\nTi(cid:88)\n\ni=1\n\nt=1\n\nmin\ny1:ydk\n\n(cid:16)(cid:16) \u02dc\u03b8(cid:62)\n\n(cid:17)\n\n(cid:17)\n\nwith \u02dcyi = [y1(i), . . . , ydk(i)]\nsize ndk \u00d7 ndk having Laplacian elements, and y \u2208 Rndk a vector collecting y1, . . . , ydk.\n\nsh ), M = Idk\u00d7dk \u2297 L a block-diagonal matrix of\n\n(cid:62) denoting vec(\u0398(i)\n\n4.2 Solution Methodology\n\nThe problem in Equation 5 is a constrained optimisation one that can be solved by descending (in a\ndistributed fashion) in the dual function. Though adopting second-order techniques (e.g., Newton\niteration) can lead to improved convergence speeds, direct application of standard Newton is dif\ufb01cult\nas we require a distributed procedure to accurately compute the direction of descent3.\nIn the following, we propose an accurate and scalable distributed Newton method. Our solution is\ndecomposed in two steps. First, we write the constraint problem as an unconstraint one by introducing\nthe dual functional to Equation 5. Second, we exploit the symmetric diagonally dominant (SDD)\nproperty of the Hessian, previously proved for a broader setting in Lemma 2 of [28], by developing a\nChebyshev solver to compute the Newton direction. To formulate the dual, we introduce a vector of\ndk](cid:62) \u2208 Rndk, where \u03bbi \u2208 Rn is a vector of multipliers, one\nLagrange multipliers \u03bb = [\u03bb(cid:62)\nfor each dimension of vec (\u0398sh). For fully distributed computations, we assume each node to only\nstore its corresponding components \u03bb1(i), . . . , \u03bbdk(i). After deriving the Lagrangian, we can write\n\n1 , . . . , \u03bb(cid:62)\n\n3It is worth noting that some techniques for determining the Newton direction in a distributed fashion exist.\n\nThese techniques, however, are inaccurate, see Section 5.\n\n5\n\n\u00a9 PROWLER.io 2018 \u2013 www.prowler.io.........\fthe dual function q(\u03bb) as 4:\n\nq(\u03bb) =\n\ninf\n\ny1(i):ydk(i)\n\nt=1\n\nn(cid:88)\n\ni=1\n\n(cid:32) Ti(cid:88)\n\n(cid:16)(cid:16) \u02dc\u03b8(cid:62)\n\nt \u2297 Id\u00d7d\n\n(cid:17)\n\n(cid:17)\n\n\u02dcyi\n\n\u2212J (t)\n\nMTRL\n\n+ y1(i)[L\u03bb1]i + \u00b7\u00b7\u00b7 + ydk(i)[L\u03bbdk]i\n\n(cid:33)\n\n,\n\n. . .\n\n,\n\nTi(cid:88)\n\n= \u2212[L\u03bb1]i,\n\n\u2202fi(\u00b7)\n\u2202ydk(i)\n\n= \u2212[L\u03bbdk]i where fi(\u00b7) =\n\nhave: \u2212[L\u03bbr]i =(cid:80)\n\nwhich is clearly separable across the computational nodes in G. Before discussing the SDD properties\nof the dual Hessian, we still require a procedure that allows us to infer about the primal (i.e., y)\ngiven updated parameters \u03bb. We recognise that primal variables can be found as the solution to the\nfollowing system of equations:\n\u2202fi(\u00b7)\n\u2202y1(i)\n(6)\nIt is also clear that Equation 6 is locally de\ufb01ned for every node i \u2208 V since for each r = 1, . . . , dk, we\nj\u2208N (i) \u03bbr(j) \u2212 d(i)\u03bbr(i), where N (i) is the neighbourhood of node i. As such,\neach node i can construct its own system of equations by collecting {\u03bb1(j), . . . , \u03bbdk(j)} from its\nneighbours without the need for full communication. These can then be locally solved for determining\nthe primal variables5.\nAs mentioned earlier, we update \u03bb using a distributed Newton method. At every iteration s in the\noptimisation algorithm, the descent direction is thus computed to be the solution of H (\u03bbs) ds = \u2212gs,\nwhere H (\u03bbs) is the Hessian, ds the Newton direction, and gs the gradient. The Hessian and the\ngradient of our objective are given by:\n\n(cid:16)(cid:16) \u02dc\u03b8(cid:62)\n\nt \u2297 Id\u00d7d\n\n(cid:17)\n\n\u2212J (t)\n\nMTRL\n\n(cid:17)\n\nt=1\n\n\u02dcyi\n\n.\n\nH(\u03bbs) = \u2212M\n\n\u2212\u22072J (t)\n\nMTRL (y (\u03bbs))\n\nM and \u2207q(\u03bbs) = M y (\u03bbs) .\n\n(cid:34) n(cid:88)\n\nTi(cid:88)\n\ni=1\n\nt=1\n\n(cid:35)\u22121\n\nUnfortunately, inverting H(\u03bbs) to determine the Newton direction is not possible in a distributed\nsetting since computing the inverse requires global information. Given the form of M and following\nthe results in [28], one can show that the above Hessian exhibits the SDD property. Luckily, this\nproperty can be exploited for a distributed solver as we show next.\nThe story of computing an approximation to the exact solution of an SDD system of linear equation\nstarts with standard splitting of symmetric matrices. Given a symmetric matrix6 H the standard\nsplitting is given by H = D0\u2212 A0, where D0 is a diagonal matrix that consists of diagonal elements\nin H, while A0 is a matrix collecting the negate of the off-diagonal components in H. As the goal is\nto determine a solution of the SDD system, we will be interested in inverses of H. Generalising the\nwork in [29], we recognise that the inverse can be written as:\n\nO(log m)(cid:89)\n\n(cid:20)\n\n(cid:104)\n\n(cid:105)2(cid:96)(cid:21)\n\n(D0 \u2212 A0)\u22121 \u2248 D\n\n\u2212 1\n0\n\n2\n\nI +\n\n\u2212 1\n0 A0D\n\n2\n\n\u2212 1\n0\n\n2\n\nD\n\n\u2212 1\n0 = \u02c6Pm(H),\n\n2\n\nD\n\n(cid:96)=0\n\nwhere \u02c6Pm(H) is a polynomial of degree m \u223c \u03ba(H) of matrix H. All computations do not need\naccess to the Hessian nor its inverse. We can describe these only using local Hessian vector products,\nhence allowing for fast implementation using automatic differentiation. Hence, the goal of the\nNewton update is to \ufb01nd a solution of the form d(m)\nis an\n\u0001-close solution to d(cid:63)\n\ns = Pm(H(\u03bbs))\u2207q(\u03bbs) such that d(m)\n\ns \u2212 d(cid:63)\n\ns. Consequently, the differential d(m)\ns \u2212 d(cid:63)\ns = [H(\u03bbs)Pm(H(\u03bbs)) \u2212 I] d(cid:63)\nd(m)\n\ns can be written as:\ns = \u2212Qm(H(\u03bbs))d(cid:63)\ns,\n\nwhere Qm(H(\u03bbs)) = \u2212H(\u03bbs)Pm(H(\u03bbs)) + I. Therefore, instead of seeking Pm(\u00b7), one can\nthink of constructing polynomials Qm(\u00b7) that reduce the term d(m)\ns as fast as possible. This\ncan be formalised in terms of the properties of Qm(\u00b7) by requiring the polynomial to have a minimal\ndegree, as well as satisfying the following for a precision parameter \u0001: Qm(0) = 1 and |Qm(\u00b5i)| \u2264 \u0001,\n\ns \u2212 d(cid:63)\n\ns\n\n4Please notice that for a dual function we use notation q(\u03bb) and for the variational distribution q\u03c6t (\u03c4t)\n5Please note that for the case of log-concave policies, we can determine the relation between primal and dual\n\nvariables in closed form by simple algebraic manipulation.\n\n6Please note that we use H to denote H(\u03bbs).\n\n6\n\n\f(b) Running Times\n(SM, DM, & CP)\n\n(c) Running Times (HC\n& HR)\n\n(a) Communication Over-\nhead\nFigure 2: (a) Communication overhead in the HC case. Our method has an increase proportional to\nthe condition number of the graph, which is slower compared to the other techniques. (b) and (c)\nRunning times till convergence to a threshold of 10\u22125. (d) Number of iterations for a 10\u22125 consensus\nerror on the HC dynamical system on different graph topologies.\nwith \u00b5i being the ith smallest eigenvalue of H(\u03bbs). The \ufb01rst condition is a result of observing\nQm(z) = \u2212zPm(z) + 1, while the second guarantees an \u0001-approximate solution:\n\n(d) Effect of Graph Topol-\nogy\n\n||d(m)\n\nk \u2212 d(cid:63)\n\nk||2\n\nH(\u03bbs) \u2264 max\n\n|Qm(\u00b5i)|2||d(cid:63)\n\ns||2\n\ns||2\nH(\u03bbs) \u2264 \u00012||d(cid:63)\n\nH(\u03bbs).\n\ni\n\n1\n\n/Tm\n\n(cid:17)\n\n(cid:17)\n\n\u00b5N\u2212\u00b52\n\nm(z) = Tm\n\n(cid:16) \u00b5N +\u00b52\n\n(cid:16) \u00b5N +\u00b52\u22122z\n\nIn other words, \ufb01nding Qm(z) that has minimal degree and satis\ufb01es the above two conditions\nguarantees an ef\ufb01cient and \u0001-close solution to d(cid:63)\ns. Chebyshev polynomials of the \ufb01rst kind satisfy\nour requirements. Their form is de\ufb01ned as Tm(z) = cos(m arccos(z)) if z \u2208 [\u22121, 1], and 1\nz2 \u2212 1)m + (z \u2212 \u221a\n\u221a\n2 ((z +\nz2 \u2212 1)m) otherwise. Interestingly, |Tm(z)| \u2264 1 on [\u22121, 1], and among all\npolynomials of degree m with a leading coef\ufb01cient 1, the polynomial\n2m\u22121 Tm(z) acquires its minimal\nand maximal values on this interval (i.e., sharpest increase outside the range [\u22121, 1]). We posit that\na good candidate is Q(cid:63)\n, with \u00b5i being the ith smallest\neigenvalue of symmetric matrix H describing the system of linear equations. First, it is easy to see\nshown that for any s and z \u2208 [\u00b52, \u00b5ndk], |Q(cid:63)(z)|2 is bounded as |Q(cid:63)(z)|2 \u2264 4 exp (\u22124m/(cid:112)\u03ba(H) + 1).\nthat when z = 0, these polynomials attain a value of unity (i.e., Q(cid:63)\nm(0) = 1). Secondly, it can be\ns = \u2212H(\u03bbs)\u22121 (I \u2212 Qm(H(\u03bbs))) gs guar-\nTherefore, choosing the approximate solution as7 d(m)\nantees an \u0001-close solution. Please note that by exploiting the recursive properties of Chebyshev\npolynomials we can derive an approximate solution without the need to compute H\u22121 explicitly. In\naddition to the time and message complexities of this new solver, other implementation details can be\nfound in the appendix. We now show quadratic convergence of the distributed Newton method:\nTheorem 1. Distributed Newton method using the Chebyshev solver exhibits the following two\nconvergence phases for some constants c1 and c2:\n2(L)\nStrict Decrease: if ||\u2207q(\u03bbs)||2 > c1, then ||\u2207q(\u03bbs)||2 \u2212 ||\u2207q(\u03bbs)||2 \u2264 c2\n\u00b54\nn(L)\n\u00b53\nQuadratic Decrease: if ||\u2207q(\u03bbs)||2 \u2264 c1, then for any l \u2265 1: ||\u2207q(\u03bbs+l)||2 \u2264 2c1\n\n\u00b5N\u2212\u00b52\n\n22l + O(\u0001)\n\n5 Experiments & Results\n\nWe conducted two sets of experiments to compare against distributed and multitask learning methods.\nOn the distributed side, we evaluated our algorithm against \ufb01ve other approaches being: 1) ADD [27],\n2) ADMM [11], 3) distributed averaging [30], 4) network-newton [26, 25], and 5) sub-gradients.\nWe are chie\ufb02y interested in the convergence speeds of both the objective value and consensus error,\nas well as the communication overhead and running times of these approaches. The comparison\nagainst [11] (which we title as distributed ADMM in the \ufb01gures) allows us to understand whether\nwe surpass state-of-the-art, while that against ADD and network-newton sheds the light on the\naccuracy of our Newton\u2019s direction approximation. When it comes to online methods, we compare\nour performance in terms of jump-start and asymptotic performance to policy gradients [31, 32],\nPG-ELLA [10], and GO-MTL [33]. Our experiments ran on \ufb01ve systems, simple mass (SM), double\nmass (DM), cart-pole (CP), helicopter (HC), and humanoid robots (HR).\nWe followed the experimental protocol in [10, 33] where we generated 5000 SM, 500 DM, and 1000\nCP tasks by varying the dynamical parameters of each of the above systems. These tasks were then\n\n7Please note that the solution of this system can be split into dk linear systems that can be solved ef\ufb01ciently\n\nusing the distributed Chebyshev solver. Due to space constraints these details can be found in the appendix.\n\n7\n\n123456789Accuracy00.20.40.60.811.21.41.61.82Local Communication Exchange\u00d7104Distributed SDDM NewtonDistributed ADD NewtonDistribted ADMMDistribted AveragingNewtork NewtonDistribted GradientsTime to Convergence [sec]016.7533.550.2567SMDMCPSDD-NewtonADD-NewtonADMMDist. AveragingNetwork NewtonDistributed GradientsTime to Convergence [sec]01150230034504600HCHRSDD-NewtonADD-NewtonADMMDist. AveragingNetwork NewtonDistributed Gradients# Iterations to Convergence03006009001200S. RandomM. RandomL. RandomBar-Bell SDD-NewtonADD-NewtonADMMDist. AveragingNetwork NewtonDistributed Gradients> 700> 700> 700> 1000> 700> 700> 700> 1000> 700> 700> 700> 1000> 700> 700> 700> 1000\f(a) Jump-Start\n\n(b) Asymptotic Performance\n\nFigure 3: Demonstration of jump-start and asymptotic results.\n\ndistributed over graphs with edges generated uniformly at random. Namely, a graph of 10 nodes\nand 25 edges was used for both SM and DM experiments, while a one with 50 nodes and 150 edges\nfor the CP. To distribute our computations, we made use of MATLAB\u2019s parallel pool running on 10\nnodes. For all methods, tasks were assigned evenly across 10 agents8. An \u0001 = 1/100 was provided to\nthe Chebyshev solver for determining the approximate Newton direction in all cases. Step-sizes were\ndetermined separately for each algorithm using a grid-search-like technique over {0.01, . . . , 1} to\nensure best operating conditions. Results reporting improvements in the consensus error (i.e., the\nerror measuring the deviation from agreement among the nodes) can be found in the appendix due to\nspace constraints.\nCommunication Overhead & Running Times: It can be argued that our improved results arrive at\na high communication cost between processors. This may be true as our method relies on an SDD-\nsolver while others allow for only few messages per iteration. We conducted an experiment measuring\nlocal communication exchange with respect to accuracy requirements. Results on the HC system,\nreported in Figure 2a, demonstrate that this increase is negligible compared to other methods. Clearly,\nas accuracy demands increase so does the communication overhead of all algorithms. Distributed\nSDD-Newton has a growth rate proportional to the condition number of the graph being much slower\ncompared to the exponential growth observed by other techniques. Having shown small increase\nin communication cost, now we turn our attention to assess running times to convergence on all\ndynamical systems. Figures 2b and 2c report running times to convergence computed according\nto a 10\u22125 error threshold. All these experiments were run on a small random graph of 20 nodes\nand 50 edges. Clearly, our method is faster when compared with others in both cases of low and\nhigh-dimensional policies. A \ufb01nal question to be answered is the effect of different graph topologies\non the performance of SDD-Newton. Taking the HC benchmark, we generated four graph topologies\nrepresenting small (S. Random), medium (M. Random), and large (L. Random) random networks,\nand a bar-bell graph with nodes varying from 10 to 150 and edges from 25 to 250. The bar-bell\ncontained 2 cliques formed by 10 nodes each and a 10 node line graph connecting them. We then\nmeasured the number of iterations required by all algorithms to achieve a consensus error of 10\u22125.\nFigure 2d reports these results showing that our method is again faster than others.\nBenchmarking Against RL: We \ufb01nally assessed our method in comparison to current MTRL\nthe technique described in [11], where the reward function was given by \u2212\u221a\nliterature, including PG-ELLA [10] and GO-MTL [33]. For the experimental procedure, we followed\nxh \u2212 xref, with xref\nbeing the reference state. As base-learners we used policy gradients as detailed in [34], which acquired\n1000 trajectories with a length of 150 each. We report jump-start and asymptotic performance in\nFigures 3a and 3b. These results show that our method can outperform others in terms of jump-start\nand asymptotic performance while requiring fewer iterations. Moreover, it is clear that our method\noutperforms streaming models, e.g., PG-ELLA.\n\n6 Conclusions & Future Work\n\nWe proposed a distributed solver for multitask reinforcement learning with quadratic convergence.\nOur next steps include developing an incremental version of our algorithm using generalised Hessians,\nand conducting experiments running on true distributed architectures to quantify the trade-off between\ncommunication and computation.\n\n8When graphs grew larger, nodes were grouped together and provided to one processor.\n\n8\n\nJump-Start Improvement %SMCPHCDM69593423605230215042181940442217PG-ELLAGO-MTLDist-ADMMSDD-NewtonAsymptotic Performance %SMCPHCDM7116941078458741076PG-ELLAGO-MTLDist-ADMMSDD-Newton\fReferences\n[1] Thira Chavarnakul and David Enke. A hybrid stock trading system for intelligent technical analysis-based\n\nequivolume charting. Neurocomput., 72(16-18), 2009.\n\n[2] Marc Peter Deisenroth, Peter Englert, Jochen Peters, and Dieter Fox. Multi-task policy search for robotics.\nIn Robotics and Automation (ICRA), 2014 IEEE International Conference on, pages 3876\u20133881. IEEE,\n2014.\n\n[3] Jens Kober and Jan R Peters. Policy search for motor primitives in robotics. In Advances in neural\n\ninformation processing systems, pages 849\u2013856, 2009.\n\n[4] Matthew E. Taylor and Peter Stone. Transfer Learning for Reinforcement Learning Domains: A Survey.\n\nJournal of Machine Learning Research, 10:1633\u20131685, 2009.\n\n[5] Mohammad Gheshlaghi Azar, Alessandro Lazaric, and Emma Brunskill. Sequential Transfer in Multi-\narmed Bandit with Finite Set of Models. In Advances in Neural Information Processing Systems 26.\n2013.\n\n[6] A. Lazaric. Transfer in Reinforcement Learning: A Framework and a Survey. In M. Wiering and M. van\n\nOtterlo, editors, Reinforcement Learning: State of the Art. Springer, 2011.\n\n[7] Matthijs Snel and Shimon Whiteson. Learning potential functions and their representations for multi-task\n\nreinforcement learning. Autonomous agents and multi-agent systems, 28(4):637\u2013681, 2014.\n\n[8] Sebastian Thrun and Joseph O\u2019Sullivan. Learning More From Less Data: Experiment in Lifelong Learning.\n\nIn Seminar Digest, 1996.\n\n[9] Paul Ruvolo and Eric Eaton. ELLA: An Ef\ufb01cient Lifelong Learning Algorithm. In Proceedings of the 30th\n\nInternational Conference on Machine Learning (ICML-13), 2013.\n\n[10] Haitham Bou-Ammar, Eric Eaton, Paul Ruvolo, and Matthew Taylor. Online Multi-task Learning for\nPolicy Gradient Methods. In Tony Jebara and Eric P. Xing, editors, Proceedings of the 31st International\nConference on Machine Learning. JMLR Workshop and Conference Proceedings, 2014.\n\n[11] El Bsat, Bou-Ammar Haitham, and Mathew Taylor. Scalable Multitask Policy Gradient Reinforcement\n\nLearning. In AAAI, February 2017.\n\n[12] Pedro A. Forero, Alfonso Cano, and Georgios B. Giannakis. Consensus-based distributed support vector\n\nmachines. J. Mach. Learn. Res., 11:1663\u20131707, August 2010.\n\n[13] Stephen Boyd and Lieven Vandenberghe. Convex Optimization. Cambridge University Press, New York,\n\nNY, USA, 2004.\n\n[14] Richard S. Sutton and Andrew G. Barto. Introduction to Reinforcement Learning. MIT Press, Cambridge,\n\nMA, USA, 1st edition, 1998.\n\n[15] Marc Toussaint. Robot trajectory optimization using approximate inference. In Proceedings of the 26th\n\nAnnual International Conference on Machine Learning, pages 1049\u20131056, 2009.\n\n[16] Gerhard Neumann. Variational Inference for Policy Search in changing situations. In Proceedings of the\n\n28th International Conference on Machine Learning (ICML-11), pages 817\u2013824, 2011.\n\n[17] Sergey Levine and Vladlen Koltun. Variational policy search via trajectory optimization. In Advances in\n\nNeural Information Processing Systems 26, pages 207\u2013215, 2013.\n\n[18] Abbas Abdolmaleki, Jost Tobias Springenberg, Yuval Tassa, R\u00e9mi Munos, Nicolas Heess, and Martin\n\nRiedmiller. Maximum a posteriori policy optimization. In ICLR, pages 1\u201322, 2018.\n\n[19] Hui Li, Xuejun Liao, and Lawrence Carin. Multi-task reinforcement learning in partially observable\n\nstochastic environments. The Journal of Machine Learning Research, 10:1131\u20131186, 2009.\n\n[20] David M. Blei, Alp Kucukelbir, and Jon D. McAuliffe. Variational Inference: A Review for Statisticians.\n\nJournal of the American Statistical Association, 112(518):859\u2013877, 2017.\n\n[21] Matthew D. Hoffman, David M. Blei, Chong Wang, and John Paisley. Stochastic Variational Inference.\n\nJournal of Machine Learning Research, 14:1303\u20131347, 2013.\n\n[22] A. Nedic and A.E. Ozdaglar. Distributed Subgradient Methods for Multi-Agent Optimization. IEEE Trans.\n\nAutomat. Contr., (1):48\u201361, 2009.\n\n9\n\n\f[23] Ermin Wei and Asuman Ozdaglar. Distributed alternating direction method of multipliers. In Decision and\n\nControl (CDC), 2012 IEEE 51st Annual Conference on, pages 5445\u20135450. IEEE, 2012.\n\n[24] J. L. Gof\ufb01n. On convergence Rates of Subgradient Optimization Methods. Mathematical Programming,\n\n13, 1977.\n\n[25] A. Mokhtari, Q. Ling, and A. Ribeiro. Network Newton-Part I: Algorithm and Convergence. ArXiv\n\ne-prints, 2015, 1504.06017.\n\n[26] A. Mokhtari, Q. Ling, and A. Ribeiro. Network Newton-Part II: Convergence Rate and Implementation.\n\nArXiv e-prints, 2015, 1504.06020.\n\n[27] M. Zargham, A. Ribeiro, A. E. Ozdaglar, and A. Jadbabaie. Accelerated Dual Descent for Network Flow\n\nOptimization. IEEE Transactions on Automatic Control, 2014.\n\n[28] R. Tutunov, H. Bou-Ammar, and A. Jadbabaie. Distributed SDDM Solvers: Theory & Applications. ArXiv\n\ne-prints, 2015.\n\n[29] Daniel A. Spielman and Shang-Hua Teng. Nearly-linear time algorithms for preconditioning and solving\n\nsymmetric, diagonally dominant linear systems. CoRR, abs/cs/0607105, 2006.\n\n[30] A. Olshevsky. Linear Time Average Consensus on Fixed Graphs and Implications for Decentralized\n\nOptimization and Multi-Agent Control. ArXiv e-prints, 2014.\n\n[31] Jens Kober and Jan Peters. Policy Search for Motor Primitives in Robotics. Machine Learning, 84(1-2),\n\nJuly 2011.\n\n[32] John Schulman, Sergey Levine, Philipp Moritz, Michael Jordan, and Pieter Abbeel. Trust Region Policy\n\nOptimization. In ICML, 2015.\n\n[33] Abhishek Kumar and Hal Daum\u00e9 III. Learning Task Grouping and Overlap in Multi-Task Learning. In\n\nInternational Conference on Machine Learning (ICML), 2012.\n\n[34] Jan Peters and Stefan Schaal. Natural Actor-Critic. Neurocomputing, 71, 2008.\n\n10\n\n\f", "award": [], "sourceid": 5337, "authors": [{"given_name": "Rasul", "family_name": "Tutunov", "institution": "PROWLER.io"}, {"given_name": "Dongho", "family_name": "Kim", "institution": "PROWLER.io"}, {"given_name": "Haitham", "family_name": "Bou Ammar", "institution": "PROWLER.io"}]}