{"title": "Accelerated consensus via Min-Sum Splitting", "book": "Advances in Neural Information Processing Systems", "page_first": 1374, "page_last": 1384, "abstract": "We apply the Min-Sum message-passing protocol to solve the consensus problem in distributed optimization. We show that while the ordinary Min-Sum algorithm does not converge, a modified version of it known as Splitting yields convergence to the problem solution. We prove that a proper choice of the tuning parameters allows Min-Sum Splitting to yield subdiffusive accelerated convergence rates, matching the rates obtained by shift-register methods. The acceleration scheme embodied by Min-Sum Splitting for the consensus problem bears similarities with lifted Markov chains techniques and with multi-step first order methods in convex optimization.", "full_text": "Accelerated consensus via Min-Sum Splitting\n\nPatrick Rebeschini\n\nDepartment of Statistics\n\nUniversity of Oxford\n\nSekhar Tatikonda\n\nDepartment of Electrical Engineering\n\npatrick.rebeschini@stats.ox.ac.uk\n\nsekhar.tatikonda@yale.edu\n\nYale University\n\nAbstract\n\nWe apply the Min-Sum message-passing protocol to solve the consensus problem\nin distributed optimization. We show that while the ordinary Min-Sum algorithm\ndoes not converge, a modi\ufb01ed version of it known as Splitting yields convergence to\nthe problem solution. We prove that a proper choice of the tuning parameters allows\nMin-Sum Splitting to yield subdiffusive accelerated convergence rates, matching\nthe rates obtained by shift-register methods. The acceleration scheme embodied by\nMin-Sum Splitting for the consensus problem bears similarities with lifted Markov\nchains techniques and with multi-step \ufb01rst order methods in convex optimization.\n\n1\n\nIntroduction\n\nMin-Sum is a local message-passing algorithm designed to distributedly optimize an objective\nfunction that can be written as a sum of component functions, each of which depends on a subset of\nthe decision variables. Due to its simplicity, Min-Sum has emerged as canonical protocol to address\nlarge scale problems in a variety of domains, including signal processing, statistics, and machine\nlearning. For problems supported on tree graphs, the Min-Sum algorithm corresponds to dynamic\nprogramming and is guaranteed to converge to the problem solution. For arbitrary graphs, the ordinary\nMin-Sum algorithm may fail to converge, or it may converge to something different than the problem\nsolution [28]. In the case of strictly convex objective functions, there are known suf\ufb01cient conditions\nto guarantee the convergence and correctness of the algorithm. The most general condition requires\nthe Hessian of the objective function to be scaled diagonally dominant [28, 25]. While the Min-Sum\nscheme can be applied to optimization problems with constraints, by incorporating the constraints\ninto the objective function as hard barriers, the known suf\ufb01cient conditions do not apply in this case.\nIn [34], a generalization of the traditional Min-Sum scheme has been proposed, based on a\nreparametrization of the original objective function. This algorithm is called Splitting, as it can be\nderived by creating equivalent graph representations for the objective function by \u201csplitting\u201d the\nnodes of the original graph. In the case of unconstrained problems with quadratic objective functions,\nwhere Min-Sum is also known as Gaussian Belief Propagation, the algorithm with splitting has been\nshown to yield convergence in settings where the ordinary Min-Sum does not converge [35]. To date,\na theoretical investigation of the rates of convergence of Min-Sum Splitting has not been established.\nIn this paper we establish rates of convergence for the Min-Sum Splitting algorithm applied to solve\nthe consensus problem, which can be formulated as an equality-constrained problem in optimization.\nThe basic version of the consensus problem is the network averaging problem. In this setting, each\nnode in a graph is assigned a real number, and the goal is to design a distributed protocol that allows\nthe nodes to iteratively exchange information with their neighbors so to arrive at consensus on the\naverage across the network. Early work include [42, 41]. The design of distributed algorithms\nto solve the averaging problem has received a lot of attention recently, as consensus represents a\nwidely-used primitive to compute aggregate statistics in a variety of \ufb01elds. Applications include, for\ninstance, estimation problems in sensor networks, distributed tracking and localization, multi-agents\ncoordination, and distributed inference [20, 21, 9, 19]. Consensus is typically combined with some\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fform of local optimization over a peer-to-peer network, as in the case of iterative subgradient methods\n[29, 40, 17, 10, 6, 16, 39]. In large-scale machine learning, consensus is used as a tool to distribute\nthe minimization of a loss function over a large dataset into a network of processors that can exchange\nand aggregate information, and only have access to a subset of the data [31, 11, 26, 3].\nClassical algorithms to solve the network averaging problem involve linear dynamical systems sup-\nported on the nodes of the graph. Even when the coef\ufb01cients that control the dynamics are optimized,\nthese methods are known to suffer from a \u201cdiffusive\u201d rate of convergence, which corresponds to the\nrate of convergence to stationarity exhibited by the \u201cdiffusion\u201d random walk naturally associated to\na graph [44, 2]. This rate is optimal for graphs with good expansion properties, such as complete\ngraphs or expanders. In this case the convergence time, i.e., the number of iterations required to\nreach a prescribed level of error accuracy \u03b5 > 0 in the (cid:96)2 norm relative to the initial condition, scales\nindependently of the dimension of the problem, as \u0398(log 1/\u03b5). For graphs with geometry this rate is\nsuboptimal [7], and it does not yield a convergence time that matches the lower bound \u2126(D log 1/\u03b5),\nwhere D is the graph diameter [37, 36]. For example, in both cycle graphs and in grid-like topologies\nthe number of iterations scale like \u0398(D2 log 1/\u03b5) (if n is the number of nodes, D \u223c n in a cycle\nand D \u223c \u221a\nn in a two-dimensional torus). \u0398(D2 log 1/\u03b5) is also the convergence time exhibited in\nrandom geometric graphs, which represent the relevant topologies for many applications in sensor\nnetworks [9]. In [7] it was established that for a class of graphs with geometry (polynomial growth or\n\ufb01nite doubling dimension), the mixing time of any reversible Markov chain scales at least like D2,\nembodying the fact that symmetric walks on these graphs take D2 steps to travel distances of orderD.\nMin-Sum schemes to solve the consensus problem have been previously investigated in [27]. The\nauthors show that the ordinary Min-Sum algorithm does not converge in graphs with cycles. They\ninvestigate a modi\ufb01ed version of it that uses a soft barrier function to incorporate the equality\nconstrains into the objective function. In the case of d-regular graphs, upon a proper choice of initial\nconditions, the authors show that the algorithm they propose reduces to a linear process supported on\nthe directed edges of the graph, and they characterize the convergence time of the algorithm in terms\nof the Ces\u00e0ro mixing time of a Markov chain de\ufb01ned on the set of directed edges of the original graph.\nIn the case of cycle graphs (i.e., d = 2), they prove that the mixing time scales like O(D), which\nyields the convergence time O(D/\u03b5 log 1/\u03b5). See Theorem 4 and Theorem 5 in [27]. In the case of\n(d/2)-dimensional tori (D \u223c n2/d), they conjecture that the mixing time is \u0398(D2(d\u22121)/d), but do\nnot present bounds for the convergence time. See Conjecture 1 in [27]. For other graph topologies,\nthey leave the mixing time (and convergence time) achieved by their method as an open question.\nIn this paper we show that the Min-Sum scheme based on splitting yields convergence to the\nconsensus solution, and we analytically establish rates of convergence for any graph topology.\nFirst, we show that a certain parametrization of the Min-Sum protocol for consensus yields a linear\nmessage-passing update for any graph and for any choice of initial conditions. Second, we show that\nthe introduction of the splitting parameters is not only fundamental to guarantee the convergence\nand correctness of the Min-Sum scheme in the consensus problem, but that proper tuning of these\nparameters yields accelerated (i.e., \u201csubdiffusive\u201d) asymptotic rates of convergence. We establish a\nsquare-root improvement for the asymptotic convergence time over diffusive methods, which allows\nMin-Sum Splitting to scale like O(D log(D/\u03b5)) for cycles and tori. Our results show that Min-Sum\nschemes are competitive and get close to the optimal rate O(D log(1/\u03b5)) recently established for\nsome algorithms based on Nesterov\u2019s acceleration [30, 36]. The main tool used for the analysis\ninvolves the construction of an auxiliary linear process supported on the nodes of the original graph\nto track the evolution of the Min-Sum Splitting algorithm, which is instead supported on the directed\nedges. This construction allows us to relate the convergence time of the Min-Sum scheme to the\nspectral gap of the matrix describing the dynamics of the auxiliary process, which is easier to analyze\nthan the matrix describing the dynamics on the edges as in [27].\nIn the literature, overcoming the suboptimal convergence rate of classical algorithms for network\naveraging consensus has motivated the design of several accelerated methods. Two main lines of\nresearch have been developed, and seem to have evolved independently of each others: one involves\nlifted Markov chains techniques, see [37] for a review, the other involves accelerated \ufb01rst order\nmethods in convex optimization, see [13] for a review. Another contribution of this paper is to\nshow that Min-Sum Splitting bears similarities with both types of accelerated methods. On the one\nhand, Min-Sum can be seen as a process on a lifted space, which is the space of directed edges in\nthe original graph. Here, splitting is seen to introduce a directionality in the message exchange of\nthe ordinary Min-Sum protocol that is analogous to the directionality introduced in non-reversible\n\n2\n\n\frandom walks on lifted graphs to achieve faster convergence to stationarity. The advantage of the\nMin-Sum algorithm over lifted Markov chain methods is that no lifted graph needs to be constructed.\nOn the other hand, the directionality induced on the edges by splitting translates into a memory term\nfor the auxiliary algorithm running on the nodes. This memory term, which allows nodes to remember\nprevious values and incorporate them into the next update, directly relates the Min-Sum Splitting\nalgorithm to accelerated multi-step \ufb01rst order methods in convex optimization. In particular, we show\nthat a proper choice of the splitting parameters recovers the same matrix that support the evolution\nof shift-register methods used in numerical analysis for linear solvers, and, as a consequence, we\nrecover the same accelerated rate of convergence for consensus [45, 4, 24].\nTo summarize, the main contributions of this paper are:\n\n1. First connection of Min-Sum schemes with lifted Markov chains techniques and multi-step\n\nmethods in convex optimization.\n\n2. First proof of how the directionality embedded in Belief Propagation protocols can be tuned\n\nand exploited to accelerate the convergence rate towards the problem solution.\n\n3. First analysis of convergence rates for Min-Sum Splitting. New proof technique based on\nthe introduction of an auxiliary process to track the evolution of the algorithm on the nodes.\n4. Design of a Min-Sum protocol for the consensus problem that achieves better convergence\n\nrates than the ones established (and conjectured) for the Min-Sum method in [27].\n\nOur results motivate further studies to generalize the acceleration due to splittings to other problems.\nThe paper is organized as follows. In Section 2 we introduce the Min-Sum Splitting algorithm in\nits general form. In Section 3 we describe the consensus problem and review the classical diffusive\nalgorithms. In Section 4 we review the main accelerated methods that have been proposed in the\nliterature. In Section 5 we specialize the Min-Sum Splitting algorithm to the consensus problem, and\nshow that a proper parametrization yields a linear exchange of messages supported on the directed\nedges of the graph. In Section 6 we derive the auxiliary message-passing algorithm that allows us to\ntrack the evolution of the Min-Sum Splitting algorithm via a linear process with memory supported\non the nodes of the graph. In Section 7 we state Theorem 1, which shows that a proper choice of the\ntuning parameters recovers the rates of shift-registers. Proofs are given in the supplementary material.\n\n2 The Min-Sum Splitting algorithm\n\nThe Min-Sum algorithm is a distributed routine to optimize a cost function that is the sum of\ncomponents supported on a given graph structure. Given a simple graph G = (V, E) with n := |V |\nvertices and m := |E| edges, let us assume that we are given a set of functions \u03c6v : R \u2192 R \u222a {\u221e},\nfor each v \u2208 V , and \u03c6vw = \u03c6wv : R \u00d7 R \u2192 R \u222a {\u221e}, for each {v, w} \u2208 E, and that we want to\nsolve the following problem over the decision variables x = (xv)v\u2208V \u2208 RV :\n\n\u03c6v(xv) +\n\n\u03c6vw(xv, xw).\n\n(1)\n\nminimize (cid:88)\n\nv\u2208V\n\n(cid:88)\n\n{v,w}\u2208E\n\nThe Min-Sum algorithm describes an iterative exchange of messages\u2014which are functions of the\ndecision variables\u2014associated to each directed edge in G. Let E := {(v, w) \u2208 V \u00d7 V : {v, w} \u2208 E}\nbe the set of directed edges associated to the undirected edges in E (each edge in E corresponds\nto two edges in E). In this work we consider the synchronous implementation of the Min-Sum\nalgorithm where at any given time step s, each directed edge (v, w) \u2208 E supports two messages,\nvw : R \u2192 R \u222a {\u221e}. Messages are computed iteratively. Given an initial choice of messages\n\u02c6\u03bes\nvw, \u02c6\u00b5s\nvw)(v,w)\u2208E, the Min-Sum scheme that we investigate in this paper is given in Algorithm 1.\n\u02c6\u00b50 = (\u02c6\u00b50\nHenceforth, for each v \u2208 V , let N (v) := {w \u2208 V : {v, w} \u2208 E} denote the neighbors of node v.\nThe formulation of the Min-Sum scheme given in Algorithm 1, which we refer to as Min-Sum\nSplitting, was introduced in [34]. This formulation admits as tuning parameters the real number\n\u03b4 \u2208 R and the symmetric matrix \u0393 = (\u0393vw)v,w\u2208V \u2208 RV \u00d7V . Without loss of generality, we\nassume that the sparsity of \u0393 respects the structure of the graph G, in the sense that if {v, w} (cid:54)\u2208 E\nthen \u0393vw = 0 (note that Algorithm 1 only involves summations with respect to nearest neighbors\nin the graph). The choice of \u03b4 = 1 and \u0393 = A, where A is the adjacency matrix de\ufb01ned as\nAvw := 1 if {v, w} \u2208 E and Avw := 0 otherwise, yields the ordinary Min-Sum algorithm. For\n\n3\n\n\fvw)(v,w)\u2208E; parameters \u03b4 \u2208 R and \u0393 \u2208 RV \u00d7V symmetric; time t \u2265 1.\n\nAlgorithm 1: Min-Sum Splitting\nInput: Messages \u02c6\u00b50 = (\u02c6\u00b50\nfor s \u2208 {1, . . . , t} do\n\nwv +(cid:80)\n\nv = \u03c6v + \u03b4(cid:80)\n\n\u00b5t\nOutput: xt\n\nw\u2208N (v) \u0393wv \u02c6\u00b5t\nv = arg minz\u2208R \u00b5t\n\nwv = \u03c6v/\u03b4 \u2212 \u02c6\u00b5s\u22121\n\u02c6\u03bes\nwv = minz\u2208R{\u03c6vw(\u00b7 , z)/\u0393vw + (\u03b4 \u2212 1) \u02c6\u03bes\n\u02c6\u00b5s\n\nzv , (w, v) \u2208 E;\nwv + \u03b4 \u02c6\u03bes\n\nvw(z)}, (w, v) \u2208 E;\n\nz\u2208N (v) \u0393zv \u02c6\u00b5s\u22121\nwv, v \u2208 V ;\nv(z), v \u2208 V .\n\n(cid:80)\n\n(cid:80)\u03b4\n\nv(xv) +(cid:80){v,w}\u2208E\n\nan arbitrary choice of strictly positive integer parameters, Algorithm 1 can be seen to correspond\nto the ordinary Min-Sum algorithm applied to a new formulation of the original problem, where\nan equivalent objective function is obtained from the original one in (1) by splitting each term\n\u03c6vw into \u0393vw \u2208 N \\ {0} terms, and each term \u03c6v into \u03b4 \u2208 N \\ {0} terms. Namely, minimize\nvw := \u03c6vw/\u0393vw.1\nHence the reason for the name \u201csplitting\u201d algorithm. Despite this interpretation, Algorithm 1 is\nde\ufb01ned for any real choice of parameters \u03b4 and \u0393.\nIn this paper we investigate the convergence behavior of the Min-Sum Splitting algorithm for some\nchoices of \u03b4 and \u0393, in the case of the consensus problem that we de\ufb01ne in the next section.\n\nvw(xv, xw), with \u03c6k\n\nv := \u03c6v/\u03b4 and \u03c6k\n\n(cid:80)\u0393vw\n\nv\u2208V\n\nk=1 \u03c6k\n\nk=1 \u03c6k\n\n3 The consensus problem and standard diffusive algorithms\nGiven a simple graph G = (V, E) with n := |V | nodes, for each v \u2208 V let \u03c6v : R \u2192 R \u222a {\u221e} be a\ngiven function. The consensus problem is de\ufb01ned as follows:\n\n\u03c6v(xv)\n\nsubject to xv = xw,{v, w} \u2208 E.\n\n(2)\n\nminimize (cid:88)\n\nv\u2208V\n\nWe interpret G as a communication graph where each node represents an agent, and each edge\nrepresent a communication channel between neighbor agents. Each agent v is given the function \u03c6v,\nand agents collaborate by iteratively exchanging information with their neighbors in G with the goal\nto eventually arrive to the solution of problem (2). The consensus problem amounts to designing\ndistributed algorithms to solve problem (2) that respect the communication constraints encoded by G.\n(cid:80)\nA classical setting investigated in the literature is the least-square case yielding the network averaging\nproblem, where for a given b \u2208 RV we have2 \u03c6v(z) := 1\n2 z2 \u2212 bvz and the solution of problem\nv\u2208V bv. In this setup, each agent v \u2208 V is given a number bv, and agents want\n(2) is \u00afb := 1\nn\nto exchange information with their neighbors according to a protocol that allows each of them to\neventually reach consensus on the average \u00afb across the entire network. Classical algorithms to\nsolve this problem involve a linear exchange of information of the form xt = W xt\u22121 with x0 = b,\nfor a given matrix W \u2208 RV \u00d7V that respects the topology of the graph G (i.e., Wvw (cid:54)= 0 only\nif {v, w} \u2208 E or v = w), so that W t \u2192 11T /n for t \u2192 \u221e, where 1 is the all ones vector.\nThis linear iteration allows for a distributed exchange of information among agents, as at any\niteration each agent v \u2208 V only receives information from his/her neighbors N (v) via the update:\nv = Wvvxt\u22121\nw . The original literature on this problem investigates the case\nxt\nwhere the matrix W has non-negative coef\ufb01cients and represents the transition matrix of a random\nwalk on the nodes of the graph G, so that Wvw is interpreted as the probability that a random walk at\nnode v visits node w in the next time step. A popular choice is given by the Metropolis-Hastings\nmethod [37], which involved the doubly-stochastic matrix W M H de\ufb01ned as W M H\nvw := 1/(2dmax) if\n{v, w} \u2208 E, W M H\nvw := 0 otherwise, where dv := |N (v)|\nis the degree of node v, and dmax := maxv\u2208V dv is the maximum degree of the graph G.\n\nvw := 1 \u2212 dv/(2dmax) if w = v, and W M H\n\nv +(cid:80)\n\nw\u2208N (v) Wvwxt\u22121\n\n1As mentioned in [34], one can also consider a more general formulation of the splitting algorithm with\n\u03b4 \u2192 (\u03b4v)v\u2208V \u2208 R (possibly also with time-varying parameters). The current choice of the algorithm is\n(cid:80)\nmotivated by the fact that in the present case the output of the algorithm can be tracked by analyzing a linear\nsystem on the nodes of the graph, as we will show in Section 5.\nv\u2208V (z \u2212 bv)2, which yields the same results as the\nquadratic function that we de\ufb01ne in the main text, as constant terms in the objective function do not alter the\noptimal point of the problem but only the optimal value of the objective function.\n\n2In the literature, the classical choice is \u03c6v(z) := 1\n2\n\n4\n\n\fIn [44], necessary and suf\ufb01cient conditions are given for a generic matrix W to satisfy W t \u2192 11T /n,\nnamely, 1T W = 1T , W 1 = 1, and \u03c1(W \u2212 11T /n) < 1, where \u03c1(M ) denotes the spectral radius\nof a given matrix M. The authors show that the problem of choosing the optimal symmetric matrix\nW that minimizes \u03c1(W \u2212 11T /n) = (cid:107)W \u2212 11T /n(cid:107) \u2014 where (cid:107)M(cid:107) denotes the spectral norm of a\nmatrix M that coincides with \u03c1(M ) if M is symmetric \u2014 is a convex problem and it can be cast as a\nsemi-de\ufb01nite program. Typically, the optimal matrix involves negative coef\ufb01cients, hence departing\nfrom the random walk interpretation. However, even the optimal choice of symmetric matrix is shown\nto yield a diffusive rate of convergence, which is already attained by the matrix W M H [7]. This rate\ncorresponds to the speed of convergence to stationarity achieved by the diffusion random walk, de\ufb01ned\nas the Markov chain with transition matrix diag(d)\u22121A, where diag(d) \u2208 RV \u00d7V is the degree matrix,\ni.e., diagonal with diag(d)vv := dv, and A \u2208 RV \u00d7V is the adjacency matrix, i.e., symmetric with\nAvw := 1 if {v, w} \u2208 E, and Avw := 0 otherwise. For instance, the condition (cid:107)W \u2212 11T /n(cid:107)t \u2264 \u03b5,\nwhere (cid:107) \u00b7 (cid:107) is the (cid:96)2 norm, yields a convergence time that scales like t \u223c \u0398(D2 log(1/\u03b5)) in cycle\ngraphs and tori [33], where D is the graph diameter. The authors in [7] established that for a class\nof graphs with geometry (polynomial growth or \ufb01nite doubling dimension) the mixing time of any\nreversible Markov chain scales at least like D2, and it is achieved by Metropolis-Hastings [37].\n\n4 Accelerated algorithms\n\nTo overcome the diffusive behavior typical of classical consensus algorithms, two main types of\napproaches have been investigated in the literature, which seem to have been developed independently.\n\nThe \ufb01rst approach involves the construction of a lifted graph (cid:98)G = ((cid:98)V ,(cid:98)E) and of a linear system\nsupported on the nodes of it, of the form \u02c6xt =(cid:99)W \u02c6xt\u22121, where(cid:99)W \u2208 R(cid:98)V \u00d7(cid:98)V is the transition matrix\nof a non-reversible Markov chain on the nodes of (cid:98)G. This approach has its origins in the work of\n\n[8] and [5], where it was observed for the \ufb01rst time that certain non-reversible Markov chains on\nproperly-constructed lifted graphs yield better mixing times than reversible chains on the original\ngraphs. For some simple graph topologies, such as cycle graphs and two-dimensional grids, the\nconstruction of the optimal lifted graphs is well-understood already from the works in [8, 5]. A general\ntheory of lifting in the context of Gossip algorithms has been investigated in [18, 37]. However, this\nconstruction incurs additional overhead, which yield non-optimal computational complexity, even for\ncycle graphs and two-dimensional grids. Typically, lifted random walks on arbitrary graph topologies\nare constructed on a one-by-one case, exploiting the speci\ufb01cs of the graph at hand. This is the case,\nfor instance, for random geometric graphs [22, 23]. The key property that allows non-reversible lifted\nMarkov chains to achieve subdiffusive rates is the introduction of a directionality in the process to\nbreak the diffusive nature of reversible chains. The strength of the directionality depends on global\nproperties of the original graph, such as the number of nodes [8, 5] or the diameter [37]. See Figure 1.\n\n1/2\n\n1/2\n\n1\u22121/n\n\n1/n\n\n1/n\n\n1\u22121/n\n\n1\n\n1\n\n\u2248 1\u22121/n\n\n\u2248\u22121/n\n\n(a)\n\n(b)\n\n(c)\n\nFigure 1: (a) Symmetric Markov chain W on the nodes of the ring graph G. (b) Non-reversible\n\nMarkov chain(cid:99)W on the nodes of the lifted graph (cid:98)G [8]. (c) Ordinary Min-Sum algorithm on the\ndirected edges E associated to G (i.e., (cid:98)K(\u03b4, \u0393), Algorithm 2, with \u03b4 = 1 and \u0393 = A, where A is\nthe adjacency matrix of G). (d) Min-Sum Splitting (cid:98)K(\u03b4, \u0393), Algorithm 2, with \u03b4 = 1, \u0393 = \u03b3W ,\n\u03b3 = 2/(1 +(cid:112)1 \u2212 \u03c12\nThe matrix (cid:98)K(\u03b4, \u0393) has negative entries, departing from the Markov chain interpretation. This is also\nW ) as in Theorem 1. Here, \u03c1W is \u0398(1 \u2212 1/n2) and \u03b3 \u2248 2(1 \u2212 1/n) for n large.\nthe case for the optimal tuning in classical consensus schemes [44] and for the ADMM lifting in [12].\n\n(d)\n\nThe second approach involves designing linear updates that are supported on the original graph G and\nkeep track of a longer history of previous iterates. This approach relies on the fact that the original\nconsensus update xt = W xt\u22121 can be interpreted as a primal-dual gradient ascent method to solve\nproblem (2) with a quadratic objective function [32]. This allows the implementation of accelerated\n\n5\n\n\fgradient methods. To the best of our knowledge, this idea was \ufb01rst introduced in [14], and since then it\nhas been investigated in many other papers. We refer to [13, 24], and references in there, for a review\nand comparison of multi-step accelerated methods for consensus. The simplest multi-step extension of\ngradient methods is Polyak\u2019s \u201cheavy ball,\u201d which involves adding a \u201cmomentum\u201d term to the standard\nupdate and yields a primal iterate of the form xt = W xt\u22121 + \u03b3(xt\u22121 \u2212 xt\u22122). Another popular\nmulti-step method involves Nesterov\u2019s acceleration, and yields xt = (1 + \u03b3)W xt\u22121 \u2212 \u03b3W xt\u22122.\nAligned with the idea of adding a momentum term is the idea of adding a shift register term, which\nyields xt = (1 + \u03b3)W xt\u22121 \u2212 \u03b3xt\u22122. For our purposes, we note that these methods can be written as\n\n(cid:18) xt\n\n(cid:19)\n\n(cid:18) xt\u22121\n\n(cid:19)\n\n,\n\n= K\n\nxt\u22122\n\nxt\u22121\n\n(3)\nfor a certain matrix K \u2208 R2n\u00d72n. As in the case of lifted Markov chains techniques, also multi-step\nmethods are able to achieve accelerated rates by exploiting some form of global information: the\nchoice of the parameter \u03b3 that yields subdiffusive rates depends on the eigenvalues of W .\nRemark 1. Beyond lifted Markov chains techniques and accelerated \ufb01rst order methods, many other\nalgorithms have been proposed to solve the consensus problem. The literature is vast. As we focus on\nMin-Sum schemes, an exhaustive literature review on consensus is beyond the scope of our work. Of\nparticular interest for our results is the distributed ADMM approach [3, 43, 38]. Recently in [12],\nfor a class of unconstrained problems with quadratic objective functions, it has been shown that\nmessage-passing ADMM schemes can be interpreted as lifting of gradient descent techniques. This\nprompts for further investigation to connect Min-Sum, ADMM, and accelerated \ufb01rst order methods.\n\nIn the next two sections we show that Min-Sum Splitting bears similarities with both types of\naccelerated methods described above. On the one hand, in Section 5 we show that the estimates xt\nv\u2019s\nof Algorithm 1 applied to the network averaging problem can be interpreted as the result of a linear\nprocess supported on a lifted space, i.e., the space E of directed edges associated to the undirected\nedges of G. On the other hand, in Section 6 we show that the estimates xt\nv\u2019s can be seen as the result\nof a linear multi-step process supported on the nodes of G, which can be written as in (3). Later on,\nin Section 7 and Section 8, we will see that the similarities just described go beyond the structure\nof the processes, and they extend to the acceleration mechanism itself. In particular, the choice of\nsplitting parameters that yields subdiffusive convergence rates, matching the asymptotic rates of shift\nregister methods, is also shown to depend on global information about G.\n\n5 Min-Sum Splitting for consensus\n\nWe apply Min-Sum Splitting to solve network averaging. We show that in this case the message-\npassing protocol is a linear exchange of parameters associated to the directed edges in E.\nGiven \u03b4 \u2208 R and \u0393 \u2208 RV \u00d7V symmetric, let \u02c6h(\u03b4) \u2208 RE be the vector de\ufb01ned as \u02c6h(\u03b4)wv :=\n\nbw + (1 \u2212 1/\u03b4)bv, and let (cid:98)K(\u03b4, \u0393) \u2208 RE\u00d7E be matrix de\ufb01ned as\n\n, v \u2208 V .\n\n6\n\n\uf8f1\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f2\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f3\n\n(cid:98)K(\u03b4, \u0393)wv,zu :=\n\n\u03b4\u0393zw\n\u03b4(\u0393vw \u2212 1)\n(\u03b4 \u2212 1)\u0393zv\n(\u03b4 \u2212 1)(\u0393wv \u2212 1)\n0\n\nif u = w, z \u2208 N (w) \\ {v},\nif u = w, z = v,\nif u = v, z \u2208 N (v) \\ {w},\nif u = v, z = w,\notherwise.\nvw)(v,w)\u2208E \u2208 RE, \u02c6r0 = (\u02c6r0\nInput: \u02c6R0, \u02c6r0 \u2208 RE; \u03b4 \u2208 R, \u0393 \u2208 RV \u00d7V symmetric; (cid:98)K(\u03b4, \u0393) de\ufb01ned in (5); t \u2265 1.\n\u02c6Rs = (2 \u2212 1/\u03b4)1 + (cid:98)K(\u03b4, \u0393) \u02c6Rs\u22121;\n\n\u02c6rs = \u02c6h(\u03b4) + (cid:98)K(\u03b4, \u0393)\u02c6rs\u22121;\n\nAlgorithm 2: Min-Sum Splitting, consensus problem, quadratic case\n\nConsider Algorithm 2 with initial conditions \u02c6R0 = ( \u02c6R0\n\nfor s \u2208 {1, . . . , t} do\nbv+\u03b4(cid:80)\n1+\u03b4(cid:80)\n\nOutput: xt\n\nv :=\n\nw\u2208N (v) \u0393wv \u02c6rt\nw\u2208N (v) \u0393wv \u02c6Rt\n\nwv\n\nwv\n\n(4)\n\nvw)(v,w)\u2208E \u2208 RE.\n\n\fProposition 1. Let \u03b4 \u2208 R and \u0393 \u2208 RV \u00d7V symmetric be given. Consider Algorithm 1 applied to\nvwz,\nproblem (2) with \u03c6v(z) := 1\nvwz2\u2212\nfor some \u02c6R0\n\u02c6Rs\nvw(z) = 1\n2\nw\u2208N (v) \u0393wv \u02c6Rt\n\u02c6rs\nwv > 0\nfor any v \u2208 V and t \u2265 1, then the output of Algorithm 2 coincides with the output of Algorithm 1.\n\nvwz for any s \u2265 1, and the parameters evolve as in Algorithm 2. If 1 + \u03b4(cid:80)\n\nvw \u2208 R. Then, the messages will remain quadratic, i.e., \u02c6\u00b5s\n\n2 z2\u2212bvz and with quadratic initial messages: \u02c6\u00b50\n\nvwz2\u2212\u02c6r0\n\u02c6R0\n\nvw > 0 and \u02c6r0\n\nvw(z) = 1\n2\n\n6 Auxiliary message-passing scheme\n\nWe show that the output of Algorithm 2 can be tracked by a new message-passing scheme that\ncorresponds to a multi-step linear exchange of parameters associated to the nodes of G. This auxiliary\nalgorithm represents the main tool to establish convergence rates for the Min-Sum Splitting protocol,\ni.e., Theorem 1 below. The intuition behind the auxiliary process is that while Algorithm 1 (hence,\nAlgorithm 2) involves an exchange of messages supported on the directed edges E, the computation\nof the estimates xt\nv\u2019s, which are supported on the nodes of G.\nDue to the simple nature of the pairwise equality constraints in the consensus problem, in the present\ncase a reparametrization allows to track the output of Min-Sum via an algorithm that directly updates\nthe belief functions on the nodes of the graph, which yields Algorithm 3.\nGiven \u03b4 \u2208 R and \u0393 \u2208 Rn\u00d7n symmetric, de\ufb01ne the matrix K(\u03b4, \u0393) \u2208 R2n\u00d72n as\n\nv\u2019s only involve the belief functions \u00b5t\n\n(cid:18) (1 \u2212 \u03b4)I \u2212 (1 \u2212 \u03b4)diag(\u03931) + \u03b4\u0393\n\n(cid:19)\n\n\u03b4I\n\nK(\u03b4, \u0393) :=\n\n(5)\nwhere I \u2208 RV \u00d7V is the identity matrix and diag(\u03931) \u2208 RV \u00d7V is diagonal with (diag(\u03931))vv =\n\n\u03b4I \u2212 \u03b4diag(\u03931) + (1 \u2212 \u03b4)\u0393\n\nw\u2208N (v) \u0393vw. Consider Algorithm 3 with initial conditions R0, r0, Q0, q0 \u2208 RV .\n\n(\u03931)v =(cid:80)\n\n(1 \u2212 \u03b4)I\n\n,\n\nAlgorithm 3: Auxiliary message-passing\nInput: R0, r0, Q0, q0 \u2208 RV ; \u03b4 \u2208 R, \u0393 \u2208 RV \u00d7V symmetric; K(\u03b4, \u0393) de\ufb01ned in (5); t \u2265 1.\nfor s \u2208 {1, . . . , t} do\n= K(\u03b4, \u0393)\n\n(cid:18) rs\u22121\n\n(cid:18) Rs\n\n(cid:18) rs\n\n= K(\u03b4, \u0393)\n\n(cid:19)\n\n(cid:19)\n\n(cid:19)\n\n;\n\nqs\n\nQs\n\nqs\u22121\nv, v \u2208 V .\n\n;\n\n(cid:19)\n\nQs\u22121\n\n(cid:18) Rs\u22121\nv := 1 + \u03b4(cid:80)\nv := bv \u2212 \u03b4(cid:80)\n\nv/Rt\n\nv := rt\n\nOutput: xt\nProposition 2. Let \u03b4 \u2208 R and \u0393 \u2208 RV \u00d7V symmetric be given. The output of Algorithm 2 with initial\nconditions \u02c6R0, \u02c6r0 \u2208 RE is the output of Algorithm 3 with R0\nwv, Q0\nv :=\nvw.\nw\u2208N (v) \u0393vw \u02c6r0\n\nv := bv + \u03b4(cid:80)\n\n1 \u2212 \u03b4(cid:80)\n\nw\u2208N (v) \u0393wv \u02c6R0\n\nw\u2208N (v) \u0393wv \u02c6R0\n\nw\u2208N (v) \u0393wv \u02c6r0\n\nwv, and q0\n\nwv, r0\n\nProposition 2 shows that upon proper initialization, the outputs of Algorithm 2 and Algorithm 3\nare equivalent. Hence, Algorithm 3 represents a tool to investigate the convergence behavior of the\nMin-Sum Splitting algorithm. Analytically, the advantage of the formulation given in Algorithm 3\nover the one given in Algorithm 2 is that the former involves two coupled systems of n equations\nwhose convergence behavior can explicitly be linked to the spectral properties of the n \u00d7 n matrix \u0393,\nas we will see in Theorem 1 below. On the contrary, the linear system of 2m equations in Algorithm\n2 does not seem to exhibit an immediate link to the spectral properties of \u0393. In this respect, we note\nthat the previous paper that investigated Min-Sum schemes for consensus, i.e., [27], characterized the\nconvergence rate of the algorithm under consideration \u2014 albeit only in the case of d-regular graphs,\nand upon initializing the quadratic terms to the \ufb01x point \u2014 in terms of the spectral gap of a matrix\nthat controls a linear system of 2m equations. However, the authors only list results on the behavior\nof this spectral gap in the case of cycle graphs, i.e., d = 2, and present a conjecture for 2d-tori.\n\n7 Accelerated convergence rates for Min-Sum Splitting\n\nWe investigate the convergence behavior of the Min-Sum Splitting algorithm to solve problem (2)\nwith quadratic objective functions. Henceforth, without loss of generality, let b \u2208 RV be given with\n0 < bv < 1 for each v \u2208 V , and let \u03c6v(z) := 1\nRecall from [27] that the ordinary Min-Sum algorithm (i.e., Algorithm 2 with \u03b4 = 1 and \u0393 = A,\nwhere A is the adjacency matrix of the graph G) does not converge if the graph G has a cycle.\n\n2 z2 \u2212 bvz. De\ufb01ne \u00afb :=(cid:80)\n\nv\u2208V bv/n.\n\n7\n\n\fWe now show that a proper choice of the tuning parameters allows Min-Sum Splitting to converge\nto the problem solution in a subdiffusive way. The proof of this result, which is contained in the\nsupplementary material, relies on the use of the auxiliary method de\ufb01ned in Algorithm 3 to track the\nevolution of the Min-Sum Splitting scheme. Here, recall that (cid:107)x(cid:107) denotes the (cid:96)2 norm of a given\nvector x, (cid:107)M(cid:107) denotes the (cid:96)2 matrix norm of the given matrix M, and \u03c1(M ) its spectral radius.\nTheorem 1. Let W \u2208 RV \u00d7V be a symmetric matrix with W 1 = 1 and \u03c1W := \u03c1(W \u2212 11T /n) < 1.\nW ). Let xt be the output at time t of Algorithm 2\nwith initial conditions \u02c6R0 = \u02c6r0 = 0. De\ufb01ne\n\nLet \u03b4 = 1 and \u0393 = \u03b3W , with \u03b3 = 2/(1 +(cid:112)1 \u2212 \u03c12\n\nK :=\n\n(cid:19)\n\n(1 \u2212 \u03b3)I\n\n(cid:18) \u03b3W\n\n(cid:18)\n(1\u2212(cid:112)1\u2212\u03c12\n(cid:112)1/(1 \u2212 \u03c1W ) \u2264 1/(1 \u2212 \u03c1K) \u2264(cid:112)1/(1 \u2212 \u03c1W ).\n\nThen, for any v \u2208 V we have limt\u2192\u221e xt\nThe asymptotic rate of convergence is given by\n\u03c1K := \u03c1(K \u2212 K\u221e) = limt\u2192\u221e (cid:107)(K \u2212 K\u221e)t(cid:107)1/t =\nwhich satis\ufb01es 1\n2\n\n(cid:113)\n\nI\n0\n\n1\n\n,\n\n(2 \u2212 \u03b3)n\n\nK\u221e :=\n(1 \u2212 \u03b3)11T\n\u221a\nv = \u00afb and (cid:107)xt \u2212 \u00afb1(cid:107) \u2264 4\n2\u2212\u03b3 (cid:107)(K \u2212 K\u221e)t(cid:107).\n\n(1 \u2212 \u03b3)11T\n\n11T\n\n11T\n\n2n\n\n(cid:19)\n\n.\n\n(6)\n\nW )/(1+(cid:112)1\u2212\u03c12\n\nW ) < \u03c1W < 1,\n\n\u221a\n\n\u221a\n\n\u221a\n\n2\u03c1t\n\nn (cid:46) 4\n\nTheorem 1 shows that the choice of splitting parameters \u03b4 = 1 and \u0393 = \u03b3W , where \u03b3 and W\nare de\ufb01ned as in the statement of the theorem, allows the Min-Sum Splitting scheme to achieve\nthe asymptotic rate of convergence that is given by the second largest eigenvalue in magnitude of\nthe matrix K de\ufb01ned in (6), i.e., the quantity \u03c1K. The matrix K is the same matrix that describes\nshift-register methods for consensus [45, 4, 24]. In fact, the proof of Theorem 1 relies on the spectral\nanalysis previously established for shift-registers, which can be traced back to [15]. See also [13, 24].\nFollowing [27], let us consider the absolute measure of error given by (cid:107)xt \u2212 \u00afb1(cid:107)/\nassume 0 < bv < 1 so that (cid:107)b(cid:107) \u2264 \u221a\nn (recall that we\nn). From Theorem 1 it follows that, asymptotically, we have\nK/(2 \u2212 \u03b3). If we de\ufb01ne the asymptotic convergence time as the minimum\n(cid:107)xt \u2212 \u00afb1(cid:107)/\n\u221a\ntime t so that, asymptotically, (cid:107)xt \u2212 \u00afb1(cid:107)/\nn (cid:46) \u03b5, then the Min-Sum Splitting scheme investigated in\nTheorem 1 has an asymptotic convergence time that is O(1/(1\u2212\u03c1K) log{[1/(1\u2212\u03c1K)]/\u03b5}). Given the\nlast bound in Theorem 1, this result achieves (modulo logarithmic terms) a square-root improvement\nover the convergence time of diffusive methods, which scale like \u0398(1/(1 \u2212 \u03c1W ) log 1/\u03b5). For cycle\ngraphs and, more generally, for higher-dimensional tori \u2014 where 1/(1 \u2212 \u03c1W ) is \u0398(D2) so that\n1/(1\u2212\u03c1K) is \u0398(D) [33, 1] \u2014 the convergence time is O(D log D/\u03b5), where D is the graph diameter.\nAs prescribed by Theorem 1, the choice of \u03b3 that makes the Min-Sum scheme achieve a subdiffusive\nrate depends on global properties of the graph G. Namely, \u03b3 depends on the quantity \u03c1W , the second\nlargest eigenvalue in magnitude of the matrix W . This fact connects the acceleration mechanism\ninduced by splitting in the Min-Sum scheme to the acceleration mechanism of lifted Markov chains\ntechniques (see Figure 1) and multi-step \ufb01rst order methods, as described in Section 4.\nIt remains to be investigated how choices of splitting parameters different than the ones investigated\nin Theorem 1 affect the convergence behavior of the Min-Sum Splitting algorithm.\n\n8 Conclusions\n\nThe Min-Sum Splitting algorithm has been previously observed to yield convergence in settings\nwhere the ordinary Min-Sum protocol does not converge [35]. In this paper we proved that the\nintroduction of splitting parameters is not only fundamental to guarantee the convergence of the\nMin-Sum scheme applied to the consensus problem, but that proper tuning of these parameters\nyields accelerated convergence rates. As prescribed by Theorem 1, the choice of splitting parameters\nthat yields subdiffusive rates involves global type of information, via the spectral gap of a matrix\nassociated to the original graph (see the choice of \u03b3 in Theorem 1). The acceleration mechanism\nexploited by Min-Sum Splitting is analogous to the acceleration mechanism exploited by lifted\nMarkov chain techniques \u2014 where the transition matrix of the lifted random walks is typically chosen\nto depend on the total number of nodes in the graph [8, 5] or on its diameter [37] (global pieces of\ninformation) \u2014 and to the acceleration mechanism exploited by multi-step gradient methods \u2014 where\nthe momentum/shift-register term is chosen as a function of the eigenvalues of a matrix supported on\nthe original graph [13] (again, a global information). Prior to our results, this connection seems to\nhave not been established in the literature. Our \ufb01ndings motivate further studies to generalize the\nacceleration due to splittings to other problem instances, beyond consensus.\n\n8\n\n\fAcknowledgements\n\nThis work was partially supported by the NSF under Grant EECS-1609484.\n\nReferences\n[1] David Aldous and James Allen Fill. Reversible markov chains and random walks on graphs,\n2002. Un\ufb01nished monograph, recompiled 2014, available at http://www.stat.berkeley.\nedu/~aldous/RWG/book.html.\n\n[2] S. Boyd, A. Ghosh, B. Prabhakar, and D. Shah. Randomized gossip algorithms.\n\nTransactions on Information Theory, 52(6):2508\u20132530, 2006.\n\nIEEE\n\n[3] Stephen Boyd, Neal Parikh, Eric Chu, Borja Peleato, and Jonathan Eckstein. Distributed\noptimization and statistical learning via the alternating direction method of multipliers. Found.\nTrends Mach. Learn., 3(1):1\u2013122, 2011.\n\n[4] Ming Cao, Daniel A. Spielman, and Edmund M. Yeh. Accelerated gossip algorithms for\ndistributed computation. Proc. 44th Ann. Allerton Conf. Commun., Contr., Computat, pages\n952 \u2013 959, 2006.\n\n[5] Fang Chen, L\u00e1szl\u00f3 Lov\u00e1sz, and Igor Pak. Lifting markov chains to speed up mixing.\n\nIn\nProceedings of the Thirty-\ufb01rst Annual ACM Symposium on Theory of Computing, pages 275\u2013\n281, 1999.\n\n[6] J. Chen and A. H. Sayed. Diffusion adaptation strategies for distributed optimization and\n\nlearning over networks. IEEE Transactions on Signal Processing, 60(8):4289\u20134305, 2012.\n\n[7] P. Diaconis and L. Saloff-Coste. Moderate growth and random walk on \ufb01nite groups. Geometric\n\n& Functional Analysis GAFA, 4(1):1\u201336, 1994.\n\n[8] Persi Diaconis, Susan Holmes, and Radford M. Neal. Analysis of a nonreversible markov chain\n\nsampler. The Annals of Applied Probability, 10(3):726\u2013752, 2000.\n\n[9] A. G. Dimakis, S. Kar, J. M. F. Moura, M. G. Rabbat, and A. Scaglione. Gossip algorithms for\n\ndistributed signal processing. Proceedings of the IEEE, 98(11):1847\u20131864, 2010.\n\n[10] John C. Duchi, Alekh Agarwal, and Martin J. Wainwright. Dual averaging for distributed opti-\nmization: Convergence analysis and network scaling. IEEE Trans. Automat. Contr., 57(3):592\u2013\n606, 2012.\n\n[11] Pedro A. Forero, Alfonso Cano, and Georgios B. Giannakis. Consensus-based distributed\n\nsupport vector machines. J. Mach. Learn. Res., 11:1663\u20131707, 2010.\n\n[12] G. Fran\u00e7a and J. Bento. Markov chain lifting and distributed admm. IEEE Signal Processing\n\nLetters, 24(3):294\u2013298, 2017.\n\n[13] E. Ghadimi, I. Shames, and M. Johansson. Multi-step gradient methods for networked optimiza-\n\ntion. IEEE Transactions on Signal Processing, 61(21):5417\u20135429, 2013.\n\n[14] Bhaskar Ghosh, S. Muthukrishnan, and Martin H. Schultz. First and second order diffusive\nmethods for rapid, coarse, distributed load balancing (extended abstract). In Proceedings of the\nEighth Annual ACM Symposium on Parallel Algorithms and Architectures, pages 72\u201381, 1996.\n\n[15] Gene H. Golub and Richard S. Varga. Chebyshev semi-iterative methods, successive over-\nrelaxation iterative methods, and second order richardson iterative methods. Numer. Math.,\n3(1):147\u2013156, 1961.\n\n[16] D. Jakoveti\u00b4c, J. Xavier, and J. M. F. Moura. Fast distributed gradient methods. IEEE Transactions\n\non Automatic Control, 59(5):1131\u20131146, May 2014.\n\n[17] Bj\u00f6rn Johansson, Maben Rabi, and Mikael Johansson. A randomized incremental subgradient\nmethod for distributed optimization in networked systems. SIAM Journal on Optimization,\n20(3):1157\u20131170, 2010.\n\n9\n\n\f[18] K. Jung, D. Shah, and J. Shin. Distributed averaging via lifted markov chains. IEEE Transactions\n\non Information Theory, 56(1):634\u2013647, 2010.\n\n[19] S. Kar, S. Aldosari, and J. M. F. Moura. Topology for distributed inference on graphs. IEEE\n\nTransactions on Signal Processing, 56(6):2609\u20132613, 2008.\n\n[20] V. Lesser, C. Ortiz, and M. Tambe, editors. Distributed Sensor Networks: A Multiagent\n\nPerspective (Edited book), volume 9. Kluwer Academic Publishers, 2003.\n\n[21] Dan Li, K. D. Wong, Yu Hen Hu, and A. M. Sayeed. Detection, classi\ufb01cation, and tracking of\n\ntargets. IEEE Signal Processing Magazine, 19(2):17\u201329, 2002.\n\n[22] W. Li and H. Dai. Accelerating distributed consensus via lifting markov chains. In 2007 IEEE\n\nInternational Symposium on Information Theory, pages 2881\u20132885, 2007.\n\n[23] W. Li, H. Dai, and Y. Zhang. Location-aided fast distributed consensus in wireless networks.\n\nIEEE Transactions on Information Theory, 56(12):6208\u20136227, 2010.\n\n[24] Ji Liu, Brian D.O. Anderson, Ming Cao, and A. Stephen Morse. Analysis of accelerated gossip\n\nalgorithms. Automatica, 49(4):873\u2013883, 2013.\n\n[25] Dmitry M. Malioutov, Jason K. Johnson, and Alan S. Willsky. Walk-sums and belief propagation\n\nin gaussian graphical models. J. Mach. Learn. Res., 7:2031\u20132064, 2006.\n\n[26] G. Mateos, J. A. Bazerque, and G. B. Giannakis. Distributed sparse linear regression. IEEE\n\nTransactions on Signal Processing, 58(10):5262\u20135276, 2010.\n\n[27] C. C. Moallemi and B. Van Roy. Consensus propagation. IEEE Transactions on Information\n\nTheory, 52(11):4753\u20134766, 2006.\n\n[28] Ciamac C. Moallemi and Benjamin Van Roy. Convergence of min-sum message-passing for\n\nconvex optimization. Information Theory, IEEE Transactions on, 56(4):2041\u20132050, 2010.\n\n[29] A. Nedic and A. Ozdaglar. Distributed subgradient methods for multi-agent optimization. IEEE\n\nTransactions on Automatic Control, 54(1):48\u201361, 2009.\n\n[30] A. Olshevsky. Linear Time Average Consensus on Fixed Graphs and Implications for Decen-\n\ntralized Optimization and Multi-Agent Control. ArXiv e-prints (1411.4186), 2014.\n\n[31] J. B. Predd, S. R. Kulkarni, and H. V. Poor. A collaborative training algorithm for distributed\n\nlearning. IEEE Transactions on Information Theory, 55(4):1856\u20131871, 2009.\n\n[32] M. G. Rabbat, R. D. Nowak, and J. A. Bucklew. Generalized consensus computation in\nnetworked systems with erasure links. In IEEE 6th Workshop on Signal Processing Advances in\nWireless Communications, 2005., pages 1088\u20131092, 2005.\n\n[33] S\u00e9bastien Roch. Bounding fastest mixing. Electron. Commun. Probab., 10:282\u2013296, 2005.\n\n[34] N. Ruozzi and S. Tatikonda. Message-passing algorithms: Reparameterizations and splittings.\n\nIEEE Transactions on Information Theory, 59(9):5860\u20135881, 2013.\n\n[35] Nicholas Ruozzi and Sekhar Tatikonda. Message-passing algorithms for quadratic minimization.\n\nJournal of Machine Learning Research, 14:2287\u20132314, 2013.\n\n[36] Kevin Scaman, Francis Bach, S\u00e9bastien Bubeck, Yin Tat Lee, and Laurent Massouli\u00e9. Optimal\nalgorithms for smooth and strongly convex distributed optimization in networks. In Proceedings\nof the 34th International Conference on Machine Learning, volume 70, pages 3027\u20133036, 2017.\n[37] Devavrat Shah. Gossip algorithms. Foundations and Trends R(cid:13) in Networking, 3(1):1\u2013125,\n\n2009.\n\n[38] W. Shi, Q. Ling, K. Yuan, G. Wu, and W. Yin. On the linear convergence of the ADMM in\ndecentralized consensus optimization. IEEE Transactions on Signal Processing, 62(7):1750\u2013\n1761, 2014.\n\n10\n\n\f[39] Wei Shi, Qing Ling, Gang Wu, and Wotao Yin. Extra: An exact \ufb01rst-order algorithm for\ndecentralized consensus optimization. SIAM Journal on Optimization, 25(2):944\u2013966, 2015.\n\n[40] S. Sundhar Ram, A. Nedi\u00b4c, and V. V. Veeravalli. Distributed stochastic subgradient projec-\ntion algorithms for convex optimization. Journal of Optimization Theory and Applications,\n147(3):516\u2013545, 2010.\n\n[41] J. Tsitsiklis, D. Bertsekas, and M. Athans. Distributed asynchronous deterministic and stochastic\ngradient optimization algorithms. IEEE Transactions on Automatic Control, 31(9):803\u2013812,\n1986.\n\n[42] John N. Tsitsiklis. Problems in Decentralized Decision Making and Computation. PhD thesis,\n\nDepartment of EECS, MIT, 1984.\n\n[43] E. Wei and A. Ozdaglar. Distributed alternating direction method of multipliers. In 2012 IEEE\n\n51st IEEE Conference on Decision and Control (CDC), pages 5445\u20135450, 2012.\n\n[44] Lin Xiao and Stephen Boyd. Fast linear iterations for distributed averaging. Systems & Control\n\nLetters, 53(1):65 \u2013 78, 2004.\n\n[45] David M Young. Second-degree iterative methods for the solution of large linear systems.\n\nJournal of Approximation Theory, 5(2):137 \u2013 148, 1972.\n\n11\n\n\f", "award": [], "sourceid": 890, "authors": [{"given_name": "Patrick", "family_name": "Rebeschini", "institution": "University of Oxford"}, {"given_name": "Sekhar", "family_name": "Tatikonda", "institution": "Yale University"}]}