{"title": "Deep Learning Games", "book": "Advances in Neural Information Processing Systems", "page_first": 1678, "page_last": 1686, "abstract": "We investigate a reduction of supervised learning to game playing that reveals new connections and learning methods. For convex one-layer problems, we demonstrate an equivalence between global minimizers of the training problem and Nash equilibria in a simple game. We then show how the game can be extended to general acyclic neural networks with differentiable convex gates, establishing a bijection between the Nash equilibria and critical (or KKT) points of the deep learning problem. Based on these connections we investigate alternative learning methods, and find that regret matching can achieve competitive training performance while producing sparser models than current deep learning approaches.", "full_text": "Deep Learning Games\n\nDale Schuurmans\u2217\n\nGoogle\n\ndaes@ualberta.ca\n\nMartin Zinkevich\n\nGoogle\n\nmartinz@google.com\n\nAbstract\n\nWe investigate a reduction of supervised learning to game playing that reveals new\nconnections and learning methods. For convex one-layer problems, we demonstrate\nan equivalence between global minimizers of the training problem and Nash\nequilibria in a simple game. We then show how the game can be extended to general\nacyclic neural networks with differentiable convex gates, establishing a bijection\nbetween the Nash equilibria and critical (or KKT) points of the deep learning\nproblem. Based on these connections we investigate alternative learning methods,\nand \ufb01nd that regret matching can achieve competitive training performance while\nproducing sparser models than current deep learning strategies.\n\n1\n\nIntroduction\n\nIn this paper, we investigate a new approach to reducing supervised learning to game playing. Unlike\nwell known reductions [8, 29, 30], we avoid duality as a necessary component in the reduction,\nwhich allows a more \ufb02exible perspective that can be extended to deep models. An interesting \ufb01nding\nis that the no-regret strategies used to solve large-scale games [35] provide effective stochastic\ntraining methods for supervised learning problems. In particular, regret matching [12], a step-size\nfree algorithm, appears capable of ef\ufb01cient stochastic optimization performance in practice.\nA central contribution of this paper is to demonstrate how supervised learning of a directed acyclic\nneural network with differentiable convex gates can be expressed as a simultaneous move game with\nsimple player actions and utilities. For variations of the learning problem (i.e. whether regularization\nis considered) we establish connections between the critical points (or KKT points) and Nash\nequilibria in the corresponding game. As expected, deep learning games are not simple, since even\napproximately training deep models is hard in the worst case [13]. Nevertheless, the reduction reveals\nnew possibilities for training deep models that have not been previously considered. In particular, we\ndiscover that regret matching with simple initialization can offer competitive training performance\ncompared to state-of-the-art deep learning heuristics while providing sparser solutions.\nRecently, we have become aware of unpublished work [2] that also proposes a reduction of supervised\ndeep learning to game playing. Although the reduction presented in this paper was developed\nindependently, we acknowledge that others have also begun to consider the connection between deep\nlearning and game theory. We compare these two speci\ufb01c reductions in Appendix J, and outline the\ndistinct advantages of the approach developed in this paper.\n\n2 One-Layer Learning Games\n\nWe start by considering the simpler one-layer case, which allows us to introduce the key concepts\nthat will then be extended to deep models. Consider the standard supervised learning problem where\nt=1, such that (xt, yt) \u2208 X \u00d7 Y, and wishes to learn a\none is given a set of paired data {(xt, yt)}T\n\n\u2217Work performed at Google Brain while on a sabbatical leave from the University of Alberta.\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\ft=1 Lt(\u03b8) with respect to the parameters \u03b8.\n\npredictor h : X \u2192 Y. For simplicity, we assume X = Rm and Y = Rn. A standard generalized linear\nmodel can be expressed as h(x) = \u03c6(\u03b8x) for some output transfer function \u03c6 : Rn \u2192 Rn and matrix\n\u03b8 \u2208 Rn\u00d7m denoting the trainable parameters of the model. Despite the presence of the transfer\nfunction \u03c6, such models are typically trained by minimizing an objective that is convex in z = \u03b8x.\nOLP (One-layer Learning Problem) Given a loss function \u2113 : Rn \u00d7 Rn \u2192 R that is convex in\nthe \ufb01rst argument, let \u2113t(z) = \u2113(z, yt) and Lt(\u03b8) = \u2113t(\u03b8xt). The training problem is to minimize\nL(\u03b8) = T \u22121 PT\nWe \ufb01rst identify a simple game whose Nash equilibria correspond to global minima of the one-layer\nlearning problem. This basic relationship establishes a connection between supervised learning and\ngame playing that we will exploit below. Although this reduction is not a signi\ufb01cant contribution by\nitself, the one-layer case allows us to introduce some key concepts that we will deploy later when\nconsidering deep neural networks. A one-shot simultaneous move game is de\ufb01ned by specifying: a\nset of players, a set of actions for each player, and a set of utility functions that specify the value to\neach player given a joint action selection [36, Page 9] (also see Appendix E). Corresponding to the\nOLP speci\ufb01ed above, we propose the following game.\nOLG (One-layer Learning Game) There are two players, a protagonist p and an antagonist a. The\nprotagonist chooses a parameter matrix \u03b8 \u2208 Rm\u00d7n. The antagonist chooses a set of T vectors\nand scalars {at, bt}T\nt z + bt \u2264 \u2113t(z) for all z \u2208 Rn; that is, the\nantagonist chooses an af\ufb01ne minorant of the local loss for each training example. Both players make\ntheir action choice without knowledge of the other player\u2019s choice. Given a joint action selection\n(\u03b8, {at, bt}) we de\ufb01ne the utility of the antagonist as U a = T \u22121 PT\nt \u03b8xt + bt, and the utility of\nthe protagonist as U p = \u2212U a. This is a two-person zero-sum game with continuous actions.\nA Nash equilibrium is de\ufb01ned by a joint assignment of actions such that no player has any incentive\nto deviate. That is, if \u03c3p = \u03b8 denotes the action choice for the protagonist and \u03c3a = {at, bt} the\nchoice for the antagonist, then the joint action \u03c3 = (\u03c3p, \u03c3a) is a Nash equilibrium if U p(\u02dc\u03c3p, \u03c3a) \u2264\nU p(\u03c3p, \u03c3a) for all \u02dc\u03c3p, and U a(\u03c3p, \u02dc\u03c3a) \u2264 U a(\u03c3p, \u03c3a) for all \u02dc\u03c3a.\nUsing this characterization one can then determine a bijection between the Nash equilibria of the\nOLG and the global minimizers of the OLP.\n\nt=1, at \u2208 Rn, bt \u2208 R, such that a\u22a4\n\nt=1 a\u22a4\n\nTheorem 1 (1) If (\u03b8\u2217, {at, bt}) is a Nash equilibrium of the OLG, then \u03b8\u2217 must be a global minimum\nof the OLP. (2) If \u03b8\u2217 is a global minimizer of the OLP, then there exists an antagonist strategy {at, bt}\nsuch that (\u03b8\u2217, {at, bt}) is a Nash equilibrium of the OLG. (All proofs are given in the appendix.)\n\nThus far, we have ignored the fact that it is important to control model complexity to improve\ngeneralization, not merely minimize the loss. Although model complexity is normally controlled by\nregularizing \u03b8, we will \ufb01nd it more convenient to equivalently introduce a constraint \u03b8 \u2208 \u0398 for some\nconvex set \u0398 (which we assume satis\ufb01es an appropriate constraint quali\ufb01cation; see Appendix C).\nThe learning problem and corresponding game can then be modi\ufb01ed accordingly while still preserving\nthe bijection between their solution concepts.\nOCP (One-layer Constrained Learning Problem) Add optimization constraint \u03b8 \u2208 \u0398 to the OLP.\nOCG (One-layer Constrained Learning Game) Add protagonist action constraint \u03b8 \u2208 \u0398 to OLG.\n\nTheorem 2 (1) If (\u03b8\u2217, {at, bt}) is a Nash equilibrium of the OCG, then \u03b8\u2217 must be a constrained\nglobal minimum of the OCP. (2) If \u03b8\u2217 is a constrained global minimizer of the OCP, then there exists\nan antagonist strategy {at, bt} such that (\u03b8\u2217, {at, bt}) is a Nash equilibrium of the OCG.\n\n2.1 Learning Algorithms\n\nThe tight connection between convex learning and two-person zero-sum games raises the ques-\ntion of whether techniques for \ufb01nding Nash equilibria might offer alternative training approaches.\nSurprisingly, the answer appears to be yes.\nThere has been substantial progress in on-line algorithms for \ufb01nding Nash equilibria, both in theory\n[5, 24, 34] and practice [35]. In the two-person zero-sum case, large games are solved by pitting two\nregret-minimizing learning algorithms against each other, exploiting the fact that when both achieve\na regret rate of \u01eb/2, their respective average strategies form an \u01eb-Nash equilibrium [38]. For the game\nas described above, where the protagonist action is \u03b8 \u2208 \u0398 and the antagonist action is denoted \u03c3a,\n\n2\n\n\fk=1 U p(\u03b8, \u03c3(k)\n\na ) \u2212 U p(\u03b8(k), \u03c3(k)\n\nwe imagine playing in rounds, where on round k the joint action is denoted by \u03c3(k) = (\u03b8(k), \u03c3(k)\na ).\nSince the utility function for each player U i for i \u2208 {p, a}, is af\ufb01ne in their own action choice for\nany \ufb01xed action chosen by the other player, each faces an online convex optimization problem [37]\n(note that maximizing U i is equivalent to minimizing \u2212U i; see also Appendix G). The total regret\nof a player, say the protagonist, is de\ufb01ned with respect to their utility function after K rounds as\nRp(\u03c3(1) . . . \u03c3(K)) = max\u03b8\u2208\u0398 PK\na ). (Nature can also be introduced\nto choose a random training example on each round, which simply requires the de\ufb01nition of regret to\nbe expressed in terms of expectations over nature\u2019s choices.)\nTo accommodate regularization in the learning problem, we impose parameter constraints \u0398. A\nparticularly interesting case occurs when one de\ufb01nes \u0398 = {\u03b8 : k\u03b8k1 \u2264 \u03b2}, since the L1 ball\nconstraint is equivalent to imposing L1 regularization. There are two distinct advantages to L1\nregularization in this context. First, as is well known, L1 encourages sparsity in the solution. Second,\nand much less appreciated, is the fact that any polytope constraint allows one to reduce the constrained\nonline convex optimization problem to learning from expert advice over a \ufb01nite number of experts\n[37]: Given a polytope \u0398, de\ufb01ne the convex hull basis H(\u0398) to be a matrix whose columns are the\nvertices in \u0398. An expert can then be assigned to each vertex in H(\u0398), and an algorithm for learning\nfrom expert advice can then be applied by mapping its strategy on round k, \u03c1(k) (a probability\ndistribution over the experts), back to an action choice in the original problem via \u03b8(k) = H(\u0398)\u03c1(k),\nwhile the utility vector on round k, u(k), can be passed back to the experts via H(\u0398)\u22a4u(k) [37].\nSince this reduction allows any method for learning from expert advice to be applied to L1 constrained\nonline convex optimization, we investigated whether alternative algorithms for supervised training\nmight be uncovered. We considered two algorithms for learning from expert advice: the normalized\nexponentiated weight algorithm (EWA) [22, 32] (Algorithm 3); and regret matching (RM), a\nsimpler method from the economics and game theory literature [12] (Algorithm 2). For supervised\nlearning, these algorithms operate by using a stochastic sample of the gradient to perform their\nupdates (outer loop Algorithm 1). EWA possesses superior regret bounds that demonstrate only a\nlogarithmic dependence on the number of actions; however RM is simpler, hyperparameter-free, and\nstill possesses reasonable regret bounds [9, 10]. Although exponentiated gradient methods have been\napplied to supervised learning [18, 32], we not aware of any previous attempt to apply regret matching\nto supervised training. We compared these to projected stochastic gradient descent (PSGD), which\nis the obvious modi\ufb01cation of stochastic gradient descent (SGD) that retains a similar regret bound\n[7, 28] (Algorithm 4).\n\n2.2 Evaluation\n\nTo investigate the utility of these methods for supervised learning, we conducted experiments on\nsynthetic data and on the MNIST data set [20]. Note that PSGD and EWA have a step size parameter,\n\u03b7(k), that greatly affects their performance. The best regret bounds are achieved for step sizes of the\nform \u03b7k\u22121/2 and \u03b7 log(m)k\u22121/2 respectively [28]; we also tuned \u03b7 to generate the best empirical\nresults. Since the underlying optimization problems are convex, these experiments merely focus on\nthe speed of convergence to a global minimum of the constrained training problem.\nThe \ufb01rst set of experiments considered synthetic problems. The data dimension was set to m = 10,\nand T = 100 training points were drawn from a standard multivariate Gaussian. For univariate\nprediction, a random hyperplane was chosen to label the data (hence the data was linearly separable,\nbut not with a large margin). The logistic training loss achieved by the running average of the\nprotagonist strategy \u00af\u03b8 over the entire training set is plotted in Figure 1a. For multivariate prediction, a\n4\u00d710 target matrix, \u03b8\u2217, was randomly generated to label training data by arg max(\u03b8\u2217xt). The training\nsoftmax loss achieved by the running average of the protagonist strategy \u00af\u03b8 over the entire training\nset is shown in Figure 1b. The third experiment was conducted on MNIST, which is an n = 10\nclass problem over m = 784 dimensional inputs with T = 60, 000 training examples, evidently not\nlinearly separable. For this experiment, we used mini-batches of size 100. The training loss of the\nrunning average protagonist strategy \u00af\u03b8 (single run) is shown in Figure 1c. The apparent effectiveness\nof RM in these experiments is a surprising outcome. Even after tuning \u03b7 for both PSGD and EWA,\nthey do not surpass the performance of RM, which is hyperparameter free. We did not anticipate this\nobservation; the effectiveness of RM for supervised learning appears not to have been previously\nnoticed. (We do not expect RM to be competitive in high dimensional sparse problems, since its\nregret bound has a square root and not a logarithmic dependence on n [9].)\n\n3\n\n\f(a) Logistic loss, synthetic data.\n\n(b) Softmax loss, synthetic data.\n\n(c) Softmax loss, MNIST data.\n\nFigure 1: Training loss achieved by different no-regret algorithms. Sub\ufb01gures (a) and (b) are averaged\nover 100 repeats, log scale x-axis. Sub\ufb01gure (c) is averaged over 10 repeats (psgd theory off scale).\n\n3 Deep Learning Games\n\nA key contribution of this paper is to show how the problem of training a feedforward neural network\nwith differentiable convex gates can be reduced to a game. A practical consequence of this reduction\nis that it suggests new approaches to training deep models that are inspired by methods that have\nrecently proved successful for solving massive-scale games.\n\nFeedforward Neural Network A feedforward neural network is de\ufb01ned by a directed acyclic graph\nwith additional objects attached to the vertices and edges. The network architecture is speci\ufb01ed by\nN = (V, E, I, O, F ), where V is a set of vertices, E \u2286 V \u00d7 V is a set of edges, I = {i1 . . . im} \u2282 V\nis a set of input vertices, O = {o1 . . . on} \u2282 V is a set of output vertices, and F = {fv : v \u2208 V } is a\nset of activation functions, where fv : R \u2192 R. The trainable parameters are given by \u03b8 : E \u2192 R.\nIn the graph de\ufb01ned by G = (V, E), a path (v1, ..., vk) consists of a sequence of vertices such that\n(vj, vj+1) \u2208 E for all j. A cycle is a path where the \ufb01rst and last vertex are equal. We assume that G\ncontains no cycles, the input vertices have no incoming edges (i.e. (u, i) 6\u2208 E for all i \u2208 I, u \u2208 V ),\nand the output vertices have no outgoing edges (i.e. (o, v) 6\u2208 E for all o \u2208 O, v \u2208 V ). A directed\nacyclic graph generates a partial order \u2264 on the vertices where u \u2264 v if and only if there is a path from\nu to v. For all v \u2208 V , de\ufb01ne Ev = {(u, u\u2032) \u2208 E : u\u2032 = v}. The network is related to the training\ndata by assuming |I| = m, the number of input vertices corresponds to the number of input features,\nand |O| = n, the number of output vertices corresponds to the number of output dimensions. It is a\ngood idea (but not required) to have two additional bias inputs, whose corresponding input features\nare always set to 0 and 1, respectively, and have edges to all non-input nodes in the graph. Usually,\nthe activation functions on input and output nodes are the identity, i.e. fv(x) = x for v \u2208 I \u222a O.\nGiven a training input xt \u2208 Rm, the computation of the network N is expressed by a circuit value\nfunction ct that assigns values to each vertex based on the partial order over vertices:\n\nct(ik, \u03b8) = fik (xtk) for ik \u2208 I;\n\n(1)\nct(v, \u03b8) = fv(cid:0)Pu:(u,v)\u2208E ct(u, \u03b8)\u03b8(u, v)(cid:1) for v \u2208 V \u2212 I.\nLet ct(o, \u03b8) denote the vector of values at the output vertices, i.e. (ct(o, \u03b8))k = ct(ok, \u03b8). Since each\nfv is assumed differentiable, the output ct(o, \u03b8) must also be differentiable with respect to \u03b8.\nWhen we wish to impose constraints on \u03b8 we assume the constraints factor over vertices, and are\napplied across the incoming edges to each vertex. That is, for each v \u2208 V \u2212 I the parameters \u03b8\nrestricted to Ev are required to be in a set \u0398v \u2286 REv , and \u0398 = Qv\u2208V \u2212I \u0398v. (We additionally\nassume each \u0398v satis\ufb01es constraint quali\ufb01cations\u2014see Appendix C\u2014and can also alter the factoriza-\ntion requirement to allow more complex network architectures\u2014see Appendix H). If \u0398 = RE, we\nconsider the network to be unconstrained. If \u0398 is bounded, we consider the network to be bounded.\n\nDLP (Deep Learning Problem) Given a loss function \u2113(z, y) that is convex in the \ufb01rst argument\nsatisfying 0 \u2264 \u2113(z, y) < \u221e for all z \u2208 Rn, de\ufb01ne \u2113t(z) = \u2113(z, yt) and Lt(\u03b8) = \u2113t(ct(o, \u03b8)). The\ntraining problem is to \ufb01nd a \u03b8 \u2208 \u0398 that minimizes L(\u03b8) = T \u22121 PT\nDLG (Deep Learning Game) We de\ufb01ne a one-shot simultaneous move game [36, page 9] with\nin\ufb01nite action sets (Appendix E); we need to specify the players, action sets, and utility functions.\n\nt=1 Lt(\u03b8).\n\n4\n\n\ft=1, at \u2208 Rn, bt \u2208 R, such that a\u22a4\n\nPlayers: The players consist of a protagonist p for each v \u2208 V \u2212 I, an antagonist a, and a set\nof self-interested zannis sv, one for each vertex v \u2208 V .2 Actions: The protagonist for vertex v\nchooses a parameter function \u03b8v \u2208 \u0398v. The antagonist chooses a set of T vectors and scalars\nt z + bt \u2264 \u2113t(z) for all z \u2208 Rn; that is, the antagonist\n{at, bt}T\nchooses an af\ufb01ne minorant of the local loss for each training example. Each zanni sv chooses\na set of 2T scalars (qvt, dvt), qvt \u2208 R, dvt \u2208 R, such that qvtz + dvt \u2264 fv(z) for all z \u2208 R;\nthat is, the zanni chooses an af\ufb01ne minorant of its local activation function fv for each training\nexample. All players make their action choice without knowledge of the other player\u2019s choice.\nUtilities: For a joint action \u03c3 = (\u03b8, {at, bt}, {qvt, dvt}), the zannis\u2019 utilities are de\ufb01ned recursively\nfollowing the parial order on vertices. First, for each i \u2208 I the utility for zanni si on training\nexample t is U s\nit(\u03c3) = dit + qitxit, and for each v \u2208 V \u2212 I the utility for zanni sv on example\nt is U s\ntu(\u03c3)\u03b8(u, v). The total utility for each zanni sv is given\nvt(\u03c3) = dvt + qvt Pu:(u,v)\u2208E U s\nv (\u03c3) = PT\nvt(\u03c3) for v \u2208 V . The utility for the antagonist a is then given by U a =\nby U s\nT \u22121 PT\nt (\u03c3) = bt + Pn\nokt(\u03c3). The utility for all protagonists are the same,\nt=1 U a\nU p(\u03c3) = \u2212U a(\u03c3). (This representation also allows for an equivalent game where nature selects an\nexample t, tells the antagonist and the zannis, and then everyone plays their actions simultaneously.)\nThe next lemma shows how the zannis and the antagonist can be expected to act.\n\nt=1 U s\nt where U a\n\nk=1 aktU s\n\nLemma 3 Given a \ufb01xed protagonist action \u03b8, there exists a unique joint action for all agents\n\u03c3 = (\u03b8, {at, bt}, {qvt, dvt}) where the zannis and the antagonist are playing best responses to \u03c3.\nMoreover, U p(\u03c3) = \u2212L(\u03b8), \u2207\u03b8U p(\u03c3) = \u2212\u2207L(\u03b8), and given some protagonist at v \u2208 V \u2212 I, if we\nhold all other agents\u2019 strategies \ufb01xed, U p(\u03c3) is an af\ufb01ne function of the strategy of the protagonist at\nv. We de\ufb01ne \u03c3 as the joint action expansion for \u03b8.\n\nThere is more detail in the appendix about the joint action expansion. However, the key point is that\nif the current cost and partial derivatives can be calculated for each parameter, one can construct the\naf\ufb01ne function for each agent. We will return to this in Section 3.1.\nA KKT point is a point that satis\ufb01es the KKT conditions [15, 19]: roughly, that either it is a critical\npoint (where the gradient is zero), or it is a point on the boundary of \u0398 where the gradient is pointing\nout of \u0398 \u201cperpendicularly\u201d (see Appendix C). We can now state the main theorem of the paper,\nshowing a one to one relationship between KKT points and Nash equilibria.\n\nTheorem 4 (DLG Nash Equilibrium) The joint action \u03c3 = (\u03b8, {at, bt}, {qvt, dvt}) is a Nash equi-\nlibrium of the DLG iff it is the joint action expansion for \u03b8 and \u03b8 is a KKT point of the DLP.\n\nCorollary 5 If the network is unbounded, the joint action \u03c3 = (\u03b8, {at, bt}, {qvt, dvt}) is a Nash\nequilibrium of the DLG iff it is the joint action expansion for \u03b8 and \u03b8 is a critical point of the DLP.\n\nFinally we note that sometimes we need to add constraints between edges incident on different\nnodes. For example, in a convolutional neural network, one will have edges e = {u, v} and\ne\u2032 = {u\u2032, v\u2032} such that there is a constraint \u03b8e = \u03b8e\u2032 (see Appendix H). In game theory, if two agents\nact simultaneously it is dif\ufb01cult to have one agent\u2019s viable actions depend on another agent\u2019s action.\nTherefore, if parameters are constrained in this manner, it is better to have one agent control both.\nThe appendix (beginning with Appendix B) extends our model and theory to handle such parameter\ntying, which allows us to handle both convolutional networks and non-convex activation functions\n(Appendix I). Our theory does not apply to non-smooth activation functions, however (e.g. ReLU\ngates), but these can be approximated arbitrarily closely by differentiable activations.\n\n3.1 Learning Algorithms\n\nCharacterizing the deep learning problem as a game motivates the consideration of equilibrium\n\ufb01nding methods as potential training algorithms. Given the previous reduction to expert algorithms,\nwe will consider the use of the L1 ball constraint \u0398v = {\u03b8v : k\u03b8vk1 \u2264 \u03b2} at each vertex v. For deep\nlearning, we have investigated a simple approach by training independent protagonist agents at each\nvertex against a best response antagonist and best response zannis [14]. In this case, it is possible\n\n2 Nomenclature explanation: Protagonists nominally strive toward a common goal, but their actions can\ninterfere with one another. Zannis are traditionally considered servants, but their motivations are not perfectly\naligned with the protagonists. The antagonist is diametrically opposed to the protagonists.\n\n5\n\n\fAlgorithm 1 Main Loop\n\nAlgorithm 2 Regret Matching (RM)\n\nOn round k, observe some xt (or mini batch)\nAntagonist and zannis choose best responses\nv (\u03b8v) = \u2212\u2207L(\u03b8(k)\nv )\n\nwhich ensures \u2207U p\n\ng(k)\nv \u2190 \u2207U p\nApply update to r(k)\n\nv (\u03b8v)\n\nv , \u03c1(k)\nv\n\nand \u03b8(k)\n\nv \u2200v \u2208 V\n\nr(k+1)\nv\n\n\u03c1(k+1)\nv\n\u03b8(k+1)\nv\n\n\u2190 r(k)\n\nv + H(\u0398v)\u22a4g(k)\n\nv \u2212\n\u22a4H(\u0398v)\u22a4g(k)\n\nv\n\n\u03c1(k)\nv\n\nv\n\n\u2190 (cid:0)r(k+1)\n\u2190 H(\u0398v)\u03c1(k+1)\n\nv\n\n(cid:1)+/(cid:16)1\u22a4(cid:0)r(k+1)\n\nv\n\n(cid:1)+(cid:17)\n\nAlgorithm 3 Exp. Weighted Average (EWA)\n\nAlgorithm 4 Projected SGD\n\nr(k+1)\nv\n\u03c1(k+1)\nv\n\u03b8(k+1)\nv\n\nv + \u03b7(k)H(\u0398v)\u22a4g(k)\nv\n\n\u2190 r(k)\n\u2190 exp(r(k+1)\n\u2190 H(\u0398v)\u03c1(k+1)\n\n)/(1\u22a4 exp(r(k+1)\n\nv\n\nv\n\nv\n\n))\n\nr(k+1)\nv\n\u03c1(k+1)\nv\n\u03b8(k+1)\nv\n\nv + \u03b7(k)H(\u0398v)\u22a4g(k)\n\n\u2190 r(k)\n\u2190 L2_project_to_simplex(r(k+1)\n\u2190 H(\u0398v)\u03c1(k+1)\n\nv\n\nv\n\nv\n\n)\n\nto devise interesting and novel learning strategies based on the algorithms for learning from expert\nadvice. Since the optimization problem is no longer convex in a local protagonist action \u03b8v, we do\nnot expect convergence to a joint, globally optimal strategy among protagonists. Nevertheless, one\ncan develop a generic approach for using the game to generate a learning algorithm.\nAlgorithm Outline On each round, nature chooses a random training example (or mini-batch).\nFor each v \u2208 V , each protagonist v selects her actions \u03b8v \u2208 \u0398v deterministically. The antagonist\nand zannis then select their actions, which are best responses to the \u03b8v and to each other.3 The\nv is af\ufb01ne\nv are then calculated. Given the zanni and antagonist choices, U p\nprotagonist utilities U p\n= \u2212 \u2202U p\nin the protagonist\u2019s action, and also by Lemma 3 for all e \u2208 Ev, we have \u2202Lt\nv (\u03b8v)\n. Each\n\u2202we\n\u2202we\nprotagonist v \u2208 V then observes their utility and uses this to update their strategy. See Algorithm 1\nfor the general loop, and Algorithms 2, 3 and 4 for speci\ufb01c updates.\nGiven the characterization developed previously, we know that a Nash equilibrium will correspond to\na critical point in the training problem (which is almost certain to be a local minimum rather than\na saddle point [21]). It is interesting to note that the usual process of backpropagating the sampled\n(sub)gradients corresponds to computing the best response actions for the zannis and the antagonist,\nwhich then yields the resulting af\ufb01ne utility for the protagonists.\n\n3.2 Experimental Evaluation\n\nWe conducted a set of experiments to investigate the plausibility of applying expert algorithms at each\nvertex in a feedforward neural network. For comparison, we considered current methods for training\ndeep models, including SGD [3], SGD with momentum [33], RMSprop, Adagrad [6], and Adam\n[17]. Since none of these impose constraints, they technically solve an easier optimization problem,\nbut they are also un-regularized and therefore might exhibit weaker generalization. We tuned the step\nsize parameter for each comparison method on each problem. For the expert algorithms, RM, EWA\nand PSGD, we found that EWA and PSGD were not competitive, even after tuning their step sizes.\nFor RM, we initially found that it learned too quickly, with the top layers of the model becoming\nsparse; however, we discovered that RM works remarkably well simply by initializing the cumulative\nregret vectors r(0)\nAs a sanity check, we \ufb01rst conducted experiments on synthetic combinatorial problems: \u201cparity\u201d,\nde\ufb01ned by y = x1 \u2295 \u00b7 \u00b7 \u00b7 \u2295 xm and \u201cfolded parity\u201d, de\ufb01ned by y = (x1 \u2227 x2) \u2295 \u00b7 \u00b7 \u00b7 \u2295 (xm\u22121 \u2227 xm)\n[27]. Parity cannot be approximated by a single-layer model but is representable with a single hidden\nlayer of linear threshold gates [11], while folded parity is known to be not representable by a (small\nweights) linear threshold circuit with only a single hidden layer; at least two hidden layers are required\n[27]. For parity we trained a m-4m-1 architecture, and for folded parity we trained a m-4m-4m-1\narchitecture, both fully connected, m = 8. Here we chose the L1 constraint bound to be \u03b2 = 10\nand the initialization scale as \u03c3 = 100. For the nonlinear activation functions we used a smooth\n\nv with random values drawn from a Gaussian with large standard deviation \u03c3.\n\n3 Conceptually, each zanni has a copy of the algorithm of each protagonist and an algorithm for selecting a\njoint action for all antagonists and zannis, and thus do not technically depend upon \u03b8v. In practice, these multiple\ncopies are unnecessary, and one merely calculates \u03b8v \u2208 \u0398v \ufb01rst.\n\n6\n\n\f(a) Learning Parity with logistic loss.\n\n(b) MNIST, full layers, train loss.\n\n(c) MNIST, full layers, test error.\n\n(d) Folded Parity, logistic loss.\n\n(e) MNIST, convolutional, train loss. (f) MNIST, convolutional, test error.\n\nFigure 2: Experimental results. (a) Parity, m-4m-1 architecture, 100 repeats. (d) Folded parity,\nm-4m-4m-1 architecture, 100 repeats. (b) and (c): MNIST, 784-1024-1024-10 architecture, 10\nrepeats. (e) and (f): MNIST, 28\u00d728-c(5\u00d75, 64)-c(5\u00d75, 64)-c(5\u00d75, 64)-10 architecture, 10 repeats.\n\napproximation of the standard ReLU gate fv(x) = \u03c4 log(1 + ex/\u03c4 ) with \u03c4 = 0.5. The results shown\nin Figure 2a and Figure 2d con\ufb01rm that RM performs competitively, even when producing models\nwith sparsity, top to bottom, of 18% and 13% for parity, and 27%, 19% and 21% for folded parity.\nWe next conducted a few experiments on MNIST data. The \ufb01rst experiment used a fully con-\nnected 784-1024-1024-10 architecture, where RM was run with \u03b2 = 30 and initialization scales\n(\u03c31, \u03c32, \u03c33) = (50, 200, 50). The second experiment was run with a convolutional architecture\n28\u00d728-c(5\u00d75, 64)-c(5\u00d75, 64)-c(5\u00d75, 64)-10 (convolution windows 5 \u00d7 5 with depth 64), where RM\nwas run with (\u03b21, \u03b22, \u03b23, \u03b24) = (30, 30, 30, 10) and initialization scales \u03c3 = 500. The mini-batch\nsize was 100, and the x-axis in the plots give results after each \u201cupdate\u201d batch of 600 mini-batches\n(i.e. one epoch over the training data). The training loss and test loss are shown in Figures 2b, 2c,\n2e and 2f, showing the evolution of the training loss and test misclassi\ufb01cation errors. We dropped\nall but SGD, Adam, RMSprop and RM here, since these seemed to dominate the other methods in\nour experiments. It is surprising that RM can demonstrate convergence rates that are competitive\nwith tuned RMSprop, and even outperforms methods like SGD and Adam that are routinely used\nin practice. An even more interesting \ufb01nding is that the solutions found by RM were sparse while\nachieving lower test misclassi\ufb01cation errors than standard deep learning methods. In particular, in\nthe fully connected case, the \ufb01nal solution produced by RM zeroed out 32%, 26% and 63% of the\nparameter matrices (from the input to the output layer) respectively. For the convolutional case, RM\nzeroed out 29%, 27%, 28% and 43% of the parameter matrices respectively. Regarding run times,\nwe observed that our Tensor\ufb02ow implementation of RM was only 7% slower than RMSProp on the\nconvolutional architecture, but 85% slower in the fully connected case.\n\n4 Related Work\n\nThere are several works that consider using regret minimization to solve of\ufb02ine optimization problems.\nOnce stochastic gradient descent was connected to regret minimization in [4], a series of papers\nfollowed [26, 25, 31]. Two popular approaches are currently Adagrad [6] and traditional stochastic\ngradient descent. The theme of simplifying the loss is very common: it appears in batch gradient and\nincremental gradient approaches [23] as the majorization-minimization family of algorithms. In the\n\n7\n\n\fregret minimization literature, the idea of simplifying the class of losses by choosing a minimizer\nfrom a particular family of functions \ufb01rst appeared in [37], and has since been further developed.\nBy contrast, the history of using games for optimization has a much shorter history. It has been shown\nthat a game between people can be used to solve optimal coloring [16]. There is also a history of\nusing regret minimization in games: of interest is [38] that decomposes a single agent into multiple\nagents, providing some inspiration for this paper. In the context of deep networks, a paper of interest\nconnects brain processes to prediction markets [1]. However, the closest work appears to be the\nrecent manuscript [2] that also poses the optimization of a deep network as a game. Although the\ngames described there are similar, unlike [2], we focus on differentiable activation functions, and\nde\ufb01ne agents with different information and motivations. Importantly, [2] does not characterize all\nthe Nash equilibria in the game proposed. We discuss these issues in more detail in Appendix J.\n\n5 Conclusion\n\nWe have investigated a reduction of deep learning to game playing that allowed a bijection between\nKKT points and Nash equilibria. One of the novel algorithms considered for supervised learning,\nregret matching, appears to provide a competitive alternative that has the additional bene\ufb01t of\nachieving sparsity without unduly sacri\ufb01cing speed or accuracy. It will be interesting to investigate\nalternative training heuristics for deep games, and whether similar successes can be achieved on\nlarger deep models or recurrent models.\n\nReferences\n\n[1] D. Balduzzi. Cortical prediction markets.\n\nIn Proceedings of the 2014 International Conference on\n\nAutonomous Agents and Multi-agent Systems, pages 1265\u20131272, 2014.\n\n[2] D. Balduzzi. Deep online convex optimization using gated games. http://arxiv.org/abs/1604.01952, 2016.\n\n[3] L. Bottou. Stochastic gradient descent tricks. In Neural Networks: Tricks of the Trade - Second Edition,\n\npages 421\u2013436. 2012.\n\n[4] N. Cesa-Bianchi, A. Conconi, and C. Gentile. On the generalization ability of on-line learning algorithms.\n\nIEEE Transactions on Information Theory, 50(9):2050\u20132057, September 2004.\n\n[5] N. Cesa-Bianchi and G. Lugosi. Prediction, learning, and games. Cambridge University Press, 2006.\n\n[6] J. Duchi, E. Hazan, and Y. Singer. Adaptive subgradient methods for online learning and stochastic\n\noptimization. Journal of Machine Learning Research, 12:2121\u20132159, 2011.\n\n[7] J. Duchi, S. Shalev-Shwartz, Y. Singer, and T. Chandra. Ef\ufb01cient projections onto the l1-ball for learning\n\nin high dimensions. In Inter. Conf. on Machine Learning, pages 272\u2013279, 2008.\n\n[8] Y. Freund and R. Schapire. Adaptive game playing using multiplicative weights. Games and Economic\n\nBehavior, 29(1-2):79\u2013103, 1999.\n\n[9] G. Gordon. No-Regret algorithms for structured prediction problems. Technical Report CMU-CALD-05-\n\n112, Carnegie Mellon University, 2005.\n\n[10] G. Gordon. No-regret algorithms for online convex programs. In NIPS 19, 2006.\n\n[11] A. Hajnal. Threshold circuits of bounded depth. JCSS, 46(2):129\u2013154, 1993.\n\n[12] S. Hart and A. Mas-Colell. A simple adaptive procedure leading to correlated equilibrium. Econometrica,\n\n68(5):1127\u20131150, 2000.\n\n[13] K. Hoeffgen, H. Simon, and K. Van Horn. Robust trainability of single neurons. JCSS, 52(2):114\u2013125,\n\n1995.\n\n[14] M. Johanson, N. Bard, N. Burch, and M. Bowling. Finding optimal abstract strategies in extensive form\n\ngames. In AAAI Conference on Arti\ufb01cial Intelligence, pages 1371\u20131379, 2012.\n\n[15] W. Karush. Minima of functions of several variables with inequalities as side constraints. Master\u2019s thesis,\n\nUniv. of Chicago, Chicago, Illinois, 1939.\n\n8\n\n\f[16] M. Kearns, S. Suri, and N. Montfort. An experimental study of the coloring problem on human subject\n\nnetworks. Science, 313:824\u2013827, 2006.\n\n[17] D. Kingma and J. Ba. Adam: A method for stochastic optimization. CoRR, abs/1412.6980, 2014.\n\n[18] J. Kivinen and M. Warmuth. Exponentiated gradient versus gradient descent for linear predictors. Informa-\n\ntion and Computation, 132(1):1\u201363, 1997.\n\n[19] H. Kuhn and A. Tucker. Nonlinear programming. In Proceedings of 2nd Berkeley Symposium, pages\n\n481\u2013492. University of California Press, 1951.\n\n[20] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition.\n\nProceedings of the IEEE, 86(11):2278\u20132324, 1998.\n\n[21] J. Lee, M. Simchowitz, M. Jordan, and B. Recht. Gradient Descent Only Converges to Minimizers. In\n\n29th Annual Conference on Learning Theory, volume 49, 2016.\n\n[22] N. Littlestone and M. Warmuth. The weighted majority algorithm. Inf. Comput., 108(2):212\u2013261, 1994.\n\n[23] J. Mairal. Incremental majorization-minimization optimization with application to large-scale machine\n\nlearning. SIAM Journal on Optimization, 25(2):829\u2013855, 2015.\n\n[24] A. Rakhlin and K. Sridharan. Optimization, learning, and games with predictable sequences. In Advances\n\nin Neural Information Processing Systems 26, pages 3066\u20133074, 2013.\n\n[25] N. Ratliff, D. Bagnell, and M. Zinkevich. Subgradient methods for structured prediction. In Eleventh\n\nInternational Conference on Arti\ufb01cial Intelligence and Statistics (AISTATS-07), 2007.\n\n[26] N. Ratliff, J. A. Bagnell, and M. Zinkevich. Maximum margin planning. In Twenty Second International\n\nConference on Machine Learning (ICML-06), 2006.\n\n[27] A. Razborov. On small depth threshold circuits. In Algorithm Theory (SWAT 92), 1992.\n\n[28] S. Shalev-Shwartz. Online learning and online convex optimization. Foundations and Trends in Machine\n\nLearning, 4(2):107\u2013194, 2012.\n\n[29] S. Shalev-Shwartz and Y. Singer. Convex repeated games and Fenchel duality. In NIPS 19, 2006.\n\n[30] S. Shalev-Shwartz and Y. Singer. A primal-dual perspective of online learning algorithms. Machine\n\nLearning, 69(2-3):115\u2013142, 2007.\n\n[31] S. Shalev-Shwartz, Y. Singer, N. Srebro, and A. Cotter. Pegasos: Primal estimated sub-gradient solver for\n\nsvm. Mathematical programming, 127(1):3\u201330, 2011.\n\n[32] N. Srinivasan, V. Ravichandran, K. Chan, J. Vidhya, S. Ramakirishnan, and S. Krishnan. Exponentiated\n\nbackpropagation algorithm for multilayer feedforward neural networks. In ICONIP, volume 1, 2002.\n\n[33] I. Sutskever, J. Martens, G. Dahl, and G. Hinton. On the importance of initialization and momentum in\n\ndeep learning. In Proceedings ICML, pages 1139\u20131147, 2013.\n\n[34] V. Syrgkanis, A. Agarwal, H. Luo, and R. Schapire. Fast convergence of regularized learning in games. In\n\nAdvances in Neural Information Processing Systems 28, pages 2971\u20132979, 2015.\n\n[35] O. Tammelin, N. Burch, M. Johanson, and M. Bowling. Solving heads-up limit Texas hold\u2019em. In\n\nInternational Joint Conference on Arti\ufb01cial Intelligence, IJCAI, pages 645\u2013652, 2015.\n\n[36] V. Vazirani, N. Nisan, T. Roughgarden, and \u00c9 Tardos. Algorithmic Game Theory. Cambridge Press, 2007.\n\n[37] M. Zinkevich. Online convex programming and generalized in\ufb01nitesimal gradient ascent. In Twentieth\n\nInternational Conference on Machine Learning, 2003.\n\n[38] M. Zinkevich, M. Bowling, M. Johanson, and C. Piccione. Regret minimization in games with incomplete\n\ninformation. In NIPS, 2007.\n\n9\n\n\f", "award": [], "sourceid": 923, "authors": [{"given_name": "Dale", "family_name": "Schuurmans", "institution": "Alberta"}, {"given_name": "Martin", "family_name": "Zinkevich", "institution": "Google"}]}