{"title": "Gradient Estimation Using Stochastic Computation Graphs", "book": "Advances in Neural Information Processing Systems", "page_first": 3528, "page_last": 3536, "abstract": "In a variety of problems originating in supervised, unsupervised, and reinforcement learning, the loss function is defined by an expectation over a collection of random variables, which might be part of a probabilistic model or the external world. Estimating the gradient of this loss function, using samples, lies at the core of gradient-based learning algorithms for these problems. We introduce the formalism of stochastic computation graphs--directed acyclic graphs that include both deterministic functions and conditional probability distributions and describe how to easily and automatically derive an unbiased estimator of the loss function's gradient. The resulting algorithm for computing the gradient estimator is a simple modification of the standard backpropagation algorithm. The generic scheme we propose unifies estimators derived in variety of prior work, along with variance-reduction techniques therein. It could assist researchers in developing intricate models involving a combination of stochastic and deterministic operations, enabling, for example, attention, memory, and control actions.", "full_text": "Gradient Estimation Using\n\nStochastic Computation Graphs\n\nJohn Schulman1,2\n\njoschu@eecs.berkeley.edu\n\nNicolas Heess1\n\nheess@google.com\n\nTheophane Weber1\n\ntheophane@google.com\n\nPieter Abbeel2\n\npabbeel@eecs.berkeley.edu\n\n1 Google DeepMind\n\n2 University of California, Berkeley, EECS Department\n\nAbstract\n\nIn a variety of problems originating in supervised, unsupervised, and reinforce-\nment learning, the loss function is de\ufb01ned by an expectation over a collection\nof random variables, which might be part of a probabilistic model or the exter-\nnal world. Estimating the gradient of this loss function, using samples, lies at\nthe core of gradient-based learning algorithms for these problems. We introduce\nthe formalism of stochastic computation graphs\u2014directed acyclic graphs that in-\nclude both deterministic functions and conditional probability distributions\u2014and\ndescribe how to easily and automatically derive an unbiased estimator of the loss\nfunction\u2019s gradient. The resulting algorithm for computing the gradient estimator\nis a simple modi\ufb01cation of the standard backpropagation algorithm. The generic\nscheme we propose uni\ufb01es estimators derived in variety of prior work, along with\nvariance-reduction techniques therein. It could assist researchers in developing in-\ntricate models involving a combination of stochastic and deterministic operations,\nenabling, for example, attention, memory, and control actions.\n\nIntroduction\n\n1\nThe great success of neural networks is due in part to the simplicity of the backpropagation al-\ngorithm, which allows one to ef\ufb01ciently compute the gradient of any loss function de\ufb01ned as a\ncomposition of differentiable functions. This simplicity has allowed researchers to search in the\nspace of architectures for those that are both highly expressive and conducive to optimization; yield-\ning, for example, convolutional neural networks in vision [12] and LSTMs for sequence data [9].\nHowever, the backpropagation algorithm is only suf\ufb01cient when the loss function is a deterministic,\ndifferentiable function of the parameter vector.\nA rich class of problems arising throughout machine learning requires optimizing loss functions\nthat involve an expectation over random variables. Two broad categories of these problems are (1)\nlikelihood maximization in probabilistic models with latent variables [17, 18], and (2) policy gradi-\nents in reinforcement learning [5, 23, 26]. Combining ideas from from those two perennial topics,\nrecent models of attention [15] and memory [29] have used networks that involve a combination of\nstochastic and deterministic operations.\nIn most of these problems, from probabilistic modeling to reinforcement learning, the loss functions\nand their gradients are intractable, as they involve either a sum over an exponential number of latent\nvariable con\ufb01gurations, or high-dimensional integrals that have no analytic solution. Prior work (see\nSection 6) has provided problem-speci\ufb01c derivations of Monte-Carlo gradient estimators, however,\nto our knowledge, no previous work addresses the general case.\nAppendix C recalls several classic and recent techniques in variational inference [14, 10, 21] and re-\ninforcement learning [23, 25, 15], where the loss functions can be straightforwardly described using\n\n1\n\n\fthe formalism of stochastic computation graphs that we introduce. For these examples, the variance-\nreduced gradient estimators derived in prior work are special cases of the results in Sections 3 and 4.\nThe contributions of this work are as follows:\n\u2022 We introduce a formalism of stochastic computation graphs, and in this general setting, we derive\n\u2022 We show how this estimator can be computed as the gradient of a certain differentiable function\n(which we call the surrogate loss), hence, it can be computed ef\ufb01ciently using the backpropaga-\ntion algorithm. This observation enables a practitioner to write an ef\ufb01cient implementation using\nautomatic differentiation software.\n\nunbiased estimators for the gradient of the expected loss.\n\ntation graphs, generalizing prior work from reinforcement learning and variational inference.\n\n\u2022 We describe variance reduction techniques that can be applied to the setting of stochastic compu-\n\u2022 We brie\ufb02y describe how to generalize some other optimization techniques to this setting:\nmajorization-minimization algorithms, by constructing an expression that bounds the loss func-\ntion; and quasi-Newton / Hessian-free methods [13], by computing estimates of Hessian-vector\nproducts.\n\nThe main practical result of this article is that to compute the gradient estimator, one just needs\nto make a simple modi\ufb01cation to the backpropagation algorithm, where extra gradient signals are\nintroduced at the stochastic nodes. Equivalently, the resulting algorithm is just the backpropagation\nalgorithm, applied to the surrogate loss function, which has extra terms introduced at the stochastic\nnodes. The modi\ufb01ed backpropagation algorithm is presented in Section 5.\n\n2 Preliminaries\n2.1 Gradient Estimators for a Single Random Variable\n\nThis section will discuss computing the gradient of an expectation taken over a single random\nvariable\u2014the estimators described here will be the building blocks for more complex cases with\nmultiple variables. Suppose that x is a random variable, f is a function (say, the cost), and we are\ninterested in computing @\n@\u2713 Ex [f (x)]. There are a few different ways that the process for generating\nx could be parameterized in terms of \u2713, which lead to different gradient estimators.\n\u2022 We might be given a parameterized probability distribution x \u21e0 p(\u00b7; \u2713). In this case, we can use\n\nthe score function (SF) estimator [3]:\n\nThis classic equation is derived as follows:\n\n@\n@\u2713 Ex [f (x)] =\n\n@\n\n@\n@\u2713\n\n@\u2713 Ex [f (x)] = Ex\uf8fff (x)\n@\u2713Z dx p(x; \u2713)f (x) =Z dx\n=Z dx p(x; \u2713)\n\n@\n@\u2713\n\n@\n\nlog p(x; \u2713) .\n\n@\n@\u2713\n\np(x; \u2713)f (x)\n\nlog p(x; \u2713)f (x) = Ex\uf8fff (x)\n\n@\n@\u2713\n\nlog p(x; \u2713)\n\n(1)\n\n(2)\n\n(3)\n\nThis equation is valid if and only if p(x; \u2713) is a continuous function of \u2713; however, it does not\nneed to be a continuous function of x [4].\n\n\u2022 x may be a deterministic, differentiable function of \u2713 and another random variable z, i.e., we can\n\nwrite x(z, \u2713). Then, we can use the pathwise derivative (PD) estimator, de\ufb01ned as follows.\n\n@\n\n@\u2713 Ez [f (x(z, \u2713))] = Ez\uf8ff @\n\n@\u2713\n\nf (x(z, \u2713)) .\n\nThis equation, which merely swaps the derivative and expectation, is valid if and only if f (x(z, \u2713))\nis a continuous function of \u2713 for all z [4]. 1 That is not true if, for example, f is a step function.\n1 Note that for the pathwise derivative estimator, f (x(z, \u2713)) merely needs to be a continuous function of\n\u2713\u2014it is suf\ufb01cient that this function is almost-everywhere differentiable. A similar statement can be made\nabout p(x; \u2713) and the score function estimator. See Glasserman [4] for a detailed discussion of the technical\nrequirements for these gradient estimators to be valid.\n\n2\n\n\f\u2022 Finally \u2713 might appear both in the probability distribution and inside the expectation, e.g., in\n\n@\u2713 Ez\u21e0p(\u00b7; \u2713) [f (x(z, \u2713))]. Then the gradient estimator has two terms:\n\n@\n\nlog p(z; \u2713)\u25c6f (x(z, \u2713)) .\n\n(4)\n\n@\n\n@\u2713 Ez\u21e0p(\u00b7; \u2713) [f (x(z, \u2713))] = Ez\u21e0p(\u00b7; \u2713)\uf8ff @\n\n@\u2713\n\nf (x(z, \u2713)) +\u2713 @\n\n@\u2713\n\nThis formula can be derived by writing the expectation as an integral and differentiating, as in\nEquation (2).\n\nIn some cases, it is possible to reparameterize a probabilistic model\u2014moving \u2713 from the distribution\nto inside the expectation or vice versa. See [3] for a general discussion, and see [10, 21] for a recent\napplication of this idea to variational inference.\nThe SF and PD estimators are applicable in different scenarios and have different properties.\n\n1. SF is valid under more permissive mathematical conditions than PD. SF can be used if f is\n\ndiscontinuous, or if x is a discrete random variable.\n\n2. SF only requires sample values f (x), whereas PD requires the derivatives f0(x). In the context\nof control (reinforcement learning), SF can be used to obtain unbiased policy gradient estimators\nin the \u201cmodel-free\u201d setting where we have no model of the dynamics, we only have access to\nsample trajectories.\n\n3. SF tends to have higher variance than PD, when both estimators are applicable (see for instance\n[3, 21]). The variance of SF increases (often linearly) with the dimensionality of the sampled\nvariables. Hence, PD is usually preferable when x is high-dimensional. On the other hand, PD\nhas high variance if the function f is rough, which occurs in many time-series problems due to\nan \u201cexploding gradient problem\u201d / \u201cbutter\ufb02y effect\u201d.\n\n4. PD allows for a deterministic limit, SF does not. This idea is exploited by the deterministic policy\n\ngradient algorithm [22].\n\nNomenclature. The methods of estimating gradients of expectations have been independently pro-\nposed in several different \ufb01elds, which use differing terminology. What we call the score function\nestimator (via [3]) is alternatively called the likelihood ratio estimator [5] and REINFORCE [26].\nWe chose this term because the score function is a well-known object in statistics. What we call\nthe pathwise derivative estimator (from the mathematical \ufb01nance literature [4] and reinforcement\nlearning [16]) is alternatively called in\ufb01nitesimal perturbation analysis and stochastic backpropa-\ngation [21]. We chose this term because pathwise derivative is evocative of propagating a derivative\nthrough a sample path.\n\n2.2 Stochastic Computation Graphs\n\nThe results of this article will apply to stochastic computation graphs, which are de\ufb01ned as follows:\nDe\ufb01nition 1 (Stochastic Computation Graph). A directed, acyclic graph, with three types of\nnodes:\n\n1. Input nodes, which are set externally, including the parameters we differentiate with\n\nrespect to.\n\n2. Deterministic nodes, which are functions of their parents.\n\n3. Stochastic nodes, which are distributed conditionally on their parents.\n\nEach parent v of a non-input node w is connected to it by a directed edge (v, w).\n\nIn the subsequent diagrams of this article, we will use circles to denote stochastic nodes and squares\nto denote deterministic nodes, as illustrated below. The structure of the graph fully speci\ufb01es what\nestimator we will use: SF, PD, or a combination thereof. This graphical notation is shown below,\nalong with the single-variable estimators from Section 2.1.\n\n3\n\n\f\u2713\n\nInput node\n\nDeterministic node\n\n\u2713\n\nx\n\nf\n\n\u2713\n\nz\n\nx\n\nf\n\nStochastic node\n\nGives SF estimator\n\nGives PD estimator\n\n2.3 Simple Examples\n\nSeveral simple examples that illustrate the stochastic computation graph formalism are shown below.\nThe gradient estimators can be described by writing the expectations as integrals and differentiating,\nas with the simpler estimators from Section 2.1. However, they are also implied by the general\nresults that we will present in Section 3.\n\nStochastic Computation Graph\n\n(1)\n\n(2)\n\n(3)\n\n\u2713\n\n\u2713\n\n\u2713\n\n(4)\n\n\u2713\n\n(5)\n\n\u2713\n\nx0\n\nx\n\nx\n\nx\n\nx\n\ny\n\nf1\n\nx1\n\ny\n\ny\n\ny\n\nf\n\nf2\n\nx2\n\nf\n\nf\n\nf\n\nObjective\n\nEy [f (y)]\n\nEx [f (y(x))]\n\nEx,y [f (y)]\n\nEx [f (x, y(\u2713))]\n\nGradient Estimator\n\n@x\n@\u2713\n\n@\n@x\n\nlog p(y | x)f (y)\n\n@\n@\u2713\n\n@\n@\u2713\n\n@\n@\u2713\n\nlog p(x | \u2713)f (y(x))\n\nlog p(x | \u2713)f (y)\n\nlog p(x | \u2713)f (x, y(\u2713)) +\n\n@y\n@\u2713\n\n@f\n@y\n\nEx1,x2 [f1(x1) + f2(x2)]\n\n@\nlog p(x1 | \u2713, x0)(f1(x1) + f2(x2))\n@\u2713\n@\nlog p(x2 | \u2713, x1)f2(x2)\n@\u2713\n\n+\n\nFigure 1: Simple stochastic computation graphs\n\nThese simple examples illustrate several important motifs, where stochastic and deterministic nodes\nare arranged in series or in parallel. For example, note that in (2) the derivative of y does not appear\nin the estimator, since the path from \u2713 to f is \u201cblocked\u201d by x. Similarly, in (3), p(y | x) does not\nappear (this type of behavior is particularly useful if we only have access to a simulator of a system,\nbut not access to the actual likelihood function). On the other hand, (4) has a direct path from \u2713 to\nf, which contributes a term to the gradient estimator. (5) resembles a parameterized Markov reward\nprocess, and it illustrates that we\u2019ll obtain score function terms of the form grad log-probability \u21e5\nfuture costs.\nThe examples above all have one input \u2713, but the formal-\nism accommodates models with multiple inputs, for ex-\nample a stochastic neural network with multiple layers of\nweights and biases, which may in\ufb02uence different sub-\nsets of the stochastic and cost nodes. See Appendix C\nfor nontrivial examples with stochastic nodes and multi-\nple inputs. The \ufb01gure on the right shows a deterministic\ncomputation graph representing classi\ufb01cation loss for a two-layer neural network, which has four\nparameters (W1, b1, W2, b2) (weights and biases). Of course, this deterministic computation graph\nis a special type of stochastic computation graph.\n\ncross-\nentropy\nloss\n\nsoft-\nmax\n\ny=label\n\nh1\n\nW1\n\nb1\n\nW2\n\nb2\n\nx\n\nh2\n\n4\n\n\f3 Main Results on Stochastic Computation Graphs\n3.1 Gradient Estimators\n\nThis section will consider a general stochastic computation graph, in which a certain set of nodes\nare designated as costs, and we would like to compute the gradient of the sum of costs with respect\nto some input node \u2713.\nIn brief, the main results of this section are as follows:\n\n1. We derive a gradient estimator for an expected sum of costs in a stochastic computation graph.\nThis estimator contains two parts (1) a score function part, which is a sum of terms grad log-\nprob of variable \u21e5 sum of costs in\ufb02uenced by variable; and (2) a pathwise derivative term, that\npropagates the dependence through differentiable functions.\n2. This gradient estimator can be computed ef\ufb01ciently by differentiating an appropriate \u201csurrogate\u201d\n\nobjective function.\n\nLet \u21e5 denote the set of input nodes, D the set of deterministic nodes, and S the set of stochastic\nnodes. Further, we will designate a set of cost nodes C, which are scalar-valued and deterministic.\n(Note that there is no loss of generality in assuming that the costs are deterministic\u2014if a cost is\nstochastic, we can simply append a deterministic node that applies the identity function to it.) We\nwill use \u2713 to denote an input node (\u2713 2 \u21e5) that we differentiate with respect to. In the context of\nmachine learning, we will usually be most concerned with differentiating with respect to a parameter\nvector (or tensor), however, the theory we present does not make any assumptions about what \u2713\nrepresents.\nFor the results that follow, we need to de\ufb01ne the\nnotion of \u201cin\ufb02uence\u201d, for which we will introduce\ntwo relations and D. The relation v w\n(\u201cv in\ufb02uences w\u201d) means that there exists a se-\nquence of nodes a1, a2, . . . , aK, with K 0, such\nthat (v, a1), (a1, a2), . . . , (aK1, aK), (aK, w) are\nedges in the graph. The relation v D w (\u201cv deter-\nministically in\ufb02uences w\u201d) is de\ufb01ned similarly, ex-\ncept that now we require that each ak is a determin-\nistic node. For example, in Figure 1, diagram (5)\nabove, \u2713 in\ufb02uences {x1, x2, f1, f2}, but it only de-\nterministically in\ufb02uences {x1, x2}.\nNext, we will establish a condition that is suf\ufb01cient\nfor the existence of the gradient. Namely, we will\nstipulate that every edge (v, w) with w lying in the\n\u201cin\ufb02uenced\u201d set of \u2713 corresponds to a differentiable\ndependency: if w is deterministic, then the Jacobian\n@v must exist; if w is stochastic, then the probability mass function p(w | v, . . . ) must be differen-\n@w\ntiable with respect to v.\nMore formally:\n\nNotation Glossary\n\u21e5: Input nodes\nD: Deterministic nodes\nS: Stochastic nodes\nC: Cost nodes\nv w: v in\ufb02uences w\nv D w: v deterministically in\ufb02uences w\nDEPSv: \u201cdependencies\u201d,\n{w 2 \u21e5 [ S | w D v}\n\u02c6Qv: sum of cost nodes in\ufb02uenced by v.\n\u02c6v: denotes the sampled value of the node v.\n\nCondition 1 (Differentiability Requirements). Given input node \u2713 2 \u21e5, for all edges (v, w)\nwhich satisfy \u2713 D v and \u2713 D w, then the following condition holds: if w is deterministic,\nJacobian @w\n@v exists, and if w is stochastic, then the derivative of the probability mass function\n@v p(w | PARENTSw) exists.\n@\n\nNote that Condition 1 does not require that all the functions in the graph are differentiable.\nIf\nthe path from an input \u2713 to deterministic node v is blocked by stochastic nodes, then v may be a\nnondifferentiable function of its parents. If a path from input \u2713 to stochastic node v is blocked by\nother stochastic nodes, the likelihood of v given its parents need not be differentiable; in fact, it does\nnot need to be known2.\n\n2This fact is particularly important for reinforcement learning, allowing us to compute policy gradient esti-\n\nmates despite having a discontinuous dynamics function or reward function.\n\n5\n\n\fWe need a few more de\ufb01nitions to state the main theorems. Let DEPSv := {w 2 \u21e5 [ S | w D v},\nthe \u201cdependencies\u201d of node v, i.e., the set of nodes that deterministically in\ufb02uence it. Note the\nfollowing:\n\u2022 If v 2 S, the probability mass function of v is a function of DEPSv, i.e., we can write p(v| DEPSv).\n\u2022 If v 2 D, v is a deterministic function of DEPSv, so we can write v(DEPSv).\nLet \u02c6Qv := Pcv,\n\n\u02c6c, i.e., the sum of costs downstream of node v. These costs will be treated as\nconstant, \ufb01xed to the values obtained during sampling. In general, we will use the hat symbol \u02c6v to\ndenote a sample value of variable v, which will be treated as constant in the gradient formulae.\nNow we can write down a general expression for the gradient of the expected sum of costs in a\nstochastic computation graph:\n\nc2C\n\nTheorem 1.\ntions hold:\n\nSuppose that \u2713 2 \u21e5 satis\ufb01es Condition 1. Then the following two equivalent equa-\n\n@\n\n@\u2713 E\"Xc2C\n\n@\u2713\n\n\u2713 @\n\nc# = E2664 Xw2S,\n= E2664Xc2C\n\u02c6c Xwc,\n\n\u2713Dw\n\n\u2713Dw\n\n@\n@\u2713\n\nlog p(w | DEPSw)\u25c6 \u02c6Qw + Xc2C\u2713Dc\nlog p(w | DEPSw) + Xc2C,\n\n@\n@\u2713\n\n@\n@\u2713\n\n\u2713Dc\n\nc(DEPSc)3775\nc(DEPSc)3775\n\n(5)\n\n(6)\n\nProof: See Appendix A.\nThe estimator expressions above have two terms. The \ufb01rst term is due to the in\ufb02uence of \u2713 on proba-\nbility distributions. The second term is due to the in\ufb02uence of \u2713 on the cost variables through a chain\nof differentiable functions. The distribution term involves a sum of gradients times \u201cdownstream\u201d\ncosts. The \ufb01rst term in Equation (5) involves a sum of gradients times \u201cdownstream\u201d costs, whereas\nthe \ufb01rst term in Equation (6) has a sum of costs times \u201cupstream\u201d gradients.\n\n3.2 Surrogate Loss Functions\n\nThe next corollary lets us write down a \u201csurrogate\u201d objective L,\nwhich is a function of the inputs that we can differentiate to obtain\nan unbiased gradient estimator.\n:= Pw log p(w | DEPSw) \u02c6Qw +\nCorollary 1. Let L(\u21e5,S)\nPc2C c(DEPSc). Then differentiation of L gives us an unbiased gra-\ndient estimate: @\n\n@\u2713 E\u21e5Pc2C c\u21e4 = E\u21e5 @\n\n@\u2713 L(\u21e5,S)\u21e4 .\n\nOne practical consequence of this result is that we can apply a stan-\ndard automatic differentiation procedure to L to obtain an unbiased\ngradient estimator. In other words, we convert the stochastic com-\nputation graph into a deterministic computation graph, to which we\ncan apply the backpropagation algorithm.\nThere are several alternative ways to de\ufb01ne the surrogate objective\nfunction that give the same gradient as L from Corollary 1. We\n\nSurrogate Loss Computation Graph\n\n(1)\n\n(2)\n\n(3)\n\n(4)\n\n(5)\n\n\u2713\n\n\u2713\n\n\u2713\n\n\u2713\n\n\u2713\n\nx\n\nlog p(y|x) \u02c6f\n\nlog p(x; \u2713) \u02c6f\n\nlog p(x; \u2713) \u02c6f\n\nlog p(x; \u2713) \u02c6f\n\nf\n\ny\n\nx0\n\nlog p(x1|x0; \u2713)\n( \u02c6f1 + \u02c6f2)\n\nlog p(x2|x1; \u2713) \u02c6f2\n\n\u02c6Pv\n\np( \u02c6w | DEPSw)\n\n\u02c6Qw+Pc2C c(DEPSc),\ncould also write L(\u21e5,S) :=Pw\nwhere \u02c6Pw is the probability p(w|DEPSw) obtained during sampling,\nwhich is viewed as a constant.\nThe surrogate objective from Corollary 1 is actually an upper bound\non the true objective in the case that (1) all costs c 2 C are negative,\n(2) the the costs are not deterministically in\ufb02uenced by the parameters \u21e5. This construction al-\nlows from majorization-minimization algorithms (similar to EM) to be applied to general stochastic\ncomputation graphs. See Appendix B for details.\n\nFigure 2: Deterministic compu-\ntation graphs obtained as surro-\ngate loss functions of stochas-\ntic computation graphs from Fig-\nure 1.\n\n6\n\n\f3.3 Higher-Order Derivatives.\n\nThe gradient estimator for a stochastic computation graph is itself a stochastic computation graph.\nHence, it is possible to compute the gradient yet again (for each component of the gradient vector),\nand get an estimator of the Hessian. For most problems of interest, it is not ef\ufb01cient to compute\nthis dense Hessian. On the other hand, one can also differentiate the gradient-vector product to get\na Hessian-vector product\u2014this computation is usually not much more expensive than the gradient\ncomputation itself. The Hessian-vector product can be used to implement a quasi-Newton algo-\nrithm via the conjugate gradient algorithm [28]. A variant of this technique, called Hessian-free\noptimization [13], has been used to train large neural networks.\n\n@\u2713 Ex\u21e0p(\u00b7; \u2713) [f (x)] = Ex\u21e0p(\u00b7; \u2713)\u21e5 @\n\n4 Variance Reduction\nConsider estimating @\n@\u2713 Ex\u21e0p(\u00b7; \u2713) [f (x)]. Clearly this expectation is unaffected by subtracting a con-\n@\u2713 Ex\u21e0p(\u00b7; \u2713) [f (x) b]. Taking the score function estimator,\nstant b from the integrand, which gives @\n@\u2713 log p(x; \u2713)(f (x) b)\u21e4. Taking b = Ex [f (x)] gener-\nwe get @\nally leads to substantial variance reduction\u2014b is often called a baseline3 (see [6] for a more thorough\ndiscussion of baselines and their variance reduction properties).\nWe can make a general statement for the case of stochastic computation graphs\u2014that we can\nadd a baseline to every stochastic node, which depends all of the nodes it doesn\u2019t in\ufb02uence. Let\nNONINFLUENCED(v) := {w | v \u2303 w}.\nTheorem 2.\n\n\u2713 @\n\n@\u2713\n\nlog p(v | PARENTSv)\u25c6( \u02c6Qv b(NONINFLUENCED(v)) + Xc2C\u232b\u2713\n\n@\n@\u2713\n\nc375\n\n@\n\n@\u2713 E\"Xc2C\n\nc# = E264Xv2Sv\u2713\n\nProof: See Appendix A.\n\n5 Algorithms\nAs shown in Section 3, the gradient estimator can be obtained by differentiating a surrogate objective\nfunction L. Hence, this derivative can be computed by performing the backpropagation algorithm\non L. That is likely to be the most practical and ef\ufb01cient method, and can be facilitated by automatic\ndifferentiation software.\nAlgorithm 1 shows explicitly how to compute the gradient estimator in a backwards pass through\nthe stochastic computation graph. The algorithm will recursively compute gv := @\nevery deterministic and input node v.\n\n@v E\uf8ffPc2Cvc\n\nc at\n\n6 Related Work\nAs discussed in Section 2, the score function and pathwise derivative estimators have been used in a\nvariety of different \ufb01elds, under different names. See [3] for a review of gradient estimation, mostly\nfrom the simulation optimization literature. Glasserman\u2019s textbook provides an extensive treatment\nof various gradient estimators and Monte Carlo estimators in general. Griewank and Walther\u2019s\ntextbook [8] is a comprehensive reference on computation graphs and automatic differentiation (of\ndeterministic programs.) The notation and nomenclature we use is inspired by Bayes nets and\nin\ufb02uence diagrams [19]. (In fact, a stochastic computation graph is a type of Bayes network; where\nthe deterministic nodes correspond to degenerate probability distributions.)\nThe topic of gradient estimation has drawn signi\ufb01cant recent interest in machine learning. Gradients\nfor networks with stochastic units was investigated in Bengio et al. [2], though they are concerned\n\n3The optimal baseline for scalar \u2713 is in fact the weighted expectation Ex[f (x)s(x)2]\nEx[s(x)2]\n\n@\n\n@\u2713 log p(x; \u2713).\n\nwhere s(x) =\n\n7\n\n\fAlgorithm 1 Compute Gradient Estimator for Stochastic Computation Graph\n\nfor v 2 Graph do\ngv =\u21e21dim v\n0dim v\nend for\nCompute \u02c6Qw for all nodes w 2 Graph\nfor v in REVERSETOPOLOGICALSORT(NONINPUTS) do\n\nif v 2 C\notherwise\n\n. Initialization at output nodes\n\n. Reverse traversal\n\nfor w 2 PARENTSv do\n\nif not ISSTOCHASTIC(w) then\nif ISSTOCHASTIC(v) then\n\ngw += ( @\n\ngw += ( @v\n\n@w log p(v | PARENTSv)) \u02c6Qw\n@w )T gv\n\nelse\n\nend if\n\nend if\n\nend for\n\nend for\nreturn [g\u2713]\u27132\u21e5\n\nwith differentiating through individual units and layers; not how to deal with arbitrarily structured\nmodels and loss functions. Kingma and Welling [11] consider a similar framework, although only\nwith continuous latent variables, and point out that reparameterization can be used to to convert\nhierarchical Bayesian models into neural networks, which can then be trained by backpropagation.\nThe score function method is used to perform variational inference in general models (in the context\nof probabilistic programming) in Wingate and Weber [27], and similarly in Ranganath et al. [20];\nboth papers mostly focus on mean-\ufb01eld approximations without amortized inference. It is used to\ntrain generative models using neural networks with discrete stochastic units in Mnih and Gregor [14]\nand Gregor et al. in [7]; both amortize inference by using an inference network.\nGenerative models with continuous valued latent variables networks are trained (again using an\ninference network) with the reparametrization method by Rezende, Mohamed, and Wierstra [21] and\nby Kingma and Welling [10]. Rezende et al. also provide a detailed discussion of reparameterization,\nincluding a discussion comparing the variance of the SF and PD estimators.\nBengio, Leonard, and Courville [2] have recently written a paper about gradient estimation in neural\nnetworks with stochastic units or non-differentiable activation functions\u2014including Monte Carlo\nestimators and heuristic approximations. The notion that policy gradients can be computed in mul-\ntiple ways was pointed out in early work on policy gradients by Williams [26]. However, all of this\nprior work deals with speci\ufb01c structures of the stochastic computation graph and does not address\nthe general case.\n\n7 Conclusion\nWe have developed a framework for describing a computation with stochastic and deterministic\noperations, called a stochastic computation graph. Given a stochastic computation graph, we can\nautomatically obtain a gradient estimator, given that the graph satis\ufb01es the appropriate conditions\non differentiability of the functions at its nodes. The gradient can be computed ef\ufb01ciently in a\nbackwards traversal through the graph: one approach is to apply the standard backpropagation al-\ngorithm to one of the surrogate loss functions from Section 3; another approach (which is roughly\nequivalent) is to apply a modi\ufb01ed backpropagation procedure shown in Algorithm 1. The results we\nhave presented are suf\ufb01ciently general to automatically reproduce a variety of gradient estimators\nthat have been derived in prior work in reinforcement learning and probabilistic modeling, as we\nshow in Appendix C. We hope that this work will facilitate further development of interesting and\nexpressive models.\n\n8 Acknowledgements\nWe would like to thank Shakir Mohamed, Dave Silver, Yuval Tassa, Andriy Mnih, and others at\nDeepMind for insightful comments.\n\n8\n\n\fReferences\n[1] J. Baxter and P. L. Bartlett. In\ufb01nite-horizon policy-gradient estimation. Journal of Arti\ufb01cial Intelligence\n\nResearch, pages 319\u2013350, 2001.\n\n[2] Y. Bengio, N. L\u00b4eonard, and A. Courville. Estimating or propagating gradients through stochastic neurons\n\nfor conditional computation. arXiv preprint arXiv:1308.3432, 2013.\n\n[3] M. C. Fu. Gradient estimation. Handbooks in operations research and management science, 13:575\u2013616,\n\n2006.\n\n[4] P. Glasserman. Monte Carlo methods in \ufb01nancial engineering, volume 53. Springer Science & Business\n\nMedia, 2003.\n\n[5] P. W. Glynn. Likelihood ratio gradient estimation for stochastic systems. Communications of the ACM,\n\n33(10):75\u201384, 1990.\n\n[6] E. Greensmith, P. L. Bartlett, and J. Baxter. Variance reduction techniques for gradient estimates in\n\nreinforcement learning. The Journal of Machine Learning Research, 5:1471\u20131530, 2004.\n\n[7] K. Gregor, I. Danihelka, A. Mnih, C. Blundell, and D. Wierstra. Deep autoregressive networks. arXiv\n\npreprint arXiv:1310.8499, 2013.\n\n[8] A. Griewank and A. Walther. Evaluating derivatives: principles and techniques of algorithmic differenti-\n\nation. Siam, 2008.\n\n[9] S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural computation, 9(8):1735\u20131780, 1997.\n[10] D. P. Kingma and M. Welling. Auto-encoding variational Bayes. arXiv:1312.6114, 2013.\n[11] D. P. Kingma and M. Welling. Ef\ufb01cient gradient-based inference through transformations between bayes\n\nnets and neural nets. arXiv preprint arXiv:1402.0480, 2014.\n\n[12] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition.\n\nProceedings of the IEEE, 86(11):2278\u20132324, 1998.\n\n[13] J. Martens. Deep learning via Hessian-free optimization. In Proceedings of the 27th International Con-\n\nference on Machine Learning (ICML-10), pages 735\u2013742, 2010.\n\n[14] A. Mnih and K. Gregor. Neural variational inference and learning in belief networks. arXiv:1402.0030,\n\n2014.\n\n[15] V. Mnih, N. Heess, A. Graves, and K. Kavukcuoglu. Recurrent models of visual attention. In Advances\n\nin Neural Information Processing Systems, pages 2204\u20132212, 2014.\n\n[16] R. Munos. Policy gradient in continuous time. The Journal of Machine Learning Research, 7:771\u2013791,\n\n2006.\n\n[17] R. M. Neal. Learning stochastic feedforward networks. Department of Computer Science, University of\n\nToronto, 1990.\n\n[18] R. M. Neal and G. E. Hinton. A view of the em algorithm that justi\ufb01es incremental, sparse, and other\n\nvariants. In Learning in graphical models, pages 355\u2013368. Springer, 1998.\n\n[19] J. Pearl. Probabilistic reasoning in intelligent systems: networks of plausible inference. Morgan Kauf-\n\nmann, 2014.\n\n[20] R. Ranganath, S. Gerrish, and D. M. Blei.\n\narXiv:1401.0118, 2013.\n\nBlack box variational\n\ninference.\n\narXiv preprint\n\n[21] D. J. Rezende, S. Mohamed, and D. Wierstra. Stochastic backpropagation and approximate inference in\n\ndeep generative models. arXiv:1401.4082, 2014.\n\n[22] D. Silver, G. Lever, N. Heess, T. Degris, D. Wierstra, and M. Riedmiller. Deterministic policy gradient\n\nalgorithms. In ICML, 2014.\n\n[23] R. S. Sutton, D. A. McAllester, S. P. Singh, Y. Mansour, et al. Policy gradient methods for reinforcement\n\nlearning with function approximation. In NIPS, volume 99, pages 1057\u20131063. Citeseer, 1999.\n\n[24] N. Vlassis, M. Toussaint, G. Kontes, and S. Piperidis. Learning model-free robot control by a Monte\n\nCarlo EM algorithm. Autonomous Robots, 27(2):123\u2013130, 2009.\n\n[25] D. Wierstra, A. F\u00a8orster, J. Peters, and J. Schmidhuber. Recurrent policy gradients. Logic Journal of IGPL,\n\n18(5):620\u2013634, 2010.\n\n[26] R. J. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning.\n\nMachine learning, 8(3-4):229\u2013256, 1992.\n\n[27] D. Wingate and T. Weber. Automated variational inference in probabilistic programming. arXiv preprint\n\narXiv:1301.1299, 2013.\n\n[28] S. J. Wright and J. Nocedal. Numerical optimization, volume 2. Springer New York, 1999.\n[29] W. Zaremba and I. Sutskever. Reinforcement learning neural Turing machines.\n\narXiv preprint\n\narXiv:1505.00521, 2015.\n\n9\n\n\f", "award": [], "sourceid": 1947, "authors": [{"given_name": "John", "family_name": "Schulman", "institution": "UC Berkeley / Google"}, {"given_name": "Nicolas", "family_name": "Heess", "institution": "Google DeepMind"}, {"given_name": "Theophane", "family_name": "Weber", "institution": "Google DeepMind"}, {"given_name": "Pieter", "family_name": "Abbeel", "institution": "UC Berkeley"}]}