{"title": "Reparameterization Gradient for Non-differentiable Models", "book": "Advances in Neural Information Processing Systems", "page_first": 5553, "page_last": 5563, "abstract": "We present a new algorithm for stochastic variational inference that targets at models with non-differentiable densities. One of the key challenges in stochastic variational inference is to come up with a low-variance estimator of the gradient of a variational objective. We tackle the challenge by generalizing the reparameterization trick, one of the most effective techniques for addressing the variance issue for differentiable models, so that the trick works for non-differentiable models as well. Our algorithm splits the space of latent variables into regions where the density of the variables is differentiable, and their boundaries where the density may fail to be differentiable. For each differentiable region, the algorithm applies the standard reparameterization trick and estimates the gradient restricted to the region. For each potentially non-differentiable boundary, it uses a form of manifold sampling and computes the direction for variational parameters that, if followed, would increase the boundary\u2019s contribution to the variational objective. The sum of all the estimates becomes the gradient estimate of our algorithm. Our estimator enjoys the reduced variance of the reparameterization gradient while remaining unbiased even for non-differentiable models. The experiments with our preliminary implementation confirm the benefit of reduced variance and unbiasedness.", "full_text": "Reparameterization Gradient\nfor Non-differentiable Models\n\nWonyeol Lee\n\nSchool of Computing, KAIST\n\nDaejeon, South Korea\n\nHangyeol Yu\n\nHongseok Yang\n\n{wonyeol, yhk1344, hongseok.yang}@kaist.ac.kr\n\nAbstract\n\nWe present a new algorithm for stochastic variational inference that targets at\nmodels with non-differentiable densities. One of the key challenges in stochastic\nvariational inference is to come up with a low-variance estimator of the gradient of\na variational objective. We tackle the challenge by generalizing the reparameteriza-\ntion trick, one of the most effective techniques for addressing the variance issue\nfor differentiable models, so that the trick works for non-differentiable models\nas well. Our algorithm splits the space of latent variables into regions where the\ndensity of the variables is differentiable, and their boundaries where the density\nmay fail to be differentiable. For each differentiable region, the algorithm applies\nthe standard reparameterization trick and estimates the gradient restricted to the\nregion. For each potentially non-differentiable boundary, it uses a form of manifold\nsampling and computes the direction for variational parameters that, if followed,\nwould increase the boundary\u2019s contribution to the variational objective. The sum of\nall the estimates becomes the gradient estimate of our algorithm. Our estimator\nenjoys the reduced variance of the reparameterization gradient while remaining\nunbiased even for non-differentiable models. The experiments with our preliminary\nimplementation con\ufb01rm the bene\ufb01t of reduced variance and unbiasedness.\n\n1\n\nIntroduction\n\nStochastic variational inference (SVI) is a popular choice for performing posterior inference in\nBayesian machine learning. It picks a family of variational distributions, and formulates posterior\ninference as a problem of \ufb01nding a member of this family that is closest to the target posterior. SVI,\nthen, solves this optimization problem approximately using stochastic gradient ascent. One major\nchallenge in developing an effective SVI algorithm is the dif\ufb01culty of designing a low-variance\nestimator for the gradient of the optimization objective. Addressing this challenge has been the\ndriver of recent advances for SVI, such as reparameterization trick [13, 30, 31, 26, 15], clever control\nvariate [28, 7, 8, 34, 6, 23], and continuous relaxation of discrete distributions [20, 10].\nOur goal is to tackle the challenge for models with non-differentiable densities. Such a model naturally\narises when one starts to use both discrete and continuous random variables or speci\ufb01es a model\nusing programming constructs, such as if statement, as in probabilistic programming [4, 22, 37, 5].\nThe high variance of a gradient estimate is a more serious issue for these models than for those\nwith differentiable densities. Key techniques for addressing it simply do not apply in the absence\nof differentiability. For instance, a prerequisite for the so called reparameterization trick is the\ndifferentiability of a model\u2019s density function.\nIn the paper, we present a new gradient estimator for non-differentiable models. Our estimator splits\nthe space of latent variables into regions where the joint density of the variables is differentiable, and\ntheir boundaries where the density may fail to be differentiable. For each differentiable region, the\nestimator applies the standard reparameterization trick and estimates the gradient restricted to the\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fregion. For each potentially non-differentiable boundary, it uses a form of manifold sampling, and\ncomputes the direction for variational parameters that, if followed, would increase the boundary\u2019s\ncontribution to the variational objective. This manifold sampling step cannot be skipped if we want\nto get an unbiased estimator, and it only adds a linear overhead to the overall estimation time for a\nlarge class of non-differentiable models. The result of our gradient estimator is the sum of all the\nestimated values for regions and boundaries.\nOur estimator generalizes the estimator based on the reparameterization trick. When a model\nhas a differentiable density, these two estimators coincide. But even when a model\u2019s density is\nnot differentiable and so the reparameterization estimator is not applicable, ours still applies; it\ncontinues to be an unbiased estimator, and enjoys variance reduction from reparameterization. The\nunbiasedness of our estimator is not trivial, and follows from an existing yet less well-known theorem\non exchanging integration and differentiation under moving domain [3] and the divergence theorem.\nWe have implemented a prototype of an SVI algorithm that uses our gradient estimator and works for\nmodels written in a simple \ufb01rst-order loop-free probabilistic programming language. The experiments\nwith this prototype con\ufb01rm the strength of our estimator in terms of variance reduction.\n\n2 Variational Inference and Reparameterization Gradient\n\nBefore presenting our results, we review the basics of stochastic variational inference.\nLet x and z be, respectively, observed and latent variables living in Rm and Rn, and p(x, z) a density\nthat speci\ufb01es a probabilistic model about x and z. We are interested in inferring information about\nthe posterior density p(z|x0) for a given value x0 of x.\nVariational inference approaches this posterior-inference problem from the optimization angle. It\nrecasts posterior inference as a problem of \ufb01nding a best approximation to the posterior among\na collection of pre-selected distributions {q\u03b8(z)}\u03b8\u2208Rd, called variational distributions, which all\nhave easy-to-compute and easy-to-differentiate densities and permit ef\ufb01cient sampling. A standard\nobjective for this optimization is to maximize a lower bound of log p(x0) called evidence lower\nbound or simply ELBO:\n\nargmax\u03b8\n\nELBO\u03b8\n\n, where ELBO\u03b8 (cid:44) Eq\u03b8(z)\n\nlog\n\n.\n\n(1)\n\n(cid:16)\n\n(cid:17)\n\n(cid:20)\n\n(cid:21)\n\np(x0, z)\n\nq\u03b8(z)\n\nIt is equivalent to the objective of minimizing the KL divergence from q\u03b8(z) to the posterior p(z|x0).\nMost of recent variational-inference algorithms solve the optimization problem (1) by stochastic\ngradient ascent. They repeatedly estimate the gradient of ELBO\u03b8 and move \u03b8 towards the direction\nof this estimate:\n\n(cid:86)\n\n\u03b8 \u2190 \u03b8 + \u03b7 \u00b7 \u2207\u03b8ELBO\u03b8\n\nThe success of this iterative scheme crucially depends on whether it can estimate the gradient well in\nterms of computation time and variance. As a result, a large part of research efforts on stochastic\nvariational inference has been devoted to constructing low-variance gradient estimators or reducing\nthe variance of existing estimators.\nThe reparameterization trick [13, 30] is the technique of choice for constructing a low-variance\ngradient estimator for models with differentiable densities. It can be applied in our case if the joint\np(x, z) is differentiable with respect to the latent variable z. The trick is a two-step recipe for\nbuilding a gradient estimator. First, it tells us to \ufb01nd a distribution q(\u0001) on Rn and a smooth function\nf : Rd \u00d7 Rn \u2192 Rn such that f\u03b8(\u0001) for \u0001 \u223c q(\u0001) has the distribution q\u03b8. Next, the reparameterization\ntrick suggests us to use the following estimator:\n\n(cid:86) (cid:44) 1\n\u2207\u03b8ELBO\u03b8\nN\n\n\u2207\u03b8log\n\nr(f\u03b8(\u0001i))\nq\u03b8(f\u03b8(\u0001i))\n\n, where r(z) (cid:44) p(x0, z) and \u00011, . . . , \u0001N \u223c q(\u0001).\n\n(2)\n\nN(cid:88)\n\ni=1\n\nThe reparameterization gradient in (2) is unbiased, and has variance signi\ufb01cantly lower than the so\ncalled score estimator (or REINFORCE) [35, 27, 36, 28], which does not exploit differentiability.\nBut so far its use has been limited to differentiable models. We will next explain how to lift this\nlimitation.\n\n2\n\n\f3 Reparameterization for Non-differentiable Models\n\nOur main result is a new unbiased gradient estimator for a class of non-differentiable models, which\ncan use the reparameterization trick despite the non-differentiability.\nRecall the notations from the previous section: x \u2208 Rm and z \u2208 Rn for observed and latent variables,\np(x, z) for their joint density, x0 for an observed value, and q\u03b8(z) for a variational distribution\nparameterized by \u03b8 \u2208 Rd.\nOur result makes two assumptions. First, the variational distribution q\u03b8(z) satis\ufb01es the conditions of\nthe reparameterization gradient. Namely, q\u03b8(z) is continuously differentiable with respect to \u03b8 \u2208 Rd,\nand is the distribution of f\u03b8(\u0001) for a smooth function f : Rd \u00d7 Rn \u2192 Rn and a random variable\n\u0001 \u2208 Rn distributed by q(\u0001). Also, the function f\u03b8 on Rn is bijective for every \u03b8 \u2208 Rd. Second, the\njoint density r(z) = p(x0, z) at x = x0 has the following form:\n\nK(cid:88)\nof Rn with measurable boundary \u2202Rk such that(cid:82)\nwhere rk is a non-negative continuously-differentiable function Rn \u2192 R, Rk is a (measurable) subset\ndz = 0, and {Rk}1\u2264k\u2264K is a partition of Rn.\nNote that r(z) is an unnormalized posterior under the observation x = x0. The assumption indicates\nthat the posterior r may be non-differentiable at some z\u2019s, but all the non-differentiabilities occur only\nat the boundaries \u2202Rk of regions Rk. Also, it ensures that when considered under the usual Lebesgue\nmeasure on Rn, these non-differentiable points are negligible (i.e., they are included in a null set of\nthe measure). As we illustrate in our experiments section, models satisfying our assumption naturally\narise when one starts to use both discrete and continuous random variables or speci\ufb01es models using\nprogramming constructs, such as if statement, as in probabilistic programming [4, 22, 37, 5].\nOur estimator is derived from the following theorem:\nTheorem 1. Let\n\n1[z \u2208 Rk] \u00b7 rk(z)\n\nr(z) =\n\n(3)\n\n\u2202Rk\n\nk=1\n\nhk(\u0001, \u03b8) (cid:44) log\n\nrk(f\u03b8(\u0001))\nq\u03b8(f\u03b8(\u0001))\n\n,\n\nV (\u0001, \u03b8) \u2208 Rd\u00d7n,\n\nV (\u0001, \u03b8)ij (cid:44)\n\nThen,\n\n(cid:124)\n\u2207\u03b8ELBO\u03b8 = Eq(\u0001)\n\n(cid:34) K(cid:88)\n\nk=1\n\n(cid:35)\n(cid:125)\n1[f\u03b8(\u0001)\u2208 Rk] \u00b7 \u2207\u03b8hk(\u0001, \u03b8)\n\n(cid:123)(cid:122)\n\n(cid:90)\n\n\u22121\nf\n\u03b8 (\u2202Rk)\n\nK(cid:88)\n(cid:124)\n\nk=1\n\n+\n\nRepGrad\u03b8\n\n(cid:32)\n\n\u03b8\n\n(z)\n\n(cid:16)\n\nf\u22121\n\n\u2202\n\u2202\u03b8i\n\n(cid:17)(cid:12)(cid:12)(cid:12)z=f\u03b8(\u0001)\n(cid:0)q(\u0001)hk(\u0001, \u03b8)V (\u0001, \u03b8)(cid:1)\n\n(cid:123)(cid:122)\n\nBouContr\u03b8\n\n(cid:33)\n\n.\n\nj\n\n\u2022 d\u03a3\n\n(cid:125)\n\n\u03b8\n\n\u03b8\n\n(Rk), and the \u2022 operation denotes the matrix-vector multiplication.\n\nwhere the RHS of the plus uses the surface integral of q(\u0001)hk(\u0001, \u03b8)V (\u0001, \u03b8) over the boundary\nf\u22121\n(\u2202Rk) expressed in terms of \u0001, the d\u03a3 is the normal vector of this boundary that is outward\npointing with respect to f\u22121\nThe theorem says that the gradient of ELBO\u03b8 comes from two sources. The \ufb01rst is the usual\nreparameterized gradient of each hk but restricted to its region Rk. The second source is the sum of\nthe surface integrals over the region boundaries \u2202Rk. Intuitively, the surface integral for k computes\nthe direction to move \u03b8 in order to increase the contribution of the boundary \u2202Rk to ELBO\u03b8. Note\nthat the integrand of the surface integral has the additional V term. This term is a by-product of\nrephrasing the original integration over z in terms of the reparameterization variable \u0001. We write\nRepGrad\u03b8 for the contribution from the \ufb01rst source, and BouContr\u03b8 for that from the second source.\nThe proof of the theorem uses an existing but less known theorem about interchanging integration\nand differentiation under moving domain [3], together with the divergence theorem. It appears in the\nsupplementary material of this paper.\nAt this point, some readers may feel uneasy with the BouContr\u03b8 term in our theorem. They may\nreason like this. Every boundary \u2202Rk is a measure-zero set in Rn, and non-differentiabilities occur\nonly at these \u2202Rk\u2019s. So, why do we need more than RepGrad\u03b8, the case-split version of the usual\nreparameterization? Unfortunately, this heuristic reasoning is incorrect, as indicated by the following\nproposition:\n\n3\n\n\fProposition 2. There are models satisfying this section\u2019s conditions s.t. \u2207\u03b8ELBO\u03b8 (cid:54)= RepGrad\u03b8.\n\nProof. Consider the model p(x, z) = N (z|0, 1)(cid:0)1[z > 0]N (x|5, 1) + 1[z \u2264 0]N (x|\u22122, 1)(cid:1) for\n\nx \u2208 R and z \u2208 R, the variational distribution q\u03b8(z) = N (z|\u03b8, 1) for \u03b8 \u2208 R, and its reparameterization\nf\u03b8(\u0001) = \u0001 + \u03b8 and q(\u0001) = N (\u0001|0, 1) for \u0001 \u2208 R. For an observed value x0 = 0, the joint density\np(x0, z) becomes r(z) = 1[z > 0] \u00b7 c1N (z|0, 1) + 1[z \u2264 0] \u00b7 c2N (z|0, 1), where c1 = N (0|5, 1)\nand c2 = N (0|\u22122, 1). Notice that r is non-differentiable only at z = 0 and {0} is a null set in R.\nlog(c1c2)] and thus obtain \u2207\u03b8ELBO\u03b8 = \u2212\u03b8 + log(c1/c2) exp(cid:0)\n\u2212\u03b82/2(cid:1)/\u221a2\u03c0.\nFor any \u03b8, \u2207\u03b8ELBO\u03b8 is computed as follows: Since log(r(z)/q\u03b8(z)) = 1[z > 0] \u00b7 (\u03b82/2 \u2212 z\u03b8 +\n2 [\u2212\u03b82 + erf(\u03b8/\u221a2) log(c1/c2) +\nlog c1) + 1[z \u2264 0] \u00b7 (\u03b82/2 \u2212 z\u03b8 + log c2), we have1 ELBO\u03b8 = 1\nlog(cid:0)r(f\u03b8(\u0001))/q\u03b8(f\u03b8(\u0001))(cid:1) = 1[\u0001 + \u03b8 > 0]\u00b7(\u2212\u03b82/2\u2212\u0001\u03b8+log c1)+1[\u0001 + \u03b8 \u2264 0]\u00b7(\u2212\u03b82/2\u2212\u0001\u03b8+log c2),\nOn the other hand, RepGrad\u03b8 is computed as follows: After reparameterizing z into \u0001, we have\nso the term inside the expectation of RepGrad\u03b8 is 1[\u0001 + \u03b8 > 0] \u00b7 (\u2212\u03b8 \u2212 \u0001) + 1[\u0001 + \u03b8 \u2264 0] \u00b7 (\u2212\u03b8 \u2212 \u0001)\nand we obtain RepGrad\u03b8 = \u2212\u03b8.\nNote that \u2207\u03b8ELBO\u03b8 (cid:54)= RepGrad\u03b8 for any \u03b8. The difference between the two quantities is BouContr\u03b8\nin Theorem 1. The main culprit here is that interchanging differentiation and integration is sometimes\ninvalid: for D1, D2(\u03b8) \u2282 Rn and \u03b11, \u03b12 : Rn \u00d7 Rd \u2192 R, the below equations do not hold in general\n(cid:90)\nif \u03b11 is not differentiable in \u03b8, and if D2(\u00b7) is not constant (even when \u03b12 is differentiable in \u03b8).\nD2(\u03b8) \u2207\u03b8\u03b12(\u0001, \u03b8)d\u0001.\n\u2207\u03b8\n\n(cid:90)\nD1 \u2207\u03b8\u03b11(\u0001, \u03b8)d\u0001\n\nand \u2207\u03b8\n\n\u03b12(\u0001, \u03b8)d\u0001 =\n\n\u03b11(\u0001, \u03b8)d\u0001 =\n\n(cid:90)\n\n(cid:90)\n\nD2(\u03b8)\n\nD1\n\nThe RepGrad\u03b8 term in Theorem 1 can be easily estimated by the standard Monte Carlo:\n\n(cid:32) K(cid:88)\n\nN(cid:88)\n\ni=1\n\nk=1\n\n1(cid:2)f\u03b8(\u0001i)\u2208 Rk\n\n(cid:3)\n\n(cid:33)\n\u00b7 \u2207\u03b8hk(\u0001i, \u03b8)\n\nRepGrad\u03b8 \u2248\n\n1\nN\n\nfor i.i.d. \u00011, . . . , \u0001N \u223c q(\u0001).\n\nfor this estimate.\n\n(cid:86)\nWe write RepGrad\u03b8\nHowever, estimating the other BouContr\u03b8 term is not that easy, because of the dif\ufb01culties in esti-\nmating surface integrals in the term. In general, to approximate a surface integral well, we need\na parameterization of the surface, and a scheme for generating samples from it [2]; this general\nmethodology and a known theorem related to our case are reviewed in the supplementary material.\nIn this paper, we focus on a class of models that use relatively simple (reparameterized) boundaries\nf\u22121\n(\u2202Rk) and permit, as a result, an ef\ufb01cient method for estimating surface integrals in BouContr\u03b8.\n\u03b8\nA good way to understand our simple-boundary condition is to start with something even simpler,\nnamely the condition that f\u22121\n(\u2202Rk) is an (n\u22121)-dimensional hyperplane {\u0001 | a \u00b7 \u0001 = c}. Here the\nTheorem 3. Let q(\u0001) = (cid:81)n\noperation \u00b7 denotes the dot-product. A surface integral over such a hyperplane can be estimated using\nthe following theorem:\ni=1 q(\u0001i) and S a measurable subset of Rn. Assume that S = {\u0001 |\na \u00b7 \u0001 > c} or S = {\u0001 | a \u00b7 \u0001 \u2265 c} for some a \u2208 Rn and c \u2208 R, and that aj (cid:54)= 0 for some j. Then,\n\n\u03b8\n\n\u2202S\n\nfor all measurable F : Rn \u2192 Rd\u00d7n.\n\nHere d\u03a3 is the normal vector pointing outward with respect to S, \u03b6 ranges over Rn\u22121, its density\nq(\u03b6) is the product of the densities for its components, and this component density q(\u03b6i) is the same\nas the density q(\u0001i(cid:48)) for the i(cid:48)-th component of \u0001, where i(cid:48) = i + 1[i \u2265 j]. Also,\nG(\u0001) (cid:44) q(\u0001j)F (\u0001),\na\u2212j (cid:44) (a1, . . . , aj\u22121, aj+1, . . . , an),\n\n\u2022 d\u03a3 = Eq(\u03b6) [G(g(\u03b6)) \u2022 n]\ng(\u03b6) (cid:44)(cid:16)\n(cid:16) a1\nn (cid:44) sgn(\u2212aj)\n0 exp(cid:0)\u2212t2(cid:1) dt/\n1The error function erf is de\ufb01ned by erf(x) = 2(cid:82) x\n\n(cid:17)(cid:124)\n(cid:17)(cid:124)\n(c \u2212 a\u2212j \u00b7 \u03b6), \u03b6j, . . . , \u03b6n\u22121\naj\u22121\nan\naj\naj\n\n, . . . ,\naj\n\u221a\n\n\u03b61, . . . , \u03b6j\u22121,\n\naj+1\naj\n\n1\naj\n\n, . . . ,\n\n, 1,\n\n\u03c0.\n\n.\n\n,\n\n(cid:0)q(\u0001)F (\u0001)(cid:1)\n\n(cid:90)\n\n4\n\n\f\u03b8\n\nThe theorem says that if the boundary \u2202S is an (n\u22121)-dimensional hyperplane {\u0001 | a \u00b7 \u0001 = c}, we\ncan parameterize the surface by a linear map g : Rn\u22121 \u2192 Rn and express the surface integral as an\nexpectation over q(\u03b6). This distribution for \u03b6 is the marginalization of q(\u0001) over the j-th component.\nInside the expectation, we have the product of the matrix G and the vector n. The matrix comes\nfrom the integrand of the surface integral, and the vector is the direction of the surface. Note that\nG(\u0001) has q(\u0001j) instead of q(\u0001); the missing part of q(\u0001) has been converted to the distribution q(\u03b6).\nWhen every f\u22121\n(\u2202Rk) is an (n\u22121)-dimensional hyperplane {\u0001 | a \u00b7 \u0001 = c} for a \u2208 Rn and c \u2208 R\n(cid:90)\n(cid:0)q(\u0001)hk(\u0001, \u03b8)V (\u0001, \u03b8)(cid:1)\nwith ajk (cid:54)= 0, we can use Theorem 3 and estimate the surface integrals in BouContr\u03b8 as follows:\nW (g(\u03b6i)) \u2022 n for i.i.d. \u03b61, . . . , \u03b6M \u223c q(\u03b6),\nK(cid:88)\n\n(cid:86)\nwhere W (\u0001) = q(\u0001jk )hk(\u0001, \u03b8)V (\u0001, \u03b8). Let BouContr(\u03b8,k)\nthe gradient of ELBO\u03b8 in this case computes:\n\nbe this estimate. Then, our estimator for\n\n\u2022 d\u03a3 \u2248\n\nM(cid:88)\n\n\u22121\nf\n\u03b8 (\u2202Rk)\n\n1\nM\n\ni=1\n\n(cid:86) (cid:44) RepGrad\u03b8\n\u2207\u03b8ELBO\u03b8\n\n(cid:86)\n\n+\n\n(cid:86)\nBouContr(\u03b8,k)\n.\n\nk=1\n\nThe estimator is unbiased because of Theorems 1 and 3:\n\nCorollary 4. E(cid:104)\n\n(cid:86)(cid:105)\n\n\u2207\u03b8ELBO\u03b8\n\n= \u2207\u03b8ELBO\u03b8.\n\nWe now relax the condition that each boundary is a hyperplane, and consider a more liberal simple-\nboundary condition, which is often satis\ufb01ed by non-differentiable models from a \ufb01rst-order loop-free\nprobabilistic programming language. This new condition and the estimator under this condition are\nwhat we have used in our implementation. The relaxed condition is that the regions {f\u22121\n(Rk)}1\u2264k\u2264K\nare obtained by partitioning Rn with L (n\u22121)-dimensional hyperplanes. That is, there are af\ufb01ne\nmaps \u03a61, . . . , \u03a6L : Rn \u2192 R such that for all 1 \u2264 k \u2264 K,\n\n\u03b8\n\nL(cid:92)\n\nl=1\n\nf\u22121\n\n\u03b8\n\n(Rk) =\n\nSl,(\u03c3k)l\n\nfor some \u03c3k \u2208 {\u22121, 1}L\n\nwhere Sl,1 = {\u0001 | \u03a6l(\u0001) > 0} and Sl,\u22121 = {\u0001 | \u03a6l(\u0001) \u2264 0}. Each af\ufb01ne map \u03a6l de\ufb01nes an\n(n\u22121)-dimensional hyperplane \u2202Sl,1, and (\u03c3k)l speci\ufb01es on which side the region f\u22121\n(Rk) lies\nwith respect to the hyperplane \u2202Sl,1. Every probabilistic model written in a \ufb01rst-order probabilistic\nprogramming language satis\ufb01es the relaxed condition, if the model does not contain a loop and uses\nonly a \ufb01xed \ufb01nite number of random variables and the branch condition of each if statement in the\nmodel is linear in the latent variable z; in such a case, L is the number of if statements in the model.\nUnder the new condition, how can we estimate BouContr\u03b8? A naive approach is to estimate the\nk-th surface integral for each k (in some way) and sum them up. However, with L hyperplanes, the\nnumber K of regions can grow as fast as O (Ln), implying that the naive approach is slow. Even\nworse the boundaries f\u22121\n(\u2202Rk) do not satisfy the condition of Theorem 3, and just estimating the\nsurface integral over each f\u22121\nA solution is to transform the original formulation of BouContr\u03b8 such that it can be expressed as the\nsum of surface integrals over \u2202Sl,1\u2019s. The transformation is based on the following derivation:\n\n(\u2202Rk) may be dif\ufb01cult.\n\n\u03b8\n\n\u03b8\n\n\u03b8\n\n(cid:90)\n(cid:90)\n\nK(cid:88)\nL(cid:88)\n\nk=1\n\n(cid:0)q(\u0001)hk(\u0001, \u03b8)V (\u0001, \u03b8)(cid:1)\n\u0001 \u2208 f\u22121\n\nK(cid:88)\n\n(cid:104)\n\n1\n\n\u03b8\n\nk=1\n\nq(\u0001)V (\u0001, \u03b8)\n\n\u22121\nf\n\u03b8 (\u2202Rk)\n\n(cid:16)\n\nl=1\n\n\u2202Sl,1\n\nBouContr\u03b8 =\n\n=\n\n\u2022 d\u03a3\n\n(Rk)\n\n(cid:105)\n\n(cid:17)\n\n(\u03c3k)lhk(\u0001, \u03b8)\n\n\u2022 d\u03a3\n\n(4)\n\nwhere T denotes the closure of T \u2282 Rn, and d\u03a3 in (4) is the normal vector pointing outward with\nrespect to Sl,1. Since {f\u22121\n(Rk)}k are obtained by partitioning Rn with {\u2202Sl,1}l, we can rearrange\nthe sum of K surface integrals over complicated boundaries f\u22121\n(\u2202Rk), into the sum of L surface\nintegrals over the hyperplanes \u2202Sl,1 as above. Although the expression inside the summation over k\nin (4) looks complicated, for almost all \u0001, the indicator function is nonzero for exactly two k\u2019s: k1\n\n\u03b8\n\n\u03b8\n\n5\n\n\fwith (\u03c3k1 )l = 1 and k\u22121 with (\u03c3k\u22121)l = \u22121. So, we can ef\ufb01ciently estimate the l-th surface integral\n(cid:86)(cid:48). Then, our estimator for the gradient of\nin (4) using Theorem 3, and call this estimate BouContr(\u03b8,l)\nELBO\u03b8 in this more general case computes:\n\n(cid:86)(cid:48) (cid:44) RepGrad\u03b8\n\u2207\u03b8ELBO\u03b8\n\n(cid:86)\n\n+\n\nL(cid:88)\n(cid:86)(cid:48).\nBouContr(\u03b8,l)\n\nl=1\n\n(5)\n\nThe estimator is unbiased because of Theorems 1 and 3 and Equation 4:\n\nCorollary 5. E(cid:104)\n\n(cid:86)(cid:48)(cid:105)\n\n\u2207\u03b8ELBO\u03b8\n\n= \u2207\u03b8ELBO\u03b8.\n\n4 Experimental Evaluation\n\nWe experimentally compare our gradient estimator (OURS) to the score estimator (SCORE), an\nunbiased gradient estimator that is applicable to non-differentiable models, and the reparameterization\n(cid:86)\nestimator (REPARAM), a biased gradient estimator that computes only RepGrad\u03b8\n(discussed in\nSection 3). REPARAM is biased in our experiments because it is applied to non-differentiable models.\nWe implemented a black-box variational inference engine that accepts a probabilistic model written\nin a simple probabilistic programming language (which supports basic constructs such as sample,\nobserve, and if statements) and performs variational inference using one of the three aforemen-\ntioned gradient estimators. Our implementation2 is written in Python and uses autograd [18],\nan automatic differentiation package for Python, to automatically compute the gradient term in\n(cid:86)\nRepGrad\u03b8\n\nfor an arbitrary probabilistic model.\n\nBenchmarks. We evaluate our estimator on three models for small sequential data:\n\u2022 temperature [33] models the random dynamics of a controller that attempts to keep the temper-\nature of a room within speci\ufb01ed bounds. The controller\u2019s state has a continuous part for the room\ntemperature and a discrete part that records the on or off of an air conditioner. At each time step,\nthe value of this discrete part decides which of two different random state updates is employed,\nand incurs the non-differentiability of the model\u2019s density. We use a synthetically-generated\nsequence of 21 noisy measurements of temperatures, and perform posterior inference on the\nsequence of the controller\u2019s states given these noisy measurements. This model consists of a\n41-dimensional latent variable and 80 if statements.\n\u2022 textmsg [1] is a model for the numbers of per-day SNS messages over the period of 74 days\n(skipping every other day). It allows the SNS-usage pattern to change over the period, and this\nchange causes non-differentiability. Finding the posterior distribution over this change is the\ngoal of the inference problem in this case. We use the data from [1]. This model consists of a\n3-dimensional latent variable and 37 if statements.\n\n\u2022 influenza [32] is a model for the US in\ufb02uenza mortality data in 1969. The mortality rate in each\nmonth depends on whether the dominant in\ufb02uenza virus is of type 1 or 2, and \ufb01nding this type\ninformation from a sequence of observed mortality rates is the goal of the inference. The virus\ntype is the cause of non-differentiability in this example. This model consists of a 37-dimensional\nlatent variable and 24 if statements.\n\nExperimental setup. We optimize the ELBO objective using Adam [11] with two stepsizes: 0.001\nand 0.01. We run Adam for 10000 iterations and at each iteration, we compute each estimator using\nN \u2208 {1, 8, 16} Monte Carlo samples. For OURS, we use a single subsample l (drawn uniformly\nat random from {1,\u00b7\u00b7\u00b7 , L}) to estimate the summation in (5), and use N Monte Carlo samples to\n(cid:86)(cid:48). While maximizing ELBO, we measure two things: the variance of estimated\ncompute BouContr(\u03b8,l)\ngradients of ELBO, and ELBO itself. Since each gradient is not scalar, we measure two kinds of\nvariance of the gradient, as in [23]: Avg(V(\u00b7)), the average variance of each of its components, and\nV((cid:107) \u00b7 (cid:107)2), the variance of its l2-norm. To estimate the variances and the ELBO objective, we use 16\nand 1000 Monte Carlo samples, respectively.\n\n2 Code is available at https://github.com/wonyeol/reparam-nondiff.\n\n6\n\n\fEstimator\nREPARAM\n\nOURS\n\nEstimator\nREPARAM\n\nOURS\n\nType of Variance\n\nAvg(V(\u00b7))\nV((cid:107) \u00b7 (cid:107)2)\nAvg(V(\u00b7))\nV((cid:107) \u00b7 (cid:107)2)\n\nType of Variance\n\nAvg(V(\u00b7))\nV((cid:107) \u00b7 (cid:107)2)\nAvg(V(\u00b7))\nV((cid:107) \u00b7 (cid:107)2)\n\ntemperature\n4.45 \u00d7 10\u22129\n2.45 \u00d7 10\u22128\n1.85 \u00d7 10\u22126\n7.59 \u00d7 10\u22125\n\n(a) stepsize = 0.001\n\ntextmsg\n\n2.91 \u00d7 10\u22122\n2.92 \u00d7 10\u22122\n2.77 \u00d7 10\u22122\n2.46 \u00d7 10\u22122\n\ninfluenza\n4.38 \u00d7 10\u22123\n2.12 \u00d7 10\u22123\n4.89 \u00d7 10\u22123\n2.36 \u00d7 10\u22123\n\ntemperature\n\n3.88 \u00d7 10\u221211\n6.11 \u00d7 10\u221211\n1.24 \u00d7 10\u221211\n8.05 \u00d7 10\u221211\n\n(b) stepsize = 0.01\n\ntextmsg\n\n5.03 \u00d7 10\u22124\n1.02 \u00d7 10\u22123\n5.07 \u00d7 10\u22124\n8.12 \u00d7 10\u22124\n\ninfluenza\n2.46 \u00d7 10\u22123\n1.26 \u00d7 10\u22123\n2.80 \u00d7 10\u22123\n1.40 \u00d7 10\u22123\n\nTable 1: Ratio of {REPARAM, OURS}\u2019s average variance to SCORE\u2019s for N = 1. The values for\nSCORE are all 1, so omitted. The optimization trajectories used to compute the above variances are\nshown in Figure 1.\n\nEstimator\nSCORE\n\nREPARAM\n\nOURS\n\ntemperature\n\ntextmsg\n\ninfluenza\n\n21.7\n\n46.1\n\n79.2\n\n4.9\n\n15.4\n\n24.9\n\n18.7\n\n251.4\n\n269.8\n\nTable 2: Computation time (in ms) per iteration for N = 1.\n\nResults. Table 1 compares the average variance of each estimator for N = 1, where the average\nis taken over a single optimization trajectory. The table clearly shows that during the optimization\nprocess, OURS has several orders of magnitude (sometimes < 10\u221210 times) smaller variances\nthan SCORE. Since OURS computes additional terms when compared with REPARAM, we expect\nthat OURS would have larger variances than REPARAM, and this is con\ufb01rmed by the table. It is\nnoteworthy, however, that for most benchmarks, the averaged variances of OURS are very close\nto those of REPARAM. This suggests that the additional term BouContr\u03b8 in our estimator often\nintroduces much smaller variances than the reparameterization term RepGrad\u03b8.\nFigure 1 shows the ELBO objective, for different estimators with different N\u2019s, as a function of the\niteration number. As expected, using a larger N makes all estimators converge faster in a more stable\nmanner. In all three benchmarks, OURS outperforms (or performs similarly to) the other two and\nconverges stably, and REPARAM beats SCORE. Increasing the stepsize to 0.01 makes SCORE unstable\nin temperature and textmsg. It is also worth noting that REPARAM converges to sub-optimal\nvalues in temperature (possibly because REPARAM is biased).\nTable 2 shows the computation time per iteration of each approach for N = 1. Our implementation\nperforms the worst in this wall-time comparison, but the gap between OURS and REPARAM is not\nhuge: the computation time of OURS is less than 1.72 times that of REPARAM in all benchmarks.\nFurthermore, we want to point out that our implementation is an early unoptimized prototype, and\nthere are several rooms to improve in the implementation. For instance, it currently constructs Python\nfunctions dynamically, and computes the gradients of these functions using autograd. But this\ndynamic approach is costly because autograd is not optimized for such dynamically constructed\nfunctions; this can also be observed in the bad performance of REPARAM, particularly in influenza,\nthat employs the same strategy of dynamically constructing functions and taking their gradients. So\none possible optimization is to avoid this gradient computation of dynamically constructed functions\nby building the functions statically during compilation.\n\n7\n\n\f(a) temperature (stepsize = 0.001)\n\n(b) temperature (stepsize = 0.01)\n\n(c) textmsg (stepsize = 0.001)\n\n(d) textmsg (stepsize = 0.01)\n\n(e) influenza (stepsize = 0.001)\n\n(f) influenza (stepsize = 0.01)\n\nFigure 1: The ELBO objective as a function of the iteration number. {dotted, dashed, solid} lines\nrepresent {N = 1, N = 8, N = 16}.\n\n5 Related Work\n\nA common example of a model with a non-differentiable density is the one that uses discrete random\nvariables, typically together with continuous random variables.3 Coming up with an ef\ufb01cient algo-\nrithm for stochastic variational inference for such a model has been an active research topic. Maddison\net al. [20] and Jang et al. [10] proposed continuous relaxations of discrete random variables that con-\nvert non-differentiable variational objectives to differentiable ones and make the reparameterization\ntrick applicable. Also, a variety of control variates for the standard score estimator [35, 27, 36, 28]\nfor the gradients of variational objectives have been developed [28, 7, 8, 34, 6, 23], some of which\nuse biased yet differentiable control variates such that the reparameterization trick can be used to\ncorrect the bias [7, 34, 6].\nOur work extends this line of research by adding a version of the reparameterization trick that can\nbe applied to models with discrete random variables. For instance, consider a model p(x, z) with\nz discrete. By applying the Gumbel-Max reparameterization [9, 21] to z, we transform p(x, z) to\np(x, z, c), where c is sampled from the Gumbel distribution and z in p(x, z, c) is de\ufb01ned determin-\n\n3 Another common example of such a model is the one that uses if statements whose branch conditions\n\ncontain continuous random variables, which is the main focus of our work.\n\n8\n\n0200040006000800010000Iteration\u22123.0\u22122.5\u22122.0\u22121.5\u22121.0\u22120.50.0ELBO\u00d7106ScoreReparOurs0200040006000800010000Iteration\u22123.0\u22122.5\u22122.0\u22121.5\u22121.0\u22120.50.0ELBO\u00d7106ScoreReparOurs0200040006000800010000Iteration\u22126.0\u22125.5\u22125.0\u22124.5\u22124.0\u22123.5\u22123.0\u22122.5ELBO\u00d7102ScoreReparOurs0200040006000800010000Iteration\u22126.0\u22125.5\u22125.0\u22124.5\u22124.0\u22123.5\u22123.0\u22122.5ELBO\u00d7102ScoreReparOurs0200040006000800010000Iteration\u22125\u22124\u22123\u22122\u221210ELBO\u00d7106ScoreReparOurs0200040006000800010000Iteration\u22125\u22124\u22123\u22122\u221210ELBO\u00d7106ScoreReparOurs\fistically from c using the arg max operation. Since arg max can be written as if statements, we\ncan express p(x, z, c) in the form of (3) to which our reparameterization gradient can be applied.\nInvestigating the effectiveness of this approach for discrete random variables is an interesting topic\nfor future research.\nThe reparameterization trick was initially used with normal distribution [13, 30], but its scope was\nsoon extended to other common distributions, such as gamma, Dirichlet, and beta [14, 31, 26].\nTechniques for constructing normalizing \ufb02ow [29, 12] can also be viewed as methods for creating\ndistributions in a reparameterized form. In the paper, we did not consider these recent developments\nand mainly focused on the reparameterization with normal distribution. One interesting future avenue\nis to further develop our approach for these other reparameterization cases. We expect that the main\nchallenge will be to \ufb01nd an effective method for handling the surface integrals in Theorem 1.\n\n6 Conclusion\n\nWe have presented a new estimator for the gradient of the standard variational objective, ELBO.\nThe key feature of our estimator is that it can keep variance under control by using a form of the\nreparameterization trick even when the density of a model is not differentiable. The estimator\nsplits the space of the latent random variable into a lower-dimensional subspace where the density\nmay fail to be differentiable, and the rest where the density is differentiable. Then, it estimates\nthe contributions of both parts to the gradient separately, using a version of manifold sampling for\nthe former and the reparameterization trick for the latter. We have shown the unbiasedness of our\nestimator using a theorem for interchanging integration and differentiation under moving domain [3]\nand the divergence theorem. Also, we have experimentally demonstrated the promise of our estimator\nusing three time-series models. One interesting future direction is to investigate the possibility of\napplying our ideas to recent variational objectives [24, 17, 19, 16, 25], which are based on tighter\nlower bounds of marginal likelihood than the standard ELBO.\nWhen viewed from a high level, our work suggests a heuristic of splitting the latent space into a bad\nyet tiny subspace and the remaining good one, and solving an estimation problem in each subspace\nseparately. The latter subspace has several good properties and so it may allow the use of ef\ufb01cient\nestimation techniques that exploit those properties. The former subspace is, on the other hand, tiny\nand the estimation error from the subspace may, therefore, be relatively small. We would like to\nexplore this heuristic and its extension in different contexts, such as stochastic variational inference\nwith different objectives [24, 17, 19, 16, 25].\n\nAcknowledgments\nWe thank Hyunjik Kim, George Tucker, Frank Wood and anonymous reviewers for their helpful\ncomments, and Shin Yoo and Seongmin Lee for allowing and helping us to use their cluster ma-\nchines. This research was supported by the Engineering Research Center Program through the\nNational Research Foundation of Korea (NRF) funded by the Korean Government MSIT (NRF-\n2018R1A5A1059921), and also by Next-Generation Information Computing Development Program\nthrough the National Research Foundation of Korea (NRF) funded by the Ministry of Science, ICT\n(2017M3C4A7068177).\n\nReferences\n\n[1] C. Davidson-Pilon. Bayesian Methods for Hackers: Probabilistic Programming and Bayesian\n\nInference. Addison-Wesley Professional, 2015.\n\n[2] P. Diaconis, S. Holmes, and M. Shahshahani. Sampling from a Manifold, volume Volume 10\nof Collections, pages 102\u2013125. Institute of Mathematical Statistics, Beachwood, Ohio, USA,\n2013.\n\n[3] H. Flanders. Differentiation Under the Integral Sign. The American Mathematical Monthly,\n\n80(6):615\u2013627, 1973.\n\n[4] N. D. Goodman, V. K. Mansinghka, D. M. Roy, K. Bonawitz, and J. B. Tenenbaum. Church:\na language for generative models. In Proceedings of the 24th Conference in Uncertainty in\nArti\ufb01cial Intelligence (UAI), 2008.\n\n9\n\n\f[5] A. D. Gordon, T. A. Henzinger, A. V. Nori, and S. K. Rajamani. Probabilistic Programming. In\n\nInternational Conference on Software Engineering (ICSE, FOSE track), 2014.\n\n[6] W. Grathwohl, D. Choi, Y. Wu, G. Roeder, and D. K. Duvenaud. Backpropagation through the\nVoid: Optimizing control variates for black-box gradient estimation. In Proceedings of the 6th\nInternational Conference on Learning Representations (ICLR), 2018.\n\n[7] S. Gu, S. Levine, I. Sutskever, and A. Mnih. MuProp: Unbiased Backpropagation for Stochastic\nNeural Networks. In Proceedings of the 4th International Conference on Learning Representa-\ntions (ICLR), 2016.\n\n[8] S. Gu, T. Lillicrap, Z. Ghahramani, R. E. Turner, and S. Levine. Q-Prop: Sample-Ef\ufb01cient\nPolicy Gradient with An Off-Policy Critic. In Proceedings of the 5th International Conference\non Learning Representations (ICLR), 2017.\n\n[9] E. J. Gumbel. Statistical Theory of Extreme Values and Some Practical Applications: a Series\n\nof Lectures. Number 33. US Govt. Print. Of\ufb01ce, 1954.\n\n[10] E. Jang, S. Gu, and B. Poole. Categorical Reparameterization with Gumbel-Softmax.\n\nProceedings of the 5th International Conference on Learning Representations (ICLR), 2017.\n\nIn\n\n[11] D. P. Kingma and J. Ba. Adam: A Method for Stochastic Optimization. In Proceedings of the\n\n3rd International Conference on Learning Representations (ICLR), 2015.\n\n[12] D. P. Kingma, T. Salimans, R. J\u00f3zefowicz, X. Chen, I. Sutskever, and M. Welling. Improv-\ning Variational Autoencoders with Inverse Autoregressive Flow. In Proceedings of the 30th\nInternational Conference on Neural Information Processing Systems (NIPS), 2016.\n\n[13] D. P. Kingma and M. Welling. Auto-Encoding Variational Bayes. In Proceedings of the 2nd\n\nInternational Conference on Learning Representations (ICLR), 2014.\n\n[14] D. A. Knowles. Stochastic gradient variational Bayes for gamma approximating distributions.\n\narXiv, page 1509.01631, 2015.\n\n[15] A. Kucukelbir, D. Tran, R. Ranganath, A. Gelman, and D. M. Blei. Automatic Differentiation\n\nVariational Inference. J. Mach. Learn. Res., 18(1):430\u2013474, Jan. 2017.\n\n[16] T. A. Le, M. Igl, T. Rainforth, T. Jin, and F. Wood. Auto-Encoding Sequential Monte Carlo. In\n\nProceedings of the 6th International Conference on Learning Representations (ICLR), 2018.\n\n[17] Y. Li and R. E. Turner. R\u00e9Nyi Divergence Variational Inference. In Proceedings of the 30th\n\nInternational Conference on Neural Information Processing Systems (NIPS), 2016.\n\n[18] D. Maclaurin. Modeling, Inference and Optimization with Composable Differentiable Proce-\n\ndures. PhD thesis, Harvard University, 2016.\n\n[19] C. J. Maddison, J. Lawson, G. Tucker, N. Heess, M. Norouzi, A. Mnih, A. Doucet, and Y. W.\nTeh. Filtering Variational Objectives. In Proceedings of the 31st International Conference on\nNeural Information Processing Systems (NIPS), 2017.\n\n[20] C. J. Maddison, A. Mnih, and Y. W. Teh. The Concrete Distribution: A Continuous Relaxation\nof Discrete Random Variables. In Proceedings of the 5th International Conference on Learning\nRepresentations (ICLR), 2017.\n\n[21] C. J. Maddison, D. Tarlow, and T. Minka. A* Sampling. In Proceedings of the 28th International\n\nConference on Neural Information Processing Systems (NIPS), 2014.\n\n[22] V. K. Mansinghka, D. Selsam, and Y. N. Perov. Venture: a higher-order probabilistic program-\n\nming platform with programmable inference. arXiv, 2014.\n\n[23] A. C. Miller, N. J. Foti, A. D\u2019Amour, and R. P. Adams. Reducing Reparameterization Gradient\n\nVariance. arXiv preprint arXiv:1705.07880, 2017.\n\n[24] A. Mnih and D. J. Rezende. Variational Inference for Monte Carlo Objectives. In Proceedings\nof the 33rd International Conference on International Conference on Machine Learning (ICML),\n2016.\n\n[25] C. A. Naesseth, S. W. Linderman, R. Ranganath, and D. M. Blei. Variational Sequential\nMonte Carlo. In Proceedings of the 21st International Conference on Arti\ufb01cial Intelligence and\nStatistics (AISTATS), 2018. To appear.\n\n10\n\n\f[26] C. A. Naesseth, F. J. R. Ruiz, S. W. Linderman, and D. M. Blei. Reparameterization Gradients\nthrough Acceptance-Rejection Sampling Algorithms. In Proceedings of the 20th International\nConference on Arti\ufb01cial Intelligence and Statistics (AISTATS), 2017.\n\n[27] J. W. Paisley, D. M. Blei, and M. I. Jordan. Variational Bayesian Inference with Stochastic\nSearch. In Proceedings of the 29th International Conference on Machine Learning (ICML),\n2012.\n\n[28] R. Ranganath, S. Gerrish, and D. Blei. Black Box Variational Inference. In Proceedings of the\n\n17th International Conference on Arti\ufb01cial Intelligence and Statistics (AISTATS), 2014.\n\n[29] D. J. Rezende and S. Mohamed. Variational Inference with Normalizing Flows. In Proceedings\nof the 32nd International Conference on International Conference on Machine Learning (ICML),\n2015.\n\n[30] D. J. Rezende, S. Mohamed, and D. Wierstra. Stochastic Backpropagation and Approximate\nInference in Deep Generative Models. In Proceedings of the 31st International Conference on\nMachine Learning (ICML), 2014.\n\n[31] F. J. R. Ruiz, M. K. Titsias, and D. M. Blei. The Generalized Reparameterization Gradient. In\nProceedings of the 30th International Conference on Neural Information Processing Systems\n(NIPS), 2016.\n\n[32] R. H. Shumway and D. S. Stoffer. Time Series Analysis and Its Applications (Springer Texts in\n\nStatistics). Springer-Verlag, 2005.\n\n[33] S. E. Z. Soudjani, R. Majumdar, and T. Nagapetyan. Multilevel Monte Carlo Method for\nIn Proceedings of the 14th International\n\nStatistical Model Checking of Hybrid Systems.\nConference on Quantitative Evaluation of Systems (QUEST), 2017.\n\n[34] G. Tucker, A. Mnih, C. J. Maddison, J. Lawson, and J. Sohl-Dickstein. REBAR: Low-variance,\nunbiased gradient estimates for discrete latent variable models. In Proceedings of the 31st\nInternational Conference on Neural Information Processing Systems (NIPS), 2017.\n\n[35] R. J. Williams. Simple Statistical Gradient-Following Algorithms for Connectionist Reinforce-\n\nment Learning. Mach. Learn., 8(3-4):229\u2013256, May 1992.\n\n[36] D. Wingate and T. Weber. Automated Variational Inference in Probabilistic Programming.\n\nCoRR, abs/1301.1299, 2013.\n\n[37] F. Wood, J.-W. van de Meent, and V. Mansinghka. A New Approach to Probabilistic Program-\nming Inference. In Proceedings of the 17th International Conference on Arti\ufb01cial Intelligence\nand Statistics (AISTATS), 2014.\n\n11\n\n\f", "award": [], "sourceid": 2657, "authors": [{"given_name": "Wonyeol", "family_name": "Lee", "institution": "KAIST"}, {"given_name": "Hangyeol", "family_name": "Yu", "institution": "KAIST"}, {"given_name": "Hongseok", "family_name": "Yang", "institution": "KAIST"}]}