{"title": "Non-parametric Structured Output Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 4214, "page_last": 4224, "abstract": "Deep neural networks (DNNs) and probabilistic graphical models (PGMs) are the two main tools for statistical modeling. While DNNs provide the ability to model rich and complex relationships between input and output variables, PGMs provide the ability to encode dependencies among the output variables themselves. End-to-end training methods for models with structured graphical dependencies on top of neural predictions have recently emerged as a principled way of combining these two paradigms. While these models have proven to be powerful in discriminative settings with discrete outputs, extensions to structured continuous spaces, as well as performing efficient inference in these spaces, are lacking. We propose non-parametric structured output networks (NSON), a modular approach that cleanly separates a non-parametric, structured posterior representation from a discriminative inference scheme but allows joint end-to-end training of both components. Our experiments evaluate the ability of NSONs to capture structured posterior densities (modeling) and to compute complex statistics of those densities (inference). We compare our model to output spaces of varying expressiveness and popular variational and sampling-based inference algorithms.", "full_text": "Non-parametric Structured Output Networks\n\nAndreas M. Lehrmann\n\nDisney Research\n\nPittsburgh, PA 15213\n\nLeonid Sigal\n\nDisney Research\n\nPittsburgh, PA 15213\n\nandreas.lehrmann@disneyresearch.com\n\nlsigal@disneyresearch.com\n\nAbstract\n\nDeep neural networks (DNNs) and probabilistic graphical models (PGMs) are\nthe two main tools for statistical modeling. While DNNs provide the ability to\nmodel rich and complex relationships between input and output variables, PGMs\nprovide the ability to encode dependencies among the output variables themselves.\nEnd-to-end training methods for models with structured graphical dependencies\non top of neural predictions have recently emerged as a principled way of com-\nbining these two paradigms. While these models have proven to be powerful in\ndiscriminative settings with discrete outputs, extensions to structured continuous\nspaces, as well as performing ef\ufb01cient inference in these spaces, are lacking. We\npropose non-parametric structured output networks (NSON), a modular approach\nthat cleanly separates a non-parametric, structured posterior representation from\na discriminative inference scheme but allows joint end-to-end training of both\ncomponents. Our experiments evaluate the ability of NSONs to capture structured\nposterior densities (modeling) and to compute complex statistics of those densities\n(inference). We compare our model to output spaces of varying expressiveness and\npopular variational and sampling-based inference algorithms.\n\nIntroduction\n\n1\nIn recent years, deep neural networks have led to tremendous progress in domains such as image\nclassi\ufb01cation [1, 2] and segmentation [3], object detection [4, 5] and natural language processing [6, 7].\nThese achievements can be attributed to their hierarchical feature representation, the development of\neffective regularization techniques [8, 9] and the availability of large amounts of training data [10, 11].\nWhile a lot of effort has been spent on identifying optimal network structures and trainings schemes\nto enable these advances, the expressiveness of the output space has not evolved at the same rate.\nIndeed, it is striking that most neural architectures model categorical posterior distributions that\ndo not incorporate any structural assumptions about the underlying task; they are discrete and\nglobal (Figure 1a). However, many tasks are naturally formulated as structured problems or would\nbene\ufb01t from continuous representations due to their high cardinality. In those cases, it is desirable to\nlearn an expressive posterior density re\ufb02ecting the dependencies in the underlying task.\nAs a simple example, consider a stripe of n noisy pixels in a natural image. If we want to learn\na neural network that encodes the posterior distribution p\u2713\u2713\u2713(y | x) of the clean output y given the\nnoisy input x, we must ensure that p\u2713\u2713\u2713 is expressive enough to represent potentially complex noise\ndistributions and structured enough to avoid modeling spurious dependencies between the variables.\nProbabilistic graphical models [12], such as Bayesian networks or Markov random \ufb01elds, have a\nlong history in machine learning and provide principled frameworks for such structured data. It is\ntherefore natural to use their factored representations as a means of enforcing structure in a deep\nneural network. While initial results along this line of research have been promising [13, 14], they\nfocus exclusively on the discrete case and/or mean-\ufb01eld inference.\nInstead, we propose a deep neural network that encodes a non-parametric posterior density that\nfactorizes over a graph (Figure 1b). We perform recurrent inference inspired by message-passing in\nthis structured output space and show how to learn all components end-to-end.\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fClassi\ufb01cation\n\nNeural Network\n\n(i) Deep\n\nNeural Network\n\nx\n\nn\no\ni\nt\nu\n\nl\no\nv\nn\no\nC\n\nU\nL\ne\nR\n\ng\nn\n\ni\nl\no\no\nP\n\nC\nF\n\nU\nL\ne\nR\n\nt\nu\no\np\no\nr\nD\n\n)\ny\n=\n\nY\n(\n\u2713\np\n\n\u00b7\u00b7\u00b7\n\ny\n\n\u2713\n\np\u2713\n\nY\n\n\u2713 = {pi}|Y |\n\ni=1\n\nx\n\nn\no\ni\nt\nu\n\nl\no\nv\nn\no\nC\n\nU\nL\ne\nR\n\ng\nn\n\ni\nl\no\no\nP\n\nC\nF\n\nU\nL\ne\nR\n\nt\nu\no\np\no\nr\nD\n\n\u2327\uf8ff(\u00b7, y)\n\u27131\n\n\u27132|1\n\n\u27133|2\n\n\u00b7\n\n\u00b7\n\n\u00b7\n\ne\u27131\ne\u271321\ne\u271332\n\n\u00b7\n\n\u00b7\n\n\u00b7\n\nU\n\nV\n\nU\n\nV\n\n\u2713n|\u00b7\n\ne\u2713n\u00b7\ne\u2713 = {ew,e\u00b5\u00b5\u00b5,eB}\n\n(ii) Non-parametric\n\nGraphical Model\n\np\u2713\u2713\u2713(y | x)\nY1\n\nY1\n\np\u27132|1\n\nY3\n\nY2\n\nY4\n\n\u00b7\n\n\u00b7\n\n\u00b7\n\nY5\n\nYn-1\n\nYn\n\nYi | pa(Yi) \u21e0 p\u2713i|pa(i)\n\nRecurrent\n\nInference Network\n\nFig. 2\n\nLM\n\nLI\n\n(a) Traditional Neural Network:\n\ndiscrete, global, parametric.\n\n(b) Non-parametric Structured Output Network:\n\ncontinuous, structured, non-parametric.\n\nFigure 1: Overview: Non-parametric Structured Output Networks. (a) Traditional neural networks use\na series of convolution and inner product modules to predict a discrete posterior without graphical structure\n\n(e.g., VGG [15]). [greyb= optional] (b) Non-parametric structured output networks use a deep neural network to\n\npredict a non-parametric graphical model p\u2713\u2713\u2713(x)(y) (NGM) that factorizes over a graph. A recurrent inference\nnetwork (RIN) computes statistics t[p\u2713\u2713\u2713(x)(y)] from this structured output density. At training time, we propagate\nstochastic gradients from both NGM and RIN back to the inputs.\n\n1.1 Related Work\n\nOur framework builds upon elements from neural networks, structured models, non-parametric\nstatistics, and approximate inference. We will \ufb01rst present prior work on structured neural networks\nand then discuss the relevant literature on approximate non-parametric inference.\n\n1.1.1 Structured Neural Networks\nStructured neural networks combine the expressive representations of deep neural networks with\nthe structured dependencies of probabilistic graphical models. Early attempts to combine both\nframeworks used high-level features from neural networks (e.g., fc7) to obtain \ufb01xed unary potentials\nfor a graphical model [18]. More recently, statistical models and their associated inference tasks have\nbeen reinterpreted as (layers in) neural networks, which has allowed true end-to-end training and\nblurred the line between both paradigms: [13, 14] express the classic mean-\ufb01eld update equations\nas a series of layers in a recurrent neural network (RNN). Structure inference machines [17] use an\nRNN to simulate message-passing in a graphical model with soft-edges for activity recognition. A\nfull backward-pass through loopy-BP was proposed in [19]. The structural-RNN [16] models all\nnode and edge potentials in a spatio-temporal factor graph as RNNs that are shared among groups of\nnodes/edges with similar semantics. Table 1 summarizes some important properties of these methods.\nNotably, all output spaces except for the non-probabilistic work [16] are discrete.\n\n1.1.2 Inference in Structured Neural Networks\nIn contrast to a discrete and global posterior, which allows inference of common statistics (e.g., its\nmode) in linear time, expressive output spaces, as in Figure 1b, require message-passing schemes [20]\n\nOutput Space\n\nRelated Work\n\nContinuous\n\nStructured\n\nNon-parametric\nEnd-to-end Training\n\nProb. Inference\nPosterior Sampling\n\nVGG MRF-RNN Structural\n[15]\n7\n\n7\n\nD\nX\n\nRNN\n[16]\nX\n7\nX\nX\n7\n7\n\n[14]\n7\n\nX\nX\nMF\n7\n\nStructure\n\nInference Machines\n\n[17]\n7\n\nX\n7\nMP\nX\n\nModels\n\nDeep Structured NSON\n(ours)\nX\nX\nX\nX\nMP\nX\n\n[13]\n7\n\nX\nX\nMF\n7\n\nTable 1: Output Space Properties Across Models.\n\n[MF: mean-\ufb01eld; MP: message passing; D: direct; \u2018\u2019: not applicable]\n\n2\n\n\fto propagate and aggregate information. Local potentials outside of the exponential family, such\nas non-parametric distributions, lead to intractable message updates, so one needs to resort to\napproximate inference methods, which include the following two popular groups:\n\nVariational Inference. Variational methods, such as mean-\ufb01eld and its structured variants [12],\napproximate an intractable target distribution with a tractable variational distribution by maximizing\nthe evidence lower bound (ELBO). Stochastic extensions allow the use of this technique even on\nlarge datasets [21]. If the model is not in the conjugate-exponential family [22], as is the case for\nnon-parametric graphical models, black box methods must be used to approximate an intractable\nexpectation in the ELBO [23]. For fully-connected graphs with Gaussian pairwise potentials, the\ndense-CRF model [24] proposes an ef\ufb01cient way to perform the variational updates using the\npermutohedral lattice [25]. For general edge potentials, [26] proposes a density estimation technique\nthat allows the use of non-parametric edge potentials.\n\nSampling-based Inference. This group of methods employs (sets of) samples to approximate\nintractable operations when computing message updates. Early works use iterative re\ufb01nements of ap-\nproximate clique potentials in junction trees [27]. Non-parametric belief propagation (NBP) [28, 29]\nrepresents each message as a kernel density estimate and uses Gibbs sampling for propagation. Parti-\ncle belief propagation [30] represents each message as a set of samples drawn from an approximation\nto the receiving node\u2019s marginal, effectively circumventing the kernel smoothing required in NBP.\nDiverse particle selection [31] keeps a diverse set of hypothesized solutions at each node that pass\nthrough an iterative augmentation-update-selection scheme that preserves message values. Finally, a\nmean shift density approximation has been used as an alternative to sampling in [32].\n\n1.2 Contributions\n\nOur NSON model is inspired by the structured neural architectures (Section 1.1.1). However, in\ncontrast to those approaches, we model structured dependencies on top of expressive non-parametric\ndensities. In doing so, we build an inference network that computes statistics of these non-parametric\noutput densities, thereby replacing the need for more conventional inference (Section 1.1.2).\nIn particular, we make the following contributions: (1) We propose non-parametric structured output\nnetworks, a novel approach combining the predictive power of deep neural networks with the\nstructured representation and multimodal \ufb02exibility of non-parametric graphical models; (2) We show\nhow to train the resulting output density together with recurrent inference modules in an end-to-end\nway; (3) We compare non-parametric structured output networks to a variety of alternative output\ndensities and demonstrate superior performance of the inference module in comparison to variational\nand sampling-based approaches.\n\n2 Non-parametric Structured Output Networks\nTraditional neural networks (Figure 1a; [15]) encode a discrete posterior distribution by predicting\n\ncontinuous graphical model with non-parametric potentials. It consists of three components: A deep\nneural network (DNN), a non-parametric graphical model (NGM), and a recurrent inference network\n\nan input-conditioned parameter vectore\u2713(x) of a categorical distribution, i.e., Y | X = x \u21e0 pe\u2713(x).\nNon-parametric structured output networks (Figure 1b) do the same, except thate\u2713\u2713\u2713(x) parameterizes a\n(RIN). While the DNN+NGM encode a structured posterior (b= model), the RIN computes complex\nstatistics in this output space (b= inference).\nAt a high level, the DNN, conditioned on an input x, predicts the parameterse\u2713\u2713\u2713 = {e\u2713ij} (e.g.,\nkernel weights, centers and bandwidths) of local non-parametric distributions over a node and its\nparents according to the NGM\u2019s graph structure (Figure 1b). Using a function \u2327\uf8ff, these local joint\ndistributions are then transformed to conditional distributions parameterized by \u2713\u2713\u2713 = {\u2713i|j} (e.g.,\nthrough a closed-form conditioning operation) and assembled into a structured joint density p\u2713\u2713\u2713(x)(y)\nwith conditional (in)dependencies prescribed by the graphical model. Parameters of the DNN are\noptimized with respect to a maximum-likelihood loss LM. Simultaneously, a recurrent inference\nnetwork (detailed in Figure 2) that takese\u2713\u2713\u2713 as input, is trained to compute statistics of the structured\ndistribution (e.g., marginals) using a separate inference loss LI. The following two paragraphs discuss\nthese elements in more detail.\n\n3\n\n\fModel (DNN+NGM). The DNN is parameterized by a weight vector M and encodes a function\nfrom a generic input space X to a Cartesian parameter space \u21e5n,\nx M7!e\u2713\u2713\u2713(x) = (e\u2713i,pa(i)(x))n\ni=1,\n\neach of whose components models a joint kernel density (Yi, pa(Yi)) \u21e0 pe\u2713i,pa(i)(x) and thus, implic-\nitly, the local conditional distribution Yi | pa(Yi) \u21e0 p\u2713i|pa(i)(x) of a non-parametric graphical model\n(2)\n\np\u2713\u2713\u2713(x)(y) =\n\n(1)\n\np\u2713i|pa(i)(x)(yi | pa(yi))\n\nover a structured output space Y with directed, acyclic graph G = (Y, E). Here, pa(\u00b7) denotes the set\nof parent nodes w.r.t. G, which we \ufb01x in advance based on prior knowledge or structure learning [12].\nThe conditional density of a node Y = Yi with parents Y 0 = pa(Yi) and parameters \u2713 = \u2713i|pa(i)(x)\nis thus given by1\n\nnYi=1\n\nNXj=1\n\np\u2713(y | y0) =\n\nw(j) \u00b7 |B(j)|1\uf8ff(B(j)(y \u00b5(j))),\n\n(3)\n\nform for a wide range of kernels, including Gaussian, cosine, logistic and other kernels with sigmoid\n\nwhere the differentiable kernel \uf8ff(u) = Qi q(ui) is de\ufb01ned in terms of a symmetric, zero-mean\ndensity q with positive variance and the conditional parameters \u2713 = (w, \u00b5\u00b5\u00b5, B) 2 \u21e5 correspond to\nthe full set of kernel weights, kernel centers, and kernel bandwidth matrices, respectively.2 The\nfunctional relationship between \u2713 and its joint counterparte\u2713 = e\u2713i,pa(i)(x) is mediated through a\nkernel-dependent conditioning operation \u2327\uf8ff(e\u2713) = \u2327\uf8ff(ew,e\u00b5\u00b5\u00b5,eB) = \u2713 and can be computed in closed-\ny0 ande\u00b5(j) =e\u00b5(j)\ny0 , we obtain\nCDF. In particular, for block decompositions eB(j) =eB(j)\ne\u00b5(j)\neB(j)\ny0 |1\uf8ff(eB(j)\n(y0 e\u00b5(j)\n1 \uf8ff j \uf8ff N\n\ny0 )),\n\n(4)\n\ny0\n\n0\n\ny\n0\n\ny\n\n\u2327\uf8ff(e\u2713) = \u2713 =8>><>>:\n\nw(j) / ew(j) \u00b7 |eB(j)\n\u00b5(j) =e\u00b5(j)\nB(j) = eB(j)\n\ny ,\ny .\n\nSee Appendix A.1 for a detailed derivation. We refer to the structured posterior density in Eq. (2)\nwith the non-parametric local potentials in Eq. (3) as a non-parametric structured output network.\nGiven an output training set DY = {y(i) 2Y} N0\ni=1, traditional kernel density estimation [33] can be\nviewed as an extreme special case of this architecture in which the discriminative, trainable DNN\nis replaced with a generative, closed-form estimator and n := 1 (no structure), N := N0 (#kernels\n= #training points), w(i) := (N0)1 (uniform weights), B(i) := B(0) (shared covariance) and\n\u00b5(i) := y(i) (\ufb01xed centers). When learning M from data, we can easily enforce parts or all of those\nrestrictions in our model (see Section 5), but Section 3 will provide all necessary derivations for the\nmore general case shown above.\n\nInference (RIN).\nIn contrast to traditional classi\ufb01cation networks with discrete label posterior,\nnon-parametric structured output networks encode a complex density with rich statistics. We employ\na recurrent inference network with parameters I to compute such statistics t from the predicted\n\nparameterse\u2713\u2713\u2713(x) 2 \u21e5n,\n\ne\u2713\u2713\u2713(x) I\n\n(5)\nSimilar to conditional graphical models, the underlying assumption is that the input-conditioned\ndensity p\u2713\u2713\u2713(x) contains all information about the semantic entities of interest and that we can infer\nwhichever statistic we are interested in from it. A popular example of a statistic is a summary statistic,\n(6)\n\n7! t[p\u2713\u2713\u2713(x)].\n\nmax; computing max-marginals). Note, however, that we can attach recurrent inference networks\ncorresponding to arbitrary tasks to this meta representation. Section 4 discusses the necessary details.\n\nwhich is known as sum-product BP (op =R ; computing marginals) and max-product BP (op =\n1We write B(j) :=\u21e3B(j)\u23181\n\nand BT :=B1> to avoid double superscripts.\n\n2Note that \u2713 represents the parameters of a speci\ufb01c node; different nodes may have different parameters.\n\nt[p\u2713\u2713\u2713(x)](yi) = opy\\yip\u2713\u2713\u2713(x)(y) d(y\\yi),\n\n4\n\n\f3 Learning Structured Densities using Non-Parametric Back-Propagation\n\nThe previous section introduced the model and inference components of a non-parametric structured\noutput network. We will now describe how to learn the model (DNN+NGM) from a supervised\ntraining set (x(i), y(i)) \u21e0 pD.\n3.1 Likelihood Loss\n\nWe write \u2713\u2713\u2713(x; M ) = \u2327\uf8ff(e\u2713\u2713\u2713(x; M )) to explicitly refer to the weights M of the deep neural network\npredicting the non-parametric graphical model (Eq. (1)). Since the parameters of p\u2713\u2713\u2713(x) are determin-\nistic predictions from the input x, the only free and learnable parameters are the components of M.\nWe train the DNN via empirical risk minimization with a negative log-likelihood loss LM,\n\n\u21e4M = argmin\n\nM\n\n= argmax\n\nM\n\nE(x,y)\u21e0bpD\nE(x,y)\u21e0bpD\n\n[LM (\u2713\u2713\u2713(x; M ), y)]\n[log p\u2713\u2713\u2713(x;M )(y)],\n\n(7)\n\nwherebpD refers to the empirical distribution and the expectation in Eq. (7) is taken over the fac-\n\ntorization in Eq. (2) and the local distributions in Eq. (3). Note the similarities and differences\nbetween a non-parametric structured output network and a non-parametric graphical model with\nunary potentials from a neural network: Both model classes describe a structured posterior. However,\nwhile the unaries in the latter perform a reweighting of the potentials, a non-parametric structured\noutput network predicts those potentials directly and allows joint optimization of its DNN and NGM\ncomponents by back-propagating the structured loss \ufb01rst through the nodes of the graphical model\nand then through the layers of the neural network all the way back to the input.\n\n3.2 Topological Non-parametric Gradients\nWe optimize Eq. (7) via stochastic gradient descent of the loss LM w.r.t. the deep neural network\nweights M using Adam [34]. Importantly, the gradients rM LM (\u2713\u2713\u2713(x; M ), y) decompose into a\nfactor from the deep neural network and a factor from the non-parametric graphical model,\n\n@ log p\u2713\u2713\u2713(x;M )(y)\n\n@e\u2713e\u2713e\u2713(x; M )\n\n\u00b7\n\n@e\u2713e\u2713e\u2713(x; M )\n\n@ M\n\nrM LM (\u2713\u2713\u2713(x; M ), y) =\n\n,\n\n(8)\n\nwhere the partial derivatives of the second factor can be obtained via standard back-propagation and\nthe \ufb01rst factor decomposes according to the graphical model\u2019s graph structure G,\n@ log p\u2713i|pa(i)(x;M )(yi | pa(yi))\n\n@ log p\u2713\u2713\u2713(x;M )(y)\n\n.\n\n(9)\n\n=\n\nnXi=1\n\n@e\u2713e\u2713e\u2713(x; M )\n\n@e\u2713e\u2713e\u2713(x; M )\n\nfor the gradient w.r.t. the conditional parameters and the Jacobian of the conditioning operation,\n\nThe gradient of a local model w.r.t. the joint parameterse\u2713e\u2713e\u2713(x; M ) is given by two factors accounting\n\n@ log p\u2713i|pa(i)(x;M )(yi | pa(yi))\n\n@ log p\u2713i|pa(i)(x;M )(yi | pa(yi))\n\n(10)\n\n=\n\n@ \u2713\u2713\u2713(x; M )\n\nNote that the Jacobian takes a block-diagonal form, because \u2713 = \u2713i|pa(i)(x; M ) is independent\nconditioning operation,\n\nofe\u2713 = e\u2713j,pa(j)(x; M ) for i 6= j. Each block constitutes the backward-pass through a node Yi\u2019s\n\n@w\n\n@w\n\n@w\n\n\u00b7\n\n.\n\n@ \u2713\u2713\u2713(x; M )\n\n@e\u2713\u2713\u2713(x; M )\n\n@e\u2713e\u2713e\u2713(x; M )\n\nwhere the individual entries are given by the derivatives of Eq. (4), e.g.,\n\n377775\n\n=266664\n\n,\n\n0\n\n=\n\n@ew\n\n0 @\u00b5\u00b5\u00b5\n\n@ (w, \u00b5\u00b5\u00b5, B)\n\n@eB\n@e\u00b5\u00b5\u00b5\n@e\u00b5\u00b5\u00b5 0\n@ (ew,e\u00b5\u00b5\u00b5,eB)\n@eB\n= (w \u2326 w + diag(w)) \u00b7 diag(ew)1.\n\n0 @B\n\n5\n\n@ \u2713\n\n@e\u2713\n\n@w\n\n@ew\n\n(11)\n\n(12)\n\n\fSimilar equations exist for the derivatives of the weights w.r.t. the kernel locations and kernel\nbandwidth matrices; the remaining cases are simple projections. In practice, we may be able to group\nthe potentials p\u2713i|pa(i) according to their semantic meaning, in which case we can train one potential\nper group instead of one potential per node by sharing the corresponding parameters in Eq. (9).\nAll topological operations can be implemented as separate layers in a deep neural network and the\ncorresponding gradients can be obtained using automatic differentiation.\n\n3.3 Distributional Non-parametric Gradients\nWe have shown how the gradient of the loss factorizes over the graph of the output space. Next, we\nwill provide the gradients of those local factors log p\u2713(y | y0) (Eq. (3)) w.r.t. the local parameters\nrefer to the normalized input and provide only \ufb01nal results; detailed derivations for all gradients and\nworked out examples for speci\ufb01c kernels can be found in Appendix A.2.\n\n\u2713 = \u2713i|pa(i). To reduce notational clutter, we introduce the shorthandby(k) := B(k)(y \u00b5(k)) to\n\nKernel Weights.\n\nrw log p\u2713(y | y0) =\n\n\u2318\n\nw>\u2318\n\n,\u2318\n\n:=\u2713|B(k)|\uf8ff(by(k))\u25c6N\n\nk=1\n\n.\n\n(13)\n\nNote that w is required to lie on the standard (N1)-simplex (N1). Different normalizations are\npossible, including a softmax or a projection onto the simplex, i.e., \u21e1(N1)(w(i)) = max(0, w(i) + u)\nand u is the unique translation such that the positive points sum to 1 [35].\n\nKernel Centers.\n\nr\u00b5\u00b5\u00b5 log p\u2713(y | y0) =\n\nw \nw>\u2318\n\n,\n\n:=\u2713B(>k)\n\n|B(k)|\n\n\u00b7\n\nThe kernel centers do not underlie any spatial restrictions, but proper initialization is important.\nTypically, we use the centers of a k-means clustering with k := N to initialize the kernel centers.\n\n.\n\nk=1\n\n@by(k) \u25c6N\n@\uf8ff(by(k))\n@by(k) by(>k)\u25c6\u25c6N\n@\uf8ff(by(k))\n\n(14)\n\n.\n\n(15)\n\nk=1\n\nKernel Bandwidth Matrices.\nw \nw>\u2318\n\nrB log p\u2713(y | y0) =\n\n,\n\n:=\u2713B(>k)\n\n|B(k)|\n\n\u00b7\u2713\uf8ff(by(k)) +\n\nWhile computation of the gradient w.r.t. B is a universal approach, speci\ufb01c kernels may allow\nalternative gradients: In a Gaussian kernel, for instance, the Gramian of the bandwidth matrix acts as a\ncovariance matrix. We can thus optimize B(k)B(>k) in the interior of the cone of positive-semide\ufb01nite\nmatrices by computing the gradients w.r.t. the Cholesky factor of the inverse covariance matrix.\n\n4\n\nInferring Complex Statistics using Neural Belief Propagation\n\nThe previous sections introduced non-parametric structured output networks and showed how their\ncomponents, DNN and NGM, can be learned from data. Since the resulting posterior density p\u2713\u2713\u2713(x)(y)\n(Eq. (2)) factorizes over a graph, we can, in theory, use local messages to propagate beliefs about statis-\ntics t[p\u2713\u2713\u2713(x)(y)] along its edges (BP; [20]). However, special care must be taken to handle intractable\noperations caused by non-parametric local potentials and to allow an end-to-end integration.\nFor ease of exposition, we assume that we can represent the local conditional distributions as a set of\npairwise potentials {(yi, yj)}, effectively converting our directed model to a normalized MRF. This\nis not limiting, as we can always convert a factor graph representation of Eq. (2) into an equivalent\npairwise MRF [36]. In this setting, a BP message \u00b5i!j(yj) from Yi to Yj takes the form\n\n\u00b5i!j(yj) = opyi(yi, yj) \u00b7 \u00b5\u00b7!i(yi),\n\n(16)\nwhere the operator opy computes a summary statistic, such as integration or maximization, and\n\u00b5\u00b7!i(yi) is the product of all incoming messages at Yi. In case of a graphical model with non-\nparametric local distributions (Eq. (3)), this computation is not feasible for two reasons: (1) the pre-\nmessages \u00b5\u00b7!i(yi) are products of sums, which means that the number of kernels grows exponentially\nin the number of incoming messages; (2) the functional opy does not usually have an analytic form.\n\n6\n\n\fDeep\n\nNeural Network\n\nFig. 1(b)(i)\n\n{e\u2713ij}\n\nNon-parametric\nGraphical Model\n\nFig. 1(b)(ii)\n\nLM\n\nk!i\n\nb\u00b5(t1)\n\nk2 ne(i)\\j\n\nL(i)\nFC+ReLU\n\nI\n\ng\nn\n\ni\nk\nc\na\nt\nS\n\nU\nL\ne\nR\n+\nC\nF\n\ne\u2713ij\n\nbbi\nb\u00b5(T )\n(a) Recurrent Inference Network.\n\nt = 1, . . . , T\n\nb\u00b5(t)\n\ni!j\n\nk!i\n\nk 2 ne(i)\ni = 1, . . . , n\n\ne\u271332\ne\u271332\n\n1\n\n1\n\ne\u271321\n\n1\n\n1\n\ne\u271342\n\n3!2\n\ne\u271342\nb\u00b5(t1)\nb\u00b5(t1)\n\n1!2\n\nC\nF\n\n2!4\n\nb\u00b5(t)\n\nC\nF\n\nC\nF\n\n3!2\n\nb\u00b5(t1)\nb\u00b5(t1)\n\n1!2\n\ne\u271321\n(b) Partially Unrolled Inference Network.\n\nFigure 2: Inferring Complex Statistics. Expressive output spaces require explicit inference procedures to\nobtain posterior statistics. We use an inference network inspired by message-passing schemes in non-parametric\ngraphical models. (a) An RNN iteratively computes outgoing messages from incoming messages and the local\n\nInspired by recent results in imitation learning [37] and inference machines for classi\ufb01cation [17, 38],\nwe take an alternate route and use an RNN to model the exchange of information between non-\n\npotential. (b) Unrolled inference network illustrating the computation ofb\u00b52!4 in the graph shown in Figure 1b.\nparametric nodes. In particular, we introduce an RNN nodeb\u00b5i!j for each message and connect them\nin time according to Eq. (16), i.e., each node has incoming connections from its local potentiale\u2713ij,\npredicted by the DNN, and the nodes {b\u00b5k!i : k 2 neG(i)\\j}, which correspond to the incoming\n\nmessages. The message computation itself is approximated through an FC+ReLU layer with weights\ni!j\nI\n\n. An approximate messageb\u00b5i!j from Yi to Yj can thus be written as\n\n(17)\n\nb\u00b5i!j = ReLU(FCi!j\n\nI\n\n(Stacking(e\u2713ij,{b\u00b5k!i : k 2 neG(i)\\j}))),\n\nwhere neG(\u00b7) returns the neighbors of a node in G. The \ufb01nal beliefsbbi = b\u00b5\u00b7!i \u00b7b\u00b5i!j can be\na decomposable inference loss LI =Pn\n\nimplemented analogously. Similar to (loopy) belief updates in traditional message-passing, we run\nthe RNN for a \ufb01xed number of iterations, at each step passing all neural messages. Furthermore, using\nthe techniques discussed in Section 3.3, we can ensure that the messages are valid non-parametric\ndistributions. All layers in this recurrent inference network are differentiable, so that we can propagate\nI end-to-end back to the inputs. In practice, we \ufb01nd that\ngeneric loss functions work well (see Section 5) and that canonic loss functions can often be obtained\ndirectly from the statistic. The DNN weights M are thus updated so as to do both predict the right\nposterior density and, together with the RIN weights I, perform correct inference in it (Figure 2).\n\ni=1 L(i)\n\n5 Experiments\nWe validate non-parametric structured output networks at both the model (DNN+NGM) and the\ninference level (RIN). Model validation consists of a comparison to baselines along two binary\naxes, structuredness and non-parametricity. Inference validation compares our RIN unit to the\ntwo predominant groups of approaches for inference in structured non-parametric densities, i.e.,\nsampling-based and variational inference (Section 1.1.2).\n\n5.1 Dataset\nWe test our approach on simple natural pixel statistics from Microsoft COCO [11] by sampling stripes\ni=1 2 [0, 255]n of n = 10 pixels. Each pixel yi is corrupted by a linear noise model, leading\ny = (yi)n\nto the observable output xi = \u00b7 yi + \u270f, with \u270f \u21e0N (255 \u00b7 ,1, 2) and \u21e0 Ber( ), where the\ntarget space of the Bernoulli trial is {1, +1}. For our experiments, we set 2 = 100 and = 0.5.\nUsing this noise process, we generate training and test sets of sizes 100,000 and 1,000, respectively.\n\n5.2 Model Validation\nThe distributional gradients (Eq. (9)) comprise three types of parameters: Kernel locations, kernel\nweights, and kernel bandwidth matrices. Default values for the latter two exist in the form of uniform\nweights and plug-in bandwidth estimates [33], respectively, so we can turn optimization of those\n\n7\n\n\f.\n\nB\n\n+B\n\nB\n\n+B\n\n+W\n\nModel\n\nd\ne\nr\nu\nt\nc\nu\nr\nt\nS\n\n7\n\nParameter Group Estimation\nW\n\nm\na\nr\na\np\n-\nn\no\nN\nGaussian\n7\nX 7 +6.66 (Plug-in bandwidth estimation)\nKernel Density\nGaussian\n7\n7 0.90 +2.54 0.88 +2.90\n7 X 0.85 +1.55 0.93 +1.53\nGGM [39]\nMixture Density [40] X 7 +9.22 +6.87 +11.18 +11.51\nX X +15.26 +15.30 +16.00 +16.46\nNGM-100 (ours)\n\n1.13 (ML estimation)\n\nl\na\nr\nu\ne\nN\n\nk\nr\no\nw\nt\ne\nN\n\n+\n\nInference\n\nParticles\n\nPerformance\n(marg. log-lik.)\n\nBB-VI [23]\n\nP-BP [30]\n\nRIN-100 (ours)\n\n400\n800\n\n50\n100\n200\n400\n\n\n+2.30\n+3.03\n\n+2.91\n+6.13\n+7.01\n+8.85\n\n+16.62\n\nRuntime\n\n(sec)\n660.65\n1198.08\n\n0.49\n2.11\n6.43\n21.13\n\n0.04\n\n(a) Model Validation\n\n(b) Inference Validation\n\nTable 2: Quantitative Evaluation. (a) We report the expected log-likelihood of the test set under the predicted\nposterior p\u2713\u2713\u2713(x)(y), showing the need for a structured and non-parametric approach to model rich posteriors.\n(b) Inference using our RIN architecture is much faster than sampling-based or variational inference while\nstill leading to accurate marginals. [(N/G)GM: Non-parametric/Gaussian Graphical Model; RIN-x: Recurrent\nInference Network with x kernels; P-BP: Particle Belief Propagation; BB-VI: Black Box Variational Inference]\n\nparameter groups on/off as desired.3 In addition to those variations, non-parametric structured output\nnetworks with a Gaussian kernel \uf8ff = N (\u00b7 | ~0, I) comprise a number of popular baselines as special\ncases, including neural networks predicting a Gaussian posterior (n = 1, N = 1), mixture density\nnetworks (n = 1, N > 1; [40]), and Gaussian graphical models (n > 1, N = 1; [39]). For the sake\nof completeness, we also report the performance of two basic posteriors without preceding neural\nnetwork, namely a pure Gaussian and traditional kernel density estimation (KDE). We compare our\napproach to those baselines in terms of the expected log-likelihood on the test set, which is a relative\nmeasure for the KL-divergence to the true posterior.\n\nSetup and Results. For the two basic models, we learn a joint density p(y, x) by maximum like-\nlihood (Gaussian) and plug-in bandwidth estimation (KDE) and condition on the inputs x to infer\nthe labels y. We train the other 4 models for 40 epochs using a Gaussian kernel and a diagonal\nbandwidth matrix for the non-parametric models. The DNN consists of 2 fully-connected layers with\n256 units and the kernel weights are constrained to lie on a simplex with a softmax layer. The NGM\nuses a chain-structured graph that connects each pixel to its immediate neighbors. Table 2a shows our\nresults. Ablation study: unsurprisingly, a purely Gaussian posterior cannot represent the true posterior\nappropriately. A multimodal kernel density works better than a neural network with parametric poste-\nrior but cannot compete with the two non-parametric models attached to the neural network. Among\nthe methods with a neural network, optimization of kernel locations only (\ufb01rst column) generally\nperforms worst. However, the W + B setting (second column) gets sometimes trapped in local min-\nima, especially in case of global mixture densities. If we decide to estimate a second parameter group,\nweights (+W ) should therefore be preferred over bandwidths (+B). Best results are obtained when\nestimation is turned on for all three parameter groups. Baselines: the two non-parametric methods\nconsistently perform better than the parametric approaches, con\ufb01rming our claim that non-parametric\ndensities are a powerful alternative to a parametric posterior. Furthermore, a comparison of the\nlast two rows shows a substantial improvement due to our factored representation, demonstrating\nthe importance of incorporating structure into high-dimensional, continuous estimation problems.\n\nLearned Graph Structures. While the output variables in our experiments with one-dimensional\npixel stripes have a canonical dependence structure, the optimal connectivity of the NGM in tasks with\ncomplex or no spatial semantics might be less obvious. As an example, we consider the case of two-\ndimensional image patches of size 10\u21e5 10, which we extract and corrupt following the same protocol\nand noise process as above. Instead of specifying the graph by hand, we use a mutual information cri-\nterion [41] to learn the optimal arborescence from the training labels. With estimation of all parameter\ngroups turned on (+W + B), we obtain results that are fully in line with those above: the expected\ntest log-likelihood of NSONs (+153.03) is again superior to a global mixture density (+76.34),\nwhich in turn outperforms the two parametric approaches (GGM: +18.60; Gaussian: 19.03). A full\nablation study as well as a visualization of the inferred graph structure are shown in Appendix A.3.\n\n3Since plug-in estimators depend on the kernel locations, the gradient w.r.t. the kernel locations needs to take\nthese dependencies into account by backpropagating through the estimator and computing the total derivative.\n\n8\n\n\fInference Validation\n\n5.3\nSection 4 motivated the use of a recurrent inference network (RIN) to infer rich statistics from\nstructured, non-parametric densities. We compare this choice to the other two groups of approaches,\ni.e., variational and sampling-based inference (Section 1.1.2), in a marginal inference task. To this\nend, we pick one popular member from each group as baselines for our RIN architecture.\n\nParticle Belief Propagation (P-BP; [30]). Sum-product particle belief propagation approximates\n\na BP-message (Eq. (16); op :=R ) with a set of particles {y(s)\n\nj }S\n\ns=1 per node Yj by computing\n\n)\n\n,\n\n(18)\n\n) =\n\nj\n\nb\u00b5i!j(y(k)\n\nSXs=1\n\n(y(s)\n\n, y(k)\n\ni\n\nj\nS\u21e2(y(s)\n\n) \u00b7b\u00b5\u00b7!i(y(s)\n\n)\n\ni\n\ni\n\nwhere the particles are sampled from a proposal distribution \u21e2 that approximates the true marginal by\n\nrunning MCMC on the beliefsb\u00b5\u00b7!i(yi) \u00b7b\u00b5i!j(yi). Similar versions exist for other operators [42].\n\nBlack Box Variational Inference (BB-VI; [23]). Black box variational inference maximizes the\nELBO LV I[q] with respect to a variational distribution q by approximating its gradient through a\nset of samples {y(s)}S\n\ns=1 \u21e0 q and performing stochastic gradient ascent,\n\nr LV I[q] = r Eq(y)\uf8fflog\n\np\u2713\u2713\u2713(y)\n\nq(y) \u21e1 S1\n\nSXs=1\n\nr log q(y(s)) log\n\np\u2713\u2713\u2713(y(s))\nq(y(s))\n\n.\n\n(19)\n\nA statistic t (Eq. (5)) can then be estimated from the tractable variational distribution q(y) instead\nof the complex target distribution p\u2713\u2713\u2713(y). We use an isotropic Gaussian kernel \uf8ff = N (\u00b7 | ~0, I)\ntogether with the traditional factorization q(y) =Qn\ni=1 qi(yi), in which case variational sampling\nis straighforward and the (now unconditional) gradients are given directly by Section 3.3.\n\n5.3.1 Setup and Results.\nWe train our RIN architecture with a negative log-likelihood loss attached to each belief node,\nL(i)\nI = log p\u2713i(yi), and compare its performance to the results obtained from P-BP and BB-VI by\ncalculating the sum of marginal log-likelihoods. For the baselines, we consider different numbers\nof particles, which affects both performance and speed. Additionally, for BB-VI we track the\nperformance across 1024 optimization steps and report the best results. Table 2b summarizes our\n\ufb01ndings. Among the baselines, P-BP performs better than BB-VI once a required particle threshold is\nexceeded. We believe this is a manifestation of the special requirements associated with inference in\nnon-parametric densities: while BB-VI needs to \ufb01t a high number of parameters, which poses the risk\nof getting trapped in local minima, P-BP relies solely on the evaluation of potentials. However, both\nmethods are outperformed by a signi\ufb01cant margin by our RIN, which we attribute to its end-to-end\ntraining in accordance with DNN+NGM and its ability to propagate and update full distributions\ninstead of their mere value at a discrete set of points. In addition to pure performance, a key advantage\nof RIN inference over more traditional inference methods is its speed: our RIN approach is over 50\u21e5\nfaster than P-BP with 100 particles and orders of magnitude faster than BB-VI. This is signi\ufb01cant,\neven when taking dependencies on hardware and implementation into account, and allows the use of\nexpressive non-parametric posteriors in time-critical applications.\n6 Conclusion\nWe proposed non-parametric structured output networks, a highly expressive framework consisting of\na deep neural network predicting a non-parametric graphical model and a recurrent inference network\ncomputing statistics in this structured output space. We showed how all three components can be\nlearned end-to-end by backpropagating non-parametric gradients through directed graphs and neural\nmessages. Our experiments showed that non-parametric structured output networks are necessary\nfor both effective learning of multimodal posteriors and ef\ufb01cient inference of complex statistics in\nthem. We believe that NSONs are suitable for a variety of other structured tasks and can be used\nto obtain accurate approximations to many intractable statistics of non-parametric densities beyond\n(max-)marginals.\n\n9\n\n\fReferences\n[1] Krizhevsky, A., Sutskever, I., Hinton, G.: ImageNet Classi\ufb01cation with Deep Convolutional\n\nNeural Networks. NIPS (2012)\n\n[2] He, K., Zhang, X., Ren, S., Sun, J.: Deep Residual Learning for Image Recognition. CVPR\n\n(2016)\n\n[3] Shelhamer, E., Long, J., Darrell, T.: Fully Convolutional Networks for Semantic Segmentation.\n\nPAMI (2016)\n\n[4] Girshick, R.: Fast R-CNN. ICCV (2015)\n[5] Rena, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: Towards Real-Time Object Detection\n\nwith Region Proposal Networks. arXiv:1506.01497 [cs.CV] (2015)\n\n[6] Collobert, R., Weston, J.: A Uni\ufb01ed Architecture for Natural Language Processing: Deep\n\nNeural Networks with Multitask Learning. ICML (2008)\n\n[7] Bahdanau, D., Cho, K., Bengio, Y.: Neural Machine Translation by Jointly Learning to Align\n\nand Translate. ICLR (2015)\n\n[8] Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: A Simple\n\nWay to Prevent Neural Networks from Over\ufb01tting. JMLR (2014)\n\n[9] Ioffe, S., Szegedy, S.: Batch Normalization: Accelerating Deep Network Training by Reducing\n\nInternal Covariate Shift. ICML (2015)\n\n[10] Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.:\n\nHierarchical Image Database. CVPR (2009)\n\nImageNet: A Large-Scale\n\n[11] Lin, T.Y., Maire, M., Belongie, S., Bourdev, L., Girshick, R., Hays, J., Perona, P., Ramanan, D.,\nZitnick, C.L., Dollar, P.: Microsoft COCO: Common Objects in Context. In arXiv:1405.0312\n[cs.CV]. (2014)\n\n[12] Koller, D., Friedman, N.: Probabilistic Graphical Models: Principles and Techniques. MIT\n\nPress (2009)\n\n[13] Schwing, A., Urtasun, R.: Fully Connected Deep Structured Networks. arXiv:1503.02351\n\n[cs.CV] (2015)\n\n[14] Zheng, S., Jayasumana, S., Romera-Paredes, B., Vineet, V., Su, Z., Du, D., Huang, C., Torr, P.:\n\nConditional Random Fields as Recurrent Neural Networks. ICCV (2015)\n\n[15] Simonyan, K., Zisserman, A.: Very Deep Convolutional Networks for Large-Scale Image\n\nRecog. ICLR (2015)\n\n[16] Jain, A., Zamir, A.R., Savarese, S., Saxena, A.: Structural-RNN: Deep Learning on Spatio-\n\nTemporal Graphs. CVPR (2016)\n\n[17] Deng, Z., Vahdat, A., Hu, H., Mori, G.: Structure Inference Machines: Recurrent Neural\n\nNetworks for Analyzing Relations in Group Activity Recognition. CVPR (2015)\n\n[18] Chen, L.C., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.: Semantic Image Segmentation\n\nwith Deep Convolutional Nets and Fully Connected CRFs. ICLR (2015)\n\n[19] Chen, L.C., Schwing, A., Yuille, A., Urtasun, R.: Learning Deep Structured Models. ICML\n\n(2015)\n\n[20] Pearl, J.: Probabilistic Reasoning in Intelligent Systems. Morgan Kaufmann (1988)\n[21] Hoffman, M.D., Blei, D.M., Wang, C., Paisley, J.: Stochastic Variational Inference. JMLR\n\n(2013)\n\n[22] Ghahramani, Z., Beal, M.: Propagation Algorithms for Variational Bayesian Learning. NIPS\n\n(2001)\n\n[23] Ranganath, R., Gerrish, S., Blei, D.M.: Black Box Variational Inference. JMLR W&CP (2014)\n[24] Kraehenbuehl, P., Koltun, V.: Ef\ufb01cient Inference in Fully Connected CRFs with Gaussian Edge\n\nPotentials. NIPS (2012)\n\n[25] Adams, A., Baek, J., Davis, M.A.: Fast High-Dimensional Filtering Using the Permutohedral\n\nLattice. Computer Graphics Forum (2010)\n\n10\n\n\f[26] Campbell, N., Subr, K., Kautz, J.: Fully-Connected CRFs with Non-Parametric Pairwise\n\nPotentials. CVPR (2013)\n\n[27] Koller, D., Lerner, U., Angelov, D.: A General Algorithm for Approximate Inference and its\n\nApplication to Hybrid Bayes Nets. UAI (1999)\n\n[28] Isard, M.: Pampas: Real-Valued Graphical Models for Computer Vision. CVPR (2003)\n[29] Sudderth, E., lhler, A., Freeman, W., Willsky, A.: Non-parametric Belief Propagation. CVPR\n\n(2003)\n\n[30] Ihler, A., McAllester, D.: Particle Belief Propagation. AISTATS (2009)\n[31] Pacheco, J., Zuf\ufb01, S., Black, M.J., Sudderth, E.: Preserving Modes and Messages via Diverse\n\nParticle Selection. ICML (2014)\n\n[32] Park, M., Liu, Y., Collins, R.T.: Ef\ufb01cient Mean Shift Belief Propagation for Vision Tracking.\n\nCVPR (2008)\n\n[33] Scott, D.: Multivariate Density Estimation: Theory, Practice, and Visualization. Wiley (1992)\n[34] Kingma, D., Ba, J.: Adam: A Method for Stochastic Optimization. ICLR (2015)\n[35] Wang, W., Carreira-Perpi\u00f1\u00e1n, M.\u00c1.: Projection onto the Probability Simplex: An Ef\ufb01cient\n\nAlgorithm with a Simple Proof, and an Application. arXiv:1309.1541 [cs.LG] (2013)\n\n[36] Yedidia, J.S., Freeman, W.T., Weiss, Y.: Understanding Belief Propagation and its Generaliza-\n\ntions. Technical report, Mitsubishi Electric Research Laboratories (2001)\n\n[37] Sun, W., Venkatramana, A., Gordon, G.J., Boots, B., Bagnell, J.A.: Deeply AggreVaTeD:\nDifferentiable Imitation Learning for Sequential Prediction. arXiv:1703.01030 [cs.LG] (2017)\n[38] Ross, S., Munoz, D., Hebert, M., Bagnell, J.A.: Learning Message-Passing Inference Machines\n\nfor Structured Prediction. CVPR (2011)\n\n[39] Weiss, Y., Freeman, W.T.: Correctness of Belief Propagation in Gaussian Graphical Models of\n\nArbitrary Topology. Neural Computation (2001)\n\n[40] Bishop, C.M.: Mixture Density Networks. Technical report, Aston University (1994)\n[41] Lehrmann, A., Gehler, P., Nowozin, S.: A Non-Parametric Bayesian Network Prior of Human\n\nPose. ICCV (2013)\n\n[42] Kothapa, R., Pacheco, J., Sudderth, E.B.: Max-Product Particle Belief Propagation. Technical\n\nreport, Brown University (2011)\n\n11\n\n\f", "award": [], "sourceid": 2223, "authors": [{"given_name": "Andreas", "family_name": "Lehrmann", "institution": "Disney Research"}, {"given_name": "Leonid", "family_name": "Sigal", "institution": "Disney Research / University of British Columbia"}]}