{"title": "On the Accuracy of Self-Normalized Log-Linear Models", "book": "Advances in Neural Information Processing Systems", "page_first": 1783, "page_last": 1791, "abstract": "Calculation of the log-normalizer is a major computational obstacle in applications of log-linear models with large output spaces. The problem of fast normalizer computation has therefore attracted significant attention in the theoretical and applied machine learning literature. In this paper, we analyze a recently proposed technique known as ``self-normalization'', which introduces a regularization term in training to penalize log normalizers for deviating from zero. This makes it possible to use unnormalized model scores as approximate probabilities. Empirical evidence suggests that self-normalization is extremely effective, but a theoretical understanding of why it should work, and how generally it can be applied, is largely lacking.We prove upper bounds on the loss in accuracy due to self-normalization, describe classes of input distributionsthat self-normalize easily, and construct explicit examples of high-variance input distributions. Our theoretical results make predictions about the difficulty of fitting self-normalized models to several classes of distributions, and we conclude with empirical validation of these predictions on both real and synthetic datasets.", "full_text": "On the Accuracy of Self-Normalized\n\nLog-Linear Models\n\nJacob Andreas\u2217, Maxim Rabinovich\u2217, Michael I. Jordan, Dan Klein\n\nComputer Science Division, University of California, Berkeley\n{jda,rabinovich,jordan,klein}@cs.berkeley.edu\n\nAbstract\n\nCalculation of the log-normalizer is a major computational obstacle in applica-\ntions of log-linear models with large output spaces. The problem of fast normal-\nizer computation has therefore attracted signi\ufb01cant attention in the theoretical and\napplied machine learning literature. In this paper, we analyze a recently proposed\ntechnique known as \u201cself-normalization\u201d, which introduces a regularization term\nin training to penalize log normalizers for deviating from zero. This makes it pos-\nsible to use unnormalized model scores as approximate probabilities. Empirical\nevidence suggests that self-normalization is extremely effective, but a theoretical\nunderstanding of why it should work, and how generally it can be applied, is\nlargely lacking.\nWe prove upper bounds on the loss in accuracy due to self-normalization, describe\nclasses of input distributions that self-normalize easily, and construct explicit ex-\namples of high-variance input distributions. Our theoretical results make predic-\ntions about the dif\ufb01culty of \ufb01tting self-normalized models to several classes of\ndistributions, and we conclude with empirical validation of these predictions.\n\n1 Introduction\n\nLog-linear models, a general class that includes conditional random \ufb01elds (CRFs) and generalized\nlinear models (GLMs), offer a \ufb02exible yet tractable approach modeling conditional probability dis-\ntributions p(x|y) [1, 2]. When the set of possible y values is large, however, the computational cost\nof computing a normalizing constant for each x can be prohibitive\u2014involving a summation with\nmany terms, a high-dimensional integral or an expensive dynamic program.\nThe machine translation community has recently described several procedures for training \u201cself-\nnormalized\u201d log-linear models [3, 4]. The goal of self-normalization is to choose model parame-\nters that simultaneously yield accurate predictions and produce normalizers clustered around unity.\nModel scores can then be used as approximate surrogates for probabilities, obviating the computa-\ntion normalizer computation.\nIn particular, given a model of the form\n\nwith\n\np\u03b7(y | x) = e\u03b7T T (y, x)\u2212A(\u03b7, x)\nA (\u03b7, x) = log!y\u2208Y\ne\u03b7T T (y, x) ,\n\n(1)\n\n(2)\n\nwe seek a setting of \u03b7 such that A(x, \u03b7) is close enough to zero (with high probability under p(x))\nto be ignored.\n\n\u2217Authors contributed equally.\n\n1\n\n\fThis paper aims to understand the theoretical properties of self-normalization. Empirical results have\nalready demonstrated the ef\ufb01cacy of this approach\u2014for discrete models with many output classes, it\nappears that normalizer values can be made nearly constant without sacri\ufb01cing too much predictive\naccuracy, providing dramatic ef\ufb01ciency increases at minimal performance cost.\nThe broad applicability of self-normalization makes it likely to spread to other large-scale appli-\ncations of log-linear models, including structured prediction (with combinatorially many output\nclasses) and regression (with continuous output spaces). But it is not obvious that we should expect\nsuch approaches to be successful: the number of inputs (if \ufb01nite) can be on the order of millions,\nthe geometry of the resulting input vectors x highly complex, and the class of functions A(\u03b7, x) as-\nsociated with different inputs quite rich. To \ufb01nd to \ufb01nd a nontrivial parameter setting with A(\u03b7, x)\nroughly constant seems challenging enough; to require that the corresponding \u03b7 also lead to good\nclassi\ufb01cation results seems too much. And yet for many input distributions that arise in practice, it\nappears possible to choose \u03b7 to make A(\u03b7, x) nearly constant without having to sacri\ufb01ce classi\ufb01ca-\ntion accuracy.\nOur goal is to bridge the gap between theoretical intuition and practical experience. Previous work\n[5] bounds the sample complexity of self-normalizing training procedures for a restricted class of\nmodels, but leaves open the question of how self-normalization interacts with the predictive power\nof the learned model. This paper seeks to answer that question. We begin by generalizing the\npreviously-studied model to a much more general class of distributions, including distributions with\ncontinuous support (Section 3). Next, we provide what we believe to be the \ufb01rst characterization of\nthe interaction between self-normalization and model accuracy Section 4. This characterization is\ngiven from two perspectives:\n\n\u2022 a bound on the \u201clikelihood gap\u201d between self-normalized and unconstrained models\n\u2022 a conditional distribution provably hard to represent with a self-normalized model\n\nIn Figure 5, we present empirical evidence that these bounds correctly characterize the dif\ufb01culty of\nself-normalization, and in the conclusion we survey a set of open problems that we believe merit\nfurther investigation.\n\n2 Problem background\n\nThe immediate motivation for this work is a procedure proposed to speed up decoding in a machine\ntranslation system with a neural-network language model [3]. The language model used is a standard\nfeed-forward neural network, with a \u201csoftmax\u201d output layer that turns the network\u2019s predictions into\na distribution over the vocabulary, where each probability is log-proportional to its output activation.\nIt is observed that with a suf\ufb01ciently large vocabulary, it becomes prohibitive to obtain probabilities\nfrom this model (which must be queried millions of times during decoding). To \ufb01x this, the language\nmodel is trained with the following objective:\n\nmax\n\nW !i \"N (yi|xi; W ) \u2212 log!y\u2032\n\neN (y\u2032|xi;W ) \u2212 \u03b1# log!y\u2032\n\neN (y\u2032|xi;W )$2%\n\nwhere N (y|x; W ) is the response of output y in the neural net with weights W given an input\nx. From a Lagrangian perspective, the extra penalty term simply con\ufb01nes the W to the set of\n\u201cempirically normalizing\u201d parameters, for which all log-normalizers are close (in squared error) to\nthe origin. For a suitable choice of \u03b1, it is observed that the trained network is simultaneously\naccurate enough to produce good translations, and close enough to self-normalized that the raw\nscores N (yi|xi) can be used in place of log-probabilities without substantial further degradation in\nquality.\nWe seek to understand the observed success of these models in \ufb01nding accurate, normalizing param-\neter settings. While it is possible to derive bounds of the kind we are interested in for general neural\nnetworks [6], in this paper we work with a simpler linear parameterization that we believe captures\nthe interesting aspects of this problem. 1\n\n1It is possible to view a log-linear model as a single-layer network with a softmax output. More usefully,\nall of the results presented here apply directly to trained neural nets in which the last layer only is retrained to\nself-normalize [7].\n\n2\n\n\fRelated work\n\nThe approach described at the beginning of this section is closely related to an alternative self-\nnormalization trick described based on noise-contrastive estimation (NCE) [8]. NCE is an alter-\nnative to direct optimization of likelihood, instead training a classi\ufb01er to distinguish between true\nsamples from the model, and \u201cnoise\u201d samples from some other distribution. The structure of the\ntraining objective makes it possible to replace explicit computation of each log-normalizer with an\nestimate. In traditional NCE, these values are treated as part of the parameter space, and estimated\nsimultaneously with the model parameters; there exist guarantees that the normalizer estimates will\neventually converge to their true values. It is instead possible to \ufb01x all of these estimates to one.\nIn this case, empirical evidence suggests that the resulting model will also exhibit self-normalizing\nbehavior [4].\nA host of other techniques exist for solving the computational problem posed by the log-normalizer.\nMany of these involve approximating the associated sum or integral using quadrature [9], herding\n[10], or Monte Carlo methods [11]. For the special case of discrete, \ufb01nite output spaces, an alter-\nnative approach\u2014the hierarchical softmax\u2014is to replace the large sum in the normalizer with a\nseries of binary decisions [12]. The output classes are arranged in a binary tree, and the probability\nof generating a particular output is the product of probabilities along the edges leading to it. This\nreduces the cost of computing the normalizer from O(k) to O(log k). While this limits the set of\ndistributions that can be learned, and still requires greater-than-constant time to compute normaliz-\ners, it appears to work well in practice. It cannot, however, be applied to problems with continuous\noutput spaces.\n\n3 Self-normalizable distributions\n\nWe begin by providing a slightly more formal characterization of a general log-linear model:\nDe\ufb01nition 1 (Log-linear models). Given a space of inputs X , a space of outputs Y, a measure \u00b5 on\nY, a nonnegative function h : Y\u2192 R, and a function T : X\u00d7Y\u2192 Rd that is \u00b5-measurable with\nrespect to its second argument, we can de\ufb01ne a log-linear model indexed by parameters \u03b7 \u2208 Rd,\nwith the form\n\nwhere\n\np\u03b7(y|x) = h(y)e\u03b7\u22a4T (x,y)\u2212A(x,\u03b7) ,\nA(x, \u03b7) \u2206= log&Y\n\nh(y)e\u03b7\u22a4T (x,y) d\u00b5(y) .\n\n(3)\n\n(4)\n\nIf A(x, \u03b7) \u2264 \u221e, then\u2019y p\u03b7(y|x) d\u00b5(y) = 1, and p\u03b7(y|x) is a probability density over Y.2\nWe next formalize our notion of a self-normalized model.\nDe\ufb01nition 2 (Self-normalized models). The log-linear model p\u03b7(y|x) is self-normalized with re-\nspect to a set S\u2282X if for all x \u2208S , A(x, \u03b7) = 0. In this case we say that S is self-normalizable,\nand \u03b7 is self-normalizing w.r.t. S.\nAn example of a normalizable set is shown in Figure 1a, and we provide additional examples below:\n\n2Some readers may be more familiar with generalized linear models, which also describe exponential family\ndistributions with a linear dependence on input. The presentation here is strictly more general, and has a few\nnotational advantages: it makes explicit the dependence of A on x and \u03b7 but not y, and lets us avoid tedious\nbookkeeping involving natural and mean parameterizations. [13]\n\n3\n\n\f(a) A self-normalizable set S for \ufb01xed \u03b7:\nthe solu-\ntions (x1, x2) to A(x, \u03b7) = 0 with \u03b7\u22a4T (x, y) =\n\u03b7\u22a4y x and \u03b7 = {(\u22121, 1), (\u22121,\u22122)}. The set forms a\nsmooth one-dimensional manifold bounded on either\nside by hyperplanes normal to (\u22121, 1) and (\u22121,\u22122).\n\n(b) Sets of approximately normalizing parameters \u03b7\nfor \ufb01xed p(x): solutions (\u03b71,\u03b7 2) to E[A(x, \u03b7)2] =\n\u03b42 with T (x, y) = (x + y,\u2212xy), y \u2208{\u2212 1, 1} and\np(x) uniform on {1, 2}. For a given upper bound on\nnormalizer variance, the feasible set of parameters is\nnonconvex, and grows as \u03b4 increases.\n\nFigure 1: Self-normalizable data distributions and parameter sets.\n\nExample. Suppose\n\nThen for either x \u2208S ,\n\nS = {log 2,\u2212 log 2} ,\nY = {\u22121, 1}\nT (x, y) = [xy, 1]\n\n\u03b7 = (1, log(2/5)) .\n\nA(x, \u03b7) = log(elog 2+log(2/5) + e\u2212 log 2+log(2/5))\n\n= log((2/5)(2 + 1/2))\n= 0 ,\n\nand \u03b7 is self-normalizing with respect to S.\nIt is also easy to choose parameters that do not result in a self-normalized distribution, and in fact to\nconstruct a target distribution which cannot be self-normalized:\nExample. Suppose\n\nX = {(1, 0), (0, 1), (1, 1)}\nY = {\u22121, 1}\n\nT (x, y) = (x1y, x2y, 1)\n\nThen there is no \u03b7 such that A(x, \u03b7) = 0 for all x, and A(x, \u03b7) is constant if and only if \u03b7 = 0.\n\nAs previously motivated, downstream uses of these models may be robust to small errors resulting\nfrom improper normalization, so it would be useful to generalize this de\ufb01nition of normalizable\ndistributions to distributions that are only approximately normalizable. Exact normalizability of\nthe conditional distribution is a deterministic statement\u2014there either does or does not exist some\nx that violates the constraint. In Figure 1a, for example, it suf\ufb01ces to have a single x off of the\nindicated surface to make a set non-normalizable. Approximate normalizability, by contrast, is\ninherently a probabilistic statement, involving a distribution p(x) over inputs. Note carefully that\nwe are attempting to represent p(y|x) but have no representation of (or control over) p(x), and that\napproximate normalizability depends on p(x) but not p(y|x).\nInformally, if some input violates the self-normalization constraint by a large margin, but occurs\nonly very infrequently, there is no problem; instead we are concerned with expected deviation. It\n\n4\n\n\fis also at this stage that the distinction between penalization of the normalizer vs. log-normalizer\nbecomes important. The normalizer is necessarily bounded below by zero (so overestimates might\nappear much worse than underestimates), while the log-normalizer is unbounded in both directions.\nFor most applications we are concerned with log probabilities and log-odds ratios, for which an\nexpected normalizer close to zero is just as bad as one close to in\ufb01nity. Thus the log-normalizer is\nthe natural choice of quantity to penalize.\nDe\ufb01nition 3 (Approximately self-normalized models). The log-linear distribution p\u03b7(y|x) is \u03b4-\nIn\napproximately normalized with respect to a distribution p(x) over X if E[A(X, \u03b7)2] <\u03b4 2.\nthis case we say that p(x) is \u03b4-approximately self-normalizable, and \u03b7 is \u03b4-approximately self-\nnormalizing.\n\nThe sets of \u03b4-approximately self-normalizing parameters for a \ufb01xed input distribution and feature\nfunction are depicted in Figure 1b. Unlike self-normalizable sets of inputs, self-normalizing and\napproximately self-normalizing sets of parameters may have complex geometry.\nThroughout this paper, we will assume that vectors of suf\ufb01cient statistics T (x, y) have bounded\n\u21132 norm at most R, natural parameter vectors \u03b7 have \u21132 norm at most B (that is, they are Ivanov-\nregularized), and that vectors of both kinds lie in Rd. Finally, we assume that all input vectors have\na constant feature\u2014in particular, that x0 = 1 for every x (with corresponding weight \u03b70). 3\nThe \ufb01rst question we must answer is whether the problem of training self-normalized models is\nfeasible at all\u2014that is, whether there exist any exactly self-normalizable data distributions p(x),\nor at least \u03b4-approximately self-normalizable distributions for small \u03b4. Section 3 already gave an\nexample of an exactly normalizable distribution. In fact, there are large classes of both exactly and\napproximately normalizable distributions.\nObservation. Given some \ufb01xed \u03b7, consider the set S\u03b7 = {x \u2208X : A(x, \u03b7) = 0}. Any distri-\nbution p(x) supported on S\u03b7 is normalizable. Additionally, every self-normalizable distribution is\ncharacterized by at least one such \u03b7.\n\nThis de\ufb01nition provides a simple geometric characterization of self-normalizable distributions. An\nexample solution set is shown in Figure 1a. More generally, if y is discrete and T (x, y) consists of\n|Y| repetitions of a \ufb01xed feature function t(x) (as in Figure 1a), then we can write\n\ne\u03b7\u22a4y t(x).\n\n(5)\n\nA(x, \u03b7) = log!y\u2208Y\n\nProvided \u03b7\u22a4y t(x) is convex in x for each \u03b7y, the level sets of A as a function of x form the boundaries\nof convex sets. In particular, exactly normalizable sets are always the boundaries of convex regions,\nas in the simple example Figure 1a.\nWe do not, in general, expect real-world datasets to be supported on the precise class of self-\nnormalizable surfaces. Nevertheless, it is very often observed that data of practical interest lie on\nother low-dimensional manifolds within their embedding feature spaces. Thus we can ask whether\nit is suf\ufb01cient for a target distribution to be well-approximated by a self-normalizing one. We begin\nby constructing an appropriate measurement of the quality of this approximation.\nDe\ufb01nition 4 (Closeness). An input distribution p(x) is D-close to a set S if\n\nE( inf\n\nx\u2217\u2208S\n\ny\u2208Y ||T (X, y) \u2212 T (x\u2217, y)||2) \u2264 D\n\nsup\n\n(6)\n\nIn other words, p(x) is D-close to S if a random sample from p is no more than a distance D from S\nin expectation. Now we can relate the quality of this approximation to the level of self-normalization\nachieved. Generalizing a result from [5], we have:\nProposition 1. Suppose p(x) is D-close to {x : A(x, \u03b7) = 1}. Then p(x) is BD-approximately\nself-normalizable (recalling that ||x||2 \u2264 B).\n\n3It will occasionally be instructive to consider the special case where X is the Boolean hypercube, and we\nwill explicitly note where this assumption is made. Otherwise all results apply to general distributions, both\ncontinuous and discrete.\n\n5\n\n\f(Proofs for this section may be found in Appendix A.)\nThe intuition here is that data distributions that place most of their mass in feature space close to\nnormalizable sets are approximately normalizable on the same scale.\n\n4 Normalization and model accuracy\n\nSo far our discussion has concerned the problem of \ufb01nding conditional distributions that self-\nnormalize, without any concern for how well they actually perform at modeling the data. Here\nthe relationship between the approximately self-normalized distribution and the true distribution\np(y|x) (which we have so far ignored) is essential. Indeed, if we are not concerned with making\na good model it is always trivial to make a normalized one\u2014simply take \u03b7 = 0 and then scale\n\u03b70 appropriately! We ultimately desire both good self-normalization and good data likelihood, and\nin this section we characterize the tradeoff between maximizing data likelihood and satisfying a\nself-normalization constraint.\nWe achieve this characterization by measuring the likelihood gap between the classical maximum\nlikelihood estimator, and the MLE subject to a self-normalization constraint. Speci\ufb01cally, given\n\npairs ((x1, y1), (x2, y2), . . . , (xn, yn)), let \u2113(\u03b7|x, y) =*i log p\u03b7(yi|xi). Then de\ufb01ne\n\n\u02c6\u03b7 = arg max\n\n\u03b7\n\n\u02c6\u03b7\u03b4 = arg max\n\u03b7:V (\u03b7)\u2264\u03b4\n\n\u2113(\u03b7|x, y)\n\u2113(\u03b7|x, y)\n\n(7)\n\n(8)\n\n(9)\n\n(10)\n\n(where V (\u03b7) = 1\nWe would like to obtain a bound on the likelihood gap, which we de\ufb01ne as the quantity\n\nn*i A(xi,\u03b7 )2).\n\n\u2206\u2113(\u02c6\u03b7, \u02c6\u03b7\u03b4) =\n\n1\nn\n\n(\u2113(\u02c6\u03b7|x, y) \u2212 \u2113(\u02c6\u03b7\u03b4|x, y)) .\n\nWe claim:\nTheorem 2. Suppose Y has \ufb01nite measure. Then asymptotically as n \u2192 \u221e\nR||\u02c6\u03b7||2, E KL(p\u03b7(\u00b7|X) || Unif) .\n\n\u2206\u2113(\u02c6\u03b7, \u02c6\u03b7\u03b4) \u2264+1 \u2212\n\n\u03b4\n\n(Proofs for this section may be found in Appendix B.)\nThis result lower-bounds the likelihood at \u02c6\u03b7\u03b4 by explicitly constructing a scaled version of \u02c6\u03b7 that sat-\nis\ufb01es the self-normalization constraint. Speci\ufb01cally, if \u03b7 is chosen so that normalizers are penalized\nfor distance from log \u00b5(Y) (e.g. the logarithm of the number of classes in the \ufb01nite case), then any\nincrease in \u03b7 along the span of the data is guaranteed to increase the penalty. From here it is possible\nto choose an \u03b1 \u2208 (0, 1) such that \u03b1\u02c6\u03b7 satis\ufb01es the constraint. The likelihood at \u03b1\u02c6\u03b7 is necessarily less\nthan \u2113(\u02c6\u03b7\u03b4|x, y), and can be used to obtain the desired lower bound.\nThus at one extreme, distributions close to uniform can be self-normalized with little loss of likeli-\nhood. What about the other extreme\u2014distributions \u201cas far from uniform as possible\u201d? With suitable\nassumptions about the form of p\u02c6\u03b7(y|x), we can use the same construction of a self-normalizing\nparameter to achieve an alternative characterization for distributions that are close to deterministic:\nProposition 3. Suppose that X is a subset of the Boolean hypercube, Y is \ufb01nite, and T (x, y) is the\nconjunction of each element of x with an indicator on the output class. Suppose additionally that\nin every input x, p\u02c6\u03b7(y|x) makes a unique best prediction\u2014that is, for each x \u2208X , there exists a\nunique y\u2217 \u2208Y such that whenever y \u0338= y\u2217, \u03b7\u22a4T (x, y\u2217) >\u03b7 \u22a4T (x, y). Then\n\n\u2206\u2113(\u02c6\u03b7, \u02c6\u03b7\u03b4) \u2264 b+||\u03b7||2 \u2212\n\n\u03b4\n\nR,2\n\ne\u2212c\u03b4/R\n\n(11)\n\nfor distribution-dependent constants b and c.\n\nThis result is obtained by representing the constrained likelihood with a second-order Taylor expan-\nsion about the true MLE. All terms in the likelihood gap vanish except for the remainder; this can be\n\n6\n\n\f2 times the largest eigenvalue the feature covariance matrix at \u02c6\u03b7\u03b4, which\n\nupper-bounded by the ||\u02c6\u03b7\u03b4||2\nin turn is bounded by e\u2212c\u03b4/R.\nThe favorable rate we obtain for this case indicates that \u201call-nonuniform\u201d distributions are also an\neasy class for self-normalization. Together with Theorem 2, this suggests that hard distributions\nmust have some mixture of uniform and nonuniform predictions for different inputs. This is sup-\nported by the results in Section 4.\nThe next question is whether there is a corresponding lower bound; that is, whether there exist\nany conditional distributions for which all nearby distributions are provably hard to self-normalize.\nThe existence of a direct analog of Theorem 2 remains an open problem, but we make progress by\ndeveloping a general framework for analyzing normalizer variance.\nOne key issue is that while likelihoods are invariant to certain changes in the natural parameters,\nthe log normalizers (and therefore their variance) is far from invariant. We therefore focus on\nequivalence classes of natural parameters, as de\ufb01ned below. Throughout, we will assume a \ufb01xed\ndistribution p(x) on the inputs x.\nDe\ufb01nition 5 (Equivalence of parameterizations). Two natural parameter values \u03b7 and \u03b7\u2032 are said\nto be equivalent (with respect to an input distribution p(x)), denoted \u03b7 \u223c \u03b7\u2032 if\n\np\u03b7(y|X) = p\u03b7\u2032(y|X)\n\na.s. p(x)\n\nWe can then de\ufb01ne the optimal log normalizer variance for the distribution associated with a natural\nparameter value.\nDe\ufb01nition 6 (Optimal variance). We de\ufb01ne the optimal log normalizer variance of the log-linear\nmodel associated with a natural parameter value \u03b7 by\n\nV \u2217(\u03b7) = inf\n\u03b7\u2032\u223c\u03b7\n\nVarp(x) [A(X, \u03b7)] .\n\nWe now specialize to the case where Y is \ufb01nite with |Y| = K and where T : Y\u00d7X \u2192 RKd satis\ufb01es\n\nT (k, x)k\u2032j = \u03b4kk\u2032xj.\n\nThis is an important special case that arises, for example, in multi-way logistic regression. In this\nsetting, we can show that despite the fundamental non-identi\ufb01ability of the model, the variance can\nstill be shown to be high under any parameterization of the distribution.\nTheorem 4. Let X = {0, 1}d and let the input distribution p(x) be uniform on X . There exists an\n\u03b70 \u2208 RKd such that for \u03b7 = \u03b1\u03b70, \u03b1> 0,\n\n\u221a1\u2212 1\n\nd ||\u03b7||2\n\n2(d\u22121)\n\n||\u03b7||2 .\n\nV \u2217(\u03b7) \u2265\n\n5 Experiments\n\n2\n\n||\u03b7||2\n32d(d \u2212 1) \u2212 4Ke\u2212\n\nThe high-level intuition behind the results in the preceding section can be summarized as follows: 1)\nfor predictive distributions that are in expectation high-entropy or low-entropy, self-normalization\nresults in a relatively small likelihood gap; 2) for mixtures of high- and low-entropy distributions,\nself-normalization may result in a large likelihood gap. More generally, we expect that an increased\ntolerance for normalizer variance will be associated with a decreased likelihood gap.\nIn this section we provide experimental con\ufb01rmation of these predictions. We begin by generating a\nset of random sparse feature vectors, and an initial weight vector \u03b70. In order to produce a sequence\nof label distributions that smoothly interpolate between low-entropy and high-entropy, we introduce\na temperature parameter \u03c4, and for various settings of \u03c4 draw labels from p\u03c4 \u03b7. We then \ufb01t a self-\nnormalized model to these training pairs. In addition to the synthetic data, we compare our results\nto empirical data [3] from a self-normalized language model.\nFigure 2a plots the tradeoff between the likelihood gap and the error in the normalizer, under various\ndistributions (characterized by their KL from uniform). Here the tradeoff between self-normalization\nand model accuracy can be seen\u2014as the normalization constraint is relaxed, the likelihood gap\ndecreases.\n\n7\n\n\f0.2\"\n\n0.15\"\n\n\u2206\u2113\n\n0.1\"\n\n0.05\"\n\n0\"\n\n0\"\n\nKL=2.6\"\n\nKL=5.0\"\n\nLM\"\n\n0.5\"\n\n1\"\n\n1.5\"\n\n\u03b4\n\n(a) Normalization / likelihood tradeoff. As the nor-\nmalization constraint \u03b4 is relaxed, the likelihood gap\n\u2206\u2113 decreases. Lines marked \u201cKL=\u201d are from syn-\nthetic data; the line marked \u201cLM\u201d is from [3].\n\n0.2\"\n\n0.15\"\n\n\u2206\u2113\n\n0.1\"\n\n0.05\"\n\n0\"\n\n0\"\n\n5\"\n\n10\"\n\n15\"\n\n20\"\n\nEKL(p\u03b7||Unif)\n\n(b) Likelihood gap as a function of expected diver-\ngence from the uniform distribution. As predicted by\ntheory, the likelihood gap increases, then decreases,\nas predictive distributions become more peaked.\n\nFigure 2: Experimental results\n\nFigure 2b shows how the likelihood gap varies as a function of the quantity EKL(p\u03b7(\u00b7|X)||Unif).\nAs predicted, it can be seen that both extremes of this quantity result in small likelihood gaps, while\nintermediate values result in large likelihood gaps.\n\n6 Conclusions\n\nMotivated by the empirical success of self-normalizing parameter estimation procedures, we have\nattempted to establish a theoretical basis for the understanding of such procedures. We have charac-\nterized both self-normalizable distributions, by constructing provably easy examples, and training\nprocedures, by bounding the loss of likelihood associated with self-normalization.\nWhile we have addressed many of the important \ufb01rst-line theoretical questions around self-\nnormalization, this study of the problem is by no means complete. We hope this family of problems\nwill attract further study in the larger machine learning community; toward that end, we provide the\nfollowing list of open questions:\n\n1. How else can the approximately self-normalizable distributions be characterized? The\nclass of approximately normalizable distributions we have described is unlikely to corre-\nspond perfectly to real-world data. We expect that Proposition 1 can be generalized to\nother parametric classes, and relaxed to accommodate spectral or sparsity conditions.\n\n2. Are the upper bounds in Theorem 2 or Proposition 3 tight? Our constructions involve\nrelating the normalization constraint to the \u21132 norm of \u03b7, but in general some parameters\ncan have very large norm and still give rise to almost-normalized distributions.\n\n3. Do corresponding lower bounds exist? While it is easy to construct of exactly self-\nnormalizable distributions (which suffer no loss of likelihood), we have empirical evidence\nthat hard distributions also exist. It would be useful to lower-bound the loss of likelihood\nin terms of some simple property of the target distribution.\n\n4. Is the hard distribution in Theorem 4 stable? This is related to the previous question.\nThe existence of high-variance distributions is less worrisome if such distributions are fairly\nrare. If the lower bound falls off quickly as the given construction is perturbed, then the\nassociated distribution may still be approximately self-normalizable with a good rate.\n\nWe have already seen that new theoretical insights in this domain can translate directly into prac-\ntical applications. Thus, in addition to their inherent theoretical interest, answers to each of these\nquestions might be applied directly to the training of approximately self-normalized models in prac-\ntice. We expect that self-normalization will \ufb01nd increasingly many applications, and we hope the\nresults in this paper provide a \ufb01rst step toward a complete theoretical and empirical understanding\nof self-normalization in log-linear models.\n\nAcknowledgments The authors would like to thank Robert Nishihara for useful discussions. JA\nand MR are supported by NSF Graduate Fellowships, and MR is additionally supported by the\nFannie and John Hertz Foundation Fellowship.\n\n8\n\n\fReferences\n[1] Lafferty, J. D.; McCallum, A.; Pereira, F. C. N. Conditional random \ufb01elds: Probabilistic models\n\nfor segmenting and labeling sequence data. 2001; pp 282\u2013289.\n\n[2] McCullagh, P.; Nelder, J. A. Generalized linear models; Chapman and Hall, 1989.\n[3] Devlin, J.; Zbib, R.; Huang, Z.; Lamar, T.; Schwartz, R.; Makhoul, J. Fast and robust neural\nnetwork joint models for statistical machine translation. Proceedings of the Annual Meeting of\nthe Association for Computational Linguistics. 2014.\n\n[4] Vaswani, A.; Zhao, Y.; Fossum, V.; Chiang, D. Decoding with large-scale neural language\nmodels improves translation. Proceedings of the Conference on Empirical Methods in Natural\nLanguage Processing. 2013.\n\n[5] Andreas, J.; Klein, D. When and why are log-linear models self-normalizing? Proceedings\nof the Annual Meeting of the North American Chapter of the Association for Computational\nLinguistics. 2014.\n\n[6] Bartlett, P. L. IEEE Transactions on Information Theory 1998, 44, 525\u2013536.\n[7] Anthony, M.; Bartlett, P. Neural network learning: theoretical foundations; Cambridge Uni-\n\nversity Press, 2009.\n\n[8] Gutmann, M.; Hyv\u00a8arinen, A. Noise-contrastive estimation: A new estimation principle for\nunnormalized statistical models. Proceedings of the International Conference on Arti\ufb01cial In-\ntelligence and Statistics. 2010; pp 297\u2013304.\n\n[9] O\u2019Hagan, A. Journal of statistical planning and inference 1991, 29, 245\u2013260.\n[10] Chen, Y.; Welling, M.; Smola, A. Proceedings of the Conference on Uncertainty in Arti\ufb01cial\n\nIntelligence 2010, 109\u2013116.\n\n[11] Doucet, A.; De Freitas, N.; Gordon, N. An introduction to sequential Monte Carlo methods;\n\nSpringer, 2001.\n\n[12] Morin, F.; Bengio, Y. Proceedings of the International Conference on Arti\ufb01cial Intelligence\n\nand Statistics 2005, 246.\n\n[13] Yang, E.; Allen, G.; Liu, Z.; Ravikumar, P. K. Graphical models via generalized linear models.\n\nAdvances in Neural Information Processing Systems. 2012; pp 1358\u20131366.\n\n9\n\n\f", "award": [], "sourceid": 1059, "authors": [{"given_name": "Jacob", "family_name": "Andreas", "institution": "UC Berkeley"}, {"given_name": "Maxim", "family_name": "Rabinovich", "institution": "UC Berkeley"}, {"given_name": "Michael", "family_name": "Jordan", "institution": "UC Berkeley"}, {"given_name": "Dan", "family_name": "Klein", "institution": "UC Berkeley"}]}