{"title": "On Controllable Sparse Alternatives to Softmax", "book": "Advances in Neural Information Processing Systems", "page_first": 6422, "page_last": 6432, "abstract": "Converting an n-dimensional vector to a probability distribution over n objects is a commonly used component in many machine learning tasks like multiclass classification, multilabel classification, attention mechanisms etc. For this, several probability mapping functions have been proposed and employed in literature such as softmax, sum-normalization, spherical softmax, and sparsemax, but there is very little understanding in terms how they relate with each other. Further, none of the above formulations offer an explicit control over the degree of sparsity. To address this, we develop a unified framework that encompasses all these formulations as special cases. This framework ensures simple closed-form solutions and existence of sub-gradients suitable for learning via backpropagation. Within this framework, we propose two novel sparse formulations, sparsegen-lin and sparsehourglass, that seek to provide a control over the degree of desired sparsity. We further develop novel convex loss functions that help induce the behavior of aforementioned formulations in the multilabel classification setting, showing improved performance. We also demonstrate empirically that the proposed formulations, when used to compute attention weights, achieve better or comparable performance on standard seq2seq tasks like neural machine translation and abstractive summarization.", "full_text": "On Controllable Sparse Alternatives to Softmax\n\nAnirban Laha1\u2020\u2217\n\nSaneem A. Chemmengath1\u2217 Priyanka Agrawal1 Mitesh M. Khapra2\n\n1 IBM Research\n\n2 Robert Bosch Center for DS and AI, and Dept of CSE, IIT Madras\n\nKarthik Sankaranarayanan1\n\nHarish G. Ramaswamy2\n\nAbstract\n\nConverting an n-dimensional vector to a probability distribution over n objects\nis a commonly used component in many machine learning tasks like multiclass\nclassi\ufb01cation, multilabel classi\ufb01cation, attention mechanisms etc. For this, several\nprobability mapping functions have been proposed and employed in literature such\nas softmax, sum-normalization, spherical softmax, and sparsemax, but there is very\nlittle understanding in terms how they relate with each other. Further, none of the\nabove formulations offer an explicit control over the degree of sparsity. To address\nthis, we develop a uni\ufb01ed framework that encompasses all these formulations as\nspecial cases. This framework ensures simple closed-form solutions and existence\nof sub-gradients suitable for learning via backpropagation. Within this framework,\nwe propose two novel sparse formulations, sparsegen-lin and sparsehourglass, that\nseek to provide a control over the degree of desired sparsity. We further develop\nnovel convex loss functions that help induce the behavior of aforementioned\nformulations in the multilabel classi\ufb01cation setting, showing improved performance.\nWe also demonstrate empirically that the proposed formulations, when used to\ncompute attention weights, achieve better or comparable performance on standard\nseq2seq tasks like neural machine translation and abstractive summarization.\n\n1\n\nIntroduction\n\nVarious widely used probability mapping functions such as sum-normalization, softmax, and spherical\nsoftmax enable mapping of vectors from the euclidean space to probability distributions. The\nneed for such functions arises in multiple problem settings like multiclass classi\ufb01cation [1, 2],\nreinforcement learning [3, 4] and more recently in attention mechanism [5, 6, 7, 8, 9] in deep neural\nnetworks, amongst others. Even though softmax is the most prevalent approach amongst them, it\nhas a shortcoming in that its outputs are composed of only non-zeroes and is therefore ill-suited\nfor producing sparse probability distributions as output. The need for sparsity is motivated by\nparsimonious representations [10] investigated in the context of variable or feature selection. Sparsity\nin the input space offers bene\ufb01ts of model interpretability as well as computational bene\ufb01ts whereas\non the output side, it helps in \ufb01ltering large output spaces, for example in large scale multilabel\nclassi\ufb01cation settings [11]. While there have been several such mapping functions proposed in\nliterature such as softmax [4], spherical softmax [12, 13] and sparsemax [14, 15], very little is\nunderstood in terms of how they relate to each other and their theoretical underpinnings. Further, for\nsparse formulations, often there is a need to trade-off interpretability for accuracy, yet none of these\nformulations offer an explicit control over the desired degree of sparsity.\nMotivated by these shortcomings, in this paper, we introduce a general formulation encompassing all\nsuch probability mapping functions which serves as a unifying framework to understand individual\nformulations such as hardmax, softmax, sum-normalization, spherical softmax and sparsemax as spe-\ncial cases, while at the same time helps in providing explicit control over degree of sparsity. With the\n\u2217Equal contribution by the \ufb01rst two authors. Corresponding authors: {anirlaha,saneem.cg}@in.ibm.com.\n\n\u2020This author was also brie\ufb02y associated with IIT Madras during the course of this work.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\faim of controlling sparsity, we propose two new formulations: sparsegen-lin and sparsehourglass.\nOur framework also ensures simple closed-form solutions and existence of sub-gradients similar to\nsoftmax. This enables them to be employed as activation functions in neural networks which require\ngradients for backpropagation and are suitable for tasks that require sparse attention mechanism\n[14]. We also show that the sparsehourglass formulation can extend from translation invariance to\nscale invariance with an explicit control, thus helping to achieve an adaptive trade-off between these\ninvariance properties as may be required in a problem domain.\nWe further propose new convex loss functions which can help induce the behaviour of the above\nproposed formulations in a multilabel classi\ufb01cation setting. These loss functions are derived from a\nviolation of constraints required to be satis\ufb01ed by the corresponding mapping functions. This way of\nde\ufb01ning losses leads to an alternative loss de\ufb01nition for even the sparsemax function [14]. Through\nexperiments we are able to achieve improved results in terms of sparsity and prediction accuracies for\nmultilabel classi\ufb01cation.\nLastly, the existence of sub-gradients for our proposed formulations enable us to employ them to\ncompute attention weights [5, 7] in natural language generation tasks. The explicit controls provided\nby sparsegen-lin and sparsehourglass help to achieve higher interpretability while providing better\nor comparable accuracy scores. A recent work [16] had also proposed a framework for attention;\nhowever, they had not explored the effect of explicit sparsity controls. To summarize, our contributions\nare the following:\n\n\u2022 A general framework of formulations producing probability distributions with connections\nto hardmax, softmax, sparsemax, spherical softmax and sum-normalization (Sec.3).\n\u2022 New formulations like sparsegen-lin and sparsehourglass as special cases of the general\nframework which enable explicit control over the desired degree of sparsity (Sec.3.2,3.5).\n\u2022 A formulation sparsehourglass which enables us to adaptively trade-off between the trans-\nlation and scale invariance properties through explicit control (Sec.3.5).\n\u2022 Convex multilabel loss functions correponding to all the above formulations proposed by us.\nThis enable us to achieve improvements in the multilabel classi\ufb01cation problem (Sec.4).\n\u2022 Experiments for sparse attention on natural language generation tasks showing comparable\n\nor better accuracy scores while achieving higher interpretability (Sec.5).\n\n2 Preliminaries and Problem Setup\nNotations: For K \u2208 Z+, we denote [K] := {1, . . . , K}. Let z \u2208 RK be a real vector denoted as\nz = {z1, . . . , zK}. 1 and 0 denote vector of ones and zeros resp. Let \u2206K\u22121 := {p \u2208 RK | 1T p =\n1, p \u2265 0} be the (K \u2212 1)-dimensional simplex and p \u2208 \u2206K\u22121 be denoted as p = {p1, . . . , pK}. We\nuse [t]+ := max{0, t}. Let A(z) := {k \u2208 [K] | zk = maxj zj} be the set of maximal elements of z.\nDe\ufb01nition: A probability mapping function is a map \u03c1 : RK \u2192 \u2206K\u22121 which transforms a score\nvector z to a categorical distribution (denoted as \u03c1(z) = {\u03c11(z), . . . , \u03c1K(z)}). The support of \u03c1(z)\nis S(z) := {j \u2208 [K] | \u03c1j(z) > 0}. Such mapping functions can be used as activation function for\nmachine learning models. Some known probability mapping functions are listed below:\n\n(cid:80)\n\nexp(zi)\n\n\u2022 Softmax function is de\ufb01ned as: \u03c1i(z) =\nj\u2208[K] exp(zj ) , \u2200i \u2208 [K]. Softmax is easy to\nevaluate and differentiate and its logarithm is the negative log-likelihood loss [14].\n\u2022 Spherical softmax - Another function which is simple to compute and derivative-friendly:\n\u03c1i(z) =\n\u2022 Sum-normalization : \u03c1i(z) =\n\n, \u2200i \u2208 [K]. Spherical softmax is not de\ufb01ned for(cid:80)\n\n, \u2200i \u2208 [K]. It is not used in practice much as the\n\ni(cid:80)\nz2\nj\u2208[K] z2\n\nj\u2208[K] z2\n\nj\n\nmapping is not de\ufb01ned if zi < 0 for any i \u2208 [K] and for(cid:80)\n\nzi(cid:80)\n\nj\u2208[K] zj\n\nThe above mapping functions are limited to producing distributions with full support. Consider there\nis a single value of zi signi\ufb01cantly higher than the rest, its desired probability should be exactly 1,\nwhile the rest should be grounded to zero (hardmax mapping). Unfortunately, that does not happen\nunless the rest of the values tend to \u2212\u221e (in case of softmax) or are equal to 0 (in case of spherical\nsoftmax and sum-normalization).\n\n2\n\nj = 0.\n\nj\u2208[K] zj = 0.\n\n\f(a) Softmax\n\n(b) Sparsemax\n\n(c) Sum-normalization\n\nFigure 1: Visualization of probability mapping functions in two-dimension. The contour plots\nshow values of \u03c11(z). The green line segment connecting (1,0) and (0,1) is the 1-dimensional\nprobability simplex. Each contour (here it is line) contains points in R2 plane which have the same\n\u03c11(z), the exact value marked on the contour line.\n\n\u2022 Sparsemax recently introduced by [14] circumvents this issue by projecting the score vector\nz onto a simplex [15]: \u03c1(z) = argminp\u2208\u2206K\u22121 (cid:107)p \u2212 z(cid:107)2\n2. This offers an intermediate\nsolution between softmax (no zeroes) and hardmax (zeroes except for the highest value).\nThe contour plots for softmax, sparsemax and sum-normalization in two-dimensions (z \u2208 R2)\nare shown in Fig.1. The contours of sparsemax are concentrated over a narrow region, while the\nremaining region corresponds to sparse solutions. For softmax, the contour plots are spread over\nthe whole real plane, con\ufb01rming the absence of sparse solutions. Sum-normalization is not de\ufb01ned\noutside the \ufb01rst quadrant, and yet, the contours cover the whole quadrant, denying sparse solutions.\n\n3 Sparsegen Activation Framework\n\nDe\ufb01nition: We propose a generic probability mapping function inspired from the sparsemax formu-\nlation (in Sec.2) which we call sparsegen:\n\n\u03c1(z) = sparsegen(z; g, \u03bb) = argmin\n\np\u2208\u2206K\u22121 (cid:107)p \u2212 g(z)(cid:107)2\n\n2 \u2212 \u03bb(cid:107)p(cid:107)2\n\n2\n\n(1)\n\nwhere g : RK \u2192 RK is a component-wise transformation function applied on z. Here gi(z) denotes\nthe i-th component of g(z). The coef\ufb01cient \u03bb < 1 controls the regularization strength. For \u03bb > 0, the\nsecond term becomes negative L-2 norm of p. In addition to minimizing the error on projection of\ng(z), Eq.1 tries to maximize the norm, which encourages larger probability values for some indices,\nhence moving the rest to zero. The above formulation has a closed-form solution (see App. A.1\nfor solution details), which can be computed in O(K) time using the modi\ufb01ed randomized median\n\ufb01nding algorithm as followed in [15] while solving the projection onto simplex problem.\nThe choices of both \u03bb and g can help control the cardinality of the support set S(z), thus in\ufb02uencing\nthe sparsity of \u03c1(z). \u03bb can help produce distributions with support ranging from full (uniform\ndistribution when \u03bb \u2192 1\u2212) to minimum (hardmax when \u03bb \u2192 \u2212\u221e). Let S(z, \u03bb1) denote the support\nof sparsegen for a particular coef\ufb01cient \u03bb1. It is easy to show: if |S(z, \u03bb1)| > |A(z)|, then there\nexists \u03bbx > \u03bb1 for an x < |S(z, \u03bb1)| such that |S(z, \u03bbx)| = x. In other words, if a sparser solution\nexists, it can be obtained by changing \u03bb. The following result has an alternate interpretation for \u03bb:\nResult: The sparsegen formulation (Eq.1) is equivalent to the following, when \u03b3 = 1\n1\u2212\u03bb (where\n\u03b3 > 0): \u03c1(z) = argminp\u2208\u2206K\u22121 (cid:107)p \u2212 \u03b3g(z)(cid:107)2\n2.\nThe above result says that scaling g(z) by \u03b3 = 1\n1\u2212\u03bb is equivalent to applying the negative L-2 norm\nwith \u03bb coef\ufb01cient when considering projection of g(z) onto the simplex. Thus, we can write:\n\nsparsegen(z; g, \u03bb) = sparsemax\n\n.\n\n(2)\n\nThis equivalence helps us borrow results from sparsemax to establish various properties for sparsegen.\nJacobian of sparsegen: To train a model with sparsegen as an activation function, it is essential\nto compute its Jacobian matrix denoted by J\u03c1(z) := [\u2202\u03c1i(z)/\u2202zj]i,j for using gradient-based\n\n(cid:16) g(z)\n\n(cid:17)\n\n1 \u2212 \u03bb\n\n3\n\n21012341.00.50.00.51.01.52.02.53.0z1z20.0300.0500.1000.2000.3000.4000.5000.6000.7000.8000.9000.9500.97021012341.00.50.00.51.01.52.02.53.0z1z20.1000.2000.3000.4000.5000.6000.7000.8000.9000.0001.000\foptimization techniques. We use Eq.2 and results from [14](Sec.2.5) to derive the Jacobian for\nsparsegen by applying chain rule of derivatives:\n\nJsparsegen(z) = Jsparsemax\n\n(3)\n\n(cid:17)\n\n(cid:16) g(z)\n(cid:104)\nJg(z)\n1 \u2212 \u03bb\n1 \u2212 \u03bb\nDiag(s) \u2212 ssT\n|S(z)|\n\n\u00d7\n\n(cid:105)\n\nwhere Jg(z) is Jacobian of g(z) and Jsparsemax(z) =\n. Here s is an indicator\nvector whose ith entry is 1 if i \u2208 S(z). Diag(s) is a matrix created using s as its diagonal entries.\n3.1 Special cases of Sparsegen: sparsemax, softmax and spherical softmax\n\nApart from \u03bb, one can control the sparsity of sparsegen through g(z) as well. Moreover, certain\nchoices of \u03bb and g(z) help us establish connections with existing activation functions (see Sec.2).\nThe following cases illustrate these connections (more details in App.A.2):\nExample 1: g(z) = exp(z) (sparsegen-exp): exp(z) denotes element-wise exponentiation of z,\nthat is gi(z) = exp(zi). Sparsegen-exp reduces to softmax when \u03bb = 1 \u2212\nj\u2208[K] ezj , as it results in\nS(z) = [K] as per Eq.14 in App.A.2.\nExample 2: g(z) = z2 (sparsegen-sq): z2 denotes element-wise square of z. As observed for\nsparsegen-exp, when \u03bb = 1 \u2212\nExample 3 : g(z) = z, \u03bb = 0: This case is equivalent to the projection onto the simplex objective\nadopted by sparsemax. Setting \u03bb (cid:54)= 0 leads the regularized extension of sparsemax as seen next.\n3.2 Sparsegen-lin: Extension of sparsemax\n\nj , sparsegen-sq reduces to spherical softmax.\n\nj\u2208[K] z2\n\n(cid:80)\n\n(cid:80)\n\nThe negative L-2 norm regularizer in Eq.4 helps to control the width of the non-sparse region (see\nFig.2 for the region plot in two-dimensions). In the extreme case of \u03bb \u2192 1\u2212, the whole real plane\nmaps to sparse region whereas for \u03bb \u2192 \u2212\u221e, the whole real plane renders non-sparse solutions.\n(4)\n\n\u03c1(z) = sparsegen-lin(z) = argmin\n\np\u2208\u2206K\u22121 (cid:107)p \u2212 z(cid:107)2\n\n2 \u2212 \u03bb(cid:107)p(cid:107)2\n\n2\n\nFigure 2: Sparsegen-lin: Region plot for z =\n{z1, z2} \u2208 R2 when \u03bb = 0.5. \u03c1(z) is sparse in the\nred region, whereas non-sparse in the blue region.\nThe dashed red lines depict the boundaries between\nsparse and non-sparse regions. For \u03bb = 0.5, points\nlike (z or z(cid:48)) are mapped onto the sparse points A\nor B. Whereas for sparsemax (\u03bb = 0), they fall in\nthe blue region (the boundaries of sparsemax are\nshown by lighter dashed red lines passing through\nA and B). The point z0 lies in the blue region,\nproducing non-sparse solution. Interestingly more\npoints like z(cid:48)(cid:48) and z(cid:48)(cid:48)(cid:48), which currently lie in the\nred region can fall in the blue region for some\n\u03bb < 0. For \u03bb > 0.5, the blue region becomes\nsmaller, as mores points map to sparse solutions.\n\n3.3 Desirable properties for probability mapping functions\n\nLet us enumerate below some properties a probability mapping function \u03c1 should possess:\n1. Monotonicity: If zi \u2265 zj, then \u03c1i(z) \u2265 \u03c1j(z). This does not always hold true for sum-\nnormalization and spherical softmax when one or both of zi, zj is less than zero. For sparsegen, both\ngi(z) and gi(z)/(1 \u2212 \u03bb) should be monotonic increasing, which implies \u03bb needs to be less than 1.\n2. Full domain: The domain of \u03c1 should include negatives as well as positives, i.e. Dom(\u03c1) \u2208 RK.\nSum-normalization does not satisfy this as it is not de\ufb01ned if some dimensions of z are negative.\n\n4\n\nB(1,0)A(0,1)z1\u2212z1z2\u2212z2B(1,0)A(0,1)z2=z1+1z2=z1\u22121z2=z1+1\u2212\u03bbz2=z1\u22121+\u03bbz(1,1.5)z0(1,1.25)p0(0.25,0.75)z0(1,0.5)z00(1,2.5)z000(2,0.5)SparseRegionSparseRegion\f3. Existence of Jacobian: This enables usage in any training algorithm where gradient-based\noptimization is used. For sparsegen, the Jacobian of g(z) should be easily computable (Eq.3).\n4. Lipschitz continuity: The derivative of the function should be upper bounded. This is important\nfor the stability of optimization technique used in training. Softmax and sparsemax are 1-Lipschitz\nwhereas spherical softmax and sum-normalization are not Lipschitz continuous. Eq.3 shows the\nLipschitz constant for sparsegen is upper bounded by 1/(1\u2212 \u03bb) times the Lipschitz constant for g(z).\n5. Translation invariance: Adding a constant c to every element in z should not change the output\ndistribution : \u03c1(z + c1) = \u03c1(z). Sparsemax and softmax are translation invariant whereas sum-\nnormalization and spherical softmax are not. Sparsegen is translation invariant iff for all c \u2208 R there\nexist a \u02dcc \u2208 R such that g(z + c1) = g(z) + \u02dcc1. This follows from Eq.2.\n6. Scale invariance: Multiplying every element in z by a constant c should not change the output\ndistribution : \u03c1(cz) = \u03c1(z). Sum-normalization and spherical softmax satisfy this property whereas\nsparsemax and softmax are not scale invariant. Sparsegen is scale invariant iff for all c \u2208 R there\nexist a \u02c6c \u2208 R such that g(cz) = g(z) + \u02c6c1. This also follows from Eq.2.\n7. Permutation invariance: If there is a permutation matrix P , then \u03c1(P z) = P \u03c1(z). For sparsegen,\nthe precondition is that g(z) should be a permutation invariant function.\n8. Idempotence: \u03c1(z) = z, \u2200z \u2208 \u2206K\u22121. This is true for sparsemax and sum-normalization. For\nsparsegen, it is true if and only if g(z) = z, \u2200z \u2208 \u2206K\u22121 and \u03bb = 0.\nIn the next section, we discuss in detail about the scale invariance and translation invariance properties\nand propose a new formulation achieving a trade-off between these properties.\n\n3.4 Trading off Translation and Scale Invariances\n\nAs mentioned in Sec.1, scale invariance is a desirable property to have for probability mapping\nfunctions. Consider applying sparsemax on two vectors z = {0, 1}, \u00afz = {100, 101} \u2208 R2. Both\nwould result in {0, 1} as the output. However, ideally \u00afz should have mapped to a distribution\nnearer to {0.5, 0.5} instead. Scale invariant functions will not have such a problem. Among the\nexisting functions, only sum-normalization and spherical softmax satisfy scale invariance. While\nsum-normalization is only de\ufb01ned for positive values of z, spherical softmax is not monotonic or\nLipschitz continuous. In addition, both of these methods are also not de\ufb01ned for z = 0, thus making\nthem unusable for practical purposes. It can be shown that any probability mapping function with the\nscale invariance property will not be Lipschitz continuous and will be unde\ufb01ned for z = 0.\nA recent work[13] had pointed out the lack of clarity over whether scale invariance is more desired\nthan the translation invariance property of softmax and sparsemax. We take this into account to\nachieve trade-off between the two invariances. In the usual scale invariance property, scaling vector z\nessentially results in another vector along the line connecting z and the origin. That resultant vector\nalso has the same output probability distribution as the original vector (See Sec. 3.3). We propose to\nscale the vector z along the line connecting it with a point (we call it anchor point henceforth) other\nthan the origin, yet achieving the same output. Interestingly, the choice of this anchor point can act as\na control to help achieve a trade-off between scale invariance and translation invariance.\nLet a vector z be projected onto the simplex along the line connecting it with an anchor point\nq = (\u2212q, . . . ,\u2212q) \u2208 RK, for q > 0 (See Fig.3a for K = 2). We choose g(z) as the point where this\n(cid:80)\nline intersects with the af\ufb01ne hyperplane 1T \u02c6z = 1 containing the probability simplex. Thus, g(z)\nis set equal to \u03b1z + (1 \u2212 \u03b1)q, where \u03b1 = 1+Kq\ni zi+Kq (we denote it as \u03b1(z) as \u03b1 is a function of z).\nFrom the translation invariance property of sparsemax, the resultant mapping function can be shown\nequivalent to considering g(z) = \u03b1(z)z in Eq.1. We refer to this variant of sparsegen assuming\ng(z) = \u03b1(z)z and \u03bb = 0 as sparsecone.\nInterestingly, when the parameter q = 0, sparsecone reduces to sum-normalization (scale invariant)\nand when q \u2192 \u221e, it is equivalent to sparsemax (translation invariant). Thus the parameter q acts as a\ncontrol taking sparsecone from scale invariance to translation invariance. At intermediate values (that\nis, for 0 < q < \u221e), sparsecone is approximate scale invariant with respect to the anchor point q.\ni zi < \u2212Kq (beyond the black dashed line shown in Fig.3a).\ni zi + Kq) becomes negative destroying the\n\nHowever, it is unde\ufb01ned for z where(cid:80)\nIn this case the denominator term of \u03b1(z) (that is,(cid:80)\n\nmonotonicity of \u03b1(z)z. Also note that sparsecone is not Lipschitz continuous.\n\n5\n\n\f(a) Sparsecone\n\n(b) Sparsehourglass\n\nFigure 3: (a) Sparsecone: The vector z maps to a point p on the simplex along the line connecting\nto the point q = (\u2212q,\u2212q). Here we consider q = 1. The red region corresponds to sparse region\nwhereas blue covers the non-sparse region. (b) Sparsehourglass: For the vector z(cid:48) in the positive\nhalf-space, the mapping to the solution p(cid:48) can be obtained similarly as sparsecone. For the vector z\nin the negative half-space, a mirror point \u02dcz needs to be found, which leads to the solution p.\n\n3.5 Proposed Solution: Sparsehourglass\n\nTo alleviate the issue of monotonicity when(cid:80)\n(cid:80)\n(cid:80)\ni zi < \u2212Kq, we choose to restrict applying sparsecone\nonly for the positive half-space HK\ni zi \u2265 0}. For the remaining negative half-\nspace HK\u2212 := {z \u2208 RK |\ni zi < 0}, we de\ufb01ne a mirror point function to transform to a point in\nHK\n+ , on which sparsecone can be applied. Thus the solution for a point in the negative half-space is\ngiven by the solution of its corresponding mirror point in the positive half-space. This mirror point\n2(cid:80)\nfunction has some necessary properties (see App. A.4 for details), which can be satis\ufb01ed by de\ufb01ning\n|(cid:80)\nm: mi(z) = zi \u2212\nK , \u2200i \u2208 [K]. Interestingly, this can alternatively be achieved by choosing\ng(z) = \u02c6\u03b1(z)z, where \u02c6\u03b1 is a slight modi\ufb01cation of \u03b1 given by \u02c6\u03b1(z) = 1+Kq\ni zi|+Kq . This leads to the\nde\ufb01nition of a new probability mapping function (which we call sparsehourglass):\n\n+ := {z \u2208 RK |\n\nj zj\n\n(cid:13)(cid:13)(cid:13)p \u2212\n\n(cid:80)\n\n|\n\n(cid:13)(cid:13)(cid:13)2\n\n2\n\n1 + Kq\ni\u2208[K] zi| + Kq\n\nz\n\n(5)\n\n\u03c1(z) = sparsehourglass(z) = argmin\np\u2208\u2206K\u22121\n\nLike sparsecone, sparsehourglass also reduces to sparsemax when q \u2192 \u221e. Similarly, q = 0 for\nsparsehourglass leads to a corrected version of sum-normalization (we call it sum normalization++ ),\nwhich works for the negative domain as well unlike the original version de\ufb01ned in Sec.2. Another\nadvantage of sparsehourglass is that it is Lipschitz continuous with Lipschitz constant equal to (1+ 1\nKq )\n(proof details in App.A.5). Table.1 summarizes all the formulations seen in this paper and compares\nthem against the various important properties mentioned in Sec.3.3. Note that sparsehourglass is\nthe only probability mapping function which satis\ufb01es all the properties. Even though it does not\nsatisfy both scale invariance and translation invariance simultaneously, it is possible to achieve these\nseparately through different values of q parameter, which can be decided independent of z.\n\n4 Sparsity Inducing Loss Functions for Multilabel Classi\ufb01cation\n\nAn important usage of such sparse probability mapping functions is in the output mapping of\nmultilabel classi\ufb01cation models. Typical multilabel problems have hundreds of possible labels or\ntags, but any single instance has only a few tags [17]. Thus, a function which takes in a vector in RK\nand outputs a sparse version of the vector is of great value.\nGiven training instances (xi, yi) \u2208 X \u00d7 {0, 1}K, we need to \ufb01nd a model function f : X \u2192 RK\nthat produces score vector over the label space, which on application of \u03c1 : RK \u2192 \u2206K\u22121 (the\nsparse probability mapping function in question) leads to correct prediction of label vector yi. De\ufb01ne\n\n6\n\nB(1,0)A(0,1)q(\u2212q,\u2212q)z1\u2212z1z2\u2212z2B(1,0)A(0,1)z1+z2=\u22122qz(z1,z2)p(p1,p2)SparseRegionSparseRegionB(1,0)A(0,1)q(q,q)z1z1z2z2B(1,0)A(0,1)z0(z01,z02)p0(p01,p02)z=(z1,z2)\u02dcz=(\u02dcz1,\u02dcz2)p(p1,p2)SparseRegionSparseRegion\fTable 1: Summary of the properties satis\ufb01ed by probability mapping functions. Here \u0014 denotes\n\u2018satis\ufb01ed in general\u2019, \u0017 signi\ufb01es \u2018not satis\ufb01ed\u2019 and (cid:88)says \u2018satis\ufb01ed for some constant parameter\nindependent of z\u2019. Note that PERMUTATION INV and existence of JACOBIAN are satis\ufb01ed by all.\n\nFUNCTION\n\nIDEMPOTENCE\n\nMONOTONIC\n\nTRANSLATION INV\n\nSCALE INV\n\nFULL DOMAIN\n\nLIPSCHITZ\n\nSUM NORMALIZATION\nSPHERICAL SOFTMAX\nSOFTMAX\nSPARSEMAX\n\nSPARSEGEN-LIN\nSPARSEGEN-EXP\nSPARSEGEN-SQ\nSPARSECONE\nSPARSEHOURGLASS\nSUM NORMALIZATION++\n\n\u0014\n\u0017\n\u0017\n\u0014\n(cid:88)\n\u0017\n\u0017\n\u0014\n\u0014\n\u0014\n\n\u0017\n\u0017\n\u0014\n\u0014\n\n\u0014\n\u0014\n\u0017\n\u0017\n\u0014\n\u0014\n\n\u0017\n\u0017\n\u0014\n\u0014\n\n\u0014\n\u0017\n\u0017\n(cid:88)\n(cid:88)\n\u0017\n\n\u0014\n\u0014\n\u0017\n\u0017\n\n\u0017\n\u0017\n\u0017\n(cid:88)\n(cid:88)\n\u0014\n\n\u0017\n\u0017\n\u0014\n\u0014\n\n\u0014\n\u0014\n\u0014\n\u0017\n\u0014\n\u0014\n\n\u221e\n\u221e\n1\n1\n\n\u221e\n\u221e\n\u221e\n\u221e\n\n1/(1 \u2212 \u03bb)\n\n(1 + 1/Kq)\n\n\u03b7i := yi/(cid:107)yi(cid:107)1, which is a probability distribution over the labels. Considering a loss function\nL : \u2206K\u22121 \u00d7 \u2206K\u22121 \u2192 [0,\u221e) and representing zi := f (xi), a natural way for training using \u03c1 is to\n\ufb01nd a function f : X \u2192 RK that minimises the error R(f ) below over a hypothesis class F:\n\nR(f ) =\n\nL (\u03c1(zi), \u03b7i)\n\n(6)\n\nM(cid:88)\n\ni=1\n\nIn the prediction phase, for a test instance x, one can simply predict the non-zero elements in the\nvector \u03c1(f\u2217(x)) where f\u2217 is the minimizer of the above training objective R(f ).\nFor all cases where \u03c1 produces sparse probability distributions, one can show that the training\nobjective R above is highly non-convex in f, even for the case of a linear hypothesis class F.\nHowever, if we remove the strict requirement of the training objective depending on \u03c1(z) (as in\nEq.6), and use a loss function which can work with z directly, a convex objective is possible. We,\nthus, design a loss function L : RK \u00d7 \u2206K\u22121 \u2192 [0,\u221e) such that L(z, \u03b7) = 0 only if \u03c1(z) = \u03b7.\nTo derive such a loss function, we proceed by enumerating a list of constraints which will be\nsatis\ufb01ed by the zero-loss region in the K-dimensional space of the vector z. For sparsehourglass,\nthe closed-form solution is given by \u03c1i(z) = [\u02c6\u03b1(z)zi \u2212 \u03c4 (z)]+ (see App.A.1). This enables us to\nlist down the following constraints for zero loss: (1) \u02c6\u03b1(z)(zi \u2212 zj) = 0, \u2200i, j |\u03b7i = \u03b7j (cid:54)= 0, and\n(2) \u02c6\u03b1(z)(zi \u2212 zj) \u2265 \u03b7i, \u2200i, j |\u03b7i (cid:54)= 0, \u03b7j = 0. The value of the loss when any such constraints\nis violated is simply determined by piece-wise linear functions, which lead to the following loss\nfunction for sparsehourglass:\nLsparsehg,hinge(z, \u03b7) =\n\n(cid:12)(cid:12)zi \u2212 zj\n\n\u02c6\u03b1(z) \u2212 (zi \u2212 zj), 0\n\n(cid:110) \u03b7i\n\n(cid:12)(cid:12) +\n\n(cid:88)\n\n(cid:88)\n\n(cid:111)\n\n.\n\nmax\n\n(7)\n\ni,j\n\n\u03b7i(cid:54)=0,\u03b7j(cid:54)=0\n\ni,j\n\n\u03b7i(cid:54)=0,\u03b7j =0\n\nIt can be easily proved that the above loss function is convex in z using the properties that both sum\nof convex functions and maximum of convex functions result in convex functions. The above strategy\ncan also be applied to derive a multilabel loss function for sparsegen-lin:\n\nLsparsegen-lin,hinge(z, \u03b7) =\n\n1\n\n1 \u2212 \u03bb\n\n|zi \u2212 zj| +\n\ni,j\n\n\u03b7i(cid:54)=0,\u03b7j(cid:54)=0\n\ni,j\n\n\u03b7i(cid:54)=0,\u03b7j =0\n\nmax\n\n\u03b7i \u2212\n\nzi \u2212 zj\n1 \u2212 \u03bb\n\n, 0\n\n.\n\n(8)\n\n(cid:88)\n\n(cid:88)\n\n(cid:110)\n\n(cid:111)\n\nThe above loss function for sparsegen-lin can be used to derive a multilabel loss for sparsemax by\nsetting \u03bb = 0 (which we use in our experiments for \u201csparsemax+hinge\u201d) . The piecewise-linear\nlosses proposed in this section based on violation of constraints are similar to the well-known\nhinge loss, whereas the sparsemax loss proposed by [14] (which we use in our experiments for\n\u201csparsemax+huber\u201d) has connections with Huber loss. We have shown through our experiments in\nnext section, that hinge loss variants for multilabel classi\ufb01cation work better than Huber loss variants.\n\n5 Experiments and Results\n\nHere we present two sets of evaluations for the proposed probability mapping functions and loss\nfunctions. First, we apply them on the multilabel classi\ufb01cation task studying the effect of varying\nlabel density in synthetic dataset, followed by evaluation on real multilabel datasets. Next, we report\nresults of sparse attention on NLP tasks of machine translation and abstractive summarization.\n\n7\n\n\f5.1 Multilabel Classi\ufb01cation\n\nWe compare the proposed activations and loss functions for multilabel classi\ufb01cation with both\nsynthetic and real datasets. We use a linear prediction model followed by a loss function during\ntraining. During test time, the corresponding activation is directly applied to the output of the linear\nmodel. We consider the following activation-loss pairs: (1) softmax+log: KL-divergence loss applied\non top of softmax outputs, (2) sparsemax+huber: multilabel classi\ufb01cation method from [14], (3)\nsparsemax+hinge: hinge loss as in Eq.8 with \u03bb = 0 is used during training compared to Huber loss\nin (2), and (4) sparsehg+hinge: for sparsehourglass (in short sparsehg), loss in Eq.7 is used during\ntraining. Please note as we have a convex system of equations due to an underlying linear prediction\nmodel, applying Eq.8 in training and applying sparsegen-lin activation during test time produces the\nsame result as sparsemax+hinge. For softmax+log, we used a threshold p0, above which a label is\npredicted to be \u201con\u201d. For others, a label is predicted \u201con\u201d if its predicted probability is non-zero. We\ntune hyperparams q for sparsehg+hinge and p0 for softmax+log using validation set.\n\n5.1.1 Synthetic dataset with varied label density\n\nWe use scikit-learn for generating synthetic datasets (details in App.A.6). We conducted experiments\nin three settings: (1) varying mean number of labels per instance, (2) varying range of number of\nlabels and, (3) varying document length. In the \ufb01rst setting, we study the ability to model varying\nlabel sparsity. We draw number of labels N uniformly at random from set {\u00b5 \u2212 1, \u00b5, \u00b5 + 1}\nwhere \u00b5 \u2208 {2. . . 9} is mean number of labels. For the second setting we study how these models\nperform when label density varies across instances. We draw N uniformly at random from set\n{5 \u2212 r, . . . , 5 + r}. Parameter r controls variation of label density among instances. In the third\nsetting we experiment with different document lengths, we draw N from Poisson with mean 5 and\nvary document length L from 200 to 2000. In \ufb01rst two settings document length was \ufb01xed to 2000.\nWe report F-score2 and Jensen-Shannon divergence (JSD) on test set in our results.\nFig.5 shows F-score on test sets in the three experimen-\ntal settings. We can observe that sparsemax+hinge and\nsparsehg+hinge consistently perform better\nthan sparse-\nmax+huber in all three cases, especially the label distributions\nare sparser. Note that sparsehg+hinge performs better than\nsparsemax+hinge in most cases. From empirical comparison\nbetween sparsemax+hinge and sparsemax+huber, we can con-\nclude that the proposed hinge loss variants are better in produc-\ning sparser and and more accurate predictions. This observation\nis also supported in our analysis of sparsity in outputs (see Fig.4\n- lower the curve the sparser it is - this is analysis is done cor-\nresponding to the setting in Fig.5a), where we \ufb01nd that hinge\nloss variants encourage more sparsity. We also \ufb01nd the hinge loss variants are doing better than\nsoftmax+log in terms of the JSD metric (details in App.A.7.1).\n5.1.2 Real Multilabel datasets\nWe further experiment with three real datasets3 for multilabel classi\ufb01cation: Birds, Scene and\nEmotions. The experimental setup and baselines are same as that for synthetic dataset described in\nSec.5.1.1. For each of the datasets, we consider only those examples with atleast one label. Results\nare shown in Table 3 in App.A.7.2. All methods give comparable results on these benchmark datasets.\n\nFigure 4: Sparsity comparison\n\n5.2 Sparse Attention for Natural Language Generation\n\nHere we demonstrate the effectiveness of our formulations experimentally on two natural language\ngeneration tasks: neural machine translation and abstractive sentence summarization. The purpose\nof these experiments are two fold: \ufb01rstly, effectiveness of our proposed formulations sparsegen-lin\nand sparsehourglass in attention framework on these tasks, and secondly, control over sparsity leads\nto enhanced interpretability. We borrow the encoder-decoder architecture with attention (see Fig.7\nin App.A.8). We replace the softmax function in attention by our proposed functions as well as\n\n2Micro-averaged F1 score.\n3Available at http://mulan.sourceforge.net/datasets-mlc.html\n\n8\n\n2468mean #labels246810sparsitysoftmax+logsparsemax+hubersparsemax+hingesparsehg+hinge\f(a) Varying mean #labels\n\n(b) Varying range #labels\n\n(c) Varying document length\n\nFigure 5: F-score on multilabel classi\ufb01cation synthetic dataset.\n\nTable 2: Sparse Attention Results. Here R-1, R-2 and R-L denote the ROUGE scores.\n\nAttention\n\nsoftmax\nsoftmax (with temp.)\nsparsemax\nsparsegen-lin\nsparsehg\n\nTRANSLATION\nEN-FR\nFR-EN\nBLEU\nBLEU\n36.00\n36.38\n36.08\n36.63\n35.78\n36.73\n37.27\n35.78\n36.63\n35.69\n\nGigaword\n\nR-2\n16.64\n17.15\n16.88\n17.57\n16.91\n\nR-1\n34.80\n35.00\n34.89\n35.90\n35.14\n\nR-L\n32.15\n32.57\n32.20\n33.37\n32.66\n\nR-1\n27.95\n27.78\n27.29\n28.13\n27.39\n\nSUMMARIZATION\n\nDUC 2003\n\nR-2\n9.22\n8.91\n8.48\n9.00\n9.11\n\nR-L\n24.54\n24.53\n24.04\n24.89\n24.53\n\nR-1\n30.68\n31.64\n30.80\n31.85\n30.64\n\nDUC 2004\n\nR-2\n12.24\n12.89\n12.01\n12.28\n12.05\n\nR-L\n28.12\n28.51\n28.04\n29.13\n28.18\n\nsparsemax as baseline. In addition we use another baseline where we tune for the temperature in\nsoftmax function. More details are provided in App.A.8.\nExperimental Setup: In our experiments we adopt the same experimental setup followed by [16]\non top of the OpenNMT framework [18]. We varied only the control parameters required by our\nformulations. The models for the different control parameters were trained for 13 epochs and the\nepoch with the best validation accuracy is chosen as the best model for that setting. The best control\nparameter for a formulation is again selected based on validation accuracy. For all our formulations,\nwe report the test scores corresponding to the best control parameter in Table 2.\nNeural Machine Translation: We consider the FR-EN language pair from the NMT-Benchmark\nproject and perform experiments both ways. We see (refer Table.2) that sparsegen-lin surpasses\nBLEU scores of softmax and sparsemax for FR-EN translation, whereas sparsehg formulations yield\ncomparable performance. Quantitatively, these metrics show that adding explicit controls do not come\nat the cost of accuracy. In addition, it is encouraging to see (refer Fig.8 in App.A.8) that increasing \u03bb\nfor sparsegen-lin leads to crisper and hence more interpretable attention heatmaps (the lesser number\nof activated columns per row the better it is). We have also analyzed the average sparsity of heatmaps\nover the whole test dataset and have indeed observed that larger \u03bb leads to sparser attention.\nAbstractive Summarization: We next perform our experiments on abstractive summarization\ndatasets like Gigaword, DUC2003 & DUC2004 and report ROUGE metrics. The results in Ta-\nble.2 show that sparsegen-lin stands out in performance with other formulations closely following\nand comparable to softmax and sparsemax. It is also encouraging to see that all the models trained on\nGigaword generalizes well on other datasets DUC2003 and DUC2004. Here again the \u03bb control leads\nto more interpretable attention heatmaps as shown in Fig.9 in App.A.8 and we have also observed the\nsame with average sparsity of heatmaps over the test set.\n6 Conclusions and Future Work\nIn this paper, we investigated a family of sparse probability mapping functions, unifying them under\na general framework. This framework helped us to understand connections to existing formulations\nin the literature like softmax, spherical softmax and sparsemax. Our proposed probability mapping\nfunctions enabled us to provide explicit control over sparsity to achieve higher interpretability.\nThese functions have closed-form solutions and sub-gradients can be computed easily. We have\nalso proposed convex loss functions, which helped us to achieve better accuracies in the multilabel\nclassi\ufb01cation setting. Application of these formulations to compute sparse attention weights for NLP\ntasks also yielded improvements in addition to providing control to produce enhanced interpretability.\nAs future work, we intend to apply these sparse attention formulations for ef\ufb01cient read and write\noperations of memory networks [19]. In addition, we would like to investigate application of these\nproposed sparse formulations in knowledge distillation and reinforcement learning settings.\n\n9\n\n2468mean #labels0.850.900.951.00F-score01234range #labels0.8250.8500.8750.9000.9250.950F-score500100015002000Document length0.750.800.850.900.95F-scoresoftmax+logsparsemax+hubersparsemax+hingesparsehg+hinge\fAcknowledgements\n\nWe thank our colleagues in IBM, Abhijit Mishra, Disha Shrivastava, and Parag Jain for the numerous\ndiscussions and suggestions which helped in shaping this paper.\n\nReferences\n[1] M. Aly. Survey on multiclass classi\ufb01cation methods. Neural networks, pages 1\u20139, 2005.\n[2] John S. Bridle. Probabilistic interpretation of feedforward classi\ufb01cation network outputs, with\nrelationships to statistical pattern recognition. In Fran\u00e7oise Fogelman Souli\u00e9 and Jeanny H\u00e9rault,\neditors, Neurocomputing, pages 227\u2013236, Berlin, Heidelberg, 1990. Springer Berlin Heidelberg.\n[3] Richard S. Sutton and Andrew G. Barto. Introduction to Reinforcement Learning. MIT Press,\n\nCambridge, MA, USA, 1st edition, 1998.\n\n[4] B. Gao and L. Pavel. On the properties of the softmax function with application in game theory\n\nand reinforcement learning. ArXiv e-prints, 2017.\n\n[5] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly\nlearning to align and translate. In Proceedings of the International Conference on Learning\nRepresentations (ICLR), San Diego, CA, 2015.\n\n[6] Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron C. Courville, Ruslan Salakhutdinov,\nRichard S. Zemel, and Yoshua Bengio. Show, attend and tell: Neural image caption generation\nwith visual attention. In Francis R. Bach and David M. Blei, editors, ICML, volume 37 of JMLR\nWorkshop and Conference Proceedings, pages 2048\u20132057. JMLR.org, 2015.\n\n[7] K. Cho, A. Courville, and Y. Bengio. Describing multimedia content using attention-based\nencoder-decoder networks. IEEE Transactions on Multimedia, 17(11):1875\u20131886, Nov 2015.\n[8] Alexander M. Rush, Sumit Chopra, and Jason Weston. A neural attention model for abstractive\nsentence summarization. In Proceedings of the 2015 Conference on Empirical Methods in\nNatural Language Processing, pages 379\u2013389, Lisbon, Portugal, September 2015. Association\nfor Computational Linguistics.\n\n[9] Preksha Nema, Mitesh Khapra, Anirban Laha, and Balaraman Ravindran. Diversity driven\nattention model for query-based abstractive summarization. In Proceedings of the 55th Annual\nMeeting of the Association for Computational Linguistics, Vancouver, Canada, August 2017.\nAssociation for Computational Linguistics.\n\n[10] Francis Bach, Rodolphe Jenatton, and Julien Mairal. Optimization with Sparsity-Inducing\nPenalties (Foundations and Trends(R) in Machine Learning). Now Publishers Inc., Hanover,\nMA, USA, 2011.\n\n[11] Mohammad S. Sorower. A literature survey on algorithms for multi-label learning. 2010.\n[12] Pascal Vincent, Alexandre de Br\u00e9bisson, and Xavier Bouthillier. Ef\ufb01cient exact gradient\nupdate for training deep networks with very large sparse targets. In Proceedings of the 28th\nInternational Conference on Neural Information Processing Systems - Volume 1, NIPS\u201915,\npages 1108\u20131116, Cambridge, MA, USA, 2015. MIT Press.\n\n[13] Alexandre de Br\u00e9bisson and Pascal Vincent. An exploration of softmax alternatives belonging\nIn Proceedings of the International Conference on Learning\n\nto the spherical loss family.\nRepresentations (ICLR), 2016.\n\n[14] Andr\u00e9 F. T. Martins and Ram\u00f3n F. Astudillo. From softmax to sparsemax: A sparse model of\nattention and multi-label classi\ufb01cation. In Proceedings of the 33rd International Conference\non International Conference on Machine Learning - Volume 48, ICML\u201916, pages 1614\u20131623.\nJMLR.org, 2016.\n\n[15] John Duchi, Shai Shalev-Shwartz, Yoram Singer, and Tushar Chandra. Ef\ufb01cient projections onto\nthe l1-ball for learning in high dimensions. In Proceedings of the 25th International Conference\non Machine Learning, ICML \u201908, pages 272\u2013279, New York, NY, USA, 2008. ACM.\n\n[16] Vlad Niculae and Mathieu Blondel. A regularized framework for sparse and structured neural\nattention. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and\nR. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 3340\u20133350.\nCurran Associates, Inc., 2017.\n\n10\n\n\f[17] Ashish Kapoor, Raajay Viswanathan, and Prateek Jain. Multilabel classi\ufb01cation using bayesian\ncompressed sensing. In F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, editors,\nAdvances in Neural Information Processing Systems 25, pages 2645\u20132653. Curran Associates,\nInc., 2012.\n\n[18] Guillaume Klein, Yoon Kim, Yuntian Deng, Jean Senellart, and Alexander M. Rush. Opennmt:\n\nOpen-source toolkit for neural machine translation. CoRR, abs/1701.02810, 2017.\n\n[19] Sainbayar Sukhbaatar, arthur szlam, Jason Weston, and Rob Fergus. End-to-end memory\nnetworks. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, editors,\nAdvances in Neural Information Processing Systems 28, pages 2440\u20132448. Curran Associates,\nInc., 2015.\n\n11\n\n\f", "award": [], "sourceid": 3163, "authors": [{"given_name": "Anirban", "family_name": "Laha", "institution": "IBM Research"}, {"given_name": "Saneem Ahmed", "family_name": "Chemmengath", "institution": "IBM Research AI"}, {"given_name": "Priyanka", "family_name": "Agrawal", "institution": "IBM India Pvt. Ltd."}, {"given_name": "Mitesh", "family_name": "Khapra", "institution": "IIT Madras"}, {"given_name": "Karthik", "family_name": "Sankaranarayanan", "institution": "IBM Research"}, {"given_name": "Harish", "family_name": "Ramaswamy", "institution": "IIT Madras"}]}