{"title": "Sigsoftmax: Reanalysis of the Softmax Bottleneck", "book": "Advances in Neural Information Processing Systems", "page_first": 286, "page_last": 296, "abstract": "Softmax is an output activation function for modeling categorical probability distributions in many applications of deep learning. However, a recent study revealed that softmax can be a bottleneck of representational capacity of neural networks in language modeling (the softmax bottleneck). In this paper, we propose an output activation function for breaking the softmax bottleneck without additional parameters. We re-analyze the softmax bottleneck from the perspective of the output set of log-softmax and identify the cause of the softmax bottleneck. On the basis of this analysis, we propose sigsoftmax, which is composed of a multiplication of an exponential function and sigmoid function. Sigsoftmax can break the softmax bottleneck. The experiments on language modeling demonstrate that sigsoftmax and mixture of sigsoftmax outperform softmax and mixture of softmax, respectively.", "full_text": "Sigsoftmax: Reanalysis of the Softmax Bottleneck\n\nSekitoshi Kanai\n\nNTT Software Innovation Center, Keio Univ.\n\nkanai.sekitoshi@lab.ntt.co.jp\n\nYuki Yamanaka\n\nNTT Secure Platform Laboratories\nyamanaka.yuki@lab.ntt.co.jp\n\nYasuhiro Fujiwara\n\nNTT Software Innovation Center\n\nfujiwara.yasuhiro@lab.ntt.co.jp\n\nShuichi Adachi\n\nKeio Univ.\n\nadachi.shuichi@appi.keio.ac.jp\n\nAbstract\n\nSoftmax is an output activation function for modeling categorical probability distri-\nbutions in many applications of deep learning. However, a recent study revealed\nthat softmax can be a bottleneck of representational capacity of neural networks in\nlanguage modeling (the softmax bottleneck). In this paper, we propose an output\nactivation function for breaking the softmax bottleneck without additional parame-\nters. We re-analyze the softmax bottleneck from the perspective of the output set\nof log-softmax and identify the cause of the softmax bottleneck. On the basis of\nthis analysis, we propose sigsoftmax, which is composed of a multiplication of an\nexponential function and sigmoid function. Sigsoftmax can break the softmax bot-\ntleneck. The experiments on language modeling demonstrate that sigsoftmax and\nmixture of sigsoftmax outperform softmax and mixture of softmax, respectively.\n\n1\n\nIntroduction\n\nDeep neural networks are used in many recent applications such as image recognition [17, 13],\nspeech recognition [12], and natural language processing [24, 32, 7]. High representational capacity\nand generalization performance of deep neural networks are achieved by many layers, activation\nfunctions and regularization methods [26, 13, 31, 14, 10]. Although various model architectures\nare built in the above applications, softmax is commonly used as an output activation function for\nmodeling categorical probability distributions [4, 10, 13, 24, 32, 7, 12]. For example, in language\nmodeling, softmax is employed for representing the probability of the next word over the vocabulary\nin a sentence. When using softmax, we train the model by minimizing negative log-likelihood with a\ngradient-based optimization method. We can easily calculate the gradient of negative log-likelihood\nwith softmax, and it is numerically stable [3, 4].\nEven though softmax is widely used, few studies have attempted to improve its modeling performance\n[6, 8]. This is because deep neural networks with softmax are believed to have a universal approx-\nimation property. However, Yang et al. [34] recently revealed that softmax can be a bottleneck of\nrepresentational capacity in language modeling. They showed that the representational capacity of the\nsoftmax-based model is restricted by the length of the hidden vector in the output layer. In language\nmodeling, the length of the hidden vector is much smaller than the vocabulary size. As a result, the\nsoftmax-based model cannot completely learn the true probability distribution, and this is called\nthe softmax bottleneck. For breaking the softmax bottleneck, Yang et al. [34] proposed mixture of\nsoftmax (MoS) that mixes the multiple softmax outputs. However, this analysis of softmax does not\nexplicitly show why softmax can be a bottleneck. Furthermore, MoS is an additional layer or mixture\nmodel rather than an alternative activation function to softmax: MoS has learnable parameters and\nhyper-parameters.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fIn this paper, we propose a novel output activation function for breaking the softmax bottleneck\nwithout additional parameters. We re-analyze the softmax bottleneck from the point of view of the\noutput set (range) of a function and show why softmax can be a bottleneck. This paper reveals that\n(i) the softmax bottleneck occurs because softmax uses only exponential functions for nonlinearity\nand (ii) the range of log-softmax is a subset of the vector space whose dimension depends on the\ndimension of the input space. As an alternative activation function to softmax, we explore the output\nfunctions composed of recti\ufb01ed linear unit (ReLU) and sigmoid functions. In addition, we propose\nsigsoftmax, which is composed of a multiplication of an exponential function and sigmoid function.\nSigsoftmax has desirable properties for output activation functions, e.g., the calculation of its gradient\nis numerically stable. More importantly, sigsoftmax can break the softmax bottleneck, and the range\nof softmax can be a subset of that of sigsoftmax. Experiments on language modeling demonstrate\nthat sigsoftmax can break the softmax bottleneck and outperform softmax. In addition, mixture of\nsigsoftmax outperforms MoS.\n\n2 Preliminaries\n\n2.1 Softmax\n\nDeep neural networks use softmax in learning categorical distributions. For example, in the classi\ufb01-\ncation, a neural network uses softmax to learn the probability distribution over M classes y \u2208 RM\nconditioned on the input x as P\u03b8(y|x) where \u03b8 is a parameter. Let h(x) \u2208 Rd be a hidden vector\nand W \u2208 RM\u00d7d be a weight matrix in the output layer, the output of softmax fs(\u00b7) represents the\nconditional probability of the i-th class as follows:\n\nP\u03b8(yi|x) = [fs(W h(x))]i =\n\n(cid:80)M\n\nexp([W h(x)]i)\n\nm=1 exp([W h(x)]m)\n\n,\n\n(1)\n\nwhere [fs]i represents the i-th element of fs. We can see that each element of fs is bounded from\nzero to one since the output of exponential functions is non-negative in eq. (1). The summation of\nall elements of fs is obviously one. From these properties, we can regard output of the softmax\ntrained by minimizing negative log-likelihood as a probability [4, 21]. If we only need the most likely\nlabel, we can \ufb01nd such a label by comparing elements of W h(x) without the calculations of softmax\nfs(W h(x)) once we have trained the softmax-based model. This is because exponential functions\nin softmax are monotonically increasing.\nTo train the softmax-based models, negative log-likelihood (cross entropy) is used as a loss function.\nSince the loss function is minimized by stochastic gradient descent (SGD), the properties of the\ngradients of functions are very important [26, 28, 9, 15]. One advantage of softmax is that the gradient\n(cid:40)\nof log-softmax is easily calculated as follows [3, 4, 1, 8]:\n1 \u2212 [fs(z)]j\n\u2212 [fs(z)]j\n\nif j = i,\nif j (cid:54)= i,\n\n(2)\n\n\u2202[logfs(z)]i\n\n\u2202zj\n\n=\n\nwhere z = W h(x). Whereas the derivative of the logarithm can cause a division by zero since\ndlog(z)\n\nz , the derivative of log-softmax cannot. As a result, softmax is numerically stable.\n\ndz = 1\n\n2.2 Softmax bottleneck\n\nt P (Yt|Y<t) =(cid:81)\n\nthe joint probability P (Y ) is factorized as P (Y ) =(cid:81)\n\nIn recurrent neural network (RNN) language modeling, given a corpus of tokens Y = (Y1, . . . , YT ),\nt P (Yt|Xt), where Xt =\nY<t is referred to as the context of the conditional probability. Output of softmax fs(W h(Xt))\nlearns P (Yt|Xt) where (a) h(Xt) \u2208 Rd is the hidden vector corresponding to the context Xt and\n(b) W is a weight matrix in the output layer (embedding layer). A natural language is assumed as a\n\ufb01nite set of pairs of xt and P \u2217(Y |xt) as L = {(x1, P \u2217(Y |x1)), . . . , (xN , P \u2217(Y |xN ))}, where N is\nthe number of possible contexts. The objective of language modeling is to learn a model distribution\nP\u03b8(Y |X) parameterized by \u03b8 to match the true data distribution P \u2217(Y |X). Note that upper- and\nlower-case letters are used for variables and constants, respectively, in this section. Under the above\nassumptions, let y1, . . . , yM be M possible tokens in the language L, the previous study of Yang\n\n2\n\n\fet al. [34] considers the following three matrices:\n\n\uf8ee\uf8ef\uf8ef\uf8ef\uf8f0 h(x1)T\n\nh(x2)T\n\n...\n\n\uf8f9\uf8fa\uf8fa\uf8fa\uf8fb, W , A =\n\nh(xN )T\n\n\uf8ee\uf8ef\uf8ef\uf8f0 logP \u2217(y1|x1),\n\nlogP \u2217(y1|x2),\n\nlogP \u2217(y1|xN ),\n\n...\n\nH\u03b8 =\n\n\uf8f9\uf8fa\uf8fa\uf8fb.\n\nlogP \u2217(yM|x1)\nlogP \u2217(yM|x2)\n\n...\n\nlogP \u2217(y2|x1),\nlogP \u2217(y2|x2),\n\n...\n\n. . .\n. . .\n...\n. . .\n\nlogP \u2217(y2|xN ),\n\nlogP \u2217(yM|xN )\n\n(3)\n\nH\u03b8 \u2208 RN\u00d7d is a matrix composed of the hidden vectors, W \u2208 RM\u00d7d is a weight matrix, and\nA \u2208 RM\u00d7N is a matrix composed of the log probabilities of the true distribution. By using these\nmatrices, the rank of H\u03b8W T should be greater than or equal to rank(A) \u2212 1 so that the softmax-\nbased model completely learns L [34]. However, the rank of H\u03b8W T is at most d if any functions U\nare used for H\u03b8 and W . Therefore, if we have d < rank(A) \u2212 1, softmax can be the bottleneck of\nrepresentational capacity as shown in the following theorem:\nTheorem 1 (Softmax Bottleneck [34]). If d < rank(A) \u2212 1, for any function family U and any\nmodel parameter \u03b8, there exists a context x in L such that P\u03b8(Y |x) (cid:54)= P \u2217(Y |x).\nThis theorem shows that the length of the hidden vector in the output layer determines the representa-\ntional power of RNN with softmax. In language modeling, the rank of A can be extremely high since\ncontexts can vary and vocabulary size M is much larger than d. Therefore, the softmax can be the\nbottleneck of the representational power.\n\n2.3 Mixture of softmax\n\nA simple approach to improving the representational capacity is to use a weighted sum of the several\nmodels. In fact, Yang et al. [34] use this approach for breaking the softmax bottleneck. As the\nalternative to softmax, they propose the mixture of softmax (MoS), which is the weighted sum of K\nsoftmax functions:\n\nP\u03b8(yi|x) =(cid:80)K\n\n(cid:80)M\n\nk=1 \u03c0(x, k)\n\nexp([W h(x,k)]i)\n\nm=1 exp([W h(x,k)]m)\n\n,\n\n(4)\n\nwhere \u03c0(x, k) is the prior or mixture weight of the k-th component, and h(x, k) is the k-th context\nvector associated with the context x. Let h(cid:48)(x) be input of MoS for the context x. The priors and con-\n\u03c0,kh(cid:48)(x))\n\u03c0,k(cid:48) h(cid:48)(x)) and h(x, k) = tanh(Wh,kh(cid:48)(x)),\ntext vectors are parameterized as \u03c0(x, k) =\nrespectively. MoS can break the softmax bottleneck since the rank of the approximate A can be\narbitrarily large [34]. Therefore, language modeling with MoS performs better than that with softmax.\nHowever, in this method, the number of mixtures K is the hyper-parameter which needs to be tuned.\nIn addition, weights Wh,k and w\u03c0,k are additional parameters. Thus, MoS can be regarded as an\nadditional layer or mixing technique rather than the improvement of the activation function.\n\nk(cid:48)=1 exp(wT\n\n(cid:80)K\n\nexp(wT\n\n2.4 Related work\n\nPrevious studies proposed alternative functions to softmax [8, 25, 27]. The study of de Br\u00e9bisson\nand Vincent [8] explored spherical family functions: the spherical softmax and Taylor softmax. They\nshowed that these functions do not outperform softmax when the length of an output vector is large.\nIn addition, the spherical softmax has a hyper-parameter that should be carefully tuned for numerical\nstability reasons [8]. On the other hand, the Taylor softmax might suffer from the softmax bottleneck\nsince it approximates softmax. Mohassel and Zhang [25] proposed a ReLU-based alternative function\nto softmax for privacy-preserving machine learning since softmax is expensive to compute inside a\nsecure computation. However, it leads to a division by zero since all outputs of ReLUs frequently\nbecome zeros and the denominator for normalization becomes zero. Several studies improved the\nef\ufb01ciency of softmax [11, 30, 33, 20]. However, they did not improve the representational capacity.\n\n3 Proposed method\n\n3.1 Reanalysis of the softmax bottleneck\n\nThe analysis of the softmax bottleneck [34] is based on matrix factorization and reveals that the rank\nof H\u03b8W T\n\u03b8 becomes\n\n\u03b8 needs to be greater than or equal to rank(A) \u2212 1. Since the rank of H\u03b8W T\n\n3\n\n\fthe length of the hidden vector in the output layer, the length of the hidden vector determines the\nrepresentational power as described in Sec. 2.2. However, this analysis does not explicitly reveal\nthe cause of the softmax bottleneck. To identify the cause of the softmax bottleneck, we re-analyze\nthe softmax bottleneck from the perspective of the range of log-softmax because it should be large\nenough to approximate the true log probabilities.\nLog-softmax is a logarithm of softmax and is used in training of deep learning as mentioned\nin Sec. 2.1. By using the notation in Sec. 2.1, log-softmax log(fs(z)) can be represented as\nm=1 exp(zm)). This function can be expressed as\n[log(fs(z))]i = log\n\n(cid:16)\n(cid:80)M\n\n(cid:17)\n\nexp(zi)\n\nm=1exp(zm)\n\n= zi \u2212 log((cid:80)M\nlog (fs(z)) = z \u2212 log((cid:80)M\n\nm=1 exp(zm))1,\n\n(5)\nwhere 1 is the vector of all ones. To represent various log probability distributions log(P \u2217(y|x)),\nthe range of log (fs(z)) \u2208 RM should be suf\ufb01ciently large. Therefore, we investigate the range of\nlog (fs(z)). We assume that the hidden vector h in the output layer can be an arbitrary vector in\nRd where d \u2264 M, and the weight matrix W \u2208 RM\u00d7d is the full rank matrix; the rank of W is d.1\nUnder these assumptions, the input vector space of softmax S (z \u2208 S) is a d dimensional vector\nspace, and we have the following theorem:\nTheorem 2. Let S \u2286 RM be the d dimensional vector space and z \u2208 S be input of log-softmax,\nevery range of the log-softmax {log(fs(z))|z \u2208 S} is a subset of the d + 1 dimensional vector space.\n\nProof. The input of log-softmax z = W h can be represented by d singular vectors of W since the\nrank of W is d. In other words, the space of input vectors z is spanned by d basis vectors. Thus,\nl=1 k(l)u(l)|k(l) \u2208 R} where u(l) \u2208 RM for\nl = 1, . . . , d are linearly independent vectors and k(l) are their coef\ufb01cients. From eq. (5), by using\nu(l) and k(l), the range of log-softmax {log(fs(z))|z \u2208 S} becomes\n\nthe input vector space {z|z \u2208 S} is represented as {(cid:80)d\nl=1 k(l)u(l) \u2212 c((cid:80)d\n(cid:104)(cid:80)d\nl=1 k(l)u(l)(cid:105)\nl=1 k(l)u(l))1|k(l) \u2208 R} \u2286 {(cid:80)d\n\n{log(fs(z))| z \u2208 S} = {(cid:80)d\nwhere c((cid:80)d\nl=1 k(l)u(l)) = log((cid:80)M\n{(cid:80)d\nl=1 k(l)u(l) \u2212 c((cid:80)d\nwhere {(cid:80)d\nvector space {(cid:80)d\n\nm=1 exp(\n\nm\n\n(7)\nl=1 k(l)u(l) + k(d+1)1|k(l) \u2208 R} is the vector space spanned by u(l) and 1. Let Y be the\n\nl=1 k(l)u(l) + k(d+1)1|k(l) \u2208 R},\n\nl=1 k(l)u(l) + k(d+1)1|k(l) \u2208 R}, the dimension of Y becomes\n\nl=1 k(l)u(l))1|k(l) \u2208 R},\n)). This is the linear combination of d\n\n(6)\n\nlinearly independent vectors u(l) and 1. Therefore, we have the following relation:\n\n(cid:40)\nd + 1 if 1 /\u2208 {(cid:80)d\nif 1 \u2208 {(cid:80)d\n\nd\n\ndim(Y ) =\n\nl=1 k(l)u(l)|k(l) \u2208 R},\nl=1 k(l)u(l)|k(l) \u2208 R}.\n\n(8)\n\nWe can see that Y is the d or d + 1 dimensional linear subspace of RM . From eqs. (7) and (8), output\nvectors of log-softmax exist in the d + 1 dimensional vector space, which completes the proof.\n\nTheorem 2 shows that the log-softmax has at most d + 1 linearly independent output vectors, even\nif the various inputs are applied to the model. Therefore, if the vectors of true log probabilities\nlogP \u2217(y|x) have more than d + 1 linearly independent vectors, the softmax-based model cannot\ncompletely represent the true probabilities. Figure 1 illustrates theorems 1 and 2 when M = 3 and\nd = 1. We can prove Theorem 1 by using Theorem 2 as follows:\nProof. If we have d < rank(A) \u2212 1, i.e., rank(A) > d + 1, the number of linearly independent\nvectors of logP \u2217(y|x) is larger than d + 1. On the other hand, the output vectors logP\u03b8(y|x) of\nthe model cannot be larger than d + 1 linearly independent vectors from Theorem 2. Therefore, the\nsoftmax-based model cannot completely learn P \u2217(y|x), i.e., there exists a context x in L such that\nP\u03b8(Y |x) (cid:54)= P \u2217(Y |x).\n\n1If neural networks have the universal approximation property, h can be an arbitrary vector in Rd. If not, the\ninput space is a subset of a d dimensional vector space, and the range of log-softmax is still a subset of a d + 1\ndimensional vector space. When rank(W ) < d, we can examine the range of log-softmax in the same way by\nreplacing d with rank(W ). If a bias is used in the output layer, the dimension of S can be d + 1.\n\n4\n\n\fFigure 1: Softmax bottleneck (M = 3, d = 1). The input space S is a gray dashed straight line, and\nthe range of log-softmax is the black curve on the orange plane spanned by z and 1. On the other\nhand, {log(P \u2217)} is the blue dash-dotted curve over the 3 dimensional (3-D) space since M = 3. We\ncan see that the range of log-softmax cannot match the {log(P \u2217)} over the 3-D space.\n\nThe above analysis shows that the softmax bottleneck occurs because the output of log-softmax is the\nlinear combination of the input z and vector 1 as eq. (5). Linear combination of the input and vector\n1 increases the number of linearly independent vectors by at most one, and as a result, the output\nvectors become at most d + 1 linearly independent vectors. The reason log-softmax becomes the\nlinear combination is that the logarithm of the exponential function log(exp(z)) is z.\nBy contrast, the number of linearly independent output vectors of a nonlinear function can be much\ngreater than the number of linearly independent input vectors. Therefore, if the other nonlinear\nfunctions are replaced with exponential functions, the logarithm of such functions can be nonlinear\nand the softmax bottleneck can be broken without additional parameters.\nOur analysis provides new insights that the range of log-softmax is a subset of the less dimensional\nvector space although the dimension of a vector space is strongly related to the rank of a matrix.\nFurthermore, our analysis explicitly shows the cause of the softmax bottleneck.\n\n3.2 Alternative functions to softmax and desirable properties\n\nIn the previous section, we explained that the softmax bottleneck can be broken by replacing\nnonlinear functions with exponential functions. In this section, we explain the desirable properties of\nan alternative function to softmax. We formulate a new output function f (\u00b7) as follows:\n\n[f (z)]i =\n\n(cid:80)M\n\n[g(z)]i\n\nm=1[g(z)]m\n\n.\n\n(9)\n\nThe new function is composed of the nonlinear function g(z) and the division for the normalization\nso that the summation of the elements is one. As the alternative function to softmax, a new output\nfunction f (z) and its g(z) should have all of the following properties:\n\nNonlinearity of log(g(z)) As mentioned in Secs. 2.2 and 3.1, softmax can be the bottleneck of\nthe representational power because log(exp(z)) is z. Provided that log(g(z)) is a linear\nfunction, {log(f (z))|z \u2208 S} is a subset of the d + 1 dimensional vector space. In order to\nbreak the softmax bottleneck, log(g(z)) should be nonlinear.\n\nNumerically stable In training of deep learning, we need to calculate the gradient for optimization.\n\nThe derivative of logarithm of [f (z)]i with respect to zj is\n\u2202[f (z)]i\n\n\u2202log([f (z)]i)\n\n[f (z)]i\n\n\u2202zj\n\n\u2202zj\n\n= 1\n\n(10)\nWe can see that this function has a division by [f (z)]i. It can cause a division by zero since\n[f (z)]i can be close to zero if networks completely go wrong in training. The alternative\nfunctions should avoid a division by zero similar to softmax as shown in eq. (2).\nNon-negative In eq. (9), all elements of g(z) should be non-negative to limit output in [0, 1].\nTherefore, g(z) should be non-negative: [g(z)]i \u2265 0. Note that if g(z) is non-positive,\nf (z) are also limited to [0, 1]. We only mention non-negative since non-positive functions\ng(z) can easily be non-negative as \u2212g(z).\n\n.\n\n5\n\n\fMonotonically increasing g(z) should be monotonically increasing so that f (z) becomes a\nsmoothed version of the argmax function [4, 2].\nIf g(z) is monotonically increasing,\nwe can obtain the label that has the maximum value of f (z) by comparing elements of z.\n\nNote that, if we use ReLU as g(z), the ReLU-based function f (z) does not have all the above\nproperties since the gradient of its logarithm is not numerically stable. If we use sigmoid as g(z),\nthe new sigmoid-based function satis\ufb01es the above properties. However, the output of sigmoid is\nbounded above as [g(z)]i \u2264 1, and this restriction might limit the representational power. In fact,\nthe sigmoid-based function does not outperform softmax on the large dataset in Sec. 4. We discuss\nthese functions in detail in the supplementary material. In the next section, we propose a new output\nactivation function that can break the softmax bottleneck, and satis\ufb01es all the above properties.\n\n3.3 Sigsoftmax\n\nFor breaking the softmax bottleneck, we propose sigsoftmax given as follows:\nDe\ufb01nition 1. Sigsoftmax is de\ufb01ned as\n\nwhere \u03c3(\u00b7) represents a sigmoid function.\n\n[f (z)]i =\n\n(cid:80)M\n\nexp(zi)\u03c3(zi)\n\nm=1 exp(zm)\u03c3(zm)\n\n,\n\n(11)\n\nwhere c(cid:48)(z) = log((cid:80)M\n\nWe theoretically show that sigsoftmax can break the softmax bottleneck and has the desired properties.\nIn the same way as in the analysis of softmax in Sec. 3.1, we examine the range of log-sigsoftmax.\nSince we have log(\u03c3(z)) = log(\n\n1+exp(\u2212z) ) = z\u2212log(1+exp(z)), log-sigsoftmax becomes\n\n1\n\nlog(f (z)) = 2z \u2212 log(1 + exp(z)) + c(cid:48)(z)1,\n\n(12)\n\nm=1 exp(zm)\u03c3(zm)), and log(1 + exp(z)) is the nonlinear function called\nsoftplus [10]. Since log-sigsoftmax is composed of a nonlinear function, its output vectors can be\ngreater than d + 1 linearly independent vectors. Therefore, we have the following theorem:\nTheorem 3. Let S \u2286 RM be the d dimensional vector space and z \u2208 S be input of log-sigsoftmax,\nsome range of log-sigsoftmax {log(f (z))|z \u2208 S} is not a subset of a d + 1 dimensional vector space.\n\nThe detailed proof of this theorem is given in the supplementary material. Theorem 3 shows that\nsigsoftmax can break the softmax bottleneck; even if the vectors of the true log probabilities are more\nthan d + 1 linearly independent vectors, the sigsoftmax-based model can learn the true probabilities.\nHowever, the representational powers of sigsoftmax and softmax are dif\ufb01cult to compare only by\nusing the theorem based on the vector space. This is because both functions are nonlinear and their\nranges are not necessarily vector spaces, even though they are subsets of vector spaces. Therefore,\nwe directly compare the ranges of sigsoftmax and softmax as the following theorem:\nTheorem 4. Let z \u2208 S be the input of sigsoftmax f (\u00b7) and softmax fs(\u00b7). If the S is a d dimensional\nvector space and 1 \u2208 S, the range of softmax is a subset of the range of sigsoftmax\n\n{fs(z)|z \u2208 S} \u2286 {f (z)|z \u2208 S}.\n\n(13)\n\nProof. If we have 1 \u2208 S, S can be written as S = {(cid:80)d\u22121\nl=1 k(cid:48)(l)u(cid:48)(l) + k(cid:48)(d)1|k(cid:48)(l) \u2208 R} where u(cid:48)(l)\ncan be written as(cid:80)d\u22121\nl=1 k(cid:48)(l)u(cid:48)(l) + k(cid:48)(d)1, and thus, z =(cid:80)d\u22121\n(l = 1, . . . , d \u2212 1) and 1 are linearly independent vectors. In addition, the arbitrary elements of S\nof softmax, by substituting z =(cid:80)d\u22121\nl=1 k(cid:48)(l)u(cid:48)(l) + k(cid:48)(d)1. For the output\nl=1 k(cid:48)(l)u(cid:48)(l) + k(cid:48)(d)1 for eq. (1), we have\nexp([(cid:80)d\u22121\nexp([(cid:80)d\u22121\n(cid:80)M\n(cid:80)M\nm=1 exp([(cid:80)d\u22121\nm=1 exp([(cid:80)d\u22121\nl=1 k(cid:48)(l)u(cid:48)(l)]i\nexp([(cid:80)d\u22121\nm=1 exp([(cid:80)d\u22121\n(cid:80)M\n\nl=1 k(cid:48)(l)u(cid:48)(l) + k(cid:48)(d)1)|k(cid:48)(l) \u2208 R\n\nAs a result, the range of softmax becomes as follows:\n\n(cid:110)\nfs((cid:80)d\u22121\n\nl=1 k(cid:48)(l)u(cid:48)(l)]m\n\n)\n\nl=1 k(cid:48)(l)u(cid:48)(l)]i\n\n)\n\nl=1 k(cid:48)(l)u(cid:48)(l)]i\n\n)\n\nl=1 k(cid:48)(l)u(cid:48)(l)]m\n\n+k(cid:48)(d))\n\n|k(cid:48)(l) \u2208 R\n\n[fs(z)]i =\n\n.\n\n(14)\n\n.\n\n(15)\n\n(cid:111)\n\n=\n\n+k(cid:48)(d))\n\n=\n\n(cid:27)\n\nl=1 k(cid:48)(l)u(cid:48)(l)]m\n\n)\n\n(cid:26)\n\n6\n\n\f(cid:27)\n\nexp([(cid:80)d\u22121\n(cid:80)M\nm=1 exp([(cid:80)d\u22121\n(cid:26)\nexp([(cid:80)d\u22121\n(cid:80)M\nm=1 exp([(cid:80)d\u22121\n\nbecomes as follows:\n\nOn the other hand, by substituting z =(cid:80)d\u22121\nexp([(cid:80)d\u22121\n(cid:80)M\nm=1 exp([(cid:80)d\u22121\n)\u03c3([(cid:80)d\u22121\n)\u03c3([(cid:80)d\u22121\nl=1 k(cid:48)(l)u(cid:48)(l)]i\n\nl=1 k(cid:48)(l)u(cid:48)(l)]i\n\n[f (z)]i =\n\nl=1 k(cid:48)(l)u(cid:48)(l)]m\n\nk(cid:48)(d)\u2192+\u221e\n\nlim\n\nl=1 k(cid:48)(l)u(cid:48)(l) + k(cid:48)(d)1 for eq. (11), the output of sigsoftmax\nl=1 k(cid:48)(l)u(cid:48)(l)]i\n\n)\u03c3([(cid:80)d\u22121\n)\u03c3([(cid:80)d\u22121\nl=1 k(cid:48)(l)u(cid:48)(l)]i\n\n+k(cid:48)(d))\n\nl=1 k(cid:48)(l)u(cid:48)(l)]m\n\nl=1 k(cid:48)(l)u(cid:48)(l)]m\n\n+k(cid:48)(d))\n\n(16)\n\n.\n\n+k(cid:48)(d))\n\nl=1 k(cid:48)(l)u(cid:48)(l)]m\n\n+k(cid:48)(d))\n\n=\n\nexp([(cid:80)d\u22121\n(cid:80)M\nm=1 exp([(cid:80)d\u22121\n\nl=1 k(cid:48)(l)u(cid:48)(l)]i\n\n)\n\nl=1 k(cid:48)(l)u(cid:48)(l)]m\n\n(17)\n\n,\n)\n\nWhen k(cid:48)(l) are \ufb01xed for l = 1, . . . , d \u2212 1 and k(cid:48)(d) \u2192 +\u221e,2 we have the following equality:\n\nsince limk\u2192+\u221e\u03c3(v+k) = 1 when v is \ufb01xed. From eq. (17), sigsoftmax has the following relation:\n\n)\n\n)\n\nl=1 k(cid:48)(l)u(cid:48)(l)]i\n\n|k(cid:48)(l) \u2208 R\n\nl=1 k(cid:48)(l)u(cid:48)(l)]m\n\n= {f (z)|z \u2208 S(cid:48)} \u2286 {f (z)|z \u2208 S} ,\n\nwhere S(cid:48) is a hyperplane of S with k(cid:48)(d) = +\u221e, S(cid:48) = {(cid:80)d\u22121\n\n(18)\nl=1 k(cid:48)(l)u(cid:48)(l) + k(cid:48)(d)1|k(cid:48)(l) \u2208 R for l =\n1, . . . , d \u2212 1, k(cid:48)(d) = +\u221e} \u2282 S. From eqs. (15) and (18), we can see that the range of sigsoftmax\nincludes the range of softmax. Therefore, we have {fs(z)|z \u2208 S} \u2286 {f (z)|z \u2208 S}.\nTheorem 4 shows that the range of sigsoftmax can be larger than that of softmax if 1 \u2208 S. The\nassumption 1 \u2208 S means that there exist inputs of which outputs are the equal probabilities for all\nlabels as p\u03b8(yi|x) = 1\nM for all i. This assumption is not very strong in practice. If 1 /\u2208 S, the\nrange of sigsoftmax can include the range of softmax by introducing one learnable scalar parameter\n(cid:80)M\nb into sigsoftmax as [f (z + b1)]i =\nm=1 exp(zm)\u03c3(zm+b). In this case, if softmax can \ufb01t the true\nprobability, b can become large enough for sigsoftmax to approximately equal softmax. In the\nexperiments, we did not use b in order to con\ufb01rm that sigsoftmax can outperform softmax without\nadditional parameters. From Theorems 3 and 4, sigsoftmax can break the softmax bottleneck, and\nfurthermore, the representational power of sigsoftmax can be higher than that of softmax.\nThen, we show that sigsoftmax has the desirable properties introduced in Sec. 3.2 as shown in the\nfollowing theorem from De\ufb01nition 1 although we show its proof in the supplementary material:\nTheorem 5. Sigsoftmax has the following properties:\n\nexp(zi)\u03c3(zi+b)\n\n1. Nonlinearity of log(g(z)): log(g(z)) = 2z \u2212 log(1 + exp(z)).\n(1 \u2212 [f (z)]j)(2 \u2212 \u03c3(zj))\n\u2212 [f (z)]j (2 \u2212 \u03c3(zj))\n\n2. Numerically stable: \u2202log[f (z)]i\n\n\u2202zj\n\n=\n\n(cid:40)\n\ni = j,\ni (cid:54)= j.\n\n3. Non-negative: [g(z)]i = exp(zi)\u03c3(zi) \u2265 0.\n4. Monotonically increasing: z1 \u2264 z2 \u21d2 exp(z1)\u03c3(z1) \u2264 exp(z2)\u03c3(z2).\n\nSince sigsoftmax is an alternative function to softmax, we can use the weighted sum of sigsoftmax\nfunctions in the same way as MoS. Mixture of sigsoftmax (MoSS) is the following function:\n\nP\u03b8(yi|x) =(cid:80)K\n\n(cid:80)M\n\nk=1 \u03c0(x, k)\n\nexp([W h(x,k)]i)\u03c3([W h(x,k)]i)\n\nm=1 exp([W h(x,k)]m)\u03c3([W h(x,k)]m)\n\n.\n\n(19)\n\n(cid:80)K\n\nexp(wT\n\n\u03c0,kh(cid:48)(x))\u03c3(wT\n\n\u03c0,kh(cid:48)(x))\n\nk(cid:48)=1 exp(wT\n\n\u03c0,k(cid:48) h(cid:48)(x))\u03c3(wT\n\n\u03c0,k(cid:48) h(cid:48)(x)).\n\n\u03c0(x, k) is also composed of sigsoftmax as \u03c0(x, k) =\n\n4 Experiments\n\nTo evaluate the effectiveness of sigsoftmax, we conducted experiments on word-level language\nmodeling. We compared sigsoftmax with softmax, the ReLU-based function and the sigmoid-based\nfunction. We also compared the mixture of sigsoftmax with that of softmax; MoSS with MoS.\nNote that we provide the character-level language modeling experiments on text8 [18] and word-level\nlanguage modeling experiments on One Billion Word dataset [5] in the supplementary material.\nSince the softmax bottleneck does not occur on character-level language modeling, we con\ufb01rmed the\nperformance of sigsoftmax is similar to that of softmax in these experiments. On One Billion Word\ndataset, we used ef\ufb01cient method [11] since One Billion Word is the massive dataset. We con\ufb01rmed\nthat sigsoftmax can outperform softmax on the massive dataset.\n\n2 Even though k(cid:48)(d) is extremely large, the input vector is the element of the input space S.\n\n7\n\n\fTable 1: Results of the language modeling experiment on PTB.\n\nValidation\n\nTest\n\nSoftmax\n51.2\u00b10.5\n50.5\u00b10.5\n\ng:ReLU\n\n(4.91\u00b15)\u00d7103\n(2.78\u00b18)\u00d7105\n\ng: Sigmoid\n49.2\u00b10.4\n48.9\u00b10.3\n\nSigsoftmax\n49.7\u00b10.5\n49.2\u00b10.4\n\nMoS\n\n48.6\u00b10.2\n48.0\u00b10.1\n\nMoSS\n48.3\u00b10.1\n47.7\u00b10.07\n\nTable 2: Results of the language modeling experiment on WT2.\n\nValidation\n\nTest\n\nSoftmax\n45.3\u00b10.2\n43.3\u00b10.1\n\ng:ReLU\n\n(1.79\u00b10.8)\u00d7103\n(2.30\u00b12)\u00d7104\n\ng:Sigmoid\n45.7\u00b10.1\n43.5\u00b10.1\n\nSigsoftmax\n44.9\u00b10.1\n42.9\u00b10.1\n\nMoS\n\n42.5\u00b10.1\n40.8\u00b10.03\n\nMoSS\n42.1\u00b10.2\n40.3\u00b10.2\n\n4.1 Experimental conditions\n\nWe used Penn Treebank dataset (PTB) [19, 24] and WikiText-2 dataset (WT2) [22] by following the\nprevious studies [23, 16, 34]. PTB is commonly used to evaluate the performance of RNN-based\nlanguage modeling [24, 35, 23, 34]. PTB is split into a training set (about 930 k tokens), validation\nset (about 74 k tokens), and test set (about 82 k tokens). The vocabulary size M was set to 10 k, and\nall words outside the vocabulary were replaced with a special token. WT2 is a collection of tokens\nfrom the set of articles on Wikipedia. WT2 is also split into a training set (about 2100 k), validation\nset (about 220 k), and test set (about 250 k). The vocabulary size M was 33,278. Since WT2 is larger\nthan PTB, language modeling of WT2 may require more representational power than that of PTB.\nWe trained a three-layer long short-term memory (LSTM) model with each output function. After we\ntrained models, we \ufb01netuned them and applied the dynamic evaluation [16]. For fair comparison, the\nexperimental conditions, such as unit sizes, dropout rates, initialization, and the optimization method\nwere the same as in the previous studies [23, 34, 16] except for the number of epochs by using their\ncodes.3 We set the epochs to be twice as large as the original epochs used in [23] since the losses\ndid not converge in the original epochs. In addition, we trained each model with various random\nseeds and evaluated the average and standard deviation of validation and test perplexities for each\nmethod. The detailed conditions and the results at training and \ufb01netuning steps are provided in the\nsupplementary material.\n\n4.2 Experimental results\n\nValidation perplexities and test perplexities of PTB and WT2 modeling are listed in Tabs. 1 and 2. Note\nthat we con\ufb01rmed these results are statistically different by pair-wise t-test (5 % of p-value). Table 1\nshows that the sigmoid-based function achieved the lowest perplexities among output activation\nfunctions on PTB. However, the sigmoid-based function did not outperform softmax on WT2. This is\nbecause sigmoid is bounded above by one, \u03c3(\u00b7) \u2264 1, and it may restrict the representational power.\nAs a result, the sigmoid based function did not perform well on the large dataset. On the other hand,\nsigsoftmax achieved lower perplexities than softmax on PTB and achieves the lowest perplexities on\nWT2. Furthermore, between mixture models, MoSS achieved lower perplexities than MoS. Even\nthough we trained and \ufb01netuned models under the conditions that are highly optimized for softmax\nand MoS in [23, 34], sigsoftmax and MoSS outperformed softmax and MoS, respectively. Therefore,\nwe conclude that sigsoftmax outperforms softmax as an activation function.\n\n4.3 Evaluation of linear independence\n\nIn this section, we evaluate linear independence of output vectors of each function. First, we applied\nwhole test data to the \ufb01netuned models and obtained log-output log(P\u03b8(yt|xt)), e.g., log-softmax, at\neach time. Next, we made the matrices \u02c6A as \u02c6A = [log(P\u03b8(y1|x1)), . . . , log(P\u03b8(yT|xT ))] \u2208 RM\u00d7T\nwhere T is the number of tokens of test data. M and T were respectively 10,000 and 82,430 on the\nPTB test set and 33,278 and 245,570 on the WT2 test set. Finally, we examined the rank of \u02c6A since\n\n3https://github.com/salesforce/awd-lstm-lm (Note that Merity et al. [23] further tuned some\n\nhyper-parameters to obtain results better than those in the original paper in their code.);\nhttps://github.com/benkrause/dynamic-evaluation; https://github.com/zihangdai/mos\n\n8\n\n\fTable 3: The number of linearly independent log-output vectors on test datasets: Ranks of \u02c6A.\n\nSoftmax\n\ng: ReLU g: Sigmoid\n\nPTB\nWT2\n\n402\n402\n\n8243\n31400\n\n1304\n463\n\nSigsoftmax MoS MoSS\n9986\n19834\n\n9980\n12093\n\n4640\n5465\n\nthe rank of the matrix is N if the matrix is composed of N linearly independent vectors. Note that\nthe numerical approaches for computing ranks have roundoff error, and we used the threshold used in\n[29, 34] to detect the ranks. The ranks of \u02c6A are listed in Tab. 3. The calculated singular values for\ndetecting ranks are presented in the supplementary material.\nWe can see that log-softmax output vectors have 402 linearly independent vectors. In the experiments,\nthe number of hidden units is set to 400, and we used a bias vector in the output layer. As a result, the\ndimension of the input space S was at most 401, and log-softmax output vectors are theoretically\nat most 402 linearly independent vectors from Theorem 2. Therefore, we con\ufb01rmed that the range\nof log-softmax is a subset of the d + 1 dimensional vector space. On the other hand, the number\nof linearly independent output vectors of sigsoftmax, ReLU and sigmoid-based functions are not\nbounded by 402. Therefore, sigsoftmax, ReLU and sigmoid-based functions can break the softmax\nbottleneck. The ranks of the ReLU-based function are larger than the other activation functions.\nHowever, the ReLU-based function is numerically unstable as mentioned in Sec. 3.2. As a result, it\nwas not trained well as shown in Tabs. 1 and 2. MoSS has more linearly independent output vectors\nthan MoS. Therefore, MoSS may have more representational power than MoS.\n\n5 Conclusion\n\nIn this paper, we investigated the range of log-softmax and identi\ufb01ed the cause of the softmax bottle-\nneck. We proposed sigsoftmax, which can break the softmax bottleneck and has more representational\npower than softmax without additional parameters. Experiments on language modeling demonstrated\nthat sigsoftmax outperformed softmax. Since sigsoftmax has the desirable properties for output\nactivation functions, it has the potential to replace softmax in many applications. Breaking the\nsoftmax bottleneck is the necessary conditions in order to \ufb01t the model to the true distribution. In our\nfuture work, we will investigate the suf\ufb01cient conditions in order to \ufb01t the model to the distribution.\n\nReferences\n[1] Christopher M Bishop. Neural Networks for Pattern Recognition. Oxford university press,\n\n1995.\n\n[2] Christopher M Bishop. Pattern Recognition and Machine Learning. Springer-Verlag New York,\n\n2006.\n\n[3] John S Bridle. Training stochastic model recognition algorithms as networks can lead to\nmaximum mutual information estimation of parameters. In Proc. NIPS, pages 211\u2013217, 1990.\n[4] John S Bridle. Probabilistic interpretation of feedforward classi\ufb01cation network outputs, with\nrelationships to statistical pattern recognition. In Neurocomputing, pages 227\u2013236. Springer,\n1990.\n\n[5] Ciprian Chelba, Tomas Mikolov, Mike Schuster, Qi Ge, Thorsten Brants, Phillipp Koehn, and\nTony Robinson. One billion word benchmark for measuring progress in statistical language\nmodeling. Technical report, Google, 2013. URL http://arxiv.org/abs/1312.3005.\n\n[6] Binghui Chen, Weihong Deng, and Junping Du. Noisy softmax: Improving the generalization\nability of dcnn via postponing the early softmax saturation. In Proc. CVPR, pages 5372\u20135381,\n2017.\n\n[7] Kyunghyun Cho, Bart Van Merri\u00ebnboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares,\nHolger Schwenk, and Yoshua Bengio. Learning phrase representations using rnn encoder\u2013\ndecoder for statistical machine translation. In Proc. EMNLP, pages 1724\u20131734. ACL, 2014.\n\n[8] Alexandre de Br\u00e9bisson and Pascal Vincent. An exploration of softmax alternatives belonging\n\nto the spherical loss family. In Proc. ICLR, 2016.\n\n9\n\n\f[9] Xavier Glorot and Yoshua Bengio. Understanding the dif\ufb01culty of training deep feedforward\n\nneural networks. In Proc. AISTATS, pages 249\u2013256, 2010.\n\n[10] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep learning. MIT press, 2016.\n[11] \u00c9douard Grave, Armand Joulin, Moustapha Ciss\u00e9, David Grangier, and Herv\u00e9 J\u00e9gou. Ef\ufb01cient\n\nsoftmax approximation for GPUs. In Proc. ICML, pages 1302\u20131310, 2017.\n\n[12] Alex Graves, Abdel-rahman Mohamed, and Geoffrey Hinton. Speech recognition with deep\n\nrecurrent neural networks. In Proc. ICASSP, pages 6645\u20136649. IEEE, 2013.\n\n[13] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image\n\nrecognition. In Proc. CVPR, pages 770\u2013778, 2016.\n\n[14] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training\n\nby reducing internal covariate shift. In Proc. ICML, pages 448\u2013456, 2015.\n\n[15] Sekitoshi Kanai, Yasuhiro Fujiwara, and Sotetsu Iwamura. Preventing gradient explosions in\n\ngated recurrent units. In Proc. NIPS, pages 435\u2013444, 2017.\n\n[16] Ben Krause, Emmanuel Kahembwe, Iain Murray, and Steve Renals. Dynamic evaluation of\n\nneural sequence models. arXiv preprint arXiv:1709.07432, 2017.\n\n[17] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classi\ufb01cation with deep\n\nconvolutional neural networks. In Proc. NIPS, pages 1097\u20131105, 2012.\n\n[18] Matt Mahoney. Large text compression benchmark. 2011. URL http://www.mattmahoney.\n\nnet/text/text.html.\n\n[19] Mitchell P Marcus, Mary Ann Marcinkiewicz, and Beatrice Santorini. Building a large annotated\n\ncorpus of english: The penn treebank. Computational linguistics, 19(2):313\u2013330, 1993.\n\n[20] Andre Martins and Ramon Astudillo. From softmax to sparsemax: A sparse model of attention\n\nand multi-label classi\ufb01cation. In Proc. ICML, pages 1614\u20131623, 2016.\n\n[21] Roland Memisevic, Christopher Zach, Marc Pollefeys, and Geoffrey E Hinton. Gated softmax\n\nclassi\ufb01cation. In Proc. NIPS, pages 1603\u20131611, 2010.\n\n[22] Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture\n\nmodels. In Proc. ICLR, 2017.\n\n[23] Stephen Merity, Nitish Shirish Keskar, and Richard Socher. Regularizing and optimizing lstm\n\nlanguage models. In Proc. ICLR, 2018.\n\n[24] Tomas Mikolov. Statistical language models based on neural networks. PhD thesis, Brno\n\nUniversity of Technology, 2012.\n\n[25] Payman Mohassel and Yupeng Zhang. Secureml: A system for scalable privacy-preserving\nmachine learning. In Security and Privacy (SP), 2017 IEEE Symposium on, pages 19\u201338. IEEE,\n2017.\n\n[26] Vinod Nair and Geoffrey E Hinton. Recti\ufb01ed linear units improve restricted boltzmann machines.\n\nIn Proc. ICML, pages 807\u2013814. Omnipress, 2010.\n\n[27] Yann Ollivier. Riemannian metrics for neural networks i: feedforward networks. arXiv preprint\n\narXiv:1303.0818, 2013.\n\n[28] Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio. On the dif\ufb01culty of training recurrent\n\nneural networks. In Proc. ICML, pages 1310\u20131318, 2013.\n\n[29] William H Press, Saul A Teukolsky, William T Vetterling, and Brian P Flannery. Numerical\n\nRecipes 3rd Edition: The Art of Scienti\ufb01c Computing. Cambridge University Press, 2007.\n\n[30] Kyuhong Shim, Minjae Lee, Iksoo Choi, Yoonho Boo, and Wonyong Sung. SVD-softmax: Fast\nsoftmax approximation on large vocabulary neural networks. In Proc. NIPS, pages 5469\u20135479,\n2017.\n\n[31] Nitish Srivastava, Geoffrey E Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdi-\nnov. Dropout: a simple way to prevent neural networks from over\ufb01tting. Journal of Machine\nLearning Research, 15(1):1929\u20131958, 2014.\n\n[32] Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence learning with neural\n\nnetworks. In Proc. NIPS, pages 3104\u20133112. 2014.\n\n10\n\n\f[33] Michalis K. Titsias. One-vs-each approximation to softmax for scalable estimation of probabili-\n\nties. In Proc. NIPS, pages 4161\u20134169, 2016.\n\n[34] Zhilin Yang, Zihang Dai, Ruslan Salakhutdinov, and William W Cohen. Breaking the softmax\n\nbottleneck: a high-rank rnn language model. In Proc. ICLR, 2018.\n\n[35] Wojciech Zaremba, Ilya Sutskever, and Oriol Vinyals. Recurrent neural network regularization.\n\narXiv preprint arXiv:1409.2329, 2014.\n\n11\n\n\f", "award": [], "sourceid": 188, "authors": [{"given_name": "Sekitoshi", "family_name": "Kanai", "institution": "NTT, Keio University"}, {"given_name": "Yasuhiro", "family_name": "Fujiwara", "institution": "NTT Software Innovation Center"}, {"given_name": "Yuki", "family_name": "Yamanaka", "institution": "NTT Secure Platform Laboratories"}, {"given_name": "Shuichi", "family_name": "Adachi", "institution": "Keio University"}]}