{"title": "Learning towards Minimum Hyperspherical Energy", "book": "Advances in Neural Information Processing Systems", "page_first": 6222, "page_last": 6233, "abstract": "Neural networks are a powerful class of nonlinear functions that can be trained end-to-end on various applications. While the over-parametrization nature in many neural networks renders the ability to fit complex functions and the strong representation power to handle challenging tasks, it also leads to highly correlated neurons that can hurt the generalization ability and incur unnecessary computation cost. As a result, how to regularize the network to avoid undesired representation redundancy becomes an important issue. To this end, we draw inspiration from a well-known problem in physics -- Thomson problem, where one seeks to find a state that distributes N electrons on a unit sphere as evenly as possible with minimum potential energy. In light of this intuition, we reduce the redundancy regularization problem to generic energy minimization, and propose a minimum hyperspherical energy (MHE) objective as generic regularization for neural networks. We also propose a few novel variants of MHE, and provide some insights from a theoretical point of view. Finally, we apply neural networks with MHE regularization to several challenging tasks. Extensive experiments demonstrate the effectiveness of our intuition, by showing the superior performance with MHE regularization.", "full_text": "Learning towards Minimum Hyperspherical Energy\n\nWeiyang Liu1,*, Rongmei Lin2,*, Zhen Liu1,*, Lixin Liu3, Zhiding Yu4, Bo Dai1,5, Le Song1,6\n\n1Georgia Institute of Technology 2Emory University\n\n3South China University of Technology 4NVIDIA 5Google Brain 6Ant Financial\n\nAbstract\n\nNeural networks are a powerful class of nonlinear functions that can be trained\nend-to-end on various applications. While the over-parametrization nature in\nmany neural networks renders the ability to \ufb01t complex functions and the strong\nrepresentation power to handle challenging tasks, it also leads to highly correlated\nneurons that can hurt the generalization ability and incur unnecessary computation\ncost. As a result, how to regularize the network to avoid undesired representation\nredundancy becomes an important issue. To this end, we draw inspiration from a\nwell-known problem in physics \u2013 Thomson problem, where one seeks to \ufb01nd a state\nthat distributes N electrons on a unit sphere as evenly as possible with minimum\npotential energy. In light of this intuition, we reduce the redundancy regularization\nproblem to generic energy minimization, and propose a minimum hyperspherical\nenergy (MHE) objective as generic regularization for neural networks. We also\npropose a few novel variants of MHE, and provide some insights from a theoretical\npoint of view. Finally, we apply neural networks with MHE regularization to\nseveral challenging tasks. Extensive experiments demonstrate the effectiveness of\nour intuition, by showing the superior performance with MHE regularization.\n\nIntroduction\n\n1\nThe recent success of deep neural networks has led to its wide applications in a variety of tasks. With\nthe over-parametrization nature and deep layered architecture, current deep networks [14, 46, 42]\nare able to achieve impressive performance on large-scale problems. Despite such success, having\nredundant and highly correlated neurons (e.g., weights of kernels/\ufb01lters in convolutional neural\nnetworks (CNNs)) caused by over-parametrization presents an issue [37, 41], which motivated a series\nof in\ufb02uential works in network compression [10, 1] and parameter-ef\ufb01cient network architectures [16,\n19, 62]. These works either compress the network by pruning redundant neurons or directly modify\nthe network architecture, aiming to achieve comparable performance while using fewer parameters.\nYet, it remains an open problem to \ufb01nd a uni\ufb01ed and principled theory that guides the network\ncompression in the context of optimal generalization ability.\nAnother stream of works seeks to further release the network generalization power by alleviating\nredundancy through diversi\ufb01cation [57, 56, 5, 36] as rigorously analyzed by [59]. Most of these\nworks address the redundancy problem by enforcing relatively large diversity between pairwise\nprojection bases via regularization. Our work broadly falls into this category by sharing similar\nhigh-level target, but the spirit and motivation behind our proposed models are distinct. In particular,\nthere is a recent trend of studies that feature the signi\ufb01cance of angular learning at both loss and\nconvolution levels [29, 28, 30, 27], based on the observation that the angles in deep embeddings\nlearned by CNNs tend to encode semantic difference. The key intuition is that angles preserve the\nmost abundant and discriminative information for visual recognition. As a result, hyperspherical\ngeodesic distances between neurons naturally play a key role in this context, and thus, it is intuitively\ndesired to impose discrimination by keeping their projections on the hypersphere as far away from\n\n* indicates equal contributions. Correspondence to: Weiyang Liu .\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\feach other as possible. While the concept of imposing large angular diversities was also considered\nin [59, 57, 56, 36], they do not consider diversity in terms of global equidistribution of embeddings\non the hypersphere, which fails to achieve the state-of-the-art performances.\nGiven the above motivation, we draw inspiration from a well-known physics problem, called Thomson\nproblem [48, 43]. The goal of Thomson problem is to determine the minimum electrostatic potential\nenergy con\ufb01guration of N mutually-repelling electrons on the surface of a unit sphere. We identify\nthe intrinsic resemblance between the Thomson problem and our target, in the sense that diversifying\nneurons can be seen as searching for an optimal con\ufb01guration of electron locations. Similarly, we\ncharacterize the diversity for a group of neurons by de\ufb01ning a generic hyperspherical potential energy\nusing their pairwise relationship. Higher energy implies higher redundancy, while lower energy\nindicates that these neurons are more diverse and more uniformly spaced. To reduce the redundancy\nof neurons and improve the neural networks, we propose a novel minimum hyperspherical energy\n(MHE) regularization framework, where the diversity of neurons is promoted by minimizing the\nhyperspherical energy in each layer. As veri\ufb01ed by comprehensive experiments on multiple tasks,\nMHE is able to consistently improve the generalization power of neural networks.\nMHE faces different situations when it is\napplied to hidden layers and output lay-\ners. For hidden layers, applying MHE\nstraightforwardly may still encourage\nsome degree of redundancy since it will\nproduce co-linear bases pointing to op-\nposite directions (see Fig. 1 middle). In\norder to avoid such redundancy, we pro-\npose the half-space MHE which con-\nstructs a group of virtual neurons and\nminimize the hyperspherical energy of\nboth existing and virtual neurons. For\noutput layers, MHE aims to distribute\nthe classi\ufb01er neurons1 as uniformly as\npossible to improve the inter-class feature separability. Different from MHE in hidden layers, classi-\n\ufb01er neurons should be distributed in the full space for the best classi\ufb01cation performance [29, 28].\nAn intuitive comparison among the widely used orthonormal regularization, the proposed MHE and\nhalf-space MHE is provided in Fig. 1. One can observe that both MHE and half-space MHE are able\nto uniformly distribute the neurons over the hypersphere and half-space hypershpere, respectively. In\ncontrast, conventional orthonormal regularization tends to group neurons closer, especially when the\nnumber of neurons is greater than the dimension.\nMHE is originally de\ufb01ned on Euclidean distance, as indicated in Thomson problem. However, we\nfurther consider minimizing hyperspherical energy de\ufb01ned with respect to angular distance, which we\nwill refer to as angular-MHE (A-MHE) in the following paper. In addition, we give some theoretical\ninsights of MHE regularization, by discussing the asymptotic behavior and generalization error.\nLast, we apply MHE regularization to multiple vision tasks, including generic object recognition,\nclass-imbalance learning, and face recognition. In the experiments, we show that MHE is architecture-\nagnostic and can considerably improve the generalization ability.\n2 Related Works\nDiversity regularization is shown useful in sparse coding [32, 35], ensemble learning [26, 24], self-\npaced learning [21], metric learning [58], etc. Early studies in sparse coding [32, 35] show that the\ngeneralization ability of codebook can be improved via diversity regularization, where the diversity\nis often modeled using the (empirical) covariance matrix. More recently, a series of studies have\nfeatured diversity regularization in neural networks [59, 57, 56, 5, 36, 55], where regularization is\nmostly achieved via promoting large angle/orthogonality, or reducing covariance between bases. Our\nwork differs from these studies by formulating the diversity of neurons on the entire hypersphere,\ntherefore promoting diversity from a more global, top-down perspective.\nMethods other than diversity-promoting regularization have been widely proposed to improve\nCNNs [44, 20, 33, 30] and generative adversarial nets (GANs) [4, 34]. MHE can be regarded\nas a complement that can be applied on top of these methods.\n\nFigure 1: Orthonormal, MHE and half-space MHE regularization.\nThe red dots denote the neurons optimized by the gradient of the\ncorresponding regularization. The rightmost pink dots denote\nthe virtual negative neurons. We randomly initialize the weights\nof 10 neurons on a 3D Sphere and optimize them with SGD.\n\n1Classi\ufb01er neurons are the projection bases of the last layer (i.e., output layer) before input to softmax.\n\n2\n\nOrthonormalMHEHalf-space MHE\fEs,d( \u02c6wi|N\n\nN(cid:88)\n\nN(cid:88)\n\n3 Learning Neurons towards Minimum Hyperspherical Energy\n3.1 Formulation of Minimum Hyperspherical Energy\nMinimum hyperspherical energy de\ufb01nes an equilibrium state of the con\ufb01guration of neuron\u2019s direc-\ntions. We argue that the power of neural representation of each layer can be characterized by the\nhyperspherical energy of its neurons, and therefore a minimal energy con\ufb01guration of neurons can\ninduce better generalization. Before delving into details, we \ufb01rst de\ufb01ne the hyperspherical energy\nfunctional for N neurons (i.e., kernels) with (d + 1)-dimension WN ={w1,\u00b7\u00b7\u00b7 , wN \u2208Rd+1} as\n\n(cid:0)(cid:107) \u02c6wi \u2212 \u02c6wj(cid:107)(cid:1) =\n\n(cid:40) (cid:80)\ni(cid:54)=j log(cid:0)(cid:107) \u02c6wi \u2212 \u02c6wj(cid:107)\u22121(cid:1), s = 0\n(cid:80)\ni(cid:54)=j (cid:107) \u02c6wi \u2212 \u02c6wj(cid:107)\u2212s , s > 0\n\n,\n\ni=1\n\nfs\n\ni=1) =\n\nj=1,j(cid:54)=i\n\n(1)\nwhere (cid:107)\u00b7(cid:107) denotes Euclidean distance, fs(\u00b7) is a decreasing real-valued function, and \u02c6wi = wi(cid:107)wi(cid:107)\nis the i-th neuron weight projected onto the unit hypersphere Sd ={w\u2208Rd+1|(cid:107)w(cid:107) = 1}. We also\ndenote \u02c6WN ={ \u02c6w1,\u00b7\u00b7\u00b7 , \u02c6wN \u2208Sd}, and Es = Es,d( \u02c6wi|N\ni=1) for short. There are plenty of choices for\nfs(\u00b7), but in this paper we use fs(z) = z\u2212s, s > 0, known as Riesz s-kernels. Particularly, as s \u2192 0,\nz\u2212s\u2192 s log(z\u22121) + 1, which is an af\ufb01ne transformation of log(z\u22121). It follows that optimizing the\ni(cid:54)=j log((cid:107) \u02c6wi\u2212 \u02c6wj(cid:107)\u22121) is essentially the limiting case of\noptimizing the hyperspherical energy Es. We therefore de\ufb01ne f0(z) = log(z\u22121) for convenience.\nThe goal of the MHE criterion is to minimize the energy in Eq. (1) by varying the orientations of the\nneuron weights w1,\u00b7\u00b7\u00b7 , wN . To be precise, we solve an optimization problem: minWN Es with\ns \u2265 0. In particular, when s = 0, we solve the logarithmic energy minimization problem:\n\nlogarithmic hyperspherical energy E0 =(cid:80)\n\narg min\nWN\n\nE0 = arg min\nWN\n\nexp(E0) = arg max\nWN\n\n(cid:107) \u02c6wi \u2212 \u02c6wj(cid:107) ,\n\n(2)\n\n(cid:89)\n\ni(cid:54)=j\n\nin which we essentially maximize the product of Euclidean distances. E0, E1 and E2 have interesting\nyet profound connections. Note that Thomson problem corresponds to minimizing E1, which is a\nNP-hard problem. Therefore in practice we can only compute its approximate solution by heuristics.\nIn neural networks, such a differentiable objective can be directly optimized via gradient descent.\n3.2 Logarithmic Hyperspherical Energy E0 as a Relaxation\nOptimizing the original energy in Eq. (1) is equivalent to optimizing its logarithmic form log Es.\nTo ef\ufb01ciently solve this dif\ufb01cult optimization problem, we can instead optimize the lower bound of\nlog Es as a surrogate energy, by applying Jensen\u2019s inequality:\n\n(cid:26)\n\nN(cid:88)\n\nN(cid:88)\n\n(cid:18)\n\n(cid:0)(cid:107) \u02c6wi \u2212 \u02c6wj(cid:107)(cid:1)(cid:19)(cid:27)\n\ni=1\n\nfs\n\nlog\n\nj=1,j(cid:54)=i\n\nElog :=\n\narg min\nWN\n\nWith fs(z) = z\u2212s, s > 0, we observe that Elog becomes sE0 = s(cid:80)\n\n(3)\ni(cid:54)=j log((cid:107) \u02c6wi\u2212 \u02c6wj(cid:107)\u22121), which is\nidentical to the logarithmic hyperspherical energy E0 up to a multiplicative factor s. Therefore,\nminimizing E0 can also be viewed as a relaxation of minimizing Es for s > 0.\n3.3 MHE as Regularization for Neural Networks\nNow that we have introduced the formulation of MHE, we propose MHE regularization for neural\nnetworks. In supervised neural network learning, the entire objective function is shown as follows:\n\nm(cid:88)\n\nj=1\n\nL =\n\n1\nm\n\n(cid:124)\n\n(cid:96)((cid:104)wout\n\ni\n\n, xj(cid:105)c\n\ni=1, yj)\n\n(cid:123)(cid:122)\n\n+ \u03bbh \u00b7 L\u22121(cid:88)\n(cid:124)\n\nj=1\n\n(cid:125)\n\n1\n\n(cid:123)(cid:122)\n\nNj(Nj \u2212 1)\n\n{Es}j\n\n+ \u03bbo \u00b7\n\n1\n\nNL(NL \u2212 1)\n\nEs( \u02c6wout\n\ni\n\n|c\ni=1)\n\n(4)\n\n(cid:123)(cid:122)\n\n(cid:125)\n\n(cid:125)\n\n(cid:124)\n\ntraining data \ufb01tting\n\nTh: hyperspherical energy for hidden layers\n\nTo: hyperspherical energy for output layer\n\nwhere xi is the feature of the i-th training sample entering the output layer, wout\nis the classi\ufb01er\ni\nneuron for the i-th class in the output fully-connected layer and \u02c6wout\ni denotes its normalized version.\n{Es}i denotes the hyperspherical energy for the neurons in the i-th layer. c is the number of classes,\nm is the batch size, L is the number of layers of the neural network, and Ni is the number of neurons\n1 ,\u00b7\u00b7\u00b7 , \u02c6wout\nc }.\nin the i-th layer. Es( \u02c6wout\ni\nThe (cid:96)2 weight decay is omitted here for simplicity, but we will use it in practice. An alternative\ninterpretation of MHE regularization from a decoupled view is given in Section 3.7 and Appendix C.\nMHE has different effects and interpretations in regularizing hidden layers and output layers.\nMHE for hidden layers. To make neurons in the hidden layers more discriminative and less redun-\ndant, we propose to use MHE as a form of regularization. MHE encourages the normalized neurons to\n\n|c\ni=1) denotes the hyperspherical energy of neurons { \u02c6wout\n\n3\n\n\fbe uniformly distributed on a unit hypersphere, which is partially inspired by the observation in [30]\nthat angular difference in neurons preserves semantic (label-related) information. To some extent,\nMHE maximizes the average angular difference between neurons (speci\ufb01cally, the hyperspherical\nenergy of neurons in every hidden layer). For instance, in CNNs we minimize the hyperpsherical\nenergy of kernels in convolutional and fully-connected layers except the output layer.\nMHE for output layers. For the output layer, we propose to enhance the inter-class feature separa-\nbility with MHE to learn discriminative and well-separated features. For classi\ufb01cation tasks, MHE\nregularization is complementary to the softmax cross-entropy loss in CNNs. The softmax loss focuses\nmore on the intra-class compactness, while MHE encourages the inter-class separability. Therefore,\nMHE on output layers can induce features with better generalization power.\n3.4 MHE in Half Space\nDirectly applying the MHE formulation may still encouter some\nredundancy. An example in Fig. 2, with two neurons in a 2-\ndimensional space, illustrates this potential issue. Directly im-\nposing the original MHE regularization leads to a solution that\ntwo neurons are colinear but with opposite directions. To avoid\nsuch redundancy, we propose the half-space MHE regularization\nwhich constructs some virtual neurons and minimizes the hyper-\nspherical energy of both original and virtual neurons together.\nSpeci\ufb01cally, half-space MHE constructs a colinear virtual neuron with opposite direction for every\nexisting neuron. Therefore, we end up with minimizing the hyperspherical energy with 2Ni neurons\nin the i-th layer (i.e., minimizing Es({ \u02c6wk,\u2212 \u02c6wk}|2Ni\nk=1)). This half-space variant will encourage the\nneurons to be less correlated and less redundant, as illustrated in Fig. 2. Note that, half-space MHE\ncan only be used in hidden layers, because the colinear neurons do not constitute redundancy in output\nlayers, as shown in [29]. Nevertheless, colinearity is usually not likely to happen in high-dimensional\nspaces, especially when the neurons are optimized to \ufb01t training data. This may be the reason that the\noriginal MHE regularization still consistently improves the baselines.\n3.5 MHE beyond Euclidean Distance\nThe hyperspherical energy is originally de\ufb01ned based on the Euclidean distance on a hypersphere,\nwhich can be viewed as an angular measure. In addition to Euclidean distance, we further consider\nthe geodesic distance on a unit hypersphere as a distance measure for neurons, which is exactly\nthe same as the angle between neurons. Speci\ufb01cally, we consider to use arccos( \u02c6w(cid:62)\ni \u02c6wj) to replace\n(cid:107) \u02c6wi\u2212 \u02c6wj(cid:107) in hyperspherical energies. Following this idea, we propose angular MHE (A-MHE) as a\nsimple extension, where the hyperspherical energy is rewritten as:\n\nFigure 2: Half-space MHE.\n\ni(cid:54)=j arccos( \u02c6w(cid:62)\n\ni(cid:54)=j log(cid:0) arccos( \u02c6w(cid:62)\n\ni \u02c6wj)\u2212s, s > 0\n\ni \u02c6wj)\u22121(cid:1), s = 0\n\n(5)\n\ns,d( \u02c6wi|N\nEa\n\ni=1) =\n\nN(cid:88)\n\nN(cid:88)\n\ni=1\n\nj=1,j(cid:54)=i\n\n(cid:0) arccos( \u02c6w\n\ni \u02c6wj)(cid:1) =\n\n(cid:62)\n\nfs\n\n(cid:26) (cid:80)\n(cid:80)\n\nwhich can be viewed as rede\ufb01ning MHE based on geodesic distance on hyperspheres (i.e., angle), and\ncan be used as an alternative to the original hyperspherical energy Es in Eq. (4). Note that, A-MHE\ncan also be learned in full-space or half-space, leading to similar variants as original MHE. The key\ndifference between MHE and A-MHE lies in the optimization dynamics, because their gradients w.r.t\nthe neuron weights are quite different. A-MHE is also more computationally expensive than MHE.\n3.6 Mini-batch Approximation for MHE\nWith a large number of neurons in one layer, calculating MHE can be computationally expensive as it\nrequires computing the pair-wise distances between neurons. To address this issue, we propose the\nmini-batch version of MHE to approximate the MHE (either original or half-space) objective.\nMini-batch approximation for MHE on hidden layers. For hidden layers, mini-batch approxima-\ntion iteratively takes a random batch of neurons as input and minimizes their hyperspherical energy\nas an approximation to the MHE. Note that the gradient of the mini-batch objective is an unbiased\nestimation of the original gradient of MHE.\nData-dependent mini-batch approximation for output layers. For the output layer, the data-\ndependent mini-batch approximation iteratively takes the classi\ufb01er neurons corresponding to the\nfs((cid:107) \u02c6wyi \u2212 \u02c6wj(cid:107)) in each\nclasses that exist in mini-batches. It minimizes\niteration, where yi denotes the class label of the i-th sample in each mini-batch, m is the mini-batch\nsize, and N is the number of neurons (in one particular layer).\n\n(cid:80)m\n\n(cid:80)N\n\nj=1,j(cid:54)=yi\n\nm(N\u22121)\n\ni=1\n\n1\n\n4\n\nOriginal MHEHalf-space MHEw1w2w1w2-w1-w2^ ^ ^ ^ ^ ^ \f(cid:124)\n\n(cid:123)(cid:122)\n\n(cid:125)\n\n3.7 Discussions\nConnections to scienti\ufb01c problems. The hyperspherical energy minimization has close relationships\nwith scienti\ufb01c problems. When s = 1, Eq. (1) reduces to Thomson problem [48, 43] (in physics) where\none needs to determine the minimum electrostatic potential energy con\ufb01guration of N mutually-\nrepelling electrons on a unit sphere. When s =\u221e, Eq. (1) becomes Tammes problem [47] (in\ngeometry) where the goal is to pack a given number of circles on the surface of a sphere such that the\nminimum distance between circles is maximized. When s = 0, Eq. (1) becomes Whyte\u2019s problem\nwhere the goal is to maximize product of Euclidean distances as shown in Eq. (2). Our work aims to\nmake use of important insights from these scienti\ufb01c problems to improve neural networks.\nUnderstanding MHE from decoupled view. Inspired by decoupled networks [27], we can view the\noriginal convolution as the multiplication of the angular function g(\u03b8) = cos(\u03b8) and the magnitude\nfunction h((cid:107)w(cid:107) ,(cid:107)x(cid:107)) =(cid:107)w(cid:107)\u00b7(cid:107)x(cid:107): f (w, x) = h((cid:107)w(cid:107) ,(cid:107)x(cid:107))\u00b7 g(\u03b8) where \u03b8 is the angle between the\nkernel w and the input x. From the equation above, we can see that the norm of the kernel and the\ndirection (i.e., angle) of the kernel affect the inner product similarity differently. Typically, weight\ndecay is to regularize the kernel by minimizing its (cid:96)2 norm, while there is no regularization on the\ndirection of the kernel. Therefore, MHE completes this missing piece by promoting angular diversity.\nBy combining MHE to a standard neural networks, the entire regularization term becomes\nLreg = \u03bbw \u00b7\n\n{Es}j + \u03bbo \u00b7\n\nNj(cid:88)\n\nEs( \u02c6wout\n\n(cid:107)wi(cid:107)\n\n|c\ni=1)\n\ni\n\n1(cid:80)L\n\n1\n\nNL(NL \u2212 1)\n\n1\n\nNj(Nj \u2212 1)\n\nL(cid:88)\n(cid:123)(cid:122)\n\nj=1\n\nj=1 Nj\n\ni=1\n\n+ \u03bbh \u00b7 L\u22121(cid:88)\n(cid:124)\n\nj=1\n\n(cid:125)\n\nWeight decay: regularizing the magnitude of kernels\n\nMHE: regularizing the direction of kernels\n\nwhere \u03bbw, \u03bbh and \u03bbo are weighting hyperparameters for these three regularization terms. From the\ndecoupled view, MHE makes a lot of senses in regularizing the neural networks, since it serves as a\ncomplementary and orthogonal role to weight decay. More discussions are in Appendix C.\nComparison to orthogonality/angle-promoting regularizations. Promoting orthogonality or large\nangles between bases has been a popular choice for encouraging diversity. Probably the most related\nand widely used one is the orthonormal regularization [30] which aims to minimize (cid:107)W (cid:62)W \u2212 I(cid:107)F ,\nwhere W denotes the weights of a group of neurons with each column being one neuron and I is an\nidentity matrix. One similar regularization is the orthogonality regularization [36] which minimizes\nthe sum of the cosine values between all the kernel weights. These methods encourage kernels to\nbe orthogonal to each other, while MHE does not. Instead, MHE encourages the hyperspherical\ndiversity among these kernels, and these kernels are not necessarily orthogonal to each other. [56]\nproposes the angular constraint which aims to constrain the angles between different kernels of the\nneural network, but quite different from MHE, they use a hard constraint to impose this angular\nregularization. Moreover, these methods model diversity regularization at a more local level, while\nMHE regularization seeks to model the problem in a more top-down manner.\nNormalized neurons in MHE. From Eq. 1, one can see that the normalized neurons are used to\ncompute MHE, because we aim to encourage the diversity on a hypersphere. However, a natural\nquestion may arise: what if we use the original (i.e., unnormalized) neurons to compute MHE?\nFirst, combining the norm of kernels (i.e., neurons) into MHE may lead to a trivial gradient descent\ndirection: simply increasing the norm of all kernels. Suppose all kernel directions stay unchanged,\nincreasing the norm of all kernels by a factor can effectively decrease the objective value of MHE.\nSecond, coupling the norm of kernels into MHE may contradict with weight decay which aims to\ndecrease the norm of kernels. Moreover, normalized neurons imply that the importance of all neurons\nis the same, which matches the intuition in [28, 30, 27]. If we desire different importance for different\nneurons, we can also manually assign a \ufb01xed weight for each neuron. This may be useful when we\nhave already known certain neurons are more important and we want them to be relatively \ufb01xed. The\nneuron with large weight tends to be updated less. We will discuss it more in Appendix D.\n4 Theoretical Insights\nThis section leverages a number of rigorous theoretical results from [38, 23, 12, 25, 11, 23, 8, 54]\nand provides theoretical yet intuitive understandings about MHE.\n4.1 Asymptotic Behavior\nThis subsection shows how the hyperspherical energy behaves asymptotically. Speci\ufb01cally, as\nN \u2192\u221e, we can show that the solution \u02c6WN tends to be uniformly distributed on hypersphere Sd\nwhen the hyperspherical energy de\ufb01ned in Eq. (1) achieves its minimum.\n\n5\n\n\fDe\ufb01nition 1 (minimal hyperspherical s-energy). We de\ufb01ne the minimal s-energy for N points on the\nunit hypersphere Sd ={w\u2208Rd+1|(cid:107)w(cid:107) = 1} as\n\nEs,d( \u02c6wi|N\n\ni=1)\n\n\u02c6WN \u2282Sd\n\n\u03b5s,d(N ) := inf\n\nthe energy integral as Is(\u00b5) =(cid:82)(cid:82)\n\n(6)\nwhere the in\ufb01mum is taken over all possible \u02c6WN on Sd. Any con\ufb01guration of \u02c6WN to attain the\nin\ufb01mum is called an s-extremal con\ufb01guration. Usually \u03b5s,d(N ) =\u221e if N is greater than d and\n\u03b5s,d(N ) = 0 if N = 0, 1.\nWe discuss the asymptotic behavior (N \u2192\u221e) in three cases: 0 < s < d, s = d, and s > d. We \ufb01rst write\nSd\u00d7Sd (cid:107)u\u2212 v(cid:107)\u2212sd\u00b5(u)d\u00b5(v), which is taken over all probability\nmeasure \u00b5 supported on Sd. With 0 < s < d, Is(\u00b5) is minimal when \u00b5 is the spherical measure\n\u03c3d =Hd(\u00b7)|Sd /Hd(Sd) on Sd, where Hd(\u00b7) denotes the d-dimensional Hausdorff measure. When\ns\u2265 d, Is(\u00b5) becomes in\ufb01nity, which therefore requires different analysis. In general, we can say all\ns-extremal con\ufb01gurations asymptotically converge to uniform distribution on a hypersphere, as stated\nin Theorem 1. This asymptotic behavior has been heavily studied in [38, 23, 12].\nTheorem 1 (asymptotic uniform distribution on hypersphere). Any sequence of optimal s-energy\n2 \u2282Sd is asymptotically uniformly distributed on Sd in the sense of the weak-\ncon\ufb01gurations ( \u02c6W (cid:63)\nstar topology of measures, namely\n1\nN\n\nas N \u2192 \u221e\n\n\u03b4v \u2192 \u03c3d,\n\n(cid:88)\n\nN )|\u221e\n\n(7)\n\nv\u2208 \u02c6W (cid:63)\n\nN\n\nN 2 = Is(\u03c3d).\n\nwhere \u03b4v denotes the unit point mass at v, and \u03c3d is the spherical measure on Sd.\nTheorem 2 (asymptotics of the minimal hyperspherical s-energy). We have that limN\u2192\u221e \u03b5s,d(N )\np(N )\nexists for the minimal s-energy. For 0 < s < d, p(N ) = N 2. For s = d, p(N ) = N 2 log N. For s > d,\np(N ) = N 1+s/d. Particularly if 0 < s < d, we have limN\u2192\u221e \u03b5s,d(N )\nTheorem 2 tells us the growth power of the minimal hyperspherical s-energy when N goes to in\ufb01nity.\nTherefore, different potential power s leads to different optimization dynamics. In the light of\nthe behavior of the energy integral, MHE regularization will focus more on local in\ufb02uence from\nneighborhood neurons instead of global in\ufb02uences from all the neurons as the power s increases.\n4.2 Generalization and Optimality\nAs proved in [54], in one-hidden-layer neural network, the diversity of neurons can effectively\neliminate the spurious local minima despite the non-convexity in learning dynamics of neural\nnetworks. Following such an argument, our MHE regularization, which encourages the diversity of\nneurons, naturally matches the theoretical intuition in [54], and effectively promotes the generalization\nof neural networks. While hyperspherical energy is minimized such that neurons become diverse on\nhyperspheres, the hyperspherical diversity is closely related to the generalization error.\nk=1 vk\u03c3(W (cid:62)\n\nMore speci\ufb01cally, in a one-hidden-layer neural network f (x) =(cid:80)n\n(cid:80)m\n(cid:80)m\ni=1(yi\u2212 f (xi))2, we can compute its gradient w.r.t Wk as\ni=1(f (xi)\u2212 yi)vk\u03c3(cid:48)(W (cid:62)\nk xi)xi.\n\nk x) with least\nsquares loss L(f ) = 1\n=\n2m\n(\u03c3(\u00b7) is the nonlinear activation function and \u03c3(cid:48)(\u00b7) is its\n1\nsubgradient. x\u2208 is the training sample. Wk denotes the weights of hidden layer and vk is the\nm\n\u2202W = D\u00b7 r\nweights of output layer.) Subsequently, we can rewrite this gradient as a matrix form: \u2202L\nwhere D\u2208Rdn\u00d7m, D{di\u2212d+1:di,j} = vi\u03c3(cid:48)(W (cid:62)\ni xj)xj \u2208Rd and r\u2208Rm, ri = 1\nm f (xi)\u2212 yi. Further,\n\u2202W (cid:107). (cid:107)r(cid:107) is actually the training error. To make the\nwe can obtain the inequality (cid:107)r(cid:107)\u2264 1\n\u03bbmin(D)(cid:107) \u2202L\ntraining error small, we need to lower bound \u03bbmin(D) away from zero. From [54, 3], one can know\nthat the lower bound of \u03bbmin(D) is directly related to the hyperspherical diversity of neurons. After\nbounding the training error, it is easy to bound the generalization error using Rademachar complexity.\n5 Applications and Experiments\n5.1\nImproving Network Generalization\nFirst, we perform ablation study and some exploratory experiments on MHE. Then we apply MHE to\nlarge-scale object recognition and class-imbalance learning. For all the experiments on CIFAR-10 and\nCIFAR-100 in the paper, we use moderate data augmentation, following [14, 27]. For ImageNet-2012,\nwe follow the same data augmentation in [30]. We train all the networks using SGD with momentum\n0.9, and the network initialization follows [13]. All the networks use BN [20] and ReLU if not\notherwise speci\ufb01ed. Experimental details are given in each subsection and Appendix A.\n\n\u2202L\n\u2202Wk\n\n6\n\n\f24.95\n24.05\n23.38\n\n25.45\n23.14\n21.83\n\nHalf-space MHE\n\nMethod\n\nMHE\n\nBaseline\n\nMethod\nBaseline\n\nMHE\n\nHalf-space MHE\n\nCIFAR-10\n\nCIFAR-100\n\n47.72\n36.84\n35.16\n\n38.64\n30.05\n29.33\n\n28.13\n26.75\n25.96\n\ns = 0\n26.16\n26.18\n27.90\n26.47\n\ns = 2\n6.22\n6.28\n6.21\n6.52\n\nHalf-space MHE\n\nA-MHE\n\nHalf-space A-MHE\n\ns = 2\n27.15\n25.61\n26.17\n26.03\n\ns = 0\n6.44\n6.30\n6.45\n6.44\n\ns = 1\n27.09\n26.30\n27.31\n26.52\n28.13\n\ns = 1\n6.74\n6.54\n6.77\n6.49\n7.75\n\n16/32/64 32/64/128 64/128/256 128/256/512 256/512/1024\n\n5.1.1 Ablation Study and Exploratory Experiments\nVariants of MHE. We evaluate all dif-\nferent variants of MHE on CIFAR-10\nand CIFAR-100, including original MHE\n(with the power s = 0, 1, 2) and half-space\nMHE (with the power s = 0, 1, 2) with\nboth Euclidean and angular distance. In\nTable 1: Testing error (%) of different MHE on CIFAR-10/100.\nthis experiment, all methods use CNN-9\n(see Appendix A). The results in Table 1 show that all the variants of MHE perform consistently better\nthan the baseline. Speci\ufb01cally, the half-space MHE has more signi\ufb01cant performance gain compared\nto the other MHE variants, and MHE with Euclidean and angular distance perform similarly. In\ngeneral, MHE with s = 2 performs best among s = 0, 1, 2. In the following experiments, we use s = 2\nand Euclidean distance for both MHE and half-space MHE by default if not otherwise speci\ufb01ed.\nNetwork width. We evaluate MHE with\ndifferent network width. We use CNN-9\nas our base network, and change its \ufb01lter\nnumber in Conv1.x, Conv2.x and Conv3.x\nTable 2: Testing error (%) of different width on CIFAR-100.\n(see Appendix A) to 16/32/64, 32/64/128,\n64/128/256, 128/256/512 and 256/512/1024. Results in Table 2 show that both MHE and half-space\nMHE consistently outperform the baseline, showing stronger generalization. Interestingly, both MHE\nand half-space MHE have more signi\ufb01cant gain while the \ufb01lter number is smaller in each layer, indi-\ncating that MHE can help the network to make better use of the neurons. In general, half-space MHE\nperforms consistently better than MHE, showing the necessity of reducing colinearity redundancy\namong neurons. Both MHE and half-space MHE outperform the baseline with a huge margin while\nthe network is either very wide or very narrow, showing the superiority in improving generalization.\nNetwork depth. We perform experiments with different net-\nwork depth to better evaluate the performance of MHE. We\n\ufb01x the \ufb01lter number in Conv1.x, Conv2.x and Conv3.x to 64,\n128 and 256, respectively. We compare 6-layer CNN, 9-layer\nCNN and 15-layer CNN. The results are given in Table 3.\nBoth MHE and half-space MHE perform signi\ufb01cantly better\nthan the baseline. More interestingly, baseline CNN-15 can not converge, while CNN-15 is able\nto converge reasonably well if we use MHE to regularize the network. Moreover, we also see that\nhalf-space MHE can consistently show better generalization than MHE with different network depth.\nAblation study. Since the current MHE regularizes the neurons\nin the hidden layers and the output layer simultaneously, we\nperform ablation study for MHE to further investigate where\nthe gain comes from. This experiment uses the CNN-9. The\nresults are given in Table 4. \u201cH\u201d means that we apply MHE\nto all the hidden layers, while \u201cO\u201d means that we apply MHE\nTable 4: Ablation study on CIFAR-100.\nto the output layer. Because the half-space MHE can not be\napplied to the output layer, so there is \u201cN/A\u201d in the table. In general, we \ufb01nd that applying MHE\nto both the hidden layers and the output layer yields the best performance, and using MHE in the\nhidden layers usually produces better accuracy than using MHE in the output layer.\nHyperparameter experiment. We evaluate how the selection of hy-\nperparameter affects the performance. We experiment with different\nhyperparameters from 10\u22122 to 102 on CIFAR-100 with the CNN-9.\nHS-MHE denotes the half-space MHE. We evaluate MHE variants by\nseparately applying MHE to the output layer (\u201cO\u201d), MHE to the hidden\nlayers (\u201cH\u201d), and the half-space MHE to the hidden layers (\u201cH\u201d). The\nresults in Fig. 3 show that our MHE is not very hyperparameter-sensitive\nand can consistently beat the baseline by a considerable margin. One can\nobserve that MHE\u2019s hyperparameter works well from 10\u22122 to 102 and\ntherefore is easy to set. In contrast, the hyperparameter of weight decay\nFigure 3: Hyperparameter.\ncould be more sensitive than MHE. Half-space MHE can consistently\noutperform the original MHE under all different hyperparameter settings. Interestingly, applying\nMHE only to hidden layers can achieve better accuracy than applying MHE only to output layers.\n\nTable 3: Testing error (%) of different\ndepth on CIFAR-100. N/C: not converged.\n\nH O\n\n\u221a \u00d7\n\n26.55\n26.28\n26.56\n26.64\n28.13\n\nH O\n\n\u00d7 \u221a\n\n26.85\nN/A\n27.8\nN/A\n\nCNN-6\n32.08\n28.16\n27.56\n\nCNN-9\n28.13\n26.75\n25.96\n\n\u221a \u221a\nH O\n\n26.16\n25.61\n26.17\n26.03\n\nHalf-space MHE\n\nA-MHE\n\nHalf-space A-MHE\n\nCNN-15\n\nN/C\n26.9\n25.84\n\nMethod\nBaseline\n\nMHE\n\nMethod\n\nMHE\n\nBaseline\n\n7\n\n10-21001022525.52626.52727.528BaselineMHE (O)MHE (H)HS-MHE (H)10110-1Value of HyperparameterTesting Error on CIFAR-100 (%)\fMethod\n\nResNet-110-original [14]\n\nMHE for ResNets. Besides the standard CNN, we also\nevaluate MHE on ResNet-32 to show that our MHE is\narchitecture-agnostic and can improve accuracy on multi-\nple types of architectures. Besides ResNets, MHE can also\nbe applied to GoogleNet [46], SphereNets [30] (the exper-\nimental results are given in Appendix E), DenseNet [17],\netc. Detailed architecture settings are given in Appendix A.\nThe results on CIFAR-10 and CIFAR-100 are given in Table 5. One can observe that applying MHE to\nResNet also achieves considerable improvements, showing that MHE is generally useful for different\narchitectures. Most importantly, adding MHE regularization will not affect the original architecture\nsettings, and it can readily improve the network generalization at a neglectable computational cost.\n\nHalf-space MHE\nTable 5: Error (%) of ResNet-32.\n\nResNet-1001 (64 batch) [15]\n\n6.61\n4.92\n4.64\n5.19\n4.72\n4.66\n\nResNet-1001 [15]\n\n22.87\n22.19\n22.04\n\nbaseline\nMHE\n\nCIFAR-100\n\n25.16\n22.71\n\nCIFAR-10\n\n-\n\n5.1.2 Large-scale Object Recognition\nWe evaluate MHE on large-scale ImageNet-2012 datasets. Specif-\nically, we perform experiment using ResNets, and then report\nthe top-1 validation error (center crop) in Table 6. From the re-\nsults, we still observe that both MHE and half-space MHE yield\nconsistently better recognition accuracy than the baseline and the\nTable 6: Top1 error (%) on ImageNet.\northonormal regularization (after tuning its hyperparameter). To\nbetter evaluate the consistency of MHE\u2019s performance gain, we use two ResNets with different\ndepth: ResNet-18 and ResNet-34. On these two different networks, both MHE and half-space MHE\noutperform the baseline by a signi\ufb01cant margin, showing consistently better generalization power.\nMoreover, half-space MHE performs slightly better than full-space MHE as expected.\n\n33.95\n33.65\n33.61\n33.50\n33.45\n\n30.04\n29.74\n29.75\n29.60\n29.50\n\nMethod\nbaseline\n\nHalf-space MHE\n\nOrthogonal [36]\n\nOrthnormal\n\nResNet-18\n\nResNet-34\n\nMHE\n\nFigure 4: Class-imbalance learning on MNIST.\n\n5.1.3 Class-imbalance Learning\nBecause MHE aims to maximize the hyperspherical mar-\ngin between different classi\ufb01er neurons in the output\nlayer, we can naturally apply MHE to class-imbalance\nlearning where the number of training samples in differ-\nent classes is imbalanced. We demonstrate the power of\nMHE in class-imbalance learning through a toy exper-\niment. We \ufb01rst randomly throw away 98% training data\nfor digit 0 in MNIST (only 100 samples are preserved\nfor digit 0), and then train a 6-layer CNN on this imbal-\nance MNIST. To visualize the learned features, we set\nthe output feature dimension as 2. The features and classi\ufb01er neurons on the full training set are\nvisualized in Fig. 4 where each color denotes a digit and red arrows are the normalized classi\ufb01er\nneurons. Although we train the network on the imbalanced training set, we visualize the features of\nthe full training set for better demonstration. The visualization for the full testing set is also given in\nAppendix H. From Fig. 4, one can see that the CNN without MHE tends to ignore the imbalanced\nclass (digit 0) and the learned classi\ufb01er neuron is highly biased to another digit. In contrast, the CNN\nwith MHE can learn reasonably separable distribution even if digit 0 only has 2% samples compared\nto the other classes. Using MHE in this toy setting can readily improve the accuracy on the full testing\nset from 88.5% to 98%. Most importantly, the classi\ufb01er neuron for digit 0 is also properly learned,\nsimilar to the one learned on the balanced dataset. Note that, half-space MHE can not be applied to\nthe classi\ufb01er neurons, because the classi\ufb01er neurons usually need to occupy the full feature space.\nWe experiment MHE in two data imbalance settings on\nCIFAR-10: 1) single class imbalance (S) - All classes have\nthe same number of images but one single class has signif-\nicantly less number, and 2) multiple class imbalance (M) -\nThe number of images decreases as the class index decreases\nfrom 9 to 0. We use CNN-9 for all the compared regular-\nTable 7: Error on imbalanced CIFAR-10.\nizations. Detailed setups are provided in Appendix A. In\nTable 7, we report the error rate on the whole testing set. In addition, we report the error rate (denoted\nby Err. (S)) on the imbalance class (single imbalance setting) in the full testing set. From the results,\none can observe that CNN-9 with MHE is able to effectively perform recognition when classes are\nimbalanced. Even only given a small portion of training data in a few classes, CNN-9 with MHE can\nachieve very competitive accuracy on the full testing set, showing MHE\u2019s superior generalization\npower. Moreover, we also provide experimental results on imbalanced CIFAR-100 in Appendix H.\n\nMultiple\n12.00\n10.80\n10.25\n9.59\n9.88\n9.89\n\nErr. (S)\n30.40\n26.80\n25.80\n26.40\n26.00\n25.90\n\nSingle\n9.80\n8.34\n7.98\n7.90\n7.96\n7.59\n\nMethod\nBaseline\n\nHalf-space A-MHE\n\nHalf-space MHE\n\nOrthonormal\n\nA-MHE\n\nMHE\n\n8\n\n(a) CNN without MHE(b) CNN with MHE\f5.2 SphereFace+: Improving Inter-class Feature Separability via MHE for Face Recognition\nWe have shown that full-space MHE for output layers can encourage classi\ufb01er neurons to distribute\nmore evenly on hypersphere and therefore improve inter-class feature separability. Intuitively, the\nclassi\ufb01er neurons serve as the approximate center for features from each class, and can therefore guide\nthe feature learning. We also observe that open-set face recognition (e.g., face veri\ufb01cation) requires\nthe feature centers to be as separable as possible [28]. This connection inspires us to apply MHE to\nface recognition. Speci\ufb01cally, we propose SphereFace+ by applying MHE to SphereFace [28]. The\nobjective of SphereFace, angular softmax loss ((cid:96)SF) that encourages intra-class feature compactness,\nis naturally complementary to that of MHE. The objective function of SphereFace+ is de\ufb01ned as\n\nLSF+ =\n\n(cid:96)SF((cid:104)wout\n\ni\n\n, xj(cid:105)c\n\ni=1, yj, mSF)\n\nm(cid:88)\n\nj=1\n\n1\nm\n\n(cid:124)\n\n(cid:123)(cid:122)\n\n1\n\nm(N \u2212 1)\n\n(cid:125)\n\n+ \u03bbM \u00b7\n\n(cid:124)\n\nm(cid:88)\n\nN(cid:88)\n(cid:123)(cid:122)\n\ni=1\n\nj=1,j(cid:54)=yi\n\nfs((cid:13)(cid:13) \u02c6wout\n\nyi \u2212 \u02c6wout\n\nj\n\n(cid:13)(cid:13))\n(cid:125)\n\n(8)\n\nAngular softmax loss: promoting intra-class compactness\n\nMHE: promoting inter-class separability\n\nwhere c is the number of classes, m is the mini-batch size, N is the number of classi\ufb01er neurons, xi\nthe deep feature of the i-th face (yi is its groundtruth label), and wout\nis the i-th classi\ufb01er neuron.\ni\nmSF is a hyperparameter for SphereFace, controlling the degree of intra-class feature compactness\n(i.e., the size of the angular margin). Because face datesets usually have thousands of identities, we\nwill use the data-dependent mini-batch approximation MHE as shown in Eq. (8) in the output layer to\nreduce computational cost. MHE completes a missing piece for SphereFace by promoting the inter-\nclass separability. SphereFace+ consistently outperforms SphereFace, and achieves state-of-the-art\nperformance on both LFW [18] and MegaFace [22] datasets. More results on MegaFace are put in\nAppendix I. MHE can also improve other face recognition methods, as shown in Appendix F.\n\nmSF\n\nSphereFace\n\nLFW\n\nSphereFace+\n\nMegaFace\n\nSphereFace\n\nSphereFace+\n\nmSF\n\nSphereFace\n\nLFW\n\nSphereFace+\n\nMegaFace\n\nSphereFace\n\nSphereFace+\n\n1\n2\n3\n4\n\n96.35\n98.87\n98.97\n99.26\n\n97.15\n99.05\n99.13\n99.32\n\n39.12\n60.48\n63.71\n70.68\n\n45.90\n68.51\n66.89\n71.30\n\n1\n2\n3\n4\n\n96.93\n99.03\n99.25\n99.42\n\n97.47\n99.22\n99.35\n99.47\n\n41.07\n62.01\n69.69\n72.72\n\n45.55\n67.07\n70.89\n73.03\n\nMethod\n\nTable 8: Accuracy (%) on SphereFace-20 network.\nTable 9: Accuracy (%) on SphereFace-64 network.\nPerformance under different mSF. We evaluate SphereFace+ with two different architectures\n(SphereFace-20 and SphereFace-64) proposed in [28]. Speci\ufb01cally, SphereFace-20 and SphereFace-\n64 are 20-layer and 64-layer modi\ufb01ed residual networks, respectively. We train our network with\nthe publicly available CASIA-Webface dataset [60], and then test the learned model on LFW and\nMegaFace dataset.\nIn MegaFace dataset, the reported accuracy indicates rank-1 identi\ufb01cation\naccuracy with 1 million distractors. All the results in Table 8 and Table 9 are computed without\nmodel ensemble and PCA. One can observe that SphereFace+ consistently outperforms SphereFace\nby a considerable margin on both LFW and MegaFace datasets under all different settings of mSF.\nMoreover, the performance gain generalizes across network architectures with different depth.\nComparison to state-of-the-art methods. We also compare\nour methods with some widely used loss functions. All these\ncompared methods use SphereFace-64 network that are trained\nwith CASIA dataset. All the results are given in Table 10\ncomputed without model ensemble and PCA. Compared to the\nother state-of-the-art methods, SphereFace+ achieves the best\naccuracy on LFW dataset, while being comparable to the best\nTable 10: Comparison to state-of-the-art.\naccuracy on MegaFace dataset. Current state-of-the-art face\nrecognition methods [49, 28, 51, 6, 31] usually only focus on compressing the intra-class features,\nwhich makes MHE a potentially useful tool in order to further improve these face recognition methods.\n6 Concluding Remarks\nWe borrow some useful ideas and insights from physics and propose a novel regularization method for\nneural networks, called minimum hyperspherical energy (MHE), to encourage the angular diversity\nof neuron weights. MHE can be easily applied to every layer of a neural network as a plug-in\nregularization, without modifying the original network architecture. Different from existing methods,\nsuch diversity can be viewed as uniform distribution over a hypersphere. In this paper, MHE has been\nspeci\ufb01cally used to improve network generalization for generic image classi\ufb01cation, class-imbalance\nlearning and large-scale face recognition, showing consistent improvements in all tasks. Moreover,\nMHE can signi\ufb01cantly improve the image generation quality of GANs (see Appendix G). In summary,\nour paper casts a novel view on regularizing the neurons by introducing hyperspherical diversity.\n\nLFW\n97.88\n98.78\n98.70\n99.10\n99.05\n99.10\n99.42\n99.47\n\n54.86\n65.22\n64.80\n67.13\n65.49\n75.10\n72.72\n73.03\n\nSoftmax+Center Loss [53]\n\nSoftmax+Contrastive [45]\n\nL-Softmax Loss [29]\n\nCosineFace [51, 49]\n\nSphereFace+ (ours)\n\nTriplet Loss [40]\n\nSoftmax Loss\n\nSphereFace\n\nMegaFace\n\n9\n\n\fAcknowledgements\n\nThis project was supported in part by NSF IIS-1218749, NIH BIGDATA 1R01GM108341, NSF\nCAREER IIS-1350983, NSF IIS-1639792 EAGER, NSF IIS-1841351 EAGER, NSF CCF-1836822,\nNSF CNS-1704701, ONR N00014-15-1-2340, Intel ISTC, NVIDIA, Amazon AWS and Siemens.\nWe would like to thank NVIDIA corporation for donating Titan Xp GPUs to support our research.\nWe also thank Tuo Zhao for the valuable discussions and suggestions.\n\nReferences\n[1] Alireza Aghasi, Afshin Abdi, Nam Nguyen, and Justin Romberg. Net-trim: A layer-wise convex pruning\n\nof deep neural networks. In NIPS, 2017. 1\n\n[2] Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization. arXiv preprint\n\narXiv:1607.06450, 2016. 20\n\n[3] Dmitriy Bilyk and Michael T Lacey. One-bit sensing, discrepancy and stolarsky\u2019s principle. Sbornik:\n\nMathematics, 208(6):744, 2017. 6\n\n[4] Andrew Brock, Theodore Lim, James M Ritchie, and Nick Weston. Neural photo editing with introspective\n\nadversarial networks. In ICLR, 2017. 2, 20\n\n[5] Michael Cogswell, Faruk Ahmed, Ross Girshick, Larry Zitnick, and Dhruv Batra. Reducing over\ufb01tting in\n\ndeep networks by decorrelating representations. In ICLR, 2016. 1, 2\n\n[6] Jiankang Deng, Jia Guo, and Stefanos Zafeiriou. Arcface: Additive angular margin loss for deep face\n\nrecognition. arXiv preprint arXiv:1801.07698, 2018. 9\n\n[7] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron\n\nCourville, and Yoshua Bengio. Generative adversarial nets. In NIPS, 2014. 20\n\n[8] Mario G\u00f6tz and Edward B Saff. Note on d\u2014extremal con\ufb01gurations for the sphere in r d+1. In Recent\n\nProgress in Multivariate Approximation, pages 159\u2013162. Springer, 2001. 5, 15\n\n[9] Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron C Courville. Improved\n\ntraining of wasserstein gans. In NIPS, 2017. 20\n\n[10] Song Han, Huizi Mao, and William J Dally. Deep compression: Compressing deep neural networks with\n\npruning, trained quantization and huffman coding. In ICLR, 2016. 1\n\n[11] DP Hardin and EB Saff. Minimal riesz energy point con\ufb01gurations for recti\ufb01able d-dimensional manifolds.\n\narXiv preprint math-ph/0311024, 2003. 5, 15\n\n[12] DP Hardin and EB Saff. Discretizing manifolds via minimum energy points. Notices of the AMS,\n\n51(10):1186\u20131194, 2004. 5, 6\n\n[13] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into recti\ufb01ers: Surpassing\n\nhuman-level performance on imagenet classi\ufb01cation. In ICCV, 2015. 6\n\n[14] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition.\n\nIn CVPR, 2016. 1, 6, 8, 13\n\n[15] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Identity mappings in deep residual networks.\n\nIn ECCV, 2016. 8\n\n[16] Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco\nAndreetto, and Hartwig Adam. Mobilenets: Ef\ufb01cient convolutional neural networks for mobile vision\napplications. arXiv preprint arXiv:1704.04861, 2017. 1\n\n[17] Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger. Densely connected\n\nconvolutional networks. In CVPR, 2017. 8\n\n[18] Gary B Huang, Manu Ramesh, Tamara Berg, and Erik Learned-Miller. Labeled faces in the wild: A\ndatabase for studying face recognition in unconstrained environments. Technical report, Technical Report,\n2007. 9\n\n10\n\n\f[19] Forrest N Iandola, Song Han, Matthew W Moskewicz, Khalid Ashraf, William J Dally, and Kurt Keutzer.\nSqueezenet: Alexnet-level accuracy with 50x fewer parameters and< 0.5 mb model size. arXiv preprint\narXiv:1602.07360, 2016. 1\n\n[20] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing\n\ninternal covariate shift. In ICML, 2015. 2, 6, 20\n\n[21] Lu Jiang, Deyu Meng, Shoou-I Yu, Zhenzhong Lan, Shiguang Shan, and Alexander Hauptmann. Self-paced\n\nlearning with diversity. In NIPS, 2014. 2\n\n[22] Ira Kemelmacher-Shlizerman, Steven M Seitz, Daniel Miller, and Evan Brossard. The megaface benchmark:\n\n1 million faces for recognition at scale. In CVPR, 2016. 9\n\n[23] Arno Kuijlaars and E Saff. Asymptotics for minimal discrete energy on the sphere. Transactions of the\n\nAmerican Mathematical Society, 350(2):523\u2013538, 1998. 5, 6, 15\n\n[24] Ludmila I Kuncheva and Christopher J Whitaker. Measures of diversity in classi\ufb01er ensembles and their\n\nrelationship with the ensemble accuracy. Machine learning, 51(2):181\u2013207, 2003. 2\n\n[25] Naum Samouilovich Landkof. Foundations of modern potential theory, volume 180. Springer, 1972. 5, 15\n\n[26] Nan Li, Yang Yu, and Zhi-Hua Zhou. Diversity regularized ensemble pruning.\nConference on Machine Learning and Knowledge Discovery in Databases, 2012. 2\n\nIn Joint European\n\n[27] Weiyang Liu, Zhen Liu, Zhiding Yu, Bo Dai, Rongmei Lin, Yisen Wang, James M Rehg, and Le Song.\n\nDecoupled networks. CVPR, 2018. 1, 5, 6, 16\n\n[28] Weiyang Liu, Yandong Wen, Zhiding Yu, Ming Li, Bhiksha Raj, and Le Song. Sphereface: Deep\n\nhypersphere embedding for face recognition. In CVPR, 2017. 1, 2, 5, 9, 14, 19\n\n[29] Weiyang Liu, Yandong Wen, Zhiding Yu, and Meng Yang. Large-margin softmax loss for convolutional\n\nneural networks. In ICML, 2016. 1, 2, 4, 9, 22\n\n[30] Weiyang Liu, Yan-Ming Zhang, Xingguo Li, Zhiding Yu, Bo Dai, Tuo Zhao, and Le Song. Deep\n\nhyperspherical learning. In NIPS, 2017. 1, 2, 4, 5, 6, 8, 16, 18\n\n[31] Yu Liu, Hongyang Li, and Xiaogang Wang. Rethinking feature discrimination and polymerization for\n\nlarge-scale recognition. arXiv preprint arXiv:1710.00870, 2017. 9\n\n[32] Julien Mairal, Francis Bach, Jean Ponce, and Guillermo Sapiro. Online dictionary learning for sparse\n\ncoding. In ICML, 2009. 2\n\n[33] Dmytro Mishkin and Jiri Matas. All you need is a good init. In ICLR, 2016. 2\n\n[34] Takeru Miyato, Toshiki Kataoka, Masanori Koyama, and Yuichi Yoshida. Spectral normalization for\n\ngenerative adversarial networks. In ICLR, 2018. 2, 20\n\n[35] Ignacio Ramirez, Pablo Sprechmann, and Guillermo Sapiro. Classi\ufb01cation and clustering via dictionary\n\nlearning with structured incoherence and shared features. In CVPR, 2010. 2\n\n[36] Pau Rodr\u00edguez, Jordi Gonzalez, Guillem Cucurull, Josep M Gonfaus, and Xavier Roca. Regularizing cnns\n\nwith locally constrained decorrelations. In ICLR, 2017. 1, 2, 5, 8\n\n[37] Aruni RoyChowdhury, Prakhar Sharma, Erik Learned-Miller, and Aruni Roy. Reducing duplicate \ufb01lters in\n\ndeep neural networks. In NIPS workshop on Deep Learning: Bridging Theory and Practice, 2017. 1\n\n[38] Edward B Saff and Amo BJ Kuijlaars. Distributing many points on a sphere. The mathematical intelligencer,\n\n19(1):5\u201311, 1997. 5, 6\n\n[39] Tim Salimans and Diederik P Kingma. Weight normalization: A simple reparameterization to accelerate\n\ntraining of deep neural networks. In NIPS, 2016. 20\n\n[40] Florian Schroff, Dmitry Kalenichenko, and James Philbin. Facenet: A uni\ufb01ed embedding for face\n\nrecognition and clustering. In CVPR, 2015. 9\n\n[41] Wenling Shang, Kihyuk Sohn, Diogo Almeida, and Honglak Lee. Understanding and improving convolu-\n\ntional neural networks via concatenated recti\ufb01ed linear units. In ICML, 2016. 1\n\n[42] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recogni-\n\ntion. arXiv:1409.1556, 2014. 1\n\n11\n\n\f[43] Steve Smale. Mathematical problems for the next century. The mathematical intelligencer, 20(2):7\u201315,\n\n1998. 2, 5\n\n[44] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout:\n\nA simple way to prevent neural networks from over\ufb01tting. JMLR, 15(1):1929\u20131958, 2014. 2\n\n[45] Yi Sun, Xiaogang Wang, and Xiaoou Tang. Deep learning face representation from predicting 10,000\n\nclasses. In CVPR, 2014. 9\n\n[46] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru\nErhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In CVPR, 2015. 1,\n8\n\n[47] Pieter Merkus Lambertus Tammes. On the origin of number and arrangement of the places of exit on the\n\nsurface of pollen-grains. Recueil des travaux botaniques n\u00e9erlandais, 27(1):1\u201384, 1930. 5\n\n[48] Joseph John Thomson. Xxiv. on the structure of the atom: an investigation of the stability and periods\nof oscillation of a number of corpuscles arranged at equal intervals around the circumference of a circle;\nwith application of the results to the theory of atomic structure. The London, Edinburgh, and Dublin\nPhilosophical Magazine and Journal of Science, 7(39):237\u2013265, 1904. 2, 5\n\n[49] Feng Wang, Weiyang Liu, Haijun Liu, and Jian Cheng. Additive margin softmax for face veri\ufb01cation.\n\narXiv preprint arXiv:1801.05599, 2018. 9, 19\n\n[50] Feng Wang, Xiang Xiang, Jian Cheng, and Alan L Yuille. Normface: L2 hypersphere embedding for face\n\nveri\ufb01cation. arXiv preprint arXiv:1704.06369, 2017. 19\n\n[51] Hao Wang, Yitong Wang, Zheng Zhou, Xing Ji, Zhifeng Li, Dihong Gong, Jingchao Zhou, and Wei Liu.\nCosface: Large margin cosine loss for deep face recognition. arXiv preprint arXiv:1801.09414, 2018. 9, 14\n\n[52] David Warde-Farley and Yoshua Bengio. Improving generative adversarial networks with denoising feature\n\nmatching. In ICLR, 2017. 20\n\n[53] Yandong Wen, Kaipeng Zhang, Zhifeng Li, and Yu Qiao. A discriminative feature learning approach for\n\ndeep face recognition. In ECCV, 2016. 9\n\n[54] Bo Xie, Yingyu Liang, and Le Song. Diverse neural network learns true target functions. arXiv preprint\n\narXiv:1611.03131, 2016. 5, 6\n\n[55] Di Xie, Jiang Xiong, and Shiliang Pu. All you need is beyond a good init: Exploring better solution for train-\ning extremely deep convolutional neural networks with orthonormality and modulation. arXiv:1703.01827,\n2017. 2\n\n[56] Pengtao Xie, Yuntian Deng, Yi Zhou, Abhimanu Kumar, Yaoliang Yu, James Zou, and Eric P Xing.\n\nLearning latent space models with angular constraints. In ICML, 2017. 1, 2, 5\n\n[57] Pengtao Xie, Aarti Singh, and Eric P Xing. Uncorrelation and evenness: a new diversity-promoting\n\nregularizer. In ICML, 2017. 1, 2\n\n[58] Pengtao Xie, Wei Wu, Yichen Zhu, and Eric P Xing. Orthogonality-promoting distance metric learning:\n\nconvex relaxation and theoretical analysis. In ICML, 2018. 2\n\n[59] Pengtao Xie, Jun Zhu, and Eric Xing. Diversity-promoting bayesian learning of latent variable models. In\n\nICML, 2016. 1, 2\n\n[60] Dong Yi, Zhen Lei, Shengcai Liao, and Stan Z Li. Learning face representation from scratch.\n\narXiv:1411.7923, 2014. 9\n\n[61] Kaipeng Zhang, Zhanpeng Zhang, Zhifeng Li, and Yu Qiao. Joint face detection and alignment using\nmultitask cascaded convolutional networks. IEEE Signal Processing Letters, 23(10):1499\u20131503, 2016. 14\n\n[62] Xiangyu Zhang, Xinyu Zhou, Mengxiao Lin, and Jian Sun. Shuf\ufb02enet: An extremely ef\ufb01cient convolutional\n\nneural network for mobile devices. arXiv preprint arXiv:1707.01083, 2017. 1\n\n12\n\n\f", "award": [], "sourceid": 3053, "authors": [{"given_name": "Weiyang", "family_name": "Liu", "institution": "Georgia Institute of Technology"}, {"given_name": "Rongmei", "family_name": "Lin", "institution": "Emory University"}, {"given_name": "Zhen", "family_name": "Liu", "institution": "Georgia Institute of Technology"}, {"given_name": "Lixin", "family_name": "Liu", "institution": "SCUT"}, {"given_name": "Zhiding", "family_name": "Yu", "institution": "NVIDIA"}, {"given_name": "Bo", "family_name": "Dai", "institution": "Google Brain"}, {"given_name": "Le", "family_name": "Song", "institution": "Ant Financial & Georgia Institute of Technology"}]}