{"title": "Deep Hyperspherical Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 3950, "page_last": 3960, "abstract": "Convolution as inner product has been the founding basis of convolutional neural networks (CNNs) and the key to end-to-end visual representation learning. Benefiting from deeper architectures, recent CNNs have demonstrated increasingly strong representation abilities. Despite such improvement, the increased depth and larger parameter space have also led to challenges in properly training a network. In light of such challenges, we propose hyperspherical convolution (SphereConv), a novel learning framework that gives angular representations on hyperspheres. We introduce SphereNet, deep hyperspherical convolution networks that are distinct from conventional inner product based convolutional networks. In particular, SphereNet adopts SphereConv as its basic convolution operator and is supervised by generalized angular softmax loss - a natural loss formulation under SphereConv. We show that SphereNet can effectively encode discriminative representation and alleviate training difficulty, leading to easier optimization, faster convergence and comparable (even better) classification accuracy over convolutional counterparts. We also provide some theoretical insights for the advantages of learning on hyperspheres. In addition, we introduce the learnable SphereConv, i.e., a natural improvement over prefixed SphereConv, and SphereNorm, i.e., hyperspherical learning as a normalization method. Experiments have verified our conclusions.", "full_text": "Deep Hyperspherical Learning\n\nWeiyang Liu1, Yan-Ming Zhang2, Xingguo Li3,1, Zhiding Yu4, Bo Dai1, Tuo Zhao1, Le Song1\n1Georgia Institute of Technology 2Institute of Automation, Chinese Academy of Sciences\n\n3University of Minnesota 4Carnegie Mellon University\n\n{wyliu,tourzhao}@gatech.edu, ymzhang@nlpr.ia.ac.cn, lsong@cc.gatech.edu\n\nAbstract\n\nConvolution as inner product has been the founding basis of convolutional neural\nnetworks (CNNs) and the key to end-to-end visual representation learning. Ben-\ne\ufb01ting from deeper architectures, recent CNNs have demonstrated increasingly\nstrong representation abilities. Despite such improvement, the increased depth and\nlarger parameter space have also led to challenges in properly training a network.\nIn light of such challenges, we propose hyperspherical convolution (SphereConv),\na novel learning framework that gives angular representations on hyperspheres.\nWe introduce SphereNet, deep hyperspherical convolution networks that are dis-\ntinct from conventional inner product based convolutional networks. In particular,\nSphereNet adopts SphereConv as its basic convolution operator and is supervised\nby generalized angular softmax loss - a natural loss formulation under SphereConv.\nWe show that SphereNet can effectively encode discriminative representation and\nalleviate training dif\ufb01culty, leading to easier optimization, faster convergence and\ncomparable (even better) classi\ufb01cation accuracy over convolutional counterparts.\nWe also provide some theoretical insights for the advantages of learning on hy-\nperspheres. In addition, we introduce the learnable SphereConv, i.e., a natural\nimprovement over pre\ufb01xed SphereConv, and SphereNorm, i.e., hyperspherical\nlearning as a normalization method. Experiments have veri\ufb01ed our conclusions.\n\n1\n\nIntroduction\n\nRecently, deep convolutional neural networks have led to signi\ufb01cant breakthroughs on many vision\nproblems such as image classi\ufb01cation [9, 18, 19, 6], segmentation [3, 13, 1], object detection [3, 16],\netc. While showing stronger representation power over many conventional hand-crafted features,\nCNNs often require a large amount of training data and face certain training dif\ufb01culties such as\nover\ufb01tting, vanishing/exploding gradient, covariate shift, etc. The increasing depth of recently\nproposed CNN architectures have further aggravated the problems.\nTo address the challenges, regularization techniques such as dropout [9] and orthogonality parameter\nconstraints [21] have been proposed. Batch normalization [8] can also be viewed as an implicit\nregularization to the network, by normalizing each layer\u2019s output distribution. Recently, deep\nresidual learning [6] emerged as a promising way to overcome vanishing gradients in deep networks.\nHowever, [20] pointed out that residual networks (ResNets) are essentially an exponential ensembles\nof shallow networks where they avoid the vanishing/exploding gradient problem but do not provide\ndirect solutions. As a result, training an ultra-deep network still remains an open problem. Besides\nvanishing/exploding gradient, network optimization is also very sensitive to initialization. Finding\nbetter initializations is thus widely studied [5, 14, 4]. In general, having a large parameter space is\ndouble-edged considering the bene\ufb01t of representation power and the associated training dif\ufb01culties.\nTherefore, proposing better learning frameworks to overcome such challenges remains important.\nIn this paper, we introduce a novel convolutional learning framework that can effectively alleviate\ntraining dif\ufb01culties, while giving better performance over dot product based convolution. Our idea\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fFigure 1: Deep hyperspherical convolutional network architecture.\n\nis to project parameter learning onto unit hyperspheres, where layer activations only depend on\nthe geodesic distance between kernels and input signals1 instead of their inner products. To this\nend, we propose the SphereConv operator as the basic module for our network layers. We also\npropose softmax losses accordingly under such representation framework. Speci\ufb01cally, the proposed\nsoftmax losses supervise network learning by also taking the SphereConv activations from the last\nlayer instead of inner products. Note that the geodesic distances on a unit hypersphere is the angles\nbetween inputs and kernels. Therefore, the learning objective is essentially a function of the input\nangles and we call it generalized angular softmax loss in this paper. The resulting architecture is the\nhyperspherical convolutional network (SphereNet), which is shown in Fig. 1.\nOur key motivation to propose SphereNet is that angular information matters in convolutional\nrepresentation learning. We argue this motivation from several aspects: training stability, training\nef\ufb01ciency, and generalization power. SphereNet can also be viewed as an implicit regularization to\nthe network by normalizing the activation distributions. The weight norm is no longer important since\nthe entire network operates only on angles. And as a result, the (cid:96)2 weight decay is also no longer\nneeded in SphereNet. SphereConv to some extent also alleviates the covariate shift problem [8]. The\noutput of SphereConv operators are bounded from \u22121 to 1 (0 to 1 if considering ReLU), which makes\nthe variance of each output also bounded.\nOur second intuition is that angles preserve the most abundant discriminative information in convolu-\ntional learning. We gain such intuition from 2D Fourier transform, where an image is decomposed by\nthe combination of a set of templates with magnitude and phase information in 2D frequency domain.\nIf one reconstructs an image with original magnitudes and random phases, the resulting images\nare generally not recognizable. However, if one reconstructs the image with random magnitudes\nand original phases. The resulting images are still recognizable. It shows that the most important\nstructural information in an image for visual recognition is encoded by phases. This fact inspires us\nto project the network learning into angular space. In terms of low-level information, SphereConv is\nable to preserve the shape, edge, texture and relative color. SphereConv can learn to selectively drop\nthe color depth but preserve the RGB ratio. Thus the semantic information of an image is preserved.\nSphereNet can also be viewed as a non-trivial generalization of [12, 11]. By proposing a loss that\ndiscriminatively supervises the network on a hypersphere, [11] achieves state-of-the-art performance\non face recognition. However, the rest of the network remains a conventional convolution network.\nIn contrast, SphereNet not only generalizes the hyperspherical constraint to every layer, but also\nto different nonlinearity functions of input angles. Speci\ufb01cally, we propose three instances of\nSphereConv operators: linear, cosine and sigmoid. The sigmoid SphereConv is the most \ufb02exible one\nwith a parameter controlling the shape of the angular function. As a simple extension to the sigmoid\nSphereConv, we also present a learnable SphereConv operator. Moreover, the proposed generalized\nangular softmax (GA-Softmax) loss naturaly generalizes the angular supervision in [11] using the\nSphereConv operators. Additionally, the SphereConv can serve as a normalization method that is\ncomparable to batch normalization, leading to an extension to spherical normalization (SphereNorm).\nSphereNet can be easily applied to other network architectures such as GoogLeNet [19], VGG [18]\nand ResNet [6]. One simply needs to replace the convolutional operators and the loss functions with\nthe proposed SphereConv operators and hyperspherical loss functions. In summary, SphereConv can\nbe viewed as an alternative to the original convolution operators, and serves as a new measure of\ncorrelation. SphereNet may open up an interesting direction to explore the neural networks. We ask\nthe question whether inner product based convolution operator is an optimal correlation measure for\nall tasks? Our answer to this question is likely to be \u201cno\u201d.\n\n1Without loss of generality, we study CNNs here, but our method is generalizable to any other neural nets.\n\n2\n\nxw4w3w2SphereConv Operatorg( )\u03b8(w,x)\u03b8(w,x)wxw1xxxxSphereConv Operator............SoftmaxCross-entropy......Hyperspherical ConvolutionsGeneralized Angular Softmax LossxxxxSphereConv OperatorSphereConv Operator\f2 Hyperspherical Convolutional Operator\n2.1 De\ufb01nition\nThe convolutional operator in CNNs is simply a linear matrix multiplication, written as F(w, x) =\nw(cid:62)x + bF where w is a convolutional \ufb01lter, x denotes a local patch from the bottom feature map\nand bF is the bias. The matrix multiplication here essentially computes the similarity between the\nlocal patch and the \ufb01lter. Thus the standard convolution layer can be viewed as patch-wise matrix\nmultiplication. Different from the standard convolutional operator, the hyperspherical convolutional\n(SphereConv) operator computes the similarity on a hypersphere and is de\ufb01ned as:\n\nFs(w, x) = g(\u03b8(w,x)) + bFs ,\n\n(1)\nwhere \u03b8(w,x) is the angle between the kernel parameter w and the local patch x. g(\u03b8(w,x)) indicates\na function of \u03b8(w,x) (usually a monotonically decreasing function), and bFs is the bias. To simplify\nanalysis and discussion, the bias terms are usually left out. The angle \u03b8(w,x) can be interpreted\nas the geodesic distance (arc length) between w and x on a unit hypersphere. In contrast to the\nconvolutional operator that works in the entire space, SphereConv only focuses on the angles between\nlocal patches and the \ufb01lters, and therefore operates on the hypersphere space. In this paper, we present\nthree speci\ufb01c instances of the SphereConv Operator. To facilitate the computation, we constrain the\noutput of SphereConv operators to [\u22121, 1] (although it is not a necessary requirement).\nLinear SphereConv. In linear SphereConv operator, g is a linear function of \u03b8(w,x), with the form:\n(2)\nwhere a and b are parameters for the linear SphereConv operator. In order to constrain the output\nrange to [0, 1] while \u03b8(w,x) \u2208 [0, \u03c0], we use a = \u2212 2\nCosine SphereConv. The cosine SphereConv operator is a non-\nlinear function of \u03b8(w,x), with its g being the form of\n\n\u03c0 and b = 1 (not necessarily optimal design).\n\ng(\u03b8(w,x)) = a\u03b8(w,x) + b,\n\ng(\u03b8(w,x)) = cos(\u03b8(w,x)),\n\n(3)\n. Therefore, it can be\nwhich can be reformulated as\nviewed as a doubly normalized convolutional operator, which\nbridges the SphereConv operator and convolutional operator.\nSigmoid SphereConv. The Sigmoid SphereConv operator is\nderived from the Sigmoid function and its g can be written as\n\n(cid:107)w(cid:107)2(cid:107)x(cid:107)2\n\nwT x\n\ng(\u03b8(w,x)) =\n\n1 + exp(\u2212 \u03c0\n2k )\n1 \u2212 exp(\u2212 \u03c0\n2k )\n\n\u00b7 1 \u2212 exp(cid:0) \u03b8(w,x)\n1 + exp(cid:0) \u03b8(w,x)\n\nk \u2212 \u03c0\nk \u2212 \u03c0\n\n2k\n\n2k\n\n(cid:1)\n(cid:1) ,\n\nFigure 2: SphereConv operators.\n\n(4)\n\nwhere k > 0 is the parameter that controls the curvature of the function. While k is close to 0,\ng(\u03b8(w,x)) will approximate the step function. While k becomes larger, g(\u03b8(w,x)) is more like a linear\nfunction, i.e., the linear SphereConv operator. Sigmoid SphereConv is one instance of the parametric\nSphereConv family. With more parameters being introduced, the parametric SphereConv can have\nricher representation power. To increase the \ufb02exibility of the parametric SphereConv, we will discuss\nthe case where these parameters can be jointly learned via back-prop later in the paper.\n2.2 Optimization\nThe optimization of the SphereConv operators is nearly the same as the convolutional operator\nand also follows the standard back-propagation. Using the chain rule, we have the gradient of the\nSphereConv with respect to the weights and the feature input:\n\u2202g(\u03b8(w,x))\n\n\u2202g(\u03b8(w,x))\n\n\u2202g(\u03b8(w,x))\n\n\u2202g(\u03b8(w,x))\n\n=\n\n\u2202w\n\n\u2202\u03b8(w,x)\n\n\u00b7 \u2202\u03b8(w,x)\n\u2202w\n\n,\n\n=\n\n\u2202x\n\n\u2202\u03b8(w,x)\n\n\u00b7 \u2202\u03b8(w,x)\n\n.\n\n\u2202x\n\n(5)\n\nFor different SphereConv operators, both \u2202\u03b8(w,x)\nlies in the \u2202g(\u03b8(w,x))\n\u2202\u03b8(w,x)\n\npart. For \u2202\u03b8(w,x)\n\n\u2202w , we have\n\n\u2202 arccos(cid:0) wT x\n\n(cid:107)w(cid:107)2(cid:107)x(cid:107)2\n\n(cid:1)\n\n\u2202\u03b8(w,x)\n\n\u2202w\n\n=\n\n\u2202w\n\n,\n\n\u2202\u03b8(w,x)\n\n\u2202x\n\n=\n\n\u2202 arccos(cid:0) wT x\n\n(cid:1)\n\n(cid:107)w(cid:107)2(cid:107)x(cid:107)2\n\u2202x\n\n\u2202w and \u2202\u03b8(w,x)\n\n\u2202x\n\nare the same, so the only difference\n\n,\n\n(6)\n\nwhich are straightforward to compute and therefore neglected here. Because \u2202g(\u03b8(w,x))\nfor the\n\u2202\u03b8(w,x)\nlinear SphereConv, the cosine SphereConv and the Sigmoid SphereConv are a, \u2212 sin(\u03b8(w,x)) and\n\u22122 exp(\u03b8(w,x)/k\u2212\u03c0/2k)\nk(1+exp(\u03b8(w,x)/k\u2212\u03c0/2k))2 respectively, all these partial gradients can be easily computed.\n\n3\n\n00.511.522.53-1-0.500.51CosineLinearSigmoid (k=0.1)Sigmoid (k=0.3)Sigmoid (k=0.7)\f2.3 Theoretical Insights\nWe provide a fundamental analysis for the cosine SphereConv operator in the case of linear neural\nnetwork to justify that the SphereConv operator can improve the conditioning of the problem. In\nspeci\ufb01c, we consider one layer of linear neural network, where the observation is F = U\u2217V \u2217(cid:62)\n(ignore the bias), U\u2217 \u2208 Rn\u00d7k is the weight, and V \u2217 \u2208 Rm\u00d7k is the input that embeds weights from\nprevious layers. Without loss of generality, we assume the columns satisfying (cid:107)Ui,:(cid:107)2 = (cid:107)Vj,:(cid:107)2 = 1\nfor all i = 1, . . . , n and j = 1, . . . , m, and consider\n\nG(U , V ) = 1\n\n2(cid:107)F \u2212 U V (cid:62)(cid:107)2\nF.\n\nmin\n\nU\u2208Rn\u00d7k,V \u2208Rm\u00d7k\n\n(7)\nThis is closely related with the matrix factorization and (7) can be also viewed as the expected version\nfor the matrix sensing problem [10]. The following lemma demonstrates a critical scaling issue of (7)\nfor U and V that signi\ufb01cantly deteriorate the conditioning without changing the objective of (7).\nLemma 1. Consider a pair of global optimal points U , V satisfying F = U V (cid:62) and Tr(V (cid:62)V \u2297\n\nIn) \u2264 Tr(U(cid:62)U \u2297 Im). For any real c > 1, let (cid:101)U = cU and (cid:101)V = V /c, then we have\n\u03ba(\u22072G((cid:101)U ,(cid:101)V )) = \u2126(c2\u03ba(\u22072G(U , V ))), where \u03ba = \u03bbmax\n\nis the restricted condition number with\n\n\u03bbmax being the largest eigenvalue and \u03bbmin being the smallest nonzero eigenvalue.\n\n\u03bbmin\n\n(8)\n\nmin\n\n, . . . ,\n\nU\u2208Rn\u00d7k,V \u2208Rm\u00d7k\n1(cid:107)U1,:(cid:107)2\n\n, . . . ,\n\n1(cid:107)Un,:(cid:107)2\n\n1(cid:107)Vm,:(cid:107)2\n\nGS(U , V ) = 1\n\nLemma 1 implies that the conditioning of the problem (7) at a unbalanced global optimum scaled by\na constant c is \u2126(c2) times larger than the conditioning of the problem at a balanced global optimum.\nNote that \u03bbmin = 0 may happen, thus we consider the restricted condition here. Similar results hold\nbeyond global optima. This is an undesired geometric structure, which further leads to slow and\nunstable optimization procedures, e.g., using stochastic gradient descent (SGD). This motivates us to\nconsider the SphereConv operator discussed above, which is equivalent to projecting data onto the\nhypersphere and leads to a better conditioned problem.\nNext, we consider our proposed cosine SphereConv operator for one-layer of the linear neural\nnetwork. Based on our previous discussion on SphereConv, we consider an equivalent problem:\n\n2(cid:107)F \u2212 DU U V (cid:62)DV (cid:107)2\nF,\n1(cid:107)V1,:(cid:107)2\nRm\u00d7m are diagonal matrices. We provide an analogous result to Lemma 1 for (8) .\n\n(cid:1) \u2208\nwhere DU = diag(cid:0)\nLemma 2. For any real c > 1, let (cid:101)U = cU and (cid:101)V = V /c, then we have \u03bbi(\u22072GS((cid:101)U ,(cid:101)V )) =\n\u03bbi(\u22072GS(U , V )) for all i \u2208 [(n + m)k] = {1, 2, . . . , (n + m)k} and \u03ba(\u22072G((cid:101)U ,(cid:101)V )) =\n\n(cid:1) \u2208 Rn\u00d7n and DV = diag(cid:0)\n\n\u03ba(\u22072G(U , V )), where \u03ba is de\ufb01ned as in Lemma 1.\nWe have from Lemma 2 that the issue of increasing condition caused by the scaling is eliminated by\nthe SphereConv operator in the entire parameter space. This enhances the geometric structure over\n(7), which further results in improved convergence of optimization procedures. If we extend the result\nfrom one layer to multiple layers, the scaling issue propagates. Roughly speaking, when we train N\nlayers, in the worst case, the conditioning of the problem can be cN times worse with a scaling factor\nc > 1. The analysis is similar to the one layer case, but the computation of the Hessian matrix and\nassociated eigenvalues are much more complicated. Though our analysis is elementary, we provide\nan important insight and a straightforward illustration of the advantage for using the SphereConv\noperator. The extension to more general cases, e..g, using nonlinear activation function (e.g., ReLU),\nrequires much more sophisticated analysis to bound the eigenvalues of Hessian for objectives, which\nis deferred to future investigation.\n2.4 Discussion\nComparison to convolutional operators. Convolutional operators compute the inner product be-\ntween the kernels and the local patches, while the SphereConv operators compute a function of the\nangle between the kernels and local patches. If we normalize the convolutional operator in terms of\nboth w and x, then the normalized convolutional operator is equivalent to the cosine SphereConv\noperator. Essentially, they use different metric spaces. Interestingly, SphereConv operators can also\nbe interpreted as a function of the Geodesic distance on a unit hypersphere.\nExtension to fully connected layers. Because the fully connected layers can be viewed as a special\nconvolution layer with the kernel size equal to the input feature map, the SphereConv operators could\nbe easily generalized to the fully connected layers. It also indicates that SphereConv operators could\nbe used not only to deep CNNs, but also to linear models like logistic regression, SVM, etc.\n\n4\n\n\fF where I is an identity matrix.\n\nNetwork Regularization. Because the norm of weights is no longer crucial, we stop using the (cid:96)2\nweight decay to regularize the network. SphereNets are learned on hyperspheres, so we regularize the\nnetwork based on angles instead of norms. To avoid redundant kernels, we want the kernels uniformly\nspaced around the hypersphere, but it is dif\ufb01cult to formulate such constraints. As a tradeoff, we\nencourage the orthogonality. Given a set of kernels W where the i-th column Wi is the weights of\nthe i-th kernel, the network will also minimize (cid:107)W (cid:62)W \u2212 I(cid:107)2\nDetermining the optimal SphereConv. In practice, we could treat different types of SphereConv as\na hyperparameter and use the cross validation to determine which SphereConv is the most suitable\none. For sigmoid SphereConv, we could also use the cross validation to determine its hyperparameter\nk. In general, we need to specify a SphereConv operator before using it, but pre\ufb01xing a SphereConv\nmay not be an optimal choice (even using cross validation). What if we treat the hyperparameter k in\nsigmoid SphereConv as a learnable parameter and use the back-prop to learn it? Following this idea,\nwe further extend sigmoid SphereConv to a learnable SphereConv in the next subsection.\nSphereConv as normalization. Because SphereConv could partially address the covariate shift, it\ncould also serve as a normalization method similar to batch normalization. Differently, SphereConv\nnormalizes the network in terms of feature map and kernel weights, while batch normalization is for\nthe mini-batches. Thus they do not contradict with each other and can be used simultaneously.\n2.5 Extension: Learnable SphereConv and SphereNorm\nLearnable SphereConv. It is a natrual idea to replace the current pre\ufb01xed SphereConv with a\nlearnable one. There will be plenty of parametrization choices for the SphereConv to be learnable,\nand we present a very simple learnable SphereConv operator based on the sigmoid SphereConv.\nBecause the sigmoid SphereConv has a hyperparameter k, we could treat it as a learnable parameter\n\u2202k where t denotes\nthat can be updated by back-prop. In back-prop, k is updated using kt+1 = kt + \u03b7 \u2202L\nthe current iteration index and \u2202L\n\u2202k can be easily computed by the chain rule. Usually, we also require\nk to be positive. The learning of k is in fact similar to the parameter learning in PReLU [5].\nSphereNorm: hyperspherical learning as a normalization method. Similar to batch normal-\nization (BatchNorm), we note that the hyperspherical learning can also be viewed as a way of\nnormalization, because SphereConv constrain the output value in [\u22121, 1] ([0, 1] after ReLU). Dif-\nferent from BatchNorm, SphereNorm normalizes the network based on spatial information and the\nweights, so it has nothing to do with the mini-batch statistic. Because SphereNorm normalize both\nthe input and weights, it could avoid covariate shift due to large weights and large inputs while\nBatchNorm could only prevent covariate shift caused by the inputs. In such sense, it will work better\nthan BatchNorm when the batch size is small. Besides, SphereConv is more \ufb02exible in terms of\ndesign choices (e.g. linear, cosine, and sigmoid) and each may lead to different advantages.\nSimilar to BatchNorm, we could use a rescaling strategy for the SphereNorm. Speci\ufb01cally, we rescale\nthe output of SphereConv via \u03b2Fs(w, x) + \u03b3 where \u03b2 and \u03b3 are learned by back-prop (similar to\nBatchNorm, the rescaling parameters can be either learned or pre\ufb01xed). In fact, SphereNorm does not\ncontradict with the BatchNorm at all and can be used simultaneously with BatchNorm. Interestingly,\nwe \ufb01nd using both is empirically better than using either one alone.\n3 Learning Objective on Hyperspheres\nFor learning on hyperspheres, we can either use the conventional loss function such as softmax loss,\nor use some loss functions that are tailored for the SphereConv operators. We present some possible\nchoices for these tailored loss functions.\nWeight-normalized Softmax Loss. The input feature and its label are denoted as xi and yi, respec-\ntively. The original softmax loss can be written as L = 1\nN\nis the number of training samples and fj is the score of the j-th class (j \u2208 [1, K], K is the number of\nclasses). The class score vector f is usually the output of a fully connected layer W , so we have\nfj = W (cid:62)\nxi + byi in which xi, Wj, and Wyi are the i-th training sample, the\nj-th and yi-th column of W respectively. We can rewrite Li as\n\ni \u2212 log(cid:0) efyi(cid:80)\n(cid:80)\n\n(cid:1) where N\n\ni Li = 1\nN\n\n(cid:80)\n\nj efj\n\n(cid:19)\n\n(cid:18) e(cid:107)Wyi(cid:107)(cid:107)xi(cid:107) cos(\u03b8yi,i)+byi\n(cid:80)\n\n(cid:19)\n\nj xi + bj and fyi = W (cid:62)\n(cid:18) eW (cid:62)\n(cid:80)\n\nLi = \u2212 log\n\nyi\n\nyi\n\n(9)\nwhere \u03b8j,i(0\u2264 \u03b8j,i\u2264 \u03c0) is the angle between vector Wj and xi. The decision boundary of the\noriginal softmax loss is determined by the vector f. Speci\ufb01cally in the binary-class case, the\n\nj e(cid:107)Wj(cid:107)(cid:107)xi(cid:107) cos(\u03b8j,i)+bj\n\nj eW (cid:62)\n\nj xi+bj\n\n,\n\nxi+byi\n\n= \u2212 log\n\n5\n\n\fdecision boundary of the softmax loss is W (cid:62)\n2 x + b2. Considering the intuition of the\nSphereConv operators, we want to make the decision boundary only depend on the angles. To this\nend, we normalize the weights ((cid:107)Wj(cid:107) = 1) and zero out the biases (bj = 0), following the intuition in\n[11] (sometimes we could keep the biases while data is imbalanced). The decision boundary becomes\n(cid:107)x(cid:107) cos(\u03b81) =(cid:107)x(cid:107) cos(\u03b82). Similar to SphereConv, we could generalize the decision boundary to\n(cid:107)x(cid:107)g(\u03b81) =(cid:107)x(cid:107)g(\u03b82), so the weight-normalized softmax (W-Softmax) loss can be written as\n\n1 x + b1 = W (cid:62)\n\n(cid:18) e(cid:107)xi(cid:107)g(\u03b8yi,i)\n(cid:80)\n\n(cid:19)\n\nLi = \u2212 log\n\n,\n\nj e(cid:107)xi(cid:107)g(\u03b8j,i)\n\n(10)\nwhere g(\u00b7) can take the form of linear SphereConv, cosine SphereConv, or sigmoid SphereConv.\nThus we also term these three difference weight-normalized loss functions as linear W-Softmax loss,\ncosine W-Softmax loss, and sigmoid W-Softmax loss, respectively.\nGeneralized Angular Softmax Loss. Inspired by [11], we use a multiplicative parameter m to im-\npose margins on hyperspheres. We propose a generalized angular softmax (GA-Softmax) loss which\nextends the W-Softmax loss to a loss function that favors large angular margin feature distribution. In\ngeneral, the GA-Softmax loss is formulated as\n\n(cid:18)\n\nLi = \u2212 log\n\ne(cid:107)xi(cid:107)g(m\u03b8yi,i) +(cid:80)\n\ne(cid:107)xi(cid:107)g(m\u03b8yi,i)\nj(cid:54)=yi\n\n(cid:19)\n\n,\n\ne(cid:107)xi(cid:107)g(\u03b8j,i)\n\n\u2202\u03b8 will be 0 when \u03b8 = k\u03b8\n\n(11)\nwhere g(\u00b7) could also have the linear, cosine and sigmoid form, similar to the W-Softmax loss. We can\nsee A-Softmax loss [11] is exactly the cosine GA-Softmax loss and W-Softmax loss is the special case\n(m = 1) of GA-Sofmtax loss. Note that we usually require \u03b8j,i \u2208 [0, \u03c0\nm ], because cos(\u03b8j,i) is only\nmonotonically decreasing in [0, \u03c0]. To address this, [12, 11] construct a monotonically decreasing\nm ] part of cos(m\u03b8j,i). Although it indeed partially addressed the\nfunction recursively using the [0, \u03c0\nissue, it may introduce a number of saddle points (w.r.t. W ) in the loss surfaces. Originally, \u2202g\n\u2202\u03b8 will\nbe close to 0 only when \u03b8 is close to 0 and \u03c0. However, in L-Softmax [12] or A-Softmax (cosine\nm , k = 0,\u00b7\u00b7\u00b7 , m. It will possibly cause\nGA-Softmax), it is not the case. \u2202g\ninstability in training. The sigmoid GA-Softmax loss also has similar issues. However, if we use\nthe linear GA-Softmax loss, this problem will be automatically solved and the training will possibly\nbecome more stable in practice. There will also be a lot of choices of g(\u00b7) to design a speci\ufb01c\nGA-Sofmtax loss, and each one has different optimization dynamics. The optimal one may depend\non the task itself (e.g. cosine GA-Softmax has been shown effective in deep face recognition [11]).\nDiscussion of Sphere-normalized Softmax Loss. We have also considered the sphere-normalized\nsoftmax loss (S-Softmax), which simultaneously normalizes the weights (Wj) and the feature x.\nIt seems to be a more natural choice than W-Softmax for the proposed SphereConv and makes the\nentire framework more uni\ufb01ed. In fact, we have tried this and the empirical results are not that good,\nbecause the optimization seems to become very dif\ufb01cult. If we use the S-Softmax loss to train a\nnetwork from scratch, we can not get reasonable results without using extra tricks, which is the reason\nwe do not use it in this paper. For completeness, we give some discussions here. Normally, it is very\ndif\ufb01cult to make the S-Softmax loss value to be small enough, because we normalize the features to\nunit hypersphere. To make this loss work, we need to either normalize the feature to a value much\nlarger than 1 (hypersphere with large radius) and then tune the learning rate or \ufb01rst train the network\nwith the softmax loss from scratch and then use the S-Softmax loss for \ufb01netuning.\n4 Experiments and Results\n4.1 Experimental Settings\nWe will \ufb01rst perform comprehensive ablation study and exploratory experiments for the proposed\nSphereNets, and then evaluate the SphereNets on image classi\ufb01cation. For the image classi\ufb01cation\ntask, we perform experiments on CIFAR10 (only with random left-right \ufb02ipping), CIFAR10+ (with\nfull data augmentation), CIFAR100 and large-scale Imagenet 2012 datasets [17].\nGeneral Settings. For CIFAR10, CIFAR10+ and CIFAR100, we follow the same settings from\n[7, 12]. For Imagenet 2012 dataset, we mostly follow the settings in [9]. We attach more details in\nAppendix B. For fairness, batch normalization and ReLU are used in all methods if not speci\ufb01ed. All\nthe comparisons are made to be fair. Compared CNNs have the same architecture with SphereNets.\nTraining. Appendix A gives the network details. For CIFAR-10 and CIFAR-100, we use the ADAM,\nstarting with the learning rate 0.001. The batch size is 128 if not speci\ufb01ed. The learning rate is divided\nby 10 at 34K, 54K iterations and the training stops at 64K. For both A-Softmax and GA-Softmax loss,\n\n6\n\n\fwe use m = 4. For Imagenet-2012, we use the SGD with momentum 0.9. The learning rate starts\nwith 0.1, and is divided by 10 at 200K and 375K iterations. The training stops at 550K iteration.\n4.2 Ablation Study and Exploratory Experiments\nWe perform comprehensive Ablation and exploratory study on the SphereNet and evaluate every\ncomponent individually in order to analyze its advantages. We use the 9-layer CNN as default (if not\nspeci\ufb01ed) and perform the image classi\ufb01cation on CIFAR-10 without any data augmentation.\n\nSphereConv\n\nOperator / Loss\nSigmoid (0.1)\nSigmoid (0.3)\nSigmoid (0.7)\n\nLinear\nCosine\n\nOriginal Conv\n\nOriginal\nSoftmax\n90.97\n91.08\n91.05\n91.10\n90.89\n90.58\n\nSigmoid (0.1)\nW-Softmax\n\nSigmoid (0.3)\nW-Softmax\n\nSigmoid (0.7)\nW-Softmax\n\nLinear\n\nW-Softmax\n\nCosine\n\nW-Softmax\n\n90.91\n91.44\n91.16\n90.93\n90.88\n90.58\n\n90.89\n91.37\n91.47\n91.42\n91.08\n90.73\n\n90.88\n91.21\n91.07\n90.96\n91.22\n90.78\n\n91.07\n91.34\n90.99\n90.95\n91.17\n91.08\n\n91.13\n91.28\n91.18\n91.24\n90.99\n90.68\n\nA-Softmax\n\n(m=4)\n91.87\n92.13\n92.22\n92.21\n91.94\n91.78\n\nGA-Softmax\n\n(m=4)\n91.99\n92.38\n92.36\n92.32\n92.19\n91.80\n\nTable 1: Classi\ufb01cation accuracy (%) with different loss functions.\n\nComparison of different loss functions. We \ufb01rst evaluate all the SphereConv operators with\ndifferent loss functions. All the compared SphereConv operators use the 9-layer CNN architecture\nin the experiment. From the results in Table 1, one can observe that the SphereConv operators\nconsistently outperforms the original convolutional operator. For the compared loss functions except\nA-Softmax and GA-Softmax, the effect on accuracy seems to less crucial than the SphereConv\noperators, but sigmoid W-Softmax is more \ufb02exible and thus works slightly better than the others.\nThe sigmoid SphereConv operators with a suitably chosen parameter also works better than the\nothers. Note that, W-Softmax loss is in fact comparable to the original softmax loss, because our\nSphereNet optimizes angles and the W-Softmax is derived from the original softmax loss. Therefore,\nit is fair to compare the SphereNet with W-Softmax and CNN with softmax loss. From Table 1,\nwe can see SphereConv operators are consistently better than the covolutional operators. While\nwe use a large-margin loss function like the A-Softmax [11] and the proposed GA-Softmax, the\naccuracy can be further boosted. One may notice that A-Softmax is actually cosine GA-Softmax. The\nsuperior performance of A-Softmax with SphereNet shows that our architecture is more suitable for\nthe learning of angular loss. Moreover, our proposed large-margin loss (linear GA-Softmax) performs\nthe best among all these compared loss functions.\nComparison of different network architectures. We are also interested in how our SphereConv\noperators work in different architectures. We evaluate all the proposed SphereConv operators with\nthe same architecture of different layers and a totally different architecture (ResNet). Our baseline\nCNN architecture follows the design of VGG network [18] only with different convolutional layers.\nFor fair comparison, we use cosine W-Softmax for all SphereConv operators and original softmax\nfor original convolution operators. From the results in Table 2, one can see that SphereNets greatly\noutperforms the CNN baselines, usually with more than 1% improvement. While applied to ResNet,\nour SphereConv operators also work better than the baseline. Note that, we use the similar ResNet\narchitecture from the CIFAR-10 experiment in [6]. We do not use data augmentation for CIFAR-10\nin this experiment, so the ResNet accuracy is much lower than the reported one in [6]. Our results on\ndifferent network architectures show consistent and signi\ufb01cant improvement over CNNs.\n\nSphereConv Operator\n\nSigmoid (0.1)\nSigmoid (0.3)\nSigmoid (0.7)\n\nLinear\nCosine\n\nOriginal Conv\n\nCNN-3\n82.08\n81.92\n82.4\n82.31\n82.23\n81.19\n\nCNN-9\n91.13\n91.28\n91.18\n91.15\n90.99\n90.68\n\nCNN-18\n91.43\n91.55\n91.69\n91.24\n91.23\n90.62\n\nCNN-45\n89.34\n89.73\n89.85\n90.15\n90.05\n88.23\n\nCNN-60\n87.67\n87.85\n88.42\n89.91\n89.28\n88.15\n\nResNet-32\n\nSphereConv Operator\n\nAcc. (%)\n\n90.94\n91.7\n91.19\n91.25\n91.38\n90.40\n\nSigmoid (0.1)\nSigmoid (0.3)\nSigmoid (0.7)\n\nLinear\nCosine\n\nCNN w/o ReLU\n\n86.29\n85.67\n85.51\n85.34\n85.25\n80.73\n\nTable 2: Classi\ufb01cation accuracy (%) with different network architectures.\n\nTable 3: Acc. w/o ReLU.\n\nComparison of different width (number of \ufb01lters). We evaluate the SphereNet with different\nnumber of \ufb01lters. Fig. 3(c) shows the convergence of different width of SphereNets. 16/32/48 means\nconv1.x, conv2.x and conv3.x have 16, 32 and 48 \ufb01lters, respectively. One could observe that while\nthe number of \ufb01lters are small, SphereNet performs similarly to CNNs (slightly worse). However,\nwhile we increase the number of \ufb01lters, the \ufb01nal accuracy will surpass the CNN baseline even faster\nand more stable convergence performance. With large width, we \ufb01nd that SphereNets perform\nconsistently better than CNN baselines, showing that SphereNets can make better use of the width.\nLearning without ReLU. We notice that SphereConv operators are no longer a matrix multiplication,\nso it is essentially a non-linear function. Because the SphereConv operators already introduce certain\n\n7\n\n\fFigure 3: Testing accuracy over iterations. (a) ResNet vs. SphereResNet. (b) Plain CNN vs. plain SphereNet. (c)\nDifferent width of SphereNet. (d) Ultra-deep plain CNN vs. ultra-deep plain SphereNet.\n\nnon-linearity to the network, we evaluate how much gain will such non-linearity bring. Therefore, we\nremove the ReLU activation and compare our SphereNet with the CNNs without ReLU. The results\nare given in Table 3. All the compared methods use 18-layer CNNs (with BatchNorm). Although\nremoving ReLU greatly reduces the classi\ufb01cation accuracy, our SphereNet still outperforms the CNN\nwithout ReLU by a signi\ufb01cant margin, showing its rich non-linearity and representation power.\nConvergence. One of the most signi\ufb01cant advantages of SphereNet is its training stability and\nconvergence speed. We evaluate the convergence with two different architectures: CNN-9 and\nResNet-32. For fair comparison, we use the original softmax loss for all compared methods (including\nSphereNets). ADAM is used for the stochastic optimization and the learning rate is the same for all\nnetworks. From Fig. 3(a), the SphereResNet converges signi\ufb01cantly faster than the original ResNet\nbaseline in both CIFAR-10 and CIFAR-10+ and the \ufb01nal accuracy are also higher than the baselines.\nIn Fig. 3(b), we evaluate the SphereNet with and without orthogonality constraints on kernel weights.\nWith the same network architecture, SphereNet also converges much faster and performs better\nthan the baselines. The orthogonality constraints also can bring performance gains in some cases.\nGenerally from Fig. 3, one could also observe that the SphereNet converges fast and very stably in\nevery case while the CNN baseline \ufb02uctuates in a relative wide range.\nOptimizing ultra-deep networks. Partially because of the alleviation of the covariate shift problem\nand the improvement of conditioning, our SphereNet is able to optimize ultra-deep neural networks\nwithout using residual units or any form of shortcuts. For SphereNets, we use the cosine SphereConv\noperator with the cosine W-Softmax loss. We directly optimize a very deep plain network with 69\nstacked convolutional layers. From Fig. 3(d), one can see that the convergence of SphereNet is much\neasier than the CNN baseline and the SphereNet is able to achieve nearly 90% \ufb01nal accuracy.\n4.3 Preliminary Study towards Learnable SphereConv\nAlthough the learnable SphereConv is not a main theme of this\npaper, we still run some preliminary evaluations on it. For the\nproposed learnable sigmoid SphereConv, we learn the parameter\nk independently for each \ufb01lter. It is also trivial to learn it in a\nlayer-shared or network-shared fashsion. With the same 9-layer\narchitecture used in Section 4.2, the learnable SphereConv (with\ncosine W-Softmax loss) achieves 91.64% on CIFAR-10 (without\nfull data augmentation), while the best sigmoid SphereConv (with\ncosine W-Softmax loss) achieves 91.22%. In Fig. 4, we also plot\nthe frequency histogram of k in Conv1.1 (64 \ufb01lters), Conv2.1 (96\n\ufb01lters) and Conv3.1 (128 \ufb01lters) of the \ufb01nal learned SphereNet.\nFrom Fig. 4, we observe that each layer learns different distribution of k. The \ufb01rst convolutional\nlayer (Conv1.1) tends to uniformly distribute k into a large range of values from 0 to 1, potentially\nextracting information from all levels of angular similarity. The fourth convolutional layer (Conv2.1)\ntends to learn more concentrated distribution of k than Conv1.1, while the seventh convolutional\nlayer (Conv3.1) learns highly concentrated distribution of k which is centered around 0.8. Note that,\nwe initialize all k with a constant 0.5 and learn them with the back-prop.\n4.4 Evaluation of SphereNorm\nFrom Section 4.2, we could clearly see the convergence advantage of SphereNets. In general, we can\nview the SphereConv as a normalization method (comparable to batch normalization) that can be\napplied to all kinds of networks. This section evaluates the challenging scenarios where the mini-\nbatch size is small (results under 128 batch size could be found in Section 4.2) and we use the same\n\nFigure 4: Frequency histogram of k.\n\n8\n\nIterationTesting Accuracy01234567x10400.10.20.30.40.50.60.70.80.91ResNet baseline on CIFAR10ResNet baseline on CIFAR10+SphereResNet (Sigmoid 0.3) on CIFAR10SphereResNet (Sigmoid 0.3) on CIFAR10+01234567x1040.10.20.30.40.50.60.70.80.91CNN BaselineSphereNet (cosine) w/o orth.SphereNet (cosine) w/ orth.SphereNet (linear) w/ orth.SphereNet (Sigmoid 0.3) w/ orth.IterationTesting Accuracy00.511.522.533.54x10400.10.20.30.40.50.60.70.80.969-layer CNN69-layer SphereNetIterationTesting Accuracy0.10.20.30.40.50.60.70.80.9CNN 16/32/48SphereNet 16/32/48CNN 64/96/128SphereNet 64/96/128CNN 128/192/256SphereNet 128/192/256CNN 256/384/512SphereNet 256/384/512x104Iteration01234565.566.50.90.9050.910.915x104Testing Accuracy(a) ResNet vs. SphereResNet on CIFAR-10/10+(b) CNN vs. SphereNet (orth.) on CIFAR-10(c) Different width of SphereNet on CIFAR-10(d) Deep CNN vs. SphereNet on CIFAR-1000.20.40.60.81The value of k00.10.20.3Frequencyconv1.1conv2.1conv3.1\fFigure 5: Convergence under different mini-batch size on CIFAR-10 dataset (Same setting as Section 4.2).\n\nCIFAR-10+\n\nCIFAR-100\n\nMethod\nELU [2]\n\nFitResNet (LSUV) [14]\n\nResNet-1001 [7]\n\nBaseline ResNet-32 (softmax)\n\n94.16\n93.45\n95.38\n93.26\n94.47\n94.33\n94.64\n95.01\n\n72.34\n65.72\n77.29\n72.85\n76.02\n75.62\n74.92\n76.39\n\nSphereResNet-32 (S-SW)\nSphereResNet-32 (L-LW)\nSphereResNet-32 (C-CW)\nSphereResNet-32 (S-G)\n\nTable 4: Acc. (%) on CIFAR-10+ & CIFAR-100.\n\nImage Classi\ufb01cation on CIFAR-10+ and CIFAR-100\n\n9-layer CNN as in Section 4.2. To be simple, we use the cosine SphereConv as SphereNorm. The\nsoftmax loss is used in both CNNs and SphereNets. From Fig. 5, we could observe that SphereNorm\nachieves the \ufb01nal accuracy similar to BatchNorm, but SphereNorm converges faster and more stably.\nSphereNorm plus the orthogonal constraint helps convergence a little bit and rescaled SphereNorm\ndoes not seem to work well. While BatchNorm and SphereNorm are used together, we obtain the\nfastest convergence and the highest \ufb01nal accuracy, showing excellent compatibility of SphereNorm.\n4.5\nWe \ufb01rst evaluate the SphereNet in a classic image\nclassi\ufb01cation task. We use the CIFAR-10+ and CI-\nFAR100 datasets and perform random \ufb02ip (both hori-\nzontal and vertical) and random crop as data augmenta-\ntion (CIFAR-10 with full data augmentation is denoted\nas CIFAR-10+). We use the ResNet-32 as a baseline ar-\nchitecture. For the SphereNet of the same architecture,\nwe evaluate sigmoid SphereConv operator (k = 0.3)\nwith sigmoid W-Softmax (k = 0.3) loss (S-SW), lin-\near SphereConv operator with linear W-Softmax loss\n(L-LW), cosine SphereConv operator with cosine W-Softmax loss (C-CW) and sigmoid SphereConv\noperator (k = 0.3) with GA-Softmax loss (S-G). In Table 4, we could see the SphereNet outperforms\na lot of current state-of-the-art methods and is even comparable to the ResNet-1001 which is far\ndeeper than ours. This experiment further validates our idea that learning on a hyperspheres constrains\nthe parameter space to a more semantic and label-related one.\n4.6 Large-scale Image Classi\ufb01cation on Imagenet-2012\nWe evaluate SphereNets on large-scale Imagenet-\n2012 dataset. We only use the minimum data\naugmentation strategy in the experiment (details\nare in Appendix B). For the ResNet-18 base-\nline and SphereResNet-18, we use the same \ufb01lter\nnumbers in each layer. We develop two types of\nSphereResNet-18, termed as v1 and v2 respec-\ntively. In SphereResNet-18-v2, we do not use\nSphereConv in the 1\u00d7 1 shortcut convolutions\nwhich are used to match the number of channels.\nIn SphereResNet-18-v1, we use SphereConv in\nthe 1\u00d7 1 shortcut convolutions. Fig. 6 shows the single crop validation error over iterations. One could\nobserve that both SphereResNets converge much faster than the ResNet baseline, while SphereResNet-\n18-v1 converges the fastest but yields a slightly worse yet comparable accuracy. SphereResNet-18-v2\nnot only converges faster than ResNet-18, but it also shows slightly better accuracy.\n5 Limitations and Future Work\nOur work still has some limitations: (1) SphereNets have large performance gain while the network\nis wide enough. If the network is not wide enough, SphereNets still converge much faster but yield\nslightly worse (still comparable) recognition accuracy. (2) The computation complexity of each\nneuron is slightly higher than the CNNs. (3) SphereConvs are still mostly pre\ufb01xed. Possible future\nwork includes designing/learning a better SphereConv, ef\ufb01ciently computing the angles to reduce\ncomputation complexity, applications to the tasks that require fast convergence (e.g. reinforcement\nlearning and recurrent neural networks), better angular regularization to replace orthogonality, etc.\n\nFigure 6: Validation error (%) on ImageNet.\n\n9\n\n0123456Iterationx1040.10.20.30.40.50.60.70.8Testing AccuracyBatchNormSphereNormSphereNorm+BatchNorm0123456Iterationx1040.10.20.30.40.50.60.70.80.9Testing Accuracy0123456Iteration0.10.20.30.40.50.60.70.80.9Testing AccuracyBatchNormSphereNormRescaled SphereNormSphereNorm w/ Orth.SphereNorm+BatchNorm0123456Iteration0.10.20.30.40.50.60.70.80.9Testing AccuracyBatchNormSphereNormRescaled SphereNormSphereNorm w/ Orth.SphereNorm+BatchNormx104x104(a) Mini-Batch Size = 4(b) Mini-Batch Size = 8(c) Mini-Batch Size = 16(d) Mini-Batch Size = 32BatchNormSphereNormRescaled SphereNormSphereNorm w/ Orth.SphereNorm+BatchNorm012345Iterationx1050.30.40.50.60.70.80.9Top1 Error RateResNet-18SphereResNet-18-v1SphereResNet-18-v2012345Iterationx1050.10.20.30.40.50.60.7Top5 Error RateResNet-18SphereResNet-18-v1SphereResNet-18-v2\fAcknowledgements\n\nWe thank Zhen Liu (Georgia Tech) for helping with the experiments and providing suggestions. This\nproject was supported in part by NSF IIS-1218749, NIH BIGDATA 1R01GM108341, NSF CAREER\nIIS-1350983, NSF IIS-1639792 EAGER, NSF CNS-1704701, ONR N00014-15-1-2340, Intel ISTC,\nNVIDIA and Amazon AWS. Xingguo Li is supported by doctoral dissertation fellowship from\nUniversity of Minnesota. Yan-Ming Zhang is supported by the National Natural Science Foundation\nof China under Grant 61773376.\n\nReferences\n[1] Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L Yuille. Semantic\n\nimage segmentation with deep convolutional nets and fully connected crfs. In ICLR, 2015.\n\n[2] Djork-Arn\u00e9 Clevert, Thomas Unterthiner, and Sepp Hochreiter. Fast and accurate deep network learning\n\nby exponential linear units (elus). arXiv:1511.07289, 2015.\n\n[3] Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. Rich feature hierarchies for accurate\n\nobject detection and semantic segmentation. In CVPR, 2014.\n\n[4] Xavier Glorot and Yoshua Bengio. Understanding the dif\ufb01culty of training deep feedforward neural\n\nnetworks. In Aistats, 2010.\n\n[5] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into recti\ufb01ers: Surpassing\n\nhuman-level performance on imagenet classi\ufb01cation. In ICCV, 2015.\n\n[6] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition.\n\nIn CVPR, 2016.\n\n[7] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Identity mappings in deep residual networks.\n\narXiv:1603.05027, 2016.\n\n[8] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing\n\ninternal covariate shift. In ICML, 2015.\n\n[9] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classi\ufb01cation with deep convolutional\n\nneural networks. In NIPS, 2012.\n\n[10] Xingguo Li, Zhaoran Wang, Junwei Lu, Raman Arora, Jarvis Haupt, Han Liu, and Tuo Zhao. Symmetry,\n\nsaddle points, and global geometry of nonconvex matrix factorization. arXiv:1612.09296, 2016.\n\n[11] Weiyang Liu, Yandong Wen, Zhiding Yu, Ming Li, Bhiksha Raj, and Le Song. Sphereface: Deep\n\nhypersphere embedding for face recognition. In CVPR, 2017.\n\n[12] Weiyang Liu, Yandong Wen, Zhiding Yu, and Meng Yang. Large-margin softmax loss for convolutional\n\nneural networks. In ICML, 2016.\n\n[13] Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for semantic segmenta-\n\ntion. In CVPR, 2015.\n\n[14] Dmytro Mishkin and Jiri Matas. All you need is a good init. arXiv:1511.06422, 2015.\n\n[15] Yuji Nakatsukasa. Eigenvalue perturbation bounds for hermitian block tridiagonal matrices. Applied\n\nNumerical Mathematics, 62(1):67\u201378, 2012.\n\n[16] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection\nwith region proposal networks. In Advances in neural information processing systems, pages 91\u201399, 2015.\n\n[17] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang,\nAndrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual recognition challenge.\nIJCV, pages 1\u201342, 2014.\n\n[18] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recogni-\n\ntion. arXiv:1409.1556, 2014.\n\n[19] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru\n\nErhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In CVPR, 2015.\n\n10\n\n\f[20] Andreas Veit, Michael J Wilber, and Serge Belongie. Residual networks behave like ensembles of relatively\n\nshallow networks. In NIPS, 2016.\n\n[21] Di Xie, Jiang Xiong, and Shiliang Pu. All you need is beyond a good init: Exploring better solution for train-\ning extremely deep convolutional neural networks with orthonormality and modulation. arXiv:1703.01827,\n2017.\n\n11\n\n\f", "award": [], "sourceid": 2127, "authors": [{"given_name": "Weiyang", "family_name": "Liu", "institution": "Georgia Tech"}, {"given_name": "Yan-Ming", "family_name": "Zhang", "institution": "Institute of Automation, Chinese Academy of Sciences"}, {"given_name": "Xingguo", "family_name": "Li", "institution": "University of Minnesota"}, {"given_name": "Zhiding", "family_name": "Yu", "institution": "Carnegie Mellon University"}, {"given_name": "Bo", "family_name": "Dai", "institution": "Georgia Tech"}, {"given_name": "Tuo", "family_name": "Zhao", "institution": "Georgia Tech"}, {"given_name": "Le", "family_name": "Song", "institution": "Georgia Institute of Technology"}]}