{"title": "CapProNet: Deep Feature Learning via Orthogonal Projections onto Capsule Subspaces", "book": "Advances in Neural Information Processing Systems", "page_first": 5814, "page_last": 5823, "abstract": "In this paper, we formalize the idea behind capsule nets of using a capsule vector rather than a neuron activation to predict the label of samples. To this end, we propose to learn a group of capsule subspaces onto which an input feature vector is projected. Then the lengths of resultant capsules are used to score the probability of belonging to different classes. We train such a Capsule Projection Network (CapProNet) by learning an orthogonal projection matrix for each capsule subspace, and show that each capsule subspace is updated until it contains input feature vectors corresponding to the associated class. With low dimensionality of capsule subspace as well as an iterative method to estimate the matrix inverse, only a small negligible computing overhead is incurred to train the network. Experiment results on image datasets show the presented network can greatly improve the performance of state-of-the-art Resnet backbones by $10-20\\%$ with almost the same computing cost.", "full_text": "CapProNet: Deep Feature Learning via Orthogonal\n\nProjections onto Capsule Subspaces\n\nLiheng Zhang\u2020, Marzieh Edraki\u2020, and Guo-Jun Qi\u2020\u2021\u2217\n\u2020Laboratory for MAchine Perception and LEarning,\n\nUniversity of Central Florida\nhttp://maple.cs.ucf.edu\n\u2021Huawei Cloud, Seattle, USA\n\nAbstract\n\nIn this paper, we formalize the idea behind capsule nets of using a capsule vector\nrather than a neuron activation to predict the label of samples. To this end, we\npropose to learn a group of capsule subspaces onto which an input feature vector is\nprojected. Then the lengths of resultant capsules are used to score the probability\nof belonging to different classes. We train such a Capsule Projection Network\n(CapProNet) by learning an orthogonal projection matrix for each capsule sub-\nspace, and show that each capsule subspace is updated until it contains input feature\nvectors corresponding to the associated class. We will also show that the capsule\nprojection can be viewed as normalizing the multiple columns of the weight matrix\nsimultaneously to form an orthogonal basis, which makes it more effective in\nincorporating novel components of input features to update capsule representations.\nIn other words, the capsule projection can be viewed as a multi-dimensional weight\nnormalization in capsule subspaces, where the conventional weight normalization\nis simply a special case of the capsule projection onto 1D lines. Only a small\nnegligible computing overhead is incurred to train the network in low-dimensional\ncapsule subspaces or through an alternative hyper-power iteration to estimate the\nnormalization matrix. Experiment results on image datasets show the presented\nmodel can greatly improve the performance of the state-of-the-art ResNet back-\nbones by 10 \u2212 20% and that of the Densenet by 5 \u2212 7% respectively at the same\nlevel of computing and memory expenses. The CapProNet establishes the com-\npetitive state-of-the-art performance for the family of capsule nets by signi\ufb01cantly\nreducing test errors on the benchmark datasets.\n\n1\n\nIntroduction\n\nSince the idea of capsule net [15, 9] was proposed, many efforts [8, 17, 14, 1] have been made to\nseek better capsule architectures as the next generation of deep network structures. Among them are\nthe dynamic routing [15] that can dynamically connect the neurons between two consecutive layers\nbased on their output capsule vectors. While these efforts have greatly revolutionized the idea of\nbuilding a new generation of deep networks, there are still a large room to improve the state of the art\nfor capsule nets.\nIn this paper, we do not intend to introduce some brand new architectures for capsule nets. Instead,\nwe focus on formalizing the principled idea of using the overall length of a capsule rather than\n\n\u2217Corresponding author: G.-J. Qi, email: guojunq@gmail.com and guojun.qi@huawei.com.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fa single neuron activation to model the presence of an entity [15, 9]. Unlike the existing idea in\nliterature [15, 9], we formulate this idea by learning a group of capsule subspaces to represent a set\nof entity classes. Once capsule subspaces are learned, we can obtain set of capsules by performing an\northogonal projection of feature vectors onto these capsule subspaces.\nThen, one can adopt the principle of separating the presence of an entity and its instantiation\nparameters into capsule length and orientation, respectively.\nIn particular, we use the lengths\nof capsules to score the presence of entity classes corresponding to different subspaces, while\ntheir orientations are used to instantiate the parameters of entity properties such as poses, scales,\ndeformations and textures. In this way, one can use the capsule length to achieve the intra-class\ninvariance in detecting the presence of an entity against appearance variations, as well as model the\nequivalence of the instantiation parameters of entities by encoding them into capsule orientations\n[15].\nFormally, each capsule subspace is spanned by a basis from the columns of a weight matrix in the\nneural network. A capsule projection is performed by projecting input feature vectors fed from a\nbackbone network onto the capsule subspace. Speci\ufb01cally, an input feature vector is orthogonally\ndecomposed into the capsule component as the projection onto a capsule subspace and the complement\ncomponent perpendicular to the subspace. By analyzing the gradient through the capsule projection,\none can show that a capsule subspace is iteratively updated along the complement component that\ncontains the novel characteristics of the input feature vector. The training process will continue until\nall presented feature vectors of an associated class are well contained by the corresponding capsule\nsubspace, or simply the back-propagated error accounting for misclassi\ufb01cation caused by capsule\nlengths vanishes.\nWe call the proposed deep network with the capsule projections the CapProNet for brevity. The\nCapProNet is friendly to any existing network architecture \u2013 it is built upon the embedded features\ngenerated by some neural networks and outputs the projected capsule vectors in the subspaces\naccording to different classes. This makes it amenable to be used together with existing network\narchitectures. We will conduct experiments on image datasets to demonstrate the CapProNet can\ngreatly improve the state-of-the-art results by sophisticated networks with only small negligible\ncomputing overhead.\n\n1.1 Our Findings\n\nBrie\ufb02y, we summarize our main \ufb01ndings from experiments upfront about the proposed CapProNet.\n\u2022 The proposed CapProNet signi\ufb01cantly advances the capsule net performance [15] by re-\nducing its test error from 10.3% and 4.3% on CIFAR10 and SVHN to 3.64% and 1.54%\nrespectively upon the chosen backbones.\n\u2022 The proposed CapProNet can also greatly reduce the error rate of various backbone networks\nby adding capsule projection layers into these networks. For example, The error rate can\nbe reduced by more than 10 \u2212 20% based on Resnet backbone, and by more than 5 \u2212 6%\nbased on densenet backbone, with only < 1% and 0.04% computing and memory overhead\nin training the model compared with the backbones.\n\u2022 The orthogonal projection onto capsule subspaces plays a critical role in delivering com-\npetitive performance. On the contrary, simply grouping neurons into capsules could not\nobviously improve the performance. This shows the capsule projection plays an indispens-\nable role in the CapProNet delivering competitive results.\n\u2022 Our insight into the gradient of capsule projection in Section 2.3 explains the advantage of\nupdating capsule subspaces to continuously contain novel components of training examples\nuntil they are correctly classi\ufb01ed. We also \ufb01nd that the capsule projection can be viewed as a\nhigh-dimensional extension of weight normalization in Section 2.4, where the conventional\nweight normalization is merely a simple case of the capsule projection onto 1D lines.\n\nThe source code is available at https://github.com/maple-research-lab.\nThe remainder of this paper is organized as follows. We present the idea of the Capsule Projection\nNet (CapProNet) in Section 2, and discuss the implementation details in Section 3. The review of\nrelated work follows in Section 4, and the experiment results are demonstrated in Section 5. Finally,\nwe conclude the paper and discuss the future work in Section 6.\n\n2\n\n\f2 The Capsule Projection Nets\n\nIn this section, we begin by shortly revisiting the idea of conventional neural networks in classi\ufb01cation\ntasks. Then we formally present the orthogonal projection of input feature vectors onto multiple\ncapsule subspaces where capsule lengths are separated from their orientations to score the presence\nof entities belonging to different classes. Finally, we analyze the gradient of the resultant capsule\nprojection by showing how capsule subspaces are updated iteratively to adopt novel characteristics of\ninput feature vectors through back-propagation.\n\n2.1 Revisit: Conventional Neural Networks\nConsider a feature vector x \u2208 Rd generated by a deep network to represent an input entity. Given its\nground truth label y \u2208 {1, 2,\u00b7\u00b7\u00b7 , L}, the output layer of the deep network aims to learn a group of\nweight vectors {w1, w2,\u00b7\u00b7\u00b7 , wL} such that\n\nwT\n\ny x > wT\n\nl x, for all, l (cid:54)= y.\n\n(1)\n\nThis hard constraint is usually relaxed to a differentiable softmax objective, and the backpropagation\nalgorithm is performed to train {w1, w2,\u00b7\u00b7\u00b7 , wL} and the backbone network generating the input\nfeature vector x.\n\n2.2 Capsule Projection onto Subspaces\n\nUnlike simply grouping neurons to form capsules for classi\ufb01cation, we propose to learn a group\nof capsule subspaces {S1,S2,\u00b7\u00b7\u00b7 ,SL}, each associated with one of L classes. Suppose we have\na feature vector x \u2208 Rd generated by a backbone network from an input sample. Then, to learn\na proper feature representation, we project x onto these capsule subspaces, yielding L capsules\n{v1, v2,\u00b7\u00b7\u00b7 , vL} as projections. Then, we will use the lengths of these capsules to score the\nprobability of the input sample belonging to different classes by assigning it to the one according to\nthe longest capsule.\nFormally, for each capsule subspace Sl of dimension c, we learn a weight matrix Wl \u2208 Rd\u00d7c\nthe columns of which form the basis of the subspace, i.e., Sl = span(Wl) is spanned by the\ncolumn vectors. Then the orthogonal projection vl of a vector x onto Sl is found by solving\nvl = arg minv\u2208span(Wl) (cid:107)x \u2212 v(cid:107)2. This orthogonal projection problem has the following closed-\nform solution\n\nvl = Plx, and Pl = WlW+\nl\n\nl x\n\nl Plx =\n\nvT\nl vl =\n\nxT Wl\u03a3lWT\n\n(2)\nl Wl)\u22121 can be seen as a normalization matrix applied to the transformed feature\nwhere \u03a3l = (WT\nvector WT\nl x as a way to normalize the Wl-transformation based on the capsule projection. As we\nwill discuss in the next subsection, this normalization plays a critical role in updating Wl along the\northogonal direction of the subspace so that novel components pertaining to the properties of input\nentities can be gradually updated to the subspace.\nIn practice, since c (cid:28) d, the c columns of Wl are usually independent in a high-dimensional d-D\nspace. Otherwise, one can always add a small \u0001I to WT\nl Wl to avoid the numeric singularity when\ntaking the matrix inverse. Later on, we will discuss a fast iterative algorithm to compute the matrix\ninverse with a hyper-power sequence that can be seamlessly integrated with the back-propagation\niterations.\n\n2A projection matrix P for a subspace S is a symmetric idempotent matrix (i.e., PT = P and P2 = P)\n\nsuch that its range space is S.\n\n3\n\nwhere Pl is called projection matrix 2 for capsule subspace Sl, and W+\npseudoinverse [4].\nl Wl)\u22121WT\nWhen the columns of Wl are independent, W+\nonly need the capsule length (cid:107)vl(cid:107)2 to predict the class of an entity, we have\n\nl becomes (WT\n\nl\n\nis the Moore-Penrose\n\nl . In this case, since we\n\n(cid:113)\n\n(cid:107)vl(cid:107)2 =\n\n(cid:113)\n\nxT PT\n\n(cid:113)\n\n\f2.3\n\nInsight into Gradients\n\nIn this section, we take a look at the gradient used to update Wl in each iteration, which can give us\nsome insight into how the CapProNet works in learning the capsule subspaces.\nSuppose we minimize a loss function (cid:96) to train the capsule projection and the network. For simplicity,\nwe only consider a single sample x and its capsule vl. Then by the chain rule and the differential of\ninverse matrix [13], we have the following gradient of (cid:96) wrt Wl\n\n\u2202(cid:96)\n\u2202Wl\n\n=\n\n\u2202(cid:96)\n\n\u2202(cid:107)vl(cid:107)2\n\n\u2202(cid:107)vl(cid:107)\n\u2202Wl\n\n=\n\n\u2202(cid:96)\n\n\u2202(cid:107)vl(cid:107)2\n\n(I \u2212 Pl)xxT W+T\n\nl\n\n(cid:107)vl(cid:107)2\n\n(3)\n\nl\n\nl\n\n\u2202(cid:96)\n\n\u2202Wl\n\ndenotes the transpose of W+\nis the back-propagated error accounting for misclassi\ufb01cation caused by (cid:107)vl(cid:107)2.\n\nwhere the operator (I \u2212 Pl) can be viewed as the projection onto the orthogonal complement of the\ncapsule subspace spanned by the columns of Wl, W+T\nl , and the factor\n\u2202(cid:107)vl(cid:107)2\nDenote by x\u22a5 (cid:44) (I \u2212 Pl)x the projection of x onto the orthogonal component perpendicular to the\ncurrent capsule subspace Sl. Then, the above gradient \u2202(cid:96)\nonly contains the columns parallel to x\u22a5\n(up to coef\ufb01cients in the vector xT W+T\n). This shows that the basis of the current capsule subspace\nSl in the columns of Wl is updated along this orthogonal component of the input x to the subspace.\nOne can regard x\u22a5 as representing the novel component of x not yet contained in the current Sl, it\nshows capsule subspaces are updated to contain the novel component of each input feature vector\nuntil all training feature vectors are well contained in these subspaces, or the back-propagated errors\nvanish that account for misclassi\ufb01cation caused by (cid:107)vl(cid:107)2.\nFigure 1 illustrates an example of 2-D capsule\nsubspace S spanned by two basis vectors w1\nand w2. An input feature vector x is decom-\nposed into the capsule projection v onto S and\nan orthogonal complement x\u22a5 perpendicular to\nthe subspace. In one training iteration, two basis\n1 and w(cid:48)\nvectors w1 and w2 are updated to w(cid:48)\n2\nalong the orthogonal direction x\u22a5, where x\u22a5 is\nviewed as containing novel characteristics of an\nentity not yet contained by S.\n\n2.4 A Perspective of Multiple-Dimensional\nWeight Normalization\n\nFigure 1: This \ufb01gure illustrates a 2-D capsule sub-\nspace S spanned by two basis vectors w1 and w2.\nAn input feature vector x is decomposed into the\ncapsule projection v onto S and an orthogonal\ncomplement x\u22a5 perpendicular to the subspace. In\none training iteration, two basis vectors w1 and\nw2 are updated to w(cid:48)\n2 along the orthogonal\ndirection x\u22a5, where x\u22a5 is viewed as containing\nnovel characteristics of an entity not yet contained\nby S.\n\nAs discussed in the last subsection and Figure 2,\nwe can explain the orthogonal components rep-\nresent the novel information in input data, and\nthe orthogonal decomposition thus enables us\nto update capsule subspaces by more effectively\nincorporating novel characteristics/components\nthan the classic capsule nets.\nOne can also view the capsule projection as nor-\nmalizing the column basis of weight matrix Wl\nsimultaneously in a high-dimensional capsule\nspace. If the capsule dimension c is set to 1, it is\nnot hard to see that Eq. (2) can be rewritten by\nl x|\nsetting vl to |WT\n(cid:107)Wl(cid:107) . It produces the conventional weight normalization of the vector Wl \u2208 Rd, as a\nspecial 1D case of the capsule projection. As the capsule dimension c grows, Wl can be normalized\nl x, which keeps (cid:107)vl(cid:107) unchanged in Eq. (2). This enables us to extend\nby replacing vl with \u03a31/2\nthe conventional weight normalization to high dimensional capsule subspaces.\n\n1 and w(cid:48)\n\nl WT\n\n4\n\n\f3\n\nImplementation Details\n\nWe will discuss some implementation details in this section, including 1) the computing cost to\nperform capsule projection and a fast iterative method by using hyper-power sequences without\nrestart; 2) the objective to train the capsule projection.\n\n3.1 Computing Normalization Matrix\n\nTaking a matrix inverse to get the normalization matrix \u03a3l would be expensive with an increasing\ndimension c. But after the model is trained, it is \ufb01xed in the inference with only one-time computing.\nFortunately, the dimension c of a capsule subspace is usually much smaller than the feature dimension\nd that is usually hundreds and even thousands. For example, c is typically no larger than 8 in\nexperiments. Thus, taking a matrix inverse to compute these normalization matrices only incurs\na small negligible computing overhead compared with the training of many other layers in a deep\nnetwork.\nAlternatively, one can take advantage of an iterative algorithm to compute the normalization matrix.\nWe consider the following hyper-power sequence\n\n\u03a3l \u2190 2\u03a3l \u2212 \u03a3lWT\n\nl Wl\u03a3l\n\nwhich has proven to converge to (WT W)\u22121 with a proper initial point [2, 3]. In stochastic gradient\nmethod, since only a small change is made to update Wl in each training iteration, thus it is often\nsuf\ufb01cient to use this recursion to make an one-step update on the normalization matrix from the last\nl Wl)\u22121 at the very \ufb01rst iteration to\niteration. The normalization matrix \u03a3l can be initialized to (WT\ngive an ideal start. This could further save computing cost in training the network.\nIn experiments, a very small computing overhead was incurred in the capsule projection. For example,\ntraining the ResNet110 on CIFAR10/100 costed about 0.16 seconds per iteration on a batch of 128\nimages. In comparison, training the CapProNet with a ResNet110 backbone in an end-to-end fashion\nonly costed an additional < 0.001 seconds per iteration, that is less than 1% computing overhead\nfor the CapProNet compared with its backbone. For the inference, we did not \ufb01nd any noticeable\ncomputing overhead for the CapProNet compared with its backbone network.\n\n3.2 Training Capsule Projections\nGiven a group of capsule vectors {v1, v2,\u00b7\u00b7\u00b7 , vL} corresponding to a feature vector x and its ground\ntruth label y, we train the model by requiring\n\n(cid:107)vy(cid:107)2 > (cid:107)vl(cid:107)2, for all, l (cid:54)= y.\n\nIn other words, we require (cid:107)vy(cid:107)2 should be larger than all the length of the other capsules. As\na consequence, we can minimize the following negative logarithmic softmax function (cid:96)(x, y) =\n(cid:80)L\nexp((cid:107)vy(cid:107)2)\n\u2212 log\nl=1 exp((cid:107)vl(cid:107)2) to train the capsule subspaces and the network generating x through back-\npropagation in an end-to-end fashion. Once the model is trained, we will classify a test sample into\nthe class with the longest capsule.\n\n4 Related Work\n\nThe presented CapProNets are inspired by the CapsuleNets by adopting the idea of using a capsule\nvector rather than a neural activation output to predict the presence of an entity and its properties\n[15, 9]. In particular, the overall length of a capsule vector is used to represent the existence of the\nentity and its direction instantiates the properties of the entity. We formalize this idea in this paper by\nexplicitly learning a group of capsule subspaces and project embedded features onto these subspaces.\nThe advantage of these capsule subspaces is their directions can represent characteristics of an entity,\nwhich contains much richer information, such as its positions, orientations, scales and textures,\nthan a single activation output. By performing an orthogonal projection of an input feature vector\nonto a capsule subspace, one can \ufb01nd the best direction revealing these properties. Otherwise, the\nentity is thought of being absent as the projection vanishes when the input feature vector is nearly\nperpendicular to the capsule subspace.\n\n5\n\n\f5 Experiments\n\nWe conduct experiments on benchmark datasets to evaluate the proposed CapProNet compared with\nthe other deep network models.\n\n5.1 Datasets\n\nWe use both CIFAR and SVHN datasets in experiments to evaluate the performance.\nCIFAR The CIFAR dataset contains 50, 000 and 10, 000 images of 32 \u00d7 32 pixels for the training\nand test sets respectively. A standard data augmentation is adopted with horizonal \ufb02ipping and\nshifting. The images are labeled with 10 and 100 categories, namely CIFAR10 and CIFAR100\ndatasets. A separate validation set of 5, 000 images are split from the training set to choose the model\nhyperparameters, and the \ufb01nal test errors are reported with the chosen hyperparameters by training\nthe model on all 50, 000 training images.\nSVHN The Street View House Number (SVHN) dataset has 73, 257 and 26, 032 images of colored\ndigits in the training and test sets, with an additional 531, 131 training images available. Following\nthe widely used evaluation protocol in literature [5, 11, 12, 16], all the training examples are used\nwithout data augmentation, while a separate validation set of 6, 000 images is split from the training\nset. The model with the smallest validation error is selected and the error rate is reported.\nImageNet The ImageNet data-set consists of 1.2 million training and 50k validation images. We\napply mean image subtraction as the only pre-processing step on images and use random cropping,\nscaling and horizontal \ufb02ipping for data augmentation [6]. The \ufb01nal resolution of both train and\nvalidation sets is 224 \u00d7 224, and 20k images are chosen randomly from training set for tuning hyper\nparameters.\n\n5.2 Backbone Networks\n\nWe test various networks such as ResNet [6], ResNet (pre-activation) [7], WideResNet [18] and\nDensenet [10] as the backbones in experiments. The last output layer of a backbone network is\nreplaced by the capsule projection, where the feature vector from the second last layer of the backbone\nis projected onto multiple capsule subspaces.\nThe CapProNet is trained from the scratch in an end-to-end fashion on the given training set. For the\nsake of fair comparison, the strategies used to train the respective backbones [6, 7, 18], such as the\nlearning rate schedule, parameter initialization, and the stochastic optimization solver, are adopted to\ntrain the CapProNet. We will denote the CapProNet with a backbone X by CapProNet+X below.\n\n5.3 Results\n\nWe perform experiments with various networks as backbones for comparison with the proposed\nCapProNet. In particular, we consider three variants of ResNets \u2013 the classic one reported in [11] with\n110 layers, the ResNet with pre-activation [7] with 164 layers, and two paradigms of WideResNets\n[18] with 16 and 28 layers, as well as densenet-BC [10] with 100 layers. Compared with ResNet\nand ResNet with pre-activation, WideResNet has fewer but wider layers that reaches smaller error\nrates as shown in Table 1. We test the CapProNet+X with these different backbone networks to\nevaluate if it can consistently improve these state-of-the-art backbones. It is clear from Table 1\nthat the CapProNet+X outperforms the corresponding backbone networks by a remarkable margin.\nFor example, the CapProNet+ResNet reduces the error rate by 19%, 17.5% and 10% on CIFAR10,\nCIFAR100 and SVHN, while CapProNet+Densenet reduces the error rate by 5.8%, 4.8% and 6.8%\nrespectively. Finally, we note that the CapProNet signi\ufb01cantly advances the capsule net performance\n[15] by reducing its test error from 10.3% and 4.3% on CIFAR10 and SVHN to 3.64% and 1.54%\nrespectively based on the chosen backbones.\nWe also evaluate the CapProNet with Resnet50 and Resnet101 backbones for single crop Top-1/Top-5\nresults on ImageNet validation set. To ensure fair comparison, we retrain the backbone networks\nbased on the of\ufb01cal Resnet model3, where both original Resnet[6] and CapProNet are trained with\nthe same training strategies on four GPUs. The results are reported in Table 2, where CapProNet+X\n\n3https://github.com/tensor\ufb02ow/models/tree/master/of\ufb01cial/resnet\n\n6\n\n\fTable 1: Error rates on CIFAR10, CIFAR100, and SVHN. The best results are highlighted in bold\nfor the methods with the same network architectures. Not all results on different combinations of\nnetwork backbones or datasets have been reported in literature, and missing results are remarked \u201c-\"\nin the table.\n\nMethod\n\nResNet [6]\n\nResNet (reported by [11])\nCapProNet+ResNet (c=2)\nCapProNet+ResNet (c=4)\nCapProNet+ResNet (c=8)\nResNet (pre-activation) [7]\n\nCapProNet+ResNet (pre-activation c=4)\nCapProNet+ResNet (pre-activation c=8)\n\nWideResNet [18]\n\nwith Dropout\n\nCapProNet+WideResNet (c=4)\n\nwith Dropout\n\nCapProNet+WideResNet (c=8)\n\nwith Dropout\n\nDensenet-BC k=12 [10]\n\nCapProNet Densenet-BC k=12 (c=4)\nCapProNet Densenet-BC k=12 (c=8)\n\nDepth\n110\n110\n110\n110\n110\n164\n164\n164\n16\n28\n16\n16\n28\n16\n16\n28\n16\n100\n100\n100\n\n-\n\n6.61\n6.41\n5.24\n5.27\n5.19\n5.46\n4.88\n4.89\n4.81\n4.17\n\n27.22\n22.65\n22.45\n21.93\n24.33\n21.37\n20.91\n22.07\n20.50\n\nParams CIFAR10 CIFAR100\n1.7M\n1.7M\n1.7M\n1.7M\n1.7M\n1.7M\n1.7M\n1.7M\n11.0M\n36.5M\n2.7M\n11.0M\n36.5M\n2.7M\n11.0M\n36.5M\n2.7M\n0.8M\n0.8M\n0.8M\n\n22.27\n21.22\n21.19\n\n4.51\n4.35\n4.25\n\n20.12\n19.83\n\n21.33\n19.98\n\n4.04\n3.85\n\n4.20\n3.64\n\n-\n\n-\n\n-\n\n-\n\n-\n\n-\n\nSVHN\n\n-\n\n2.01\n1.79\n1.82\n1.79\n\n-\n-\n-\n-\n-\n\n-\n-\n\n-\n-\n\n1.64\n\n1.58\n\n1.54\n1.76\n1.64\n1.64\n\nTable 2: The CapProNet results with Resnet50 and Resnet101 backbones for Single crop top-1/top-5\nerror rate on ImageNet validation set with image resolution of 224 \u00d7 224, as well as the comparison\nwith original baseline results.\nreported result[6]\n\nCapProNet (c=2) CapProNet (c=4) CapProNet (c=8)\n\n23.282 / 6.8\n22.192 / 6.178\n\n23.265 / 6.648\n\n21.89 / 6\n\n23.203 / 6.78\n21.9 / 6.01\n\nMethod\nResnet50\nResnet101\n\n24.8 / 7.8\n23.6 / 7.1\n\nour rerun\n24.09/7.13\n22.81 /6.67\n\nsuccessfully outperforms the original backbones on both Top-1 and Top-5 error rates. It is worth\nnoting the gains are only obtained with the last layer of backbones replaced by the capsule project\nlayer. We believe the error rate can be further reduced by replacing the intermediate convolutional\nlayers with the capsule projections, and we leave it to our future research.\nWe also note that the CapProNet+X consistently outperforms the backbone counterparts with varying\ndimensions c of capsule subspaces. In particular, with the WideResNet backbones, in most cases, the\nerror rates are reduced with an increasing capsule dimension c on all datasets, where the smallest\nerror rates often occur at c = 8. In contrast, while CapProNet+X still clearly outperforms both\nResNet and ResNet (pre-activation) backbones, the error rates are roughly at the same level. This is\nprobably because both ResNet backbones have a much smaller input dimension d = 64 of feature\nvectors into the capsule projection than that of WideResNet backbone where d = 128 and d = 160\nwith 16 and 28 layers, respectively. This turns out to suggest that a larger input dimension can enable\nto use capsule subspaces of higher dimensions to encode patterns of variations along more directions\nin a higher dimensional input feature space.\nTo further assess the effect of capsule projection, we compare with the method that simply groups\nthe output neurons into capsules without performing orthogonal projection onto capsule subspaces.\nWe still use the lengths of these resultant \u201ccapsules\" of grouped neurons to classify input images\nand the model is trained in an end-to-end fashion accordingly. Unfortunately, this approach, namely\nGroupNeuron+ResNet in Table 3, does not show signi\ufb01cant improvement over the backbone network.\nFor example, the smallest error rate by GroupNeuron+ResNet is 6.26 at c = 2, a small improvement\n\n7\n\n\fFigure 2: These \ufb01gures plot the 2-D capsule subspaces and projected capsules corresponding to\nten classes on CIFAR10 dataset. In each \ufb01gure, red capsules represent samples from the class\ncorresponding to the subspace, while green capsules belong to a different class. It shows red samples\nhave larger capsule length (relative to the origin) than those of green samples. This validates the\ncapsule length as the classi\ufb01cation criterion in the proposed model. Note that some \ufb01gures have\ndifferent scales in two axes for a better illustration.\n\nover the error rate of 6.41 reached by ResNet110. This demonstrates the capsule projection makes an\nindispensable contribution to improving model performances.\nWhen training on CIFAR10/100 and SVHN, one iteration typically costs \u223c 0.16 seconds for Resnet-\n110, with an additional less than 0.01 second to train the corresponding CapProNet. That is less\nthan 1% computing overhead. The memory overhead for the model parameters is even smaller. For\nexample, the CapProNet+ResNet only has an additional 640 \u2212 6400 parameters at c = 2 compared\nwith 1.7M parameters in the backbone ResNet. We do not notice any large computing or memory\noverheads with the ResNet (pre-activation) or WideResNet, either. This shows the advantage of\nCapProNet+X as its error rate reduction is not achieved by consuming much more computing and\nmemory resources.\n\n5.4 Visualization of Projections onto Capsule Subspaces\n\n2 WT\n\nl Wl)\u2212 1\n\nTable 3: Comparison between GroupNeuron and\nCapProNet with the ResNet110 backbone on CI-\nFAR10 dataset. The best results are highlighted in\nbold for c = 2, 4, 8 capsules. It shows the need of\ncapsule projection to obtain better results.\n\nTo give an intuitive insight into the learned cap-\nsule subspaces, we plot the projection of input\nfeature vectors onto capsule subspaces. Instead\nof directly using Plx to project feature vectors\nonto capsule subspaces in the original input s-\npace Rd, we use (WT\nl x to project\nan input feature vector x onto Rc, since this\nprojection preserves the capsule length (cid:107)vl(cid:107)2\nde\ufb01ned in (2).\nFigure 2 illustrates the 2-D capsule subspaces\nlearned on CIFAR10 when c = 2 and d = 64 in\nCapProNet+ResNet110, where each subspace\ncorresponds to one of ten classes. Red points\nrepresent the capsules projected from the class of input samples corresponding to the subspace while\ngreen points correspond to one of the other classes. The \ufb01gure shows that red capsules have larger\nlength than green ones, which suggests the capsule length is a valid metric to classify samples into\ntheir corresponding classes. Meanwhile, the orientation of a capsule re\ufb02ects various instantiations of\na sample in these subspaces. These \ufb01gures visualize the separation of the lengths of capsules from\ntheir orientations in classi\ufb01cation tasks.\n\nc GroupNeuron CapProNet\n2\n4\n8\n\n5.24\n5.27\n5.19\n\n6.26\n6.29\n6.42\n\n8\n\n\f6 Conclusions and Future Work\n\nIn this paper, we present a novel capsule project network by learning a group of capsule subspaces for\ndifferent classes. Speci\ufb01cally, the parameters of an orthogonal projection is learned for each class and\nthe lengths of projected capsules are used to predict the entity class for a given input feature vector.\nThe training continues until the capsule subspaces contain input feature vectors of corresponding\nclasses or the back-propagated error vanishes. Experiment results on real image datasets show that\nthe proposed CapProNet+X could greatly improve the performance of backbone network without\nincurring large computing and memory overheads. While we only test the capsule projection as the\noutput layer in this paper, we will attempt to insert it into intermediate layers of backbone networks as\nwell, and hope this could give rise to a new generation of capsule networks with more discriminative\narchitectures in future.\n\nAcknowledgements\n\nL. Zhang and M. Edraki made equal contributions to implementing the idea: L. Zhang conducted\nexperiments on CIFAR10 and SVHN datasets, and visualized projections in capsule subspaces on\nCIFAR10. M. Edraki performed experiments on CIFAR100. G.-J. Qi initialized and formulated the\nidea, and prepared the paper.\n\nReferences\n[1] Mohammad Taha Bahadori. Spectral capsule networks. 6th International Conference on Learning\n\nRepresentations, 2018.\n\n[2] Adi Ben-Israel. An iterative method for computing the generalized inverse of an arbitrary matrix. Mathe-\n\nmatics of Computation, 19(91):452\u2013455, 1965.\n\n[3] Adi Ben-Israel and Dan Cohen. On iterative computation of generalized inverses and associated projections.\n\nSIAM Journal on Numerical Analysis, 3(3):410\u2013419, 1966.\n\n[4] Adi Ben-Israel and Thomas NE Greville. Generalized inverses: theory and applications, volume 15.\n\nSpringer Science & Business Media, 2003.\n\n[5] Ian J Goodfellow, David Warde-Farley, Mehdi Mirza, Aaron Courville, and Yoshua Bengio. Maxout\n\nnetworks. arXiv preprint arXiv:1302.4389, 2013.\n\n[6] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition.\nIn Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770\u2013778, 2016.\n\n[7] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Identity mappings in deep residual networks.\n\nIn European Conference on Computer Vision, pages 630\u2013645. Springer, 2016.\n\n[8] Geoffrey Hinton, Nicholas Frosst, and Sara Sabour. Matrix capsules with em routing. 2018.\n\n[9] Geoffrey E Hinton, Alex Krizhevsky, and Sida D Wang. Transforming auto-encoders. In International\n\nConference on Arti\ufb01cial Neural Networks, pages 44\u201351. Springer, 2011.\n\n[10] Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger. Densely connected\n\nconvolutional networks. In CVPR, volume 1, page 3, 2017.\n\n[11] Gao Huang, Yu Sun, Zhuang Liu, Daniel Sedra, and Kilian Q Weinberger. Deep networks with stochastic\n\ndepth. In European Conference on Computer Vision, pages 646\u2013661. Springer, 2016.\n\n[12] Chen-Yu Lee, Saining Xie, Patrick Gallagher, Zhengyou Zhang, and Zhuowen Tu. Deeply-supervised nets.\n\nIn Arti\ufb01cial Intelligence and Statistics, pages 562\u2013570, 2015.\n\n[13] Kaare Brandt Petersen et al. The matrix cookbook.\n\n[14] David Rawlinson, Abdelrahman Ahmed, and Gideon Kowadlo. Sparse unsupervised capsules generalize\n\nbetter. arXiv preprint arXiv:1804.06094, 2018.\n\n[15] Sara Sabour, Nicholas Frosst, and Geoffrey E Hinton. Dynamic routing between capsules. In Advances in\n\nNeural Information Processing Systems, pages 3859\u20133869, 2017.\n\n9\n\n\f[16] Pierre Sermanet, Soumith Chintala, and Yann LeCun. Convolutional neural networks applied to house\nnumbers digit classi\ufb01cation. In Pattern Recognition (ICPR), 2012 21st International Conference on, pages\n3288\u20133291. IEEE, 2012.\n\n[17] Dilin Wang and Qiang Liu. An optimization view on dynamic routing between capsules. 6th International\n\nConference on Learning Representations, 2018.\n\n[18] Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. arXiv preprint arXiv:1605.07146,\n\n2016.\n\n10\n\n\f", "award": [], "sourceid": 2815, "authors": [{"given_name": "Liheng", "family_name": "Zhang", "institution": "University of Central Florida"}, {"given_name": "Marzieh", "family_name": "Edraki", "institution": "University of Central Florida"}, {"given_name": "Guo-Jun", "family_name": "Qi", "institution": "Futurewei Technologies, Inc."}]}