{"title": "STAR-Caps: Capsule Networks with Straight-Through Attentive Routing", "book": "Advances in Neural Information Processing Systems", "page_first": 9101, "page_last": 9110, "abstract": "Capsule networks have been shown to be powerful models for image classification, thanks to their ability to represent and capture viewpoint variations of an object. However, the high computational complexity of capsule networks that stems from the recurrent dynamic routing poses a major drawback making their use for large-scale image classification challenging. In this work, we propose Star-Caps a capsule-based network that exploits a straight-through attentive routing to address the drawbacks of capsule networks. By utilizing attention modules augmented by differentiable binary routers, the proposed mechanism estimates the routing coefficients between capsules without recurrence, as opposed to prior related work. Subsequently, the routers utilize straight-through estimators to make binary decisions to either connect or disconnect the route between capsules, allowing stable and faster performance. The experiments conducted on several image classification datasets, including MNIST, SmallNorb, CIFAR-10, CIFAR-100, and ImageNet show that Star-Caps outperforms the baseline capsule networks.", "full_text": "STAR-CAPS: Capsule Networks with\nStraight-Through Attentive Routing\n\nDepartment of Computer Science\n\nDepartment of Computer Science\n\nLorenzo Torresani\n\nDartmouth College\nLT@dartmouth.edu\n\nKarim Ahmed\n\nDartmouth College\n\nkarim@cs.dartmouth.edu\n\nAbstract\n\nCapsule networks have been shown to be powerful models for image classi\ufb01cation,\nthanks to their ability to represent and capture viewpoint variations of an object.\nHowever, the high computational complexity of capsule networks that stems from\nthe recurrent dynamic routing poses a major drawback making their use for large-\nscale image classi\ufb01cation challenging. In this work, we propose STAR-CAPS a\ncapsule-based network that exploits a straight-through attentive routing to address\nthe drawbacks of capsule networks. By utilizing attention modules augmented\nby differentiable binary routers, the proposed mechanism estimates the routing\ncoef\ufb01cients between capsules without recurrence, as opposed to prior related\nwork. Subsequently, the routers utilize straight-through estimators to make binary\ndecisions to either connect or disconnect the route between capsules, allowing stable\nand faster performance. The experiments conducted on several image classi\ufb01cation\ndatasets, including MNIST, SmallNorb, CIFAR-10, CIFAR-100 and ImageNet\nshow that STAR-CAPS outperforms the baseline capsule networks.\n\n1\n\nIntroduction\n\nConvolutional neural networks (CNNs) have achieved successful performance on different computer\nvision tasks [7, 14, 25, 6, 22]. By using local receptive \ufb01elds and shared weights, CNNs can identify\nthe existence of entities regardless of their spatial locations (translation invariance). CNNs use a\ndeep sequence of convolutional layers or max pooling operations which downsample the spatial size.\nMax-pooling is considered a primitive form of routing in which the output only attends to the most\nactive neuron in the pool. By throwing away information about the precise position of an entity, max-\npooling achieves some translation invariance. To mitigate the viewpoint variations of an entity, CNNs\ncombine the activities of the pool, i.e., overlapping the sub-sampling pools. However, CNNs fail to\nrepresent the part-whole relationships of the entities, thus they cannot detect radically new viewpoints\ndue to losing the precise spatial relationships in the max-pooling operations. Contrarily, capsule\nnetworks [23, 8] utilize trainable viewpoint-invariant transformations that learn to represent part-\nwhole relationship of the entities. Although, capsule models have been shown to be powerful models\nto detect viewpoint variations compared to the traditional convolutional neural networks [23, 8], the\ncomputational complexity of these models during training and inference is a major drawback which\nlimits utilizing these networks ef\ufb01ciently on large-scale image classi\ufb01cation datasets. This poses a\ndilemma: choosing between capsule networks and convolutional neural networks requires sacri\ufb01cing\neither the computational ef\ufb01ciency or the mechanism to detect viewpoint variations, respectively.\nIn this work, we present STAR-CAPS, a capsule-based architecture that utilizes a straight-through\nattentive routing to address the drawbacks of the recurrent dynamic routing. The proposed routing\nmechanism is based on ef\ufb01cient attention modules augmented by differentiable binary routers, which\nmake routing decisions utilizing a set of straight-through gradient estimators [10, 1]. We outline the\nmotivation and the contributions of our work, next.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\f1.1 Motivation and Contributions\n\nThe computational complexity of the capsule networks during the training stage as well as the\ninference, stems from the complex mechanisms of the voting and the routing steps. In the voting step,\nthe lower-level n capsules cast votes for the higher-level m capsules. This is achieved by transforming\nthe lower-level pose matrices using distinct (n \u00d7 m) transformation matrices. For a capsule layer\nwith kernel size of k, the voting step in one forward pass requires (k2 \u00d7 n \u00d7 m) matrix-matrix\nmultiplications. In the routing step, the recurrent dynamic routing algorithms [8, 23] depend on\nmultiple iterations to update the agreements. Each iteration requires additional expensive operations\nsuch as matrix multiplications or exponential functions. The routing complexity gets intensi\ufb01ed in the\nEM routing algorithm [8] that requires two steps (E-step and M-Step) per iteration. Though a capsule\nnetwork architecture has a \ufb01xed number of parameters, training and inference time can increase\ndramatically according to the number of routing iterations de\ufb01ned a priori as a hyperparameter.\nTo address the computational complexity of capsule networks, we replace the recurrent dynamic rout-\ning by a non-recursive attention-based routing mechanism. The motivation of our routing mechanism\nstems from the relation between the non-recurrent self-attention employed in the Transformer [26]\nand the recurrent dynamic routing [8, 23]. Compared to the recurrent neural networks, the self-\nattention [26] has been shown to be faster and more powerful. In fact, the recurrent dynamic routing\ncan been seen as an attention mechanism, but in the opposite direction [8]. As an additional advantage\nof our proposed routing mechanism, the capsule network avoids the under\ufb01tting/over\ufb01tting caused by\nthe improper setting of the number of routing iterations [23]. The experiments conducted by Sabour\net al. [23] showed that fewer routing iterations may lead to under\ufb01tting, whereas large number of\niterations cause over\ufb01tting; thus, training a capsule network often require trial and error to identify a\nsatisfactory number of routing iterations for a speci\ufb01c task and dataset. Furthermore, compared to\nthe baseline capsule network [8], our approach shows a stable and better performance without being\nsensitive to the prede\ufb01ned number of capsules in each layer and their initializations.\nOur main contributions can be summarized as follows.\n\u2022 To enable faster training and inference, we replace the recurrent dynamic routing mechanism by\nef\ufb01cient attention modules augmented by differentiable binary routers, which exploit a group of\nstraight-through gradient estimators to make routing decisions.\n\u2022 As an additional bene\ufb01t of the proposed routing mechanism, the capsule network avoids the\nunder\ufb01tting/over\ufb01tting that occurs in the recurrent dynamic routing mechanisms, caused by\nchoosing an improper number of iterations. Furthermore, our approach allows more stable\nperformance without being sensitive to the prede\ufb01ned number of capsules and their initializations.\n\u2022 We conducted different experiments on several image classi\ufb01cation datasets, including MNIST,\nSmallNorb, CIFAR-10, CIFAR-100 and ImageNet. Our results show that STAR-CAPS outperforms\nthe baseline capsule networks.\n\n2 Background\n2.1 Capsule Networks\nA capsule neural network consists of capsule layers, where each layer is constructed from a set of\ncapsules. A capsule is a unit that represents a group of neurons formulated as a vector [23] or a\nmatrix [8] that re\ufb02ects properties of an entity such as pose. Figure 1 shows traditional neural layers\nvs. capsule layers. In traditional neural networks, the neurons are connected through a set of weights\nlearned during training. In capsule networks, the information \ufb02ow between the lower-level and the\nhigher-level capsules can be described in two steps: (1) voting, in which lower-level capsules cast\nvotes for the higher-level capsules, and (2) routing, where lower-level and higher-level capsules are\nconnected via routing coef\ufb01cients learned by a dynamic routing algorithm. In DynamicCaps [23]\nthe capsule is a vector that represents the pose, and its length indicates the existence of an entity. In\nEMCaps [8] the capsule has a pose matrix, and an activation scalar.\nIn general, the architecture of capsule networks [8, 23] consists of: (i) a traditional convolutional\nlayer, (ii) a PrimaryCaps, a special convolutional capsule layer that converts activities into vector\ncapsules [23] or matrix capsules [8], (iii) a set of convolutional capsule layers (ConvCaps layers)\nthat learn the part-whole relationships of entities, (iv) the \ufb01nal capsule layer is ClassCaps which\noutputs the \ufb01nal class predictions. During voting, the pose of a lower-level capsule is multiplied\nby trainable weights (transformation matrix) to cast a vote for each higher-level capsule. Capsules\n\n2\n\n\fFigure 1: Traditional Neural Layers (left) vs. Capsule Neural Layers (right).\n\nmake use of this underlying linearity to allow learning and representing the part-whole relationships\nof the entities, thus detecting the viewpoint variations [8]. Recurrent dynamic routing is a routing-\nby-agreement iterative approach, in which each lower-level capsule sends its vote to the capsules\nin the higher level that agree. These agreements are achieved through many iterations of adjusting\nthe routing coef\ufb01cients. The routing-by-agreement algorithm in DynamicCaps [23] is a dynamic\niterative mechanism based on coordinate descent optimization; whereas in EMCaps [8] the routing is\nbased on an Expectation-Maximization procedure.\n2.2 Attentions\nThe Transformer [26] relies on multi-head self-attentions to capture the dependencies between the\ninput and the output. The self-attention layers decide how to attend various parts of the input and\ngenerates attention coef\ufb01cients to update the representations. Compared to the recurrent layers\nused in recurrent neural networks, the self-attention layers that do not use any recurrence have been\nshown to be faster and more powerful [26]. It can be noticed the relation between the self-attention\nmechanism employed in the Transformer [26] and the recurrent dynamic routing approaches [8, 23]\nin capsule networks. Dynamic routing [8, 23] can been seen as an attention mechanism, but in the\nopposite direction. The dynamic routing is a bottom-up approach where the competition is between\nthe higher-level capsules that a lower-level capsule might send its vote to; whereas the attention-based\nrouting is a top-down approach where the competition is between the lower-level capsules that a\nhigher-level capsule might attend to. Several prior work have utilized attention mechanisms with\ncapsule-based networks. Zhang et al. [30] proposed a relation extraction approach based on capsule\nnetworks with attention; however, the proposed attention mechanism was used as an augmentation\nto a capsule network [23] that utilizes a dynamic routing mechanism. Li et al. [18] proposed to\nimprove the information aggregation for multi-head attention with a dynamic routing algorithm.\nXinyi et al. [29] proposed a capsule graph network that utilizes an attention module to scale node\nembeddings followed by dynamic routing to generate graph capsules. Differently from the prior work,\nwe propose a capsule-based architecture that replaces the recurrent dynamic routing mechanism by a\nnon-recurrent attentive routing mechanism.\n2.3 Straight-through Estimators\nOur approach utilizes routing modules to make binary decisions to either connect or disconnect the\nroute between capsules. Propagating gradients through discrete stochastic nodes has been studied\nin the literature, for instance Bengio et al. [1] proposed a straight-through estimator to estimate and\npropagate the gradients through discrete stochastic neurons. In STAR-CAPS, we adopt a straight-\n\n3\n\nTraditional Neural Layersweights votetransformation matrixVotingDynamic Routingneuronposeposeposerouting coef\ufb01cients`AAACi3icbVFNT9tAEN2Ylo8UKB9HLqsGJC6N7KBSijggVZV6pCoJSImFxpuJvWS9a+2OW0VW/gNX+Gf9N10nOTShI6309GbezJudpFDSURj+aQRrb96ub2xuNd9t7+y+39s/6DlTWoFdYZSx9wk4VFJjlyQpvC8sQp4ovEvGX+v83S+0Thp9S5MC4xxSLUdSAHmqdzxApY4f9lphO5wFfw2iBWixRdw87DfMYGhEmaMmocC5fhQWFFdgSQqF0+agdFiAGEOKfQ815OjiamZ3yk88M+QjY/3TxGfsv4oKcucmeeIrc6DMreZq8n+5fkmji7iSuigJtZgPGpWKk+H17nwoLQpSEw9AWOm9cpGBBUH+h5am5FJYU2t8E/5TyTSrZfQbYTw37Xyl1OlsC0BHGfpuy25uo7iqa+dWeNchv/iYSOKeMMNaTBkQz8DxzqdznqpJ4fc5WVm27uCmTX+iaPUgr0Gv047O2p0fndb1l8WxNtkR+8BOWcQ+s2v2nd2wLhPskT2xZ/YS7ARnwWVwNS8NGgvNIVuK4Ntfl0jGJA==`+1AAACjXicbVFNa9tAEF0rbZo63+mxl6VOIFBiJIc0KZRi6KE9uiRODLYIo/VYWrzaFbujFCP8J3pt/1j/TVe2D7XTgYXHm3kzb3aSQklHYfinEWy9eLn9aud1c3dv/+Dw6Pjk3pnSCuwLo4wdJOBQSY19kqRwUFiEPFH4kEy/1PmHJ7ROGn1HswLjHFItJ1IAeWpwOkKl3kenj0etsB0ugj8H0Qq02Cp6j8cNMxobUeaoSShwbhiFBcUVWJJC4bw5Kh0WIKaQ4tBDDTm6uFoYnvMzz4z5xFj/NPEF+6+igty5WZ74yhwoc5u5mvxfbljS5CaupC5KQi2Wgyal4mR4vT0fS4uC1MwDEFZ6r1xkYEGQ/6O1KbkU1tQa34TfKplmtYx+IEyXpp2vlDpdbAHoKEPfbd3NXRRXde3SCu875DcXiSTuCTOuxZQB8Qwc71x94KmaFX6fs41l6w5u3vQnijYP8hzcd9rRZbvzvdPqflwda4e9Ze/YOYvYNeuyb6zH+kwwxX6yX+x3cBhcBZ+Cz8vSoLHSvGFrEXz9C6qqxpQ=capsuleCapsule Layers`AAACi3icbVFNT9tAEN2Ylo8UKB9HLqsGJC6N7KBSijggVZV6pCoJSImFxpuJvWS9a+2OW0VW/gNX+Gf9N10nOTShI6309GbezJudpFDSURj+aQRrb96ub2xuNd9t7+y+39s/6DlTWoFdYZSx9wk4VFJjlyQpvC8sQp4ovEvGX+v83S+0Thp9S5MC4xxSLUdSAHmqdzxApY4f9lphO5wFfw2iBWixRdw87DfMYGhEmaMmocC5fhQWFFdgSQqF0+agdFiAGEOKfQ815OjiamZ3yk88M+QjY/3TxGfsv4oKcucmeeIrc6DMreZq8n+5fkmji7iSuigJtZgPGpWKk+H17nwoLQpSEw9AWOm9cpGBBUH+h5am5FJYU2t8E/5TyTSrZfQbYTw37Xyl1OlsC0BHGfpuy25uo7iqa+dWeNchv/iYSOKeMMNaTBkQz8DxzqdznqpJ4fc5WVm27uCmTX+iaPUgr0Gv047O2p0fndb1l8WxNtkR+8BOWcQ+s2v2nd2wLhPskT2xZ/YS7ARnwWVwNS8NGgvNIVuK4Ntfl0jGJA==`+1AAACjXicbVFNa9tAEF0rbZo63+mxl6VOIFBiJIc0KZRi6KE9uiRODLYIo/VYWrzaFbujFCP8J3pt/1j/TVe2D7XTgYXHm3kzb3aSQklHYfinEWy9eLn9aud1c3dv/+Dw6Pjk3pnSCuwLo4wdJOBQSY19kqRwUFiEPFH4kEy/1PmHJ7ROGn1HswLjHFItJ1IAeWpwOkKl3kenj0etsB0ugj8H0Qq02Cp6j8cNMxobUeaoSShwbhiFBcUVWJJC4bw5Kh0WIKaQ4tBDDTm6uFoYnvMzz4z5xFj/NPEF+6+igty5WZ74yhwoc5u5mvxfbljS5CaupC5KQi2Wgyal4mR4vT0fS4uC1MwDEFZ6r1xkYEGQ/6O1KbkU1tQa34TfKplmtYx+IEyXpp2vlDpdbAHoKEPfbd3NXRRXde3SCu875DcXiSTuCTOuxZQB8Qwc71x94KmaFX6fs41l6w5u3vQnijYP8hzcd9rRZbvzvdPqflwda4e9Ze/YOYvYNeuyb6zH+kwwxX6yX+x3cBhcBZ+Cz8vSoLHSvGFrEXz9C6qqxpQ=\fFigure 2: Overview of a STAR-CAPS layer.\n\nthrough estimator based on Gumbel-Softmax [10] to implement the binary routers. Differently from\nour approach, Guo et al. [5] and Viet et al. [27] uses Gumbel-Softmax [10], to decide which layers in\na CNN to \ufb01ne-tune during transfer learning, and for adaptive inference in CNNs, respectively.\n\n3 STAR-CAPS Architecture\n\nSTAR-CAPS is a capsule-based network that utilizes a straight-through attentive routing mechanism.\nWe opt to formulate each capsule as a matrix rather than a vector to save parameters [8]. Given the\npose features from the lower-level capsules, we transform the pose through shared trainable weight\nmatrices, i.e. a single weight matrix between each lower-level capsule and all the higher-level capsules.\nWe call the output of this transformation the pre-vote. The routing between the lower-level and higher-\nlevel capsules takes place through two components: the Attention Estimator and the Straight-through\nRouter. Given the pre-vote, each Attention Estimator estimates an attentive coef\ufb01cient matrix that\nacts as a soft relevance signal for each higher-level capsule. Additionally, each Attention Estimator is\nsequentially augmented by a Straight-through Router, a differentiable binary router that acts as a gate.\nThis router estimates a binary signal that decides whether to connect or disconnect the current route\nbetween the lower-level capsule and the higher-level capsule. The binary signal estimated by the\nrouter can be seen as a hard-attention coef\ufb01cient, albeit differentiable. Conceptually, each route can\nbe seen as a double-attention (soft & hard) mechanism. Between each lower-level capsule and all the\nhigher-level capsules, we build a tree of double-attentions; thus, creating a forest of double-attentions\nin each capsule layer. During training, each double-attention component learns the connectivity\nbetween capsules in a stochastic dynamic manner, yet differentiable, which can be a seen as an\nattention-based connectivity search mechanism. Next, we give an overview of the overall architecture\n(\u00a7 3.1), then we discuss the Attention Estimator (\u00a7 3.2) and the Straight-through Router (\u00a7 3.3).\n\n3.1 Overview\nOur architecture starts with a regular convolutional layer (Conv) with kernel (\u02d8k \u00d7 \u02d8k), \u02d8c channels and\nReLU non-linearity, followed by a sequence of capsule layers. The \ufb01rst capsule layer is a primary\ncapsule type (PrimaryCaps) [8], followed by a set of m convolutional capsule type (ConvCaps).\nPrimaryCaps and ConvCaps layers have kernel size of (k\u00d7k). The \ufb01nal layer (ClassCaps) predicts\nthe classes, where each class is represented by one capsule, i.e. the number of capsules is equal to the\nnumber of classes. Each capsule layer (cid:96) \u2208 {0, . . . , m, m + 1} contains n(cid:96) capsules. Each capsule is\ncomposed of a pose matrix de\ufb01ned explicitly, whereas the activation is implicitly encoded as we will\ndiscuss later. We use the following notation to de\ufb01ne a capsule network:\n\n4\n\n\f(cid:8)Conv(\u02d8k, \u02d8c), PrimaryCaps(k, n0),{ConvCaps(cid:96)(k, n(cid:96)) | 1 \u2264 (cid:96) \u2264 m}, ClassCaps(nm+1)(cid:9)\nConvCaps (cid:96) is the set of the pose matrices P(cid:96)\u22121 =(cid:8)Pi \u2208 Rp\u00d7p | i \u2208 {1, . . . , n(cid:96)\u22121}(cid:9) generated\nP(cid:96) =(cid:8) \u02dcPj \u2208 Rp\u00d7p | j \u2208 {1, . . . , n(cid:96)}(cid:9) generated by the higher-level capsules de\ufb01ned in the current\n\nConvCaps is the key layer of the architecture where the routing between capsules takes place. Figure 2\nillustrates an overview of the (ConvCaps (cid:96)) using our proposed routing mechanism. The input of\nby the lower-level capsules in layer (cid:96) \u2212 1. Correspondingly, the output is the set of pose matrices\n\nlayer (cid:96). Pose matrices are not stored parameters and they act as a group of activities.\nTransformation of Input Pose: Given P(cid:96)\u22121, each input pose matrix 1 Pi \u2208 Rp\u00d7p is multiplied by a\ntrainable transformation matrix Wi \u2208 Rp\u00d7p. We point out that the output of each transformation is\nnot the actual vote considering that there is a single transformation matrix for each input pose matrix.\nThus, we call the transformed pose, the pre-vote Vpre\n\nVpre\n\ni = PiWi,\n\n(1)\nAttentions: For capsule i, we build a tree structure of Attention Estimator (\u00a7 3.2) modules. Each\nmodule estimates distinct attentive matrix Aij for every capsule j, given the shared Vpre\n\ni \u2208 Rp\u00d7p \u2192 Aij \u2208 Rp\u00d7p | i \u2208 {1, . . . , n(cid:96)\u22121}, j \u2208 {1, . . . , n(cid:96)}(cid:9)\n\n(cid:8)Tij : Vpre\n\n(2)\nRouters: Given the attentive matrix Aij estimated by Tij (Eqn. 2), a Straight-Through Router (\u00a7 3.3)\nRij acting as a gate, estimates a binary decision value \u03b4ij \u2208 {0, 1} indicating whether to disconnect\n(\u03b4ij = 0) or connect (\u03b4ij = 1) the route between capsules i and j. This mechanism can be seen as a\nhard attention, yet differentiable (see (\u00a7 3.3)), where each Rij sends its hard attention signal to the\nhigher-level capsules.\n\n(cid:8)Rij : Aij \u2208 Rp\u00d7p \u2192 \u03b4ij \u2208 {0, 1} | i \u2208 {1, . . . , n(cid:96)\u22121}, j \u2208 {1, . . . , n(cid:96)}(cid:9)\n\n(3)\nCalculation of Output Pose: Each higher-level capsule j, receives a list of n(cid:96)\u22121 tuples of features,\neach tuple (Vpre\n, Aij, \u03b4ij) is generated by the lower-level capsule i. The output pose matrix\n\u02dcPj \u2208 Rp\u00d7p of capsule j in ConvCaps (cid:96) is calculated as follows:\n\ni\n\ni\n\ni \u2208 Rp\u00d7p.\n\u2200i \u2208 {1, . . . , n(cid:96)\u22121}\n\nAij\n\n;\n\n\u02dcPj =\n\ni (cid:12) \u02dcAij\nVpre\n\n(4)\n\n\u02dcAij = Aij (cid:11)\n\n(cid:11) is element-wise division,(cid:80)\n\nn(cid:96)\u22121(cid:88)\n\ni=1\n\u03b4ij =1\n\nn(cid:96)\u22121(cid:88)\n\ni=1\n\u03b4ij =1\n\n\u03b4ij =1 is a summation masked by \u03b4ij, (cid:12) is element-wise product, and\n\ni (cid:12) \u02dcAij) is the attentive vote Vattn\n\n(Vpre\nActivation Probablity: The ClassCaps layer ((cid:96) = m + 1) outputs the \ufb01nal predictions, where each\ncapsule represents a single class. The activation probability (at) indicates the presence of an object\nclass t. This activation is implicitly encoded in the capsule. Given \u02dcPt, we calculate at as follows:\n\nij\n\n\u03c3( \u02dcPt)\n\n\u03c3( \u02dcPt[s, \u02c6s]),\n\u03c3 is a sigmoid function, M is a global average pooling [19].\nLoss Function: Given the activations (at), t \u2208 {1, . . . , nm+1}, we calculate the spread loss [8].\n\ns=1\n\n\u02c6s=1\n\n=\n\n(5)\n\nt \u2208 {1, . . . , nm+1}\n\n1\np2\n\nat = M(cid:16)\n\n(cid:17)\n\np(cid:88)\n\np(cid:88)\n\n3.2 Attention Estimator\nThe role of the Attention Estimator (Tij) (Eqn.2) is to estimate the attentive matrix Aij \u2208 Rp\u00d7p\nwith c channels. We propose an ef\ufb01cient bottleneck architecture which consists of 3 convolutional\nlayers. The architecture 2 starts with Conv2D(c, 1x1, d) and ends with Conv2D(d, 1x1, c), followed\nby a BatchNorm [9] and a LeakyRelu [20]. We set c = k2 and d \u2264 k2. Inspired by the recent work\nof Wu et al. [28], we design the middle layer as a lightweight 2D convolution (LightConv2D) with\nH attention heads, which is a depth-wise separable [2, 11, 24] convolution that shares d\nH output\nchannels, and the weights are normalized using a Softmax2D.\n\n1For each input sample in the training batch, the size of the pose matrix is (c \u00d7 p \u00d7 p), where c is the number\n\nof channels. For simplicity, we frequently omit c from our notation.\n\n2Conv2D(c, 1x1, d) is a 2D convolution with c input channels, 1x1 kernel size, d output channels.\n\n5\n\n\f3.3 Straight-Through Router\nGiven the attentive matrix Aij, the Straight-Through Router (Rij) (Eqn.3) estimates a binary decision\nsignal \u03b4ij \u2208 {0: disconnect, 1: connect}. We design the router to be a differentiable hard attention\nmodule. The intuition is to allow learning the attention-based connectivity or relevance between\ncapsules. The Straight-Through Router consists of two sequential sub-modules, Decision-Learner\nand Decision-Maker. We discuss the details 3 next.\nDecision-Learner: The Decision-Learner learns a pair of decision scores \u03a0 \u2208 R2, we will assume\n\u03a0 = {\u03c00, \u03c01}. Conceptually, it can be de\ufb01ned as DL\u03b8DL : A \u2208 Rc\u00d7p\u00d7p \u2212\u2192 \u03a0 \u2208 R2. First, we\napply a global average pooling [19] on A, to capture the con\ufb01dence maps [19] of the c channels and to\nreduce the computational complexity. Then, we apply Conv2D(c, 1x1, c) followed by a BatchNorm [9]\nand a LeakyRelu [20]. Finally, we apply Conv2D(c, 1x1, 2) to generate unnormalized decision real-\nvalued scores \u03a0. Empirically, this simple architecture enables fast and ef\ufb01cient estimation of the\ndecision scores, which is essential to minimize the overall computational overhead of the routing\nprocess between the capsules.\nDecision-Maker: Given the real-valued scores \u03a0, we estimate a binary decision parameter \u03b4 \u2208 {0, 1}\nthat indicates a decision chosen from a set of two mutually exclusive and exhaustive events, (i) connect\n(if \u03b4 = 1) or (ii) disconnect (if \u03b4 = 0) the route between the current two capsules. The Decision-\nMaker can be represented as DM : \u03a0 \u2208 R2 \u2212\u2192 I \u223c Bernoulli(\u03b4), where I is a bernoulli (indicator)\nrandom variable parameterized by \u03b4 \u2208 {0, 1}. Conceptually, this representation can be seen as\na binarization function of the real-valued scores \u03a0 such that each value in the pair of the binary\noutcomes is the complement of the other. A simple way to implement DM, is to adopt a deterministic\napproach during training such as selecting the position with the maximum value of \u03a0. However, this\napproach is not differentiable and tends to memorize the same generated binary samples throughout\ntraining. Propagating gradients through discrete stochastic nodes has been studied in the literature,\nfor example Bengio et al. [1], proposed a \u201cstraight-through estimator\u201d to estimate and propagate the\ngradients through discrete stochastic neurons. In our work, we adopt a \u201cstraight-through estimator\u201d\nbased on Gumbel-Softmax [10].\nGiven a discrete categorical distribution with classes probabilities, we can draw samples using the\nGumble-Max trick [21, 4]. In our case, we have two classes (disconnect and connect), and we assume\nthat the unnormalized real-valued scores {\u03c00, \u03c01} generated by DL\u03b8DL are the log probabilities of\nthese two classes, i.e. \u03c0\u03ba = log[p\u03ba] where \u03ba \u2208 {0, 1} and p\u03ba is the probability of class \u03ba. Thus, we\ncan draw a sample from a Bernoulli distribution (as a special case of the categorical distribution)\nparameterized by {p0, p1} as follows:\n\n\u00b5 = argmax{0, 1}\n\n(6)\nwhere {g0, g1} are i.i.d samples drawn from the Gumbel distribution Gumbel(0, 1) acting as a noise\nto introduce stochasticity, Gumbel(0, 1) is de\ufb01ned as \u2212log(\u2212log(U )), U \u223c Uniform(0, 1). The\nargmax is non-differentiable, however. Alternatively, we can use the Gumbel-Softmax Estimator [10]\nto sample from a discrete Bernoulli distribution, by using a softmax as a continuous differentiable\napproximation to argmax.\n\n(cid:2)(\u03c00 + g0), (\u03c01 + g1)(cid:3)\n\n(cid:80)1\n\n\u03bd\u03ba =\n\nexp(\u03c0\u03ba + g\u03ba)/\u03c4\n\u02c6\u03ba=0 exp(\u03c0\u02c6\u03ba + g\u02c6\u03ba)/\u03c4\n\n(7)\nThe Decision-Maker DM is implemented as a straight-through Gumbel-softmax [10], which uses\nEqn.(6) in the forward pass. Thus, the binary decision parameter \u03b4 = \u00b5. In the backward pass,\nthe gradients of the binary samples are approximated by computing the gradients of the continuous\nsoftmax Eqn.(7), i.e. \u2207\u03b8\u00b5 \u2248 \u2207\u03b8\u03bd.\n\n\u03ba \u2208 {0, 1}, \u03c4 is the temperature\n\n,\n\n4 Experiments\n\nWe evaluated our approach on the task of image classi\ufb01cation using the following datasets:\nMNIST [15], SmallNorb [16], CIFAR-10 [13], CIFAR-100 [13], and ImageNet [3]. The base-\nline models are based on EMCaps [8], since the capsule in EMCaps [8] is formulated as a matrix\nsimilar to our approach, and it showed better general performance compared to DynamicCaps [23].\n\n3Henceforth, for simplicity we omit the subscript index (ij) from our notation.\n\n6\n\n\fModels and training settings. Unless otherwise speci\ufb01ed, STAR-CAPS models as well as EMCaps [8]\nmodels consist of a (5 \u00d7 5) Conv with ReLU, 1 PrimaryCaps, 2 ConvCaps, and 1 ClassCaps.\nThe kernel size of ConvCaps is k = 3. The number of channels of Conv and the num-\nber of capsules in each layer will be speci\ufb01ed for each model using the following notation:\n#capsules={\u02d8c, n0, n1, n2, n3} as described in (\u00a7 3). We use Adam [12] optimizer, with coef\ufb01cients\n(0.9, 0.999). The initial learning rate is 0.01, and the training batch size T = 128.\n\nFigure 3: Comparison between STAR-CAPS and EMCaps [8] models trained on MNIST. The gray box shows\n#capsules {\u02d8c, n0, n1, n2, n3}; whereas the green box shows the (training time; testing time) in secs per batch.\n\n4.1 Evaluation on MNIST\nWe perform training on MNIST [15] gray-level 28x28 images. The dataset consists of 60K training\nimages and 10K testing images. We compare different STAR-CAPS and EMCaps models in terms of\naccuracy, training time, and testing time. For STAR-CAPS models, we set \u02c6k = 3, and d = 3. For\nEMCaps models, the number of routing iterations is 2. Figure 3 shows the classi\ufb01cation accuracy of\ndifferent STAR-CAPS and EMCaps models. Each model varies in terms of the number of capsules\nand the number of parameters. We notice that STAR-CAPS models yield better accuracy compared to\nEMCaps models. Furthermore, STAR-CAPS shows more stable performance and faster training and\ntesting time. We point out that we could not train an EMCaps model with larger number of parameters\nthan the model shown in Figure 3, i.e. the EMCaps:{32, 32, 32, 32, 10} and 319K parameters. This is\nbecause larger EMCaps models, in addition for being very expensive to train, they were over\ufb01tting\nunder different hyperparameters settings.\nTable 1: Performance sensitivity to the prede\ufb01ned # capsules: STAR-CAPS vs. EMCaps evaluated on MNIST.\nWe report (mean\u00b1std) of the test accuracy of 3 runs.\n\nModel\nSTAR-CAPS:{32, 4, 64, 4, 10}\nEMCaps:{32, 4, 64, 4, 10}\nSTAR-CAPS:{64, 8, 64, 8, 10}\nEMCaps:{64, 8, 64, 8, 10}\n\n#Params\n\n143K\n77K\n281K\n159K\n\nAccuracy(mean\u00b1std)\n\n99.49\u00b10.11\n96.89\u00b10.13\n99.57\u00b10.09\n98.12\u00b10.12\n\nOur experiments show that the performance of the baseline EMCaps [8] models can be sensitive to\nthe numbers of capsules de\ufb01ned for each layer and their initializations. For instance, on MNIST,\ntraining an EMCaps model in which one or more capsule layer contain a large number of capsules,\nand the lower-level or the higher-level capsule layers have small number of capsules, the performance\nof this model becomes unstable even with careful initializations of the capsules. On the other hand,\nSTAR-CAPS mitigates this problem by learning to disconnect the super\ufb02uous capsules during routing\n\n7\n\n\fmore ef\ufb01ciently. In Table 1, we compare the performance of STAR-CAPS and EMCaps using two\nmodel variations that use different number of capsules.\n4.2 Evaluation on SmallNorb\nSmallNorb [16] contains gray-level stereo images of 5 toy classes. Each image represents 18\nazimuths (range 0-340), 6 lightning variations, and 9 elevations. We follow the data preprocessing\nas in EMCaps [8], yielding randomly cropped training image patches of size 32x32. We compare\nthe performance of two different STAR-CAPS and EMCaps models with comparable number of\nparameters. STAR-CAPS:{32, 8, 8, 8, 5} achieves 98.0% compared to EMCaps:{64, 8, 16, 16, 5}\nthat achieves 97.8%; whereas both STAR-CAPS:{32, 32, 16, 16, 5} and EMCaps:{32, 32, 32, 32, 5}\nachieve 98.2%.\n\nTable 2: Detection of novel viewpoints on SmallNorb\n\nType1: (low capacity)\nEMCaps\n\nSTAR-CAPS\n\n73K\n\n95.72\u00b10.02\n86.07\u00b10.03\n\n68K\n\n95.66\u00b10.03\n86.12\u00b10.05\n\nType2: (high capacity)\n\nCNN EMCaps\n316K\n4.2M\n96.3\n96.3\n80.0\n86.5\n\nSTAR-CAPS\n\n318K\n96.3\n86.3\n\nModel\n#Params\nFamiliar\nNovel\n\nDetection of novel viewpoints: We use SmallNorb to evaluate the ability of STAR-CAPS to detect\nnovel viewpoints, similar to the experiments in EMCaps [8]. We create a subset of SmallNorb with\ntwo parts, each part contains images of distinct azimuths range as follows: \u201cTrain-viewpoints\u201d which\ncontains the training images with azimuths (300, 320, 340, 0, 20, 40), and \u201cTest-viewpoints\u201d that\nhas the testing images of azimuths range (60-280). We train two types of models (low capacity\nand high capacity) for STAR-CAPS and EMCaps on \u201cTrain-viewpoints\u201d, and we evaluate the models\non \u201cTest-viewpoints\u201d. Table 2 shows two types of experiments on SmallNorb (novel, familiar\nviewpoints). Type1: 3 runs of EMCaps:{64, 8, 16, 16, 5}, STAR-CAPS:{32, 8, 8, 8, 5}, fully trained\non familiar views and tested on both novel and familiar views. Type2: EMCaps:{32, 32, 32, 32, 5},\nSTAR-CAPS:{32, 32, 16, 16, 5}, trained on familiar views and early stopped when test accuracy\nreached 96.3% (as the CNN model in [8]). In Type1, we notice that STAR-CAPS achieves comparable\nresults (small difference in accuracy) to EMCaps both on familiar and novel viewpoints. In Type2, on\nthe novel viewpoints, STAR-CAPS performs dramatically better than CNN model (+6.3%) and its\naccuracy is only slightly lower than EMCaps (-0.2%).\n4.3 Evaluation on CIFAR-10/CIFAR-100\nCIFAR-10 [13] and CIFAR-100 [13] datasets contain images of size 32x32, with 10 classes and\n100 classes, respectively. For each dataset, the training set consists 50,000 images, and the test-\ning set has 10,000 images. We train a CIFAR-10 model based on STAR-CAPS:{32, 8, 8, 8, 10},\nwhich achieves a test accuracy of 91.23% with test time of 0.21 secs/batch, compared to\nEMCaps:{256, 32, 32, 32, 10} that achieves 88.10%. Another relevant work, is the EncapNet [17]\nwhich achieves an accuracy of 88.07%. On CIFAR-100, our STAR-CAPS model achieves 67.66%,\nwhile an EMCaps:{256, 32, 32, 32, 100} was not able to converge yielding 19%.\n4.4 Evaluation on ImageNet\nImageNet [3] is a large-scale dataset with 1000 classes. As per our knowledge, there is no related\nwork that was able to train EMCaps [8] model on ImageNet. We point out that EncapNet [17] model\nthat reported preliminary results on ImageNet, was built upon a deep residual network [7] augmented\nby a capsule module. We construct a STAR-CAPS model that starts with 7x7 Conv layer and output\n64 channels, followed by a single bottleneck residual block with 256 output channels. Afterwards,\nwe add 4 capsule layers with 64 capsules for PrimaryCaps and 128 capsules for ConvCaps layers.\nThe Top-1 validation accuracy of this model is 60.07% and the Top-5 accuracy is 85.66%.\n5 Conclusion\nWe presented STAR-CAPS, a capsule-based network that utilizes a straight-through attentive routing\nto address the computational complexity of capsule networks. The proposed routing is a double-\nattention mechanism utilizing (a) Attention Estimators that estimate attention matrices between\ncapsules, and (b) Straight-Through Routers to make binary connectivity decisions between capsules.\nOur experiments showed that STAR-CAPS outperforms the baseline capsule models.\n\n8\n\n\fAcknowledgments\n\nThis work was funded in part by NSF award CNS-120552. We gratefully acknowledge NVIDIA and\nFacebook for the donation of GPUs used for portions of this work.\n\nReferences\n[1] Yoshua Bengio, Nicholas L\u00e9onard, and Aaron Courville. Estimating or propagating gradients through\n\nstochastic neurons for conditional computation. arXiv preprint arXiv:1308.3432, 2013.\n\n[2] Fran\u00e7ois Chollet. Xception: Deep learning with depthwise separable convolutions. In Proceedings of the\n\nIEEE conference on computer vision and pattern recognition, pages 1251\u20131258, 2017.\n\n[3] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Fei-Fei Li. Imagenet: A large-scale hierarchical\nimage database. In 2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition\n(CVPR 2009), 20-25 June 2009, Miami, Florida, USA, pages 248\u2013255, 2009.\n\n[4] E. J. Gumbel. Statistical theory of extreme values and some practical applications: a series of lectures. In\n\nUS Govt. Print. Of\ufb01ce, number 33, 1954.\n\n[5] Yunhui Guo, Honghui Shi, Abhishek Kumar, Kristen Grauman, Tajana Rosing, and Rogerio Feris. Spottune:\ntransfer learning through adaptive \ufb01ne-tuning. In Proceedings of the IEEE Conference on Computer Vision\nand Pattern Recognition, pages 4805\u20134814, 2019.\n\n[6] Kaiming He, Georgia Gkioxari, Piotr Doll\u00e1r, and Ross Girshick. Mask r-cnn. In Proceedings of the IEEE\n\ninternational conference on computer vision, pages 2961\u20132969, 2017.\n\n[7] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition.\n\nIn Computer Vision and Pattern Recognition (CVPR), 2016 IEEE Conference on, 2016.\n\n[8] Geoffrey E Hinton, Sara Sabour, and Nicholas Frosst. Matrix capsules with em routing. ICLR, 2018.\n\n[9] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing\ninternal covariate shift. In Proceedings of the 32nd International Conference on Machine Learning, ICML\n2015, Lille, France, 6-11 July 2015, pages 448\u2013456, 2015.\n\n[10] Eric Jang, Shixiang Gu, and Ben Poole. Categorical Reparameterization with Gumbel-Softmax. In ICLR,\n\n2017.\n\n[11] Lukasz Kaiser, Aidan N Gomez, and Francois Chollet. Depthwise separable convolutions for neural\n\nmachine translation. arXiv preprint arXiv:1706.03059, 2017.\n\n[12] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. ICLR, 2015.\n\n[13] Alex Krizhesvsky. Learning multiple layers of features from tiny images, 2009. Technical Report\n\nhttps://www.cs.toronto.edu/~kriz/learning-features-2009-TR.pdf.\n\n[14] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. Imagenet classi\ufb01cation with deep convolutional\nneural networks. In Advances in Neural Information Processing Systems 25, Lake Tahoe, Nevada, United\nStates., pages 1106\u20131114, 2012.\n\n[15] Yann LeCun, Corinna Cortes, and Christopher JC Burges. The mnist database of handwritten digits. 1998.\n\n[16] Yann LeCun, Fu Jie Huang, Leon Bottou, et al. Learning methods for generic object recognition with\n\ninvariance to pose and lighting. In CVPR (2), pages 97\u2013104. Citeseer, 2004.\n\n[17] Hongyang Li, Xiaoyang Guo, Bo Dai, Wanli Ouyang, and Xiaogang Wang. Neural network encapsulation.\n\nIn ECCV, 2018.\n\n[18] Jian Li, Baosong Yang, Zi-Yi Dou, Xing Wang, Michael R Lyu, and Zhaopeng Tu. Information aggregation\n\nfor multi-head attention with routing-by-agreement. arXiv preprint arXiv:1904.03100, 2019.\n\n[19] Lin, Min, Qiang Chen, and Shuicheng Yan. Network in network. In International Conference on Learning\n\nRepresentations, 2014 (arXiv:1409.1556)., 2014.\n\n[20] Andrew L Maas, Awni Y Hannun, and Andrew Y Ng. Recti\ufb01er nonlinearities improve neural network\n\nacoustic models. In Proc. icml, volume 30, page 3, 2013.\n\n9\n\n\f[21] Chris J Maddison, Andriy Mnih, and Yee Whye Teh. The concrete distribution: A continuous relaxation of\n\ndiscrete random variables. arXiv preprint arXiv:1611.00712, 2016.\n\n[22] Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. You only look once: Uni\ufb01ed, real-time\nobject detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages\n779\u2013788, 2016.\n\n[23] Sara Sabour, Nicholas Frosst, and Geoffrey E Hinton. Dynamic routing between capsules. In Advances in\n\nneural information processing systems, pages 3856\u20133866, 2017.\n\n[24] L Sifre and S Mallat. Rigid-motion scattering for texture classi\ufb01cation [ph. d. thesis]. Ecole Polytechnique,\n\nCMAP, 2014.\n\n[25] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recogni-\n\ntion. In International Conference on Learning Representations (ICLR), 2015.\n\n[26] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, \u0141ukasz\nKaiser, and Illia Polosukhin. Attention is all you need. In Advances in neural information processing\nsystems, pages 5998\u20136008, 2017.\n\n[27] Andreas Veit and Serge Belongie. Convolutional networks with adaptive inference graphs. In Proceedings\n\nof the European Conference on Computer Vision (ECCV), pages 3\u201318, 2018.\n\n[28] Felix Wu, Angela Fan, Alexei Baevski, Yann N Dauphin, and Michael Auli. Pay less attention with\n\nlightweight and dynamic convolutions. ICLR, 2019.\n\n[29] Zhang Xinyi and Lihui Chen. Capsule graph neural network. ICLR, 2018.\n\n[30] Ningyu Zhang, Shumin Deng, Zhanlin Sun, Xi Chen, Wei Zhang, and Huajun Chen. Attention-based\ncapsule networks with dynamic routing for relation extraction. arXiv preprint arXiv:1812.11321, 2018.\n\n10\n\n\f", "award": [], "sourceid": 4875, "authors": [{"given_name": "Karim", "family_name": "Ahmed", "institution": "Cornell; Darmouth"}, {"given_name": "Lorenzo", "family_name": "Torresani", "institution": "Facebook"}]}