{"title": "CondConv: Conditionally Parameterized Convolutions for Efficient Inference", "book": "Advances in Neural Information Processing Systems", "page_first": 1307, "page_last": 1318, "abstract": "Convolutional layers are one of the basic building blocks of modern deep neural networks. One fundamental assumption is that convolutional kernels should\nbe shared for all examples in a dataset. We propose conditionally parameterized convolutions (CondConv), which learn specialized convolutional kernels\nfor each example. Replacing normal convolutions with CondConv enables us to increase the size and capacity of a network, while maintaining efficient inference. We demonstrate that scaling networks with CondConv improves the performance and inference cost trade-off of several existing convolutional neural\nnetwork architectures on both classification and detection tasks. On ImageNet classification, our CondConv approach applied to EfficientNet-B0 achieves state-ofthe-art performance of 78.3% accuracy with only 413M multiply-adds. Code and checkpoints for the CondConv Tensorflow layer and CondConv-EfficientNet models are available at: https://github.com/tensorflow/tpu/tree/master/ models/official/efficientnet/condconv.", "full_text": "CondConv: Conditionally Parameterized\n\nConvolutions for Ef\ufb01cient Inference\n\nBrandon Yang\u2217\nGoogle Brain\n\nbcyang@google.com\n\nQuoc V. Le\nGoogle Brain\n\nqvl@google.com\n\nGabriel Bender\nGoogle Brain\n\ngbender@google.com\n\nJiquan Ngiam\nGoogle Brain\n\njngiam@google.com\n\nAbstract\n\nConvolutional layers are one of the basic building blocks of modern deep neu-\nral networks. One fundamental assumption is that convolutional kernels should\nbe shared for all examples in a dataset. We propose conditionally parameter-\nized convolutions (CondConv), which learn specialized convolutional kernels\nfor each example. Replacing normal convolutions with CondConv enables us\nto increase the size and capacity of a network, while maintaining ef\ufb01cient in-\nference. We demonstrate that scaling networks with CondConv improves the\nperformance and inference cost trade-off of several existing convolutional neural\nnetwork architectures on both classi\ufb01cation and detection tasks. On ImageNet\nclassi\ufb01cation, our CondConv approach applied to Ef\ufb01cientNet-B0 achieves state-of-\nthe-art performance of 78.3% accuracy with only 413M multiply-adds. Code and\ncheckpoints for the CondConv Tensor\ufb02ow layer and CondConv-Ef\ufb01cientNet mod-\nels are available at: https://github.com/tensorflow/tpu/tree/master/\nmodels/official/efficientnet/condconv.\n\n1\n\nIntroduction\n\nDeep convolutional neural networks (CNNs) have achieved state-of-the-art performance on many\ntasks in computer vision [23, 22]. Improvements in performance have largely come from increasing\nmodel size and capacity to scale to larger and larger datasets [29, 17, 33]. However, current ap-\nproaches to increasing model capacity are computationally expensive. Deploying the best-performing\nmodels for inference can consume signi\ufb01cant datacenter capacity [19] and are often not feasible for\napplications with strict latency constraints.\nOne fundamental assumption in the design of convolutional layers is that the same convolutional\nkernels are applied to every example in a dataset. To increase the capacity of a model, model\ndevelopers usually add more convolutional layers or increase the size of existing convolutions (kernel\nheight/width, number of input/output channels). In either case, the computational cost of additional\ncapacity increases proportionally to the size of the input to the convolution, which can be large.\nDue to this assumption and focus on mobile deployment, current computationally ef\ufb01cient models\nhave very few parameters [15, 36, 41]. However, there is a growing class of computer vision\napplications that are not constrained by parameter count, but have strict latency requirements at\ninference, such as real-time server-side video processing and perception for self-driving cars. In this\npaper, we aim to design models to better serve these applications.\n\n\u2217Work done as part of the Google AI Residency.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\f(a) CondConv: (\u03b11W1 + . . . + \u03b1nWn) \u2217 x\n\n(b) Mixture of Experts: \u03b11(W1\u2217x)+. . .+\u03b1n(Wn\u2217x)\nFigure 1: (a) Our CondConv layer architecture with n = 3 kernels vs. (b) a mixture of experts\napproach. By parameterizing the convolutional kernel conditionally on the input, CondConv is\nmathematically equivalent to the mixture of experts approach, but requires only 1 convolution.\n\nWe propose conditionally parameterized convolutions (CondConv), which challenge the paradigm\nof static convolutional kernels by computing convolutional kernels as a function of the input. In\nparticular, we parameterize the convolutional kernels in a CondConv layer as a linear combination of\nn experts (\u03b11W1 + . . . + \u03b1nWn) \u2217 x, where \u03b11, . . . , \u03b1n are functions of the input learned through\ngradient descent. To ef\ufb01ciently increase the capacity of a CondConv layer, model developers can\nincrease the number of experts. This is much more computationally ef\ufb01cient than increasing the\nsize of the convolutional kernel itself, because the convolutional kernel is applied at many different\npositions within the input, while the experts are combined only once per input. This allows model\ndevelopers to increase model capacity and performance while maintaining ef\ufb01cient inference.\nCondConv can be used as a drop-in replacement for existing convolutional layers in CNN architectures.\nWe demonstrate that replacing convolutional layers with CondConv improves model capacity and\nperformance on several CNN architectures on ImageNet classi\ufb01cation and COCO object detection,\nwhile maintaining ef\ufb01cient inference. In our analysis, we \ufb01nd that CondConv layers learn semantically\nmeaningful relationships across examples to compute the conditional convolutional kernels.\n\n2 Related Work\n\nConditional computation. Similar to CondConv, conditional computation aims to increase model\ncapacity without a proportional increase in computation cost. In conditional computation models, this\nis achieved by activating only a portion of the entire network for each example [3, 8, 5, 2]. However,\nconditional computation models are often challenging to train, since they require learning discrete\nrouting decisions from individual examples to different experts. Unlike these approaches, CondConv\ndoes not require discrete routing of examples, so can be easily optimized with gradient descent.\nOne approach to conditional computation uses reinforcement learning or evolutionary methods to\nlearn discrete routing functions [34, 30, 25, 9]. BlockDrop [47] and SkipNet [45] use reinforcement\nlearning to learn the subset of blocks needed to process a given input. Another approach uses\nunsupervised clustering methods to partition examples into sub-networks. Gross et al. [12] use a two\nstage training and clustering pipeline to train a hard mixture of experts model. Mullapudi et al. [31]\nuse clusters as labels to train a routing function between branches in a deep CNN model. Finally,\nShazeer et al. [37] proposed the sparsely-gated mixture-of-experts layer, which which achieves\nsigni\ufb01cant success on large language modeling using noisy top-k gating.\nPrior work in computation demonstrates the potential of designing large models that process different\nsets of examples with different sub-networks. Our work on CondConv pushes the boundaries of this\nparadigm, by enabling each individual example to be processed with different weights.\nWeight generating networks. Ha et al. [13] propose the use of a small network to generate weights\nfor a larger network. Unlike CondConv, for CNNs, these weights are the same for every example in\nthe dataset. This enables greater weight-sharing, which achieves lower parameter count but worse\nperformance than the original network. In neural machine translation, Platanios et al. [32] generate\n\n2\n\n\fweights to translate between different language pairs, but use the same weights for every example\nwithin each language pair.\nMulti-branch convolutional networks. Multi-branch architectures like Inception [40] and\nResNext [48] have shown success on a variety of computer vision tasks. In these architectures,\na layer consists of multiple convolutional branches, which are aggregated to compute the \ufb01nal output.\nA CondConv layer is mathematically equivalent to a multi-branch convolutional layer where each\nbranch is a single convolution and outputs are aggregated by a weighted sum, but only requires the\ncomputation of one convolution.\nExample dependent activation scaling. Some recent work proposes to adapt the activations of\nneural networks conditionally on the input. Squeeze-and-Excitation networks [16] learn to scale the\nactivations of every layer output. GaterNet [4] uses a separate network to select a binary mask for\n\ufb01lters for a larger backbone network. Attention-based methods [28, 1, 44] scale previous layer inputs\nbased on learned attention weights. Scaling activations has similar motivations as CondConv, but is\nrestricted to modulating activations in the base network.\nInput-dependent convolutional layers. In language modeling, Wu et al. [46] use input-dependent\nconvolutional kernels as a form of local attention. In vision, Brabandere et al. [18] generate small\ninput-dependent convolutional \ufb01lters to transform images for next frame and stereo prediction. Rather\nthan learning input-dependent weights, Dai et al. [7] propose to learn different convolutional offsets\nfor each example. Finally, in recent work, SplineNets [20] apply input-dependent convolutional\nweights, modeled as 1-dimensional B-splines, to implement continuous neural decision graphs.\n\n3 Conditionally Parameterized Convolutions\n\nIn a regular convolutional layer, the same convolutional kernel is used for all input examples. In a\nCondConv layer, the convolutional kernel is computed as a function of the input example (Fig 1a).\nSpeci\ufb01cally, we parameterize the convolutional kernels in CondConv by:\nOutput(x) = \u03c3((\u03b11 \u00b7 W1 + . . . + \u03b1n \u00b7 Wn) \u2217 x)\n\nwhere each \u03b1i = ri(x) is an example-dependent scalar weight computed using a routing function\nwith learned parameters, n is the number of experts, and \u03c3 is an activation function. When we adapt\na convolutional layer to use CondConv, each kernel Wi has the same dimensions as the kernel in the\noriginal convolution.\nWe typically increase the capacity of a regular convolutional layer by increasing the kernel\nheight/width or number of input/output channels. However, each additional parameter in a con-\nvolution requires additional multiply-adds proportional to the number of pixels in the input feature\nmap, which can be large. In a CondConv layer, we compute a convolutional kernel for each example\nas a linear combination of n experts before applying the convolution. Crucially, each convolutional\nkernel only needs to be computed once but is applied at many different positions in the input image.\nThis means that by increasing n, we can increase the capacity of the network with only a small\nincrease in inference cost; each additional parameter requires only 1 additional multiply-add.\nA CondConv layer is mathematically equivalent to a more expensive linear mixture of experts\nformulation, where each expert corresponds to a static convolution (Fig 1b):\n\n\u03c3((\u03b11 \u00b7 W1 + . . . + \u03b1n \u00b7 Wn) \u2217 x) = \u03c3(\u03b11 \u00b7 (W1 \u2217 x) + . . . + \u03b1n \u00b7 (Wn \u2217 x))\n\nThus, CondConv has the same capacity as a linear mixture of experts formulation with n experts,\nbut is computationally ef\ufb01cient since it requires computing only one expensive convolution. This\nformulation gives insight into the properties of CondConv and relates it to prior work on conditional\ncomputation and mixture of experts. The per-example routing function is crucial to CondConv\nperformance: if the learned routing function is constant for all examples, a CondConv layer has the\nsame capacity as a static convolutional layer.\nWe wish to design a per-example routing function that is computationally ef\ufb01cient, able to mean-\ningfully differentiate between input examples, and is easily interpretable. We compute the example-\ndependent routing weights \u03b1i = ri(x) from the layer input in three steps: global average pooling,\nfully-connected layer, Sigmoid activation.\n\nr(x) = Sigmoid(GlobalAveragePool(x) R)\n\n3\n\n\fFigure 2: On ImageNet validation, increasing the number of experts per layer of our CondConv-\nMobileNetV1 models improves performance relative to inference cost compared to the MobileNetV1\nfrontier [38] across a spectrum of model sizes. Models with more experts per layer achieve mono-\ntonically higher accuracy. We train CondConv models with {1, 2, 4, 8, 16, 32} experts at width\nmultipliers {0.25, 0.50, 0.75, 1.0}.\n\nwhere R is a matrix of learned routing weights mapping the pooled inputs to n expert weights. A\nnormal convolution operation operates only over local receptive \ufb01elds, so our routing function allows\nadaptation of local operations using global context.\nThe CondConv layer can be used in place of any convolutional layer in a network. The same\napproach can easily be extended to other linear functions like those in depth-wise convolutions and\nfully-connected layers.\n\n4 Experiments\n\nWe evaluate CondConv on ImageNet classi\ufb01cation and COCO object detection by scaling up the\nMobileNetV1 [15], MobileNetV2 [36], ResNet-50 [14], MnasNet [41], and Ef\ufb01cientNet [42]\narchitectures. In practice, we have two options to train CondConv models, which are mathematically\nequivalent. We can either \ufb01rst compute the kernel for each example and apply convolutions with\na batch size of one (Fig. 1a), or we can use the linear mixture of experts formulation (Fig. 1b) to\nperform batch convolutions on each branch and sum the outputs. Current accelerators are optimized\nto train on large batch convolutions, and it is dif\ufb01cult to fully utilize them for small batch sizes. Thus,\nwith small numbers of experts (<=4), we found it to be more ef\ufb01cient to train CondConv layers\nwith the linear mixture of experts formulation and large batch convolutions, then use our ef\ufb01cient\nCondConv approach for inference. With larger numbers of experts (>4), training CondConv layers\ndirectly with batch size one is more ef\ufb01cient.\n\n4.1\n\nImageNet Classi\ufb01cation\n\nWe evaluate our approach on the ImageNet 2012 classi\ufb01cation dataset [35]. The ImageNet dataset\nconsists of 1.28 million training images and 50K validation images from 1000 classes. We train\nall models on the entire training set and compare the single-crop top-1 validation set accuracy with\ninput image resolution 224x224. For MobileNetV1, MobileNetV2, and ResNet-50, we use the same\ntraining hyperparameters for all models on ImageNet, following [21], except we use BatchNorm\nmomentum of 0.9 and disable exponential moving average on weights. For MnasNet [41] and\nEf\ufb01cientNet [42], we use the same training hyperparameters as the original papers, with the batch\nsize, learning rate, and training steps scaled appropriately for our hardware con\ufb01guration. For fair\ncomparison, we retrain all of our baseline models with the same hyperparameters and regularization\nsearch space as the CondConv models. We measure performance as ImageNet top-1 accuracy relative\nto computational cost in multiply-adds (MADDs).\n\n4\n\n\fTable 1: ImageNet validation accuracy and inference cost for our CondConv models on several\nbaseline model architectures. All models use 8 experts per CondConv layer. CondConv improves the\naccuracy of all baseline architectures with small relative increase in inference cost (<10%).\n\nMobileNetV1 (1.0x)\nMobileNetV2 (1.0x)\n\nMnasNet-A1\nResNet-50\n\nEf\ufb01cientNet-B0\n\nBaseline2\n\nMADDs (\u00d7106) Top-1 (%)\n\nCondConv\n\nMADDs (\u00d7106) Top-1 (%)\n\n567\n301\n312\n4093\n391\n\n71.9\n71.6\n74.9\n77.7\n77.2\n\n600\n329\n325\n4213\n413\n\n73.7\n74.6\n76.2\n78.6\n78.3\n\nFor each baseline architecture, we evaluate CondConv by replacing convolutional layers with Cond-\nConv layers, and increasing the number of experts per layer. We share routing weights between layers\nin a block (a residual block, inverted bottleneck block, or separable convolution). Additionally, for\nsome models, we replace the fully-connected classi\ufb01cation layer with a 1x1 CondConv layer. For\nthe exact architectural details, refer to Appendix A. Our ablation experiments in Table 3 and Table 4\nsuggest CondConv improves performance across a wide range of layer and routing architectures.\nWe use two general regularization techniques for models with large capacity. First, we use\nDropout [39] on the input to the fully-connected layer preceding the logits, with keep probabil-\nity between 0.6 and 1.0. Second, we also add data augmentation using the AutoAugment [6]\nImageNet policy and Mixup [49] with \u03b1 = 0.2. To address over\ufb01tting in the large ResNet models, we\nadditionally introduce a new data augmentation technique for CondConv based on Shake-Shake [10]\nby randomly dropping out experts during training.\nOn MobileNetV1, we \ufb01nd that increasing the number of CondConv experts improves accuracy\nrelative to inference cost compared to the performance frontier with static convolutional scaling\ntechniques using the channel width and input size (Figure 2). Moreover, we \ufb01nd that increasing\nthe number of CondConv experts leads to monotonically increasing performance with suf\ufb01cient\nregularization.\nWe further \ufb01nd that CondConv improves performance relative to inference cost on a wide range of\narchitectures (Table 1). This includes architectures that take advantage of architecture search [42, 41],\nSqueeze-and-Excitation [16], and large architectures with ordinary convolutions not optimized for\ninference time [14]. For more in depth comparisons, see Appendix A.\nOur CondConv-Ef\ufb01cientNet-B0 model achieves state-of-the-art performance of 78.3% accuracy\nwith 413M multiply-adds, when compared to the MixNet frontier [43]. To directly compare our\nCondConv scaling approach to the compound scaling coef\ufb01cient proposed by Tan et al. [42], we\nadditionally scale the CondConv-Ef\ufb01cientNet-B0 model with a depth multiplier of 1.1x, which\nwe call CondConv-Ef\ufb01cientNet-B0-depth. Our CondConv-Ef\ufb01cientNet-B0-depth model achieves\n79.5% accuracy with only 614M multiply-adds. When trained with the same hyperparameters and\nregularization search space, the Ef\ufb01cientNet-B1 model, which is scaled from the Ef\ufb01cientNet-B0\nmodel using the compound coef\ufb01cient, achieves 79.2% accuracy with 700M multiply-adds. In this\nregime, CondConv scaling outperforms static convolutional scaling with the compound coef\ufb01cient.\n\n4.2 COCO Object Detection\n\nWe next evaluate the effectiveness of CondConv on a different task and dataset with the COCO object\ndetection dataset [24]. Our experiments use the MobileNetV1 feature extractor and the Single Shot\nDetector [26] with 300x300 input resolution (SSD300).\nFollowing Howard et al. [15], we train on the combined COCO training and validation sets excluding\n8,000 minival images, which we evaluate our networks on. We train our models using a batch size\n\n2Our re-implementation of the baseline models and our CondConv models use the same hyperparameters\n\nand regularization search space for fair comparison. For reference, published results for baselines are:\nMobileNetV1 (1.0x): 70.6% [15]. MobileNetV2 (1.0x): 72.0% [36]. MnasNet-A1: 75.2% [41]. ResNet-50:\n76.4% [11]. Ef\ufb01cientNet-B0: 76.3% [42].\n\n5\n\n\fTable 2: COCO object detection minival performance of our CondConv-MobileNetV1 SSD 300\narchitecture with 8 experts per layer. Mean average precision (mAP) reported with COCO primary\nchallenge metric (AP at IoU=0.50:0.05:0.95). CondConv improves mAP at all model sizes with small\nrelative increase in inference cost (<5%).\n\nMobileNetV1 (0.5x)\nMobileNetV1 (0.75x)\nMobileNetV1(1.0x)\n\nBaseline3\n\nMADDs (\u00d7106) mAP\n14.4\n18.2\n20.3\n\n352\n730\n1230\n\nCondConv\n\nMADDs (\u00d7106) mAP\n18.0\n21.0\n22.4\n\n363\n755\n1280\n\nTable 3: Different routing architectures. Our\nbaseline CondConv(CC)-MobileNetV1 uses a\none-layer, fully-connected routing function with\nSigmoid activation for each CondConv block.\n\nTable 4: CondConv at different layers in our\nCondConv(CC)-MobileNetV1 (0.25x) model.\nFC refers to the \ufb01nal classi\ufb01cation layer. Cond-\nConv improves performance at every layer.\n\nRouting Fn\n\nCC-MobileNetV1 (0.25x)\n\nSingle\n\nPartially-shared\nHidden (small)\nHidden (medium)\n\nHidden (large)\nHierarchical\n\nSoftmax\n\nMADDs\n(\u00d7106)\n55.7\n55.5\n55.6\n55.6\n55.9\n57.8\n55.7\n55.7\n\nValid\n\nTop-1 (%)\n\n62.0\n56.5\n62.5\n57.7\n62.2\n54.1\n60.3\n60.5\n\nCondConv Begin\n\nLayer\n\nCC-MobileNetV1 (0.25x)\n\nMobileNetV1 (0.25x)\n\n1\n5\n7\n13\n\n15 (FC Only)\n\n7 (No FC)\n\nMADDs\n(\u00d7106)\n55.7\n41.2\n56.3\n56.0\n55.7\n52.5\n49.3\n47.6\n\nValid\n\nTop-1 (%)\n\n62.0\n50.0\n62.5\n62.0\n62.0\n59.5\n54.2\n60.2\n\nof 1024 for 20,000 steps. For the learning rate, we use linear warmup from 0.3 to 0.9 over 1,000\nsteps, followed by cosine decay [27] from 0.9. We use the data augmentation scheme proposed\nby Liu et al. [26]. We use the same convolutional feature layer dimensions, SSD hyperparameters,\nand training hyperparameters across all models. We measure performance as COCO minival mean\naverage precision (mAP) relative to computational cost in multiply-adds (MADDs).\nWe use our CondConv-MobileNetV1 models with depth multipliers {0.50, 0.75, 1.0} as the feature\nextractors for object detection. We further replace the additional convolutional feature extractor layers\nin SSD with CondConv layers.\nCondConv with 8 experts improves object detection performance at all model sizes (Table 2). Our\nCondConv-MobileNetV1(0.75x) SSD model exceeds the MobileNetV1(1.0x) SSD baseline by 0.7\nmAP at 60% of the inference cost. Moreover, our CondConv-MobileNetV1(1.0x) SSD model\nimproves upon the MobileNetV1(1.0x) SSD baseline by 2.1 mAP at similar inference cost.\n\n4.3 Ablation studies\n\nWe perform ablation experiments to better understand model design with the CondConv block. In all\nexperiments, we compare against the same baseline CondConv-MobileNetV1 (0.25x) model with 32\nexperts per CondConv layer, trained with the same setup as Section 4 and no additional Dropout or\ndata augmentation. The baseline model achieves 61.98% ImageNet Top-1 validation accuracy with\n55.7M multiply-adds. The MobileNetV1(0.25x) architecture achieves 50.4% Top-1 accuracy with\n41.2M multiply-adds.4 We choose this setup for ease of training and large effect of CondConv.\n\n3Our re-implementation of the baseline models and our CondConv models use the same hyperparameters for\n\nfair comparison. As published reference, Howard et al. [15] report mAP of 19.3 for MobileNetV1 (1.0x).\n\n4Our implementation. Howard et al. [15] report a top-1 accuracy of 50.0% with different hyperparameters.\n\n6\n\n\f(a) Layer 12\n\n(b) Layer 26\n\n(c) Fully Connected (FC)\n\nFigure 3: Mean routing weights for four classes averaged across the ImageNet validation set at three\ndifferent depths in our CondConv-MobileNetV1 (0.5x) model. CondConv routing weights are more\nclass-speci\ufb01c at greater depth.\n\nFigure 4: Distribution of routing weights in the \ufb01nal CondConv layer of our CondConv-MobileNetV1\n(0.5x) model when evaluated on all images in the ImageNet validation set. Routing weights follow a\nbi-modal distribution.\n\n4.3.1 Routing function\n\nWe investigate different choices for the routing function in Table 3. The baseline model computes\nnew routing weights for each layer. Single computes the routing weights only once at CondConv 7\n(the 7th separable convolutional block), and uses the same routing weights in all subsequent layers.\nPartially-shared shares the routing weights between every other layer. Both the baseline model and\nPartially-shared signi\ufb01cantly outperform Single, which suggests that routing at multiple depths in the\nnetwork improve quality. Partially-shared performs slightly outperforms the baseline, suggesting\nthat sharing routing functions among nearby layers can improve quality.\nWe then experiment with more complex routing functions, by introducing a hidden layer with ReLU\nactivation after the global average pooling step. We vary the hidden layer size to be input_dim/8\nfor Hidden (small), input_dim for Hidden (medium), and input_dim \u00b7 8 for Hidden (large). Adding\na non-linear hidden layer of appropriate size can slightly improve performance. Large hidden layer\nsizes are prone to over-\ufb01tting, even with the same number of experts.\nNext, we experiment with Hierarchical routing functions, by concatenating the routing weights of\nthe previous layer to the output of the global average pooling layer in the routing function. This adds\na dependency between CondConv routing weights, which we \ufb01nd is also prone to over\ufb01tting.\nFinally, we experiment with the Softmax activation function to compute routing weights. The\nbaseline\u2019s Sigmoid signi\ufb01cantly outperforms Softmax, which suggests that multiple experts are often\nuseful for a single example.\n\n4.3.2 CondConv Layer Depth\n\nWe analyze the effect of CondConv layers at different depths in the CondConv-MobileNetV1 (0.25x)\nmodel (Table 4). We use CondConv layers in the begin layer, and all subsequent layers. We further\nperform ablation studies speci\ufb01c to the \ufb01nal fully-connected classi\ufb01cation layer. We \ufb01nd CondConv\nlayers improve performance when applied at every layer in the network. Additionally, we \ufb01nd that\nadditionally applying CondConv before layer 7 in the network has only small effects on performance.\n\n7\n\n\fFigure 5: Routing weights in the \ufb01nal CondConv layer in our CondConv-MobileNetV1 (0.5x) model\nfor 2 classes averaged across the ImageNet validation set. Error bars indicate one standard deviation.\n\nFigure 6: Top 10 classes with highest mean routing weight for 4 different experts in the \ufb01nal CondConv\nlayer in our CondConv-MobileNetV1 (0.5x) model, as measured across the ImageNet validation set.\nExpert 1 is most activated for wheeled vehicles; expert 2 is most activated for rectangular structures;\nexpert 3 is most activated for cylindrical household objects; expert 4 is most activated for brown and\nblack dog breeds.\n\nFor image classi\ufb01cation with the CondConv-MobileNetV1 (0.25x) model, CondConv in the \ufb01nal\nclassi\ufb01cation layer accounts for a signi\ufb01cant fraction of the additional inference cost. Using a normal\n\ufb01nal classi\ufb01cation layer results in smaller performance gains, but is more ef\ufb01cient.\n\n5 Analysis\n\nIn this section, we aim to gain a better understanding of the learned kernels and routing functions in our\nCondConv-MobileNetV1 architecture. We study our CondConv-MobileNetV1 (0.50x) architecture\nwith 32 experts per layer trained on ImageNet with Mixup and AutoAugment, which achieves 71.6%\ntop-1 validation accuracy. We evaluate our CondConv-MobileNetV1 (0.5x) model on the 50,000\nImageNet validation examples, and compute the routing-weights at CondConv layers in the network.\nWe \ufb01rst study inter-class variation between the routing weights at different layers in the network.\nWe visualize the average routing weight for four different classes (cliff, pug, gold\ufb01sh, and plane, as\nsuggested by Hu et al. [16] for semantic and appearance diversity) at three different depths in the\nnetwork (Layer 12, Layer 26, and the \ufb01nal fully-connected layer). The distribution of the routing\nweights is very similar across classes at early layers in the network, and become more and more\nclass speci\ufb01c at later layers (Figure 3). This suggests an explanation for why replacing additional\nconvolutional layers with CondConv layers near the input of the network does not signi\ufb01cantly\nimprove performance.\nWe next analyze the distribution of the routing weights of the \ufb01nal fully-connected layer in Figure\n4. The routing weights follow a bi-modal distribution, with most experts receiving a routing weight\n\n8\n\n\fclose to 0 or 1. This suggests that the experts are sparsely activated, even without regularization, and\nfurther suggests the specialization of the experts.\nWe then study intra-class variation between the routing weights in the \ufb01nal CondConv layer (Figure\n5). Within one class, some kernels are activated with high weight and small variance for all examples.\nHowever, even within one class, there can be big variation in the routing weights between examples.\nFinally, to better understand experts in the \ufb01nal CondConv layer, we visualize top 10 classes with\nhighest mean routing weight for four difference experts on the ImageNet validation set (Figure 6).\nWe show the exemplar image with highest routing weight within each class. CondConv layers learn\nto specialize in semantically and visually meaningful ways.\n\n6 Conclusion\n\nIn this paper, we proposed conditionally parameterized convolutions (CondConv). CondConv\nchallenges the assumption that convolutional kernels should be shared across all input examples.\nThis introduces a new direction for increasing model capacity while maintaining ef\ufb01cient inference:\nincrease the size and complexity of the kernel-generating function. Since the kernel is computed only\nonce, then convolved across the input, increasing the complexity of the kernel-generating function\ncan be much more ef\ufb01cient than adding additional convolutions or expanding existing convolutions.\nCondConv also highlights an important research question in the trend towards larger datasets on\nhow to best uncover, represent, and leverage the relationship between examples to improve model\nperformance. In the future, we hope to further explore the design space and limitations of CondConv\nwith larger datasets, more complex kernel-generating functions, and architecture search to design\nbetter base architectures.\n\nReferences\n\n[1] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly\nlearning to align and translate. In International Conference on Learning Representations, 2015.\n[2] Emmanuel Bengio, Pierre-Luc Bacon, Joelle Pineau, and Doina Precup. Conditional computa-\n\ntion in neural networks for faster models. arXiv preprint arXiv:1511.06297, 2015.\n\n[3] Yoshua Bengio, Nicholas L\u00e9onard, and Aaron Courville. Estimating or propagating gradients\nthrough stochastic neurons for conditional computation. arXiv preprint arXiv:1308.3432, 2013.\n[4] Zhourong Chen, Yang Li, Samy Bengio, and Si Si. Gaternet: Dynamic \ufb01lter selection in convo-\nlutional neural network via a dedicated global gating network. arXiv preprint arXiv:1811.11205,\n2018.\n\n[5] Kyunghyun Cho and Yoshua Bengio. Exponentially increasing the capacity-to-computation\n\nratio for conditional computation in deep learning. arXiv preprint arXiv:1406.7362, 2014.\n\n[6] Ekin D Cubuk, Barret Zoph, Dandelion Mane, Vijay Vasudevan, and Quoc V Le. Autoaugment:\nLearning augmentation policies from data. In Proceedings of the IEEE Conference on Computer\nVision and Pattern Recognition, 2019.\n\n[7] Jifeng Dai, Haozhi Qi, Yuwen Xiong, Yi Li, Guodong Zhang, Han Hu, and Yichen Wei.\nDeformable convolutional networks. In Proceedings of the IEEE international conference on\ncomputer vision, pages 764\u2013773, 2017.\n\n[8] Andrew Davis and Itamar Arel. Low-rank approximations for conditional feedforward compu-\n\ntation in deep neural networks. arXiv preprint arXiv:1312.4461, 2013.\n\n[9] Chrisantha Fernando, Dylan Banarse, Charles Blundell, Yori Zwols, David Ha, Andrei A Rusu,\nAlexander Pritzel, and Daan Wierstra. Pathnet: Evolution channels gradient descent in super\nneural networks. arXiv preprint arXiv:1701.08734, 2017.\n\n[10] Xavier Gastaldi. Shake-shake regularization. arXiv preprint arXiv:1705.07485, 2017.\n[11] Priya Goyal, Piotr Doll\u00e1r, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola,\nAndrew Tulloch, Yangqing Jia, and Kaiming He. Accurate, large minibatch sgd: Training\nimagenet in 1 hour. arXiv preprint arXiv:1706.02677, 2017.\n\n9\n\n\f[12] Sam Gross, Marc\u2019Aurelio Ranzato, and Arthur Szlam. Hard mixtures of experts for large scale\nweakly supervised vision. In Proceedings of the IEEE Conference on Computer Vision and\nPattern Recognition, pages 6865\u20136873, 2017.\n\n[13] David Ha, Andrew Dai, and Quoc V Le. Hypernetworks. In International Conference on\n\nLearning Representations, 2017.\n\n[14] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for im-\nage recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern\nRecognition, pages 770\u2013778, 2016.\n\n[15] Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias\nWeyand, Marco Andreetto, and Hartwig Adam. Mobilenets: Ef\ufb01cient convolutional neural\nnetworks for mobile vision applications. arXiv preprint arXiv:1704.04861, 2017.\n\n[16] Jie Hu, Li Shen, and Gang Sun. Squeeze-and-excitation networks. In Proceedings of the IEEE\n\nConference on Computer Vision and Pattern Recognition, pages 7132\u20137141, 2018.\n\n[17] Yanping Huang, Yonglong Cheng, Dehao Chen, HyoukJoong Lee, Jiquan Ngiam, Quoc V. Le,\nand Zhifeng Chen. GPipe: Ef\ufb01cient training of giant neural networks using pipeline parallelism.\narXiv preprint arXiv:1808.07233, 2018.\n\n[18] Xu Jia, Bert De Brabandere, Tinne Tuytelaars, and Luc V Gool. Dynamic \ufb01lter networks. In\n\nAdvances in Neural Information Processing Systems, pages 667\u2013675, 2016.\n\n[19] Norman P Jouppi, Cliff Young, Nishant Patil, David Patterson, Gaurav Agrawal, Raminder\nBajwa, Sarah Bates, Suresh Bhatia, Nan Boden, Al Borchers, et al. In-datacenter performance\nanalysis of a tensor processing unit. In 2017 ACM/IEEE 44th Annual International Symposium\non Computer Architecture (ISCA), pages 1\u201312. IEEE, 2017.\n\n[20] Cem Keskin and Shahram Izadi. Splinenets: Continuous neural decision graphs. In Advances\n\nin Neural Processing Systems, 2018.\n\n[21] Simon Kornblith, Jonathon Shlens, and Quoc V Le. Do better imagenet models transfer better?\n\nIn Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019.\n\n[22] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classi\ufb01cation with deep\nconvolutional neural networks. In Advances in Neural Information Processing Systems, pages\n1097\u20131105, 2012.\n\n[23] Yann LeCun, Bernhard E Boser, John S Denker, Donnie Henderson, Richard E Howard,\nWayne E Hubbard, and Lawrence D Jackel. Handwritten digit recognition with a back-\npropagation network. In Advances in Neural Information Processing Systems, pages 396\u2013404,\n1990.\n\n[24] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr\nDoll\u00e1r, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In European\nConference on Computer Vision, pages 740\u2013755. Springer, 2014.\n\n[25] Lanlan Liu and Jia Deng. Dynamic deep neural networks: Optimizing accuracy-ef\ufb01ciency\ntrade-offs by selective execution. In Thirty-Second AAAI Conference on Arti\ufb01cial Intelligence,\n2018.\n\n[26] Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang\nFu, and Alexander C Berg. Ssd: Single shot multibox detector. In European Conference on\nComputer Vision, pages 21\u201337. Springer, 2016.\n\n[27] Ilya Loshchilov and Frank Hutter. Sgdr: Stochastic gradient descent with warm restarts. 2016.\n[28] Minh-Thang Luong, Hieu Pham, and Christopher D Manning. Effective approaches to attention-\nbased neural machine translation. In Emperical Methods in Natural Language Processing,\n2015.\n\n[29] Dhruv Mahajan, Ross Girshick, Vignesh Ramanathan, Kaiming He, Manohar Paluri, Yixuan Li,\nAshwin Bharambe, and Laurens van der Maaten. Exploring the limits of weakly supervised\npretraining. arXiv preprint arXiv:1805.00932, 2018.\n\n[30] Mason McGill and Pietro Perona. Deciding how to decide: Dynamic routing in arti\ufb01cial neural\n\nnetworks. In International Conference on Machine Learning, 2017.\n\n10\n\n\f[31] Ravi Teja Mullapudi, William R.Mark, Noam Shazeer, and Kayvon Fatahalian. Hydranets:\nSpecialized dynamic architectures for ef\ufb01cient inference. In Proceedings of the IEEE Conference\non Computer Vision and Pattern Recognition, 2018.\n\n[32] Emmanouil Antonios Platanios, Mrinmaya Sachan, Graham Neubig, and Tom Mitchell. Contex-\ntual parameter generation for universal neural machine translation. In Proceedings of the 2018\nConference on Empirical Methods in Natural Language Processing, pages 425\u2013435, 2018.\n\n[33] Esteban Real, Alok Aggarwal, Yanping Huang, and Quoc V Le. Regularized evolution for\nimage classi\ufb01er architecture search. In Thirty-Third AAAI Conference on Arti\ufb01cial Intelligence,\n2019.\n\n[34] Clemens Rosenbaum, Tim Klinger, and Matthew Riemer. Routing networks: Adaptive selection\nIn International Conference on Learning\n\nof non-linear functions for multi-task learning.\nRepresentations, 2018.\n\n[35] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng\nHuang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual\nrecognition challenge. International Journal of Computer Vision, 115(3):211\u2013252, 2015.\n\n[36] Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen.\nMobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE Conference\non Computer Vision and Pattern Recognition, pages 4510\u20134520, 2018.\n\n[37] Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton,\nand Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts\nlayer. In International Conference on Learning Representations, 2017.\n\n[38] N. Silberman and S. Guadarrama. Tensor\ufb02ow-slim image classi\ufb01cation model library, 2016.\n[39] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov.\nDropout: A simple way to prevent neural networks from over\ufb01tting. Journal of Machine\nLearning Research, 15:1929\u20131958, 2014.\n\n[40] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov,\nDumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions.\nIn Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1\u20139,\n2015.\n\n[41] Mingxing Tan, Bo Chen, Ruoming Pang, Vijay Vasudevan, and Quoc V Le. Mnasnet: Platform-\n\naware neural architecture search for mobile. arXiv preprint arXiv:1807.11626, 2018.\n\n[42] Mingxing Tan and Quoc V Le. Ef\ufb01cientnet: Rethinking model scaling for convolutional neural\n\nnetworks. arXiv preprint arXiv:1905.11946, 2019.\n\n[43] Mingxing Tan and Quoc V Le. Mixnet: Mixed depthwise convolutional kernels. arXiv preprint\n\narXiv:1907.09595, 2019.\n\n[44] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez,\n\u0141ukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural Informa-\ntion Processing Systems, pages 5998\u20136008, 2017.\n\n[45] Xin Wang, Fisher Yu, Zi-Yi Dou, Trevor Darrell, and Joseph E Gonzalez. Skipnet: Learning\ndynamic routing in convolutional networks. In European Conference on Computer Vision, pages\n420\u2013436. Springer, 2018.\n\n[46] Felix Wu, Angela Fan, Alexei Baevski, Yann N Dauphin, and Michael Auli. Pay less atten-\ntion with lightweight and dynamic convolutions. In International Conference on Learning\nRepresentations, 2019.\n\n[47] Zuxuan Wu, Tushar Nagarajan, Abhishek Kumar, Steven Rennie, Larry S Davis, Kristen\nGrauman, and Rogerio Feris. Blockdrop: Dynamic inference paths in residual networks. In\nProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018.\n\n[48] Saining Xie, Ross Girshick, Piotr Doll\u00e1r, Zhuowen Tu, and Kaiming He. Aggregated residual\ntransformations for deep neural networks. In Proceedings of the IEEE conference on computer\nvision and pattern recognition, pages 1492\u20131500, 2017.\n\n[49] Hongyi Zhang, Moustapha Cisse, Yann N Dauphin, and David Lopez-Paz. mixup: Beyond\nempirical risk minimization. In International Conference on Learning Representations, 2017.\n\n11\n\n\fAppendices\n\nA ImageNet Architectures\n\nIn this section, we provide detailed descriptions of the speci\ufb01c CondConv architectures we used for\neach baseline model, and elaborate on individual results. All CondConv results are reported with 8\nexperts per layer. For fair comparison, all results reported use the same training hyperparameters and\nregularization search space as the CondConv model they are compared against.\nCondConv-MobileNetV1. We replace the convolutional layers starting from the sixth separable\nconvolutional block and the \ufb01nal fully-connected classi\ufb01cation layer of the baseline MobileNetV1\nmodel with CondConv layers. We share routing weights between depthwise and pointwise layers\nwith a separable convolution block. Our CondConv-MobileNetV1 (0.5x) model with 32 experts per\nCondConv layer achieves 71.6% accuracy at 190M multiply-adds, comparable to the MobileNetV1\n(1.0x) model at 71.7% at 571M multiply-adds.\nCondConv-MobileNetV2. We replace the convolutional layers in the \ufb01nal 6 inverted residual blocks\nand the \ufb01nal fully-connected classi\ufb01cation layer of the baseline MobileNetV2 architecture with\nCondConv layers. We share routing weights between convolutional layers in each inverted bottleneck\nblock. Our CondConv-MobileNetV2 (1.0x) model achieves 74.6% accuracy at 329M multiply-adds.\nThe MobileNetV2 (1.4x) architecture with static convolutions scaled by width multiplier achieves\nsimilar accuracy of 74.5% in our implementation (74.7% in [36]), but requires 585M multiply-adds.\nCondConv-MnasNet-A1. We replace the convolutional layers in the \ufb01nal 3 block groups of the\nbaseline MnasNet-A1 architecture with CondConv layers. We share routing weights between convo-\nlutional layers in each inverted bottleneck block within a block group. The baseline MnasNet-A1\nmodel achieves 74.9% accuracy with 312M multiply-adds. Our CondConv-MnasNet-A1 model\nachieves 76.2% accuracy with 329M multiply-adds. A larger model from the same search space using\nstatic convolutional layers, MnasNet-A2, achieves 75.6% accuracy with 340M multiply-adds [41].\nCondConv-ResNet-50. We replace the convolutional layers in the \ufb01nal 3 residual blocks and the\n\ufb01nal fully-connected classi\ufb01cation layer of the baseline ResNet-50 architecture with CondConv\nlayers. The baseline ResNet-50 model achieves 77.7% accuracy at 4096M multiply-adds. Our\nCondConv-ResNet-50 architecture achieves 78.6% accuracy at 4213 multiply-adds. With suf\ufb01cient\nregularization, CondConv improves the performance of even large model architectures with ordinary\nconvolutions that are not optimized for inference time.\nCondConv-Ef\ufb01cientNet-B0. We replace the convolutional layers in the \ufb01nal 3 block groups of the\nbaseline Ef\ufb01cientNet-B0 architecture with CondConv layers. We share routing weights between\nconvolutional layers in each inverted bottleneck block within a block group. The baseline Ef\ufb01cientNet-\nB0 model achieves 77.2% accuracy with 391M multiply-adds. Our CondConv-Ef\ufb01cientNet-B0 model\nachieves 78.3% accuracy with 413M multiply-adds.\nTo directly compare our CondConv scaling approach to the compound scaling coef\ufb01cient proposed\nby Tan et al. [42], we additionally scale the CondConv-Ef\ufb01cientNet-B0 model with a depth multiplier\nof 1.1x, which we call CondConv-Ef\ufb01cientNet-B0-depth. Our CondConv-Ef\ufb01cientNet-B0-depth\nmodel achieves 79.5% accuracy with only 614M multiply-adds. When trained with the same\nhyperparameters and regularization search space, the Ef\ufb01cientNet-B1 model, which is scaled from\nthe Ef\ufb01cientNet-B0 model using the compound coef\ufb01cient, achieves 79.2% accuracy with 700M\nmultiply-adds. In this regime, CondConv scaling outperforms static convolutional scaling with the\ncompound coef\ufb01cient.\n\n12\n\n\f", "award": [], "sourceid": 767, "authors": [{"given_name": "Brandon", "family_name": "Yang", "institution": "Google Brain"}, {"given_name": "Gabriel", "family_name": "Bender", "institution": "Google Brain"}, {"given_name": "Quoc", "family_name": "Le", "institution": "Google"}, {"given_name": "Jiquan", "family_name": "Ngiam", "institution": "Google Brain"}]}