{"title": "SpArSe: Sparse Architecture Search for CNNs on Resource-Constrained Microcontrollers", "book": "Advances in Neural Information Processing Systems", "page_first": 4977, "page_last": 4989, "abstract": "The vast majority of processors in the world are actually microcontroller units (MCUs), which find widespread use performing simple control tasks in applications ranging from automobiles to medical devices and office equipment. The Internet of Things (IoT) promises to inject machine learning into many of these every-day objects via tiny, cheap MCUs. However, these resource-impoverished hardware platforms severely limit the complexity of machine learning models that can be deployed. For example, although convolutional neural networks (CNNs) achieve state-of-the-art results on many visual recognition tasks, CNN inference on MCUs is challenging due to severe memory limitations. To circumvent the memory challenge associated with CNNs, various alternatives have been proposed that do fit within the memory budget of an MCU, albeit at the cost of prediction accuracy. This paper challenges the idea that CNNs are not suitable for deployment on MCUs. We demonstrate that it is possible to automatically design CNNs which generalize well, while also being small enough to fit onto memory-limited MCUs. Our Sparse Architecture Search method combines neural architecture search with pruning in a single, unified approach, which learns superior models on four popular IoT datasets. The CNNs we find are more accurate and up to 7.4\u00d7 smaller than previous approaches, while meeting the strict MCU working memory constraint.", "full_text": "SpArSe: Sparse Architecture Search for CNNs on\n\nResource-Constrained Microcontrollers\n\nIgor Fedorov\n\nArm ML Research\n\nigor.fedorov@arm.com\n\nRyan P. Adams\n\nPrinceton University\nrpa@princeton.edu\n\nMatthew Mattina\nArm ML Research\n\nmatthew.mattina@arm.com\n\nPaul N. Whatmough\nArm ML Research\n\npaul.whatmough@arm.com\n\nAbstract\n\nThe vast majority of processors in the world are actually microcontroller units\n(MCUs), which \ufb01nd widespread use performing simple control tasks in applications\nranging from automobiles to medical devices and of\ufb01ce equipment. The Internet\nof Things (IoT) promises to inject machine learning into many of these every-day\nobjects via tiny, cheap MCUs. However, these resource-impoverished hardware\nplatforms severely limit the complexity of machine learning models that can be\ndeployed. For example, although convolutional neural networks (CNNs) achieve\nstate-of-the-art results on many visual recognition tasks, CNN inference on MCUs\nis challenging due to severe memory limitations. To circumvent the memory\nchallenge associated with CNNs, various alternatives have been proposed that do\n\ufb01t within the memory budget of an MCU, albeit at the cost of prediction accuracy.\nThis paper challenges the idea that CNNs are not suitable for deployment on\nMCUs. We demonstrate that it is possible to automatically design CNNs which\ngeneralize well, while also being small enough to \ufb01t onto memory-limited MCUs.\nOur Sparse Architecture Search method combines neural architecture search with\npruning in a single, uni\ufb01ed approach, which learns superior models on four popular\nIoT datasets. The CNNs we \ufb01nd are more accurate and up to 7.4\u00d7 smaller than\nprevious approaches, while meeting the strict MCU working memory constraint.\n\n1\n\nIntroduction\n\nThe microcontroller unit (MCU) is a truly ubiquitous computer. MCUs are self-contained single-chip\nprocessors which are small (\u223c 1cm2), cheap (\u223c $1), and power ef\ufb01cient (\u223c 1mW). Applications\nare extremely broad, but often include seemingly banal tasks such as simple control and sequencing\noperations for everyday devices like washing machines, microwave ovens, and telephones. The key\nadvantage of MCUs over application speci\ufb01c integrated circuits is that they are programmed with\nsoftware and can be readily updated to \ufb01x bugs, change functionality, or add new features. The\nshort time to market and \ufb02exibility of software has led to the staggering popularity of MCUs. In\nthe developed world, a typical home is likely to have around four general-purpose microprocessors.\nIn contrast, the number of MCUs is around three dozen [46]. A typical mid-range car may have\nabout 30 MCUs. Public market estimates suggest that around 50 billion MCU chips will ship in\n2019 [1], which far eclipses other chips like graphics processing units (GPUs), whose shipments\ntotalled roughly 100 million units in 2018 [2].\nMCUs can be highly resource constrained; Table 1 compares MCUs with bigger processors. The\nbroad proliferation of MCUs relative to desktop GPUs and CPUs stems from the fact that they are\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fTable 1: Processors for ML inference: estimated characteristics to indicate the relative capabilities.\nProcessor\nNvidia 1080Ti GPU\nIntel i9-9900K CPU\nGoogle Pixel 1 (Arm CPU)\nRaspberry Pi (Arm CPU)\nMicro Bit (Arm MCU)\nArduino Uno (Microchip MCU)\n\nPower\nCost\n250 W $700\n95 W $499\n~5 W\n1.5 W\n~1 mW $1.75\n~1 mW $1.14\n\nMemory\n11 GB\n256 GB\n4 GB\n1 GB\n16 KB\n2 KB\n\n50 GOPs/Sec\n50 GOPs/Sec\n16 MOPs/Sec\n4 MOPs/Sec\n\n\u2013\n\u2013\n\nCompute\n\n10 TFLOPs/Sec\n500 GFLOPs/Sec\n\nUsecase\nDesktop\nDesktop\nMobile\nHobbyist\n\nIoT\nIoT\n\norders of magnitude cheaper (\u223c 600\u00d7) and less power hungry (\u223c 250, 000\u00d7). In recent years, MCUs\nhave been used to inject intelligence and connectivity into everything from industrial monitoring\nsensors to consumer devices, a trend commonly referred to as the Internet of Things (IoT) [10, 22, 43].\nDeploying machine learning (ML) models on MCUs is a critical part of many IoT applications,\nenabling local autonomous intelligence rather than relying on expensive and insecure communication\nwith the cloud [9]. In the context of supervised visual tasks, state-of-the-art (SOTA) ML models\ntypically take the form of convolutional neural networks (CNNs) [35]. While tools for deploying\nCNNs on MCUs have started to appear [7, 6, 4], the CNNs themselves remain far too large for\nthe memory-limited MCUs commonly used in IoT devices. In the remainder of this work, we use\nMCU to refer speci\ufb01cally to IoT-sized MCUs, like the Micro Bit. In contrast to this work, the\nmajority of preceding research on compute/memory ef\ufb01cient CNN inference has targeted CPUs and\nGPUs [26, 11, 61, 62, 45, 54, 49].\nTo illustrate the challenge of deploying CNNs on MCUs, consider the seemingly simple task of\ndeploying the well-known LeNet CNN on an Arduino Uno to perform MNIST character recogni-\ntion [38]. Assuming the weights can be quantized to 8-bit integers, 420 KB of memory is required to\nstore the model parameters, which exceeds the Uno\u2019s 32 KB of read-only (\ufb02ash) memory. An addi-\ntional 391 (resp. 12) KB of random access memory (RAM) is then required to store the intermediate\nfeature maps produced by LeNet under memory model (5) (resp. (6)), which far exceeds the Uno\u2019s\n2 KB RAM. The dispiriting implication is that it is not possible to perform LeNet inference on the\nUno. This has led many to conclude that CNNs should be abandoned on constrained MCUs [36, 24].\nNevertheless, the sheer popularity of MCUs coupled with the dearth of techniques for leveraging\nCNNs on MCUs motivates our work, where we take a step towards bridging this gap.\nDeployment of CNNs on MCUs is challenging along multiple dimensions, including power con-\nsumption and latency, but as the example above illustrates, it is the hard memory constraints that\nmost directly prohibit the use of these networks. MCUs typically include two types of memory. The\n\ufb01rst is static RAM, which is relatively fast, but volatile and small in capacity. RAM is used to store\nintermediate data. The second is \ufb02ash memory, which is non-volatile and larger than RAM; it is\ntypically used to store the program binary and any constant data. Flash memory has very limited\nwrite endurance, and is therefore treated as read-only memory (ROM). The two MCU memory types\nintroduce the following constraints on CNN model architecture:\n\nC1 : The maximum size of intermediate feature maps cannot exceed the RAM capacity.\nC2 : The model parameters must not exceed the ROM (\ufb02ash memory) capacity.\n\nTo the best of our knowledge, there are currently no CNN architectures or training procedures\nthat produce CNNs satisfying these memory constraints for MCUs with less than 2 KB RAM and\ndeployed using standard toolchains [36, 24]. This is true even ignoring the memory required for the\nruntime (in RAM) and the program itself (in ROM). The severe memory constraints for inference on\nMCUs have pushed research away from CNNs and toward simpler classi\ufb01ers based on decision trees\nand nearest neighbors [36, 24]. We demonstrate for the \ufb01rst time that it is possible to design CNNs\nthat are at least as accurate as Kumar et al. [36], Gupta et al. [24] and at the same time satisfy C1-C2,\neven for devices with just 2 KB of RAM. We achieve this result by designing CNNs that are heavily\nspecialized for deployment on MCUs using a method we call Sparse Architecture Search (SpArSe).\nThe key insight is that combining neural architecture search (NAS) and network pruning allows us to\nbalance generalization performance against tight memory constraints C1-C2. Critically, we enable\nSpArSe to search over pruning strategies in conjunction with conventional hyperparameters around\nmorphology and training. Pruning enables SpArSe to quickly evaluate many sub-networks of a given\n\n2\n\n\f(a) Acc = 73.84%, MS = 1.31 KB, WM = 1.28 KB\n(b) Acc=73.58%, MS = 0.61 KB, WM = 14.3 KB\nFigure 1: Model architectures found with best test accuracy on CIFAR10-binary, while optimizing for\n(a) 2KB for both MODELSIZE (MS) and WORKINGMEMORY (WM), and (b) minimum MS. Each\nnode in the graph is annotated with MS and WM using the model in (5), and the values in square\nbrackets show the quantities before and after pruning, respectively. Optimizing for WM yields more\nthan 11.2x WM reduction. Note that pruning has a considerable impact on the CNN.\n\nnetwork, thereby expanding the scope of the overall search. While previous NAS approaches have\nautomated the discovery of performant models with reduced parameterizations, we are the \ufb01rst to\nsimultaneously consider performance, parameter memory constraints, and inference-time working\nmemory constraints.\nWe use SpArSe to uncover SOTA models on four datasets, in terms of accuracy and model size,\noutperforming both pruning of popular architectures and MCU-speci\ufb01c models [36, 24]. The multi-\nobjective approach of SpArSe leads to new insights in the design of memory-constrained architectures.\nFig. 1a shows an example of a discovered architecture which has high accuracy, small model size,\nand \ufb01ts within 2KB RAM. By contrast, we \ufb01nd that optimizing networks solely to minimize the\nnumber of parameters (as is typically done in the NAS literature, e.g., [14]), is not suf\ufb01cient to\nidentify networks that minimize RAM usage. Fig. 1b illustrates one such example.\n\n1.1 Related work\n\nCNNs designed for resource constrained inference have been widely published in recent years\n[49, 30, 63], motivated by the goal of enabling inference on mobile phone platforms [60, 29].\nAdvances include depth-wise separable layers [50], deployment-centric pruning [62, 45], quantization\n[58, 21], and matrix decomposition techniques [55]. More recently, NAS has been leveraged to\nachieve even more ef\ufb01cient networks on mobile phone platforms [11, 52]. In a complimentary line\nof work, Gural and Murmann [25] propose memory-optimal direct convolutions (MODC). Unlike\nMODC, SpArSe yields CNNs that can be deployed with off-the-shelf tools and is shown to work on\nan array of IoT datasets.\nAlthough mobile phones are more constrained than general-purpose CPUs and GPUs, they still have\nmany orders of magnitude more memory capacity and compute performance than MCUs (Table 1). In\ncontrast, little attention has been paid to running CNNs on MCUs, which represent the most numerous\ncompute platform in the world. Kumar et al. [36] propose Bonsai, a pruned shallow decision tree with\nnon-axis aligned decision boundaries. Gupta et al. [24] propose a compressed k-nearest neighbors\n(kNN) approach (ProtoNN), where model size is reduced by projecting data into a low-dimensional\nspace, maintaining a subset of prototypes to classify against, and pruning parameters. We build upon\nKumar et al. [36], Gupta et al. [24] by targeting the same MCUs, but using NAS to \ufb01nd CNNs which\nare at least as small and more accurate.\nAlgorithms for identifying performant CNN architectures have received signi\ufb01cant attention recently\n[64, 14, 11, 40, 23, 15, 39]. The approaches closest to SpArSe are Stamoulis et al. [52], Elsken et al.\n\n3\n\ninputMaxPool 1x1x3Conv2D 5x5x [11/50] ModelSize [286/1300] WorkingMemory [1310/2324]MaxPool 2x2FC [732/9800]x[0/41] ModelSize [0/401841] WorkingMemory [732/411641]ConcatenateFC [253/9841]x2 ModelSize [508/19684] WorkingMemory [761/29525]inputConv2D 3x3x9 ModelSize [94/243] WorkingMemory [3166/3315]SeparableConv2D 4x4x86 ModelSize [125/918] WorkingMemory [8172/8874]MaxPool 2x2DownsampledConv2D 1x1x17, 5x5x39 ModelSize [217/18037] WorkingMemory [14644/19448]MaxPool 2x2FC 624x391 ModelSize [77/243984] WorkingMemory [285/244608]FC 391x2 ModelSize [3/782] WorkingMemory [6/1173]\f[14]. In Stamoulis et al. [52], the authors optimize the kernel size and number of feature maps of\nthe MBConv layers in a MobileNetV2 backbone [49] by expressing each of the layer choices as a\npruned version of a superkernel. In some ways, Stamoulis et al. [52] is less a NAS algorithm and\nmore of a structured pruning approach, given that the only allowed architectures are reductions of\nMobileNetV2. SpArSe does not constrain architectures to be pruned versions of a baseline, which\ncan be too restrictive of an assumption for ultra small CNNs. SpArSe is not based on an existing\nbackbone, giving it greater \ufb02exibility to extend to different problems. Like Elsken et al. [14], SpArSe\nuses a form of weight sharing called network morphism [59] to search over architectures without\ntraining each one from scratch. SpArSe extends the concept of morphisms to expedite training and\npruning CNNs. Because Elsken et al. [14] seek compact architectures by using the number of network\nedges as one of the objectives in the search, potential gains from weight sparsity are ignored, which\ncan be signi\ufb01cant (Section 3 [18, 19]). Moreover, since SpArSe optimizes both the architecture and\nweight sparsity, Elsken et al. [14] can be seen as a special case of SpArSe.\n\n2 SpArSe framework: CNN design as multi-objective optimization\n\nOur approach to designing a small but performant CNN is to specify a multi-objective opti-\nmization problem that balances the competing criteria. We denote a point in the design space\nas \u2126 = {\u03b1, \u03d1, \u03c9, \u03b8}, in which: \u03b1 = {V, E} is a directed acyclic graph describing the network connec-\ntivity, where V and E denote the set of graph vertices and edges; \u03c9 denotes the network weights; \u03d1\nrepresents the operations performed at each edge, i.e. convolution, pooling, etc.; and \u03b8 are hy-\nperparameters governing the training process. The vertices vi, vj \u2208 V represent network neurons,\nwhich are connected to each other if (vi, vj) \u2208 E through an operation \u03d1ij parameterized by \u03c9. The\ncompeting objectives in the present work of targeting constrained MCUs are:\n\nf1(\u2126) = 1 \u2212 VALIDATIONACCURACY(\u2126)\nf2(\u2126) = MODELSIZE(\u03c9)\nf3(\u2126) = max\nl\u22081,...,L\n\nWORKINGMEMORYl(\u2126)\n\n(1)\n(2)\n(3)\n\nwhere VALIDATIONACCURACY(\u2126) is the accuracy of the trained model on validation data,\nMODELSIZE(\u03c9), or MS,\nis the number of bits needed to store the model parameters \u03c9,\nWORKINGMEMORYl(\u2126) is the working memory in bits needed to compute the output of layer l,\nwith the maximum taken over the L layers to account for in-place operations. We refer to (3) as the\nworking memory (WM) for \u2126. There is no single \u2126 which minimizes all of (1) \u2212 (3) simultaneously.\nFor instance, (1) prefers large networks with many non-zero weights whereas (2) favors networks with\nno weights. Likewise, (3) prefers con\ufb01gurations with small intermediate representations, whereas\n(2) has no preference as to the size of the feature maps. Therefore, in the context of CNN design,\nit is more appropriate to seek the set of Pareto optimal con\ufb01gurations, where \u2126(cid:63) is Pareto optimal\nif fk(\u2126(cid:63)) \u2264 fk(\u2126) \u2200k, \u2126 and \u2203j : fj(\u2126(cid:63)) < fj(\u2126) \u2200\u2126 (cid:54)= \u2126(cid:63). The concept of Pareto optimality is\nappealing for multi-objective optimization, as it allows the ready identi\ufb01cation of optimal designs\nsubject to arbitrary constraints in a subset of the objectives.\n\n2.1 Search space\n\nOur search space is designed to encompass CNNs of varying depth, width, and connectivity. Each\ngraph consists of optional input downsampling followed by a variable number of blocks, where\neach block contains a variable number of convolutional layers, each parametrized by its own kernel\nsize, number of output channels, convolution type, and padding. We consider regular, depthwise\nseparable, and downsampled convolutions, where we de\ufb01ne a downsampled convolution to be a\n1 \u00d7 1 convolution that downsamples the input in depth, followed by a regular convolution. Each\nconvolution is followed by optional batch-normalization, ReLU, and spatial downsampling through\npooling of a variable window size. Each set of two consecutive convolutions has an optional residual\nconnection. Inspired by the decision tree approach in Kumar et al. [36], we let the output layer use\nfeatures at multiple scales by optionally routing the output of each block to the output layer through a\nfully connected (FC) layer (see Fig. 1a). All of the FC layer outputs are merged before going through\nan FC layer that generates the output. The search space also includes parameters governing CNN\ntraining and pruning. The Appendix contains a complete description of the search space.\n\n4\n\n\f2.2 Quantifying memory requirements\n\nThe VALIDATIONACCURACY(\u2126) metric is readily available for models via a held-out validation set\nor by cross-validation. However, the memory constraints of interest in this work demand more careful\nspeci\ufb01cation. For simplicity, we estimate the model size as\n\nFor working memory, we consider two different models:\n\nMODELSIZE(\u03c9) \u2248 (cid:107)\u03c9(cid:107)0 .\n\nWORKINGMEMORY1\nWORKINGMEMORY2\n\nl (\u2126) \u2248 (cid:107)xl(cid:107)0 + (cid:107)\u03c9l(cid:107)0\nl (\u2126) \u2248 (cid:107)xl(cid:107)0 + (cid:107)yl(cid:107)0\n\n(4)\n\n(5)\n(6)\n\nwhere xl, yl, and \u03c9l are the input, output, and weights for layer l, respectively. The assumption in (5)\nis that the inputs to layer l and the weights need to reside in RAM to compute the output, which is\nconsistent with deployment tools like [7] which allow layer outputs to be written to an SD card. The\nmodel in (6) is also a standard RAM usage model, adopted in [8], for example. For merge nodes that\nsum two vector inputs x1\nin (5)-(6). The reliance of (4)-(6)\non the (cid:96)0 norm is motivated by our use of pruning to minimize the number of non-zeros in both \u03c9 and\n{xl}L\nl=1, which is also the compression mechanism used in related work [36, 24]. Note that (4)-(6)\nare reductive to varying degrees. However, since SpArSe is a black-box optimizer, the measures in\n(4)-(6) can be readily updated as MCU deployment toolchains mature.\n\n(cid:104)(cid:0)x1\n\n(cid:1)T(cid:105)T\n\nl and x2\n\nl , we set xl =\n\n(cid:0)x2\n\nl\n\n(cid:1)T\n\nl\n\n2.3 Neural network pruning\n\nPruning [37] is essential to MCU deployment using SpArSe, as it heavily reduces the model size and\nworking memory without signi\ufb01cantly impacting classi\ufb01cation accuracy. Pruning is a procedure for\nzeroing out network parameters \u03c9 and can be seen as a way to generate a new set of parameters \u00af\u03c9\nthat have lower (cid:107)\u00af\u03c9(cid:107)0. We consider both unstructured and structured, or channel [27], pruning, where\nthe difference is that the latter prunes away entire groups of weights corresponding to output feature\nmaps for convolution layers and input neurons for FC layers. Both forms of pruning reduce (cid:107)\u03c9(cid:107)0\nand, consequently, (4)-(5). Structured pruning is critical for reducing (5)-(6) because it provides a\nmechanism for reducing the size of layer inputs. We use Sparse Variational Dropout (SpVD) [44] and\nBayesian Compression (BC) [42] to realize unstructured and structured pruning, respectively. Both\napproaches assume a sparsity promoting prior on the weights and approximate the weight posterior\nby a distribution parameterized by \u03c6. See the Appendix for a description of SpVD and BC. Notably,\n\u03c6 contains all of the information about the network weight values as well as which weights to prune.\n\n2.4 Multi-objective Bayesian optimization\n\nSpArSe consists of three stages, where each stage m samples Tm con\ufb01gurations. At iteration n, a new\ncon\ufb01guration \u2126n is generated by the multi-objective Bayesian optimizer (MOBO) with probability \u03c1m\nand uniformly at random with probability 1 \u2212 \u03c1m. We adopt the combination of model-based and\nentirely random sampling from [17] to increase search space coverage. The optimizer considers\ncandidates which are morphs of previous con\ufb01gurations and returns both the new and reference\ncon\ufb01gurations (Section 2.5). The parameters of the new architecture are then inherited from the\nreference before being retrained and pruned.\nSpArSe uses a MOBO based on the idea of random scalarizations [47]. The MOBO approach\nis appealing as it builds \ufb02exible nonparametric models of the unknown objectives and enables\nreasoning about uncertainty in the search for the Pareto frontier. A scalarized objective is given by\ng (\u2126) = maxk\u22081,...,K \u03bbkfk(\u2126), where \u03bbk is drawn randomly. Choosing the domain of the prior\non \u03bbk allows the user to specify preferences about the region of the Pareto frontier to explore.\nFor example, IoT practitioners may care about models with less than 1000 parameters. Since the\nfunctional form of fk(\u2126) is unknown in practice, it is modeled by a Gaussian process [48] with a\nkernel \u03ba (\u00b7,\u00b7) that supports the types of variables included in \u2126, i.e., real-valued, discrete, categorical,\nand hierarchically related variables [53, 20]. A new \u2126n is sampled by minimizing g (\u00b7) through\nThompson sampling. This MOBO yields better coverage of the Pareto frontier than the deterministic\nscalarization methods used in [11, 52].\n\n5\n\n\f(a)\n\n(c)\n\n(b)\n\n(d)\n\nFigure 2: SpArSe results from minimization of (1 \u2212 VALIDATIONACCURACY(\u2126)) , MODELSIZE(\u03c9).\n\n2.5 Network morphism\n\neach proposal to be a morph of a reference \u2126r \u2208(cid:8)\u2126j(cid:9)n\u22121\n\nEvaluating each con\ufb01guration \u2126n from a random initialization is slow, as evidenced by early NAS\nworks which required thousands of GPU days [64, 65]. Search time can be reduced by constraining\nj=0 [14]. Loosely speaking, we say that \u2126n\nis a morph of \u2126r if most of the elements in \u2126n are identical to those in \u2126r. The advantage of\nusing morphism to generate \u2126n is that most of \u03c6n can be inherited from \u03c6r, where \u03c6r denotes\nthe weight posterior parameters for con\ufb01guration \u2126r. Initializing \u03c6n in this way means that \u2126n\ninherits knowledge about the value and pruning mask for most of its weights. Compared to running\nSpVD/BC from scratch, morphisms enable pruning proposals using 2-8\u00d7 fewer epochs, depending\non the dataset. Further details on morphism are given in the Appendix, including allowed morphs.\nBecause our search space includes such a diversity of parameters, including architectural parameters,\npruning hyperparameters, etc., we \ufb01nd it helpful to perform the search in stages, where each successive\nstage increasingly limits the set of possible proposals. This coarse-to-\ufb01ne search enables exploring\ndecisions at increasing granularity, to wit: Stage 1) A candidate con\ufb01guration can be generated\nby applying modi\ufb01cations to any of {\u2126r}n\u22121\nr=1 , Stage 2) The allowable morphs are restricted to the\npruning parameters, Stage 3) The reference con\ufb01gurations are restricted to the Pareto optimal points.\n\n3 Results\n\nWe report results on a variety of datasets: MNIST (55e3, 5e3, 10e3) [38], CIFAR10 (45e3, 5e3, 10e3)\n[34], CUReT (3704, 500, 1408) [57], and Chars4k (3897, 500, 1886) [16], corresponding to classi-\n\ufb01cation problems with 10, 10, 61, and 62 classes, respectively, with the training/validation/test set\nsizes provided in parentheses. To match the setup in [36], we also report on binary versions of these\ndatasets, meaning that the classes are split into two groups and re-labeled. The only pre-processing\nwe perform is mean subtraction and division by the standard deviation. Experiments were run on four\n\n6\n\n102104106108Number of parameters0.900.920.940.960.981.00AccuracyMNISTSpArSe stage 1SpArSe stage 2SpArSe stage 3GBDTMODCRBF-SVMBonsaiProtoNNLeNet+SpVDKNN101102103104105106107Number of parameters0.500.550.600.650.700.750.800.85AccuracyCIFAR10-binarySpArSe stage 1SpArSe stage 2SpArSe stage 3Bonsai 2kBGBDTRBF-SVMLeNet+SpVDProtoNNBonsai 16 kBKNN102103104105106Number of parameters0.8000.8250.8500.8750.9000.9250.9500.9751.000AccuracyCUReTSpArSe stage 1SpArSe stage 2SpArSe stage 3GBDTBonsaiProtoNNRBF-SVMKNN102103104105106Number of parameters0.20.40.60.8AccuracyChars4kSpArSe stage 1SpArSe stage 2SpArSe stage 3GBDTBonsaiRBF-SVMKNN\fTable 2: Dominating con\ufb01gurations for parameter minimization experiment. SpArSe models are\nlisted on top and the competing method on bottom. SpArSe \ufb01nds CNNs that are more accurate and\nhave fewer parameters than competing methods. The amount of time spent obtaining each dominating\ncon\ufb01guration is reported in GPU days (GPUD).\n\nMNIST\n\nCIFAR10-binary\n\nCUReT\n\nChars4k\n\nc\nc\nA\n97.24\n97.01\n\n\u2013\n\n96.84\n95.88\n98.78\n97.90\n96.84\n94.34\n97.42\n97.30\n99.16\n99.10\n99.17\n99.15\n\n0\n(cid:107)\n\u03c9\n(cid:107)\n\n510\n2.15e4\n\n\u2013\n476\n1.6e4\n804\n1.5e6\n476\n4.71e7\n569\n1e7\n1e3\n1.8e3\n1.45e3\n3e3\n\nD\nU\nP\nG\n\n11\n\n\u2013\n\n11\n\n11\n\n11\n\n10\n\n8\n\n1\n\nc\nc\nA\n73.08\n73.02\n76.66\n76.64\n76.56\n76.35\n77.90\n77.19\n76.34\n73.70\n81.77\n81.68\n75.35\n75.09\n\n\u2013\n\n0\n(cid:107)\n\u03c9\n(cid:107)\n\n487\n512\n1.4e3\n4.1e3\n1.4e3\n4.1e3\n1.6e3\n4e5\n1.4e3\n2e7\n3.2e3\n1.6e7\n1.4e3\n1.6e5\n\n\u2013\n\nD\nU\nP\nG\n\n1\n\n9\n\n10\n\n8\n\n10\n\n3\n\n10\n\n\u2013\n\nc\nc\nA\n96.45\n95.23\n\n\u2013\n\n96.45\n94.44\n96.45\n90.81\n96.45\n89.81\n97.58\n97.43\n\n\u2013\n\n\u2013\n\n0\n(cid:107)\n\u03c9\n(cid:107)\n\n8.5e3\n2.9e4\n\n\u2013\n\n8.5e3\n1.6e4\n8.5e3\n6.1e5\n8.5e3\n2.6e6\n2.2e4\n2.3e6\n\n\u2013\n\n\u2013\n\nD\nU\nP\nG\n\n1\n\n\u2013\n\n1\n\n1\n\n2\n\n2\n\n\u2013\n\n\u2013\n\nc\nc\nA\n67.82\n58.59\n\n\u2013\n\n\u2013\n\n67.82\n43.34\n67.82\n39.32\n67.82\n48.04\n\n\u2013\n\n\u2013\n\n0\n(cid:107)\n\u03c9\n(cid:107)\n\n1.7e3\n2.6e4\n\n\u2013\n\n\u2013\n\n1.7e3\n2.5e6\n1.7e3\n1.7e6\n1.7e3\n2e6\n\u2013\n\n\u2013\n\nD\nU\nP\nG\n\n1\n\n\u2013\n\n\u2013\n\n1\n\n1\n\n1\n\n\u2013\n\n\u2013\n\nBonsai\n\nBonsai (16 kB)\n\nProtoNN\n\nGBDT\n\nkNN\n\nRBF-SVM\n\nLeNet + SpVD\n\nMODC\n\nNVIDIA RTX 2080 GPUs. We compare against previous SOTA works: Bonsai [36], ProtoNN [24],\nGradient Boosted Decision Tree Ensemble Pruning [12], kNN, radial basis function support vector\nmachine (SVM), and MODC [25]. We do not compare against previous NAS works because they\nhave not addressed the memory-constrained classi\ufb01cation problem addressed here.\n\n3.1 Models optimized for number of parameters\n\nFirst, we address C2 by showing that SpArSe \ufb01nds CNNs with higher accuracy and fewer parameters\nthan previously published methods. We use unstructured pruning and optimize {fk (\u2126)}2\nk=1. Fig. 2\nshows the Pareto curves for SpArSe and con\ufb01rms that it \ufb01nds smaller and more accurate models on\nall datasets. For each competing method, we also report the SpArSe-obtained con\ufb01guration which\nattains the same or higher test accuracy and minimum number of parameters, which we term the\ndominating con\ufb01guration. Results are shown in Table 2. To con\ufb01rm that SpArSe learns non-trivial\nsolutions, we compare with applying SpVD pruning to LeNet in Fig. 2 and Table 2.\n\n3.2 Models optimized for total memory footprint\n\nNext, we demonstrate that SpArSe resolves C1-C2 by \ufb01nding CNNs that consume less device\nmemory than Bonsai [36]. We use structured pruning and optimize {fk (\u2126)}3\nk=1. We quantize\nweights and activations to one byte to yield realistic memory calculations and for fair comparison\nwith Bonsai [5]. Table 3 compares SpArSe to Bonsai in terms of accuracy, MS, and WM under\nthe model in (5). For all datasets and metrics, SpArSe yields CNNs which outperform Bonsai. For\nMNIST, Bonsai reports performance on a binarized dataset, whereas we use the original ten-class\nproblem, i.e., we solve a signi\ufb01cantly more complex problem with fewer resources. Table 4 reports\nresults for WM model (6), showing that SpArSe outperforms Bonsai across all metrics and datasets,\nwith the exception that Bonsai yields a model with smaller MS for CIFAR10-binary.\n\n3.3 What SpArSe reveals about pruning\nPruning can be considered a form of NAS, where \u00af\u03c9 represents a sub-network of {\u03b1, \u03d1, \u03c9} given\nby {{V, Ep} , \u03d1, \u03c9}, and Ep \u2286 E contains only the edges for which \u00af\u03c9 is non-zero [18]. The question\nthen becomes, should one look for Ep directly or begin with a large edge-set E and prune it? There\nis con\ufb02icting evidence whether the same validation accuracy can be achieved by both approaches\n[18, 19, 41]. Importantly, previous NAS approaches have focused on searching for Ep directly by\nusing |E| as one of the optimization objectives [14]. On the other hand, SpArSe is able to explore both\nstrategies and learn the optimal interaction between network graph \u03b1, operations \u03d1, and pruning. Fig.\n\n7\n\n\fTable 3: Comparison of Bonsai with SpArSe for WM model (5). The \ufb01rst row shows the highest\naccuracy model for WM \u2264 2KB and the second row shows the highest accuracy model for WM, MS\n\u2264 2KB. For MNIST, SpArSe is evaluated on the full ten-class dataset whereas Bonsai reports on a\nreduced two-class problem. SpArSe \ufb01nds models with smaller MS, less WM, and higher accuracy in\nall cases. WM,MS reported in KB. Best performance highlighted in bold.\n\nMNIST\n\nCIFAR10-binary\n\nCUReT-binary\n\nChars4K-binary\n\nUSPS-binary\n\nM\nc\nc\nW\nA\n98.64\n1.96\n1.33\n96.49\n94.38\u2217 < 2\n\nS\nM\n2.77\n1.44\n1.96\n\nSpArSe\nSpArSe\nBonsai\n\nD\nU\nP\nG\n1\n1\n\nM\nc\nc\nW\nA\n1.28\n73.84\n73.84\n1.28\n73.02 < 2\n\nS\nM\n0.78\n0.78\n1.98\n\nD\nU\nP\nG\n5\n5\n\nc\nc\nA\n80.68\n79.97\n\n\u2013\n\nM\nW\n1.66\n1.43\n\u2013\n\nS\nM\n2.34\n1.69\n\u2013\n\nD\nU\nP\nG\n1\n1\n\nM\nc\nc\nW\nA\n0.72\n77.78\n77.78\n0.72\n74.28 < 2\n\nS\nM\n0.46\n0.46\n2\n\nD\nU\nP\nG\n1\n1\n\nc\nc\nA\n96.76\n96.76\n94.42\n\nM\nW\n1.06\n1.06\n<2\n\nS\nM\n1.60\n1.60\n2\n\nTable 4: SpArSe versus Bonsai for WM model (6). See Table 3 for details.\n\nMNIST\n\nCIFAR10-binary\n\nCUReT-binary\n\nChars4K-binary\n\nUSPS-binary\n\nM\nc\nc\nW\nA\n97.03\n1.38\n0.62\n95.76\n94.38\u2217 < 2\n\nS\nM\n15\n1.76\n1.96\n\nSpArSe\nSpArSe\nBonsai\n\nD\nU\nP\nG\n1\n2\n\nM\nc\nc\nW\nA\n1.13\n73.66\n71.76\n1.40\n73.02 < 2\n\nS\nM\n3.95\n1.88\n1.98\n\nD\nU\nP\nG\n25\n27\n\nc\nc\nA\n73.22\n73.22\n\n\u2013\n\nM\nW\n1.9\n1.9\n\u2013\n\nS\nM\n0.14\n0.14\n\u2013\n\nD\nU\nP\nG\n2\n2\n\nM\nc\nc\nW\nA\n0.39\n76.83\n74.87\n1.64\n74.71 < 2\n\nS\nM\n20.12\n0.16\n2\n\nD\nU\nP\nG\n1\n3\n\nc\nc\nA\n97.56\n96.21\n94.42\n\nM\nW\n1.81\n0.98\n<2\n\nS\nM\n31.79\n1.48\n2\n\nD\nU\nP\nG\n1\n1\n\nD\nU\nP\nG\n1\n1\n\n3a compares SpArSe to SpArSe without pruning on MNIST. The results show that including pruning\nas part of the optimization yields roughly an 80x reduction in number of parameters, indicating that\nthe formulation of SpArSe is better suited to designing tiny CNNs compared to [14]. To gain more\ninsight, we show scatter plots of |E| versus (cid:107)\u00af\u03c9(cid:107)0 for the best-performing con\ufb01gurations on two\ndatasets in Fig. 3b-3c, revealing two important trends (see the Appendix for results on the Chars4k\nand CUReT datasets). First, (cid:107)\u00af\u03c9(cid:107)0 tends to increase with increasing |E| for |E| greater than some\nthreshold \u03b6. This suggests that optimizing |E| can be a proxy for optimizing (cid:107)\u00af\u03c9(cid:107)0 when targeting\nlarge networks. At the same time, (cid:107)\u00af\u03c9(cid:107)0 tends to decrease with increasing |E| for |E| < \u03b6, which has\nimplications for both NAS and pruning in the context of small CNNs. Fig. 3b-3c suggest that |E| is\nnot always indicative of weight sparsity, such that minimizing |E| would actually lead to ignoring\ngraphs with more edges but the same amount of non-zero weights. Since CNNs with more edges\ncontain more subgraphs, it is possible that one of these subgraphs has better accuracy and the same\nnumber of non-zero weights as the subgraphs of a graph with less edges. The key is that pruning\nprovides a mechanism for uncovering such high performing subgraphs [18].\n\n3.4 Ablation study\n\nTable 5 presents an ablation experiment on SpArSe with MNIST where we replaced the multi-\nobjective optimizer with a product scalarizer [11, 28] and excluded pruning from the search [13]. In\nboth cases, the algorithm was incapable of \ufb01nding architectures that are both accurate and meet strict\nMCU memory requirements. These results support the design choices made in SpArSe in the context\nof memory constrained MCUs. Table 5 also shows that searching without morphisms yields higher\naccuracy while meeting the same constraints, albeit at the cost of 50% longer search.\n\n(a)\n\n(b)\n\n(c)\n\nFigure 3: Fig. 3a: Pareto frontier of SpArSe with and without pruning, where both experiments\nsample the same number (325) of con\ufb01gurations. Fig. 3b-3c show scatter plots of |E| versus (cid:107)\u00af\u03c9(cid:107)0\nfor the best performing con\ufb01gurations from the parameter minimization experiment. Fig. 3b: MNIST\nnetworks with > 95% accuracy. Fig. 3c: CIFAR10-binary networks with > 70% accuracy.\n\n8\n\n103104105106Number of parameters0.900.920.940.960.981.00AccuracyMNISTSpArSeSpArSe w/o pruning105106|E|103104\u07e0\u0130\u03c9\u07e00\u0304N\u2016ST104105106|E|103104105106\u07e0\u0130\u03c9\u07e00CIF\u0304R10\u2016binary\fTable 5: Ablation on MNIST using WM model (6), searching for models with WM,MS \u2264 2KB on\n250 con\ufb01guration budget. SpArSe w/o pruning did not yield a model that satis\ufb01es the constraints.\nSpArSe w/o morphism\n\nSpArSe w/ product scalarization\n\nSpArSe w/o pruning\n\n11.35\n0.01\n0.05\n\n2\n\n97.46\n0.68\n1.31\n\n3\n\nAcc\nWM\nMS\n\nGPUD\n\n\u2013\n\u2013\n\u2013\n\u2013\n\nSpArSe\n95.76\n0.62\n1.76\n\n2\n\nTable 6: Measurement of SpArSe models on Micro Bit and STM MCUs, compared with Bonsai on\nArduino Uno. Latency in ms.\n\nM\nc\nc\nW\nA\n96.97\n1.32\n0.71\n95.76\n94.38\u2217 < 2\n\nS\nM\n15.86\n2.35\n1.96\n\nSpArSe\nSpArSe\nBonsai\n\nMNIST\n\nt\ni\n\nB\n\u00b5\n\n.\nt\na\nL\n\u2013\n\n115.40\n\n8.9\n\nCIFAR10-binary\n\nt\ni\n\nB\n\u00b5\nf\nn\ni\n/\nJ\n\nm\n\u2013\n\n12.17\n2.18\n\nM\nT\nS\n\n.\nt\na\nL\n\nM\nT\nS\nf\nn\ni\n/\nJ\n\nm\n\n285.82\n27.06\n8.9\n\n203.79\n19.29\n2.18\n\nM\nc\nc\nW\nA\n2.4\n73.4\n70.48\n2.12\n73.02 < 2\n\nS\nM\n9.94\n2.74\n1.98\n\nt\ni\n\nB\n\u00b5\n\n.\nt\na\nL\n\u2013\n\u2013\n\n8.16\n\nt\ni\n\nB\n\u00b5\nf\nn\ni\n/\nJ\n\nm\n\u2013\n\u2013\n\n2.01\n\nM\nT\nS\n\n.\nt\na\nL\n\nM\nT\nS\nf\nn\ni\n/\nJ\n\nm\n\n2529.84\n498.57\n8.16\n\n1803.78\n355.48\n2.01\n\nc\nc\nA\n73.22\n73.22\n\n\u2013\n\nM\nW\n2.06\n2.06\n\n\u2013\n\nS\nM\n0.56\n0.56\n\u2013\n\nCUReT-binary\n\nt\ni\n\nB\n\u00b5\n\n.\nt\na\nL\n\n671.72\n671.72\n\n\u2013\n\nt\ni\n\nB\n\u00b5\nf\nn\ni\n/\nJ\n\nm\n70.87\n70.87\n\n\u2013\n\nM\nT\nS\n\n.\nt\na\nL\n\n103.67\n103.67\n\n\u2013\n\nM\nT\nS\nf\nn\ni\n/\nJ\n\nm\n73.92\n73.92\n\n\u2013\n\nChars4K-binary\n\nM\nc\nc\nW\nA\n1.87\n74.87\n74.87\n1.87\n74.71 < 2\n\nS\nM\n0.27\n0.27\n\n2\n\nt\ni\n\nB\n\u00b5\n\n.\nt\na\nL\n\n207.04\n207.04\n8.55\n\nt\ni\n\nB\n\u00b5\nf\nn\ni\n/\nJ\n\nm\n21.83\n21.83\n2.1\n\nM\nT\nS\n\n.\nt\na\nL\n77.89\n77.89\n8.55\n\nM\nT\nS\nf\nn\ni\n/\nJ\n\nm\n55.54\n55.54\n2.1\n\n3.5 Latency and power measurements\n\nFor validation, we use uTensor [7] to convert CNNs from SpArSe into baremetal C++, which we\ncompile using mbed-cli [3] and deploy on the Micro Bit and STM32F413 MCUs. Table 6 shows the\nlatency and energy per inference measurements. Since uTensor has limited operator support, some\nnetworks reported in Table 6 differ from Table 4. Due to uTensor issues with memory management,\nincluding memory leaks, some models were only able to be run on the larger MCU. Corresponding\nmeasurements for Bonsai cannot be directly compared because Bonsai operates on extracted features\ninstead of the raw input image itself [41]. A recent related work, MODC [25], is considerably slower\nthan SpArSe, at 684 ms for MNIST on the Arduino Uno. Although it may be too early to say if CNN\nlatency/power consumption can meet application requirements, we hope this work provides much\nneeded data to start to answer this question.\n\n4 Conclusion\n\nAlthough MCUs are the most widely deployed computing platform, they have been largely ignored\nby ML researchers. This paper makes the case for targeting MCUs for deployment of ML, enabling\nfuture IoT products and usecases. We demonstrate that, contrary to previous assertions, it is in fact\npossible to design CNNs for MCUs with as little as 2KB RAM. SpArSe optimizes CNNs for the\nmultiple constraints of MCU hardware platforms, \ufb01nding models that are both smaller and more\naccurate than previous SOTA non-CNN models across a range of standard datasets.\n\n4.1 Acknowledgements\n\nWe thank Michael Bartling, Patrick Hansen, and Neil Tan for their help in model deployment.\n\n9\n\n\fReferences\n[1] The shape of the MCU market.\n\nbreak-points/4441588/The-shape-of-the-MCU-market. Accessed: 2019-05-02.\n\n/https://www.embedded.com/electronics-blogs/\n\n[2] Global shipments of discrete graphics processing units from 2015 to 2018 (in million units).\nURL /https://www.statista.com/statistics/865846/worldwide-discrete-gpus-\nshipment/. Accessed: 2019-05-23.\n\n[3] Arm mbed-cli. URL /https://github.com/ARMmbed/mbed-cli. Accessed: 2019-05-02.\n\n[4] Microsoft Embedded Learning Library. URL /https://microsoft.github.io/ELL/. Ac-\n\ncessed: 2019-05-02.\n\n[5] TensorFlow Quantization-Aware Training. URL /https://github.com/tensorflow/\n\ntensorflow/tree/master/tensorflow/contrib/quantize. Accessed: 2019-05-02.\n\n[6] TensorFlow Lite for Microcontrollers.\n\nURL /https://github.com/tensorflow/\ntensorflow/tree/master/tensorflow/lite/experimental/micro. Accessed: 2019-\n05-02.\n\n[7] uTensor. URL /http://utensor.ai/. Accessed: 2019-05-02.\n\n[8] Visual Wake Words\n\nChallenge,\n\nCVPR\n\n2019.\n\nURL\n\n/https://\n\ndocs.google.com/document/u/2/d/e/2PACX-1vStp3uPhxJB0YTwL4T_\n_Q5xjclmrj6KRs55xtMJrCyi82GoyHDp2X0KdhoYcyjEzKe4v75WBqPObdkP/pub.\ncessed: 2019-05-02.\n\nAc-\n\n[9] Why the Future of Machine Learning is Tiny. URL /https://petewarden.com/2018/06/\n\n11/why-the-future-of-machine-learning-is-tiny/. Accessed: 2019-05-02.\n\n[10] Luigi Atzori, Antonio Iera, and Giacomo Morabito. The internet of things: A survey. Com-\nputer Networks, 54(15):2787 \u2013 2805, 2010. ISSN 1389-1286. doi: /https://doi.org/10.1016/\nj.comnet.2010.05.010.\n\n[11] Han Cai, Ligeng Zhu, and Song Han. ProxylessNAS: Direct neural architecture search on target\ntask and hardware. In International Conference on Learning Representations, 2019. URL\n/https://openreview.net/forum?id=HylVB3AqYm.\n\n[12] O Dekel, C Jacobbs, and L Xiao. Pruning decision forests. Personal Communications, 2016.\n\n[13] Thomas Elsken, Jan Hendrik Metzen, and Frank Hutter. Neural architecture search: A survey.\n\narXiv preprint arXiv:1808.05377, 2018.\n\n[14] Thomas Elsken, Jan Hendrik Metzen, and Frank Hutter. Ef\ufb01cient multi-objective neural architec-\nture search via lamarckian evolution. In International Conference on Learning Representations,\n2019.\n\n[15] Thomas Elsken, Jan Hendrik Metzen, and Frank Hutter. Neural architecture search: A survey.\n\nJournal of Machine Learning Research, 20(55):1\u201321, 2019.\n\n[16] Te\u00f3\ufb01lo Em\u00eddio de Campos, Bodla Rakesh Babu, and Manik Varma. Character recognition in\n\nnatural images. volume 2, pages 273\u2013280, 01 2009.\n\n[17] Stefan Falkner, Aaron Klein, and Frank Hutter. BOHB: robust and ef\ufb01cient hyperparameter\noptimization at scale. In Proceedings of the 35th International Conference on Machine Learning,\nICML 2018, Stockholmsm\u00e4ssan, Stockholm, Sweden, July 10-15, 2018, pages 1436\u20131445, 2018.\n\n[18] Jonathan Frankle and Michael Carbin. The lottery ticket hypothesis: Finding sparse, trainable\n\nneural networks. In International Conference on Learning Representations, 2019.\n\n[19] Trevor Gale, Erich Elsen, and Sara Hooker. The state of sparsity in deep neural networks. CoRR,\n\nabs/1902.09574, 2019. URL /http://arxiv.org/abs/1902.09574.\n\n10\n\n\f[20] Eduardo C Garrido-Merch\u00e1n and Daniel Hern\u00e1ndez-Lobato. Dealing with categorical and\ninteger-valued variables in bayesian optimization with gaussian processes. arXiv preprint\narXiv:1805.03463, 2018.\n\n[21] Dibakar Gope, Ganesh Dasika, and Matthew Mattina. Ternary hybrid neural-tree networks for\n\nhighly constrained iot applications. CoRR, abs/1903.01531, 2019.\n\n[22] Jayavardhana Gubbi, Rajkumar Buyya, Slaven Marusic, and Marimuthu Palaniswami. Internet\nof things (iot): A vision, architectural elements, and future directions. Future generation\ncomputer systems, 29(7):1645\u20131660, 2013.\n\n[23] Zichao Guo, Xiangyu Zhang, Haoyuan Mu, Wen Heng, Zechun Liu, Yichen Wei, and Jian\nSun. Single path one-shot neural architecture search with uniform sampling. arXiv preprint\narXiv:1904.00420, 2019.\n\n[24] Chirag Gupta, Arun Sai Suggala, Ankit Goyal, Harsha Vardhan Simhadri, Bhargavi Paran-\njape, Ashish Kumar, Saurabh Goyal, Raghavendra Udupa, Manik Varma, and Prateek Jain.\nProtonn: Compressed and accurate knn for resource-scarce devices. In Proceedings of the\n34th International Conference on Machine Learning-Volume 70, pages 1331\u20131340. JMLR. org,\n2017.\n\n[25] Albert Gural and Boris Murmann. Memory-optimal direct convolutions for maximizing classi\ufb01-\n\ncation accuracy in embedded applications. In ICML, pages 2515\u20132524, 2019.\n\n[26] Song Han, Xingyu Liu, Huizi Mao, Jing Pu, Ardavan Pedram, Mark A Horowitz, and William J\nDally. EIE: ef\ufb01cient inference engine on compressed deep neural network. In 2016 ACM/IEEE\n43rd Annual International Symposium on Computer Architecture (ISCA), pages 243\u2013254. IEEE,\n2016.\n\n[27] Yihui He, Xiangyu Zhang, and Jian Sun. Channel pruning for accelerating very deep neural\nnetworks. In Proceedings of the IEEE International Conference on Computer Vision, pages\n1389\u20131397, 2017.\n\n[28] Yihui He, Ji Lin, Zhijian Liu, Hanrui Wang, Li-Jia Li, and Song Han. Amc: Automl for model\n\ncompression and acceleration on mobile devices. In ECCV, pages 784\u2013800, 2018.\n\n[29] Jos\u00e9 Miguel Hern\u00e1ndez-Lobato, Michael A. Gelbart, Brandon Reagen, Robert Adolf, Daniel\nHern\u00e1ndez-Lobato, Paul N. Whatmough, David Brooks, Gu-Yeon Wei, and Ryan P. Adams.\nDesigning neural network hardware accelerators with decoupled objective evaluations. In NIPS\nworkshop on Bayesian Optimization, 2016.\n\n[30] Andrew G. Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias\nWeyand, Marco Andreetto, and Hartwig Adam. Mobilenets: Ef\ufb01cient convolutional neural\nnetworks for mobile vision applications. CoRR, abs/1704.04861, 2017. URL /http://\narxiv.org/abs/1704.04861.\n\n[31] Michael I. Jordan, Zoubin Ghahramani, Tommi S. Jaakkola, and Lawrence K. Saul. An\nintroduction to variational methods for graphical models. Machine Learning, 37(2):183\u2013233,\nNov 1999. ISSN 1573-0565. doi: /10.1023/A:1007665907178.\n\n[32] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. In The International\n\nConference on Learning Representations, 2014.\n\n[33] Durk P Kingma, Tim Salimans, and Max Welling. Variational dropout and the local reparam-\neterization trick. In Advances in Neural Information Processing Systems, pages 2575\u20132583,\n2015.\n\n[34] Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images.\n\nTechnical report, Citeseer, 2009.\n\n[35] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classi\ufb01cation with deep\nconvolutional neural networks. In Advances in neural information processing systems, pages\n1097\u20131105, 2012.\n\n11\n\n\f[36] Ashish Kumar, Saurabh Goyal, and Manik Varma. Resource-ef\ufb01cient machine learning in 2 kb\nram for the internet of things. In Proceedings of the 34th International Conference on Machine\nLearning-Volume 70, pages 1935\u20131944. JMLR. org, 2017.\n\n[37] Yann LeCun, John S Denker, and Sara A Solla. Optimal brain damage. In Advances in neural\n\ninformation processing systems, pages 598\u2013605, 1990.\n\n[38] Yann LeCun, L\u00e9on Bottou, Yoshua Bengio, Patrick Haffner, et al. Gradient-based learning\n\napplied to document recognition. Proceedings of the IEEE, 86(11):2278\u20132324, 1998.\n\n[39] Marius Lindauer. Literature on Neural Architecture Search at AutoML.org at Freiburg. URL\n/https://www.automl.org/automl/literature-on-neural-architecture-search/.\nAccessed: 2019-05-02.\n\n[40] Hanxiao Liu, Karen Simonyan, and Yiming Yang. DARTS: Differentiable architecture search.\n\nIn International Conference on Learning Representations, 2019.\n\n[41] Zhuang Liu, Mingjie Sun, Tinghui Zhou, Gao Huang, and Trevor Darrell. Rethinking the value\n\nof network pruning. In International Conference on Learning Representations, 2019.\n\n[42] Christos Louizos, Karen Ullrich, and Max Welling. Bayesian compression for deep learning. In\n\nAdvances in Neural Information Processing Systems, pages 3288\u20133298, 2017.\n\n[43] Francois Meunier, Adam Wood, Keith Weiss, Katy Huberty, Simon Flannery, Joseph Moore,\nCraig Hettenbach, and Bill Lu. The \u2018Internet of Things\u2019 is now. Morgan Stanley Research,\npages 1\u201396, 2014.\n\n[44] Dmitry Molchanov, Arsenii Ashukha, and Dmitry Vetrov. Variational dropout sparsi\ufb01es deep\nneural networks. In Proceedings of the 34th International Conference on Machine Learning-\nVolume 70, pages 2498\u20132507, 2017.\n\n[45] Pavlo Molchanov, Stephen Tyree, Tero Karras, Timo Aila, and Jan Kautz. Pruning convolutional\nneural networks for resource ef\ufb01cient transfer learning. CoRR, abs/1611.06440, 2016. URL\n/http://arxiv.org/abs/1611.06440.\n\n[46] United Nations. Department of Economic and Social Affairs. Population Division. World\n\npopulation prospects: The 2010 revision. UN, 2010.\n\n[47] Biswajit Paria, Kirthevasan Kandasamy, and Barnab\u00e1s P\u00f3czos. A \ufb02exible multi-objective\nbayesian optimization approach using random scalarizations. arXiv preprint arXiv:1805.12168,\n2018.\n\n[48] Carl Edward Rasmussen. Gaussian processes in machine learning. In Summer School on\n\nMachine Learning, pages 63\u201371. Springer, 2003.\n\n[49] Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen.\nMobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE Conference\non Computer Vision and Pattern Recognition (CVPR), pages 4510\u20134520, 2018.\n\n[50] Laurent Sifre and St\u00e9phane Mallat. Rigid-motion scattering for image classi\ufb01cation. Ph.D.\n\nThesis, 1:3, 2014.\n\n[51] Casper Kaae S\u00f8nderby, Tapani Raiko, Lars Maal\u00f8e, S\u00f8ren Kaae S\u00f8nderby, and Ole Winther.\nHow to train deep variational autoencoders and probabilistic ladder networks. In 33rd Interna-\ntional Conference on Machine Learning (ICML 2016), 2016.\n\n[52] Dimitrios Stamoulis, Ruizhou Ding, Di Wang, Dimitrios Lymberopoulos, Bodhi Priyantha, Jie\nLiu, and Diana Marculescu. Single-path nas: Designing hardware-ef\ufb01cient convnets in less than\n4 hours. In arXiv preprint arXiv:1904.02877, 2019.\n\n[53] Kevin Swersky, David Duvenaud, Jasper Snoek, Frank Hutter, and Michael A Osborne. Raiders\nof the lost architecture: Kernels for bayesian optimization in conditional parameter spaces.\narXiv preprint arXiv:1409.4011, 2014.\n\n12\n\n\f[54] Mingxing Tan, Bo Chen, Ruoming Pang, Vijay Vasudevan, and Quoc V Le. Mnasnet: Platform-\n\naware neural architecture search for mobile. arXiv preprint arXiv:1807.11626, 2018.\n\n[55] Urmish Thakker, Igor Fedorov, Jesse G. Beu, Dibakar Gope, Chu Zhou, Ganesh Dasika, and\n\nMatthew Mattina. Pushing the limits of rnn compression. ArXiv, abs/1910.02558, 2019.\n\n[56] Michael E Tipping. Sparse bayesian learning and the relevance vector machine. Journal of\n\nmachine learning research (JMLR), 1(Jun):211\u2013244, 2001.\n\n[57] Manik Varma and Andrew Zisserman. A statistical approach to texture classi\ufb01cation from single\n\nimages. International journal of computer vision, 62(1-2):61\u201381, 2005.\n\n[58] Kuan Wang, Zhijian Liu, Yujun Lin, Ji Lin, and Song Han. HAQ: hardware-aware auto-\nmated quantization. In Proceedings of the IEEE Conference on Computer Vision and Pattern\nRecognition (CVPR), 2019.\n\n[59] Tao Wei, Changhu Wang, Yong Rui, and Chang Wen Chen. Network morphism. In International\n\nConference on Machine Learning, pages 564\u2013572, 2016.\n\n[60] Paul N. Whatmough, Chuteng Zhou, Patrick Hansen, Shreyas Kolala Venkataramanaiah, Jae\nsun Seo, and Matthew Mattina. Fixynn: Ef\ufb01cient hardware for mobile computer vision via\ntransfer learning. In Proceedings of the 2nd SysML Conference, Palo Alto, CA, USA, 2019.\n\n[61] Tien-Ju Yang, Yu-Hsin Chen, and Vivienne Sze. Designing energy-ef\ufb01cient convolutional neural\nnetworks using energy-aware pruning. In Proceedings of the IEEE Conference on Computer\nVision and Pattern Recognition, pages 5687\u20135695, 2017.\n\n[62] Tien-Ju Yang, Andrew Howard, Bo Chen, Xiao Zhang, Alec Go, Mark Sandler, Vivienne\nSze, and Hartwig Adam. Netadapt: Platform-aware neural network adaptation for mobile\napplications. In The European Conference on Computer Vision (ECCV), September 2018.\n\n[63] Xiangyu Zhang, Xinyu Zhou, Mengxiao Lin, and Jian Sun. Shuf\ufb02enet: An extremely ef\ufb01cient\nconvolutional neural network for mobile devices. In Proceedings of the IEEE Conference on\nComputer Vision and Pattern Recognition, pages 6848\u20136856, 2018.\n\n[64] Barret Zoph and Quoc V Le. Neural architecture search with reinforcement learning. arXiv\n\npreprint arXiv:1611.01578, 2016.\n\n[65] Barret Zoph, Vijay Vasudevan, Jonathon Shlens, and Quoc V Le. Learning transferable\narchitectures for scalable image recognition. In Proceedings of the IEEE conference on computer\nvision and pattern recognition, pages 8697\u20138710, 2018.\n\n13\n\n\f", "award": [], "sourceid": 2754, "authors": [{"given_name": "Igor", "family_name": "Fedorov", "institution": "Arm Research"}, {"given_name": "Ryan", "family_name": "Adams", "institution": "Princeton University"}, {"given_name": "Matthew", "family_name": "Mattina", "institution": "ARM"}, {"given_name": "Paul", "family_name": "Whatmough", "institution": "Arm Research"}]}