{"title": "Deep Active Learning with a Neural Architecture Search", "book": "Advances in Neural Information Processing Systems", "page_first": 5976, "page_last": 5986, "abstract": "We consider active learning of deep neural networks. Most active learning works in this context have focused on studying effective querying mechanisms and assumed that an appropriate network architecture is a priori known for the problem at hand. We challenge this assumption and propose a novel active strategy whereby the learning algorithm searches for effective architectures on the fly, while actively learning. We apply our strategy using three known querying techniques (softmax response, MC-dropout, and coresets) and show that the proposed approach overwhelmingly outperforms active learning using fixed architectures.", "full_text": "Deep Active Learning with a Neural Architecture\n\nSearch\n\nYonatan Geifman\n\nTechnion \u2013 Israel Institute of Technology\nyonatan.g@cs.technion.ac.il\n\nRan El-Yaniv\n\nTechnion \u2013 Israel Institute of Technology\n\nrani@cs.technion.ac.il\n\nAbstract\n\nWe consider active learning of deep neural networks. Most active learning works\nin this context have focused on studying effective querying mechanisms and as-\nsumed that an appropriate network architecture is a priori known for the problem at\nhand. We challenge this assumption and propose a novel active strategy whereby\nthe learning algorithm searches for effective architectures on the \ufb02y, while actively\nlearning. We apply our strategy using three known querying techniques (softmax\nresponse, MC-dropout, and coresets) and show that the proposed approach over-\nwhelmingly outperforms active learning using \ufb01xed architectures.\n\n1\n\nIntroduction\n\nActive learning allows a learning algorithm to control the learning process, by actively selecting\nthe labeled training sample from a large pool of unlabeled instances. Theoretically, active learning\nhas a huge potential, especially in cases where exponential speedup in sample complexity can be\nachieved [10, 25, 9]. Active learning becomes particularly important when considering supervised\ndeep neural models, which are hungry for large and costly labeled training samples. For example,\nwhen considering supervised learning of medical diagnoses for radiology images, the labeling of\nimages must be performed by professional radiologists whose availability is scarce and consultation\ntime is costly.\nIn this paper, we focus on active learning of image classi\ufb01cation with deep neural models. There are\nonly a few works on this topic and, for the most part, they concentrate on one issue: How to select\nthe subsequent instances to be queried. They are also mostly based on the uncertainty sampling\nprinciple in which querying uncertain instances tends to expedite the learning process. For example,\n[6] employ a Monte-Carlo dropout (MC-dropout) technique for estimating uncertainty of unlabeled\ninstances. [24] applied the well-known softmax response (SR) to estimate uncertainty. [21] and\n[7] proposed to use coresets on the neural embedding space and then exploit the coreset loss of\nunlabeled points as a proxy for their uncertainty. A drawback of most of these works is their heavy\nuse of prior knowledge regarding the neural architecture. That is, they utilize an architecture already\nknown to be useful for the classi\ufb01cation problem at hand.\nWhen considering active learning of a new learning task, e.g., involving medical images or remote\nsensing, there is no known off-the-shelf working architecture. Moreover, even if one receives from\nan oracle the \u201ccorrect\u201d architecture for the passive learning problem (an architecture that induces the\nbest performance if trained over a very large labeled training sample), it is unlikely that this archi-\ntecture will be effective in the early stages of an active learning session. The reason is that a large\nand expressive architecture will tend to over\ufb01t when trained over a small sample and, consequently,\nits generalization performance and the induced querying function (from the over\ufb01t model) can be\npoor (we demonstrate this phenomenon in Section 5).\nTo overcome this challenge, we propose to perform a neural architecture search (NAS) in every\nactive learning round. We present a new algorithm, the incremental neural architecture search\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\f(iNAS), which can be integrated together with any active querying strategy. In iNAS, we perform\nan incremental search for the best architecture from a restricted set of candidate architectures. The\nmotivating intuition is that the capacity of the architectural class should start small, with limited ar-\nchitectural capacity, and should be monotonically non-decreasing along the active learning process.\nThe iNAS algorithm thus only allows for small architectural increments in each active round. We\nimplement iNAS using a \ufb02exible architecture family consisting of changeable numbers of stacks,\neach consisting of a \ufb02uid number of Resnet blocks. The resulting active learning algorithm, which\nwe term active-iNAS, consistently and signi\ufb01cantly improves all known deep active learning algo-\nrithms. We demonstrate this advantage of active-iNAS with the above three querying functions over\nthree image classi\ufb01cation datasets: CIFAR-10, CIFAR-100, and SVHN.\n\n2 Related Work\n\nActive learning has attracted considerable attention since the early days of machine learning.\nThe literature on active learning in the context of classical models such as SVMs is extensive\n[4, 5, 23, 2, 1, 13], and clearly beyond the scope of this paper. Active learning of deep neural\nmodels, as we consider here, has hardly been considered to date. Among the prominent related\nresults, we note Gal et al. [6], who presented active learning algorithms for deep models based on\na Bayesian Monte-Carlo dropout (MC-dropout) technique for estimating uncertainty. Wang et al.\n[24] applied the well-known softmax response (SR) idea supplemented with pseudo-labeling (self-\nlabeling of highly con\ufb01dent points) for active learning. Sener and Savarese [21] and Geifman and\nEl-Yaniv [7] proposed using coresets on the neural embedding space and then exploiting the coreset\nloss of unlabeled points as a proxy for their uncertainty. A major de\ufb01ciency of most of these results\nis that the active learning algorithms were applied with a neural architecture that is already known\nto work well for the learning problem at hand. This hindsight knowledge is, of course, unavailable\nin a true active learning setting. To mitigate this problematic aspect, in [7] it was suggested that the\nactive learning be applied only over the \u201clong tail\u201d; namely, to initially utilize a large labeled training\nsample to optimize the neural architecture, and only then to start the active learning process. This\npartial remedy suffers from two de\ufb01ciencies. First, it cannot be implemented in small learning prob-\nlems where the number of labeled instances is small (e.g., smaller than the \u201clong tail\u201d). Secondly, in\nGeifman and El-Yaniv\u2019s solution, the architecture is \ufb01xed after it has been initially optimized. This\nmeans that the \ufb01nal model, which may require a larger architecture, is likely to be sub-optimal.\nHere, we initiate the discussion of architecture optimization in active learning within the context of\ndeep neural models. Surprisingly, the problem of hyperparameter selection in classical models (such\nas SVMs) has not been discussed for the most part. One exception is the work of Huang et al. [13]\nwho brie\ufb02y considered this problem in the context of linear models and showed that active learning\nperformance curves can be signi\ufb01cantly enhanced using a proper choice of (\ufb01xed) hyperparameters.\nHuang et al. however, chose the hyperparameters in hindsight. In contrast, we consider a dynamic\noptimization of neural architectures during the active learning session.\nIn neural architecture search (NAS), the goal is to devise algorithms that automatically optimize\nthe neural architecture for a given problem. Several NAS papers have recently proposed a number\nof approaches. In [28], a reinforcement learning algorithm was used to optimize the architecture\nof a neural network. In [29], a genetic algorithm is used to optimize the structure of two types of\n\u201cblocks\u201d (a combination of neural network layers and building components) that have been used\nfor constructing architectures. The number of blocks comprising the full architecture was manually\noptimized. It was observed that the optimal number of blocks is mostly dependent on the size of the\ntraining set. More ef\ufb01cient optimization techniques were proposed in [16, 19, 20, 18]. In all these\nworks, the architecture search algorithms were focused on optimizing the structure of one (or two)\nblocks that were manually connected together to span the full architecture. The algorithm proposed\nin [17] optimizes both the block structure and the number of blocks simultaneously.\n\n3 Problem Setting\nWe \ufb01rst de\ufb01ne a standard supervised learning problem. Let X be a feature space and Y be a label\nspace. Let P (X, Y ) be an unknown underlying distribution, where X \u2208 X , Y \u2208 Y. Based on a\nlabeled training set Sm = {(xi, yi)} of m labeled training samples, the goal is to select a prediction\nfunction f \u2208 F, f : X \u2192 Y, so as to minimize the risk R(cid:96)(f ) = E(X,Y )[(cid:96)(f (x), y)], where\n\n2\n\n\f(cid:80)|S|\n\ni=1 (cid:96)(f (xi), yi).\n\n(cid:96)(\u00b7) \u2208 R+ is a given loss function. For any labeled set S (training or validation), the empirical risk\nover S is de\ufb01ned as \u02c6rS(f ) = 1|S|\nIn the pool-based active learning setting, we are given a set U = {x1, x2, ...xu} of unlabeled sam-\nples. Typically, the acquisition of unlabeled instances is cheap and, therefore, U can be very large.\nThe task of the active learner is to choose points from U to be labeled by an annotator so as to train\nan accurate model for a given labeling budget (number of labeled points). The points are selected by\na query function denoted by Q. Query functions often select points based on information inferred\nfrom the current model f\u03b8, the existing training set S, and the current pool U. In the mini-batch\npool-based active learning setting, the n points to be labeled are queried in bundles that are called\nmini-batches such that a model is trained after each mini-batch.\nNAS is formulated as follows. Consider a class A of architectures, where each architecture A \u2208 A\nrepresents a hypothesis class containing all models f\u03b8 \u2208 A, where \u03b8 represents the parameter vector\nof the architecture A. The objective in NAS is to solve\n\nA = argmin\n\nA\u2208A\n\nmin\nf\u03b8\u2208A|S\n\n(R(cid:96)(f )).\n\n(1)\n\nSince R(cid:96)(f ) depends on an unknown distribution, it is typically proxied by an empirical quantity\nsuch as \u02c6rS(f ) where S is a training or validation set.\n\n4 Deep Active Learning with a Neural Architecture Search\n\nIn this section we de\ufb01ne a neural architecture search space over which we apply a novel search\nalgorithm. This search space together with the algorithm constitute a new NAS technique that drives\nour new active algorithm.\n\n4.1 Modular Architecture Search Space\n\nModern neural network architectures are often modeled as a composition of one or several basic\nbuilding blocks (sometimes referred to as \u201ccells\u201d) containing several layers [11, 14, 27, 26, 12].\nStacks are composed of several blocks connected together. The full architecture is a sequence of\nstacks, where usually down-sampling and depth-expansion are performed between stacks. For ex-\nample, consider the Resnet-18 architecture. This network begins with two initial layers and contin-\nues with four consecutive stacks, each consisting of two Resnet basic blocks, followed by an average\npooling and ending with a softmax layer. The Resnet basic block contains two batch normalized 3\u00d73\nconvolutional layers with a ReLU activation and a residual connection. Between every two stacks,\nthe feature maps\u2019 resolution is reduced by a factor of 2 (using a strided convolution layer), and the\nwidth (number of feature maps in each layer, denoted as W ) is doubled, starting from 64 in the \ufb01rst\nblock. This classic architecture has several variants, which differ by the number and type of blocks\nin each stack.\nIn this work, we consider \u201chomogenous\u201d architectures composed of a single block type and\nwith each stack containing the same number of blocks. We denote such an architecture by\nA(B, Nblocks, Nstacks), where B is the block, Nblocks is the number of blocks in each stack, and\nNstacks is the number of stacks. For example, using this notation, Resnet-18 is A(Br, 2, 4) where\nBr is the Resnet basic block. Figure 1(a) depicts the proposed homogeneous architecture.\nFor a given block B, we de\ufb01ne a modular architecture search space as A = {A(B, i, j) : i \u2208\n{1, 2, 3, ..., Nblocks}, j \u2208 {1, 2, 3, ..., Nstacks}}, which is simply all possible architectures spanned\nby the grid de\ufb01ned by the two corners A(B, 1, 1) and A(B, Nblocks, Nstacks). Clearly, the space A\nis restricted in the sense that it only contains a limited subspace of architectures but nevertheless it\ncontains Nblocks \u00d7 Nstacks architectures with diversity in both numbers of layers and parameters.\n\n4.2 Search Space as an Acyclic Directed Graph (DAG)\n\nThe main idea in our search strategy is to start from the smallest possible architecture (in the modu-\nlar search space) and iteratively search for an optimal incremental architectural expansion within the\nmodular search space. We de\ufb01ne the depth of an architecture to be the number of layers in the archi-\ntecture. We denote the depth of A(B, i, j) by |A(B, i, j)| = ij\u03b2+\u03b1, where \u03b2 is the number of layers\n\n3\n\n\f(a)\n\n(b)\n\nFigure 1: (a) The general proposed architecture contains Nblocks blocks in each stack and Nstacks stacks. (b)\nA search space up to A(B, 5, 4) plotted on a grid. The horizontal axis (i) represents the number of blocks; the\nvertical axis (j) represents the number of stacks. The arrows represent all the edges of the graph. The number\nin each vertex is the number of blocks in the architecture (ij).\n\nin the block B and \u03b1 is the number of layers in the initial block (all the layers appearing before the\n\ufb01rst block) plus the number of layers in the classi\ufb01cation block (all the layers appearing after the last\nblock). It is convenient to represent the architecture search space as a directed acyclic graph (DAG)\nG = (V, E), where the vertex set V = {A(B, i, j)}, B is a \ufb01xed neural block (e.g., a Resnet basic\nblock), i \u2208 {1, 2, . . . , Nblocks} is the number of blocks in each stack, and j \u2208 {1, 2, 3, . . . , Nstacks}\nis the number of stacks. The edge set E is de\ufb01ned based on two incremental expansion steps. The\n\ufb01rst step, increases the depth of the network without changing the number of stacks (i.e., without\naffecting the width), and the second step increases the depth while also increasing the number of\nstacks (i.e., increasing the width). Both increment steps are de\ufb01ned so as to perform the minimum\npossible architectural expansion (within the search space). Thus, when expanding A(B, i, j) using\nthe \ufb01rst step, the resulting architecture is A(B, i + 1, j). When expanding A(B, i, j) using the sec-\nond step, we reduce the number of blocks in each stack to perform a minimal expansion resulting in\nthe architecture A(B,(cid:98) ij\nj+1(cid:99) + 1, j + 1). The parameters of the latter architecture are obtained by\nrounding up the solution i(cid:48) of the following problem,\n\ni(cid:48) = argmini(cid:48)>0 |A(B, i(cid:48), j + 1)|\ns.t.|A(B, i(cid:48), j + 1)| > |A(B, i, j)| .\n\nWe conclude that each of these steps are indeed depth-expanding. In the \ufb01rst step, the expansion\nis only made along the depth dimension, while the second step affects the number of stacks and\nexpands the width as well. In both steps, the incremental step is the smallest possible within the\nmodular search space.\nIn Figure 1(b), we depict the DAG G on a grid whose coordinates are i (blocks) and j (stacks). The\nmodular search space in this example is all the architectures in the range A(B, 1, 1) to A(B, 5, 4).\nThe arrows represents all edges in G. In this formulation, it is evident that every path starting from\nany architecture can be expanded up to the largest possible architecture. Moreover, every architec-\nture is reachable when starting from the smallest architecture A(B, 1, 1). These two properties serve\nwell our search strategy.\n\n4.3\n\nIncremental Neural Architecture Search\n\nThe proposed incremental neural architecture search (iNAS) procedure is described in Algorithm 1\nand operates as follows. Given a small initial architecture A(B, i0, j0), a training set S, and an ar-\nchitecture search space A, we \ufb01rst randomly partition the set S into training and validation subsets,\nS(cid:48) and V (cid:48), respectively, S = S(cid:48) \u222a V (cid:48). On iteration t, a set of candidate architectures is selected\nbased on the edges of the search DAG (see Section 4.2) including the current architecture and the\ntwo connected vertices (lines 5-6). This step creates a candidate set, A(cid:48), consisting of three models,\nA(cid:48) = {A(B, i, j), A(B,(cid:98) ij\nj+1(cid:99) + 1, j + 1), A(B, i + 1, j)}. In line 7, the best candidate in terms\n\n4\n\n\fof validation performance is selected and denoted At = A(B, it, jt). The optimization problem\nformulated in line 7 is an approximation of the NAS objective formulated in Equation (1). The al-\ngorithm terminates whenever At = At\u22121, or a prede\ufb01ned maximum number of iterations is reached\n(in which case At is the \ufb01nal output).\n\nAlgorithm 1 iNAS\n1: iNAS(S,A(B, i0, j0), A, TiN AS)\n2: Let S(cid:48), V (cid:48) be an train-test random split of S\n3: for t=1:TiN AS do:\n4:\n\ni \u2190 it\u22121; j \u2190 jt\u22121\nA(cid:48) = { A(B, i, j),\nA(B,(cid:98) ij\nA(B, i + 1, j)}\n\nj+1(cid:99) + 1, j + 1),\n\nA(cid:48) = A(cid:48) \u2229 A\nA(B, it, jt) =\n= argminA\u2208A(cid:48) \u02c6rV (cid:48)(argminf\u03b8\u2208A \u02c6rS(cid:48)(f\u03b8))\nif A(B, it, jt) = A(B, it\u22121, jt\u22121) then\n\n5:\n\n6:\n7:\n\nbreak\n\n8:\n9:\n10:\n11: end for\n12: Return A(B, it, jt)\n\nend if\n\nAlgorithm 2 Deep Active Learning with iNAS\n1: active-iNAS(U,A0, A, Q, b, k)\n2: t \u2190 1\n3: St \u2190 Sample k points from U at random\n4: U0 \u2190 U\\S1\n5: while true do\n6:\n7:\n8:\n9:\n10:\n11:\n12:\n13:\n14:\n15: end while\n\nAt \u2190 iNAS(S, At\u22121, A)\ntrain f\u03b8 \u2208 At using S\nif budget exhausted or Ut = \u2205 then\nend if\nS(cid:48) \u2190 Q(f\u03b8, St, Ut, b)\nSt+1 \u2190 St \u222a S(cid:48)\nUt+1 \u2190 Ut\\S(cid:48)\nt \u2190 t + 1\n\nReturn f\u03b8\n\n4.4 Active Learning with iNAS\n\nThe deep active learning with incremental neural architecture search (active-iNAS) technique is\ndescribed in Algorithm 2 and works as follows. Given a pool U of unlabeled instances from X ,\na set of architectures A is induced using a composition of basic building blocks B as shown in\nSection 4.1, an initial (small) architecture A0 \u2208 A, a query function Q, an initial (passively) labeled\ntraining set size k, and an active learning batch size b. We \ufb01rst sample uniformly at random k\npoints from U to constitute the initial training set S1. We then iterate the following three steps.\nFirst, we search for an optimal neural architecture using the iNAS algorithm over the search space\nA with the current training set St (line 6). The initial architecture for iNAS is chosen to be the\nselected architecture from the previous active round (At\u22121), assuming that the architecture size is\nnon-decreasing along the active learning process. The resulting architecture at iteration t is denoted\nAt. Next, we train a model f\u03b8 \u2208 At based on St (line 7). Finally, if the querying budget allows, the\nalgorithm requests b new points using Q(f\u03b8, St, Ut, b) and updates St+1 and Ut+1 correspondingly.\nOtherwise the algorithm returns f\u03b8 (lines 8-14).\n\n4.5 Theoretical Motivation and Implementation Notes\n\nThe iNAS algorithm is designed to exploit the prior knowledge gleaned from samples of increasing\nsize, which is motivated from straightforward statistical learning arguments. iNAS starts with small\ncapacity so as to avoid over-\ufb01tting in early stages, and then allows for capacity increments as labeled\ndata accumulates. Recall from statistical learning theory that for a given hypothesis class F and\ntraining set Sm, the generalization gap can be bounded as follows with probability at least 1 \u2212 \u03b4,\n\n(cid:114)\n\n5\n\nR(f ) \u2212 \u02c6rSm(f ) \u2264 O(\n\nd log(m/d) + log(1/\u03b4)\n\n),\n\nm\n\nwhere d is the VC-dimension of F. Recently, Bartlett et al. [3] showed a nearly tight bound for\nthe VC-dimension of deep neural networks. Let W be the number of parameters in a neural net-\nwork, let L be the number of layers, and U, the number of computation units (neurons/\ufb01lters), [3]\nshowed that the VC dimension of ReLU-activated regression models is bounded as V Cdim(F) \u2264\nO( \u00afLW log(U ), where \u00afL (cid:44) 1\ni=1 Wi and Wi is the number of parameters from the input to layer\ni. As can be seen, the expansion steps proposed in iNAS are designed to minimally expand the VC-\ndimension of F. When adding blocks, W , U and \u00afL grow linearly. As a result, the VC-dimension\ngrows linearly. When adding a stack (in the iNAS algorithm), W and U grow sub-exponentially,\nand L (and \u00afL) also grows. Along the active learning session, m grows linearly in incremental steps,\n\n(cid:80)L\n\nW\n\n\f(a) Softmax response\n\n(b) MC-dropout\n\n(c) Coreset\n\nFigure 2: Active learning curves for CIFAR-10 dataset using various query functions, (a) softmax response, (b)\nMC-dopout, (c) coreset. In black (solid) \u2013 Active-iNAS (ours), blue (dashed) \u2013 Resnet-18 \ufb01xed architecture,\nand red (dashed) \u2013 A(Br, 1, 2) \ufb01xed.\n\nthus, it motivates a linear growth in the VC-dimension (in incremental steps) so as to maintain the\ngeneralization gap bound as small as possible. Alternate approaches that are often used, such as a\nfull grid-search on each active round, would not enjoy these bene\ufb01ts and will be prone to over\ufb01tting\n(not to mention that full-grid search could be computationally prohibitive).\nTurning now to analyze the run time of active-iNAS, when running with small active learning mini-\nbatches, it is evident that the iNAS algorithm will only require one iteration at each round, resulting\nin only having to train three additional models at each round. In our implementation of iNAS, we\napply \u201cpremature evaluation\u201d as considered in [22]; our models are evaluated after TSGD/4 epochs\nwhere TSGD is the total number of epochs in each round. Our \ufb01nal active-iNAS implementation thus\nonly takes 1.75TSGD for each active round. For example, in the CIFAR-10 experiment TSGD = 200\nrequires less than 2 GPU hours (on average) for an active learning round (Nvidia Titan-Xp GPU).\n\n5 Experiments\n\nWe \ufb01rst compare active-iNAS to active learning performed with a \ufb01xed architecture over three\ndatasets, we apply three querying functions, softmax response, coresets and MC-dropout. Then we\nanalyze the architectures learned by iNAS along the active process. We also empirically motivate\nthe use of iNAS by showing how optimized architecture can improve the query function. Finally,\nwe compare the resulting active learning algorithm obtained with the active-iNAS framework.\n\n5.1 Experimental Setting\n\nWe used an architecture search space that is based on the Resnet architecture [11]. The initial block\ncontains a convolutional layer with \ufb01lter size of 3 \u00d7 3 and depth of 64, followed by a max-pooling\nlayer having a spatial size of 3 \u00d7 3 and strides of 2. The basic block contains two convolutional\nlayers of size 3 \u00d7 3 followed by a ReLU activation. A residual connection is added before the\nactivation of the second convolutional layer, and a batch normalization [15] is used after each layer.\nThe classi\ufb01cation block contains an average pooling layer that reduces the spatial dimension to\n1 \u00d7 1, and a fully connected classi\ufb01cation layer followed by softmax. The search space is de\ufb01ned\naccording to the formulation in Section 4.1, and spans all architectures in the range A(Br, 1, 1) to\nA(Br, 12, 5).\nAs a baseline, we chose two \ufb01xed architectures. The \ufb01rst architecture was the one optimized for\nthe \ufb01rst active round (optimized over the initial seed of labeled points), and which coincidentally\nhappened to be A(Br, 1, 2) on all tested datasets. The second architecture was the well-known\nResnet-18, denoted as A(Br, 2, 4), which is some middle point in our search grid.\nWe trained all models using stochastic gradient descent (SGD) with a batch size of 128 and momen-\ntum of 0.9 for 200 epochs. We used a learning rate of 0.1, with a learning rate multiplicative decay\nof 0.1 after epochs 100 and 150. Since we were dealing with different sizes of training sets along\nthe active learning process, the epoch size kept changing. We \ufb01xed the size of an epoch to be 50,000\ninstances (by oversampling), regardless of the current size of the training set St. A weight decay of\n5e-4 was used, and standard data augmentation was applied containing horizontal \ufb02ips, four pixel\nshifts and up to 15-degree rotations.\n\n6\n\n\f(a) Softmax response\n\n(b) MC-dropout\n\n(c) Coreset\n\nFigure 3: Active learning curves for CIFAR-100 dataset using various query functions, (a) softmax response,\n(b) MC-dopout, (c) coreset. In black (solid) \u2013 Active-iNAS (ours), blue (dashed) \u2013 Resnet-18 \ufb01xed architecture,\nand red (dashed) \u2013 A(Br, 1, 2) \ufb01xed.\n\n(a) Softmax response\n\n(b) MC-dropout\n\n(c) Coreset\n\nFigure 4: Active learning curves for SVHN dataset using various query functions, (a) softmax response, (b)\nMC-dopout, (c) coreset. In black (solid) \u2013 Active-iNAS (ours), blue (dashed) \u2013 Resnet-18 \ufb01xed architecture,\nand red (dashed) \u2013 A(Br, 1, 2) \ufb01xed.\n\nThe active learning was implemented with an initial labeled training seed (k) of 2000 instances. The\nactive mini-batch size (b) was initialized to 2000 instances and updated to 5000 after reaching 10000\nlabeled instances. The maximal budget was set to 50,000 for all datasets1. For time ef\ufb01ciency rea-\nsons, the iNAS algorithm was implemented with TiN AS = 1, and the training of new architectures\nin iNAS was early-stopped after 50 epochs, similar to what was done in [22].\n\n5.2 Active-iNAS vs. Fixed Architecture\n\nThe results of an active learning algorithm are often depicted by a curve measuring the trade-off\nbetween labeled points (or a budget) vs. performance (accuracy in our case). For example, in\nFigure 2(a) we see the results obtained by active-iNAS and two \ufb01xed architectures for classify-\ning CIFAR-10 images using the softmax response querying function. In black (solid), we see the\ncurve for the active-iNAS method. The results of A(Br, 1, 2) and Resnet-18 (A(Br, 2, 4)) appear\nin (dashed) red and (dashed) blue, respectively. The X axis corresponds to the labeled points con-\nsumed, starting from k = 2000 (the initial seed size), and ending with 50,000 . In each active\nlearning curve, the standard error of the mean over three random repetitions is shadowed.\nWe present results for CIFAR-10, CIFAR-100 and SVHN. We \ufb01rst analyze the results for CIFAR-\n10 (Figure 2). Consider the graphs corresponding to the \ufb01xed architectures (red and blue). It is\nevident that for all query functions, the small architecture (red) outperforms the big one (Resnet-18\nin blue) in the early stage of the active process. Later on, we see that the big and expressive Resnet-\n18 outperforms the small architecture. Active-iNAS, performance consistently and signi\ufb01cantly\noutperforms both \ufb01xed architectures almost throughout the entire range.\nIt is most striking that\nactive-iNAS is better than each of the \ufb01xed architectures even when all are consuming the entire\ntraining budget. Later on we speculate about the reason for this phenomenon as well as the switch\nbetween the red and blue curves occurring roughly around 15,000 training points (in Figure 2(a)).\nTurning now to CIFAR-100 (Figure 3), we see qualitatively very similar behaviors and relations\nbetween the various active learners. We now see that the learning problem is considerably harder,\nas indicated by the smaller area under all the curves. Nevertheless, in this problem active-iNAS\n\n1SVHN contains 73,256 instances and was, therefore, trimmed to 50000.\n\n7\n\n\fachieves a substantially greater advantage over the \ufb01xed architectures in all three query functions.\nFinally, in the SVHN digit classi\ufb01cation task, which is known to be easier than both the CIFAR\ntasks, we again see qualitatively similar behaviors that are now much less pronounced, as all active\nlearners are quite effective. On the other hand, in the SVHN task, active-iNAS impressively obtains\nalmost maximal performance after consuming only 20% of the training budget.\n\n(a) CIFAR-10\n\n(b) CIFAR-100\n\n(c) SVHN\n\nFigure 5: Comparison of active-iNAS for various query functions across three datasets, (a) CIFAR10, (b)\nCIFAR-100, (c) SVHN. In black (solid) \u2013 softmax response, red (dashed) \u2013 MC-dropout, and blue (dashed) \u2013\ncoreset.\n\n(a)\n\n(b)\n\nFigure 6: (a) The architectures learned by iNAS as function of the labeled samples. Blue curves represent\nnumber of parameters (\u00d71e6) and black curves represent the number of layers. CIFAR-10 \u2013 solid line, CIFAR-\n100 \u2013 dashed line and SVHN \u2013 dotted line. (b) Comparison of AUC-GAIN for softmax response active learning\nover CIFAR-10 for two architectures. In red (solid), the small architecture (A(Br, 1, 2) and in blue (dashed)\nthe Resnet-18 architecture (A(Br, 2, 4)).\n\n5.3 Analyzing the Learned Architectures\n\nIn addition to standard performance results presented in Section 5.2, it is interesting to inspect\nthe sequence of architectures that have been selected by iNas along the active learning process.\nIn Figure 6(a) we depict this dynamics; for example, consider the CIFAR-10 dataset appearing\nin solid lines, where the blue curve represents the number of parameters in the network and the\nblack shows the number of layers in the architecture. Comparing CIFAR-10 (solid) and CIFAR-\n100 (dashed), we see that active-iNAS prefers, for CIFAR-100, deeper architectures compared to\nits choices for CIFAR-10. In contrast, in SVHN (dotted), active-iNAS gravitates to shallower but\nwider architectures, which result signi\ufb01cantly larger numbers of parameters. The iNAS algorithm is\nrelatively stable in the sense that in the vast majority of random repeats of the experiments, similar\nsequences of architectures have been learned (this result is not shown in the \ufb01gure).\nA hypothesis that might explain the latter results is that CIFAR-100 contains a larger number of\n\u201cconcepts\u201d requiring deeper hierarchy of learned CNN layers compared to CIFAR-10. The SVHN\nis a simpler and less noisy learning problem, and, therefore, larger architectures can play without\nsigni\ufb01cant risk of over\ufb01tting.\n\n8\n\n\f5.4 Enhanced Querying with Active-iNAS\n\nIn this section we argue and demonstrate that optimized architectures not only improve generaliza-\ntion at each step, but also enhance the query function quality2. In order to isolate the contribution\nof the query function, we normalize the active performance by the performance of a passive learner\nobtained with the same model. A common approach for this normalization has already been pro-\nposed in [13, 2], we de\ufb01ne conceptually similar normalization as follows. Let the relative AUC gain\nbe the relative reduction of area under the curve (AUC) of the 0-1 loss in the active learning curve,\ncompared to the AUC of the passive learner (trained over the same number of random queries, at\neach round); namely, AUC-GAIN(P A, AC, m) = AU Cm(P A)\u2212AU Cm(AC)\n, where AC is an active\nlearning algorithm, P A is its passive application (with the same architecture), m is a labeling bud-\nget, and AU Cm(\u00b7) is the area under the learning curve (0-1 loss) of the algorithm with budget m.\nClearly, high values of AUC-GAIN correspond to high performance and vice versa.\nIn Figure 6(b), we used the AUC-GAIN to measure the performance of the softmax response query-\ning function on the CIFAR-10 dataset over all training budgets up to the maximal (50,000). We\ncompare the performance of this query function applied over two different architectures: the small\narchitecture (A(Br, 1, 2), and Resnet-18 (A(Br, 2, 4). We note that it is unclear how to de\ufb01ne AUC-\nGAIN for active-iNAS because it has a dynamically changing architecture.\nAs can easily be seen, the small architecture dramatically outperforms Resnet-18 in the early stages.\nLater on, the AUC-GAIN curves switch, and Resnet-18 catches up and outperforms the small archi-\ntecture. This result supports the intuition that improvements in the generalization tend to improve\nthe effectiveness of the querying function. We hypothesize that the active-iNAS\u2019 outstanding results\nshown in Section 5.2 have been achieved not only by the improved generalization of every single\nmodel, but also by the effect of the optimized architecture on the querying function.\n\nAU Cm(P A)\n\n5.5 Query Function Comparison\n\nIn Section 5.2 We demonstrated that active-iNAS consistently outperformed direct active applica-\ntions of three querying functions. Here, we compare the performance of the three active-iNAS\nmethods, applied with those three functions: softmax response, MC-dropout and coreset. In Fig-\nure 6 we compare these three active-iNAS algorithms over the three datasets. In all three datasets,\nsoftmax response is among the top performers, whereas one or the other two querying functions is\nsometimes the worst. Thus, softmax response achieves the best results. For example, on CIFAR-10\nand SVHN, the MC-dropout is on par with softmax, but on CIFAR-100 MC-dropout is the worst.\nThe poor performance of MC-dropout over CIFAR-100 may be caused by the large number of\nclasses, as pointed out by [8] in the context of selective classi\ufb01cation. In all cases, coreset is slightly\nbehind the softmax response. This is in sharp contrast to the results presented by [21] and [7].\nWe conclude this section by emphasizing that our results indicate that the combination of softmax\nresponse with active-iNAS is the best active learning method.\n\n6 Concluding Remarks\n\nWe presented active-iNAS, an algorithm that effectively integrates deep neural architecture opti-\nmization with active learning. The active algorithm performs a monotone search for the locally\nbest architecture on the \ufb02y. Our experiments indicate that active-iNAS outperforms standard active\nlearners that utilize suitable and commonly used \ufb01xed architecture. In terms of absolute performance\nquality, to the best of our knowledge, the combination of active-iNAS and softmax response is the\nbest active learner over the datasets we considered.\n\nAcknowledgments\n\nThis research was supported by The Israel Science Foundation (grant No. 81/017).\n\n2We only consider querying functions that are de\ufb01ned in terms of a model (such as all query functions\n\nconsidered here).\n\n9\n\n\fReferences\n[1] Maria-Florina Balcan and Phil Long. Active and passive learning of linear separators under log-concave\n\ndistributions. In Conference on Learning Theory, pages 288\u2013316, 2013.\n\n[2] Yoram Baram, Ran El Yaniv, and Kobi Luz. Online choice of active learning algorithms. Journal of\n\nMachine Learning Research, 5(Mar):255\u2013291, 2004.\n\n[3] Peter L Bartlett, Nick Harvey, Christopher Liaw, and Abbas Mehrabian. Nearly-tight vc-dimension and\npseudodimension bounds for piecewise linear neural networks. Journal of Machine Learning Research,\n20(63):1\u201317, 2019.\n\n[4] D. Cohn, L. Atlas, and R. Ladner. Improving generalization with active learning. Machine Learning,\n\n15(2):201\u2013221, 1994.\n\n[5] Y. Freund, H.S. Seung, E. Shamir, and N. Tishby. Information, prediction, and Query by Committee. In\n\nAdvances in Neural Information Processing Systems (NIPS) 5, pages 483\u2013490, 1993.\n\n[6] Yarin Gal, Riashat Islam, and Zoubin Ghahramani. Deep bayesian active learning with image data. arXiv\n\npreprint arXiv:1703.02910, 2017.\n\n[7] Yonatan Geifman and Ran El-Yaniv. Deep active learning over the long tail.\n\narXiv:1711.00941, 2017.\n\narXiv preprint\n\n[8] Yonatan Geifman and Ran El-Yaniv. Selective classi\ufb01cation for deep neural networks. In Advances in\n\nneural information processing systems, pages 4878\u20134887, 2017.\n\n[9] Roei Gelbhart and Ran El-Yaniv. The relationship between agnostic selective classi\ufb01cation active learning\n\nand the disagreement coef\ufb01cient. arXiv preprint arXiv:1703.06536, 2017.\n\n[10] Steve Hanneke et al. Rates of convergence in active learning. The Annals of Statistics, 39(1):333\u2013361,\n\n2011.\n\n[11] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition.\nIn Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770\u2013778, 2016.\n\n[12] Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand,\nMarco Andreetto, and Hartwig Adam. Mobilenets: Ef\ufb01cient convolutional neural networks for mobile\nvision applications. arXiv preprint arXiv:1704.04861, 2017.\n\n[13] Tzu-Kuo Huang, Alekh Agarwal, Daniel J Hsu, John Langford, and Robert E Schapire. Ef\ufb01cient and\nIn Advances in Neural Information Processing Systems, pages\n\nparsimonious agnostic active learning.\n2755\u20132763, 2015.\n\n[14] Forrest Iandola, Matt Moskewicz, Sergey Karayev, Ross Girshick, Trevor Darrell, and Kurt Keutzer.\n\nDensenet: Implementing ef\ufb01cient convnet descriptor pyramids. arXiv preprint arXiv:1404.1869, 2014.\n\n[15] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing\n\ninternal covariate shift. arXiv preprint arXiv:1502.03167, 2015.\n\n[16] Chenxi Liu, Barret Zoph, Jonathon Shlens, Wei Hua, Li-Jia Li, Li Fei-Fei, Alan Yuille, Jonathan Huang,\n\nand Kevin Murphy. Progressive neural architecture search. arXiv preprint arXiv:1712.00559, 2017.\n\n[17] Hanxiao Liu, Karen Simonyan, Oriol Vinyals, Chrisantha Fernando, and Koray Kavukcuoglu. Hierarchi-\n\ncal representations for ef\ufb01cient architecture search. arXiv preprint arXiv:1711.00436, 2017.\n\n[18] Hanxiao Liu, Karen Simonyan, and Yiming Yang. Darts: Differentiable architecture search. arXiv\n\npreprint arXiv:1806.09055, 2018.\n\n[19] Hieu Pham, Melody Y Guan, Barret Zoph, Quoc V Le, and Jeff Dean. Ef\ufb01cient neural architecture search\n\nvia parameter sharing. arXiv preprint arXiv:1802.03268, 2018.\n\n[20] Esteban Real, Alok Aggarwal, Yanping Huang, and Quoc V Le. Regularized evolution for image classi\ufb01er\n\narchitecture search. arXiv preprint arXiv:1802.01548, 2018.\n\n[21] Ozan Sener and Silvio Savarese. Active learning for convolutional neural networks: A core-set approach.\n\nIn International Conference on Learning Representations, 2018.\n\n[22] Mingxing Tan, Bo Chen, Ruoming Pang, Vijay Vasudevan, and Quoc V Le. Mnasnet: Platform-aware\n\nneural architecture search for mobile. arXiv preprint arXiv:1807.11626, 2018.\n\n10\n\n\f[23] Simon Tong and Daphne Koller. Support vector machine active learning with applications to text classi-\n\n\ufb01cation. Journal of machine learning research, 2(Nov):45\u201366, 2001.\n\n[24] Keze Wang, Dongyu Zhang, Ya Li, Ruimao Zhang, and Liang Lin. Cost-effective active learning for deep\n\nimage classi\ufb01cation. IEEE Transactions on Circuits and Systems for Video Technology, 2016.\n\n[25] Yair Wiener, Steve Hanneke, and Ran El-Yaniv. A compression technique for analyzing disagreement-\n\nbased active learning. The Journal of Machine Learning Research, 16(1):713\u2013745, 2015.\n\n[26] Saining Xie, Ross Girshick, Piotr Doll\u00b4ar, Zhuowen Tu, and Kaiming He. Aggregated residual transfor-\nIn Computer Vision and Pattern Recognition (CVPR), 2017 IEEE\n\nmations for deep neural networks.\nConference on, pages 5987\u20135995. IEEE, 2017.\n\n[27] Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. arXiv preprint arXiv:1605.07146,\n\n2016.\n\n[28] Barret Zoph and Quoc V Le. Neural architecture search with reinforcement learning. arXiv preprint\n\narXiv:1611.01578, 2016.\n\n[29] Barret Zoph, Vijay Vasudevan, Jonathon Shlens, and Quoc V Le. Learning transferable architectures for\n\nscalable image recognition. arXiv preprint arXiv:1707.07012, 2(6), 2017.\n\n11\n\n\f", "award": [], "sourceid": 3199, "authors": [{"given_name": "Yonatan", "family_name": "Geifman", "institution": "Technion"}, {"given_name": "Ran", "family_name": "El-Yaniv", "institution": "Technion"}]}