{"title": "Improving Simple Models with Confidence Profiles", "book": "Advances in Neural Information Processing Systems", "page_first": 10296, "page_last": 10306, "abstract": "In this paper, we propose a new method called ProfWeight for transferring information from a pre-trained deep neural network that has a high test accuracy to a simpler interpretable model or a very shallow network of low complexity and a priori low test accuracy. We are motivated by applications in interpretability and model deployment in severely memory constrained environments (like sensors). Our method uses linear probes to generate confidence scores through flattened intermediate representations. Our transfer method involves a theoretically justified weighting of samples during the training of the simple model using confidence scores of these intermediate layers. The value of our method is first demonstrated on CIFAR-10, where our weighting method significantly improves (3-4\\%) networks with only a fraction of the number of Resnet blocks of a complex Resnet model. We further demonstrate operationally significant results on a real manufacturing problem, where we dramatically increase the test accuracy of a CART model (the domain standard) by roughly $13\\%$.", "full_text": "Improving Simple Models with Con\ufb01dence Pro\ufb01les\n\nAmit Dhurandhar*\n\nIBM Research,\n\nYorktown Heights, NY\nadhuran@us.ibm.com\n\nRonny Luss\nIBM Research,\n\nYorktown Heights, NY\nrluss@us.ibm.com\n\nKarthikeyan Shanmugam*\n\nIBM Research,\n\nYorktown Heights, NY\n\nkarthikeyan.shanmugam2@ibm.com\n\nPeder Olsen\nIBM Research,\n\nYorktown Heights, NY\npederao@us.ibm.com\n\n\u2217\n\nAbstract\n\nIn this paper, we propose a new method called ProfWeight for transferring infor-\nmation from a pre-trained deep neural network that has a high test accuracy to a\nsimpler interpretable model or a very shallow network of low complexity and a\npriori low test accuracy. We are motivated by applications in interpretability and\nmodel deployment in severely memory constrained environments (like sensors).\nOur method uses linear probes to generate con\ufb01dence scores through \ufb02attened\nintermediate representations. Our transfer method involves a theoretically justi\ufb01ed\nweighting of samples during the training of the simple model using con\ufb01dence\nscores of these intermediate layers. The value of our method is \ufb01rst demonstrated\non CIFAR-10, where our weighting method signi\ufb01cantly improves (3-4%) networks\nwith only a fraction of the number of Resnet blocks of a complex Resnet model.\nWe further demonstrate operationally signi\ufb01cant results on a real manufacturing\nproblem, where we dramatically increase the test accuracy of a CART model (the\ndomain standard) by roughly 13%.\n\n1\n\nIntroduction\n\nComplex models such as deep neural networks have shown remarkable success in applications such\nas computer vision, speech and time series analysis [15, 18, 26, 10]. One of the primary concerns with\nthese models has been their lack of transparency which has curtailed their widespread use in domains\nwhere human experts are responsible for critical decisions [21]. Recognizing this limitation, there\nhas been a surge of methods recently [29, 27, 5, 28, 25] to make deep networks more interpretable.\nThese methods highlight important features that contribute to the particular classi\ufb01cation of an input\nby a deep network and have been shown to reasonably match human intuition.\nWe, in this paper, however, propose an intuitive model agnostic method to enhance the performance\nof simple models (viz. lasso, decision trees, etc.) using a pretrained deep network. A natural question\nto ask is, given the plethora of explanation techniques available for deep networks, why do we care\nabout enhancing simple models? Here are a few reasons why simple models are still important.\nDomain Experts Preference: In applications where the domain experts are responsible for critical\ndecisions, they usually have a favorite model (viz.\nlasso in medical decision trees in advanced\nmanufacturing) that they trust [31]. Their preference is to use something they have been using for\nyears and are comfortable with.\n\n\u2217First two authors have equal contribution\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fFigure 1: Above we depict our general idea. In (a), we see the k hidden representations H1 \u00b7\u00b7\u00b7 Hk\nof a pretrained neural network. The diode symbols (triangle with vertical line) attached to each Hi\n\u2200i \u2208 {1, ..., k} denote the probes Pi as in [3], with the \u02c6Yi denoting the respective outputs. In (b)-(c),\nwe see example plots created by plotting the con\ufb01dence score of the true label at each probe. In (b),\nwe see a well written digit \"4\" which possibly is an easy example to classify and hence the con\ufb01dence\nscores are high even at lower level probes. This sharply contrasts the curve in (c), which is for a much\nharder example of digit \"4\".\n\nSmall Data Settings: Companies usually have limited amounts of usable data collected for their\nbusiness problems. As such, simple models are many times preferred here as they are less likely to\nover\ufb01t the data and in addition can provide useful insight [20]. In such settings, improving the simple\nmodels using a complex model trained on a much larger publicly/privately available corpora with the\nsame feature representation as the small dataset would be highly desirable.\nResource-Limited Settings: In addition to interpretability, simple models are also useful in settings\nwhere there are power and memory constraints. For example, in certain Internet-of-Things (IoT) [24]\nsuch as those on mobile devices and in unmanned aerial vehicles (UAV) [11] there are strict power\nand memory constraints which preclude the deployment of large deep networks. In such settings,\nneural networks with only a few layers and possibly up to a few tens of thousands of neurons are\nconsidered reasonable to deploy.\nWe propose a method where we add probes to the intermediate layers of a deep neural network. A\nprobe is essentially a logistic classi\ufb01er (linear model with bias followed by a softmax) added to an\nintermediate layer of a pretrained neural network so as obtain its predictions from that layer. We call\nthem linear probes throughout this paper. This is depicted in \ufb01gure 1(a), where k probes are added to\nk hidden layers. Note that there is no backpropagation of gradients through the probes to the hidden\nlayers. In other words, the hidden representations are \ufb01xed once the neural network is trained with\nonly the probes being updated to \ufb01t the labels based on these previously learned representations. Also\nnote that we are not required to add probes to each layer. We may do so only at certain layers which\nrepresent logical units for a given neural network architecture. For example, in a Resnet [18] we may\nadd probes only after each Residual unit/block.\nThe con\ufb01dence scores of the true label of an input when plotted at each of the probe outputs form a\ncurve that we call a con\ufb01dence pro\ufb01le for that input. This is seen in \ufb01gure 1 (b)-(c). We now want to\nsomehow use these con\ufb01dence pro\ufb01les to improve our simple model. It\u2019s worth noting that probes\nhave been used before, but for a different purpose. For instance in [3], the authors use them to study\nproperties of the neural network in terms of its stability and dynamics, but not for information transfer\nas we do. We consider functions of these con\ufb01dence scores starting from an intermediate layer up to\nthe \ufb01nal layer to weight samples during training of the simple model. The \ufb01rst function we consider,\narea under the curve (AUC) traced by the con\ufb01dence scores, shows much improvement in the simple\nmodels performance. We then turn to learning the weights using neural networks that take as input\nthe con\ufb01dence scores and output an optimal weighting. Choice of the intermediate layers is chosen\nbased on the simple model\u2019s complexity as is described later.\nWe observe in experiments that our proposed method can improve performance of simple models\nthat are desirable in the respective domains. On CIFAR [19], we improve a simple neural network\nwith very few Resnet blocks which can be deployed on UAVs and in IoT applications where there\nare memory and power constraints [24]. On a real manufacturing dataset, we signi\ufb01cantly improve\n\n2\n\n\fAlgorithm 1 ProfWeight\n\nInput: k unit neural network N , learning algorithm for simple model LS, dataset DN used to\ntrain N , dataset DS = {(x1, y1), ..., (xm, ym)} to train a simple model and margin parameter \u03b1.\n1) Attach probes P1, ..., Pk to the k units of N .\n2) Train probes based on DN and obtain errors e1, ..., ek on DS. {There is no backpropagation of\ngradients here to the hidden units/layers of N .}\n3) Train the simple model S \u2190 LS (DS, \u03b2,(cid:126)1m) and obtain its error eS.{S is obtained by un-\nweighted training. (cid:126)1m denotes a m dimensional vector of 1s.}\n4) Let I \u2190 {u | eu \u2264 eS \u2212 \u03b1}{I contains indices of all probes that are more accurate than the\nsimple model S by a margin \u03b1 on DS.}\n5) Compute weights w using Algorithm 2 or 3 for AUC or neural network, respectively.\n6) Obtain simple model Sw,\u03b2 \u2190 LS,\u03b2(DS, \u03b2, w) {Train the simple model on DS along with the\nweights w associated with each input.}\nreturn Sw,\u03b2\n\nAlgorithm 2 AUC Weight Computation\n\nInput: Neural network N , probes Pu, dataset DS, and index set I from Algorithm 1.\n1) Set i \u2190 1, w = (cid:126)0m (m-vector of zeros)\nwhile i \u2264 m do\n2) Obtain con\ufb01dence scores {ciu = Pu(Ru(xi))[yi] | u \u2208 I}.\n3) Compute wi \u2190 1|I|\non probes indexed by I. | \u00b7 | denotes cardinality.}\n4) Increment i, i.e., i \u2190 i + 1\n\n(cid:80)\nu\u2208I ciu {In other words, estimate AUC for sample (xi, yi) \u2208 DS based\n\nend while\nreturn w\n\na decision tree classi\ufb01er which is the method of choice of a fab engineer working in an advanced\nmanufacturing plant.\nThe primary intuition behind our approach is to identify examples that the simple model will most\nlikely fail on, i.e. identify truly hard examples. We then want to inform the simple model to ignore\nthese examples and make it expend more effort on other relatively easier examples that it could\nperform well on, with the eventual hope of improving generalization performance. This is analogous\nto a teacher (i.e. complex model) informing a student (i.e. simple model) about aspects of the syllabus\nhe should focus on and which he/she could very well ignore since it is way above his/her current level.\nWe further ground this intuition and justify our approach by showing that it is a speci\ufb01c instance of a\nmore general procedure that weights examples so as to learn the optimal simple model.\n\n2 General Framework\n\nIn this section we provide a simple method to transfer information from a complex model to a simple\none by characterizing the hardness of examples. We envision doing this with the help of con\ufb01dence\npro\ufb01les that are obtained by adding probes to different layers of a neural network.\nAs seen in \ufb01gure 1(b)-(c), our intuition is that easy examples should be resolved by the network, that is,\nclassi\ufb01ed correctly with high con\ufb01dence at lower layer probes themselves, while hard examples would\neither not be resolved at all or be resolved only at or close to the \ufb01nal layer probes. This captures\nthe notion of the network having to do more work and learn more \ufb01nely grained representations for\ngetting the harder examples correctly classi\ufb01ed. One way to capture this notion is to compute the area\nunder the curve (AUC) traced by the con\ufb01dence scores at each of the probes for a given input-output\npair. AUC amounts to averaging the values involved. Thus, as seen in \ufb01gure 1(b), the higher the\nAUC, the easier is the example to classify. Note that the con\ufb01dence scores are for the true label of\nthat example and not for the predicted label, which may be different.\n\n3\n\n\fAlgorithm 3 Neural Network Weight Computation\n\nInput: Weight space C, dataset DS, # of iterations N and index set I from Algorithm 1.\n1) Obtain con\ufb01dence scores {ciu = Pu(Ru(xi))[yi] | u \u2208 I} for xi when predicting the class yi\nusing the probes Pu for i \u2208 {1, . . . , m}.\n2) Initialize i = 1, w0 = (cid:126)1m and \u03b20 (simple model parameters)\nwhile i \u2264 N do\n\n4) Update weights: wi = argminw\u2208C(cid:80)m\n\n3) Update simple model parameters: \u03b2i = argmin\u03b2\n\n(cid:80)m\n\n(cid:80)m\nj=1 \u03bb(Sw,\u03b2i(xj), yj) + \u03b3R(w), where R(\u00b7) is a\ni=1 wi \u2212 1)2 with scaling parameter \u03b3. {Note that the weight\n\nj=1 \u03bb(Swi\u22121,\u03b2(xj), yj)\n\nregularization term set to ( 1\nspace C restricts w to be a neural network that takes as input the con\ufb01dence scores ciu}\nm\n5) Increment i, i.e., i \u2190 i + 1\n\nend while\nreturn w = wN\n\nWe next formalize this intuition which suggests a truly hard example is one that is more of an outlier\nthan a prototypical example of the class that it belongs to. In other words, if X \u00d7 Y denotes the\ninput-output space and p(x, y) denotes the joint distribution of the data, then a hard example (xh, yh)\nhas low p(yh|xh).\nA learning algorithm LS is trying to learn a simple model that \u201cbest\" matches p(y|x) so as to have\nlow generalization error. The dataset DS = {(x1, y1), ..., (xm, ym)}, which may or may not be\nrepresentative of p(x, y), but which is used to produce the simple model, may not produce this best\nmatch. We thus have to bias DS and/or the loss of LS so that we produce this best match. The\nmost natural way to bias is by associating weights W = {w1, ..., wm} with each of the m examples\n(x1, y1), ..., (xm, ym) in DS. This setting thus seems to have some resemblance to covariate shift [1]\nwhere one is trying to match distributions. Our goal here, however, is not to match distributions but\nto bias the dataset in such a way that we produce the best performing simple model.\nIf \u03bb(., .) denotes a loss function, w a vector of m weights to be estimated for examples in DS, and\nSw,\u03b2 = LS (DS, \u03b2, w) is a simple model with parameters \u03b2 that is trained by LS on the weighted\ndataset. C is the space of allowed weights based on constraints (viz. penalty on extremely low\nweights) that would eliminate trivial solutions such as all weights being close to zero, and B is the\nsimple model\u2019s parameter space. Then ideally, we would want to solve the following optimization\nproblem:\n\nS\u2217 = Sw\u2217,\u03b2\u2217 = min\n\nw\u2208C min\n\n\u03b2\u2208B E [\u03bb (Sw,\u03b2(x), y)]\n\n(1)\n\nThat is, we want to learn the optimal simple model S\u2217 by estimating the corresponding optimal\nweights W \u2217 which are used to weight examples in DS. It is known that not all deep networks are\ngood density estimators [2]. Hence, our method does not just rely on the output con\ufb01dence score for\nthe true label, as we describe next.\n\n2.1 Algorithm Description\nWe \ufb01rst train the complex model N on a data set DN and then freeze the resulting weights. Let U\nbe the set of logical units whose representations we will use to train probes, and let Ru(x) denote\nthe \ufb02attened representation after the logical unit u on input x to the trained network N . We train\nprobe function Pu(\u00b7) = \u03c3(W x + b), where W \u2208 k \u00d7 |Ru(x)|, b \u2208 Rk, \u03c3(\u00b7) is the standard softmax\nfunction, and k is the number of classes, on the \ufb02attened representations Ru(x) to optimize the\ncross-entropy with the labels y in the training data set DN . For a label y among the class labels,\nPu(Ru(x))[y] \u2208 [0, 1] denotes the con\ufb01dence score of the probe on label y.\nGiven that the simple model may have a certain performance, we do not want to use very low level\nprobe con\ufb01dence scores to convey hardness of examples to it. A teacher must be at least as good as\nthe student and so we compute weights in Algorithm 1 only based on those probes that are roughly\nmore accurate than the simple model. We also have parameter \u03b1 which can be thought of as a margin\nparameter determining how much better the weakest teacher should be. The higher the \u03b1, the better\n\n4\n\n\fthe worst performing teacher will be. As we will see in the experiments, it is not always optimal\nto only use the best performing model as the teacher, since, if the teacher is highly accurate all\ncon\ufb01dences will be at or near 1 which will provide no useful information to the simple student model.\nOur main algorithm, ProfWeight 2 is detailed in Algorithm 1. At a high level it can be described as\nperforming the following steps:\n\n\u2022 Attach and train probes on intermediate representations of a high performing neural network.\n\u2022 Train a simple model on the original dataset.\n\u2022 Learn weights for examples in the dataset as a function (AUC or neural network) of the\n\nsimple model and the probes.\n\n\u2022 Retrain the simple model on the \ufb01nal weighted dataset.\n\n(cid:80)m\nIn step (5), one can compute weights either as the AUC (Algorithm 2) of the con\ufb01dence scores of\nthe selected probes or by learning a regularized neural network (Algorithm 3) that inputs the same\ncon\ufb01dence scores. In Algorithm 3, we set the regularization term R(w) = ( 1\ni=1 wi \u2212 1)2 to\nkeep the weights from all going to zero. Also as is standard practice when training neural networks\n[15], we also impose an (cid:96)2 penalty on the weights so as to prevent them from diverging. Note that,\nwhile the neural network is trained using batches of data, the regularization is still a function of all\ntraining samples. Algorithm 3 alternates between minimizing two blocks of variables (w and \u03b2).\nWhen the subproblems have solutions and are differentiable, all limit points of (wk, \u03b2k) can be shown\nto be stationary points [16]. The \ufb01nal step of ProfWeight is to train the simple model on DS with\nthe corresponding learned weights.\n\nm\n\n2.2 Theoretical Justi\ufb01cation\n\nWe next provide a justi\ufb01cation for the regularized optimization in Step 4 of Algorithm 3. Intuitively,\nwe have a pre-trained complex model that has high accuracy on a test data set Dtest. Consider the\nbinary classi\ufb01cation setting. We assume that Dtest has samples drawn from a uniform mixture of two\nclass distributions: P (x|y = 0) and P (x|y = 1). We have another simple model which is trained on\na training set Dtrain and has a priori low accuracy on the Dtest. We would like to modify the training\nprocedure of the simple model such that the test accuracy could be improved.\nSuppose, training the simple model on training dataset Dtrain results in classi\ufb01er M. We view\nthis training procedure of simple models through a different lens: It is equivalent to the optimal\nclassi\ufb01cation algorithm trained on the following class distribution mixtures: PM (x|y = 1) and\nPM (x|y = 0). We refer to this distribution as \u02dcDtrain. If we knew PM , the ideal way to bias an entry\n(x, y) \u2208 \u02dcDtrain in order to boost test accuracy would be to use the following importance sampling\nweights w(x, y) = P (x|y)\nPM (x|y) to account for the covariate shift between \u02dcDtrain and Dtest. Motivated\nby this, we look at the following parameterized set of weights, wM(cid:48)(x, y) = P (x|y)\nPM(cid:48) (x|y) for every M(cid:48)\nin the simple model class. We now have following result (proof can be found in the supplement):\nTheorem 2.1. If wM(cid:48) corresponds to weights on the training samples,\nEPM (x|y)[wM(cid:48)(x, y)] = 1 implies that wM(cid:48)(x, y) = P (x|y)\nPM (x|y) .\n\nthen the constraint\n\nIt is well-known that the error of the performance of the best classi\ufb01er (Bayes optimal) on a distribution\nof class mixtures is the total variance distance between them. That is:\nLemma 2.2. [8] The error of the Bayes optimal classi\ufb01er trained on a uniform mixture of two class\n2 DTV(P (x|y = 1), P (x|y = 0)) where L(\u00b7)\ndistributions is given by: min\nis the 0, 1 loss function and \u03b8 is parameterization over a class of classi\ufb01ers that includes the Bayes\noptimal classi\ufb01er. DTV is the total variation distance between two distributions. P (x|y) are the\nclass distributions in dataset D.\n\n(cid:80) D[L\u03b8(x, y)] = 1\n\n2 \u2212 1\n\n\u03b8\n\n2Code\n\nis\n\navailable\n\nmaster/README.md\n\nat https://github.ibm.com/Karthikeyan-Shanmugam2/Transfer/blob/\n\n5\n\n\fFrom Lemma 2.2 and Theorem 2.1, where \u03b8 corresponds to the parametrization of the simple model,\nit follows that:\n\nmin\nM(cid:48),\u03b8 s.t. E \u02dcDtrain\n\n[w(cid:48)\n\nM ]=1\n\nE \u02dcD[wM(cid:48)(x, y)L\u03b8(x, y)] =\n\n1\n2\n\n\u2212 1\n2\n\nDTV(P (x|y = 1), P (x|y = 0))\n\n(2)\n\nThe right hand side is indeed the performance of the Bayes Optimal classi\ufb01er on the test dataset\nDtest. The left hand side justi\ufb01es the regularized optimization in Step 4 of Algorithm 3, which is\nimplemented as a least squares penalty. It also justi\ufb01es the min-min optimization in Equation 1,\nwhich is with respect to the weights and the parameters of the simple model.\n\n3 Experiments\n\nIn this section we experiment on datasets from two different domains. The \ufb01rst is a public benchmark\nvision dataset named CIFAR-10. The other is a chip manufacturing dataset obtained from a large\ncorporation. In both cases, we see the power of our method, ProfWeight, in improving the simple\nmodel.\nWe compare our method with training the simple model on the original unweighted dataset (Standard).\nWe also compare with Distillation [14], which is a popular method for training relatively simpler\nneural networks. We lastly compare results with weighting instances just based on the output\ncon\ufb01dence scores of the complex neural network (i.e. output of the last probe Pk) for the true label\n(ConfWeight). This can be seen as a special case of our method where \u03b1 is set to the difference in\nerrors between the simple model and the complex network.\nWe consistently see that our method outperforms these competitors. This showcases the power of our\napproach in terms of performance and generality, where the simple model may not be minimizing\ncross-entropy loss, as is usually the case when using Distillation.\n\n3.1 CIFAR-10\n\nWe now describe our methods on the CIFAR-10 dataset 3. We report results for multiple \u03b1\u2019s of our\nProfWeight scheme including ConfWeight which is a special case of our method. Further model\ntraining details than appear below are given in the supplementary materials.\nComplex Model: We use the popular implementation of the Residual Network Model available\nfrom the TensorFlow authors 4 where simple residual units are used (no bottleneck residual units\nare used). The complex model has 15 Resnet units in sequence. The basic blocks each consist of\ntwo consecutive 3 \u00d7 3 convolutional layers with either 64, 128, or 256 \ufb01lters and our model has\n\ufb01ve of each of these units. The \ufb01rst Resnet unit is preceded by an initial 3 \u00d7 3 convolutional layer\nwith 16 \ufb01lters. The last Resnet unit is succeeded by an average pooling layer followed by a fully\nconnected layer producing 10 logits, one for each class. Details of the 15 Resnet units are given in\nthe supplementary material.\nSimple Models: We now describe our simple models that are smaller Resnets which use a subset of\nthe 15 Resnet units in the complex model. All simple models have the same initial convolutional layer\nand \ufb01nish with the same average pooling and fully connected layers as in the complex model above.\nWe have four simple models with 3, 5, 7, and 9 ResNet units. The approximate relative sizes of these\nmodels to the complex neural network are 1/5, 1/3, 1/2, 2/3, correpondingly. Further details are\nabout the ResNet units in each model are given in the supplementary material.\nProbes Used: The set of units U (as de\ufb01ned in Section 2.1) whose representations are used to train\nthe probes are the units in Table 1 of the trained complex model. There are a total of 18 units.\nTraining-Test Split: We split the available 50000 training samples from the CIFAR-10 dataset into\ntraining set 1 consisting of 30000 examples and training set 2 consisting of 20000 examples. We\nsplit the 10000 test set into a validation set of 500 examples and a holdout test set of 9500 examples.\nAll \ufb01nal test accuracies of the simple models are reported with respect to this holdout test set. The\nvalidation set is used to tune all models and hyperparameters.\n\n3We used the python version from https://www.cs.toronto.edu/ kriz/cifar.html.\n4Code was obtained from: https://github.com/tensor\ufb02ow/models/tree/master/research/resnet\n\n6\n\n\fStandard\n\nConfWeight\nDistillation\n\nProfWeightReLU\nProfWeightAUC\n\nSM-3\n\n73.15(\u00b1 0.7)\n76.27 (\u00b10.48)\n65.84(\u00b10.60)\n77.52 (\u00b10.01)\n76.56 (\u00b10.62)\n\nSM-5\n\n75.78(\u00b10.5)\n78.54 (\u00b10.36)\n70.09 (\u00b10.19)\n78.24(\u00b10.01)\n79.25(\u00b10.36)\n\nSM-7\n\n78.76(\u00b10.35)\n81.46(\u00b10.50)\n73.4(\u00b10.64)\n80.16(\u00b10.01)\n81.34(\u00b10.49)\n\nSM-9\n\n79.9(\u00b10.34)\n82.09 (\u00b10.08)\n77.30 (\u00b10.16)\n81.65 (\u00b10.01)\n82.42 (\u00b10.36)\n\nTable 1: Averaged accuracies (%) of simple model trained with various weighting methods and distil-\nlation. The complex model achieved 84.5% accuracy. Weighting methods that average con\ufb01dence\nscores of higher level probes perform the best or on par with the best in all cases. In each case, the\nimprovement over the unweighted model is about 3 \u2212 4% in test accuracy. Distillation performs\nuniformly worse in all cases.\n\nComplex Model Training: The complex model is trained on training set 1. We obtained a test\naccuracy of 0.845 and keep this as our complex model. We note that although this is suboptimal with\nrespect to Resnet performances of today, we have only used 30000 samples to train.\nProbe Training: Linear probes Pu(\u00b7) are trained on representations produced by the complex model\non training set 1, each for 200 epochs. The trained Probe con\ufb01dence scores Pu(Ru(x)) are evaluated\non samples in training set 2.\nSimple Models Training: Each of the simpler models are trained only on training set 2 consisting\nof 20000 samples for 500 epochs. All training hyperparameters are set to be the same as in the\nprevious cases. We train each simple model in Table 2 for the following different cases. Standard\ntrains a simple unweighted model. ConfWeight trains a weighted model where the weights are\nthe true label\u2019s con\ufb01dence score of the complex model\u2019s last layer. Distillation trains the simple\nmodel using cross-entropy loss with soft targets obtained from the softmax ouputs of the complex\nmodel\u2019s last layer rescaled by temperature t = 0.5 (tuned with cross-validation) as in distillation of\n[14]. ProfWeightAUC and ProfWeightReLU train using Algorithm 1 with Algorithms 2 and 3 for the\nweighting scheme, respectively. Results are for layer 14 as the lowest layer (margin parameter \u03b1\nwas set small, and \u03b1 = 0 corresponded to layer 13). More details along with results for different\ntemperature in distillation and margin in ProfWeight are given in the supplementary materials.\nTest accuracies (their means and standard deviations each averaged over about 4 runs each) of each\nof the 4 simple models in Table 2 trained in 6 different ways described above are provided in Figure\n3. Their numerical values in tabular form are given in Table 1.\nResults: From Table 1, it is clear that in all cases, the weights corresponding to the AUC of the probe\ncon\ufb01dence scores from unit 13 or 14 and upwards are among the best in terms of test accuracies. They\nsigni\ufb01cantly outperform distillation-based techniques and, further, are better than the unweighted test\naccuracies by 3 \u2212 4%. This shows that our ProfWeight algorithm performs really well. We notice\nthat in this case, the con\ufb01dence scores from the last layer or \ufb01nal probe alone are quite competitive as\nwell. This is probably due to the complex model accuracy not being very high, having been trained on\nonly 30000 examples. This might seem counterintuitive, but a highly accurate model will \ufb01nd almost\nall examples easy to classify at the last layer leading to con\ufb01dence scores that are uniformly close to\n1. Weighting with such scores then, is almost equivalent to no weighting at all. This is somewhat\nwitnessed in the manufacturing example where the complex neural network had an accuracy in the\n90s and ConfWeight did not enhance the CART model to the extent ProfWeight did. In any case,\nweighting based on the last layer is just a special instance of our method ProfWeight, which is seen to\nperform quite well.\n\n3.2 Manufacturing\n\nWe now describe how our method not only improved the performance of a CART model, but produced\noperationally signi\ufb01cant results in a semi-conductor manufacturing setting.\nSetup: We consider an etching process in a semi-conductor manufacturing plant. The goal is to\npredict the quantity of metal etched on each wafer \u2013 which is a collection of chips \u2013 without having\nto explicitly measure it using high precision tools, which are not only expensive but also substantially\nslow down the throughput of a fab. If T denotes the required speci\ufb01cation and \u03b3 the allowed variation,\nthe target we want to predict is quantized into three bins namely: (\u2212\u221e, T \u2212 \u03b3), (T + \u03b3,\u221e) and\n\n7\n\n\fFigure 2: Above we show the performance of the different methods on the manufacturing dataset.\n\nwithin spec which is T \u00b1 \u03b3. We thus have a three class problem and the engineers goal is not only to\npredict these classes accurately but also to obtain insight into ways that he can improve his process.\nFor each wafer we have 5104 input measurements for this process. The inputs consist of acid\nconcentrations, electrical readings, metal deposition amounts, time of etching, time since last cleaning,\nglass fogging and various gas \ufb02ows and pressures. The number of wafers in our dataset was 100,023.\nSince these wafers were time ordered we split the dataset sequentially where the \ufb01rst 70% was used\nfor training and the most recent 30% was used for testing. Sequential splitting is a very standard\nprocedure used for testing models in this domain, as predicting on the most recent set of wafers is\nmore indicative of the model performance in practice than through testing using random splits of\ntrain and test with procedures such as 10-fold cross validation.\nModeling and Results: We built a neural network (NN) with an input layer and \ufb01ve fully connected\nhidden layers of size 1024 each and a \ufb01nal softmax layer outputting the probabilities for the three\nclasses. The NN had an accuracy of 91.2%. The NN was, however, not the model of choice for the\nfab engineer who was more familiar and comfortable using decision trees.\nGiven this, we trained a CART based decision tree on the dataset. As seen in \ufb01gure 2, its accuracy\nwas 74.3%. Given the big gap in performance between these two methods the engineers wanted an\nimproved interpretable model whose insights they could trust. We thus tested by weighting instances\nbased on the actual con\ufb01dence scores outputted by the NN and then retraining CART. This improved\nthe CART performance slightly to 77.1%. We then used ProfWeightAUC, where \u03b1 was set to zero, to\ntrain CART whose accuracy bumped up signi\ufb01cantly to 87.3%, which is a 13% lift. Similar gains\nwere seen for ProfWeightReLU where accuracy reached 87.4%. For Distillation we tried 10 different\ntemperature scalings in multiples of 2 starting with 0.5. The best distilled CART produced a slight\nimprovement in the base model increasing its accuracy to 75.6%. We also compared with the decision\ntree extraction (DEx) method [6] which had a performance of 77.5%.\nOperationally Signi\ufb01cant Human Actions: We reported the top features based on the improved\nmodel to the engineer. These features were certain pressures, time since last cleaning and certain\nacid concentrations. The engineer based on this insight started controlling the acid concentrations\nmore tightly. This improved the total number of within spec wafers by 1.3%. Although this is a small\nnumber, it has huge monetary impact in this industry, where even 1% increase in yield can amount to\nbillions of dollars in savings.\n\n4 Related Work and Discussion\n\nOur information transfer procedures based on con\ufb01dence measures are related to Knowledge Dis-\ntillation and learning with privileged information [22]. The key difference is in the way we use\ninformation. We weight training instances by functions, such as the average, of the con\ufb01dence pro\ufb01les\nof the training label alone. This approach, unlike Distillation [14, 26, 30], is applicable in broader\nsettings like when target models are classi\ufb01ers optimized using empirical risk (e.g., SVM) where risk\ncould be any loss function. By weighting instances, our method uses any available target training\nmethods. Distillation works best with cross entropy loss and other losses speci\ufb01cally designed for\n\n8\n\n\fthis purpose. Typically, distilled networks are usually quite deep. They would not be interpretable or\nbe able to respect tight resource constraints on sensors. In [24], the authors showed that primarily\nshallow networks can be deployed on memory constrained devices. The only papers to our knowledge\nthat do thin down CNNs came about prior to ResNet and the memory requirements are higher even\ncompared to our complex Resnet model (2.5 M vs 0.27 M parameters) [26].\nIt also interesting to note that calibrated scores of a highly accurate model does not imply good\ntransfer. This is because post-calibration majority of the con\ufb01dence scores would still be high (say\n>90%). These scores may not re\ufb02ect the true hardness. Temperature scaling is one of the most\npopular methods for calibration of neural networks [17]. Distillation which involves temperature\nscaling showed subpar performance in our experiments.\nThere have been other strategies [9, 4, 6] to transfer information from bigger models to smaller ones,\nhowever, they are all similar in spirit to Distillation, where the complex models predictions are used to\ntrain a simpler model. As such, weighting instances also has an intuitive justi\ufb01cation where viewing\nthe complex model as a teacher and the TM as a student, the teacher is telling the student which\naspects he/she should focus on (i.e. easy instances) and which he/she could ignore.\nThere are other strategies that weight examples although their general setup and motivation is different,\nfor instance curriculum learning (CL) [7] and boosting [13]. CL is a training strategy where \ufb01rst easy\nexamples are given to a learner followed by more complex ones. The determination of what is simple\nas opposed to complex is typically done by a human. There is usually no automatic gradation of\nexamples that occurs based on a machine. Also sometimes the complexity of the learner is increased\nduring the training process so that it can accurately model more complex phenomena. In our case\nhowever, the complexity of the simple model is assumed \ufb01xed given applications in interpretability\n[12, 23, 25] and deployment in resource limited settings [24, 11]. Moreover, we are searching for\njust one set of weights which when applied to the original input (not some intermediate learned\nrepresentations) the \ufb01xed simple model trained on it gives the best possible performance. Boosting\nis even more remotely related to our setup. In boosting there is no high performing teacher and\none generally grows an ensemble of weak learners which as just mentioned is not reasonable in our\nsetting. Hard examples w.r.t. a previous \u2019weak\u2019 learner are highlighted for subsequent training to\ncreate diversity. In our case, hard examples are w.r.t. an accurate complex model. This means that\nthese labels are near random. Hence, it is important to highlight these relatively easier examples\nwhen training the simple model.\nIn this work we proposed a strategy to improve simple models, whose complexity is \ufb01xed, with the\nhelp of a high performing neural network. The crux of the idea was to weight examples based on\na function of the con\ufb01dence scores based on intermediate representations of the neural network at\nvarious layers for the true label. We accomplished this by attaching probes to intermediate layers in\norder to obtain con\ufb01dence scores. As observed in the experiments, our idea of weighting examples\nseems to have a lot of promise where we want to improve (interpretable) models trained using\nempirical risk minimization or in cases where we want to improve a (really) small neural network\nthat will respect certain power and memory constraints. In such situations Distillation seems to have\nlimited impact in creating accurate models.\nOur method could also be used in small data settings which would be analogous to our setup on\nCIFAR 10, where the training set for the complex and simple models were distinct. In such a setting,\nwe would obtain soft predictions from the probes of the complex model for the small data samples\nand use ProfWeight with these scores to weight the smaller training set. A complementary metric\nthat would also be interesting to look at is the time (or number of epochs) it takes to train the simple\nmodel on weighted examples to reach the unweighted accuracy. If there is huge savings in time, this\nwould be still useful in power constrained settings.\nIn the future, we would like to explore more adaptive schemes and hopefully understand them\ntheoretically as we have done in this work. Another potentially interesting future direction is to use\na combination of the improved simple model and complex model to make decisions. For instance,\nif we know that the simple models performance (almost) matches the performance of the complex\nmodel on a part of the domain then we could use it for making predictions for the corresponding\nexamples and the complex model otherwise. This could have applications in interpretability as well\nas in speeding up inference for real time applications where the complex models could potentially be\nlarge.\n\n9\n\n\fAcknowledgement\n\nWe would like to thank the anonymous area chair and reviewers for their constructive comments.\n\nReferences\n[1] D. Agarwal, L. Li, and A. J. Smola. Linear-Time Estimators for Propensity Scores. In 14th Intl.\n\nConference on Arti\ufb01cial Intelligence and Statistics (AISTATS), pages 93\u2013100, 2011.\n\n[2] R. B. Akshayvarun Subramanya, Suraj Srinivas. Con\ufb01dence estimation in deep neural networks\n\nvia density modelling. arXiv preprint arXiv:1707.07013, 2017.\n\n[3] G. Alain and Y. Bengio. Understanding intermediate layers using linear classi\ufb01er probes. arXiv\n\npreprint arXiv:1610.01644, 2016.\n\n[4] L. J. Ba and R. Caurana. Do deep nets really need to be deep? CoRR, abs/1312.6184, 2013.\n\n[5] S. Bach, A. Binder, G. Montavon, F. Klauschen, K.-R. M\u00fcller, and W. Samek. On pixel-wise\nexplanations for non-linear classi\ufb01er decisions by layer-wise relevance propagation. PloS one,\n10(7):e0130140, 2015.\n\n[6] O. Bastani, C. Kim, and H. Bastani. Interpreting blackbox models via model extraction. arXiv\n\npreprint arXiv:1705.08504, 2017.\n\n[7] Y. Bengio, J. Louradour, R. Collobert, and J. Weston. Curriculum learning. In Proceedings of\n\nthe 26th Annual International Conference on Machine Learning, 2009.\n\n[8] S. Boucheron, O. Bousquet, and G. Lugosi. Theory of classi\ufb01cation: A survey of some recent\n\nadvances. ESAIM: probability and statistics, 9:323\u2013375, 2005.\n\n[9] C. Bucilu\u02c7a, R. Caruana, and A. Niculescu-Mizil. Model compression. In Proceedings of the\n12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2006.\n\n[10] P.-L. Carrier and A. Courville. Challenges in representation learning: Facial expression\n\nrecognition challenge. ICML, 2013.\n\n[11] Y.-H. Chen, J. Emer, and V. Sze. Eyeriss: A spatial architecture for energy-ef\ufb01cient data\ufb02ow\nfor convolutional neural networks. In 2016 ACM/IEEE 43rd Annual International Symposium\non Computer Architecture, 2016.\n\n[12] A. Dhurandhar, V. Iyengar, R. Luss, and K. Shanmugam. Tip: Typifying the interpretability of\n\nprocedures. arXiv preprint arXiv:1706.02952, 2017.\n\n[13] Y. Freund and R. E. Schapire. Decision-theoretic generalization of on-line learning and an\n\napplication to boosting. Journal of Computer and System Sciences, 55(1):119\u2013139, 1997.\n\n[14] J. D. Geoffrey Hinton, Oriol Vinyals. Distilling the knowledge in a neural network.\n\nhttps://arxiv.org/abs/1503.02531, 2015.\n\nIn\n\n[15] I. Goodfellow, Y. Bengio, and A. Courville. Deep Learning. MIT Press, 2016.\n\n[16] L. Grippo and M. Sciandrone. On the convergence of the block nonlinear gauss-seidel method\n\nunder convex constraints. Oper. Res. Lett., 26(3):127\u2013136, 2000.\n\n[17] C. Guo, G. Pleiss, Y. Sun, and K. Weinberger. On calibration of modern neural networks. Intl.\n\nConference on Machine Learning (ICML), 2017.\n\n[18] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Intl.\n\nConference on Computer Vision and Pattern Recognition (CVPR), 2015.\n\n[19] A. Krizhevsky. Learning multiple layers of features from tiny images. In Tech. Report. 2009.\n\n[20] M. Lindstrom. Small Data: The Tiny Clues that Uncover Huge Trends. St. Martin\u2019s Press, 2016.\n\n10\n\n\f[21] Z. C. Lipton.\n\n(deep learning\u2019s deep \ufb02aws)\u2019s deep \ufb02aws.\n\nIn kdnuggets. 2015.\n\nhttps://www.kdnuggets.com/2015/01/deep-learning-\ufb02aws-universal-machine-learning.html.\n\n[22] D. Lopez-Paz, L. Bottou, B. Sch\u00f6lkopf, and V. Vapnik. Unifying distillation and privileged\n\ninformation. In International Conference on Learning Representations (ICLR 2016), 2016.\n\n[23] G. Montavon, W. Samek, and K.-R. M\u00fcller. Methods for interpreting and understanding deep\n\nneural networks. Digital Signal Processing, 2017.\n\n[24] B. Reagen, P. Whatmough, R. Adolf, S. Rama, H. Lee, S. K. Lee, J. Miguel, Hernandez-Lobato,\nG.-Y. Wei, and D. Brooks. Minerva: Enabling low-power, highly-accurate deep neural network\naccelerators. IEEE Explore, 2016.\n\n[25] M. Ribeiro, S. Singh, and C. Guestrin. \"why should i trust you?\u201d explaining the predictions of\nany classi\ufb01er. In ACM SIGKDD Intl. Conference on Knowledge Discovery and Data Mining,\n2016.\n\n[26] A. Romero, N. Ballas, S. E. Kahou, A. Chassang, C. Gatta, and Y. Bengio. Fitnets: Hints for\n\nthin deep nets. arXiv preprint arXiv:1412.6550, 2015.\n\n[27] S.-I. L. Scott Lundberg. Uni\ufb01ed framework for interpretable methods. In In Advances of Neural\n\nInf. Proc. Systems, 2017.\n\n[28] R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra. Grad-cam:\nVisual explanations from deep networks via gradient-based localization. See https://arxiv.\norg/abs/1610.02391 v3, 2016.\n\n[29] K. Simonyan, A. Vedaldi, and A. Zisserman. Deep inside convolutional networks: Visualising\n\nimage classi\ufb01cation models and saliency maps. CoRR, abs/1312.6034, 2013.\n\n[30] S. Tan, R. Caruana, G. Hooker, and Y. Lou. Auditing black-box models using transparent model\n\ndistillation with side information. CoRR, 2017.\n\n[31] S. Weiss, A. Dhurandhar, and R. Baseman. Improving quality control by early prediction of\nmanufacturing outcomes. In ACM SIGKDD conference on Knowledge Discovery and Data\nMining (KDD), 2013.\n\n11\n\n\f", "award": [], "sourceid": 6595, "authors": [{"given_name": "Amit", "family_name": "Dhurandhar", "institution": "IBM Research"}, {"given_name": "Karthikeyan", "family_name": "Shanmugam", "institution": "IBM Research, NY"}, {"given_name": "Ronny", "family_name": "Luss", "institution": "IBM Research"}, {"given_name": "Peder", "family_name": "Olsen", "institution": "IBM"}]}