{"title": "Launch and Iterate: Reducing Prediction Churn", "book": "Advances in Neural Information Processing Systems", "page_first": 3179, "page_last": 3187, "abstract": "Practical applications of machine learning often involve successive training iterations with changes to features and training examples. Ideally, changes in the output of any new model should only be improvements (wins) over the previous iteration, but in practice the predictions may change neutrally for many examples, resulting in extra net-zero wins and losses, referred to as unnecessary churn. These changes in the predictions are problematic for usability for some applications, and make it harder and more expensive to measure if a change is statistically significant positive. In this paper, we formulate the problem and present a stabilization operator to regularize a classifier towards a previous classifier. We use a Markov chain Monte Carlo stabilization operator to produce a model with more consistent predictions without adversely affecting accuracy. We investigate the properties of the proposal with theoretical analysis. Experiments on benchmark datasets for different classification algorithms demonstrate the method and the resulting reduction in churn.", "full_text": "Launch and Iterate: Reducing Prediction Churn\n\nQ. Cormier\nENS Lyon\n\n15 parvis Ren\u00e9 Descartes\n\nLyon, France\n\nM. Milani Fard, K. Canini, M. R. Gupta\n\nGoogle Inc.\n\n1600 Amphitheatre Parkway\nMountain View, CA 94043\n\nquentin.cormier@ens-lyon.fr\n\n{mmilanifard,canini,mayagupta}@google.com\n\nAbstract\n\nPractical applications of machine learning often involve successive training itera-\ntions with changes to features and training examples. Ideally, changes in the output\nof any new model should only be improvements (wins) over the previous iteration,\nbut in practice the predictions may change neutrally for many examples, resulting\nin extra net-zero wins and losses, referred to as unnecessary churn. These changes\nin the predictions are problematic for usability for some applications, and make it\nharder and more expensive to measure if a change is statistically signi\ufb01cant positive.\nIn this paper, we formulate the problem and present a stabilization operator to regu-\nlarize a classi\ufb01er towards a previous classi\ufb01er. We use a Markov chain Monte Carlo\nstabilization operator to produce a model with more consistent predictions without\nadversely affecting accuracy. We investigate the properties of the proposal with\ntheoretical analysis. Experiments on benchmark datasets for different classi\ufb01cation\nalgorithms demonstrate the method and the resulting reduction in churn.\n\n1 The Curse of Version 2.0\n\nIn most practical settings, training and launching an initial machine-learned model is only the \ufb01rst\nstep: as new and improved features are created, additional training data is gathered, and the model\nand learning algorithm are improved, it is natural to launch a series of ever-improving models. Each\nnew candidate may bring wins, but also unnecessary changes. In practice, it is desirable to minimize\nany unnecessary changes for two key reasons. First, unnecessary changes can hinder usability\nand debugability as they can be disorienting to users and follow-on system components. Second,\nunnecessary changes make it more dif\ufb01cult to measure with statistical con\ufb01dence whether the change\nis truly an improvement. For both these reasons, there is great interest in making only those changes\nthat are wins, and minimizing any unnecessary changes, while making sure such process does not\nhinder the overall accuracy objective.\nThere is already a large body of work in machine learning that treats the stability of learning\nalgorithms. These range from the early works of Devroye and Wagner [1] and Vapnik [2, 3] to more\nrecent studies of learning stability in more general hypothesis spaces [4, 5, 6]. Most of the literature\non this topic focus on stability of the learning algorithm in terms of the risk or loss function and how\nsuch properties translate into uniform generalization with speci\ufb01c convergence rates. We build on\nthese notions, but the problem treated here is substantively different.\nWe address the problem of training consecutive classi\ufb01ers to reduce unnecessary changes in the\npresence of realistic evolution of the problem domain and the training sets over time. The main\ncontributions of this paper include: (I) discussion and formulation of the \u201cchurn\u201d metric between\ntrained models, (II) design of stabilization operators for regularization towards a previous model, (III)\nproposing a Markov chain Monte Carlo (MCMC) stabilization technique, (VI) theoretical analysis of\nthe proposed stabilization in terms of churn, and (V) empirical analysis of the proposed methods on\nbenchmark datasets with different classi\ufb01cation algorithms.\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fTable 1: Win-loss ratio (WLR) needed to establish a change is statistically signi\ufb01cant at the p = 0.05\nlevel for k wins out of n diffs from a binomial distribution. The empirical WLR column shows the\nWLR one must actually see in the diffs. The true WLR column is the WLR the change must have so\nthat any random draw of diffs has at least a 95% chance of producing the needed empirical WLR.\n\n# Diffs Min # Wins Max # Losses Empirical WLR True WLR\n\nNeeded\n26.195\n1.972\n1.234\n1.068\n\nNeeded\n9\n59\n527\n5,083\n\n10\n100\n1,000\n10,000\n\nAllowed\n1\n41\n473\n4,917\n\nNeeded\n9.000\n1.439\n1.114\n1.034\n\n1.1 Testing for Improvements\n\nIn the machine learning literature, it is common to compare classi\ufb01ers on a \ufb01xed pre-labeled test set.\nHowever, a \ufb01xed test set has a few practical downsides. First, if many potential changes to the model\nare evaluated on the same dataset, it becomes dif\ufb01cult to avoid observing spurious positive effects that\nare actually due to chance. Second, the true test distribution may be evolving over time, meaning that\na \ufb01xed test set will eventually diverge from the true distribution of interest. Third, and most important\nto our discussion, any particular change may affect only a small subset of the test examples, leaving\ntoo small a sample of differences (diffs) to determine whether a change is statistically signi\ufb01cant.\nFor example, suppose one has a \ufb01xed test set of 10,000 samples with which to evaluate a classi\ufb01er.\nConsider a change to one of the features, say a Boolean string-similarity feature that causes the\nfeature to match more synonyms, and suppose that re-training a classi\ufb01er with this small change to\nthis one feature impacts only 0.1% of random examples. Then only 10 of the 10,000 test examples\nwould be affected. As shown in the \ufb01rst row of Table 1, given only 10 diffs, there must be 9 or more\nwins to declare the change statistically signi\ufb01cantly positive for p = 0.05.\nNote that cross-validation (CV), even in leave-one-out form, does not solve this issue. First, we are\nstill bound by the size of the training set which might not include enough diffs between the two\nmodels. Second, and more importantly, the model in the previous iteration has likely seen the entire\ndataset, which breaks the independence assumption needed for the statistical test.\nTo address these problems and ensure a fresh, suf\ufb01ciently large test set for each comparison, prac-\ntitioners often instead measure changes on a set of diffs for the proposed change. For example, to\ncompare classi\ufb01er A and B, each classi\ufb01er is evaluated on a billion unlabeled examples, and then the\nset of diffs is de\ufb01ned as those examples for which classi\ufb01ers A and B predict a different class.\n\n1.2 Churn\n\nWe de\ufb01ne the churn between two models as the expected percent of diffs sampled from the test\ndistribution. For a \ufb01xed accuracy gain, less churn is better. For example, if classi\ufb01er A has accuracy\n90% and classi\ufb01er B has accuracy 91%, then the best case is if classi\ufb01er B gets the same 90% of\nexamples correct as classi\ufb01er A, while correcting A\u2019s errors on 1% of the data. Churn is thus only\n1% in this case, and all diffs between A and B will be wins for B. Therefore the improvement of\nB over A will achieve statistical signi\ufb01cance after labelling a mere 10 diffs. The worst case is if\nclassi\ufb01er A is right on the 9% of examples that B gets wrong, and B is right on the 10% of examples\nthat A gets wrong. In this case, churn is 19%, and a given diff will only have probability of 10/19 of\nbeing a win for B, and almost 1,000 diffs will have to be labeled to be con\ufb01dent that B is better.\nOn Statistical Signi\ufb01cance: Throughout this paper, we assume that every diff is independent and\nidentically distributed with some probability of being a win for the test model vs. the base model.\nThus, the probability of k wins in n trials follows a binomial distribution. Con\ufb01dence intervals can\nprovide more information than a p-value, but p-values are a useful summary statistic to motivate the\nproblem and proposed solution, and are relevant in practice; for a longer discussion see e.g. [7].\n\n2 Reducing Churn for Classi\ufb01ers\n\nIn this paper, we propose a new training strategy for reducing the churn between classi\ufb01ers. One\nspecial case is how to train a classi\ufb01er B to be low-churn given a \ufb01xed classi\ufb01er A. We treat that\n\n2\n\n\fDe-Churning Markov Chain\n\nT1\n\nF \u2217\n\n1\n\nT2\n\nF \u2217\n\n2\n\n. . .\n\n. . .\n\nTK\n\nF \u2217\n\nK\n\nA\n\nTA\n\nA\u2217\n\nB\n\nTB\n\nB\u2217\n\nFigure 1: The orange nodes illustrate a Markov Chain, at each step the classi\ufb01er F \u2217\nt is regularized\nt\u22121 using the stabilization operator S, and each step trained on\ntowards the previous step\u2019s classi\ufb01er F \u2217\na different random training set Tt. We run K steps of this Markov chain, for K large enough so that\nthe distribution of F \u2217\nK, TA) is then\ndeployed. Later, some changes are proposed, and a new classi\ufb01er B\u2217 is trained on training set TB but\nregularized towards A\u2217 using B\u2217 = S(A\u2217, TB). We compare this proposal in terms of churn and\naccuracy to the green nodes, which do not use the proposed stabilization.\n\nk is close to a stationary distribution. The classi\ufb01er A\u2217 = S(F \u2217\n\nspecial case as well as a broader problem: a framework for training both classi\ufb01ers A and B so that\nclassi\ufb01er B is expected to have low-churn relative to classi\ufb01er A, though when we train A we do not\nyet know exactly the changes B will incorporate. We place no constraints on the kind of classi\ufb01ers or\nthe kind of future changes allowed.\nOur solution consists of two components: a stabilization operator that regularizes classi\ufb01er B to be\ncloser in predictions to classi\ufb01er A; and a randomization of the training set that attempts to mimic\nexpected future changes.\nWe consider a training set T = {(xi, yi)}m\ni=1 of m samples with each D-dimensional feature vector\nxi \u2208 X \u2286 RD and each label yi \u2208 Y = {\u22121, 1}. Samples are drawn i.i.d. from distribution D.\nDe\ufb01ne a classi\ufb01er f : RD \u2192 {\u22121, 1}, and the churn between two classi\ufb01ers f1 and f2 as:\n\n(1)\n\nC(f1, f2) =\n\nE\n\n(X,Y )\u223cD[1f1(X)f2(X)<0],\n\nwhere 1 is the indicator function. We are given training sets TA and TB to train the \ufb01rst and second\nversion of the model respectively. TB might add or drop features or examples compared to TA.\n\n2.1 Perturbed Training to Imitate Future Changes\nConsider a random training set drawn from a distribution P(TA), such that different draws may have\ndifferent training samples and different features. We show that one can train an initial classi\ufb01er to be\nmore consistent in predictions for different realizations of the perturbed training set by iteratively\ntraining on a series of i.i.d. random draws T1, T2, . . . from P(TA). We choose P(TA) to model a\ntypical expected future change to the dataset. For example, if we think a likely future change will\nadd 5% more training data and one new feature, then we would de\ufb01ne a random training set to be a\nrandom 95% of the m examples in TA, while dropping a feature at random.\n\n2.2 Stabilized Training Based On A Previous Model using a Markov Chain\n\nt+1 = S(F \u2217\n\nt , Tt+1) where F \u2217\n\nWe propose a Markov chain Monte Carlo (MCMC) approach to form a distribution over classi\ufb01ers\nthat are consistent in predictions w.r.t. the distribution P(TA) on the training set. Let S denote\na regularized training that outputs a new classi\ufb01er F \u2217\nis a previous\nclassi\ufb01er and Tt+1 is the current training set. Applying S repeatedly to random training sets Tt forms\na Markov chain as shown in Figure 1. We expect this chain to produce a stationary peaked distribution\non classi\ufb01ers robust to the perturbation P(TA). We sample a model from this resulting distribution\nafter K steps.\nWe end the proposed Markov chain with a classi\ufb01er A\u2217 trained on the full training set TA, that is,\nA\u2217 = S(F \u2217\nK, TA). Classi\ufb01er A\u2217 is the initial launched model, and has been pre-trained to be robust\nto the kind of changes we expect to see in some future training set TB. Later, classi\ufb01er B\u2217 should be\ntrained as B\u2217 = S(A\u2217, TB). We expect the chain to have reduced the churn C(A\u2217, B\u2217) compared to\nthe churn C(A, B) that would have resulted from training classi\ufb01ers A and B without the proposed\nstabilization. See Figure 1 for an illustration. Note that this chain only needs to be run for the \ufb01rst\nversion of the model.\n\nt\n\n3\n\n\fOn Regularization Effect of Perturbed Training: One can view the perturbation of the dataset\nand random feature drops during the MCMC run as a form of regularization, resembling the dropout\ntechnique [8] now popular in deep, convolutional and recurrent neural networks (see e.g. [9] for a\nrecent survey). Such regularization can result in better generalization error, and our empirical results\nshow some evidence of such an effect. See further discussion in the experiments section.\nPerturbation Chain as Longitudinal Study: The chain in Figure 1 can also be viewed as a study\nof the stabilization operator upon several iterations of the model, with each trained and anchored\non the previous version. It can help us assess if the successive application of the operator has any\nadverse effect on the accuracy or if the resulting churn reduction diminishes over time.\n\n3 Stabilization Operators\n\nWe propose two stabilization operators: (I) Regress to Corrected Prediction (RCP) which turns the\nclassi\ufb01cation problem into a regression towards corrected predictions of an older model, and (II) the\nDiplopia operator which regularizes the new model towards the older model using example weights.\n\n3.1 RCP Stabilization Operator\nWe propose a stabilization operator S(f, T ) that can be used with almost any regression algorithm\nand any type of change. The RCP operator re-labels each classi\ufb01cation training label yj \u2208 {\u22121, 1}\nin T with a regularized label \u02dcyj \u2208 R, using an anchor model f:\n\n\u02dcyj =\n\n(2)\nwhere \u03b1, \u0001 \u2208 [0, 1] are hyperparameters of S that control the churn-accuracy trade-off, with larger\n\u03b1 corresponding to lower churn but less sensitive to good changes. Denote the set of all re-labeled\nexamples \u02dcT . The RCP stabilization operator S trains a regression model on \u02dcT , using the user\u2019s choice\nof regression algorithm.\n\n\u0001yj\n\nif yjf (xj) \u2265 0\notherwise,\n\n(cid:26)\u03b1f (xj) + (1 \u2212 \u03b1)yj\n\n3.2 Diplopia Stabilization Operator\n\nThe second stabilization operator, which we term Diplopia (double-vision), can be used with any\nclassi\ufb01cation strategy that can output a probability estimate for each class, including algorithms like\nSVMs and random forests (calibrated with a method like Platt scaling [10] or isotonic regression\n[11]). This operator can be easily extended to multi-class problems.\nFor binary classi\ufb01cation, the Diplopia operator copies each training example into two examples with\nlabels \u00b11, and assigns different weights to the two contradictorily labeled copies. If f (.) is the\nprobability estimate of class +1:\n\n(cid:26)(xi, +1) with weight \u039bi\n\n(xi,\u22121) with weight 1 \u2212 \u039bi\n\n(xi, yi) \u2192\n\n(cid:26)\u03b1f (xi) + (1 \u2212 \u03b1)1yi\u22650\n\n\u039bi =\n\n1/2 + \u0001yi\n\nif yi(f (xi) \u2212 1\notherwise.\n\n2 ) \u2265 0\n\nThe formula always assigns the higher weight to the copy with the correct label. Notice that the roles\nof \u03b1 and \u0001 are very similar than to those in (2). To see the intuition behind this operator, note that\nwith \u03b1 = 1 and without the \u0001-correction, stochastic f (.) maximizes the likelihood of the new dataset.\nThe RCP operator requires using a regressor, but our preliminary experiments showed that it often\ntrains faster (without the need to double the dataset size) and reduces churn better than the Diplopia\noperator. We therefore focus on the RCP operator for theoretical and empirical analysis.\n\n4 Theoretical Results\n\nIn this section we present some general bounds on smoothed churn, assuming that the perturbation\ndoes not remove any features, and that the training algorithm is symmetric in training examples (i.e.\nindependent of the order of the dataset). The analysis here assumes datasets for different models\nare sampled i.i.d., ignoring the dependency between consecutive re-labeled datasets (through the\nintermediate model). Proofs and further technical details are given in the supplemental material.\n\n4\n\n\fFirst, note that we can rewrite the de\ufb01nition of the churn in terms of zero-one loss:\n\nE\n\nE\n\nC(f1, f2) =\n\nC\u03b3(f1, f2) =\n\n(X,Y )\u223cD [(cid:96)0,1(f1(X), f2(X))] =\n\n(X,Y )\u223cD [|(cid:96)0,1(f1(X), Y ) \u2212 (cid:96)0,1(f2(X), Y )|] . (3)\nWe de\ufb01ne a relaxation of C that is similar to the loss used by [5] to study the stability of classi\ufb01cation\nalgorithms, we call it smooth churn and it is parameterized by the choice of \u03b3:\n(X,Y )\u223cD [|(cid:96)\u03b3(f1(X), Y ) \u2212 (cid:96)\u03b3(f2(X), Y )|] ,\n\n(4)\nwhere (cid:96)\u03b3(y, y(cid:48)) = 1 if yy(cid:48) \u2264 0, (cid:96)\u03b3(y, y(cid:48)) = 1 \u2212 yy(cid:48)/\u03b3 for 0 \u2264 yy(cid:48) \u2264 \u03b3, and (cid:96)\u03b3(y, y(cid:48)) = 0 otherwise.\nSmooth churn can be interpreted as \u03b3 playing the role of a \u201ccon\ufb01dence threshold\u201d of the classi\ufb01er f\nsuch that |f (x)|(cid:28) \u03b3 means the classi\ufb01er is not con\ufb01dent in its prediction. It is easy to verify that (cid:96)\u03b3\nis (1/\u03b3)-Lipschitz continuous with respect to y, when y(cid:48) \u2208 {\u22121, 1}.\nLet fT (x) \u2192 R be a classi\ufb01er discriminant function (which can be thresholded to form a classi\ufb01er)\ntrained on set T . Let T i be the same as T except with the ith training sample (xi, yi) replaced by\nanother sample. Then, as in [4], de\ufb01ne training algorithm f.(.) to be \u03b2-stable if:\n\nE\n\n\u2200x, T, T i : |fT (x) \u2212 fT i(x)|\u2264 \u03b2.\n\n(5)\nMany algorithms such as SVM and classical regularization networks have been shown to be \u03b2-stable\nwith \u03b2 = O(1/m) [4, 5]. We can use \u03b2-stability of learning algorithms to get a bound on the expected\nchurn between independent runs of the algorithms on i.i.d. datasets:\nTheorem 1 (Expected Churn). Suppose f is \u03b2-stable, and is used to train classi\ufb01ers on i.i.d. training\nsets T and T (cid:48) sampled from Dm. We have:\n\nE\n\nT,T (cid:48)\u223cDm\n\n[C\u03b3(fT , fT (cid:48))] \u2264 \u03b2\n\n\u221a\n\n\u03c0m\n\u03b3\n\n\u221a\n\n.\n\n(6)\n\nAssuming \u03b2 = O(1/m) this bound is of order O(1/\nthe generalization error. We can further show that churn is concentrated around its expectation:\nTheorem 2 (Concentration Bound on Churn). Suppose f is \u03b2-stable, and is used to train classi\ufb01ers\non i.i.d. training sets T and T (cid:48) sampled from Dm. We have:\n\u03c0m\u03b2\n\u03b3\n\nm), in line with most concentration bounds on\n\nC\u03b3(fT , fT (cid:48)) > \u0001 +\n\n\u2212 \u00012\u03b32\nm\u03b22 .\n\nT,T (cid:48)\u223cDm\n\n(cid:27)\n\n(cid:26)\n\n\u2264 e\n\n\u221a\n\n(7)\n\nPr\n\n\u03b2-stability for learning algorithms often includes worst case bound on loss or Lipschitz-constant\nof the loss function. Assuming we use the RCP operator with squared loss in a reproducing kernel\nHilbert space (RKHS), we can derive a distribution-dependent bound on the expected squared churn:\nTheorem 3 (Expected Squared Churn). Let F be a reproducing kernel Hilbert space with kernel k\nsuch that \u2200x \u2208 X : k(x, x) \u2264 \u03ba2 < \u221e. Let fT be a model trained on T = {(xi, yi)}m\ni=1 de\ufb01ned by:\n\nfT = arg min\ng\u2208F\n\n1\nm\n\n(g(xi) \u2212 yi)2 + \u03bb(cid:107)g(cid:107)2\nk.\n\nFor models trained on i.i.d. training sets T and T (cid:48):\n\n(cid:2)((cid:96)\u03b3(fT (X), Y ) \u2212 (cid:96)\u03b3(fT (cid:48)(X), Y ))2(cid:3) \u2264 2\u03ba4\n\nm\u03bb2\u03b32\n\nE\n\nT,T (cid:48)\u223cDm\n(X,Y )\u223cD\n\n(cid:34)\n\nm(cid:88)\n\ni=1\n\n1\nm\n\nE\n\nT\u223cDm\n\n(fT (xi) \u2212 yi)2\n\n(8)\n\n(cid:35)\n\n.\n\n(9)\n\nm(cid:88)\n\n1\n\nWe can further use Chebyshev\u2019s inequality to get a concentration bound on the smooth churn C\u03b3.\nUnlike the bounds in [4] and [5], the bound of Theorem 3 scales with the expected training error (note\nthat we must use \u02dcyi in place of of yi when applying the theorem, since training data is re-labeled by\nthe stabilization operator). We can thus use the above bound to analyse the effect of \u03b1 and \u0001 on the\nchurn, through their in\ufb02uence on the training error.\nSuppose the Markov chain described in Section 2.2 has reached a stationary distribution. Let F \u2217\nk be a\nmodel sampled from the resulting stationary distribution, used with the RCP operator de\ufb01ned in (2)\n\n5\n\n\fTable 2: Description of the datasets used in the experimental analysis.\n\n# Features\nTA\nTB\nValidation set\nTesting set\n\nNomao [13]\n89\n4000 samples, 84 features\n5000 samples, 89 features\n1000 samples\n28465 samples\n\nNews Popularity [14]\n61\n8000 samples, 58 features\n10000 samples, 61 features\n1000 samples\n28797 samples\n\nTwitter Buzz [15]\n77\n4000 samples, 70 features\n5000 samples, 77 features\n1000 samples\n45402 samples\n\nk+1 is the minimizer of objective in (8) on the re-labeled dataset\n\nto re-label the dataset Tk+1. Since F \u2217\nwe have:\n\n(cid:35)\n\nE\nTk+1\n\n1\nm\n\n(F \u2217\n\nk+1(xi) \u2212 \u02dcyi)2\n\n(cid:34)\n\nm(cid:88)\n\ni=1\n\n(cid:34)\n(cid:34)\n\nm(cid:88)\nm(cid:88)\n\ni=1\n\ni=1\n\n1\nm\n\n1\nm\n\n\u2264 E\n\nTk+1\n\n= E\nTk+1\n\n(cid:35)\n\n(F \u2217\n\nk (xi) \u2212 \u02dcyi)2 + \u03bb((cid:107)F \u2217\nk (cid:107)2\n\nk\u2212(cid:107)F \u2217\n\nk+1(cid:107)2\nk)\n\n(cid:35)\n\n(F \u2217\n\nk (xi) \u2212 \u02dcyi)2\n\n,\n\n(10)\n\nwhere line (10) is by the assumptions of stationary regime on F \u2217\nk and F \u2217\nsampling distributions for Tk and Tk+1. If E is the set of examples that F \u2217\nde\ufb01nition of the RCP operator we can replace \u02dcyi to get this bound on the squared churn:\n\nk+1 with similar dataset\nk got wrong, using the\n\n\u03ba4\n\nm\u03bb2\u03b32\n\nE\nTk+1\n\n1 \u2212 \u03b1\nm\n\n(F \u2217\n\nk (xi) \u2212 yi)2 +\n\n1\nm\n\n(F \u2217\n\nk (xi) + \u0001)2\n\n.\n\n(11)\n\n(cid:34)\n\n(cid:88)\n\ni /\u2208E\n\n(cid:35)\n\n(cid:88)\n\ni\u2208E\n\nWe can see in Eqn. (11) that using an \u03b1 close to 1 can decrease the \ufb01rst part of the bound, but at the\nsame time it can negatively affect the error rate of the classi\ufb01er, resulting in more samples in E and\nconsequently a larger second term. Decreasing \u0001 can reduce the (F \u2217\nk (xi) + \u0001)2 term of the bound, but\ncan again cause an increase in the error rate. As shown in the experimental results, there is often a\ntrade-off between the amount of churn reduction and the accuracy of the resulting model. We can\nmeasure the accuracy on the training set or a validation set to make sure the choice of \u03b1 and \u0001 does\nnot degrade the accuracy. To estimate churn reduction, we can use an un-labeled dataset.\n\n5 Experiments\n\nThis section demonstrates the churn reduction effect of the RCP operator for three UCI benchmark\ndatasets (see Table 2) with three regression algorithms: ridge regression, random forest regression, and\nsupport vector machine regression with RBF kernel, all implemented in Scikit-Learn [12] (additional\nresults for boosted stumps and linear SVM in the appendix). We randomly split each dataset into\nthree \ufb01xed parts: a training set, a validation set on which we optimized the hyper-parameters for\nall algorithms, and a testing set. We impute any missing values by the corresponding mean, and\nnormalize the data to have zero mean and variance 1 on the training set. See the supplementary\nmaterial for more experimental details.\nTo compare two models by computing the WLR on a reasonable number of diffs, we have made the\ntesting sets as large as possible, so that the expected number of diffs between two different models\nis large enough to derive accurate and statistically signi\ufb01cant conclusions. Lastly, we note that the\nchurn metric does not require labels, so it can be computed on an unlabeled dataset.\n\n5.1 Experimental Set-up and Metrics\n\nWe assume an initial classi\ufb01er is to be trained on TA, and a later candidate trained on TB will be\ntested against the initial classi\ufb01er. For the baseline of our experiments, we train classi\ufb01er A on TA\nand classi\ufb01er B on TB independently and without any stabilization, as shown in Figure 1.\nFor the RCP operator comparison, we train A on TA, then train B+= S(A, TB). For the MCMC\noperator comparison, we run the MCMC chain for k = 30 steps\u2014empirically enough for convergence\n\n6\n\n\fFigure 2: Left: Churn between consecutive models during the MCMC run on Nomao Dataset, with\nand without stabilization. Right: Accuracy of the intermediate models, with and without stabilization.\nValues are averaged over 40 runs of the chain. Dotted lines show standard errors.\n\nfor the datasets we considered as seen in Figure 2\u2014and set A\u2217= S(F \u2217\nk , TA) and B\u2217= S(A\u2217, TA).\nThe dataset perturbation sub-samples 80% of the examples in TA and randomly drops 3-7 features.\nWe run 40 independent chains to measure the variability, and report the average outcome and standard\ndeviation. Figure 2 (left) plots the average and standard deviation of the churn along the 40 traces,\nand Figure 2 (right) shows the accuracy.\nFor each experiment we report the churn ratio Cr between the initial classi\ufb01er and candidate change,\nthat is, Cr = C(B+, A)/C(B, A) for the RCP operator, and Cr = C(B\u2217, A\u2217)/C(B, A) for the\nMCMC operator, and Cr = C(B, A)/C(B, A) = 1 for the baseline experiment. The most important\nmetric in practice is how easy it is to tell if B is an improvement over A, which we quantify by the\nWLR between the candidate and initial classi\ufb01er for each experiment. To help interpret the WLR,\nwe also report the resulting probability pwin that we would conclude that the candidate change is\npositive (p \u2264 0.05) with a random 100-example set of differences.\nLastly, to demonstrate that the proposed methods reduce the churn without adversely impacting the\naccuracy of the models, we also report the accuracy of the different trained models for a large test set,\nthough the point of this work is that a suf\ufb01ciently-large labeled test set may not be available in a real\nsetting (see Section 1.1), and note that even if available, using a \ufb01xed test set to test many different\nchanges will lead to over\ufb01tting.\n\n5.2 Results\n\nTable 3 shows results using reasonable default values of \u03b1 = 0.5 and \u0001 = 0.5 for both RCP and the\nMCMC (for results with other values of \u03b1 and \u0001 see Appendix D). As seen in the Cr rows of the table,\nRCP reduces churn over the baseline in all 9 cases, generally by 20%, but as much as 46% for ridge\nregression on the Nomao dataset. Similarly, running RCP in the Markov Chain also reduces the churn\ncompared to the baseline in all 9 cases, and by slightly more on average than with the one-step RCP.\n\nFigure 3: SVM on Nomao dataset. Left: Testing accuracy of A\u2217 and B\u2217 compared to A and B, and\nchurn ratio Cr as a function of \u0001, for \ufb01xed \u03b1 = 0.7. Both the accuracy and the churn ratio tend to\nincrease with larger values of \u0001. Right: Accuracies and the churn ratio versus \u03b1, for \ufb01xed \u0001 = 0.1.\nThere is a sharp decrease in accuracy with \u03b1 > 0.8 likely due to divergence in the chain.\n\n7\n\n51015202530Iteration of the Markov chain11.522.53Churn (%) between consecutive modelsC(Fi,Fi-1)C(Fi*, Fi-1*)51015202530Iteration of the Markov chain94.194.294.394.494.594.694.794.894.9Test Accuracy (%)Fi AccuracyFi* Accuracy0.10.20.30.40.50.60.70.80.9Epsilon Parameter for RCP0.10.20.30.40.5Test accuracy compared to baseline (%)0.50.550.60.650.7Churn Ratio(A*- A) Accuracy(B*- B) AccuracyChurn Ratio0.10.20.30.40.50.60.70.80.9Alpha Parameter for RCP-1.5-1-0.500.51Test accuracy compared to baseline (%)0.30.50.70.91.1Churn Ratio(A*- A) Accuracy(B*- B) AccuracyChurn Ratio\fTable 3: Experiment results on 3 domains with 3 different training algorithms for a single step RCP\nand the MCMC methods. For the MCMC experiment, we report the numbers with the standard\ndeviation over the 40 runs of the chain.\n\nBaseline\n\nRCP\n\nNo Stabilization \u03b1 = 0.5, \u0001 = 0.5\n\nMCMC, k = 30\n\u03b1 = 0.5, \u0001 = 0.5\n\no\na\nm\no\nN\n\nRidge\n\nRF\n\nSVM\n\nRidge\n\ns\nw\ne\nN\n\nRF\n\nSVM\n\nRidge\n\nRF\n\nSVM\n\nz\nz\nu\nB\n\nr\ne\nt\nt\ni\n\nw\nT\n\nWLR\npwin\nCr\n\n1.24\n26.5\n1.00\n\n1.40\n49.2\n0.54\n\nAcc V1 / V2\n\n93.1 / 93.4\n\n93.1 / 93.4\n\nWLR\npwin\nCr\n\n1.02\n5.6\n1.00\n\n1.13\n13.4\n0.83\n\nAcc V1 / V2\n\n94.8 / 94.8\n\n94.8 / 95.0\n\nWLR\npwin\nCr\n\n1.70\n82.5\n1.00\n\n2.51\n99.7\n0.75\n\nAcc V1 / V2\n\n94.6 / 95.1\n\n94.6 / 95.2\n\nWLR\npwin\nCr\n\n0.95\n2.5\n1.00\n\n0.94\n2.4\n0.75\n\nAcc V1 / V2\n\n65.1 / 65.0\n\n65.1 / 65.0\n\nWLR\npwin\nCr\n\n1.07\n8.5\n1.00\n\n1.02\n5.7\n0.69\n\nAcc V1 / V2\n\n64.5 / 65.1\n\n64.5 / 64.7\n\nWLR\npwin\nCr\n\n1.17\n18.4\n1.00\n\n1.26\n29.4\n0.77\n\nAcc V1 / V2\n\n64.9 / 65.4\n\n64.9 / 65.4\n\nWLR\npwin\nCr\n\n1.71\n83.1\n1.00\n\n3.54\n100.0\n0.85\n\nAcc V1 / V2\n\n89.7 / 89.9\n\n89.7 / 90.0\n\nWLR\npwin\nCr\n\n1.35\n41.5\n1.00\n\n1.15\n16.1\n0.86\n\nAcc V1 / V2\n\n96.2 / 96.4\n\n96.2 / 96.3\n\nWLR\npwin\nCr\n\n1.35\n42.2\n1.00\n\n1.77\n86.6\n0.70\n\nAcc V1 / V2\n\n96.0 / 96.1\n\n96.0 / 96.1\n\n1.31\n36.5\n\n0.54 \u00b1 0.06\n\n93.2 \u00b1 0.1 / 93.4 \u00b1 0.1\n\n1.09\n9.8\n\n0.83 \u00b1 0.05\n\n94.9 \u00b1 0.2 / 95.0 \u00b1 0.2\n\n2.32\n99.2\n\n0.69 \u00b1 0.06\n\n94.8 \u00b1 0.2 / 95.3 \u00b1 0.1\n\n1.04\n6.7\n\n0.78 \u00b1 0.04\n\n65.0 \u00b1 0.1 / 65.1 \u00b1 0.1\n\n1.10\n10.8\n\n0.67 \u00b1 0.04\n\n64.3 \u00b1 0.3 / 64.8 \u00b1 0.2\n\n1.24\n26.1\n\n0.86 \u00b1 0.02\n\n64.8 \u00b1 0.1 / 65.4 \u00b1 0.1\n\n1.53\n66.4\n\n0.65 \u00b1 0.05\n\n90.1 \u00b1 0.1 / 90.2 \u00b1 0.1\n\n1.15\n15.9\n\n0.77 \u00b1 0.07\n\n96.3 \u00b1 0.1 / 96.3 \u00b1 0.1\n\n1.55\n68.4\n\n0.70 \u00b1 0.03\n\n96.1 \u00b1 0.1 / 96.2 \u00b1 0.1\n\nIn some cases, the reduced churn has a huge impact on the WLR. For example, for the SVM on\nTwitter, the 30% churn reduction by RCP raised the WLR from 1.35 to 1.77, making it twice as\nlikely that labelling 100 differences would have veri\ufb01ed the change was good (compare pwin values).\nMCMC provides a similar churn reduction, but the WLR increase is not as large.\nIn addition to the MCMC providing slightly more churn reduction on average than RCP, running\nthe Markov chain provides slightly higher accuracy on average as well, most notably for the ridge\nclassi\ufb01er on the Twitter dataset, raising initial classi\ufb01er accuracy by 2.3% over the baseline. We\nhypothesize this is due to the regularization effect of the perturbed training during the MCMC run,\nresembling the effect of dropout in neural networks.\nWe used \ufb01xed values of \u03b1 = 0.5 and \u0001 = 0.5 for all the experiments in Table 3, but note that results\nwill vary with the choice of \u03b1 and \u0001, and if they can be tuned with cross-validation or otherwise,\nresults can be substantially improved. Figure 3 illustrates the dependence on these hyper-parameters:\nthe left plot shows that small values of \u0001 result in lower churn with reduced improvement on accuracy,\nand the right plot shows that increasing \u03b1 reduces churn, and also helps increase accuracy, but at\nvalues larger than 0.8 causes the Markov chain to diverge.\n\n8\n\n\fReferences\n[1] L. Devroye and T. Wagner. Distribution-free performance bounds for potential function rules. Information\n\nTheory, IEEE Transactions on, 25(5):601\u2013604, 1979.\n\n[2] V. N. Vapnik. The Nature of Statistical Learning Theory. Springer-Verlag: New York, 1995.\n\n[3] V. N. Vapnik. Statistical Learning Theory. John Wiley: New York, 1998.\n\n[4] O. Bousquet and A. Elisseeff. Algorithmic stability and generalization performance. In Advances in Neural\nInformation Processing Systems 13: Proceedings of the 2000 Conference, volume 13, page 196. MIT Press,\n2001.\n\n[5] O. Bousquet and A. Elisseeff. Stability and generalization. Journal of Machine Learning Research,\n\n2(Mar):499\u2013526, 2002.\n\n[6] S. Mukherjee, P. Niyogi, T. Poggio, and R. Rifkin. Learning theory: stability is suf\ufb01cient for generalization\nand necessary and suf\ufb01cient for consistency of empirical risk minimization. Advances in Computational\nMathematics, 25(1-3):161\u2013193, 2006.\n\n[7] A. Reinart. Statistics Done Wrong: The Woefully Complete Guide. No Starch Press, San Francisco, USA,\n\n2015.\n\n[8] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Dropout: A simple way to\nprevent neural networks from over\ufb01tting. The Journal of Machine Learning Research, 15(1):1929\u20131958,\n2014.\n\n[9] L. Zhang and P. N. Suganthan. A survey of randomized algorithms for training neural networks. Information\n\nSciences, 2016.\n\n[10] J. Platt. Probabilistic outputs for support vector machines and comparisons to regularized likelihood\n\nmethods. Advances in Large Margin Classi\ufb01ers, 10(3):61\u201374, 1999.\n\n[11] A. Niculescu-Mizil and R. Caruana. Predicting good probabilities with supervised learning. In Proceedings\n\nof the 22nd International Conference on Machine Learning, pages 625\u2013632. ACM, 2005.\n\n[12] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer,\nR. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay.\nScikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825\u20132830, 2011.\n\n[13] L. Candillier and V. Lemaire. Design and analysis of the Nomao challenge active learning in the real-world.\nIn Proceedings of the ALRA: Active Learning in Real-world Applications, Workshop ECML-PKDD, 2012.\n\n[14] K. Fernandes, P. Vinagre, and P. Cortez. Progress in Arti\ufb01cial Intelligence: 17th Portuguese Conference\non Arti\ufb01cial Intelligence, EPIA 2015, Coimbra, Portugal, September 8-11, 2015. Proceedings, chapter\nA Proactive Intelligent Decision Support System for Predicting the Popularity of Online News, pages\n535\u2013546. Springer International Publishing, Cham, 2015.\n\n[15] F. Kawala, E. Gaussier, A. Douzal-Chouakria, and E. Diemert. Apprentissage d\u2019ordonnancement et\nin\ufb02uence de l\u2019ambigu\u00eft\u00e9 pour la pr\u00e9diction d\u2019activit\u00e9 sur les r\u00e9seaux sociaux. In Coria\u20192014, pages 1\u201315,\nNancy, France, France, March 2014.\n\n9\n\n\f", "award": [], "sourceid": 1580, "authors": [{"given_name": "Mahdi", "family_name": "Milani Fard", "institution": "Google"}, {"given_name": "Quentin", "family_name": "Cormier", "institution": "Google"}, {"given_name": "Kevin", "family_name": "Canini", "institution": "Google"}, {"given_name": "Maya", "family_name": "Gupta", "institution": "Google"}]}