{"title": "Minimizers of the Empirical Risk and Risk Monotonicity", "book": "Advances in Neural Information Processing Systems", "page_first": 7478, "page_last": 7487, "abstract": "Plotting a learner's average performance against the number of training samples results in a learning curve.  Studying such curves on one or more data sets is a way to get to a better understanding of the generalization properties of this learner.  The behavior of learning curves is, however, not very well understood and can display (for most researchers) quite unexpected behavior.  Our work introduces the formal notion of risk monotonicity, which asks the risk to not deteriorate with increasing training set sizes in expectation over the training samples.  We then present the surprising result that various standard learners, specifically those that minimize the empirical risk, can act nonmonotonically irrespective of the training sample size. We provide a theoretical underpinning for specific instantiations from classification, regression, and density estimation.  Altogether, the proposed monotonicity notion opens up a whole new direction of research.", "full_text": "Minimizers of the Empirical Risk\n\nand Risk Monotonicity\n\nDelft University of Technology & University of Copenhagen\n\nMarco Loog\n\nTom Viering\n\nDelft University of Technology\n\nAlexander Mey\n\nDelft University of Technology\n\nAbstract\n\nPlotting a learner\u2019s average performance against the number of training samples\nresults in a learning curve. Studying such curves on one or more data sets is a way\nto get to a better understanding of the generalization properties of this learner. The\nbehavior of learning curves is, however, not very well understood and can display\n(for most researchers) quite unexpected behavior. Our work introduces the formal\nnotion of risk monotonicity, which asks the risk to not deteriorate with increasing\ntraining set sizes in expectation over the training samples. We then present the\nsurprising result that various standard learners, speci\ufb01cally those that minimize the\nempirical risk, can act nonmonotonically irrespective of the training sample size.\nWe provide a theoretical underpinning for speci\ufb01c instantiations from classi\ufb01cation,\nregression, and density estimation. Altogether, the proposed monotonicity notion\nopens up a whole new direction of research.\n\n1\n\nIntroduction\n\nLearning curves are an important diagnostic tool that provide researchers and practitioners with\ninsight into a learner\u2019s generalization behavior [Shalev-Shwartz and Ben-David, 2014]. Learning\ncurves plot the (estimated) true performance against the number of training samples. Among other\nthings, they can be used to compare different learners to each other. This can highlight the differences\ndue to their complexity, with the simpler learners performing better in the small sample regime, while\nthe more complex learners perform best with large sample sizes. In combination with a plot of their\n(averaged) resubstitution error (or training error), they can also be employed to diagnose under\ufb01tting\nand over\ufb01tting. Moreover, they can aid when it comes to making decision about collecting more data\nor not by extrapolating them to sample sizes beyond the ones available.\nIt seems intuitive that learners become better (or at least do not deteriorate) with more training\ndata. With a bit more reservation, Shalev-Shwartz and Ben-David [2014] state, for instance, that\nthe learning curve \u201cmust start decreasing once the training set size is larger than the VC-dimension\u201d\n(page 153). The large majority of researchers and practitioners (that we talked to) indeed take it for\ngranted that learning curves show improved performance with more data. Any deviations from this\nthey contribute to the way the experiments are set up, to the \ufb01nite sample sizes one is dealing with, or\nto the limited number of cross-validation or bootstrap repetitions one carried out. It is expected that if\none could sample a training set ad libitum and measure the learner\u2019s true performance over all data,\nsuch behavior disappears. That is, if one could indeed get to the performance in expectation over all\ntest data and over all training samples of a particular size, performance supposedly improves with\nmore data.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fWe formalize this behavior of expected improved performance in Section 3. As we will typically\nexpress a learner\u2019s ef\ufb01ciency in term of the expected loss, we will refer to this notation as risk\nmonotonicity. Section 4 then continues with the main contribution of this work and demonstrates that\nvarious well-known empirical risk minimizers can display nonmonotonic behavior. Moreover, we\nshow that for these learners this behavior can persist inde\ufb01nitely, i.e., it can occur at any sample size.\nNote: all proofs can be found in the supplement. Section 5 provides some experimental evidence for\nsome cases of interest that have, up to now, resisted any deeper theoretical analysis. Section 6 then\nprovides a discussion and concludes the work. In this last section, among others, we contrast our\nnotion of risk monotonicity to that of PAC-learnability, note that these are two essentially different\nconcepts, and consider various research questions of interest to further re\ufb01ne our understanding of\nlearning curves. Though many will probably \ufb01nd our \ufb01ndings surprising, counterintuitive behavior of\nthe learning curve has been reported before in various other settings. Section 2 goes through these\nand other related works and puts our contribution in perspective.\n\n2 Earlier Work and Its Relation to the Current\n\nWe split up our overview into the more regular works that characterize monotonic behavior and those\nthat identify the existence of nonmonotonic behavior.\n\n2.1 The Monotonic Character of Learning Curves\n\nMany of the studies into the behavior of learning curves stem from the end of the 1980s and the\nbeginning of the 1990s and were carried out by Tishby, Haussler, and others [Tishby et al., 1989, Levin\net al., 1990, Sompolinsky et al., 1990, Opper and Haussler, 1991, Seung et al., 1992, Haussler et al.,\n1992]. These early investigations were done in the context of neural networks and in their analyses\ntypically make use of tools from statistical mechanics. A statistical inference approach is studied by\nAmari et al. [1992] and Amari and Murata [1993], who demonstrate the typical power-law behavior\nof the asymptotic learning curve. Haussler et al. [1996] bring together many of the techniques and\nresults from the aforementioned works. At the same time, they advance the theory for learning curves\nand provide an overview of the rather diverse, though still monotonic, behavior they can exhibit. In\nparticular, the curve may display multiple steep and sudden drops in the risk.\nAlready in 1979, Micchelli and Wahba [1979] provide a lower bound for learning curves of Gaussian\nprocesses. Only at the end of the 1990s and beginning of the 2000s, the overall attention shifted from\nneural networks to Gaussian processes. In this period, various works were published that introduce\napproximations and bounds [Opper, 1998, Sollich, 1999, Opper and Vivarelli, 1999, Williams and\nVivarelli, 2000, Sollich and Halees, 2002]. Different types of techniques were employed in these\nanalyses, some of which again from statistical mechanics. The main caveat, when it comes to the\nresults obtained, is the assumption that the model is correctly speci\ufb01ed.\nThe focus of Cortes et al. [1994] is on support vector machines. They develop ef\ufb01cient procedures for\nan extrapolation of the learning curve, so that if only limited computational resources are available,\nthese can possibly be assigned to the most promising approaches. It is assumed that, for large enough\ntraining set sizes, the error rate converges towards a stable value following a power-law. This behavior\nwas established to hold in many of the aforementioned works. The ideas that Cortes et al. [1994] put\nforward have found use in speci\ufb01c applications (see, for instance, [Kolachina et al., 2012]) and can\ncount on renewed interest these days, especially in combination with \ufb02op gobbling neural networks\n(see, for instance, [Hestness et al., 2017]).\nAll of the aforementioned works study and derive learning curve behavior that shows no deterioration\nwith growing training set sizes, even though they may be described as \u201clearning curves with rather\ncurious and dramatic behavior\u201d [Haussler et al., 1996]. Our work identi\ufb01es aspects that are more\ncurious and more dramatic: with a larger training set, performance can deteriorate, even in expectation.\n\n2.2 Early Noted Nonmonotonic Behavior\n\nProbably the \ufb01rst to point out that learning curves can show nonmonotonic behavior was Duin [1995],\nwho looked at the error rate of so-called Fisher\u2019s linear discriminant. In this context, Fisher\u2019s linear\ndiscriminant is used as a classi\ufb01er and equivalent to the two-class linear classi\ufb01er that is obtained\nby optimizing the squared loss. This can be solved by regressing the input feature vectors onto\n\n2\n\n\fa \u22121/+1 encoding of the class labels. In case the number of training samples is smaller than or\nequal to the number of input dimensions, one needs to deal with the inverse of singular matrices and\ntypically resorts to the use of the Moore-Penrose pseudo-inverse. In this way, the minimum norm\nsolution is obtained [Smola et al., 2000]. It is exactly in this underdetermined setting, as the number\nof training samples approaches the dimensionality, that the error rate will be increasing. Around\nthe same time, Opper and Kinzel [1996] showed that in the context of neural networks a similar\nbehavior is observed for small samples. In particular, the error rate for the single layer perceptron\nis demonstrated to increase when the training set size goes towards the dimensionality of the data\n[Opper, 2001]. Subsequently, other examples of exactly this type of nonmonotonic behavior have\nbeen reported. Worth mentioning are classi\ufb01ers built based on the lasso [Kr\u00e4mer, 2009] and two\nrecent works that have trigger renewed attention to this subject in the neural networks community\n[Belkin et al., 2018, Spigler et al., 2018]. The classi\ufb01er reaching a maximum error rate when the\nsample size transits from an underspeci\ufb01ed to an overspeci\ufb01ed setting is originally referred to as\npeaking (see also [Duin, 2000]). The two recent works above rename it and use the terms double\ndescent and jamming.\nA completely different phenomenon, and yet other way in which learning curves can be nonmonotonic,\nis described by Loog and Duin [2012]. They show that there are learning problems for which speci\ufb01c\nclassi\ufb01ers attain their optimal expected 0-1 loss at a \ufb01nite sample size. That is, on such problems,\nthese classi\ufb01ers perform essentially worse with an in\ufb01nite amount of training data compared to some\n\ufb01nite training set sizes. The behavior is referred to as dipping, following the shape of the error rate\u2019s\nlearning curve. In the context of (safe) semi-supervised learning, Loog [2016] then argues that if one\ncannot even guarantee improvements in 0-1 loss when receiving more labeled data, this is certainly\nimpossible with unlabeled data. When evaluating in terms of the loss the model optimizes, however,\none can get to demonstrable improvements and essentially solve the safe semi-supervised leaning\nproblem [Loog, 2016, Krijthe and Loog, 2017, 2018]. Our work shows, however, that also when one\nlooks at the loss the learner optimizes, there may be no performance guarantees.\nThe dipping behavior hinges both on the fact that the model is misspeci\ufb01ed (i.e., the Bayes-optimal\nestimate is not in the class of models considered) and that the classi\ufb01er does not optimize what it\nis ultimately evaluated with. That this setting can cause problems, e.g. convergence to the wrong\nsolution, had already been demonstrated for maximum likelihood by Devroye et al. [1996]. If\nthe model class is \ufb02exible enough, this discrepancy disappears in many a setting. This happens,\nfor instance, for the class of classi\ufb01cation-calibrated surrogate losses [Bartlett et al., 2006]. Note,\nhowever, that Devroye et al. [1996] conjecture that consistent rules that are expected to perform better\nwith increasing training sizes (so-called smart rules) do not exist. Ben-David et al. [2012] analyze the\nconsequence of the mismatch between surrogate and zero-one loss in some more detail and provide\nanother example of a problem distribution on which such classi\ufb01ers would dip.\nOur results strengthen or extend the above \ufb01ndings in the following ways. First of all, we show\nthat nonmonotonic behavior can occur in the setting where the complexity of the learner is small\ncompared to the training set size. Therefore, the reported behavior is not due to jamming or peaking.\nSecondly, we are going to evaluate our learners by means of the loss they actually optimize for. If\nwe look at the linear classi\ufb01er that optimizes the hinge loss, for instance, we will study its learning\ncurve for the hinge loss as well. In other words, there is no discrepancy between the objective used\nduring training and the loss used at test time. Therefore, possibly odd behavior cannot be explained\nby dipping. As a third, we do not only look at classi\ufb01cation and regression but also consider density\nestimation and (negative) log-likelihood estimation in particular.\n\n3 Risk Monotonicity\n\nWe come to a formal de\ufb01nition of the intuition that with one additional instance a learner should\nimprove its performance in expectation over the training set. The next section then study various\nlearners with the notions developed here. First, however, some notations and prior de\ufb01nitions are\nprovided.\n\n3.1 Preliminaries\n\nWe let Sn = (z1, . . . ,zn) be a training set of size n, sampled i.i.d. from a distribution D over a general\ndomain Z . Also given is a hypothesis class H and a loss function (cid:96) : Z \u00d7 H \u2192 R through which\n\n3\n\n\fthe performance of a hypothesis h \u2208 H is measured. The objective is to minimize the expected loss\nor risk under the distribution D, which is given by\n\nRD(h) := E\nz\u223cD\n\n(cid:96)(z,h).\n\n(1)\n\nA learner A is a particular mapping from the set of all samples S := Z \u222a Z 2\u222a Z 3\u222a . . . to elements\nfrom the prespeci\ufb01ed hypothesis class H . That is, A : S \u2192 H . We are particularly interested in\nlearners Aerm that provide a solution which minimizes the empirical risk RSn over the training set:\n\nAerm(Sn) := argmin\n\nh\u2208H\n\nRSn(h),\n\nwith\n\nRSn(h) :=\n\n1\nn\n\nn\n\n\u2211\n\ni=1\n\n(cid:96)(zi,h).\n\nMost common classi\ufb01cation, regression, and density estimation problems can be formulated in such\nterms. Examples are the earlier mentioned Fisher\u2019s linear discriminant, support vector machines, and\nGaussian processes, but also maximum likelihood estimation, linear regression, and the lasso can be\ncast in similar terms.\n\n3.2 Degrees of Monotonicity\n\n(2)\n\n(3)\n\n(4)\n\nThe basic de\ufb01nition is the following.\nDe\ufb01nition 1 (local monotonicity) A learner A is (D, (cid:96),n)-monotonic with respect to a distribution\nD, a loss (cid:96), and an integer n \u2208 N := {1,2, . . .} if\n\nE\n\nSn+1\u223cDn+1\n\n[RD(A(Sn+1))\u2212 RD(A(Sn))] \u2264 0.\n\nThis expresses exactly how we would expect a learner to behave locally (i.e., at a speci\ufb01c training\nsample size n): given one additional training instance, we expect the learner to improve. Based on our\nde\ufb01nition of local monotonicity, we can construct stronger desiderata that may be of more interest.\nThe two entities we would like to get rid of in the above de\ufb01nition are n and D. The former, because\nwe would like our learner to act monotonically irrespective of the sample size. The latter, because\nwe typically do not know the underlying distribution. For now, getting rid of the loss (cid:96) is maybe\ntoo much to ask for. First of all, not all losses are compatible with one another, as they may act\non different types of z \u2208 Z and h \u2208 H . But even if they take the same types of input, a learner is\ntypically designed to minimize one speci\ufb01c loss and there seems to be no direct reason for it to be\nmonotonic in terms of another. It seems less likely, for example, that an SVM is risk monotonic in\nterms of the squared loss. (We will nevertheless brie\ufb02y return to this matter in Section 6.) We exactly\nfocus on the empirical risk minimizers as they seem to be the most appropriate candidates to behave\nmonotonically in terms of their own loss.\nThough we typically do not know D, we do know in which domain Z we are operating. Therefore,\nthe following de\ufb01nition is suitable.\nDe\ufb01nition 2 (local Z -monotonicity) A learner A is (locally) (Z , (cid:96),n)-monotonic with respect to a\nloss (cid:96) and an integer n \u2208 N if, for all distributions D on Z , it is (D, (cid:96),n)-monotonic.\nWhen it comes to n, the peaking phenomenon shows that, for some learners, it may be hopeless to\ndemand local monotonicity for all n \u2208 N. What we still can hope to \ufb01nd is an N \u2208 N, such that for all\nn \u2265 N, we \ufb01nd the learner to be locally risk monotonic. As properties like peaking may change with\nthe dimensionality\u2014the complexity of the classi\ufb01er is generally dependent on it, the choice for N\nwill typically have to depend on the domain.\nDe\ufb01nition 3 (weak Z -monotonicity) A learner A is weakly (Z , (cid:96),N)-monotonic with respect to a\nloss (cid:96) if there is an integer N \u2208 N such that for all n \u2265 N, the learner is locally (Z , (cid:96),n)-monotonic.\nGiven the domain, one may of course be interested in the smallest N for which weak Z -monotonicity\nis achieved. If it does turn out that N can be set to 1, the learner is said to be globally Z -monotonic.\nDe\ufb01nition 4 (global Z -monotonicity) A learner A is globally (Z , (cid:96))-monotonic with respect to a\nloss (cid:96) if for every integer n \u2208 N, the learner is locally (Z , (cid:96),n)-monotonic.\n\n4\n\n\f4 Theoretical Results\n\nWe consider the hinge loss, the squared loss, and the absolute loss and linear models that optimize\nthe corresponding empirical loss. In essence, we demonstrate that, there are various domains Z for\nwhich for any choice of N, these learners are not weakly (Z , (cid:96),N)-monotonic. For the log-likelihood,\nwe basically prove the same: there are standard learners for which the (negative) log-likelihood is not\nweakly (Z , (cid:96),N)-monotonic for any N. The \ufb01rst three losses can all be used to build classi\ufb01ers: the\n\ufb01rst is at the basis of SVMs, while the second gives rise to Fisher\u2019s linear discriminant in combination\nwith linear hypothesis classes. The second and third loss are of course also employed in regression.\nThe log-likelihood is standard in density estimation.\n\n4.1 Learners that Do Behave Monotonically\n\nBefore we actually move to our negative results, we \ufb01rst provide examples that point in a positive\ndirection. The \ufb01rst learner is provably risk monotonic over a large collection of domains. The second\nlearner, the memorize algorithm, is a monotonic learner taken from [Ben-David et al., 2011].\n\nFitting a normal distribution with \ufb01xed covariance and unknown mean. Let \u03a3 be an invertible\nd \u00d7 d-matrix,\n\n(cid:40)\n\nH :=\n\nz (cid:55)\u2192\n\n1(cid:112)(2\u03c0)d|\u03a3| exp(\u2212 1\n\n(cid:41)\n\n(cid:12)(cid:12)(cid:12)(cid:12)\u00b5 \u2208 Rd\n\n2 (z\u2212 \u00b5)T \u03a3\u22121(z\u2212 \u00b5))\n\n,\n\n(5)\n\nZ \u2282 Rd, and take the loss to equal the negative log-likelihood.\nTheorem 1 If Z is bounded, the learner Aerm is globally (Z , (cid:96))-monotonic.\nRemark 1 Using similar arguments, one can show that the learner with H = Rd and Mahalanobis\nloss (cid:96)(z,h) = ||z\u2212 h||2\n\u03a3 := (z\u2212 h)T \u03a3(z\u2212 h), with \u03a3 a positive semi-de\ufb01nite matrix, is globally (Z , (cid:96))-\nmonotonic as well as long as Z is bounded.\n\nThe memorize algorithm [Ben-David et al., 2011]. When evaluated on a test input object that\nis also present in the training set, this classi\ufb01er returns the label of said training object. In case\nmultiple training examples share the same input, the majority voted label is returned. In case the test\nobject is not present in the training set, a default label is returned. This learner is monotonic for any\ndistribution under the zero-one loss. Similairly, any histogram rule with \ufb01xed partitions is monotone,\nwhich is immediate from the properties of the binomial distribution [Devroye et al., 1996].\n\n4.2 Learners that Don\u2019t Behave\n\nTo show for various learners that they do not always behave risk monotonically, we construct speci\ufb01c\ndiscrete distributions for which we can explicitly proof nonmonotonicity. What leads to the sought-\nafter counterexamples in our case, is a distribution where a small fraction of the density is located\nrelatively far away from the origin. In particular, shrinking the probability of this fraction towards 0\nleads us to the lemma below. It is used in the subsequent proofs, but is also of some interest in itself.\nLemma 1 Let Z := {a,b} be a domain with two elements from R, let\n\n(cid:124) (cid:123)(cid:122) (cid:125)\n\nSk\nn\u2212k := ( a, . . . ,a\nk elements\n\n(cid:124) (cid:123)(cid:122) (cid:125)\n\n, b, . . . ,b\n)\nn\u2212k elements\nn\u2212k). If\nn)\u2212 n(cid:96)(b,h1\n\nbe a training set with n samples, and let hk\n\nn\u2212k := Aerm(Sk\n\n\u2212(cid:96)(b,h0\n\nn+1) + (n + 1)(cid:96)(b,h1\n\nn\u22121) > 0,\n\nthen Aerm is not locally (Z , (cid:96),n)-monotonic.\n\nRemark 2 For many losses, we have, in fact, that (cid:96)(b,h0\nthe difference of interest to (n + 1)(cid:96)(b,h1\nn\u22121).\n\nn)\u2212 n(cid:96)(b,h1\n\nn) = (cid:96)(b,h0\n\nn+1) = 0, which further simpli\ufb01es\n\nIn a way, the above lemma and remark show that if the learning of the single point b does not happen\nfast enough, local monotonicity cannot be guaranteed. Section 6 will brie\ufb02y return to this point.\n\n5\n\n(6)\n\n(7)\n\n\fLinear hypotheses, squared loss, absolute loss, and hinge loss. We consider linear models with-\nout bias in d dimensions, so take Z = X \u00d7 Y \u2282 Rd \u00d7R and H = Rd. Though not crucial to our\nargument, we select the minimum-norm solution in the underdetermined case. Aerm : H \u2192 Rd is the\ngeneral minimizer of the risk in this setting. For the squared loss, we have (cid:96)(z,h) = (xT h\u2212 y)2 for\nany z = (x,y) \u2208 Z . The absolute loss is given by (cid:96)(z,h) = |xT h\u2212 y| and the hinge loss is de\ufb01ned as\n(cid:96)(z,h) = max(0,1\u2212 yxT h). Both the absolute loss and the squared loss can be used for regression\nand classi\ufb01cation. The hinge loss is appropriate only for the classi\ufb01cation setting. Therefore, though\nthe rest of the setup remains the same, outputs are limited to the set Y = {\u22121, +1} for the hinge loss.\n\nTheorem 2 Consider a linear Aerm without intercept and assume it either optimizes the squared,\nthe absolute, or the hinge loss. Assume Y contains at least one nonzero element. If there exists\nan open ball B0 that contains the origin, such that B0 \u2282 X , then this risk minimizer is not weakly\n(Z , (cid:96),N)-monotonic for any N \u2208 N.\n\nFitting a normal distribution with \ufb01xed mean and unknown variance (in one dimension). We\nfollow up on the example where we \ufb01tted a normal distribution with \ufb01xed covariance and unknown\nmean. We limit ourselves, however, to one dimension only and, more importantly, now take the\nvariance to be the unknown, while \ufb01xing the mean (to 0, arbitrarily). Speci\ufb01cally, let H := {z (cid:55)\u2192\n2\u03c0\u03c3 2 exp(\u2212 1\n1\u221a\nTheorem 3 If there exists an open ball B0 that contains the origin, such that B0 \u2282 Z , then estimating\nthe variance of a one-dimensional normal density is not weakly (Z , (cid:96),N)-monotonic for any N \u2208 N.\n\n2\u03c3 2 z2)|\u03c3 > 0}, Z \u2282 R, and take the loss to equal the negative log-likelihood.\n\n5 Experimental Evidence\n\nOur results from the previous section, already show cogently that the behavior of the learning curve\ncan be interesting to study. Here we complement our theoretical \ufb01ndings with a few illustrative\nexperiments to strengthen this point even further. The results can be found in Figure 1, which displays\n(numerically) exact learning curves for a couple of different settings.\nThe input space considered for all our examples is one-dimensional. The experiment in Sub\ufb01gure 1b\nrelies on the absolute loss, while all other make use of the squared loss. In addition, Sub\ufb01gures 1a,\n1b, and 1c consider distributions with two points: a = (1,1) and b = ( 1\n10 ,1) with the \ufb01rst coordinate\nthe input and the second the corresponding output. Different plots use different values for the\nprobability of observing a. For Sub\ufb01gure 1a, P(a) = 0.00001, Sub\ufb01gure 1b uses P(a) = 0.1, and\nSub\ufb01gure 1c takes P(a) = 0.01. For Sub\ufb01gure 1c, we also studied the effect of a small amount\nof standard L2-regularization decreasing with training size (\u03bb = 0.01\nn ), leading to the regularized\nsolution Areg. The distribution for Sub\ufb01gure 1d is slightly different and supported on three points:\n10 ,\u22121), and c = (\u22121,1), with again the \ufb01rst coordinate as the input and the second\na = (1,1), b = ( 1\nIn this case, P(a) = 0.01, P(b) = 0.01, and P(c) = 0.98. This last\nthe corresponding output.\nexperiment concerns least squares regression with a bias term: a setting we have not been able to\nanalyze theoretically up to this point.\nMost salient is probably the serrated and completely nonmonotonic behavior of the learning curve for\nthe absolute loss in Figure 1b. Of interest as well is that regularization does not necessarily solve the\nproblem. Sub\ufb01gure 1c even shows it can make it worse: Areg gives nonmonotonic behavior, while\nAerm is monotonic under the same distribution (cf. [Gr\u00fcnwald and Kot\u0142owski, 2011]). Sub\ufb01gure 1a\nillustrates clearly how dramatic the expected squared loss can grow with more data.\nIn the \ufb01nal example in Figure 1d, as already noted, we consider linear regression with the squared\nloss that includes a bias term in combination with the distribution supported on three points. This\nexample is of interest because the usual con\ufb01guration for standard learners includes such bias term\nand one could get the impression from our theoretical results (and maybe in particular the proofs)\nthat the origin plays a major role in the bad behavior of some of the learners. But as can be observed\nhere, adding an intercept, and therefore taking away the possibly special status of the origin does not\nmake risk nonmonotonicity go away.\n\n6\n\n\f(a)\n\n(c)\n\n(b)\n\n(d)\n\nFigure 1: Learning curves (average risk against training set size) for some one-dimensional problems.\nSub\ufb01gure (a) is based on squared loss, no intercept; (b) on absolute loss, no intercept; (c) on squared\nloss, no intercept (with and without regularization); (d) on squared loss with intercept. The dashed\nline, indicates the risk the learner attains in the limit of an in\ufb01nite training set size.\n\n6 Discussion and Conclusion\n\nIt should be clear that this paper does not get to the bottom of the learning-curve issue. In fact, one\nof the reasons of this work is to bring it to the attention of the community. We are convinced that it\nraises a lot of interesting and interrelated problems that may go far beyond the initial analyses we\noffer here. Further study should bring us to a better understanding of how learning curves can actually\nact, which, in turn, should enable practitioners to better interpret and anticipate their behavior.\nWhat this work does convey is that learning curves can (provably) show some rather counterintuitive\nand surprising behavior. In particular, we have demonstrated that least squares regression, regression\nwith the absolute loss, linear models trained with the hinge loss, and likelihood estimation of the\nvariance of a normal distribution can all suffer from nonmonotonic behavior, even when evaluated\nwith the loss they optimize for. All of these are standard learners, using standard loss functions.\nAnyone familiar with the theory of PAC learning may wonder how our results can be reconciliated\nwith the bounds that come from this theory. At a \ufb01rst glance, our observations may seem to contradict\nthis theory. Learning theory dictates that if the hypothesis class has \ufb01nite VC-dimension, the excess\nrisk \u03b5 of ERM will drop as \u03b5 = O( 1\nn ) in the agnostic case\n[Vapnik, 1998, Shalev-Shwartz and Ben-David, 2014]. Thus PAC bounds give an upper bound\non the excess risk \u03b5 that will be tighter given more samples. PAC bounds hold with a particular\nprobability, but we are concerned with the risk in expectation. Even bounds that hold in expectation\nover the training sample will, however, not rule out nonmonotonic behavior. This is because in the\nend the guarantees from PAC learning are indeed merely bounds. Our analysis show that within those\n\nn ) in the realizable case and as \u03b5 = O( 1\u221a\n\n7\n\n01020304088.599.51010-40102030400.80.820.840.860.880.90102030400.40.50.60.70.80.90102030400.030.040.050.060.070.08\fbounds, we cannot always expect risk monotonic behavior. In fact, learning problems of all four\npossible combinations exist: not PAC-learnable and monotonic, PAC-learnable and not monotonic,\netc. For instance, the memorize algorithm (end of Subsection 4.1) is monotone, while it has in\ufb01nite\nVC-dimension and so is not PAC-learnable.\nIn light of the learning rates mentioned above, we wonder whether there are deeper links with Lemma\n1 (see also Remark 2). Rewrite Equation (7) to \ufb01nd that we do not have local monotonicity at n in\ncase\n\n\u2212 (cid:96)(b,h0\n\nn+1)\nn+1 + (cid:96)(b,h1\nn)\n(cid:96)(b,h1\n\nn\u22121)\n\n>\n\nn\n\nn + 1\n\n.\n\n(8)\n\n(cid:113) n\n\nn+1, which is always larger than n\n\nWith n large enough, we can ignore the \ufb01rst term in the numerator. So if a learner, in this particular\nn\nsetting, does not learn an instance b at least at a rate of\nn+1 in terms of the loss, it will display\nnonmonotonic behavior. According to learning theory, for agnostic learners, the fraction between two\nn+1 for n > 0. Can one therefore\nsubsequent losses is of the order\ngenerally expect nonmonotonic behavior for any agnostic learner? Our normal mean estimation\nproblem shows it cannot. But then, what is the link, if any?\nAs already hinted at in the introduction, our \ufb01ndings may also warrant revisiting the results obtained in\n[Loog, 2016, Krijthe and Loog, 2017, 2018]. These works show that there are some semi-supervised\nlearners that allow for essentially improved performance over the supervised learner, i.e., these\nare truly safe. Though this is the transductive setting, this may in a sense just shows how strong\nthese results are. In the end, their estimation procedures is really rather different from empirical\nrisk minimization, but it does beg the question whether similar constructs can be used to get to risk\nmonotonic procedures in the supervised case.\nAnother question, related to the last remark above, seems of interest: could it be that the use of\nparticular losses at training time leads to monotonic behavior at test time? Or can regularization still\nlead to more monotonic behavior, e.g. by explicitly limiting H ? Maybe particular (upper-bounding)\nconvex losses could turn out to behave risk monotonic in terms of speci\ufb01c nonconvex losses? Dipping\nseems to show, however, that this may very well not be the case. Results concerning smart rules, i.e.,\nclassi\ufb01ers that act monotonically in terms of the error rate [Devroye et al., 1996], seem to point in the\nsame direction. So should we expect it to be the other way round? Can nonconvex losses bring us\nmonotonicity guarantees for convex ones? Of course, monotonicity properties of nonconvex learners\nare also of interest to study in their own respect.\nAn ultimate goal would of course be to fully characterize when one can have risk monotonic behavior\nand when not. At this point we do not have a clear idea to what extent this would at all be possible.\nWe were, for instance, not able to analyze some standard, seemingly simple cases, e.g. simultaneously\nestimating the mean and the variance of a normal model. And maybe we can only get to rather weak\nresults. Only knowledge about the domain may turn out to be insuf\ufb01cient and we need to make\nassumptions on the class of distributions D we are dealing with (leading to some notion of weakly\nD-monotonicity?). For a start, we could study likelihood estimation under correctly speci\ufb01ed models,\nfor which generally there turn out to be remarkably few \ufb01nite-sample results. One can also wonder\nwhether it is possible to \ufb01nd salient distributional properties that can be speci\ufb01cally related to the\noverall shape of the learning curve (see, for instance, [Haussler et al., 1996]).\nAll in all, we believe that our theoretical results, strengthened by some illustrative examples, show\nthat the monotonicity of learning curves is an interesting and nontrivial property to study.\n\nAcknowledgments\n\nWe received various suggestions and comments, among others based on an abstract presented earlier\n[Viering et al., 2019]. We particularly want to thank Peter Gr\u00fcnwald, Steve Hanneke, Wojciech\nKot\u0142owski, Jesse Krijthe, and David Tax for constructive feedback and discussions.\nThis work was funded in part by the Netherlands Organisation for Scienti\ufb01c Research (NWO) and\ncarried out under TOP grant project number 612.001.402.\n\n8\n\n\fReferences\n\nShun-Ichi Amari and Noboru Murata. Statistical theory of learning curves under entropic loss\n\ncriterion. Neural Computation, 5(1):140\u2013153, 1993.\n\nShun-ichi Amari, Naotake Fujita, and Shigeru Shinomoto. Four types of learning curves. Neural\n\nComputation, 4(4):605\u2013618, 1992.\n\nPeter L. Bartlett, Michael I. Jordan, and Jon D. McAuliffe. Convexity, classi\ufb01cation, and risk bounds.\n\nJournal of the American Statistical Association, 101(473):138\u2013156, 2006.\n\nMikhail Belkin, Daniel Hsu, Siyuan Ma, and Soumik Mandal. Reconciling modern machine learning\n\nand the bias-variance trade-off. arXiv preprint arXiv:1812.11118, 2018.\n\nShai Ben-David, Nathan Srebro, and Ruth Urner. Universal learning vs. no free lunch results. In\n\nPhilosophy and Machine Learning Workshop NIPS, 2011.\n\nShai Ben-David, David Loker, Nathan Srebro, and Karthik Sridharan. Minimizing the misclassi\ufb01ca-\ntion error rate using a surrogate convex loss. In Proceedings of the 29th International Conference\non Machine Learning, pages 83\u201390, 2012.\n\nCorinna Cortes, Lawrence D. Jackel, Sara A. Solla, Vladimir N. Vapnik, and John S. Denker. Learning\ncurves: Asymptotic values and rate of convergence. In Advances in Neural Information Processing\nSystems, pages 327\u2013334, 1994.\n\nLuc Devroye, L\u00e1szl\u00f3 Gy\u00f6r\ufb01, and G\u00e1bor Lugosi. A Probabilistic Theory of Pattern Recognition.\n\nSpringer, 1996.\n\nRobert P.W. Duin. Small sample size generalization. In Proceedings of the Scandinavian Conference\n\non Image Analysis, volume 2, pages 957\u2013964, 1995.\n\nRobert P.W. Duin. Classi\ufb01ers in almost empty spaces. In Proceedings of the 15th International\n\nConference on Pattern Recognition, volume 2, pages 1\u20137. IEEE, 2000.\n\nPeter D. Gr\u00fcnwald and Wojciech Kot\u0142owski. Bounds on individual risk for log-loss predictors. In\n\nProceedings of the 24th Annual Conference on Learning Theory, pages 813\u2013816, 2011.\n\nDavid Haussler, Michael Kearns, Manfred Opper, and Robert Schapire. Estimating average-case\nlearning curves using bayesian, statistical physics and vc dimension methods. In Advances in\nNeural Information Processing Systems, pages 855\u2013862, 1992.\n\nDavid Haussler, Michael Kearns, H. Sebastian Seung, and Naftali Tishby. Rigorous learning curve\n\nbounds from statistical mechanics. Machine Learning, 25(2-3):195\u2013236, 1996.\n\nJoel Hestness, Sharan Narang, Newsha Ardalani, Gregory Diamos, Heewoo Jun, Hassan Kianinejad,\nMostofa Patwary, Mostofa Ali, Yang Yang, and Yanqi Zhou. Deep learning scaling is predictable,\nempirically. arXiv preprint arXiv:1712.00409, 2017.\n\nPrasanth Kolachina, Nicola Cancedda, Marc Dymetman, and Sriram Venkatapathy. Prediction\nof learning curves in machine translation. In Proceedings of the 50th Annual Meeting of the\nAssociation for Computational Linguistics: Long Papers-Volume 1, pages 22\u201330. Association for\nComputational Linguistics, 2012.\n\nNicole Kr\u00e4mer. On the peaking phenomenon of the lasso in model selection. arXiv preprint\n\narXiv:0904.4416, 2009.\n\nJesse H. Krijthe and Marco Loog. Projected estimators for robust semi-supervised classi\ufb01cation.\n\nMachine Learning, 106(7):993\u20131008, 2017.\n\nJesse H. Krijthe and Marco Loog. The pessimistic limits and possibilities of margin-based losses\nIn Advances in Neural Information Processing Systems, pages\n\nin semi-supervised learning.\n1790\u20131799, 2018.\n\nEsther Levin, Naftali Tishby, and Sara A. Solla. A statistical approach to learning and generalization\n\nin layered neural networks. Proceedings of the IEEE, 78(10):1568\u20131574, 1990.\n\nMarco Loog. Contrastive pessimistic likelihood estimation for semi-supervised classi\ufb01cation. IEEE\n\nTransactions on Pattern Analysis and Machine Intelligence, 38(3):462\u2013475, 2016.\n\nMarco Loog and Robert P.W. Duin. The dipping phenomenon. In Joint IAPR International Workshops\non Statistical Techniques in Pattern Recognition (SPR) and Structural and Syntactic Pattern\nRecognition (SSPR), pages 310\u2013317. Springer, 2012.\n\n9\n\n\fCharles A. Micchelli and Grace Wahba. Design problems for optimal surface interpolation. Technical\n\nReport 565, Department of Statistics, Wisconsin University, 1979.\n\nManfred Opper. Regression with Gaussian processes: Average case performance. In Theoretical\n\naspects of neural computation: A multidisciplinary perspective, pages 17\u201323. Springer, 1998.\n\nManfred Opper. Learning to generalize. Frontiers of Life, 3(part 2):763\u2013775, 2001.\nManfred Opper and David Haussler. Calculation of the learning curve of bayes optimal classi\ufb01cation\nalgorithm for learning a perceptron with noise. In Proceedings of the fourth annual workshop on\nComputational learning theory, pages 75\u201387. Morgan Kaufmann Publishers Inc., 1991.\n\nManfred Opper and Wolfgang Kinzel. Statistical mechanics of generalization. In Models of Neural\n\nNetworks III, pages 151\u2013209. Springer, 1996.\n\nManfred Opper and Francesco Vivarelli. General bounds on bayes errors for regression with Gaussian\n\nprocesses. In Advances in Neural Information Processing Systems, pages 302\u2013308, 1999.\n\nH.S. Seung, Haim Sompolinsky, and Naftali Tishby. Statistical mechanics of learning from examples.\n\nPhysical Review A, 45(8):6056, 1992.\n\nShai Shalev-Shwartz and Shai Ben-David. Understanding Machine Learning: From Theory to\n\nAlgorithms. Cambridge University Press, 2014.\n\nAlexander J. Smola, Peter J. Bartlett, Dale Schuurmans, and Bernhard Sch\u00f6lkopf. Advances in Large\n\nMargin Classi\ufb01ers. MIT Press, 2000.\n\nPeter Sollich. Learning curves for Gaussian processes. In Advances in Neural Information Processing\n\nSystems, pages 344\u2013350, 1999.\n\nPeter Sollich and Anason Halees. Learning curves for Gaussian process regression: Approximations\n\nand bounds. Neural Computation, 14(6):1393\u20131428, 2002.\n\nHaim Sompolinsky, Naftali Tishby, and H. Sebastian Seung. Learning from examples in large neural\n\nnetworks. Physical Review Letters, 65(13):1683, 1990.\n\nStefano Spigler, Mario Geiger, St\u00e9phane d\u2019Ascoli, Levent Sagun, Giulio Biroli, and Matthieu Wyart.\nA jamming transition from under-to over-parametrization affects loss landscape and generalization.\narXiv preprint arXiv:1810.09665, 2018.\n\nNaftali Tishby, Esther Levin, and Sara A. Solla. Consistent inference of probabilities in layered\nnetworks: Predictions and generalization. In International Joint Conference on Neural Networks,\nvolume 2, pages 403\u2013409, 1989.\n\nVladimir N. Vapnik. Statistical Learning Theory. Wiley, 1998.\nTom Viering, Alexander Mey, and Marco Loog. Open problem: Monotonicity of learning. In Alina\nBeygelzimer and Daniel Hsu, editors, Proceedings of the Thirty-Second Conference on Learning\nTheory, volume 99 of Proceedings of Machine Learning Research, pages 3198\u20133201, Phoenix,\nUSA, 25\u201328 Jun 2019.\n\nChristopher K.I. Williams and Francesco Vivarelli. Upper and lower bounds on the learning curve for\n\nGaussian processes. Machine Learning, 40(1):77\u2013102, 2000.\n\n10\n\n\f", "award": [], "sourceid": 4074, "authors": [{"given_name": "Marco", "family_name": "Loog", "institution": "Delft University of Technology & University of Copenhagen"}, {"given_name": "Tom", "family_name": "Viering", "institution": "Delft University of Technology, Netherlands"}, {"given_name": "Alexander", "family_name": "Mey", "institution": "TU Delft"}]}