{"title": "Learning from queries for maximum information gain in imperfectly learnable problems", "book": "Advances in Neural Information Processing Systems", "page_first": 287, "page_last": 294, "abstract": null, "full_text": "Learning from queries for maximum \n\ninformation gain in imperfectly learnable \n\nproblems \n\nPeter Sollich David Saad \n\nDepartment of Physics, University of Edinburgh \n\nEdinburgh EH9 3JZ, U.K. \n\nP.Sollich~ed.ac.uk. D.Saad~ed.ac.uk \n\nAbstract \n\nIn supervised learning, learning from queries rather than from \nrandom examples can improve generalization performance signif(cid:173)\nicantly. We study the performance of query learning for problems \nwhere the student cannot learn the teacher perfectly, which occur \nfrequently in practice. As a prototypical scenario of this kind, we \nconsider a linear perceptron student learning a binary perceptron \nteacher. Two kinds of queries for maximum information gain, i.e., \nminimum entropy, are investigated: Minimum student space en(cid:173)\ntropy (MSSE) queries, which are appropriate if the teacher space \nis unknown, and minimum teacher space entropy (MTSE) queries, \nwhich can be used if the teacher space is assumed to be known, but \na student of a simpler form has deliberately been chosen. We find \nthat for MSSE queries, the structure of the student space deter(cid:173)\nmines the efficacy of query learning, whereas MTSE queries lead \nto a higher generalization error than random examples, due to a \nlack of feedback about the progress of the student in the way queries \nare selected. \n\n1 \n\nINTRODUCTION \n\nIn systems that learn from examples, the traditional approach has been to study \ngeneralization from random examples, where each example is an input-output pair \n\n\f288 \n\nPeter Sollich, David Saad \n\nwith the input chosen randomly from some fixed distribution and the corresponding \noutput provided by a teacher that one is trying to approximate. However, random \nexamples contain less and less new information as learning proceeds. Therefore, \ngeneralization performance can be improved by learning from queries, i. e., by choos(cid:173)\ning the input of each new training example such that it will be, together with its \nexpected output, in some sense 'maximally useful'. The most widely used mea(cid:173)\nsure of 'usefulness' is the information gain, i.e ., the decrease in entropy of the \npost-training probability distributions in the parameter space of the student or the \nteacher. We shall call the resulting queries 'minimum (student or teacher space) \nentropy (MSSE/MTSE) queries'; their effect on generalization performance has re(cid:173)\ncently been investigated for perfectly learnable problems, where student and teacher \nspace are identical (Seung et al., 1992, Freund et al., 1993, Sollich, 1994), and was \nfound to depend qualitatively on the structure of the teacher. For a linear percep(cid:173)\ntron, for example, one obtains a relative reduction in generalization error compared \nto learning from random examples which becomes insignificant as the number of \ntraining examples, p, tends to infinity. For a perceptron with binary output, on the \nother hand, minimum entropy queries result in a generalization error which decays \nexponentially as p increases, a marked improvement over the much slower algebraic \ndecay with p in the case of random examples. \n\nIn practical situations, one almost always encounters imperfectly learnable problems, \nwhere the student can only approximate the teacher, but not learn it perfectly. \nImperfectly learnable problems can arise for two reasons: Firstly, the teacher space \n(i.e., the space of models generating the data) might be unknown. Because the \nteacher space entropy is then also unknown, MSSE (and not MTSE) queries have \nto be used for query learning. Secondly, the teacher space may be known, but a \nstudent of a simpler structure might have deliberately been chosen to facilitate or \nspeed up training, for example. In this case, MTSE queries could be employed as \nan alternative to MSSE queries. The motivation for doing this would be strongest \nif, as in the learning scenario that we consider below, it is known from analyses \nof perfectly learnable problems that the structure of the teacher space allows more \nsignificant improvements in generalization performance from query learning than \nthe structure of the student space. \n\nWith the above motivation in mind, we investigate in this paper the performance \nof both MSSE and MTSE queries for a prototypical imperfectly learnable prob(cid:173)\nlem, in which a linear perceptron student is trained on data generated by a binary \nperceptron teacher. Both student and teacher are specified by an N-dimensional \nweight vector with real components, and we will consider the thermodynamic limit \nN -+ 00, p -+ 00,0:' = piN = const. In Section 2 below we calculate the general(cid:173)\nization error for learning from random examples. In Sections 3 and 4 we compare \nthe result to MSSE and MTSE queries. Throughout, we only outline the neces(cid:173)\nsary calculations; for details, we refer the reader to a forthcoming publication. We \nconclude in Section 5 with a summary and brief discussion of our results. \n\n2 LEARNING FROM RANDOM EXAMPLES \n\nWe denote students and teachers by N (for 'Neural network') and V (for 'element \nof the Version space', see Section 4), respectively, and their corresponding weight \n\n\fLearning from Queries for Maximum Information Gain \n\n289 \n\nvectors by W Nand w v . For an input vector x, the outputs of a given student and \nteacher are \n\nYN = 7N XTWN' \n\nYv = sgn( 7N xTwv) . \n\nAssuming that inputs are drawn from a uniform distribution over the hypersphere \nx 2 = N, and taking as our error measure the standard squared output difference \n! (YN - Yv)2, the generalization error, i. e., the average error between student Nand \nteacher V when tested on random test inputs, is given by \n\n\u20acg(N, V) = '2 QN + 1- 2..ftTv;: \n\n, \n\n(1) \n\n1 [ \n\nR (2) 1/2] \n\nwhere we have set R = ];WrrWv,QN = ];w~,Qv = ];w~. \nAs our training algorithm we take stochastic gradient descent on the trammg \nerror E t , which for a training set e(p) = {( x~ ,Y~ = Yv( x~)), J.l = 1 ... p} is \nEt = ! 2:~(Y~ - YN(X~))2. A weight decay term !AW~ is added for regulariza(cid:173)\ntion, i.e., to prevent overfitting. Stochastic gradient descent on the resulting en(cid:173)\nergy function E = E t + !AW~ yields a Gibbs post-training distribution of students, \np(wNle(p)) ex exp(-E/T), where the training temperature T measures the amount \nof stochasticity in the training algorithm. For the linear perceptron students con(cid:173)\nsidered here, this distribution is Gaussian, with covariance matrix TM,N-l, where \n(IN denotes the N x N identity matrix) \n\nMN = AlN + ~ 2:~=1 x~(x~f\u00b7 \n\nSince the length ofthe teacher weight vector Wv does not affect the teacher outputs, \nwe assume a spherical prior on teacher space, P(w v ) ex 6(w~-N), for which Qv = 1. \nRestricting attention to the limit of zero training temperature, it is straightforward \nto calculate from eq . (1) the average generalization error obtained by training on \nrandom examples \n\n\u20acg -\n\n\u20acg,min =;: AoptG + A(Aopt - A) OA \nOG] \n\n1 [ \n\n' \n\nwith the function G = (]; tr M,N-1) P( {xl'}) given by (Krogh and Hertz, 1992) \n\nG = 2\\ [1 - a - A + )(1- a - A)2 + 4A] . \n\n(2) \n\n(3) \n\nIn eq. (2) we have explicitly subtracted the minimum achievable generalization error, \n\u20acg,min = !(1-2/11'), which is nonzero since a linear perceptron cannot approximate a \nbinary percept ron perfectly. At finite a, the generalization error is minimized when \nthe weight decay is set to its optimal value A = Aopt = 11'/2 - 1. Note that since \nboth G and OG/OA tend to zero as a --+ 00, the generalization error for random \nexamples approaches the minimum achievable generalization error in this limit. \n\n3 MINIMUM STUDENT SPACE ENTROPY QUERIES \n\nWe now calculate the generalization performance resulting from MSSE queries. For \nthe training algorithm introduced in the last section, the student space entropy \n(normalized by N) is given by \n\n\f290 \n\nPeter Sollich, David Saad \n\n3.0 \"\"T\"\"----r----r---------------, \n\n- - A== 0.01 \n---- A == 0.1 \n-------- A == 1 \n\n2.5 \n\n2.0 \n\n1.5 \n\no \n\n-------------------::~-::-=.:--~~.J \n1.0-+--.::..::......-.-----,----.-----r-----I \n5 \n\n.. _------------\n\n2 \n\n3 \n\n4 \n\na \n\nFigure 1: Relative improvement K, in generalization error due to MSSE queries, for \nweight decay A = 0.01,0.1,1. \n\n1 \n\nS'\" = - 2N lndet M N , \n\nwhere we have omitted an unimportant constant which depends on the training \ntemperature only. This entropy is minimized by choosing each new query along \nthe direction corresponding to the minimal eigenvalue of the existing MN (Sollich, \n1994). The expression for the resulting average generalization error is given by \neq. (2) with G replaced by its analogue for MSSE queries (Sollich, 1994) \n\nG \n\nQ-\n\n~a \n\n1- ~a \n+----:-\n- A + [a] + 1 A + [a] , \n\nwhere [a] is the greatest integer less than or equal to a and ~o' = a-raj. We define \nthe improvement factor K, as the ratio of the generalization error (with the minimum \nachievable generalization error subtracted as in eq. (2)) for random examples to that \nfor MSSE queries. Figure 1 shows K(O') for several values of the weight decay A. \nComparing with existing results (Sollich, 1994), we find that K, is exactly the same \nas if our linear student were trying to approximate a linear teacher with additive \nnoise of variance Aopt on the outputs. For large a, one can show (Sollich, 1994) that \nK, = 1 + I/O' + O(I/a2 ) and hence the relative reduction in generalization error due \nto querying tends to zero as a -I- 00. We investigate in the next section whether it \nis possible to improve generalization performance more significantly by using MTSE \nqueries. \n\n4 MINIMUM TEACHER SPACE ENTROPY QUERIES \n\nWe now consider the generalization performance achieved by MTSE queries. We \nremind readers that such queries could be used if the teacher space is known, but \na student of a simpler functional form has deliberately chosen. The aim in using \nMTSE rather than MSSE queries would be to exploit the structure of the teacher \nspace if this is known (for perfectly learnable problems) to make query learning very \nefficient compared to random examples. \n\nFor the case of noise free training data under consideration, the posterior probability \ndistribution in teacher space given a certain training set is proportional to the prior \n\n\fLearning from Queries for Maximum Information Gain \n\n291 \n\ndistribution on the version space (the set of all teachers that could have produced \nthe training set without error) and zero everywhere else. From this the (normalized) \nteacher space entropy can be derived to be, up to an additive constant, \n\n1 \n\nSv = N In V(p), \n\nwhere the version space volume V(p) is given by (8(z) = 1 for z > 0 and 0 otherwise) \n\nV(p) = JdwvP(w v) n~=l 8(JNy/Jw~x/J). \n\nIt can easily be verified that this entropy is minimizedl by choosing queries x which \n(bisect' the existing version space, i. e., for which the hyperplane perpendicular to \nx splits the version space into two equal halves (Seung et al., 1992, Freund et al., \n1993). Such queries lead to an exponentially shrinking version space, V(p) = 2-P, \nand hence a linear decrease of the entropy, Sv = -a In 2. We consider instead \nqueries which achieve qualitatively the same effect, but permit a much simpler \nanalysis of the resulting student performance. They are similar to those studied in \nthe context of a learnable problem by Watkin and Rau (1992), and are defined as \nfollows . The (p + 1 )th query is obtained by first picking a random teacher vector \nwp from the version space defined by the existing p training examples, and then \npicking the new training input Xp+l from the distribution of random inputs but \nunder the constraint that x;+1wp = O. \nFor the calculation of the student performance, i. e., the average generalization error, \nachieved by the approximate MTSE queries described above, we use an approxi(cid:173)\nmation based on the following observation. As the number of training examples, \np, increases, the teacher vectors wp from the version space will align themselves \nwith the true teacher w~; their components along the direction of w~ will increase, \nwhereas their components perpendicular to we will decrease, varying widely across \nthe N - 1 dimensional hyperplane perpendicular to we . Following Watkin and Rau \n(1992), we therefore assume that the only significant effect of choosing queries xp+1 \nwith X;+lWP = 0 is on the distribution of the component ofxp+l along we . Writing \nthis component as x~+1 = x;+1 w~/lwel, its probability distribution can readily be \nshown to be \n\nP(x~+1) ex: exp (-~(x~+1lsp)2) , \n\n(4) \nwhere sp is the sine of the angle between w p and we. For finite N, the value of sp \nis dependent on the p previous training examples that define the existing version \nspace and on the teacher vector wp sampled randomly from this version space. In \nthe thermodynamic limit, however, the variations of sp become vanishingly small \nand we can thus replace sp by its average value, which is a function of palone. \nIn the thermodynamic limit, this average value becomes a continuous function of \na = pIN , the number of training examples per weight, which we denote simply by \nsea) . The calculation can then be split into two parts: First, the function sea) is \nobtained from a calculation of the teacher space entropy using the replica method, \ngeneralizing the results of Gyorgi and Tishby (1990) . The average generalization \n\n1 More precisely, what is minimized is the value of the entropy after a new training \nexample (x, y) is added, averaged over the distribution of the unknown new training output \ny given the new training input x and the existing training set; see Sollich (1994). \n\n\f292 \n\nPeter Sollich, David Saad \n\n................\u2022 , ....... ,., ..... ,. \n\n--':''''''::-.::, ----\n-------(cid:173)\n' . \n...\u2022.....\u2022\u2022...... \n\n..... \n............ \n\n0 \n\n-1 \n\n-2 \n\n-3 \n\n-4 \n\n-5 \n\n0 \n\n2 \n\n3 \n\n4 \n\n5 a 6 \n\n7 \n\n8 \n\n9 \n\n10 \n\nFigure 2: MTSE queries: Teacher space entropy, Sv (with value for random exam(cid:173)\nples plotted for comparison), and In s, the log of the sine of the angle between the \ntrue teacher and a random teacher from the version space. \n\nerror can then be calculated by using an extension of the response function method \ndescribed in (Sollich, 1994b) or by another replica calculation (now in student space) \nas in (Dunmur and Wallace, 1993). \n\nFigure 2 shows the effects of (approximate) MTSE queries in teacher space. For large \na values, the teacher space entropy decreases linearly with a, with gradient c ::::::: 0.44, \nwhereas the entropy for random examples, also shown for comparison, decreases \nmuch more slowly (asymptotically like -In a, see (Gyorgi and Tishby, 1990)). The \nlinear a-dependence of the entropy for queries corresponds to an average reduction \nof the version space volume with each new training example by a factor of exp( -c) ::::::: \n0.64, which is reasonably close to the factor ~ for proper bisection of the version \nspace. This justifies our choice of analysing approximate MTSE queries rather than \ntrue MTSE queries, since the former achieve qualitatively the same results as the \nlatter. \n\nBefore discussing the student performance achieved by (approximate) MTSE \nqueries, we note from figure 2 that In s( a) decreases linearly with a for large a, \nwith the same gradient as the teacher space entropy. Hence s( a) ex: exp( -ca) for \nlarge a, and MTSE queries force the average teacher from the version space to \napproach the true teacher exponentially quickly. It can easily be shown that if we \nwere learning with a binary perceptIOn student, i. e., if the problem were perfectly \nlearnable, then this would result in an exponentially decaying generalization error, \n\u20acg ex: exp( -ca). MTSE queries would thus lead to a marked improvement in gener(cid:173)\nex: l/a, see (Gyorgi and \nalization performance over random examples (for which \u20acg \nTishby, 1990)). It is this significant benefit (in teacher space) of query learning that \nprovides the motivation for using MTSE queries in imperfectly learnable problems \nsuch as the one considered here. \n\nThe results plotted in Figure 3 for the average generalization error achieved by the \nlinear perceptron student show, however, that MTSE queries do not have the de(cid:173)\nsired effect. Far from translating the benefits in teacher space into improvements \nin generalization performance for the linear student, they actually lead to a deteri(cid:173)\noration of generalization performance, i. e., a larger generalization error than that \n\n\fLearning from Queries for Maximum Information Gain \n\n293 \n\n0.6 \n\n0.5 \n\nfg \n\n0.4 \n\n0.3 \n\n0.2 \n\n... \\-, \n\" , \n'.. , \n\". , \n'. , ...... .. -~ --\n\n- - >. = 0.01 \n---- >. = 0.1 \n-------- >. = 1 \n\n- --- --------\n--------......-=---=--=---=--d--\n\no \n\n2 \n\n3 \n\n4 \n\n5 a 6 \n\n7 \n\n8 \n\n9 \n\n10 \n\nFigure 3: Generalization error for MTSE queries (higher curves of each pair) and \nrandom examples (lower curves), for weight decay>. = 0.01,0.1,1. The curves for \nrandom examples (which are virtually indistinguishable from one another already \nat a = 10) converge to the minimum achievable generalization error fg,min (dotted \nline) as a -+ 00. \n\nobtained for random examples. Worse still, they 'mislead' the student to such an \nextent that the minimum achievable generalization error is not reached even for an \ninfinite number of training examples, a -+ 00. How does this happen? It can be \nverified that the angle between the student and teacher weight vectors tends to zero \nfor a -+ 00 as expected, while Q N, the normalized squared length of the student \nweight vector, approaches \n\n)2 \n\n, \n\nS \n\n---= \n>. + s2 \n\n2 ( \nQN(a -+ 00) = -\n11' \n\n(5) \nwhere s = Jooo da s( a), s2 = Jooo da s2 (a). Unless the weight decay parameter >. \nhappens to be equal to s - s2, this is different from the optimal asymptotic value, \nwhich is 2/11'. This is the reason why in general the linear student does not reach \nthe minimum possible generalization error even as a -+ 00. The approach of QN \nto its non-optimal asymptotic value can cause an increase in the generalization \nerror for large a and a corresponding minimum of the generalization error at some \nfinite a, as can be seen in the plots for>. = 0.01 and 0.1 in Figure 3. For>. = 0, \neq. (5) has the following intuitive interpretation: As a increases, the version space \nshrinks around the true teacher w~, and hence MTSE queries become 'more and \nmore orthogonal' to w~ . As a consequence, the distribution of training inputs along \n\nthe direction of we is narrowed down progressively (compare eq. (4)). Trying to \n\nfind a best fit to the teacher's binary output function over this narrower range of \ninputs, the linear student learns a function which is steeper than the best fit over \nthe range of random inputs (which would give minimum generalization error). This \ncorresponds to a suboptimally large length of the student weight vector in agreement \nwith eq. (5): QN(a -+ 00) > 2/11' for>. = a because s2 < s. \nSummarizing the results of this section, we have found that although MTSE queries \nare very beneficial in teacher space, they are entirely misleading for the linear stu(cid:173)\ndent, to the extent that the student does not learn to approximate the teacher \noptimally even for an infinite number of training examples. \n\n\f294 \n\nPeter Sollich, David Saad \n\n5 SUMMARY AND DISCUSSION \n\nWe have found in our study of an imperfectly learnable problem with a linear student \nand a binary teacher that queries for minimum student and teacher space entropy, \nrespectively, have very different effects on generalization performance. Minimum \nstudent space entropy (MSSE) queries essentially have the same effect as for a linear \nstudent learning a noisy linear teacher, apart from a nonzero minimum value of the \ngeneralization error due to the unlearnability of the problem . Hence the structure \nof the student space is the dominating influence on the efficacy of query learning. \nMinimum teacher space entropy queries (MTSE), on the other hand, perform worse \nthan random examples, leading to a higher generalization error even for an infinite \nnumber of training examples. With the benefit of hindsight, we note that this makes \nintuitive sense since the teacher space entropy, according to which MTSE queries \nare selected, contains no feedback about the progress of the student in learning the \nrequired generalization task, and thus MTSE queries cannot be guaranteed to have \na positive effect. \n\nOur results, then, are a mixture of good and bad news for query learning for max(cid:173)\nimum information gain in imperfectly learnable problems: The bad news is that \nMTSE queries, due to a lack of feedback information about student progress, are \nnot enough to translate significant benefits in teacher space into similar improve(cid:173)\nments of student performance and may in fact yield worse performance than random \nexamples. The good news is that for MSSE queries, we have found evidence that \nthe structure of the student space is the key factor in determining the efficacy of \nquery learning. If this result holds more generally, then statements about the ben(cid:173)\nefits of query learning can be made on the basis of how one is trying to learn only, \nindependently of what one is trying to learn-a result of great practical significance. \n\nReferences \n\nStatistical theory of learning a rule. \n\nA P Dunmur and D J Wallace (1993). Learning and generalization in a linear \nperceptron stochastically trained with noisy data. J. Phys. A, 26:5767-5779. \nY Freund, H S Seung, E Shamjr, and N Tishby (1993). Information, prediction, \nand query by committee. In S J Hanson, J D Cowan, and C Lee Giles, editors, \nNIPS 5, pages 483-490, San Mateo, CA, Morgan Kaufmann. \nG Gyorgi and N Tishby (1990). \nIn \nW Theumann and R Koberle, editors, Neu.ral Networks and Spin Glasses, pages \n3-36. Singapore , World Scientific. \nA Krogh and J A Hertz (1992). Generalization in a linear perceptron in the presence \nof noise. J. Phys. A,25:1135-1147. \nP Sollich (1994). Query construction , entropy, and generalization in neural network \nmodels. Phys. Rev. E,49:4637-4651. \nP Sollich (1994b) . Finite-size effects in learning and generalization in linear percep(cid:173)\ntrons. J. Phys. A, 27:7771-7784. \nH S Seung, M Opper, and H Sompolinsky (1992) . Query by committee. In Pro(cid:173)\nceedings of COLT '92, pages 287-294, New York, ACM . \nT L H Watkin and A Rau (1992). Selecting examples for perceptrons. J. Phys. A, \n25:113-121. \n\n\f", "award": [], "sourceid": 965, "authors": [{"given_name": "Peter", "family_name": "Sollich", "institution": null}, {"given_name": "David", "family_name": "Saad", "institution": null}]}