{"title": "A Method for Learning From Hints", "book": "Advances in Neural Information Processing Systems", "page_first": 73, "page_last": 80, "abstract": null, "full_text": "A Method for Learning from Hints \n\nYaser s. Abu-Mostafa \n\nDepartments of Electrical Engineering, Computer Science, \n\nand Computation and Neural Systems \n\nCalifornia Institute of Technology \n\nPasadena, CA 91125 \n\ne-mail: yaser@caltech.edu \n\nAbstract \n\nWe address the problem of learning an unknown function by \npu tting together several pieces of information (hints) that we know \nabout the function. We introduce a method that generalizes learn(cid:173)\ning from examples to learning from hints. A canonical representa(cid:173)\ntion of hints is defined and illustrated for new types of hints. All \nthe hints are represented to the learning process by examples, and \nexamples of the function are treated on equal footing with the rest \nof the hints. During learning, examples from different hints are \nselected for processing according to a given schedule. We present \ntwo types of schedules; fixed schedules that specify the relative em(cid:173)\nphasis of each hint, and adaptive schedules that are based on how \nwell each hint has been learned so far. Our learning method is \ncompatible with any descent technique that we may choose to use. \n\n1 \n\nINTRODUCTION \n\nThe use of hints is coming to the surface in a number of research communities dealing \nwith learning and adaptive systems. In the learning-from-examples paradigm, one \noften has access not only to examples of the function, but also to a number of \nhints (prior knowledge, or side information) about the function. The most common \ndifficulty in taking advantage of these hints is that they are heterogeneous and \ncannot be easily integrated into the learning process. This paper is written with the \nspecific goal of addressing this problem. The paper develops a systematic method \n\n73 \n\n\f74 \n\nAbu-Mostafa \n\nfor incorporating different hints in the usuallearning-from-examples process. \n\nWithout such a systematic method, one can still take advantage of certain types of \nhints. For instance, one can implement an invariance hint by preprocessing the input \nto achieve the invariance through normalization. Alternatively, one can structure \nthe learning model in a way that directly implements the invariance (Minsky and \nPapert, 1969). Whenever direct implementation is feasible, the full benefit of the \nhint is realized. This paper does not attempt to offer a superior alternative to \ndirect implementation. However, when direct implementation is not an option, we \nprescribe a systematic method lor incorporating practically any hint in any descent \ntechnique lor learning. The goal is to automate the use of hints in learning to a \ndegree where we can effectively utilize a large number of different hints that may \nbe available in a practical situation. As the use of hints becomes routine, we are \nencouraged to exploit even the simplest observations that we may have about the \nfunction we are trying to learn. \n\nThe notion of hints is quite general and it is worthwhile to formalize what we mean \nby a hint as far as our method is concerned. Let I be the function that we are \ntrying to learn. A hint is a property that I is known to have. Thus, all that is \nneeded to qualify as a hint is to have a litmus test that I passes and that can be \napplied to different functions. Formally, a hint is a given subset of functions that \nincludes I. \nWe start by introducing the basic nomenclature and notation. The environment X \nis the set on which the function I is defined. The points in the environment are \ndistributed according to some probability distribution P. I takes on values from \nsome set Y \n\nI:X-+Y \n\nOften, Y is just {O, I} or the interval [0, 1]. The learning process takes pieces of \ninformation about (the otherwise unknown) I as input and produces a hypothesis g \n\nthat attempts to approximate f. The degree to which a hypothesis g is considered \nan approximation of I is measured by a distance or 'error' \n\ng:X-+Y \n\nE(g, !) \n\nThe error E is based on the disagreement between g and I as seen through the eyes \nof the probability distribution P. \n\nTwo popular forms of the error measure are \n\nand \n\nE = Pr[g(x) =F f(x)] \n\nE = \u00a3[(g(x) - l(x))2] \n\nwhere Pr[.] denotes the probability of an event, and \u00a3[.J denotes the expected value \nof a random variable. The underlying probability distribution is P. E will always \n\nbe a non-negative quantity, and we will take E(g,!) = \u00b0 t.o mean that 9 and I \n\nare identical for all intents and purposes. We will also assume that when the set \nof hypotheses is parameterized by real-valued parameters (e.g., the weights in the \ncase of a neural network), E will be well-behaved as a function of the parameters \n\n\fA Method for Learning from Hints \n\n75 \n\n(in order to allow for derivative-based descent techniques). We make the same \nassumptions about the error measures that will be introduced in section 2 for the \nhints. \nIn this paper, the 'pieces of information' about f that are input to the learning \nprocess are more general than in the learning-from-examples paradigm. In that \nparadigm, a number of points Xl, ... , X N are picked from X (usually independently \naccording to the probability distribution P) and the values of f on these points are \nprovided. Thus, the input to the learning process is the set of examples \n\n(Xl, f(XI)),\"', (XN' f(XN)) \n\nand these examples are used to guide the search for a good hypothesis. We will \nconsider the set of examples of f as only one of the available hints and denote it by \nHo. The other hints HI,' .. ,HM will be additional known facts about f, such as \ninvariance properties for instance. \n\nThe paper is organized as follows. Section 2 develops a canonical way for represent(cid:173)\ning different hints. This is the first step in dealing with any hint that we encounter \nin a practical situation. Section 3 develops the basis for learning from hints and \ndescribes our method, including specific learning schedules. \n\n2 REPRESENTATION OF HINTS \n\nAs we discussed before, a hint Hm is defined by a litmus test that f satisfies and \nthat can be applied to the set of hypotheses. This definition of Hm can be extended \nto a definition of 'approximation of Hm' in several ways. For instance, 9 can be \nconsidered to approximate Hm within f. if there is a function h that strictly satisfies \nH~ for which E(g, h) ::; f.. \nIn the context of learning, it is essential to have a \nnotion of approximation since exact learning is seldom achievable. Our definitions \nfor approximating different hints will be part of the scheme for representing those \nhints. \n\nThe first step in representing Hm is to choose a way of generating 'examples' of the \nhint. For illustration, suppose that Hm asserts that \nf: [-1,+1]- [-1,+1] \n\nis an odd function. An example of Hm would have the form \n\nf(-x) = -f(x) \n\nfor a particular X E [-1, +1]. To generate N examples of this hint, we generate \nXl,'\" ,XN and assert for each Xn that f( -xn) = - f(xn). Suppose that we are \nin the middle of a learning process, and that the current hypothesis is 9 when the \nexample f( -x) = - f(x) is presented. We wish to measure how much 9 disagrees \nwith this example. This leads to the second component of the representation, the \nerror measure em. For the oddness hint, em can be defined as \n\nem = (g(x) + g( _x))2 \n\nso that em = 0 reflects total agreement with the example (i.e., g( -x) = -g(x)). \nOnce the disagreement between 9 and an example of Hm has been quantified \n\n\f76 \n\nAbu-Mostafa \n\nthrough em, the disagreement between 9 and Hm as a whole is automatically quan(cid:173)\ntified through Em, where \n\nEm = \u00a3(em) \n\nThe expected value is taken w.r.t. the probability rule for picking the examples. \nTherefore, Em can be estimated by a.veraging em over a number of examples that \nare independently picked. \n\nThe choice of representation of H m is not unique, and Em will depend on the form \nof examples, the probability rule for picking the examples, and the error measure \nem. A minimum requirement on Em is that it should be zero when E = O. This \nrequirement guarantees that a hypothesis for which E = 0 (perfect hypothesis) will \nnot be excluded by the condition Em = O. \nLet us illustrate how to represent different types of hints. Perhaps the most common \ntype of hint is the invariance hint. This hint asserts that I(x) = I(x') for certain \npairs x, x'. For instance, \"I is shift-invariant\" is formalized by the pairs x, x' that \nare shifted versions of each other. To represent the invariance hint, an invariant \npair (x, x') is picked as an example. The error associated with this example is \n\nem = (g(x) - g(x'))2 \n\nAnother related type of hint is the monotonicity hint (or inequality hint). The \nhint asserts for certain pairs x, x' that I(x) :S I(x') . For instance, \"I is mono(cid:173)\ntonically nondecreasing in x\" is formalized by all pairs x, x' such that x < x'. To \nrepresent the monotonicity hint, an example (x, x') is picked, and the error associ(cid:173)\nated with this example is given by \n\n_ {(g(x) - g(X'\u00bb2 \n\nem -\n\n0 \n\nif g(x) > g(x') \nif g(x) :S g(x' ) \n\nThe third type of hint we discuss here is the approximation hint. The hint \nasserts for certain points x E X that I(x) E [ax, bx]. In other words, the value of 1 \nat x is known only approximately. The error associated with an example x of the \napproximation hint is \n\nif g(x) < ax \nif g(x) > bx \nif g(x) E [ax,bx] \n\nAnother type of hints arises when the learning model allows non-binary values for \n9 where 1 itself is known to be binary. This gives rise to the binary hint. Let \nX ~ X be the set where 1 is known to be binary (for Boolean functions, X is the \nset of binary input vectors). The binary hint is represented by examples of the form \nx, where x E X. The error function associated with an example x (assuming 0/1 \nbinary convention, and assuming g( x) E [0, 1]) is \nem = g(x)(l- g(x\u00bb \n\nThis choice of em forces it to be zero when g(x) is either 0 or 1, while it would be \npositive if g( x) is between 0 and 1. \n\n\fA Method for Learning from Hints \n\n77 \n\nIt is worth noting that the set of examples of f can be formally treated as a hint, \ntoo. Given (Xl, f(xt)},\u00b7\u00b7\u00b7, (XN' f(XN )), the examples hint asserts that these are \nthe correct values of f at those particular points. Now, to generate an 'example' of \nthis hint, we pick a number n from I to N and use the corresponding (xn, f(xn)). \nThe error associated with this example is eo (we fix the convention that m = 0 for \nthe examples hint) \n\neo = (g(x n ) -\n\nf(:c n ))2 \n\nAssuming that the probability rule for picking n is uniform over {I,\u00b7\u00b7\u00b7 ,N}, \n\nEo = E(eo) = N I)g(x n) - f(xn))2 \n\n1 N \n\nn=l \n\nIn this case, Eo is also the best estimator of E = E[(g(x) - f(x))2] given Xl , \u00b7\u00b7\u00b7 ,xN \nthat are independently picked according to the original probability distribution P . \nThis way of looking at the examples of f justifies their treatment exactly as one of \nthe hints, and underlines the distinction between E and Eo. \nIn a practical situation, we try to infer as many hints about f as the situation \nwill allow. Next , we represent each hint according to the scheme discussed in this \nsection. This leads to a list Ho, H l ,\u00b7\u00b7\u00b7 ,HM of hints that are ready to produce \nexamples upon the request of the learning algorithm. We now address how the \nalgorithm should pick and choose between these examples as it moves along. \n\n3 LEARNING SCHEDULES \n\nIf the learning algorithm had complete information about f, it would search for a \nhypothesis g for which E(g, f) = o. However, f being unknown means that the \npoint E = 0 cannot be directly identified. The most any learning algorithm can do \ngiven the hints Ho, HI,\u00b7\u00b7\u00b7 ,HM is to reach a hypothesis g for which all the error \nmeasures Eo, El, \u00b7 \u00b7\u00b7 , EM are Zeros. Indeed, we have required that E = 0 implies \nthat Em = 0 for all m. \nIf that point is reached, regardless of how it is reached, the job is done. However, it \nis seldom the case that we can reach the zero-error point because either (1) it does \nnot exist (i.e., no hypothesis can satisfy all the hints simultaneously, which implies \nthat no hypothesis can replicate f exactly), or (2) it is difficult to reach (Le., the \ncomputing resources do not allow us to exhaustively search the space of hypotheses \nlooking for that point). In either case, we will have to settle for a point where the \nEm's are 'as small as possible'. \n\nHow small should each Em be? A balance has to be struck, otherwise some Em's \nmay become very small at the expense of the others. This situation would mean \nthat some hints are over-learned while the others are under-learned. We will discuss \nlearning schedules that use different criteria for balancing between the hints. The \nschedules are used by the learning algorithm to simultaneously minimize the Em's. \nLet us start by exploring how simultaneous minimization of a number of quantities \nis done in general. \n\nPerhaps the most common approach is that of penalty functions (Wismer and Chat-\n\n\f78 \n\nAbu-Mostafa \n\ntel'gy, 1978). In order to minimize Eo, E 1 ,\u00b7 \u2022\u2022 ,EM, we minimize the penalty function \n\nM L am Em \n\nm=O \n\nwhere each am is a non-negative number that may be constant (exact penalty \nfunction) or variable (sequential penalty function). Any descent technique can be \nemployed to minimize the penalty function once the am's are selected. The am's \nare weights that reflect the relative emphasis or 'importance' of the corresponding \nEm's. The choice of the weights is usually crucial to the quality of the solution. \n\nEven if the am's are determined, we still do not have the explicit values of the Em's \nin our case (recall that Em is the expected value of the error em on an example of \nthe hint). Instead, we will estimate Em by drawing several examples and averaging \ntheir error. Suppose that we draw N m examples of Hm. The estimate for Em would \nthen be \n\n1 Nm \n_ ~ e(n) \nN m L-\nn=l \n\nm \n\nwhere e~) is the error on the nth example. Consider a batch of examples consisting \nof No examples of Ho, Nl examples of HI, ... , and NM examples of HM. The \ntotal error of this batch is \n\nm=O n=l \n\nIf we take N m ex: am, this total error will be a proportional estimate of the penalty \nfunction \n\nM L am Em \n\nm=O \n\nIn effect, we translated the weights into a schedule, where different hints are em(cid:173)\nphasized, not by magnifying their error, but by representing them with more exam(cid:173)\nples. \n\nA batch of examples can be either a uniform batch that consist of N examples of \none hint at a time, or, more generally, a mixed batch where examples of different \nhints are allowed within the same batch. If the descent technique is linear and the \nlearning rate is small, a schedule that uses mixed batches is equivalent to a schedule \nthat alternates between uniform batches (wit.h frequency equal to the frequency \nof examples in the mixed batch). If we are using a nonlinear descent technique, \nit is generally more difficult to ascertain a direct translation from mixed batches \nto uniform batches, but there may be compelling heuristic correspondences. All \nschedules discussed here are expressed in terms of uniform batches for simplicity. \n\nThe implementation of a given schedule goes as follows: (1) The algorithm decides \nwhich hint (which m for m = 0,1,\u00b7\u00b7\u00b7, M) to work on next, according to some \ncriterion; (2) The algorithm then requests a batch of examples of this hint; (3) It \nperforms its descent on this batch; and (4) When it is done, it goes back to step \n(1). We make a distinction between fixed schedules, where the criterion for selecting \nthe hint can be 'evaluated' ahead of time (albeit time-invariant or time-varying, \n\n\fA Method for Learning from Hints \n\n79 \n\ndeterministic or stochastic), and adaptive schedules, where the criterion depends on \nwhat happens as the algorithm runs. Here are some fixed and adaptive schedules: \n\nSimple Rotation: This is the simplest possible schedule that tries to balance \nbetween the hints. It is a fixed schedule that rotates between Ho, HI'\u00b7\u00b7\u00b7' HM. Thus, \nat step k, a batch of N examples of Hm is processed, where m = k mod (M + 1). \nThis simple-minded algorithm tends to do well in situations where the Em's are \nsomewhat similar. \n\nWeighted Rotation: This is the next step in fixed schedules that tries to give \ndifferent emphasis to different Em's. The schedule rotates between the hints, visiting \nHm with frequency Vm. The choice of the vm's can achieve balance by emphasizing \nthe hints that are more important or harder to learn. \n\nMaximum Error: This is the simplest adaptive schedule that tries to achieve the \nsame type of balance as simple rotation. At each step k, the algorithm processes \nthe hint with the largest error Em. The algorithm uses estimates of the Em's to \nmake its selection. \n\nMaximum Weighted Error: This is the adaptive counterpart to weighted rota(cid:173)\ntion. It selects the hint with the largest value of vmEm. The choice of the vm's can \nachieve balance by making up for disparities between the numerical ranges of the \nEm's. Again, the algorithm uses estimates of the Em's. \n\nAdaptive schedules attempt to answer the question: Given a set of values for the \nEm's, which hint is the most under-learned? The above schedules answer the ques(cid:173)\ntion by comparing the individual Em's. Althongh this works well in simple cases, \nit does not take into consideration the correlation between different hints. As we \ndeal with more and more hints, the correlation between the Em's becomes more \nsignificant. This leads us to the final schedule that achieves the balance between \nthe Em's through their relation to the actual error E. \n\nAdaptive Minimization: Given the estimates of Eo, EI , ... , EM, make M + 1 \nestimates of E, each based on all but one of the hints: \n\nE(., Ell E2 ,\u00b7\u00b7\u00b7, EM) \nE(Eo,., E2,\u00b7\u00b7\u00b7, EM) \nE(Eo, EI,.,\u00b7\u00b7\u00b7, EM) \n\nE (Eo, EI , E2, ... , \u2022 ) \n\nand choose the hint for which the corresponding estimate is the smallest. \n\nIn other words, E becomes the common thread between the Em's. Knowing that \nwe are really trying to minimize E, and that the Em's are merely a vehicle to this \nend, the criterion for balancing the Em's should be based on what is happening to \nE as far as we can tell. \n\n\f80 \n\nAbu-Mostafa \n\nCONCLUSION \n\nThis paper developed a systematic method for using different hints as input to \nthe learning process, generalizing the case of invariance hints (Abu-Mostafa, 1990). \nThe method treats all hints on equal footing, including the examples of the func(cid:173)\ntion. Hints are represented in a canonical way that is compatible with the common \nlearning-from-examples paradigm. No restrictions are made on the learning model \nor the descent technique to be used. \nThe hints are captured by the error measures Eo, El,\"', EM, and the learning al(cid:173)\ngorithm attempts to simultaneously minimize these quantities. The simultaneous \nminimization of the Em's gives rise to the idea of balancing between the different \nhints. A number of algorithms that minimize the Em's while maintaining this bal(cid:173)\nance were discussed in the paper. Adaptive schedules in particular are worth noting \nbecause they automatically compensate against many artifacts of the learning pro(cid:173)\ncess. \n\nIt is worthwhile to distinguish between the quality of the hints and the quality of \nthe learning algorithm that uses these hints. The quality of the hints is determined \nby how reliably one can predict that the actual error E will be close to zero for a \ngiven hypothesis based on the fact that Eo, E 1 , \u2022\u2022\u2022 , EM are close to zero for that \nhypothesis. The quality of the algorithm is determined by how likely it is that the \nEm's will become nearly as small as they can be within a reasonable time. \n\nAcknowledgements \n\nThe author would like to thank Ms. Zehra Kok for her valuable input. This work \nwas supported by the AFOSR under grant number F49620-92-J-0398. \n\nReferences \n\nAbu-Mostafa, Y. S. (1990), Learning from hints in neural networks, Journal of \nComplexity 6, 192-198. \nAI-Mashouq, K. and Reed, 1. (1991), Including hints in training neural networks, \nNeural Computation 3, 418-427. \nMinsky, M. L. and Papert, S. A. (1969), \"Perceptrons,\" MIT Press. \n\nOmlin, C. and Giles, C. L. (1992), Training second-order recurrent neural networks \nusing hints, Machine Learning: Proceedings of the Ninth International Conference \n(ML-92), D. Sleeman and P. Edwards (ed.), Morgan Kaufmann. \n\nSuddarth, S. and Holden, A. (1991), Symbolic neural systems and the use of hints \nfor developing complex systems, International Journal of Machine Studies 35, p. \n291. \n\nWismer, D. A. and Chattergy, R. (1978), \"Introduction to Nonlinear Optimization,\" \nNorth Holland. \n\n\f", "award": [], "sourceid": 616, "authors": [{"given_name": "Yaser", "family_name": "Abu-Mostafa", "institution": null}]}