{"title": "Training Connectionist Networks with Queries and Selective Sampling", "book": "Advances in Neural Information Processing Systems", "page_first": 566, "page_last": 573, "abstract": null, "full_text": "566 \n\nAtlas, Cohn and Ladner \n\nTraining Connectionist Networks with \n\nQueries and Selective Sampling \n\nLes Atlas \nDept. of E.E. \n\nDavid Cohn \n\nDept. of C.S. & E. \n\nRichard Ladner \nDept. of C.S. & E. \n\nM.A. El-Sharkawi, R.J. Marks II, M.E. Aggoune, and D.C. Park \n\nDept. of E.E. \n\nUniversity of Washington, Seattle, WA 98195 \n\nABSTRACT \n\n\"Selective sampling\" is a form of directed search that can greatly \nincrease the ability of a connectionist network to generalize accu(cid:173)\nrately. Based on information from previous batches of samples, a \nnetwork may be trained on data selectively sampled from regions \nin the domain that are unknown. This is realizable in cases when \nthe distribution is known, or when the cost of drawing points from \nthe target distribution is negligible compared to the cost of label(cid:173)\ning them with the proper classification. The approach is justified \nby its applicability to the problem of training a network for power \nsystem security analysis. The benefits of selective sampling are \nstudied analytically, and the results are confirmed experimentally. \n\nIntroduction: Random Sampling vs. Directed Search \n\n1 \nA great deal of attention has been applied to the problem of generalization based \non random samples drawn from a distribution, frequently referred to as \"learning \nfrom examples.\" Many natural learning learning systems however, do not simply \nrely on this passive learning technique, but instead make use of at least some form \nof directed search to actively examine the problem domain. In many problems, \ndirected search is provably more powerful than passively learning from randomly \ngiven examples. \n\n\fTraining Connectionist Networks with Queries and Selective Sampling \n\n567 \n\nTypically, directed search consists of membership queries, where the learner asks for \nthe classification of specific points in the domain. Directed search via membership \nqueries may proceed simply by examining the information already given and deter(cid:173)\nmining a region of uncertainty, the area in the domain where the learner believes \nmis-classification is still possible. The learner then asks for examples exclusively \nfrom that region. \n\nThis paper discusses one form of directed search: selective sampling. In Section 2, \nwe describe theoretical foundations of directed search and give a formal definition \nof selective sampling. In Section 3 we describe a neural network implementation \nof this technique, and we discuss the resulting improvements in generalization on a \nnumber of tasks in Section 4. \n\n2 Learning and Selective Sampling \nFor some arbitrary domain learning theory defines a concept as being some subset of \npoints in the domain. For example, if our domain is ~2, we might define a concept \nas being all points inside a region bounded by some particular rectangle. \nA concept class is simply the set of concepts in some description language. \n\nA concept class of particular interest for this paper is that defined by neural network \narchitectures with a single output node. Architecture refers to the number and types \nof units in a network and their connectivity. The configuration of a network specifies \nthe weights on the connections and the thresholds of the units 1 . \n\nA single-output architecture plus configuration can be seen as a specification of \na concept classifier in that it classifies the set of all points producing a network \noutput above some threshold value. Similarly, an architecture may be seen as a \nspecification of a concept class. It consists of all concepts classified by configurations \nof the network that the learning rule can produce (figure 1). \n\nInput~ \n\noutPu~ \n\n> \n\nFigure 1: A network architecture as a concept class specification \n\n2.1 Generalization and formal learning theory \n\nAn instance, or training example, is a pair (x, f(x)) consisting of a point x in \nthe domain, usually drawn from some distribution P, along with its classification \n\n1 For the purposes of this discussion, a neural network will be considered to be a feedforward \nnetwork of neuron-like components that compute a weighted swn of their inputs and modify \nthat swn with a sigmoidal transfer function. The methods described, however should be equally \napplicable to other, more general classifiers as well. \n\n\f568 \n\nAtlas, Cohn and Ladner \n\naccording to some target concept I. A concept c is consistent with an instance \n(x,/(x\u00bb if c(x) = I(x), that is, if the concept produces the same classification of \npoint x as the target. The error( c, I, P) of a concept c, with respect to a target \nconcept 1 and a distribution P, is the probability that c and 1 will disagree on a \nrandom sample drawn from P. \n\nThe generalization problem, is posed by formal learning theory as: for a given \nconcept class C, an unknown target I, and an arbitrary error rate f, how many \nsamples do we have to draw from an arbitrary distribution P in order to find a \nconcept c E C such that error( c, I, P) < f with high confidence? This problem \nhas been studied for neural networks in (Baum and Haussler, 1989) and (Haussler, \n1989). \n\n'R(sm), the region of uncertainty \n\n2.2 \nIf we consider a concept class C and a set sm of m instances, the classification of \nsome regions of the domain may be implicitly determined; all concepts in C that \nare consistent with all of the instances may agree in these parts. What we are \ninterested in here is what we define to be the region 01 uncertainty: \n'R(sm) = {x : 3CI, C2 E C, CI, C2 are consistent with all s E sm, and CI(X) 1= C2(X)}. \n\nFor an arbitrary distribution P, we can define a measure on the size of this region as \na = Pr[x E'R(sm)]. In an incremental learning procedure, as we classify and train \non more points, a will be monotonically non-increasing. A point that falls outside \n'R(sm) will leave it unchanged; a point inside will further restrict the region. Thus, \na is the probability that a new, random point from P will reduce our uncertainty. \nA key point is that since 'R(sm) serves as an envelope for consistent concepts, it \nalso bounds the potential error of any consistent hypothesis we choose. If the error \nof our current hypothesis is f, then f < a. Since we have no basis for changing \nour current hypothesis without a contradicting point, f is also the probability of an \nadditional point red ucing our error. \n\n2.3 Selective sampling is a directed search \n\nConsider the case when the cost of drawing a point from our distribution is small \ncompared to the cost of finding the point's proper classification. Then, after training \non n instances, if we have some inexpensive method of testing for membership in \n'R( sn), we can \"filter\" points drawn from our distribution, selecting, classifying and \ntraining on only those that show promise of improving our representation. \n\nMathematically, we can approximate this filtering by defining a new distribution pI \nthat is zero outside 'R(sn), but maintains the relative distribution of P. Since the \nnext sample from pI would be guaranteed to land inside the region, it would have, \nwith high confidence, the effect of at least 1/a samples drawn from P. \n\nThe filtering process can be applied iteratively. Start out with the distribution \nPO,n = P. Inductively, train on n samples chosen from Pi,n to obtain a new region \n\n\fTraining Connectionist Networks with Queries and Selective Sampling \n\n569 \n\nof uncertainty, 'R(s\"n), and define from it P'+l,n = P'\"n. The total number of \ntraining points to calculate P'\"n is m = in. \nSelective sampling can be contrasted with random sampling in terms of efficiency. \nIn random sampling, we can view training as a single, non-selective pass where \nn = m. As the region of uncertainty shrinks, so does the probability that any given \nadditional sample will help. The efficiency of the samples decreases with the error. \n\nBy filtering out useless samples before committing resources to them, as we can do \nin selective sampling, the efficiency of the samples we do classify remains high. In \nthe limit where n = 1, this regimen has the effect of querying: each sample is taken \nfrom a region based on the cumulative information from all previous samples, and \neach one will reduce the size of'R(sm). \n\n3 Training Networks with Selective Sampling \nA leading concern in connectionist research is how to achieve good generalization \nwith a limited number of samples. This suggests that selective sampling, properly \nimplemented, should be a useful tool for training neural networks. \n\n3.1 A na'ive neural network querying algorithm \n\nSince neural networks with real-valued outputs are generally trained to within some \ntolerance (say, less than 0.1 for a zero and greater than 0.9 for a one), one is tempted \nto use the part of the domain between these limits as 'R(sm) (figure 2) . \n\nInput~ \n\noutPu~ \n\n.. . \n. . \n~ . \n. \n\n. \n,,:~~,' , . \n. ~ \n\n> \n\nFigure 2: The region of uncertainty captured by a nai\u00b7ve neural network \n\nThe problem with applying this na\u00b7ive approach to neural networks is that when \ntraining, a network tends to become \"overly confident\" in regions that are still \nunknown. The 'R( sm) chosen by this method will in general be a very small subset \nof the true region of uncertainty. \n\n3.2 Version-space search and neural networks \n\nMitchell (1978) describes a learning procedure based on the partial-ordering in \ngenerality of the concepts being learned. One maintains two sets of plausible hy(cid:173)\npotheses: Sand G. S contains all \"most specific\" concepts consistent with present \ninformation, and G contains all consistent \"most general\" concepts. The \"version \nspace,\" which is the set of all plausible concepts in the class being considered, lies \n\n\f570 \n\nAtlas, Cohn and Ladner \n\nbetween these two bounding sets. Directed search proceeds by examining instances \nthat fall in the difference of Sand G. Specifically, the search region for a version(cid:173)\nspace search is equal to {U s~g : s E S, g E G}. If an instance in this region \nproves positive, then some s in S will have to generalize to accommodate the new \ninformation; if it proves negative, some 9 in G will have to be modified to exclude \nit. In either case, the version space, the space of plausible hypotheses, is reduced \nwith every query. \n\nThis search region is exactly the 'R.(sm) that we are attempting to capture. Since \nsand 9 consist of most specific/general concepts in the class we are considering, \ntheir analogues are the most specific and most general networks consistent with the \nknown data. \n\nThis search may be roughly implemented by training two networks in parallel. One \nnetwork, which we will label N s, is trained on the known examples as well as given \na large number of random \"background\" patterns, which it is trained to classify \nwith as negative. The global minimum error for N s is achieved when it classifies \nall positive training examples as positive and as much else as possible as negative. \nThe result is a \"most specific\" configuration consistent with the training examples. \nSimilarly, N G is trained on the known examples and a large number of random \nbackground examples which it is to classify as positive. Its global minimum error is \nachieved when it classifies all negative training examples as negative and as much \nelse possible as positive. \nAssuming our networks Ns and NG converge to near-global minima, we can now de(cid:173)\nfine a region 'R.,t:.g, the symmetric difference of the outputs of Ns and NG. Because \nNs and NG lie near opposite extremes of'R.(sm), we have captured a well-defined \nregion of uncertainty to search (figure 3). \n\n3.3 Limitations of the technique \n\nThe neural network version-space technique is not without problems in general \napplication to directed search. One limitation of this implementation of version \n\n1nput \n\noutput \n\nFigure 3: 'R.,t:.g contains the difference between decision regions of N sand N G as \nwell as their own regions of uncertainty. \n\n\fTraining Connectionist Networks with Queries and Selective Sampling \n\n571 \n\nspace search is that a version space is bounded by a set of most general and most \nspecific concepts, while an S-G network maintains only one most general and most \nspecific network. As a result, n6~g will contain only a subset of the true n(sm). \nThis limitation is softened by the global minimizing tendency of the networks. As \nnew examples are added and the current N s (or N G) is forced to a more general \n(or specific) configuration, the network will relax to another, now more specific (or \ngeneral) configuration. The effect is that of a traversal of concepts in Sand G. If \nthe number of samples in each pass is kept sufficiently small, all \"most general\" and \nmost specific\" concepts in n(sm) may be examined without excessive sampling on \none particular configuration. \n\nThere is a remaining difficulty inherent in version-space search itself: Haussler \n(1987) points out that even in some very simple cases, the size of Sand G may \ngrow exponentially in the number of examples. \n\nA limitation inherent to neural networks is the necessary assumption that the net(cid:173)\nworks N sand N G will in fact converge to global minima, and that they will do so \nin a reasonable amount of time. This is not always a valid assumption; it has been \nshown that in (Blum and Rivest, 1989) and (Judd, 1988) that the network loading \nproblem is NP-complete, and that finding a global minimum may therefore take an \nexponential amount of time. \nThis concern is ameliorated by the fact that if the number of samples in each pass is \nkept small, the failure of one network to converge will only result in a small number \nof samples being drawn from a less useful area, but will not cause a large-scale \nfailure of the technique. \n\n4 Experimental Results \nExperiments were run on three types of problems: learning a simple square-shaped \nregion in ~2, learning a 25-bit majority function, and recognizing the secure region \nof a small power system. \n\n4.1 The square learner \n\nA two-input network with one hidden layer of 8 units was trained on a distribution \nof samples that were positive inside a square-shaped region at the center of the \ndomain and negative elsewhere. This task was chosen because of its intuitive visual \nappeal (figure 4). \nThe results of training an S-G network provide support for the method. As can be \nseen in the accompanying plots, the Ns plots a tight contour around the positive \ninstances, while NG stretches widely around the negative ones. \n\n4.2 Majority function \n\nSimulations training on a 25-bit majority function were run using selective sampling \nin 2, 3, 4 and 20 passes, as well as baseline simulations using random sampling for \nerror comparIson. \n\n\f572 \n\nAtlas, Cohn and Ladner \n\nFigure 4: Learning a square by selective sampling \n\nIn all cases, there was a significant improvement of the selective sampling passes \nover the random sampling ones (figure 5). The randomly sampled passes exhibited a \nroughly logarithmic generalization curve, as expected following Blumer et al (1988). \n\nThe selectively sampled passes, however, exhibited a steeper, more exponential drop \nin the generalization error, as would be expected from a directed search method. \nFurthermore, the error seemed to decrease as the sampling process was broken up \ninto smaller, more frequent passes, pointing at an increased efficiency of sampling \nas new information was incorporated earlier into the sampling process. \n\n0.5 \n~ 0.4 \n5 \nc \n.~ \n~ 0.2 \n... \n~ c \n~ \n0 \n\n0.3 \n\nN \n\n0.1 \n\n______ random sampling \n__ selective sampling \n\n(20 passes) \n\n... -..... ....... -\n\n0 \n0 \n\n50 \n\n100 \n\n150 \n\n200 \n\nNumber of training samples \n\n100 \n\nIS \n5 \n5 \n\u00b7a 10.1 \ni \n13 \nc \n~ \n0 \n\n10-2 \n\n0 \n\n50 \n\n100 \n\n150 \n\n200 \n\nNumber of training samples \n\nFigure 5: Error rates for random vs. selective sampling \n\n4.3 Power system security analysis \n\nIf various load parameters of a power system are within a certain range, the system \nis secure. Otherwise it risks thermal overload and brown-out. Previous research \n(Aggoune et aI, 1989) determined that this problem was amenable to neural network \nlearning, but that random sampling of the problem domain was inefficient in terms \nof samples needed. The fact that arbitrary points in the domain may be analyzed for \nstability makes the problem well-suited to learning by means of selective sampling. \n\nA baseline case was tested using 3000 data points representing power system con(cid:173)\nfigurations and compared with a two-pass, selectively-sampled data set. The latter \nwas trained on an initial 1500 points, then on a second 1500 derived from a S-G \nnetwork as described in the previous section. The error for the baseline case was \n0.86% while that of the selectively sampled case was 0.56%. \n\n\fTraining Connectionist Networks with Queries and Selective Sampling \n\n573 \n\n5 Discussion \nIn this paper we have presented a theory of selective sampling, described a connec(cid:173)\ntionist implementation of the theory, and examined the performance of the resulting \nsystem in several domains. \n\nThe implementation presented, the S-G network, is notable in that, even though \nit is an imperfect implementation of the theory, it marks a sharp departure from \nthe standard method of training neural networks. Here, the network itself decides \nwhat samples are worth considering and training on. The results appear to give \nnear-exponential improvements over standard techniques. \n\nThe task of active learning is an important one; in the natural world much learning \nis directed at least somewhat by the learner. We feel that this theory and these \nexperiments are just initial forays into the promising area of self-training networks. \n\nAcknowledgements \n\nThis work was supported by the National Science Foundation, the Washington \nTechnology Center, and the IBM Corporation. Part of this work was done while D. \nCohn was at IBM T.J. Watson Research Center, Yorktown Heights, NY 10598. \n\nReferences \n\nM. Aggoune, L. Atlas, D. Cohn, M. Damborg, M. EI-Sharkawi, and R. Marks II. Ar(cid:173)\ntificial neural networks for power system static security assessment. In Proceedings, \nInternational Symposium on Circuits and Systems, 1989. \n\nEric Baum and David Haussler. What size net gives valid generalization? In Neural \nInformation Processing Systems, Morgan Kaufmann 1989. \n\nAnselm Blumer, Andrej Ehrenfeucht, David Haussler, and Manfred Warmuth. Learn(cid:173)\nability and the Vapnik-Chervonenkis dimension. UCSC Tech Report UCSC-CRL-\n87-20, October 1988. \n\nAvrim Blum and Ronald Rivest. Training a 3-node neural network is NP-complete. \nIn Neural Information Processing Systems, Morgan Kaufmann 1989. \nDavid Haussler. Learning conjunctive concepts in structural domains. In Proceed(cid:173)\nings, AAAI '87, pages 466-470. 1987. \n\nDavid Haussler. Generalizing the pac model for neural nets and other learning \napplications. UCSC Tech Report UCSC-CRL-89-30, September 1989. \n\nStephen Judd. On the complexity of loading shallow neural networks . Journal of \nComplexity, 4:177-192, 1988. \n\nTom Mitchell. Version spaces: an approach to concept learning. Tech Report CS-\n78-711, Dept. of Computer Science, Stanford Univ., 1978. \n\nLeslie Valiant. A theory of the learnable. Communications of the A CM, 27:1134-\n1142, 1984. \n\n\f", "award": [], "sourceid": 261, "authors": [{"given_name": "Les", "family_name": "Atlas", "institution": null}, {"given_name": "David", "family_name": "Cohn", "institution": null}, {"given_name": "Richard", "family_name": "Ladner", "institution": null}]}