{"title": "Learning Time-varying Concepts", "book": "Advances in Neural Information Processing Systems", "page_first": 183, "page_last": 189, "abstract": null, "full_text": "Learning Time-varying Concepts \n\nAnthony Kuh \nDept. of Electrical Eng. \nU. of Hawaii at Manoa \nHonolulu, HI 96822 \nkuh@wiliki.eng.hawaii.edu \n\nThomas Petsche \n\nSiemens Corp. Research \n755 College Road East \nPrinceton, NJ 08540 \n\npetsche\u00ae learning. siemens.com \n\nRonald L. Rivest \nLab. for Computer Sci. \nMIT \nCambridge, MA 02139 \nrivest@theory.lcs.mit.edu \n\nAbstract \n\nThis work extends computational learning theory to situations in which concepts \nvary over time, e.g., system identification of a time-varying plant. We have \nextended formal definitions of concepts and learning to provide a framework \nin which an algorithm can track a concept as it evolves over time. Given \nthis framework and focusing on memory-based algorithms, we have derived \nsome PAC-style sample complexity results that determine, for example, when \ntracking is feasible. We have also used a similar framework and focused on \nincremental tracking algorithms for which we have derived some bounds on \nthe mistake or error rates for some specific concept classes. \n\n1 INTRODUCTION \n\nThe goal of our ongoing research is to extend computational learning theory to include \nconcepts that can change or evolve over time. For example, face recognition is complicat(cid:173)\ned by the fact that a persons face changes slowly with age and more quickly with changes \nin make up, hairstyle, or facial hair. Speech recognition is complicated by the fact that \na speakers voice may change over time due to fatigue, illness, stress, or background \nnoise (Galletti and Abbott, 1989). \n\nTime varying systems often appear in adaptive control or signal processing applications. \nFor example, adaptive equalizers adjust the receiver and transmitter to compensate for \nchanges in the noise on a transmission channel (Lucky et at, 1968). The kinematics of \na robot arm can change when it picks up a heavy load or when the motors and drive \ntrain responses change due to wear. The output of a sensor may drift over time as the \ncomponents age or as the temperature changes. \n\n183 \n\n\f184 \n\nKuh, Petsche, and Rivest \n\nComputational learning theory as introduced by Valiant (1984) can make some useful \nstatements about whether a given class of concepts can be learned and provide proba(cid:173)\nbilistic bounds on the number of examples needed to learn a concept. Haussler, et al. \n(1987), and Littlestone (1989) have also shown that it is possible to bound the number of \nmistakes that a learner will make. However, while these analyses allow the concept to be \nchosen arbitrarily, that concept must remain fixed for all time. Littlestone and Warmuth \n(1989) considered concepts that may drift, but in the context of a different accuracy \nmeasure than we use. Our research seeks explore further modifications to existing theory \nto allow the analysis of performance when learning time-varying concept. \n\nIn the following, we describe two approaches we are exploring. Section 3 describes \nan extension of the PAC-model to include time-varying concepts and shows how this \nnew model applies to algorithms that base their hypotheses on a set of stored examples. \nSection 4 described how we can bound the mistake rate of an algorithm that updates its \nestimate based on the most recent example. In Section 2 we define some notation and \nterminology that is used in the remainder of the based. \n\n2 NOTATION & TERMINOLOGY \n\nFor a dichotomy that labels each instance as a positive or negative example of a concept, \nwe can formally describe the model as follows. Each instance Xj is drawn randomly, \naccording to an arbitrary fixed probability distribution, from an instance space X. The \nconcept c to be learned is drawn randomly, according to an arbitrary fixed probability \ndistribution, from a concept class C. Associated with each instance is a label aj = c(Xj) \nsuch that aj = 1 if Xj is a positive example and aj = 0 otherwise. The learner is presented \nwith a sequence of examples (each example is a pair (Xj, aj)) chosen randomly from X . \nThe learner must form an estimate, c, of c based on these examples. \nIn the time-varying case, we assume that there is an adversary who can change cover \ntime, so we change notation slightly. The instance Xt is presented at time t. The concept \nCt is active at time t if the adversary is using Ct to label instances at that time. The \nsequence of t active concepts, Ct = {Cl' ... , Ct} is called a concept sequence of length t. \nThe algorithm's task is to form an estimate f t of the actual concept sequence Cr. i.e., at \neach time t, the tracker must use the sequence of randomly chosen examples to form an \nestimate c t of Ct. A set of length t concept sequences is denoted by C (t) and we call a \nset of infinite length concept sequences a concept sequence space and denote it by C. \nSince the adversary, if allowed to make arbitrary changes, can easily make the tracker's \ntask impossible, it is usually restricted such that only small or infrequent changes are \nallowed. In other words, each C (t) is a small subset of ct. \nWe consider two different types of different types of \"tracking\" (learning) algorithms, \nmemory-based (or batch) and incremental (or on-line). We analyze the sample complexity \nof batch algorithms and the mistake (or error) rate of incremental algorithms. \n\nIn t;e usual case where concepts are time-invariant, batch learning algorithms operate \nin two distinct phases. During the first phase, the algorithm collects a set of training \nexamples. Given this set, it then computes a hypothesis. \nIn the second phase, this \nhypothesis is used to classify all future instances. The hypothesis is never again updated. \nIn Section 3 we consider memory-based algoritms derived from batch algorithms. \n\n\fLearning Time-varying Concepts \n\n185 \n\nWhen concepts are time-invariant, an on-line learning algorithm is one which constantly \nmodifies its hypothesis. On each iteration, the learner (1) receives an instance; (2) predicts \na label based on the current hypothesis; (3) receives the correct label; and (4) uses the \ncorrect label to update the hypothesis. In Section 4, we consider incremental algorithms \nbased on on-line algorithms. \n\nWhen studying learnability, it is helpful to define the Vapnik-Chervonenkis (VC) dimen(cid:173)\nsion (Vapnik and Chervonenkis, 1971) of a concept class: VCdim(C) is the cardinality \nof the largest set such that every possible labeling scheme is achieved by some concept \nin C. Blumer et al. (1989) showed that a concept class is learnable if and only if the \nVC-dimension is finite and derived an upper bound (that depends on the VC dimension) \nfor the number of examples need to PAC-learn a learnable concept class. \n\n3 MEMORY-BASED TRACKING \n\nIn this section, we will consider memory-based trackers which base their current hypoth(cid:173)\nesis on a stored set of examples. We build on the definition of PAC-learning to define \nwhat it means to PAC-track a concept sequence. Our main result here is a lower bound \non the maximum rate of change that can be PAC-tracked by a memory-based learner. \nA memory-based tracker consists of (a) a function WeE, 8); and (b) an algorithm .c that \nproduces the current hypothesis, Ct using the most recent W (E, 8) examples. The memory(cid:173)\nbased tracker thus maintains a sliding window on the examples that includes the most \nrecent W ( E, 8) examples. We do not require that .c run in polynomial time. \nFollowing the work of Valiant (1984) we say that an algorithm A PAC -tracks a concept \nsequence space C' ~ C if, for any c E C', any distribution D on X, any E,8 > 0, and \naccess to examples randomly selected from X according to D and labeled at time t by \nconcept Ct; for all t sufficiently large, with t' chosen unifonnly at random between 1 and \nt, it is true that \n\nPr(d(ctl\n\n, Ct l\n\n) ~ E) ~ 1 - 8. \n\nThe probability includes any randomization algorithm A may use as well as the random \nselection of t' and the random selection of examples according to the distribution D, \nand where d(c,c') = D(x : c(x) # c'(x)) is the probability that c and c' disagree on a \nrandomly chosen example. \n\nLearnability results often focus on learners that see only positive examples. For many \nconcept classes this is sufficient, but for others negative examples are also necessary. \nNatarajan (1987) showed that a concept class that is PAC-learnable can be learned using \nonly positive examples if the class is closed under intersection. \n\nWith this in mind, let's focus on a memory-based tracker that modifies its estimate \nusing only positive examples. Since PAC-tracking requires that A be able to PAC-learn \nindividual concepts, it must be true that A can PAC-track a sequence of concepts only if \nthe concept class is closed under intersection. However, this is not sufficient. \nObservation 1. Assume C is closed under intersection. If positive examples are drawn \nfrom CI E C prior to time to, and from C2 E C, CI ~ C2. after time to. then there exists an \nestimate of C2 that is consistent with all examples drawn from CI. \n\nThe proof of this is straightforward once we realize that if CI ~ C2, then all positive \n\n\f186 \n\nKuh, Petsche, and Rivest \n\nexamples drawn prior to time to from CI are consistent with C2. The problem is therefore \nequivalent to first choosing a set of examples from a subset of C2 and then choosing more \nexamples from all of C2 -\nit skews that probability distribution, but any estimate of C2 \nwill include all examples drawn from CI. \nConsider the set of closed intervals on [0,1], C = {[a,b] I 0 ~ a,b ~ I}. Assume that, \nfor some d > b, Ct = CI = [a,b] for all t ~ to and Ct = C2 = [a,d] for all t > to. All \nthe examples drawn prior to to, {xc: t < to}, are consistent with C2 and it would be nice \nto use these examples to help estimate C2. How much can these examples help? \nTheorem 1. Assume C is closed under intersection and VCdim(C) is finite; C2 ~ C; \nand A has PAC learned CI E C at time to. Then,for some d such that VCdim( C2) ~ d ~ \nVCdim( C), the maximum number of examples drawn after time to required so that A can \nPAC learn C2 E C is upper bounded by m(E, 8) = max (~log~, 8: log 1;) \nIn other words, if there is no prior information about C2, then the number of examples \nrequired depends on VCdim(C). However, the examples drawn from CI can be used to \nshrink the concept space towards C2' For example, when CI = [a,b] and C2 = [a,c], \nin the limit where c~ = CI. the problem of learning C2 reduces to learning a one-sided \ninterval which has VC-dimension 1 versus 2 for the two-sided interval. Since it is unlikely \nthat c~ = Cit it will usually be the case that d > VCdim(C2 ). \nIn order to PAC-track c, most of the time A must have m( E, 8) examples consistent with \nthe current concept. This implies that w (E, 8) must be at least m (E, 8). Further, since the \nconcepts are changing, the consistent examples will be the most recent. Using a sliding \nwindow of size m(e, 8), the tracker will have an estimate that is based on examples that \nare consistent with the active concept after collecting no more than m (e, 8) examples \nafter a change. \n\nIn much of our analysis of memory-based trackers, we have focused on a concept se(cid:173)\nquence space C,\\ which is the set of all concept sequences such that, on average, each \nconcept is active for at least 1/), time steps before a change occurs. That is, if N ( c, t) is \nthe number of changes in the firstt time steps of c, C,\\ = {c : lim sUPC-400 N (c, t) /t < ).}. \nThe question then is, for what values of ). does there exist a PAC-tracker? \nTheorem 2. Let.c be a memory-based tracker with W(E, 8) = m(E,8/2) which draws \ninstances labeled according to some concept sequence c E C,\\ with each Ct E C and \nVCdim(C) < 00. For any E > 0 and 8> 0, A can UPAC track C if). < !m(E, 8/2). \nThis theorem provides a lower bound on the maximum rate of change that can be tracked \nby a batch tracker. Theorem 1 implies that a memory-based tracker can use examples \nfrom a previous concept to help estimate the active concept. The proof of theorem 2 \nassumes that some of the most recent m(E, 8) examples are not consistent with Ct until \nm (E, 8) examples from the active concept have been gathered. An algorithm that removes \ninconsistent examples more intelligently, e.g., by using conflicts between examples or \ninformation about allowable changes, will be able to track concept sequence spaces that \nchange more rapidly. \n\n\fLearning Time-varying Concepts \n\n187 \n\n4 INCREMENTAL TRACKING \n\nIncremental tracking is similar to the on-line learning case, but now we assume that there \nis an adversary who can change the concept such that Ct+l =fi Ct. At each iteration: \n\n1. the adversary chooses the active concept Ct; \n2. the tracker is given an unlabeled instance, Xt; \n3. the tracker predicts a label using the current hypothesis: at = Ct-l (Xt); \n4. the tracker is given the correct label at; \n5. the tracker forms a new hypothesis: ct = .c(Ct-l, (Xt,at}). \n\nWe have defined a number of different types of trackers and adversaries: A prudent \ntracker predicts that at = 1 if and only if Ct (Xt) = 1. A conservative tracker changes \nits hypothesis only if at =fi at. A benign adversary changes the concept in a way that \nis independent of the tracker's hypothesis while a malicious adversary uses information \nabout the tracker and its hypothesis to choose a Ct+l to cause an increase in the error \nrate. The most malicious adversary chooses Ct+l to cause the largest possible increase in \nerror rate on average. \n\nWe distinguish between the error of the hypothesis formed in step 5 above and a mistake \nmade in step 3 above. The instantaneous error rate of an hypothesis is et = d (Ct, ct ). \nIt is the probability that another randomly chosen instance labeled according to Ct will \nbe misclassified by the updated hypothesis. A mistake is a mislabeled instance, and we \ndefine a mistake indicator function Mt = 1 if Ct (Xt) =fi Ct-l (Xt). \nWe define the average error rate Ct = t L:~=l et and the asymptotic error rate is c = \nlim inft-+co Ct. The average mistake rate is the average value of the mistake indicator \nfunction, J.Lt = t L:~=l Mto and the asymptotic mistake rate is J.L = lim inft -+ co J.Lt\u00b7 \nWe are modeling the incremental tracking problems as a Markov process. Each state \nof the Markov process is labeled by a triple (c, C, a), and corresponds to an iteration in \nwhich C is the active concept, C is the active hypothesis, and a is the set of changes the \nadversary is allowed to make given c. We are still in the process of analyzing a general \nmodel, so the following presents one of the special cases we have examined. \n\nLet X be the set of all points on the unit circle. We use polar coordinates so that \nsince the radius is fixed we can label each point by an angle B, thus X = [0, 27r). \nNote that X is periodic. The concept class C is the set of all arcs of fixed length 7r \nradians, i.e., all semicircles that lie on the unit circle. Each C E C can be written as \nC = [7r(2B - 1) mod 27r, 27rB), where B E [0, 1). We assume that the instances are chosen \nuniformly from the circle. \n\nThe adversary may change the concept by rotating it around the circle, however, the \nthe uniform case, this is equivalent to restricting Bt+ 1 = Bt \u00b1 f3 mod 1, where \u00b0 ~ f3 ~ \nmaximum rotation is bounded such that, given Ct, Ct+l must satisfy d(ct+t, Ct) ~ \"y. For \n\"y /2. \nThe tracker is required to be conservative, but since we are satisfied to lower bound the \nerror rate, we assume that every time the tracker makes a mistake, it is told the correct \nconcept. Thus, ct = Ct-l if no mistake is made, but Ct = Ct wherever a mistake is made. \n\n\f188 \n\nKuh, Petsche, and Rivest \n\nThe worst case or most malicious adversary for a conservative tracker always tries to \nmaximize the tracker's error rate. Therefore, whenever the tracker deduces Ct (Le. when(cid:173)\never the tracker makes a mistake), the adversary picks a direction by flipping a fair \ncoin. The adversary then rotates the concept in that direction as far as possible on each \niteration. Then we can define a random direction function St and write \n\n{ \nSt = \n\n+ 1, w.p. 1/2 if Ct-l = Ct-l; \n-1, w.p. 1/2 if Cl-l = Ct-l; \nSt-l, \n\nif Ct-l # Ct-l. \n\nThen the adversary chooses the new concept to be (}t = (}t-l + Stl/2. \nSince the adversary always rotates the concept by 1/2, there are 2/1 distinct concepts that \ncan occur. However, when (}( t + 1/1 ) = (}(t) + 1/2 mod 1, the semicircles do not overlap \nand therefore, after at most 1/1 changes, a mistake will be made with probability one. \nBecause at most 1/1 consecutive changes can be made before the mistake rate returns \n(}~, and because of \nto zero, because the probability of a mistake depends only on (}t -\ninherent symmetries, this system can be modeled by a Markov chain with k = 1/1 states. \nEach state Si corresponds to the case I(}t - Ot I = i I mod 1. The probability of a transition \nfrom state Si to state Si+l is P(si+1lsi) = 1 - (i + Ih. The probability of a transition \nfrom state Si to state So is P(sols;) = (i + Ih. All other transition probabilities are \nzero. This Markov chain is homogeneous, irreducible, aperiodic, and finite so it has an \ninvariant distribution. By solving the balance equations, for I sufficiently small, we find \nthat \n\n(1) \n\nSince we assume that I is small, the probability that no mistake will occur for each of \nk - 1 consecutive time steps after a mistake, P(sk-d, is very small and we can say that \nthe probability of a mistake is approximately P(so). Therefore, from equation I, for small \nI' it follows that JLmaJicious ~ ..)2// 1r. \nIf we drop the assumption that the adversary is malicious, and instead assume the the \nadversary chooses the direction randomly at each iteration, then a similar sort of analysis \nyields that JLbenign = 0 ( 1 2/ 3 ). \nSince the foregoing analysis assumes a conservative tracker that chooses the best hy(cid:173)\npothesis every time it makes a mistake, it implies that for this concept sequence space \nand any conservative tracker, the mistake rate is 0(rl/2) against a malicious adversary \nand 0(r2/3b) against a benign adversary. For either adversary, it can be shown that \nc = JL -I\u00b7 \n\n5 CONCLUSIONS AND FURTHER RESEARCH \n\nWe can draw a number of interesting conclusions fonn the work we have done so far. \nFirst, tracking sequences of concepts is possible when the individual concepts are learn(cid:173)\nable and change occurs \"slowly\" enough. Theorem 2 gives a weak upper bound on the \nrate of concept changes that is sufficient to insure that tracking is possible. \n\n\fLearning Time-varying Concepts \n\n189 \n\nTheorem 1 implies that there can be some trade-off between the size (VC-dimension) \nof the changes and the rate of change. Thus, if the size of the changes is restricted, \nTheorems 1 and 2 together imply that the maximum rate of change can be faster than for \nthe general case. It is significant that a simple tracker that maintains a sliding window \non the most recent set of examples can PAC-track the new concept after a change as \nquickly as a static learner can if it starts from scratch. This suggests it may be possible \nto subsume detection so that it is implicit in the operation of the tracker. One obviously \nopen problem is to determine d in Theorem 1, i.e., what is the appropriate dimension to \napply to the concept changes? \n\nThe analysis of the mistake and error rates presented in Section 4 is for a special case \nwith VC-dimension 1, but even so, it is interesting that the mistake and error rates are \nsignificantly worse than the rate of change. Preliminary analysis of other concept classes \nsuggests that this continues to be true for higher VC-dimensions. We are continuing \nwork to extend this analysis to other concept classes, including classes with higher VC(cid:173)\ndimension; non-conservative learners; and other restrictions on concept changes. \n\nAcknowledgments \n\nAnthony Kuh gratefully acknowledges the support of the National Science Foundation \nthrough grant EET-8857711 and Siemens Corporate Research. Ronald L. Rivest grateful(cid:173)\nly acknowledges support from NSF grant CCR-8914428, ARO grant NOOOI4-89-J-1988, \nand a grant from the Siemens Corporation. \n\nReferences \n\nBlumer, A., Ehrenfeucht, A., Haussler, D., and Warmuth, M. (1989). Learnability and the \nVapnik-Chervonenkis dimension. Journal o/the Association/or Computing Machinery, \n36(4):929-965. \n\nGalletti, I. and Abbott, M. (1989). Development of an advanced airborne speech recog(cid:173)\n\nnizer for direct voice input. Speech Technology, pages 60-63. \n\nHaussler, D., Littlestone, N., and Warmuth, M. K. (1987). Expected mistake bounds for \n\non-line learning algorithms. (Unpublished). \n\nLittlestone, N. (1989). Mistake bounds and logarithmic linear-threshold learning algo(cid:173)\n\nrithms. Technical Report UCSC-CRL-89-11, Univ. of California at Santa Cruz. \n\nLittlestone, N. and Warmuth, M. K. (1989). The weighted majority algorithm. In Pro(cid:173)\nceedings 0/ IEEE FOCS Conference, pages 256-261. IEEE. (Extended abstract only.). \nLucky, R. W., Salz, 1., and Weldon, E. 1. (1968). Principles 0/ Data Communications. \n\nMcGraw-Hill, New York. \n\nNatarajan, B. K. (1987). On learning boolean functions. In Proceedings o/the Nineteenth \n\nAnnual ACM Symposium on Theory o/Computing, pages 296-304. \n\nValiant, L. (1984). A theory of the learnable. Communications o/the ACM, 27:1134-1142. \nVapnik, V. N. and Chervonenkis, A. Y. (1971). On the uniform convergence of relative \nfrequencies of events to their probabilities. Theory 0/ Probability and its Applications, \n16:264-280. \n\n\f", "award": [], "sourceid": 402, "authors": [{"given_name": "Anthony", "family_name": "Kuh", "institution": null}, {"given_name": "Thomas", "family_name": "Petsche", "institution": null}, {"given_name": "Ronald", "family_name": "Rivest", "institution": null}]}