{"title": "PAC Generalization Bounds for Co-training", "book": "Advances in Neural Information Processing Systems", "page_first": 375, "page_last": 382, "abstract": null, "full_text": "PAC Generalization Bounds for Co-training\n\nSanjoy Dasgupta\n\nAT&T Labs\u2013Research\n\nMichael L. Littman\nAT&T Labs\u2013Research\n\ndasgupta@research.att.com\n\nmlittman@research.att.com\n\nDavid McAllester\n\nAT&T Labs\u2013Research\ndmac@research.att.com\n\nAbstract\n\nThe rule-based bootstrapping introduced by Yarowsky, and its co-\ntraining variant by Blum and Mitchell, have met with considerable em-\npirical success. Earlier work on the theory of co-training has been only\nloosely related to empirically useful co-training algorithms. Here we give\na new PAC-style bound on generalization error which justi\ufb01es both the\nuse of con\ufb01dences \u2014 partial rules and partial labeling of the unlabeled\ndata \u2014 and the use of an agreement-based objective function as sug-\ngested by Collins and Singer. Our bounds apply to the multiclass case,\ni.e., where instances are to be assigned one of \n\nlabels for \u0002\u0001\u0004\u0003 .\n\n1 Introduction\n\nIn this paper, we study bootstrapping algorithms for learning from unlabeled data. The\ngeneral idea in bootstrapping is to use some initial labeled data to build a (possibly partial)\npredictive labeling procedure; then use the labeling procedure to label more data; then use\nthe newly labeled data to build a new predictive procedure and so on. This process can\nbe iterated until a \ufb01xed point is reached or some other stopping criterion is met. Here we\ngive PAC style bounds on generalization error which can be used to formally justify certain\nboostrapping algorithms.\n\nOne well-known form of bootstrapping is the EM algorithm (Dempster, Laird and Rubin,\n1977). This algorithm iteratively updates model parameters by using the current model\nto infer (a probability distribution on) labels for the unlabeled data and then adjusting the\nmodel parameters to \ufb01t the (distribution on) \ufb01lled-in labels. When the model de\ufb01nes a joint\nprobability distribution over observable data and unobservable labels, each iteration of the\nEM algorithm can be shown to increase the probability of the observable data given the\nmodel parameters. However, EM is often subject to local minima \u2014 situations in which\nthe \ufb01lled-in data and the model parameters \ufb01t each other well but the model parameters are\nfar from their maximum-likelihood values. Furthermore, even if EM does \ufb01nd the globally\noptimal maximum likelihood parameters, a model with a large number of parameters will\nover-\ufb01t the data. No PAC-style guarantee has yet been given for the generalization accuracy\nof the maximum likelihood model.\n\nAn alternative to EM is rule-based bootstrapping of the form used by Yarowsky (1995),\nin which one assigns labels to some fraction of a corpus of unlabeled data and then infers\n\n\fnew labeling rules using these assigned labels as training data. New labels lead to new\nrules which in turn lead to new labels, and so on. Unlike EM, rule-based bootstrapping\ntypically does not attempt to \ufb01ll in, or assign a distribution over, labels unless there is\ncompelling evidence for a particular label. One intuitive motivation for this is that by\navoiding training on low-con\ufb01dence \ufb01lled-in labels one might avoid the self-justifying local\noptima encountered by EM. Here we prove PAC-style generalization guarantees for rule-\nbased bootstrapping.\n\nOur\nresults are based on an independence assumption introduced by Blum and\nMitchell (1998) which is rather strong but is used by many successful applications. Con-\nsider, for example, a stochastic context-free grammar. If we generate a parse tree using\nsuch a grammar then the nonterminal symbol labeling a phrase separates the phrase from\nits context \u2014 the phrase and the context are statistically independent given the nontermi-\nnal symbol. More intuitively, in natural language the distribution of contexts into which a\ngiven phrase can be inserted is determined to some extent by the \u201ctype\u201d of the phrase. The\ntype includes the syntactic category but might also include semantic subclassi\ufb01cations, for\ninstance, whether a noun phrase refers to a person, organization, or location. If we think of\neach particular occurrence of a phrase as a triple\nis the phrase itself,\nis conditionally\n. The conditional independence can be made to hold precisely if\nis the syntactic\n\n\u0001\u0010\u0003\nindependent of\nwe generate such triples using a stochastic context free grammar where\ncategory of the phrase.\n\n\u0001\u000f\u0003\nis the context, then we expect that\n\nis the \u201ctype\u201d of the phrase, and\n\n\u0002\u0001\u0004\u0003\u0006\u0005\b\u0007\t\u0005\n\u0001\f\u000b\u000e\n\n, where\n\ngiven\n\n\u0001\t\u000b\n\n\u0001\f\u000b\n\nor entirely on\n\nwhich consists of functions predicting\n\nwhich consists of functions predicting\n\nby alternately selecting\nand Mitchell is that\nstances, they show that, given a weak predictor in\nlearn\n\nBlum and Mitchell introduce co-training as a general term for rule-based bootstrapping in\n. In other words, there are\nwhich each rule must be based entirely on\ntwo distinct hypothesis classes,\n, and\n. A co-training algorithm bootstraps\n. The principal assumption made by Blum\n. Under such circum-\n, and given an algorithm which can\nunder random misclassi\ufb01cation noise, it is possible to learn a good predictor in\n. This gives some degree of justi\ufb01cation for the co-training restriction on rule-based\n\u0011\u0016\u000b\nbootstrapping. However, it does not provide a bound on generalization error as a func-\ntion of empirically measurable quantities. Furthermore, there is no apparent relationship\nbetween this PAC-learnability theorem and the iterative co-training algorithm they suggest.\n\n\u0003\u0014\u0013\nis conditionally independent of\n\ngiven\n\nfrom\n\nfrom\n\nand\n\n\u0011\u0016\u000b\n\n\u000b\u0015\u0013\n\nCollins and Singer (1999) suggest a re\ufb01nement of the co-training algorithm in which one\nexplicitly optimizes an objective function that measures the degree of agreement between\nthe predictions based on\n. They describe methods for \u201cboosting\u201d\nthis objective function but do not provide any formal justi\ufb01cation for the objective function\nitself. Here we give a PAC-style performance guarantee in terms of this agreement rate.\nThis guarantee formally justi\ufb01es the Collins and Singer suggestion.\n\nand those based on\n\nIn this paper, we use partial classi\ufb01cation rules, which either output a class label or output\na special symbol\nindicating no opinion. The error of a partial rule is the probability that\nthe rule is incorrect given that it has an opinion. We work in the co-training setting where\nwe have a pair of partial rules\n\u0012\u0018\u000b\n(sometimes) predicts\ncan be \u201ccomposite rules\u201d, such\nas decision lists, where each composite rule contains a large set of smaller rules within it.\nWe give a bound on the generalization error of each of the rules\nin terms of the\nempirical agreement rate between the two rules. This bound formally justi\ufb01es both the use\n\n\u0012\t\u0003\n\u0012\u000f\u0003\n. Each of the rules\n\n(sometimes) predicts\n\nwhere\n\nfrom\n\nfrom\n\nand\n\nand\n\nand\n\nand\n\n\u0001\u0010\u0003\n\n\u0012\u000f\u0003\n\n\u0012\u0018\u000b\n\n\u0012\u0018\u000b\n\n\u0001\t\u000b\n\n\u0007\n\u0007\n\u0007\n\u0001\n\u0003\n\u0001\n\u000b\n\u0011\n\u0003\n\u0007\n\u0001\n\u0003\n\u0011\n\u000b\n\u0007\n\u0001\n\u000b\n\u0012\n\u0011\n\u0003\n\u0012\n\u0011\n\u000b\n\u0001\n\u0003\n\u0001\n\u000b\n\u0007\n\u0011\n\u0003\n\u0001\n\u0003\n\u0001\n\u000b\n\u0017\n\u0007\n\u0007\n\u0012\n\u0003\n\u0012\n\u000b\n\fh1\n\nX\n1\n\nX2\n\nh2\n\nY\n\nFigure 1: The co-training scenario with rules\n\nand\n\n.\n\n\u0012\u0018\u000b\n\n\u0012\u000f\u0003\n\nof agreement in the objective function and the use of partial rules. The bound shows the\npotential power of unlabeled data \u2014 low generalization error can be achieved for complex\nrules with a suf\ufb01cient quantity of unlabeled data. The use of partial rules is analogous to\nthe use of con\ufb01dence ratings \u2014 a partial rule is just a rule with two levels of con\ufb01dence.\nSo the bound can also be viewed as justifying the partial labeling aspect of rule-based\nbootstrapping, at least in the case of co-training where an independence assumption holds.\nThe generalization bound leads naturally to algorithms for optimizing the bound. A simple\ngreedy procedure for doing this is quite similar to the co-training algorithm suggested by\nCollins and Singer.\n\n2 The Main Result\n\n. For any statement\nand\n\nWe start with some basic de\ufb01nitions and observations. Let\nof individual samples\n\n,\n\n,\n\nbe an i.i.d. sample consisting\n\nwe let\n\nbe the subset\n\n\u000e\u0016\n\n\u0001\u000b\n\n\u0001\u0006\u0003\n\n\u0001\u0006\u0005\n. For any two statements\n\n\u0002\u0003\u0002\u0004\u0002\n\n!\u001c\u001a\nand\n\n\r\f\u000e\u0017\u001d\u000f\u0003\u001a\n\n\u0007\t\b\n\r\f\u000e\u0007\u001f\u001e \u0017\u001d\u000f\u0004\u001a\nsome distribution over triples\nwhere\nand \u0019\nsample\n\u0002\u0001\nsome labeled samples\nunlabeled data. A partial rule\nbe interested in pairs of partial rules\n\n\u000f)'\n\u0007\t\u0005\nof pairs\n\n\u0005\n\u0001\n+,\n\n+*\n\n\u0011\u0012\u0001\u0003\u0013\u0015\u0014\nto be\n. For the co-training bounds proved here we assume data is drawn from\n, and\n\nwe de\ufb01ne the empirical estimate\n\n\r\f\u000e\u0007\u0010\u000f\n\n, and\n\nwith\n\n\u0007\t\b\n\n\u0013#\"\nare conditionally independent given\n\n\u0005\u0004\u0002\u0003\u0002\u0004\u0002\n, that is, \u0019\n\n\u0005\n\u0007\f\u0005\n\u0007\u001c\u000f\nIn the co-training framework we are given an unlabeled\n\u0007\u001c\u000f\ndrawn i.i.d. from the underlying distribution, and possibly\n. We will mainly be interested in making inferences from the\n. We will\nwhich largely agree on the unlabeled data.\n\nis a mapping from\n\non a set\n\nand\n\n\u000f('\n\n\u0007\f\u0005\n\n\f\u000e\u0007\u001b\u001a\u001c\u0017\u001d\u000f\n\u0013&\"\n\n\u0011%$\n\nto\n\n\u0002\u0001\n.\n\n\f\u0002\u0001\n\n\u0011%$\n\n\u0005\u0004\u0002\u0003\u0002\u0004\u0002\n\n\u0017-\u0016\n\nThe conditional probability relationships in our scenario are depicted graphically in \ufb01g-\nure 1. Important intuition is given by the data-processing inequality of information theory\n(Cover and Thomas, 1991):\n. In other words, any mutual information\nagree to a large\nbetween\nextent, then they must reveal a lot about\nrequires no\nlabeled data at all. This simple observation is a major motivation for the proof, but things\nare complicated considerably by partial rules and by approximate agreement.\n\n\u0012\f\u000330\n.2\f\nmust be mediated through\n. And yet \ufb01nding such a pair\n\n\u0012\u0018\u000b\u0003\u000f\n. In particular, if\n\n\u0012\u000f\u000310\n\nand\n\nand\n\n\u0012\u0010\u0003\n\n\u0012\t\u0003\n\n./\f\n\n\u0012\u0018\u000b\n\n\u0012\u0018\u000b\n\n\u0007\u001c\u000f\n\n\u0012\u0010\u0003\n\n\u0012\u0018\u000b\n\nFor a given partial rule\n\n\u0003(4\n\nwith \u0019\n\u0017\u001d\u000f6587\n9:\f<;<\u000f='?>A@\u0006BDC E1F\n\u0003HGJIKGML\n\nde\ufb01ne a function\n\non\n\nby\n\n\u0011%$\n\n\u0005\u0003\u0002\u0004\u0002\u0004\u0002\n\n'ON\u0010\u001a\n\n\u0007P'?;<\u000f\n\nWe want\nwant \u0019\ninformation as\nlabels\nto high con\ufb01dence, that\n\n\u0012\t\u0003\n\u0012\t\u0003 'Q9:\f\u0002\u0007\u001c\u000f \u001a\u0010\u0012\t\u0003\n\u0005\u0004\u0002\u0003\u0002\u0004\u0002\n\nto be a nearly deterministic function of the actual label\nto be near one. We would also like\n\n; in other words, we\nto carry the same\n. This is equivalent to saying that\nshould be a permutation of the possible\n. Here we give a condition using only unlabeled data which guarantees, up\nis a permutation; this is the best we can hope for using unlabeled\n\n\u0017\u001d\u000f\n\n\u0012\u000f\u0003\n\n\u0011A$\n\n\n\u0001\n\u0013\n\u0007\n\u0017\n\u0018\n\u0019\n\u001a\n\u0003\n\u0001\n\u000b\n\n\u0001\n\u0003\n\u0003\n\u0005\n\u0007\n\u0013\n\u0005\n\n\u0016\n\u0001\n\u000b\n\u000b\n\u0001\n\u0003\n\u0001\n\u000b\n\u0007\n\f\n\u0001\n\u0003\n\u001a\n\u0001\n\u000b\n\u0019\n\u0003\n\u001a\n\f\n\u0001\n\u000b\n\u001a\n\u0001\n\u0003\n\u0019\n\f\n\u0001\n\u000b\n\u001a\n\u0003\n\u000b\n\n\u0012\n\"\n\"\n\u0005\n\n\u0005\n\u0001\n\u0007\n\u0007\n\f\n\u0012\n\u0003\n\u0005\n\u0012\n\u000b\n\u000f\n\u0012\n\u0003\n\f\n\u0012\n'\n9\n\u0005\n\n\u0016\n\u0019\n\f\n\u0012\n\u0003\n\u0007\n\f\n4\n'\n\u0007\n9\n\u0005\n\n\u0016\n9\n\f\u0012\u0018\u000b\n\n\u0012\f\u0003\n\n\u0017\u001d\u000f\n\n\u0005\u0015\u0012\n\nand\n\n\u0003D4\n\nso that\n\n\u001a:9:\f\u0002\u0007\u001c\u000f '\n\nis a permutation and\n\nis a permutation then\n\nusing only\ndata alone. We also bound the error rates \u0019\nunlabeled data. In the case of \nis either the identity\n\u0003 , if\nfunction or the function reversing the two possible values. We use the unlabeled data to\nselect\nhas low error rate. We can then use a\nsmaller amount of labeled data to determine which permutation we have found.\nWe now introduce a few de\ufb01nitions related to sampling issues. Some measure of the com-\nis needed; rather than VC dimension, we adopt a clean notion of\nplexity of rules\nbit length. We assume that rules are speci\ufb01ed in some rule language and write\nfor the\n. We assume that the rule language is pre\ufb01x-free\nnumber of bits used to specify the rule\n(no proper pre\ufb01x of the bit string specifying a rule is itself a legal rule speci\ufb01cation). A\nand\n. The \ufb01rst, as\n\u0012\u0018\u000b\nwe will see, is a bound on the sampling error for empirical probabilities conditioned upon\n.\n\n\u0005\u0004\u0002\u0004\u0002\u0003\u0002\n. The second is a sampling-adjusted disagreement rate between\n\nwe now de\ufb01ne the following functions of the sample\n\n. For given partial rules\n\n\u0012\t\u0003\nand\n\n\u0003\u0005\u0004\u0007\u0006\n\n\u0006\t\b\n\nand\n\nand\n\n\u0012\u000f\u0003\n\n\u0012\t\u0003\n\n\u0012\u0018\u000b\n\n\u0011A$\n\n'ON\n\npre\ufb01x free code satis\ufb01es the Kraft inequality\u0001\u0003\u0002\n\u0015\u000e+\n\r\u001f\u001e! #\"\u001a\u0019$\r\f%\n\u000f\u0012\u0011\u0017%$&'%\n\u000f\u0016\u0015(%\n\u0019)&*\u001e! \n\u001132\n\"-%\n.\u0007\r\u0010\u000f\n\u001b0/\u000e\u00131\u000f\n\u001b546\u00197%\n\u001b'46\u0019\u0012<\n\r\u0010\u000f\u0012\u00111\u001b0/\u0007%\u0017\u000f\u0016\u0015:\u001b;/\u000e\u0013\f\u000f)\u0011\n\u0012\u0018\u000b\u0006\u0005$C3\u000f\n, ifD\n\n\f\u000b\u000e\r\u0010\u000f\u0012\u0011\u0014\u0013\f\u000f\u0016\u0015\u0017\u0013\u000e\u0018\u001a\u0019\u001c\u001b\n\u000b\u000e\r\u0010\u000f\u0012\u0011\u0014\u0013\f\u000f\u0016\u0015\u0017\u0013\u000e\u0018\u001a\u0019\u001c\u001b\nthenD\n\nTheorem 1 With probability at least\nfor all\nall\n\n\u0005GC3\u000f\r5\n\nand\n,\n\n\u0012\f\u0003\u0006\u0005\n\nand\n\nfor\n\n\u0012\u000f\u0003\n\n\u0012\f\u0003\n\n\u0012\f\u0003\n\n\u0012\u0018\u000b\n\n\u0012\u0018\u000b\n\nNote that if the sample size is suf\ufb01ciently large (relative to\nis near zero. Also note that if\nnot\n\n\u0012\u0018\u000b%\u001a\nis near one. We can now state our main result.\n\n\u0012\u0010\u0003\n\nhave near perfect agreement when they both are\n\n\u0010\u000f)\u0011\n\n\u001b0/=%\u001a\u000f\u0016\u0015:\u001b0/\u000e\u0013>\u000f\u0012\u0011\n\nand\n\n\u001b543\u0019?<@\"\u0017\n\f\u000b\f\r\u0010\u000f\u0012\u0011\u0014\u0013\f\u000f\u0016\u0015\u0017\u0013A\u0018(\u0019\n) thenBHI\n\u0012\u0018\u000b\u0006\u0005$C3\u000f\n\n\u0012\f\u0003\u0006\u0005\n\n\u0003-4\n'?N6\u001a39:\f\n\n\u0003-4\nThe theorem states, in essence, that if the sample size is large, and\non the unlabeled data, then\n\n\u000f='ON\n\n\u0003-4\n\n\u0017\u001d\u000f\n\nlargely agree\nis a good estimate of the error rate\n\n\u0005GC3\u000f\n\nthen (a)\n\n$FE5C over the choice of the sample\n\u0017\u001d\u000fIH5B\n\u0005GC3\u000f\n\n\u0003-4\n'?N6\u001a\u0006\u0012\n\nand\n\n\u0012\t\u0003\n\n, we have that\nis a permutation and (b) for\n\n.\n\n\u0003\u00154\n\nN6\u001a\n\n'?N\n\n\u0003-4\n\n\u0017\u001d\u000f\n\n\u0003-4\n'?N6\u001a39:\f\n\n\u000f='ON\n\n\u0003-4\n\n\u0017\u001d\u000f\n\nThe theorem also justi\ufb01es the use of partial rules. Of course it is possible to convert a\npartial rule to a total rule by forcing a random choice when the rule would otherwise return\n. Converting a partial rule to a total rule in this way and then applying the above theorem\nis total and is\nto the total rule gives a weaker result. An interesting case is when \n. In this case the empirical error\na perfect copy of\nrate of the corresponding total rule \u2014 the rule that guesses when\nhas no opinion \u2014 will\nbe statistically indistinguishable from from 1/2. However, in this case theorem 1 can still\nestablish that the false positive and false negative rate of the partial rule\n\nhappens to be\n\nis near zero.\n\n, and \u0019\n\n$1!\tJ\n\n\u0003-4\n\n\u0017\u001d\u000f\n\n\u0012\u0004\u0003\n\n\u0003 ,\n\n3 The Analysis\n\nWe start with a general lemma about conditional probability estimation.\n\nLemma 2 For any i.i.d. sample\nin the sample, the following holds with probability at least\n\n, and any statements\n\nand\n\n\u0012\u0010\u0003\n\nabout individual instances\n\nNPO\n\n$KELD over the choice of\n!\u001aD\n\n\r\f\u000e\u0017\u001d\u000f\u0003\u001a\n\n.\n\n(1)\n\n\u001aA\u0017\u001d\u000f1E\n\n\u0017\u001d\u000f\n\n\f\n\u0012\n'\nN\nN\n\u0003\n4\n'\n'\n9\n9\n9\n\u001a\n\u0012\n\u001a\n\u0012\n\u0002\n$\nN\n\u0013\n\u0005\n\n\u0016\n\n\u0012\n\u000b\n\u0005\n\u0012\n\u0003\n4\n'\n\u0017\n\u0012\n\u0003\n\u0012\n\u000b\n\u001d\n,\n\u0015\n8\n\u0018\n9\n2\n\u0018\n9\n2\n2\n\u001a\n\u001a\n\u001a\n\f\n\u0017\nI\n\f\n\u0005\n\n\u0012\n\u000b\nI\n\f\n7\n$\n\b\nN\n\b\n\n9\n$\n\b\nN\n\b\n\n\u0019\n\f\n\u0012\n\u0007\n\u0005\n\u0012\n'\n\b\n\u0018\n\u0019\n\f\n\u0012\n\u000b\n'\nN\n\u0005\n\u0012\n'\nI\n\f\n\u0012\n\u0003\n\u0005\n\u0012\n\u000b\nD\nI\n\f\n\u0005\n\u0012\n\u000b\n\u0002\n\u0012\n\u0003\n\u0012\n\u000b\n\u0018\n\u0019\n\f\n\u0012\n'\n\u0012\n\u000b\n\u0005\n\u0012\n'\n\u0019\n\f\n\u0012\n\u0007\n\u0005\n\u0012\n'\n\u0017\n'\n\u0012\n\u000b\n\u0007\n\f\n\u0012\n'\n\u001a\n\n\u001a\n\n\u0007\n\u0017\n\nM\nM\nM\n\u0019\n\f\n\u0007\n\u0018\n\u0019\n\f\n\u0007\n\u001a\nM\nM\nM\n\b\n\u001d\n\u0003\n\u0003\n\u001a\n\fProof. We have the following where the third step follows by the Chernoff bound.\n\n\u0019\u0001\n\n\u001aA\u0017\u001d\u000f1E\n\n\r\f\u000e\u0017\u001d\u000f\u0003\u001a\n\n\r\f\u000e\u0017\u001d\u000f\u0003\u001a\n\n\u0017\u001d\u000f\n\n\u0019\u0017\u0016\n\n\u000f\u0015\u0014\n\u000f\u0015\u0014\u0014D\n\nTherefore, with probability at least\n\n5\u0003\u0002\n\n\f\u000e\u0007\n\n\u000b\u0007\u0006\t\b\n\n\f\u000b\u000e\r\u0010\u000f\f\u0006\u0012\u0011\n\u0017\u001d\u000f1E\n\n\f\u000e\u0007\n\n\u0002\u001c\u001b\n\n\u0003\t\u0004\u0007\u0006\n\n\u0002\u001e\u001d\n\n\u0003\u0016\u00041\u0006\n\n$KE\n\u0017\u001d\u000f1E\n\n\u000b\u0018\u0006\t\b\n\n\f\u000b\u000e\r\u0010\u000f>\u0006\n\n\r\f\n\n\u0017\u001d\u000f\u0004\u001a\n\n\u001a\u0019\n\n\u0017\u001d\u000f\n\n,\n\n\u0005GC3\u000f\n\n'?N\n\f<9:\f\n\n\u000f='!\u001f\n\n\u001a\u0006\u0012\n\n\u0017\u001d\u000f1E\n\n\u0003-4\n\n\u0003-4\n'ON\n\n\u0017\u001d\u000f\n\n\u0003-4\n'?N6\u001a\u0006\u0012\n\n'ON\n\u0003-4\n\n\f<9:\f\n\n\u0007\u001c\u000f='ON\n\n'ON\n\n\u0003-4\n\n\u0017\u001d\u000f\n\n\f\u000e9:\f\u0002\u0007\u001c\u000f='%\u001f\n\n\u0013\u001c#\n\n\u001a\u0006\u0012\u0018\u000b\n\n'ON\n\n\u0012\f\u0003\n\n\u0017\u001d\u000f\n\n\u0003-4\n\u001aA9:\f\u0002\u0007\u001c\u000f\nN6\u001a\n\n\u0017\u001d\u000f\n\n\u0003-4\n\n'!\u001f\n9:\f\u0002\u0007\u001c\u000f='!\u001f\n\u0007\u001c\u000f=' \u001f\n\u0007\u001c\u000f=' \u001f\n\n\u001a39:\f\n\n\u0012\f\u0003\r'%\u001f\n\n'ON6\u001a39:\f\u0002\u0007\u001c\u000f='?N\n\u0003-4\nN\u0010\u001a39:\f\u0002\u0007\u001c\u000f='?N\n\u0012\f\u0003\n'ON6\u001a39:\f\n\n\u0003-4\n\u0003-4\n\n\u0017\u001d\u000f\n\n\u0017\u001d\u000f\n\n\u0017\u001d\u000f\n\n\u0017\u001d\u000f\n\u0012\t\u0003\n\n\u0012\t\u0003\n\n\u0017\u001d\u000f\n\n\u0017\u001d\u000f\n\n(2)\n\n\u0005GC3\u000f\n\b\u0004\n\u0005GC3\u000f\u00155\n\nsuch that\n\nN\u0010\u001a\u0006\u0012\nfor any given\nprobability at least\n\n\u0003-4\n'ON\n. By the union bound and the Kraft inequality, we have that with\nthis must hold simultaneously for all\n.\n\n\u0017\u001d\u000f\u0004\u001a\nand\n\n'ON6\u001a\u0006\u0012\n\n, and all\n\n'ON\n\n,\n\n\u0003-4\n\n'?N\n\n\u001a39:\f\n\n\u000f='?N\n\nand\nis a permutation, and moreover, for any\n\nfor which equation (2) as well asD\n. We need to show that there exists some\u001f\n\u0012\u0018\u000b\u0006\u0005$C3\u000f\n\nand condition (2) we know\n\n\u0017\u001d\u000f\n\n\u0017\u001d\u000f\n\n$1!\n\n\u001a\u0006\u0012\u0018\u000b\n'?N\n'?N6\u001a\u0006\u0012\u0018\u000b\n\n\u0017\u001d\u000f\n\u0017\u001d\u000f\u00105\n\n\u0012\f\u0003\n\u0012\f\u0003\n\n\u0012\f\u0003\n\u0012\t\u0003\n\n'?N\n\n$\u0012!\n\n, it follows that \u0019\n\n\u0012\f\u0003\n\u0003 . Rewriting this\n\nLemma 3 Pick any rules\nhold for all\n\n. Then\n\nProof.\n\nPick any\n\n\u000f='?N\n\n9:\f\u000e\u001f\nSinceD\n\n\u001a\u0006\u0012\u0018\u000b\nby conditioning on\n\n\u0011%$\n\n\u0005\u0003\u0002\u0004\u0002\u0004\u0002\n\n\u0012\t\u0003\r'?N\n\u0012\f\u0003\n\n\u0012\t\u0003\n, we get\n\n. By the de\ufb01nition ofD\n\u0012\u0018\u000b\u0006\u0005GC3\u000f\r587\n\u0007P' \u001f\n\nG\u001c\u0013\u000bGML\n\n\u001a\u0006\u0012\n\n\u0003-4\n\n'ON\n\n\u0007P'!\u001f\nThe summation is a convex combination; therefore there must exist some \u001f\nthere must exist a\u001f with\n\n\u0003)4\nLemma 4 Pick any rules\n\n\u0017\u001d\u000f 5\nand\n\nis a permutation.\n\n\u0003 . So for each\n\n'QN \u001a\u000f\u0007\n\n'\u0017\u001f\n\nwhereby\n\n'ON\u0010\u001a\n\n\u0003-4\n\n\u0017\u001d\u000f\n\n\u0017\u001d\u000f\n\n$\u0012!\n\n$\u0012!\n\nsatisfying the conditions of the previous lemma. Then\n\nsuch that\n,\n\n9:\f\"\u001f%\u000f\u0015'QN\n\n\u0007\u001c\u000f='ON\n\f<9:\f\nProof.\ncontent as\n\n'ON\n\n\u0003-4\n\nis at leastD\n\n\u0017\u001d\u000f\nand\n\nBy the previous lemma\n\nis a permutation, so\n\nhas the same information\n\n. Therefore\n\nare conditionally independent given\n\n9:\f\n\n\u0007\u001c\u000f\n\n.\n\n\u0005$C3\u000f\n\n. For any\n\n,\n\n9:\f\u0002\u0007\u001c\u000f\n\nM\nM\nM\n\u0019\n\f\n\u0007\n\u0018\n\u0019\n\f\n\u0007\n\u001a\nM\nM\nM\n\u0004\n\u0005\n\u000b\n\u0006\n'\n\u0013\nL\n\u0019\n\f\n\u001a\n'\n\nM\nM\nM\n\u0019\n\u001a\n\u0018\n\u0019\n\u001a\nM\nM\nM\n5\n\u0002\n\u0004\n\u0005\n\u000b\n\u0006\nM\nM\nM\nM\n\u001a\n'\n\b\n\u0013\nL\n\u0019\n\f\n\u001a\n'\n\n'\nD\n\u0006\n\u0006\nC\n!\n\n\u001a\n\u0018\n\u0019\n\f\n\u0012\n\u0003\n'\n\u000b\n\u0005\n\u0012\n\u0003\n4\n'\n\u0019\n\f\n\u0012\n\u0003\n\u000b\n\u0005\n\u0012\n'\n\b\nB\nI\n\f\n\u0012\n\u0003\n\u0005\n\u0012\n\u000b\n\u0012\n\u0003\n\u0005\n\u0012\n\u000b\n$\nE\nC\n\u0012\n\u0003\n\u0012\n\u000b\n$\n\b\nN\n\u0012\n\u0003\n\u0012\n\u000b\nI\n\f\n\u0012\n\u0003\n\u0005\n\u0012\n\u000b\n7\n$\n\b\nN\n\b\n\n9\nN\n\u0019\n\f\n\u0012\n\u0003\n\u0007\n\u0005\n\u0012\n'\n5\n\u0003\n\u0002\nN\n\u0013\n\u0005\n\n\u0016\nI\n\u0019\n\f\n'\nN\n\u0005\n4\n'\nE\n\u0019\n\f\n4\n'\nN\n\u0005\n4\n'\n\u0001\nD\nI\n\f\n\u0005\nI\n\f\n\u0005\n\f\n\u0005\n4\n'\n\u0007\n\u0013\n\u0003\n\u0019\n\f\n\u000b\n\u0005\n\u0012\n'\n\u0019\n\f\n\u0012\n\u0003\n\u0005\n\u0012\n'\n5\n\u0003\n\u0002\n\u0019\n\f\n\u0012\n\u0003\n\u0005\n\u0012\n'\nN\n9\n\u0012\n\u0003\n\u0012\n\u000b\n\u0019\n\u001a\n\u0012\n\u000b\n\u0005\n\u0012\n'\nI\n\f\n\u0012\n\u0003\n\u0005\n\u0012\n\u000b\n9\n\u0007\n\u0001\n\u0003\n\u0001\n\u000b\nN\nD\nI\n\f\n\u0012\n\u0003\n\u0005\n\u0012\n\u000b\n\b\n\u0019\n\f\n\u0012\n\u0003\n\u001a\n\u0012\n\u000b\n'\nN\n\u0005\n\u0012\n'\n\u0019\n\f\n\u0012\n\u000b\n'\nN\n\u0005\n\u0012\n'\n'\n\u0013\n\u0013\n\u0019\n\u0007\n\u000b\n\u0005\n\u0012\n'\n\u0016\n\u0019\n\f\n\u0012\n\u0003\n\u0005\n\u0012\n'\nE\n\u0019\n\f\n\u0012\n'\n\u0005\n\u0012\n'\n\u0019\n\b\n\u0019\n\u001a\n\u0012\n\u000b\n\u0005\n\u0012\n'\n\u0016\n\u0019\n\f\n\u0012\n\u0003\n\u0005\n\u0012\n'\nE\n\u0019\n\f\n\u0012\n'\n\u0005\n\u0012\n\u0003\n4\n'\n\u0019\nH\n\u0013\n$\nI\n\u0019\n\u0005\n4\n'\n\u0016\n\u0019\n\f\n\u0005\n4\n'\nE\n\u0019\n\f\n\u0005\n4\n'\n\u0019\n\fwhere the second step involves conditioning on\nhave \u0019\nnegative, whereby\n\n9:\f\u0002\u0007\u001c\u000f\u0010'\n\n\u0017\u001d\u000f\n\n$\u0012!\n\n. Also by the previous lemma, we\n\u0003 so the second term in the above sum must be\n\n9:\f\u0002\u0007\u001c\u000f\n\n\f<9:\f\n\f<9:\f\n\n\u000f='ON6\u001a\u0006\u0012\n\u000f='ON6\u001a\u0006\u0012\n\n'?N\n'?N\n\n\u0003-4\n\u0003-4\n\n\u0017\u001d\u000f\n\u0017\u001d\u000f\n\n'ON\n'?N6\u001a\n\n9:\f\u0002\u0007\u001c\u000f='ON\n9:\f\u0002\u0007\u001c\u000f='\n\n\u0017\u001d\u000f\n\n\u0017\u001d\u000f\n\n\u0003-4\n\n\u001f&\u001a\n\n\u0005GC3\u000f\n\nUnder the conditions of these lemmas, we can derive the bounds on error rates:\n\n\u0003-4\n'ON\u0010\u001a39:\f\u0002\u0007\u001c\u000f='\n\n\u0003-4\n\n\u0017\u001d\u000f\n\n4 Bounding Total Error\n\n\u0003\u00154\n\nN6\u001a\n\n\f\u000e9:\f\u0002\u0007\u001c\u000f='ON6\u001a\u0006\u0012\n\u0012\f\u0003\n'?N6\u001a\u0006\u0012\u0018\u000b\n\n'?N\n\u000b\t'\n\n\u0012\t\u0003\n\n\u0012\f\u0003\n\n\u0017\u001d\u000f\n\u0017\u001d\u000f\n\n\u0017\u001d\u000fIH5BHI\n\u0005GC3\u000f\n\n\u0012\t\u0003\n\n\u0005GC3\u000f\n\nAssuming that we make a random guess when\nwritten as follows.\n\n\u0012\u0010\u0003 '\n\n, the total error rate of\n\ncan be\n\n\u0012\t\u0003\n\n\u0001\u0003\u0002\u0004\u0002\u0006\u0005\u0007\u0002\n\n\u0003-4\nTo give a bound on the total error rate we \ufb01rst de\ufb01ne \b\nerror rate for label\n\ngiven in theorem 1.\n\n9:\f\u0002\u0007\u001c\u000f\r\u001a\u0006\u0012\n\n\u0003-4\n\n\u0003-4\n\n\u0017\u001d\u000f\n\n\u000f='\n\n\u0017\u001d\u000f=H\n\n\u0012\t\u0003\n\n\u0012\f\u0003\n\n\u0012\u0018\u000b\u0006\u0005GC3\u000f\n\n\u0012\t\u0003\n\n'ON\n\n'ON\n\n\u0012\t\u0003\n\n\u0005GC3\u000f\n\nWe can now state a bound on the total error rate as a corollary to theorem 1.\n\n\u0017\u001d\u000f\n\nto be the bounds on the\n\nE8$\n\u0005GC3\u000f\n\u0017\u001d\u000fIH'BHI\n\n\u0012\t\u0003\u0006\u0005\n\n\u0012\u0018\u000b\u0006\u0005GC3\u000f\n\nCorollary 5 With probability at least\ning for all pairs of rules\n\nand\n\nsuch that for all\n\n\u0012\t\u0003\n\n\u0005GC\n\n\u0001\t\u0002\u0006\u0002\u0006\u0005\u0007\u0002\n\nE8$\u0006\u000f\n\n\u0012\u000f\u0003\n\n.\n\n\u0012\u0018\u000b\n\nC over the choice of\nwe haveD\u001cI\nB\u0012\f\n\nC E1F\n\n\u0005GC\n\u0017\u001d\u000fIH'B\u0012\f\n\n\u0005$C\n\nwe have the follow-\nand\n\n\u0012\t\u0003\n\n\u000f 5\n\n\u0005GC\n\n\u0005GC\n\n\u0017\u001d\u000f\u0007E\n\n\u0003(4\n\nE8$\nNPO\n\nProof. From our main theorem, we know that with probability at least\n.\nN-\u001aM9:\f\u0002\u0007\u001c\u000f\nprobability at least\n\nis bounded by \b\n\n\u0003#4\n\u0003 ,\n\n\u0003#4\n\n\u0017\u001d\u000f\n\n\u0003 , for all\n. This implies that with\n\n\u0012\f\u0003\n\n\u0017\u001d\u000f\n\nC E1F\n\n\u0013A\f\n\n\u0012\t\u0003\n\n\u0005GC\n\n\u000fIH\n\n(3)\n\n\u0012\f\u0003\n\n\u0017\u001d\u000f\n\n$KE\n\n\u0005$C\nE8$\n\nB\u0012\f\n\n\u0005$C3\u000f\n\n$KE0C\n\n\u0001\u0003\u0002\u0004\u0002\u0006\u0005\u0007\u0002\n\n\u0012\f\u0003\u0004\u000f\n\n\f\n\u0012\n\u0003\n'\n\u001f\n\u0005\n\u0012\n\u0003\n4\n'\n5\nD\nI\n\f\n\u0012\n\u0003\n\u0005\n\u0012\n\u000b\n\b\n\u0019\n\u0007\n\u000b\n\u0005\n\u0012\n'\n\u0016\n\u0019\n\f\n\u0012\n\u0003\n\u001a\n\u0005\n\u0012\n\u0003\n4\n'\nE\n\u0019\n\f\n\u0012\n\u0003\n4\nN\n\u0005\n\u0012\n'\n\u0019\n\b\n\u0019\n\u0007\n\u000b\n\u0005\n\u0012\n'\n\u0019\n\f\n\u0012\nN\n\u0005\n\u0012\n'\n\b\n\u0019\n\f\n\u0012\n'\n\u0012\n\u000b\n\u0005\n\u0012\n\u0003\n4\n'\n\u0019\nN\n\u0005\n4\n'\n\b\n\u0018\n\u0019\n\f\n4\n'\nN\n\u0005\n4\n'\n\f\n\u0005\n\u0012\n\u000b\nD\nI\n\f\n\u0012\n\u0003\n\u0005\n\u0012\n\u000b\n\u0017\n\f\n\u0012\n\u0003\n\u0019\n\f\n\u0012\n'\n\u0019\n\f\n\u0012\n'\n'\n\n\n\u0019\n\f\n\u0012\n\u0003\n'\nI\n\f\n\u0005\n\u0012\n\u000b\nN\n\b\nI\n\f\n\u0005\n'\n$\nD\nI\n\f\n\u0012\n\u0003\n\u0005\n\u0012\n\u000b\n\n\u0018\n\u0019\n\f\n4\n\u001a\n\u0012\n\u000b\n\u0005\n4\n'\n\f\n\u0011\n$\nE\n\nN\n\f\n\u0005\n\u0012\n\u000b\n!\n\u0003\n7\n\b\nI\n\f\n\u0005\n\u0012\n\u000b\n!\n\u0003\n\u000f\n\b\n\f\n\n!\n\n\f\n\u0012\n\u0003\n\u000f\n\b\n\n\u0018\n\u0019\n\f\n\u0012\n'\n\u001a\n\u0012\n\u0003\n\u001a\n!\n\u0003\n\u000f\n\u0011\n\u0013\n\b\n\u0013\n\f\n\u0012\n\u0003\n\u0005\n\u0012\n\u000b\n!\n\u0003\n\u000f\nH\n\n\n\n\u0018\n\u0019\n\f\n\u0012\n\u0003\n'\n\u001a\n\u0012\n\u0003\n\u001a\n!\n\u0003\n\u000f\n\u0011\n\n'\n\u001d\n\n\u0003\nH\nN\nO\n\u0003\n!\nC\n\u0003\n\u001a\n\n\u001a\n$\nE\nC\n!\nN\n\u0019\n\f\n\u0012\n'\n'\nN\n\u0005\n\u0012\n'\nI\n\f\n\u0012\n\u0003\n\u0005\n\u0012\n\u000b\n!\n\u0003\n\u000f\n!\n\f\n\b\n\u0019\n\f\n4\n'\n\u0013\n\b\n\u0005\n\u0012\n\u000b\n!\n\u0003\n\n\n\f\n\u0019\n\f\n4\n'\n\u000f\n\f\u00133\f\n\n\u0017\u001d\u000f\n\n\u0005GC\n\n\u0017\u001d\u000f\u0004\u001a\n\n\u0012\f\u0003\u0006\u0005\n\u0003-4\n\nequal to\n\nE8$\u0006\u000f\n\n\u0012\u0018\u000b\u0006\u0005$C\n\n\u0003 we have that\n\nWith probability at least\n\nis no larger\n. So by the union bound both of these conditions hold simultaneously with\n we have that the upper\n\nthanB\u0012\f\nprobability at least\n$KE0C . Since\nCPE3F%\u0013\nbound in (3) is maximized by setting \u0019\n\u0017\u001d\u000f\nCorollary 5 can be improved in a variety of ways. One could use a relative Chernoff\nin the case where this probability is small.\nbound to tighten the uncertainty in \u0019\n\u0003 4\nOne could also use the error rate bounds \b\n\n\f\u000e9:\f\u0002\u0007\u001c\u000f='\nby a convex combination.\n\u0003-4\nN6\u001a\n, e.g., the rule\nAnother approach is to use the error rate of a rule that combines\noutputs\n, and otherwise guesses a random\nvalue. This combined rule will have a lower error rate and it is possible to give bounds on\nthe error rate of the combined rule. We will not pursue these re\ufb01ned bounds here. It should\nbe noted, however, that the algorithm described in section 4 can be used with any bound on\ntotal error rate.\n\n\u0005GC\nto construct bounds on \u0019\n\u0005$C\n\n. One could then replace the max over \b\n\n, otherwise outputs\n\nB\u0012\f\n\n\u000b)4\n\nand\n\n\u0003&4\n\n\u0005$C\n\n\u0003(4\n\n\u0017\u001d\u000f\n\n\u0017\u001d\u000f\n\n\u0017\u001d\u000f\n\nif\n\nif\n\n.\n\n5 A Decision List Algorithm\n\nThis section suggests a learning algorithm inspired by both Yarowsky (1995) and Collins\nand Singer (1999) but modi\ufb01ed to take into account theorem 1 and Corollary 5. Corol-\nlary 5, or some more re\ufb01ned bound on total error, provides an objective function that can\nbe pursued by a learning algorithm \u2014 the objective is to \ufb01nd\nso as to minimize\nthe upper bound on the total error rate. Typically, however, the search space is enormous.\nFollowing Yarowsky, we consider the greedy construction of decision list rules.\n\nand\n\nLet\n\nand\nand for\n\nbe two \u201cfeature sets\u201d such that for\n\nand\n\nwe have\n\nand\n. We assume that\n\nwe have\n\n\u0005\u0004$\n\nand\n\n, i.e., a \ufb01nite sequence of the form\n\n\u000b\u0014\u0013)\"\n\u0011\u00067\ndecision list over the features in\nwhere\n\u0005\u0004\u0002\u0003\u0002\u0004\u0002\ndecision tree. More speci\ufb01cally, if\nis\nWe de\ufb01ne an empty decision list to have value\nfollows where \n\n\u00043I\n\u0002\u0003\u0002\u0004\u0002\n\u0006\u0001\u0018\u000b\u0006\u0005\u0007\u0004\u0006\u000b\u000e\r\nand otherwise equals the value of the list\n\u0003\u0001\u000f\u000b\u0006\u0005\b\u0004\u0006\u000b\nin\n\u0005\u0004$\u0012\u000f\n\f<7\n\n\u0001/I\nif\n\u0001\u000f\u00031\f\u0002\u0001\u000f\u0003\u0004\u000f\t'Q$\n\nis the number of feature-value pairs in\n\n\u0005\u0004$3\u0016\n\n\u0006\u0001\t\u0003\n\n\u0005\u0005\u0004\n\n\u0012\u000f\u0003\n\n\u0011A$\n\n\u0011\u00067\n\n;\n\n.\n\n. A decision list can be viewed as a right-branching\nis the list\n\n;\n\n;\n\n\u0005\u0005\u0004\n\u0003\u0001\u0010\u0003\n\u0006\u0001/L\n\u0005\u0005\u0004\n;\n;\nL\u0006\r\n\u0002\u0004\u0002\u0003\u0002\nwe can de\ufb01ne\n\n\u0003\u0001JL\n\n\u0005\u0005\u0004\n\n;\n\nis to be a\n;\n\n\f\u0002\u0001\n\u0005\u0005\u0004\n\u0006\u0001/L\n\u0002\u0003\u0002\u0004\u0002\nthen\n\u0012\t\u00031\f\n\u0001\t\u0003\u0003\u000f\non\n.\n\u0001\t\u0003\nas\n\u0012\u0010\u0003\n\n. ForB\n$KE0B\n. This implies the Kraft inequality\u0001\n\n\u0012\f\u0003\n\n\u0005\u000b\t\n\n\u0005\n\t\n\n\u00033\u001a\n\nIt is possible to show that \u0003\t\u00041\u0006\n\ngenerates the rule\nneeded in theorem 1 and corollary 5. We also assume that\nfeatures\n\nand de\ufb01ne\n\n\u0006 equals the probability that a certain stochastic process\n\nwhich is all that is\nis a decision list over the\n\n\u0003\u0005\u0004\u0007\u0006\n\nsimilarly.\n\n\u0006I\b\n\nFollowing Yarowsky we suggest growing the decision lists in a greedy manner adding one\nfeature value pair at a time. A natural choice of greedy heuristic might be a bound on the\ntotal error rate. However, in many cases the \ufb01nal objective function is not an appropriate\nchoice for the greedy heuristic in greedy algorithms. A* search, for example, might be\nviewed as a greedy heuristic where the heuristic function estimates the number of steps\nneeded to reach a high-value con\ufb01guration \u2014 a low value con\ufb01guration might be one step\naway from a high value con\ufb01guration. The greedy heuristic used in greedy search should\nestimate the value of the \ufb01nal con\ufb01guration. Here we suggest using\nas a heuristic estimate of the \ufb01nal total error rate \u2014 in the \ufb01nal con\ufb01guration we should\n\nCPE3F\n\n\u0005GC\n\n$\nE\nC\n!\n\u001a\n\u0018\n\u0019\n\f\n\u0012\n\u0003\n4\n'\nE\n\u0019\n\f\n\u0012\n\u0003\n4\n'\n\u001a\n\u0012\n\u0003\n\u001a\n!\n\u0003\n\u000f\n\b\n!\n\u0003\n\u000f\n\b\n\f\n\n!\n\f\n\u0012\n'\n\u0018\n\u0019\n\f\n\u0012\n'\nE\n\u001a\n\u0012\n\u0003\n\u001a\n!\n\u0003\n\u000f\n\f\n\u0012\n'\nI\n\f\n\u0012\n\u0003\n\u0005\n\u0012\n\u000b\n!\n\u0003\n\u000f\n\u0012\n'\nI\n\f\n\u0012\n\u0003\n\u0005\n\u0012\n\u000b\n!\n\u0003\n\u000f\n\u0012\n\u0003\n\u0012\n\u000b\n\u0012\n\u0003\n\u0012\n'\n\u0017\n\u0012\n\u000b\n\u0012\n'\n\u0017\n\u0012\n\u0003\n\u0012\n\u000b\n\n\u0003\n\n\u000b\n\u0001\n\u0013\n\n\u0003\n\u0001\n\u0003\n\u0013\n\"\n\u0003\n\u0001\n\u0003\n\u000f\n\u0013\n\u0016\n\u0002\n\u0013\n\n\u000b\n\u0001\n\u000b\n\u0002\n\f\n\u0001\n\u000b\n\u000f\n\u0013\n\u0012\n\u0003\n\n\u0003\n\u0003\n\nL\n\n\u0013\n\n\u0003\n\u0013\n\u0005\n\n\u0016\n\u0003\n\nL\n\n\u0004\n\u0003\n\n\u0017\n\u001a\n\u001a\n\u0012\n\u0003\n\u001a\n\u001a\n'\nN\n\u000b\n$\nB\nH\n\nN\n\u000b\n\n\u001a\n\n\u0002\n\u001b\n\u0012\n\u0003\n\u0002\n\u0002\n$\n\u0012\n\u000b\n\n\u000b\n\u001a\n\u0012\n\u000b\n\u001a\n\u0013\n\b\n\u0013\n\f\n\u0012\n\u0003\n\u0005\n\u0012\n\u000b\n!\n\u0003\n\u000f\n\f.\nhave that \u0019\n\u0003-4\nFor concreteness, we propose the following algorithm. Many variants of this algorithm also\nseem sensible.\n\nis reasonably large and the important term will be\n\nC E1F\n\n\u0017\u001d\u000f\n\n\u0005GC3\u000f\n\nto \u201cseed rule\u201d decision lists using domain-speci\ufb01c prior\n\n1. Initialize\n\n\u0012\t\u0003\nknowledge.\n\nand\n\n2. Until\n\n\u0012\u0018\u000b\nand\n\n\u0012\u0018\u000b\nboth rules, do the following.\n\n\u0012\t\u0003\n\n\u0017\u001d\u000f\ndenote\n\n(a) Let\n\n(b) IfD\n\nmost increases\n\nC\u0001\n(c) Otherwise extend\n\n\u0005GC\n\n\u0012\t\u0003\n\nif\n\n\u0017\u001d\u000f65\nfor some\n\n\u0005$C\n\nC E1FA\u0013\n\n\u0013A\f\n\n\u0012\t\u0003\n\n\u0005GC\n\nare both zero, or all features have been used in\n\n\u0017\u001d\u000f\n\n\u0017\u001d\u000f\n\n, then extend\n.\n\nand\n\notherwise.\n\nby the pair\n\nwhich\n\n\u0006\u0001\u0004\u0005\u0005\u0004\n\nby a single feature-value pair selected to minimize the\n.\n\n3. Prune the rules \u2014 iteratively remove the pair from the end of either\n\nthat\ngreedily minimizes the bound on total error until no such removal reduces the\nbound.\n\n\u0012\u0004\u0003\n\nor\n\n6 Future Directions\n\nWe have given some theoretical justi\ufb01cation for some aspects of co-training algorithms that\nhave been shown to work well in practice. The co-training assumption we have used in our\ntheorems are is at best only approximately true in practice. One direction for future research\nis to try to relax this assumption somehow. The co-training assumption states that\nand\n. This is equivalent to the statement that the mutual information\n\u0001\u0018\u000b\nis zero. We could relax this assumption by allowing some\nbetween\nsmall amount of mutual information between\nand giving bounds on error\nrates that involve this quantity of mutual information. Another direction for future work,\nof course, is the empirical evaluation of co-training and bootstrapping methods suggested\nby our theory.\n\nare independent given\ngiven\n\ngiven\n\nand\n\nand\n\n\u0001\u000f\u0003\n\n\u0001\f\u000b\n\nAcknowledgments\n\nThe authors wish to acknowledge Avrim Blum for useful discussions and give special\nthanks to Steve Abney for clarifying insights.\n\nLiterature cited\nBlum, A. & Mitchell, T. (1998) Combining labeled and unlabeled data with co-training. COLT.\nCollins, M. & Singer, Y. (1999) Unsupervised models for named entity classi\ufb01cation. EMNLP.\nCover, T. & Thomas, J. (1991) Elements of information theory. Wiley.\nDempster, A., Laird, N. & Rubin, D. (1977) Maximum-likelihood from incomplete data via the EM\n\nalgorithm. J. Royal Statist. Soc. Ser. B, 39:1-38.\n\nNigam, K. & Ghani, R. (2000) Analyzing the effectiveness and applicability of co-training. CIKM.\nYarowsky, D. (1995) Unsupervised word sense disambiguation rivaling supervised methods. ACL.\n\n\f\n\u0012\n'\n\u0013\n\b\n\u0013\n\f\n\u0012\n\u0003\n\u0005\n\u0012\n\u000b\n\u0018\n\u0019\n\f\n'\n\u0018\n\u0019\n\f\n'\nB\n\u0012\n\u0003\n\u0018\n\u0019\n\f\n\u0012\n\u0003\n'\n\u0018\n\u0019\n\f\n\u0012\n\u000b\n'\n\u0012\n\u000b\nI\n\f\n\u0005\n\u0012\n\u000b\n!\n\u0003\n\u000f\n\b\n7\nN\nB\n\nO\nI\nD\nI\n\f\n\u0012\n\u0003\n\u0005\n\u0012\n\u000b\n!\n\u0003\n\u000f\nB\n\b\n\u0005\n\u0012\n\u000b\n!\n\u0003\n\u000f\n\u0012\n\u000b\n\u0001\n\u0003\n\u0007\n\u0007\n\u0001\n\u0003\n\u0001\n\u000b\n\u0007\n\f", "award": [], "sourceid": 2040, "authors": [{"given_name": "Sanjoy", "family_name": "Dasgupta", "institution": null}, {"given_name": "Michael", "family_name": "Littman", "institution": null}, {"given_name": "David", "family_name": "McAllester", "institution": null}]}