{"title": "Learning from discriminative feature feedback", "book": "Advances in Neural Information Processing Systems", "page_first": 3955, "page_last": 3963, "abstract": "We consider the problem of learning a multi-class classifier from labels as well as simple explanations that we call \"discriminative features\". We show that such explanations can be provided whenever the target concept is a decision tree, or more generally belongs to a particular subclass of DNF formulas. We present an efficient online algorithm for learning from such feedback and we give tight bounds on the number of mistakes made during the learning process. These bounds depend only on the size of the target concept and not on the overall number of available features, which could be infinite. We also demonstrate the learning procedure experimentally.", "full_text": "Learning from discriminative feature feedback\n\nSanjoy Dasgupta, Akansha Dey, Nicholas Roberts\nDepartment of Computer Science and Engineering\n\nUniversity of California, San Diego\n\ndasgupta@eng.ucsd.edu,n3robert@ucsd.edu,a1dey@ucsd.edu\n\nSivan Sabato\n\nDepartment of Computer Science\nBen-Gurion University of the Negev\n\nsabatos@cs.bgu.ac.il\n\nAbstract\n\nWe consider the problem of learning a multi-class classi\ufb01er from labels as well\nas simple explanations that we call discriminative features. We show that such\nexplanations can be provided whenever the target concept is a decision tree, or\ncan be expressed as a particular type of multi-class DNF formula. We present an\nef\ufb01cient online algorithm for learning from such feedback and we give tight bounds\non the number of mistakes made during the learning process. These bounds depend\nonly on the representation size of the target concept and not on the overall number\nof available features, which could be in\ufb01nite. We also demonstrate the learning\nprocedure experimentally.\n\n1\n\nIntroduction\n\nCommunication between humans and machine learning systems has typically been restricted to labels\nalone. A human provides labels for a data set and in return gets a classi\ufb01er that predicts labels of\nnew instances. This is a lot more rigid than the way humans learn. The \ufb01eld of interactive learning\nexplores richer learning frameworks in which the machine engages with the human (or other sources\nof annotation) while learning is taking place, and the communication between the two is allowed to be\nmore varied. A key question in this enterprise is whether such interaction can overcome fundamental\nalgorithmic and statistical barriers to learning.\nThe one interactive framework that has perhaps been explored the most thoroughly is active learning\nof classi\ufb01ers. Here, the learning machine begins with just a pool of unlabeled data, and interacts with\na human by asking for labels of speci\ufb01c points. By adaptively focusing on maximally informative\npoints, it can sometimes dramatically reduce the amount of labeling effort. Indeed, over the past\ntwo decades, a number of active learning algorithms have been designed that provably require only\nlogarithmically as many labels as random querying, or otherwise reduce the labeling requirement\nconsiderably, in a variety of canonical settings, e.g. [1, 7].\nIn this paper, our interest is in feedback that goes beyond labels, to simple explanations of a particular\ntype. Imagine, for instance, that you have just \ufb01nished watching a movie in your living room, and\nyour electronic assistant\u2014Siri, Alexa, or one of their colleagues\u2014asks \u201cDid you like the movie?\u201d\nYou dutifully reply \u201cyes\u201d, in effect providing a labeled data point. But then you casually add \u201cI really\nlike John Hurt\u201d. This last piece of information is spontaneous and does not require extra effort. But it\ncan be far more useful, for the purposes of determining what movie to recommend next, than a mere\nthumbs-up or thumbs-down label. This kind of explanation could be called feature feedback, because\nit helps identify relevant features in a high-dimensional space of rationales for user preferences.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fFeedback of this form has been used effectively in information retrieval [4, 13, 6] and in computer\nvision [10, 9, 8]. In general, however, these systems have been geared towards speci\ufb01c applications,\nand it is of interest to study such feedback rigorously, in a more abstract setting.\n\n1.1 Predictive versus discriminative feature feedback\n\nThere has been a signi\ufb01cant amount of work on what might be called predictive feature feedback.\nSuppose, for example, that in a document classi\ufb01cation task, a labeler assigns each document x to a\ncategory y (\u201csports\u201d, \u201cpolitics\u201d, and so on). While making this determination, the labeler might also\nbe able to highlight a few words that are highly indicative of the label (e.g. \u201cCongress\u201d, \u201c\ufb01libuster\u201d).\nThis kind of auxiliary feedback has been explored in a variety of empirical studies, for text and image,\nwith promising results [4, 13, 6, 14, 8]. More recently, the theoretical results of [12] have shown that\nsuch feedback can improve the rate of convergence of learning.\nIn this paper, we study an alternative setting that we call discriminative feature feedback. Consider\na computer vision system that is learning to recognize different animals. Whenever it makes a\nmistake\u2014classi\ufb01es a \u201czebra\u201d as a \u201chorse\u201d, say\u2014a human labeler (\u201cthe teacher\u201d) corrects it. While\ndoing this, the labeler can also, at little extra cost, highlight a part of the image (the stripes, for\ninstance) that distinguishes the two animals. Work on recognizing different species of birds, for\ninstance, has used this sort of feedback effectively [3].\nThis kind of discriminative feedback is quite different from the predictive feature feedback of earlier\nwork. In the document example, the feedback yields predictive features: the presence of the word\n\u201c\ufb01libuster\u201d is a moderately-strong predictor of the label \u201cpolitics\u201d. In contrast, discriminative features\nare not necessarily predictive for the entire class. In the animal example above, \u201cstripes\u201d are not\npredictive of the class of zebras, since many animals have stripes. But they do distinguish zebras from\nhorses. Thus discriminative feedback can be advantageous in a multi-class setting: the feature need\nonly differentiate between two classes, rather than separating one entire class from all the others. In\nour abstract model, we relax this even further by requiring only that a discriminative feature separate\none subcategory of a certain label from a subcategory of a different label.\n\n1.2 Contributions\n\nOur \ufb01rst contribution is to formalize one particular type of discriminative feature feedback. Human\nexplanations, even simple ones, are rife with ambiguity and thus it is important to design explanatory\nmechanisms that have precise semantics.\nNext, we present a simple and ef\ufb01cient learning algorithm that uses this feedback for multi-class\nclassi\ufb01cation. It operates in the online learning framework: at each point in time, a new data point\narrives, it is classi\ufb01ed by the learner\u2019s current model, and then feedback is received if this classi\ufb01cation\nis incorrect. We show that the algorithm provably converges to the correct concept, and we provide a\ntight mistake bound on the total number of errors it makes over the lifetime of the learning process.\nWhat concept classes are learned by our algorithm? We show that it can ef\ufb01ciently learn any multi-\nclass concept that can be expressed as a decision tree, with a mistake bound that is quadratic in the\nnumber of leaves of the tree. More generally, it works when the target concept is expressible using\na particular multi-class version of DNF (disjunctive normal form, OR-of-ANDs) formulas that we\ncall separable DNF. In the setting of binary classi\ufb01cation, disjunctive normal form concepts have\nproved to be computationally intractable to learn under standard supervised learning [11, 5], which is\na bit troubling since humans seem to use such concepts quite naturally. But the hardness results apply\nto situations where the only feedback on any example is its label. With richer feedback, learning\nbecomes easy and ef\ufb01cient.\nThe model learned by the algorithm is a logical combination of features obtained during feedback, a\nsort of decision list where each individual entry in the list is a conjunction. The mistake bound has\nno dependence on the overall number of available features, which could potentially be in\ufb01nite. As\na result this methodology can be used to build classi\ufb01ers based on vast pools of low-level named\nfeatures, which neural nets are increasingly able to provide.\nLastly, we demonstrate the learning procedure experimentally.\n\n2\n\n\f1.3 Other related work\n\nRelated to our work is the in\ufb01nite attribute model of Blum [2], which introduces an online learning\nframework in which the goal is to learn a standard classi\ufb01er, such as a linear separator, in situations\nwhere the total number of available features is in\ufb01nite. Earlier mistake bounds for online learning,\nsuch as those for the Perceptron and Winnow algorithms, had some dependence on the number of\nfeatures. It was shown in [2] that this dependence can be removed if, for any given example x, only\n\ufb01nitely many features are present. These can, for instance, be thought of as the features that are\nperceptually most salient. In our paper, we again consider an online learning scenario and give a\nmistake bound that has no dependence on the overall number of available features. In our case, this\narises naturally as a by-product of constructing classi\ufb01ers from feature feedback, which is a different\nmechanism technically but does ultimately connect back to perceptual salience.\nA different type of feedback was studied in [15], for the speci\ufb01c purpose of learning DNF formulas.\nThe learner is allowed to make queries in which it provides two data points from the same class and\nasks how many terms in the DNF formula they both satisfy.\n\n2 The feedback model\n\nWe introduce our formal model by means of an example. Let\u2019s say the goal is to learn a classi\ufb01er that\ntakes as an input example a description of an animal\u2014given by a set of features describing where\nit lives, what it eats, its appearance, and so on\u2014and then classi\ufb01es it as mammal, bird, reptile,\namphibian, or fish.\nThe learning takes place in rounds of interaction. On each round,\n\n\u2022 A new animal is presented, e.g.,\n\nseahorse\n\n\u2022 The learner classi\ufb01es this instance using its current model.\n\n\u2013 E.g., it (mistakenly) says:\n\n mammal\n\n\u2013 In addition, the learner provides an example of an animal it has seen previously, that it\nconsiders to be similar to the new animal, and that belongs to the predicted class. E.g.,\n\n seahorse is similar to horse, which is a mammal.\n\n\u2022 The labeler responds if the prediction is incorrect.\n\n\u2013 The labeler provides the correct class.\n\n> Correct class: fish\n\n\u2013 In addition, the labeler provides a distinguishing feature between the instance being\n\nclassi\ufb01ed and the instance that the learner suggested as a similar example.\n> Distinguishing feature between seahorse and horse: lives-in-water\n\nHere the feature, lives-in-water, doesn\u2019t distinguish all \ufb01sh from all mammals\u2014some mammals do\nlive in water, for instance\u2014but does distinguish a group of \ufb01sh that includes seahorses from a group\nof mammals that include horses.\nWe now formalize the semantics of this kind of interaction.\n\n2.1 An abstract model of interaction\nLet c\u21e4 be the target concept to be learned, where c\u21e4 is a mapping from the input space X (e.g.,\nanimals) to a \ufb01nite label space Y (e.g., {mammal, bird, reptile, amphibian, fish}). The learner\nhas access to a set of Boolean features on X , and expresses concepts in terms of these. For instance,\nin the example above, one of the features 2 is lives-in-water.\nInformally, the concepts we can handle are those in which: the data has some unknown underlying\nclusters (e.g., \u201cregular land mammals\u201d, \u201cegg-laying mammals\u201d, \u201cmarine mammals\u201d, \u201cbony \ufb01sh\u201d,\n\u201ccartilaginous \ufb01sh\u201d, etc); each label corresponds to a union of some of these clusters; and any two\nclusters with different labels can be separated by a single feature.\n\n3\n\n\f\u2022 A new instance xt arrives.\n\u2022 The learner supplies:\n\n\u2013 A predictionbyt\n\u2013 A previously-seen instancebxt with that label\n\nE.g.: \u201cI think it is a mammal; it is similar to a horse\u201d\n\u2022 If the prediction is correct: no feedback is obtained.\n\u2022 If the prediction is incorrect:\n\n\u2013 The teacher provides the correct label yt\n\n\u2013 The teacher also provides a feature that separates G(xt) from G(bxt):\n\nif x 2 G(xt),\n\n(x) =\u21e2true\n\nfalse if x 2 G(bxt).\n\nE.g.: \u201cNo, it is a fish: lives-in-water\u201d\n\nFigure 1: Round t of interaction.\n\nFormally, we assume that X can be represented as the union of m sets, X = G1 [ G2 [\u00b7\u00b7\u00b7[ Gm,\npossibly overlapping. This representation (that is, the identity of the sets in the union) is unknown to\nthe learner, and satis\ufb01es the following:\n\n\u2022 Each of the sets is pure in its label: for each i, there exists a label `(Gi) 2Y such that\n\nc\u21e4(x) = `(Gi) for all x 2 Gi.\n\n(It follows that two sets Gi, Gj can intersect only if they have the same label.)\n\u2022 Any two sets Gi, Gj with `(Gi) 6= `(Gj) have a discriminating feature: there is some\n\n 2 and b 2{ 0, 1} such that\n\n(x) =\u21e2b\n\n\u00acb\n\nif x 2 Gi,\nif x 2 Gj.\n\nFor instance, if Gi is the cluster of land mammals and Gj is the cluster of cartilaginous \ufb01sh,\nthen one possible separating feature is lives-in-water.\n\nAs we discuss below, a representation that satis\ufb01es the assumptions above naturally exists if the set of\nBoolean features is suf\ufb01ciently rich. We place no restrictions on the number of features, which can\neven be in\ufb01nite. Moreover, since the algorithm and the mistake bound have no dependence on the\nnumber of features in , the requirement that a single feature be used to separate sets is not restrictive:\none can always include in \u201csingle features\u201d that are combinations of other basic features.\n\n2.2 The learning protocol\n\nFigure 1 shows how the tth round of interaction proceeds. In the protocol, G(x) is the set (one\nof G1, . . . , Gm) containing x. If there are multiple such sets, it is some particular choice. Thus\nG : X!{ G1, . . . , Gm} and x 2 G(x).\nTo reiterate, the labeler does not need to provide features that separate entire classes from one another.\nRather, these features just need to separate a subgroup of one class, containing xt, from a subgroup\n\nof another class, containingbxt. These subgroups might just be the singletons xt andbxt, in which\n\ncase the labeler need only distinguish these two speci\ufb01c instances. But it is reasonable to expect that\na labeler will attempt to \ufb01nd features that are fairly general, in which case these subgroups will be\nsomewhat larger. The clusters in the feedback formalism re\ufb02ect the level of categorization at which\nthe labeler is operating. They are allowed to be of arbitrary size; however, the complexity of the\nalgorithm we later present will depend upon the total number of clusters. If there are m clusters, the\ntotal number of mistakes made by the algorithm will be bounded by m2.\n\n4\n\n\f2.3 When do the assumptions hold?\nOur assumptions posit that there is some representation of the domain as purely-labeled sets that\ncan be separated by single features, and that this representation is the one used implicitly by the\nteacher when providing feedback to the learner. An important question is when these assumptions\nhold. Moreover, since the number of sets in the representation determines the mistake bound of our\nlearner, we would like to identify when this number is small.\nFirst, note that whenever the set of features is suf\ufb01ciently rich that any two instances x, x0 differ on\nat least one feature, we can always have a representation based on singleton sets which satis\ufb01es the\nassumptions. In this sense, the model is non-parametric, and allows any target concept. However, the\ncomplexity of learning depends upon the number of sets, which we are denoting by m, and would be\ntrivial if m = |X|. Therefore, we would like to identify concepts which admit m \u2327|X| .\nOne case in which this holds is when the target concept c\u21e4 can be expressed as a decision tree with m\nleaves, using features in . We can de\ufb01ne the set of examples x 2X that reach the j-th leaf to be a\ncluster Gj, so that there are m clusters. This way, any two clusters Gi and Gj can be separated by a\nsingle feature: the feature at the internal node that is the lowest common ancestor of leaves i and j.\nMore generally, discriminative feedback with m clusters is possible if and only if the target concept\nis expressible as a particular kind of multi-class disjunctive normal form formula over the features in\n, that we call separable-DNF, and that has at most m terms.\n\nDe\ufb01nition 1 (Separable-DNF concept) A separable-DNF concept over features is a concept\nc\u21e4 : X!Y such that each individual class y 2Y is characterized by a DNF formula Fy, where x\nsatis\ufb01es Fy if and only if c\u21e4(x) = y, and:\n\n\u2022 The literals in each Fy are individual features from or negations of such features.\n\u2022 For any y 6= y0, denote Fy = Fy,1 _ Fy,2 _\u00b7\u00b7\u00b7 and Fy0 = Fy0,1 _ Fy0,2 _\u00b7\u00b7\u00b7 , where the\nFy,i and Fy0,i are conjunctions. Then for any Fy,i and Fy0,j, there is some feature 2 \nthat is in both conjunctions, but with opposite polarity.\n\nWe say that the separable-DNF concept is of size m if the total number of conjunctions Fy,i is m.\n\nLemma 1 Target concept c\u21e4 can be represented using G1, . . . , Gm which satisfy the assumptions in\nSection 2.1 if and only if c\u21e4 is a separable-DNF concept of size m over features .\n\nThe proof is deferred to Appendix A.\n\n2.4 DNF formulas for binary classi\ufb01cation\nThere is a large body of work on learning disjunctive normal form formulas for binary classi\ufb01cation.\nThese concepts don\u2019t exactly fall into the framework above because they are asymmetric: there is a\nDNF formula for positive instances, and everything else is a negative instance. In Appendix B, we\nprovide a simple variant of our learning algorithm speci\ufb01cally for this case. For a target DNF formula\nwith m terms, each containing k literals, it makes at most m(k + 1) mistakes.\n\n3 A mistake-bounded learning algorithm\n\nWe now show that under the setting of discriminative feature feedback, there is an ef\ufb01cient learning\nalgorithm that reaches the target concept c\u21e4 after making at most m2 mistakes.\n\n3.1 The algorithm\nThe algorithm is shown in Figure 2. It maintains:\n\n\u2022 a list L of some of the instances seen so far;\n\u2022 for each item x in this list, its label as well as a conjunction C[x] that holds true for all of\n\u2022 a default instance and label (xo, yo) to apply to examples that violate all conjunctions of L.\n\nthe cluster G(x); and\n\n5\n\n\fInitialization:\n\n\u2022 Get the label yo of the \ufb01rst example xo (these serve as a default prediction)\n\u2022 Initialize L to an empty list\n\nAt time t, given a new point xt:\n\n\u2013 If incorrect:\n\n\u2022 Else:\n\n\u2022 If there existsbx 2 L such that xt satis\ufb01es C[bx]:\n\u2013 Predict label[bx] and provide examplebx\n\u21e4 Get correct label yt and feature \n\u21e4 C[bx] := C[bx] ^\u00ac \n\u21e4 Get correct label yt and feature \n\u21e4 Add xt to L\n\u21e4 Set label[xt] := yt and C[xt] := \n\n\u2013 Predict default label yo and provide example xo\n\u2013 If incorrect:\n\nFigure 2: An algorithm that learns from discriminative feature feedback.\n\nThe conjunctions C[x] are built entirely out of features obtained from the teacher. We denote by\nCL = {C[x] : x 2 L} the set of conjunctions for the examples in L.\n3.2 Mistake bounds\nTheorem 2 Suppose that the algorithm of Figure 2 is used to learn a target concept c\u21e4, and the\nteacher provides discriminative feature feedback corresponding to a representation G1, . . . , Gm\nwhich satis\ufb01es the assumptions in Section 2.1. Then the total number of mistakes made over the\nlifetime of the algorithm is at most m2.\n\nWe prove this in several steps. First, we note two key invariants of the algorithm:\n\n(I-1) Any item x 2 L has been seen in a previous round.\n(I-2) For any x 2 L, label[x] is correct and every point in G(x) satis\ufb01es C[x].\n\nThe \ufb01rst invariant is trivial. The second invariant holds since the label is provided by the teacher, and\nthe conjunction C[x] is a conjunction of literals taken from the teacher. These literals are all satis\ufb01ed\nby G(x), as de\ufb01ned in the learning protocol in Figure 1.\nNext, we show that L contains at most one representative per group Gi.\nLemma 3 For any distinct x, x0 2 L, we have G(x) 6= G(x0).\nPROOF: The only time a new x is added to L is in situations when it doesn\u2019t satisfy any of the\nconjunctions in L, and is therefore not in any of the corresponding G(\u00b7), as per invariant (I-2). \u21e4\nThe above observations allow deriving a connection between the size of the conjunctions in CL and\nthe number of mistakes the algorithm makes.\n\nLemma 4 Let B be an upper bound on the total number of literals in any conjunction in the list.\nThen the number of mistakes made by the algorithm is at most mB.\nPROOF: On each mistake of the algorithm, either: (i) an existing x 2 L has its conjunction restricted\nby an additional literal or (ii) a new item x is added to L with a conjunction of size 1. Thus, the total\nnumber of literals in conjunctions in L is equal to the number of mistakes made by the algorithm.\nBy assumption, each conjunction in L has \uf8ff B literals, and by Lemma 3 there are at most m such\nconjunctions. \u21e4\n\n6\n\n\fWe can now prove the mistake bound stated above.\n\nProof of Theorem 2: We show that each conjunction in L has at most m literals as follows. First,\n\nany conjunction C[bx] always starts with a single literal. Subsequently, the conjunction is extended\nonly when some instance xt appears that satis\ufb01es the conjunction and yet has G(xt) 6= G(bx). In\nthis case, one literal is added to C[bx] and thereafter no instance in G(xt) satis\ufb01es the conjunction.\nSince there are m different sets Gi, it follows that there can be at most m 1 rounds in which C[bx]\nis extended. Thus C[bx] has at most m literals. The mistake bound now follows from Lemma 4. \u21e4\n\nFurther, we show that this mistake bound is nearly tight, as stated in the following theorem and\nproved in Appendix C.\n\nTheorem 5 The worst-case mistake bound of the algorithm in Figure 2, assuming discriminative\nfeature feedback with m clusters, is at least (m 1)(m 2)/2.\nIt is also possible to derive a mistake bound which depends on the total number of features that\nthe teacher uses during the interaction. This bound is useful if the teacher uses a small number of\ndifferent features to discriminate between clusters. This can happen if the teacher attempts to reuse\nfeatures, or if the target conjunction is sparse.\n\nTheorem 6 Under the same assumptions as in Theorem 2, if the teacher uses at most k features\nduring the running of the algorithm, then the total number of mistakes made over the lifetime of the\nalgorithm is at most km.\n\nPROOF: If the teacher uses at most k different features during the run of the algorithm, then each\nconjunction in L uses at most k literals: Due to invariant (I-2), each conjunction is satis\ufb01ed by at\nleast one instance, thus it cannot include both a feature and its negation. The mistake bound follows\nfrom Lemma 4. \u21e4\n\n4 An illustrative experiment\n\nThe ZOO data set from the UCI ML repository contains information on 101 animals: for each, one of\nseven labels (mammal, bird, reptile, fish, amphibian, insect, other) as well as 21 Boolean\nfeatures. The goal is to learn a classi\ufb01er that predicts the label from the 21 features.\nThe learning algorithm of Figure 2 can potentially return different classi\ufb01ers depending on the\nparticular ordering of the data points. Figure 3 shows the result of one run. The starting example,\nand thus default prediction, is frog (class amphibian), and a total of 14 mistakes are made before\nconvergence. On each mistake, a separating feature is chosen at random from those that distinguish\n\nthe two instances (xt andbx in the notation of Figure 2). A human teacher would perhaps make more\n\njudicious feature choices.\nThe \ufb01rst table in the \ufb01gure shows the prediction and similar example provided by the learner (columns\n2 and 3) in each mistake round, as well as the correct label and discriminative feature provided by\nthe labeler (columns 4 and 5). The \ufb01nal decision list L is shown below. Recall that an instance is\nclassi\ufb01ed by \ufb01rst going through all the conjunctions in L and then falling through to the default\nprediction if there is no match.\nNotice that the list L contains exactly 14 literals in all, one from each mistake. Also, we know from\nLemma 3 that each of the \u201cunderlying groups\u201d contributes at most one rule in L. Therefore, the\nmethod of choosing features implicitly divides the class reptile into at least two groups, which\nappear to correspond to land reptiles and aquatic reptiles, and divides the somewhat nebulous class\nother into two groups, exempli\ufb01ed by worm and lobster.\n\n5 Conclusions and further directions\n\nThere is potential for enhancing the scope, robustness, and ease-of-use of learning systems by having\nthem learn from simple explanations, and in turn explain their predictions. A crucial part of this\nenterprise is to identify and formalize simple explanatory mechanisms, and to study how they can be\n\n7\n\n\fInstance\n\nPrediction\n\nSimilar instance\n\nfrog\nworm\ngirl\n\nherring\nseasnake\n\nhawk\ntortoise\ntermite\nlobster\nladybird\nhoneybee\nhouse\ufb02y\n\n\ufb02ea\n\nseasnake\n\nnewt\n\namphibian\namphibian\namphibian\n\nfish\n\namphibian\namphibian\n\nother\n\namphibian\nreptile\nother\n\namphibian\n\nother\n\namphibian\nreptile\n\nfrog\nfrog\nfrog\nherring\nfrog\nfrog\nworm\nfrog\n\ntortoise\nlobster\nfrog\nlobster\nfrog\n\nseasnake\n\nTrue label\namphibian\n\nother\nmammal\nfish\n\nreptile\n\nbird\n\nreptile\ninsect\nother\ninsect\ninsect\ninsect\ninsect\nreptile\namphibian\n\nDiscriminating feature\n\n(default prediction)\n\nnot(backbone)\n\nmilk\n\nzero-legs\nnot(\ufb01ns)\ntwo-legs\n\nnot(aquatic)\n\nsix-legs\n\nnot(backbone)\n\nsix-legs\nairborne\n\nnot(aquatic)\n\nbreathes\n\ntail\n\nnot(zero-legs)\n\nFinal decision list (L):\n\n\u2022 not(backbone) AND not(six-legs) =) other (worm)\n\u2022 milk =) mammal (girl)\n\u2022 zero-legs AND \ufb01ns =) fish (herring)\n\u2022 two-legs =) bird (hawk)\n\u2022 not(aquatic) AND not(six-legs) =) reptile (tortoise)\n\u2022 not(backbone) AND not(airborne) AND not(breathes) =) other (lobster)\n\u2022 not(aquatic) =) insect (house\ufb02y)\n\u2022 tail AND zero-legs =) reptile (sea snake)\n\u2022 ELSE: amphibian (frog)\n\nFigure 3: The 14 mistakes made on the ZOO data and the \ufb01nal classi\ufb01er.\n\nhandled algorithmically. This paper introduces a novel type of explanation that is fairly intuitive, and\nshows that it lends itself to simple learning algorithms with rigorous performance guarantees. It is a\n\ufb01rst step, and a wide range of open problems remain.\nFor this particular feedback scheme, what happens if the labeler is noisy or careless? We know from\nSection 2 that the scheme can in principle handle any concept, and thus noisy feedback is unlikely to\nderail convergence: if necessary, for instance, a noisily-labeled point can be in a cluster of its own. A\nmore subtle issue is how much the mistake bound can blow up as a result of noise.\nThe decision lists produced by the algorithm of Figure 2 are accompanied by reassuring guarantees.\nHowever, is it possible to retain those guarantees while producing a model that is more concise? This\nopen problem is particularly important in cases where the lists are intended to be interpretable.\n\nAcknowledgements\n\nThis research was supported by National Science Foundation grant CCF-1813160, and by a United-\nStates-Israel Binational Science Foundation (BSF) grant no. 2017641. Part of the work was done\nwhile SD and SS were at the \u201cFoundations of Machine Learning\u201d program at the Simons Institute for\nthe Theory of Computing, Berkeley.\n\n8\n\n\fReferences\n[1] M.-F. Balcan, A. Beygelzimer, and J. Langford. Agnostic active learning. In Proceedings of the\n\n23rd International Conference on Machine Learning, 2006.\n\n[2] A. Blum. Learning boolean functions in an in\ufb01nite attribute space. Machine Learning, 9(4):373\u2013\n\n386, 1992.\n\n[3] S. Branson, C. Wah, B. Babenko, F. Schroff, P. Welinder, P. Perona, and S. Belongie. Visual\n\nrecognition with humans in the loop. In European Conference on Computer Vision, 2010.\n\n[4] W.B. Croft and R. Das. Experiments with query acquisition and use in document retrieval\nsystems. In Proceedings of the 13th International Conference on Research and Development in\nInformation Retrieval, pages 349\u2013368, 1990.\n\n[5] A. Daniely and S. Shalev-Shwartz. Complexity theoretic limitations on learning dnfs. In\n\nConference on Learning Theory, pages 815\u2013830, 2016.\n\n[6] G. Druck, G. Mann, and A. McCallum. Learning from labeled features using generalized\nexpectation criteria. In Proceedings of ACM Special Interest Group on Information Retrieval,\n2008.\n\n[7] S. Hanneke. Rates of convergence in active learning. Annals of Statistics, 39(1):333\u2013361, 2011.\n[8] O. Mac Aodha, S. Su, Y. Chen, P. Perona, and Y. Yue. Teaching categories to human learners\nwith visual explanations. In IEEE Conference on Computer Vision and Pattern Recognition,\n2018.\n\n[9] D. Parikh and K. Grauman. Relative attributes. In Proceedings of the International Conference\n\non Computer Vision, 2011.\n\n[10] P. Perona. Visions of a Visipedia. Proceedings of the IEEE, 98(8):1526\u20131534, 2010.\n[11] L. Pitt and L.G. Valiant. Computational limitations on learning from examples. Journal of the\n\nACM, 35(4):965\u2013984, 1988.\n\n[12] S. Poulis and S. Dasgupta. Learning with feature feedback. In Twentieth International Confer-\n\nence on Arti\ufb01cial Intelligence and Statistics, 2017.\n\n[13] H. Raghavan, O. Madani, and R. Jones. Interactive feature selection. In Proceedings of the 19th\n\nInternational Joint Conference on Arti\ufb01cial Intelligence, pages 841\u2013846, 2005.\n\n[14] B. Settles. Closing the loop: fast, interactive semi-supervised annotation with queries on features\n\nand instances. In Empirical Methods in Natural Language Processing, 2011.\n\n[15] L. Yang, A. Blum, and J. Carbonell. Learnability of DNF with representation-speci\ufb01c queries.\n\nIn Innovations in Theoretical Computer Science, 2013.\n\n9\n\n\f", "award": [], "sourceid": 1949, "authors": [{"given_name": "Sanjoy", "family_name": "Dasgupta", "institution": "UC San Diego"}, {"given_name": "Akansha", "family_name": "Dey", "institution": "UCSD"}, {"given_name": "Nicholas", "family_name": "Roberts", "institution": "UC San Diego"}, {"given_name": "Sivan", "family_name": "Sabato", "institution": "Ben-Gurion University of the Negev"}]}