{"title": "A General Framework for Robust Interactive Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 7082, "page_last": 7091, "abstract": "We propose a general framework for interactively learning models, such as (binary or non-binary) classifiers, orderings/rankings of items, or clusterings of data points. Our framework is based on a generalization of Angluin's equivalence query model and Littlestone's online learning model: in each iteration, the algorithm proposes a model, and the user either accepts it or reveals a specific mistake in the proposal. The feedback is correct only with probability p > 1/2 (and adversarially incorrect with probability 1 - p), i.e., the algorithm must be able to learn in the presence of arbitrary noise. The algorithm's goal is to learn the ground truth model using few iterations. Our general framework is based on a graph representation of the models and user feedback. To be able to learn efficiently, it is sufficient that there be a graph G whose nodes are the models, and (weighted) edges capture the user feedback, with the property that if s, s* are the proposed and target models, respectively, then any (correct) user feedback s' must lie on a shortest s-s* path in G. Under this one assumption, there is a natural algorithm, reminiscent of the Multiplicative Weights Update algorithm, which will efficiently learn s* even in the presence of noise in the user's feedback. From this general result, we rederive with barely any extra effort classic results on learning of classifiers and a recent result on interactive clustering; in addition, we easily obtain new interactive learning algorithms for ordering/ranking.", "full_text": "A General Framework for Robust Interactive\n\nLearning\u2217\n\nEhsan Emamjomeh-Zadeh\u2020\n\nDavid Kempe\u2021\n\nAbstract\n\nWe propose a general framework for interactively learning models, such as (binary\nor non-binary) classi\ufb01ers, orderings/rankings of items, or clusterings of data points.\nOur framework is based on a generalization of Angluin\u2019s equivalence query model\nand Littlestone\u2019s online learning model: in each iteration, the algorithm proposes a\nmodel, and the user either accepts it or reveals a speci\ufb01c mistake in the proposal.\nThe feedback is correct only with probability p > 1\n2 (and adversarially incorrect\nwith probability 1 \u2212 p), i.e., the algorithm must be able to learn in the presence of\narbitrary noise. The algorithm\u2019s goal is to learn the ground truth model using few\niterations.\nOur general framework is based on a graph representation of the models and user\nfeedback. To be able to learn ef\ufb01ciently, it is suf\ufb01cient that there be a graph G\nwhose nodes are the models, and (weighted) edges capture the user feedback, with\nthe property that if s, s\u2217 are the proposed and target models, respectively, then any\n(correct) user feedback s(cid:48) must lie on a shortest s-s\u2217 path in G. Under this one\nassumption, there is a natural algorithm, reminiscent of the Multiplicative Weights\nUpdate algorithm, which will ef\ufb01ciently learn s\u2217 even in the presence of noise in\nthe user\u2019s feedback.\nFrom this general result, we rederive with barely any extra effort classic results on\nlearning of classi\ufb01ers and a recent result on interactive clustering; in addition, we\neasily obtain new interactive learning algorithms for ordering/ranking.\n\n1\n\nIntroduction\n\nWith the pervasive reliance on machine learning systems across myriad application domains in the real\nworld, these systems frequently need to be deployed before they are fully trained. This is particularly\ntrue when the systems are supposed to learn a speci\ufb01c user\u2019s (or a small group of users\u2019) personal\nand idiosyncratic preferences. As a result, we are seeing an increased practical interest in online and\ninteractive learning across a variety of domains.\nA second feature of the deployment of such systems \u201cin the wild\u201d is that the feedback the system\nreceives is likely to be noisy. Not only may individual users give incorrect feedback, but even if they\ndo not, the preferences \u2014 and hence feedback \u2014 across different users may vary. Thus, interactive\nlearning algorithms deployed in real-world systems must be resilient to noisy feedback.\nSince the seminal work of Angluin [2] and Littlestone [14], the paradigmatic application of (noisy)\ninteractive learning has been online learning of a binary classi\ufb01er when the algorithm is provided\nwith feedback on samples it had previously classi\ufb01ed incorrectly. However, beyond (binary or other)\nclassi\ufb01ers, there are many other models that must be frequently learned in an interactive manner. Two\n\u2217A full version is available on the arXiv at https://arxiv.org/abs/1710.05422. The present version\n\u2020Department of Computer Science, University of Southern California, emamjome@usc.edu\n\u2021Department of Computer Science, University of Southern California, dkempe@usc.edu\n\nomits all proofs and several other details and discussions.\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fparticularly relevant examples are the following: (1) Learning an ordering/ranking of items is a key\npart of personalized Web search or other information-retrieval systems (e.g., [12, 18]). The user is\ntypically presented with an ordering of items, and from her clicks or lack thereof, an algorithm can\ninfer items that are in the wrong order. (2) Interactively learning a clustering [6, 5, 4] is important\nin many application domains, such as interactively identifying communities in social networks or\npartitioning an image into distinct objects. The user will be shown a candidate clustering, and can\nexpress that two clusters should be merged, or a cluster should be split into two.\nIn all three examples \u2014 classi\ufb01cation, ranking, and clustering \u2014 the interactive algorithm proposes\na model4 (a classi\ufb01er, ranking, or clustering) as a solution. The user then provides \u2014 explicitly or\nimplicitly \u2014 feedback on whether the model is correct or needs to be \ufb01xed/improved. This feedback\nmay be incorrect with some probability. Based on the feedback, the algorithm proposes a new\nand possibly very different model, and the process repeats. This type of interaction is the natural\ngeneralization of Angluin\u2019s equivalence query model [2, 3]. It is worth noting that in contrast to\nactive learning, in interactive learning (which is the focus of this work), the algorithm cannot \u201cask\u201d\ndirect questions; it can only propose a model and receive feedback in return. The algorithm should\nminimize the number of user interactions, i.e., the number of times that the user needs to propose\n\ufb01xes. A secondary goal is to make the algorithm\u2019s internal computations ef\ufb01cient as well.\nThe main contribution of this article is a general framework for ef\ufb01cient interactive learning of models\n(even with noisy feedback), presented in detail in Section 2. We consider the set of all N models as\nnodes of a positively weighted undirected or directed graph G. The one key property that G must\nsatisfy is the following: (*) If s is a proposed model, and the user (correctly) suggests changing it to\ns(cid:48), then the graph must contain the edge (s, s(cid:48)); furthermore, (s, s(cid:48)) must lie on a shortest path from\ns to the target model s\u2217 (which is unknown to the algorithm).\nWe show that this single property is enough to learn the target model s\u2217 using at most log N queries5\nto the user, in the absence of noise. When the feedback is correct with probability p > 1\n2, the required\nnumber of queries gracefully deteriorates to O(log N ); the constant depends on p. We emphasize\nthat the assumption (*) is not an assumption on the user. We do not assume that the user somehow\n\u201cknows\u201d the graph G and computes shortest paths in order to \ufb01nd a response. Rather, (*) states that\nG was correctly chosen to model the underlying domain, so that correct answers by the user must\nin fact have the property (*). To illustrate the generality of our framework, we apply it to ordering,\nclustering, and classi\ufb01cation:\n\n1. For ordering/ranking, each permutation is a node in G; one permutation is the unknown\ntarget. If the user can point out only adjacent elements that are out of order, then G is an\nadjacent transposition \u201cBUBBLE SORT\u201d graph, which naturally has the property (*). If the\nuser can pick any element and suggest that it should precede an entire block of elements\nit currently follows, then we can instead use an \u201cINSERSION SORT\u201d graph; interestingly,\nto ensure the property (*), this graph must be weighted. On the other hand, as we show in\nSection 3, if the user can propose two arbitrary elements that should be swapped, there is no\ngraph G with the property (*).\nOur framework directly leads to an interactive algorithm that will learn the correct ordering\nof n items in O(log(n!)) = O(n log n) queries; we show that this bound is optimal under\nthe equivalence query model.\n\n2. For learning a clustering of n items, the user can either propose merging two clusters, or\nsplitting one cluster. In the interactive clustering model of [6, 5, 4], the user can specify\nthat a particular cluster C should be split, but does not give a speci\ufb01c split. We show in\nSection 4 that there is a weighted directed graph with the property (*); then, if each cluster\nis from a \u201csmall\u201d concept class of size at most M (such as having low VC-dimension), there\nis an algorithm \ufb01nding the true clustering in O(k log M ) queries, where k is number of the\nclusters (known ahead of time).\n\n3. For binary classi\ufb01cation, G is simply an n-dimensional hypercube (where n is the number\nof sample points that are to be classi\ufb01ed). As shown in Section 5, one immediately recovers\na close variant of standard online learning algorithms within this framework. An extension\nto classi\ufb01cation with more than two classes is very straightforward.\n\n4We avoid the use of the term \u201cconcept,\u201d as it typically refers to a binary function, and is thus associated\n\nspeci\ufb01cally with a classi\ufb01er.\n\n5 Unless speci\ufb01ed otherwise, all logarithms are base 2.\n\n2\n\n\fDue to space limits, all proofs and several other details and discussions are omitted. A full version is\navailable on the arXiv at https://arxiv.org/abs/1710.05422.\n\n2 Learning Framework\n\nWe de\ufb01ne a framework for query-ef\ufb01cient interactive learning of different types of models. Some\nprototypical examples of models to be learned are rankings/orderings of items, (unlabeled) clusterings\nof graphs or data points, and (binary or non-binary) classi\ufb01ers. We denote the set of all candidate\nmodels (permutations, partitions, or functions from the hypercube to {0, 1}) by \u03a3, and individual\nmodels6 by s, s(cid:48), s\u2217, etc. We write N = |\u03a3| for the number of candidate models.\nWe study interactive learning of such models in a natural generalization of the equivalence query\nmodel of Angluin [2, 3]. This model is equivalent to the more widely known online learning model of\nLittlestone [14], but more naturally \ufb01ts the description of user interactions we follow here. It has also\nserved as the foundation for the interactive clustering model of Balcan and Blum [6] and Awasthi et\nal. [5, 4].\nIn the interactive learning framework, there is an unknown ground truth model s\u2217 to be learned. In\neach round, the learning algorithm proposes a model s to the user. In response, with probability\n2, the user provides correct feedback. In the remaining case (i.e., with probability 1 \u2212 p), the\np > 1\nfeedback is arbitrary; in particular, it could be arbitrarily and deliberately misleading.\nCorrect feedback is of the following form: if s = s\u2217, then the algorithm is told this fact in the form of\na user response of s. Otherwise, the user reveals a model s(cid:48) (cid:54)= s that is \u201cmore similar\u201d to s\u2217 than\ns was. The exact nature of \u201cmore similar,\u201d as well as the possibly restricted set of suggestions s(cid:48)\nthat the user can propose, depend on the application domain. Indeed, the strength of our proposed\nframework is that it provides strong query complexity guarantees under minimal assumptions about\nthe nature of the feedback; to employ the framework, one merely has to verify that the the following\nassumption holds.\n\nDe\ufb01nition 2.1 (Graph Model for Feedback) De\ufb01ne a weighted graph G (directed or undirected)\nthat contains one node for each model s \u2208 \u03a3, and an edge (s, s(cid:48)) with arbitrary positive edge length\n\u03c9(s,s(cid:48)) > 0 if the user is allowed to propose s(cid:48) in response to s. (Choosing the lengths of edges is an\nimportant part of using the framework.) G may contain additional edges not corresponding to any\nuser feedback. The key property that G must satisfy is the following: (*) If the algorithm proposes\ns and the ground truth is s\u2217 (cid:54)= s, then every correct user feedback s(cid:48) lies on a shortest path from s\nto s\u2217 in G with respect to the lengths \u03c9e. If there are multiple candidate nodes s(cid:48), then there is no\nguarantee on which one the algorithm will be given by the user.\n\n2.1 Algorithm and Guarantees\n\nOur algorithms are direct reformulations and slight generalizations of algorithms recently proposed\nby Emamjomeh-Zadeh et al. [10], which itself was a signi\ufb01cant generalization of the natural \u201cHalving\nAlgorithm\u201d for learning a classi\ufb01er (e.g., [14]). They studied the search problem as an abstract\nproblem they termed \u201cBinary Search in Graphs,\u201d without discussing any applications. Our main\ncontribution here is the application of the abstract search problem to a large variety of interactive\nlearning problems, and a framework that makes such applications easy. We begin with the simplest\ncase p = 1, i.e., when the algorithm only receives correct feedback.\nAlgorithm 1 gives essentially best-possible general guarantees [10]. To state the algorithm and its\nguarantees, we need the notion of an approximate median node of the graph G. First, we denote by\n\nN (s, s(cid:48)) :=\n\n{\u02c6s | s(cid:48) lies on a shortest path from s to \u02c6s}\n\nif s(cid:48) = s\nif s(cid:48) (cid:54)= s\n\n(cid:26){s}\n\nthe set of all models \u02c6s that are consistent with a user feedback of s(cid:48) to a model s. In anticipation\nof the noisy case, we allow models to be weighted7, and denote the node weights or likelihoods by\n\n6When considering speci\ufb01c applications, we will switch to notation more in line with that used for the\n\nspeci\ufb01c application.\n\nthey basically correspond to likelihoods.\n\n7Edge lengths are part of the de\ufb01nition of the graph, but node weights will be assigned by our algorithm;\n\n3\n\n\fsubset of models S, we write \u00b5(S) :=(cid:80)\n\n\u00b5(s) \u2265 0. If feedback is not noisy (i.e., p = 1), all the non-zero node weights are equal. For every\ns\u2208S \u00b5(s) for the total node weight of the models in S. Now,\n\nfor every model s, de\ufb01ne\n\n\u03a6\u00b5(s) :=\n\n1\n\n\u00b5(\u03a3)\n\n\u00b7\n\nmax\n\ns(cid:48)(cid:54)=s,(s,s(cid:48))\u2208G\n\n\u00b5(N (s, s(cid:48)))\n\nto be the largest fraction (with respect to node weights) of models that could still be consistent with a\nworst-case response s(cid:48) to a proposed model of s. For every subset of models S, we denote by \u00b5S\nthe likelihood function that assigns weight 1 to every node s \u2208 S and 0 elsewhere. For simplicity of\nnotation, we use \u03a6S(s) when the node weights are \u00b5S.\nThe simple key insight of [10] can be summarized and reformulated as the following proposition:\nProposition 2.1 ([10], Proofs of Theorems 3 and 14) Let G be a (weighted) directed graph in\nwhich each edge e with length \u03c9e is part of a cycle of total edge length at most c \u00b7 \u03c9e. Then,\nfor every node weight function \u00b5, there exists a model s such that \u03a6\u00b5(s) \u2264 c\u22121\nc .\nWhen G is undirected (and hence c = 2), for every node weight function \u00b5, there exists an s such that\n\u03a6\u00b5(s) \u2264 1\n2 .\nIn Algorithm 1, we always have uniform node weight for all the models which are consistent with\nall the feedback received so far, and node weight 0 for models that are inconsistent with at least one\nresponse. Prior knowledge about candidates for s\u2217 can be incorporated by providing the algorithm\nwith the input Sinit (cid:51) s\u2217 to focus its search on; in the absence of prior knowledge, the algorithm can\nbe given Sinit = \u03a3.\n\nAlgorithm 1 LEARNING A MODEL WITHOUT FEEDBACK ERRORS (Sinit)\n1: S \u2190 Sinit.\n2: while |S| > 1 do\n3:\n4:\n5:\n6: return the only remaining model in S.\n\nLet s be a model with a \u201csmall\u201d value of \u03a6S(s).\nLet s(cid:48) be the user\u2019s feedback model.\nSet S \u2190 S \u2229 N (s, s(cid:48)).\n\nLine 3 is underspeci\ufb01ed as \u201csmall.\u201d Typically, an algorithm would choose the s with smallest \u03a6S(s).\nBut computational ef\ufb01ciency constraints or other restrictions (see Sections 2.2 and 5) may preclude\nthis choice and force the algorithm to choose a suboptimal s. The guarantee of Algorithm 1 is\nsummarized by the following Theorem 2.2. It is a straightforward generalization of Theorems 3 and\n14 from [10]\nTheorem 2.2 Let N0 = |Sinit| be the number of initial candidate models. If each model s chosen in\nLine 3 of Algorithm 1 has \u03a6S(s) \u2264 \u03b2, then Algorithm 1 \ufb01nds s\u2217 using at most log1/\u03b2 N0 queries.\nCorollary 2.3 When G is undirected and the optimal s is used in each iteration, \u03b2 = 1\n2 and\nAlgorithm 1 \ufb01nds s\u2217 using at most log2 N0 queries.\nIn the presence of noise, the algorithm is more complicated. The algorithm and its analysis are given\nin the full version. The performance of the robust algorithm is summarized in Theorem 2.4.\n2 , 1), de\ufb01ne \u03c4 = \u03b2p + (1 \u2212 \u03b2)(1 \u2212 p), and let N0 = |Sinit|. Assume that\nTheorem 2.4 Let \u03b2 \u2208 [ 1\nlog(1/\u03c4 ) > H(p) where H(p) = \u2212p log p \u2212 (1 \u2212 p) log(1 \u2212 p) denotes the entropy. (When \u03b2 = 1\n2 ,\nthis holds for every p > 1\n2 .)\nIf in each iteration, the algorithm can \ufb01nd a model s with \u03a6\u00b5(s) \u2264 \u03b2, then with probability at least\n1 \u2212 \u03b4, the robust algorithm \ufb01nds s\u2217 using at most\nlog(1/\u03c4 )\u2212H(p) log N0 + o(log N0) + O(log2(1/\u03b4))\nqueries in expectation.\nCorollary 2.5 When the graph G is undirected and the optimal s is used in each iteration, then with\n(1\u2212\u03b4)\nprobability at least 1 \u2212 \u03b4, the robust algorithm \ufb01nds s\u2217 using at most\n1\u2212H(p) log2 N0 + o(log N0) +\nO(log2(1/\u03b4)) queries in expectation.\n\n(1\u2212\u03b4)\n\n4\n\n\f2.2 Computational Considerations and Sampling\n\nCorollaries 2.3 and 2.5 require the algorithm to \ufb01nd a model s with small \u03a6\u00b5(s) in each iteration. In\nmost learning applications, the number N of candidate models is exponential in a natural problem\nparameter n, such as the number of sample points (classi\ufb01cation), or the number of items to rank or\ncluster. If computational ef\ufb01ciency is a concern, this precludes explicitly keeping track of the set S or\nthe weights \u00b5(s). It also rules out determining the model s to query by exhaustive search over all\nmodels that have not yet been eliminated.\nIn some cases, these dif\ufb01culties can be circumvented by exploiting problem-speci\ufb01c structure. A\nmore general approach relies on Monte Carlo techniques. We show that the ability to sample models\ns with probability (approximately) proportional to \u00b5(s) (or approximately uniformly from S in the\ncase of Algorithm 1) is suf\ufb01cient to essentially achieve the results of Corollaries 2.3 and 2.5 with a\ncomputationally ef\ufb01cient algorithm. Notice that both in Algorithm 1 and the robust algorithm with\nnoisy feedback (omitted from this version), the node weights \u00b5(s) are completely determined by all\nthe query responses the algorithm has seen so far and the probability p.\n\nTheorem 2.6 Let n be a natural measure of the input size and assume that log N is polynomial in n.\nAssume that G = (V, E) is undirected8, all edge lengths are integers, and the maximum degree and\ndiameter (both with respect to the edge lengths) are bounded by poly(n). Also assume w.l.o.g. that \u00b5\nis normalized to be a distribution over the nodes9 (i.e., \u00b5(\u03a3) = 1).\nLet 0 \u2264 \u2206 < 1\n4 be a constant, and assume that there is an oracle that \u2014 given a set of query\nresponses \u2014 runs in polynomial time in n and returns a model s drawn from a distribution \u00b5(cid:48) with\ndTV(\u00b5, \u00b5(cid:48)) \u2264 \u2206. Also assume that there is a polynomial-time algorithm that, given a model s,\ndecides whether or not s is consistent with every given query response or not.\nThen, for every \u0001 > 0, in time poly(n, 1\nwith high probability.\n\n\u0001 ), an algorithm can \ufb01nd a model s with \u03a6\u00b5(s) \u2264 1\n\n2 + 2\u2206 + \u0001,\n\n3 Application I: Learning a Ranking\n\nAs a \ufb01rst application, we consider the task of learning the correct order of n elements with supervision\nin the form of equivalence queries. This task is motivated by learning a user\u2019s preference over web\nsearch results (e.g., [12, 18]), restaurant or movie orders (e.g., [9]), or many other types of entities.\nUsing pairwise active queries (\u201cDo you think that A should be ranked ahead of B?\u201d), a learning\nalgorithm could of course simulate standard O(n log n) sorting algorithms; this number of queries is\nnecessary and suf\ufb01cient. However, when using equivalence queries, the user must be presented with\na complete ordering (i.e., a permutation \u03c0 of the n elements), and the feedback will be a mistake in\nthe proposed permutation. Here, we propose interactive algorithms for learning the correct ranking\nwithout additional information or assumptions.10 We \ufb01rst describe results for a setting with simple\nfeedback in the form of adjacent transpositions; we then show a generalization to more realistic\nfeedback as one is wont to receive in applications such as search engines.\n\n3.1 Adjacent Transpositions\n\nWe \ufb01rst consider \u201cBUBBLE SORT\u201d feedback of the following form: the user speci\ufb01es that elements\ni and i + 1 in the proposed permutation \u03c0 are in the wrong relative order. An obvious correction\nfor an algorithm would be to swap the two elements, and leave the rest of \u03c0 intact. This algorithm\nwould exactly implement BUBBLE SORT, and thus require \u0398(n2) equivalence queries. Our general\nframework allows us to easily obtain an algorithm with O(n log n) equivalence queries instead. We\nde\ufb01ne the undirected and unweighted graph GBS as follows:\n\n\u2022 GBS contains N = n! nodes, one for each permutation \u03c0 of the n elements;\n\u2022 it contains an edge between \u03c0 and \u03c0(cid:48) if and only if \u03c0(cid:48) can be obtained from \u03c0 by swapping\n\ntwo adjacent elements.\n\n\u03a6\u00b5(s) \u2264 1\n2 .\n\n8It is actually suf\ufb01cient that for every node weight function \u00b5 : V \u2192 R+, there exists a model s with\n\n9For Algorithm 1, \u00b5 is uniform over all models consistent with all feedback up to that point.\n10For example, [12, 18, 9] map items to feature vectors and assume linearity of the target function(s).\n\n5\n\n\fLemma 3.1 GBS satis\ufb01es De\ufb01nition 2.1 with respect to BUBBLE SORT feedback.\nHence, applying Corollary 2.3 and Theorem 2.4, we immediately obtain the existence of learning\nalgorithms with the following properties:\nCorollary 3.2 Assume that in response to each equivalence query on a permutation \u03c0, the user\nresponds with an adjacent transposition (or states that the proposed permutation \u03c0 is correct).\n\n1. If all query responses are correct, then the target ordering can be learned by an interactive\n\nalgorithm using at most log N = log n! \u2264 n log n equivalence queries.\n\n2. If query responses are correct with probability p > 1\n\nby an interactive algorithm with probability at least 1 \u2212 \u03b4 using at most\no(n log n) + O(log2(1/\u03b4)) equivalence queries in expectation.\n\n2 , the target ordering can be learned\n(1\u2212\u03b4)\n1\u2212H(p) n log n +\n\nUp to constants, the bound of Corollary 3.2 is optimal: Theorem 3.3 shows that \u2126(n log n) equiva-\nlence queries are necessary in the worst case. Notice that Theorem 3.3 does not immediately follow\nfrom the classical lower bound for sorting with pairwise comparisons: while the result of a pairwise\ncomparison always reveals one bit, there are n \u2212 1 different possible responses to an equivalence\nquery, so up to O(log n) bits might be revealed. For this reason, the proof of Theorem 3.3 explicitly\nconstructs an adaptive adversary, and does not rely on a simple counting argument.\nTheorem 3.3 With adversarial responses, any interactive ranking algorithm can be forced to ask\n\u2126(n log n) equivalence queries. This is true even if the true ordering is chosen uniformly at random,\nand only the query responses are adversarial.\n\n3.2\n\nImplicit Feedback from Clicks\n\nIn the context of search engines, it has been argued (e.g., by [12, 18, 1]) that a user\u2019s clicking behavior\nprovides implicit feedback of a speci\ufb01c form on the ranking. Speci\ufb01cally, since users will typically\nread the search results from \ufb01rst to last, when a user skips some links that appear earlier in the\nranking, and instead clicks on a link that appears later, her action suggests that the later link was more\ninformative or relevant.\nFormally, when a user clicks on the element at index i, but did not previously click on any elements at\nindices j, j + 1, . . . , i\u2212 1, this is interpreted as feedback that element i should precede all of elements\nj, j + 1, . . . , i \u2212 1. Thus, the feedback is akin to an \u201cINSERSION SORT\u201d move. (The BUBBLE SORT\nfeedback model is the special case in which j = i \u2212 1 always.)\nTo model this more informative feedback, the new graph GIS has more edges, and the edge lengths\nare non-uniform. It contains the same N nodes (one for each permutation). For a permutation \u03c0 and\nindices 1 \u2264 j < i \u2264 n, \u03c0j\u2190i denotes the permutation that is obtained by moving the ith element in\n\u03c0 before the jth element (and thus shifting elements j, j + 1, . . . , i \u2212 1 one position to the right). In\nGIS, for every permutation \u03c0 and every 1 \u2264 j < i \u2264 n, there is an undirected edge from \u03c0 to \u03c0j\u2190i\nwith length i \u2212 j. Notice that for i > j + 1, there is actually no user feedback corresponding to the\nedge from \u03c0j\u2190i to \u03c0; however, additional edges are permitted, and Lemma 3.4 establishes that GIS\ndoes in fact satisfy the \u201cshortest paths\u201d property.\nLemma 3.4 GIS satis\ufb01es De\ufb01nition 2.1 with respect to INSERSION SORT feedback.\nAs in the case of GBS, by applying Corollary 2.3 and Theorem 2.4, we immediately obtain the\nexistence of interactive learning algorithms with the same guarantees as those of Corollary 3.2.\nCorollary 3.5 Assume that in response to each equivalence query, the user responds with a pair of\nindices j < i such that element i should precede all elements j, j + 1, . . . , i \u2212 1.\n\n1. If all query responses are correct, then the target ordering can be learned by an interactive\n\nalgorithm using at most log N = log n! \u2264 n log n equivalence queries.\n\n2. If query responses are correct with probability p > 1\n\nby an interactive algorithm with probability at least 1 \u2212 \u03b4 using at most\no(n log n) + O(log2(1/\u03b4)) equivalence queries in expectation.\n\n2 , the target ordering can be learned\n(1\u2212\u03b4)\n1\u2212H(p) n log n +\n\n6\n\n\f3.3 Computational Considerations\n\nWhile Corollaries 3.2 and 3.5 imply interactive algorithms using O(n log n) equivalence queries,\nthey do not guarantee that the internal computations of the algorithms are ef\ufb01cient. The na\u00a8\u0131ve\nimplementation requires keeping track of and comparing likelihoods on all N = n! nodes.\nWhen p = 1, i.e., the algorithm only receives correct feedback, it can be made computationally\nef\ufb01cient using Theorem 2.6. To apply Theorem 2.6, it suf\ufb01ces to show that one can ef\ufb01ciently sample\na (nearly) uniformly random permutation \u03c0 consistent with all feedback received so far. Since the\nfeedback is assumed to be correct, the set of all pairs (i, j) such that the user implied that element i\nmust precede element j must be acyclic, and thus must form a partial order. The sampling problem is\nthus exactly the problem of sampling a linear extension of a given partial order.\nThis is a well-known problem, and a beautiful result of Bubley and Dyer [8, 7] shows that the\nKarzanov-Khachiyan Markov Chain [13] mixes rapidly. Huber [11] shows how to modify the Markov\nChain sampling technique to obtain an exactly (instead of approximately) uniformly random linear\nextension of the given partial order. For the purpose of our interactive learning algorithm, the sampling\nresults can be summarized as follows:\nTheorem 3.6 (Huber [11]) Given a partial order over n elements, let L be the set of all linear\nextensions, i.e., the set of all permutations consistent with the partial order. There is an algorithm\nthat runs in expected time O(n3 log n) and returns a uniformly random sample from L.\nThe maximum node degree in GBS is n \u2212 1, while the maximum node degree in GIS is O(n2). The\ndiameter of both GBS and GIS is O(n2). Substituting these bounds and the bound from Theorem 3.6\ninto Theorem 2.6, we obtain the following corollary:\n\nCorollary 3.7 Both under BUBBLE SORT feedback and INSERSION SORT feedback, if all feedback is\ncorrect, there is an ef\ufb01cient interactive learning algorithm using at most log n! \u2264 n log n equivalence\nqueries to \ufb01nd the target ordering.\n\nThe situation is signi\ufb01cantly more challenging when feedback could be incorrect, i.e., when p < 1.\nIn this case, the user\u2019s feedback is not always consistent and may not form a partial order. In fact, we\nprove the following hardness result.\n\nresponses, let \u00b5(\u03c0) be the likelihood of \u03c0 given the responses, and normalized so that(cid:80)\n\nTheorem 3.8 There exists a p (depending on n) for which the following holds. Given a set of user\n\u03c0 \u00b5(\u03c0) = 1.\nLet 0 < \u2206 < 1 be any constant. There is no polynomial-time algorithm to draw a sample from a\ndistribution \u00b5(cid:48) with dTV(\u00b5, \u00b5(cid:48)) \u2264 1 \u2212 \u2206 unless RP = NP.\n\nIt should be noted that the value of p in the reduction is exponentially close to 1. In this range,\nincorrect feedback is so unlikely that with high probability, the algorithm will always see a partial\norder. It might then still be able to sample ef\ufb01ciently. On the other hand, for smaller values of p\n(e.g., constant p), sampling approximately from the likelihood distribution might be possible via a\nmetropolized Karzanov-Khachiyan chain or a different approach. This problem is still open.\n\n4 Application II: Learning a Clustering\n\nMany traditional approaches for clustering optimize an (explicit) objective function or rely on\nassumptions about the data generation process. In interactive clustering, the algorithm repeatedly\nproposes a clustering, and obtains feedback that two proposed clusters should be merged, or a\nproposed cluster should be split into two. There are n items, and a clustering C is a partition of the\nitems into disjoint sets (clusters) C1, C2, . . .. It is known that the target clustering has k clusters, but\nin order to learn it, the algorithm can query clusterings with more or fewer clusters as well. The user\nfeedback has the following semantics, as proposed by Balcan and Blum [6] and Awasthi et al. [5, 4].\n\n1. MERGE(Ci, Cj): Speci\ufb01es that all items in Ci and Cj belong to the same cluster.\n2. SPLIT(Ci): Speci\ufb01es that cluster Ci needs to be split, but not into which subclusters.\n\n7\n\n\fNotice that feedback that two clusters be merged, or that a cluster be split (when the split is known),\ncan be considered as adding constraints on the clustering (see, e.g., [21]); depending on whether\nfeedback may be incorrect, these constraints are hard or soft.\nWe de\ufb01ne a weighted and directed graph GUC on all clusterings C. Thus, N = Bn \u2264 nn is the nth\nBell number. When C(cid:48) is obtained by a MERGE of two clusters in C, GUC contains a directed edge\n(C,C(cid:48)) of length 2. If C = {C1, C2, . . .} is a clustering, then for each Ci \u2208 C, the graph GUC contains\na directed edge of length 1 from C to C \\ {Ci} \u222a {{v} | v \u2208 Ci}. That is, GUC contains an edge from\nC to the clustering obtained from breaking Ci into singleton clusters of all its elements. While this\nmay not be the \u201cintended\u201d split of the user, we can still associate this edge with the feedback.\n\nLemma 4.1 GUC satis\ufb01es De\ufb01nition 2.1 with respect to MERGE and SPLIT feedback.\n\n3n fraction of the total length of at least one cycle\n3n on the value of \u03b2 in each\n\nGUC is directed, and every edge makes up at least a 1\nit participates in. Hence, Proposition 2.1 gives an upper bound of 3n\u22121\niteration. A more careful analysis exploiting the speci\ufb01c structure of GUC gives us the following:\nLemma 4.2 In GUC, for every non-negative node weight function \u00b5, there exists a clustering C with\n\u03a6\u00b5(C) \u2264 1\n2 .\nIn the absence of noise in the feedback, Lemmas 4.1 and 4.2 and Theorem 2.2 imply an algorithm\nthat \ufb01nds the true clustering using log N = log B(n) = \u0398(n log n) queries. Notice that this is worse\nthan the \u201ctrivial\u201d algorithm, which starts with each node as a singleton cluster and always executes\nthe merge proposed by the user, until it has found the correct clustering; hence, this bound is itself\nrather trivial.\nNon-trivial bounds can be obtained when clusters belong to a restricted set, an approach also followed\nby Awasthi and Zadeh [5]. If there are at most M candidate clusters, then the number of clusterings is\nN0 \u2264 M k. For example, if there is a set system F of VC dimension at most d such that each cluster\nis in the range space of F, then M = O(nd) by the Sauer-Shelah Lemma [19, 20]. Combining\nLemmas 4.1 and 4.2 with Theorems 2.2 and 2.4, we obtain the existence of learning algorithms with\nthe following properties:\n\nCorollary 4.3 Assume that in response to each equivalence query, the user responds with MERGE\nor SPLIT. Also, assume that there are at most M different candidate clusters, and the clustering has\n(at most) k clusters.\n\n1. If all query responses are correct, then the target clustering can be learned by an interactive\nalgorithm using at most log N = O(k log M ) equivalence queries. Speci\ufb01cally when\nM = O(nd), this bound is O(kd log n). This result recovers the main result of [5].11\n\n2. If query responses are correct with probability p > 1\n\n2 , the target clustering can be learned\nwith probability at least 1 \u2212 \u03b4 using at most (1\u2212\u03b4)k log M\n1\u2212H(p) + o(k log M ) + O(log2(1/\u03b4))\nequivalence queries in expectation. Our framework provides the noise tolerance \u201cfor free;\u201d\n[5] instead obtain results for a different type of noise in the feedback.\n\n5 Application III: Learning a Classi\ufb01er\n\nLearning a binary classi\ufb01er is the original and prototypical application of the equivalence query\nmodel of Angluin [2], which has seen a large amount of follow-up work since (see, e.g., [16, 17]).\nNaturally, if no assumptions are made on the classi\ufb01er, then n queries are necessary in the worst case.\nIn general, applications therefore restrict the concept classes to smaller sets, such as assuming that\nthey have bounded VC dimension. We use F to denote the set of all possible concepts, and write\nM = |F|; when F has VC dimension d, the Sauer-Shelah Lemma [19, 20] implies that M = O(nd).\nLearning a binary classi\ufb01er for n points is an almost trivial application of our framework12. When\nthe algorithm proposes a candidate classi\ufb01er, the feedback it receives is a point with a corrected label\n(or the fact that the classi\ufb01er was correct on all points).\n\n11In fact, the algorithm in [5] is implicitly computing and querying a node with small \u03a6 in GUC\n12The results extend readily to learning a classi\ufb01er with k \u2265 2 labels.\n\n8\n\n\fWe de\ufb01ne the graph GCL to be the n-dimensional hypercube13 with unweighted and undirected edges\nbetween every pair of nodes at Hamming distance 1. Because the distance between two classi\ufb01ers C,\nC(cid:48) is exactly the number of points on which they disagree, GCL satis\ufb01es De\ufb01nition 2.1. Hence, we\ncan apply Corollary 2.3 and Theorem 2.4 with Sinit equal to the set of all M candidate classi\ufb01ers,\nrecovering the classic result on learning a classi\ufb01er in the equivalence query model when feedback is\nperfect, and extending it to the noisy setting.\n\nCorollary 5.1\n\n1. With perfect feedback, the target classi\ufb01er is learned using log M queries14.\n\n2. When each query response is correct with probability p > 1\n\nthe true binary classi\ufb01er with probability at least 1\u2212\u03b4 using at most (1\u2212\u03b4) log M\nO(log2(1/\u03b4)) queries in expectation.\n\n2 , there is an algorithm learning\n1\u2212H(p) +o(log M )+\n\n6 Discussion and Conclusions\n\nWe de\ufb01ned a general framework for interactive learning from imperfect responses to equivalence\nqueries, and presented a general algorithm that achieves a small number of queries. We then showed\nhow query-ef\ufb01cient interactive learning algorithms in several domains can be derived with practically\nno effort as special cases; these include some previously known results (classi\ufb01cation and clustering)\nas well as new results on ranking/ordering.\nOur work raises several natural directions for future work. Perhaps most importantly, for which\ndomains can the algorithms be made computationally ef\ufb01cient (in addition to query-ef\ufb01cient)? We\nprovided a positive answer for ordering with perfect query responses, but the question is open\nfor ordering when feedback is imperfect. For classi\ufb01cation, when the possible clusters have VC\ndimension d, the time is O(nd), which is unfortunately still impractical for real-world values of d.\nMaass and Tur\u00b4an [15] show how to obtain better bounds speci\ufb01cally when the sample points form a\nd-dimensional grid; to the best of our knowledge, the question is open when the sample points are\narbitrary. The Monte Carlo approach of Theorem 2.6 reduces the question to the question of sampling\na uniformly random hyperplane, when the uniformity is over the partition induced by the hyperplane\n(rather than some geometric representation). For clustering, even less appears to be known.\nIt should be noted that our algorithms may incorporate \u201cimproper\u201d learning steps: for instance, when\ntrying to learn a hyperplane classi\ufb01er, the algorithm in Section 5 may propose intermediate classi\ufb01ers\nthat are not themselves hyperplanes (though the \ufb01nal output is of course a hyperplane classi\ufb01er). At\nan increase of a factor O(log d) in the number of queries, we can ensure that all steps are proper for\nhyperplane learning. An interesting question is whether similar bounds can be obtained for other\nconcept classes, and for other problems (such as clustering).\nFinally, our noise model is uniform. An alternative would be that the probability of an incorrect\nresponse depends on the type of response. In particular, false positives could be extremely likely, for\ninstance, because the user did not try to classify a particular incorrectly labeled data point, or did not\nsee an incorrect ordering of items far down in the ranking. Similarly, some wrong responses may be\nmore likely than others; for example, a user proposing a merge of two clusters (or split of one) might\nbe \u201croughly\u201d correct, but miss out on a few points (the setting that [5, 4] studied). We believe that\nseveral of these extensions should be fairly straightforward to incorporate into the framework, and\nwould mostly lead to additional complexity in notation and in the de\ufb01nition of various parameters.\nBut a complete and principled treatment would be an interesting direction for future work.\n\nAcknowledgments\n\nResearch supported in part by NSF grant 1619458. We would like to thank Sanjoy Dasgupta, Ilias\nDiakonikolas, Shaddin Dughmi, Haipeng Luo, Shanghua Teng, and anonymous reviewers for useful\nfeedback and suggestions.\n\n13When there are k labels, GCL is a graph with kn nodes.\n14With k labels, this bound becomes (k \u2212 1) log M.\n\n9\n\n\fReferences\n[1] E. Agichtein, E. Brill, S. Dumais, and R. Ragno. Learning user interaction models for predicting\nIn Proc. 29th Intl. Conf. on Research and Development in\n\nweb search result preferences.\nInformation Retrieval (SIGIR), pages 3\u201310, 2006.\n\n[2] D. Angluin. Queries and concept learning. Machine Learning, 2:319\u2013342, 1988.\n\n[3] D. Angluin. Computational learning theory: Survey and selected bibliography. In Proc. 24th\n\nACM Symp. on Theory of Computing, pages 351\u2013369, 1992.\n\n[4] P. Awasthi, M.-F. Balcan, and K. Voevodski. Local algorithms for interactive clustering. Journal\n\nof Machine Learning Research, 18:1\u201335, 2017.\n\n[5] P. Awasthi and R. B. Zadeh. Supervised clustering. In Proc. 24th Advances in Neural Information\n\nProcessing Systems, pages 91\u201399. 2010.\n\n[6] M.-F. Balcan and A. Blum. Clustering with interactive feedback. In Proc. 19th Intl. Conf. on\n\nAlgorithmic Learning Theory, pages 316\u2013328, 2008.\n\n[7] R. Bubley. Randomized Algorithms: Approximation, Generation, and Counting. Springer, 2001.\n\n[8] R. Bubley and M. Dyer. Faster random generation of linear extensions. Discrete Mathematics,\n\n201(1):81\u201388, 1999.\n\n[9] K. Crammer and Y. Singer. Pranking with ranking. In Proc. 16th Advances in Neural Information\n\nProcessing Systems, pages 641\u2013647, 2002.\n\n[10] E. Emamjomeh-Zadeh, D. Kempe, and V. Singhal. Deterministic and probabilistic binary search\n\nin graphs. In Proc. 48th ACM Symp. on Theory of Computing, pages 519\u2013532, 2016.\n\n[11] M. Huber. Fast perfect sampling from linear extensions. Discrete Mathematics, 306(4):420\u2013428,\n\n2006.\n\n[12] T. Joachims. Optimizing search engines using clickthrough data. In Proc. 8th Intl. Conf. on\n\nKnowledge Discovery and Data Mining, pages 133\u2013142, 2002.\n\n[13] A. Karzanov and L. Khachiyan. On the conductance of order Markov chains. Order, 8(1):7\u201315,\n\n1991.\n\n[14] N. Littlestone. Learning quickly when irrelevant attributes abound: A new linear-threshold\n\nalgorithm. Machine Learning, 2:285\u2013318, 1988.\n\n[15] W. Maass and G. Tur\u00b4an. On the complexity of learning from counterexamples and membership\nqueries. In Proc. 31st IEEE Symp. on Foundations of Computer Science, pages 203\u2013210, 1990.\n\n[16] W. Maass and G. Tur\u00b4an. Lower bound methods and separation results for on-line learning\n\nmodels. Machine Learning, 9(2):107\u2013145, 1992.\n\n[17] W. Maass and G. Tur\u00b4an. Algorithms and lower bounds for on-line learning of geometrical\n\nconcepts. Machine Learning, 14(3):251\u2013269, 1994.\n\n[18] F. Radlinski and T. Joachims. Query chains: Learning to rank from implicit feedback. In Proc.\n\n11th Intl. Conf. on Knowledge Discovery and Data Mining, pages 239\u2013248, 2005.\n\n[19] N. Sauer. On the density of families of sets. Journal of Combinatorial Theory, Series A,\n\n13(1):145\u2013147, 1972.\n\n[20] S. Shelah. A combinatorial problem; stability and order for models and theories in in\ufb01nitary\n\nlanguages. Paci\ufb01c Journal of Mathematics, 41(1):247\u2013261, 1972.\n\n[21] K. L. Wagstaff. Intelligent Clustering with Instance-Level Constraints. PhD thesis, Cornell\n\nUniversity, 2002.\n\n10\n\n\f", "award": [], "sourceid": 3576, "authors": [{"given_name": "Ehsan", "family_name": "Emamjomeh-Zadeh", "institution": "U. of Southern California"}, {"given_name": "David", "family_name": "Kempe", "institution": "U. of Southern California"}]}