{"title": "Flattening a Hierarchical Clustering through Active Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 15289, "page_last": 15299, "abstract": "We investigate active learning by pairwise similarity over the leaves of trees originating from hierarchical clustering procedures. In the realizable setting, we provide a full characterization of the number of queries needed to achieve perfect reconstruction of the tree cut. In the non-realizable setting, we rely on known important-sampling procedures to obtain regret and query complexity bounds. Our algorithms come with theoretical guarantees on the statistical error and, more importantly, lend themselves to {\\em linear-time} implementations in the relevant parameters of the problem. We discuss such implementations, prove running time guarantees for them, and present preliminary experiments on real-world datasets showing the compelling practical performance of our algorithms as compared to both passive learning and simple active learning baselines.", "full_text": "Flattening a Hierarchical Clustering through Active\n\nLearning\n\nFabio Vitale\n\nDepartment of Computer Science\n\nINRIA Lille, France &\n\nAnand Rajagopalan\nGoogle Research NY\n\nNew York, USA\n\nClaudio Gentile\n\nGoogle Research NY\n\nNew York, USA\n\nanandbr@google.com\n\ncgentile@google.com\n\nSapienza University of Rome, Italy\n\nfabio.vitale@inria.fr\n\nAbstract\n\nWe investigate active learning by pairwise similarity over the leaves of trees orig-\ninating from hierarchical clustering procedures.\nIn the realizable setting, we\nprovide a full characterization of the number of queries needed to achieve perfect\nreconstruction of the tree cut. In the non-realizable setting, we rely on known\nimportant-sampling procedures to obtain regret and query complexity bounds. Our\nalgorithms come with theoretical guarantees on the statistical error and, more\nimportantly, lend themselves to linear-time implementations in the relevant pa-\nrameters of the problem. We discuss such implementations, prove running time\nguarantees for them, and present preliminary experiments on real-world datasets\nshowing the compelling practical performance of our algorithms as compared to\nboth passive learning and simple active learning baselines.\n\nIntroduction\n\n1\nActive learning is a learning scenario where labeled data are scarse and/or expensive to gather, as they\nrequire careful assessment by human labelers. This is often the case in several practical settings where\nmachine learning is routinely deployed, from image annotation to document classi\ufb01cation, from\nspeech recognition to spam detection, and beyond. In all such cases, an active learning algorithm\ntries to limit human intervention by seeking as little supervision as possible, still obtaining accurate\nprediction on unseen samples. This is an attractive learning framework offering substantial practical\nbene\ufb01ts, but also presenting statistical and algorithmic challenges.\nA main argument that makes active learning effective is when combined with methods that exploit\nthe cluster structure of data (e.g., [11, 21, 10], and references therein), where a cluster typically\nencodes some notion of semantic similarity across the involved data points. An ubiquitous solution to\nclustering is to organize data into a hierarchy, delivering clustering solutions at different levels of\nresolution. An (agglomerative) Hierarchical Clustering (HC) procedure is an unsupervised learning\nmethod parametrized by a similarity function over the items to be clustered and a linkage function\nthat lifts similarity from items to clusters of items. Finding the \u201cright\u201d level of resolution amounts to\nturning a given HC into a \ufb02at clustering by cutting the resulting tree appropriately. We would like\nto do so by resorting to human feedback in the form of pairwise similarity queries, that is, yes/no\nquestions of the form \u201care these two products similar to one another ?\u201d or \u201care these two news items\ncovering similar events ?\u201d. It is well known that such queries are relatively easy to respond to, but are\nalso intrinsically prone to subjectiveness and/or noise. More importantly, the hierarchy at hand need\nnot be aligned with the similarity feedback we actually receive.\nIn this paper, we investigate the problem of cutting a tree originating from a pre-speci\ufb01ed HC\nprocedure through pairwise similarity queries generated by active learning algorithms. Since the\ntree is typically not consistent with the similarity feedback, that is to say, the feedback is noisy, we\nare lead to tackle this problem under a variety of assumptions about the nature of this noise (from\nnoiseless to random but persistent to general agnostic). Moreover, because different linkage functions\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fapplied to the very same set of items may give rise to widely different tree topologies, our study also\nfocuses on characterizing active learning performance as a function of the structure of the tree at\nhand. Finally, because these hierarchies may in practice be sizeable (in the order of billion nodes),\nscalability will be a major concern in our investigation.\nOn motivation. It is often the case in big organizations that data processing pipelines are split into ser-\nvices, making Machine Learning solution providers be constrained by the existing hardware/software\ninfrastructure. In the Active Learning applications that motivate this work, the hierarchy over the\nitems to be clustered is provided by a third party, i.e., by an exogenous data processing tool that relies\non side information on the items (e.g., word2vec mappings and associated distance functions) which\nare possibly generated by yet another service, etc. In this modular environment, it is reasonable to\nassume that the tree is given to us as part of the input of our Active Learning problem. The human\nfeedback our algorithms rely upon may or may not be consistent with the tree at hand both because\nhuman feedback is generally noisy and because this feeback may originate from yet another source of\ndata, e.g., another production team in the organization that was not in charge of building the original\ntree. In fact, the same tree over the data items may serve the clustering needs of different groups within\nthe organization, having different goals and views on the same data. This also motivates why, when\nstudying this problem, we are led to consider different noise scenarios, that is, the presence of noisy\nfeedback and the possibility that the given tree is not consistent with the clustering corresponding to\nthe received feedbacks.\nOur contribution. In the realizable setting (both noiseless and persistent noisy, Section 3), we\nintroduce algorithms whose expected number of queries scale with the average complexity of tree\ncuts, a notion which is introduced in this paper. A distinctive feature of these algorithms is that they\nare rather ad hoc in the way they deal with the structure of our problem. In particular, they cannot\nbe seen as \ufb01nding the query that splits the version space as evenly as possible, a common approach\nin many active learning papers (e.g., [12, 25, 15, 16, 27, 24], and references therein). We then show\nthat, at least in the noiseless case, this average complexity measure characterizes the expected query\ncomplexity of the problem. Our ad hoc analyses are bene\ufb01cial in that they deliver sharper guarantees\nthan those readily available from the above papers. In addition, and perhaps more importantly for\npractical usage, our algorithms admit linear-time implementations in the relevant parameters of the\nproblem (like the number of items to be clustered). In the non-realizable setting (Section 4), we\nbuild on known results in importance-weighted active learning (e.g., [5, 6]) to devise a selective\nsampling algorithm working under more general conditions. While our statistical analysis follows\nby adaptating available results, our goal here is to rather come up with fast implementations, so as\nto put the resulting algorithms on the same computational footing as those operating under (noisy)\nrealizability assumptions. By leveraging the speci\ufb01c structure of our hypothesis space, we design a\nfast incremental algorithm for selective sampling whose running time per round is linear in the height\nof the tree. In turn, this effort paves the way for our experimental investigation (Section 5), where we\ncompare the effectiveness of the two above-mentioned approaches (realizable with persistent noise vs\nnon-realizable) on real data originating from various linkage functions. Though preliminary in nature,\nthese experiments seem to suggest that in practice the algorithms originating from the persistent\nnoise assumption exhibit more attractive learning curves than those working in the more general\nnon-realizable setting.\nRelated work. The literature on active learning is vast, and we can hardly do it justice here. In\nwhat follows we con\ufb01ne ourselves to the references which we believe are closest to our paper. Since\nour sample space is discrete (the set of all possible pairs of items from a \ufb01nite set of size n), our\nrealizable setting is essentially a pool-based active learning setting. Several papers have considered\ngreedy algorithms which generalize binary search [1, 20, 12, 25, 15, 24]. The query complexity can\nbe measured either in the worst case or averaged over a prior distribution over all possible labeling\nfunctions in a given set. The query complexity of these algorithms can be analyzed by comparing it\nto the best possible query complexity achieved for that set of items. In [12] it is shown that if the\nprobability mass of the version space is split as evenly as possible then the approximation factor\nfor its average query complexity is O(log(1/pm)), where pm is the minimal prior probability of\nany considered labeling function. [15] extended this result through a more general approach to\napproximate greedy rules, but with the worse factor O(log2(1/pm)). [20] observed that modifying\nthe prior distribution always allows one to replace O(log(1/pm)) by the smaller factor O(log N ),\nwhere N is the size of the set of labeling functions. Results of a similar \ufb02avor are contained in\n[25, 24]. In our case, N can be exponential in n (see Section 2), making these landmark results too\nbroad to be tight for our speci\ufb01c setting. Furthermore, some of these papers (e.g., [12, 25, 24]) have\n\n2\n\n\fonly theoretical interest because of their dif\ufb01cult algorithmic implementation. Interesting advances\non this front are contained in the more recent paper [27], though when adapted to our speci\ufb01c setting,\ntheir results give rise to worse query bounds than ours. In the same vein are the papers by [7, 8],\ndealing with persistent noise. Finally, in the non-realizable setting, our work fully relies on [6], which\nin turns builds on standard references like [9, 5, 17] \u2013 see, e.g., the comprehensive survey by [18].\nFurther references, speci\ufb01cally related to clustering with queries, are mentioned in Appendix A.\n\n2 Preliminaries and learning models\n\nWe consider the problem of \ufb01nding cuts of a given binary tree through pairwise similarity queries\nover its leaves. We are given in input a binary1 tree T originating from, say, an agglomerative\n(i.e., bottom-up) HC procedure (single linkage, complete linkage, etc.) applied to a set of items\nL = {x1, . . . , xn}. Since T is the result of successive (binary) merging operations from bottom to\ntop, T turns out to be a strongly binary tree2 and the items in L are the leaves of T . We will denote\nby V the set of nodes in T , including its leaves L, and by r the root of T . The height of T will be\ndenoted by h. When referring to a subtree T (cid:48) of T , we will use the notation V (T (cid:48)), L(T (cid:48)), r(T (cid:48)),\nand h(T (cid:48)), respectively. We also denote by T (i) the subtree of T rooted at node i, and by L(i) the set\nof leaves of T (i), so that L(i) = L(T (i)), and r(T (i)) = i. Moreover, par(i) will denote the parent\nof node i (in tree T ), left(i) will be the left-child of i, and right(i) its right child.\n\nFigure 1: Left: A binary tree corre-\nsponding to a hierarchical clustering of\nthe set of items L = {x1, . . . , x8}. The\ncut depicted in dashed green has two\nnodes above and the rest below. This cut\ninduces over L the \ufb02at clustering C1 =\n{{x1, x2, x3, x4, x5},{x6},{x7, x8}}\ncorresponding to the leaves of\nthe\nsubtrees rooted at the 3 green-bordered\nnodes just below the cut (the lower\nboundary of the cut). Clustering C1 is\ntherefore realized by T . On the contrary,\nclustering C2 = {{x1, x2, x3},{x4, x5, x6},{x7, x8}} is not. Close to each node i is also displayed the\nnumber N (i) of realized cuts by the subtree rooted at i. For instance, in this \ufb01gure, 7 = 1 + 3 \u00b7 2, and\n22 = 1 + 7 \u00b7 3, so that T admits overall N (T ) = 22 cuts. Right: The same \ufb01gure, where below each node i are\nthe probabilities p(i) encoding a uniform prior distribution over cuts. Notice that p(i) = 1/N (i) so that, like\nall other cuts, the depicted green cut has probability (1 \u2212 1/22) \u00b7 (1 \u2212 1/3) \u00b7 (1/7) \u00b7 1 \u00b7 (1/2) = 1/22.\nA \ufb02at clustering C of L is a partition of L into disjoint (and non-empty) subsets. A cut c of T of size\nK is a set of K edges of T that partitions V into two disjoint subsets; we call them the nodes above c\nand the nodes below c. Cut c also univocally induces a clustering over L, made up of the clusters\nL(i1), L(i2), . . . , L(iK), where i1, i2, . . . , iK are the nodes below c that the edges of c are incident\nto. We denote this clustering by C(c), and call the nodes i1, i2, . . . , iK the lower boundary of c. We\nsay that clustering C0 is realized by T if there exists a cut c of T such that C(c) = C0. See Figure 1\n(left) for a pictorial illustration.\nClearly enough, for a given L, and a given tree T with set of leaves L, not all possible clusterings\nover L are realized by T , as the number and shape of the clusterings realized by T are strongly\nin\ufb02uenced by T \u2019s structure. Let N (T ) be the number of clusterings realized by T (notice that\nthis is also equal to the number of distinct cuts admitted by T ). Then N (T ) can be computed\nthrough a simple recursive formula. If we let N (i) be the number of cuts realized by T (i), one can\neasily verify that N (i) = 1 + N (left(i)) \u00b7 N (right(i)), with N (xi) = 1 for all xi \u2208 L. With this\nnotation, we then have N (T ) = N (r(T )). If T has n leaves, N (T ) ranges from n, when T is a\ndegenerate line tree, to the exponential (cid:98)\u03b1n(cid:99), when T is the full binary tree, where \u03b1 (cid:39) 1.502 (e.g.,\nhttp://oeis.org/A003095). See again Figure 1 (left) for a simple example.\n\n1 In fact, the trees we can handle are more general than binary: we are making the binary assumption\n\nthroughout for presentational convenience only.\n\n2 A strongly binary tree is a rooted binary tree for which the root is adjacent to either zero or two nodes, and\n\nall non-root nodes are adjacent to either one or three nodes.\n\n3\n\n5x4x6x1/71/21/2211111/31/211/311/217x2x8x1x12x3x4x5x6x7x8x1x2x3x4x5x6x7x8x1x3xC r 12C T 5x4x6x7222111132131217x2x8x1x13xr T \fby\n\nthe\n\nFigure 2: Cuts and corresponding\n\nvalues of dH (\u03a3,(cid:98)C) in four input\n\nfor\nthe same ground-truth\ntrees\n\u03a3 determined\ncluster-\n{{x1, x2, x3, x4, x5},{x6},\ning\n{x7, x8}}. Underneath each tree\n\n(realizable case). The other 3 trees\nillustrate the non-realizable case with\n\ndepicted cut. The color of a cluster is\ngreen if it corresponds to a cluster in\nthe clustering of \u03a3, and is red other-\n\nis the clustering (cid:98)C induced by the\nwise. In T1 we have dH (\u03a3,(cid:98)C) = 0\nthe cut minimizing dH (\u03a3,(cid:98)C) on the\ndH (\u03a3,(cid:98)C) counts ordered pairs, in T2\nwe have dH (\u03a3,(cid:98)C) = 10, because\naccording to clustering (cid:98)C. In T3 we\nhave dH (\u03a3,(cid:98)C) = 2. Finally, in T4 it\nis easy to verify that dH (\u03a3,(cid:98)C) = 10.\n\nx6 now belongs to the \ufb01rst cluster\n\ncorresponding tree. Recalling that\n\nA ground-truth matrix \u03a3 is an n\u00d7 n and \u00b11-valued symmetric matrix \u03a3 = [\u03c3(xi, xj)]n\u00d7n\ni,j=1 encoding\na pairwise similarity relation over L. Speci\ufb01cally, if \u03c3(xi, xj) = 1 we say that xi and xj are similar,\nwhile if \u03c3(xi, xj) = \u22121 we say they are dissimilar. Moreover, we always have \u03c3(xi, xi) = 1 for all\nxi \u2208 L. Notice that \u03a3 need not be consistent with a given clustering over L, i.e., the binary relation\nde\ufb01ned by \u03a3 over L need not be transitive.\nGiven T and its leaves L, an active learning algorithm A proceeds in a sequence of rounds. In a purely\nactive setting, at round t, the algorithm queries a pair of items (xit, xjt), and observes the associated\nlabel \u03c3(xit, xjt). In a selective sampling setting, at round t, the algorithm is presented with (xit, xjt)\ndrawn from some distribution over L \u00d7 L, and has to decide whether or not to query the associated\nlabel \u03c3(xit, xjt). In both cases, the algorithm is stopped at some point, and is compelled to commit\nto a speci\ufb01c cut of T (inducing a \ufb02at clustering over L). Coarsely speaking, the goal of A is to come\nup with a good cut of T , by making as few queries as possible on the entries of \u03a3.\nNoise Models. The simplest possible setting, called noiseless realizable setting, is when \u03a3 itself\nis consistent with a given clustering realized by T , i.e., when there exists a cut c\u2217 of T such that\nC(c\u2217) = {L(i1), . . . , L(iK)}, for some nodes i1, . . . , iK \u2208 V , that satis\ufb01es the following: For all\nr = 1, . . . , K, and for all pairs (xi, xj) \u2208 L(ir) \u00d7 L(ir) we have \u03c3(xi, xj) = 1, while for all\nother pairs we have \u03c3(xi, xj) = \u22121. We call (persistent) noisy realizable setting one where \u03a3 is\ngenerated as follows. Start off from the noiseless ground-truth matrix, and call it \u03a3\u2217. Then, in\n\n(cid:1) pairs (xi, xj) with i < j, and pick uniformly\n(cid:1)(cid:99), for some \u03bb \u2208 [0, 1/2). Each such pair has \ufb02ipped label in \u03a3:\n\norder to obtain \u03a3 from \u03a3\u2217, consider the set of all(cid:0)n\nat random a subset of size (cid:98)\u03bb(cid:0)n\n\n\u03c3(xi, xj) = 1 \u2212 \u03c3\u2217(xi, xj). This is then combined with the symmetric \u03c3(xi, xj) = \u03c3(xj, xi), and\nthe re\ufb02exive \u03c3(xi, xi) = 1 conditions. We call \u03bb the noise level. Notice that this kind of noise is\nrandom but persistent, in that if we query the same pair (xi, xj) twice we do obtain the same answer\n\u03c3(xi, xj). Clearly, the special case \u03bb = 0 corresponds to the noiseless setting. Finally, in the general\nnon-realizable (or agnostic) setting, \u03a3 is an arbitrary matrix that need not be consistent with any\nclustering over L, in particular, with any clustering over L realized by T .\n\nError Measure. If \u03a3 is some ground-truth matrix over L, and(cid:98)c is the cut output by A, with induced\nclustering (cid:98)C = C((cid:98)c), we let \u03a3(cid:98)C = [\u03c3(cid:98)C(xi, xj)]n\u00d7n\ni,j=1 be the similarity matrix associated with (cid:98)C, i.e.,\n\u03c3(cid:98)C(xi, xj) = 1 if xi and xj belong to the same cluster, and \u22121 otherwise. Then the Hamming\ndistance dH (\u03a3,(cid:98)C) simply counts the number of (ordered) pairs (xi, xj) having inconsistent sign:\n\ndH (\u03a3,(cid:98)C) =(cid:12)(cid:12){(xi, xj) \u2208 L2 : \u03c3(xi, xj) (cid:54)= \u03c3(cid:98)C(xi, xj)}(cid:12)(cid:12) .\n\n2\n\n2\n\nThe same de\ufb01nition applies in particular to the case when \u03a3 itself represents a clustering over L. The\nquantity dH, sometimes called correlation clustering distance, is closely related to the Rand index\n[26] \u2013 see, e.g., [23]. Figure 2 contains illustrative examples.\n\n4\n\n2x3x4x5x6x7x8x1x5x4x7x2x8x1x3xr T 2x3x4x5x6x7x8x1x5x4x7x2x8x1x3xr T 6x6x2x3x4x5x6x7x8x1x5x4x7x2x8x1x3xr T 6x2x3x4x5x6x7x8x1x5x4x7x2x8x1x3xr T 6x1234\fFigure 3: Left: The dotted green cut c\u2217\ncan be described by the set of values of\n{y(i), i \u2208 V }, below each node. In this\ntree, in order to query, say, node i2, it suf-\n\ufb01ces to query any of the four pairs (x1, x3),\n(x1, x4), (x2, x3), or (x2, x4). The base-\nline queries i1 through i6 in a breadth-\ufb01rst\nmanner, and then stops having identi\ufb01ed c\u2217.\n\n(cid:17)\u00b7(cid:16) (cid:89)\n\np(j)\n\nj\u2208LB(c\u2217)\n\n\uf8f6\uf8f8 ,\n\n(1)\n\nRight: This graph has N (T ) = n. On the depicted cut, the baseline has to query all n \u2212 1 internal nodes.\nPrior distribution. Recall cut c\u2217 de\ufb01ned in the noiseless realizable setting and its associated \u03a3\u2217.\nDepending on the speci\ufb01c learning model we consider (see below), the algorithm may have access\nto a prior distribution P(\u00b7) over c\u2217, parametrized as follows. For i \u2208 V , let p(i) be the conditional\nprobability that i is below c\u2217 given that all i\u2019s ancestors are above. If we denote by AB(c\u2217) \u2286 V the\nnodes of T which are above c\u2217, and by LB(c\u2217) \u2286 V those on the lower boundary of c\u2217, we can write\n\n\uf8eb\uf8ed (cid:89)\n\ni\u2208AB(c\u2217)\n\nP(c\u2217) =\n\n(1 \u2212 p(i))\n\nwhere p(i) = 1 if i \u2208 L. In particular, setting p(i) = 1/N (i) \u2200i yields the uniform prior P(c\u2217) =\n1/N (T ) for all c\u2217 realized by T . See Figure 1 (right) for an illustration. A canonical example of a\nnon-uniform prior is one that favors cuts close to the root, which are thereby inducing clusterings\nhaving few clusters. These can be obtained, e.g., by setting p(i) = \u03b1, for some constant \u03b1 \u2208 (0, 1).\nLearning models. We consider two learning settings. The \ufb01rst setting (Section 3) is an active\nlearning setting under a noisy realizability assumption with prior information. Let C\u2217 = C(c\u2217) be\nthe ground truth clustering induced by cut c\u2217 before noise is added. Here, for a given prior P(c\u2217),\nthe goal of learning is to identify C\u2217 either exactly (when \u03bb = 0) or approximately (when \u03bb > 0),\nwhile bounding the expected number of queries (xit, xjt) made to the ground-truth matrix \u03a3, the\n\nexpectation being over the noise, and possibly over P(c\u2217). In particular, if(cid:98)C is the clustering produced\nby the algorithm after it stops, we would like to prove upper bounds on E[dH (\u03a3\u2217,(cid:98)C)], as related to\n\nthe number of active learning rounds, as well as to the properties of the prior distribution. The second\nsetting (Section 4) is a selective sampling setting where the pairs (xit, xjt) are drawn i.i.d. according\nto an arbitrary and unknown distribution D over the n2 entries of \u03a3, and the algorithm at every round\ncan choose whether or not to query the label. After a given number of rounds the algorithm is stopped,\nand the goal is the typical goal of agnostic learning: no prior distribution over cuts is available\nanymore, and we would like to bound with high probability over the sample (xi1 , xj1), (xi2, xj2 ), . . .\n\nthe so-called excess risk of the clustering (cid:98)C produced by A, i.e., the difference\n\nP(xi,xj )\u223cD(cid:0)\u03c3(xi, xj) (cid:54)= \u03c3C(c)(xi, xj)(cid:1) ,\nP(xi,xj )\u223cD(cid:0)\u03c3(xi, xj) (cid:54)= \u03c3(cid:98)C(xi, xj)(cid:1) \u2212 min\n(cid:17)\ndH (\u03a3,(cid:98)C) \u2212 minc dH (\u03a3,C(c))\n\n(2)\nthe minimum being over all possible cuts c realized by T . Notice that when D is uniform the excess\nrisk reduces to 1\n. At the same time, we would like to bound with\nn2\nhigh probability the total number of labels the algorithm has queried.\n3 Active learning in the realizable case\nAs a warm up, we start by considering the case where \u03bb = 0 (no noise). The underlying cut c\u2217 can\nbe conveniently described by assigning to each node i of T a binary value y(i) = 0 if i is above c\u2217,\nand y(i) = 1 if i is below. Then we can think of an active learning algorithm as querying nodes,\ninstead of querying pairs of leaves. A query to node i \u2208 V can be implemented by querying any\npair (xi(cid:96), xir ) \u2208 L(left(i)) \u00d7 L(right(i)). When doing so, we actually receive y(i), since for any\nsuch (xi(cid:96), xir ), we clearly have y(i) = \u03c3\u2217(xi(cid:96), xir ). An obvious baseline is then to perform a kind\nstop and output clustering (cid:98)C = {L}; otherwise, we go down by querying both left(r) and right(r),\nof breadth-\ufb01rst search in the tree: We start by querying the root r, and observe y(r); if y(r) = 1 we\n\nand then proceed recursively. It is not hard to show that this simple algorithm will make at most\n2K \u2212 1 queries, with an overall running time of O(K), where K is the number of clusters of C(c\u2217).\nSee Figure 3 for an illustration. If we know beforehand that K is very small, then this baseline is a\ntough competitor. Yet, this is not the best we can do in general. Consider, for instance, the line graph\nin Figure 3 (right), where c\u2217 has K = n.\n\n(cid:16)\n\nc\n\n5\n\nx1 T 4x3x5x010111101116x2x2x1T xxx 1n11110001n-17x1i2i5i6i4in-1i1i3i2i\ft\n\nt\n\n(i) and Sy(j)=1\n\nIdeally, for a given prior P(\u00b7), we would like to obtain a query complexity of the form log(1/P(c\u2217)),\nholding in the worst-case for all underlying c\u2217. As we shall see momentarily, this is easily obtained\nwhen P(\u00b7) is uniform. We \ufb01rst describe a version space algorithm (One Third Splitting, OTS) that\nadmits a fast implementation, and whose number of queries in the worst-case is O(log N (T )). This\nwill in turn pave the way for our second algorithm, Weighted Dichotomic Path (WDP). WDP leverages\nP(\u00b7), but its theoretical guarantees only hold in expectation over P(c\u2217). WDP will then be extended to\nthe persistent noisy setting through its variant Noisy Weighted Dichotomic Path (N-WDP).\nWe need a few ancillary de\ufb01nitions. First of all note that, in the noiseless setting, we have a clear\nhierarchical structure on the labels y(i) of the internal nodes of T : Whenever a query reveals a label\ny(i) = 0, we know that all i\u2019s ancestors will have label 0. On the other hand, if we observe y(i) = 1\nwe know that all internal nodes of subtree T (i) have label 1. Hence, disclosing the label of some\nnode indirectly entails disclosing the labels of either its ancestors or its descendants. Given T , a\nbottom-up path is any path connecting a node with one of its ancestors in T . In particular, we call\na backbone path any bottom up path having maximal length. Given i \u2208 V , we denote by St(i) the\nversion space at time t associated with T (i), i.e., the set of all cuts of T (i) that are consistent with\nthe labels revealed so far. For any node j (cid:54)= i, St(i) splits into Sy(j)=0\n(i), the subsets\nof St(i) obtained by imposing a further constraint on y(j).\nOTS (One Third Splitting): For all i \u2208 V , OTS maintains over time the value |St(i)|, i.e., the size\nof St(i), along with the forest F made up of all maximal subtrees T (cid:48) of T such that |V (T (cid:48))| > 1 and\nfor which none of their node labels have been revealed so far. OTS initializes F to contain T only,\nand maintains F updated over time, by picking any backbone of any subtree T (cid:48) \u2208 F , and visiting it\nin a bottom-up manner. See the details in Appendix B.1. The following theorem (proof in Appendix\nB.1) crucially relies on the fact that \u03c0 is a backbone path of T (cid:48), rather than an arbitrary path.\nTheorem 1 On a tree T with n leaves, height h, and number of cuts N, OTS \ufb01nds c\u2217 by making\nO(log N ) queries. Moreover, an ad hoc data-structure exists that makes the overall running time\nO(n + h log N ) and the space complexity O(n).\nHence, Theorem 1 ensures that, for all c\u2217, a time-ef\ufb01cient active learning algorithm exists whose\nnumber of queries is of the form log(1/P(c\u2217)), provided P(c\u2217) = 1/N (T ) for all c\u2217. This query\nbound is fully in line with well-known results on splittable version spaces [12, 25, 24], so we cannot\nmake claims of originality. Yet, what is relevant here is that this splitting can be done very ef\ufb01ciently.\nWe complement the above result with a lower bound holding in expectation over prior distributions\non c\u2217. This lower bound depends in a detailed way on the structure of T . Given tree T , with set of\nthe subtree of T whose nodes are (AB(c\u2217) \u222a LB(c\u2217)) \\ L, and then let (cid:101)K(T, c\u2217) = |L(T (cid:48)\nleaves L, and cut c\u2217, recall the de\ufb01nitions of AB(c\u2217) and LB(c\u2217) we gave in Section 2. Let T (cid:48)\nc\u2217 be\nc\u2217 )| be\nso that (cid:101)K(T, c\u2217) = 3, while in Figure 3 (right), T (cid:48)\nc\u2217 has nodes i1, . . . , in\u22121, hence (cid:101)K(T, c\u2217) = 1.\nthe number of its leaves. For instance, in Figure 3 (left), T (cid:48)\nc\u2217 is made up of the six nodes i1, . . . , i6,\nNotice that we always have (cid:101)K(T, c\u2217) \u2264 K, but for many trees T , (cid:101)K(T, c\u2217) may be much smaller\nthan K. A striking example is again provided by the cut in Figure 3 (right), where (cid:101)K(T, c\u2217) = 1, but\n\nK = n. It is also helpful to introduce Ls(T ), the set of all pairs of sibling leaves in T . For instance,\nin the tree of Figure 3, we have |Ls(T )| = 3. One can easily verify that, for all T we have\n\nc\u2217 (cid:101)K(T, c\u2217) = |Ls(T )| \u2264 log2 N (T ) .\n\nmax\n\nWe now show that there always exist families of prior distributions P(\u00b7) such that the expected number\n\nof queries needed to \ufb01nd c\u2217 is \u2126(E[(cid:101)K(T, c\u2217)]). The quantity E[(cid:101)K(T, c\u2217)] is our notion of average\n\nmake in order to recover c\u2217 is lower bounded by B/2, while B \u2264 E[(cid:101)K(T, c\u2217)] \u2264 2B, the latter\n\n(query) complexity. Since the lower bound holds in expectation, it also holds in the worst case. The\nproof can be found in Appendix B.2.\nTheorem 2 In the noiseless realizable setting, for any tree T , any positive integer B \u2264 |Ls(T )|, and\nany (possibly randomized) active learning algorithm A, there exists a prior distribution P(\u00b7) over\nc\u2217 such that the expected (over P(\u00b7) and A\u2019s internal randomization) number of queries A has to\nexpectation being over P(\u00b7).\nNext, we describe an algorithm that, unlike OTS, is indeed able to take advantage of the prior\ndistribution, but it does so at the price of bounding the number of queries only in expectation.\nWDP (Weighted Dichotomic Path): Recall prior distibution (1), collectively encoded through the\nvalues {p(i), i \u2208 V }. As for OTS, we denote by F the forest made up of all maximal subtrees T (cid:48)\n\n6\n\n\fFigure 4: An example of input tree T\nbefore (left) and after (right) the \ufb01rst\nbinary search of WDP. The green node\nis a dummy super-root. The nodes in\nyellow are the roots of the subtrees cur-\nrently included in forest F . The num-\nbers in red within each node i indicate\nthe probabilities p(i), while the q(i)\nvalues are in blue, and viewed here as\nassociated with edges (par(i), i). The\nmagenta numbers at each leaf (cid:96) give the entropy H(\u03c0((cid:96), r(T (cid:48)))), where r(T (cid:48)) is the root of the subtree in F\nthat contains both (cid:96) and r(T (cid:48)). Left: The input tree T at time t = 0. No labels are revealed, and no clusters of\nC(c\u2217) are found. Right: Tree T after a full binary search has been performed on the depicted light blue path.\nBefore this binary search, that path connected a leaf of a subtree in F to its root (in this case, F contains only\nT ). The selected path is the one maximazing entropy within the forest/tree on the left. The dashed line indicates\nthe edge of c\u2217 found by the binary search. The red, blue and magenta numbers are updated accordingly to the\nresult of the binary search. The leaves enclosed in the grey ellipse are now known to form a cluster of C(c\u2217).\nof T such that |V (T (cid:48))| > 1 and for which none of their node labels have so far been revealed. F is\nupdated over time, and initially contains only T . We denote by \u03c0(u, v) a bottom-up path in T having\nas terminal nodes u and v (hence v is an ancestor of u in T ). For a given cut c\u2217, and associated labels\n{y(i), i \u2208 V }, any tree T (cid:48) \u2208 F , and any node i \u2208 V (T (cid:48)), we de\ufb01ne3\n\nq(i) = P(y(i) = 1 \u2227 y(par(i)) = 0) = p(i) \u00b7\n\n(1 \u2212 p(j)) .\n\n(3)\n\n(cid:89)\n\nj\u2208\u03c0(par(i),r(T (cid:48)))\n\ntropy H(\u03c0((cid:96), r(T (cid:48)))) = \u2212(cid:80)\nWe then associate with any backbone path of the form \u03c0((cid:96), r(T (cid:48))), where (cid:96) \u2208 L(T (cid:48)), an en-\n(cid:80)\ni\u2208\u03c0((cid:96),r(T (cid:48))) q(i) log2 q(i). Notice that at the beginning we have\ni\u2208\u03c0((cid:96),r(T )) q(i) = 1 for all (cid:96) \u2208 L. This invariant will be maintained on all subtrees T (cid:48). The\nprior probabilities p(i) will evolve during the algorithm\u2019s functioning into posterior probabilities\nbased on the information revealed by the labels. Accordingly, also the related values q(i) w.r.t. which\nthe entropy H(\u00b7) is calculated will change over time.\nDue to space limitations, WDP\u2019s pseudocode is given in Appendix B.3, but we have included an\nexample of its execution in Figure 4. At each round, WDP \ufb01nds the path whose entropy is maximized\nover all bottom-up paths \u03c0((cid:96), r(cid:48)), with (cid:96) \u2208 L and r(cid:48) = r(T (cid:48)), where T (cid:48) is the subtree in F containing\n(cid:96). WDP performs a binary search on such \u03c0((cid:96), r(cid:48)) to \ufb01nd the edge of T (cid:48) which is cut by c\u2217, taking\ninto account the current values of q(i) over that path. Once a binary search terminates, WDP updates\nF and the probabilities p(i) at all nodes i in the subtrees of F . See Figure 4 for an example. Notice\nthat the p(i) on the selected path become either 0 (if above the edge cut by c\u2217) or 1 (if below). In\nturn, this causes updates on all probabilities q(i). WDP continues with the next binary search on the\nnext path with maximum entropy at the current stage, discovering another edge cut by c\u2217, and so on,\nuntil F becomes empty. Denote by P>0 the set of all priors P(\u00b7) such that for all cuts c of T we have\nP(c) > 0. The proof of the following theorem is given in Appendix B.3.\nTheorem 3 In the noiseless realizable setting, for any tree T of height h, any prior distribution\nP(\u00b7) over c\u2217, such that P(\u00b7) \u2208 P>0, the expected number of queries made by WDP to \ufb01nd c\u2217 is\n\n, the expectations being over P(\u00b7).\n\nO(cid:16)E(cid:104)(cid:101)K(T, c\u2217)\n(cid:105)\nclusters, we can set p(i) in (1) as p(i) = 1/ log K, which would guarantee E[(cid:101)K(T, c\u2217)] = O(K),\n\nFor instance, in the line graph of Figure 3 (right), the expected number of queries is O(log n) for\nany prior P(\u00b7), while if T is a complete binary tree with n leaves, and we know that C(c\u2217) has O(K)\nand a bound on the expected number of queries of the form O(K log log n). By comparison, observe\nthat the results in [12, 15, 27] would give a query complexity which is at best O(K log2 n), while\nthose in [25, 24] yield at best O(K log n). In addition, we show below (Remark 1) that our algorithm\nhas very compelling running time guarantees.\nIt is often the case that a linkage function generating T also tags each internal node i with a coherence\nlevel \u03b1i of T (i), which is typically increasing as we move downwards from root to leaves. A common\n\n(cid:17)\n\nlog h\n\n3 For de\ufb01niteness, we set y(par(r)) = 0, that is, we are treating the parent of r(T ) as a \u201cdummy super\u2013root\u201d\n\nwith labeled 0 since time t = 0. Thus, according to this de\ufb01nition, q(r) = p(r).\n\n7\n\n0.50.9110.3110.250.111(1-0.25)*0.5=0.375(1-0.25)*(1-0.5)*0.9=0.3375(1-0.25)*(1-0.5)*(1-0.9)*1=0.03750.03750.11250.0750.6750.6750.26250.2625\u00ab0\u00bb0.250.0375*log_2(1/0.0375)+0.3375*log_2(1/0.3375)+0.375*log_2(1/0.375)+0.25*log_2(1/0.25)=1.737\u20261.737\u20261.891\u20261.891\u20261.163\u20261.163\u202600.91111100.11100.9(1-0.9)*1=0.10.110.10.90.900\u00ab0\u00bb00.1*log_2(1/0.1)+0.9*log_2(1/0.9)=0.468\u20260.468\u20260.468\u20260.468\u2026\f(log(n/\u03b4))3/2\n\n(1\u22122\u03bb)3\n\nn\n\n(cid:17)\n\nby asking O(cid:16) log(n/\u03b4)\n\nn2 dH (\u03a3\u2217,(cid:98)C) = O(cid:16) 1\n\nsituation in hierarchical clustering is then to \ufb01gure out the \u201cright\u201d level of granularity of the \ufb02at\nclustering we search for by de\ufb01ning parallel bands of nodes of similar coherence where c\u2217 is possibly\nlocated. For such cases, a slightly more involved guarantee for WDP is contained in Theorem 6 in\nAppendix B.3, where the query complexity depends in a more detailed way on the interplay between\nT and the prior P(\u00b7). In the above example, if we have b-many edge-disjoint bands, Theorem 6\nreplaces factor log h of Theorem 3 by log b.\nN-WDP (Noisy Weighted Dichotomic Path): This is a robust variant of WDP that copes with\npersistent noise. Whenever a label y(i) is requested, N-WDP determines its value by a majority vote\nover randomly selected pairs from L(left(i)) \u00d7 L(right(i)). Due to space limitations, all details are\ncontained in Appendix B.4. The next theorem quanti\ufb01es N-WDP\u2019s performance in terms of a tradeoff\nbetween the expected number of queries and the distance to the noiseless ground-truth matrix \u03a3\u2217.\nTheorem 4 In the noisy realizable setting, given any input tree T of height h, any cut c\u2217 \u223c P(\u00b7) \u2208\nP>0, and any \u03b4 \u2208 (0, 1/2), N-WDP outputs with probability \u2265 1 \u2212 \u03b4 (over the noise in the labels) a\n\nqueries in expectation (over P(\u00b7)).\nRemark 1 Compared to the query bound in Theorem 3, the one in Theorem 4 adds a factor due to\nnoise. The very same extra factor is contained in the bound of [22]. Regarding the running time of\n\n(cid:17)\n(1\u22122\u03bb)2 E(cid:101)K(T, c\u2217) log h\nclustering(cid:98)C such that 1\nWDP , the version we have described can be naively implemented to run in O(n E(cid:101)K(T, c\u2217)) expected\nhold, that requires O(n + h E(cid:101)K(T, c\u2217)) expected time. Likewise, an ef\ufb01cient variant of N-WDP exists\nfor which Theorem 4 holds, that takes O(cid:16)\n\nh + log2 n\n(1\u22122\u03bb)2\n4 Selective sampling in the non-realizable case\nIn the non-realizable case, we adapt to our clustering scenario the importance-weighted algorithm in\n[6]. The algorithm is a selective sampler that proceeds in a sequence of rounds t = 1, 2, . . .. In round\nt a pair (xit, xjt) is drawn at random from distribution D over the entries of a given ground truth\nmatrix \u03a3, and the algorithm produces in response a probability value pt = pt(xit, xjt). A Bernoulli\nvariable Qt \u2208 {0, 1} is then generated with P(Qt = 1) = pt, and if Qt = 1 the label \u03c3t = \u03c3(xit, xjt)\nis queried, and the algorithm updates its internal state; otherwise, we skip to the next round. The\nway pt is generated is described as follows. Given tree T , the algorithm maintains at each round t an\nimportance-weighted empirical risk minimizer cut \u02c6ct, de\ufb01ned as \u02c6ct = argminc errt\u22121(C(c)) , where\n{\u03c3C(xis , xjs ) (cid:54)= \u03c3s} ,\nthe \u201cargmin\u201d is over all cuts c realized by T , and errt\u22121(C) = 1\nt\u22121\nbeing {\u00b7} the indicator function of the predicate at argument. This is paired up with a perturbed\nempirical risk minimizer\n\ntime overall. A more time-ef\ufb01cient variant of WDP exists for which Theorem 3 and Theorem 6 still\n\n(cid:17) E(cid:101)K(T, c\u2217)\n(cid:17)\n\n(cid:80)t\u22121\n\nexpected time.\n\n(cid:16)\n\nQs\nps\n\ns=1\n\nn +\n\n\u02c6c(cid:48)\nt =\n\nargmin\n\nc : \u03c3C(c)(xit ,xjt )(cid:54)=\u03c3C(\u02c6ct)(xit ,xjt )\n\nerrt\u22121(C(c)) ,\n\nthe \u201cargmin\u201d being over all cuts c realized by T that disagree with \u02c6ct on the current pair (xit, xjt).\nThe value of pt is a function of dt = errt\u22121(C(\u02c6c(cid:48)\nt + 1/dt\n\n(cid:1) log((N (T )/\u03b4) log t)/t(cid:9) ,\n\npt = min(cid:8)1,O(cid:0)1/d2\n\nt)) \u2212 errt\u22121(C(\u02c6ct)), of the form\n\n(4)\n\nwhere N (T ) is the total number of cuts realized by T (i.e., the size of our comparison class), and \u03b4 is\nthe desired con\ufb01dence parameter. Once stopped, say in round t0, the algorithm gives in output cut\n\u02c6ct0+1, and the associated clustering C(\u02c6ct0+1). Let us call the resulting algorithm NR (Non-Realizable).\nDespite N (T ) can be exponential in n, there are very ef\ufb01cient ways of computing \u02c6ct, \u02c6c(cid:48)\nt, and\nhence pt at each round. In particular, an ad hoc procedure exists that incrementally computes these\nquantities by leveraging the sequential nature of NR. For a given T , and constant K \u2265 1, consider\nthe class C(T, K) of cuts inducing clusterings with at most K clusters. Set R\u2217 = R\u2217(T,D) =\n\nminc\u2208C(T,K) P(xi,xj )\u223cD(cid:0)\u03c3(xi, xj) (cid:54)= \u03c3C(c)(xi, xj)(cid:1), and B\u03b4(K, n) = K log n + log(1/\u03b4). The\n\nfollowing theorem is an adaptation of a result in [6]. See Appendix C.1 for a proof.\nTheorem 5 Let T have n leaves and height h. Given con\ufb01dence parameter \u03b4, for any t \u2265 1, with\nprobability at least 1\u2212 \u03b4, the excess risk (2) achieved by the clustering C(\u02c6ct+1) computed by NR w.r.t.\nthe best cut in class C(T, K) is bounded by O\n, while the (expected)\n\n(cid:18)(cid:113) B\u03b4(K,n) log t\n\n+ B\u03b4(K,n) log t\n\n(cid:19)\n\nt\n\nt\n\n8\n\n\fTree\nSING\nMED\nCOMP\n\nAvg depth\n2950\n186.4\n17.1\n\nStd. dev\n1413.6\n41.8\n3.3\n\nBEST\u2019s error\n8.26%\n8,51%\n8.81%\n\nBEST\u2019s K\n4679\n1603\n557\n\nTable 1: Statistics of the trees used in our experiments. These trees result from applying the linkage functions\nSING, COMP, and MED to the MNIST dataset (\ufb01rst 10000 samples). Each tree has the same set of n = 10000\nleaves. \u201cAvg depth\u201d is the average depth of the leaves in the tree, \u201cStd. dev\u201d is its standard deviation. For\nreference, we report the performance of BEST (i.e., the minimizer of dH over all possible cuts realized by the\ntrees), along with the associated number of clusters K.\n\nnumber of labels(cid:80)t\n\ns=1 ps is bounded by O(cid:16)\n\n(cid:16)\n\n(cid:17)(cid:17)\nR\u2217t +(cid:112)t B\u03b4(K, n) log t + B\u03b4(K, n) log3 t\n\n,\nwhere \u03b8 = \u03b8(C(T, K),D) is the disagreement coef\ufb01cient of C(T, K) w.r.t. distribution D. In\nparticular, when D is uniform we have \u03b8 \u2264 K. Moreover, there exists a fast implementation of NR\nwhose expected running time per round is E(xi,xj )\u223cD[de(lca(xi, xj))] \u2264 h, where de(lca(xi, xj)) is\nthe depth in T of the lowest common ancestor of xi and xj.\n\n\u03b8\n\n5 Preliminary experiments\nThe goal of these experiments was to contrast active learning methods originating from the persistent\nnoisy setting (speci\ufb01cally, N-WDP) to those originating from the non-realizable setting (speci\ufb01cally,\nNR). The comparison is carried out on the hierarchies produced by standard HC methods operating\non the \ufb01rst n = 10000 datapoints in the well-known MNIST dataset from http://yann.lecun.\ncom/exdb/mnist/, yielding a sample space of 108 pairs. We used Euclidean distance combined\nwith the single linkage (SING), median linkage (MED), and complete linkage (COMP) functions. The\nn \u00d7 n ground-truth matrix \u03a3 is provided by the 10 class labels of MNIST.\nWe compared N-WDP with uniform prior and NR to two baselines: passive learning based on empirical\nrisk minimization (ERM), and the active learning baseline performing breadth-\ufb01rst search from the\nroot (BF, Section 3) made robust to noise as in N-WDP. For reference, we also computed for each\nof the three hierarchies the performance of the best cut in hindsight (BEST) on the entire matrix \u03a3.\nThat is essentially the best one can hope for in each of the three cases. All algorithms except ERM are\nrandomized and have a single parameter to tune. We let such parameters vary across suitable ranges\nand, for each algorithm, picked the best performing value on a validation set of 500 labeled pairs.\nIn Table 1, we have collected relevant statistics about the three hierarchies. In particular, the single\nlinkage tree turned out to be very deep, while the complete linkage one is quite balanced. We\nevaluated test set accuracy vs. number of queries after parameter tuning, excluding these 500\npairs. For N-WDP, once a target number of queries was reached, we computed as current output the\nmaximum-a-posteriori cut. In order to reduce variance, we repeated each experiment 10 times.\nThe details of our empirical comparison are contained in Appendix C.3. Though our experiments\nare quite preliminary, some trends can be readily spotted (see Table 2 in Appendix C.3): i. N-WDP\nsigni\ufb01cantly outperforms NR. E.g., in COMP at 250 queries, the test set accuracy of N-WDP is at\n9.52%, while NR is at 10.1%. A similar performance gap at low number of queries one can observe\nin SING and MED. This trend was expected: NR is very conservative, as it has been designed to\nwork under more general conditions than N-WDP. We conjecture that, whenever the speci\ufb01c task at\nhand allows one to make an aggressive noise-free algorithm (like WDP) robust to persistent noise\n(like N-WDP), this outcome is quite likely to occur. ii. BF is competitive only when BEST has few\nclusters. iii. N-WDP clearly outperforms ERM, while the comparison between NR and ERM yields\nmixed results.\nConclusions and ongoing activity. Beyond presenting new algorithms and analyses for pairwise\nsimilarity-based active learning, our goal was to put different approaches to active learning on the\nsame footing for comparison on real data. The initial evidence emerging from our experiments is that\nActive Learning algorithms based on persistent noise can in practice be more effective than those\nmaking the more general non-realizable assumption. Notice that the similarities of the pairs of items\nhave been generated by the MNIST class labels, hence they have virtually nothing to do with the\ntrees we generated, which in turn do not rely on those labels at all. These initial trends suggested by\nour experiments clearly need a more thorough investigation. We are currently using other datasets,\nof different nature and size. Further HC methods are also under consideration, like those based on\nk-means.\n\n9\n\n\fAcknowledgments\n\nFabio Vitale acknowledges support from the Google Focused Award \u201cALL4AI\u201d and the ERC Starting\nGrant \u201cDMAP 680153\u201d, awarded to the Department of Computer Science of Sapienza University.\n\nReferences\n[1] E. Arkin, H. Meijer, J. Mitchell, D. Rappaport, and S. Skiena. Decision trees for geometric\n\nmodels. In Proc. Symposium on Computational Geometry, pages 369\u2013378, 1993.\n\n[2] H. Ashtiani, S. Kushagra, and S. Ben-David. Clustering with same-cluster queries. In Proc.\n\n30th NIPS, 2016.\n\n[3] P. Awasthi, M. F. Balcan, and K. Voevodski. Local algorithms for interactive clustering. Journal\n\nof Machine Learning Research, 18, 2017.\n\n[4] M. F. Balcan and A. Blum. Clustering with interactive feedback. In Proc. of the 19th Interna-\n\ntional Conference on Algorithmic Learning Theory, pages 316\u2013328, 2008.\n\n[5] Alina Beygelzimer, Sanjoy Dasgupta, and John Langford. Importance weighted active learning.\n\nIn Proc. ICML, pages 49\u201356. ACM, 2009.\n\n[6] Alina Beygelzimer, Daniel Hsu, John Langford, and Tong Zhang. Agnostic active learning\nwithout constraints. In Proc. 23rd International Conference on Neural Information Processing\nSystems, NIPS\u201910, pages 199\u2013207, 2010.\n\n[7] Yuxin Chen, S. Hamed Hassani, Amin Karbasi, and Andreas Krause. Sequential information\nmaximization: When is greedy near-optimal? In Proc. 28th Conference on Learning Theory,\nPMLR 40, pages 338\u2013363, 2015.\n\n[8] Yuxin Chen, S. Hamed Hassani, and Andreas Krause. Near-optimal bayesian active learning\nwith correlated and noisy tests. In Proc. 20th International Conference on Arti\ufb01cial Intelligence\nand Statistics, 2017.\n\n[9] D. Cohn, L. Atlas, and R. Ladner. Improving generalization with active learning. Machine\n\nLearning, 15:201\u2013221, 1994.\n\n[10] C. Cortes, G. DeSalvo, C. Gentile, M. Mohri, and N. Zhang. Region-based active learning. In\n\nProc. 22nd International Conference on Arti\ufb01cial Intelligence and Statistics, 2019.\n\n[11] S. Dasgupta and D. Hsu. Hierarchical sampling for active learning.\n\nInternational Conference on Machine Learning, 2008.\n\nIn Proc. of the 25th\n\n[12] Sanjoy Dasgupta. Coarse sample complexity bounds for active learning. In Advances in neural\n\ninformation processing systems, pages 235\u2013242, 2005.\n\n[13] S. Davidson, S. Khanna, T. Milo, and S. Roy. Top-k and clustering with noisy comparisons.\n\nACM Trans. Database Syst., 39(4):35:1\u201335:39, 2014.\n\n[14] Donatella Firmani, Barna Saha, and Divesh Srivastava. Online entity resolution using an oracle.\n\nProc. VLDB Endow., 9(5):384\u2013395, 2016.\n\n[15] Daniel Golovin and Andreas Krause. Adaptive submodularity: A new approach to active\n\nlearning and stochastic optimization. In arXiv:1003.3967, 2017.\n\n[16] Alon Gonen, Sivan Sabato, and Shai Shalev-Shwartz. Ef\ufb01cient active learning of halfspaces:\n\nAn aggressive approach. Journal of Machine Learning Research, 14:2583\u20132615, 2013.\n\n[17] S. Hanneke. A bound on the label complexity of agnostic active learning.\n\nInternational Conference on Machine Learning, pages 353\u2013360, 2007.\n\nIn Proc. 24th\n\n[18] S. Hanneke. Theory of disagreement-based active learning. Foundations and Trends in Machine\n\nLearning, 7(2-3):131\u2013309, 2014.\n\n10\n\n\f[19] Dov Harel and Robert E. Tarjan. Fast algorithms for \ufb01nding nearest common ancestors. SIAM\n\nJournal on Computing, 13(2):338\u2013355, 1984.\n\n[20] S. Kosaraju, T. Przytycka, and R. Borgstrom. On an optimal split tree problem. In Proc. 6th\n\nInternational Workshop on Algorithms and Data Structures, pages 157\u2013168, 1999.\n\n[21] S. Kpotufe, R. Urner, and S. Ben-David. Hierarchical label queries with data-dependent\n\npartitions. In Proc. 28th Conference on Learning Theory, pages 1176\u20131189, 2015.\n\n[22] A. Mazumdar and B. Saha. Clustering with noisy queries. In arXiv:1706.07510v1, 2017b.\n\n[23] M. Meila. Local equivalences of distances between clusterings\u2014a geometric perspective.\n\nMachine Learning, 86(3):369\u2013389, 2012.\n\n[24] Stephen Mussmann and Percy Liang. Generalized binary search for split-neighborly problems.\nIn Proc. 21st International Conference on Arti\ufb01cial Intelligence and Statistics (AISTATS) 2018,\n2018.\n\n[25] Robert D. Nowak. The geometry of generalized binary search. IEEE Transactions on Informa-\n\ntion Theory, 57(12):7893\u20137906, 2011.\n\n[26] W. M. Rand. Objective criteria for the evaluation of clustering methods. Journal of the American\n\nStatistical Association, 66:846\u2013850, 1971.\n\n[27] C. Tosh and S. Dasgupta. Diameter-based active learning.\n\nConference on Machine Learning (ICML), 2017.\n\nIn Thirty-fourth International\n\n[28] Norases Vesdapunt, Kedar Bellare, and Nilesh Dalvi. Crowdsourcing algorithms for entity\n\nresolution. Proc. VLDB Endow., 7(12):1071\u20131082, 2014.\n\n[29] Jiannan Wang, Tim Kraska, Michael J. Franklin, and Jianhua Feng. Crowder: Crowdsourcing\n\nentity resolution. Proc. VLDB Endow., 5(11):1483\u20131494, July 2012.\n\n11\n\n\f", "award": [], "sourceid": 8768, "authors": [{"given_name": "Fabio", "family_name": "Vitale", "institution": "University of Lille - INRIA Lille (France)"}, {"given_name": "Anand", "family_name": "Rajagopalan", "institution": "Google"}, {"given_name": "Claudio", "family_name": "Gentile", "institution": "Google Research"}]}