{"title": "Active Information Retrieval", "book": "Advances in Neural Information Processing Systems", "page_first": 777, "page_last": 784, "abstract": null, "full_text": "Active Information Retrieval \n\nTommi Jaakkola \n\nMIT AI Lab \n\nCambridge, MA \ntommi@ai.mit.edu \n\nHava Siegelmann \n\nMIT LIDS \n\nCambridge, MA \nhava@mit.edu \n\nAbstract \n\nIn classical large information retrieval systems, the system responds \nto a user initiated query with a list of results ranked by relevance. \nThe users may further refine their query as needed. This process \nmay result in a lengthy correspondence without conclusion. We \npropose an alternative active learning approach, where the sys(cid:173)\ntem responds to the initial user's query by successively probing the \nuser for distinctions at multiple levels of abstraction. The system's \ninitiated queries are optimized for speedy recovery and the user \nis permitted to respond with multiple selections or may reject the \nquery. The information is in each case unambiguously incorporated \nby the system and the subsequent queries are adjusted to minimize \nthe need for further exchange. The system's initiated queries are \nsubject to resource constraints pertaining to the amount of infor(cid:173)\nmation that can be presented to the user per iteration. \n\n1 \n\nIntroduction \n\nAn IR system consists of a collection of documents and an engine that retrieves \ndocuments described by users queries. In large systems, such as the Web, queries \nare typically too vague, and hence, an iterative process in which the users refine their \nqueries gradually has to take place. Since much dissatisfaction of IR users stems \nfrom long, tedious repetitive search sessions, our research is targeted at shortening \nthe search session. We propose a new search paradigm of active information retrieval \nin which the user initiates only one query, and the subsequent iterative process is \nled by the engine/system. The active process exploits optimum experiment design \nto permit minimal effort on the part of the user. \n\nOur approach is related but not identical to the interactive search processes called \nrelevance feedback. The primary differences pertain to the way in which the feedback \nis incorporated and queried from the user. In relevance feedback, the system has to \ndeduce a set of \"features\" (words, phrases, etc.) that characterize the set of selected \nrelevant documents, and use these features in formulating a new query (e.g., [5,6]) . \nIn contrast, we cast the problem as a problem of estimation and the goal is to \nrecover the unknown docum ent weights or relevance assessments. \n\n\fOur system also relates to the Scatter/Gather algorithm of browsing information \nsystems [2], where the system initially scatters the document collection into a fixed \nnumber k of clusters whose summaries are presented to the user. The user select \nclusters from a new sub-collection, to be scattered again into k clusters, and so \nforth , until enumerating single documents. In our approach, documents are not \ndiscarded but rather their weighting is updated appropriately. Like many other \nclustering methods, the scatter/gather is based on hierarchical orderings. Overlap(cid:173)\nping clusters were recently proposed to better match real-life grouping and allow \nnatural summarizing and viewing [4]. \n\nThis short paper focuses on the underlying methodology of the active learning \napproach. \n\n2 Active search \n\nLet X be the set of documents (elements) in the database and C = {GI , ... , Gm } \na set of available clusters of documents for which appropriate summaries can be \ngenerated. The set of clusters typically includes individual documents and may \ncome from a fiat, hierarchical, or overlapping clustering method. The clustering \nneed not be static, however, and could be easily defined dynamically in the search \nprocess. \n\nGiven the set of available clusters, we may choose a query set, a limited set of clusters \nthat are presented to the user for selection at each iteration of the search process. \nThe user is expected to choose the best matching cluster in this set or, alternatively, \nannotate the clusters with relevant/irrelevant labels (select the relevant ones). We \nwill address both modes of operation. \n\nThe active retrieval algorithm proceeds as follows: (1) it finds a small subset S \nof clusters to present, along with their summaries, to the user; (2) waits until \nthe user selects none, one or more of the presented clusters; (3) uses the evidence \nfrom the user's selections to update the distribution over documents or relevance \nassessments; (4) outputs the top documents so far , ranked by their weights, and \nthe iteration continues until terminated by the user or the system (based on any \nremaining uncertainty about the relevant documents or the implied ranking). \n\nThe following sections address three primary issues: the user model, how to in(cid:173)\ncorporate the information from user selections, and how to optimize the query set \npresented to the user. All the algorithms should scale linearly with the database \nsize (and the size of the query set). \n\n3 Contrastive selection model \n\nWe start with a contrastive selection model where the user is expected to choose \nonly the best matching cluster in the query set. In case of multiple selections, we \nwill interpret the marked clusters as a redefined cluster of the query set. While \nthis interpretation will result in sub-optimal choices for the query set assuming the \nuser consistently selects multiple clusters, the interpretation nevertheless obviates \nthe need for modeling user's selection biases in this regard. An empty selection, on \nthe other hand, suggests that the clusters outside the query set are deemed more \nlikely. \n\n\f9 \n\n8 \n\n7 \n\n6 \n\n10 / 10 \n\ndatabasesize(log-sca~) \n\n3 , \n\n5 \n\n4 \n\nb) \n\n09 \n\n06 \n\nc) \n\ndatabase size (log-scale) \n\n10' \n\nFigure 1: a) A three level hierarchical transform of a flat Dirichlet; b) dependence \nof mean retrieval time on the database size (log-scale); c) median ratio of retrieval \ntimes corresponding to doubling the query set size. \n\nTo capture the ranking implied by the user selections, we define weights (distribu(cid:173)\ntion) {Bx}, L:XEX Bx = lover the underlying documents. We assume that the user \nbehavior is (probabilistic ally) consistent with one such weighting B;. The goal of a \nretrieval algorithm is therefore to recover this underlying weighting through inter(cid:173)\nactions with the user. The resulting (approximation to) B; can be used to correctly \nrank the documents or, for example, to display all the documents with sufficiently \nlarge weight (cf. coverage). Naturally, B; changes from one retrieval task to another \nand has to be inferred separately in each task. We might estimate a user specific \nprior (model) over the document weights to reflect consistent biases that different \nusers have across retrieval tasks. \n\nWe express our prior belief about the document weights in terms of a Dirichlet \ndistribution: P(B) = liZ\u00b7 rrxExB~' -l, where Z = [f1x Exr(ax)l/r(L:~=l ax). \n\n3.1 \n\nInference \n\nSuppose a fiat Dirichlet distribution P(B) over the document weights and a fixed \nquery set S = {CS1, .. . ,CSk }. We evaluate here the posterior distribution P(Bly) \ngiven the user response y. The key is to transform P(B) into a hierarchical form \nso as to explicate the portion of the distribution potentially affected by the user \nresponse. The hierarchy, illustrated in Figure 1a), contains three levels: selection \nof S or X \\ S; choices within the query set S (of most interest to us) and those \nunder X \\ S; selections within the clusters CS1 in S. For simplicity, the clusters are \nassumed to be either nested or disjoint, i.e. , can be organized hierarchically. \n\nWe use B?) , i = 1,2 to denote the top level parameters, B;f{, j = 1, ... , k for \nthe cluster choices within the query set whereas B~~~, x ~ S gives the document \nchoices outside S. Finally, B~~j for x E CSj indicate the parameters associated \nwith the cluster CSj E S. The original flat Dirichlet P(B) can be written as a \nproduct p(B(l) )P(BW )P(BW) [rr~=l P( B(I~))] with the appropriate normalization \nconstraints. If clusters in S overlap, the expansion is carried out in terms of the \ndisjoint subsets. The parameters governing the Dirichlet component distributions \nare readily obtained by gathering the appropriate parameters ax of the original \n[3]). For example, a~l) = L:xES ax; am = L:xECs j a x, for j = \nDirichlet (cf. \n1, ... , k; a~~~ = ax, for x ~ S; a~~j = ax, whenever x E CSj , j = 1, ... , k. \n\n\fIf user selects cluster CSy , we will update P( 8W) which reduces to adjusting the \ncounts a~~i f- a~~i + 1. The resulting new parameters give rise to the posterior dis(cid:173)\ntribution P(8W IY) and, by including the other components, to the overall posterior \nP(8IY). If the user selects \"none of these items,\" only the first level parameters 8~1) \nwill be updated. \n\n3.2 Query set optimization \n\nOur optimization criterion for choosing the query set S is the information that we \nstand to gain from querying the user with it. Let y indicate the user choice, the \nmutual information between y and the parameters 8 is given by (derivation omitted) \n\nJ(Yi 8) \n\n(1) \n\n(2) \n\nwhere P(y) = a~~iI (I::=l a~~~) defines our current expectation about user selection \nfrom Si H(y) = - I:~=l P(y) log P(y) is the entropy of the selections y, and w(\u00b7) \nis the Di-gamma function, defined as w(z) = djdzlogr(z). Extending the criterion \nto \"no selection\" is trivial. \n\nTo simplify, we expand the counts aW in terms of the original (flat) counts ax, \nand define for all clusters (whether or not they appear in the query set) the weights \nai = I:xECi ax, bi = aiw(ai + 1) - ailogai. The mutual information criterion now \ndepends only on as = I:~=l a Si = I:xES ax, the overall weight of the query set and \nbs = I:~=l bS i which provides an overall measure of how informative the individual \nclusters in S are. With these changes, we obtain: \n\n(2) \n\nJ(Y i 8. ll) = - + log(as) - w(as + 1) \n\nbs \nas \n\n(3) \n\nWe can optimize the choice of S with a simple greedy method that successively \nfinds the next best cluster index i to include in the information set. This algorithm \nscales as O(km), where m is the number of clusters in our database and k is the \nthe maximal query set size in terms of the number of clusters. \n\nNote that this simple criterion excludes nested or overlapping clusters from S. In \na more general context, the bookkeeping problem associated with the overlapping \nclusters is analogous to that of the Kikuchi expansion in statistical physics (cf. [7]) . \n\n3.3 Projection back to a flat Dirichlet \n\nThe hierarchical posterior is not a flat Dirichlet anymore. To maintain simplic(cid:173)\nity, we project it back into a flat Dirichlet in the KL-divergence sense: P~ I Y = \nargminQo KL(Pe 1yIIQe), where P(8Iy) is the hierarchical posterior expressed in \nterms of the original flat variables 8x ,x E X (but no longer a flat Dirichlet). The \ntransformation from hierarchical to flat variables is given by: 8x = 8~1) 8JN 8~~j for \n\n\fx E CSj , j = 1, ... ,k, and Ox = O~l) o~~L for x E X \\ S. As a result, when x E CSj \nfor some j = 1, ... , k we get (derivation omitted) \n\n(4) \n\nwhere y denotes the user selection. For x E X\\S, EO ly log Ox = w(ax) - W(L zEX a z) \nIf we define rx = Ee ly log Ox for all x E X, then the counts f3x corresponding to the \nflat approximation Qo can be found by minimizing \n\nxEX \n\nxEX \n\n(5) \n\nwhere we have omitted any terms not depending on f3x . This is a strictly convex \noptimization problem over the convex set f3x ~ 0, x E X and therefore admits a \nunique solution. Furthermore, we can efficiently apply second order methods such \nas Newton-Raphson in this context due to the specific structure of the Hessian: \n1i = D - el1 T, where D is a diagonal matrix containing the derivatives of the \ndi-gamma function l w'( f3x) = d/df3x w( f3x) and e = W'(LXEX f3x ). Each Newton(cid:173)\nRaphson iteration requires only O(m) space/time. \n\n3.4 Decreasing entropy \n\nSince the query set was chosen to maximize the mutual information between the \nuser selection and the parameters 0, we get the maximal reduction in the expected \nentropy of the parameters: J(y; 0) = H(Po) - Ey H(Pe ly) \nAs discussed in the previous section, we cannot maintain the true posterior but have \nto settle for a projection. It is therefore no longer obvious that the expected entropy \nof the projected posterior possesses any analogous guarantees; indeed, projections \nof this type typically increase the entropy. We can easily show, however, that the \nexpected entropy is non-increasing: \n\nsince P~ y is the minimizing argument. It is possible to make a stronger state(cid:173)\nment indicating that the expected entropy of the projected distribution decreases \nmonotonically after each iteration. \n\nTheorem 1 For any 10 > 0, Ey {H(Qo IY) } :::; H(Pe) - f(k -l)/as + 0(102 ), where \nk is the size of the query set and as = L zEs a z . \n\nWhile this result is not tight, it does demonstrate that the projection back into a \nflat Dirichlet still permits a semi-linear decrease in the entropy2. The denominator \nof the first order term, i.e., as, can increase only by 1 at each iteration. \n\nIThese derivatives can be evaluated efficiently on the basis of the highly accurate ap(cid:173)\n\nproximation to the di-gamma function . \n\n2Note that the entropy of a Dirichlet distribution is not bounded from below (it is \nbounded from above) . The manner in which the Dirichlet updates are carried out (how \na x change) still keeps the entropy a meaningful quantity. \n\n\f4 Annotation model \n\nThe contrastive selection approach discussed above operates a priori in a single \ntopic mode3 . The expectation that the user should select the best matching cluster \nin the query set also makes an inefficient use of the query set. We provide here an \nanalogous development of the active learning approach under the assumption that \nthe user classifies rather than contrasts the clusters. \n\nThe user responses are now assumed to be consistent with a noisy-OR model \n\nP(Ye = 1Ir*) = 1 - (1 - qo) II (1 - qr: \n\nxEe \n\n(7) \n\nwhere Ye is the binary relevance annotation (outcome) for a cluster c in the query, \nr; E {O, 1 }, x E X are the underlying task specific relevance assignments to the \nelements in the database, q is the probability that a relevant element in the cluster \nis caught by the user, and qo is the probability that a cluster is deemed \"relevant\" \nin the absence of any relevant elements. While the parameters qo and q could easily \nbe inferred from past searches, we assume here for simplicity that they are known \nto the search algorithm. The user annotations of different clusters in the query set \nare independent of each other, even for overlapping clusters. \n\nTo ensure that we can infer the unknown relevance assignments from the observ(cid:173)\nabIes (cluster annotations), we require identifiability: the annotation probabilities \nP(Ye = 1Ir*), for all c E C, should uniquely determine {r;}. Equivalently, knowing \nthe number of relevant documents in each cluster should enable us to recover the \nunderlying relevance assignments. This is a property of the cluster structure and \nholds trivially for any clustering with access to individual elements. \n\nThe search algorithm maintains a simple independent Bernoulli model over the \nunknown relevance assignments: P(rIB) = TIxEx B;' (1 - Bx) l - r \u2022 . This gives rise \nto a marginal noisy-OR model over cluster annotations: \n\nP(Ye = liB) = L P(Ye = 1Ir)P(rIB) = 1 - (1 - qo) II (1- Bxq) \n\n(8) \n\nr \n\nx E e \n\nThe uncertainty about the relevance assignments {rx} makes the system beliefs \nabout the cluster annotations dependent on each other. The parameters (relevance \nprobabilities) {Bx} are, of course, specific to each search task. \n\n4.1 \n\nInference and projection \n\nGiven fie E {O, 1} for a single cluster c, we can evaluate the posterior P(rlfie, B) over \nthe relevance assignments. Similarly to noisy-OR graphical models, this posterior \ncan be (exponentially) costly to maintain and we instead sequentially project the \nposterior back into the set of independent Bernoulli distributions. The projection \nhere is in the moments sense (m- projection): Pr;(I' = argminQr KL(Pr IVc,(lIIQr), \nwhere Qr is an independent Bernoulli model. The m-projection preserves the pos(cid:173)\nterior expectations B~ ; vc = Er lYc {rx} used for ranking the documents. \n\n3Dynamic redefinition of clusters partially avoids this problem. \n\n\fThe projection yields simple element-wise updates for the parameters4 : for x E c, \n\n(9) \n\nwhere Po = P(yc = OIB) = (l-qo) IT xEc(l-Bxq) is the only parameter that depends \non the cluster as a whole. \n\n4.2 Query set optimization \n\nThe best single cluster c E C to query has the highest mutual information between \nthe expected user response Yc = {O, I} and the underlying relevance assignments \nl' = {rx}xEx, maximizing I(yc; r iB) = EYe {KL( PrIO ,Ye II Pr IO)}' This mutual infor(cid:173)\nmation cannot be evaluated in closed form, however. We use a lower bound: \n\nI(yc; r iB ) ::::: EYe { l: D(Bx;Ye II Bx) } d~ Ip(yc; riB) \n\nxE c \n\n(10) \n\nwhere BX;Ye' x E X are the parameters of the m-projected posterior and \nKL(Bx;yJBx) is the KL-divergence between two Bernoulli distributions with mean \nparameters BX;Ye and Bx, respectively. \n\nTo alleviate the concern that the lower bound would prematurely terminate the \nsearch, we note that if Ip(r; B) = 0 for all c E C, then Bx E {O, I} for all x E X. \nIn other words, the search terminates only if we are already fully certain about the \nunderlying relevance assignments. \n\nThe best k clusters to query are those maximizing \n\nFinding the optimal query set under this criterion (even with the m-projections) \ninvolves O(nk2k) operations. We select the clusters sequentially while maintain(cid:173)\ning an explicit dependence on the hypothetical outcome (classification) of only \nthe previous cluster choice. More precisely, we combine the cluster selection with \nconditional projections: for k > 1, Ck = argmaxclp(Yc,Yck;rIBk - l), \nB~.y = \nE{ B~;!k_l ,Yek I YCk }. The mutual information terms do not, however, decompose \nadditively with the elements in the clusters. The desired O(kn) scaling of the se(cid:173)\nlection algorithm requires a cached spline reconstruction5 . \n\n, ek \n\n4.3 Sanity check results \n\nFigure 1 b) gives the mean number of iterations of the query process as function of \nthe database size. Each point represents an average over 20 runs with parameters \n\n4The parameters 8x;fiq ,fi e2 , ... ,fi ek resulting from k successive projections define a martin(cid:173)\ngale process Ey q ,Ye2 , . .. ,Yek {8x;yq ,Ye2 , . . . ,Yek } = 8x, x EX, where the expectation is taken \nw.r .t . to the posterior approximation. \n\n5The mutual information terms for select fixed values of po can be cached additively \nrelative to the cluster structure. The actual Po dependence is reconstructed (quadratically) \nfrom the cached values (Ip is convex in po) . \n\n\fk = 5, qo = 0.05, and q = 0.95. The user responses were selected on the basis of the \nsame parameters and a randomly chosen (single) underlying element of interest. \nThe search is terminated when the sought after element in the database has the \nhighest rank according to {Ox} , x E X. The randomized cluster structures were \nrelatively balanced and hierarchical. Similarly to the theoretically optimal system, \nthe performance scales linearly with the log-database size. Results for random \nchoice of the clusters in the query are far outside the figure. \n\nFigure lc), on the other hand, demonstrates that increasing the query set size \nappropriately reduces the interaction time. Note that since all the clusters in the \nquery set have to be chosen prior to getting feedback from any of the clusters, \ndoubling the query set size cannot theoretically reduce the retrieval time to a half. \n\n5 Discussion \n\nThe active learning approach proposed here provides the basic methodology for \noptimally querying the user at multiple levels of abstraction. There are a number \nof extensions to the approach presented in this short paper. For example, we can \nencourage the user to provide confidence rated selections/annotations among the \npresented clusters. Both user models can be adapted to handle such selections. \nAnalyzing the fundamental trade-offs between the size of the query set (resource \nconstraints) and the expected completion time of the retrieval process will also be \naddressed in later work. \n\nReferences \n\n[1] A. C. Atkinson and A. N. Donev, Optimum experimental designs, Clarendon \n\nPress, 1992. \n\n[2] D. R. Cutting, D. R. Karger, J. O. Pederson, J. W. Tukey, Scatter/Gather: \nA cluster Based Approach to Browse Document Collections, In Proceedings of \nthe Fifteenth Annual International ACM SIGIR Conference, Denmark, June \n1996. \n\n[3] D. Heckerman, D. Geiger, and D. M. Chickering, Learning Bayesian Networks: \nThe Combination of Knowledge and Statistical Data, Machine Learning, Vol \n20, 1995. \n\n[4] H. Lipson and H.T. Siegelmann, Geometric Neurons for Clustering, Neural \n\nComputation 12(10), August 2000 \n\n[5] J. J. Jr. Rocchio, Relevance Feedback in Information Retrieval, In The Smart \nSystem - experiments in automatic document processing, 313-323, Englewood \nCliffs, NJ: Prentice Hall Inc. \n\n[6] G. Salton and C. Buckley, Improving Retrieval Performance by Relevance Feed(cid:173)\n\nback, Journal ofthe American Society for Information Science, 41(4): 288-297, \n1990. \n\n[7] J.S. Yedidia, W.T. Freeman, Y. Weiss, Generalized Belief Propagation, Neural \n\nInformation Processing Systems 13, 2001. \n\n\f", "award": [], "sourceid": 1954, "authors": [{"given_name": "Tommi", "family_name": "Jaakkola", "institution": null}, {"given_name": "Hava", "family_name": "Siegelmann", "institution": null}]}