{"title": "Learning to Order Things", "book": "Advances in Neural Information Processing Systems", "page_first": 451, "page_last": 457, "abstract": "", "full_text": "Learning to Order Things \n\nWilliam W.  Cohen  Robert E. Schapire  Yoram Singer \n\nAT&T Labs, 180 Park Ave., Florham Park, NJ 07932 \n\n{ wcohen,schapire,singer} @research.att.com \n\nAbstract \n\nThere are many applications in which it is  desirable to order rather than classify \ninstances.  Here we consider the problem of learning how to order, given feedback \nin the form of preference judgments, i.e., statements to the effect that one instance \nshould be ranked ahead of another.  We outline a two-stage approach in which one \nfirst learns by conventional means a preference Junction, of the form PREF( u, v), \nwhich  indicates  whether  it is advisable  to  rank  u  before  v.  New  instances  are \nthen  ordered  so  as  to  maximize  agreements  with  the  learned  preference  func(cid:173)\ntion.  We  show  that  the  problem  of finding  the  ordering  that  agrees  best  with \na preference function  is  NP-complete,  even under very  restrictive assumptions. \nNevertheless, we describe a simple greedy algorithm that is guaranteed to find  a \ngood approximation.  We then discuss an on-line learning algorithm, based on the \n\"Hedge\" algorithm,  for  finding  a good linear combination of ranking \"experts.\" \nWe use the ordering algorithm combined with the on-line learning algorithm to \nfind a combination of \"search experts,\" each of which is a domain-specific query \nexpansion strategy for a WWW search engine,  and present experimental  results \nthat demonstrate the merits of our approach. \n\n1  Introduction \n\nMost previous work in inductive learning has concentrated on learning to classify.  However, \nthere are many applications in which it is  desirable to  order rather than classify  instances. \nAn  example might be  a personalized email filter  that gives  a priority  ordering  to  unread \nmail.  Here we will consider the problem of learning how to construct such orderings, given \nfeedback in the form of preference judgments, i.e., statements that one instance should be \nranked ahead of another. \n\nSuch  orderings  could  be  constructed  based  on  a  learned  classifier  or  regression  model, \nand in  fact often are.  For instance, it is  common practice in  information retrieval to rank \ndocuments  according  to  their  estimated  probability  of relevance  to  a  query  based  on  a \nlearned classifier for the concept \"relevant document.\"  An advantage of learning orderings \ndirectly is that preference judgments can be much easier to obtain than the labels required \nfor classification learning. \n\nFor  instance,  in  the  email  application  mentioned  above,  one  approach  might  be  to  rank \nmessages according to their estimated probability of membership in  the class of \"urgent\" \nmessages,  or by  some  numerical  estimate  of urgency  obtained  by  regression.  Suppose, \nhowever, that a user is presented with an ordered list of email messages, and elects to read \nthe third message first.  Given this election, it is not necessarily the case that message three \nis  urgent, nor is there sufficient information to estimate any  numerical urgency measures; \nhowever,  it  seems quite  reasonable  to  infer that  message  three  should  have  been  ranked \nahead of the others.  Thus, in this setting, obtaining preference information may be easier \nand more natural than obtaining the information needed for classification or regression. \n\n\f452 \n\nw.  W.  Cohen, R.  E.  Schapire and Y.  Singer \n\nIn  the  remainder of this  paper,  we  will  investigate  the  following  two-stage  approach  to \nlearning  how  to  order. \nIn  stage  one,  we  learn  a  preference junction,  a  two-argument \nfunction PREF( u, v) which returns a numerical measure of how certain it is that u  should \nbe ranked before v.  In stage two,  we  use the learned preference function to order a set of \nnew instances U;  to  accomplish this,  we evaluate the learned function  PREF( u, v) on all \npairs of instances u, v  E  U,  and choose an ordering of U  that agrees, as much as possible, \nwith these pairwise preference judgments. This general approach is novel; for related work \nin various fields see, for instance, references [2, 3,  1, 7,  10]. \n\nAs  we  will  see,  given  an  appropriate  feature  set,  learning  a  preference  function  can  be \nreduced to a fairly conventional classification learning problem.  On the other hand, finding \na total order that agrees best with a preference function is NP-complete.  Nevertheless, we \nshow that there is an efficient greedy algorithm that always finds a good approximation to \nthe best ordering.  After presenting these  results on the complexity of ordering instances \nusing a preference function, we then describe a specific algorithm for learning a preference \nfunction.  The algorithm is an on-line weight allocation algorithm, much like the weighted \nmajority  algorithm  [9]  and  Winnow  [8],  and,  more  directly,  Freund  and  Schapire's  [4] \n\"Hedge\" algorithm.  We  then present some experimental results in which this algorithm is \nused to combine the results of several \"search experts,\" each of which is a domain-specific \nquery expansion strategy for a WWW search engine. \n\n2  Preliminaries \n\nLet  X  be  a  set  of instances  (possibly  infinite).  A  preference junction  PREF  is  a  binary \nfunction  PREF  :  X  x  X  ~ [0,1].  A  value  of PREF(u, v)  which  is  close  to  1 or a is \ninterpreted as a strong recommendation that u should be ranked before v.  A value close to \n1/2 is  interpreted as  an abstention from making a recommendation.  As noted above,  the \nhypothesis of our learning system will be a preference function, and new instances will be \nranked so as to agree as much as possible with the preferences predicted by this hypothesis. \n\nIn  standard  classification  learning,  a  hypothesis  is  constructed  by  combining  primitive \nfeatures.  Similarly,  in  this  paper,  a  preference  function  will  be  a  combination  of other \npreference functions.  In particular, we will  typically assume the availability of a set of N \nprimitive preference functions RI , ... , RN.  These can then be combined in the usual ways, \ne.g., with a boolean or linear combination of their values; we will  be especially interested \nin the latter combination method. \n\nIt is  convenient to assume that the  Ri'S are  well-formed in certain ways.  To  this end, we \nintroduce a special kind of preference function called a  rank ordering.  Let S be  a totally \nordered set l  with' >' as the comparison operator.  An ordering function into S is a function \nf  : X  ~ S.  The function f  induces the preference function Rj, defined as \n\nI \nRj(u,v) ~  0\n\n{\n\n21 \n\nif f (u) > f ( v) \nif f(u) < f(v) \notherwise. \n\nWe  call Rf a rank ordering for X  into S.  If Rf(u, v) = I, then we say that u is preferred \nto v, or u is ranked higher than v. \n\nIt  is  sometimes  convenient  to  allow  an  ordering  function  to  \"abstain\"  and  not  give  a \npreference for a pair u, v.  Let \u00a2>  be a special symbol not in S, and let f  be a function into \nS U {\u00a2>}.  We  will interpret the mapping f (u)  = \u00a2>  to mean that u  is \"unranked,\" and let \nRf (u, v) =  ! if either u or v is unranked. \nTo give concrete examples of rank ordering, imagine learning to order documents based on \nthe words that they contain.  To model this, let X  be the set of all documents in a repository, \n\n)That is, for all  pairs of distinct elements 8J, 82  E S, either 8)  <  82 or 8)  > 82 . \n\n\fLearning to Order Things \n\n453 \n\nand for N  words WI, ... , W N.  let  Ii (u)  be  the  number of occurrences of Wi  in  u.  Then \nRf;  will prefer u to v whenever Wi  occurs more often in u  than v.  As a second example. \nconsider a meta-search application in which the goal is to combine the rankings of several \nWWW search engines.  For N  search engines el, ... , eN. one might define h so that R'i \nprefers u to v whenever u is ranked ahead of v in the list Li produced by the corresponding \nsearch engine.  To do this, one could let Ii(u)  =  -k for the document u  appearing in the \nk-th position in the list L i \u2022 and let Ii( u)  =  </>  for any document not appearing in L i . \n\n3  Ordering instances with a preference function \n\nWe  now consider the complexity of finding  the  total  order that agrees best with a learned \npreference function.  To analyze this. we must first quantify the notion of agreement between \na preference function PREF and an  ordering.  One natural notion is  the  following:  Let X \nbe  a  set.  PREF  be  a  preference  function.  and  let  p be  a  total  ordering  of X.  expressed \nagain as  an ordering function (i.e .\u2022  p( u)  > p( v) iff u  precedes v  in  the order).  We define \nAGREE(p, PREF)  to  be the  sum  of PREF( u, v)  over all  pairs  u, v  such  that u  is  ranked \nahead of v by p: \n\nAGREE(p, PREF) = \n\nPREF(u, v). \n\n(1) \n\nu.v:p{ul>p{v) \n\nIdeally.  one would  like to find  a p that maximizes AGREE(p, PREF).  This  general opti(cid:173)\nmization problem is of little interest since in practice, there are many constraints imposed \nby  learning:  for  instance  PREF must  be  in  some  restricted  class  of functions.  and  will \ngenerally  be  a combination of relatively  well-behaved preference functions  R i .  A more \ninteresting question is whether the problem remains hard under such constraints. \n\nThe theorem below gives such a result.  showing that the problem is  NP-complete even if \nPREF is restricted to be a linear combination of rank orderings.  This holds even if all  the \nrank orderings map into a set S with only three elements. one of which mayor may not be \n</>.  (Clearly. if S consists of more than three elements then the problem is still hard.) \n\nTheorem 1  The following decision problem is NP-complete: \nInput:  A  rational  number  1\\,;  a  set  X;  a  set  S  with  lSI  ~  3;  a  collection  of \nN  ordering  functions  Ii \n:  X  -t  S;  and  a  preference  function  PREF  defined  as \nPREF(u, v)  =  L~I wiR'i (u, v) where  w  =  (WI, ... ,WN) is a weight vector in  [0,  l]N \nwith L~I Wi  =  1. \nQuestion:  Does there exist a total order p such that AGREE(p, PREF)  ~ I\\,? \n\nThe proof (omitted) is by reduction from CYCLIC-ORDERING [5. 6]. \n\nAlthough this problem is hard when  lSI  ~ 3. it becomes tractable for linear combinations \nof rank orderings into a set S of size two.  In brief. suppose one is given X, Sand PREF as \nin Theorem 1, save that S is a two-element set. which we assume without loss of generality \nto be S  =  {O,  I}.  Now define  p(u)  =  Li Wdi(U).  It can  be  shown that the total  order \ndefined  by  p maximizes AGREE(p, PREF).  (In case of a tie,  p( u)  =  p( v)  for  distinct u \nand v.  p defines only a partial order.  The claim still  holds in  this case for  any  total  order \nwhich is consistent with this partial order.)  Of course, when lSI  =  2, the rank orderings \nare really only binary classifiers.  The fact that this special case is tractable underscores the \nfact that manipulating orderings can be computationally more difficult than performing the \ncorresponding operations on binary classifiers. \n\nTheorem 1 implies that we are unlikely to find an efficient algorithm that finds the optimal \ntotal order for a weighted combination of rank orderings. Fortunately. there do exist efficient \nalgorithms for finding an approximately optimal total order.  Figure 1 summarizes a greedy \n\n\f454 \n\nw.  W.  Cohen, R.  E.  Schapire and Y.  Singer \n\nAlgorithm Order-By-Preferences \nInputs:  an instance set X; a preference function PREF \nOutput:  an  approximately optimal ordering function p \nlet V  =  X \nfor each v  E V  do7l'(v)  =  LUEVPREF(v,u) - LUEVPREF(u,v) \nwhile V  is non-empty do \n\nlet t = argmaxuEv 71'(u) \nlet pet) = IVI \nV=V-{t} \nfor each v  E V  do 71'(v)  =  71'(v)  + PREF(t, v) - PREF(v, t) \n\nendwhile \n\nFigure 1:  A greedy ordering algorithm \n\nalgorithm that produces a  good approximation to  the  best total  order,  as  we  will  shortly \nThe  algorithm  is  easiest  to  describe  by  thinking  of PREF as  a  directed \ndemonstrate. \nweighted  graph where,  initially,  the  set of vertices V  is  equal  to the  set of instances  X, \nand each edge u  -t v has weight PREF( u, v).  We  assign to each vertex v  E  V  a potential \nvalue 71'( v), which is the weighted sum of the  outgoing edges minus the weighted sum of \nthe ingoing edges.  That is, 71'(v)  =  LUEV PREF(v,u) - LUEV PREF(u, v) . The greedy \nalgorithm  then  picks  some node t  that  has  maximum potential,  and  assigns  it  a  rank by \nsetting  pet)  =  lVI,  effectively ordering it  ahead of all  the remaining nodes.  This  node, \ntogether with  all  incident edges,  is  then deleted from  the  graph,  and  the  potential values \n71'  of the  remaining vertices  are  updated appropriately:  This process is  repeated  until  the \ngraph is empty; notice that nodes removed in subsequent iterations will have progressively \nsmaller and smaller ranks. \n\nThe next theorem shows that this greedy algorithm comes within a factor of two of optimal. \nFurthermore, it is relatively simple to show that the approximation factor of 2 is tight. \n\nTheorem 2  Let OPT(PREF)  be  the  weighted agreement achieved by an  optimal total \norderfor the preference junction PREF and let APPROX(PREF) be the weighted agreement \nachieved by the greedy algorithm.  Then APPROX(PREF)  ;:::  !OPT(PREF). \n\n4  Learning a good weight vector \n\nIn this section,  we  look at the  problem of learning a  good  linear combination of a set of \npreference  functions.  Specifically,  we  assume  access  to  a  set  of ranking  experts  which \nprovide us with preference functions Ri of a set of instances.  The problem, then, is to learn \na preference function of the form PREF(u,v) = L~I wiRi(U,V).  We  adopt the on-line \nlearning framework first studied by Littlestone [8J in which the weight Wi  assigned to each \nranking expert Ri is updated incrementally. \n\nLearning is assumed to take place in a sequence of rounds.  On the t-th round, the learning \nalgorithm is provided with a set X t  of instances to be ranked and to a set of N  preference \nfunctions R~ of these instances.  The learner may compute R!( u, v) for any and all preference \nfunctions R~ and pairs u, v  E  X t  before producing a final  ordering Pt  of xt.  Finally, the \nlearner receives feedback from the environment.  We assume that the feedback is an arbitrary \nset of assertions of the form \"u should be preferred to v.\"  That is,  formally we regard the \nfeedback on the t-th round as a set Ft of pairs (u, v) indicating such preferences. \n\nThe algorithm we propose for this problem is based on the \"weighted majority algorithm\" [9J \nand, more directly, on the \"Hedge\" algorithm [4].  We define the loss of a preference function \n\n\fLearning to Order Things \n\n455 \n\nAllocate Weights for Ranking Experts \nParameters: \n(3  E  [0,1] , initial weight vector WI  E  [0, I]N with l:~1 wl  =  1 \nN  ranking experts, number of rounds T \nDo fort =  1,2, ... ,T \n\n1.  Receive a set of elements X t  and preference functions R~, ... , R'N. \n2.  Use algorithm Order-By-Preferences to compute ordering function Pt  which ap-\n\nproximatesPREFt(u,v) =  E~I wiRHu,v). \n\n3.  Order X t  using Pt . \n4.  Receive feedback Ft from the user. \n5.  Evaluate losses Loss(RL Ft) as defined in Eq. (2). \n6.  Set the new weight vector w!+ 1 =  w!  . (3Loss(R: ,Ft) / Zt where Zt is a normalization \n\nconstant, chosen so that E~I w!+1  =  1. \n\nFigure 2:  The on-line weight allocation algorithm. \n\nR with respect to the user's feedback F  as \n\nL \n\n(R  F) ~ E(U,V)EF(1  - R(u,v)) \n\nIF I '  \n\noss \n\n, \n\n(2) \n\nThis loss has a natural probabilistic interpretation.  If R is viewed as a randomized prediction \nalgorithm that predicts that u  will  precede v  with probability R(u, v), then Loss(R, F) is \nthe  probability  of R  disagreeing  with  the  feedback on a  pair (u, v)  chosen  uniformly  at \nrandom from F. \n\nWe now can use the Hedge algorithm almost verbatim, as shown in Figure 2.  The algorithm \nmaintains a positive weight vector whose value at time t is denoted by w t = (wf, . . . , w'N). \nIf there  is  no  prior knowledge about the  ranking experts,  we  set all  initial  weights to  be \nequal so that wI  =  1/ N. The weight vector w t  is used to combine the preference functions \nof the different experts to obtain the preference function PREFt  =  E~ I  w~ R~.  This, in \ntum, is converted into an  ordering Pt  on the current set of elements Xl  using the method \ndescribed in Section 3.  After receiving feedback pt, the loss for each preference function \nLoss(RL Ft)  is  evaluated  as  in  Eq.  (2)  and  the  weight  vector  w t  is  updated  using  the \nmUltiplicative rule  W!+I  =  w~ . (3LQss(R: ,Ft) / Zt  where (3  E  [0, 1]  is a parameter, and Zt is \na normalization constant, chosen so that the weights sum to one after the update. Thus, based \non the feedback, the weights of the ranking experts are adjusted so that experts producing \npreference functions with relatively large agreement with the feedback are promoted. \n\nWe  will  briefly  sketch  the  theoretical  rationale  behind  this  algorithm.  Freund  and \nSchapire [4]  prove general results about Hedge which can be applied directly to this loss \nfunction.  Their results  imply almost immediately  a  bound on  the  cumulative loss  of the \npreference function PREFt in terms of the loss of the best ranking expert, specifically \n\nT \nLLoss(PREFt,Ft )  ~ a,Bm~n LLoss(RLFt) +c,BlnN \nt=1 \n\nt=1 \n\nT \n\nl \n\nwhere a,B  = InO / (3) / (1  - (3)  and C,B  = 1/( I  - (3).  Thus, if one of the ranking experts has \nlow loss, then so will the combined preference function PREFt . \n\nHowever, we are not interested in the loss ofPREFt (since it is not an ordering), but rather in \nthe performance of the actual ordering Pt  computed by the learning algorithm.  Fortunately, \n\n\f456 \n\nw. W.  Cohen, R.  E.  Schapire and y.  Singer \n\nthe losses of these can be related using a kind of triangle inequality.  It can be shown that, \nfor any PREF, F and p: \n\nLoss(Rp, F)  ~ \n\nOISAGREE(p  PREF) \n\nIFI  ' \n\n+ Loss(PREF, F) \n\n(3) \n\nwhere, similar to Eq. (1),  OISAGREE(p, PREF)  =  Lu,v:p(u\u00bbp(v)(l - PREF(u, v)).  Not \nsurprisingly, maximizing AGREE is equivalent to minimizing DISAGREE. \n\nSo, in sum, we use the greedy algorithm of Section 3 to minimize (approximately) the first \nterm on the right hand side ofEq. (3), and we use the learning algorithm Hedge to minimize \nthe second term. \n\n5  Experimental results for metasearch \n\nWe  now  present  some  experiments in  learning to  combine  the  results  of several  WWW \nsearches.  We  note that this  problem exhibits many facets  that require a general approach \nsuch  as  ours.  For  instance,  approaches  that  learn  to  combine  similarity  scores  are  not \napplicable since the similarity scores of WWW search engines are often unavailable. \n\nWe  chose  to  simulate  the  problem  of learning  a domain-specific  search engine.  As  test \ncases  we  picked two  fairly  narrow classes  of queries-retrieving the  home pages of ma(cid:173)\nchine  learning  researchers  (ML),  and  retrieving the  home  pages of universities  (UNIV). \nWe  obtained  a  listing  of machine  learning  researchers,  identified  by  name  and  affiliated \ninstitution, together with their home pages,  and a similar list for universities, identified by \nname and (sometimes) geographical location.  Each entry on a list was viewed as a query, \nwith the associated URL the sole relevant document. \n\nWe  then constructed a series of special-purpose \"search experts\" for each domain.  These \nwere  implemented as  query  expansion methods  which converted  a name,  affiliation  pair \n(or a name,  location pair) to  a likely-seeming Altavista query.  For example,  one expert \nfor  the  ML domain was  to  search for  all  the  words  in  the  person's name  plus  the  words \n\"machine\" and \"learning,\" and to further enforce a strict requirement that the person's last \nname appear.  Overall we defined 16 search experts for the ML domain and 22 for the UN IV \ndomain.  Each search expert returned the top 30 ranked documents.  In the ML domain there \nwere 210 searches for which at least one search expert returned the named home page; for \nthe UNIV domain, there were 290 such searches. \nFor each query t, we first constructed the set X t  consisting of all documents returned by all \nof the expanded queries defined by the search experts.  Next, each search expert i  computed \na preference function R~.  We  chose these to  be rank orderings defined with respect to  an \nordering function If in the natural way:  We  assigned a rank of if =  30 to the first listed \ndocument,  Ii  =  29 to  the second-listed document,  and so on,  finally  assigning  a rank of \nIi =  0 to every document not retrieved  by the expanded query associated with expert i. \nTo  encode  feedback,  we  considered  two  schemes. \nIn  the  first  we  simulated  complete \nrelevance  feedback-that is,  for  each query,  we  constructed feedback  in  which  the  sole \nrelevant document was  preferred to all  other documents.  In the second,  we  simulated the \nsort  of feedback  that could  be  collected  from  \"click data,\"  i.e.,  from  observing a  user's \ninteractions with  a metasearch system.  For each query,  after presenting  a ranked  list  of \ndocuments, we noted the rank of the one relevant document.  We then constructed a feedback \nranking in which the relevant document is preferred to all preceding documents.  This would \ncorrespond to observing which link the user actually followed, and making the assumption \nthat this link was preferred to previous links. \n\nTo  evaluate  the  expected performance of a  fully-trained  system  on  novel queries  in  this \ndomain,  we  employed leave-one-out testing.  For each query  q,  we  removed q  from  the \n\n\fLearning to Order Things \n\n457 \n\nML Domain \n\nUniversity Domain \n\nTop  1  Top  lO  Top 30  Av.  rank  Top 1  Top  lO  Top 30  Av.  rank \n\nLearned System (Full Feedback) \nLearned System (\"Click Data\") \nNaive \nBest (Top  1) \nBest (Top  10) \nBest (Top 30) \nBest (Av.  Rank) \n\n114 \n93 \n89 \n119 \n114 \n97 \n114 \n\n185 \n185 \n165 \n170 \n182 \n181 \n182 \n\n198 \n198 \n176 \n184 \n190 \n194 \n190 \n\n4.9 \n4.9 \n7.7 \n6.7 \n5.3 \n5.6 \n5.3 \n\n111 \n87 \n79 \n112 \n111 \n111 \n111 \n\n225 \n229 \n157 \n221 \n223 \n223 \n223 \n\n253 \n259 \n191 \n247 \n249 \n249 \n249 \n\n7.8 \n7.8 \n14.4 \n8.2 \n8.0 \n8.0 \n8.0 \n\nTable  1:  Comparison of learned systems and individual search queries \n\nquery  set,  and  recorded  the  rank  of q  after  training  (with  (3  =  0.5)  on  the  remaining \nqueries.  For click data feedback,  we recorded the median rank over 100 randomly chosen \npermutations of the training queries. \n\nWe  the  computed an  approximation to  average rank  by  artificially  assigning  a rank of 31 \nto every document that was  either unranked, or ranked above rank 30.  (The latter case is \nto  be fair to the learned system, which is the only one for  which a rank greater than  30 is \npossible.)  A summary of these results is given in  Table  1,  together with some additional \ndata on \"top-k performance\"-the number of times the correct homepage appears at rank \nno  higher than  k.  In the  table  we  give  the top-k performance (for three values of k) and \naverage rank for  several  ranking  systems:  the  two  learned  systems,  the  naive  query  (the \nperson or university's name), and the single search expert that performed best with respect \nto each performance measure.  The table illustrates the robustness of the learned systems, \nwhich are  nearly always competitive with the best expert for every performance measure \nlisted;  the  only exception is  that the  system  trained on  click data trails  the  best expert in \ntop-k performance for small values of k.  It is  also worth noting that in  both domains, the \nnaive query (simply the person or university's name) is not very effective.  Even with the \nweaker click data feedback,  the  learned system achieves a 36%  decrease in  average rank \nover the naive query in the ML domain, and a 46% decrease in the UNIV domain. \n\nTo  summarize the experiments,  on  these domains,  the  learned  system  not  only  performs \nmuch better than  naive  search strategies;  it also consistently  performs at least as  well  as, \nand perhaps slightly better than, any single domain-specific search expert. Furthermore, the \nperformance of the learned system is almost as good with the weaker \"click data\" training \nas with complete relevance feedback. \nReferences \n[1]  D.S.  Hochbaum  (Ed.).  Approximation  Algorithms for  NP-hard problems.  PWS  Publishing \n\nCompany,  1997. \n\n[2]  O. Etzioni,  S.  Hanks,  T.  Jiang,  R  M.  Karp,  O.  Madani,  and O. Waarts.  Efficient information \n\ngathering on the internet.  In 37th Ann.  Symp.  on Foundations of Computer Science,  1996. \n\n[3]  P.C Fishburn.  The Theory of Social Choice.  Princeton University Press, Princeton, NJ,  1973. \n[4]  Y.  Freund  and  RE.  Schapire.  A  decision-theoretic  generalization  of on-line learning  and  an \n\napplication to boosting.  Journal of Computer and System Sciences,  1997. \n\n[5]  Z. Galil and N. Megido.  Cyclic ordering is NP-complete.  Theor.  Compo  Sci. , 5:179-182,  1977. \n[6]  M.R  Gary  and  D.S.  Johnson.  Computers  and  Intractibility:  A  Guide  to  the  Theory  of NP(cid:173)\n\ncompleteness.  W.  H. Freeman and Company, New York,  1979. \n\n[7j  P.B. Kantor.  Decision level data fusion for routing of documents in the TREC3 context:  a best \n\ncase analysis of worste case results.  In TREC-3,  1994. \n\n[8]  N.  Littlestone.  Learning  quickly  when  irrelevant  attributes  abound:  A  new  linear-threshold \n\nalgorithm.  Machine Learning, 2(4),  1988. \n\n[9)  N. Littlestone and M.K. Warmuth.  The weighted majority algorithm.  Infonnation and Compu(cid:173)\n\ntation, 108(2):212-261,  1994. \n\n[10]  K.E. Lochbaum and L.A. Streeter. Comparing and combining the effectiveness of latent semantic \nindexing and the ordinary vector space model for information retrieval.  Infonnation processing \nand management, 25(6):665-676,  1989. \n\n\f", "award": [], "sourceid": 1431, "authors": [{"given_name": "William", "family_name": "Cohen", "institution": null}, {"given_name": "Robert", "family_name": "Schapire", "institution": null}, {"given_name": "Yoram", "family_name": "Singer", "institution": null}]}