{"title": "Learning from User Feedback in Image Retrieval Systems", "book": "Advances in Neural Information Processing Systems", "page_first": 977, "page_last": 986, "abstract": null, "full_text": "Learning from user feedback in image retrieval \n\nsystems \n\nNuno Vasconcelos \n\nAndrew Lippman \n\nMIT Media Laboratory, 20 Ames St, E15-354, Cambridge, MA 02139, \n\n{nuno,lip} @media.mit.edu, \n\nhttp://www.media.mit.edwnuno \n\nAbstract \n\nWe  formulate  the  problem  of retrieving  images  from  visual  databases \nas  a problem of Bayesian inference.  This leads to  natural and effective \nsolutions for two of the most challenging issues in the design of a retrieval \nsystem:  providing  support  for  region-based  queries  without  requiring \nprior  image  segmentation,  and  accounting  for  user-feedback  during  a \nretrieval  session.  We  present  a  new  learning  algorithm  that  relies  on \nbelief propagation to account for both positive and negative examples of \nthe user's interests. \n\n1  Introduction \n\nDue to  the  large amounts of imagery that can now be accessed and managed via comput(cid:173)\ners, the problem of content-based image retrieval (CBIR) has recently attracted significant \ninterest among the vision community [1, 2, 5].  Unlike most traditional vision applications, \nvery  few  assumptions about the content of the images to  be analyzed are allowable in  the \ncontext of CBIR. This implies that the space of valid image representations is  restricted to \nthose  of a generic nature (and typically  of low-level) and consequently the image  under(cid:173)\nstanding problem becomes even more complex.  On  the  other hand,  CBIR systems have \naccess to feedback from their users that can be exploited to simplify the task of finding the \ndesired images.  There are, therefore, two fundamental problems to be addressed.  First, the \ndesign of the image representation itself and, second, the design of learning mechanisms to \nfacilitate  the interaction.  The two problems cannot, however, be solved)n isolation as the \ncareless selection of the representation will make learning more difficult and vice-versa . \n\n...-\n\nThe impact of a poor image representation on the difficulty ofthe learning problem is visible \nin CBIR systems that rely on holistic metrics of image similarity, forcing user-feedback to \nbe relative to entire images. In response to a query, the CBII<System\u00b7 suggests a few images \nand the user rates those images according to  how well they satisfy the goals of the search. \nBecause each image usually contains several different objects or visual concepts, this rating \nis  both difficult and inefficient.  How  can the user rate an iJpage that contains the concept \nof interest, but in which this concept only occupies 30% of the field of view,  the remaining \n70% being filled with completely unrelated stuff?  And_how many example images will the \nCBIR system have to  see, in order to figure out what the concept of interest is? \n\nA  much  better interaction paradigm is  to  let  the  user explicitly  select the  regions  of the \nimage  that  are  relevant  to  the  search,  i.e.  user-feedback  at  the  region  level.  However, \nregion-based feedback  requires sophisticated image representations.  The problem is  that \nthe  most obvious choice,  object-based representations,  is  difficult  to  implement because \nit  is  still  too  hard  to  segment  arbitrary  images  in  a  meaningful  way.  We  have  argued \n\n\f978 \n\nN  Vasconcelos and A.  Lippman \n\nthat a better fonnulation is  to  view  the problem as  one of Bayesian inference and rely  on \nprobabilistic image representations.  In  this paper we show that this  fonnulation naturally \nleads to  1) representations with support for region-based interaction without segmentation \nand 2) intuitive mechanisms to account for both positive and negative user feedback. \n\n2  Retrieval as Bayesian inference \n\nThe standard interaction paradigm for CBIR is the so-called \"query by example\", where the \nuser provides the system with a few  examples, and the system retrieves from  the database \nimages  that  are  visually  similar to  these  examples.  The problem is  naturally  fonnulated \nas  one of statistical classification:  given  a representation (or feature) space F  the  goal  is \nto  find  a map  9  : F  --+  M  =  {I, ... , K}  from  F  to  the  set  M  of image classes  in  the \ndatabase.  K, the cardinality of M, can be as large as  the number of items in  the database \n(in which case each item is a class by itself), or smaller.  If the goal of the retrieval system \nis  to  minimize the probability of error, it is  well  known that the  optimal map is  the Bayes \nclassifier [3] \n\ng*(x) = argmaxP(Si = llx) = arg max P(XISi = I)P(Si = 1) \n\nt \n\nt \n\n(1) \n\nwhere x  are the example features provided by the user and Si is a binary variable indicating \nthe selection of class i.  In  the absence of any prior infonnation about which class is  most \nsuited  for  the  query,  an  uninfonnative prior can  be  used  and  the  optimal  decision  is  the \nmaximum likelihood criteria \n\ng*(x)  =  argmaxP(xlSi =  1). \n\nt \n\n(2) \n\nBesides theoretical soundness, Bayesian retrieval has two distinguishing properties of prac(cid:173)\ntical relevance.  First,  because the features x  in  equation (1) can  be any  subset of a given \nquery image,  the retrieval criteria is  valid for both region-based and image-based queries. \nSecond,  due  to  its  probabilistic  nature,  the  criteria  also  provides  a  basis  for  designing \nretrieval systems that can account for user-feedback through belief propagation. \n\n3  Bayesian relevance feedback \n\nSuppose  that  instead  of a  single  query  x  we  have  a  sequence  of t  queries  {XI, ... , Xt}, \nwhere t is a time stamp.  By  simple application of Bayes rule \n\nP(Si = llxl,'\"  ,Xt) = 'YtP(XtISi  = I)P(Si = IlxI,'\"  ,Xt-d, \n\n(3) \nwhere 'Yt  is  a nonnalizing constant and we  have assumed that,  given the knowledge of the \ncorrect image class, the current query Xt  is independent of the previous ones.  This basically \nmeans that the user provides the retrieval system with new infonnation at each iteration of \nthe  interaction.  Equation (3) is a simple but  intuitive mechanism to  integrate infonnation \nover time.  It states that the system's beliefs about the user's interests at time t - 1 simply \nbecome the  prior  beliefs for iteration t.  New data provided by  the  user  at  time t is  then \nused  to  update  these  beliefs,  which  in  turn  become  the  priors  for  iteration  t + 1.  From \na computational standpoint the procedure is  very  efficient since the only quantity that has \nto  be computed at each time step is  the  likelihood of the data in  the corresponding query. \nNotice that this is exactly equation (2) and would have to be computed even in the absence \nof any learning. \n\nBy  taking logarithms and solving for the recursion, equation (3) can also be written as \nlog P(Si = I1xJ,'\"  , Xt)  = 2: log 'Yt-k + 2: log P(Xt-k lSi  = I) + log P(Si = 1) , \n(4) \n\nt-I \n\nk=O \n\nt-J \n\nk=O \n\n\fLearning from  User Feedback in Image Retrieval Systems \n\n979 \n\nexposing the  main  limitation  of the  belief propagation  mechanism:  for  large  t  the  con(cid:173)\ntribution,  to  the  right-hand side of the  equation,  of the new  data provided by the  user is \nvery small, and the posterior probabilities tend to remain constant.  This can be avoided by \npenalizing older tenns with a decay factor at-k \n\nt-l \n\nt-l \n\nL at-k log,t-k + L at-k 10gP(xt-kISi =  1) + \nk=O \naologP(Si =  1), \n\nk=O \n\nwhere at is a monotonically decreasing sequence.  In particular, if at-k =  a( 1 - a)k, a  E \n(0,1] we have \n\n10gP(Si = llxl, ... ,Xt)  =  log,: +alogP(xtISi = 1) + \n\n(1  - a) log P(Si = llxl, ... , Xt-l). \n\nBecause,: does not depend on i, the optimal class is \n\nS; = argm~{alogP(xtISi =  1) + (1- a) 10gP(Si = llxl, ... ,Xt-J}}. \n\nt \n\n(5) \n\n4  Negative feedback \n\nIn  addition  to  positive  feedback,  there are  many  situations in  CBIR where it is  useful  to \nrely on negative user-feedback.  One example is  the case of image classes characterized by \noverlapping densities.  This  is  illustrated in Figure  1 a)  where  we  have two  classes  with \na common attribute (e.g.  regions of blue sky) but different in other aspects  (class A  also \ncontains regions of grass while class B contains regions of white snow).  If the user starts \nwith an image of class  B  (e.g.  a picture of a  snowy  mountain),  using  regions of sky  as \npositive examples is not likely to quickly take hirnlher to the images of class A.  In fact, all \nother factors  being equal,  there is  an equal likelihood that the retrieval system will return \nimages from the two classes.  On  the other hand, if the user can explicitly indicate interest \nin  regions of sky but not in regions of snow, the likelihood that only images from class A \nwill be returned increases drastically. \n\na) \n\nb) \n\nc) \n\nd) \n\nFigure  1:  a)  two  overlapping  image  classes.  b)  and  c)  two  images  in  the tile database.  d)  three \nexamples of pairs of visually similar images that appear in different classes. \n\nAnother example of the  importance of negative feedback  are  local minima of the  search \nspace.  These  happen  when  in  response to  user  feedback,  the  system returns exactly  the \nsame images as in a previous iteration.  Assuming that the user has already given the system \nall  the possible positive feedback, the  only way  to  escape from such minima is  to choose \nsome regions that are not desirable and use  them as negative feedback.  In the case of the \nexample  above,  if the  user gets  stuck  with  a  screen  full  of pictures  of white  mountains, \nhe/she can simply select some regions of snow to escape the local minima. \n\nIn order to account for negative examples, we must penalize the classes under which these \nscore  well  while  favoring  the  classes  that  assign  a  high  score  to  the  positive  examples. \n\n\f980 \n\nN.  Vasconcelos and A.  Lippman \n\nUnlike positive examples,  for  which the  likelihood is  known,  it  is  not straightforward to \nestimate the likelihood of a particular negative example given that the user is searching for a \ncertain image class.  We assume that the likelihood with which Y will be used as  a negative \nexample given that the target is class i, is equal to the likelihood with which it will be used \nas  a positive example given that the  target  is  any  other class.  Denoting the use  of Y as  a \nnegative example by y, this can be written as \n\nP(yiSi = 1)  = P(yiSi = 0). \n\n(6) \nThis  assumption captures  the  intuition that  a good negative example when  searching for \nclass i, is one that would be a good positive example if the user were looking for any class \nother than i.  E.g.  if class i  is  the only one in the database that does not contain regions of \nsky, using pieces of sky as negative examples will quickly eliminate the other images in the \ndatabase. \n\nUnder this assumption, negative examples can be incorporated into the learning by simply \nchoosing  the  class  i  that  maximizes  the  posterior odds  ratio  [4]  between  the  hypotheses \n\"class i  is  the target\" and \"class i  is  not the target\" \n\nS* \ni  = arg max \n\nP(Si =  lIXt\",.,Xl,yt, ... ,Yl) \nt  P(Si=Olxt\",.,XI,Yt, .. . ,YJ) \n\nP(Si =  llxt, ... ,Xl) \n= arg max ----O~--'---'--'----7 \nt  P(Si=OIYt, ... ,YI) \n\nwhere x  are the positive and Y the negative examples, and we have assumed that, given the \npositive (negative) examples, the posterior probability of a given class being (not being) the \ntarget is independent of the negative (positive) examples.  Once again, the procedure of the \nprevious section can be used to  obtain a recursive version  of this  equation and  include a \ndecay factor which penalizes ancient terms \n\nS* \ni  = arg m~x  a  og \n\n{  1  P(xtI Si=l) \nP(YtISi = 0) \n\nt \n\n( \n\n+  1 - a  og \n\n)1  P(Si=I IXl, ... ,Xt-l)} \n. \n\nP(Si = 0IYl, ... , Yt-d \n\nUsing equations (4) and (6) \n\nP(Si = 0IYI , ... ,yt) \n\n<X \n\nIT P(YkISi = 0)  = IT P(YkISi = 1) \n\nk \n\nk \n<X  P(Si=IIYl, ... ,Yt), \n\nwe obtain \n\nS* \ni  =  arg m~x  a  og \n\n{  1  P(XtISi = 1) \nP(YtISi=l) \n\n(1 \n\n+ \n\n- a  og \n\n)1  P(Si = l lxl, ... ,Xt-l)} \nP(Si=IIYl, ... ,Yt-d \n\n_ .  \n\n(7) \n\nt \n\nWhile  maximizing  the  ratio  of  posterior  probabilities  is  a  natural  way  to  favor  image \nclasses  that explain  well  the  positive  examples  and  poorly  the  negative  ones,  it  tends  to \nover-emphasize the  importance  of negative examples.  In  particular,  any  class  with  zero \nprobability of generating the negative examples will lead to a ratio of 00, even if it explains \nvery poorly the positive examples.  To avoid this problem we proceed in two steps: \n\n\u2022  start  by  solving  equation  (5),  i.e.  sort  the  classes  according  to  how  well  they \n\nexplain the positive examples. \n\n\u2022  select the subset of the best N  classes and solve equation (7) considering only the \n\nclasses in this subset. \n\n5  Experimental evaluation \n\nWe  performed experiments  to  evaluate  1)  the  accuracy  of Bayesian  retrieval  on  region(cid:173)\nbased queries and 2) the improvement in  retrieval performance achievable with relevance \n\n\fLeamingfrom User Feedback in Image Retrieval Systems \n\n981 \n\nfeedback. Because in a normal browsing scenario it is difficult to know the ground truth for \nthe retrieval operation (at least without going through the tedious process of hand-labeling \nall images in the database), we relied instead on a controlled experimental set up for which \nground truth is available.  All experiments reported on this section are based on the widely \nused Brodatz texture database which contains images of 112 textures, each of them being \nrepresented by  9  different patches,  in a  total  of 1008 images.  These were split into two \ngroups, a small one with 112 images (one example of each texture), and a larger one with \nthe  remaining 896.  We  call  the first  group the test database and the second the Brodatz \ndatabase.  A  synthetic database with 2000 images was  then created from the larger set by \nrandomly selecting 4 images at a time and making a 2 x  2 tile out of them.  Figure 1 b) and \nc) are two examples of these tiles.  We call this set the tile database. \n\n5.1  Region-based queries \n\nWe performed two sets of experiments to evaluate the performance of region-based queries. \nIn both cases the test database was used as a test set and the image features were the coeffi(cid:173)\ncients of the discrete cosine transform (DCT) of an 8 x  8 block-wise image decomposition \nover a grid containing every other image pixel.  The first set of experiments was performed \non the Brodatz database while the tile database was used in the second.  A  mixture of 16 \nGaussians was estimated, using EM, for each of the images in the two databases. \n\nIn both sets of experiments, each query consisted of selecting a few image blocks from an \nimage in the test set, evaluating equation (2) for each of the classes and returning those that \nbest explained the query.  Performance was measured in terms of precision (percent of the \nretrieved images that are relevant to the query) and recall (percent of the relevant images \nthat are retrieved)  averaged over the entire test  set.  The query  images contained a  total \nof 256 non-overlapping blocks.  The number of these that were used in each query varied \nbetween  1 (0.3  %  of the image size)  and 256 (100 %).  Figure 2  depicts precision-recall \nplots as a function of this number. \n\n-\n\n., \n\n==u~ \n~\\ ~,,\u00ad.. -.. -.. -12 1_  \n... -\n-- - ,.(cid:173)\n,M_ \n\n121_ \n\n'M_ \n\n,--... -\n\n~~~~~u~~ .. ~.~.~ .. ~~~ \n\n-\n\nFigure 2:  Precision-recall curves as a function of the number of feature vectors included in the query. \nLeft:  Brodatz database.  Right:  Tile database. \n\nThe graph on the left is relative to the Brodatz database.  Notice that precision is generally \nhigh even for  large values  of recall  and  the  performance increases quickly  with  the  per(cid:173)\ncentage of feature  vectors included in the query.  In particular, 25% of the texture  patch \n(64 blocks) is enough to achieve results  very close to those obtained with all pixels.  This \nshows that the retrieval  criteria is robust  to missing data.  The graph on the left presents \nsimilar results for the tile database.  While there is some loss in performance, this loss is \nnot dramatic - a decrease between  10 and  15  %  in precision for any given recall.  In fact, \nthe results are still good:  when a reasonable number of feature  vectors is  included in  the \nquery,  about 8.5 out of the  10 top retrieved images are, on average, relevant.  Once again, \nperformance improves rapidly with the number of feature vectors in the query and 25% of \n\n\f982 \n\nN.  Vasconcelos and A.  Lippman \n\nthe  image is  enough for results  comparable to  the  best.  This confirms the  argument that \nBayesian retrieval  leads  to  effective region-based queries even  for  imagery composed by \nmUltiple visual stimulae. \n\n5.2  Learning \n\nThe performance of the learning algorithm was evaluated on the tile database.  The goal was \nto determine if it is possible to reach a desired target image by starting from a weakly related \none and providing positive  and negative feedback  to  the retrieval system.  This simulates \nthe interaction between a real user and the CBIR system and is  an iterative process, where \neach iteration consists of selecting a few examples, using them as queries for retrieval and \nexamining the top M  retrieved images to find examples for the next iteration.  M  should be \nsmall since most users are  not willing to  go  through lots of false  positives to  find  the  next \nquery. In all experiments we set M  = 10 corresponding to one screenful of images. \nThe  most  complex  problem  in  testing  is  to  determine  a  good  strategy  for  selecting  the \nexamples to  be given to  the  system.  The closer this  strategy  is  to  what a real  user would \ndo,  the higher the practical significance of the results.  However, even when there  is  clear \nground truth for  the retrieval  (as  is  the case of the  tile database) it is  not completely clear \nhow to make the selection.  While it is obvious that regions of texture classes that appear in \nthe target should be used as positive feedback, it is much harder to determine automatically \nwhat  are  good  negative  examples.  As  Figure  1 d)  illustrates,  there  are  cases  in  which \ntextures from two different classes are visually similar.  Selecting images from one of these \nclasses as a negative example for the other will be a disservice to  the learner. \n\nWhile real users tend not to do this, it is hard to avoid such mistakes in an automatic setting, \nunless one does some sort of pre-classification of the database.  Because we wanted to avoid \nsuch pre-classification we decided to  stick with a simple selection procedure and live with \nthese mistakes.  At each step of the iteration, examples were selected in the following way: \namong the  10 top images returned by the retrieval system, the one with most patches from \ntexture classes also present in the target image was selected to be the next query.  One block \nfrom each patch in  the query was then used as a positive (negative) example if the class of \nthat patch was also (was not) represented in the target image. \n\nThis  strategy is  a worst-case scenario.  First,  the learner might be confused by conflicting \nnegative examples.  Second,  as  seen  above,  better retrieval  performance can  be  achieved \nif more than  one block from  each region is  included in  the  queries.  However,  using only \none block reduced the computational complexity of each iteration, allowing us  to  average \nresults  over several  runs of the  learning  process.  We  performed  100 runs  with randomly \nselected  target  images.  In  all  cases,  the  initial  query  image  was  the  first  in  the  database \ncontaining one class in common with the target. \n\nThe performance of the learning algorithm can be evaluated in various ways.  We considered \ntwo metrics:  the percentage of the runs which converged to the right target, and the number \nof iterations required for convergence.  Because, to  prevent the learner from entering loops, \nany given image could only be used once as a query, the algorithm can diverge in two ways. \nStrong divergence occurs  when, at a given  time step,  the  images  (among the  top  10) that \ncan be used as queries do not contain any texture class in common with the target.  In such \nsituation,  a real user will  tend to  feel  that the retrieval system  is  incoherent and abort the \nsearch.  Weak  divergence occurs when  all  the  top  10  images  have previously  been used. \nThis  is  a less  troublesome  situation  because the  user  could simply  look  up  more  images \n(e.g.  the next 10) to get new examples. \n\nWe start by  analyzing the results obtained with positive feedback only.  Figure 3 a) and  b) \npresent plots of the convergence rate and median number of iterations as  a function of the \ndecay factor  Q.  While when  there is  no  learning (Q  =  I) only 43% of the runs converge, \n\n\fLearningfrom User Feedback in Image Retrieval Systems \n\n983 \n\nthe convergence rate is always higher when learning takes place and for a significant range \nof 0  (0  E  [0.5,0.8]) it  is  above 60%.  This  not  only  confirms that learning can lead  to \nsignificant improvements of retrieval performance but also shows that a precise selection of \no  is not crucial.  Furthermore, when convergence occurs it is usually very fast, taking from \n4 to 6 iterations.  On the other hand, a significant percentage of runs do not converge and \nthe majority of these are cases of strong divergence. \n\nAs  illustrated  by  Figure 3  c)  and  d),  this  percentage decreases  significantly  when  both \npositive and negative examples are allowed.  The rate of convergence is in this case usually \nbetween  80 and 90  %  and  strong  divergence  never  occurs.  And  while  the  number of \niterations for convergence increases, convergence is still fast (usually below 10 iterations). \nThis is indeed the great advantage of negative examples:  they encourage some exploration \nof the database  which avoids local minima and leads to convergence.  Notice that,  when \nthere is  no learning,  the convergence rate is  high  and learning  can actually  increase  the \nrate of divergence.  We  believe that this  is  due to the inconsistencies associated  with  the \nnegative example selection strategy.  However, when convergence occurs, it is always faster \nif learning is employed. \n\n,\" \n\n- ~ - ~  -\n\n- _ .  -\n\nIT \n\n-\n\n-\n\n-\n\n_  . .  0 \n\n-..... ......-\n-1,-,  ==-1 \n-a) \n\nI \u2022 \u2022 \u2022   It \n\n.A \n\n1ft \n\n\u2022 \n\n.... \n\n' 0 -\n\n\" \n\n.ft \n\n-c) \n\n'.1 \n\n... \n\nI.e \n\n... \n\nI \n\n., \n\n.ft \n\n-b) \n\n. . . . . . . . . .   1M \n\nI \n\nI\u00b7.\u00b7  ==-1 \n\nU \n\nUI -d) \n\nFigure 3:  Learning performance as  a function of 0_  Left:  Percent of runs which converged.  Right: \nMedian number of iterations.  Top:  positive examples.  Bottom:  positive and negative examples. \nReferences \n\n[1]  S. Belongie, C. Carson, H. Greenspan, and J. Malik. Color-and texture-based image segmentation \nusing  EM  and  its application to  content-based image retrieval.  In  International  Conference  on \nComputer Vision, pages 675-682, Bombay, India,  1998. \n\n[2]  I. Cox, M.  Miller, S. Omohundro, and P.  Yianilos.  Pic Hunter:  Bayesian Relevance Feedback for \n\nImage Retrieval.  In Int.  Con! on Pattern Recognition, Vienna, Austria,  1996. \n\n[3]  L.  Devroye, L.  Gyorfi, and G.  Lugosi.  A  Probabilistic Theory of Pattern Recognition.  Springer(cid:173)\n\nVerlag,  1996. \n\n[4]  A.  Gelman, J. Carlin, H.  Stem, and D. Rubin.  Bayesian Data Analysis.  Chapman Hall,  1995. \n[5]  A.  Pentland,  R.  Picard,  and  S.  Sclaroff.  Photobook:  Content-based  Manipulation  of Image \n\nDatabases.  International Journal of Computer Vision, Vol.  18(3):233-254, June  1996. \n\n\f\fPART IX \n\nCONTROL, NAVIGATION AND  PLANNING \n\n\f\f", "award": [], "sourceid": 1761, "authors": [{"given_name": "Nuno", "family_name": "Vasconcelos", "institution": null}, {"given_name": "Andrew", "family_name": "Lippman", "institution": null}]}