{"title": "Inferring Ground Truth from Subjective Labelling of Venus Images", "book": "Advances in Neural Information Processing Systems", "page_first": 1085, "page_last": 1092, "abstract": null, "full_text": "Inferring Ground Truth from Subjective \n\nLabelling of Venus Images \n\nPadhraic Smyth, Usama Fayyad \nJet Propulsion Laboratory 525-3660, \n\nCaltech, 4800 Oak Grove Drive, \n\nPasadena, CA 91109 \n\nMichael Burl, Pietro Perona \n\nDepartment of Electrical Engineering \n\nCaltech, MS 116-81, \nPasadena, CA 91125 \n\nPierre Baldi* \n\nJet Propulsion Laboratory 303-310, \n\nCaltech, 4800 Oak Grove Drive, \n\nPasadena, CA 91109 \n\nAbstract \n\nIn remote sensing applications \"ground-truth\" data is often used \nas the basis for training pattern recognition algorithms to gener(cid:173)\nate thematic maps or to detect objects of interest. In practical \nsituations, experts may visually examine the images and provide a \nsubjective noisy estimate of the truth. Calibrating the reliability \nand bias of expert labellers is a non-trivial problem. In this paper \nwe discuss some of our recent work on this topic in the context \nof detecting small volcanoes in Magellan SAR images of Venus. \nEmpirical results (using the Expectation-Maximization procedure) \nsuggest that accounting for subjective noise can be quite signifi(cid:173)\ncant in terms of quantifying both human and algorithm detection \nperformance. \n\n1 \n\nIntroduction \n\nIn certain pattern recognition applications, particularly in remote-sensing and med(cid:173)\nical diagnosis, the standard assumption that the labelling of the data has been \n\n* and Division of Biology, California Institute of Technology \n\n\f1086 \n\nPadhraic Smyth. Usama Fayyad. Michael Burl. Pietro Perona. Pierre Baldi \n\ncarried out in a reasonably objective and reliable manner may not be appropriate. \nInstead of \"ground truth\" one may only have the subjective opinion(s) of one or \nmore experts. For example, medical data or image data may be collected off-line \nand some time later a set of experts analyze the data and produce a set of class \nlabels. The central problem is that of trying to infer the \"ground truth\" given the \nnoisy subjective estimates of the experts. When one wishes to apply a supervised \nlearning algorithm to the data, the problem is primarily twofold: (i) how to evaluate \nthe relative performance of experts and algorithms, and (ii) how to train a pattern \nrecognition system in the absence of absolute ground truth. \nIn this paper we focus on problem (i), namely the performance evaluation issue, and \nin particular we discuss the application of a particular modelling technique to the \nproblem of counting volcanoes on the surface of Venus. For problem (ii), in previous \nwork we have shown that when the inferred labels have a probabilistic interpretation, \na simple mixture model argument leads to straightforward modifications of various \nlearning algorithms [1]. \nIt should be noted that the issue of inferring ground truth from subjective labels \nhas appeared in the literature under various guises. French [2] provides a Bayesian \nperspective on the problem of combining multiple opinions. In the field of medical \ndiagnosis there is a significant body of work on latent variable models for inferring \nhidden \"truth\" from subjective diagnoses (e.g., see Uebersax [3]). More abstract \ntheoretical models have also been developed under assumptions of specific labelling \npatterns (e.g., Lugosi [4] and references therein). The contribution of this paper is \ntwofold: (i) this is the first application of latent-variable subjective-rating models \nto a large-scale image analysis problem as far as we are aware, and (ii) the focus \nof our work is on the pattern recognition aspect of the problem, i.e., comparing \nhuman and algorithmic performance as opposed to simply comparing humans to \neach other. \n\n2 Background: Automated Detection of Volcanoes in \n\nRadar Images of Venus \n\nAlthough modern remote-sensing and sky-telescope technology has made rapid re(cid:173)\ncent advances in terms of data collection capabilities, image analysis often remains a \nstrictly manual process and much investigative work is carried out using hardcopy \nphotographs. The Magellan Venus data set is a typical example: between 1991 \nand 1994 the Magellan spacecraft transmitted back to earth a data set consisting \nof over 30,000 high resolution (75m per pixel) synthetic aperture radar (SAR) im(cid:173)\nages of the Venusian surface [5]. This data set is greater than that gathered by all \nprevious planetary missions combined -\nplanetary scientists are literally swamped \nby data. There are estimated to be on the order of 106 small (less than 15km \nin diameter) vl.sible volcanoes scattered throughout the 30,000 images [6]. It has \nbeen estimated that manually locating all of these volcanoes would require on the \norder of 10 man-years of a planetary geologist's time to carry out - our experience \nhas been that even a few hours of image analysis severely taxes the concentration \nabilities of human labellers. \n\nFrom a scientific viewpoint the ability to accurately locate and characterize the \n\n\fInferring Ground Truth from Subjective Labelling of Venus Images \n\n1087 \n\nmany volcanoes is a necessary requirement before more advanced planetary geology \nstudies can be carried out: analysis of spatial clustering patterns, correlation with \nother geologic features, and so forth. From an engineering viewpoint, automation of \nthe volcano detection task presents a significant challenge to current capabilities in \ncomputer vision and pattern recognition due to the variability of the volcanoes and \nthe significant background \"clutter\" present in most of the images. Figure 1 shows \na Magellan subimage of size 30km square containing at least 10 small volcanoes. \n\nVolcanoes on Venus \n\nFigure 1: A 30km x 30km region from the Magellan SAR data, which contains a \nnumber of small volcanoes. \n\nThe purpose of this paper is not to describe pattern recognition methods for volcano \ndetection but rather to discuss some of the issues involved in collecting and cali(cid:173)\nbrating labelled training data from experts. Details of a volcano detection method \nusing matched filtering, SVD projections and a Gaussian classifier are provided in \n[7]. \n\n3 Volcano Labelling \n\nTraining examples are collected by having the planetary geologists examine an image \non the computer screen and then using a mouse to indicate where they think the \nvolcanoes (if any) are located. Typically it can take from 15 minutes to 1 hour to \nlabel an image (depending on how many volcanoes are present), where each image \nrepresents a 75km square patch on the surface of Venus. An image may contain on \nthe order of 100 volcanoes, although a more typical number is between 30 and 40. \nThere can be considerable ambiguity in volcano labelling: for the same image, \ndifferent scientists can produce different label lists, and even the same scientist \ncan produce different lists over time. To address this problem we introduced the \nnotion of having the scientists label training examples into quantized probability \n\n\f1088 \n\nPadhraic Smyth, Usama Fayyad, Michael Burl, Pierro Perona, Pierre Baldi \n\nbins or \"types\", where the probability bins correspond to visually distinguishable \nsub-categories of volcanoes. In particular, we have used 5 types: (1) summit pits, \nbright-dark radar pair, and apparent topographic slope, all clearly visible, proba(cid:173)\nbility 0.98, (2) only 2 of the 3 criteria of type 1 are visible, probability 0.80, (3) no \nsummit pit visible, evidence of flanks or circular outline, probability 0.60, (4) only \na summit pit visible, probability 0.50, (5) no volcano-like features visible, probabil(cid:173)\nity 0.0. These subjective probabilities correspond to the mean probability that a \nvolcano exists at a particular location given that it belongs to a particular type and \nwere elicited after considerable discussions with the planetary geologists. Thus, the \nobserved data for each RDI consists of labels l, which are noisy estimates of true \n\"type\" t, which in turn is probabilistically related to the hidden event of interest, \nv, the presence of a volcano: \n\nT \n\np(vlD = LP(vlt)p(tID \n\nt=1 \n\n(1) \n\nwhere T is the number of types (and labels). The subjective probabilities described \nabove correspond to p(vlt): to be able to infer the probability of a volcano given a \nset of labels l it remains to estimate the p(tlD terms. \n\n4 \n\nInferring the Label-Type Parameters via the EM \nProcedure \n\nWe follow a general model for subjective labelling originally proposed by Dawid \nand Skene [8] and apply it to the image labelling problem: more details on this \noverall approach are provided in [9]. Let N be the number of local regions of \ninterest (RDl's) in the database (these are 15 pixel square image patches for the \nvolcano application) . For simplicity we consider the case of just a single labeller who \nlabels a given set of regions of interest (RDIs) a number of times -\nthe extension \nto multiple labellers is straightforward assuming conditional independence of the \nlabellings given the true type. Let nil be the number of times that RDI i is labelled \nwith labell. Let lit denote a binary variable which takes value 1 if the true type of \nvolcano i is t* , and is 0 otherwise. We assume that labels are assigned independently \nto a given RDI from one labelling to the next, given that the type is known. If the \ntrue type t* is known then \n\np(observed labelslt*, i) ex: II p(lltti!\u00b7 \n\nT \n\n1=1 \n\nThus, unconditionally, we have \n\np(observed labels, t*li) ex: g pet) gP(llttil \n\nT (T \n\n)Yit \n\n, \n\n(2) \n\n(3) \n\nwhere lit = 1 if t = t* and 0 otherwise. Assuming that each RDI is labelled \nindependently of the others (no spatial correlation in terms of labels), \n\np(observed labels, t;) ex: ~ g pet) gp(lltt'l \n\nN T (T \n\n)Yit \n\n(4) \n\n\fInferring Ground Truth from Subjective Labelling of Venus Images \n\n1089 \n\nStill assuming that the types t for each ROI are known (the Yit)' the maximum \nlikelihood estimators of p(llt) and p(t) are \n\np(llt) = \n\n~i Yitnil \n\nEI Ei Yitnil \n\nand \n\nFrom Bayes' rule one can then show that \n\np(Yit = llobserved data) = C IIp(llt)nilp(t) \n\n1 T \n\nI \n\n(5) \n\n(6) \n\n(7) \n\nwhere C is a normalization constant. Thus, given the observed data nil and the \nparameters p(llt) and p(t), one can infer the posterior probabilities of each type via \nEquation 7. \nHowever, without knowing the Yit values we can not infer the parameters p(llt) and \np(t). One can treat the Yit as hidden and thus apply the well-known Expectation(cid:173)\nMaximization (EM) procedure to find a local maximum of the likelihood function: \n\n1. Obtain some initial estimates of the expected values of Yit, e.g., \n\nE[Yitl = ~ \n\nElnil \n\n(8) \n\n2. M-step: choose the values of p(llt) and p(t) which maximize the likelihood \n\nfunction (according to Equations 5 and 6), using E[Yitl in place of Yit. \n\n3. E-step: calculate the conditional expectation of Yit, E[Yitldatal = p(Yit = \n\nIldata) (Equation 7). \n\n4. Repeat Steps 2 and 3 until convergence. \n\n5 Experimental Results \n\n5.1 Combining Multiple Expert Opinions \n\nLabellings from 4 geologists on the 4 images resulted in 269 possible volcanoes \n(ROIs) being identified. Application of the EM procedure resulted in label-type \nprobability matrices as shown in Table 1 for Labeller C. The diagonal elements \nprovide an indication of the reliability of the labeller. There is significant miscal(cid:173)\nibration for label 3's: according to the model, a label 3 from Labeller C is most \nlikely to correspond to type 2. The label-type matrices of all 4 labellers (not shown) \nindicated that the model placed more weight on the conservative labellers (C and \nD) than the aggressive ones (A and B). \nThe determination of posterior probabilities for each of the ROIs is a fundamental \nstep in any quantitative analysis of the volcano data: p(vlD = E;=1 p(vlt)p(tID \nwhere the p(tlD terms are the posterior probabilities of type given labels provided \n\n\f1090 \n\nPadhraic Smyth, Usama Fayyad, Michael Burl, Pietro Perona, Pierre Baldi \n\nTable 1: Type-Label Probabilities for Individual Labellers as estimated via the EM \nProcedure \n\nProbability(typellabel), Labeller C \nType 1 Type 2 Type 3 Type 4 Type 5 \n1.000 \n0.000 \n0.000 \n0.019 \n0.094 \n0.000 \n0.000 \n0.233 \n0.000 \n0.611 \n\n0.000 \n0.977 \n0.667 \n0.000 \n0.000 \n\n0.000 \n0.000 \n0.065 \n0.725 \n0.000 \n\n0.000 \n0.004 \n0.175 \n0.042 \n0.389 \n\nLabell \nLabel 2 \nLabel 3 \nLabel 4 \nLabel 5 \n\nTable 2: 10 ROIs from the database: original scientist labels shown with posterior \nprobabilities estimated via the EM procedure \n\nScientist Labels JD \nA B C \n4 \n4 \n1 \n4 \n2 \n1 \n5 \n3 \n3 \n3 \n2 \n2 \n3 \n5 \n4 \n2 \n3 \n5 \n4 \n4 \n\nD \n5 \n2 \n2 \n3 \n3 \n4 \n5 \n4 \n3 \n4 \n\n4 \n4 \n1 \n1 \n1 \n2 \n1 \n1 \n2 \n4 \n\nROI \n\n1 \n2 \n3 \n4 \n5 \n6 \n7 \n8 \n9 \n10 \n\nPosterior Probabilities (EM), p(tlD \n\nType 1 Type 2 Type 3 Type 4 Type 5 \n0.184 \n0.000 \n0.000 \n0.009 \n0.023 \n0.000 \n0.000 \n0.000 \n0.000 \n0.000 \n0.000 \n0.000 \n0.000 \n0.000 \n0.000 \n0.000 \n0.000 \n0.008 \n0.004 \n0.000 \n\n0.000 \n0.000 \n0.000 \n1.000 \n0.452 \n0.000 \n1.000 \n0.000 \n0.992 \n0.000 \n\n0.816 \n0.991 \n0.000 \n0.000 \n0.012 \n0.000 \n0.000 \n0.999 \n0.000 \n0.996 \n\n0.000 \n0.000 \n0.977 \n0.000 \n0.536 \n1.000 \n0.000 \n0.000 \n0.000 \n0.000 \n\np(vlO \n0.408 \n0.496 \n0.804 \n0.600 \n0.706 \n0.800 \n0.600 \n0.500 \n0.595 \n0.498 \n\nby the EM procedure, and the p(vlt) terms are the subjective volcano-type prob(cid:173)\nabilities discussed in Section 3.2. As shown in Table 2, posterior probabilities for \nthe volcanoes generally are in agreement with intuition and often correspond to \ntaking the majority vote or the \"average\" of the C and D labels (the conservative \nlabellers). However some p(vlD estimates could not easily be derived by any simple \naveraging or voting scheme, e.g., see ROIs 3, 5 and 7 in the table. \n\n5.2 Experiment on Comparing Human and Algorithm Performance \n\nThe standard receiver operating characteristic (ROC) plots detections versus false \nalarms [10] . The ROCs shown here differ in two significant ways [11] : (1) the false \nalarm axis is normalized relative to the number of true positives (necessary since \nthe total number of possible false alarms is not well defined for object detection in \nimages), and (2) the reference labels used in scoring are probabilistic: a detection \n\"scores\" p( v) on the detection axis and 1 - p( v) on the false alarm axis. \n\n\fInferring Ground Truth from Subjective Labelling of Venus Images \n\n1091 \n\n100 \n\nMean \n\nDetection \nRate [%) 80 \n\n60 \n\n40 \n\n20 \n\n-SVD Algorithm \n-t( Scientist A \n-0 Scientist B \n-- Scientist C \n.+ ScientistD \n\n0~ __ ~ __ ~~======7 \no \n\n80 \n\n60 \n\n20 \nMean False Alarm Rate \n\n40 \n\n(expressed as % of total number of volcanoes) \n\n100 \n\nMean \n\nDetection \nRate [%) 80 \n\n60 \n\n40 \n\n-SVD Algorithm \n-t( Scientist A \n-0 Scientist B \n-- Scientist C \n.+ ScientistD \n\n20 \nMean False Alarm Rate \n\n40 \n\n60 \n\n(expressed as % of total number of volcanoes) \n\n80 \n\nFigure 2: Modified ROCs for both scientists and algorithms: (a) without the la(cid:173)\nbelling or type uncertainty, (b) with full uncertainty model factored in. \n\nAs before, data came from 4 images, and there were 269 labelled local regions. \nThe SVD-Gaussian algorithm was evaluated in cross-validation mode (train on 3 \nimages, test on the 4th) and the results combined. The first ROC (Figure 2(a)) \ndoes not take into account either label-type or type-volcano probabilities, i.e., the \nreference list (for algorithm training and overall evaluation) is a consensus list (2 \nscientists working together) where labels 1 ,2,3,4 are ignored and all labelled items \nare counted equally as volcanoes. The individual labellers and algorithm are then \nscored in the standard \"non-weighted\" ROC fashion. This curve is optimistic in \nterms of depicting the accuracy of the detectors since it ignores the underlying \nprobabilistic nature of the labels. Even with this optimistic curve, volcano labelling \nis relatively inaccurate by either man or machine. \nFigure 2(b) shows a weighted ROC: for each of 4 scientists the probabilistic \"ref(cid:173)\nerence labels\" were derived via the EM procedure as in Table 2 from the other 3 \nscientists, and the detections of each scientist were scored according to each such \nreference set. Performance of the algorithm (the SVD-Gaussian method) was eval(cid:173)\nuated relative to the EM-derived label estimates of all 4 scientists. Accounting for \nall of the uncertainty in the data results in a more realistic, if less flattering, set of \nperformance characteristics. The algorithm's performance degrades more than the \nscientist's performance (for low false alarms rates compared to Figure 2(a)) when \nthe full noise model is used. The algorithm is estimating the posterior probabilities \nof volcanoes rather poorly and the complete uncertainty model is more sensitive to \nthis fact. This is a function of the SVD feature space rather than the Gaussian \nclassification model. \n\n\f1092 \n\nPadhraic Smyth, Usama Fayyad, Michael Burl, Pietro Perona, Pierre Baldi \n\n6 Conclusion \n\nIgnoring subjective uncertainty in image labelling can lead to significant over(cid:173)\nconfidence in terms of performance estimation (for both humans and machines). \nFor the volcano detection task a simple model for uncertainty in the class labels \nprovided insight into the performance of both human and algorithmic detectors. An \nobvious extension of the maximum likelihood framework outlined here is a Bayesian \napproach [12]: accounting for parameter uncertainty in the model given the limited \namount of training data available is worth investigating. \n\nAcknowledgements \n\nThe research described in this paper was carried out by the Jet Propulsion Labora(cid:173)\ntory, California Institute of Technology, under a contract with the National Aero(cid:173)\nnautics and Space Administration and was supported in part by ARPA under grant \nnumber NOOOl4-92-J-1860. \n\nReferences \n\n1. P. Smyth, \"Learning with probabilistic supervision,\" in Computational Learning Theory \nand Natural Learning Systems 3, T. Petcshe, M. Kearns, S. Hanson, R. Rivest (eds), \nCambridge, MA: MIT Press, to appear. \n\n2. S. French, \"Group consensus probability distributions: a critical survey,\" in Bayesian \nStatistics 2, J. M. Bernardo, M. H. DeGroot, D. V. Lindley, A. F. M. Smith (eds.), \nElsevier Science Publishers, North-Holland, pp.183-202, 1985. \n\n3. J . S. Uebersax, \"Statistical modeling of expert ratings on medical treatment appropri(cid:173)\n\nateness,\" J. Amer. Statist. Assoc., vol.88, no.422, pp.421-427, 1993. \n\n4. G. Lugosi, \"Learning with an unreliable teacher,\" Pattern Recognition, vol. 25, no.1, \n\npp.79-87. 1992. \n\n5. Science, special issue on Magellan data, April 12, 1991. \n6. J . C. Aubele and E . N. Slyuta, \"Small domes on Venus: characteristics and origins,\" in \n\nEarth, Moon and Planets, 50/51, 493-532, 1990. \n\n7. M. C. Burl, U. M. Fayyad, P . Perona, P . Smyth, and M. P . Burl, \"Automating the \nhunt for volcanoes on Venus,\" in Proceedings of the 1994 Computer Vision and Pattern \nRecognition Conference: CVPR-94, Los Alamitos, CA: IEEE Computer Society Press, \npp.302-309, 1994. \n\n8. A. P . Dawid and A. M. Skene, \"Maximum likelihood estimation of observer error-rates \n\nusing the EM algorithm,\" Applied Statistics, vol.28, no.1, pp.2G-28, 1979. \n\n9. P. Smyth, M. C . Burl, U. M. Fayyad, P . Perona, 'Knowledge discovery in large im(cid:173)\n\nage databases: dealing with uncertainties in ground truth,' in Knowledge Discovery in \nDatabases 2, U. M. Fayyad, G. Piatetsky-Shapiro, P . Smyth, R. Uthurasamy (eds.), \nAAAI/MIT Press, to appear, 1995. \n\n10. M. S. Chesters, \"Human visual perception and ROC methodology in medical imaging,\" \n\nPhys. Med. BioI., vol.37, no.7, pp.1433-1476, 1992. \n\n11. M. C. Burl, U. M . Fayyad, P. Perona, P . Smyth, \"Automated analysis of radar imagery \nof Venus: handling lack of ground truth,\" in Proceedings of the IEEE Conference on \nImage Processing, Austin, November 1994. \n\n12. W . Buntine, \"Operations for learning with graphical models,\" Journal of Artificial In(cid:173)\n\ntelligence Research, 2, pp.159-225, 1994. \n\n\f", "award": [], "sourceid": 949, "authors": [{"given_name": "Padhraic", "family_name": "Smyth", "institution": null}, {"given_name": "Usama", "family_name": "Fayyad", "institution": null}, {"given_name": "Michael", "family_name": "Burl", "institution": null}, {"given_name": "Pietro", "family_name": "Perona", "institution": null}, {"given_name": "Pierre", "family_name": "Baldi", "institution": null}]}