{"title": "Improving Human Judgments by Decontaminating Sequential Dependencies", "book": "Advances in Neural Information Processing Systems", "page_first": 1705, "page_last": 1713, "abstract": "For over half a century, psychologists have been struck by how poor people are at expressing their internal sensations, impressions, and evaluations via rating scales. When individuals make judgments, they are incapable of using an absolute rating scale, and instead rely on reference points from recent experience. This relativity of judgment limits the usefulness of responses provided by individuals to surveys, questionnaires, and evaluation forms. Fortunately, the cognitive processes that transform internal states to responses are not simply noisy, but rather are influenced by recent experience in a lawful manner. We explore techniques to remove sequential dependencies, and thereby decontaminate a series of ratings to obtain more meaningful human judgments. In our formulation, decontamination is fundamentally a problem of inferring latent states (internal sensations) which, because of the relativity of judgment, have temporal dependencies. We propose a decontamination solution using a conditional random field with constraints motivated by psychological theories of relative judgment. Our exploration of decontamination models is supported by two experiments we conducted to obtain ground-truth rating data on a simple length estimation task. Our decontamination techniques yield an over 20% reduction in the error of human judgments.", "full_text": "Decontaminating Human Judgments\nby Removing Sequential Dependencies\n\nMichael C. Mozer,? Harold Pashler,\u2020 Matthew Wilder,?\nRobert V. Lindsey,? Matt C. Jones,\u25e6 & Michael N. Jones\u2021\n\n? Dept. of Computer Science, University of Colorado\n\n\u2020Dept. of Psychology, UCSD\n\n\u25e6Dept. of Psychology, University of Colorado\n\n\u2021Dept. of Psychological and Brain Sciences, Indiana University\n\nAbstract\n\nFor over half a century, psychologists have been struck by how poor people are at\nexpressing their internal sensations, impressions, and evaluations via rating scales.\nWhen individuals make judgments, they are incapable of using an absolute rating\nscale, and instead rely on reference points from recent experience. This relativity\nof judgment limits the usefulness of responses provided by individuals to surveys,\nquestionnaires, and evaluation forms. Fortunately, the cognitive processes that\ntransform internal states to responses are not simply noisy, but rather are in\ufb02u-\nenced by recent experience in a lawful manner. We explore techniques to remove\nsequential dependencies, and thereby decontaminate a series of ratings to obtain\nmore meaningful human judgments. In our formulation, decontamination is fun-\ndamentally a problem of inferring latent states (internal sensations) which, be-\ncause of the relativity of judgment, have temporal dependencies. We propose a\ndecontamination solution using a conditional random \ufb01eld with constraints mo-\ntivated by psychological theories of relative judgment. Our exploration of de-\ncontamination models is supported by two experiments we conducted to obtain\nground-truth rating data on a simple length estimation task. Our decontamination\ntechniques yield an over 20% reduction in the error of human judgments.\n\nIntroduction\n\n1\nSuppose you are asked to make a series of moral judgments by rating, on a 1\u201310 scale, various\nactions, with a rating of 1 indicating \u2019not particularly bad or wrong\u2019 and a rating of 10 indicating\n\u2019extremely evil.\u2019 Consider the series of actions on the left.\n\n(1) Stealing a towel from a hotel\n(2) Keeping a dime you \ufb01nd on the ground\n(3) Poisoning a barking dog\n\n(10) Testifying falsely for pay\n(20) Using guns on striking workers\n(30) Poisoning a barking dog\n\nNow consider that instead you had been shown the series on the right. Even though individuals are\nasked to make absolute judgments, the mean rating of statement (3) in the \ufb01rst context is reliably\nhigher than the mean rating of the identical statement (30) in the second context (Parducci, 1968).\nThe classic explanation of this phenomenon is cast in terms of anchoring or primacy: information\npresented early in time serves as a basis for making judgments later in time (Tversky & Kahneman,\n1974). In the Net\ufb02ix contest, signi\ufb01cant attention was paid to anchoring effects by considering that\nan individual who gives high ratings early in a session is likely to be biased toward higher ratings\nlater in a session (Koren, August 2009; Ellenberg, March 2008).\nThe need for anchors comes from the fact that individuals are poor at or incapable of making absolute\njudgments and instead must rely on reference points to make relative judgments (e.g., Laming, 1984;\nParducci, 1965, 1968; Stewart, Brown, & Chater, 2005). Where do these reference points come\nfrom? There is a rich literature in experimental and theoretical psychology exploring sequential\n\n1\n\n\fdependencies suggesting that reference points change from one trial to the next in a systematic\nmanner. (We use the psychological jargon \u2018trial\u2019 to refer to a single judgment or rating in a series.)\nSequential dependencies occur in many common tasks in which an individual is asked to make\na series of responses, such as \ufb01lling out surveys, questionnaires, and evaluations (e.g., usability\nratings, pain assessment inventories). Every faculty member is aware of drift in grading that neces-\nsitates comparing papers graded early on a stack with those graded later. Recency effects have been\ndemonstrated in domains as varied as legal reasoning and jury evidence interpretation (Furnham,\n1986; Hogarth & Einhorn, 1992) and clinical assessments (Mumma & Wilson, 2006).\nHowever, the most carefully controlled laboratory studies of sequential dependencies, dating back\nto the the 1950\u2019s (discussed by Miller, 1956), involve the rating of unidimensional stimuli, such as\nthe loudness of a tone or the length of a line. Human performance at rating stimuli is surprisingly\npoor compared to an individual\u2019s ability to discriminate the same stimuli. Regardless of the domain,\nresponses convey not much more than 2 bits of mutual information with the stimulus (Stewart et\nal., 2005). Different types of judgment tasks have been studied including absolute identi\ufb01cation,\nin which the individual\u2019s task is to specify the distinct stimulus level (e.g., 10 levels of loudness),\nmagnitude estimation, in which the task is to estimate the magnitude of a stimulus which may vary\ncontinuously along a dimension, and categorization which is a hybrid task requiring individuals to\nlabel stimuli by range. Because the number of responses in absolute identi\ufb01cation and categorization\ntasks is often quite large, and because individuals are often not aware of the discreteness of stimuli in\nabsolute identi\ufb01cation tasks, there isn\u2019t a qualitative difference among tasks. Feedback is typically\nprovided, especially in absolute identi\ufb01cation and categorization tasks. Without feedback, there are\nno explicit anchors against which stimuli can be assessed.\nThe pattern of sequential effects observed is complex. Typically, experimental trial t, trial t\u22121 has a\nlarge in\ufb02uence on ratings, and trials t\u2212 2, t\u2212 3, etc., have successively diminishing in\ufb02uences. The\nin\ufb02uence of recent trials is exerted by both the stimuli and responses, a fact which makes sense in\nlight of the assumption that individuals form their response on the current trial by analogy to recent\ntrials (i.e., they determine a response to the current stimulus that has the same relationship as the\nprevious response had to the previous stimulus). Both assimilation and contrast effects occur: an\nassimilative response on trial t occurs when the response moves in the direction of the stimulus or\nresponse on trial t \u2212 k; a contrastive response is one that moves away. Interpreting recency effects\nin terms of assimilation and contrast is nontrivial and theory dependent (DeCarlo & Cross, 1990).\nMany mathematical models have been developed to explain the phenomena of sequential effects in\njudgment tasks. All adopt the assumption that the transduction of a stimulus to its internal represen-\ntation is veridical. We refer to this internal representation as the sensation, as distinguished from the\nexternal stimulus. (For judgments of nonphysical quantities such as emotional states and af\ufb01nities,\nperhaps the terms impression or evaluation would be more appropriate than sensation.) Sequential\ndependencies and other corruptions of the representation occur in the mapping of the sensation to a\nresponse. According to all theories, this mapping requires reference to previous sensation-response\npairings. However, the theories differ with respect to the reference set. At one extreme, the theory of\nStewart et al. (2005) assumes that only the previous sensation-response pair matters. Other theories\nassume that multiple sensation-response anchors are required, one \ufb01xed and unchanging and another\nvarying from trial to trial (e.g., DeCarlo & Cross, 1990). And in categorization and absolute identi\ufb01-\ncation tasks, some theories posit anchors for each distinct response, which are adjusted trial-to-trial\n(e.g., Petrov & Anderson, 2005). Range-frequency theory (Parducci, 1965) claims that sequential\neffects arise because the sensation-response mapping is adjusted to utilize the full response range,\nand to produce roughly an equal number of responses of each type. This effect is the consequence\nof many other theories, either explicitly or implicitly.\nBecause recent history interacts with the current stimulus to determine an individual\u2019s response,\nresponses have a complex relationship with the underlying sensation, and do not provide as much\ninformation about the internal state of the individual as one would hope. In the applied psychology\nliterature, awareness of sequential dependencies has led some researchers to explore strategies that\nmitigate relativity of judgment, such as increasing the number of response categories and varying\nthe type and frequency of anchors (Mumma & Wilson, 2006; Wedell, Parducci, & Lane, 1990).\nIn contrast, our approach to extracting more information from human judgments is to develop auto-\nmatic techniques that recover the underlying sensation from a response that has been contaminated\n\n2\n\n\fby cognitive processes producing the response. We term this recovery process decontamination. As\nwe mentioned earlier, there is some precedent in the Net\ufb02ix competition for developing empirical\napproaches to decontamination. However, to the best of our knowledge, the competitors were not\nfocused on trial-to-trial effects, and their investigation was not systematic. Systematic investigation\nrequires ground truth knowledge of the individuals\u2019 sensations.\n\n2 Experiments\n\nTo collect ground-truth data for use in the design of decontamination techniques, we conducted two\nbehavioral experiments using stimuli whose magnitudes could be objectively determined. In both\nexperiments, participants were asked to judge the horizontal gap between two vertically aligned\ndots on a computer monitor. The position of the dots on the monitor shifted randomly from trial\nto trial. Participants were asked to respond to each dot pair using a 10-point rating scale, with 1\ncorresponding to the smallest gap they would see, and 10 corresponding to the largest.\nThe task requires absolute identi\ufb01cation of 10 distinct gaps. The participants were only told that\ntheir task was to judge the distance between the dots. They were not told that only 10 unique stimuli\nwere presented, and were likely unaware of this fact (memory of exact absolute gaps is too poor), and\nthus the task is indistinguishable from a magnitude estimation or categorization task in which the gap\nvaried continuously. The experiment began with a practice block of ten trials. During the practice\nblock, participants were shown every one of the ten gaps in random order, and simultaneous with the\nstimulus they were told\u2014via text on the screen below the dots\u2014the correct classi\ufb01cation. After the\npractice blocks, no further feedback was provided. Although the psychology literature is replete with\nline-length judgment studies (two recent examples: Lacouture, 1997; Petrov & Anderson, 2005), the\nvast majority provide feedback to participants on at least some trials beyond the practice block. We\nwanted to avoid the anchoring provided by feedback in order that the task is more analogous to\nthe the type of survey tasks we wish to decontaminate, e.g., the Net\ufb02ix movie scores. Another\ndistinction between our experiments and previous experiments is an attempt to carefully control the\nsequence structure, as described next.\n\n2.1 Experiment Methodology\n\nIn Experiment 1, the practice block was followed by 2 blocks of 90 trials. Within a block, the trial\nsequence was arranged such that each gap was preceded exactly once by each other gap, with the\nexception that no repetitions occurred. Further, every ten trials in a block consisted of exactly one\npresentation of each gap. In Experiment 2, the practice block was followed by 2 blocks of 100 trials.\nThe constraint on the sequence in Experiment 2 was looser than in Experiment 1: within a block,\neach gap occurred exactly once preceded by each other gap. However, repetitions were included, and\nthere was no constraint on the subblocks of ten trials. The other key difference between experiments\nwas the gap lengths. In Experiment 1, gap g, with g \u2208 {1, 2, ...10} spanned a proportion .08g of the\nscreen width. In Experiment 2, gap g spanned a proportion .061 + .089g of the screen width. The\nmain reason for conducting Experiment 2 was that we found the gaps used in Experiment 1 resulted\nin low error rates and few sequential effects for the smaller gaps. Other motivations for Experiment\n2 will be explained later.\nBoth experiments were conducted via the web, using a web portal set up for psychology studies.\nParticipants were prescreened for their ability to understand English instructions, and were paid $4\nfor the 10\u201315 minutes required to complete the experiment. Two participants in Experiment 1 and\none participant in Experiment 2 were excluded from data analysis because their accuracy was below\n20%. The portal was opened for long enough to obtain good data from 76 participants in each\nExperiment. Individuals were allowed to participate in only one of the two experiments.\n\n2.2 Results and Discussion of Human Experiments\n\nFigure 1 summarizes the data from Experiments 1 and 2 (top and bottom rows, respectively). All\ngraphs depict the error on a trial, de\ufb01ned as the signed difference Rt \u2212 St between the current\nresponse, Rt, and the current stimulus level St. The left column plots the error on trial t as a function\nof St\u22121 (along the abscissa) and St (the different colored lines, as speci\ufb01ed by the key between the\ngraphs). Pairs of stimulus gaps (e.g., G1 and G2) have been grouped together to simplify the graph.\n\n3\n\n\fFigure 1: Human data from Experiments 1 (top row) and 2 (bottom row).\n\nThe small bars around the point indicate one standard error of the mean. The variation along the\nabscissa re\ufb02ects sequential dependencies: assimilation is indicated by pairs of points with positive\nslopes (larger values of St\u22121 result in larger Rt), and contrast is indicated by negative slopes. The\npattern of results across the two experiments is remarkably consistent.\nThe middle column shows another depiction of sequential dependencies by characterizing the distri-\nbution of errors (Rt\u2212 St \u2208 {> 1, 1, 0,\u22121, < \u22121}) as a function of St\u2212 St\u22121. The predominance of\nassimilative responses is re\ufb02ected in more Rt > St responses when St \u2212 St\u22121 < 0, and vice-versa.\nThe rightmost column presents the lag pro\ufb01le that characterizes how the stimulus on trial t \u2212 k for\nk = 1...5 in\ufb02uences the response on trial t. The bars on each point indicate one standard error of\nthe mean. For the purpose of the current work, most relevant is that sequential dependencies in this\ntask may stretch back two or three trials.\n\n3 Approaches To Decontamination\n\nFrom a machine learning perspective, decontamination can be formulated in at least three different\nways. First, it could be considered an unsupervised infomax problem of determining a sensation\nassociated with each distinct stimulus such that the sensation sequence has high mutual information\nwith the response sequence. Second, it could be considered a supervised learning problem in which\na specialized model is constructed for each individual, using some minimal amount of ground-truth\ndata collected from that individual. Here, the ground truth is the stimulus-sensation correspondence,\nwhich can be obtained\u2014in principle, even with unknown stimuli\u2014by laborious data collection tech-\nniques, such as asking individuals to provide a full preference ordering or multiple partial orderings\nover sets of stimuli, or asking individuals to provide multiple ratings of a stimulus in many different\ncontexts, so as to average out sequential effects. Third, decontamination models could be built based\non ground-truth data for one group of individuals and then tested on another group. In this paper,\nwe adopt this third formulation of the problem.\nFormally, the decontamination problem involves inferring the sequence of (unobserved) sensations\ngiven the complete response sequence. To introduce some notation, let Rp\nt1,t2 denote the sequence\nof responses made by participant p on trials t1 through t2 when shown a sequence of stimuli that\n\n4\n\nG1,G2G3,G4G5,G6G7,G8G9,G10\u22121\u22120.8\u22120.6\u22120.4\u22120.200.2Experiment 1R(t) \u2212 S(t)S(t\u22121)error as a function of S(t\u22121) and S(t) 12345\u22120.6\u22120.5\u22120.4\u22120.3\u22120.2\u22120.1lagR(t)\u2212S(t)error as a function of lagged stimulus \u22129\u22128\u22127\u22126\u22125\u22124\u22123\u22122\u22121012345678900.20.40.60.81S(t)\u2212S(t\u22121)P(R(t)\u2212S(t))error as a function of stimulus difference G1,G2G3,G4G5,G6G7,G8G9,G10\u22121\u22120.500.5Experiment 2R(t) \u2212 S(t)S(t\u22121)12345\u22120.3\u22120.2\u22120.100.10.2lagR(t)\u2212S(t)\u22129\u22128\u22127\u22126\u22125\u22124\u22123\u22122\u22121012345678900.20.40.60.81S(t)\u2212S(t\u22121)P(R(t)\u2212S(t))G1,G2G3,G4G5,G6G7,G8G9,G10G1,G2G3,G4G5,G6G7,G8G9,G10< \u22121\u2212101>1\f1,t and Rp\n\nt , the response on trial t, given Sp\n\nt1,t2.1 Decontamination can be cast as computing the expectation or\nevoke the sensation sequence Sp\nprobability over Sp\n1,T given Rp\n1,T , where T is the total number of judgments made by the individual.\nAlthough psychological theories of human judgment address an altogether different problem\u2014that\nof predicting Rp\n1,t\u22121\u2014they can inspire decontamination\ntechniques. Two classes of psychological theories correspond to two distinct function approximation\ntechniques. Many early models of sequential dependencies, culminating in the work of DeCarlo and\nCross (1990), are framed in terms of autoregression. In contrast, other models favor highly \ufb02exible,\nnonlinear approaches that allow for similarity-based assimilation and contrast, and independent rep-\nresentations for each response label (e.g., Petrov & Anderson, 2005). Given the discrete stimuli and\nresponses, a lookup table seems the most general characterization of these models.\nWe explore a two-dimensional space of decontamination techniques. The \ufb01rst dimension of this\nspace is the model class: regression, lookup table, or an additive hybrid. We de\ufb01ne our regression\nmodel estimating St as:\n\nREGt(m, n) = \u03b1 + \u03b2 \u00b7 Rt\u2212m+1,t + \u03b3 \u00b7 St\u2212n,t\u22121,\n\n(1)\nwhere the model parameters \u03b2 and \u03b3 are vectors, and \u03b1 is a scalar. Similarly, we de\ufb01ne our lookup\ntable LUTt(m, n) to produce an estimate of St by indexing over the m responses Rt\u2212m+1,t and the\nn sensations St\u2212n,t\u22121. Finally, we de\ufb01ne an additive hybrid, REG\u2295LUT(m, n) by \ufb01rst constructing\na regression model, and then building a lookup table on the residual error, St \u2212 REGt(m, n). The\nmotivation for the hybrid is the complementarity of the two models, the regression model capturing\nlinear regularities and the lookup table representing arbitrary nonlinear relationships.\nThe second dimension in our space of decontamination techniques speci\ufb01es how inference is han-\ndled. Decontamination is fundamentally a problem of inferring unobserved states. To utilize any\nof the models above for n > 0, sensations St\u2212n,t\u22121 must be estimated. Although time \ufb02ows in\none direction, inference \ufb02ows in two: in psychological models, Rt is in\ufb02uenced by both St and\nSt\u22121; this translates to a dependence of St on both St\u22121 and St+1 when conditioned on R1,T . To\nhandle inference properly, we construct a linear-chain conditional random \ufb01eld (Lafferty, McCal-\nlum, & Pereira, 2001; Sutton & McCallum, 2007). As an alternative to the conditional random \ufb01eld\n(hereafter, CRF), we also consider a simple approach in which we simply set n = 0 and discard the\nsensation terms in our regression and lookup tables. At the other extreme, we can assume an oracle\nthat provides St\u2212n,t\u22121; this oracle approach offers an upper bound on achievable performance.\nWe explore the full Cartesian product of approaches consisting of models chosen from\n{REG, LUT, REG\u2295LUT} and inference techniques chosen from {SIMPLE, CRF, ORACLE}. The\nSIMPLE and ORACLE approaches are straightforward classic statistics, but we need to explain how\nthe different models are incorporated into a CRF. The linear-chain CRF is a distribution\n\n( TX\n\nKX\n\n)\n\nP (S1,T|R1,T ) =\n\n1\n\nZ(R1,T )\n\nexp\n\nt=1\n\nk=1\n\n\u03bbkfk(t, St\u22121,t, R1,T )\n\n(2)\n\nwith a given set of feature functions, {fk}. The linear combination of these functions determines the\npotential at some time t, denoted \u03a6t, where a higher potential re\ufb02ects a more likely con\ufb01guration\nof variables. To implement a CRF-REG model, we would like the potential to be high when the\nregression equation is satis\ufb01ed, e.g., \u03a6t = \u2212(REGt(m, n) \u2212 St)2. Simply expanding this error\nyields a collection of \ufb01rst and second order terms. Folding the terms not involving the sensations\ninto the normalization constant, the following terms remain for REG(2, 1): St, RtSt, S2\nt , RtSt\u22121,\nRt\u22121St, and StSt\u22121.2 The regression potential function can be obtained by making each of these\nterms into a real-valued feature, and determining the \u03bb parameters in Equation 2 to yield the \u03b1, \u03b2,\nand \u03b3 parameters in Equation 1.3\nThe CRF-LUT model could be implemented using indicator features, as is common in CRF models,\nbut this approach yields an explosion of free parameters: a feature would be required for each cell of\n\n1We are switching terminology: in the discussion of our experiment, S refers to the stimulus. In the dis-\ncussion of decontamination, S will refer to the sensation. The difference is minor because the stimulus and\nsensation are in one-to-one correspondence.\n\n2The terms Rt\u22121St\u22121 and S2\n3As we explain shortly, the {\u03bbk} are determined by CRF training; our point here is that the CRF has the\n\nt\u22121 are omitted because they correspond to RtSt and S2\n\nt , respectively.\n\ncapacity to represent a least-squares regression solution.\n\n5\n\n\fthe table and each value of St, yielding 104 free parameters for a gap detection task with a modest\nCRF-LUT(2, 1). Instead, we opted for the direct analog of the CRF-REG: encouraging con\ufb01gurations\nin which St is consistent with LUTt(m, n) via potential \u03a6t = \u2212(LUTt(m, n) \u2212 St)2. This approach\nyields three real-valued features: LUTt(m, n)2, St\n2, and LUTt(m, n)St. (Remember that lookup\ntable values are indexed by St\u22121, and therefore cannot be folded into the normalization constant.)\nFinally, the CRF-REG\u2295LUT is a straightforward extension of the models we\u2019ve described, based on\nthe potential \u03a6t = \u2212(REGt(m, n) + LUTt(m, n) \u2212 St)2, which still has only quadratic terms in\nStand St\u22121. Having now described a 3 \u00d7 3 space of decontamination approaches, we turn to the\ndetails of our decontamination experiments.\n\n3.1 Debiasing and Decompressing\n\nAlthough our focus is on decontaminating sequential dependencies, or desequencing, the quality\nof human judgments can be reduced by at least three other factors. First, individuals may have an\noverall bias toward smaller or larger ratings. Second, individuals may show compression, possibly\nnonlinear, of the response range. Third, there may be slow drift in the center or spread of the\nresponse range, on a relatively long time scale. All of these factors are likely to be caused at least in\npart by trial-to-trial sequential effects. For example, compression will be a natural consequence of\nassimilation because the endpoints of the response scale will move toward the center. Nonetheless\nwe \ufb01nd it useful to tease apart the factors that are easy to describe (bias, compression) from those\nthat are more subtle (assimilation, contrast).\nIn the data from our two experiments, we found no evidence of drift, as determined by the fact that\nregression models with moving averages of the responses did not improve predictions. This \ufb01nding\nis not terribly surprising given that the entire experiment took only 10\u201315 minutes to complete.\nWe brie\ufb02y describe how we remove bias and compression from our data. Decompression can be\nachieved with a LUT(1, 0), which maps each response into the expected sensation. For example, in\nExperiment 1, the shortest stimuli reported as G1 and G2 with high accuracy, but the longest stimuli\ntended to be underestimated by all participants. The LUT(1, 0) compensates for this compression\nby associating responses G8 and G9 with higher sensation levels if the table entries are \ufb01lled based\non the training data according to: LUTt(1, 0) \u2261 E[St|Rt]. All of the higher order lookup tables,\nLUT(m, n), for m \u2265 1 and n \u2265 0, will also perform nonlinear decompression in the same manner.\nThe REG models alone will also achieve decompression, though only linear decompression.\nWe found ample evidence of individual biases in the use of the response scale. To debias the data,\nt , and ensure the means\nt \u2212 \u00afSp. Assuming that the mean sensation is\nare homogeneous via the constraint Rp\nidentical for all participants\u2014as it should be in our experiments\u2014debiasing can be incorporated\ninto the lookup tables by storing not E[St|Rt...], but rather E[Sp\nt + \u00afRp|Rt...], and recovering the\nsensation for a particular individual using LUT(m, n) \u2212 \u00afRp. (This trick is necessary to index into\nthe lookup table with discrete response levels. Simply normalizing individuals\u2019 responses will yield\nnoninteger responses.) Debiasing of the regression models can be achieved by adding a \u00afRp term to\nthe regression. Note that this extra term\u2014whether in the lookup table retrieval or the regression\u2014\nresults in additional features involving combinations of \u00afRp and St, St\u22121, and LUT(m, n) being\nadded to the three CRF models.\n\nwe compute the mean response of a particular participant p, \u00afRp \u2261 1/TP Rp\n\nt \u2212 \u00afRp = Sp\n\n3.2 Modeling Methodology\n\nIn all the results we report on, we use a one-back response history, i.e., m = 2. Therefore, the\nSIMPLE models are REG(2, 0), LUT(2, 0), and REG\u2295LUT(2, 0), the ORACLE and CRF models are\nREG(2, 1), LUT(2, 1), and REG\u2295LUT(2, 1). In the ORACLE models, St\u22121 is assumed to be known\nwhen St is estimated; in the CRF models, the sensations are all inferred. The models are trained\nvia multiple splits of the available data into equal-sized training and test sets (38 participants per\nset). Parameters of the SIMPLE-REG and ORACLE-REG models are determined by least-squares\nregression on the training set. Entries in the SIMPLE-LUT and ORACLE-LUT are the expectation over\nt + \u00afRp|Rt, Rt\u22121, ...]. The SIMPLE-REG\u2295LUT and ORACLE-REG\u2295LUT\ntrials and participants: E[Sp\nmodels are trained \ufb01rst by obtaining the regression coef\ufb01cients, and then \ufb01lling lookup table entries\nt|Rt, Rt\u22121, ...]. For the CRF models, the feature coef\ufb01cients\nwith the expected residual, E[Sp\n{\u03bbk} are obtained via gradient descent and the forward-backward algorithm, as detailed in Sutton\n\nt \u2212 REGp\n\n6\n\n\fFigure 2: Results from Experiment 1 (left column) and Experiment 2 (right column). The top row\ncompares the reduction in prediction error for different types of decontamination. The bottom row\ncompares reduction in prediction error for different desequencer algorithms.\n\nand McCallum (2007). The lookup tables used in the CRF-LUT and CRF-REG\u2295LUT are the same\nas those in the ORACLE-LUT and ORACLE-REG\u2295LUT models. The CRF \u03bb parameters are initialized\nto be consistent with our notion of the potential as the negative squared error, using initialization\nvalues obtained from the regression coef\ufb01cients of the ORACLE-REG model. This initialization is\nextremely useful because it places the parameters in easy reach of an effective local minimum. No\nregularization is used on the CRF because of the small number of free parameters (7 for CRF-REG,\n5 for CRF-LUT, and 14 for CRF-REG\u2295LUT). Each model is used to determine the expected value of\nSt. We had initially hoped that a Viterbi decoding of the CRF might yield useful predictions, but the\nexpectation proved far superior, most likely because there is not a single path through the CRF that\nis signi\ufb01cantly better than others due to high level of noise in the data.\nBeyond the primary set of models described above, we explored several other models. We tested\nmodels in which the sensation and/or response values are log transformed, because sensory trans-\nduction introduces logarithmic compression. However, these models do not reliably improve decon-\ntamination. We examined higher-order regression models, i.e., m > 2. These models are helpful\nfor Experiment 1, but only because we inadvertently introduced structure into the sequences via the\nconstraint that each stimulus had to be presented once before it could be repeated. The consequence\nof this constraint is that a series of small gaps predicted a larger gap on the next trial, and vice-\nversa. One reason for conducting Experiment 2 was to eliminate this constraint. It also eliminated\nthe bene\ufb01t of higher-order regression models. We also examined switched regression models whose\nparameters were contingent on the current response. These models do not signi\ufb01cantly outperform\nthe REG\u2295LUT models.\n\n4 Results\n\nFigure 2 shows the root mean squared error (RMSE) between the ground-truth sensation and the\nmodel-estimated sensation over the set of validation subjects for 100 different splits of the data. The\nleft and right columns present results for Experiments 1 and 2, respectively. In the top row of the\n\ufb01gure, we compare baseline performance with no decontamination\u2014where the sensation prediction\nis simply the participant\u2019s actual response (pink bar)\u2014against decompression alone (magenta bar),\ndebiasing alone (red bar), debiasing and decompression (purple bar), and the best full decontamina-\ntion model, which includes debiasing, decompression, and desequencing (blue bar). The difference\nbetween each pair of these results is highly reliable, indicating that bias, compression, and recency\neffects all contribute to the contamination of human judgments.\n\n7\n\n0.880.90.920.94ORACLE\u2212REG\u2295LUT(2,1)CRF\u2212REG\u2295LUT(2,1)SIMPLE\u2212REG\u2295LUT(2,0)ORACLE\u2212LUT(2,1)CRF\u2212LUT(2,1)SIMPLE\u2212LUT(2,0)ORACLE\u2212REG(2,1)CRF\u2212REG(2,1)SIMPLE\u2212REG(2,0)sensation reconstruction error (RMSE)p < .001p < .001p < .050.90.9511.051.11.15Experiment 1CRF\u2212REG\u2295LUT(2,1)debias + decompressdebiasdecompressbaselinesensation reconstruction error (RMSE)0.950.960.970.980.991ORACLE\u2212REG\u2295LUT(2,1)CRF\u2212REG\u2295LUT(2,1)SIMPLE\u2212REG\u2295LUT(2,0)ORACLE\u2212LUT(2,1)CRF\u2212LUT(2,1)SIMPLE\u2212LUT(2,0)ORACLE\u2212REG(2,1)CRF\u2212REG(2,1)SIMPLE\u2212REG(2,0)sensation reconstruction error (RMSE)p < .001p < .001p < .00111.051.11.15Experiment 2CRF\u2212REG\u2295LUT(2,1)debias + decompressdebiasdecompressbaselinesensation reconstruction error (RMSE)\fThe reduction of error due to debiasing is 14.8% and 11.1% in Experiments 1 and 2, respectively.\nThe further reduction in error when decompressing is incorporated is 4.8% and 3.4% in Experiments\n1 and 2. Finally, the further reduction in error when desequencing is incorporated is 5.0% and 4.1%\nin Experiments 1 and 2. We reiterate that bias and compression likely have at least part of their basis\nin sequential dependencies. Indeed models like CRF-REG\u2295LUT perform nearly as well even without\nseparate debiasing and decompression corrections.\nThe bottom row of Figure 2 examines the relative performance of the nine models de\ufb01ned by the\nCartesian product of model type (REG, LUT and REG\u2295LUT) and inference type (SIMPLE, CRF,\nand ORACLE). The joint model REG\u2295LUT that exploits both the regularity of the regression model\nand the \ufb02exibility of the lookup table clearly works better than either REG or LUT in isolation.\nComparing SIMPLE, which ignores the mutual constraints provided by the inferred sensations, to\nto CRF, which exploits bidirectional temporal constraints, we see that the CRF inference produces\nreliably better results in \ufb01ve of six cases, as evaluated by paired t-tests. We do not have a good\nexplanation for the advantage of SIMPLE-LUT over CRF-LUT in Experiment 1, although there are\nsome minor differences in how the lookup tables for the two models are constructed, and we are\ninvestigating whether those differences might be responsible. We included the ORACLE models to\ngive us a sense of how much improvement we might potentially obtain, and clearly there is still\nsome potential gain as indicated by ORACLE-REG\u2295LUT.\n\n5 Discussion\n\nPsychologists have long been struck by the relativity of human judgments and have noted that rela-\ntivity limits how well individuals can communicate their internal sensations, impressions, and eval-\nuations via rating scales. We\u2019ve shown that decontamination techniques can improve the quality of\njudgments, reducing error by over 20% Is a 20% reduction signi\ufb01cant? In the Net\ufb02ix competition, if\nthis improvement in the reliability of the available ratings translated to a comparable improvement\nin the collaborative \ufb01ltering predictions, it would have been of critical signi\ufb01cance.\nIn this paper, we explored a fairly mundane domain: estimating the gap between pairs of dots on\na computer monitor. The advantage of starting our explorations in this domain is that it provided\nus with ground truth data for training and evaluation of models. Will our conclusions about this\nsensory domain generalize to more subjective and emotional domains such as movies and art? We\nare currently designing a study in which we will collect liking judgments for paintings. Using the\nmodels we developed for this study, we can obtain a decontamination of the ratings and identify\npairs of paintings where the participant\u2019s ratings con\ufb02ict with the decontaminated impressions. Via\na later session in which we ask participants for pairwise preferences, we can determine whether\nthe decontaminator or the raw ratings are more reliable. We have reason for optimism because all\nevidence in the psychological literature suggests that corruption occurs in the mapping of internal\nstates to responses, and there\u2019s no reason to suspect that the mapping is different for different types\nof sensations. Indeed, it seems that if even responses to simple visual stimuli are contaminated,\nresponses to more complex stimuli with a more complex judgment task will be even more vulnerable.\nOne key limitation of the present work is that it examines unidimensional stimuli, and any interesting\ndomain will involve multidimensional stimuli, such as movies, that could be rated in many different\nways depending on the current focus of the evaluator. Anchoring likely determines relevant dimen-\nsions as well as the reference points along those dimensions, and it may require a separate analysis\nto decontaminate this type of anchor.\nOn the positive side, the domain is ripe for further explorations, and our work suggests many direc-\ntions for future development. For instance, one might better leverage the CRF\u2019s ability to predict not\njust the expected sensation, but the distribution over sensations. Alternatively, one might pay closer\nattention to the details of psychological theory in the hope that it provides helpful constraints. One\nsuch hint is the \ufb01nding that systematic effects of sequences have been observed on response latencies\nin judgment tasks (Lacouture, 1997); therefore, latencies may prove useful for decontamination.\nA Wired Magazine article on the Net\ufb02ix competition was entitled, \u201cThis psychologist might outsmart\nthe math brains competing for the Net\ufb02ix prize\u201d (Ellenberg, March 2008). This provocative title\ndidn\u2019t turn out to be true, but the title did suggest\u2014consistent with the \ufb01ndings of our research\u2014\nthat the math brains may do well to look inward at the mechanisms of their own brains.\n\n8\n\n\fAcknowledgments\n\nThis research was supported by NSF grants BCS-0339103, BCS-720375, and SBE-0518699. The\nfourth author was supported by an NSF Graduate Student Fellowship. We thank Owen Lewis for\nconducting initial investigations and discussions that allowed us to better understand the various\ncognitive models, and Dr. Dan Crumly for the lifesaving advice on numerical optimization tech-\nniques.\n\nReferences\nDeCarlo, L. T., & Cross, D. V. (1990). Sequential effects in magnitude scaling: Models and theory.\n\nJournal of Experimental Psychology: General, 119, 375\u2013396.\n\nEllenberg, J.\n\n(March 2008). This psychologist might outsmart the math brains competing for\n(http://www.wired.com/techbiz/media/magazine/16-\n\nthe net\ufb02ix prize. Wired Magazine, 16.\n03/mf net\ufb02ix?currentPage=all#)\n\nFurnham, A. (1986). The robustness of the recency effect: Studies using legal evidence. Journal of\n\nHogarth, R. M., & Einhorn, H. J. (1992). Order effects in belief updating: The belief adjustment\n\nGeneral Psychology, 113, 351\u2013357.\n\nmodel. Cognitive Psychology, 24, 1\u201355.\n\nKoren, Y. (August 2009). The bellkor solution to the net\ufb02ix grand prize.\nLacouture, Y. (1997). Bow, range, and sequential effects in absolute identi\ufb01cation: A response-time\n\nanalysis. Psychological Research, 60, 121-133.\n\nLafferty, J., McCallum, A., & Pereira, F. (2001). Conditional random \ufb01elds: Probabilistic models\nfor segmenting and labeling sequence data. In International conference on machine learning (pp.\n282\u2013289). San Mateo, CA: Morgan Kaufmann.\n\nLaming, D. R. J. (1984). The relativity of \u201cabsolute\u201d judgements. Journal of Mathematical and\n\nStatistical Psychology, 37, 152\u2013183.\n\nMiller, G. A. (1956). The magical number seven, plus or minus two: Some limits on our capacity\n\nfor information processing. Psychological Review, 63, 81\u201397.\n\nMumma, G. H., & Wilson, S. B.\n\n(2006). Procedural debiasing of primacy/anchoring effects in\n\nclinical-like judgments. Journal of Clinical Psychology, 51, 841\u2013853.\n\nParducci, A. (1965). Category judgment: A range-frequency model. Psychological Review, 72,\n\n407\u2013418.\n\nParducci, A. (1968). The relativism of absolute judgment. Scienti\ufb01c American, 219, 84\u201390.\nPetrov, A. A., & Anderson, J. R. (2005). The dynamics of scaling: A memory-based anchor model\n\nof category rating and identi\ufb01cation. Psychological Review, 112, 383\u2013416.\n\nStewart, N., Brown, G. D. A., & Chater, N. (2005). Absolute identi\ufb01cation by relative judgment.\n\nPsychological Review, 112, 881\u2013911.\n\nSutton, C., & McCallum, A. (2007). An introduction to conditional random \ufb01elds for relational\nlearning. In L. Getoor & B. Taskar (Eds.), Introduction to statistical relational learning. Cam-\nbridge, MA: MIT Press.\n\nTversky, A., & Kahneman, D. (1974). Judgment under uncertainty: Heuristics and biases. Science,\n\n185, 1124\u20131131.\n\nWedell, D. H., Parducci, A., & Lane, M. (1990). Reducing the dependence of clinical judgment\non the immediate context: Effects of number of categories and type of anchors. Journal of\nPersonality and Social Psychology, 58, 319\u2013329.\n\n9\n\n\f", "award": [], "sourceid": 907, "authors": [{"given_name": "Michael", "family_name": "Mozer", "institution": ""}, {"given_name": "Harold", "family_name": "Pashler", "institution": ""}, {"given_name": "Matthew", "family_name": "Wilder", "institution": ""}, {"given_name": "Robert", "family_name": "Lindsey", "institution": ""}, {"given_name": "Matt", "family_name": "Jones", "institution": ""}, {"given_name": "Michael", "family_name": "Jones", "institution": "Indiana University"}]}