{"title": "Adaptive Design Optimization in Experiments with People", "book": "Advances in Neural Information Processing Systems", "page_first": 234, "page_last": 242, "abstract": "In cognitive science, empirical data collected from participants are the arbiters in model selection. Model discrimination thus depends on designing maximally informative experiments.  It has been shown that adaptive design optimization (ADO) allows one to discriminate models as efficiently as possible in simulation experiments.  In this paper we use ADO in a series of experiments with people to discriminate the Power, Exponential, and Hyperbolic models of memory retention, which has been a long-standing problem in cognitive science, providing an ideal setting in which to test the application of ADO for addressing questions about human cognition. Using an optimality criterion based on mutual information, ADO is able to find designs that are maximally likely to increase our certainty about the true model upon observation of the experiment outcomes.  Results demonstrate the usefulness of ADO and also reveal some challenges in its implementation.", "full_text": "Adaptive Design Optimization in Experiments with\n\nPeople\n\nDaniel R. Cavagnaro\n\nDepartment of Psychology\n\nOhio State University\n\ncavagnaro.2@osu.edu\n\nMark A. Pitt\n\nDepartment of Psychology\n\nOhio State University\npitt.2@osu.edu\n\nJay I. Myung\n\nDepartment of Psychology\n\nOhio State University\nmyung.1@osu.edu\n\nAbstract\n\nIn cognitive science, empirical data collected from participants are the arbiters\nin model selection. Model discrimination thus depends on designing maximally\ninformative experiments.\nIt has been shown that adaptive design optimization\n(ADO) allows one to discriminate models as ef\ufb01ciently as possible in simulation\nexperiments. In this paper we use ADO in a series of experiments with people to\ndiscriminate the Power, Exponential, and Hyperbolic models of memory retention,\nwhich has been a long-standing problem in cognitive science, providing an ideal\nsetting in which to test the application of ADO for addressing questions about hu-\nman cognition. Using an optimality criterion based on mutual information, ADO\nis able to \ufb01nd designs that are maximally likely to increase our certainty about the\ntrue model upon observation of the experiment outcomes. Results demonstrate\nthe usefulness of ADO and also reveal some challenges in its implementation.\n\n1 Introduction\n\nFor better or worse, human memory is not perfect, causing us to forget. Over a century of research\non memory has consistently shown that a person\u2019s ability to remember information just learned\n(e.g., from studying a list of words), drops precipitously for a short time immediately after learning,\nbut then quickly decelerates, leveling off to a very low rate as more and more time elapses. The\nsimplicity of this data pattern has led to the introduction of a number of models to describe the rate\nat which information is retained in memory.\nYears of experimentation with humans (and animals) have resulted in a handful of models proving to\nbe superior to the rest of the \ufb01eld, but also proving to be increasingly dif\ufb01cult to discriminate [1, 2].\nThree strong competitors are the power model (POW), the exponential model (EXP), and the hyper-\nbolic model (HYP). Their equations are given in Table 1. Despite the best efforts of researchers to\ndesign studies that were intended to discriminate among them, the results have not yielded decisive\nevidence that favors one model, let alone consistency across studies. [2, 3].\nIn these and other studies, well-established methods were used to increase the power of an experi-\nment, and thus improve model discriminability. They included testing large numbers of participants\nto reduce measurement error, testing memory at more retention intervals (i.e., the time between the\nend of the study phase and when memory is probed) after the study phase (e.g., 8 instead of 5) so as\nto obtain a more accurate description of the rate of retention, and replicating the experiment using a\nrange of different tasks or participant populations.\n\n1\n\n\fModel\nPower (POW)\nExponential (EXP)\nHyperbolic (HYP)\n\nEquation\np = a(t + 1)\u2212b\np = ae\u2212bt\np = a\n\n1+bt\n\nTable 1: Three quantitative models of memory retention. In each equation, the symbol p (0 < p <\n1) denotes the predicted probability of correct recall as a function of time interval t with model\nparameters a and b.\n\nIn the present study, we used Bayesian adaptive design optimization (ADO) [4, 5, 6, 7] on groups\nof people to achieve the same goal. Speci\ufb01cally, a retention experiment was repeated four times\non groups of people, and the set of retention intervals at which memory was probed was optimized\nfor each repetition using data collected in prior repetitions. The models in Table 1 were compared.\nBecause model predictions can differ signi\ufb01cantly across retention intervals, our intent was to exploit\nthis information to the fullest using ADO, with the aim of providing some clarity on the form of the\nretention function in humans.\nWhile previous studies have demonstrated the potential of ADO to discriminate retention functions\nin computer simulations [4, 5], this is the \ufb01rst study to utilize the methodology in experiments with\npeople. Although seemingly trivial, the application of such a methodology comes with challenges\nthat can severely restrict its usefulness. Success in applying ADO to a relatively simple design is a\nnecessary \ufb01rst step in assessing its ability to aid in model discrimination and its broader applicability.\nWe begin by reviewing ADO. This is followed by a series of retention experiments using the algo-\nrithm. We conclude with discussions of the implications of the empirical \ufb01ndings and of the bene\ufb01ts\nand challenges of using ADO in laboratory experiments.\n\n2 Adaptive optimal design\n\n2.1 Bayesian framework\n\nBefore data collection can even begin in an experiment, many choices about its design must be made.\nIn particular, design parameters such as the sample size, the number of treatments (i.e., conditions\nor levels of the independent variable) to study, and the proportion of observations to be allocated\nto each treatment group must be chosen. These choices impact not only the statistical value of the\nresults, but also the cost of the experiment. For example, basic statistics tells us that increasing the\nsample size would increase the statistical power of the experiment, but it would also increase its\ncost (e.g., number of participants, amount of testing). An optimal experimental design is one that\nmaximizes the informativeness of the experiment, while being cost effective for the experimenter.\nA principled approach to the problem of \ufb01nding optimal experimental designs can be found in the\nframework of Bayesian decision theory [8]. In this framework, each potential design is treated as a\ngamble whose payoff is determined by the outcome of an experiment carried out with that design.\nThe idea is to estimate the utilities of hypothetical experiments carried out with each design, so that\nan \u201cexpected utility\u201d of each design can be computed. This is done by considering every possible\nobservation that could be obtained from an experiment with each design, and then evaluating the\nrelative likelihoods and statistical values of these observations. The design with the highest expected\nutility value is then chosen as the optimal design.\nIn the case of adaptively designed experiments, in which testing proceeds over the course of several\nstages (i.e., periods of data collection), the information gained from all prior stages can be used to\nimprove the design at the current stage. Thus, the problem to be solved in adaptive design opti-\nmization (ADO) is to identify the most informative design at each stage of the experiment, taking\ninto account the results of all previous stages, so that one can infer the underlying model and its\nparameter values in as few steps as possible.\n\n2\n\n\fFormally, ADO for model discrimination entails \ufb01nding an optimal design at each stage that maxi-\nmizes a utility function U(d)\n\n(1)\n\nwith the utility function de\ufb01ned as\n\nd\u2217 = argmax\n\n{U(d)}\n\nd\n\nK(cid:88)\n\n(cid:90) (cid:90)\n\nm=1\n\np(m)\n\nU(d) =\n\nu(d, \u03b8m, y) p(y|\u03b8m, d) p(\u03b8m) dy d\u03b8m,\n\n(2)\nwhere m = {1, 2, . . . , K} is one of a set of K models being considered, d is a design, y is the\noutcome of an experiment with design d under model m, and \u03b8m is a parameterization of model m.\nWe refer to the function u(d, \u03b8m, y) in Equation (2) as the localutility of the design d. It measures\nthe utility of a hypothetical experiment carried out with design d when the data generating model\nis m, the parameters of the model takes the value \u03b8m, and the outcome y is observed. Thus, U(d)\nrepresents the expected value of the local utility function, where the expectation is taken over (1)\nall models under consideration, (2) the full parameter space of each model, and (3) all possible\nobservations given a particular model-parameter pair, with respect to the model prior probability\np(m), the parameter prior distribution p(\u03b8m), and the sampling distribution p(y|\u03b8m, d), respectively.\n\n2.2 Mutual information utility function\n\nI(P ; Q) = H(P ) \u2212 H(P|Q)\n\nwhere H(P ) = \u2212(cid:80)\n\nx\u2208X p(x) log p(x) is the entropy of P , and H(P|Q) =\n\nSelection of a utility function that adequately captures the goals of the experiment is an integral,\noften crucial, part of design optimization. For the goal of discriminating among competing models,\none reasonable choice would be a utility function based on a statistical model selection criterion,\nsuch as sum-of-squares error (SSE) or minimum description length (MDL) [MDL 9] as shown by\n[10]. Another reasonable choice would be a utility function based on the expected Bayes factor\nbetween pairs of competing models [11]. Both of these approaches rely on pairwise model compar-\nisons, which can be problematic when there are more than two models under consideration.\nHere, we use an information theoretic utility function based on mutual information [12]. It is an\nideal measure for quantifying the value of an experiment design because it quanti\ufb01es the reduction\nin uncertainty about one variable that is provided by knowledge of the value of another random\nvariable. Formally, the mutual information of a pair of random variables P and Q, taking values in\nX , is given by\n\n(cid:80)\n(3)\nx\u2208X p(x)H(P|Q =\nx) is the conditional entropy of P given Q. A high mutual information indicates a large reduction\nin uncertainty about P due to knowledge of Q. For example, if the distributions of P and Q were\nperfectly correlated, meaning that knowledge of Q allowed perfect prediction of P , then the con-\nditional distribution would be degenerate, having entropy zero. Thus, the mutual information of P\nand Q would be H(P ), meaning that all of the entropy of P was eliminated through knowledge of\nQ. Mutual information is symmetric in the sense that I(P ; Q) = I(Q; P ).\nMutual information can be implemented as an optimality criterion in ADO for model discrimina-\ntion of each stage s (= 1, 2, . . .) of experimentation in the following way. (For simplicity, we omit\nthe subscript s in the equations below.) Let M be a random variable de\ufb01ned over a model set\n{1, 2, . . . , K}, representing uncertainty about the true model, and let Y be a random variable denot-\ning an experiment outcome. Hence P rob.(M = m) = p(m) is the prior probability of model m, and\nP rob.(Y = y|d) =\np(y|\u03b8m, d)p(\u03b8m) d\u03b8m, is the as-\nsociated prior over experimental outcomes given design d. Then I(M; Y |d) = H(M)\u2212H(M|Y, d)\nmeasures the decrease in uncertainty about which model drives the process under investigation given\nthe outcome of an experiment with design d. Since H(M) is independent of the design d, maximiz-\ning I(M; Y |d) on each stage of ADO is equivalent to minimizing H(M|Y, d), which is the expected\nposterior entropy of M given d.\nImplementing this ADO criterion requires identi\ufb01cation of an appropriate local utility function\nu(d, \u03b8m, y) in Equation (2); speci\ufb01cally, a function whose expectation over models, parameters,\nand observations is I(M; Y |d). Such a function can be found by writing\n\n(cid:80)K\nm=1 p(y|d, m) p(m), where p(y|d, m) =\n\n(cid:82)\n\n(cid:90) (cid:90)\n\nK(cid:88)\n\nm=1\n\nI(M; Y |d) =\n\np(m)\n\np(y|\u03b8m, d) p(\u03b8m) log p(m|y, d)\n\np(m)\n\ndy d\u03b8m\n\n(4)\n\n3\n\n\fp(m)\n\nfrom whence it follows that setting u(d, \u03b8m, y) = log p(m|y,d)\nyields U(d) = I(M; Y |d). Thus, the\nlocal utility of a design for a given model and experiment outcome is the log ratio of the posterior\nprobability to the prior probability of that model. Put another way, the above utility function pre-\nscribes that a design that increases our certainty about the model upon the observation of an outcome\nis more valued than a design that does not.\nA highly desirable property of this utility function is that it is suitable for comparing more than\ntwo models, because it does not rely on pairwise comparisons of the models under consideration.\nFurther, as noted by [5], it can be seen as a natural extension of the Bayes factor for comparing more\nthan two models. To see this, notice that the local utility function can be rewritten, applying Bayes\nrule, as u(d, \u03b8m, y) = \u2212 log\n\n(cid:80)K\nk=1 p(k) p(y|k)\np(y|m),\n\n2.3 Computational methods\n\nFinding optimal designs for discriminating nonlinear models, such as POW, EXP and HYP, is a\nnontrivial task, as the computation requires simultaneous optimization and high-dimensional inte-\ngration. For a solution, we draw on a recent breakthrough in stochastic optimization [13]. The\nbasic idea is to recast the problem as a probability density simulation in which the optimal design\ncorresponds to the mode of the distribution. This allows one to \ufb01nd the optimal design without\nhaving to evaluate the integration and optimization directly. The density is simulated by Markov\nChain Monte-Carlo [14], and the mode is sought by gradually \u201dsharpening\u201d the distribution with a\nsimulated annealing procedure [15]. Details of the algorithm can be found in [10, 16].\nThe model and parameter priors are updated at each stage s = {1, 2, . . .} of experimentation. Upon\nthe speci\ufb01c outcome zs observed at stage s of an actual experiment carried out with design ds, the\nmodel and parameter priors to be used to \ufb01nd an optimal design at the next stage are updated via\nBayes rule and Bayes factor calculation [e.g., 17] as\n\n(5)\n\n(6)\n\n(cid:82)\n(cid:80)K\n\nps+1(\u03b8m) =\n\nps+1(m) =\n\np(zs|\u03b8m, ds) ps(\u03b8m)\np(zs|\u03b8m, ds) ps(\u03b8m) d\u03b8m\n\np0(m)\n\nk=1 p0(k) BF(k,m)(zs)ps(\u03b8)\n\nwhere BF(k,m)(zs)ps(\u03b8) denotes the Bayes factor de\ufb01ned as the ratio of the marginal likelihood\nof model k to that of model m given the realized outcome zs, where the marginal likelihoods are\ncomputed with the updated priors from the preceding stage. The above updating scheme is ap-\nplied successively at each stage of experimentation, after an initialization with equal model priors\np(s=0)(m) = 1/K and a parameter prior p(s=0)(\u03b8m).\n\n3 Discriminating retention models using ADO\n\nRetention experiments with people were performed using ADO to discriminate the three retention\nmodels in Table 1. The number of retention intervals was \ufb01xed at three, and ADO was used to op-\ntimize the experiment with respect to the selection of the speci\ufb01c retention intervals. The method-\nology paralleled very closely that of Experiment 1 from [3, 18]. Details of the implementation are\ndescribed next.\n\n3.1 Experiment methodology\n\nA variant of the Brown-Peterson task [19, 20] was used. In each trial, a targetlist of six words was\nrandomly drawn from a pool of high frequency, monosyllabic nouns. These words were presented\non a computer screen at a rate of two words per second, and served as the material that participants\n(undergraduates) had to remember. Five seconds of rehearsal followed, after which the target list\nwas hidden and distractor words were presented, one at a time at a rate of one word per second,\nfor the duration of the retention interval. Participants had to say each distractor word out loud as it\nappeared on the computer screen. The purpose of the distractor task was to occupy the participant\u2019s\nverbal memory in order to prevent additional rehearsal of the target list during the retention interval.\nThe distractor words were drawn from a separate pool of 2000 monosyllabic nouns, verbs, and\n\n4\n\n\fFigure 1: Best \ufb01ts of POW, EXP, and HYP at the conclusion of each experiment. Each data point\nrepresents the observed proportion of correct responses out of 54 trials from one participant. The\nlevel of noise is consistent with the assumption of binomial error. The clustering of retention inter-\nvals around the regions where the best \ufb01tting models are visually discernable hints at the tendency\nfor ADO to favor points at which the predictions of the models are most distinct.\n\nadjectives. At the conclusion of the retention interval, participants were given up to 60 seconds\nfor free recall of the words (typed responses) from the target list. A word was counted as being\nremembered only if it was typed correctly.\nWe used a methodofmoments [e.g. 21] to construct informative prior distributions for the model\nparameters. Independent Beta distributions were constructed to match the mean and variance of the\nbest \ufb01tting parameters for the individual participant data from Experiment 1 from [3, 18].\nWe conducted four replications of the experiment to assess consistency across participants. Each\nexperiment was carried out across \ufb01ve ADO stages using a different participant at each stage (20\nparticipants total). At the \ufb01rst stage of an experiment, an optimal set of three retention intervals,\neach between 1 and 40 seconds, was computed using the ADO algorithm based on the priors at that\nstage. There were nine trials at each time interval per stage, yielding 54 Bernoulli observations at\neach of the three retention intervals. At the end of a stage, priors were updated before beginning the\nnext stage. For example, the prior for stage 2 of experiment 1 was obtained by updating the prior\nfor stage 1 of experiment 1 based on the results obtained in stage 1 in experiment 1. There was no\nsharing of information between experiments.\n\n3.2 Results and analysis\n\nBefore presenting the Bayesian analysis, we begin with a brief preliminary analysis in order to\nhighlight a few points about the quality of the data. Figure 1 depicts the raw data from each of the\nfour experiments, along with the best \ufb01tting parameterization of each model. These graphs reveal\ntwo important points. First, the noise level in the measure of memory (number of correct responses)\nis high, but not inconsistent with the assumption of binomial variance. Moreover, the variation does\nnot excede that in [3], the data from which our prior distributions were constructed. Second, the\nretention intervals chosen by ADO are spread across their full range (1 to 40 seconds), but they are\nespecially clustered around the regions where the best \ufb01tting models are most discernable visually\n(e.g., 5-15, 35-40). This hints at the tendency for ADO to favor retention intervals at which the\nmodels are most distinct given current beliefs about their parameterizations.\nA simple comparison of the \ufb01ts of each model does not reveal a clear-cut winner. The \ufb01ts are bad\nand often similar across experiments. This is not surprising since such an analysis does not take into\naccount the noise in the models, nor does it take into account the complexity of the models. Both\nare addressed in the following, Bayesian analysis.\nWhen comparing three or more models, it can be useful to consider the probability of each model m\nrelative to all other models under consideration, given the data y [22, 23]. Formally, this is given by\n\np(m) =\n\n(7)\n\n(cid:80)K\np(m|y)\nk=1 p(k|y)\n\n5\n\n0102030400.20.40.60.81Proportion correctExperiment 1  Retention interval0102030400.20.40.60.81Experiment 2  Retention interval0102030400.20.40.60.81Retention intervalExperiment 4  0102030400.20.40.60.81Retention intervalExperiment 3    POWEXPHYP      POWEXPHYPPOWEXPHYPPOWEXPHYP\fExperiment 1\nExperiment 2\nExperiment 3\nExperiment 4\n\nPOW EXP HYP\n0.382\n0.093\n0.085\n0.886\n0.151\n0.507\n0.003\n0.996\n\n0.525\n0.029\n0.343\n0.001\n\nTable 2: Relative posterior probabilities of each model at the conclusion of each experiment. Experi-\nments 2 and 4 provide strong evidence in favor of POW, while experiments 1 and 3 are inconclusive,\nneither favoring, nor ruling out, any model.\n\nwhich is simply a reformulation of Equation (6). Table 2 lists these relative posterior probabilities\nfor each of the three models at the conclusion of each of the four experiments. Scanning across\nthe table, two patterns are visible in the data. In Experiments 2 and 4, the data clearly favor the\npower model. The posterior probabilities of the power model (0.886 and 0.992, respectively) greatly\nexceed those for the other two models. Using the Bayes factor as a measure of support for a model,\ncomparisons of POW over EXP and POW over HYP yield values of 30.6 and 10.4. This can be\ninterpreted as strong evidence for POW as the correct model according to the scale given by Jeffreys\n(1961). Conclusions from Experiment 4 are even stronger. With Bayes factors of 336 for POW over\nEXP and 992 for POW over HYP, the evidence is decisively in support of the power model.\nThe results in the other two experiments are equivocal. In contrast to Experiments 2 and 4, POW\nhas the lowest posterior probability in both Experiments 1 and 3 (0.093 and 0.151, respectively).\nEXP has the highest probability in Experiment 1 (0.525), and HYP has the highest in Experiment 3\n(0.507). When Bayes Factors are computed between models, not only is there is no decisive winner,\nbut the evidence is not strong enough to rule out any model. For example, in Experiment 1, EXP\nover POW, the largest difference in posterior probability, yields a Bayes Factor of only 5.6. The\ncorresponding comparison in Experiment 3, HYP over POW, yields a value of 3.3.\nInspection of the model predictions at consecutive stages of an experiment provides insight into\nthe workings of the ADO algorithm, and provides visual con\ufb01rmation that the algorithm chooses\ntime points that are intended to be maximally discriminating. Figure 2 contains the predictions of\neach of the three models for the \ufb01rst two stages of Experiments 2 and 3. The columns of density\nplots corresponding to stage 1 show the predictions for each model based on the prior parameter\ndistributions. Based on these predictions, the ADO algorithm \ufb01nds an optimal set of retention\nintervals to be 1 second, 7 seconds, and 12 seconds. It is easy to see that POW predicts a much\nsteeper decline in retention for these three retention intervals than do EXP and HYP. Upon observing\nthe number of correct responses at each of those intervals in stage 1 (depicted by the blue dots in\nthe graphs), the algorithm computes the posterior likelihood of each model. In experiment 2, for\nexample, the observed numbers of correct responses for that participant lie in regions that are much\nmore likely under POW than under EXP or HYP, hence the posterior probability of POW is increased\nfrom 0.333 to 0.584 after stage 1, whereas the posteriors for EXP and HYP are decreased to 0.107\nand 0.309, respectively. The data from stage 1 of experiment 3 similarly favor POW.\nAt the start of stage 2, the parameter priors are updated based on the results from stage 1, hence\nthe ranges of likely outcomes for each model are much narrower than they were in stage 1, and\nconcentrated around the results from stage 1. Based on these updated parameter priors, the ADO\nalgorithm \ufb01nds 1 second, 11 seconds, and 36 seconds to be an optimal set of retention intervals to test\nin stage 2 of Experiment 2, and 1 second, 9 seconds, and 35 seconds to be an optimal set of retention\nintervals to test in stage 2 of Experiment 3. The difference between these two designs re\ufb02ects the\ndifference between the updated beliefs about the models, which can be seen by comparing the stage-\n2 density plots for the respective experiments in Figure 2.\nAs hoped for with ADO, testing in stage 2 produced results that begin to discriminate the models.\nWhat is somewhat surprising, however, is that the results favor different models, with POW having\nthe highest probability (0.911) in Experiment 2, and HYP (0.566) in Experiment 3. The reason\nfor this is the very different patterns of data produced by the participants in the two experiments.\nThe participant in Experiment 2 remembered more words overall than the participant in Experiment\n3, especially at the longest retention interval. These two factors together combine to yield very\ndifferent posterior probabilities across models.\n\n6\n\n\fFigure 2: Predictions of POW, EXP and HYP based on the prior parameter distributions in the\n\ufb01rst two stages of Experiments 2 and 3. Darker colors indicate higher probabilities. Light blue dots\nmark the observations at the given stage, and dark blue dots mark observations from previous stages.\nRelative posterior model probabilities based on all observations up to the current stage are given in\nthe lower left corner of each plot.\n\n4 Discussion\n\nThe results of the current study demonstrate that ADO can work as advertised. Over a series of\ntesting stages, the algorithm updated the experiment\u2019s design (with new retention intervals) on the\nbasis of participant data to determine the form of the retention function, yielding \ufb01nal posterior\nprobabilities in Experiments 2 and 4 that unambiguously favor the power model. Like Wixted and\nEbbesen (1991), these results champion the power model, and they do so much more de\ufb01nitively\nthan any experiment that we know of.\nThe failure to replicate these results in Experiments 1 and 3 tempers such strong conclusions about\nthe superiority of the power model and can raise doubts about the usefulness of ADO. The data in\nFigure 2 (also Figure 1) hint at a likely reason for the observed inconsistencies: participant vari-\nability. In Figure 2, the variability in performance at stage 2 of Experiments 2 and 3 is very large,\nnear the upper limit of what one would expect from binomial noise. If the variability in the data\nwere to exceed the variability predicted by the models, then the more extreme data points could be\nincorrectly interpreted as evidence in favor of the wrong model, rather than being attributed to the\nintrinsic noise in the true model. Moreover, even when the noise is taken into account accurately,\nADO does not guarantee that an experiment will generate data that discriminates the models; it\nmerely sets up ideal conditions for that to occur. It is up to the participants to provide discriminating\ndata points.\nThe inconsistencies across experiments reveal one of the challenges of using ADO. It is designed\nto be highly sensitive to participant performance, and this sensitivity can also be a weakness under\ncertain conditions. If the variability noted above is uninteresting noise, then by testing the same\nparticipant at each stage (a within-subject design), we should be able to reduce the problem. On the\nother hand, the inconclusiveness of the data in Experiments 1 and 3 may point to a more interesting\n\n7\n\n091827364554Stage 1(POW)Correct responses091827364554(EXP)Correct responses010203040091827364554Retention interval(HYP)Correct responsesStage 2010203040Retention interval091827364554(POW)Correct responsesStage 1091827364554(EXP)Correct responses010203040091827364554Retention interval(HYP)Correct responsesStage 2010203040Retention intervalExperiment 2Experiment 3p(POW)=.098p(POW)=.543p(HYP)=.284p(HYP)=.566p(POW)=.911p(POW)=.584p(EXP)=.107p(EXP)=.010p(HYP)=.309p(HYP)=.079p(EXP)=.336p(EXP)=.173))\fpossibility: a minority of participants may retain information at a rate that is best described by an\nexponential or hyperbolic function. Such individual differences would be identi\ufb01able with the use\nof a within-subject design.\nAs with any search-based methodology, the application of ADO requires a number of decisions to\nbe made. Although there are too many to cover here, we conclude the paper by touching on the most\nimportant ones.\nWhen running an experiment with ADO, any model that is expected to be a serious competitor\nshould be included in the analysis from the start of experimentation. In the present study, we consid-\nered three retention functions with strong theoretical motivations, which have outperformed others\nin previous experiments [2, 3]. The current methodology does not preclude considering a larger set\nof models (the only practical limitations are computing time and the patience of the experimenter).\nHowever, once that set of models is decided, the designs chosen by ADO are optimal for discrim-\ninating those \u2013and only those\u2013 models. Thus, the designs that we found and the data we have\ncollected in these experiments are not necessarily optimal for discriminating between, say, a power\nmodel and a logarithmic model. Therefore, ADO is best used as a tool for con\ufb01rmatory rather than\nexploratory analyses. That is, it is best suited for situations in which the \ufb01eld of potential models\ncan be narrowed to a few of the strongest competitors.\nAnother important choice to be made before using ADO is which prior distributions to use. Us-\ning informative priors is very helpful but not necessarily essential to implementing ADO. Since\nthe parameter distributions are updated sequentially, the data will quickly trump all but the most\npathological prior distributions. Therefore, using a different prior distribution should not affect the\nconclusions of the sequential experiment. The ideal approach would to use an informative prior\nthat accurately re\ufb02ects individual perfomance. In the absence of reliable information from which\nto construct such a prior, any vague prior that does not give appreciably different densities to those\nregions of the parameter space where there is a reasonable \ufb01t would do [22]. However, constructing\nsuch priors can be dif\ufb01cult due to the nonlinearity of the models.\nFinally, in the current study, we applied ADO to just one property of the experiment design: the\nlengths of the retention intervals. This leaves several other design variables open to subjective\nmanipulation. Two such variables that are crucial to the timely and successful completion of the\nexperiment are the number of retention intervals, and the number of trials allotted to each interval.\nIn theory, one could allot all of the trials in each stage to just one interval.1 In practice, however, this\napproach would require more stages, and consequently more participants, to collect observations\nat the same number of intervals as an approach that allotted trials to multiple intervals in each\nstage. Such an approach could be disadvantageous if observations at several different intervals\nwere essential for discriminating the models under consideration. On the other hand, increasing the\nnumber of intervals at which to test in each stage greatly increases the complexity of the design\nspace, thus increasing the length of the computation needed to \ufb01nd an optimal design. Extending\nthe ADO algorithm to address these multiple design variables simultaneously would be a useful\ncontribution.\n\n5 Conclusion\n\nIn the current study, ADO was successfully applied in a laboratory experiment with people, the\npurpose of which was to discriminate models of memory retention. The knowledge learned from\nits application contributes to our understanding of human memory. Although challenges remain in\nthe implementation of ADO, the present success is an encouraging sign. The goals of future work\ninclude applying ADO to more complex experimental designs and to other research questions in\ncognitive science (e.g., numerical representation in children).\n\n1Testing at one interval per stage is not possible with a utility function based on statistical model selection\ncriteria, such as MDL, which require computation of the maximum likelihood estimate [10]. However, it can\nbe done with a utility function based on mutual information [5].\n\n8\n\n\fReferences\n[1] D. J. Navarro, M. A. Pitt, and I. J. Myung. Assessing the distinguishability of models and the\n\ninformativeness of data. Cognitive Psychology, 49:47\u201384, 2004.\n\n[2] D. C. Rubin and A. E. Wenzel. One hundred years of forgetting: A quantitative description of\n\nretention. Psychological Review, 103(4):734\u2013760, 1996.\n\n[3] J. T. Wixted and E. B. Ebbesen. On the form of forgetting. Psychological Science, 2(6):409\u2013\n\n415, 1991.\n\n[4] D. R. Cavagnaro, J. I Myung, M. A. Pitt, and Y. Tang. Better data with fewer participants and\ntrials: improving experiment ef\ufb01ciency with adaptive design optimization. In N. A. Taatgen\nand H. Van Rijn, editors, Proceedings of the 31st Annual Conference of the Cognitive Science\nSociety, pages 93\u201398. Cognitive Science Society, 2009.\n\n[5] D. R. Cavagnaro, J. I. Myung, M. A. Pitt, and J. V. Kujala. Adaptive design optimization:\nA mutual information based approach to model discrimination in cognitive science. Neural\nComputation, 2009. In press.\n\n[6] J. V. Kujala and T. J. Lukka. Bayesian adaptive estimation: The next dimension. Journal of\n\nMathematical Psychology, 50(4):369\u2013389, 2006.\n\n[7] J. Lewi, R. Butera, and L. Paninski. Sequential optimal design of neurophysiology experi-\n\nments. Neural Computation, 21:619\u2013687, 2009.\n\n[8] K. Chaloner and I. Verdinelli. Bayesian experimental design: A review. Statistical Science,\n\n10(3):273\u2013304, 1995.\n\n[9] P. Gr\u00a8unwald. A tutorial introduction to the minimum description length principle.\n\nIn\nP. Gr\u00a8unwald, I. J. Myung, and M. A. Pitt, editors, Advances in Minimum Description Length:\nTheory and Applications. The M.I.T. Press, 2005.\n\n[10] J. I. Myung and M. A. Pitt. Optimal experimental design for model discrimination. Psycho-\n\nlogical Review, in press.\n\n[11] A. Heavens, T. Kitching, and L. Verde. On model selection forecasting, dark energy and mod-\n\ni\ufb01ed gravity. Monthly Notices of the Royal Astronomical Society, 380(3):1029\u20131035, 2007.\n\n[12] T. M. Cover and J. A. Thomas. Elements of Information Theory. John Wiley & Sons, Inc.,\n\n1991.\n\n[13] P. M\u00a8uller, B. Sanso, and M. De Iorio. Optimal bayesian design by inhomogeneous markov\n\nchain simulation. Journal of the American Statistical Association, 99(467):788\u2013798, 2004.\n\n[14] W. R. Gilks, S. Richardson, and D. Spiegelhalter. Markov Chain Monte Carlo in Practice.\n\nChapman & Hall, 1996.\n\n[15] S. Kirkpatrick, C. D. Gelatt, and M. P. Vecchi. Optimization by simulated annealing. Science,\n\n220:671\u2013680, 1983.\n\n[16] B. Amzal, F. Y. Bois, E. Parent, and C. P. Robert. Bayesian-optimal design via interacting\n\nparticle systems. Journal of the American Statistical Association, 101(474):773\u2013785, 2006.\n\n[17] A. Gelman, J. B. Carlin, H. S. Stern, and D. B. Rubin. Bayesian Data Analysis. Chapman &\n\nHall, 2004.\n\n[18] J. T. Wixted and E. B. Ebbesen. Genuine power curves in forgetting: A quantitative analysis\n\nof individual subject forgetting functions. Memory & cognition, 25(5):731\u2013739, 1997.\n\n[19] J. A. Brown. Some tests of the decay theory of immediate memory. Quarterly Journal of\n\nExperimental Psychology, 10:12\u201321, 1958.\n\n[20] L. R. Peterson and M. J. Peterson. Short-term retention of individual verbal items. Journal of\n\nExperimental Psychology, 58:193\u2013198, 1959.\n\n[21] S. D. Guikema. Formulating informative, data-based priors for failure probability estimation\n\nin reliability analysis. Reliability Engineering & System Safety, 92:490\u2013502, 2007.\n\n[22] M. D. Lee. A bayesian analysis of retention functions. Journal of Mathematical Psychology,\n\n48:310\u2013321, 2004.\n\n[23] H. P. Carlin and T. A. Louis. Bayes and empirical Bayes methods for data analysis, 2nd ed.\n\nChapman & Hall, 2000.\n\n9\n\n\f", "award": [], "sourceid": 735, "authors": [{"given_name": "Daniel", "family_name": "Cavagnaro", "institution": null}, {"given_name": "Jay", "family_name": "Myung", "institution": null}, {"given_name": "Mark", "family_name": "Pitt", "institution": null}]}