{"title": "Optimizing Instructional Policies", "book": "Advances in Neural Information Processing Systems", "page_first": 2778, "page_last": 2786, "abstract": "Psychologists are interested in developing instructional policies that boost student learning. An instructional policy specifies the manner and content of instruction. For example, in the domain of concept learning, a policy might specify the nature of exemplars chosen over a training sequence. Traditional psychological studies compare several hand-selected policies, e.g., contrasting a policy that selects only difficult-to-classify exemplars with a policy that gradually progresses over the training sequence from easy exemplars to more difficult (known as {\\em fading}). We propose an alternative to the traditional methodology in which we define a parameterized space of policies and search this space to identify the optimum policy. For example, in concept learning, policies might be described by a fading function that specifies exemplar difficulty over time. We propose an experimental technique for searching policy spaces using Gaussian process surrogate-based optimization and a generative model of student performance. Instead of evaluating a few experimental conditions each with many human subjects, as the traditional methodology does, our technique evaluates many experimental conditions each with a few subjects. Even though individual subjects provide only a noisy estimate of the population mean, the optimization method allows us to determine the shape of the policy space and identify the global optimum, and is as efficient in its subject budget as a traditional A-B comparison. We evaluate the method via two behavioral studies, and suggest that the method has broad applicability to optimization problems involving humans in domains beyond the educational arena.", "full_text": "Optimizing Instructional Policies\n\nRobert V. Lindsey(cid:63), Michael C. Mozer(cid:63), William J. Huggins(cid:63), Harold Pashler\u2020\n\n(cid:63) Department of Computer Science, University of Colorado, Boulder\n\n\u2020 Department of Psychology, University of California, San Diego\n\nAbstract\n\nPsychologists are interested in developing instructional policies that boost\nstudent learning. An instructional policy speci\ufb01es the manner and content\nof instruction. For example, in the domain of concept learning, a policy\nmight specify the nature of exemplars chosen over a training sequence. Tra-\nditional psychological studies compare several hand-selected policies, e.g.,\ncontrasting a policy that selects only di\ufb03cult-to-classify exemplars with a\npolicy that gradually progresses over the training sequence from easy ex-\nemplars to more di\ufb03cult (known as fading). We propose an alternative to\nthe traditional methodology in which we de\ufb01ne a parameterized space of\npolicies and search this space to identify the optimal policy. For example,\nin concept learning, policies might be described by a fading function that\nspeci\ufb01es exemplar di\ufb03culty over time. We propose an experimental tech-\nnique for searching policy spaces using Gaussian process surrogate-based\noptimization and a generative model of student performance. Instead of\nevaluating a few experimental conditions each with many human subjects,\nas the traditional methodology does, our technique evaluates many exper-\nimental conditions each with a few subjects. Even though individual sub-\njects provide only a noisy estimate of the population mean, the optimization\nmethod allows us to determine the shape of the policy space and to identify\nthe global optimum, and is as e\ufb03cient in its subject budget as a traditional\nA-B comparison. We evaluate the method via two behavioral studies, and\nsuggest that the method has broad applicability to optimization problems\ninvolving humans outside the educational arena.\n\n1 Introduction\n\nWhat makes a teacher e\ufb00ective? A critical factor is their instructional policy, which speci\ufb01es\nthe manner and content of instruction. Electronic tutoring systems have been constructed\nthat implement domain-speci\ufb01c instructional policies (e.g., J. R. Anderson, Conrad, & Cor-\nbett, 1989; Koedinger & Corbett, 2006; Martin & VanLehn, 1995). A tutoring system\ndecides at every point in a session whether to present some new material, provide a de-\ntailed example to illustrate a concept, pose new problems or questions, or lead the student\nstep-by-step to discover an answer. Prior e\ufb00orts have focused on higher cognitive domains\n(e.g., algebra) in which policies result from an expert-systems approach involving careful\nhandcrafted analysis and design followed by iterative evaluation and re\ufb01nement. As a com-\nplement to these e\ufb00orts, we are interested in addressing fundamental questions in the design\nof instructional policies that pertain to basic cognitive skills.\n\nConsider a concrete example: training individuals to discriminate between two perceptual\nor conceptual categories, such as determining whether mammogram x-ray images are neg-\native or positive for an abnormality. In training from examples, should the instructor tend\nto alternate between categories\u2014as in pnpnpnpn for positive and negative examples\u2014or\npresent a series of instances from the same category\u2014ppppnnnn (Goldstone & Steyvers,\n\n1\n\n\f2001)? Both of these strategies\u2014interleaving and blocking, respectively\u2014are adopted by\nhuman instructors (Khan, Zhu, & Mutlu, 2011). Reliable advantages between strategies has\nbeen observed (Kang & Pashler, 2011; Kornell & Bjork, 2008) and factors in\ufb02uencing the\nrelative e\ufb00ectiveness of each have been explored (Carvalho & Goldstone, 2011).\n\nEmpirical evaluation of blocking and interleaving policies involves training a set of human\nsubjects with a \ufb01xed-length sequence of exemplars drawn from one policy or the other.\nDuring training, exemplars are presented one at a time, and typically subjects are asked\nto guess the category label associated with the exemplar, after which they are told the\ncorrect label. Following training, mean classi\ufb01cation accuracy is evaluated over a set of test\nexemplars. Such an experiment yields an intrinsically noisy evaluation of the two policies,\nlimited by the number of subjects and inter-individual variability. Consequently, the goal\nof a typical psychological experiment is to \ufb01nd a statistically reliable di\ufb00erence between the\ntraining conditions, allowing the experimenter to conclude that one policy is superior.\n\nBlocking and interleaving are but two points in a space of policies that could be param-\neterized by the probability, \u03c1, that the exemplar presented on trial t + 1 is drawn from\nthe same category as the exemplar on trial t. Blocking and interleaving correspond to \u03c1\nnear 1 and 0, respectively. (There are many more interesting ways of constructing a policy\nspace that includes blocking and interleaving, e.g., \u03c1 might vary with t or with a student\u2019s\nrunning-average classi\ufb01cation accuracy, but we will use the simple \ufb01xed-\u03c1 policy space for\nillustration.) Although one would ideally like to explore the policy space exhaustively, limits\non the availability of experimental subjects and laboratory resources make it challenging to\nconduct studies evaluating more than a few candidate policies to the degree necessary to\nobtain statistically signi\ufb01cant di\ufb00erences.\n\n2 Optimizing an instructional policy\n\nOur goal is to discover the optimum in policy space\u2014the policy that maximizes mean\naccuracy or another measure of performance over a population of students. (We focus on\noptimizing for a population but later discuss how our approach might be used to address\nindividual di\ufb00erences.) Our challenge is performing optimization on a budget: each subject\ntested imposes a time or \ufb01nancial cost. Evaluating a single policy with a degree of certainty\nrequires testing many subjects to reduce sampling variance due to individual di\ufb00erences,\nfactors outside of experimental control (e.g., alertness), and imprecise measurement obtained\nfrom brief evaluations and discrete (e.g., correct or incorrect) responses. Consequently,\nexhaustive search over the set of distinguishable policies is not feasible.\n\nPast research on optimal teaching (Chi, VanLehn, Litman, & Jordan, 2011; Ra\ufb00erty, Brun-\nskill, Gri\ufb03ths, & Shafto, 2011; Whitehill & Movellan, 2010) has investigated reinforcement\nlearning and POMDP approaches. These approaches are intriguing but are not typically\ntouted for their data e\ufb03ciency. To avoid exceeding a subject budget, the \ufb02exibility of the\nPOMDP framework demands additional bias, imposed via restrictions on the class of can-\ndidate policies and strong assumptions about the learner. The approach we will propose\nlikewise requires speci\ufb01cation of a constrained policy space, but does not make assumptions\nabout the internal state of the learner or the temporal dynamics of learning. In contrast\nto POMDP approaches, the cognitive agnosticism of our approach allows it to be readily\napplied to arbitrary policy optimization problems. Direct optimization methods that accom-\nmodate noisy function evaluations have also been proposed, but experimentation with one\nsuch technique (E. J. Anderson & Ferris, 2001) convinced us that the method we propose\nhere is orders of magnitude more e\ufb03cient in its required subject budget.\n\nNeither POMDP nor direct-optimization approaches models the policy space explicitly.\nIn contrast, we propose an approach based on function approximation. From a function-\napproximation perspective, the goal is to determine the shape and optimum of the function\nthat maps policies to performance\u2014call this the policy performance function or PPF. What\nsort of experimental design should be used to approximate the PPF? Traditional experi-\nmental design\u2014which aims to show a statistically reliable di\ufb00erence between two alternative\npolicies\u2014requires testing many subjects for each policy. However, if our goal is to determine\nthe shape of the PPF, we may get better value from data collection by evaluating a large\n\n2\n\n\fFigure 1: A hypothetical 1D instructional policy space. The\nsolid black line represents an (unknown) policy performance\nfunction. The grey disks indicate the noisy outcome of single-\nsubject experiments conducted at speci\ufb01ed points in policy\nspace. (The diameter of the disk represents the number of\ndata points occuring at the disk\u2019s location.) The dashed black\nline depicts the GP posterior mean, and the coloring of each\nvertical strip represents the cumulative density function for\nthe posterior.\n\nnumber of points in policy space each with few subjects instead of a small number of points\neach with many subjects. This possibility suggests a new paradigm for experimental design\nin psychological science. Our vision is a completely automated system that selects points\nin policy space to evaluate, runs an experiment\u2014an evaluation of some policy with one or\na small number of subjects\u2014and repeats until a budget for data collection is exhausted.\n\n2.1 Surrogate-based optimization using Gaussian process regression\n\nIn surrogate-based optimization (e.g., Forrester & Keane, 2009), experimental observations\nserve to constrain a surrogate model that approximates the function being optimized. This\nsurrogate is used both to select additional experiments to run and to estimate the opti-\nmum. Gaussian process regression (GPR) has long been used as the surrogate for solving\nlow-dimensional stochastic optimization problems in engineering \ufb01elds (Forrester & Keane,\n2009; Sacks, Welch, Mitchell, & Wynn, 1989). Like other Bayesian models, GPR makes e\ufb03-\ncient use of limited data, which is particularly critical to us because our budget is expressed\nin terms of the number of subjects required. Further, GPR provides a principled approach\nto handling measurement uncertainty, which is a problem any experimental context but is\nparticularly striking in human experimentation due to the range of factors in\ufb02uencing per-\nformance. The primary constraint imposed by the Gaussian Process prior\u2014that of function\nsmoothness\u2014can readily be ensured with the appropriate design of policy spaces. To illus-\ntrate GPR in surrogate-based optimization, Figure 1 depicts a hypothetical 1D instructional\npolicy space, along with the true PPF and the GPR posterior conditioned on the outcome\nof a set of single-subject experiments at various points in policy space.\n\n2.2 Generative model of student performance\n\nEach instructional policy is presumed to have an inherent e\ufb00ectiveness for a population of\nindividuals. However, a policy\u2019s e\ufb00ectiveness can be observed only indirectly through mea-\nsurements of subject performance such as the number of correct responses. To determine the\nmost e\ufb00ective policy from noisy observations, we must specify a generative model of student\nperformance which relates the inherent e\ufb00ectiveness of instruction to observed performance.\n\nFormally, each subject s is trained under a policy xs and then tested to evaluate their\nperformance. We posit that each training policy x has a latent population-wide e\ufb00ectiveness\n\nfx \u2208 R and that how well a subject performs on the test is a noisy function of fxs. We\nare interested in predicting the e\ufb00ectiveness of a policy x(cid:48) across a population of students\ngiven the observed test scores of S subjects trained under the policies x1:S. Conceptually,\nthis involves \ufb01rst inferring the e\ufb00ectiveness f of policies x1:S from the noisy test data, then\ninterpolating from f to fx(cid:48).\n\nUsing a standard Bayesian nonparametric approach, we place a mean-zero Gaussian Process\nprior over the function fx. For the \ufb01nite set of S observations, this corresponds to the\nmultivariate normal distribution f \u223c MVN(0, \u03a3), where \u03a3 is a covariance matrix prescribing\nhow smoothly varying we expect f to be across policies. We use the squared-exponential\ncovariance function, so that \u03a3s,s(cid:48) = \u03c32 exp(\u2212||xs\u2212xs(cid:48)||2\n), and \u03c32 and (cid:96) as free parameters.\n\n2(cid:96)2\n\nHaving speci\ufb01ed a prior over policy e\ufb00ectiveness, we turn to specifying a distribution over\nobservable measures of subject learning conditioned on e\ufb00ectiveness.\nIn this paper, we\nmeasure learning by administering a multiple-choice test to each subject s and observing\n\n3\n\nInstructional PolicyPerformance 00.20.40.60.8\fthe number of correct responses s made, cs, out of ns questions. We assume the probability\nthat subject s answers any question correctly is a random variable \u00b5s whose expected value\nis related to the policy\u2019s e\ufb00ectiveness via the logistic transform: E [\u00b5s] = logistic(o + fxs )\nwhere o is a constant. This is consistent with the observation model\n\n\u00b5s | fxs , o, \u03b3 \u223c Beta(\u03b3, \u03b3e\u2212(o+fxs ))\n\ncs | \u00b5s \u223c Binomial(g + (1 \u2212 g)\u00b5s; ns)\n\n(1)\n\nwhere \u03b3 controls inter-subject variability in \u00b5s and g is the probability of answering a\nquestion correctly by random guessing. In this paper, we assume g = .5. For this special\ncase, the analytic marginalization over \u00b5s yields\n\n(cid:18)ns\n\n(cid:19) cs(cid:88)\n\n(cid:18)cs\n\n(cid:19) B(\u03b3 + i, ns \u2212 cs + \u03b3e\u2212(o+fxs ))\n\ncs\n\ni=0\n\ni\n\nP (cs | fxs, \u03b3, o, g = .5) = 2\u2212ns\n(cid:80)M\n\n(cid:8)\u03b3, o, \u03c32, (cid:96)(cid:9) are given vague uniform priors. The e\ufb00ectiveness of a policy x(cid:48)\n\nwhere B(a, b) = \u0393(a)\u0393(b)/\u0393(a + b) is the beta function.\nParameters \u03b8 \u2261\nis estimated via p(fx(cid:48) | c) \u2248 1\nm=1 p(fx(cid:48) | f (m), \u03b8(m)), where p(fx(cid:48) | f (m), \u03b8(m)) is Gaussian\nwith mean and variance determined by the mth sample from the posterior p(f , \u03b8 | c).\nPosterior samples are drawn via elliptical slice sampling, a technique well-suited for models\nwith highly correlated latent Gaussian variables (Murray, Adams, & MacKay, 2010).\n\nB(\u03b3, \u03b3e\u2212(o+fxs ))\n\n(2)\n\nM\n\nWe have also explored a more general framework that relaxes the relationship between\nchance-guessing and test performance and allows for multiple policies to be evaluated per\nsubject. With regard to the latter, subjects may undergo multiple randomly ordered blocks\ns and then tested. The\nof trials where in each block b a subject s is trained under a policy fxb\ns; nb\ns \u223c Binomial(\u00b5b\nobservation model is altered so that the score in a block is given by cb\ns)\nwhere \u00b5b\n\u03b1 ) represents the ability of\nsubject s across blocks, and the constant o(cid:48) subsumes the role of o and g from the original\nmodel. In the spirit of item-response theory (Boeck & Wilson, 2004), the model could be\nextended further to include factors that represent the di\ufb03culty of individual test questions\nand interactions between subject ability and question di\ufb03culty.\n\ns), the factor \u03b1s \u223c Normal(0, \u03c4\u22121\n\ns \u2261 logistic(o(cid:48) + \u03b1s + fxb\n\n2.3 Active selection\n\nGP optimization requires a strategy for actively selecting the next experiment. (We refer\nto this as a \u2018strategy\u2019 instead of as a \u2018policy\u2019 to avoid confusion with instructional policies.)\nMany heuristic strategies have been proposed (Forrester & Keane, 2009), including: grid\nsampling over the policy space; expanding or contracting a trust region; and goal-setting\napproaches that identify regions of policy space where performance is likely to attain some\ntarget level or beat out the current best experiment result. In addition, greedy versus k-step\npredictive planning has been considered (Osborne, Garnett, & Roberts, 2009).\n\nEvery strategy faces an exploration/exploitation trade o\ufb00. Exploration involves searching\nregions of the function with the maximum uncertainty; exploitation involves concentrating\non the regions of the function that currently appear to be most promising. Each has a\ncost. A focus on exploration rapidly exhausts the subject budget for subjects. A focus on\nexploitation leads to selection of local optima.\n\nThe upper-con\ufb01dence bound (UCB) strategy (Forrester & Keane, 2009; Srinivas, Krause,\nKakade, & Seeger, 2010) attempts to avoid these two costs by starting in an exploratory\nmode and shifting to exploitation. This strategy chooses the most-promising experiment\nfrom an upper-con\ufb01dence bound on the GPR: xt = argmaxx \u02c6\u00b5t\u22121(x) + \u03b7t \u02c6\u03c3t\u22121(x), where t\nis a time index, \u02c6\u00b5 and \u02c6\u03c3 are the mean and standard deviation of the GPR, and \u03b7t controls the\nexploration/exploitation trade o\ufb00. Large \u03b7t focus on regions with the greatest uncertainty,\nbut as \u03b7t \u2192 0, the focus shifts to exploitation in the neighborhood of the current best policy.\nAnnealing \u03b7t as a function of t will yield exploration initially shifting toward exploitation.\n\nWe adapt the UCB strategy by transforming the UCB based on the GPR to an expression\nbased on the the population accuracy (proportion correct) via xt = argmaxxP ( cs\n> \u03bdt | fx),\nns\nwhere \u03bdt is an accuracy level determining the exploration/exploitation trade o\ufb00. In simula-\ntions, we found that setting \u03bdt = .999 was e\ufb00ective. Note that in applying the UCB selection\n\n4\n\n\f(a)\n\n(b)\n\nFigure 2:\n1\n(a)\nExperiment\n(b)\ntraining display;\nSelected Experiment\n2 stimuli and their\ngraspability ratings\n\nFigure 3: Experiment 1 re-\nsults. (a) Posterior density of\nthe PPF with 100 subjects.\nLight grey squares with er-\nror bars indicate the results\nof a traditional comparison\namong conditions.\n(b) Pre-\ndiction of optimum presenta-\ntion duration as more sub-\njects are run; dashed line is\nasymptotic value.\n\nstrategy, we must search over a set of candidate policies. We applied a \ufb01ne uniform grid\nsearch over policy space to perform this selection.\n\n3 Experiment 1: Optimizing presentation rate\n\nde Jonge, Tabbers, Pecher, and Zeelenberg (2012) studied the e\ufb00ect of presentation rate on\nword-pair learning. During training, each pair was viewed for a total of 16 sec. Viewing was\ndivided into 16/d trials each with a duration of d sec, where d ranged from 1 sec (viewing the\npair 16 times) to 16 sec (viewing the pair once). de Jong et al. found that an intermediate\nduration yielded better cued recall performance both immediately and following a delay.\n\nWe explored a variant of this experiment in which subjects were asked to learn the favorite\nsporting team of six individuals. During training, each individual\u2019s face was shown along\nwith their favorite team\u2014either Jets or Sharks (Figure 2a). The training policy speci\ufb01es\nthe duration d of each face-team pair. Training was over a 30 second period, with a total of\n30/d trials and an average of 5/d presentations per face-team pair. Presentation sequences\nwere blocked, where a block consists of all six individuals in random order. Immediately\nfollowing training, subjects were tested on each of the six faces in random order and were\nasked to select the corresponding team. The training/testing procedure was repeated for\neight rounds each using di\ufb00erent faces. In total, each subject responded to 48 faces. The\nfaces were balanced across ethnicity, age, and gender (provided by Minear & Park, 2004).\n\nUsing Mechanical Turk, we recruited 100 subjects who were paid $0.30 for their participa-\ntion. The policy space was de\ufb01ned to be in the logarithm of the duration, i.e., d = ex, where\nx \u2208 [ln(.25) ln(5)]. The space included only values of x such that 30/d is an integer; i.e., we\nensured that no trials were cut short by the 30 second time limit. Subject 1\u2019s training policy,\nx1, was set to the median of the range of admissable values (857 ms). After each subject\nt completed the experiment, the PPF posterior was reestimated, and the upper-con\ufb01dence\nbound strategy was used to select the policy for subject t + 1, xt+1.\n\nFigure 3a shows the PPF posterior based on 100 subjects. (We include a movie showing\nthe evolution of the PPF over subjects in the Supplementary Materials.) The diameter of\nthe grey disks indicate the number of data points observed at that location in the space.\nThe optimum of the PPF mean is at 1.15 sec, at which duration each face-team pair will be\nshown on expectation 4.33 times during training. Though the result seems intuitive, we\u2019ve\npolled colleagues, and predictions for the peak ranged from below 1 sec to 2.5 sec. Figure 3b\nuses the PPF mean to estimate the optimum duration, and this duration is plotted against\n\n5\n\n(a)(b)510152025nearfartraining trialrelative distance tocategory boundaryFigure2:(a)Someobjectsandtheirgraspabilityratings:1meansnotgraspableand5meanshighlygraspable;choosingthecategoryoftrainingexamplarsoverasequenceoftrials;(b)Examplesoffadingpoliciesdrawnfromthe1DfadingpolicyspaceusedinourstudyredlinedepictstheGPposteriormean,\u00b5(x)forpolicyx,andthepinkshadingis\u00b12\u03c3(x),where\u03c3(x)istheGPposteriorstandarddeviation.GPoptimizationrequiresastrategyforselectingthenextexperiment.(Werefertothisasa\u2019strategy\u2019insteadofa\u2019policy\u2019toavoidconfusionwithinstructionalpolicies.)Manyheuristicstrategieshavebeenproposed(Forrester&Keane,2009),including:gridsamplingoverthepolicyspace;expandingorcontractingatrustregion;andgoal-settingapproachesthatidentifyregionsofpolicyspacewhereperformanceislikelytoattainsometargetlevelorbeatoutthecurrentbestexperimentresult.Inaddition,greedyversusk-steppredictiveplanninghasbeenconsidered(Osborne,Garnett,&Roberts,2009).Everystrategyfacesanexploration/exploitationtradeo\ufb00.Explorationinvolvessearchingregionsofthefunctionwiththemaximumuncertainty;exploitationinvolvesconcentratingontheregionsofthefunctionthatcurrentlyappeartobemostpromising.Eachhasacost.Afocusonexplorationrapidlyexhauststhebudgetforparticipants.Afocusonexploitationleadstoselectionoflocaloptima.Theupper-con\ufb01dencebound(UCB)strategy(Forrester&Keane,2009;Srinivas,Krause,Kakade,&Seeger,2010)attemptstoavoidthesetwocostsbystartinginanexploratorymodeandshiftingtoexploitation.Thisstrategychoosesthemost-promisingexperimentfromanupper-con\ufb01denceboundonthefunctionxt=argmaxx\u00b5t\u22121(x)+\u03b7t\u03c3t\u22121(x),wheretisanindexovertimeand\u03b7tcontrolstheexploration/exploitationtradeo\ufb00.Large\u03b7tfocusonregionswiththegreatestuncertainty,butas\u03b7t\u21920,thefocusshiftstoexploitationintheneighborhoodofthecurrentbestpolicy.Annealing\u03b7tasafunctionoftwillyieldexplorationinitiallyshiftingtowardexploitation.3ExperimentaltaskTotestourapproachtooptimizationofinstructionalpolicies,weuseachallengingprobleminthedomainofconceptorcategorylearning.Salmon,McMullen,andFilliter(2010)haveobtainedratingnormsforasetof320objectsintermsoftheirgraspability,i.e.,howmanipulableanobjectisaccordingtohoweasyitistograspandusetheobjectwithonehand.Theypolled57individuals,eachofwhomratedeachoftheobjectsmultipletimesusinga1\u20135scale,where1meansnotgraspableand5meanshighlygraspable.Figure2ashowsseveralobjectsandtheirratings.Wedividedtheobjectsintotwogroupsbytheirmeanrating,withthenot-glopnorgrouphavingratingsin[1,2.75]andtheglopnorgrouphavingratingsin[3.25,5].(Wediscardedobjectswithratingsin[2.75,3.25]).Ourgoalwastoteachtheconceptofglopnor,usingthefollowinginstructions:4Duration (Log Scale)% Correct(a)250 500 750 100020003000500050 60 70 80 90 1002505007501000200030005000102030405060708090100Subject NumberEstimated Optimal Duration (Log Scale)(b)\fFigure 4: Expt. 2,\ntrial\ndependent fading and rep-\netition policies (left and\nright, respectively). Col-\nored lines represent speci\ufb01c\npolicies.\n\nthe number of subjects. Our procedure yields an estimate for the optimum duration that is\nquite stable after about 40 subjects.\n\nIdeally, one would like to compare the PPF posterior to ground truth. However, obtaining\nground truth requires a massive data collection e\ufb00ort. As an alternative, we contrast our\nresult with a more traditional experimental study based on the same number of subjects.\nWe ran 100 additional subjects in a standard experimental design involving evaluation of\n\ufb01ve alternative policies, d \u2208 {1, 1.25, 1.667, 2.5, 5}, 20 subjects per policy. (These durations\ncorrespond to 1-5 presentations of each face-team pair during training.) The mean score for\neach policy is plotted in Figure 3a as light grey squares with bars indicating \u00b12 standard\nerrors of the mean. The result of the traditional experiment is coarsely consistent with the\nPPF posterior, but the budget of 100 subjects places a limitation on the interpretability\nof the results. When matched on budget, the optimization procedure appears to produce\nresults that are more interpretable and less sensitive to noise in the data. Note that we have\nbiased this comparison in favor of the traditional design by restricting the exploration of\nthe policy space to the region 1 sec \u2264 d \u2264 5 sec. Nonetheless, no clear pattern emerges in\nthe shape of the PPF based on the outcome of the traditional design.\n\n4 Experiment 2: Optimizing training example sequence\n\nIn Experiment 2, we study concept learning from examples. Subjects are told that martians\nwill teach them the meaning of a martian adjective, glopnor, by presenting a series of\nexample objects, some of which have the property glopnor and others do not. During a\ntraining phase, objects are presented one at a time and subjects must classify the object\nas glopnor or not-glopnor. They then receive feedback as to the correctness of their\nresponse. On each trial, the object from the previous trial is shown in the corner of the\ndisplay along with its correct classi\ufb01cation, the reason for which is to facilitate comparison\nand contrasting of objects. Following 25 training trials, 24 test trials are administered in\nwhich the subject makes a classi\ufb01cation but receives no feedback. The training and test\ntrials are roughly balanced in number of positive and negative examples.\n\nThe stimuli in this experiment are drawn from a set of 320 objects normed by Salmon,\nMcMullen, and Filliter (2010) for graspability, i.e., how manipulable an object is according\nto how easy it is to grasp and use the object with one hand. They polled 57 individuals,\neach of whom rated each of the objects multiple times using a 1\u20135 scale, where 1 means\nnot graspable and 5 means highly graspable. Figure 2b shows several objects and their\nratings. We divided the objects into two groups by their mean rating, with the not-\nglopnor group having ratings in [1, 2.75] and the glopnor group having ratings in [3.25,\n5]. (We discarded objects with ratings in [2.75, 3.25] because they are too di\ufb03cult even\nif one knows the concept). The classi\ufb01cation task is easy if one knows that the concept is\ngraspability. However, the challenge of inferring the concept is extremely di\ufb03cult because\nthere are many dimensions along which these objects vary and any one\u2014or more\u2014could be\nthe classi\ufb01cation dimension(s).\n\nWe de\ufb01ned an instructional policy space characterized by two dimensions: fading and block-\ning. Fading refers to the notion from the animal learning literature that learning is facilitated\nby presenting exemplars far from the category boundary initially, and gradually transition-\ning toward more di\ufb03cult exemplars over time. Exemplars far from the boundary may help\nindividuals to attend to the dimension of interest; exemplars near the boundary may help\nindividuals determine where the boundary lies (Pashler & Mozer, in press). Theorists have\n\n6\n\n(a)(b)510152025nearfartraining trialrelative distance tocategory boundaryFigure2:(a)Someobjectsandtheirgraspabilityratings:1meansnotgraspableand5meanshighlygraspable;choosingthecategoryoftrainingexamplarsoverasequenceoftrials;(b)Examplesoffadingpoliciesdrawnfromthe1DfadingpolicyspaceusedinourstudyredlinedepictstheGPposteriormean,\u00b5(x)forpolicyx,andthepinkshadingis\u00b12\u03c3(x),where\u03c3(x)istheGPposteriorstandarddeviation.GPoptimizationrequiresastrategyforselectingthenextexperiment.(Werefertothisasa\u2019strategy\u2019insteadofa\u2019policy\u2019toavoidconfusionwithinstructionalpolicies.)Manyheuristicstrategieshavebeenproposed(Forrester&Keane,2009),including:gridsamplingoverthepolicyspace;expandingorcontractingatrustregion;andgoal-settingapproachesthatidentifyregionsofpolicyspacewhereperformanceislikelytoattainsometargetlevelorbeatoutthecurrentbestexperimentresult.Inaddition,greedyversusk-steppredictiveplanninghasbeenconsidered(Osborne,Garnett,&Roberts,2009).Everystrategyfacesanexploration/exploitationtradeo\ufb00.Explorationinvolvessearchingregionsofthefunctionwiththemaximumuncertainty;exploitationinvolvesconcentratingontheregionsofthefunctionthatcurrentlyappeartobemostpromising.Eachhasacost.Afocusonexplorationrapidlyexhauststhebudgetforparticipants.Afocusonexploitationleadstoselectionoflocaloptima.Theupper-con\ufb01dencebound(UCB)strategy(Forrester&Keane,2009;Srinivas,Krause,Kakade,&Seeger,2010)attemptstoavoidthesetwocostsbystartinginanexploratorymodeandshiftingtoexploitation.Thisstrategychoosesthemost-promisingexperimentfromanupper-con\ufb01denceboundonthefunctionxt=argmaxx\u00b5t\u22121(x)+\u03b7t\u03c3t\u22121(x),wheretisanindexovertimeand\u03b7tcontrolstheexploration/exploitationtradeo\ufb00.Large\u03b7tfocusonregionswiththegreatestuncertainty,butas\u03b7t\u21920,thefocusshiftstoexploitationintheneighborhoodofthecurrentbestpolicy.Annealing\u03b7tasafunctionoftwillyieldexplorationinitiallyshiftingtowardexploitation.3ExperimentaltaskTotestourapproachtooptimizationofinstructionalpolicies,weuseachallengingprobleminthedomainofconceptorcategorylearning.Salmon,McMullen,andFilliter(2010)haveobtainedratingnormsforasetof320objectsintermsoftheirgraspability,i.e.,howmanipulableanobjectisaccordingtohoweasyitistograspandusetheobjectwithonehand.Theypolled57individuals,eachofwhomratedeachoftheobjectsmultipletimesusinga1\u20135scale,where1meansnotgraspableand5meanshighlygraspable.Figure2ashowsseveralobjectsandtheirratings.Wedividedtheobjectsintotwogroupsbytheirmeanrating,withthenot-glopnorgrouphavingratingsin[1,2.75]andtheglopnorgrouphavingratingsin[3.25,5].(Wediscardedobjectswithratingsin[2.75,3.25]).Ourgoalwastoteachtheconceptofglopnor,usingthefollowinginstructions:451015202500.51training trialrepetition probability\falso made computational arguments for the bene\ufb01t of fading (Bengio, Louradour, Collobert,\n& Weston, 2009; Khan et al., 2011). Blocking refers to the issue discussed in the Introduc-\ntion concerning the sequence of category labels: Should training exemplars be blocked or\ninterleaved? That is, should the category label on one trial tend to be the same as or\ndi\ufb00erent than the label on the previous trial?\n\nFor fading, we considered a family of trial-dependent functions that specify the distance\nof the chosen exemplar to the category boundary (left panel of Figure 4). This family is\nparameterized by a single policy variable x2, 0 \u2264 x2 \u2264 1 that relates to the distance of an\nexemplar to the category boundary, d, as follows: d(t, x2) = min(1, 2x2)\u2212(1\u2212|2x2\u22121|) t\u22121\nT\u22121 ,\nwhere T is the total number of training trials and t is the current trial. For blocking, we\nalso considered a family of trial-dependent functions that vary the probability of a category\nlabel repetition over trials (right panel of Figure 4). This family is parameterized by the\npolicy variable x1, 0 \u2264 x1 \u2264 1, that relates to the probability of repeating the category label\nof the previous trial, r, as follows: r(t, x1) = x1 + (1 \u2212 2x1) t\u22121\nT\u22121 .\nFigure 5a provides a visualization of sample training trial sequences for di\ufb00erent points in\nthe 2D policy space. Each graph represents an instance of a speci\ufb01c (probabilistic) policy.\nThe abscissa of each graph is an index over the 25 training trials; the ordinate represents\nthe category label and its distance from the category boundary. Policies in the top and\nbottom rows show sequences of all-easy and all-hard examples, respectively; intermediate\nrows achieve fading in various forms. Policies in the leftmost column begin training with\nmany repetitions and end training with many alternations; policies in the rightmost column\nbegin with alternations and end with repetitions; policies in the middle column have a\ntime-invariant repetition probability of 0.5.\n\nRegardless of the training sequence, the set of test objects was the same for all subjects.\nThe test objects spanned the spectrum of distances from the category boundary. During\ntest, subjects were required to make a forced choice glopnor/not-glopnor judgment.\n\nWe seeded the optimization process by running 10 subjects in each of four corners of policy\nspace as well as in the center point of the space. We then ran 150 additional subjects using\nGP-based optimization. Figure 5 shows the PPF posterior mean over the 2D policy space,\nalong with the selection in policy space of the 200 subjects. Contour map colors indicate\nthe expected accuracy of the corresponding policy (in contrast to the earlier colored graphs\nin which the coloring indicates the cdf). The optimal policy is located at x\u2217 = (1, .66).\nTo validate the outcome of this exploration, we ran 50 subjects at x\u2217 as well as policies in\nthe upper corners and the center of Figure 5. Consistent with the prediction of the PPF\nposterior, mean accuracy at x\u2217 is 68.6%, compared to 60.9% for (0, 1), 65.7% for (1, 0),\nand 66.6% for (.5, .5). Unfortunately, only one of the paired comparisons was statistically\nreliable by a two-tailed Bonferroni corrected t-test: (0, 1) versus x\u2217 (p = .027). However,\npost-hoc power computation revealed that with 50 subjects and the variability inherent in\nthe data, the odds of observing a reliable 2% di\ufb00erence in the mean is only .10. Running an\nadditional 50 subjects would raise the power to only .17. Thus, although we did not observe\na statistically signi\ufb01cant improvement at the inferred optimum compared to sensible alter-\nnative policies, the results are consistent with our inferred optimum being an improvement\nover the type of policies one might have proposed a priori.\n\n5 Discussion\n\nThe traditional experimental paradigm in psychology involves comparing a few alternative\nconditions by testing a large number of subjects in each condition. We\u2019ve described a novel\nparadigm in which a large number of conditions are evaluated, each with only one or a few\nsubjects. Our approach achieves an understanding of the functional relationship between\nconditions and performance, and it lends itself to discovering the conditions that attain\noptimal performance.\n\nWe\u2019ve focused on the problem of optimizing instruction, but the method described here has\nbroad applicability across issues in the behavioral sciences. For example, one might attempt\nto maximize a worker\u2019s motivation by manipulating rewards, task di\ufb03culty, or time pressure.\n\n7\n\n\fFigure 5: Experiment 2 (a) policy space and (b) policy performance function at 200 subjects\n\nMotivation might be studied in an experimental context with voluntary time on task as a\nmeasure of intrinsic interest level.\n\nConsider problems in a quite di\ufb00erent domain, human vision. Optimization approaches\nmight be used to determine optimal color combinations in a manner more e\ufb03cient and fea-\nsible than exhaustive search (Schloss & Palmer, 2011). Also in the vision domain, one might\nsearch for optimal sequences and parameterizations of image transformations that would\nsupport complex visual tasks performed by experts (e.g., x-ray mammography screening) or\nordinary visual tasks performed by the visually impaired.\n\nFrom a more applied angle, A-B testing has become an extremely popular technique for \ufb01ne\ntuning web site layout, marketing, and sales (Christian, 2012). With a large web population,\ntwo competing alternatives can quickly be evaluated. Our approach o\ufb00ers a more systematic\nalternative in which a space of alternatives can be explored e\ufb03ciently, leading to discovery\nof solutions that might not have been conceived of as candidates a priori.\n\nThe present work did not address individual di\ufb00erences or high-dimensional policy spaces,\nbut our framework can readily be extended. Individual di\ufb00erences can be accommodated\nvia policies that are parameterized by individual variables (e.g., age, education level, perfor-\nmance on related tasks, recent performance on the present task). For example, one might\nadopt a fading policy in which the rate of fading depends in a parametric manner on a run-\nning average of performance. High dimensional spaces are in principle no challenge for GPR\ngiven a sensible distance metric. The challenge of high-dimensional spaces comes primarily\nfrom computational overhead in selecting the next policy to evaluate. However, this compu-\ntational burden can be greatly relaxed by switching from a global optimization perspective\nto a local perspective: instead of considering candidate policies in the entire space, active\nselection might consider only policies in the neighborhood of previously explored policies.\n\nAcknowledgments\n\nThis research was supported by NSF grants BCS-0339103 and BCS-720375 and by an NSF\nGraduate Research Fellowship to R. L. We thank Ron Kneusel and Ali Alzabarah for their\ninvaluable assistance with IT support, and Ponesadat Mortazavi, Vanja Dukic, and Rosie\nCowell for helpful discussions and advice on this work.\n\nReferences\n\nAnderson, E. J., & Ferris, M. C. (2001). A direct search algorithm for optimization with\n\nnoisy function evaluations. SIAM Journal of Optimization, 11 , 837\u2013857.\n\n8\n\n\u2212+\u2212+\u2212+\u2212+\u2212+\u2212+\u2212+\u2212+\u2212+\u2212+\u2212+Fading Policy\u2212+\u2212+\u2212+\u2212+\u2212+\u2212+\u2212+\u2212+\u2212+\u2212+\u2212+\u2212+Blocking Policy\u2212+\u2212+Blocking PolicyFading Policy Repetition/Alternation PolicyFading Policy 56%58%60%62%64%66%\fAnderson, J. R., Conrad, F. G., & Corbett, A. T. (1989). Skill acquisition and the LISP\n\ntutor. Cognitive Science, 13 , 467\u2013506.\n\nBengio, Y., Louradour, J., Collobert, R., & Weston, J. (2009, June). Curriculum learning.\nIn L. Bottou & M. Littman (Eds.), Proceedings of the 26th international conference on\nmachine learning (pp. 41\u201348). Montreal: Omnipress.\n\nBoeck, P. D., & Wilson, M. (2004). Explanatory item response models. a generalized linear\n\nand nonlinear approach. New York: Springer.\n\nCarvalho, P. F., & Goldstone, R. L. (2011, November). Stimulus similarity relations mod-\nulate bene\ufb01ts for blocking versus interleaving during category learning. (Presentation at\nthe 52nd Annual Meeting of the Psychonomics Society, Seattle, WA)\n\nChi, M., VanLehn, K., Litman, D., & Jordan, P. (2011). Empirically evaluating the appli-\ncation of reinforcement learning to the induction of e\ufb00ective and adaptive pedagogical\nstrategies. User Modeling and User-Adapted Interaction. Special Issue on Data Mining\nfor Personalized Educational Systems, 21 , 137\u2013180.\n\nChristian, B. (2012). The A/B test: Inside the technology that\u2019s changing the rules of\n\nbusiness. Wired , 20(4).\n\nde Jonge, M., Tabbers, H. K., Pecher, D., & Zeelenberg, R. (2012). The e\ufb00ect of study\ntime distribution on learning and retention: A goldilocks principle for presentation rate.\nJ. Exp. Psych.: Learning, Mem., & Cog., 38 , 405\u2013412.\n\nForrester, A. I. J., & Keane, A. J. (2009). Recent advances in surrogate-based optimization.\n\nProgress in Aerospace Sciences, 45 , 50\u201379.\n\nGoldstone, R. L., & Steyvers, M. (2001). The sensitization and di\ufb00erentiation of dimensions\nduring category learning. Journal of Experimental Psychology: General , 130 , 116\u2013139.\nKang, S. H. K., & Pashler, H. (2011). Learning painting styles: Spacing is advantageous\n\nwhen it promotes discriminative contrast. Applied Cognitive Psychology, 26 , 97\u2013103.\n\nKhan, F., Zhu, X. J., & Mutlu, B. (2011). How do humans teach: On curriculum learn-\ning and teaching dimension. In J. Shawe-Taylor, R. Zemel, P. Bartlett, F. Pereira, &\nK. Weinberger (Eds.), Adv. in NIPS 24 (pp. 1449\u20131457). La Jolla, CA: NIPS Found.\n\nKoedinger, K. R., & Corbett, A. T. (2006). Cognitive tutors: Technology bringing learning\nscience to the classroom. In K. Sawyer (Ed.), The cambridge handbook of the learning\nsciences (pp. 61\u201378). Cambridge UK: Cambridge University Press.\n\nKornell, N., & Bjork, R. A. (2008). Learning concepts and categories: Is spacing the enemy\n\nof induction? Psychological Science, 19 , 585\u2013592.\n\nMartin, J., & VanLehn, K. (1995). Student assessment using bayesian nets. International\n\nJournal of Human-Computer Studies, 42 , 575\u2013591.\n\nMinear, M., & Park, D. C. (2004). A lifespan database of adult facial stimuli. Behavior\n\nResearch Methods, Instruments, and Computers, 36 , 630\u2013633.\n\nMurray, I., Adams, R. P., & MacKay, D. J. (2010). Elliptical slice sampling. J. of Machine\n\nLearn. Res., 9 , 541\u2013548.\n\nOsborne, M. A., Garnett, R., & Roberts, S. J. (2009, January). Gaussian processes for\n\nglobal optimization. In 3d intl. conf. on learning and intell. opt. Trento, Italy.\n\nPashler, H., & Mozer, M. C. (in press). Enhancing perceptual category learning through\n\nfading: When does it help? J. of Exptl. Psych.: Learning, Mem., & Cog..\n\nRa\ufb00erty, A. N., Brunskill, E. B., Gri\ufb03ths, T. L., & Shafto, P. (2011). Faster teaching by\n\nPOMDP planning. In Proc. of the 15th intl. conf. on AI in education.\n\nSacks, J., Welch, W. J., Mitchell, T. J., & Wynn, H. P. (1989). Design and analysis of\n\ncomputer experiments. Statistical Science, 4 , 409\u2013435.\n\nSalmon, J. P., McMullen, P. A., & Filliter, J. H. (2010). Norms for two types of manip-\nulability (graspability and functional usage), familiarity, and age of acquisition for 320\nphotographs of objects. Behavioral Research Methods, 42 , 82\u201395.\n\nSchloss, K. B., & Palmer, S. E. (2011). Aesthetic response to color combinations: preference,\n\nharmony, and similarity. Attention, Perception, & Psychophysics, 73 , 551\u2013571.\n\nSrinivas, N., Krause, A., Kakade, S., & Seeger, M. (2010). Gaussian process optimization\nIn Proceedings of the 27th\n\nin the bandit setting: No regret and experimental design.\ninternational conference on machine learning. Haifa, Israel.\n\nWhitehill, J., & Movellan, J. R. (2010). Optimal teaching machines (Tech. Rep.). La Jolla,\n\nCA: Department of Computer Science, UCSD.\n\n9\n\n\f", "award": [], "sourceid": 1279, "authors": [{"given_name": "Robert", "family_name": "Lindsey", "institution": "University of Colorado"}, {"given_name": "Michael", "family_name": "Mozer", "institution": "University of Colorado"}, {"given_name": "William", "family_name": "Huggins", "institution": "University of Colorado"}, {"given_name": "Harold", "family_name": "Pashler", "institution": "UC San Diego"}]}