{"title": "Modeling the spacing effect in sequential category learning", "book": "Advances in Neural Information Processing Systems", "page_first": 1159, "page_last": 1167, "abstract": "We develop a Bayesian sequential model for category learning. The sequential model updates two category parameters, the mean and the variance, over time. We define conjugate temporal priors to enable closed form solutions to be obtained. This model can be easily extended to supervised and unsupervised learning involving multiple categories. To model the spacing effect, we introduce a generic prior in the temporal updating stage to capture a learning preference, namely, less change for repetition and more change for variation. Finally, we show how this approach can be generalized to efficiently performmodel selection to decide whether observations are from one or multiple categories.", "full_text": "Modeling the spacing effect in sequential category\n\nlearning\n\nHongjing Lu\n\nDepartment of Psychology & Statistics\n\nHongjing@ucla.edu\n\nMatthew Weiden\n\nDepartment of Psychology\nmweiden@ucla.edu\n\nDepartment of Statistics, Computer Science & Psychology\n\nAlan Yuille\n\nUniversity of California, Los Angeles\n\nLos Angeles, CA 90095\n\nyuille@stat.ucla.edu\n\nAbstract\n\nWe develop a Bayesian sequential model for category learning. The sequential\nmodel updates two category parameters, the mean and the variance, over time. We\nde\ufb01ne conjugate temporal priors to enable closed form solutions to be obtained.\nThis model can be easily extended to supervised and unsupervised learning in-\nvolving multiple categories. To model the spacing effect, we introduce a generic\nprior in the temporal updating stage to capture a learning preference, namely, less\nchange for repetition and more change for variation. Finally, we show how this ap-\nproach can be generalized to ef\ufb01ciently perform model selection to decide whether\nobservations are from one or multiple categories.\n\n1 Introduction\n\nInductive learning the process by which a new concept or category is acquired through observation\nof exemplars - poses a fundamental theoretical problem for cognitive science. When exemplars are\nencountered sequentially, as is typical in everyday learning, then learning is in\ufb02uenced in systematic\nways by presentation order. One pervasive phenomenon is the spacing effect, manifested in the\n\ufb01nding that given a \ufb01xed amount of total study time with a given item, learning is facilitated when\npresentations of the item are spread across a longer time interval rather than massed into a continuous\nstudy period.\nIn category learning, for example, exemplars of two categories can be spaced by\npresenting them in an interleaved manner (e.g., A1B1A2B2A3B3), or massed by presenting them\nin consecutive blocks (e.g., A1A2A3B1B2B3). Kornell & Bjork [1] show that when tested later on\nclassi\ufb01cation of novel category members, spaced presentation yields superior performance relative\nto massed presentation. Similar spacing effects have been obtained in studies of item learning [2]\nand motor learning [3]. Moreover, spacing effects are found not only in human learning, but also in\nvarious types of learning in other species, including rats and Aplysia [4][5].\n\nIn the present paper we will focus on spacing effects in the context of sequential category learning.\nStandard statistical methods based on summary information are unable to deal with order effects,\nincluding the performance difference between spaced and massed conditions. From a computa-\ntional perspective, a sequential learning model is needed to construct category representations from\ntraining examples and dynamically update parameters of these representations from trial to trial.\nBayesian sequential models have been successfully applied to model causal learning and animal\nconditioning [6] [7]. In the context of category learning, if we assume that the representation for\neach category can be speci\ufb01ed by a Gaussian distribution where the mean \u00b5 and the variance \u03c32 are\nboth random variables [8], then the learning model must aim to compute the posterior distribution\nof the parameters for each category given all the observations xt from trial 1 to trial t, P (\u00b5, \u03c32|xt).\n\n1\n\n\fHowever, given that both the mean and the variance of a category are random variables, standard\nKalman \ufb01ltering [9] is not directly applicable in this case since it assumes a known variance, which\nis not warranted in the current application.\n\nIn this paper, we extend traditional Kalman \ufb01ltering in order to update two category parameters, the\nmean and the variance, over time in the context of category learning. We de\ufb01ne conjugate tempo-\nral priors to enable closed form solutions to be obtained in this learning model with two unknown\nparameters. We will illustrate how the learning model can be easily extended to learning situations\ninvolving multiple categories either with supervision (i.e., learners are informed of category mem-\nbership for each training observation) or without supervision (i.e., category membership of each\ntraining observation is not provided to learners). Surprisingly, we can also derive closed form so-\nlutions in the latter case. This reduces the need for employing particle \ufb01lters as an approximation\nto exact inference, commonly used in the case of unsupervised learning [10]. To model the spacing\neffect, we introduce a generic prior in the temporal updating stage. Finally, we will show how this\napproach can be generalized to ef\ufb01ciently perform model selection.\n\nThe organization of the present paper is as follows. In Section 2 we introduce the Bayesian se-\nquential learning framework in the context of category learning, and discuss the conjugacy property\nof the model. Section 3 and 4 demonstrate how to develop supervised and unsupervised learning\nmodels, which can be compared with human performance. We draw general conclusions in section\n5.\n\n2 Bayesian sequential model\n\nWe adopt the framework of Bayesian sequential learning [11], termed Bayes-Kalman, a probabilistic\nmodel in which learning is assumed to be a Markov process with unobserved states. The exemplars\nin training are directly observable, but the representations of categories are hidden and unobserv-\nable. In this paper, we assume that categories can be represented as Gaussian distributions with two\nunknown parameters, means and variances. These two unknown parameters need to be learned from\na limited number of exemplars (e.g., less than ten exemplars).\n\nWe now state the general framework and give the update rule for the simplest situation where the\ntraining data is generated by a single category speci\ufb01ed by a mean m and precision r \u2013 the precision\nis the inverse of the variance and is used to simplify the algebra. Our model assumes that the mean\ncan change over time and is denoted by mt, where t is the time step. The model is speci\ufb01ed by the\nprior distribution P (m0, r), the likelihood function P (x|mt, r) for generating the observations, and\nthe temporal prior P (mt+1|mt) specifying how mt can vary over time. Note that the precision r is\nestimated over time, which differs from standard Kalman \ufb01ltering where it is assumed to be known.\nBayes-Kalman [11] gives iterative equations to determine the posterior P (mt, r|Xt) after a sequence\nof observations XT = {x1, ..., xt}. The update equations are divided into two stages, prediction and\ncorrection:\n\nP (mt+1, r|Xt) = Z \u221e\n\n\u2212\u221e\n\ndmtP (mt+1|mt)P (mt, r|Xt),\n\nP (mt+1, r|Xt+1) = P (mt+1, r|xt+1, Xt) =\n\nP (xt+1|mt+1, r)P (mt+1, r|Xt)\n\nP (xt+1|Xt)\n\n.\n\n(1)\n\n(2)\n\nIntuitively, the Bayes-Kalman \ufb01rst predicts the distribution P (mt+1, r|Xt) and then uses this as a\nprior to correct for the new observation xt+1 and determine the new posterior P (mt+1, r|Xt+1).\nNote that the temporal prior P (mt+1|mt) implies that the model automatically pays most attention\nto recent data and does not memorize the data, thus exhibiting sensitivity to the data ordering.\n\n2.1 Conjugate priors\n\nThe distributions P (m0, r), P (x|mt, r), P (mt+1|mt) are chosen to be conjugate, so that the distri-\nbution P (mt, r|Xt) takes the same functional form as P (m0, r). As shown in the following section,\nthis reduces the Bayes-Kalman equations to closed form update rules for the parameters of the dis-\n\n2\n\n\ftributions. The distributions are speci\ufb01ed in terms of Gamma and Gaussian distributions:\n\ng(r : \u03b1, \u03b2) =\n\nG(x : \u00b5, \u03c1) = {\n\n\u03b2\u03b1\n\u0393(\u03b1)\n\u03c1\n2\u03c0\n\nr\u03b1\u22121 exp{\u2212\u03b2r}, r \u2265 0. Gamma.\n\n} exp{\u2212\u03c1/2(x \u2212 \u00b5)2}. Gaussian.\n\nWe specify the prior P (m0, r) as the product of a Gaussian P (m0|r) and a Gamma P (r):\n\nP (m0|r) = G(m0 : \u00b5, \u03c4 r),\n\nP (r) = g(r : \u03b1, \u03b2),\n\n(3)\n\n(4)\n\n(5)\n\nwhere \u00b5, \u03c4, \u03b1, \u03b2 are the parameters of the distribution. For simplicity, we call this a Gamma-\nGaussian distribution with parameters \u00b5, \u03c4, \u03b1, \u03b2.\nThe likelihood function and temporal prior are both Gaussians:\n\nP (xt|mt, r) = G(xt : mt, \u03b6r),\n\nP (mt+1|mt) = G(mt+1 : mt, \u03b3r),\n\n(6)\n\nwhere \u03b6, \u03b3 are constants.\nThe conjugacy of the distributions ensures that the posterior distribution P (mt, r|Xt) will also be\na Gamma-Gaussian distribution with parameters \u00b5t, \u03c4t, \u03b1t, \u03b2t, where the update rules for these pa-\nrameters are speci\ufb01ed in the next section.\n\n2.2 Update rules for the model parameters\n\nThe update rules for the model parameters follow from substituting the distributions into the Bayes-\nKalman equations 1, 2. We sketch how these update rules are obtained assuming that P (mt, r|Xt)\nis a Gamma-Gaussian with parameters \u00b5t, \u03c4t, \u03b1t, \u03b2t, which is true for t = 0 using equations (5,6).\nThe form of the prediction equation and the temporal prior, see equations (1,6), ensures that\nP (mt+1, r|Xt) is also a Gamma-Gaussian distribution with parameters \u00b5t, \u03c4 p\n\nt , \u03b1t, \u03b2t, where\n\n\u03c4 p\nt =\n\n\u03c4t\u03b3\n\n\u03c4t + \u03b3\n\n.\n\n(7)\n\nThe correction equation and the likelihood function,\nP (mt+1, r|Xt+1) is also Gamma-Gaussian with parameters \u00b5t+1, \u03c4t+1, \u03b1t+1, \u03b2t+1 given by:\n\nsee equations\n\n(2,6),\n\nensure that\n\n\u03b1t+1 = \u03b1t + 1/2,\n\n\u03b2t+1 = \u03b2t +\n\n\u00b5t+1 =\n\n\u03b6xt+1 + \u03c4 p\n\u03b6 + \u03c4 p\nt\n\nt \u00b5t\n\n,\n\n\u03b6\u03c4 p\n\nt (xt+1 \u2212 \u00b5t)2\n2(\u03b6 + \u03c4 p\nt )\n\n,\n\n\u03c4t+1 = \u03b6 + \u03c4 p\nt .\n\n(8)\n\nIntuitively, the prediction only reduces the precision of m but makes no change to its mean or to the\ndistribution over r. By contrast, the new observation alters the mean of m (moving it closer to the\nnew observation xt+1), and also increases its precision, which sharpens the distribution on r.\n\n2.3 Model evidence\n\nWe also need to compute the probability of the observation sequence Xt from the model (which will\nbe used later for model selection). This can be expressed recursively as:\n\np(Xt) = p(xt|Xt\u22121)p(xt\u22121|Xt\u22122)...p(x1).\n\n(9)\n\nThis computation is also simpli\ufb01ed because we use conjugate distributions. The terms in equa-\n\ntion (9) can be expressed as P (xt+1|Xt) = R dmt+1drP (xt+1|mt+1, r)P (mt+1, r|Xt) and these\n\nintegrals can be calculated analytically yielding:\n\nP (xt+1|Xt) = (cid:8)\u03b2t +\n\n\u03b6\u03c4t\n\n2(\u03b6 + \u03c4t)\n\n(x \u2212 \u00b5t)2(cid:9)\u2212(\u03b1t+1/2)\n\n{\n\n1\n2\u03c0\n\n\u03b6\u03c4t\n\n\u03b6 + \u03c4t\n\n3\n\n}1/2 \u03b2\u03b1t\n\nt \u0393(\u03b1t + 1/2)\n\n\u0393(\u03b1t)\n\n.\n\n(10)\n\n\f3 Supervised category learning\n\nAlthough the learning model is presented for one category, it can easily be extended to learning\nmultiple categories with known category membership for training data (i.e., under supervision). In\nthis section, we will \ufb01rst describe an experiment with two categories to show how the category\nrepresentations change over time; then we will simulate learning with six categories and compare\npredictions with human data in psychological experiments.\n\n3.1 Two-category learning with supervision\n\nWe \ufb01rst conduct a synthetic experiment with two categories under supervision. We generate six\ntraining observations from one of two one-dimensional Gaussian distributions (representing cate-\ngories A and B, respectively) with means [\u22120.4, 0.4] and standard deviation of 0.4. Two training\nconditions are included, a massed condition with the data presentation order of AAABBB and a\nspaced condition with the order of ABABAB.\n\nTo model the acquisition of category representations during training, we employ the Bayesian learn-\ning model as described in the previous section.\nIn the correction stage of each trial, the model\nupdates the parameters corresponding to the category that produced the observation based on the\nsupervision (i.e., known category membership), following equation (8).\n\nIn the prediction stage, however, different values of a \ufb01xed model parameter \u03b3 are introduced to\nincorporate a generic prior that controls how much the learner is willing to update category repre-\nsentations from one trial to the next. The basic hypothesis is that learners will have greater con-\n\ufb01dence in knowledge of a category presented on trial t than of a category absent on trial t. As a\nconsequence, the learner will be willing to accept more change in a category representation if the\nobservation on the previous trial was drawn from a different category. This generic prior does share\nsome conceptual similarity with a model developed by Kording et. al,[?], which assumes that the\nmoment-moment variance of the states is higher for faster timescales (p. 779).\n\nMore speci\ufb01cally, if the observation on trial t is from the \ufb01rst category, in the prediction phase we\nwill update the \u03c4t parameters for the two categories, \u03c4t\n\n2, as:\n\n1, \u03c4t\n\n\u03c4t\n\n1 7\u2192\n\n1\u03b3s\n\n\u03c4t\n1 + \u03b3s\n\n\u03c4t\n\n,\n\n\u03c4t\n\n2 7\u2192\n\n2\u03b3d\n\u03c4t\n2 + \u03b3d\n\n\u03c4t\n\n,\n\n(11)\n\nin which \u03b3s > \u03b3d. In the simulation, we used \u03b3s = 50 and \u03b3d = .5\n\nMassed @ t = 2\n\nt = 4\n\nt = 6\n\nMassed @ t = 2\n\nt = 4\n\nt = 6\n \n\n)\n\nm\nP\n\n(\n\n)\n\nm\nP\n\n(\n\n)\n\nm\nP\n\n(\n\n\u22122\n\n0\nm\n\n2\n\n\u22122\n\nSpaced @ t = 2\n\n0\nm\nt = 4\n\n2\n\n\u22122\n\n2\n\n0\nm\nt = 6\n\nCategory 1\nCategory 2\n\n)\nr\n(\n\nP\n\n0\n\n20\nr\n\n)\nr\n(\n\nP\n\n40\n\n0\n\nSpaced @ t = 2\n\n20\nr\n\nt = 4\n\n)\nr\n(\n\nP\n\n40\n\n0\n\n40\n\n20\nr\n\nt = 6\n\n)\n\nm\nP\n\n(\n\n)\n\nm\nP\n\n(\n\n)\n\nm\nP\n\n(\n\n)\nr\n(\n\nP\n\n)\nr\n(\n\nP\n\n)\nr\n(\n\nP\n\n\u22122\n\n0\nm\n\n2\n\n\u22122\n\n0\nm\n\n2\n\n\u22122\n\n2\n\n0\nm\n\n0\n\n20\nr\n\n40\n\n0\n\n \n\n20\nr\n\n40\n\n0\n\n40\n\n20\nr\n\nFigure 1: Posterior distributions of means P (mt|Xt) and precisions P (rt|Xt) updated on training\ntrials in two-category supervised learning. Blue lines indicate category parameters in the \ufb01rst cat-\negory; and red lines indicate parameters in the second category. The top panel shows the results\nfor the massed condition (i.e., AAABBB), and the bottom panel shows the results for the spaced\ncondition (i.e., ABABAB). Please see in colour. We show the distributions only on even trials to\nsave space. See section 3.1.\n\nFigure (1) shows the change of posterior distributions of the two unknown category parameters,\nmeans P (mt|Xt) and precisions P (rt|Xt), over training trials. Figure (2) shows the category rep-\nresentation in the form of the posterior distribution of P (xt|Xt).\nIn the massed condition (i.e.,\n\n4\n\n\fAAABBB), the variance of the \ufb01rst category decreases over the \ufb01rst three trials, and then increases\nover the second three trials because the observations are from the second category. The increase of\ncategory variance re\ufb02ects the forgetting that occurs if no new observations are provided for a partic-\nular category after a long interval. This type of forgetting does not occur in the spaced condition, as\nthe interleaved presentation order ABABAB ensured that each category recurs after a short interval.\n\nBased upon the learned category representations, we can compute accuracy (the ability to discrim-\ninate between the two learnt distributions) using the posterior distributions of the two categories.\nAfter 100 simulations, the average accuracy in the massed condition is 0.78, which is lower than the\n0.84 accuracy in the spaced condition. Thus our model is able to predict the spacing effect found in\ntwo-category supervised learning.\n\nMassed @ t = 1\n\nt = 2\n\nt = 3\n\nt = 4\n\nt = 5\n\nt = 6\n\n \n\n)\nx\n(\nP\n\n)\nx\n(\nP\n\n)\nx\n(\nP\n\n\u22122\n\n0\nx\n\n2\n\n\u22122\n\n0\nx\n\n2\n\n\u22122\n\nSpaced @ t = 1\n\nt = 2\n\n)\nx\n(\nP\n\n2\n\n\u22122\n\n)\nx\n(\nP\n\n2\n\n\u22122\n\n0\nx\n\nt = 4\n\n0\nx\n\nt = 3\n\n0\nx\n\nt = 5\n\n)\nx\n(\nP\n\n2\n\n\u22122\n\n2\n\n0\nx\n\nt = 6\n\nCategory 1\nCategory 2\n\n)\nx\n(\nP\n\n)\nx\n(\nP\n\n)\nx\n(\nP\n\n)\nx\n(\nP\n\n)\nx\n(\nP\n\n)\nx\n(\nP\n\n\u22122\n\n0\nx\n\n2\n\n\u22122\n\n0\nx\n\n2\n\n\u22122\n\n0\nx\n\n2\n\n\u22122\n\n0\nx\n\n2\n\n\u22122\n\n0\nx\n\n2\n\n\u22122\n\n2\n\n0\nx\n\n \n\nFigure 2: Posterior distribution of each category, P (xt|Xt), updated on training trials in the two-\ncategory supervised learning. Same conventions as in \ufb01gure (1). See section 3.1.\n\n3.2 Modeling the spacing effect in six-category learning\n\nKornell and Bjork [1] asked human subjects to study six paintings by six different artists, with a\ngiven artists paintings presented consecutively (massed) or interleaved with other artists paintings\n(spaced). In the training phase, subjects were informed which artist created each training painting.\nThe same 36 paintings were studied in the training phase, but with different presentation orders\nin the massed and spaced conditions. In the subsequent test phase, six new paintings (one from\neach artist) were presented and subjects had to identify which artist painted each of a series of new\npaintings. Four test blocks were tested with random display order for artists. In each test block,\nparticipants were given feedback after making an identi\ufb01cation response. Paintings presented in one\ntest block thus served as training examples for the subsequent test block. Human results are shown\nin \ufb01gure (4). Human subjects showed signi\ufb01cantly better test performance after spaced than massed\ntraining. Given that feedback was provided and one painting from each artist appeared in one test\nblock, it is not surprising that test performance increased across test blocks and the spacing effect\ndecreased with more test blocks.\n\nTo simulate the data, we generated training and test data from six one-dimensional Gaussian distri-\nbutions with means [\u22122, \u22121.2, \u22120.4, 0.4, 1.2, 2] and standard deviation of 0.4. Figure (3) shows the\nlearned category representations in terms of posterior distributions. Depending on the presentation\norder of training data (massed or spaced), the learned distributions differ in terms of means and vari-\nances for each category. To compare with human performance reported by Kornell and Bjork, the\nmodel estimates accuracy in terms of discrimination between the two categories based upon learned\ndistributions. Figure (4) shows average accuracy from 1000 simulations. The result plot illustrates\nthat the model predictions match human performance well.\n\n4 Unsupervised category learning\n\nBoth humans and animals can learn without supervision. For example, in the animal conditioning\nliterature, various studies have shown that exposing two stimuli in blocks (equivalent to a massed\ncondition) is less effective in producing generalization [12]. Balleine et. al. [4] found that with rats,\npreexposure to two stimuli A and B (massed or spaced) determines the degree to which backward\nblocking is subsequently obtained \u2013 backward blocking occurs if the preexposure is spaced but not\n\n5\n\n\fMassed @ t = 6\n1\n\nt = 12\n\nt = 18\n\nt = 24\n\nt = 30\n\nt = 36\n\n \n\n)\n\nX\nP\n\n(\n\n0.5\n\n)\n\nX\nP\n\n(\n\n)\n\nX\nP\n\n(\n\n)\n\nX\nP\n\n(\n\n)\n\nX\nP\n\n(\n\n)\n\nX\nP\n\n(\n\n0\n\n\u221210\n\n10\n\n0\nX\n\nSpaced @ t = 6\n1\n\nX\n\nX\n\nX\n\nX\n\nX\n\nt = 12\n\nt = 18\n\nt = 24\n\nt = 30\n\nt = 36\n\n)\n\nX\nP\n\n(\n\n0.5\n\n)\n\nX\nP\n\n(\n\n)\n\nX\nP\n\n(\n\n)\n\nX\nP\n\n(\n\n)\n\nX\nP\n\n(\n\n)\n\nX\nP\n\n(\n\nCategory 1\nCategory 2\nCategory 3\nCategory 4\nCategory 5\nCategory 6\n\n0\n\n\u221210\n\n10\n\n0\nX\n\nX\n\nX\n\nX\n\nX\n\nX\n\n \n\nFigure 3: Posterior distribution of each category, P (xt|Xt), updated on training trials in the six-\ncategory supervised learning. Same conventions as in \ufb01gure (1). See section 3.2.\n\nt\nc\ne\nr\nr\no\nC\nn\no\n\n \n\ni\nt\nr\no\np\no\nr\nP\n\n0.8\n\n0.7\n\n0.6\n\n0.5\n\n0.4\n\n0.3\n\n0.2\n\n \n\n \n\nMassed\nSpaced\n\n1\n\n2\n3\nTest Block\n\n4\n\nFigure 4: Human performance (left) and model prediction (right). Proportion correct as a function\nof presentation training conditions (massed and spaced) and test block. See section 3.2.\n\nif it is massed. They conclude that in the massed preexposure the rats are unable to distinguish\ntwo separate categories for A and B, and therefore treat them as members of a single category. By\ncontrast, they conclude that rats can distinguish the categories A and B in the spaced preexposure.\n\nIn this section, we generalize the sequential category model to unsupervised learning, when the cate-\ngory membership of each training example is not provided to observers. We \ufb01rst derive the extension\nof the sequential model to this case (surprisingly, showing we can obtain all results in closed form).\nThen we determine whether massed and spaced stimuli (as in Balleine et. al.\u2019s experiment [4]) are\nmost likely to have been generated by a single category or by two categories. We also assess the\nimportance of supervision in training by comparing performance after unsupervised learning with\nthat after supervised learning.\n\nWe consider a model with two hidden categories. Each category can be represented as a Gaussian\ndistribution with a mean and precision m1, r1 and m2, r2. The likelihood function assumes that the\ndata is generated by either category with equal probability, since the category membership is not\nprovided,\n\nP (x|m1, r1, m2, r2) =\n\n1\n2\n\nP (x|m1, r1) +\n\n1\n2\n\nP (x|m2, r2),\n\nwith P (x|m1, r1) = G(x : m1, \u03b6r1), P (x|m2, r2) = G(x : m2, \u03b6r2).\n\nWe specify prior distributions and temporal priors as before:\n\nP (m1\n\nt+1|m1\n\nP (m2\nt+1|m2\n\n0, r2) = G(m2\nt ) = G(m2\n\n0 : \u00b52, \u03c4 r2)\nt , \u03b3r2).\n\nt+1 : m2\n\nP (m1\nt ) = G(m1\n\n0, r1) = G(m1\nt+1 : m1\n\n0 : \u00b51, \u03c4 r1),\nt , \u03b3r1),\n\nP (m2\n\n(12)\n\n(13)\n\n(14)\n(15)\n\nt , r1, m2\n\nt , r1, m2\n\nThe joint posterior distribution P (m1\nt , r2|Xt) after observations Xt can be formally ob-\ntained by applying the Bayes-Kalman update rules to the joint distribution \u2013 i.e., replace (mt, r) by\nt , r2) in equations (1,2)). But this update is more complicated because we do not know\n(m1\nwhether the new observation xt should be assigned to category 1 or category 2. Instead we have to\nsum over all the possible assignments of the observations to the categories which gives 2t possible\nassignments at time t. This can be performed ef\ufb01ciently in a recursive manner. Let At denote the set\nof possible assignments at time t where each assignment is a string (a1, ..., at) of binary variables\n\n6\n\n\fof length t, where (1, ..., 1) is the assignment where all the observations are assigned to category 1,\n(2, 1, ..., 1) assigns the \ufb01rst observation to category 2 and the remainder to category 1, and so on.\nBy substituting equations (12,14,15) into Bayes-Kalman we can obtain an iterative update equation\nfor P (m1\n\nt , r2|Xt). At time t we represent:\n\nt , r1, m2\n\nP (m1\n\nt , r1, m2\n\nt , r2|Xt) = X(a1,...,at)\u2208At\n\nP (m1, r|~\u03b11\n\na1,...,at)P (m2, r|~\u03b12\n\n(a1,...,at))P (a1, ..., at|Xt),\n\n(16)\nwhere \u03b1i\n(a1,...,at) denotes the values of the parameters ~\u03b1 = (\u03b1, \u03b2, \u00b5, \u03c4 ) for category i (i \u2208 {1, 2})\nfor observation sequence (a1, ..., at) and P (a1, ..., at) is the probability of assignment (a1, ..., at).\nAt t = 0 there is no observation sequence and P (m1\n0, r2|Xt) = P (m1, r|~\u03b11)P (m2, r|~\u03b12)\nwhich corresponds to A0 containing a single element which has probability one.\nThe prediction stage updates the \u03c4 component of ~\u03b1i(a1, ..., at) by:\n\u03b3i(at)\u03c4 i(a1, ..., at)\n\n0, r1, m2\n\n\u03c4 i(a1, ..., at) 7\u2192\n\n\u03b3i(at) + \u03c4 i(a1, ..., at)\n\n.\n\n(17)\n\nWe de\ufb01ne \u03b3i(at) as larger if i = at and smaller if i 6= at, as speci\ufb01ed in equation (11) to incorporate\nthe generic prior described in section 3.1.\n\nThe correction stage at time t + 1 introduces another observation, which must be assigned to the\ntwo categories. This gives a new set At+1 of 2t+1 assignments of form (a1, ..., at+1) and a new\nposterior:\n\nP (m1\n\nt+1, r1, m2\n\nt+1, r2|Xt+1) = X(a1,...,at+1)\u2208At+1\n\nP (m1, r|~\u03b11\n\na1,...,at+1)P (m2, r|~\u03b12\n\n(a1,...,at+1))P (a1, ..., at+1|Xt+1),\n\nwhere we compute ~\u03b1i\n\n(a1,...,at+1) for i \u2208 {1, 2} by:\n\n\u03b1i\n(a1,...,at+1) = \u03b1i\n\n(a1,...,at) + 1/2, \u03b21\n\n(a1,...,at+1) = \u03b21\n\n(a1,...,at) +\n\n(18)\n\n\u03b6\u03c4 i\n\n(a1,...,at)(xt+1 \u2212 \u00b51\n\n(a1,...,at))2\n\n2(\u03b6 + \u03c4 i\n\n(a1,...,at))\n\n,\n\n\u00b5i\n\n(a1,...,at+1) =\n\n\u03b6xt+1 + \u03c4 i\n\n(a1,...,at)\u00b5i\n\n(a1,...,at)\n\n\u03b6 + \u03c4 1\n\n(a1,...,at)\n\n,\n\n(a1,...,at+1 = \u03b6 + \u03c4 i\n\u03c4 i\n\n(a1,....,at), (19)\n\nand we compute P (a1, ..., at+1) by:\n\nP (a1, ..., at+1|Xt+1) =\n\nP (xt+1|~\u03b1at+1\n\n(a1,...,at))P (a1, ..., at)\n\nP(a1,...,at) P (xt+1|~\u03b1at+1\n\n(a1,...,at))P (a1, ..., at)\n\n,\n\n(20)\n\nwhere\n\nP (xt+1|~\u03b1at+1\n\n(a1,...,at)) = Z dmat+1 drat+1 P (xt+a|m(at+1), rat+1 )P (m(at+1), rat+1 |~\u03b1(a1,...,at))\n\n(21)\n\nThe model selection can, as before, be expressed as P (xt|Xt\u22121)P (xt\u22121|Xt)....P (x1), where\n\nP (xt+1|Xt) = X(a1,...,at)\u2208At\n\nP (xt+1|~\u03b1at+1\n\n(a1,...,at))P (a1, ..., at).\n\n(22)\n\nWe can now address the problem posed by Balleine et. al.\u2019s preexposure experiments [4] \u2013 why\ndo rats identify a single category for the massed stimuli but two categories for the spaced stimuli?\n\n7\n\n\fWe treat this as a model selection problem. We compare the evidence for the sequential model\nwith one category, see equations (9,10), versus the evidence for the model with two categories, see\nequations (9,22), for the two cases AAABBB (massed) and ABABAB (spaced).\n\nWe use the same data as described in section (3.1) but without providing category membership for\nany of the training data. The left plot in \ufb01gure (5) shows the result obtained by comparing model\nevidence for the one-category model with model evidence for the two-category model. A greater\nratio value indicates greater support for the one-category account. As shown in \ufb01gure (5), the model\ndecides that all training observations are from one category in the massed condition, but from two\ndifferent categories in the spaced condition (using zero as the decision threshold). These predictions\nagree with with Balleine et. al.\u2019s \ufb01ndings.\n\no\ni\nt\na\nr\n \ne\nc\nn\ne\nd\nv\ne\n \nl\ne\nd\no\nM\n\ni\n\n0\n\n\u22120.5\n\n\u22121\n\n\u22121.5\n\n\u22122\n\nMassed\n\nConditions\n\nSpaced\n\nt\nc\ne\nr\nr\no\nC\nn\no\n\n \n\ni\nt\nr\no\np\no\nr\nP\n\n1\n\n0.9\n\n0.8\n\n0.7\n\n0.6\n\n0.5\n\n \n\n \n\nMassed\nSpaced\n\nSupervised\n\nUnsupervised\n\nLearning conditions\n\nFigure 5: Model selection and accuracy results. Left, model selection results as a function of pre-\nsentation training conditions (massed and spaced). A greater ratio indicates more support for the\none-category account. Error bars indicate the standard error from 100 simulations. See section 4.2.\nRight, comparison of supervised and unsupervised learning in terms of accuracy. See section 4.3.\nTo assess the in\ufb02uence of supervision on learning, we compare performance between supervised\nlearning (described in section (3.1)) with unsupervised learning (described in this section). To make\nthe comparison, we assume that learners are provided with the same training data and are informed\nthat the data are from two different categories, either with known category membership (supervised)\nor unknown category membership (unsupervised) for each training observation. Accuracy measured\nby discrimination between the two categories is compared in the right plot of \ufb01gure (5). The model\npredicts higher accuracy given supervised than unsupervised learning. Furthermore, the model pre-\ndicts a spacing effect for both types of learning, although the effect is reduced with unsupervised\nlearning.\n\n5 Conclusions\n\nIn this paper, we develop a Bayesian sequential model for category learning by updating category\nrepresentations over time based on two category parameters, the mean and the variance. Analytic\nupdating rules are obtained by de\ufb01ning conjugate temporal priors to enable closed form solutions.\nA generic prior in the temporal updating stage is introduced to model the spacing effect. Parameter\nestimation and model selection can be performed on the basis of updating rules. The current work\nextends standard Kalman \ufb01ltering, and is able to predict learning phenomena that have been observed\nfor humans and other animals.\n\nIn addition to explaining the spacing effect, our model predicts that subjects will become less certain\nabout their knowledge of learned categories as time passes, see the increase in category variance in\nFigure 2. But our model is not standard Kalman \ufb01lter (since the measurement variance is unknown),\nso we do not predict exponential decay. Instead, as shown in Equation 10, our model predicts the\npattern of power-law forgetting that is fairly universal in human memory [14]\n\nFor small number of observations, our model is extremely ef\ufb01cient because we can derive analytic\nsolutions. For example, the analytic solutions for unsupervised learning requires only 0.2 seconds\nfor six observations while numerical integration takes 18 minutes. However, our model will scale\nexponentially with the number of observations in unsupervised learning. Future work is to include\na pruning strategy to keep the complexity practical.\n\nAcknowledgement\n\nThis research was supported a grant from Air Force FA 9550-08-1-0489.\n\n8\n\n\fReferences\n\n[1] Kornell, N., & Bjork, R. A. (2008a). Learning concepts and categories:\n\ninduction\u201d? Psychological Science, 19, 585-592.\n\nIs spacing the \u201denemy of\n\n[2] Bahrick, H.P., Bahrick, L.E., Bahrick, A.S., & Bahrick, P.E. (1993). Maintenance of foreign language\n\nvocabulary and the spacing effect. Psychological Science, 4, 316321.\n\n[3] Shea, J.B., & Morgan, R.L. (1979). Contextual interference effects on the acquisition, retention, and\ntransfer of a motor skill. Journal of Experimental Psychology: Human Learning and Memory, 5, 179187.\n[4] Balleine, B. W. Espinet, A. & Gonzalez, F. (2005).Perceptual learning enhances retrospective revalu-\nation of conditioned \ufb02avor preferences in rats. Journal of Experimental Psychology: Animal Behavior\nProcesses, 31(3): 341-50.\n\n[5] Carew, T.J., Pinsker, H.M., & Kandel, E.R. (1972). Long-term habituation of a defensive withdrawal\n\nre\ufb02ex in Aplysia. Science, 175, 451454.\n\n[6] Daw, N., Courville, A. C, & Dayan, P. (2007). Semi-rational Models of Conditioning: The Case of Trial\nOrder. In M. Oaksford and N. Chater (Eds.). The probabilistic mind: Prospects for rational models of\ncognition. Oxford: Oxford University Press.\n\n[7] Dayan, P. & Kakade, S. (2000). Explaining away in weight space. In T. K. Leen et al., (Eds.), Advances\n\nin neural information processing systems (Vol. 13, pp. 451-457). Cambridge, MA: MIT Press.\n\n[8] Fried, L. S., & Holyoak, K. J. (1984). Induction of category distributions: A framework for classi\ufb01cation\n\nlearning. Journal of Experimental Psychology: Learning, Memory and Cognition, 10, 234-257.\n\n[9] Kalman, R. E. (1960). A new approach to linear \ufb01ltering and prediction problems. Transactions of the\n\nASME-Journal of Basic Engineering, 82:35-45.\n\n[10] Schubert, J., & Sidenbladh, H. (2005). Sequential clustering with particle \ufb01lters: Extimating the number\n\nof clusters from data. 7th International Conference on Information Fusion (FUSION).\n\n[11] Ho, Y-C & Lee, R.C.K. (1964). A Bayesian appraoch to problems in stochastic estimation and control.\n\nIEEE Transactions on Automatic Control, 9, 333-339.\n\n[12] Honey, R. C., Bateson, P., & Horn, G. (1994). The role of stimulus comparison in perceptual learning:\nAn investigation with the domestic chick. Quarterly Journal of Experimental Psychology: Comparative\nand Physiological Psychology, 47(B), 83103.\n\n[13] Kording, K. P., Tenenbaum, J. B., and Shadmehr, R. (2007). The dynamics of memory as a consequence\n\nof optimal adaptation to a changing body. Nature Neuroscience, 10:779-786.\n\n[14] Anderson, J. R. And Schooler, L. J. (1991). Re\ufb02ections of the environment in memory. Psychological\n\nScience, 2, 395-408.\n\n9\n\n\f", "award": [], "sourceid": 1010, "authors": [{"given_name": "Hongjing", "family_name": "Lu", "institution": null}, {"given_name": "Matthew", "family_name": "Weiden", "institution": null}, {"given_name": "Alan", "family_name": "Yuille", "institution": null}]}