{"title": "Modeling human function learning with Gaussian processes", "book": "Advances in Neural Information Processing Systems", "page_first": 553, "page_last": 560, "abstract": "Accounts of how people learn functional relationships between continuous variables have tended to focus on two possibilities: that people are estimating explicit functions, or that they are simply performing associative learning supported by similarity. We provide a rational analysis of function learning, drawing on work on regression in machine learning and statistics. Using the equivalence of Bayesian linear regression and Gaussian processes, we show that learning explicit rules and using similarity can be seen as two views of one solution to this problem. We use this insight to define a Gaussian process model of human function learning that combines the strengths of both approaches.", "full_text": "Modeling human function learning\n\nwith Gaussian processes\n\nThomas L. Grif\ufb01ths Christopher G. Lucas Joseph J. Williams\n\nDepartment of Psychology\n\nUniversity of California, Berkeley\n\n{tom griffiths,clucas,joseph williams}@berkeley.edu\n\nBerkeley, CA 94720-1650\n\nMichael L. Kalish\n\nInstitute of Cognitive Science\n\nUniversity of Louisiana at Lafayette\n\nLafayette, LA 70504-3772\nkalish@lousiana.edu\n\nAbstract\n\nAccounts of how people learn functional relationships between continuous vari-\nables have tended to focus on two possibilities: that people are estimating explicit\nfunctions, or that they are performing associative learning supported by similarity.\nWe provide a rational analysis of function learning, drawing on work on regres-\nsion in machine learning and statistics. Using the equivalence of Bayesian linear\nregression and Gaussian processes, we show that learning explicit rules and us-\ning similarity can be seen as two views of one solution to this problem. We use\nthis insight to de\ufb01ne a Gaussian process model of human function learning that\ncombines the strengths of both approaches.\n\n1 Introduction\n\nMuch research on how people acquire knowledge focuses on discrete structures, such as the nature\nof categories or the existence of causal relationships. However, our knowledge of the world also\nincludes relationships between continuous variables, such as the difference between linear and ex-\nponential growth, or the form of causal relationships, such as how pressing the accelerator of a car\nin\ufb02uences its velocity. Research on how people learn relationships between two continuous vari-\nables \u2013 known in the psychological literature as function learning \u2013 has tended to emphasize two\ndifferent ways in which people could be solving this problem. One class of theories (e.g., [1, 2, 3])\nsuggests that people are learning an explicit function from a given class, such as the polynomials\nof degree k. This approach attributes rich representations to human learners, but has traditionally\ngiven limited treatment to the question of how such representations could be acquired. A second\napproach (e.g., [4, 5]) emphasizes the possibility that people learn by forming associations between\nobserved values of input and output variables, and generalize based on the similarity of new inputs\nto old. This approach has a clear account of the underlying learning mechanisms, but faces chal-\nlenges in explaining how people generalize so broadly beyond their experience, making predictions\nabout variable values that are signi\ufb01cantly removed from their previous observations. Most recently,\nhybrids of these two approaches have been proposed (e.g., [6, 7]), with explicit functions being\nrepresented, but associative learning.\nPrevious models of human function learning have been oriented towards understanding the psycho-\nlogical processes by which people solve this problem. In this paper, we take a different approach,\n\n1\n\n\fpresenting a rational analysis of function learning, in the spirit of [8]. This rational analysis provides\na way to understand the relationship between the two approaches that have dominated previous work\n\u2013 rules and similarity \u2013 and suggests how they might be combined. The basic strategy we pursue\nis to consider the abstract computational problem involved in function learning, and then to explore\noptimal solutions to that problem with the goal of shedding light on human behavior. In particular,\nthe problem of learning a functional relationship between two continuous variables is an instance of\nregression, and has been extensively studied in machine learning and statistics.\nThere are a variety of solution to regression problems, but we focus on methods related to Bayesian\nlinear regression (e.g., [9]), which allow us to make the expectations of learners about the form of\nfunctions explicit through a prior distribution. Bayesian linear regression is also directly related to\na nonparametric approach known as Gaussian process prediction (e.g., [10]), in which predictions\nabout the values of an output variable are based on the similarity between values of an input variable.\nWe use this relationship to connect the two traditional approaches to modeling function learning, as\nit shows that learning rules that describe functions and specifying the similarity between stimuli for\nuse in associative learning are not mutually exclusive alternatives, but rather two views of the same\nsolution to this problem. We exploit this fact to de\ufb01ne a rational model of human function learning\nthat incorporates the strengths of both approaches.\n\n2 Models of human function learning\n\nIn this section we review the two traditional approaches to modeling human function learning \u2013 rules\nand similarity \u2013 and some more recent hybrid approaches that combine the two.\n\n2.1 Representing functions with rules\n\nThe idea that people might represent functions explicitly appears in one of the \ufb01rst papers on human\nfunction learning [1]. This paper proposed that people assume a particular class of functions (such\nas polynomials of degree k) and use the available observations to estimate the parameters of those\nfunctions, forming a representation that goes beyond the observed values of the variables involved.\nConsistent with this hypothesis, people learned linear and quadratic functions better than random\npairings of values for two variables, and extrapolated appropriately. Similar assumptions guided\nsubsequent work exploring the ease with which people learn functions from different classes (e.g.,\n[2], and papers have tested statistical regression schemes as potential models of learning, examining\nhow well human responses were described by different forms of nonlinear regression (e.g., [3]).\n\n2.2 Similarity and associative learning\n\nAssociative learning models propose that people do not learn relationships between continuous vari-\nables by explicitly learning rules, but by forging associations between observed variable pairs and\ngeneralizing based on the similarity of new variable values to old. The \ufb01rst model to implement this\napproach was the Associative-Learning Model (ALM; [4, 5]), in which input and output arrays are\nused to represent a range of values for the two variables between which the functional relationship\nholds. Presentation of an input activates input nodes close to that value, with activation falling off\nas a Gaussian function of distance, explicitly implementing a theory of similarity in the input space.\nLearned weights determine the activation of the output nodes, being a weighted linear function of the\nactivation of the input nodes. Associative learning for the weights is performed by applying gradient\ndescent on the squared error between current output activation and the correct value. In practice, this\napproach performs well when interpolating between observed values, but poorly when extrapolating\nbeyond those values. As a consequence, the same authors introduced the Extrapolation-Association\nModel (EXAM), which constructs a linear approximation to the output of the ALM when selecting\nresponses, producing a bias towards linearity that better matches human judgments.\n\n2.3 Hybrid approaches\n\nSeveral papers have explored methods for combining rule-like representations of functions with\nassociative learning. One example of such an approach is the set of rule-based models explored in\n[6]. These models used the same kind of input representation as ALM and EXAM, with activation\n\n2\n\n\fof a set of nodes similar to the input value. However, the models also feature a set of hidden units,\nwhere each hidden unit corresponds to a different parameterization of a rule from a given class\n(polynomial, Fourier, or logistic). The values of the hidden nodes \u2013 corresponding to the values\nof the rules they instantiate \u2013 are combined linearly to obtain output predictions, with the weight\nof each hidden node being learned through gradient descent (with a penalty for the curvature of\nthe functions involved). A more complex instance of this kind of approach is the Population of\nLinear Experts (POLE) model [7], in which hidden units each represent different linear functions,\nbut the weights from input to hidden nodes indicate which linear function should be used to make\npredictions for particular input values. As a consequence, the model can learn non-linear functions\nby identifying a series of local linear approximations, and can even model situations in which people\nseem to learn different functions in different parts of the input space.\n\n3 Rational solutions to regression problems\n\nThe models outlined in the previous section all aim to describe the psychological processes involved\nin human function learning. In this section, we consider the abstract computational problem under-\nlying this task, using optimal solutions to this problem to shed light on both previous models and\nhuman learning. Viewed abstractly, the computational problem behind function learning is to learn\na function f mapping from x to y from a set of real-valued observations xn = (x1, . . . , xn) and\ntn = (t1, . . . , tn), where ti is assumed to be the true value yi = f(xi) obscured by additive noise.1\nIn machine learning and statistics, this is referred to as a regression problem. In this section, we dis-\ncuss how this problem can be solved using Bayesian statistics, and how the result of this approach\nis related to Gaussian processes. Our presentation follows that in [10].\n\n3.1 Bayesian linear regression\n\nIdeally, we would seek to solve our regression problem by combining some prior beliefs about the\nprobability of encountering different kinds of functions in the world with the information provided\nby x and t. We can do this by applying Bayes\u2019 rule, with\n\np(tn|f, xn)p(f)\nF p(tn|f, xn)p(f) df\n\n,\n\np(f|xn, tn) =\n\n(1)\nwhere p(f) is the prior distribution over functions in the hypothesis space F, p(tn|f, xn) is the\nprobability of observing the values of tn if f were the true function, known as the likelihood, and\np(f|xn, tn) is the posterior distribution over functions given the observations xn and tn. In many\ncases, the likelihood is de\ufb01ned by assuming that the values of ti are independent given f and xi,\nbeing Gaussian with mean yi = f(xi) and variance \u03c32\nt . Predictions about the value of the function\nf for a new input xn+1 can be made by integrating over the posterior distribution,\np(yn+1|f, xn+1)p(f|xn, tn) df,\n\np(yn+1|xn+1, tn, xn) =\n\nZ\n\n(2)\n\nR\n\nf\n\nwhere p(yn+1|f, xn+1) is a delta function placing all of its mass on yn+1 = f(xn+1).\nPerforming the calculations outlined in the previous paragraph for a general hypothesis space F is\nchallenging, but becomes straightforward if we limit the hypothesis space to certain speci\ufb01c classes\nof functions. If we take F to be all linear functions of the form y = b0 + xb1, then our problem takes\nthe familiar form of linear regression. To perform Bayesian linear regression, we need to de\ufb01ne a\nprior p(f) over all linear functions. Since these functions are identi\ufb01ed by the parameters b0 and\nb1, it is suf\ufb01cient to de\ufb01ne a prior over b = (b0, b1), which we can do by assuming that b follows\na multivariate Gaussian distribution with mean zero and covariance \u03a3b. Applying Equation 1 then\nresults in a multivariate Gaussian posterior distribution on b (see [9]) with\n\nE[b|xn, tn] = (cid:0)\u03c32\n(cid:18)\n\ncov[b|xn, yn] =\n\nt \u03a3\u22121\n\u03a3\u22121\nb +\n\n(cid:1)\u22121 XT\n(cid:19)\u22121\n\nb + XT\n\nn Xn\n\nn tn\n\n1\n\u03c32\nt\n\nXT\n\nn Xn\n\n(3)\n\n(4)\n\n1Following much of the literature on human function learning, we consider only one-dimensional functions,\n\nbut this approach generalizes naturally to the multi-dimensional case.\n\n3\n\n\fwhere Xn = [1n xn] (ie. a matrix with a vector of ones horizontally concatenated with xn+1) Since\nyn+1 is simply a linear function of b, applying Equation 2 yields a Gaussian predictive distribution,\nwith yn+1 having mean [1 xn+1]E[b|xn, tn] and variance [1 xn+1]cov[b|xn, tn][1 xn+1]T . The\npredictive distribution for tn+1 is similar, but with the addition of \u03c32\nWhile considering only linear functions might seem overly restrictive, linear regression actually\ngives us the basic tools we need to solve this problem for more general classes of functions. Many\nclasses of functions can be described as linear combinations of a small set of basis functions. For\nexample, all kth degree polynomials are linear combinations of functions of the form 1 (the constant\nfunction), x, x2, . . . , xk. Letting \u03c6(1), . . . , \u03c6(k) denote a set of functions, we can de\ufb01ne a prior\non the class of functions that are linear combinations of this basis by expressing such functions in\nthe form f(x) = b0 + \u03c6(1)(x)b1 + . . . + \u03c6(k)(x)bk and de\ufb01ning a prior on the vector of weights\nb.\nIf we take the prior to be Gaussian, we reach the same solution as outlined in the previous\nparagraph, substituting \u03a6 = [1n \u03c6(1)(xn) . . . \u03c6(k)(xn)] for X and [1 \u03c6(1)(xn+1) . . . \u03c6(k)(xn+1)]\nfor [1 xn+1], where \u03c6(xn) = [\u03c6(x1) . . . \u03c6(xn)]T .\n\nt to the variance.\n\n3.2 Gaussian processes\n\nIf our goal were merely to predict yn+1 from xn+1, yn, and xn, we might consider a different\napproach, simply de\ufb01ning a joint distribution on yn+1 given xn+1 and conditioning on yn. For\nexample, we might take the yn+1 to be jointly Gaussian, with covariance matrix\n\n(cid:18) Kn\n\nkT\n\nn,n+1\n\n(cid:19)\n\nkn,n+1\nkn+1\n\nKn+1 =\n\nn,n+1K\u22121\n\nIf we condition on yn, the distribution of yn+1 is Gaussian with mean kT\n\nwhere Kn depends on the values of xn, kn,n+1 depends on xn and xn+1, and kn+1 depends only\nn,n+1K\u22121\nn y\non xn+1.\nand variance kn+1 \u2212 kT\nn kn,n+1. This approach to prediction uses a Gaussian process, a\nstochastic process that induces a Gaussian distribution on y based on the values of x. This approach\nt In to Kn,\ncan also be extended to allow us to predict yn+1 from xn+1, tn, and xn by adding \u03c32\nwhere In is the n \u00d7 n identity matrix, to take into account the additional variance associated with\ntn.\nThe covariance matrix Kn+1 is speci\ufb01ed using a two-place function in x known as a kernel, with\nKij = K(xi, xj). Any kernel that results in an appropriate (symmetric, positive-de\ufb01nite) covariance\nmatrix for all x can be used. Common kinds of kernels include radial basis functions, e.g.,\n\n(5)\n\n(6)\n\n(7)\n\nwith values of y for which values of x are close being correlated, and periodic functions, e.g.,\n\nK(xi, xj) = \u03b82\n\n1 exp(\u2212 1\n\u03b82\n2\n\n(xi \u2212 xj)2)\n\nK(xi, xj) = \u03b82\n\n3 exp(\u03b82\n\n4(cos(\n\n[xi \u2212 xj])))\n\n2\u03c0\n\u03b85\n\nindicating that values of y for which values of x are close relative to the period \u03b83 are likely to be\nhighly correlated. Gaussian processes thus provide a \ufb02exible approach to prediction, with the kernel\nde\ufb01ning which values of x are likely to have similar values of y.\n\n3.3 Two views of regression\n\nBayesian linear regression and Gaussian processes appear to be quite different approaches.\nIn\nBayesian linear regression, a hypothesis space of functions is identi\ufb01ed, a prior on that space is\nde\ufb01ned, and predictions are formed averaging over the posterior, while Gaussian processes simply\nuse the similarity between different values of x, as expressed through a kernel, to predict correlations\nin values of y. It might thus come as a surprise that these approaches are equivalent.\nShowing that Bayesian linear regression corresponds to Gaussian process prediction is straight-\nforward. The assumption of linearity means that the vector yn+1 is equal to Xn+1b. It follows\nthat p(yn+1|xn+1) is a multivariate Gaussian distribution with mean zero and covariance matrix\nXn+1\u03a3bXT\nn+1. Bayesian linear regression thus corresponds to prediction using Gaussian pro-\ncesses, with this covariance matrix playing the role of Kn+1 above (ie. using the kernel func-\ntion K(xi, xj) = [1 xi][1 xj]T ). Using a richer set of basis functions corresponds to taking\nKn+1 = \u03a6n+1\u03a3b\u03a6T\n\nn+1 (ie. K(xi, xj) = [1 \u03c6(1)(xi) . . . \u03c6(k)(xi)][1 \u03c6(1)(xi) . . . \u03c6(k)(xi)]T ).\n\n4\n\n\fIt is also possible to show that Gaussian process prediction can always be interpreted as Bayesian\nlinear regression, albeit with potentially in\ufb01nitely many basis functions. Just as we can express\na covariance matrix in terms of its eigenvectors and eigenvalues, we can express a given kernel\nK(xi, xj) in terms of its eigenfunctions \u03c6 and eigenvalues \u03bb, with\n\n\u221eX\n\nK(xi, xj) =\n\n\u03bbk\u03c6(k)(xi)\u03c6(k)(xj)\n\n(8)\n\nk=1\n\nfor any xi and xj. Using the results from the previous paragraph, any kernel can be viewed as the\nresult of performing Bayesian linear regression with a set of basis functions corresponding to its\neigenfunctions, and a prior with covariance matrix \u03a3b = diag(\u03bb).\nThese results establish an important duality between Bayesian linear regression and Gaussian pro-\ncesses: for every prior on functions, there exists a corresponding kernel, and for every kernel, there\nexists a corresponding prior on functions. Bayesian linear regression and prediction with Gaussian\nprocesses are thus just two views of the same solution to regression problems.\n\n4 Combining rules and similarity through Gaussian processes\n\nThe results outlined in the previous section suggest that learning rules and generalizing based on\nsimilarity should not be viewed as con\ufb02icting accounts of human function learning. In this section,\nwe brie\ufb02y highlight how previous accounts of function learning connect to statistical models, and\nthen use this insight to de\ufb01ne a model that combines the strengths of both approaches.\n\n4.1 Reinterpreting previous accounts of human function learning\n\nThe models presented above were chosen because the contrast between rules and similarity in\nfunction learning is analogous to the difference between Bayesian linear regression and Gaussian\nprocesses. The idea that human function learning can be viewed as a kind of statistical regres-\nsion [1, 3] clearly connects directly to Bayesian linear regression. While there is no direct formal\ncorrespondence, the basic ideas behind Gaussian process regression with a radial basis kernel and\nsimilarity-based models such as ALM are closely related. In particular, ALM has many common-\nalities with radial-basis function neural networks, which are directly related to Gaussian processes\n[11]. Gaussian processes with radial-basis kernels can thus be viewed as implementing a simple\nkind of similarity-based generalization, predicting similar y values for stimuli with similar x values.\nFinally, the hybrid approach to rule learning taken in [6] is also closely related to Bayesian linear\nregression. The rules represented by the hidden units serve as a basis set that specify a class of\nfunctions, and applying penalized gradient descent on the weights assigned to those basis elements\nserves as an online algorithm for \ufb01nding the function with highest posterior probability [12].\n\n4.2 Mixing functions in a Gaussian process model\n\nThe relationship between Gaussian processes and Bayesian linear regression suggests that we\ncan de\ufb01ne a single model that exploits both similarity and rules in forming predictions.\nIn\nparticular, we can do this by taking a prior that covers a broad class of functions \u2013 including\nthose consistent with a radial basis kernel \u2013 or, equivalently, modeling y as being produced by\na Gaussian process with a kernel corresponding to one of a small number of types. Speci\ufb01-\ncally, we assume that observations are generated by choosing a type of function from the set\n{Positive Linear, Negative Linear, Quadratic, Nonlinear}, where the probabilities of these alterna-\ntives are de\ufb01ned by the vector \u03c0, and then sampling y from a Gaussian process with a kernel corre-\nsponding to the appropriate class of functions. The relevant kernels are introduced in the previous\nsections (taking \u201cNonlinear\u201d to correspond to the radial basis kernel), with the \u201cPositive Linear\u201d and\n\u201cNegative Linear\u201d kernels being derived in a similar way to the standard linear kernel but with the\nmean of the prior on b being [0 1] and [1 \u22121] rather than simply zero.\nUsing this Gaussian process model allows a learner to make an inference about the type of function\nfrom which their observations are drawn, as well as the properties of the function of that type. In\npractice, we perform probabilistic inference using a Markov chain Monte Carlo (MCMC) algorithm\n(see [13] for an introduction). This algorithm de\ufb01nes a Markov chain for which the stationary\n\n5\n\n\fdistribution is the distribution from which we wish to sample.\nIn our case, this is the posterior\ndistribution over types and the hyperparameters for the kernels \u03b8 given the observations x and t.\nThe hyperparameters include \u03b81 and \u03b82 de\ufb01ned above and the noise in the observations \u03c32\nt . Our\nMCMC algorithm repeats two steps. The \ufb01rst step is sampling the type of function conditioned on\nx, t, and the current value of \u03b8, with the probability of each type being proportional to the product of\np(tn|xn) for the corresponding Gaussian process and the prior probability of that type as given by \u03c0.\nThe second step is sampling the value of \u03b8 given xn, tn, and the current type, which is done using\na Metropolis-Hastings procedure (see [13]), proposing a value for \u03b8 from a Gaussian distribution\ncentered on the current value and deciding whether to accept that value based on the product of the\nprobability it assigns to tn given xn and the prior p(\u03b8). We use an uninformative prior on \u03b8.\n\n5 Testing the Gaussian process model\n\nFollowing a recent review of computational models of function learning [6], we look at two quanti-\ntative tests of Gaussian processes as an account of human function learning: reproducing the order\nof dif\ufb01culty of learning functions of different types, and extrapolation performance. As indicated\nearlier, there is a large literature consisting of both models and data concerning human function\nlearning, and these simulations are intended to demonstrate the potential of the Gaussian process\nmodel rather than to provide an exhaustive test of its performance.\n\n5.1 Dif\ufb01culty of learning\n\nA necessary criterion for a theory of human function learning is accounting for which functions\npeople learn readily and which they \ufb01nd dif\ufb01cult \u2013 the relative dif\ufb01culty of learning various func-\ntions. Table 1 is an augmented version of results presented in [6] which compared several models\nto the empirically observed dif\ufb01culty of learning a range of functions. Each entry in the table is the\nmean absolute deviation (MAD) of human or model responses from the actual value of the function,\nevaluated over the stimuli presented in training. The MAD provides a measure of how dif\ufb01cult it is\nfor people or a given model to learn a function. The data reported for each set of studies are ordered\nby increasing MAD (corresponding to increasing dif\ufb01culty). In addition to reproducing the MAD\nfor the models in [6], the table includes results for seven Gaussian process (GP) models.\nThe seven GP models incorporated different kernel functions by adjusting their prior probability.\nDrawing on the {Positive Linear, Negative Linear, Quadratic, Nonlinear} set of kernel functions, the\nmost comprehensive model took \u03c0 = (0.5, 0.4, 0.09, 0.01).2 Six other GP models were examined\nby assigning certain kernel functions zero prior probability and re-normalizing the modi\ufb01ed value\nof \u03c0 so that the prior probabilities summed to one. The seven distinct GP models are presented in\nTable 1 and labeled by the kernel functions with non-zero prior probability: Linear (Positive Linear\nand Negative Linear), Quadratic, Nonlinear (Radial Basis Function), Linear and Quadratic, Linear\nand Nonlinear, Quadratic and Nonlinear, and Linear, Quadratic, and Nonlinear. The last two rows of\nTable 1 give the correlations between human and model performance across functions, expressing\nquantitatively how well each model captured the pattern of human function learning behavior. The\nGP models perform well according to this metric, providing a closer match to the human data than\nany of the models considered in [6], with the quadratic kernel and the models with a mixture of\nkernels tending to provide a closer match to human behavior.\n\n5.2 Extrapolation performance\n\nPredicting and explaining people\u2019s capacity for generalization \u2013 from stimulus-response pairs to\njudgments about a functional relationship between variables \u2013 is the second key component of our\naccount. This capacity is assessed in the way in which people extrapolate, making judgments about\nstimuli they have not encountered before. Figure 1 shows mean human predictions for a linear, expo-\nnential, and quadratic function (from [4]), together with the predictions of the most comprehensive\nGP model (with Linear, Quadratic and Nonlinear kernel functions). The regions to the left and right\nof the vertical lines represent extrapolation regions, being input values for which neither people nor\n\n2The selection of these values was guided by results indicating the order of dif\ufb01culty of learning functions\nof these different types for human learners, but we did not optimize \u03c0 with respect to the criteria reported here.\n\n6\n\n\fFunction\nByun (1995, Expt 1B)\n\nLinear\nSquare root\n\nByun (1995, Expt 1A)\n\nLinear\nPower, pos. acc.\nPower, neg. acc.\nLogarithmic\nLogistic\n\nByun (1995, Expt 2)\n\nLinear\nQuadratic\nCyclic\n\nLinear\nExponential\nQuadratic\n\nLinear\nRank-order\n\nDelosh, Busemeyer, & McDaniel (1997)\n\nCorrelation of human and model performance\n\nHuman ALM Poly Fourier Logistic\n\nHybrid models\n\nGaussian process models\n\nLinear Quad RBF\n\nLQ\n\nLR\n\nQR\n\nLQR\n\n.20\n.35\n\n.15\n.20\n.23\n.30\n.39\n\n.04\n.05\n\n.10\n.12\n.12\n.14\n.18\n\n.18\n.28\n.68\n\n.10\n.15\n.24\n\n1.0\n1.0\n\n.01\n.03\n.32\n\n.04\n.05\n.07\n\n.83\n.55\n\n.04\n.06\n\n.33\n.37\n.36\n.41\n.51\n\n.18\n.31\n.41\n\n.11\n.17\n.27\n\n.45\n.51\n\n.05\n.06\n\n.33\n.37\n.36\n.41\n.52\n\n.19\n.31\n.40\n\n.11\n.17\n.27\n\n.45\n.51\n\n.16\n.19\n\n.17\n.24\n.19\n.19\n.33\n\n.12\n.24\n.68\n\n.04\n.02\n.11\n\n.93\n.77\n\n.0002\n.06\n\n.0003\n.11\n.06\n.10\n.20\n\n.0003\n.20\n.50\n\n.0005\n.03\n.1\n\n.93\n.76\n\n.004\n.02\n\n.004\n.004\n.02\n.04\n.20\n\n.005\n.09\n.50\n\n.005\n.01\n.06\n\n.92\n.80\n\n.06\n.05\n\n.04\n.08\n.05\n.07\n.22\n\n.05\n.14\n.50\n\n.03\n.02\n.07\n\n.92\n.75\n\n.0002 .0002\n.02\n.03\n\n.001\n.02\n\n.0001\n.02\n\n.0002 .0002 .0009 .0001\n.003\n.004\n.02\n.02\n.03\n.04\n.20\n.18\n\n.003\n.02\n.03\n.18\n\n.05\n.03\n.05\n.18\n\n.0003 .0002\n.12\n.09\n.50\n.49\n\n.0005 .0003\n.01\n.02\n.06\n.06\n\n.93\n.83\n\n.93\n.83\n\n.001\n.04\n.49\n\n.002\n.009\n.04\n\n.92\n.82\n\n.0002\n.04\n.49\n\n.0004\n.01\n.04\n\n.92\n.83\n\nTable 1: Dif\ufb01culty of learning results. Rows correspond to functions learned in experiments re-\nviewed in [6]. Columns give the mean absolute deviation (MAD) from the true functions for human\nlearners and different models (Gaussian process models with multiple kernels are denoted by the\ninitials of their kernels, e.g., LQR = Linear, Quadratic, and Radial Basis Function). Human MAD\nvalues represent sample means (for a single subject over trials, then over subjects), and re\ufb02ect both\nestimation and production errors, being higher than model MAD values which are computed using\ndeterministic model predictions and thus re\ufb02ect only estimation error. The last two rows give the\nlinear and rank-order correlations of the human and model MAD values, providing an indication of\nhow well the model matches the dif\ufb01culty people have in learning different functions.\n\n(c)\n\nc\ni\nt\na\nr\nd\na\nu\nQ\n\n1\n6\n9\n\n0\n7\n4\n\n1\n0\n9\n\n2\n8\n8\n\n6\n8\n8\n\n2\n9\n8\n\n8\n7\n8\n\n7\n7\n8\n\n.\n\n.\n\n.\n\n.\n\n.\n\n.\n\n.\n\n.\n\nl\na\ni\nt\nn\ne\nn\no\np\nx\nE\n\n7\n9\n9\n\n9\n8\n9\n\n7\n9\n9\n\n7\n9\n9\n\n7\n9\n9\n\n7\n9\n9\n\n4\n9\n9\n\n5\n9\n9\n\n.\n\n.\n\n.\n\n.\n\n.\n\n.\n\n.\n\n.\n\nr\na\ne\nn\ni\nL\n\n9\n9\n9\n\n9\n9\n9\n\n7\n9\n9\n\n9\n9\n9\n\n9\n9\n9\n\n9\n9\n9\n\n8\n9\n9\n\n9\n9\n9\n\n.\n\n.\n\n.\n\n.\n\n.\n\n.\n\n.\n\n.\n\nl\ne\nd\no\nM\n\nM\nA\nX\nE\n\nr\na\ne\nn\ni\nL\n\nd\na\nu\nQ\n\nF\nB\nR\n\nQ\nL\n\nR\nL\n\nQ\nR\n\nQ\nR\nL\n\nFigure 1: Extrapolation performance. (a)-(b) Mean predictions on linear, exponential, and quadratic\nfunctions for (a) human participants (from [4]) and (b) a Gaussian process model with Linear,\nQuadratic, and Nonlinear kernels. Training data were presented in the region between the verti-\ncal lines, and extrapolation performance was evaluated outside this region. (c) Correlations between\nhuman and model extrapolation. Gaussian process models are denoted as in Table 1.\n\n7\n\nLinear(a)Quadratic  FunctionHuman / ModelExponential(b)\fthe model were trained. Both people and the model extrapolate near optimally on the linear func-\ntion, and reasonably accurate extrapolation also occurs for the exponential and quadratic function.\nHowever, there is a bias towards a linear slope in the extrapolation of the exponential and quadratic\nfunctions, with extreme values of the quadratic and exponential function being overestimated.\nQuantitative measures of extrapolation performance are shown in Figure 1 (c), which gives the\ncorrelation between human and model predictions for EXAM [4, 5] and the seven GP models. While\nnone of the GP models produce quite as high a correlation as EXAM on all three functions, all of\nthe models except that with just the linear kernel produce respectable correlations. It is particularly\nnotable that this performance is achieved without the optimization of any free parameters, while the\npredictions of EXAM were the result of optimizing two parameters for each of the three functions.\n\n6 Conclusions\n\nWe have presented a rational account of human function learning, drawing on ideas from machine\nlearning and statistics to show that the two approaches that have dominated previous work \u2013 rules and\nsimilarity \u2013 can be interpreted as two views of the same kind of optimal solution to this problem. Our\nGaussian process model combines the strengths of both approaches, using a mixture of kernels to\nallow systematic extrapolation as well as sensitive non-linear interpolation. Tests of the performance\nof this model on benchmark datasets show that it can capture some of the basic phenomena of human\nfunction learning, and is competitive with existing process models. In future work, we aim to extend\nthis Gaussian process model to allow it to produce some of the more complex phenomena of human\nfunction learning, such as non-monotonic extrapolation (via periodic kernels) and learning different\nfunctions in different parts of the input space (via mixture modeling).\nAcknowledgments. This work was supported by grant FA9550-07-1-0351 from the Air Force Of\ufb01ce of Scien-\nti\ufb01c Research and grants 0704034 and 0544705 from the National Science Foundation.\n\nReferences\n[1] J. D. Carroll. Functional learning: The learning of continuous functional mappings relating stimulus and\n\nresponse continua. Education Testing Service, Princeton, NJ, 1963.\n\n[2] B. Brehmer. Hypotheses about relations between scaled variables in the learning of probabilistic inference\n\ntasks. Organizational Behavior and Human Decision Processes, 11:1\u201327, 1974.\n\n[3] K. Koh and D. E. Meyer. Function learning: Induction of continuous stimulus-response relations. Journal\n\nof Experimental Psychology: Learning, Memory, and Cognition, 17:811\u2013836, 1991.\n\n[4] E. L. DeLosh, J. R. Busemeyer, and M. A. McDaniel. Extrapolation: The sine qua non of abstraction in\nfunction learning. Journal of Experimental Psychology: Learning, Memory, and Cognition, 23:968\u2013986,\n1997.\n\n[5] J. R. Busemeyer, E. Byun, E. L. DeLosh, and M. A. McDaniel. Learning functional relations based\nIn K. Lamberts and\n\non experience with input-output pairs by humans and arti\ufb01cial neural networks.\nD. Shanks, editors, Concepts and Categories, pages 405\u2013437. MIT Press, Cambridge, 1997.\n\n[6] M. A. McDaniel and J. R. Busemeyer. The conceptual basis of function learning and extrapolation:\nComparison of rule-based and associative-based models. Psychonomic Bulletin and Review, 12:24\u201342,\n2005.\n\n[7] M. Kalish, S. Lewandowsky, and J. Kruschke. Population of linear experts: Knowledge partitioning and\n\nfunction learning. Psychological Review, 111:1072\u20131099, 2004.\n\n[8] J. R. Anderson. The adaptive character of thought. Erlbaum, Hillsdale, NJ, 1990.\n[9] J. M. Bernardo and A. F. M. Smith. Bayesian theory. Wiley, New York, 1994.\n[10] C. K. I. Williams. Prediction with Gaussian processes: From linear regression to linear prediction and\nbeyond. In M. I. Jordan, editor, Learning in Graphical Models, pages 599\u2013621. MIT Press, Cambridge,\nMA, 1998.\n\n[11] R. M. Neal. Priors for in\ufb01nite networks. Technical Report CRG-TR-94-1, Department of Computer\n\nScience, University of Toronto, 1994.\n\n[12] D.J.C. MacKay. Probable networks and plausible predictions - a review of practical bayesian methods for\n\nsupervised neural networks. Network: Computation in Neural Systems, 6:469\u2013505, 1995.\n\n[13] W.R. Gilks, S. Richardson, and D. J. Spiegelhalter, editors. Markov Chain Monte Carlo in Practice.\n\nChapman and Hall, Suffolk, UK, 1996.\n\n8\n\n\f", "award": [], "sourceid": 713, "authors": [{"given_name": "Thomas", "family_name": "Griffiths", "institution": null}, {"given_name": "Chris", "family_name": "Lucas", "institution": null}, {"given_name": "Joseph", "family_name": "Williams", "institution": null}, {"given_name": "Michael", "family_name": "Kalish", "institution": null}]}