{"title": "Log-concavity Results on Gaussian Process Methods for Supervised and Unsupervised Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 1025, "page_last": 1032, "abstract": null, "full_text": " Log-concavity results on Gaussian process\n methods for supervised and unsupervised\n learning\n\n\n\n Liam Paninski\n Gatsby Computational Neuroscience Unit\n University College London\n liam@gatsby.ucl.ac.uk\n http://www.gatsby.ucl.ac.uk/liam\n\n Abstract\n\n\n Log-concavity is an important property in the context of optimization,\n Laplace approximation, and sampling; Bayesian methods based on Gaus-\n sian process priors have become quite popular recently for classification,\n regression, density estimation, and point process intensity estimation.\n Here we prove that the predictive densities corresponding to each of these\n applications are log-concave, given any observed data. We also prove\n that the likelihood is log-concave in the hyperparameters controlling the\n mean function of the Gaussian prior in the density and point process in-\n tensity estimation cases, and the mean, covariance, and observation noise\n parameters in the classification and regression cases; this result leads to\n a useful parameterization of these hyperparameters, indicating a suitably\n large class of priors for which the corresponding maximum a posteriori\n problem is log-concave.\n\n\n\n\n\nIntroduction\n\n\nBayesian methods based on Gaussian process priors have recently become quite popular\nfor machine learning tasks (1). These techniques have enjoyed a good deal of theoretical\nexamination, documenting their learning-theoretic (generalization) properties (2), and de-\nveloping a variety of efficient computational schemes (e.g., (35), and references therein).\nWe contribute to this theoretical literature here by presenting results on the log-concavity of\nthe predictive densities and likelihood associated with several of these methods, specifically\ntechniques for classification, regression, density estimation, and point process intensity es-\ntimation. These results, in turn, imply that it is relatively easy to tune the hyperparameters\nfor, approximate the posterior distributions of, and sample from these models.\n\nOur results are based on methods which we believe will be applicable more widely in ma-\nchine learning contexts, and so we give all necessary details of the (fairly straightforward)\nproof techniques used here.\n\n\f\nLog-concavity background\n\nWe begin by discussing the log-concavity property: its uses, some examples of log-concave\n(l.c.) functions, and the key theorem on which our results are based. Log-concavity is\nperhaps most important in a maximization context: given a real function f ofsome vector\nparameter , if g(f ()) is concave for some invertible function g, and the parameters live\nin some convex set, then f is unimodal, with no non-global local maxima. (Note that in\nthis case a global maximum, if one exists, is not necessarily unique, but maximizers of f\ndo form a convex set, and hence maxima are essentially unique in a sense.) Thus ascent\nprocedures for maximization can be applied without fear of being trapped in local maxima;\nthis is extremely useful when the space to be optimized over is high-dimensional. This\nlogic clearly holds for any arbitrary rescaling g; of course, we are specifically interested in\ng(t) = log t, since logarithms are useful in the context of taking products (in a probabilistic\ncontext, read conditional independence): log-concavity is preserved under multiplication,\nsince the logarithm converts multiplication into addition and concavity is preserved under\naddition.\n\nLog-concavity is also useful in the context of Laplace (central limit theorem - type) ap-\nproximations (3), in which the logarithm of a function (typically a probability density or\nlikelihood function) is approximated via a second-order (quadratic) expansion about its\nmaximum or mean (6); this log-quadratic approximation is a reasonable approach for func-\ntions whose logs are known to be concave.\n\nFinally, l.c. distributions are in general easier to sample from than arbitrary distributions,\nas discussed in the context of adaptive rejection and slice sampling (7, 8) and the random-\nwalk-based samplers analyzed in (9).\n\nWe should note that log-concavity is not a generic property: l.c. probability densities nec-\nessarily have exponential tails (ruling out power law tails, and more generally distributions\nwith any infinite moments). Log-concavity also induces a certain degree of smoothness;\nfor example, l.c. densities must be continuous on the interior of their support. See, e.g., (9)\nfor more detailed information on the various special properties implied by log-concavity.\n\nA few simple examples of l.c. functions are as follows: the Gaussian density in any dimen-\nsion; the indicator of any convex set (e.g., the uniform density over any convex, compact\nset); the exponential density; the linear half-rectifier. More interesting well-known exam-\nples include the determinant of a matrix, or the inverse partition function of an energy-\n\nbased probabilistic model (e.g., an exponential family), Z-1() = ( ef(x,)dx)-1, l.c. in\n whenever f (x, ) is convex in for all x. Finally, log-concavity is preserved under taking\nproducts (as noted above), affine translations of the domain, and/or pointwise limits, since\nconcavity is preserved under addition, affine translations, and pointwise limits, respectively.\n\nSums of l.c. functions are not necessarily l.c., as is easily shown (e.g., a mixture of Gaus-\nsians with widely-separated means, or the indicator of the union of disjoint convex sets).\nHowever, a key theorem (10, 11) gives:\n\nTheorem (Integrating out preserves log-concavity). If f (x, y) is jointly l.c. in (x, y), for\nx and y finite dimensional, then\n\n f0(x) f(x,y)dy\nis l.c. in x.\n\nThink of y as a latent variable or hyperparameter we want to marginalize over. This\nvery useful fact has seen applications in various branches of statistics and operations re-\nsearch, but does not seem well-known in the machine learning community. The theorem\nimplies, for example, that convolutions of l.c. functions are l.c.; thus the random vectors\n\n\f\nwith l.c. densities form a vector space. Moreover, indefinite integrals of l.c. functions are\nl.c.; hence the error function, and more generally the cumulative distribution function of\nany l.c. density, is l.c., which is useful in the setting of generalized linear models (12) for\nclassification. Finally, the mass under a l.c. probability measure of a convex set which\nis translated in a convex manner is itself a l.c. function of the convex translation parame-\nter (11).\n\n\nGaussian process methods background\n\n\nWe now give a brief review of Gaussian process methods. Our goals are modest; we will\ndo little more than define notation. See, e.g., (1) and references for further details. Gaus-\nsian process methods are based on a Bayesian \"latent variable\" approach: dependencies\nbetween the observed input and output data {ti} and {yi} are modeled as arising through a\nhidden (unobserved) Gaussian process G(t). Recall that a Gaussian process is a stochastic\nprocess whose finite-dimensional projections are all multivariate Gaussian, with means and\ncovariances defined consistently for all possible projections, and is therefore specified by\nits mean (t) and covariance function C(t1, t2).\n\nThe applications we will consider may be divided into two settings; \"supervised\" and \"un-\nsupervised\" problems. We discuss the somewhat simpler unsupervised case first (however,\nit should be noted that the supervised cases have received significantly more attention in\nthe machine learning literature to date, and might be considered of more importance to this\ncommunity).\n\nDensity estimation: We are given unordered data {ti}; the setup is valid for any sample\nspace, but assume ti d, d < , for concreteness. We model the data as i.i.d. samples\nfrom an unknown distribution p. The prior over these unknown distributions, in turn, is\nmodeled as a conditioned Gaussian process, p G(t): p is drawn from a Gaussian process\nG(t) of mean (t) and covariance C (to ensure that the resulting random measures are\nwell-defined, we will assume throughout that G is moderately well-behaved; almost-sure\nlocal Lebesgue integrability is sufficient), conditioned to be nonnegative and to integrate\nto one over some arbitrarily large compact set (the latter by an obvious limiting argument,\nto prevent conditioning on a set of measure zero; the introduction of the compact set is to\navoid problems of the sort encountered when trying to define uniform probability measures\non unbounded spaces) with respect to some natural base measure on the sample space (e.g.,\nLebesgue measure in d) (13). It is worth emphasizing that this setup differs somewhat\nfrom some earlier proposals (5,14,15), which postulated that nonnegativity be enforced by,\ne.g., modeling log p or p as Gaussian, instead of the Gaussian p here; each approach has\nits own advantages, and it is unclear at the moment whether our results can be extended to\nthis context (as will be clear below, the roadblock is in the normalization constraint, which\nis transformed nonlinearly along with the density in the nonlinear warping setup).\n\nPoint process intensity estimation: A nearly identical setup can be used if we assume the\ndata {ti} represent a sample from a Poisson process with an unknown underlying intensity\nfunction (1618); the random density above is simply replaced by the random intensity\nfunction here (this type of model is known as a Cox, or doubly-stochastic, process in the\npoint-process literature). The only difference is that intensity functions are not required to\nbe normalized, so we need only condition the Gaussian process G(t) from which we draw\nthe intensity functions to be nonnegative. It turns out we will be free to use any l.c. and\nconvex warping of the range space of the Gaussian process G(t) to enforce positivity;\nsuitable warpings include exponentiation (corresponding to modeling the logarithm of the\nintensity as Gaussian (17)) or linear half-rectification.\n\nThe supervised cases require a few extra ingredients. We are given paired data, inputs {ti}\n\n\f\nwith corresponding outputs {yi}. We model the outputs as noise-corrupted observations\nfrom the Gaussian process G(t) at the points {ti}; denote the additional hidden \"observa-\ntion\" noise process as {n(ti)}. This noise process is not always taken to be Gaussian; for\ncomputational reasons, {n(ti)} is typically assumed i.i.d., and also independent of G(t),\nbut both of these assumptions will be unnecessary for the results stated below.\n\nRegression: We assume y(ti) = G(ti) + in(ti); in words, draw G(t) from a Gaussian\nprocess of mean (t) and covariance C; the outputs are then obtained by sampling this\nfunction G(t) at ti and adding noise n(ti) of scale i.\n\nClassification: y(ti) = 1 G(ti) + in(ti) > 0 , where 1(.) denotes the indicator function\nof an event. This case is as in the regression model, except we only observe a binary-\nthresholded version of the real output.\n\n\nResults\n\nOur first result concerns the predictive densities associated with the above models: the pos-\nterior density of any continuous linear functional of G(t), given observed data D = {ti}\nand/or {yi}, under the Gaussian process prior for G(t). The simplest and most impor-\ntant case of such a linear projection is the projection onto a finite collection of coor-\ndinates, {tpred}, say; in this special case, the predictive density is the posterior density\np({G(tpred)}|D). It turns out that all we need to assume is the log-concavity of the dis-\ntribution p(G, n); this is clearly more general than what is needed for the strictly Gaussian\ncases considered above (for example, Laplacian priors on G are permitted, which could\nlead to more robust performance). Also note that dependence of (G, n) is allowed; this\npermits, for example, coupling of the effective scales of the observation noise ni = n(ti)\nfor nearby points ti. Additonally, we allow nonstationarity and anisotropic correlations in\nG. The result applies for any of the applications discussed above.\n\nProposition 1 (Predictive density). Given a l.c. prior on (G, n), the predictive density is\nalways l.c., for any data D.\n\nIn other words, conditioning on data preserves these l.c. processes (where an l.c. process,\nlike a Gaussian process, is defined by the log-concavity of its finite-dimensional projec-\ntions). This represents a significant generalization of the obvious fact that in the regression\nsetup under Gaussian noise, conditioning preserves Gaussian processes.\n\nOur second result applies to the likelihood of the hyperparameters corresponding to the\nabove applications: the mean function , the covariance function C, and the observation\nnoise scales {i}. We first state the main result in some generality, then provide some\nuseful examples and interpretation below. For each j > 0, let A denote a family of linear\n j,\nmaps from some finite-dimensional vector space Gj to NdG, where dG = dim(G(ti)),\nand N is the number of observed data points. Our main assumptions are as follows: first,\nassume A-1 may be written A-1 = kKj,k, where\n j, j, {Kj,k} is a fixed set of matrices\nand the inverse is defined as a map from range(A ) to ). Second, assume\n j, Gj/ker(Aj,\nthat dim(A-1(V )) is constant in for any set V . Finally, equip the (doubly) latent space\n j,\nGj NdG = {(GL,n)} with a translation family of l.c. measures pj, (G\n L L, n) indexed\nby the mean parameter L, i.e., pj, (G\n L L, n) = pj ((GL, n) -L), for some fixed measure\npj(.). Then if the sequence pj(G, n) induced by pj and Aj converges pointwise to the joint\ndensity p(G, n), then:\n\nProposition 2 (Likelihood). In the supervised cases, the likelihood is jointly l.c. in the\nlatent mean function, covariance parameters, and inverse noise scales (L, , {-1\n i }), for\n\n\f\nall data D. In the unsupervised cases, the likelihood is l.c. in the mean function .\n\nNote that the mean function (t) is induced in a natural way by L and A , and that we\n i,\nallow the noise scale parameters {i} to vary independently, increasing the robustness of\nthe supervised methods (19) (since outliers can be \"explained,\" without large perturbations\nof the underlying predictive distributions of G(t), by simply increasing the corresponding\nnoise scale i). Of course, in practice, it is likely that to avoid overfitting one would want\nto reduce the effective number of free parameters by representing (t) and in finite-\ndimensional spaces, and restricting the freedom of the inverse noise scales {i}. The log-\nconcavity in the mean function (t) demonstrated here is perhaps most useful in the point\nprocess setting, where (t) can model the effect of excitatory or inhibitory inputs on the\nintensity function, with spatially- or temporally-varying patterns of excitation, and/or self-\nexcitatory interactions between observation sites ti (by letting (t) depend on the observed\npoints ti (16, 20)).\n\nIn the special case that the l.c. prior measure pj is taken to be Gaussian with covariance\nC0, the main assumption here is effectively on the parameterization of the covariance C;\nignoring the (technical) limiting operation in j for the moment, we are assuming roughly\nthat there exists a single basis in which, for all allowed , the covariance may be written\nC = A C , where A is of the special form described above.\n 0At \n\nWe may simplify further by assuming that C0 is white and stationary. One important\nexample of a suitable two-parameter family of covariance kernels satisfying the condi-\ntions of Proposition 2 is provided by the Ornstein-Uhlenbeck kernels (which correspond to\nexponentially-filtered one-dimensional white noise):\n\n C(t1, t2) = 2e-2|t1-t2|/\n\nFor this kernel, one can parameterize C = A At , with A-1 = \n 1I\n - 2D, where I\nand D denote the identity and differential operators, respectively, and k > 0 to ensure\nthat C is positive-definite. (To derive this reparameterization, note that C(|t1 - t2|) solves\n(I - aD2)C(|t1 - t2|) = b(t), for suitable constants a,b.) Thus Proposition 2 gener-\nalizes a recent neuroscientific result: the likelihood for a certain neural model (the leaky\nintegrate-and-fire model driven by Gaussian noise, for which the corresponding covariance\nis Ornstein-Uhlenbeck) is l.c. (21, 22) (of course, in this case the model was motivated by\nbiophysical instead of learning-theoretic concerns).\n\nIn addition, multidimensional generalizations of this family are straightforward: corre-\nsponding kernels solve the Helmholtz problem,\n\n (I - a)C(t) = b(t),\nwith denoting the Laplacian. Solutions to this problem are well-known: in the isotropic\ncase, we obtain a family of radial Bessel functions, with a, b again setting the overall mag-\nnitude and correlation scale of C(t1, t2) = C(||t1 - t2||2). Generalizing in a different\ndirection, we could let A include higher-order differential terms, A-1 = \n kDk; the\n k=0\nresulting covariance kernels correspond to higher-order autoregression process priors.\n\nAn even broader class of kernel parameterizations may be developed in the spectral domain:\nstill assuming stationary white noise inputs, we may diagonalize C in the Fourier basis,\nthat is, C() = OtP ()O, with O the (unitary) Fourier transform operator and P () the\npower spectral density. Thus, comparing to the conditions above, if the spectral density\nmay be written as P ()-1 = | \n k khk()|2 (where |.| denotes complex magnitude),\nfor k > 0 and functions hk() such that sign(real(hk())) is constant in k for any ,\nthen the likelihood will be l.c. in ; A here may be taken as the multiplication operator\n \n\n\f\nOt( \n k khk())-1). Remember that the smoothness of the sample paths of G(t) depends\non the rate of decay of the spectral density (1,23); thus we may obtain smoother (or rougher)\nkernel families by choosing \n k khk() as more rapidly- (or slowly-)increasing.\n\n\nProofs\n\nPredictive density. This proof is a straightforward application of the Prekopa theorem (10).\nWrite the predictive distributions as\n\n\n p({LkG}|D) = K(D) p({LkG},{G(ti),n(ti)})p({yi,ti}|{LkG},{G(ti),n(ti)}),\nwhere {Lk} is a finite set of continuous linear functionals of G, K(D) is a constant that\ndepends only on the data, the integral is over all {G(ti), n(ti)}, and {ni, yi} is ignored in\nthe unsupervised case. Now we need only prove that the multiplicands on the right hand\nside above are l.c. The log-concavity of the left term is assumed; the right term, in turn,\ncan be rewritten as\n\n p({yi,ti}|{LkG},{G(ti),n(ti)}) = p({yi,ti}|{G(ti),n(ti)}),\nby the Markovian nature of the models. We prove the log-concavity of the right individually\nfor each of our applications.\n\nIn the supervised cases, {ti} is given and so we only need to look at p({yi}|{G(ti), n(ti)}).\nIn the classification case, this is simply an indicator of the set\n\n G(t 0,yi = 0\n i) + ini ,\n > 0, yi = 1\n i\n\nwhich is jointly convex in {G(ti), n(ti)}, completing the proof in this case.\nThe regression case is proven in a similar fashion: write p({yi}|{G(ti), n(ti)}) as the limit\nas 0 of the indicator of the convex set\n (|G(ti) + ini - yi| < ),\n i\n\nthen use the fact that pointwise limits preserve log-concavity. (The predictive distributions\nof {y(t)} will also be l.c. here, by a nearly identical argument.)\nIn the density estimation case, the term\n\n p({ti}|{G(ti)}) = G(ti)\n i\n\nis obviously l.c. in {G(ti)}. However, recall that we perturbed the distribution of G(t)\nin this case as well, by conditioning G(t) to be positive and normalized. The fact that\np({LkG},{G(ti)}) is l.c. follows upon writing this term as a marginalization of densities\nwhich are products of l.c. densities with indicators of convex sets (enforcing the linear\nnormalization and positivity constraints).\n\nFinally, for the point process intensity case, write the likelihood term, as usual,\n\n p({ti}|{G(ti)}) = e-R f(G(t))dt f(G(ti)),\n i\n\nwhere f is the scalar warping function that takes the original Gaussian function G(t) into\nthe space of intensity functions. This term is clearly l.c. whenever f (s) is both convex and\nl.c. in s; for more details on this class of functions, see e.g. (20).\n\n\f\nLikelihood. We begin with the unsupervised cases. In the density estimation case, write\nthe likelihood as\n\n L() = dp(G)1C({G(t)}) G(ti),\n i\n\nwith p(G) the probability of G under . Here 1C is the (l.c.) indicator function of the\nconvex set enforcing the linear constraints (positivity and normalization) on G. All three\nterms in the integrand on the right are clearly jointly l.c. in (G, ). In the point process\ncase,\n\n L() = dp(G)e- R f(G(t))dt f (G(ti));\n i\n\nthe joint log-concavity of the three multiplicands on the right is again easily demonstrated.\n\nThe Prekopa theorem cannot be directly applied here, since the functions 1C(.) and e- R f(.)\ndepend in an infinite-dimensional way on G and ; however, we can apply the Prekopa the-\norem to any finite-dimensional approximation of these functions (e.g., by approximating\nthe normalization condition and exponential integral by Riemann sums and the positivity\ncondition at a finite number of points), then obtain the theorem in the limit as the approxi-\nmation becomes infinitely fine, using the fact that pointwise limits preserve log-concavity.\n\nFor the supervised cases, write\n\n\nL(L, , {-1}) = lim dpj(GL,n)1 Aj,(GL + GL) + .(n + nL) V\n j\n\n\n = lim dpj(GL, n)1 (GL, n) kKj,kV, .-1.V ) + L ,\n j ( k\nwith V an appropriate convex constraint set (or limit thereof) defined by the observed data\n{yi}, GL and nL the projection of L into Gj or NdG, respectively, and . denoting point-\nwise operations on vectors. The result now follows immediately from Rinott's theorem on\nconvex translations of sets under l.c. probability measures (11, 22).\n\n\nAgain, we have not assumed anything more about p(GL, n) than log-concavity; as before,\nthis allows dependence of G and n, anisotropic correlations, etc. It is worth noting, though,\nthat the above result is somewhat stronger in the supervised case than the unsupervised; the\nproof of log-concavity in the covariance parameters does not seem to generalize easily to\nthe unsupervised setup (briefly, because log( \n k kyk) is not jointly concave in (k, yk) for\nall (k, yk), kyk > 0, precluding a direct application of the Prekopa or Rinott theorems\nin the unsupervised case). Extensions to ensure that the unsupervised likelihood is l.c. in \nare possible, but require further restrictions on the form of p(G|) and will not be pursued\nhere.\n\n\nDiscussion\n\nWe have provided some useful results on the log-concavity of the predictive densities and\nlikelihoods associated with several common Gaussian process methods for machine learn-\ning. In particular, our results preclude the existence of non-global local maxima in these\nfunctions, for any observed data; moreover, Laplace approximations of these functions will\nnot, in general, be disastrous, and efficient sampling methods are available.\n\nPerhaps the main practical implication of our results stems from our proposition on the\nlikelihood; we recommend a certain simple way to obtain parameterized families of ker-\nnels which respect this log-concavity property. Kernel families which may be obtained\nin this manner can range from extremely smooth to singular, and may model anisotropies\n\n\f\nflexibly. Finally, these results indicate useful classes of constraints (or more generally, reg-\nularizing priors) on the hyperparameters; any prior which is l.c. (or any constraint set which\nis convex) in the parameterization discussed here will lead to l.c. a posteriori problems.\n\nMore generally, we have introduced some straightforward applications of a useful and in-\nteresting theorem. We expect that further applications in machine learning (e.g., in latent\nvariable models, marginalization of hyperparameters, etc.) will be easy to find.\n\nAcknowledgements: We thank Z. Ghahramani and C. Williams for many helpful conver-\nsations. LP is supported by an International Research Fellowship from the Royal Society.\n\n\nReferences\n\n 1. M. Seeger, International Journal of Neural Systems 14, 1 (2004).\n\n 2. P. Sollich, A. Halees, Neural Computation 14, 1393 (2002).\n\n 3. C. Williams, D. Barber, IEEE PAMI 20, 1342 (1998).\n\n 4. M. Gibbs, D. MacKay, IEEE Transactions on Neural Networks 11, 1458 (2000).\n\n 5. L. Csato, Gaussian processes - iterative sparse approximations, Ph.D. thesis, Aston U.\n (2002).\n\n 6. T. Minka, A family of algorithms for approximate bayesian inference, Ph.D. thesis,\n MIT (2001).\n\n 7. W. Gilks, P. Wild, Applied Statistics 41, 337 (1992).\n\n 8. R. Neal, Annals of Statistics 31, 705 (2003).\n\n 9. L. Lovasz, S. Vempala, The geometry of logconcave functions and an O(n3) sampling\n algorithm, Tech. Rep. 2003-04, Microsoft Research (2003).\n\n10. A. Prekopa, Acad Sci. Math. 34, 335 (1973).\n\n11. Y. Rinott, Annals of Probability 4, 1020 (1976).\n\n12. P. McCullagh, J. Nelder, Generalized linear models (Chapman and Hall, London,\n 1989).\n\n13. J. Oakley, A. O'Hagan, Biometrika under review (2003).\n\n14. I. Good, R. Gaskins, Biometrika 58, 255 (1971).\n\n15. W. Bialek, C. Callan, S. Strong, Physical Review Letters 77, 4693 (1996).\n\n16. D. Snyder, M. Miller, Random Point Processes in Time and Space (Springer-Verlag,\n 1991).\n\n17. J. Moller, A. Syversveen, R. Waagepetersen, Scandinavian Journal of Statistics 25,\n 451 (1998).\n\n18. I. DiMatteo, C. Genovese, R. Kass, Biometrika 88, 1055 (2001).\n\n19. R. Neal, Monte Carlo implementation of Gaussian process models for Bayesian re-\n gression and classification, Tech. Rep. 9702, University of Toronto (1997).\n\n20. L. Paninski, Network: Computation in Neural Systems 15, 243 (2004).\n\n21. J. Pillow, L. Paninski, E. Simoncelli, NIPS 17 (2003).\n\n22. L. Paninski, J. Pillow, E. Simoncelli, Neural Computation 16, 2533 (2004).\n\n23. H. Dym, H. McKean, Fourier Series and Integrals (Academic Press, New York, 1972).\n\n\f\n", "award": [], "sourceid": 2590, "authors": [{"given_name": "Liam", "family_name": "Paninski", "institution": null}]}