{"title": "Construction of Nonparametric Bayesian Models from Parametric Bayes Equations", "book": "Advances in Neural Information Processing Systems", "page_first": 1392, "page_last": 1400, "abstract": "We consider the general problem of constructing nonparametric Bayesian models on infinite-dimensional random objects, such as functions, infinite graphs or infinite permutations. The problem has generated much interest in machine learning, where it is treated heuristically, but has not been studied in full generality in nonparametric Bayesian statistics, which tends to focus on models over probability distributions. Our approach applies a standard tool of stochastic process theory, the construction of stochastic processes from their finite-dimensional marginal distributions. The main contribution of the paper is a generalization of the classic Kolmogorov extension theorem to conditional probabilities. This extension allows a rigorous construction of nonparametric Bayesian models from systems of finite-dimensional, parametric Bayes equations. Using this approach, we show (i) how existence of a conjugate posterior for the nonparametric model can be guaranteed by choosing conjugate finite-dimensional models in the construction, (ii) how the mapping to the posterior parameters of the nonparametric model can be explicitly determined, and (iii) that the construction of conjugate models in essence requires the finite-dimensional models to be in the exponential family. As an application of our constructive framework, we derive a model on infinite permutations, the nonparametric Bayesian analogue of a model recently proposed for the analysis of rank data.", "full_text": "Construction of Nonparametric Bayesian Models\n\nfrom Parametric Bayes Equations\n\nPeter Orbanz\n\nUniversity of Cambridge and ETH Zurich\n\np.orbanz@eng.cam.ac.uk\n\nAbstract\n\nWe consider the general problem of constructing nonparametric Bayesian models\non in\ufb01nite-dimensional random objects, such as functions, in\ufb01nite graphs or in\ufb01-\nnite permutations. The problem has generated much interest in machine learning,\nwhere it is treated heuristically, but has not been studied in full generality in non-\nparametric Bayesian statistics, which tends to focus on models over probability\ndistributions. Our approach applies a standard tool of stochastic process theory,\nthe construction of stochastic processes from their \ufb01nite-dimensional marginal\ndistributions. The main contribution of the paper is a generalization of the classic\nKolmogorov extension theorem to conditional probabilities. This extension allows\na rigorous construction of nonparametric Bayesian models from systems of \ufb01nite-\ndimensional, parametric Bayes equations. Using this approach, we show (i) how\nexistence of a conjugate posterior for the nonparametric model can be guaranteed\nby choosing conjugate \ufb01nite-dimensional models in the construction, (ii) how the\nmapping to the posterior parameters of the nonparametric model can be explicitly\ndetermined, and (iii) that the construction of conjugate models in essence requires\nthe \ufb01nite-dimensional models to be in the exponential family. As an application\nof our constructive framework, we derive a model on in\ufb01nite permutations, the\nnonparametric Bayesian analogue of a model recently proposed for the analysis\nof rank data.\n\n1\n\nIntroduction\n\nNonparametric Bayesian models are now widely used in machine learning. Common models, in\nparticular the Gaussian process (GP) and the Dirichlet process (DP), were originally imported from\nstatistics, but the nonparametric Bayesian idea has since been adapted to the needs of machine\nlearning. As a result, the scope of Bayesian nonparametrics has expanded signi\ufb01cantly: Whereas\ntraditional nonparametric Bayesian statistics mostly focuses on models on probability distributions,\nmachine learning researchers are interested in a variety of in\ufb01nite-dimensional objects, such as func-\ntions, kernels, or in\ufb01nite graphs. Initially, existing DP and GP approaches were modi\ufb01ed and com-\nbined to derive new models, including the In\ufb01nite Hidden Markov Model [2] or the Hierarchical\nDirichlet Process [15]. More recently, novel stochastic process models have been de\ufb01ned from\nscratch, such as the Indian Buffet Process (IBP) [8] and the Mondrian Process [13]. This paper\nstudies the construction of new nonparametric Bayesian models from \ufb01nite-dimensional distribu-\ntions: To construct a model on a given type of in\ufb01nite-dimensional object (for example, an in\ufb01nite\ngraph), we start out from available probability models on the \ufb01nite-dimensional counterparts (prob-\nability models on \ufb01nite graphs), and translate them into a model on in\ufb01nite-dimensional objects\nusing methods of stochastic process theory. We then ask whether interesting statistical properties of\nthe \ufb01nite-dimensional models used in the constructions, such as conjugacy of priors and posteriors,\ncarry over to the stochastic process model.\n\n1\n\n\fIn general, the term nonparametric Bayesian model refers to a Bayesian model on an in\ufb01nite-\ndimensional parameter space. Unlike parametric models, for which the number of parameters is\nconstantly bounded w.r.t. sample size, nonparametric models allow the number of parameters to\ngrow with the number of observations. To accommodate a variable and asymptotically unbounded\nnumber of parameters within a single parameter space, the dimension of the space has to be in\ufb01nite,\nand nonparametric models can be de\ufb01ned as statistical models with in\ufb01nite-dimensional parameter\nspaces [17]. For a given sample of \ufb01nite size, the model will typically select a \ufb01nite subset of the\navailable parameters to explain the observations. A Bayesian nonparametric model places a prior\ndistribution on the in\ufb01nite-dimensional parameter space.\n\nMany nonparametric Bayesian models are de\ufb01ned in terms of their \ufb01nite-dimensional marginals:\nFor example, the Gaussian process and Dirichlet process are characterized by the fact that their\n\ufb01nite-dimensional marginals are, respectively, Gaussian and Dirichlet distributions [11, 5]. The\nprobability-theoretic construction result underlying such de\ufb01nitions is the Kolmogorov extension\ntheorem [1], described in Sec. 2 below. In stochastic process theory, the theorem is used to study\nthe properties of a process in terms of its marginals, and hence by studying the properties of \ufb01nite-\ndimensional distributions. Can the statistical properties of a nonparametric Bayesian model, i.e. of\na parameterized family of distributions, be treated in a similar manner, by considering the model\u2019s\nmarginals? For example, can a nonparametric Bayesian model be guaranteed to be conjugate if\nthe marginals used in its construction are conjugate? Techniques such as the Kolmogorov theo-\nrem construct individual distributions, whereas statistical properties are properties of parameterized\nfamilies of distributions. In Bayesian estimation, such families take the form of conditional prob-\nabilities. The treatment of the statistical properties of nonparametric Bayesian models in terms of\n\ufb01nite-dimensional Bayes equations therefore requires an extension result similar to the Kolmogorov\ntheorem that is applicable to conditional distributions. The main contribution of this paper is to\nprovide such a result.\n\nWe present an analogue of the Kolmogorov theorem for conditional probabilities, which permits the\ndirect construction of conditional stochastic process models on countable-dimensional spaces from\n\ufb01nite-dimensional conditional probabilities. Application to conjugate models shows how a conju-\ngate nonparametric Bayesian model can be constructed from conjugate \ufb01nite-dimensional Bayes\nequations \u2013 including the mapping to the posterior parameters. The converse is also true: To con-\nstruct a conjugate nonparametric Bayesian model, the \ufb01nite-dimensional models used in the con-\nstruction all have to be conjugate. The construction of stochastic process models from exponential\nfamily marginals is almost generic: The model is completely described by the mapping to the poste-\nrior parameters, which has a generic form as a function of the in\ufb01nite-dimensional counterpart of the\nmodel\u2019s suf\ufb01cient statistic. We discuss how existing models \ufb01t into the framework, and derive the\nnonparametric Bayesian version of a model on in\ufb01nite permutations suggested by [9]. By essentially\nproviding a construction recipe for conjugate models of countable dimension, our theoretical results\nhave clear practical implications for the derivation of novel nonparametric Bayesian models.\n\n2 Formal Setup and Notation\n\nIn\ufb01nite-dimensional probability models cannot generally be described with densities and therefore\nrequire some basic notions of measure-theoretic probability. In this paper, required concepts will\nbe measures on product spaces and abstract conditional probabilities (see e.g. [3] or [1] for general\nintroductions). Randomness is described by means of an abstract probability space (\u2126,A, P). Here,\n\u2126 is a space of points \u03c9, which represent atomic random events, A is a \u03c3-algebra of events on \u2126,\nand P a probability measure de\ufb01ned on the \u03c3-algebra. A random variable is a measurable mapping\nfrom \u2126 into some space of observed values, such as X : \u2126 \u2192 \u2126x. The distribution of X is the\nimage measure PX := X(P) = P \u25e6 X\u22121. Roughly speaking, the events \u03c9 \u2208 \u2126 represent abstract\nstates of nature, i.e. knowing the value of \u03c9 completely describes all probabilistic aspects of the\nmodel universe, and all random aspects are described by the probability measure P. However, \u2126, A\nand P are never known explicitly, but rather constitute the modeling assumption that any explicitly\nknown distribution PX is derived from one and the same probability measure P through some random\nvariable X.\n\nMultiple dimensions of random variables are formalized by product spaces. We will generally deal\nwith an in\ufb01nite-dimensional space such as \u2126E\nx is the E-\n\nx, were E is an in\ufb01nite index set and \u2126E\n\n2\n\n\fX] := P J\n\nX](AI) = P J\n\nX(\u03c0-1\n\nx. Each product space \u2126I\n\nX] is the marginal of the measure P J\n\nstructure, such as X I =(cid:78)\n\nx of different dimensions are linked by a projection operator \u03c0JI, which\nx to xI, the subset of entries of xJ that are indexed by I \u2282 J. For a set\nJI AI under projection is called a cylinder set with base AI. The projection\nJI , so for an I-dimensional event AI \u2208 BI\nX \u25e6 \u03c0-1\nx,\nJI AI). In other words, a probability is assigned to the I-dimensional\nX to the cylinder with base AI. The projection\nX on the lower-\n\nfold product of \u2126x with itself. The set of \ufb01nite subsets of E will be denoted F(E), such that\nx with I \u2208 F(E) is a \ufb01nite-dimensional subspace of \u2126E\n\u2126I\nx is equipped\nwith the product Borel \u03c3-algebra BI\nx. Random variables with values on these spaces have product\ni\u2208I X{i}. Note that this does not imply that the corresponding measure\nX := X I(P) is a product measure; the individual components of X I may be dependent. The\nP I\nelements of the in\ufb01nite-dimensional product space \u2126E\nx can be thought of as functions of the form\nE \u2192 \u2126x. For example, the space RR contains all real-valued functions on the line.\nx \u2282 \u2126J\nProduct spaces \u2126I\nrestricts a vector xJ \u2208 \u2126J\nAI \u2282 \u2126I\nx, the preimage \u03c0-1\noperator can be applied to measures as [\u03c0JIP J\nwe have [\u03c0JIP J\nset AI by applying the J-dimensional measure P J\nof a measure is just its marginal, that is, [\u03c0JIP J\ndimensional subspace \u2126I\nx.\nWe denote observation variables (data) by X I, parameters by \u0398I and hyperparameters by \u03a8I. The\ncorresponding measures and spaces are indexed accordingly, as PX, P\u0398, \u2126\u03b8 etc. The likelihoods and\nposteriors that occur in Bayesian estimation are conditional probability distributions. Since densities\nare not generally applicable in in\ufb01nite-dimensional spaces, the formulation of Bayesian models on\nsuch spaces draws on the abstract conditional probabilities of measure-theoretic probability, which\nare derived from Kolmogorov\u2019s implicit formulation of conditional expectations [3]. We will write\ne.g. PX(X|\u0398) for the conditional probability of X given \u0398. For the reader familiar with the theory,\nwe note that all spaces considered here are Borel spaces, such that regular versions of conditionals\nalways exist, and we hence assume all conditionals to be regular conditional probabilities (Markov\nkernels). Introducing abstract conditional probabilities here is far beyond the possible scope of this\npaper. A reader not familiar with the theory should simply read PX(X|\u0398) as a conditional distribu-\ntion, but take into account that these abstract objects are only uniquely de\ufb01ned almost everywhere.\nThat is, the probability PX(X|\u0398 = \u03b8) can be changed arbitrarily for those values of \u03b8 within some\nset of exceptions, provided that this set has measure zero. While not essential for understanding\nmost of our results, this fact is the principal reason that limits the results to countable dimensions.\nX (X E|\u0398E) is to represent a Gaussian process with \ufb01xed covariance\nExample: GP. Assume that P E\nfunction. Then X E is function-valued, and if for example E := R+ and \u2126x := R, the product space\nx = RR+ contains all functions xE of the form xE : R+ \u2192 R. Each axis label i \u2208 E in the product\n\u2126E\nspace is a point on the real line, and a \ufb01nite index set I \u2208 F(E) is a \ufb01nite collection of points I =\nx is then the vector xI := (xE(i1), . . . , xE(im))\n(i1, . . . , im). The projection \u03c0EIxE of a function in \u2126E\nof function values at the points in I. The parameter variable \u0398E represents the mean function of the\nx = RR+.\nprocess, and so we would choose \u2126E\nX (X E|\u0398E) is a Dirichlet process, the variable X E takes values xE in the set of\nExample: DP. If P E\nprobability measures over a given domain, such as R. A probability measure on R (with its Borel\nalgebra B(R)) is in particular a set function B(R) \u2192 [0, 1], so we could choose E = B(R) and\n\u2126x = [0, 1]. The parameters of a Dirichlet process DP(\u03b1, G0) are a scalar concentration parameter\n\u03b1 \u2208 R+, and a probability measure G0 with the same domain as the randomly drawn measure xE.\nThe parameter space would therefore be chosen as R+ \u00d7 [0, 1]B(R).\n\n\u03b8 := \u2126E\n\n2.1 Construction of Stochastic Processes from their Marginals\n\nSuppose that a family P I\ndimensional measure P E\nsubspace \u2126I\nx of \u2126E\neach other as well:\n\nX of probability measures are the \ufb01nite-dimensional marginals of an in\ufb01nite-\nX (a \u201cstochastic process\u201d). Each measure P I\nX lives on the \ufb01nite-dimensional\nx. As marginals of one and the same measure, the measures must be marginals of\n\nX \u25e6 \u03c0-1\n\nJI\n\nX = P J\nP I\n\n(1)\nAny family of probability measures satisfying (1) is called a projective family. The marginals of\na stochastic process measure are always projective. A famous theorem by Kolmogorov states that\nthe converse is also true: Any projective family on the \ufb01nite-dimensional subspaces of an in\ufb01nite-\ndimensional product space \u2126E\nx [1]. The only\nassumption required is that the \u201caxes\u201d \u2126x of the product space are so-called Polish spaces, i.e.\n\nx uniquely de\ufb01nes a stochastic process on the space \u2126E\n\nwhenever I \u2282 J .\n\n3\n\n\fx,BI\n\nX on \u2126E\n\nX as its marginals.\n\nX|I \u2208 F(E)} be a family of probability measures on the spaces (\u2126I\n\ntopological spaces that are complete, separable and metrizable. Examples include Euclidean spaces,\nseparable Banach or Hilbert spaces, countable discrete spaces, and countable products of spaces that\nare themselves Polish.\nTheorem 1 (Kolmogorov Extension Theorem). Let E be an arbitrary in\ufb01nite set. Let \u2126x be a Polish\nspace, and let {P I\nx). If the\nfamily is projective, there exists a uniquely de\ufb01ned probability measure P E\nx with the measures\nP I\nThe in\ufb01nite-dimensional measure P E\nX constructed in Theorem 1 is called the projective limit of the\nfamily P I\nX. Intuitively, the theorem is a regularity result: The marginals determine the values of\nP E\nX on a subset of events (namely on those events involving only a \ufb01nite subset of the random\nvariables, which are just the cylinder sets with \ufb01nite-dimensional base). The theorem then states that\na probability measure is such a regular object that knowledge of these values determines the measure\ncompletely, in a similar manner as continuous functions on the line are completely determined by\ntheir values on a countable dense subset. The statement of the Kolmogorov theorem is deceptive in\nits generality: It holds for any index set E, but if E is not countable, the constructed measure P E\nX is\nessentially useless \u2013 even though the theorem still holds, and the measure is still uniquely de\ufb01ned.\nThe problem is that the measure P E\nx, but on the\n\u03c3-algebra BE\nx). If E is uncountable, this \u03c3-algebra is too coarse to\nresolve events of interest1. In particular, it does not contain the singletons (one-point sets), such that\nthe measure P E\n\nx (the product \u03c3-algebra on \u2126E\nX is incapable of assigning a probability to an event of the form {X E = xE}.\n\nX , as a set function, is not de\ufb01ned on the space \u2126E\n\n3 Extension of Conditional and Bayesian Models\n\nAccording to the Kolmogorov extension theorem, the properties of a stochastic process can be an-\nalyzed by studying its marginals. Can we, analogously, use a set of \ufb01nite-dimensional Bayes equa-\ntions to represent a nonparametric Bayesian model? The components of a Bayesian model are condi-\ntional distributions. Even though these conditionals are probability measures for (almost) each value\nof the condition variable, the Kolmogorov theorem cannot simply be applied to extend conditional\nmodels: Conditional probabilities are functions of two arguments, and have to satisfy a measurabil-\nity requirement in the second argument (the condition). Application of the extension theorem to each\nvalue of the condition need not yield a proper conditional distribution on the in\ufb01nite-dimensional\nspace, as it disregards the properties of the second argument. But since the second argument takes\nthe role of a parameter in statistical estimation, these properties determine the statistical properties of\nthe model, such as suf\ufb01ciency, identi\ufb01ability, or conjugacy. In order to analyze the properties of an\nin\ufb01nite-dimensional Bayesian model in terms of \ufb01nite-dimensional marginals, we need a theorem\nthat establishes a correspondence between the \ufb01nite-dimensional and in\ufb01nite-dimensional condi-\ntional distributions. Though a number of extension theorems based on conditional distributions is\navailable in the literature, these results focus on the construction of sequential stochastic processes\nfrom a sequence of conditionals (see [10] for an overview). Theorem 2 below provides a result that,\nlike the Kolmogorov theorem, is applicable on product spaces.\n\nTo formulate the result, the projector used to de\ufb01ne the marginals has to be generalized from mea-\nX(X J|\u0398J) is a conditional\nsures to conditionals. The natural way to do so is the following: If P J\nprobability on the product space \u2126J, and I \u2282 J, de\ufb01ne\nX]( .|\u0398J) := P J\n\nJI .|\u0398J) .\n\n[\u03c0JIP J\n\nX(\u03c0-1\n\n(2)\n\nThis de\ufb01nition is consistent with that of the projector above, in the sense that it coincides with the\nX( .|\u0398J = \u03b8J) for any \ufb01xed value \u03b8J of the parameter. As\nstandard projector applied to the measure P J\nwith projective families of measures, we then de\ufb01ne projective families of conditional probabilities.\nX(X I|\u0398I) be a family of regu-\nDe\ufb01nition 1 (Conditionally Projective Probability Models). Let P I\nx, for all I \u2208 F(E). The family will be called\nlar conditional probabilities on product spaces \u2126I\nconditionally projective if [\u03c0JIP J\nAs conditional probabilities are unique almost everywhere, the equality is only required to hold al-\nmost everywhere as well. In the jargon of abstract conditional probabilities, the de\ufb01nition requires\n\nX( .|\u0398I) whenever I \u2282 J.\n\nX]( .|\u0398J) =a.e. P I\n\n1This problem is unfortunately often neglected in the statistics literature, and measures in uncountable\ndimensions are \u201cconstructed\u201d by means of the extension theorem (such as in the original paper [5] on the\nDirichlet process). See e.g. [1] for theoretical background, and [7] for a rigorous construction of the DP.\n\n4\n\n\fX( .|\u0398I) is a version of the projection of P J\n\nX( .|\u0398J). Theorem 2 states that a conditional prob-\nthat P I\nability on a countably-dimensional product space is uniquely de\ufb01ned (up to a.e.-equivalence) by a\nconditionally projective family of marginals. In particular, if we can de\ufb01ne a parametric model on\nx for I \u2208 F(E) such that these models are conditionally projective,\neach \ufb01nite-dimensional space \u2126I\nthe models determine an in\ufb01nite-dimensional parametric model (a \u201cnonparametric\u201d model) on the\noverall space \u2126E\nx.\nX(X I|\u0398I)\nTheorem 2 (Extension of Conditional Probabilities). Let E be a countable index set. Let P I\nbe a family of regular conditional probabilities on the product space \u2126I\nx. Then if the family is\nX (X E|CE) on the in\ufb01nite-\nconditionally projective, there exists a regular conditional probability P E\nX (X E|CE) is measurable\nX(X I|\u0398I) as its conditional marginals. P E\ndimensional space \u2126E\nwith respect to the \u03c3-algebra CE := \u03c3(\u222aI\u2208F (E)\u03c3(\u0398I)). In particular, if the parameter variables\nX (X E|\u0398E)\nsatisfy \u03c0JI\u0398J = \u0398I, then P E\n\nX (X E|CE) can be interpreted as the conditional probability P E\n\nx with the P I\n\nwith \u0398E :=(cid:78)\n\ni\u2208E \u0398{i}.\n\nProof Sketch2. We \ufb01rst apply the Kolmogorov theorem separately for each setting of the parameters\nX(X I|\u0398I = \u03b8I) projective. For any given \u03c9 \u2208 \u2126 (the abstract probability\nthat makes the measures P I\nspace), projectiveness holds if \u03b8I = \u0398I(\u03c9) for all I \u2208 F(E). However, for any conditionally\nprojective family, there is a set N \u2282 \u2126 of possible exceptions (for which projectiveness need not\nhold), due to the fact that conditional probabilities and conditional projections are only unique almost\neverywhere. Using the countability of the dimension set E, we can argue that N is always a null set;\nthe resulting set of constructed in\ufb01nite-dimensional measures is still a valid candidate for a regular\nconditional probability. We then show that if this set of measures is assembled into a function of the\nparameter, it satis\ufb01es the measurability conditions of a regular conditional probability: We \ufb01rst use\nthe properties of the marginals to show measurability on the subset of events which are preimages\nunder projection of \ufb01nite-dimensional events (the cylinder sets), and then use the \u03c0-\u03bb theorem [3]\nto extend measurability to all events.\n\n4 Conjugacy\n\nThe posterior of a Dirichlet process is again a Dirichlet process, and the posterior parameters can be\ncomputed as a function of the data and the prior parameters. This property is known as conjugacy,\nin analogy to conjugacy in parametric Bayesian models, and makes Dirichlet process inference\ntractable. Virtually all known nonparametric Bayesian models, including Gaussian processes, P\u00b4olya\ntrees, and neutral-to-the-right processes are conjugate [16]. In the Bayesian and exponential family\nliterature, conjugacy is often de\ufb01ned as \u201cclosure under sampling\u201d, i.e. for a given likelihood and a\ngiven class of priors, the posterior is again an element of the prior class [12]. This de\ufb01nition does not\nimply tractability of the posterior: In particular, the set of all probability measures (used as priors)\nis conjugate for any possible likelihood, but obviously this does not facilitate computation of the\nposterior. In the following, we call a prior and a likelihood of a Bayesian model conjugate if the\nposterior (i) is parameterized and (ii) there is a measurable mapping T from the data x and the prior\nparameter \u03c8 to the parameter \u03c8(cid:48) = T (x, \u03c8) which speci\ufb01es the corresponding posterior. In the\nde\ufb01nition below, the conditional probability k represents the parametric form of the posterior. The\nde\ufb01nition is applicable to \u201cnonparametric\u201d models, in which case the parameter simply becomes\nin\ufb01nite-dimensional.\nDe\ufb01nition 2 (Conjugacy and Posterior Index). Let PX(X|\u0398) and P\u0398(\u0398|\u03a8) be regular conditional\nprobabilities. Let P\u0398(\u0398|X, \u03a8) be the posterior of the model PX(X|\u0398) under prior P\u0398(\u0398|\u03a8). Model\nand prior are called conjugate if there exists a regular conditional probability k : B\u03b8 \u00d7 \u2126t \u2192 [0, 1],\nparameterized on a measurable Polish space (\u2126t,Bt), and a measurable map T : \u2126x \u00d7 \u2126\u03c8 \u2192 \u2126t,\nsuch that\n\nP\u0398(A|X = x, \u03a8 = \u03c8) = k(A, T (x, \u03c8))\nThe mapping T is called the posterior index of the model.\nThe de\ufb01nition becomes trivial for \u2126t = \u2126x \u00d7 \u2126\u03c8 and T chosen as the identity mapping; it is mean-\ningful if T is reasonably simple to evaluate, and its complexity does not increase with sample size.\nTheorem 3 below shows that, under suitable conditions, the structure of the posterior index carries\n\nfor all A \u2208 B\u03b8 .\n\n(3)\n\n2Complete proofs for both theorems in this paper are provided as supplementary material.\n\n5\n\n\fover to the projective limit model: If the \ufb01nite-dimensional marginals admit a tractable posterior\nindex, then so does the projective limit model.\n(Posterior Indices in Exponential Families) Suppose that PX(X|\u0398) is an exponential\nExample.\nfamily model with suf\ufb01cient statistic S and density p(x|\u03b8) = exp((cid:104)S(x), \u03b8(cid:105)\u2212 \u03b3(x)\u2212 \u03c6(\u03b8)). Choose\nP\u0398(\u0398|\u03a8) as the \u201cnatural conjugate prior\u201d with parameters \u03c8 = (\u03b1, y). Its density, w.r.t. a suitable\nmeasure \u03bd\u0398 on parameter space, is of the form q(\u03b8|\u03b1, y) = K(\u03b1, y)\u22121 exp((cid:104)\u03b8, y(cid:105) \u2212 \u03b1\u03c6(\u03b8)). The\nposterior P\u0398(\u0398|X, \u03a8) is conjugate in the sense of Def. 2, and its density is q(\u03b8|\u03b1 + 1, y + S(x)).\nA q(\u03b8|t1, t2)d\u03bd\u0398(\u03b8), and the posterior index\n\nThe probability kernel k is given by k(A, (t1, t2)) :=(cid:82)\n\nis T (x, (\u03b1, y)) := (\u03b1 + 1, y + S(x)).\nThe main result of this section is Theorem 3, which explains how conjugacy carries over from\nthe \ufb01nite-dimensional to the in\ufb01nite-dimensional case, and vice versa. Both extension theorems\ndiscussed so far require a projection condition on the measures and models involved. A similar\ncondition is now required for the mappings T I: The preimages T I,-1 of the posterior indices T I must\ncommute with the preimage under projection,\n\n(\u03c0EI \u25e6 T E)-1 = (T I \u25e6 \u03c0EI)-1\n\nfor all I \u2208 F(E) .\n\n(4)\n\nx and \u2126E\n\nx \u2192 \u2126E\n\n\u0398(\u0398E), P E\n\nX (X E|\u0398E) and P E\n\nX (X E|\u0398E) with prior P E\n\n\u0398(\u0398E), and the following holds:\n\n(i) Assume that each \ufb01nite-dimensional posterior P I\n\nThe posterior indices of all well-known exponential family models, such as Gaussians and Dirich-\nlets, satisfy this condition. The following theorem states that (i) stochastic process Bayesian models\nthat are constructed from conjugate marginals are conjugate if the projection equation (4) is satis\ufb01ed,\nand that (ii) such conjugate models can only be constructed from conjugate marginals.\nTheorem 3 (Functional Conjugacy of Projective Limit Models). Let E be a countable index set\nand \u2126E\n\u03b8 be Polish product spaces. Assume that there is a Bayesian model on each \ufb01nite-\ndimensional subspace \u2126I\nx, such that the families of all priors, all observation models and all poste-\n\u0398(\u0398E|X E) denote the respective\nriors are conditionally projective. Let P E\n\u0398(\u0398E|X E) is a posterior for the in\ufb01nite-dimensional Bayesian model de-\nprojective limits. Then P E\n\ufb01ned by P E\n\u0398(\u0398I|X I) is conjugate w.r.t. its respective\nBayesian model, with posterior index T I and probability kernel kI. Then if there is a mea-\nsurable mapping T : \u2126E\nt satisfying the projection condition (4), the projective limit\nposterior P E\n\u0398(\u0398E|X E) is conjugate with posterior\n\u0398(\u0398I|X I) is conjugate,\nEI. The corresponding probability kernels kI are\n\nindex T E and probability kernel kE, then each marginal posterior P I\nwith posterior index T I := \u03c0EI \u25e6 T E \u25e6 \u03c0-1\ngiven by\nkI(AI, tI) := kE(\u03c0-1\n\n(5)\nThe theorem is not stated here in full generality, but under two simplifying assumptions: We have\nomitted the use of hyperparameters, such that the posterior indices depend only on the data, and all\ninvolved spaces (observation space, parameter space etc) are assumed to have the same dimension\nfor each Bayesian model. Generalizing the theorem beyond both assumptions is technically not dif-\n\ufb01cult, but the additional parameters and notation for book-keeping on dimensions reduce readability.\n\n\u0398(\u0398E|X E) is conjugate with posterior index T .\n\n(ii) Conversely, if the in\ufb01nite-dimensional posterior P E\n\nEIAI, t)\n\nfor any t \u2208 \u03c0-1\n\nEItI .\n\nProof Sketch2. Part (i): We de\ufb01ne a candidate for the probability kernel kE representing the projec-\ntive limit posterior, and then verify that it makes the model conjugate when combined with the map-\n\u0398(\u0398I|T I),\nping T given by assumption. To do so, we \ufb01rst construct the conditional probabilities P I\nshow that they form a conditionally projective family, and take their conditional projective limit\nusing Theorem 2. This projective limit is used as a candidate for kE. To show that kE indeed repre-\nsents the posterior, we show that the two coincide on the cylinder sets (events which are preimages\nunder projection of \ufb01nite-dimensional events). From this, equality for all events follows by the\nCaratheodory theorem [1].\nPart (ii): We only have to verify that the mappings T I and probability kernels kI indeed satisfy the\nde\ufb01nition of conjugacy, which is a straightforward computation.\n\n5 Construction of Nonparametric Bayesian Models\n\nTheorem 3(ii) states that conjugate models have conjugate marginals.\nin the \ufb01nite-\ndimensional case, conjugate Bayesian models are essentially limited to exponential families and\n\nSince,\n\n6\n\n\ftheir natural conjugate priors3, a consequence of the theorem is that we can only expect a non-\nparametric Bayesian model to be conjugate if it is constructed from exponential family marginals \u2013\nassuming that the construction is based on a product space approach.\n\nWhen an exponential family model and its conjugate prior are used in the construction, the form\nof the resulting model becomes generic: The posterior index T of a conjugate exponential fam-\nily Bayesian model is always given by the suf\ufb01cient statistic S in the form T (x, (\u03b1, y)) :=\n(\u03b1 + 1, y + S(x)). Addition commutes with projection, and hence the posterior indices T I of a\nfamily of such models over all dimensions I \u2208 F(E) satisfy the projection condition (4) if and\nonly if the same condition is satis\ufb01ed by the suf\ufb01cient statistics SI of the marginals. Accord-\ningly, the in\ufb01nite-dimensional posterior index T E in Theorem 3 exists if and only if there is an\nin\ufb01nite-dimensional \u201cextension\u201d SE of the suf\ufb01cient statistics SI satisfying (4). If that is the case,\nT E(xE, (\u03b1, yE)) := (\u03b1 + 1, yE + SE(xE)) is a posterior index for the in\ufb01nite-dimensional projective\nlimit model. In the case of countable dimensions, Theorem 3 therefore implies a construction recipe\nfor nonparametric Bayesian models from exponential family marginals; constructing the model boils\ndown to checking whether the models selected as \ufb01nite-dimensional marginals are conditionally\nprojective, and whether the suf\ufb01cient statistics satisfy the projection condition. An example con-\nstruction, for a model on in\ufb01nite permutations, is given in below. The following table summarizes\nsome stochastic process models from the conjugate extension point of view:\n\nMarginals (d-dim)\n\nBernoulli/Beta\nMultin./Dirichlet\nGaussian/Gaussian\nMallows/conjugate\n\nProjective limit model\n\nBeta process; IBP\n\nDP; CRP\nGP/GP\n\nExample below\n\nObservations (limit)\n\nBinary arrays\n\nDiscrete distributions\n(continuous) functions\nBijections N \u2192 N\n\nl=j+1\n\nA Construction Example. The analysis of preference data, in which preferences are represented\nas permutations, has motivated the de\ufb01nition of distributions on permutations of an in\ufb01nite number\nof items [9]. A \ufb01nite permutation on r items always implies a question such as \u201crank your favorite\nmovies out of r movies\u201d. A nonparametric approach can generalize the question to \u201crank your\nfavorite movies\u201d. Meila and Bao [9] derived a model on in\ufb01nite permutations, that is, on bijections of\nthe set N. We construct a nonparametric Bayesian model on bijections, with a likelihood component\nX (X E|\u0398E) equivalent to the model of Meila and Bao.\nP E\nChoice of marginals. The \ufb01nite-dimensional marginals are probability models of rankings of a \ufb01nite\nnumber of items, introduced by Fligner and Verducci [6]. For permutations \u03c4 \u2208 Sr of length r,\n\nthe model is de\ufb01ned by the exponential family density p(\u03c4|\u03c3, \u03b8) := Z(\u03b8)\u22121 exp((cid:10)S(\u03c4 \u03c3\u22121), \u03b8(cid:11)),\n(cid:80)r\n\nwhere the suf\ufb01cient statistic is the vector Sr(\u03c4) := (S1(\u03c4), . . . , Sr(\u03c4)) with components Sj(\u03c4) :=\nI{\u03c4\u22121(j) > \u03c4\u22121(l)}. Roughly speaking, the model is a location-scale model, and the\npermutation \u03c3 de\ufb01nes the distribution\u2019s mean. If all entries of \u03b8 are chosen identical as some con-\nstant, this constant acts as a concentration parameter, and the scalar product is equivalent to the\nKendall metric on permutations. This metric measures distance between permutations as the min-\nimum number of adjacent transpositions (i.e. swaps of neighboring entries) required to transform\none permutation into the other. If the entries of \u03b8 differ, they can be regarded as weights specifying\nthe relevance of each position in the ranking [6].\nDe\ufb01nition of marginals. In the product space context, each \ufb01nite set I \u2208 F(E) of axis labels is a\n\u0398(\u03c4 I|\u03c3I, \u03b8I) is a model on the corresponding \ufb01nite\nset of items to be permuted, and the marginal P I\npermutation group SI on the elements of I. The suf\ufb01cient statistics SI maps each permutation to a\nvector of integers, and thus embeds the group SI into RI. The mapping is one-to-one [6]. Projections,\ni.e. restrictions, on the group mean deletion of elements. A permutation \u03c4 J is restricted to a subset\nI \u2282 J of indices by deleting all items indexed by J \\ I, producing the restriction \u03c4 J|I. We overload\nnotation and write \u03c0JI for both the restriction in the group SI and axes-parallel projection in the\nEuclidean space RI, into which the suf\ufb01cient statistic SI embeds SI. It follows from the de\ufb01nition\nof SI that, whenever \u03c0JI\u03c4 J = \u03c4 I, then \u03c0JISJ(\u03c4 J) = SI(\u03c4 I). In other words, \u03c0JI \u25e6 SJ = SI \u25e6 \u03c0JI,\nwhich is a stronger form of the projection condition SJ,-1 \u25e6 \u03c0-1\nJI \u25e6 SI,-1 given in Eq. 4. We\nwill de\ufb01ne a nonparametric Bayesian model that puts a prior on the in\ufb01nite-dimensional analogue\n\nJI = \u03c0-1\n\n3Mixtures of conjugate priors are conjugate in the sense of closure under sampling [4], but the posterior\nindex in Def. 2 has to be evaluated for each mixture component individually. An example of a conjugate model\nnot in the exponential family is the uniform distribution on [0, \u03b8] with a Pareto prior [12].\n\n7\n\n\fgiven by the density pI(\u03c4 I|\u03c3I, \u03b8I) := Z I(\u03b8I)\u22121 exp((cid:10)SI(\u03c4 I(\u03c3I)\u22121), \u03b8I(cid:11)). The corresponding natural\n\nof \u03b8, i.e. on the weight function \u03b8E. For I \u2208 F(N), the marginal of the likelihood component is\nconjugate prior on \u03b8I has density qI(\u03b8I|\u03b1, yI) \u221d exp((cid:104)\u03b8I, yI(cid:105) \u2212 \u03b1 log Z I(\u03b8I)). Since the model is an\nexponential family model, the posterior index is of the form T I((\u03b1, yI), \u03c4 I) = (\u03b1 + 1, yI + SI(\u03c4 I)),\nand since SI is projective in the sense of Eq. 4, so is T I. The prior and likelihood densities above\nde\ufb01ne two families P I(X I|\u0398I) and P I(\u0398I|\u03a8) of measures over all \ufb01nite dimensions I \u2208 F(E). It\nis reasonably straightforward to show that both families are conditionally projective, and so is the\nfamily of the corresponding posteriors. Each therefore has a projective limit, and the projective limit\nof the posteriors is the posterior of the projective limit P E(X E|\u0398E) under prior P E(\u0398E).\nPosterior index. The posterior index of the in\ufb01nite-dimensional model can be derived by means\nof Theorem 3: To get rid of the hyperparameters, we \ufb01rst \ufb01x a value \u03c8E := (\u03b1, yE) of the\nin\ufb01nite-dimensional hyperparameter, and only consider the corresponding in\ufb01nite-dimensional prior\n\u0398(\u0398E|\u03a8E = \u03c8E), with its marginals P I\n\u0398(\u0398I|\u03a8I = \u03c0EI\u03c8E). Now de\ufb01ne a function SE on the\n(cid:80)\u221e\nP E\nbijections of N as follows. For each bijection \u03c4 : N \u2192 N, and each j \u2208 N, set SE\nj (\u03c4) :=\nI{\u03c4\u22121(j) > \u03c4\u22121(l)}. Since \u03c4\u22121(j) is a \ufb01nite number for any j \u2208 N, the indicator function\nis non-zero only for a \ufb01nite number of indices l, such that the entries of SE are always \ufb01nite. Then\nEISI,-1 for all I \u2208 F(E). As candidate posterior\nSE satis\ufb01es the projection condition SE,-1 \u25e6 \u03c0-1\nindex, we de\ufb01ne the function T E((\u03b1, yE), \u03c4 E) = (\u03b1 + 1, yE + SE(\u03c4 E)) for yE \u2208 \u2126N\n\u03b8 . Then T E also\nsatis\ufb01es the projection condition (4) for any I \u2208 F(E). By Theorem 3, this makes T E a posterior\nindex for the projective limit model.\n\nl=j+1\n\nEI = \u03c0-1\n\n6 Discussion and Conclusion\nWe have shown how nonparametric Bayesian models can be constructed from \ufb01nite-dimensional\nBayes equations, and how conjugacy properties of the \ufb01nite-dimensional models carry over to\nthe in\ufb01nite-dimensional, nonparametric case. We also have argued that conjugate nonparametric\nBayesian models arise from exponential families.\n\nA number of interesting questions could not be addressed within the scope of this paper, including\n(1) the extension to model properties other than conjugacy and (2) the generalization to uncountable\ndimensions. For example, a model property which is closely related to conjugacy is suf\ufb01ciency [14].\nIn this case, we would ask whether the existence of suf\ufb01cient statistics for the \ufb01nite-dimensional\nmarginals implies the existence of a suf\ufb01cient statistic for the nonparametric Bayesian model, and\nwhether the in\ufb01nite-dimensional suf\ufb01cient statistic can be explicitly constructed. Second, the results\npresented here are restricted to the case of countable dimensions. This restriction is inconvenient,\nsince the natural product space representations of, for example, Gaussian and Dirichlet processes\non the real line have uncountable dimensions. The GP (on continuous functions) and the DP are\nwithin the scope of our results, as both can be derived by means of countable-dimensional surrogate\nconstructions: Since continuous functions on R are completely determined by their values on Q, a\nGP can be constructed on the countable-dimensional product space RQ. Analogous constructions\nhave been proposed for the DP [7]. The drawback of this approach is that the actual random draw is\njust a partial version of the object of interest, and formally has to be completed e.g. into a continuous\nfunction or a probability measure after it is sampled. On the other hand, uncountable product space\nconstructions are subject to all the subtleties of stochastic process theory, many of which do not\noccur in countable dimensions. The application of construction methods to conditional probabilities\nalso becomes more complicated (roughly speaking, the point-wise application of the Kolmogorov\ntheorem in the proof of Theorem 2 is not possible if the dimension is uncountable).\n\nProduct space constructions are by far not the only way to de\ufb01ne nonparametric Bayesian models. A\nP\u00b4olya tree model [7], for example, is much more intuitive to construct by means of a binary partition\nargument than from marginals in product space. As far as characterization results, such as which\nmodels can be conjugate, are concerned, our results are still applicable, since the set of Poly\u00b4a trees\ncan be embedded into a product space. However, the marginals may then not be the marginals in\nterms of which we \u201cnaturally\u201d think about the model. Nonetheless, we have hopefully demonstrated\nthat the theoretical results are applicable for the construction of an interesting and practical range of\nnonparametric Bayesian models.\nAcknowledgments. I am grateful to Joachim M. Buhmann, Zoubin Ghaharamani, Finale Doshi-\nVelez and the reviewers for helpful comments. This work was in part supported by EPSRC grant\nEP/F028628/1.\n\n8\n\n\fReferences\n\n[1] H. Bauer. Probability Theory. W. de Gruyter, 1996.\n[2] M. J. Beal, Z. Ghahramani, and C. E. Rasmussen. The in\ufb01nite hidden Markov model.\n\nAdvances in Neural Information Processing Systems, 2001.\n\nIn\n\n[3] P. Billingsley. Probability and measure, 1995.\n[4] S. R. Dalal and W. J. Hall. Approximating priors by mixtures of natural conjugate priors.\n\nAnnals of Statistics, 45(2):278\u2013286, 1983.\n\n[5] T. S. Ferguson. A Bayesian analysis of some nonparametric problems. Annals of Statistics,\n\n1(2), 1973.\n\n[6] M. A. Fligner and J. S. Verducci. Distance based ranking models. Journal of the Royal Statis-\n\ntical Society B, 48(3):359\u2013369, 1986.\n\n[7] J. K. Ghosh and R. V. Ramamoorthi. Bayesian Nonparametrics. Springer, 2002.\n[8] T. L. Grif\ufb01ths and Z. Ghahramani. In\ufb01nite latent feature models and the Indian buffet process.\n\nIn Advances in Neural Information Processing Systems, 2005.\n\n[9] M. Meil\u02d8a and L. Bao. Estimation and clustering with in\ufb01nite rankings.\n\nArti\ufb01cial Intelligence, 2008.\n\nIn Uncertainty in\n\n[10] M. M. Rao. Conditional Measures and Applications. Chapman & Hall, second edition, 2005.\n[11] C. E. Rasmussen and C. K. I. Williams. Gaussian Processes for Machine Learning. MIT Press,\n\n2006.\n\n[12] C. P. Robert. The Bayesian Choice. Springer, 1994.\n[13] D. M. Roy and Y. W. Teh. The Mondrian process. In Advances in Neural Information Pro-\n\ncessing Systems, 2009.\n\n[14] M. J. Schervish. Theory of Statistics. Springer, 1995.\n[15] Y. W. Teh, M. I. Jordan, M. J. Beal, and D. M. Blei. Hierarchical Dirichlet processes. Journal\n\nof the American Statistical Association, (476):1566\u20131581, 2006.\n\n[16] S. G. Walker, P. Damien, P. W. Laud, and A. F. M. Smith. Bayesian nonparametric inference\nfor random distributions and related functions. Journal of the Royal Statistical Society B,\n61(3):485\u2013527, 1999.\n\n[17] L. Wasserman. All of Nonparametric Statistics. Springer, 2006.\n\n9\n\n\f", "award": [], "sourceid": 901, "authors": [{"given_name": "Peter", "family_name": "Orbanz", "institution": null}]}