{"title": "Learning with Tree-Averaged Densities and Distributions", "book": "Advances in Neural Information Processing Systems", "page_first": 761, "page_last": 768, "abstract": null, "full_text": "Learning with Tree-Averaged Densities and\n\nDistributions\n\nAICML and Dept of Computing Science\n\nUniversity of Alberta\n\nSergey Kirshner\n\nEdmonton, Alberta, Canada T6G 2E8\n\nsergey@cs.ualberta.ca\n\nAbstract\n\nWe utilize the ensemble of trees framework, a tractable mixture over super-\nexponential number of tree-structured distributions [1], to develop a new model\nfor multivariate density estimation. The model is based on a construction of tree-\nstructured copulas \u2013 multivariate distributions with uniform on [0, 1] marginals.\nBy averaging over all possible tree structures, the new model can approximate\ndistributions with complex variable dependencies. We propose an EM algorithm\nto estimate the parameters for these tree-averaged models for both the real-valued\nand the categorical case. Based on the tree-averaged framework, we propose a\nnew model for joint precipitation amounts data on networks of rain stations.\n\n1 Introduction\n\nMultivariate real-valued data appears in many real-world data sets, and a lot of research is being\nfocused on the development of multivariate real-valued distributions. One of the challenges in con-\nstructing such distributions is that univariate continuous distributions commonly do not have a clear\nmultivariate generalization. The most studied exception is the multivariate Gaussian distribution ow-\ning to properties such as closed form density expression with a convenient generalization to higher\ndimensions and closure over the set of linear projections. However, not all problems can be ad-\ndressed fairly with Gaussians (e.g., mixtures, multimodal distributions, heavy-tailed distributions),\nand new approaches are needed for such problems.\nWhile modeling multivariate distributions is in general dif\ufb01cult due to complicated functional forms\nand the curse of dimensionality, learning models for individual variables (univariate marginals) is\noften straightforward. Once the univariate marginals are known (or assumed known), the rest can\nbe modeled using copulas, multivariate distributions with all univariate marginals equal to uniform\ndistributions on [0, 1] (e.g., [2, 3]). A large portion of copula research concentrated on bivariate\ncopulas as extensions to higher dimensions are often dif\ufb01cult. Thus if the desired distribution de-\ncomposes into its univariate marginals and only bivariate distributions, the machinery of copulas can\nbe effectively utilized.\nDistributions with undirected tree-structured graphical models (e.g., [4]) have exactly these prop-\nerties, as probability density functions over the variables with tree-structured conditional indepen-\ndence graphs can be written as a product involving univariate marginals and bivariate marginals\ncorresponding to the edges of the tree. While tree-structured dependence is perhaps too restrictive,\na richer variable dependence can be obtained by averaging over a small number of different tree\nstructures [5] or all possible tree structures; the latter can be done analytically for categorical-valued\ndistributions with an ensemble-of-trees model [1]. In this paper, we extend this tree-averaged model\nto continuous variables with the help of copulas and derive a learning algorithm to estimate the\nparameters within the maximum likelihood framework with EM [6]. Within this framework, the\n\n1\n\n\fparameter estimation for tree-structured and tree-averaged models requires optimization over only\nunivariate and bivariate densities potentially avoiding the curse of dimensionality, a property not\nshared by alternative models that relax the dependence restriction of trees (e.g., vines [7]).\nThe main contributions of the paper are the new tree-averaged model for multivariate copulas, a\nparameter estimation algorithm for tree-averaged framework (for both categorical and real-valued\ncomplete data), and a new model for multi-site daily precipitation amounts, an important application\nin hydrology. In the process, we introduce previously unexplored tree-structured copula density and\nan algorithm for estimation of its structure and parameters. The paper is organized as follows.\nFirst, we describe copulas, their densities, and some of their useful properties (Section 2). We then\nconstruct multivariate copulas with tree-structured dependence from bivariate copulas (Section 3.1)\nand show how to estimate the parameters of the bivariate copulas and perform the edge selection.\nTo allow more complex dependencies between the variables, we describe a tree-averaged copula,\na novel copula object constructed by averaging over all possible spanning trees for tree-structured\ncopulas, and derive a learning algorithm for the estimation of the parameters from data for the tree-\naveraged copulas (Section 4). We apply our new method to a benchmark data set (Section 5.1);\nwe also develop a new model for multi-site precipitation amounts, a problem involving both binary\n(rain/no rain) and continuous (how much rain) variables (Section 5.2).\n\n2 Copulas\n\nLet X = (X1, . . . , Xd) be a vector random variable with corresponding probability distribution\nF (cdf) de\ufb01ned on Rd. We denote by V the set of d components (variables) of X and refer\nto individual variables as Xv for v \u2208 V. For simplicity, we will refer to assignments to ran-\ndom variables by lower case letters, e.g., Xv = xv will be denoted by xv. Let Fv (xv) =\nF (Xv = xv, Xu = \u221e : u \u2208 V \\ {v}) denote a univariate marginal of F over the variable Xv.\nLet pv (xv) denote the probability density function (pdf) of Xv. Let av = Fv (xv), and let\na = (a1, . . . , ad), so a is a vector of quantiles of components of x with respect to corresponding\nunivariate marginals. Next, we de\ufb01ne copula, a multivariate distribution over vectors of quantiles.\nDe\ufb01nition 1. The copula associated with F is a distribution function C : [0, 1]d \u2192 [0, 1] that\nsatis\ufb01es\n\n(1)\nIf F is a continuous distribution on Rd with univariate marginals F1, . . . , Fd, then C (a) =\n\nF (x) = C (F1 (x1) , . . . , Fd (xd)) , x \u2208 Rd.\n\n(ad)(cid:1) is the unique choice for (1).\n\nF(cid:0)F \u22121\n\n(a1) , . . . , F \u22121\n\n1\n\nd\n\nAssuming that F has d-th order partial derivatives, the probability density function (pdf) can be\nobtained from the distribution function via differentiation and expressed in terms of a derivative of\na copula:\n\np (x) = \u2202dF (x)\n\u2202x1 . . . \u2202xd\n\n= \u2202dC (a)\n\u2202x1 . . . \u2202xd\n\n= \u2202dC (a)\n\u2202a1 . . . \u2202ad\n\n\u2202av\n\u2202xv\n\npv (xv)\n\n(2)\n\nwhere c (a) = \u2202dC(a)\n\u2202a1...\u2202ad\n\nSuppose we are given a complete data set D = (cid:8)x1, . . . , xN(cid:9) of d-component real-valued vec-\n(cid:1) under i.i.d. assumption. A maximum likelihood (ML) estimate for the\ntors xn = (cid:0)xn\n\nis referred to as a copula density function.\n\nparameters of c (or p) from data can be obtained my maximizing the log-likelihood of D\n\n1 , . . . , xd\n1\n\nY\n\nv\u2208V\n\n= c (a)Y\n\nv\u2208V\n\nln p (D) =X\n\nNX\n\nNX\n\nln pv (xn\n\nv ) +\n\nln c (F1 (xn\n\n1 ) , . . . , Fd (xn\n\nd )) .\n\n(3)\n\nv\u2208V\n\nn=1\n\nn=1\n\nThe \ufb01rst term of the log-likelihood corresponds to the total log-likelihood of all univariate marginals\nof p, and the second term to the log-likelihood of its d-variate copula. These terms are not inde-\npendent as the second term in the sum is de\ufb01ned in terms of the probability expressions in the \ufb01rst\nsummand; except for a few special cases, a direct optimization of (3) is prohibitively complicated.\nHowever a useful (and asymptotically consistent) heuristic is \ufb01rst to maximize the log-likelihood for\nthe marginals (\ufb01rst term only), and then to estimate the parameters for the copula given the solution\n\n2\n\n\ffor the marginals. The univariate marginals can be accurately estimated by either \ufb01tting the parame-\nters for some appropriately chosen univariate distributions or by applying non-parametric methods1\nas the marginals are estimated independent of each other and do not suffer from the curse of di-\nmensionality. Let \u02c6pv (xv) be the estimated pdf for component v, and \u02c6Fv be the corresponding cdf.\nbe a set of estimated\nquantiles. Under the above heuristic, ML estimate for copula density c is computed by maximizing\n\nLet A =(cid:8)a1, . . . , aN(cid:9) where an = (an\nln c (A) =PN\n\n(cid:16) \u02c6F (xn\n\n1 ) , . . . , \u02c6F (xn\nd )\n\n1 , . . . , an\n\nd ) =\n\n(cid:17)\n\nn=1 ln c (an).\n\n3 Exploiting Tree-Structured Dependence\n\nJoint probability distributions are often modeled with probabilistic graphical models where the struc-\nture of the graph captures the conditional independence relations of the variables. The joint distri-\nbution is then represented as a product of functions over subsets of variables. We would like to\nkeep the number of variables for each of the functions small as the number of parameters and the\nnumber of points needed for parameter estimation often grows exponentially with the number of\nvariables. Thus, we focus on copulas with tree dependence. Trees play an important role in prob-\nabilistic graphical models as they allow for ef\ufb01cient exact inference [10] as well as structure and\nparameter learning [4]. They can also be placed in a fully Bayesian framework with decomposable\npriors allowing to compute expected values (over all possible spanning trees) of product of functions\nde\ufb01ned on the edges of the trees [1]. As we will see later in this section, under the tree-structured de-\npendence, a copula density can be computed as products of bivariate copula densities over the edges\nof the graph. This property allows us to estimate the parameters for the edge copulas independently.\n\n3.1 Tree-Structured Copulas\n\nWe consider tree-structured Markov networks, i.e., undirected graphs that do not have loops. For a\ndistribution F admitting tree-structured Markov networks (referred from now on as tree-structured\ndistributions), assuming that p (x) > 0 and p (x) < \u221e for x \u2208 R \u2286 Rd, the density (for x \u2208 R)\ncan be rewritten as\n\np (x) =\n\n(4)\nThis formulation easily follows from the Hammersley-Clifford theorem [11]. Note that for {u, v} \u2208\nE, a copula density cuv (au, av) for F (xu, xv) can be computed using Equation 2:\n\n{u,v}\u2208E\n\npv (xv)\n\nv\u2208V\n\npuv (xu, xv)\npu (xu) pv (xv) .\n\n\"Y\n\n# Y\n\ncuv (au, av) = puv (xu, xv)\npu (xu) pv (xv) .\nUsing Equations 2, 4, and 5, cp (a) for F (x) can be computed as\n\nQ\n\np (x)\n\nv\u2208V pv (xv)\n\n= Y\n\n{u,v}\u2208E\n\ncp (a) =\n\npuv (xu, xv)\npu (xu) pv (xv)\n\n(5)\n\n(6)\n\ncp (au, av) .\n\n= Y\n\n{u,v}\u2208E\n\nEquation 6 states that a copula density for a tree-structured distribution decomposes as a product\nof bivariate copulas over its edges. The converse is true as well; a tree-structured copula can be\nconstructed by specifying copulas for the edges of the tree.\nTheorem 1. Given a tree or a forest G = (V,E) and copula densities cuv (au, av) for {u, v} \u2208 E,\n\ncE (a) = Y\n\n{u,v}\u2208E\n\ncuv (au, av)\n\nis a valid copula density.\n\nFor a tree-structured density, the copula log-likelihood can be rewritten as\n\nln c (A) = X\n\nNX\n\n{u,v}\u2208E\n\nn=1\n\nln cuv (an\n\nu, an\n\nv ) ,\n\n1These approaches for copula estimation are referred to as inference for the margins (IFM) [8] and canonical\n\nmaximum likelihood (CML) [9] for parametric and non-parametric forms for the marginals, respectively.\n\n3\n\n\fand the parameters can be \ufb01tted by maximizingPN\n\nv ) independently for different\npairs {u, v} \u2208 E. The tree structure can be learned from the data as well, as in the Chow-Liu\nalgorithm [4]. Full algorithm can be found in an extended version of the paper [12].\n\nn=1 ln cuv (an\n\nu, an\n\n4 Tree-Averaged Copulas\n\nWhile the framework from Section 3.1 is computationally ef\ufb01cient and convenient for implementa-\ntion, the imposed tree-structured dependence is too restrictive for real-world problems. Vines [7],\nfor example, deal with this problem by allowing recursive re\ufb01nements for the bivariate probabilities\nover variables not connected by the tree edges. However, vines require estimation of additional char-\nacteristics of the distribution (e.g., conditional rank correlations) requiring estimation over large sets\nof variables, which is not advisable when the amount of available data is not large. Our proposed\nmethod would only require optimization of parameters of bivariate copulas from the corresponding\ntwo components of weighted data vectors. Using the Bayesian framework for spanning trees from\n[1], it is possible to construct an object constituting a convex combination over all possible spanning\ntrees allowing a much richer set of conditional independencies than a single tree.\nMeil\u02d8a and Jaakkola [1] proposed a decomposable prior over all possible spanning tree structures.\nLet \u03b2 be a symmetric matrix of non-negative weights for all pairs of distinct variables and zeros on\nthe diagonal. Let E be a set of all possible spanning trees over V. The probability distribution over\nall spanning tree structures over V is de\ufb01ned as\n\nP (E \u2208 E|\u03b2) =\n\n1\nZ\n\nY\n\n\u03b2uv where Z =X\n\nY\n\n{u,v}\u2208E\n\nE\u2208E\n\n{u,v}\u2208E\n\n\u03b2uv.\n\n(7)\n\nEven though the sum is over |E| = dd\u22122 trees, Z can be ef\ufb01ciently computed in closed form using\na weighted generalization of Kirchoff\u2019s Matrix Tree Theorem (e.g., [1]).\nTheorem 2. Let P (E) be a distribution over spanning tree structures de\ufb01ned by (7). Then the\nnormalization constant Z is equal to the determinant |L? (\u03b2)|, with matrix L? (\u03b2) representing the\n\ufb01rst (d \u2212 1) rows and columns of the matrix L (\u03b2) given by:\n\nLuv (\u03b2) = Lvu (\u03b2) =\n\n(cid:26) \u2212\u03b2uv\nP\nu, v \u2208 V, u 6= v;\nw\u2208V \u03b2vw u, v \u2208 V, u = v.\n\n\u03b2 is a generalization of an adjacency matrix, and L (\u03b2) is a generalization of the Laplacian matrix.\nThe decomposability property of the tree prior (Equation 7) allows us to compute the average of\nthe tree-structured distributions over all dd\u22122 tree structures. In [1], such averaging was applied to\ntree-structured distributions over categorical variables. Similarly, we de\ufb01ne a tree-averaged copula\ndensity as a convex combination of copula densities of the form (6):\n\n\uf8ee\uf8f0 Y\n\nX\n\nE\u2208E\n\n\uf8f9\uf8fb\uf8ee\uf8f0 Y\n\nP (E|\u03b2) c (a) =\n\n1\nZ\n\n\u03b2uv\n\ncuv (au, av)\n\n{u,v}\u2208E\n\n{u,v}\u2208E\n\n\uf8f9\uf8fb =\n\n|L? (\u03b2c (a))|\n\n|L? (\u03b2)|\n\nr (a) =X\n\nE\u2208E\n\nwhere entry (uv) of matrix \u03b2c (a) denotes \u03b2uvcuv (au, av). A \ufb01nite convex combination of copulas\nis a copula, so r (a) is a copula density.\n\n4.1 Parameter Estimation\nGiven a set of estimated quantile values A, a suitable parameter values \u03b2 (edge weight matrix) and\n\u03b8 (parameters for bivariate edge copulas) can be found by maximizing the log-likelihood of A:\nln|L? (\u03b2c (an|\u03b8))| \u2212 N ln|L? (\u03b2)| .\n\nl (\u03b2, \u03b8) = ln r (A|\u03b2, \u03b8) =\n\nln r (an|\u03b2, \u03b8) =\n\nNX\n\nNX\n\n(8)\n\nn=1\n\nn=1\n\nHowever, the parameter optimization of l (\u03b2, \u03b8) cannot be done analytically. Instead, noticing that\nwe are dealing with a mixture model (granted, one where the number of mixture components is\nsuper-exponential), we propose performing the parameter optimization with the EM algorithm [6].2\n2A possibility of EM algorithm for ensemble-of-trees with categorical data was mentioned [1], but the idea\n\nwas abandoned due to the concern about the M-step.\n\n4\n\n\fAlgorithm TREEAVERAGEDCOPULADENSITY(D, c)\nInputs: A complete data set D of d-component real-valued vectors; a set of of bivariate para-\nmetric copula densities c = {cuv : u, v \u2208 V}\n\n1. Estimate univariate margins \u02c6Fv (Xv) for all components v \u2208 V treating all components\n\n2. Replace D with A consisting of vectors an =\n\n1 ) , . . . , \u02c6Fd (xn\nd )\n\nfor each vector\n\n(cid:16) \u02c6F1 (xn\n\n(cid:17)\n\nindependently.\n\nxn in D\n\n3. Initialize \u03b2 and \u03b8\n4. Run until convergence (as determined by change in log-likelihood, Equation 8)\n\n\u2022 E-step: For all vectors an and pairs {u, v}, compute P ({u, v} \u2208 E|an, \u03b2, \u03b8)\n\u2022 M-step:\n\n\u2013 Update \u03b2 with gradient ascent\n\u2013 Update \u03b8uv for all pairs by setting partial derivative with respect to parameters\n\nof \u03b8uv (Equation 9) to zero and solving corresponding equations\n\n(cid:20)Q\n\nv\u2208V\n\n(cid:21) |L?(\u03b2c(a))|\n\n|L?(\u03b2)|\n\nOutput: Denoting au = \u02c6F (xu) and av = \u02c6F (xv), \u02c6p (x) =\n\n\u02c6pv (xv)\n\nFigure 1: Algorithm for estimation of a pdf with tree-averaged copulas.\n\nWhile there are dd\u22122 possible mixture components (spanning trees), in the E-step, we only need\nto compute the posterior probabilities for d (d \u2212 1) /2 edges. Each step of EM consists of \ufb01nd-\ning parameters \u03b2\nparameter values \u03b2, \u03b8 where\n\n0 maximizing the expected joint log-likelihood M(cid:0)\u03b2\n0(cid:1) c(cid:0)an|E, \u03b8\n\n0; \u03b2, \u03b8(cid:1) =\n\nM(cid:0)\u03b2\n\n, \u03b8\n\n, \u03b8\n\n0\n\n0\n\nP (En|an, \u03b2, \u03b8) ln(cid:2)P(cid:0)E|\u03b2\n\nX\nNX\nNX\n= X\nsn ({u, v}) = X\nP (En|an, \u03b2, \u03b8) = X\n\nsn ({u, v}) (ln \u03b20\n\nEn\u2208E\n\n{u,v}\n\nn=1\n\nn=1\n\nuv + ln cuv (an\n\n0\n\n, \u03b8\n\n0; \u03b2, \u03b8(cid:1) given current\n0(cid:1)(cid:3)\nuv)) \u2212 N ln(cid:12)(cid:12)L?(cid:0)\u03b2\n\n0(cid:1)(cid:12)(cid:12) ;\n\nv|\u03b80\n\nu, an\n\nQ{u,v}\u2208E (\u03b2uvcuv (an\n\n|L? (\u03b2c (an))|\n\nv|\u03b8uv))\n\nu, an\n\n.\n\nE\u2208E\n\n{u,v}\u2208E\n\nE\u2208E\n\n{u,v}\u2208E\n\nThe probability distribution P (En|an, \u03b2, \u03b8) is of the same form as the tree prior, so to compute\nsn ({u, v}) one needs to compute the sum of probabilities of all trees containing edge {u, v}.\nTheorem 3. Let P (E|\u03b2) be a tree prior de\ufb01ned in Equation 7. Let Q (\u03b2) = (L? (\u03b2))\u22121 where L?\nis obtained by removing row and column w from L. Then\n\n( \u03b2uv (Quu (\u03b2) + Qvv (\u03b2) \u2212 2Quv (\u03b2))\n\nX\n\nP (E|\u03b2) =\n\nE\u2208E: {u,v}\u2208E\nAs a consequence of Theorem 3, for each an, all d (d \u2212 1) /2 edge probabilities sn ({u, v}) can\nbe computed simultaneously with time complexity of a single (d \u2212 1) \u00d7 (d \u2212 1) matrix inversion,\n\nO(cid:0)d3(cid:1). Assuming a candidate bivariate copula cuv has one free parameter \u03b8uv, \u03b8uv can be optimized\n\n: u 6= v, u 6= w, v 6= w,\n:\n:\n\nv = w,\nu = w.\n\n\u03b2uwQuu (\u03b2)\n\u03b2wvQvv (\u03b2)\n\nby setting\n\nsn ({u, v}) \u2202 ln cuv (an\n\u2202\u03b80\n\nuv\n\nv ; \u03b80\n\nuv)\n\nu, an\n\n,\n\n(9)\n\nto 0. (See [12] for more details.) The parameters of the tree prior can be updated by maximizing\n\nsn ({u, v})\n\nln \u03b20\n\nuv \u2212 ln|L? (\u03b2)| ,\n\nNX\n\nn=1\n\n=\n\n0\n, \u03b8\n\u2202\u03b80\n\n\u2202M(cid:0)\u03b2\nX\n\n0; \u03b2, \u03b8(cid:1)\n \nNX\n\nuv\n\n1\nN\n\n{u,v}\n\n!\n\nn=1\n\n5\n\n\fln \u03b2uv \u2200{u, v}, with time complexity O(cid:0)d3(cid:1) per iteration. The outline of the EM algorithm is\ncomplexity of each EM iteration is O(cid:0)N d3(cid:1).\n\nan expression concave in ln \u03b2uv \u2200{u, v}. \u03b2\n0 can be updated using a gradient ascent algorithm on\nshown in Figure 1. Assuming the complexity of each bivariate copula update is O (N), the time\n\nThe EM algorithm can be easily transferred to tree averaging for categorical data. The E-step does\nnot change, and in the M-step, the parameters for the univariate marginals are updated ignoring\nbivariate terms. Then, the parameters for the bivariate distributions for each edge are updated con-\nstrained on the new values of the parameters for the univariate distributions. While the algorithm\ndoes not guarantee a maximization of the expected log-likelihood, it nonetheless worked well in our\nexperiments.\n\n5 Experiments\n\n5.1 MAGIC Gamma Telescope Data Set\n\nFirst, we tested our tree-averaged density estimator on a MAGIC Gamma Telescope Data Set\nfrom the UCI Machine Learning Repository [13]. We considered only the examples from\nthis set consists of 12332 vectors of d = 10 real-valued compo-\nclass gamma (signal);\nnents. The univariate marginals are not Gaussian (some are bounded; some have multiple\nmodes). Fig. 2 shows an average log-likelihood of models trained on training sets with N =\n50, 100, 200, 500, 1000, 2000, 5000, 10000 and evaluated on 2000-example test sets (averaged over\n10 training and test sets). The marginals were estimated using Gaussian kernel density estima-\ntors (KDE) with Rule-of-Thumb bandwidth selection. All of the models except for full Gaussian\nhave the same marginals, differ only in the multivariate dependence (copula). As expected from\nthe curse of dimensionality, product KDE improves logarithmically with the amount of data. Not\nonly the marginals are not Gaussian (evidenced by a Gaussian copula with KDE marginals outper-\nforming a Gaussian distribution), the multivariate dependence is also not Gaussian, evidenced by a\ntree-structured Frank copula outperforming a tree-structured and a full Gaussian copula. However,\nmodel averaging even with the wrong dependence model (tree-averaged Gaussian copula) yields\nsuperior performance.\n\n5.2 Multi-Site Precipitation Modeling\n\nWe applied the tree-averaged framework to the problem of modeling daily rainfall amounts for a\nregional spatial network of stations. The task is to build a generative model capturing the spatial\nand temporal properties of the data. This model can be used in at least two ways: \ufb01rst, to sample\nsequences from it and to use them as inputs for other models, e.g., crop models; and second, as\na descriptive model of the data. Hidden Markov models (possible with non-homogeneous transi-\ntions) are being frequently used for this task (e.g., [14]) with the transition distribution responsible\nfor modeling of temporal dependence, and the emission distributions capturing most of the spatial\ndependence. Additionally, HMMs can be viewed as assigning rainfall daily patterns to \u201cweather\nstates\u201d (or corresponding emission components), and both these states (as described by either their\nparameters or the statistics of the patterns associated with it) and their temporal evolution often\noffer useful synoptic insight. We will use HMMs as the wrapper model with tree-averaged (and\ntree-structured) distributions to model the emission components.\nThe distribution of daily rainfall amounts for any given station can be viewed as a non-overlapping\nmixture with one component corresponding to zero precipitation, and the other component to posi-\ntive precipitation. For a station v, let rv be the precipitation amount, \u03c0v be a probability of positive\nprecipitation, and let fv (rv|\u03bbv) be a probability density function for amounts given positive precip-\nitation:\n\n(cid:26) 1 \u2212 \u03c0v\n\np (rv|\u03c0v, \u03bbv) =\n\n\u03c0vfv (rv|\u03bbv)\n\n:\n:\n\nrv = 0,\nrv > 0.\n\nFor a pair of stations {u, v}, let \u03c0uv denote the probability of simultaneous positive amounts and\ncuv (Fu (ru|\u03bbu) , Fv (rv|\u03bbv)|\u03b8uv) denote the copula density for simultaneous positive amounts;\n\n6\n\n\f\uf8f1\uf8f4\uf8f2\uf8f4\uf8f3 1 \u2212 \u03c0u \u2212 \u03c0v + \u03c0uv\n\np (ru, rv|\u03c0u, \u03c0v, \u03c0uv, \u03bbu, \u03bbv) =\n\nru = 0, rv = 0,\nru = 0, rv > 0,\nru > 0, rv = 0,\nru > 0, rv > 0.\nWe can now de\ufb01ne a tree-structured and tree-averaged probability distributions, pt (r) and pta (r),\nrespectively, over the amounts:\n\u03c9uv (r) = p (ru, rv|\u03c0u, \u03c0v, \u03c0uv, \u03bbu, \u03bbv)\n\n(\u03c0v \u2212 \u03c0uv) fv (rv|\u03bbv)\n(\u03c0u \u2212 \u03c0uv) fu (ru|\u03bbu)\n\u03c0uvfu (ru) fv (rv) c (Fu (ru) , Fv (rv))\n\n\"Y\n\np (rv|\u03c0v)\n\n:\n:\n:\n:\n\n\u03c9uv (r) ,\n\np (ru|\u03c0u, \u03bbu) p (rv|\u03c0v, \u03bbv) , pt (r|\u03c0, \u03bb, \u03b8,E) =\n\"Y\n\npta (r|\u03c0, \u03bb, \u03b8, \u03b2) =X\n\nP (E|\u03b2) pt (r|\u03c0, \u03bb, \u03b8,E) =\n\nv\u2208V\np (rv|\u03c0v)\n\n# Y\n# |L? (\u03b2\u03c9 (r))|\n\n{u,v}\u2208E\n\n|L? (\u03b2)|\n\n.\n\nthen\n\nE\u2208E\n\nv\u2208V\n\nWe employ univariate exponential distributions fv (rv) = \u03bbve\u2212\u03bbvrv and bivariate Gaussian copulas\ncuv (au, av) =\n\nuv \u03a6\u22121(av )2\u22122\u03b8uv \u03a6\u22121(au)\u03a6\u22121(av )\n\nuv \u03a6\u22121(au)2+\u03b82\n\n\u2212 \u03b82\ne\n\n2(1\u2212\u03b82\n\nuv)\n\n.\n\n1\u221a\n1\u2212\u03b82\n\nuv\n\nWe applied the models to a data set collected from 30 stations from a region in Southeastern Aus-\ntralia (Fig. 3) 1986-2005, April-October, (20 sequences 214 30-dimensional vectors each). We\nused a 5-state HMM with three different types of emission distributions: tree-averaged (pta), tree-\nstructured (pt), and conditionally independent (\ufb01rst term of pt and pta). We will refer to these\nmodels HMM-TA, HMM-Tree, and HMM-CI, respectively. For HMM-TA, we reduced the number\nof free parameters by only allowing edges for stations adjacent to each other as determined by the\nthe Delaunay triangulation (Fig. 3). We also did not learn the edge weights (\u03b2) setting them to 1 for\nselected edges and to 0 for the rest. To make sure that the models do not over\ufb01t, we computed their\nout-of-sample log-likelihood with cross-validation, leaving out one year at a time (not shown). (5\nstates were chosen because the leave-one-year out log-likelihood starts to \ufb02atten out for HMM-TA\nat 5 states.) The resulting log-likelihoods divided by the number of days and stations are \u22120.9392,\n\u22120.9522, and \u22121.0222 for HMM-TA, HMM-Tree, and HMM-CI, respectively. To see how well\nthe models capture the properties of the data, we trained each model on the whole data set (with\n50 restarts of EM), and then simulated 500 sequences of length 214. We are particularly interested\nin how well they measure pairwise dependence; we concentrate on two measures: log-odds ratio\nfor occurrence and Kendall\u2019s \u03c4 measure of concordance for pairs when both stations had positive\namounts. Both are shown in Fig. 4. Both plots suggest that HMM-CI underestimates the pairwise\ndependence for strongly dependent pairs (as indicated by its trend to predict lower absolute values\nfor log-odds and concordance); HMM-Tree estimating the dependence correctly mostly for strongly\ndependent pairs (as indicated by good prediction for high values), but underestimating it for mod-\nerate dependence; and HMM-TA performing the best for most pairs except for the ones with very\nstrong dependence.\n\nAcknowledgements\n\nThis work has been supported by the Alberta Ingenuity Fund through the AICML. We thank Stephen\nCharles (CSIRO, Australia) for providing us with precipitation data.\n\nReferences\n[1] M. Meil\u02d8a and T. Jaakkola. Tractable Bayesian learning of tree belief networks. Statistics and Computing,\n\n16(1):77\u201392, 2006.\n\n[2] H. Joe. Multivariate Models and Dependence Concepts, volume 73 of Monographs on Statistics and\n\nApplied Probability. Chapman & Hall/CRC, 1997.\n\n[3] R. B. Nelsen. An Introduction to Copulas. Springer Series in Statistics. Springer, 2nd edition, 2006.\n[4] C. K. Chow and C. N. Liu. Approximating discrete probability distributions with dependence trees. IEEE\n\nTransactions on Information Theory, IT-14(3):462\u2013467, May 1968.\n\n[5] M. Meil\u02d8a and M. I. Jordan. Learning with mixtures of trees. Journal of Machine Learning Research,\n\n1(1):1\u201348, October 2000.\n\n7\n\n\fFigure 2: Averaged test set per-feature log-\nlikelihood for MAGIC data:\nindependent KDE\n(black solid (cid:5)), product KDE (blue dashed \u25e6),\nGaussian (brown solid \u2666), Gaussian copula (or-\nange solid +), Gaussian tree-copula (magenta\ndashed x), Frank tree-copula (blue dashed (cid:3)),\nGaussian tree-averaged copula (red solid x).\n\nFigure 3: Station map with station locations (red\ndots), coastline, and the pairs of stations se-\nlected according to Delaunay triangulation (dot-\nted lines)\n\nFigure 4: Scatter-plots of log-odds ratios for occurrence (left) and Kendall\u2019s \u03c4 measure of concor-\ndance (right) for all pairs of stations for the historical data vs HMM-TA (red o), HMM-Tree (blue\nx), and HMM-CI (green \u00b7).\n\n[6] A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via EM\n\nalgorithm. Journal of the Royal Statistical Society Series B-Methodological, 39(1):1\u201338, 1977.\n\n[7] T. Bedford and R. M. Cooke. Vines \u2013 a new graphical model for dependent random variables. The Annals\n\nof Statistics, 30(4):1031\u20131068, 2002.\n\n[8] H. Joe and J.J. Xu. The estimation method of inference functions for margins for multivariate models.\n\nTechnical report, Department of Statistics, University of British Columbia, 1996.\n\n[9] C. Genest, K. Ghoudi, and L.-P. Rivest. A semiparametric estimation procedure of dependence parameters\n\nin multivariate families of distributions. Biometrika, 82:543\u2013552, 1995.\n\n[10] J. Pearl. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan Kauf-\n\nmann Publishers, Inc., San Francisco, California, 1988.\n\n[11] J. Besag. Spatial interaction and the statistical analysis of lattice systems. Journal of the Royal Statistical\n\nSociety Series B-Methodological, 36(2):192\u2013236, 1974.\n\n[12] S. Kirshner. Learning with tree-averaged densities and distributions. Technical Report TR 08-01, Depart-\n\nment of Computing Science, University of Alberta, 2008.\n\n[13] A. Asuncion and D.J. Newman. UCI machine learning repository, 2007.\n[14] E. Bellone. Nonhomogeneous Hidden Markov Models for Downscaling Synoptic Atmospheric Patterns to\n\nPrecipitation Amounts. PhD thesis, Department of Statistics, University of Washington, 2000.\n\n8\n\n5010020050010002000500010000\u22123.2\u22123.1\u22123\u22122.9\u22122.8\u22122.7\u22122.6Training set sizeLog\u2212likelihood per feature Independent KDEProduct KDEGaussianGaussian CopulaGaussian TCopulaFrank TCopulaGaussian TACopula143144145146147148149150\u221238\u221237\u221236\u221235\u221234\u221233LongitudeLatitude CoastlineStationsSelected pairs11.522.533.544.5511.522.533.544.55Log\u2212odds from the historical dataLog\u2212odds from the simulated data HMM\u2212TAHMM\u2212TreeHMM\u2212CIy=x00.10.20.30.40.50.60.700.10.20.30.40.50.60.7Kendall\u2019s \u03c4 from the historical dataKendall\u2019s \u03c4 from the simulated data HMM\u2212TAHMM\u2212TreeHMM\u2212CIy=x\f", "award": [], "sourceid": 1080, "authors": [{"given_name": "Sergey", "family_name": "Kirshner", "institution": null}]}