{"title": "Distance-Based Network Recovery under Feature Correlation", "book": "Advances in Neural Information Processing Systems", "page_first": 775, "page_last": 783, "abstract": "We present an inference method for Gaussian graphical models when only pairwise distances of n objects are observed. Formally, this is a problem of estimating an n x n covariance matrix from the Mahalanobis distances dMH(xi, xj), where object xi lives in a latent feature space. We solve the problem in fully Bayesian fashion by integrating over the Matrix-Normal likelihood and a Matrix-Gamma prior; the resulting Matrix-T posterior enables network recovery even under strongly correlated features. Hereby, we generalize TiWnet, which assumes Euclidean distances with strict feature independence. In spite of the greatly increased flexibility, our model neither loses statistical power nor entails more computational cost. We argue that the extension is highly relevant as it yields significantly better results in both synthetic and real-world experiments, which is successfully demonstrated for a network of biological pathways in cancer patients.", "full_text": "Distance-Based Network Recovery\n\nunder Feature Correlation\n\nDavid Adametz, Volker Roth\n\nDepartment of Mathematics and Computer Science\n{david.adametz,volker.roth}@unibas.ch\n\nUniversity of Basel, Switzerland\n\nAbstract\n\nWe present an inference method for Gaussian graphical models when only pair-\nwise distances of n objects are observed. Formally, this is a problem of esti-\nmating an n \u00d7 n covariance matrix from the Mahalanobis distances dMH(xi, xj),\nwhere object xi lives in a latent feature space. We solve the problem in fully\nBayesian fashion by integrating over the Matrix-Normal likelihood and a Matrix-\nGamma prior; the resulting Matrix-T posterior enables network recovery even\nunder strongly correlated features. Hereby, we generalize TiWnet [19], which as-\nsumes Euclidean distances with strict feature independence. In spite of the greatly\nincreased \ufb02exibility, our model neither loses statistical power nor entails more\ncomputational cost. We argue that the extension is highly relevant as it yields\nsigni\ufb01cantly better results in both synthetic and real-world experiments, which is\nsuccessfully demonstrated for a network of biological pathways in cancer patients.\n\n1\n\nIntroduction\n\nIn this paper we introduce the Translation-invariant Matrix-T process (TiMT) for estimating Gaus-\nsian graphical models (GGMs) from pairwise distances. The setup is particularly interesting, as\nmany applications only allow distances to be observed in the \ufb01rst place. Hence, our approach is\ncapable of inferring a network of probability distributions, of strings, graphs or chemical structures.\n\nWe begin by stating the setup of classical GGMs: The basic building block is matrix (cid:101)X \u2208 Rn\u00d7d\n\nwhich follows the Matrix-Normal distribution [8]\n\n(cid:101)X \u223c N (M, \u03a8 \u2297 Id).\n\n(1)\nThe goal is to identify \u03a8\u22121, which encodes the desired dependence structure. More speci\ufb01cally, two\nobjects (= rows) are conditionally independent given all others if and only if \u03a8\u22121 has a correspond-\ning zero element. This is often depicted as an undirected graph (see Figure 1), where the objects are\nvertices and (missing) edges represent their conditional (in)dependencies.\n\nFigure 1: Precision matrix \u03a8\u22121 and its interpretation as a graph (self-loops are typically omitted).\n\nPrabhakaran et al. [19] formulated the Translation-invariant Wishart Network (TiWnet), which treats\n\n(cid:101)X as a latent matrix and only requires their squared Euclidean distances Dij = dE((cid:101)xi,(cid:101)xj)2, where\n\n1\n\n\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\f(cid:101)xi \u2208 Rd is the ith row of (cid:101)X. Also, SE = (cid:101)X(cid:101)X(cid:62) refers to the n \u00d7 n inner-product matrix, which is\n\nlinked via Dij = SE,ii + SE,jj \u2212 2 SE,ij. Importantly, the transition to distances implies that means\nof the form M = 1nw(cid:62) with w \u2208 Rd are not identi\ufb01able anymore. In contrast to the above, we\nstart off by assuming a matrix\n\n(2)\nwhere the columns (= features) are correlated as de\ufb01ned by \u03a3 \u2208 Rd\u00d7d. Due to this change, the\n\ninner-product becomes SMH = XX(cid:62) = (cid:101)X\u03a3(cid:101)X(cid:62). If we directly observed X as in classical GGMs,\nthen \u03a3 could be removed to recover (cid:101)X, however, in the case of distances, the impact of \u03a8 and \u03a3 is\n\ninevitably mixed. A suitable assumption is therefore the squared Mahalanobis distance\n\n1\n\n2 \u223c N (M, \u03a8 \u2297 \u03a3),\n\nX := (cid:101)X\u03a3\n\nDij = dMH(xi, xj)2 = ((cid:101)xi \u2212(cid:101)xj)(cid:62)\u03a3((cid:101)xi \u2212(cid:101)xj),\n\n(3)\n\nonly D is observed and the following is latent: d, X, (cid:101)X, S := SMH, \u03a3 and M = 1nw(cid:62).\n\nwhich dramatically increases the degree of freedom for inference about \u03a8. Recall that in our setting\n\nThe main dif\ufb01culty comes from the inherent mixture effect of \u03a8 and \u03a3 in the distances, which blurs\nor obscures what is relevant in GGMs. For example, if we naively enforce \u03a3 = Id, then all of the\ninformation is solely attributed to \u03a8. However, in applications where the true \u03a3 (cid:54)= Id, we would\nconsequently infer false structure, up to a degree where the result is completely mislead by feature\ncorrelation.\nIn pure Bayesian fashion, we specify a prior belief for \u03a3 and average over all realizations weighted\nby the Gaussian likelihood. For a conjugate prior, this leads to the Matrix-T distribution, which\nforms the core part of our approach. The resulting model generalizes TiWnet and is \ufb02exible enough\nto account for arbitrary feature correlation.\nIn the following, we brie\ufb02y describe a practical application with all the above properties.\n\nExample: A Network of Biological Pathways Using DNA microarrays, it is possible to mea-\nsure the expression levels of thousands of genes in a patient simultaneously, however, each gene is\nhighly prone to noise and only weakly informative when analyzed on its own. To solve this problem,\nthe focus is shifted towards pathways [5], which can be seen as (non-disjoint) groups of genes that\ncontribute to high-level biological processes. The underlying idea is that genes exhibit visible pat-\nterns only when paired with functionally related entities. Hence, every pathway has a characteristic\ndistribution of gene expression values, which we compare via the so-called Bhattacharyya distance\n[2, 11]. Our goal is then to derive a network between pathways, but what if the patients (= features)\nfrom whom we obtained the cells were correlated (sex, age, treatment, . . .)?\n\nFigure 2: The big picture. Different assumptions about M and \u03a3 lead to different models.\n\nRelated work Inference in GGMs is generally aimed at \u03a8\u22121 and therefore every approach relies\non Eq. (1) or (2), however, they differ in their assumptions about M and \u03a3. Figure 2 puts our setting\ninto a larger context and describes all possible con\ufb01gurations in a single scheme. Throughout the\npaper, we assume there are n objects and an unknown number of d latent features. Since our inputs\nare pairwise distances D, the mean is of the form M = 1nw(cid:62), but at the same time, we do not\n\n2\n\n\u03a3=Id\u03a3=Id\u03a3=Id\u03a3\u03a3XDM=1nwtSXX=tM=v1tM=0n\u00d7ddTiWnetTiMTgLmodelinputmeansfeaturecorrelationgLTRCM\fimpose any restriction on \u03a3. A complementary assumption is made in TiWnet [19], which enforces\nstrict feature independence.\nFor the models based on matrix X, the mean matrix is de\ufb01ned as M = v1(cid:62)d with v \u2208 Rn. This\nchoice is neither better nor worse\u2014it does not rely on pairwise distances and hence addresses a\ndifferent question. By further assuming \u03a3 = Id, we arrive at the graphical LASSO (gL) [7] that\noptimizes the likelihood under an L1 penalty. The Transposable Regularized Covariance Model\n(TRCM) [1] is closely related, but additionally allows arbitrary \u03a3 and alternates between estimating\n\u03a8\u22121 and \u03a3\u22121. The basic con\ufb01guration for S, M = 0n\u00d7d and \u03a3 = Id, also leads to the model of\ngL, however this rarely occurs in practice.\n\n2 Model\n\nOn the most fundamental level, our task deals with incorporating invariances into the Gaussian\nmodel, meaning it must not depend on any unrecoverable feature information, i.e. \u03a3, M = 1nw(cid:62)\n(vanishes for distances) and d. The starting point is the log-likelihood of Eq. (2)\n\n2tr(cid:0)W (X \u2212 M )\u03a3\u22121(X \u2212 M )(cid:62)(cid:1),\n\n(4)\n\n(cid:96)(W, \u03a3, M ; X) = d\n\n2 log |W| \u2212 n\n\n2 log |\u03a3| \u2212 1\n\nwhere we used the shorthand W := \u03a8\u22121. In the literature, there exist two conceptually different\napproaches to achieve invariances: the \ufb01rst is the classical marginal likelihood [12], closely related\nto the pro\ufb01le likelihood [16], where a nuisance parameter is either removed by a suitable statistic\nor replaced by its corresponding maximum likelihood estimate [9]. The second approach follows\nthe Bayesian marginal likelihood by introducing a prior and integrating over the product. Hereby,\nthe posterior is a weighted average, where the weights are distributed according to prior belief. The\nfollowing sections will discuss the required transformations of Eq. (4).\n\n2.1 Marginalizing the Latent Feature Correlation\n\n2.1.1 Classical Marginal Likelihood\n\nLet us begin with the attempt to remove \u03a3 by explicit reconstruction, as done in McCullagh [13].\nComputing the derivative of Eq. (4) with respect to \u03a3 and setting it to zero, we arrive at the maximum\n\nlikelihood estimate(cid:98)\u03a3 = 1\n\n(cid:96)(W, M ; X,(cid:98)\u03a3) = d\n\nn (X \u2212 M )(cid:62)W (X \u2212 M ), which leads to\n\n2tr(W (X \u2212 M )(cid:98)\u03a3\u22121(X \u2212 M )(cid:62))\n\n2 log |(cid:98)\u03a3| \u2212 1\n2 log |W (X \u2212 M )(X \u2212 M )(cid:62)|.\n\n2 log |W| \u2212 n\n2 log |W| \u2212 n\n\n= d\n\n(5)\n(6)\n\n(cid:98)\u03a3\u22121 only exists if(cid:98)\u03a3 has full rank, or equivalently, if d \u2264 n. Further, even d = n must be excluded,\n\nEq. (6) does not depend on \u03a3 anymore, however, note that there is a hidden implication in Eq. (5):\n\nsince Eq. (6) would become independent of X otherwise. McCullagh [13] analyzed the Fisher\ninformation for varying d and concluded that this model is \u201ca complete success\u201d for d (cid:28) n, but \u201ca\nspectacular failure\u201d if d \u2192 n. Since distance matrices typically require d \u2265 n, the approach does\nnot qualify.\n\n2.1.2 Bayesian Marginal Likelihood\n\nIranmanesh et al. [10] analyzed the Matrix-Normal likelihood in Eq. (4) in conjunction with an\nInverse Matrix-Gamma (IMG) prior\u2014the latter being a generalization of an inverse Wishart prior. It\nis denoted by \u03a3 \u223c IMG(\u03b1, \u03b2, \u2126), where \u03b1 > 1\n2 (d \u2212 1) and \u03b2 > 0 are shape and scale parameters,\nrespectively. \u2126 is a d \u00d7 d positive-de\ufb01nite matrix re\ufb02ecting the expectation of \u03a3. This combination\nleads to the so-called (Generalized) Matrix T-distribution1 X \u223c T (\u03b1, \u03b2, M, W, \u2126) with likelihood\n(7)\nCompared to the classical marginal likelihood, the obvious differences are In and scalar \u03b2, which\ncan be seen as regularization. The limit of \u03b2 \u2192 \u221e implies that no regularization takes place\n1Choosing an inverse Wishart prior for \u03a3 results in the standard Matrix T-distribution, however its variance\n\n2 W (X \u2212 M )\u2126\u22121(X \u2212 M )(cid:62)|.\n\n2 log |W| \u2212 (\u03b1 + n\n\n(cid:96)(W, M ; \u03b1, \u03b2, X, \u2126) = d\n\n2 ) log |In + \u03b2\n\ncan only be controlled by an integer. This is why the Generalized Matrix T-distribution is preferred.\n\n3\n\n\f2 W (X \u2212 M )\u2126\u22121(X \u2212 M )(cid:62), hence any d \u2265 1 is valid.\n\nand, interestingly, this likelihood resembles Eq. (6). The other extreme \u03b2 \u2192 0 leads to a likeli-\nhood that is independent of X. Another observation is that the regularization ensures full rank of\nIn + \u03b2\nAt this point, the Bayesian approach reveals a fundamental advantage: For TiWnet, the distance\nmatrix enforced independent features, but now, we are in a position to maintain the full model while\nadjusting the hyperparameters instead. We propose \u2126 \u2261 Id, meaning the prior of \u03a3 will be centered\nat independent latent features, which is a common and plausible choice before observing any data.\nThe \ufb02exibility ultimately comes from \u03b1 and \u03b2 when de\ufb01ning a \ufb02at prior, which means deviations\nfrom independent features are explicitly allowed.\n\n2.2 Marginalizing the Latent Means\n\n(cid:96)(\u03a8 ; \u03b1, \u03b2, LX) = \u2212 d\n\n2 log |L\u03a8L(cid:62)| \u2212 (\u03b1 + n\u22121\n\n2 ) log |In + \u03b2\n\n2 L(cid:62)(L\u03a8L(cid:62))\u22121LXX(cid:62)|.\n\n(9)\n\nThe fact that we observe a distance matrix D implies that information about the (feature) coordinate\nsystem is irrevocably lost, namely M = 1w(cid:62), which is why the means must be marginalized. We\nbrie\ufb02y discuss the necessary steps, but for an in-depth review please refer to [19, 14, 17]. Following\nthe classical marginalization, it suf\ufb01ces to de\ufb01ne a projection L \u2208 R(n\u22121)\u00d7n with property L1n =\n0n\u22121. In other words, all biases of the form 1nw(cid:62) are mapped to the nullspace of L. The Matrix\nT-distribution under af\ufb01ne transformations [10, Theorem 3.2] reads LX \u223c T (\u03b1, \u03b2, LM, L\u03a8L(cid:62), \u2126)\nand in our case (\u2126 = Id, LM = L1nw(cid:62) = 0(n\u22121)\u00d7d), we have\n\n2 log |W| \u2212 d\n\n(cid:96)(W ; \u03b1, \u03b2, D, 1n) = d\n\n(8)\nNote that due to the statistic LX, the likelihood is constant over all X (or S) mapping to the same D.\nAs we are not interested in any speci\ufb01cs about L other than its nullspace, we replace the image with\nthe kernel of the projection and de\ufb01ne matrix Q := In \u2212 (1(cid:62)n W 1n)\u221211n1(cid:62)n W . Using the identity\nQSQ(cid:62) = \u2212 1\n\n2 QDQ(cid:62) and Q(cid:62)W Q = W Q, we can \ufb01nally write the likelihood as\n2 ) log |In \u2212 \u03b2\n\n4 W QD|,\nwhich accounts for arbitrary latent feature correlation \u03a3 and all mean matrices M = 1nw(cid:62).\nIn hindsight, the combination of Bayesian and classical marginal likelihood might appear arbitrary,\nbut both strategies have their individual strengths. Mean matrix M, for example, is limited to a single\ndirection in an n dimensional space, therefore the statistic LX represents a convenient solution. In\ncontrast, the rank-d matrix \u03a3 affects a much larger spectrum that cannot be handled in the same\nfashion\u2014ignoring this leads to a degenerate likelihood as previously shown. The problem is only\ntractable when specifying a prior belief for Bayesian marginalization. On a side note, the Bayesian\nposterior includes the classical marginal likelihood for the choice of an improper prior [4], which\ncould be seen in the Matrix-T likelihood, Eq. (7), in the limit of \u03b2 \u2192 \u221e.\n\n2 log(1(cid:62)n W 1n) \u2212 (\u03b1 + n\u22121\n\n3\n\nInference\n\nThe previous section developed a likelihood for GGMs that conforms to all aspects of information\nloss inherent to distance matrices. As our interest lies in the network-de\ufb01ning W , the following will\ndiscuss Bayesian inference using a Markov chain Monte Carlo (MCMC) sampler.\n\nHyperparameters \u03b1, \u03b2 and d At some point in every Bayesian analysis, all hyperparameters\nneed to be speci\ufb01ed in a sensible manner. Currently, the occurrence of d in Eq. (9) is particularly\nproblematic, since (i) the number of latent features is unknown and (ii) it critically affects the balance\n2 (d \u2212 1), which can\nbetween determinants. To resolve this issue, recall that \u03b1 must satisfy \u03b1 > 1\nalternatively be expressed as \u03b1 = 1\n4 W QD|,\n\n(10)\nwhere d now in\ufb02uences the likelihood on a global level and can be used as temperature reminiscent\nof simulated annealing techniques for optimization. In more detail, we initialize the MCMC sampler\nwith a small value of d and increase it slowly, until the acceptance ratio is below, say, 1 percent. After\nthat event, all samples of W are averaged to obtain the \ufb01nal network.\nParameter v and \u03b2 still play a crucial role in the process of inference, as they distribute the probability\nmass across all latent feature correlations and effectively control the scope of plausible \u03a3. Upon\n\n2 (vd \u2212 n + 1) with v > 1 + n\u22122\n2 log(1(cid:62)n W 1n) \u2212 vd\n2 log |W| \u2212 d\n\nd . Thereby, we arrive at\n2 log |In \u2212 \u03b2\n\n(cid:96)(W ; v, \u03b2, D, 1n) = d\n\n4\n\n\fAlgorithm 1 One loop of the MCMC sampler\n\nik from {\u22121, 0, +1}\n\nand W (p)\n\nii\n\nkk accordingly\n\nki \u2190 W (p)\n\n(p) refers to proposal\n\nInput: distance matrix D, temperature d and \ufb01xed v > 1 + n\u22122\nd\nfor i = 1 to n do\nW (p) \u2190 W ,\nUniformly select node k (cid:54)= i and sample element W (p)\nSet W (p)\nik and update W (p)\nCompute posterior in Eq. (12) and acceptance of W (p)\nif u \u223c U(0, 1) < acceptance then\nend if\nend for\nSample proposal \u03b2(p) \u223c \u0393(\u03b2shape, \u03b2scale)\nCompute posterior in Eq. (12) and acceptance of \u03b2(p)\nif u \u223c U(0, 1) < acceptance then\nend if\n\nW \u2190 W (p)\n\n\u03b2 \u2190 \u03b2(p)\n\ncloser inspection, we gain more insight by the variance of the Matrix-T distribution,\n\n2(\u03a8 \u2297 \u2126)\n\n\u03b2(v d \u2212 2 n + 1)\n\n,\n\n(11)\n\nwhich is maximal when \u03b2 and v are jointly small. We aim for the most \ufb02exible solution, thus v is\n\ufb01xed at the smallest possible value and \u03b2 is stochastically integrated out in a Metropolis-Hastings\nstep. A suitable choice is a Gamma prior \u03b2 \u223c \u0393(\u03b2shape, \u03b2scale); its shape and scale must be chosen to\nbe suf\ufb01ciently \ufb02exible on the scale of the distance matrix at hand.\n\nPriors for W The prior for W is \ufb01rst and foremost required to be sparse and \ufb02exible. There\nare many valid choices, like spike and slab [15] or partial correlation [3], but we adapt the two-\ncomponent scheme of TiWnet, which has computational advantages and enables symmetric random\nwalks. The following brie\ufb02y explains the construction:\nPrior p1(W ) de\ufb01nes a symmetric random matrix, where off-diagonal elements Wij are uniform on\n{\u22121, 0, +1}, i.e. an edge with positive/negative weight or no edge. The diagonal is chosen such that\ni=1(Wii \u2212 \u0001)(cid:1) and induces sparsity.\nj(cid:54)=i |Wij|. Although this only allows 3 levels, it proved to be\n\nW is positive de\ufb01nite: Wii \u2190 \u0001 +(cid:80)\nThe second component is a Laplacian p2(W | \u03bb) \u221d exp(cid:0) \u2212 \u03bb(cid:80)n\n\nsuf\ufb01ciently \ufb02exible in practice. Replacing it with more levels is possible, but conceptually identical.\n\nHere, the total number of edges in the network is penalized by parameter \u03bb > 0. Combining the\nlikelihood of Eq. (10) and the above priors, the \ufb01nal posterior reads:\n\np(W |\u2022 ) = p(D | W, \u03b2, 1n) p1(W ) p2(W | \u03bb) p3(\u03b2 | \u03b2shape, \u03b2scale).\n\n(12)\n\nThe full scheme of the MCMC sampler is reported in Algorithm 1.\n\nComplexity Analysis The runtime of Algorithm 1 is primarily determined by the repeated evalu-\nation of the posterior in Eq. (12), which would require O(n4) in the naive case of fully recomputing\nthe determinants. Every \ufb02ip of an edge, however, only changes a maximum of 4 elements2 in W ,\nwhich gives rise to an elegant update scheme building on the QR decomposition.\nTheorem. One full loop in Algorithm 1 requires O(n3).\nProof. Due to the 3-level prior, there are only 6 possible \ufb02ip con\ufb01gurations depending on the\ncurrent edge between object i and j (2 examples depicted here for i = 1, j = 3):\n\n\u2206W := W (p) \u2212 W \u21d4\n\n(cid:40)(cid:34)\u22121\n\n0\n+1\n\n(cid:35)\n\n(cid:34) 0\n\n(cid:35)(cid:41)\n\n0 +1\n0\n0\n0 \u22121\n\n, . . . ,\n\n0\n+2\n\n0 +2\n0\n0\n0\n0\n\n(13)\n\nAn important observation is that \u2206W can solely be expressed in terms of rank-1 matrices, in partic-\nular either uv(cid:62) or uv(cid:62) + ab(cid:62). If we know the QR decomposition of W , then the decomposition\n\n2This also holds for more than 3 edge levels.\n\n5\n\n\fof W (p) can be found in O(n2). Consequently, its determinant is obtained by det(QR) =(cid:81)n\n\ni=1 Rii\nin O(n). Our goal is to exploit this property and express both determinants of the posterior as rank-1\nupdates to their existing QR decompositions. Restating the likelihood, we have\n2 log |In \u2212 \u03b2\n\n\u2212 d\n2 log(1(cid:62)n W (p)1n) \u2212 vd\n\n(cid:96)(W (p) ; \u2022) = d\n\n(14)\n\n.\n\n(cid:124)\n\n(cid:125)\n(cid:123)(cid:122)\n4 W (p)QD|\n\n=: det2\n\nUpdating det1 corresponds to either W (p) = W + uv(cid:62) or W (p) = W + uv(cid:62) + ab(cid:62) as explained\nin Eq. (13), thus leading to O(n2). We reformulate det2 to follow the same scheme:\n\n(cid:12)(cid:12)(cid:12)In \u2212 \u03b2\n\n4 W\n\ndet2 =\n\n=: det1\n\n(cid:124) (cid:123)(cid:122) (cid:125)\n2 log |W (p)|\n(cid:16)\n(cid:17)\n(cid:104)(cid:16)\nu \u2212 \u03b3(cid:0)1(cid:62)n u(cid:1)(cid:16)\n(cid:104)\na \u2212 \u03b3(cid:0)1(cid:62)n a(cid:1)(cid:16)\n(cid:104)\n\n1(cid:62)\nn W 1n\n\u2212 \u03b3\n\nIn \u2212\n\n1(cid:62)\nn W 1n\n\n1\n\n1\n\n4\n\n\u2212 \u03b2\n\u2212 \u03b2\n\u2212 \u03b2\n\n4\n\n4\n\nD\n\n1n1(cid:62)n W\nW 1n \u2212 \u03b3\n\n(cid:17)\n(cid:16)(cid:0)v(cid:62)1n\n(cid:17)(cid:105)(cid:0)DW 1n\n(cid:1)u +(cid:0)b(cid:62)1n\n(cid:1)a\n(cid:17)(cid:105)(cid:0)Dv(cid:1)(cid:62)\nW 1n +(cid:0)v(cid:62)1n\n(cid:1)u +(cid:0)b(cid:62)1n\n(cid:1)a\n(cid:12)(cid:12)(cid:12).\n(cid:17)(cid:105)(cid:0)Db(cid:1)(cid:62)\nW 1n +(cid:0)v(cid:62)1n\n(cid:1)u +(cid:0)b(cid:62)1n\n(cid:1)a\n\n(cid:1)(cid:62)\n\n(15)\n\nFor notational convenience, we de\ufb01ned the shorthand\n\n\u03b3 :=\n\n1\n\n1(cid:62)n W (p)1n\n\n=\n\n1\n\n1(cid:62)n (W + uv(cid:62) + ab(cid:62))1n\n\n=\n\n1\n\n1(cid:62)n W 1n + (1(cid:62)n u)(v(cid:62)1n) + (1(cid:62)n a)(b(cid:62)1n)\n\n.\n\nNote that the determinant of the \ufb01rst line in Eq. (15) is already known (i.e. its QR decomposition)\nand the following 3 lines are only rank-1 updates as indicated by parenthesis. Therefore, det2 is\ncomputed in 3 steps, each consuming O(n2). For some of the 6 \ufb02ip con\ufb01gurations, we even have\na = b = 0n, which renders the last line in Eq. (15) obsolete and simpli\ufb01es the remaining terms.\nSince the for loop covers n \ufb02ips, all updates contribute as n\u00b7O(n2). There is no shortcut to evaluate\nproposal \u03b2(p) given \u03b2, thus its posterior is recomputed from scratch in O(n3). Therefore, Algorithm\n1 has an overall complexity of O(n3), which is the same as TiWnet.\n\n4 Experiments\n\n4.1 Synthetic Data\n\nWe \ufb01rst look at synthetic data and compare how well the recovered network matches the true one.\nHereby, the accuracy is measured by the f-score using the edges (positive/negative/zero).\n\nIndependent Latent Features Since TiMT is a generalization for arbitrary \u03a3, it must also cover\n\u03a3 \u2261 Id, thus, we generate a set of 100 Gaussian-distributed matrices X with known W and \u03a3 = Id,\nwhere n = 30 and d = 300. Next, we add column translations 1nw(cid:62) with elements in w \u2208 Rd\nbeing Gamma distributed, however these do not enter D by de\ufb01nition. As TRCM does not account\nfor column shifts, it is used in conjunction with the true, unshifted matrix X (hence TRCM.u).\nAll methods require a regularization parameter, which obviously determines the outcome. In par-\nticular, TiWnet and TiMT use the same, constant parameter throughout all 100 distance matrices\nand obtain the \ufb01nal W via annealing. Concerning TRCM and gL, we evaluate each X on a set of\nparameters and only report the highest f-score per data set. This is in strong favor of the competition.\nBoxplots of the achieved f-scores and the false positive rates are depicted in Figure 3, left. As\ncan be seen, TiMT and TiWnet score as high as TRCM.u without knowledge of features or feature\ntranslations. We omit gL from the comparison due to a model mismatch regarding M, meaning it\nwill naturally fall short. Instead, the interested reader is pointed to extensive results in [19].\nThe gist of this experiment is that all methods work well when the model requirements are met.\nAlso, translating the individual features and obscuring them does not impair TiWnet and TiMT.\n\nCorrelated Latent Features The second experiment is similar to the \ufb01rst one (n = 30, d =\n300 and column shifts), but it additionally introduces feature correlation. Here, \u03a3 is generated\nby sampling a matrix G \u223c N (0d\u00d75d, Id \u2297 I5d) and adding Gamma distributed vector a \u2208 R5d to\nrandomly selected rows of G. The \ufb01nal feature covariance matrix is given by \u03a3 = 1\n\n5d GG(cid:62).\n\n6\n\n\fFigure 3: Results for synthetic data. Translations do not apply to TRCM.u. Models with violated\nassumptions (M and/or \u03a3) are highlighted with gray bars.\n\nDue to the dramatically increased degree of freedom, all methods are impacted by lower f-scores\n(see Figure 3, right). As expected, TRCM.u performs best in terms of f-score, which is based on\nthe unshifted full data matrix X with an individually optimized regularization parameter. TiMT,\nhowever, follows by a slim margin. On the contrary, TiWnet explains the similarities exclusively\nby adding more (unnecessary) edges, which is re\ufb02ected in its increased, but strongly consistent\nfalse positive rate. This issue leads to a comparatively low f-score that is even below the remaining\ncontenders. Finally, Figure 4 shows an example network and its reconstruction. Keeping in mind\nthe drastic information loss between true X30\u00d7300 and D30\u00d730, TiMT performs extremely well.\n\nFigure 4: An example for synthetic data with feature correlation. The network inferred by TiMT\n(center) is relatively close to ground truth (left), however TiWnet (right) is apparently mislead by \u03a3.\nBlack/red edges refer to +/\u2212 edge weight.\n\n4.2 Real-World Data: A Network of Biological Pathways\n\nIn order to demonstrate the scalability of TiMT, we apply it to the publicly available colon cancer\ndataset of Sheffer et al. [20], which is comprised of 13 437 genes measured across 182 patients.\nUsing the latest gene sets from the KEGG3 database, we arrive at n = 276 distinct pathways.\nAfter learning the mean and variance of each pathway as the distribution of its gene expression\nvalues across patients, the Bhattacharyya distances [11] are computed as a 276\u00d7 276 matrix D. The\npathways are allowed to overlap via common genes, thus leading to similarities, however it is unclear\nhow and to what degree the correlation of patients affects the inferred network. For this purpose, we\nrun TiMT alongside TiWnet with identical parameters for 20 000 samples and report the annealed\nnetworks in Figure 5. Again, the difference in topology is only due to latent feature correlation.\nRuntime on a standard 3 GHz PC was 3:10 hours for TiMT, while a naive implementation in O(n4)\n\ufb01nished after \u223c20 hours. TiWnet performed slightly better at around 3 hours, since the model does\nnot have hyperparameter \u03b2 to control feature correlation.\n\n3http://www.genome.jp/kegg/, accessed in May 2014\n\n7\n\nMODEL MISMATCHTRCM.uTRCMTiWnetgLTiMTTRCM.uTRCMTiWnetgLTiMTMODEL MISMATCHF\u2212score0.00.20.40.60.81.0False positive rate0.00.20.40.60.81.0F\u2212score0.00.20.40.60.81.0False positive rateIndependent Latent FeaturesCorrelated Latent Features0.00.20.40.60.81.0TRCM.uTiWnetTiMTTRCM.uTiWnetTiMT\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019True networkTiMTTiWnet\fFigure 5: A network of pathways in colon cancer patients, where each vertex represents one pathway.\nFrom both results, we extract a subgraph of 3 pathways including all neighbors in reach of 2 edges.\nThe matrix on the bottom shows external information on pathway similarity based on their relative\nnumber of protein-protein interactions. Black/red edges refer to +/\u2212 edge weight.\n\nWithout side information it is not possible to con\ufb01rm either result, hence we resort to expert knowl-\nedge for protein-protein interactions from the BioGRID4 database and compute the strength of con-\nnection between pathways as the number of interactions relative to their theoretical maximum. Using\nthis, we can easily check subnetworks for plausibility (see Figure 5, center): The black vertices 96,\n98 and 114 correspond to base excision repair, mismatch repair and cell cycle, which are particu-\nlarly interesting as they play a key role in DNA mutation. These pathways are known to be strongly\ndysregulated in colon cancer and indicate an elevated susceptibility [18, 6]. The topology of these 3\npathways for TiMT is fully supported by protein interactions, i.e. 98 is the link between 114 and 96\nand removing it renders 96 and 98 independent. TiWnet, on the contrary, overestimates the network\nand produces a highly-connected structure contradicting the evidence. This is a clear indicator for\nlatent feature correlation.\n\n5 Conclusion\n\nWe presented the Translation-invariant Matrix-T process (TiMT) as an elegant way to make in-\nference in Gaussian graphical models when only pairwise distances are available. Previously, the\ninherent information loss about underlying features appeared to prevent any conclusive statement\nabout their correlation, however, we argue that neither assumed full independence nor maximum\nlikelihood estimation is reasonable in this context.\nOur contribution is threefold: (i) A Bayesian relaxation solves the issue of strict feature indepen-\ndence in GGMs. The assumption is now shifted into the prior, but \ufb02at priors are possible. (ii) The\napproach generalizes TiWnet, but maintains the same complexity, thus, there is no reason to retain\nthe simpli\ufb01ed model. (iii) TiMT for the \ufb01rst time accounts for all latent parameters of the Matrix\nNormal without access to the latent data matrix X. The distances D are fully suf\ufb01cient.\nIn synthetic experiments, we observed a substantial improvement over TiWnet, which highly over-\nestimated the networks and falsly attributed all information to the topological structure. At the same\ntime, TiMT performed almost on par with TRCM(.u), which operates under hypothetical, optimal\nconditions. This demonstrates that all aspects of information loss can be handled exceptionally well.\nFinally, the network of biological pathways provided promising results for a domain of non-vectorial\nobjects, which effectively precludes all methods except for TiMT and TiWnet. Comparing these two,\nthe considerable difference in network topology only goes to show that invariance against latent\nfeature correlation is indispensable\u2014especially pertaining to distances.\n\n4http://thebiogrid.org, version 3.2\n\n8\n\nTiMT\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019TiWnet3192582899196981141151149801969611498322336079828991969798114\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\u0019\fReferences\n[1] G. Allen and R. Tibshirani. Transposable Regularized Covariance Models with an Application\n\nto Missing Data Imputation. The Annals of Applied Statistics, 4:764\u2013790, 2010.\n\n[2] A. Bhattacharyya. On a Measure of Divergence between Two Statistical Populations De\ufb01ned\nby Their Probability Distributions. Bulletin of the Calcutta Mathematical Society, 35:99\u2013109,\n1943.\n\n[3] M. Daniels and M. Pourahmadi. Modeling Covariance Matrices via Partial Autocorrelations.\n\nJournal of Multivariate Analysis, 100(10):2352\u20132363, 2009.\n\n[4] A. de Vos and M. Francke. Bayesian Unit Root Tests and Marginal Likelihood. Technical\nreport, Department of Econometrics and Operation Researchs, VU University Amsterdam,\n2008.\n\n[5] L. Ein-Dor, O. Zuk, and E. Domany. Thousands of Samples are Needed to Generate a Robust\nIn Proceedings of the National Academy of\n\nGene List for Predicting Outcome in Cancer.\nSciences, pages 5923\u20135928, 2006.\n\n[6] P. Fortini, B. Pascucci, E. Parlanti, M. D\u2019Errico, V. Simonelli, and E. Dogliotti. The\nBase Excision Repair: Mechanisms and its Relevance for Cancer Susceptibility. Biochimie,\n85(11):1053\u20131071, 2003.\n\n[7] J. Friedman, T. Hastie, and R. Tibshirani. Sparse Inverse Covariance Estimation with the\n\nGraphical Lasso. Biostatistics, 9(3):432\u2013441, 2008.\n\n[8] A. K. Gupta and D. K. Nagar. Matrix Variate Distributions. PMS Series. Addison-Wesley\n\nLongman, 1999.\n\n[9] D. Harville. Maximum Likelihood Approaches to Variance Component Estimation and to\n\nRelated Problems. Journal of the American Statistical Association, 72(358):320\u2013338, 1977.\n\n[10] A. Iranmanesh, M. Arashi, and S. Tabatabaey. On Conditional Applications of Matrix Variate\nNormal Distribution. Iranian Journal of Mathematical Sciences and Informatics, pages 33\u201343,\n2010.\n\n[11] T. Jebara and R. Kondor. Bhattacharyya and Expected Likelihood Kernels. In Conference on\n\nLearning Theory, 2003.\n\n[12] J. Kalb\ufb02eisch and D. Sprott. Application of Likelihood Methods to Models Involving Large\nNumbers of Parameters. Journal of the Royal Statistical Society. Series B (Methodological),\n32(2):175\u2013208, 1970.\n\n[13] P. McCullagh. Marginal Likelihood for Parallel Series. Bernoulli, 14:593\u2013603, 2008.\n[14] P. McCullagh. Marginal Likelihood for Distance Matrices. Statistica Sinica, 19:631\u2013649,\n\n2009.\n\n[15] T. Mitchell and J. Beauchamp. Bayesian Variable Selection in Linear Regression. Journal of\n\nthe American Statistical Association, 83(404):1023\u20131032, 1988.\n\n[16] S. Murphy and A. van der Vaart. On Pro\ufb01le Likelihood. Journal of the American Statistical\n\nAssociation, 95:449\u2013465, 2000.\n\n[17] H. Patterson and R. Thompson. Recovery of Inter-Block Information when Block Sizes are\n\nUnequal. Biometrika, 58(3):545\u2013554, 1971.\n\n[18] P. Peltom\u00a8aki. DNA Mismatch Repair and Cancer. Mutation Research, 488(1):77\u201385, 2001.\n[19] S. Prabhakaran, D. Adametz, K. J. Metzner, A. B\u00a8ohm, and V. Roth. Recovering Networks\n\nfrom Distance Data. JMLR, 92:251\u2013283, 2013.\n\n[20] M. Sheffer, M. D. Bacolod, O. Zuk, S. F. Giardina, H. Pincas, F. Barany, P. B. Paty, W. L.\nGerald, D. A. Notterman, and E. Domany. Association of Survival and Disease Progression\nwith Chromosomal Instability: A Genomic Exploration of Colorectal Cancer. In Proceedings\nof the National Academy of Sciences, pages 7131\u20137136, 2009.\n\n9\n\n\f", "award": [], "sourceid": 525, "authors": [{"given_name": "David", "family_name": "Adametz", "institution": "University of Basel"}, {"given_name": "Volker", "family_name": "Roth", "institution": "University of Basel"}]}