{"title": "Bayesian Pedigree Analysis using Measure Factorization", "book": "Advances in Neural Information Processing Systems", "page_first": 2897, "page_last": 2905, "abstract": "Pedigrees, or family trees, are directed graphs used to identify sites of the genome that are correlated with the presence or absence of a disease. With the advent of genotyping and sequencing technologies, there has been an explosion in the amount of data available, both in the number of individuals and in the number of sites. Some pedigrees number in the thousands of individuals. Meanwhile, analysis methods have remained limited to pedigrees of <100 individuals which limits analyses to many small independent pedigrees. Disease models, such those used for the linkage analysis log-odds (LOD) estimator, have similarly been limited. This is because linkage anlysis was originally designed with a different task in mind, that of ordering the sites in the genome, before there were technologies that could reveal the order. LODs are difficult to interpret and nontrivial to extend to consider interactions among sites. These developments and difficulties call for the creation of modern methods of pedigree analysis. Drawing from recent advances in graphical model inference and transducer theory, we introduce a simple yet powerful formalism for expressing genetic disease models. We show that these disease models can be turned into accurate and efficient estimators. The technique we use for constructing the variational approximation has potential applications to inference in other large-scale graphical models. This method allows inference on larger pedigrees than previously analyzed in the literature, which improves disease site prediction.", "full_text": "Bayesian Pedigree Analysis using Measure\n\nFactorization\n\nAlexandre Bouchard-C\u02c6ot\u00b4e\n\nStatistics Department\n\nUniversity of British Columbia\nbouchard@stat.ubc.ca\n\nBonnie Kirkpatrick\n\nComputer Science Department\nUniversity of British Columbia\n\nbbkirk@cs.ubc.ca\n\nAbstract\n\nPedigrees, or family trees, are directed graphs used to identify sites of the genome\nthat are correlated with the presence or absence of a disease. With the advent\nof genotyping and sequencing technologies, there has been an explosion in the\namount of data available, both in the number of individuals and in the number\nof sites. Some pedigrees number in the thousands of individuals. Meanwhile,\nanalysis methods have remained limited to pedigrees of < 100 individuals which\nlimits analyses to many small independent pedigrees.\nDisease models, such those used for the linkage analysis log-odds (LOD) estima-\ntor, have similarly been limited. This is because linkage analysis was originally\ndesigned with a different task in mind, that of ordering the sites in the genome,\nbefore there were technologies that could reveal the order. LODs are dif\ufb01cult to\ninterpret and nontrivial to extend to consider interactions among sites. These de-\nvelopments and dif\ufb01culties call for the creation of modern methods of pedigree\nanalysis.\nDrawing from recent advances in graphical model inference and transducer the-\nory, we introduce a simple yet powerful formalism for expressing genetic disease\nmodels. We show that these disease models can be turned into accurate and com-\nputationally ef\ufb01cient estimators. The technique we use for constructing the vari-\national approximation has potential applications to inference in other large-scale\ngraphical models. This method allows inference on larger pedigrees than previ-\nously analyzed in the literature, which improves disease site prediction.\n\n1\n\nIntroduction\n\nFinding genetic correlates of disease is a long-standing important problem with potential contribu-\ntions to diagnostics and treatment of disease. The pedigree model for inheritance is one of the best\nde\ufb01ned models in biology, and it has been an area of active statistical and biological research for\nover a hundred years.\nThe most commonly used method to analyze genetic correlates of disease is quite old. After Mendel\nintroduced, in 1866, the basic model for the inheritance of genomic sites [1] Sturtevant was the \ufb01rst,\nin 1913, to provide a method for ordering the sites of the genome [2]. The method of Sturtevant\nbecame the foundation for linkage analysis with pedigrees [3, 4, 5, 6]. The problem can be thought\nof in Sturtevant\u2019s framework as that of \ufb01nding the position of a disease site relative to an map of\nexisting sites. This is the log-odds (LOD) estimator for linkage analysis which is a likelihood ratio\ntest, described in more detail below.\nThe genomic data available now is quite different than the type of data available when LOD was ini-\ntially developed. Genomic sites are becoming considerably denser in the genome and technologies\nallow us to interrogate the genome for the position of sites [7]. Additionally, most current pedigree\n\n1\n\n\fanalysis methods are exponential either in the number of sites or in the number of individuals. This\nproduces a limit on the size of the pedigrees under consideration to around < 100 individuals. This\nis in contrast to the size of pedigrees being collected: for example the work of [8] includes a con-\nnected human pedigree containing 13 generations and 1623 individuals, and the work of [9] includes\na connected non-human data set containing thousands of breeding dogs. Apart from the issues of\npedigree size, the LOD value is dif\ufb01cult to interpret, since there are few models for the distribu-\ntion of the statistic. These developments and dif\ufb01culties call for the creation of modern methods of\npedigree analysis.\nIn this work, we propose a new framework for expressing genetic disease models. The key com-\nponent of our models, the Haplotype-Phenotype Transducer (HPT), draws from recent advances in\ngraphical model inference and transducer theory [10], and provides a simple and \ufb02exible formal-\nism for building genetic disease models. The output of inference over HPT models is a posterior\ndistribution over disease sites, which is easier to interpret than LOD scores.\nThe cost of this modeling \ufb02exibility is that the graphical model corresponding to the inference\nproblem is larger and has more loops that traditional pedigree graphical models. Our solution to\nthis challenge is based on the observation that the dif\ufb01cult graphical model can be covered by a\ncollection of tractable forest graphical models. We use a method based on measure factorization [11]\nto ef\ufb01ciently combine these approximations. Our approach is applicable to other dense graphical\nmodels, and we show that empirically it gives accurate approximations in dense graphical models\ncontaining millions of nodes as well as short and long cycles. Our approximation can be re\ufb01ned by\nadding more trees in the forest, with a cost linear in the number of forests used in the cover. We\nshow that considerable gains in accuracy can be obtained this way. In contrast, methods such as [12]\ncan suffer from an exponential increase in running time when larger clusters are considered.\nOur framework can be specialized to create analogues of classical penetrance disease models [13].\nWe focus on these special cases here to compare our method with classical ones. Our experiments\nshow that even for these simpler cases, our approach can achieve signi\ufb01cant gains in disease site\nidenti\ufb01cation accuracy compared to the most commonly used method, Merlin\u2019s implementation of\nLOD scores [3, 5]. Moreover, our inference method allows us to perform experiments on unprece-\ndented pedigree sizes, well beyond the capacity of Merlin and other pedigree analysis tools typically\nused in practice.\nWhile graphical models have played an important role in the development of pedigree analysis\nmethods [14, 15], only recently were variational methods applied to the problem [6]. However\nthis previous work is based on the same graphical model as classical LOD methods, while ours\nsigni\ufb01cantly differs.\nMost current work on more advanced disease models have focused on a very different type of data,\npopulation data, for genome wide association studies (GWAS) [16]. Similarly, state of the art work\non the related task of imputation generally makes similar population assumptions [17].\n\n2 Background\n\nEvery individual has two copies of each chromosome, one copy is a collage of the mother\u2019s two\nchromosomes while the other is a collage of the father\u2019s two chromosomes. The point at which\nthe copying of the chromosomes switches from one of the grand-maternal (grand-paternal) chromo-\nsomes to the other, is called a recombination breakpoint. A site is a particular position in the genome\nat which we can obtain measurable values. For the purposes of this paper, an allele is the nucleotide\nat a particular site on a particular chromosome. A haplotype is the sequence of alleles that appear\ntogether on the same chromosome.\nIf we had complete data, we would know the positions of all of the haplotypes, all of the recombina-\ntion breakpoints as well as which allele came from which parent. This information is not obtainable\nfrom any known experiment. Instead, we have genotype data which is the set of nucleotides that\nappear in an individual\u2019s genome at a particular site. Given that the genotype is a set, it is unordered,\nand we do not know which allele came from which parent. All of this and the recombination break-\npoints must be inferred. An example is given in the Supplement.\n\n2\n\n\fA pedigree is a directed acyclic graph with individuals as nodes, where boxes are males and circles\nare females, and edges directed downward from parent to child. Every individual must have either\nno parents or one parent of each gender. The individuals without parents in the graph are called\nfounders, and the individuals with parents are non-founders. The pedigree encodes a set of relation-\nships that constrain the allowed inheritance options. These inheritance options de\ufb01ne a probability\ndistribution which is investigated during pedigree analysis.\nAssume a single-site disease model, where a diploid genotype, GD, determines the affection status\n(phenotype), P \u2208 {\u2019h\u2019,\u2019d\u2019}, according to the penetrance probabilities: f2 = P(P = \u2019d\u2019|GD = 11),\nf1 = P(P = \u2019d\u2019|GD = 10), f0 = P(P = \u2019d\u2019|GD = 00). Here the disease site usually has a disease\nallele, 1, that confers greater risk of having the disease. For convenience, we denote the penetrance\nvector as f = (f2, f1, f0).\nLet the pedigree model for n individuals be speci\ufb01ed by a pedigree graph, a disease model f, and\nthe minor allele frequency, \u00b5, for a single site of interest, k. Let P = (P1, P2, ..., Pn) be a vector\ncontaining the affection status of each individual. Let G = (G1, G2, ..., Gn) be the genotype data for\neach individual. Between the disease site and site k, we model the per chromosome, per generation\nrecombination fraction, \u03c1, which is the frequency with which recombinations occur between those\ntwo sites. Other sites linked to k can contribute to our estimate via their arrangement in single \ufb01rst-\norder Markov chain with some sites falling to the left of the disease site and others to the right of the\nsite of interest. Previous work has shown that given a pedigree model, affection data, and genotype\ndata, we can estimate \u03c1.\nWe de\ufb01ne the likelihood as L(\u03c1) = P(P = p, G = g|\u03c1, f, \u00b5) where \u03c1 is the recombination proba-\nbility between the disease site and the \ufb01rst site, p are the founder allele frequencies, and f are the\npenetrance probabilities. To test for linkage between the disease site and the other sites, we maxi-\nmize the likelihood to obtain the optimal recombination fraction \u03c1\u2217 = argmax\u03c1 L(\u03c1)/L(1/2). The\ntest we use is the likelihood ratio test where the null hypothesis is that of no linkage (\u03c1 = 1/2).\nGenerally referred to as the log-odd score (or LOD score), the log of this likelihood ratio is\nlog L(\u03c1\u2217) \u2212 log L(1/2).\n\n3 Methods\n\nIn this section, we describe our model for inferring relationships between phenotypes and genotyped\npedigree datasets. We start by giving a high-level description of the generative process.\nThe \ufb01rst step in this generative process consists in sampling a collection of disease model (DM) vari-\nables, which encode putative relationships between the genetic sites and the observed phenotypes.\nThere is one disease model variable for each site, s, and to a \ufb01rst approximation, Ds can be thought\nas taking values zero or one, depending on whether site s is the closest to the primary genetic factor\ninvolved in a disease (a more elaborate example is presented in the Supplement). We use C to denote\nthe values Ds can take.\nThe second generative step consists in sampling the chromosomes or haplotypes of a collection of\nrelated individuals. We denote these variables by Hi,s,x, where, from now on, i is used to index\nindividuals, s, to index sites, and x \u2208 { \u2018father\u2019, \u2018mother\u2019 }, to index chromosome parental origin.\nFor SNP data, the set of values H that Hi,s,x can take generally contains two elements (alleles). A\nrelated variable, the inheritance variables Ri,s,x, will be sampled jointly with the Hi,s,x\u2019s to keep\ntrack of the grand-parental origin of each chromosome segment. See Figure 1(a) for a factor graph\nrepresentation of the random variables.\nFinally, the phenotype Pi, which we assume is taken from a \ufb01nite set P, can be sampled for each\nindividual i in the pedigree. We will de\ufb01ne the distribution of Pi conditionally on the haplotype of\nthe individual in question, Hi, and on the global disease model D. Note that variables with missing\nindices are used to denote random vectors or matrices, for example D = (D1, . . . , DS), where S\ndenotes the number of sites.\nTo summarize this high-level view of the process, and to introduce notations for the distributions\ninvolved:\n\nD \u223c DM(\u00b7)\nRi \u223c Recomb(\u00b7) for all i\n\n3\n\n\fFigure 1: (a) The pedigree graphical model for independent sites. There are two plates, one for each individual\nand one for each site. The nodes are labeled as follows: M for the marriage node which enforces the Mendelian\ninheritance constraints, H for haplotype, L and L(cid:48) for the two alleles, D(1) for the disease site indicator, and\nD(2) for the disease allele value. (b) The transducer for DM(\u00b7) has three nodes with the start node indicated\nby an in-arrow and the end node indicated by an out-arrow. The transducer for Recomb(\u00b7) has recombination\nparameter \u03b8. This assumes a constant recombination rate across sites, but non-constant rates can be obtained\nwith a bigger automaton. This transducer for HPT(\u00b7) models a recessive disease where the input at each state\nis the disease (top) and haplotype alleles (bottom). For these last two transducers any node can be the start or\nend node.\n\nThe remaining variables (the non-founder individuals\u2019 haplotype variables) are obtained determin-\nistically from the values of the founders and the inheritance: Hi,s,x = Hx(i),s,Ri,s,x , where x(i)\ndenotes the index of the father (mother) of i if x = \u2018father\u2019 (\u2018mother\u2019). The distribution on the\nfounder haplotypes is a product of independent Bernoulli distributions, one for each site (the pa-\nrameters of these Bernoulli distributions is not restricted to be identically distributed and can be\nestimated [3]). Each genotype variable Gs is obtained via a deterministic function of H. Having\ngenerated all the haplotypes and disease variables, we denote the conditional distribution of the\nphenotypes as follows:\n\nPi|(D, Hi) \u223c HPT( \u00b7 ; D, Hi),\n\nwhere HPT stands for a Haplotype-Phenotype Transducer.\nWe now turn to the description of these distributions, starting with the most important one,\nHPT( \u00b7 ; D, Hi). Formally, this distribution on phenotypes is derived from a weighted automa-\nton, where we view the vectors D and Hi as an input string of length S, the s-th character of which\nis the triplet (Ds, Hi,s,\u2018father\u2019, Hi,s,\u2018mother\u2019). We view each of the sampled phenotypes as a length-one\noutput from a weighted transducer given the input D, Hi. Longer outputs could potentially be used\nfor more complex phenotypes or diseases.\nTo illustrate this construction, we show that classical, Mendelian models such as recessive pheno-\ntypes are a special case of this formalism. We also make two simpli\ufb01cations to facilitate exposition:\n\ufb01rst, that the disease site is one of the observed sites, and second, that the disease allele is the less\nfrequent (minor) allele (we show in the Supplement a slightly more complicated transducer that does\nnot make these assumptions).\nUnder the two above assumptions, we claim that the state diagrams in Figure 1(b) specify an HPT\ntransducer for a recessive disease model. Each oval corresponds to a hidden transducer state, and the\nannotation inside the oval encodes the tuple of input symbols that the corresponding state consumes.\nThe emission is depicted on top of the states, with for example \u2018d\u2019: 1.0 denotes that a disease\nindicator is emitted with weight one. We use \u2018h\u2019 for the non-disease (healthy) indicator, and \u0001 for\nthe null emission.\nThe probability mass function of the HPT is de\ufb01ned as:\n\n(cid:80)\n(cid:80)\nz\u2208ZHPT(h,c\u2192p) wHPT(z)\nz(cid:48)\u2208ZHPT(h,c\u2192(cid:63)) wHPT(z(cid:48))\n\n,\n\nHPT(p; c, h) =\n\nwhere h \u2208 HS, c \u2208 CS, p \u2208 P, and ZHPT(h, c \u2192 p) denotes the set of valid paths in the space\nZ of hidden states. The valid paths are sequences of hidden states (depicted by black circles in\nFigure 1(b)) starting at the source and ending at the sink, consuming c, h and emitting p along the\nway. The star in the denominator of the above equation is used to denote unconstrained emissions.\n\n4\n\nAA01\u2019d\u2019:1.0Aa1\u2019h\u2019:1.0aa1\u2019h\u2019:1.0\u2217\u22170\u2019\u2019:1.010(a)HPTiLi,sLi,sHi,sMiD(1)sD(2)ss(b)GFGM1\u2212\u03b81\u2212\u03b8\u03b8\u03b8DM(.)Recomb(.)HPT(.;.)\fIn other words, the denominator is the normalization of the weighted transducer [10]. The set of\nvalid paths is implicitly encoded in the transition diagram of the transducer, and the weight function\nwHPT : Z\u2217 \u2192 [0,\u221e) can similarly be compactly represented by only storing weights for individual\ntransitions and multiplying them to get a path weight.\nThe set of valid paths along with their weights can be thought of as encoding a parametric disease\nmodel. For example, with a recessive disease, shown in Figure 1(b), we can see that if the transducer\nis at the site of the disease (encoded as the current symbol in c being equal to 1) then only an input\nhomozygous haplotype \u2018AA\u2019 will lead to an output disease phenotype \u2018d.\u2019 This formalism gives a\nconsiderable amount of \ufb02exibility to the modeler, who can go beyond simple Mendelian disease\nmodels by constructing different transducers.\nThe DM distribution is de\ufb01ned using the same machinery as for the HPT distribution. We show\nin Figure 1(b) a weighted automaton that encodes the prior that exactly one site is involved in the\ndisease, with an unknown, uniformly distributed location in the genome. The probability mass\nfunction of the distribution is given by:\n\n(cid:80)\n(cid:80)\nz\u2208ZDM(\u2192c) wDM(z)\nz(cid:48)\u2208ZDM(\u2192(cid:63)) wDM(z(cid:48))\n\n,\n\nDM(c) =\n\nwhere ZDM(\u2192 c) and ZDM(\u2192 (cid:63)) are direct analogues to the HPT case, with the difference being\nthat no input is read in the DM case.\nThe last distribution in our model, Recomb, is standard, but we present it in the new light of the\ntransducer formalism. Refer to Figure 1(b) for an example based on the standard recombination\nmodel derived from the marginals of a Poisson process. We use the analogous notation:\n\n(cid:80)\n(cid:80)\nz\u2208ZRecomb(\u2192r) wRecomb(z)\nz(cid:48)\u2208ZRecomb(\u2192(cid:63)) wRecomb(z(cid:48))\n\n.\n\nRecomb(r) =\n\n4 Computational Aspects\n\nProbabilistic inference in our model is computationally challenging: the variables L, H alone in-\nduce a loopy graph [18], and the addition of the variables D, P introduces more loops as well as\ndeterministic constraints, which further complicates the situation. After explaining in more detail\nthe graphical model of interest, we discuss in this section the approximation algorithm that we have\nused to infer haplotypes, disease loci, and other disease statistics.\nWe show in Figure 1(a) the factor graph obtained after turning the observed variables (genotypes\nand phenotypes) into potentials (we show a more detailed version in the Supplement). We have also\ntaken the pointwise product of potentials whenever possible (in the case of the transducer potentials,\nhow this pointwise product is implemented is discussed in [10]). Note that our graphical model\nhas more cycles than standard pedigree graphical models [19]; even if we assumed the sites to be\nindependent and the pedigree to be acyclic, our graphical model would still be cyclic.\nOur inference method is based on the following observation: if we kept only one subtype of factors\nin the Supplement, say only those connected to the recombination variables R, then inference could\nbe done easily. More precisely, inference would reduce to a collection of small, standard HMMs\ninference problems, which can be done using existing software.\nSimilarly, by covering the pedigree graph with a collection of subtrees, and removed the factors\nfor disease and recombination, we can get a collection of acyclic pedigrees, one for each site, and\nhence a tractable problem (the sum-product algorithm in this case is called the Elston-Stewart algo-\nrithm [14] in the pedigree literature).\nWe are therefore in a situation where we have several restricted views on our graphical model yield-\ning ef\ufb01ciently solved subproblems. How to combine the solutions of these tractable subproblems is\nthe question we address in the remainder of this section.\nThe most common way this is approached, in pedigrees [20] and elsewhere [21], is via block Gibbs\nsampling. However, block Gibbs sampling does not apply readily to our model. The main dif\ufb01culty\narises when attempting to resample D: because of the deterministic constraints that arise even in\n\n5\n\n\fthe simplest disease model, it is necessary to sample D in a block also containing a large subset of\nR and H. However this cannot be done ef\ufb01ciently since D is connected to all individuals in the\npedigree. More formally, the dif\ufb01culty is that some of the components we wish to resample are\nb-acyclic (barely acyclic) [22]. Another method, closer to ours, is the EP algorithm of [23], which\nhowever considers a single tree approximant, while we can accommodate several at once. As we\nshow in the empirical section, it is advantageous to do so in pedigrees.\nAn important feature that we will exploit in the development of method is the forest cover property\nof the tractable subproblems: we view each tractable subproblem as a subgraph of the initial factor\ngraph, and ask that the union of these subgraph coincides with the original factor graph.\nPrevious variational approaches have been proposed to exploit such forest covers. The most well-\nknown example, the structured mean \ufb01eld approximation, is unfortunately non-trivial to optimize in\nthe b-acyclic case [22]. Tree reweighted belief propagation [24] has an objective function derived\nfrom a forest distribution, however the corresponding algorithms are based on local message passing\nrather than large subproblems.\nWe propose an alternative based on the measure factorization framework [11]. As we will see,\nthis yields an easy to implement variation approximation that can ef\ufb01ciently exploit arbitrary forest\ncover approximations. Since the measure factorization interpretation of our approach is not speci\ufb01c\nto pedigrees, we present it in the context of a generic factor graph over a discrete space, viewed as\nan exponential family with suf\ufb01cient statistics \u03c6, log normalization A, and parameters \u03b8:\n\nP(X = x) = exp{(cid:104)\u03c6(x), \u03b8(cid:105) \u2212 A(\u03b8)} .\n\nus also denote the number of nodes connected to factor \u03d5 by n\u03d5. This vector y has N = (cid:80)\n\n(1)\nTo index the factors, we use \u03d5 \u2208 F = {1, ..., F}, and v to index the V variables in the factor graph.\nWe start by reparameterizing the exponential family in terms of a larger vector y of variables. Let\n\u03d5 n\u03d5\ncomponents, each corresponding to a pair containing a factor and a node index attached to it, and\ndenoted by y\u03d5,v. The reparameterization is given by:\n\nP(Y = y) = exp(cid:8)(cid:104)\u03c6(y), \u03b8(cid:105) \u2212 A\n\n(cid:48)\n\n(\u03b8)(cid:9) (cid:89)\n\n(cid:89)\n\n1[y\u03d5,v = y\u03d5(cid:48),v].\n\n(2)\n\n\u03d5,\u03d5(cid:48)\u2208F\n\nv\n\nBecause of the indicator variables in the right hand side of Equation 2, the set of y\u2019s with P(Y =\ny) > 0 is in bijection with the set of x\u2019s with P(X = x) > 0. It is therefore well-de\ufb01ned to overload\nthe variable \u03c6 in the same equation. Similarly, we have that A(cid:48) = A. This reparameterization is\ninspired by the auxiliary variables used to construct the sampler of Swendsen-Wang [25].\nNext, suppose that the sets F1, . . . ,FK form a forest cover of the factor graph, Fk \u2282 F. Then, for\nk \u2208 {1, . . . , K}, we build as follows the super-partitions required for the measure factorization to\napply (as de\ufb01ned in [11]):\n\n(cid:88)\n\nexp{(cid:104)\u03c6(y), \u03b8(cid:105)} (cid:89)\n\n(cid:89)\n\n\u03d5,\u03d5(cid:48)\u2208Fk\n\nv\n\nAk(\u03b8) =\n\ny\n\n1[y\u03d5,v = y\u03d5(cid:48),v].\n\n(3)\n\nNote that computing each Ak is tractable: it corresponds to computing the normalization of one of\nthe forest covering the graphical model. Similarly, gradients of Ak can be computed as the moments\nof a tree shaped graphical model. Also, the product over k of the base measures in Equation 3 is equal\nto the base measure of Equation 2. We have therefore constructed a valid measure factorization.\nWith this construction in hand, it is then easy to apply the measure factorization framework to get a\nprincipled way for the different subproblem views to exchange messages [11].\n\n5 Experiments\n\nWe did two sets of experiments. Haplotype reconstructions were used to assess the quality of the\nvariational approximation. Disease predictions were used to validate the HPT disease model.\n\nSimulations. Pedigree graphs were simulated using a Wright-Fisher model [26]. In this model\nthere is a \ufb01xed number of male individuals, n, and female individuals, n, per generation, making\nthe population size 2n. The pedigree is built starting from the oldest generation. Each successively\nmore recent generation is built by having each individual in that generation choose uniformly at\nrandom one female parent and one male parent. Notice that this process allows inbreeding.\n\n6\n\n\f(b) Recombination Factors (c) Recombination Parameter\n\nNo. Iterations\n\nNo. Iterations\n\n(a) Forest-Cover Factors\n\nNo. Iterations\n\n\u03c6\nc\ni\nr\nt\ne\n\nM\n\ne\np\ny\nt\no\nl\np\na\nH\n\nFigure 2: The pedigree was generated with the following parameters, number of generations 20 and n = 15\nwhich resulted in a pedigree with 424 individuals, 197 marriage nodes, 47 founders. We simulated 1000\nmarkers. The metric used for all panels is the haplotype reconstruction metric. Panel (a) shows the effect of\nremoving factors from the forest cover of the pedigree where the lines are labeled with the number of factors\nthat each experiment contains. Panel (b) shows the effect of removing the recombination factor (false) or using\nit (true). Together, panels (a-b) show that having more factors helps inference. Panel (c) shows the effect of an\nincorrect recombination parameter on inference. The correct parameter, with which the data was generated, is\nline 0.0005. Two incorrect parameters are shown 0.00005 and 0.005. This panel shows that the recombination\nparameter can be off by an order of magnitude and the haplotype reconstruction is robust.\n\nGenotype data were simulated in the simulated pedigree graph. The founder haplotypes were drawn\nfrom an empirical distribution (see Supplement for details). The recombination parameters used\nfor inheritance are given in the Supplement. We then simulated the inheritance and recombination\nprocess to obtain the haplotypes of the descendants using the external program [27]. We used two\ndistributions for the founder haplotypes, corresponding to two data sets.\nIndividuals with missing data were sampled, where each individual either has all their genetic data\nmissing or not. A random 50% of the non-founder individuals have missing data. An independent\n50% of individuals have missing phenotypes for the disease prediction comparison.\n\nHaplotype Reconstruction. For the haplotype reconstruction, the inference being scored is, for\neach individual, the maximum a posteriori haplotype predicted by the marginal haplotype distri-\nbution. These haplotypes are not necessarily Mendelian consistent, meaning that it is possible for\na child to have an allele on the maternal haplotype that could not possibly be inherited from the\nmother according to the mother\u2019s marginal distribution. However, transforming the posterior dis-\ntribution over haplotypes into a set of globally consistent haplotypes is somewhat orthogonal to\nthe methods in this paper, and there exist methods for this task [28]. The goal of this comparison\nis threefold: 1) to see if adding more factors improves inference, 2) to see if more iterations of\nthe measure factorization algorithm help, and 3) to see if there is robustness of the results to the\nrecombination parameters.\nSynthetic founder haplotypes were simulated, see Supplement for details. Each experiment was\nreplicated 10 times where for each replicate the founder haplotypes were sampled with a different\nrandom seed. We computed a metric \u03c6 which is a normalized count of the number of sites that differ\nbetween the held-out haplotype and the predicted haplotype. See the Supplement for details.\nFigure 2 shows the results for the haplotype reconstruction. Panels (a) and (b) show that adding more\nfactors helps inference accuracy. Panel (c) shows that inference accuracy is robust to an incorrect\nrecombination parameter.\n\nDisease Prediction. For disease prediction, the inference being scored is the ranking of the sites\ngiven by our Bayesian method as compared with LOD estimates computed by Merlin [3]. The dis-\nease models we consider are recessive f = (0.95, 0.05, 0.05) and dominant f = (0.95, 0.95, 0.05).\nThe disease site is one of the sites chosen uniformly at random. The goal of this comparison is to\nsee whether our disease model performs at least as well as the LOD estimator used by Merlin.\n\n7\n\n0510150.150.200.250510150.150.200.250510150.150.200.25l12345llllllllllllllllllll0510150.160.200.240.280510150.160.200.240.280510150.160.200.240.28lfalsetruellllllllllllllllllll0510150.160.200.240.280510150.160.200.240.280510150.160.200.240.28l0.000050.00050.005llllllllllllllllllll\fPedigree\nGenerations Leaves\n\nIndividuals\n\n3\n\n3\n4\n5\n5\n\n3\n\n8\n10\n12\n6\n\n100\n200\n300\n8\n10\n12\n\n22\n25\n34\n16\n20\n24\n418\n882\n1276\n22\n25\n34\n\nDisease model\nf2\n0.95\n\nf1\n0.05\n\nf0 Mean \u03c8\n0.05\n\n0.95\n\n0.95\n\n0.05\n\nHPT\n\nLOD [3]\n\nSD \u03c8 Mean \u03c8\n(0.09)\n(0.09)\n(0.04)\n(0.05)\n(0.09)\n(0.16)\n(2e-3)\n(1e-3)\n(1e-3)\n(0.15)\n(0.14)\n(0.22)\n\nSD \u03c8\n(0.20)\n0.25\n(0.44)\n0.52\n(0.23)\n0.45\n(0.31)\n0.27\n(0.31)\n0.35\n0.20\n(0.22)\nOut of memory\nOut of memory\nOut of memory\n(0.23)\n0.22\n0.33\n(0.40)\n(0.16)\n0.22\n\n0.08\n0.07\n0.04\n0.04\n0.08\n0.14\n1e-3\n4e-4\n6e-4\n0.14\n0.11\n0.12\n\nTable 1: This table gives the performance of our method and Merlin for recessive and dominant diseases as\nmeasured by the disease prediction metric. The sizes of the simulated pedigrees are given in the \ufb01rst three\ncolumns, the disease model in the next three columns, and the performance of our method and that of Merlin in\nthe \ufb01nal four columns. In all instances, our method outperforms Merlin sometimes by an order of magnitude.\nResults suggest that the standard deviation of our method is smaller than that of Merlin. Notably, Merlin cannot\neven analyze the largest pedigrees, because Merlin does exact inference.\n\nThe founder haplotypes were taken from the phased haplotypes of the JPT+CHB HapMap [29]\npopulations, see Supplement for details. Each experiment was replicated 10 times where for each\nreplicate the founder haplotypes were sampled with a different random seed. We computed a metric\n\u03c8 which is roughly the rank of the disease site in the sorted list of predictions given by each method.\nTable 1 compares the performance of our method against that of Merlin. In every case our method\nhas better accuracy. The results suggest that our method has a lower standard deviation. Within\neach delineated row of the table, the mean \u03c8 are not comparable because the pedigrees might be of\ndifferent complexities. Between delineated rows of the table, we can compare the effect of pedigree\nsize, and we observe that larger pedigrees aid in disease site prediction. Indeed, the largest pedigree\nof 1276 individuals reaches an accuracy of 6e\u22124. This pedigree is the largest pedigree that we know\nof being analyzed in the literature.\n\n6 Discussion\n\nThis paper introduces a new disease model and a new variational inference method which are applied\nto \ufb01nd a Bayesian solution to the disease-site correlation problem. This is in contrast to traditional\nlinkage analysis where a likelihood ratio statistic is computed to \ufb01nd the position of the disease site\nrelative to a map of existing sites. Instead, our approach is to use a Haplotype-Phenotype Transducer\nto obtain a posterior for the probability of each site to be the disease site. This approach is well-\nsuited to modern data which is very dense in the genome. Particularly with sequencing data, it is\nlikely that either the disease site or a nearby site will be observed.\nOur method performs well in practice both for genotype prediction and for disease site prediction.\nIn the presence of missing data, where for some individuals the whole genome is missing, our\nmethod is able to infer the missing genotypes with high accuracy. As compared with LOD linkage\nanalysis method, our method was better able to predict the disease site when one observed site was\nresponsible for the disease.\n\nReferences\n[1] G. Mendel. Experiments in plant-hybridisation. In English Translation and Commentary by R. A. Fisher,\n\nJ.H. Bennett, ed. Oliver and Boyd, Edinburgh 1965, 1866.\n\n[2] A. H. Sturtevant. The linear arrangement of six sex-linked factors in drosophila, as shown by their mode\n\nof association. Journal of Experimental Zoology, 14:43\u201359, 1913.\n\n8\n\n\f[3] GR Abecasis, SS Cherny, WO Cookson, et al. Merlin-rapid analysis of dense genetic maps using sparse\n\ngene \ufb02ow trees. Nature Genetics, 30:97\u2013101, 2002.\n\n[4] M Silberstein, A. Tzemach, N. Dovgolevsky, M. Fishelson, A. Schuster, and D. Geiger. On-line system\nfor faster linkage analysis via parallel execution on thousands of personal computers. Americal Journal\nof Human Genetics, 78(6):922\u2013935, 2006.\n\n[5] D. Geiger, C. Meek, and Y. Wexler. Speeding up HMM algorithms for genetic linkage analysis via chain\n\nreductions of the state space. Bioinformatics, 25(12):i196, 2009.\n\n[6] C. A. Albers, M. A. R. Leisink, and H. J. Kappen. The cluster variation method for ef\ufb01cient linkage\n\nanalysis on extended pedigrees. BMC Bioinformatics, 7(S-1), 2006.\n\n[7] M. L. Metzker. Sequencing technologies\u2013the next generation. Nat Rev Genet, 11(1):31\u201346, January 2010.\n[8] M. Abney, C. Ober, and M. S. McPeek. Quantitative-trait homozygosity and association mapping and\nempirical genome wide signi\ufb01cance in large, complex pedigrees: Fasting serum-insulin level in the hut-\nterites. American Journal of Human Genetics, 70(4):920 \u2013 934, 2002.\n\n[9] N.B. Sutter and et al. A Single IGF1 Allele Is a Major Determinant of Small Size in Dogs. Science,\n\n316(5821):112\u2013115, 2007.\n\n[10] M. Mohri. Handbook of Weighted Automata, chapter 6. Monographs in Theoretical Computer Science.\n\nSpringer, 2009.\n\n[11] A. Bouchard-C\u02c6ot\u00b4e and M. I. Jordan. Variational Inference over Combinatorial Spaces. In Advances in\n\nNeural Information Processing Systems 23 (NIPS), 2010.\n\n[12] J. S. Yedidia, W. T. Freeman, and Y. Weiss. Bethe free energy, Kikuchi approximations and belief propa-\n\ngation algorithms. In Advances in Neural Information Processing Systems (NIPS), 2001.\n\n[13] E. M. Wijsman. Penetrance. John Wiley & Sons, Ltd, 2005.\n[14] R.C. Elston and J. Stewart. A general model for the analysis of pedigree data. Human Heredity, 21:523\u2013\n\n542, 1971.\n\n[15] E.S. Lander and P. Green. Construction of multilocus genetic linkage maps in humans. Proceedings of\n\nthe National Academy of Science, 84(5):2363\u20132367, 1987.\n\n[16] J. Marchini, P. Donnelly, and L. R. Cardon. Genome-wide strategies for detecting multiple loci that\n\nin\ufb02uence complex diseases. Nat. Genet., 37(4):413\u2013417, 2005.\n\n[17] Y. W. Teh, C. Blundell, and L. T. Elliott. Modelling genetic variations with fragmentation-coagulation\n\nprocesses. In Advances In Neural Information Processing Systems, 2011.\n\n[18] A. Piccolboni and D. Gus\ufb01eld. On the complexity of fundamental computational problems in pedigree\n\nanalysis. Journal of Computational Biology, 10(5):763\u2013773, 2003.\n\n[19] S. L. Lauritzen and N. A. Sheehan. Graphical models for genetic analysis. Statistical Science, 18(4):489\u2013\n\n514, 2003.\n\n[20] A. Thomas, A. Gutin, V. Abkevich, and A. Bansal. Multilocus linkage analysis by blocked Gibbs sam-\n\npling. Statistics and Computing, 10(3):259\u2013269, July 2000.\n\n[21] G. O. Roberts and S. K. Sahu. Updating schemes, correlation structure, blocking and parameterization for\nthe Gibbs sampler. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 59(2):291\u2013\n317, 1997.\n\n[22] A. Bouchard-C\u02c6ot\u00b4e and M.I. Jordan. Optimization of structured mean \ufb01eld objectives. In Proceedings of\nthe Twenty-Fifth Conference Annual Conference on Uncertainty in Arti\ufb01cial Intelligence (UAI-09), pages\n67\u201374, Corvallis, Oregon, 2009. AUAI Press.\n\n[23] T. Minka and Y. Qi. Tree-structured approximations by expectation. In Advances in Neural Information\n\nProcessing Systems (NIPS), 2003.\n\n[24] M. J. Wainwright, T. S. Jaakkola, and A. S. Willsky. Tree-reweighted belief propagation algorithms and\n\napproximate ML estimation by pseudo-moment matching. In AISTATS, 2003.\n\n[25] R. H. Swendsen and J.-S. Wang. Nonuniversal critical dynamics in Monte Carlo simulations. Phys. Rev.\n\nLett., 58:86\u201388, Jan 1987.\n\n[26] J. Wakeley. Coalescent Theory: An Introduction. Roberts & Company Publishers, 1 edition, June 2008.\n[27] B. Kirkpatrick, E. Halperin, and R. M. Karp. Haplotype inference in complex pedigrees. Journal of\n\nComputational Biology, 17(3):269\u2013280, 2010.\n\n[28] C. A. Albers, T. Heskes, and H. J. Kappen. Haplotype inference in general pedigrees using the cluster\n\nvariation method. Genetics, 177(2):1101\u20131116, October 2007.\n\n[29] The International HapMap Consortium. The international HapMap project. Nature, 426:789\u2013796, 2003.\n\n9\n\n\f", "award": [], "sourceid": 1315, "authors": [{"given_name": "Bonnie", "family_name": "Kirkpatrick", "institution": null}, {"given_name": "Alexandre", "family_name": "Bouchard-c\u00f4t\u00e9", "institution": null}]}