{"title": "Clustering sequence sets for motif discovery", "book": "Advances in Neural Information Processing Systems", "page_first": 970, "page_last": 978, "abstract": "Most of existing methods for DNA motif discovery consider only a single set of sequences to find an over-represented motif. In contrast, we consider multiple sets of sequences where we group sets associated with the same motif into a cluster, assuming that each set involves a single motif. Clustering sets of sequences yields clusters of coherent motifs, improving signal-to-noise ratio or enabling us to identify multiple motifs. We present a probabilistic model for DNA motif discovery where we identify multiple motifs through searching for patterns which are shared across multiple sets of sequences. Our model infers cluster-indicating latent variables and learns motifs simultaneously, where these two tasks interact with each other. We show that our model can handle various motif discovery problems, depending on how to construct multiple sets of sequences. Experiments on three different problems for discovering DNA motifs emphasize the useful behavior and confirm the substantial gains over existing methods where only single set of sequences is considered.", "full_text": "Clustering Sequence Sets for Motif Discovery\n\nJong Kyoung Kim and Seungjin Choi\n\nDepartment of Computer Science\n\nPohang University of Science and Technology\n\nSan 31 Hyoja-dong, Nam-gu\n\nPohang 790-784, Korea\n\nfblkimjk,seungjing@postech.ac.kr\n\nAbstract\n\nMost of existing methods for DNA motif discovery consider only a single set of\nsequences to (cid:2)nd an over-represented motif. In contrast, we consider multiple\nsets of sequences where we group sets associated with the same motif into a clus-\nter, assuming that each set involves a single motif. Clustering sets of sequences\nyields clusters of coherent motifs, improving signal-to-noise ratio or enabling us\nto identify multiple motifs. We present a probabilistic model for DNA motif dis-\ncovery where we identify multiple motifs through searching for patterns which\nare shared across multiple sets of sequences. Our model infers cluster-indicating\nlatent variables and learns motifs simultaneously, where these two tasks interact\nwith each other. We show that our model can handle various motif discovery prob-\nlems, depending on how to construct multiple sets of sequences. Experiments on\nthree different problems for discovering DNA motifs emphasize the useful behav-\nior and con(cid:2)rm the substantial gains over existing methods where only a single\nset of sequences is considered.\n\n1 Introduction\n\nDiscovering how DNA-binding proteins called transcription factors (TFs) regulate gene expression\nprograms in living cells is fundamental to understanding transcriptional regulatory networks con-\ntrolling development, cancer, and many human diseases. TFs that bind to speci(cid:2)c cis-regulatory\nelements in DNA sequences are essential for mediating this transcriptional control. The (cid:2)rst step\ntoward deciphering this complex network is to identify functional binding sites of TFs referred to as\nmotifs.\nWe address the problem of discovering sequence motifs that are enriched in a given target set of\nsequences, compared to a background model (or a set of background sequences). There have been\nextensive research works on statistical modeling of this problem (see [1] for review), and recent\nworks have focused on improving the motif-(cid:2)nding performance by integrating additional informa-\ntion into comparative [2] and discriminative motif discovery [3].\nDespite the relative long history and the critical roles of motif discovery in bioinformatics, many\nissues are still unsolved and controversial. First, the target set of sequences is assumed to have only\none motif, but this assumption is often incorrect. For example, a recent study examining the binding\nspeci(cid:2)cities of 104 mouse TFs observed that nearly half of the TFs recognize multiple sequence\nmotifs [4]. Second, it is unclear how to select the target set on which over-represented motifs are\nreturned. The target set of sequences is often constructed from genome-wide binding location data\n(ChIP-chip or ChIP-seq) or gene expression microarray data. However, there is no clear way to\npartition the data into target and background sets in general. Third, a uni(cid:2)ed algorithm which is\napplicable to diverse motif discovery problems is solely needed to provide a principled framework\nfor developing more complex models.\n\n1\n\n\fS\n\n1\n\nS\n\nm\n\nS\n\nM\n\nM\n\nM\n\n,1ms\n\n,m is\n\n, mm Ls\n\nM\n\nM\n\nz\n\n,\nm ij\n\n=\n\nT\n\n[0,1] (cid:76)(cid:73)(cid:3)(cid:36)(cid:42)(cid:36)(cid:42)(cid:42)(cid:42)(cid:42)(cid:55)(cid:3)(cid:76)(cid:86)(cid:3)(cid:68)(cid:3)(cid:69)(cid:76)(cid:81)(cid:71)(cid:76)(cid:81)(cid:74)(cid:3)(cid:86)(cid:76)(cid:87)(cid:72)\n\n,m is\n\n(cid:273)(cid:273)(cid:273)(cid:17)(cid:36)(cid:38)(cid:36)(cid:42)(cid:38)(cid:36)(cid:42)(cid:36)(cid:42)(cid:42)(cid:42)(cid:42)(cid:55)(cid:42)(cid:42)(cid:36)(cid:42)(cid:273)(cid:273)(cid:273)\n\nW\ns\n,\nm ij\n\n=\n\n(\n\ns\n,\nm ij\n\n=\n\n(cid:36),\n\ns\n, (\nm i\n\nj\n\n1)\n+\n\n=\n\n(cid:42),\n\nL\n,\n\ns\n, (\nm i\n\nj W\n+ -\n\n1)\n\n=\n\n(cid:55))\n\nFigure 1: Notation illustration.\n\nThese considerations motivate us to develop a generative probabilistic framework for learning multi-\nple motifs on multiple sets of sequences. One can view our framework as an extension of the classic\nsequence models such as the two-component mixture (TCM) [5] and the zero or one occurrence per\nsequence (ZOOPS) [6] models in which sequences are partitioned into two clusters, depending on\nwhether or not they contain a motif. In this paper, we make use of a (cid:2)nite mixture model to partition\nthe multiple sequence sets into clusters having distinct sequence motifs, which improves the motif-\n(cid:2)nding performance over the classic models by enhancing signal-to-noise ratio of input sequences.\nWe also show how our algorithm can be applied into three different problems by simply changing\nthe way of constructing multiple sets from input sequences without any algorithmic modi(cid:2)cations.\n\n2 Problem formulation\n\nWe are given M sets of DNA sequences S = fS1; : : : ; SM g to be grouped according to the type of\nmotif involved with, in which each set is associated with only a single motif but multiple binding\nsites are present in each sequence. A set of DNA sequences Sm = fsm;1; : : : ; sm;Lmg is a collection\nof strings sm;i of length jsm;ij over the alphabet (cid:6) = fA; C; G; T g. To allow for a variable number\nof binding sites per sequence, we represent each sequence sm;i as a set of overlapping subsequences\nm;ij = (sm;ij ; sm;i(j+1); : : : ; sm;i(j+W (cid:0)1)) of length W starting at position j 2 Im;i, where sm;ij\nsW\ndenotes the letter at position j and Im;i = f1; : : : ; jsm;ij(cid:0)W +1g, as shown in Fig. 1. We introduce\na latent variable matrix zm;i 2 R2(cid:2)jIm;ij in which the j th column vector zm;ij is a 2-dimensional\nbinary random vector [zm;ij1; zm;ij2]> such that zm;ij = [0; 1]> if a binding site starts at position\nj 2 Im;i, otherwise, zm;ij = [1; 0]>. We also introduce K-dimensional binary random vectors\n\ntm 2 RK ( tm;k 2 f0; 1g and Pk tm;k = 1) for m = 1; : : : ; M, which involve partitioning the\n\nsequence sets S into K disjoint clusters, where sets in the same cluster are associated with the same\ncommon motif.\nFor a motif model, we use a position-frequency matrix whose entries correspond to probability\ndistributions (over the alphabet (cid:6)) of each position within a binding site. We denote by (cid:2)k 2 RW (cid:2)4\nthe kth motif model of length W over (cid:6), where (cid:2)>\nk;w represents row w, each entry is non-negative,\nl=1 (cid:2)k;wl = 1 for 8w. The background model (cid:18)0, which describes\nfrequencies over the alphabet within non-binding sites, is de(cid:2)ned by a P th order Markov chain\n(represented by a (P + 1)-dimensional conditional probability table).\nOur goal is to construct a probabilistic model for DNA motif discovery where we identify multiple\nmotifs through searching for patterns which are shared across multiple sets of sequences. Our model\ninfers cluster-indicating latent variables (to (cid:2)nd a good partition of S) and learns motifs (inferring\nbinding site-indicating latent variables zm;i) simultaneously, where these two tasks interact with\neach other.\n\n(cid:2)k;wl (cid:21) 0 for 8w; l, and P4\n\n2\n\n\fp\n\n0q\n\nz\n\n,m ij\n\nW\n,\n\nm ijs\n\nmt\n\nI\n\n|\n\n,\nm i\n\n|\n\nmL\n\nM\n\nQ\n\nk\n\nK\n\nv\n\nb\n\na\n\nFigure 2: Graphical representation of our mixture model for M sequence sets.\n\n3 Mixture model for motif discovery\n\nWe assume that the distribution of S is modeled as a mixture of K components, where it is not\nknown in advance which mixture component underlies a particular set of sequences. We also as-\nsume that the conditional distribution of the subsequence sW\nm;ij given tm is modeled as a mixture of\ntwo components, each of which corresponds to the motif and the background models, respectively.\nThen, the joint distribution of observed sequence sets S and (unobserved) latent variables Z and T\nconditioned on parameters (cid:8) is written as:\nLm\n\nM\n\np(S; Z; T j(cid:8)) =\n\np(tmj(cid:8))\n\np(sW\n\nm;ijjzm;ij ; tm; (cid:8))p(zm;ijj(cid:8));\n\n(1)\n\nYm\n\nYi=1 Yj2Im;i\n\nwhere Z = fzm;ijg and T = ftmg. The graphical model associated with (1) is shown in Fig. 2.\n\nThe generative process for subsequences sW\nweights v = [v1; : : : ; vK]> (involving set clusters) from the Dirichlet distribution:\n\nm;ij is described as follows. We (cid:2)rst draw mixture\n\np(vj(cid:11)) /\n\n(cid:11)k\nK (cid:0)1\nk\n\n;\n\nv\n\n(2)\n\nwhere (cid:11) = [(cid:11)1; : : : ; (cid:11)K]> are the hyperparameters. Given mixture weights, we choose the cluster-\n. The chosen\n\nindicator tm for Sm, according to the multinomial distribution p(tmjv) =QK\n\nkth motif model (cid:2)k is drawn from the product of Dirichlet distributions:\n\nk=1 v\n\ntm;k\nk\n\np((cid:2)kj(cid:12)) =\n\np((cid:2)k;wj(cid:12)) /\n\n(cid:2)(cid:12)l(cid:0)1\nk;wl ;\n\n(3)\n\nW\n\nYw=1\n\nW\n\n4\n\nYw=1\n\nYl=1\n\nwhere (cid:12) = [(cid:12)1; : : : ; (cid:12)4]> are the hyperparameters. The latent variables zm;ij indicating the starting\npositions of binding sites are governed by the prior distribution speci(cid:2)ed by:\n\nK\n\nYk=1\n\n2\n\nYr=1\n\np(zm;ijj(cid:25)) =\n\n(cid:25)zm;ijr\n\nr\n\n;\n\n(4)\n\nwhere the mixture weights (cid:25) = [(cid:25)1; (cid:25)2]> satisfy (cid:25)1; (cid:25)2 (cid:21) 0 and (cid:25)1 + (cid:25)2 = 1. Finally, the\nsubsequences sW\n\nm;ij are drawn from the following conditional distribution:\n\np(sW\n\nm;ijjtm; zm;ij ; f(cid:2)kgK\n\nk=1; (cid:18)0) = p(sW\n\nm;ijj(cid:18)0)zm;ij1\n\n(p(sW\n\nm;ijj(cid:2)k)zm;ij2 )tm;k ;\n\n(5)\n\nK\n\nYk=1\n\nwhere\n\np(sW\n\nm;ijj(cid:18)0) =\n\nW\n\n4\n\nYw=1\n\nYl=1\n\n(cid:18)\n\n(cid:14)(l;sm;i(j+w(cid:0)1) )\n0l\n\n; p(sW\n\nm;ijj(cid:2)k) =\n\nW\n\n4\n\nYw=1\n\nYl=1\n\n(cid:2)\n\n(cid:14)(l;sm;i(j+w(cid:0)1) )\nk;wl\n\n;\n\n3\n\n\fwhere (cid:14)(l; sm;i(j+w(cid:0)1)) is an indicator function which equals 1 if sm;i(j+w(cid:0)1) = l, and otherwise\n0. Here, the background model is speci(cid:2)ed by the 0th-order Markov chain for notational simplicity.\nSeveral assumptions simplify this generative model. First, the width W of the motif model and\nthe number K of set clusters are assumed to be known and (cid:2)xed. Second, the mixture weights (cid:25)\ntogether with the background model (cid:18)0 are treated as parameters to be estimated. We assume the\nhyperparameters (cid:11) and (cid:12) are set to (cid:2)xed and known constants. The full set of parameters and hyper-\nparameters will be denoted by (cid:8) = f(cid:11); (cid:12); (cid:25); (cid:18)0g. Extension to double stranded DNA sequences is\nobvious and omitted here due to the lack of space.\nOur model builds upon the existing TCM model proposed by [5] where the EM algorithm is applied\nto learn a motif on a single target set. This model actually generates subsequences instead of se-\nquences themselves. An alternative model which explicitly generates sequences has been proposed\nbased on Gibbs sampling [7, 8]. Note that our model is reduced to the TCM model if K, the number\nof set clusters, is set to one.\nOur model shares some similarities with the recent Bayesian hierarchical model in [9] which also\nuses a mixture model to cluster discovered motifs. The main difference is that they focus on cluster-\ning motifs already discovered, and in our formulation, we try to cluster sequence sets and discover\nmotifs simultaneously.\n\n4 Inference by Gibbs sampling\n\nWe (cid:2)nd the con(cid:2)gurations of Z and T by maximizing the posterior distribution over latent variables:\n\nZ (cid:3); T (cid:3) = arg max\n\np(Z; T jS; (cid:8)):\n\nZ;T\n\n(6)\n\nTo this end, we use Gibbs sampling to (cid:2)nd the posterior modes by drawing samples repeatedly from\nthe posterior distribution over Z and T . We will derive a Gibbs sampler for our generative model\nin which the set mixture weights v and motif models f(cid:2)kgK\nk=1 are integrated out to improve the\nconvergence rate and the cost per iteration [8].\nThe critical quantities needed to implement the Gibbs sampler are the full conditional distributions\nfor Z and T . We (cid:2)rst derive the relevant full conditional distribution over tm conditioned on the\nset cluster assignments of all other sets, Tnm, the latent positions Z, and the observed sets S. By\napplying Bayes\u2019 rule, Fig. 2 implies that this distribution factorizes as follows:\n\np(tm;k = 1jTnm; S; Z; (cid:8)) / p(tm;k = 1jTnm; (cid:11))p(Sm; ZmjT ; Snm; Znm; (cid:8));\n\n(7)\n\nwhere Znm denotes the entries of Z other than Zm = fzm;igLm\ni=1, and Snm is similarly de(cid:2)ned. The\n(cid:2)rst term represents the predictive distribution of tm given the other set cluster assignments Tnm,\nand is given by marginalizing the set mixture weights v:\n\np(tm;k = 1jTnm; (cid:11)) = Zv\nk = Pn6=m (cid:14)(tn;k; 1). Note that N (cid:0)m\n\nk\n\nwhere N (cid:0)m\ncounts the number of sets currently assigned to\nthe kth set cluster excluding the mth set. The model\u2019s Markov structure implies that the second term\nof (7) depends on the current assignments Tnm as follows:\n\nN (cid:0)m\nk + (cid:11)k\nK\nM (cid:0) 1 + (cid:11)k\n\n(8)\n\np(tm;k = 1jv)p(vjTnm; (cid:11))dv =\n\np(Sm; Zmjtm;k = 1; Tnm; Snm; Znm; (cid:8))\n\np(Sm; Zmjtm;k = 1; (cid:2)k; (cid:8))p((cid:2)kjfSn; Znjtn;k = 1; n 6= mg; (cid:8))d(cid:2)k\n\n(9)\n\nLm\n\n= Z(cid:2)k\n= 2\nYi=1 Yj2Im;i\n4\n\" W\nYw=1(cid:26) (cid:0)(Pl(N (cid:0)m\n\nm;ijjzm;ij2 = 0; (cid:18)0)3\n5\n\np(sW\n\np(zm;ijj(cid:25)) Yj2Im;i;zm;ij2=0\n(cid:0)(Pl(Nwl + (cid:12)l)) Ql (cid:0)(Nwl + (cid:12)l)\nQl (cid:0)(N (cid:0)m\n\nwl + (cid:12)l))\n\nwl + (cid:12)l)(cid:27)# ;\n\n4\n\n\fwhere Nwl = N (cid:0)m\n\nwl + N m\n\nwl and\n\nN (cid:0)m\n\nwl\n\n= Xtn;k=1;n6=m\n\nLm\n\nLn\n\nXi=1 Xj2In;i;zn;ij2=1\n\n(cid:14)(sn;i(j+w(cid:0)1); l)\n\nN m\n\nwl =\n\nXi=1 Xj2Im;i;zm;ij2=1\n\n(cid:14)(sm;i(j+w(cid:0)1); l):\n\ncounts the number of letter l at position w within currently assigned binding sites\nwl denotes the number of letter l at position w within\n\nNote that N (cid:0)m\nwl\nexcluding the ones of the mth set. Similarly, N m\nbindings sites of the mth set.\nWe next derive the full conditional distribution of zm;ij given the remainder of the variables. Inte-\ngrating over the motif model (cid:2)k, we then have the following factorization:\n\np(zm;ijjZnm;ij ; S; tm;k = 1; Tnm; (cid:8)) /Z(cid:2)k Ytn;k=1\n/ 2\n4 Ytn;k=1\n\nYi=1 Yj2In;i;zn;ij2=0\n\np(zn;ijj(cid:25))\n\nYj=1\n\nYi=1\n\np(sW\n\nIn;i\n\nLn\n\nLn\n\nn;ijj(cid:18)0)3\n5\n\np(Zn; Snj(cid:2)k)p((cid:2)kj(cid:12))d(cid:2)k\n\n(cid:0)(Pl(Nwl + (cid:12)l))# ;(10)\n\" W\nYw=1 Ql (cid:0)(Nwl + (cid:12)l)\n\nwhere Znm;ij denotes the entries of Z other than zm;ij. For the purpose of sampling, the ratio of the\nposterior distribution of zm;ij is given by:\n\np(zm;ij2 = 1jZnm;ij ; S; T ; (cid:8))\np(zm;ij2 = 0jZnm;ij ; S; T ; (cid:8))\n\n=\n\n(cid:25)2\n(cid:25)1p(sW\nm;ijj(cid:18)0)\n\nW\n\nYw=1P4\n\nl=1(N (cid:0)m;ij\n\nwl\n\n+ (cid:12))(cid:14)(sm;i(j+w(cid:0)1); l)\n\n;\n\nl=1(N (cid:0)m;ij\n\nwl\n\n+ (cid:12))\n\nP4\n\nwl\n\n= Ptn;k=1PLn\n\nwhere N (cid:0)m;ij\nnotes the number of letter l at position w within currently assigned binding sites other than zm;ij.\nCombining (7) with (10) is suf(cid:2)cient to de(cid:2)ne the Gibbs sampler for our (cid:2)nite mixture model. To\nprovide a convergence measure, we derive the following objective function based on the log of the\nposterior distribution:\n\ni=1Pj 02In;i;j 06=j;zn;ij02=1 (cid:14)(sn;i(j 0+w(cid:0)1); l). Note that N (cid:0)m;ij\n\nde-\n\nwl\n\nlog p(Z; T jS; (cid:8)) / log p(Z; T ; Sj(cid:8))\n\nM\n\nLm\n\nIm;i\n\nM\n\nLm\n\nlog p(zm;ijj(cid:25)) +\n\n/\n\n+\n\nK\n\nXj=1\nXi=1\nXm=1\nXw=1(Xl\nXk=1\n\nW\n\nlog (cid:0)(N k\n\nwhere Nk =Pm (cid:14)(tm;k; 1) and N k\n\n5 Results\n\nXm=1\n\nXi=1 Xj2Imi;zm;ij2=0\n\nlog p(sW\n\nm;ijj(cid:18)0)\n\nwl + (cid:12)l) (cid:0) log (cid:0)(Xl\nwl =Ptm;k =1PLm\n\n(N k\n\nwl + (cid:12)l))) +\n\nlog (cid:0)(Nk +\n\n(cid:11)k\nK\n\n); (11)\n\nK\n\nXk=1\n\ni=1Pj2Im;i;zm;ij2=1 (cid:14)(sm;i(j+w(cid:0)1); l).\n\nWe evaluated our motif-(cid:2)nding algorithm on the three different tasks: (1) (cid:2)ltering out undesirable\nnoisy sequences, (2) incorporating evolutionary conservation information, and (3) clustering DNA\nsequences based on the learned motifs (Fig. 3). In the all experiments, we (cid:2)xed the hyper-parameters\nso that (cid:11)k = 1 and (cid:12)l = 0:5.\n\n5.1 Data sets and evaluation criteria\n\nWe (cid:2)rst examined the yeast ChIP-chip data published by [10] to investigate the effect of (cid:2)ltering out\nnoisy sequences from input sequences on identifying true binding sites. We compiled 156 sequence-\nsets by choosing TFs having consensus motifs in the literature [11]. For each sequence-set, we\nde(cid:2)ned its sequences to be probe sequences that are bound with P -value (cid:20) 0:001.\n\n5\n\n\f(a) Filtering out noisy se-\nquences\n\n(b) Evolutionary conservation\n\n(c) Motif-based clustering\n\nFigure 3: Three different ways of constructing multiple sequence sets. Black rectangles: sequence\nsets, Blue bars: sequences, Red dashed rectangles: set clusters, Red and green rectangles: motifs.\n\nTo apply our algorithm into the comparative motif discovery problem, we compiled orthologous\nsequences for each probe sequence of the yeast ChIP-chip data based on the multiple alignments\nof seven species of Saccharomyces (S. cerevisiae, S. paradoxus, S. mikatae, S. kudriavzevii, S.\nbayanus, S. castelli, and S. kluyveri) [12]. In the experiments using the ChIP-chip data, the mo-\ntif width was set to 8 and a (cid:2)fth-order Markov chain estimated from the whole yeast intergenic\nsequences was used to describe the background model. We (cid:2)xed the mixture weights (cid:25) so that\n(cid:25)2 = 0:001.\nWe next constructed the ChIP-seq data for human neuron-restrictive silence factor (NRSF) to deter-\nmine whether our algorithm can be applied to partition DNA sequences into biologically meaningful\nclusters [13]. The data consist of 200 sequence segments of length 100 from all peak sites with the\ntop 10% binding intensity ((cid:21) 500 ChIP-seq reads), where most sequences have canonical NRSF-\nbinding sites. We also added 13 sequence segments extracted from peak sites ((cid:21) 300 reads) known\nto have noncanonical NRSF-binding sites, resulting in 213 sequences. In the experiment using the\nChIP-seq data, the motif width was set to 30 and a zero-order Markov chain estimated from the 213\nsequence segments was used to describe the background model. We (cid:2)xed the mixture weights (cid:25) so\nthat (cid:25)2 = 0:005.\nIn the experiments using the yeast ChIP-chip data, we used the inter-motif distance to measure the\nquality of discovered motifs [10]. Speci(cid:2)cally, an algorithm will be called successful on a sequence\nset only if at least one of the position-frequency matrices constructed from the identi(cid:2)ed binding\nsites is at a distance less than 0.25 from the literature consensus [14].\n\n5.2 Filtering out noisy sequences\n\nSelecting target sequences from the ChIP-chip measurements is largely left to users and this choice\nis often unclear. Our strategy of constructing sequence-sets based on the binding P -value cutoff\nwould be exposed to danger of including many irrelevant sequences. In practice, the inclusion of\nnoisy sequences in the target set is a serious obstacle in the success of motif discovery. One possible\nsolution is to cluster input sequences into two smaller sets of target and noisy sequences based\non sequence similarity, and predict motifs from the clustered target sequences with the improved\nsignal-to-noise ratio. This two-step approach has been applied to only protein sequences because\nDNA sequences do not share much similarity for effective clustering [15].\nOne alternative approach is to seek a better sequence representation based on motifs. To this end, we\nconstructed multiple sets by treating each sequence of a particular yeast ChIP-chip sequence-set as\none set (Fig. 3(a)). We examined the ability of our algorithm to (cid:2)nd a correct motif with two different\nnumbers of clusters: K = 1 (without (cid:2)ltering) and K = 2 (clustering into two subsets of true and\nnoisy sequences). We ran each experiment (cid:2)ve times with different initializations and reported\nmeans with (cid:6)1 standard error. Figure 4 shows that the (cid:2)ltering approach (K = 2) outperforms the\nbaseline method (K = 1) in general, with the increasing value of the P -value cutoff. Note that the\nZOOPS or TCM models can also handle noisy sequences by modeling them with only a background\nmodel [5, 6]. But we allow noisy sequences to have a decoy motif (randomly occurring sequence\n\n6\n\n\fFigure 4: Effect of (cid:2)ltering out noisy sequences on the number of successfully identi(cid:2)ed motifs on\nthe yeast ChIP-chip data. K = 1: without (cid:2)ltering, K = 2: clustering into two subsets.\n\npatterns or repeating elements) which is modeled with a motif model. Because our model can be\nreduced to these classic models by setting K = 1, we concluded that noisy sequences were better\nrepresented by our clustering approach than the previous ones using the background model (Fig. 4).\nTwo additional lines of evidence indicated that our (cid:2)ltering approach enhances the signal-to-noise\nratio of the target set. First, we compared the results of our (cid:2)ltering approach with that of other\nbaseline methods (AlignAce [16], MEME [6], MDScan [17], and PRIORITY-U [11]) on the same\nyeast ChIP-chip data. For AlignAce, MEME and MDScan, we used the results reported by [14];\nfor PRIORITY-U, we used two different results reported by [14, 11] according to different sampling\nstrategy. We expected that our model would perform better than these four methods because they try\nto remove noisy sequences based on the classic models. By comparing the results of Fig. 4 and Table\n1, we see that our algorithm still performs better. Second, we also compared our model with DRIM\nspeci(cid:2)cally designed to dynamically select the target set from the list of sorted sequences according\nto the binding P -values of ChIP-chip measurements. For DRIM, we used the result reported by [18].\nBecause DRIM does not produce any motifs when they are not statistically enriched at the top of\nthe ranked list, we counted the number of successfully identi(cid:2)ed motifs on the sequence-sets where\nDRIM generated signi(cid:2)cant motifs. Our method (number of successes is 16) was slightly better than\nDRIM (number of successes is 15).\n\n5.3 Detecting evolutionary conserved motifs\n\nComparative approach using evolutionary conservation information has been widely used to improve\nthe performance of motif-(cid:2)nding algorithms because functional TF binding sites are likely to be\nconserved in orthologous sequences. To incorporate conservation information into our clustering\nframework, orthologous sequences of each sequence of a particular yeast ChIP-chip sequence-set\nwere considered as one set and the number of clusters was set to 2 (Fig. 3(b)). The constructed sets\ncontain at most 7 sequences because we only used seven species of Saccharomyces. We used the\nsingle result with the highest objective function value of (11) among (cid:2)ve runs and compared it with\nthe results of (cid:2)ve conservation-based motif (cid:2)nding algorithms on the same data set: MEME c [10],\nPhyloCon [19], PhyMe [20], PhyloGibbs [21], PRIORITY-C [11]. For the (cid:2)ve methods, we used\nthe results reported by [11]. We did not compare with discriminative methods which are known\nto perform better at this data set because our model does not use negative sequences. Table 1\npresents the motif-(cid:2)nding performance in terms of the number of correctly identi(cid:2)ed motifs for\neach algorithm. We see that our algorithm greatly outperforms the four alignment-based methods\nwhich rely on multiple or pair-wise alignments of orthologous sequences to search for motifs that are\nconserved across the aligned blocks of orthologous sequences. In our opinion, it is because diverged\nregions other than the short conserved binding sites may prevent a correct alignment. Moreover, our\nalgorithm performs somewhat better than PRIORITY-C, which is a recent alignment-free method.\nWe believe that it is because the signal-to-noise ratio of the input target set is enhanced by clustering.\n\n5.4 Clustering DNA sequences based on motifs\n\nTo examine the ability of our algorithm to partition DNA sequences into biologically meaningful\nclusters, we applied our algorithm to the NRSF ChIP-seq data which are assumed to have two\n\n7\n\n\fTable 1: Comparison of the number of successfully identi(cid:2)ed motifs on the yeast ChIP-chip data\nfor different methods. NC: Non-conservation, EC: Evolutionary conservation, A: Alignment-based,\nAF: Alignment-free, C: Clustering.\nDescription\nMethod\nNC\nAlignAce\nNC\nMEME\nMDScan\nNC\nPRIORITY-U NC\nMEME c\nPhyloCon\nPhyME\nPhyloGibbs\nPRIORITY-C\nThis work\n\nEC + A\nEC + A\nEC + A\nEC + A\nEC + AF\nEC + AF + C\n\n# of successes\n\n46-58\n\n16\n35\n54\n\n49\n19\n21\n54\n69\n75\n\n(a) Canonical NRSF motif\n\n(b) Noncanonical NRSF motif\n\nFigure 5: Sequence logo of discovered NRSF motifs.\n\ndifferent NRSF motifs (Fig. 3(c)). In this experiment, we have already known the number of clusters\n(K = 2). We ran our algorithm (cid:2)ve times with different initializations and reported the one with\nthe highest objective function value. Position-frequency matrices of two clusters are shown in Fig.\n5. The two motifs correspond directly to the previously known motifs (canonical and non-canonical\nNRSF motifs). However, other motif-(cid:2)nding algorithms such as MEME could not return the non-\ncanonical motif enriched in a very small set of sequences. These observations suggest that our\nmotif-driven clustering approach is effective at inferring latent clusters of DNA sequences and can\nbe used to (cid:2)nd unexpected novel motifs.\n\n6 Conclusions\n\nIn this paper, we have presented a generative probabilistic framework for DNA motif discovery us-\ning multiple sets of sequences where we cluster DNA sequences and learn motifs interactively. We\nhave presented a (cid:2)nite mixture model with two different types of latent variables, in which one is\nassociated with cluster-indicators and the other corresponds to motifs (transcription factor binding\nsites). These two types of latent variables are inferred alternatively using multiple sets of sequences.\nOur empirical results show that the proposed method can be applied to various motif discovery prob-\nlems, depending on how to construct the multiple sets. In the future, we will explore several other\nextensions. For example, it would be interesting to examine the possibility of learning the num-\nber of clusters from data based on Dirichlet process mixture models, or to extend our probabilistic\nframework for discriminative motif discovery.\n\nAcknowledgments: We thank Raluca Gord(cid:136)an for providing the literature consensus motifs and the script to\ncompute the inter-motif distance. This work was supported by National Core Research Center for Systems\nBio-Dynamics funded by Korea NRF (Project No. 2009-0091509) and WCU Program (Project No. R31-2008-\n000-10100-0). JKK was supported by a Microsoft Research Asia fellowship.\n\n8\n\n\fReferences\n[1] G. D. Stormo. DNA binding sites: representation and discovery. Bioinformatics, 16:16(cid:150)23, 2000.\n[2] W. W. Wasserman and A. Sandelin. Applied bioinformatics for the identi(cid:2)cation of regulatory elements.\n\nNature Review Genetics, 5:276(cid:150)287, 2004.\n\n[3] E. Segal, Y. Barash, I. Simon, N. Friedman, and D. Koller. From promoter sequence to expression: a\nprobabilistic framework. In Proceedings of the International Conference on Research in Computational\nMolecular Biology, pages 263(cid:150)272, 2002.\n\n[4] G. Badis, M. F. Berger, A. A. Philippakis, S. Talukder, A. R. Gehrke, S. A. Jaeger, E. T. Chan, G. Met-\nzler, A. Vedenko, X. Chen, H. Kuznetsov, C. F. Wang, D. Coburn, D. E. Newburger, Q. Morris, T. R.\nHughes, and M. L. Bulyk. Diversity and complexity in DNA recognition by transcription factors. Sci-\nence, 324:1720(cid:150)1723, 2009.\n\n[5] T. L. Bailey and C. Elkan. Fitting a mixture model by expectation maximization to discover motifs in\nbiopolymers. In Proceedings of the International Conference Intelligent Systems for Molecular Biology,\n1994.\n\n[6] T. L. Bailey and C. Elkan. The value of prior knowledge in discovering motifs with MEME. In Proceed-\n\nings of the International Conference Intelligent Systems for Molecular Biology, 1995.\n\n[7] C. E. Lawrence, S. F. Altschul, M. S. Boguski, J. S. Liu, A. F. Neuwald, and J. C. Wootton. Detecting\nsubtle sequence signals: a Gibbs sampling strategy for multiple alignment. Science, 262:208(cid:150)214, 1993.\n[8] J. S. Liu, A. F. Neuwald, and C. E. Lawrence. Bayesian models for multiple local sequence alignment\n\nand Gibbs sampling strategies. Journal of the American Statistical Association, 90:1156(cid:150)1170, 1995.\n\n[9] S. T. Jensen and J. S. Liu. Bayesian clustering of transcription factor binding motifs. Journal of the\n\nAmerican Statistical Association, 103:188(cid:150)200, 2008.\n\n[10] C. T. Harbison, D. B. Gordon, T. I. Lee, N. J. Rinaldi, K. D. Macisaac, T. W. Danford, N. M. Hannett,\nJ. B. Tagne, D. B. Reynolds, J. Yoo, E. G. Jennings, J. Zeitlinger, D. K. Pokholok, M. Kellis, P. A. Rolfe,\nK. T. Takusagawa, E. S. Lander, D. K. Gifford, E. Fraenkel, and R. A. Young. Transcriptional regulatory\ncode of a eukaryotic genome. Nature, 431:99(cid:150)104, 2004.\n\n[11] R. Gordan, L. Narlikar, and A. J. Hartemink. A fast, alignment-free, conservation-based method for\ntranscription factor binding site discovery. In Proceedings of the International Conference on Research\nin Computational Molecular Biology, pages 98(cid:150)111, 2008.\n\n[12] A. Siepel, G. Bejerano, J. S. Pedersen, A. S. Hinrichs, M. Hou, K. Rosenbloom, H. Clawson, J. Spieth,\nL. W. Hillier, S. Richards, G. M. Weinstock, R. K. Wilson, R. A. Gibbs, W. J. Kent, W. Miller, and\nD. Haussler. Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes. Genome\nResearch, 15:1034(cid:150)1050, 2005.\n\n[13] D. S. Johnson, A. Mortazavi, R. M. Myers, and B. Wold. Genome-wide mapping of in vivo protein-DNA\n\ninteractions. Science, 316:1497(cid:150)1502, 2007.\n\n[14] L. Narlikar, R. Gordan, and A. J. Hartemink. Nucleosome occupancy information improves de novo\nmotif discovery. In Proceedings of the International Conference on Research in Computational Molecular\nBiology, pages 107(cid:150)121, 2007.\n\n[15] S. Kim, Z. Wang, and M. Dalkilic.\n\nclustering and iterative pattern sampling. Proteins, 66:671(cid:150)681, 2007.\n\nigibbs:\n\nimproving gibbs motif sampler for proteins by sequence\n\n[16] F. P. Roth, J. D. Hughes, P. W. Estep, and G. M. Church. Finding DNA regulatory motifs within unaligned\nnoncoding sequences clustered by whole-genome mRNA quantitation. Nature Biotechnology, 16:939(cid:150)\n945, 1998.\n\n[17] X. S. Liu, D. L. Brutlag, and J. S. Liu. An algorithm for (cid:2)nding protein-DNA binding sites with appli-\ncations to chromatin-immunoprecipitation microarray experiments. Nature Biotechnology, 20:835(cid:150)839,\n2002.\n\n[18] E. Eden, D. Lipson, S. Yogev, and Z. Yakhini. Discovering motifs in ranked lists of DNA sequences.\n\nPLoS Computational Biology, 3:e39, 2007.\n\n[19] T. Wang and G. D. Stormo. Combining phylogenetic data with co-regulated genes to identify regulatory\n\nmotifs. Bioinformatics, 19:2369(cid:150)2380, 2003.\n\n[20] S. Sinha, M. Blanchette, and M. Tompa. PhyME: a probabilistic algorithm for (cid:2)nding motifs in sets of\n\northologous sequences. BMC Bioinformatics, 5:170, 2004.\n\n[21] R. Siddharthan, E. D. Siggia, and E. van Nimwegen. PhyloGibbs: a gibbs sampling motif (cid:2)nder that\n\nincorporates phylogeny. PLoS Computational Biology, 1:e67, 2005.\n\n9\n\n\f", "award": [], "sourceid": 774, "authors": [{"given_name": "Jong", "family_name": "Kim", "institution": null}, {"given_name": "Seungjin", "family_name": "Choi", "institution": null}]}