{"title": "Probabilistic Inference of Alternative Splicing Events in Microarray Data", "book": "Advances in Neural Information Processing Systems", "page_first": 1241, "page_last": 1248, "abstract": null, "full_text": " Probabilistic Inference of Alternative Splicing\n Events in Microarray Data\n\n\n\n Ofer Shai, Brendan J. Frey, and Quaid D. Morris\n Dept. of Electrical & Computer Engineering\n University of Toronto, Toronto, ON\n\n\n Qun Pan, Christine Misquitta, and Benjamin J. Blencowe\n Banting & Best Dept. of Medical Research\n University of Toronto, Toronto, ON\n\n\n\n Abstract\n\n Alternative splicing (AS) is an important and frequent step in mammalian\n gene expression that allows a single gene to specify multiple products,\n and is crucial for the regulation of fundamental biological processes. The\n extent of AS regulation, and the mechanisms involved, are not well un-\n derstood. We have developed a custom DNA microarray platform for\n surveying AS levels on a large scale. We present here a generative model\n for the AS Array Platform (GenASAP) and demonstrate its utility for\n quantifying AS levels in different mouse tissues. Learning is performed\n using a variational expectation maximization algorithm, and the parame-\n ters are shown to correctly capture expected AS trends. A comparison of\n the results obtained with a well-established but low through-put experi-\n mental method demonstrate that AS levels obtained from GenASAP are\n highly predictive of AS levels in mammalian tissues.\n\n\n1 Biological diversity through alternative splicing\n\nCurrent estimates place the number of genes in the human genome at approximately 30,000,\nwhich is a surprisingly small number when one considers that the genome of yeast, a single-\ncelled organism, has 6,000 genes. The number of genes alone cannot account for the com-\nplexity and cell specialization exhibited by higher eukaryotes (i.e. mammals, plants, etc.).\nSome of that added complexity can be achieved through the use of alternative splicing,\nwhereby a single gene can be used to code for a multitude of products.\n\nGenes are segments of the double stranded DNA that contain the information required by\nthe cell for protein synthesis. That information is coded using an alphabet of 4 (A, C, G,\nand T), corresponding to the four nucleotides that make up the DNA. In what is known\nas the central dogma of molecular biology, DNA is transcribed to RNA, which in turn is\ntranslated into proteins. Messenger RNA (mRNA) is synthesized in the nucleus of the cell\nand carries the genomic information to the ribosome. In eukaryotes, genes are generally\ncomprised of both exons, which contain the information needed by the cell to synthesize\nproteins, and introns, sometimes referred to as spacer DNA, which are spliced out of the\npre-mRNA to create mature mRNA. An estimated 35%-75% of human genes [1] can be\n\n\f\n C A C\n 1 2\n (a) C A C\n 1 2\n C C\n 1 2\n\n\n\n C A C C A C\n 1 3' 2 1 5' 2\n (b) C A C C\n 1 3' 2 C1 A5' 2\n C C\n 1 2\n\n\n C A C\n 1 1 2\n (c) C A A C\n 1 1 2 2\n C A C\n 1 2 2\n\n\n C C\n 1 2\n (d) C C\n 1 2\n C C\n 1 2\n\n\n\n\n\nFigure 1: Four types of AS. Boxes represent exons and lines represent introns, with the possible\nsplicing alternatives indicated by the connectors. (a) Single cassette exon inclusion/exclusion. C1\nand C2 are constitutive exons (exons that are included in all isoforms) and flank a single alternative\nexon (A). The alternative exon is included in one isoform and excluded in the other. (b) Alternative\n3' (or donor) and alternative 5' (acceptor) splicing sites. Both exons are constitutive, but may con-\ntain alternative donor and/or acceptor splicing sites. (c) Mutually exclusive exons. One of the two\nalternative exons (A1 and A2) may be included in the isoform, but not both. (d) Intron inclusion. An\nintron may be included in the mature mRNA strand.\n\n\n\nspliced to yield different combinations of exons (called isoforms), a phenomenon referred\nto as alternative splicing (AS). There are four major types of AS as shown in Figure 1.\nMany multi-exon genes may undergo more than one alternative splicing event, resulting in\nmany possible isoforms from a single gene. [2]\n\nIn addition to adding to the genetic repertoire of an organism by enabling a single gene to\ncode for more than one protein, AS has been shown to be critical for gene regulation, con-\ntributing to tissue specificity, and facilitating evolutionary processes. Despite the evident\nimportance of AS, its regulation and impact on specific genes remains poorly understood.\nThe work presented here is concerned with the inference of single cassette exon AS levels\n(Figure 1a) based on data obtained from RNA expression arrays, also known as microar-\nrays.\n\n\n1.1 An exon microarray data set that probes alternative splicing events\n\nAlthough it is possible to directly analyze the proteins synthesized by a cell, it is easier, and\noften more informative, to instead measure the abundance of mRNA present. Traditionally,\ngene expression (abundance of mRNA) has been studied using low throughput techniques\n(such as RT-PCR or Northern blots), limited to studying a few sequences at a time and\nmaking large scale analysis nearly impossible.\n\nIn the early 1990s, microarray technology emerged as a method capable of measuring the\nexpression of thousands of DNA sequences simultaneously. Sequences of interest are de-\nposited on a substrate the size of a small microscope slide, to form probes. The mRNA\nis extracted from the cell and reverse-transcribed back into DNA, which is labelled with\nred and green fluorescent dye molecules (cy3 and cy5 respectively). When the sample of\ntagged DNA is washed over the slide, complementary strands of DNA from the sample hy-\nbridize to the probes on the array forming A-T and C-G pairings. The slide is then scanned\nand the fluorescent intensity is measured at each probe. It is generally assumed that the\nintensity measure at the probe is linearly related to the abundance of mRNA in the cell over\na wide dynamic range.\n\nDespite significant improvements in microarray technologies in recent years, microarray\ndata still presents some difficulties in analysis. Low measurements tend to have extremely\nlow signal to noise ratio (SNR) [7] and probes often bind to sequences that are very similar,\nbut not identical, to the one for which they were designed (a process referred to as cross-\n\n\f\n C A C\n 1 2\n\n C A C 3 Body probes\n 1 2\n\n C :A A:C\n 1 2\n\n C A C 2 Inclusion junction probes\n 1 2\n\n C :C\n 1 2\n\n C C 1 Exclusion junction probe\n 1 2\n\n\n\n\n\nFigure 2: Each alternative splicing event is studied using six probes. Probes were chosen to measure\nthe expression levels of each of the three exons involved in the event. Additionally, 3 probes are used\nthat target the junctions that are formed by each of the two isoforms. The inclusion isoform would\nexpress the junctions formed by C1 and A, and A and C2, while the exclusion isoform would express\nthe junction formed by C1 and C2\n\n\n\nhybridization). Additionally, probes exhibit somewhat varying hybridization efficiency,\nand sequences exhibit varying labelling efficiency.\n\nTo design our data sets, we mined public sequence databases and identified exons that were\nstrong candidates for exhibiting AS (the details of that analysis are provided elsewhere\n[4, 3]). Of the candidates, 3,126 potential AS events in 2,647 unique mouse genes were\nselected for the design of Agilent Custom Oligonucleotide microarray. The arrays were\nhybridized with unamplified mRNA samples extracted from 10 wild-type mouse tissues\n(brain, heart, intestine, kidney, liver, lung, salivary gland, skeletal muscle, spleen, and\ntestis). Each AS event has six target probes on the arrays, chosen from regions of the\nC1 exon, C2 exon, A exon, C1:A splice junction, A:C2 splice junction, and C1:C2 splice\njunction, as shown in Figure 2.\n\n\n2 Unsupervised discovery of alternative splicing\n\nWith the exception of the probe measuring the alternative exon, A (Figure 2), all probes\nmeasure sequences that occur in both isoforms. For example, while the sequence of the\nprobe measuring the junction A:C1 is designed to measure the inclusion isoform, half of it\ncorresponds to a sequence that is found in the exclusion isoform. We can therefore safely\nassume that the measured intensity at each probe is a result of a certain amount of both\nisoforms binding to the probe. Due to the generally assumed linear relationship between\nthe abundance of mRNA hybridized at a probe and the fluorescent intensity measured,\nwe model the measured intensity as a weighted sum of the overall abundance of the two\nisoforms.\n\nA stronger assumption is that of a single, consistent hybridization profile for both isoforms\nacross all probes and all slides. Ideally, one would prefer to estimate an individual hy-\nbridization profile for each AS event studied across all slides. However, in our current\nsetup, the number of tissues is small (10), resulting in two difficulties. First, the number of\nparameters is very large when compared to the number of data point using this model, and\nsecond, a portion of the events do not exhibit tissue specific alternative splicing within our\nsmall set of tissues. While the first hurdle could be accounted for using Baysian parameter\nestimation, the second cannot.\n\n\n2.1 GenASAP - a generative model for alternative splicing array platform\n\nUsing the setup described above, the expression vector x, containing the six microarray\nmeasurements as real numbers, can be decomposed as a linear combination of the abun-\ndance of the two splice isoforms, represented by the real vector s, with some added noise:\nx = s + noise, where is a 6 2 weight matrix containing the hybridization profiles for\n\n\f\n s s\n 1 2\n\n\n\n\n\n x^ x\n ^ x^ x^ x\n ^ x^\n C C A C :A A:C C :C\n 1 2 1 2 1 2\n\n\n r\n\n\n\n x x x x x x\n C C A C :A A:C C :C\n 1 2 1 2 1 2\n\n\n\n\n\n o o o o o o\n C C A C :A A:C C :C\n 1 2 1 2 1 2\n\n\n\n\n n 2\n\nFigure 3: Graphical model for alternative splicing. Each measurement in the observed expression\nprofile, x, is generated by either using a scale factor, r, on a linear combination of the isoforms, s, or\ndrawing randomly from an outlier model. For a detailed description of the model, see text.\n\n\n\nthe two isoforms across the six probes. Note that we may not have a negative amount of\na given isoform, nor can the presence of an isoform deduct from the measured expression,\nand so both s and are constrained to be positive.\n\nExpression levels measured by microarrays have previously been modelled as having\nexpression-dependent noise [7]. To address this, we rewrite the above formulation as\n\n x = r(s + ), (1)\n\nwhere r is a scale factor and is a zero-mean normally distributed random variable with\na diagonal covariance matrix, , denoted as p() = N (; 0, ). The prior distribution for\nthe abundance of the splice isoforms is given by a truncated normal distribution, denoted\nas p(s) N (s, 0, I)[s 0], where [] is an indicator function such that [s 0] = 1 if\ni, si 0, and [s 0] = 0 otherwise.\n\nLastly, there is a need to account for aberrant observations (e.g. due to faulty probes, flakes\nof dust, etc.) with an outlier model. The complete GenASAP model (shown in Figure 3)\naccounts for the observations as the outcome of either applying equation (1) or an outlier\nmodel. To avoid degenerate cases and ensure meaningful and interpretable results, the\nnumber of faulty probes considered for each AS event may not exceed two, as indicated by\nthe filled-in square constraint node in Figure 3.\n\nThe distribution of x conditional on the latent variables, s, r, and o, is:\n\n\n\n p(x|s, r, o) = N (xi; ris, r2i)[oi=0]N (xi; Ei, Vi)[oi=1], (2)\n i\n\n\nwhere oi {0, 1} is a bernoulli random variable indicating if the measurement at probe xi\nis the result of the AS model or the outlier model parameterized by p(oi = 1) = i. The\nparameters of the outlier model, E and V, are not optimized and are set to the mean and\nvariance of the data.\n\n\f\n2.2 Variational learning in the GenASAP model\n\nTo infer the posterior distribution over the splice isoform abundances while at the same time\nlearning the model parameters we use a variational expectation-maximization algorithm\n(EM). EM maximizes the log likelihood of the data by iteratively estimating the posterior\ndistribution of the model given the data in the expectation (E) step, and maximizing the\nlog likelihood with respect to the parameters, while keeping the posterior fixed, in the\nmaximization (M) step. Variational EM is used when, as in the case of GenASAP, the exact\nposterior is intractable. Variational EM minimizes the free energy of the model, defined as\nthe KL-divergence between the joint distribution of the latent and observed variables and\nthe approximation to the posterior under the model parameters [5, 6].\n\nWe approximate the true posterior using the Q distribution given by\n\n T\n Q({s(t)}, {o(t)}, {r(t)}) = Q(r(t))Q(o(t)|r(t)) Q(s(t)|o(t), r(t))\n i i\n t=1 i (3)\n T\n =Z(t)-1 (t)(t)N (s(t); (t)d\n ro , (t)d\n ro )[s(t) 0],\n t=1\n\nwhere Z is a normalization constant, the superscript d indicates that is constrained to be\ndiagonal, and there are T iid AS events. For computational efficiency, r is selected from\na finite set, r {r1, r2, . . . , rC } with uniform probability. The variational free energy is\ngiven by\n\n Q({s(t)}, {o(t)}, {r(t)})\n F(Q, P ) = Q({s(t)}, {o(t)}, {r(t)}) log .\n P ({s(t)}, {o(t)}, {r(t)}, {x(t)})\n r o s\n (4)\nVariational EM minimizes the free energy by iteratively updating the Q distribution's vari-\n\national parameters ((t), (t), (t)d\n ro , and (t)d\n ro ) in the E-step, and the model parameters (,\n, {r1, r2, . . . , rC}, and ) in the M-step. The resulting updates are too long to be shown\nin the context of this paper and are discussed in detail elsewhere [3]. A few particular points\nregarding the E-step are worth covering in detail here.\n\nIf the prior on s was a full normal distribution, there would be no need for a variational\napproach, and exact EM is possible. For a truncated normal distribution, however, the mix-\ning proportions, Q(r)Q(o|r) cannot be calculated analytically except for the case where s\nis scalar, necessitating the diagonality constraint. Note that if was allowed to be a full\ncovariance matrix, equation (3) would be the true posterior, and we could find the sufficient\nstatistics of Q(s(t)|o(t), r(t)):\n\n (t)\n ro = (I + T (I - O(t))T -1(I - O(t)))-1T (I - O(t))T -1x(t)r(t)-1 (5)\n\n (t)-1\n ro = (I + T (I - O(t))T -1(I - O(t))) (6)\n\nwhere O is a diagonal matrix with elements Oi,i = oi. Furthermore, it can be easily\nshown that the optimal settings for d and d approximating a normal distribution with\nfull covariance and mean is\n\n doptimal = (7)\n\n d-1\n optimal = diag(-1) (8)\n\nIn the truncated case, equation (8) is still true. Equation (7) does not hold, though, and\ndoptimal cannot be found analytically. In our experiments, we found that using equation\n(7) still decreases the free energy every E-step, and it is significantly more efficient than\nusing, for example, a gradient decent method to compute the optimal d.\n\n\f\n Intuitive Weigh Matrix Optimal Weight Matrix\n\n 50 50\n\n\n 40 40\n\n\n 30 30\n\n\n 20 20\n\n\n 10 10\n\n\n 0 0\n Inclusion Isoform Exclusion Isoform Inclusion Isoform Exclusion Isoform\n (a) (b)\n\n\nFigure 4: (a) An intuitive set of weights. Based on the biological background, one would expect to\nsee the inclusion isoform hybridize to the probes measuring C1, C2, A, C1:A, and A:C2, while the\nexclusion isoform hybridizes to C1, C2, and C1:C2. (b) The learned set of weights closely agrees\nwith the intuition, and captures cross hybridization between the probes\n\n\n RT-PCR AS model\n Contribution of Contribution of measurement prediction\n AS model Original Data\n exclusion isoform inclusion isoform (% exclusion) (% exclusion)\n\n\n\n (a)\n 14% 27%\n\n\n\n\n\n (b) 72% 70%\n\n\n\n outliers\n\n\n\n\n\n (c) 8% 22%\n\n\n\n\n\nFigure 5: Three examples of data cases and their predictions. (a) The data does not follow our notion\nof single cassette exon AS, but the AS level is predicted accurately by the model.(b) The probe C1:A\nis marked as outlier, allowing the model to predict the other probes accurately. (c) Two probes are\nmarked as outliers, and the model is still successful in predicting the AS levels.\n\n\n\n3 Making biological predictions about alternative splicing\n\nThe results presented in this paper were obtained using two stages of learning. In the first\nstep, the weight matrix, , is learned on a subset of the data that is selected for quality.\nTwo selection criteria were used: (a) sequencing data was used to select those cases for\nwhich, with high confidence, no other AS event is present (Figure 1) and (b) probe sets\nwere selected for high expression, as determined by a set of negative controls. The second\nselection criterion is motivated by the common assumption that low intensity measurements\nare of lesser quality (see Section 1.1). In the second step, is kept fixed, and we introduce\nthe additional constraint that the noise is isotropic ( = I) and learn on the entire data\nset. The constraint on the noise is introduced to prevent the model from using only a subset\nof the six probes for making the final set of predictions.\n\nWe show a typical learned set of weights in Figure 4. The weights fit well with our intuition\nof what they should be to capture the presence of the two isoforms. Moreover, the learned\nweights account for the specific trends in the data. Examples of model prediction based on\nthe microarray data are shown in Figure 5.\n\nDue to the nature of the microarray data, we do not expect all the inferred abundances to be\nequally good, and we devised a scoring criterion that ranks each AS event based on its fit to\nthe model. Intuitively, given two input vectors that are equivalent up to a scale factor, with\ninferred MAP estimations that are equal up to the same scale factor, we would like their\nscores to be identical. The scoring criterion used, therefore is (x\n k k - rks)2/(xk +\n\n\f\n Rank Pearson's correlation False positive\n coefficient rate\n 500 0.94 0.11\n 1000 0.95 0.08\n 2000 0.95 0.05\n 5000 0.79 0.2\n 10000 0.79 0.25\n 15000 0.78 0.29\n 20000 0.75 0.32\n 30000 0.65 0.42\n\n\nTable 1: Model performance evaluated at various ranks. Using 180 RT-PCR measurements, we are\nable to predict the model's performance at various ranks. Two evaluation criteria are used: Pearson's\ncorrelation coefficient between the model's predictions and the RT-PCR measurements and false\npositive rate, where a prediction is considered to be false positive if it is more than 15% away from\nthe RT-PCR measurement.\n\n\nrks)2, where the MAP estimations for r and s are used. This scoring criterion can be\nviewed as proportional to the sum of noise to signal ratios, as estimated using the two\nvalues given by the observation and the model's best prediction of that observation.\n\nSince it is the relative amount of the isoforms that is of most interest, we need to use the\ninferred distribution of the isoform abundances to obtain an estimate for the relative levels\nof AS. It is not immediately clear how this should be done. We do, however, have RT-\nPCR measurements for 180 AS events to guide us (see figure 6 for details). Using the\ntop 50 ranked RT-PCR measurement, we fit three parameters, {a1, a2, a3}, such that the\nproportion of excluded isoform present, p, is given by p = a s2\n 1 + a\n s 3, where s1 is the\n 1+a2s2\nMAP estimation of the abundance of the inclusion isoform, s2 is the MAP estimation of the\nabundance of the exclusion isoform, and the RT-PCR measurement are used for target p.\nThe parameters are fitted using gradient descent on a least squared error (LSE) evaluation\ncriterion.\n\nWe used two criteria to evaluate the quality of the AS model predictions. Pearson's cor-\nrelation coefficient (PCC) is used to evaluate the overall ability of the model to correctly\nestimate trends in the data. PCC is invariant to affine transformation and so is independent\nof the transformation parameters a1 and a3 discussed above, while the parameter a2 was\nfound to effect PCC very little. The PCC stays above 0.75 for the top two thirds ranked pre-\ndictions. The second evaluation criterion used is the false positive rate, where a prediction\nis considered to be false positive if it is more than 15% away from the RT-PCR measure-\nment. This allows us to say, for example, that if a prediction is within the top 10000, we\nare 75% confident that it is within 15% of the actual levels of AS.\n\n\n4 Summary\n\nWe designed a novel AS model for the inference of the relative abundance of two alter-\nnatively spliced isoforms from six measurements. Unsupervised learning in the model is\nperformed using a structured variational EM algorithm, which correctly captures the un-\nderlying structure of the data, as suggested by its biological nature. The AS model, though\npresented here for a cassette exon AS events, can be used to learn any type of AS, and with\na simple adjustment, multiple types.\n\nThe predictions obtained from the AS model are currently being used to verify various\nclaims about the role of AS in evolution and functional genomics, and to help identify\nsequences that affect the regulation of AS.\n\n\f\n % Exclusion isoform\n\n RT-PCR measurement Vs. AS model predictions\n\n 100\n\n 90\n\n 80\n\n IntestineTestis KidneySalivaryBrain SpleenLiver MuscleLung 70\n\n 60\n\n 50\n\n 40\n\n RT-PCR measurements: AS model prediction 30\n\n 14 22 27 32 47 46 66 78 63 20\n\n AS model prediction: 10\n\n 27 24 26 26 51 75 60 85 100 0 0 20 40 60 80 100\n RT-PCR measurement\n\n\n (a) (b)\n\n\nFigure 6: (a) Sample RT-PCR. RNA extracted from the cell is reverse-transcribed to DNA, amplified\nand labelled with radioactive or fluorescent molecules. The sample is pulled through a viscous gel in\nan electric field (DNA, being an acid, is positively charged). Shorter strands travel further through\nthe gel than longer ones, resulting in two distinct bands, corresponding to the two isoforms, when\nexposed to a photosensitive or x-ray film. (b) A scatter plot showing the RT-PCR measurements as\ncompared to the AS model predictions. The plot shows all available RT-PCR measurements with a\nrank of 8000 or better.\n\n\n\nThe AS model presented assumes a single weight matrix for all data cases. This is an\noversimplified view of the data, and current work is being carried out in identifying probe\nspecific expression profiles. However, due to the low dimensionality of the problem (10 tis-\nsues, six probes per event), care must be taken to avoid overfitting and to ensure meaningful\ninterpretations.\n\n\nAcknowledgments\n\nWe would like to thank Wen Zhang, Naveed Mohammad, and Timothy Hughes for their\ncontributions in generating the data set. This work was funded in part by an operating and\ninfrastructure grants from the CIHR and CFI, and a operating grants from NSERC and a\nPremier's Research Excellence Award.\n\n\nReferences\n\n[1] J. M. Johnson et al. Genome-wide survey of human alternative pre-mrna splicing with exon\n junction microarrays. Science, 302:214144, 2003.\n\n[2] L. Cartegni et al. Listening to silence and understanding nonsense: exonic mutations that affect\n splicing. Nature Gen. Rev., 3:28598, 2002.\n\n[3] Q. Pan et al. Revealing global regulatory features of mammalian alternative splicing using a\n quantitative microarray platform. Molecular Cell, 16(6):92941, 2004.\n\n[4] Q. Pan et al. Alternative splicing of conserved exons is frequently species specific in human and\n mouse. Trends Gen., In Press, 2005.\n\n[5] M. I. Jordan, Z. Ghahramani, T. Jaakkola, and Lawrence K. Saul. An introduction to variational\n methods for graphical models. Machine Learning, 37(2):183 233, 1999.\n\n[6] R. M. Neal and G. E. Hinton. A view of the em algorithm that justifies incremental, sparse, and\n other variants. In Learning in Graphical Models. Cambridge, MIT Press, 1998.\n\n[7] D. M. Rocke and B. Durbin. A model for measurement error for gene expression arrays. Journal\n of Computational Biology, 8(6):55769, 2001.\n\n\f\n", "award": [], "sourceid": 2678, "authors": [{"given_name": "Ofer", "family_name": "Shai", "institution": null}, {"given_name": "Brendan", "family_name": "Frey", "institution": null}, {"given_name": "Quaid", "family_name": "Morris", "institution": null}, {"given_name": "Qun", "family_name": "Pan", "institution": null}, {"given_name": "Christine", "family_name": "Misquitta", "institution": null}, {"given_name": "Benjamin", "family_name": "Blencowe", "institution": null}]}