{"title": "Gradients of Generative Models for Improved Discriminative Analysis of Tandem Mass Spectra", "book": "Advances in Neural Information Processing Systems", "page_first": 5724, "page_last": 5733, "abstract": "Tandem mass spectrometry (MS/MS) is a high-throughput technology used to identify the proteins in a complex biological sample, such as a drop of blood. A collection of spectra is generated at the output of the process, each spectrum of which is representative of a peptide (protein subsequence) present in the original complex sample. In this work, we leverage the log-likelihood gradients of generative models to improve the identification of such spectra. In particular, we show that the gradient of a recently proposed dynamic Bayesian network (DBN) may be naturally employed by a kernel-based discriminative classifier. The resulting Fisher kernel substantially improves upon recent attempts to combine generative and discriminative models for post-processing analysis, outperforming all other methods on the evaluated datasets. We extend the improved accuracy offered by the Fisher kernel framework to other search algorithms by introducing Theseus, a DBN representating a large number of widely used MS/MS scoring functions. Furthermore, with gradient ascent and max-product inference at hand, we use Theseus to learn model parameters without any supervision.", "full_text": "Gradients of Generative Models for Improved\n\nDiscriminative Analysis of Tandem Mass Spectra\n\nJohn T. Halloran\n\nDepartment of Public Health Sciences\n\nUniversity of California, Davis\njthalloran@ucdavis.edu\n\nDavid M. Rocke\n\nDepartment of Public Health Sciences\n\nUniversity of California, Davis\ndmrocke@ucdavis.edu\n\nAbstract\n\nTandem mass spectrometry (MS/MS) is a high-throughput technology used to\nidentify the proteins in a complex biological sample, such as a drop of blood. A\ncollection of spectra is generated at the output of the process, each spectrum of\nwhich is representative of a peptide (protein subsequence) present in the original\ncomplex sample. In this work, we leverage the log-likelihood gradients of genera-\ntive models to improve the identi\ufb01cation of such spectra. In particular, we show\nthat the gradient of a recently proposed dynamic Bayesian network (DBN) [7] may\nbe naturally employed by a kernel-based discriminative classi\ufb01er. The resulting\nFisher kernel substantially improves upon recent attempts to combine generative\nand discriminative models for post-processing analysis, outperforming all other\nmethods on the evaluated datasets. We extend the improved accuracy offered by\nthe Fisher kernel framework to other search algorithms by introducing Theseus,\na DBN representing a large number of widely used MS/MS scoring functions.\nFurthermore, with gradient ascent and max-product inference at hand, we use\nTheseus to learn model parameters without any supervision.\n\n1\n\nIntroduction\n\nIn the past two decades, tandem mass spectrometry (MS/MS) has become an indispensable tool\nfor identifying the proteins present in a complex biological sample. At the output of a typical\nMS/MS experiment, a collection of spectra is produced on the order of tens-to-hundreds of thousands,\neach of which is representative of a protein subsequence, called a peptide, present in the original\ncomplex sample. The main challenge in MS/MS is accurately identifying the peptides responsible for\ngenerating each output spectrum.\nThe most accurate identi\ufb01cation methods search a database of peptides to \ufb01rst score peptides, then\nrank and return the top-ranking such peptide. The pair consisting of a scored candidate peptide and\nobserved spectrum is typically referred to as a peptide-spectrum match (PSM). However, PSM scores\nreturned by such database-search methods are often dif\ufb01cult to compare across different spectra (i.e.,\nthey are poorly calibrated), limiting the number of spectra identi\ufb01ed per search [15]. To combat such\npoor calibration, post-processors are typically used to recalibrate PSM scores [13, 19, 20].\nRecent work has attempted to exploit generative scoring functions for use with discriminative\nclassi\ufb01ers to better recalibrate PSM scores; by parsing a DBN\u2019s Viterbi path (i.e., the most probable\nsequence of random variables), heuristically derived features were shown to improve discriminative\nrecalibration using support vector machines (SVMs). Rather than relying on heuristics, we look\ntowards the more principled approach of a Fisher kernel [11]. Fisher kernels allow one to exploit the\nsequential-modeling strengths of generative models such as DBNs, which offer vast design \ufb02exibility\nfor representing data sequences of varying length, for use with discriminative classi\ufb01ers such as\nSVMs, which offer superior accuracy but often require feature vectors of \ufb01xed length. Although\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fFigure 1: Example tandem mass spectrum with precursor charge c(s) = 2 and generating peptide x =\nLWEPLLDVLVQTK. Plotted in red and blue are, respectively, b- and y-ion peaks (discussed in Section 2.1.1),\nwhile spurious observed peaks (called insertions) are colored gray. Note y1, b1, b4, and b12 are absent fragment\nions (called deletions).\n\nthe number of variables in a DBN may vary given different observed sequences, a Fisher kernel\nutilizes the \ufb01xed-length gradient of the log-likelihood (i.e., the Fisher score) in the feature-space of\na kernel-based classi\ufb01er. Deriving the Fisher scores of a DBN for Rapid Identi\ufb01cation of Peptides\n(DRIP) [7], we show that the DRIP Fisher kernel greatly improves upon the previous heuristic\napproach; at a strict FDR of 1% for the presented datasets, the heuristically derived DRIP features\nimprove accuracy over the base feature set by an average 6.1%, while the DRIP Fisher kernel raises\nthis average improvement to 11.7% (Table 2 in [9]), thus nearly doubling the total accuracy of DRIP\npost-processing.\nMotivated by improvements offered by the DRIP Fisher kernel, we look to extend this to other models\nby de\ufb01ning a generative model representative of the large class of existing scoring functions [2,\n5, 6, 16, 10, 22, 17]. In particular, we de\ufb01ne a DBN (called Theseus1) which, given an observed\nspectrum, evaluates the universe of all possible PSM scores. In this work, we use Theseus to model\nPSM score distributions with respect to the widely used XCorr scoring function [5]. The resulting\nFisher kernel once again improves discriminative post-processing accuracy. Furthermore, with\nthe generative model in place, we explore inferring parameters of the modeled scoring function\nusing max-product inference and gradient-based learning. The resulting coordinate ascent learning\nalgorithm outperforms standard maximum-likelihood learning. Most importantly, this overall learning\nalgorithm is unsupervised which, to the authors\u2019 knowledge, is the \ufb01rst MS/MS scoring function\nparameter estimation procedure not to rely on any supervision. We note that this overall training\nprocedure may be adapted by the many MS/MS search algorithms whose scoring functions lie in the\nclass modeled by Theseus.\nThe paper is organized as follows. We discuss background information in Section 2, including the\nprocess by which MS/MS spectra are produced, the means by which spectra are identi\ufb01ed, and related\nprevious work. In Section 3, we extensively discuss the log-likelihood of the DRIP model and derive\nits Fisher scores. In Section 4, we introduce Theseus and derive gradients of its log-likelihood. We\nthen discuss gradient-based unsupervised learning of Theseus parameters and present an ef\ufb01cient,\nmonotonically convergent coordinate ascent algorithm. Finally, in Section 5, we show that DRIP\nand Theseus Fisher kernels substantially improve spectrum identi\ufb01cation accuracy and that Theseus\u2019\ncoordinate ascent learning algorithm provides accurate unsupervised parameter estimation. For\nconvenience, a table of the notation used in this paper may be found in [9].\n\n2 Background\n\nA typical tandem mass spectrometry experiment begins by cleaving proteins into peptides using a\ndigesting enzyme. The resulting peptides are then separated via liquid chromatography and subjected\nto two rounds of mass spectrometry. The \ufb01rst round measures the mass and charge of the intact\npeptide, called the precursor mass and precursor charge, respectively. Peptides are then fragmented\nand the fragments undergo a second round of mass spectrometry, the output of which is an observed\nspectrum indicative of the fragmented peptide. The x-axis of this observed spectrum denotes mass-\nto-charge (m/z), measured in thomsons (Th), and the y-axis is a unitless intensity measure, roughly\nproportional to the abundance of a single fragment ion with a given m/z value. A sample such\nobserved spectrum is illustrated in Figure 1.\n\n1In honor of Shannon\u2019s magnetic mouse, which could learn to traverse a small maze.\n\n2\n\n200300400500600700800900100011001200130014001500m/z0.00.20.40.60.81.0intensityb2b3b5b6b7b8b9b10b11y2y3y4y5y6y7y8y9y10y11y12insertionsb-ionsy-ions\f2.1 MS/MS Database Search\n\n(cid:110)\n\nLet s be an observed spectrum with precursor mass m(s) and precursor charge c(s). In order to\nidentify s, we search a database of peptides, as follows. Let P be the set of all possible peptide\nsequences. Each peptide x \u2208 P is a string x = x1x2 . . . xn comprised of characters, called amino\nacids. Given a peptide database D \u2286 P, we wish to \ufb01nd the peptide x \u2208 D responsible for generating\ns. Using the precursor mass and charge, the set of peptides to be scored is constrained by setting\na mass tolerance threshold, w, such that we score the set of candidate peptides D(s,D, w) =\nx : x \u2208 D,\n, where m(x) denotes the mass of peptide x. Note that we\u2019ve\noverloaded m(\u00b7) to return either a peptide\u2019s or observed spectrum\u2019s precursor mass; we similarly\noverload c(\u00b7). Given s and denoting an arbitrary scoring function as \u03c8(x, s), the output of a search\nalgorithm is thus x\u2217 = argmaxx\u2208D(m(s),c(s),D,w) \u03c8(x, s), the top-scoring PSM.\n\n(cid:12)(cid:12)(cid:12) \u2264 w\n\n(cid:12)(cid:12)(cid:12) m(x)\n\nc(s) \u2212 m(s)\n\n(cid:111)\n\n2.1.1 Theoretical Spectra\n\nIn order to score a candidate peptide x, fragment ions corresponding to suf\ufb01x masses (called b-ions)\nand pre\ufb01x masses (called y-ions) are collected into a theoretical spectrum. The annotated b- and\ny-ions of the generating peptide for an observed spectrum are illustrated in Figure 1. Varying based\non the value of c(s), the kth respective b- and y-ion pair of x are\n\n(cid:80)n\n\ni=n\u2212k m(xi) + 18 + cy\n\n,\n\ncy\n\n(cid:80)k\n\ni=1 m(xi) + cb\n\ncb\n\nb(x, cb, k) =\n\n,\n\ny(x, cy, k) =\n\nwhere cb is the charge of the b-ion and cy is the charge of the y-ion. For c(s) = 1, we have\ncb = cy = 1, since these are the only possible, detectable fragment ions. For higher observed charge\nstates c(s) \u2265 2, it is unlikely for a single fragment ion to consume the entire charge, so that we have\ncb + cy = c(s), where cb, cy \u2208 [1, c(s)\u2212 1]. The b-ion offset corresponds to the mass of a cb charged\nhydrogen atom, while the y-ion offset corresponds to the mass of a water molecule plus a cy charged\nhydrogen atom.\nFurther fragment ions may occur, each corresponding to the loss of a molecular group off a b- or\ny-ion. Called neutral losses, these correspond to a loss of either water, ammonia, or carbon monoxide.\nThese fragment ions are commonly collected into a vector v, whose elements are weighted based on\ntheir corresponding fragment ion. For instance, XCorr [5] assigns all b- and y-ions a weight of 50\nand all neutral losses a weight of 10.\n\n2.2 Previous Work\n\nMany scoring functions have been proposed for use in search algorithms. They range from simple\ndot-product scoring functions (X!Tandem [2], Morpheus [22]), to cross-correlation based scoring\nfunctions (XCorr [5]), to exact p-values over linear scoring functions calculated using dynamic\nprogramming (MS-GF+ [16] and XCorr p-values [10]). The recently introduced DRIP [7] scores\ncandidate peptides without quantization of m/z measurements and allows learning the expected\nlocations of theoretical peaks given high quality, labeled training data. In order to avoid quantization of\nthe m/z axis, DRIP employs a dynamic alignment strategy wherein two types of prevalent phenomena\nare explicitly modeled: spurious observed peaks, called insertions, and absent theoretical peaks,\ncalled deletions (examples of both are displayed in Figure 1). DRIP then uses max-product inference\nto calculate the most probable sequences of insertions and deletions to score candidate peptides, and\nwas shown to achieve state-of-the-art performance on a variety of datasets.\nIn practice, scoring functions are often poorly calibrated (i.e., PSM scores from different spectra are\ndif\ufb01cult to compare to one another), leading to potentially identi\ufb01ed spectra left on the table during\nstatistical analysis. In order to properly recalibrate such PSM scores, several semi-supervised post-\nprocessing methods have been proposed [13, 19, 20]. The most popular such method is Percolator [13],\nwhich, given the output target and decoy PSMs (discussed in Section 5) of a scoring algorithm and\nfeatures detailing each PSM, utilizes an SVM to learn a discriminative classi\ufb01er between target PSMs\nand decoy PSMs. PSM scores are then recalibrated using the learned decision boundary.\nRecent work has attempted to leverage the generative nature of the DRIP model for discriminative\nuse by Percolator [8]. As earlier mentioned, the output of DRIP is the most probable sequence of\ninsertions and deletions, i.e., the Viterbi path. However, DRIP\u2019s observations are the sequences of\n\n3\n\n\fobserved spectrum m/z and intensity values, so that the lengths of PSM\u2019s Viterbi paths vary depending\non the number of observed spectrum peaks. In order to exploit DRIP\u2019s output in the feature-space of\na discriminative classi\ufb01er, PSM Viterbi paths were heuristically mapped to a \ufb01xed-length vector of\nfeatures. The resulting heuristic features were shown to dramatically improve Percolator\u2019s ability to\ndiscriminate between PSMs.\n\n2.3 Fisher Kernels\n\nwith a set of parameters \u03b8 and likelihood p(O|\u03b8) = (cid:80)\n\nUsing generative models to extract features for discriminative classi\ufb01ers has been used to great\neffect in many problem domains by using Fisher kernels [11, 12, 4]. Assuming a generative model\nH p(O, H|\u03b8), where O is a sequence of\nobservations and H is the set of hidden variables, the Fisher score is then Uo = \u2207\u03b8 log p(O|\u03b8).\nGiven observations Oi and Oj of differing length (and, thus, different underlying graphs in the case\nof dynamic graphical models), a kernel-based classi\ufb01er over these instances is trained using UOi\nand UOj in the feature-space. Thus, a similarity measure is learned in the gradient space, under the\nintuition that objects which induce similar likelihoods will induce similar gradients.\n\n3 DRIP Fisher Scores\n\nFigure 2: Graph of DRIP, the frames (i.e., time instances) of which correspond to observed spectrum peaks.\nShaded nodes represent observed variables and unshaded nodes represent hidden variables. Given an observed\nspectrum, the middle frame (the chunk) dynamically expands to represent the second observed peak to the\npenultimate observed peak.\n\n, Oin\n\n1 ), (Omz\n\n2\n\n1\n\n, Oin\n\nT , Oin\n\n2 ), . . . , (Omz\n\nWe \ufb01rst de\ufb01ne, in detail, DRIP\u2019s log-likelihood, followed by the Fisher score derivation for DRIP\u2019s\nlearned parameters. For discussion of the DRIP model outside the score of this work, readers are\ndirected to [7, 8]. Denoting an observed peak as a pair (Omz, Oin) consisting of an m/z measurement\nand intensity measurement, respectively, let s = (Omz\nT ) be an\nMS/MS spectrum of T peaks and x be a candidate (which, given s, we\u2019d like to score). We denote\nthe theoretical spectrum of x, consisting of its unique b- and y-ions sorted in ascending order, as the\nlength-d vector v. The graph of DRIP is displayed in Figure 2, where variables which control the\ntraversal of the theoretical spectrum are highlighted in blue and variables which control the scoring\nof observed peak measurements are highlighted in red. Groups of variables are collected into time\ninstances called frames. The frames of DRIP correspond to the observed peak m/z and intensity\nobservations, so that there are T frames in the model.\nUnless otherwise speci\ufb01ed, let t be an arbitrary frame 1 \u2264 t \u2264 T . \u03b4t is a multinomial random variable\nwhich dictates the number of theoretical peaks traversed in a frame. The random variable Kt, which\ndenotes the index of the current theoretical peak index, is a deterministic function of its parents, such\nthat p(Kt = Kt\u22121 + \u03b4t|Kt\u22121, \u03b4t) = 1. Thus, \u03b4t > 1 corresponds to the deletion of \u03b4t \u2212 1 theoretical\npeaks. The parents of \u03b4t ensure that DRIP does not attempt to increment past the last theoretical\npeak, i.e., p(\u03b4t > d \u2212 Kt\u22121|d, Kt\u22121, it\u22121) = 0. Subsequently, the theoretical peak value v(Kt) is\nused to access a Gaussian from a collection (the mean of each Gaussian corresponds to a position\nalong the m/z axis, learned using the EM algorithm [3]) with which to score observations. Hence,\nthe state-space of the model is all possible traversals, from left to right, of the theoretical spectrum,\naccounting for all possible deletions.\nWhen scoring observed peak measurements, the Bernoulli random variable it denotes whether a\npeak is scored using learned Gaussians (when it = 0) or considered an insertion and scored using an\n\n4\n\nTheoretical SpectrumVariablesObservedSpectrumVariables\ft\n\n|v(Kt), it =\ninsertion penalty (when it = 1). When scoring m/z observations, we thus have p(Omz\n|v(Kt), it = 1) = amz, where \u00b5mz is a vector of\n0) = f (Omz\nGaussian means and \u03c32 the m/z Gaussian variance. Similarly, when scoring intensity observations,\nt |it = 1) = ain, where \u00b5in and \u00af\u03c32 are the\nwe have p(Oin\nintensity Gaussian mean and variance, respectively. Let i0 = K0 = \u2205 and 1{\u00b7} denote the indicator\nfunction. Denoting DRIP\u2019s Gaussian parameters as \u03b8, the likelihood is thus\n\nt |\u00b5in, \u00af\u03c32) and p(Oin\n\nt\n\np(s|x, \u03b8) =\n\n|Kt)p(Oin\nt )\n\nt\n\n|\u00b5mz(v(Kt)), \u03c32) and p(Omz\nt |it = 0) = f (Oin\nT(cid:89)\nT(cid:89)\nT(cid:89)\n\n\u03c6(\u03b4t, Kt\u22121, it, it\u22121).\n\nt=1\n\nt=1\n\n=\n\n=\n\np(\u03b4t|Kt\u22121, d, it\u22121)1{Kt=Kt\u22121+\u03b4t}p(Omz\n1(cid:88)\n\np(\u03b4t|Kt\u22121, d, it\u22121)1{Kt=Kt\u22121+\u03b4t}(\n\nt\n\nt=1\n\np(it)p(Omz\n\nt\n\n|Kt, it))(\n\nit=0\n\nit=0\n\n1(cid:88)\n\np(it)p(Oin\n\nt |it))\n\nThe only stochastic variables in the model are it and \u03b4t, where all other random variables are either\nobserved or deterministic given the sequences i1:T and \u03b41:T . Thus, we may equivalently write\np(s|x, \u03b8) = p(i1:T , \u03b41:T|\u03b8). The Fisher score of the kth m/z mean is thus\nlog p(s|x, \u03b8) =\np(s|x,\u03b8)\n\np(s|x, \u03b8), and we have (please see [9] for the full derivation)\n\n\u2202\u00b5mz(k)\n\n\u2202\n\n\u2202\n\n1\n\n\u2202\u00b5mz(k)\np(s|x, \u03b8) =\n\n\u2202\n\n\u2202\u00b5mz(k)\n\n(cid:88)\n\n\u2202\u00b5mz(k)\n\n\u2202\n\n(cid:88)\n\n=\n\ni1:T ,\u03b41:T\n\n1{Kt=k}p(s|x, \u03b8)\n\np(i1:T , \u03b41:T|\u03b8) =\n\n(cid:32) (cid:89)\n\nt:Kt=k\n\ni1:T ,\u03b41:T :Kt=k,1\u2264t\u2264T\n\n1\np(Omz\n\nt\n\n|Kt)\n\n\u2202\n\n\u2202\u00b5mz(k)\n\n(cid:89)\n\nt:Kt=k\n\n\u2202\n\n\u2202\u00b5mz(k)\n\np(i1:T , \u03b41:T|\u03b8)\n\n(cid:33)\n\np(Omz\n\nt\n\n|Kt)\n\n.\n\n(cid:88)\n(cid:33)(cid:32)\n\ni1:T ,\u03b41:T\n\n\u21d2 \u2202\n\n\u2202\u00b5mz(k)\n\nlog p(s|x, \u03b8) =\n\nT(cid:88)\n\nt=1\n\np(it, Kt = k|s, \u03b8)p(it = 0|Kt, Omz\n\nt\n\n)\n\n(Omz\n\nt \u2212 \u00b5mz(k))\n\n\u03c32\n\n.\n\n(1)\n\nNote that the posterior in Equation 1, and thus the Fisher score, may be ef\ufb01ciently computed using\nsum-product inference. Through similar steps, we have\n\n\u2202\n\n\u2202\u03c32(k)\n\nlog p(s|x, \u03b8) =\n\n\u2202\n\n\u2202\u00b5in log p(s|x, \u03b8) =\n\u2202 \u00af\u03c32 log p(s|x, \u03b8) =\n\n\u2202\n\np(it, Kt = k|s, \u03b8)p(it = 0|Kt, Omz\n\nt\n\n)(\n\n(Omz\n\nt \u2212 \u00b5mz(k))\n\n2\u03c34\n\n\u2212 1\n\n2\u03c32 )\n\np(it, Kt|s, \u03b8)p(it = 0|Oin\nt )\n\np(it, Kt|s, \u03b8)p(it = 0|Oin\nt )(\n\n(Oin\n\nt \u2212 \u00b5in)\n\u00af\u03c32\nt \u2212 \u00b5in)\n(Oin\n2\u00af\u03c34\n\n\u2212 1\n\n2\u00af\u03c32 ),\n\n(2)\n\n(3)\n\n(4)\n\nt\n\n(cid:88)\n(cid:88)\n(cid:88)\n\nt\n\nt\n\nwhere \u03c32(k) denotes the partial derivative of the variance for the kth m/z Gaussian with mean\n\u00b5mz(k).\nLet U\u00b5 = \u2207\u00b5mz log p(s, x|\u03b8) and U\u03c32 = \u2207\u03c32(k) log p(s, x|\u03b8). U\u00b5 and U\u03c32 are length-d vectors\ncorresponding to the mapping of a peptide\u2019s sequence of b- and y-ions into r-dimensional space\n(i.e., dimension equal to an m/z-discretized observed spectrum). Let 1 be the length-r vector of\nones. De\ufb01ning zmz, zi \u2208 Rr, the elements of which are the quantized observed spectrum m/z and\nintensity values, respectively, we use the following DRIP gradient-based features for SVM training in\nSection 5: |U\u00b5|1, |U\u03c32|1, U T\n\n\u2202 \u00af\u03c32 log p(s, x|\u03b8).\n\n\u2202\u00b5in log p(s, x|\u03b8), and \u2202\n\n\u00b5 zmz, U T\n\n\u03c32zi, U T\n\n1, U T\n\n\u03c32 1,\n\n\u00b5\n\n\u2202\n\n4 Theseus\n\nGiven an observed spectrum s, we focus on representing the universe of linear PSM scores using a\nDBN. Let z denote the vector resulting from preprocessing the observed spectrum, s. As a modeling\nexample, we look to represent the popular XCorr scoring function. Using subscript \u03c4 to denote a\n\n5\n\n\fvector whose elements are shifted \u03c4 units, XCorr\u2019s scoring function is de\ufb01ned as\n\nXCorr(s, x) = vT z \u2212 75(cid:88)\n\nvT z\u03c4 = vT (z \u2212 75(cid:88)\n\nz\u03c4 ) = vT z\n\n(cid:48)\n\n,\n\n\u03c4 =\u221275\n\n\u03c4 =\u221275\n\nwhere z(cid:48) = z\u2212(cid:80)75\nfunction. The ith element of z\u03b8 is z\u03b8(i) =(cid:80)l\n(cid:80)l\ni=1 z\u03b8(m(xi) + 1) =(cid:80)n\nhave uT z\u03b8 =(cid:80)n\nj=1 \u03b8(j)zj(m(xi) + 1) = vT z(cid:48) = XCorr(s, x).\n\n\u03c4 =\u221275 z\u03c4 . Let \u03b8 \u2208 Rl be a vector of XCorr weights for the various types of possible\nfragment ions (described in Section 2.1.1). As described in [10], given c(s), we reparameterize z(cid:48)\ninto a vector z\u03b8 such that XCorr(x, s) is rendered as a dot-product between z\u03b8 and a boolean vector\nu in the reparameterized space. This reparameterization readily applies to any linear MS/MS scoring\nj=1 \u03b8(j)zj(i), where zj is a vector whose element zj(i)\nis the sum of all higher charged fragment ions added into the singly-charged fragment ions for the jth\nfragment ion type. The nonzero elements of u correspond to the singly-charged b-ions of x and we\n\ni=1\n\nFigure 3: Graph of Theseus. Shaded nodes are observed random variables and unshaded nodes are hidden (i.e.,\nstochastic). The model is unrolled for n + 1 frames, including B0 in frame zero. Plate notation denotes M\nrepetitions of the model, where M is the number of discrete precursor masses allowed by the precursor-mass\ntolerance threshold, w.\n\nOur generative model is illustrated in Figure 3. n is the maximum possible peptide length and m is\none of M discrete precursor masses (dictated by the precursor-mass tolerance threshold, w, and m(s)).\nA hypothesis is an instantiation of random variables across all frames in the model, i.e., for the set of\nall possible sequences of Xi random variables, X1:n = X1, X2, . . . , Xn, a hypothesis is x1:n \u2208 X1:n.\nIn our case, each hypothesis corresponds to a peptide and the corresponding log-likelihood its XCorr\nscore. Each frame after the \ufb01rst contains an amino acid random variable so that we accumulate b-ions\nin successive frames and access the score contribution for each such ion.\nFor frame i, Xi is a random amino acid and Bi the accumulated mass up to the current frame.\nB0 and Bn are observed to zero and m, respectively, enforcing the boundary conditions that all\nlength-n PSMs considered begin with mass zero and end at a particular precursor mass. For i > 0,\nBi is a deterministic function of its parents, p(Bi|Bi\u22121, Xi) = p(Bi = Bi\u22121 + m(Xi)) = 1.\nThus, hypotheses which do not respect these mass constraints receive probability zero, i.e., p(Bn (cid:54)=\nm|Bn\u22121, Xn) = 0. m is observed to the value of the current precursor mass being considered.\nLet A be the set of amino acids, where |A| = 20. Given Bi and m, the conditional distribution of Xi\nchanges such that p(Xi \u2208 A|Bi\u22121 < m) = \u03b1U{A}, p(Xi = \u03ba|Bi\u22121 \u2265 m) = 1, where U{\u00b7} is the\nuniform distribution over the input set and \u03ba /\u2208 A, m(\u03ba) = 0. Thus, when the accumulated mass is\nless than m, Xi is a random amino acid and, otherwise, Xi deterministically takes on a value with\nzero mass. To recreate XCorr scores, \u03b1 = 1/|A|, though, in general, any desired mass function may\nbe used for p(Xi \u2208 A|Bi\u22121 < m).\nSi is a virtual evidence child [18], i.e., a leaf node whose conditional distribution need not be\nnormalized to compute probabilistic quantities of interest in the DBN. For our model, we have\ni=1 \u03b8izi(Bi)) and p(Si|Bi \u2265 m, \u03b8) = 1. Let t(cid:48) denote\n\np(Si|Bi < m, \u03b8) = exp(z\u03b8(Bi)) = exp((cid:80)|\u03b8|\n\n6\n\n\fthe \ufb01rst frame in which m(X1:n) \u2265 m. The log-likelihood is then log p(s, X1:n|\u03b8)\n\n= log p(X1:n, B0:n, S1:n\u22121)\n\n= log(1{B0=0}(\n\np(Xi|m, Bi\u22121)p(Bi = Bi\u22121 + m(Xi))p(Si|m, Bi, \u03b8))1{Bn\u22121+m(Xn)=m})\n\nn\u22121(cid:89)\n\ni=1\n\n= log 1{B0=0 \u2227m(X1:n)=m} + log(\n\np(Xi|m, Bi\u22121)p(Bi = Bi\u22121 + m(Xi))p(Si|m, Bi, \u03b8))+\n\nt(cid:48)(cid:89)\n\ni=1\n\nlog(\n\np(Xi|m, Bi\u22121)p(Bi = Bi\u22121 + m(Xi))p(Si|m, Bi, \u03b8))\n\n= log 1{m(X1:n)=m} + log 1 + log(\n\nexp(z\u03b8(Bi)))\n\nn(cid:89)\n\ni=t(cid:48)+1\n\nt(cid:48)(cid:89)\n\ni=1\n\nlog\n\n(cid:88)\n(cid:88)\n\nx1:n\n\nt(cid:48)(cid:88)\n\ni=1\n\n= log 1{m(X1:n)=m} +\n\nz\u03b8(Bi) = log 1{B0=0 \u2227 m(X1:n)=m} + XCorr(X1:t(cid:48) , s)\n\nThe ith element of Theseus\u2019 Fisher score is thus\n\n\u2202\n\n\u2202\u03b8(i)\n\nlog p(s|\u03b8) =\n\n\u2202\n\n\u2202\u03b8(i)\n\np(s, x1:n|\u03b8) =\n\n1\n\np(s|\u03b8)\n\n\u2202\n\n\u2202\u03b8(i)\n\np(s, x1:n|\u03b8)\n\n(cid:88)\nt(cid:48)(cid:89)\n\nx1:n\n\nt(cid:48)(cid:88)\n\n=\n\n1\n\np(s|\u03b8)\n\n1{b0=0 \u2227 m(x1:n)=m}(\n\nzi(bj))\n\nexp(z\u03b8(bj)),\n\n(5)\n\nx1:n\n\nj=1\n\nj=1\n\nWhile Equation 5 is generally dif\ufb01cult to compute, we calculate it ef\ufb01ciently using sum-product infer-\n\u2202\u03b8(i) log p(s, \u02c6x|\u03b8) =\nence. Note that when the peptide sequence is observed, i.e., X1:n = \u02c6x, we have\n\n\u2202\n\n(cid:80)\n\nj z(m(\u02c6x1:j)).\n\nUsing the model\u2019s Fisher scores, Theseus\u2019 parameters \u03b8 may be learned via maximum likelihood\nestimation. Given a dataset of spectra s1, s2, . . . , sn, we present an alternate learning algorithm\n(Algorithm 1) which converges monotonically to a local optimum (proven in [9]). Within each\niteration, Algorithm 1 uses max-product inference to ef\ufb01ciently infer the most probable PSMs per\niteration, mitigating the need for training labels. \u03b8 is maximized in each iteration using gradient\nascent.\n\nAlgorithm 1 Theseus Unsupervised Learning Algorithm\n1: while not converged do\nfor i = 1, . . . , n do\n2:\n3:\n4:\n5:\n6: end while\n\n(cid:80)n\n\u02c6xi \u2190 argmaxxi\u2208P log p(si, xi|\u03b8)\ni=1 log p(si, \u02c6xi|\u03b8)\n\nend for\n\u03b8 \u2190 argmax\u03b8\n\n5 Results\n\nMeasuring peptide identi\ufb01cation performance is complicated by the fact that ground-truth is unavail-\nable for real-world data. Thus, in practice, it is most common to estimate the false discovery rate\n(FDR) [1] by searching a decoy database of peptides which are unlikely to occur in nature, typically\ngenerated by shuf\ufb02ing entries in the target database [14]. For a particular score threshold, t, FDR\nis then calculated as the proportion of decoys scoring better than t to the number of targets scoring\nbetter than t. Once the target and decoy PSMs are calculated, a curve displaying the FDR threshold\nvs. the number of correctly identi\ufb01ed targets at each given threshold may be calculated. In place of\nFDR along the x-axis, we use the q-value [14], de\ufb01ned to be the minimum FDR threshold at which a\ngiven score is deemed to be signi\ufb01cant. As many applications require a search algorithm perform\nwell at low thresholds, we only plot q \u2208 [0, 0.1].\nThe same datasets and search settings used to evaluate DRIP\u2019s heuristically derived features in [8]\nare adapted in this work. MS-GF+ (one of the most accurate search algorithms in wide use, plotted\n\n7\n\n\ffor reference) was run using version 9980, with PSMs ranked by E-value and Percolator features\ncalculated using msgf2pin. All database searches were run using a \u00b13.0Th mass tolerance, XCorr\n\ufb02anking peaks not allowed in Crux searches, and all search algorithm settings otherwise left to their\ndefaults. Peptides were derived from the protein databases using trypsin cleavage rules without\nsuppression of proline and a single \ufb01xed carbamidomethyl modi\ufb01cation was included.\nGradient-based feature representations derived from DRIP and XCorr were used to train an SVM\nclassi\ufb01er [13] and recalibrate PSM scores. Theseus training and computation of XCorr Fisher\nscores were performed using a customized version of Crux v2.1.17060 [17]. For an XCorr PSM, a\nfeature representation is derived directly using both \u2207\u03b8 log p(s|\u03b8) and \u2207\u03b8 log p(s, x|\u03b8) as de\ufb01ned in\nSection 4, representing gradient information for both the distribution of PSM scores and the individual\nPSM score, respectively. DRIP gradient-based features, as de\ufb01ned in Section 3, were derived using\na customized version of the DRIP Toolkit [8].Figure 4 displays the resulting search accuracy for\nfour worm and yeast datasets. For the uncalibrated search results in Figure 5, we show that XCorr\nparameters may be learned without supervision using Theseus, and that the presented coordinate\ndescent algorithm (which estimates the most probable PSMs to take a step in the objective space)\nconverges to a much better local optimum than maximum likelihood estimation.\n\n(b) Worm-1\n\n(c) Worm-2\n\n(d) Worm-3\n\n(e) Worm-4\n\n(f) Yeast-1\n\n(g) Yeast-2\n\n(h) Yeast-3\n\n(i) Yeast-4\n\nFigure 4: Search accuracy plots measured by q-value versus number of spectra identi\ufb01ed for worm (C. elegans)\nand yeast (Saccharomyces cerevisiae) datasets. All methods are post-processed using the Percolator SVM\nclassi\ufb01er [13]. \u201cDRIP\u201d augments the standard set of DRIP features with DRIP-Viterbi-path parsed PSM features\n(described in [8]) and \u201cDRIP Fisher\u201d augments the heuristic set with gradient-based DRIP features. \u201cXCorr,\u201d\n\u201cXCorr p-value,\u201d and \u201cMS-GF+\u201d use their standard sets of Percolator features (described in [8]), while \u201cXCorr\np-value Fisher\u201d and \u201cXCorr Fisher\u201d augment the standard XCorr feature sets with gradient-based Theseus\nfeatures.\n\n8\n\n0.020.040.060.080.10q-value051015Spectraidenti\ufb01ed(1000\u2019s)DRIPFisherDRIPXCorrp-valueFisherXCorrp-valueXCorrFisherXCorrMS-GF+0.020.040.060.080.10q-value57101215Spectraidenti\ufb01ed(1000\u2019s)0.020.040.060.080.10q-value46810Spectraidenti\ufb01ed(1000\u2019s)0.020.040.060.080.10q-value2468Spectraidenti\ufb01ed(1000\u2019s)0.020.040.060.080.10q-value34567Spectraidenti\ufb01ed(1000\u2019s)0.020.040.060.080.10q-value68101214Spectraidenti\ufb01ed(1000\u2019s)0.020.040.060.080.10q-value68101214Spectraidenti\ufb01ed(1000\u2019s)0.020.040.060.080.10q-value68101214Spectraidenti\ufb01ed(1000\u2019s)0.020.040.060.080.10q-value68101214Spectraidenti\ufb01ed(1000\u2019s)\fFigure 5: Search accuracy of Theseus\u2019 learned scoring function parameters. Coordinate ascent parameters are\nlearned using Algorithm 1 and MLE parameters are learned using gradient ascent.\n\n(b) Yeast-1\n\n(c) Yeast-2\n\n5.1 Discussion\n\nDRIP gradient-based post-processing improves upon the heuristically derived features in all cases,\nand does so substantially on a majority of datasets. In the case of the yeast datasets, this distinguishes\nDRIP post-processing performance from all competitors and leads to state-of-the-art identi\ufb01cation\naccuracy. Furthermore, we note that both XCorr and XCorr p-value post-processing performance are\ngreatly improved using the gradient-based features derived using Theseus, raising performance above\nthe highly similar MS-GF+ in several cases. Particularly noteworthy is the substantial improvement\nin XCorr accuracy which, using gradient-based information, is nearly competitive with its p-value\ncounterpart. Considering the respective runtimes of the underlying search algorithms, this thus\npresents a tradeoff for a researcher considering search time and accuracy. In practice, the DRIP and\nXCorr p-value computations are at least an order of magnitude slower than XCorr computation in\nCrux [21]. Thus, the presented work not only improves state-of-the-art accuracy, but also improves\nthe accuracy of simpler, yet signi\ufb01cantly faster, search algorithms.\nOwing to max-product inference in graphical models, we also show that Theseus may be used to\neffectively learn XCorr model parameters (Figure 5) without supervision. Furthermore, we show that\nXCorr p-values are also made more accurate by training the underlying scoring function for which\np-values are computed. This marks a novel step towards unsupervised training of uncalibrated scoring\nfunctions, as unsupervised learning has been extensively explored for post-processor recalibration,\nbut has remained an open problem for MS/MS database-search scoring functions. The presented\nlearning framework, as well as the presented XCorr gradient-based feature representation, may be\nadapted by many of the widely scoring functions represented by Theseus [2, 5, 6, 16, 10, 22, 17].\nMany exciting avenues are open for future work. Leveraging the large breadth of graphical models\nresearch, we plan to explore other learning paradigms using Theseus (for instance, estimating other\nPSMs using k-best Viterbi in order to discriminatively learn parameters using algorithms such as\nmax-margin learning). Perhaps most exciting, we plan to further investigate the peptide-to-observed-\nspectrum mapping derived from DRIP Fisher scores. Under this mapping, we plan to explore learning\ndistance metrics between PSMs in order to identify proteins from peptides.\nAcknowledgments: This work was supported by the National Center for Advancing Translational\nSciences (NCATS), National Institutes of Health, through grant UL1 TR001860.\n\nReferences\n[1] Y. Benjamini and Y. Hochberg. Controlling the false discovery rate: a practical and powerful\n\napproach to multiple testing. Journal of the Royal Statistical Society B, 57:289\u2013300, 1995.\n\n[2] R. Craig and R. C. Beavis. Tandem: matching proteins with tandem mass spectra. Bioinformat-\n\nics, 20:1466\u20131467, 2004.\n\n[3] A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via\nthe EM algorithm. Journal of the Royal Statistical Society. Series B (Methodological), 39:1\u201322,\n1977.\n\n[4] Charles Elkan. Deriving tf-idf as a \ufb01sher kernel.\n\nIn International Symposium on String\n\nProcessing and Information Retrieval, pages 295\u2013300. Springer, 2005.\n\n9\n\n0.020.040.060.080.10q-value0510Spectraidenti\ufb01ed(1000\u2019s)XCorrp-valueCoordinateAscentXCorrp-valueMLEXCorrp-valueXCorrCoordinateAscentXCorrMLEXCorr0.020.040.060.080.10q-value4681012Spectraidenti\ufb01ed(1000\u2019s)0.020.040.060.080.10q-value571012Spectraidenti\ufb01ed(1000\u2019s)\f[5] J. K. Eng, A. L. McCormack, and J. R. Yates, III. An approach to correlate tandem mass spectral\ndata of peptides with amino acid sequences in a protein database. Journal of the American\nSociety for Mass Spectrometry, 5:976\u2013989, 1994.\n\n[6] Jimmy K Eng, Tahmina A Jahan, and Michael R Hoopmann. Comet: An open-source ms/ms\n\nsequence database search tool. Proteomics, 13(1):22\u201324, 2013.\n\n[7] John T. Halloran, Jeff A. Bilmes, and William S. Noble. Learning peptide-spectrum alignment\nmodels for tandem mass spectrometry. In Uncertainty in Arti\ufb01cial Intelligence (UAI), Quebec\nCity, Quebec Canada, July 2014. AUAI.\n\n[8] John T Halloran, Jeff A Bilmes, and William S Noble. Dynamic bayesian network for accurate\ndetection of peptides from tandem mass spectra. Journal of Proteome Research, 15(8):2749\u2013\n2759, 2016.\n\n[9] John T. Halloran and David M. Rocke. Gradients of Generative Models for Improved Discrimi-\n\nnative Analysis of Tandem Mass Spectra: Supplementary Materials, 2017.\n\n[10] J Jeffry Howbert and William S Noble. Computing exact p-values for a cross-correlation\nshotgun proteomics score function. Molecular & Cellular Proteomics, pages mcp\u2013O113, 2014.\n\n[11] T. Jaakkola and D. Haussler. Exploiting generative models in discriminative classi\ufb01ers. In\n\nAdvances in Neural Information Processing Systems, Cambridge, MA, 1998. MIT Press.\n\n[12] Tommi S Jaakkola, Mark Diekhans, and David Haussler. Using the \ufb01sher kernel method to\n\ndetect remote protein homologies. In ISMB, volume 99, pages 149\u2013158, 1999.\n\n[13] L. K\u00a8all, J. Canterbury, J. Weston, W. S. Noble, and M. J. MacCoss. A semi-supervised machine\nlearning technique for peptide identi\ufb01cation from shotgun proteomics datasets. Nature Methods,\n4:923\u201325, 2007.\n\n[14] Uri Keich, Attila Kertesz-Farkas, and William Stafford Noble. Improved false discovery rate\nestimation procedure for shotgun proteomics. Journal of proteome research, 14(8):3148\u20133161,\n2015.\n\n[15] Uri Keich and William Stafford Noble. On the importance of well-calibrated scores for\nidentifying shotgun proteomics spectra. Journal of proteome research, 14(2):1147\u20131160, 2014.\n\n[16] Sangtae Kim and Pavel A Pevzner. Ms-gf+ makes progress towards a universal database search\n\ntool for proteomics. Nature communications, 5, 2014.\n\n[17] Sean McIlwain, Kaipo Tamura, Attila Kertesz-Farkas, Charles E Grant, Benjamin Diament,\nBarbara Frewen, J Jeffry Howbert, Michael R Hoopmann, Lukas K\u00a8all, Jimmy K Eng, et al.\nCrux: rapid open source protein tandem mass spectrometry analysis. Journal of proteome\nresearch, 2014.\n\n[18] J. Pearl. Probabilistic Reasoning in Intelligent Systems : Networks of Plausible Inference.\n\nMorgan Kaufmann, 1988.\n\n[19] M. Spivak, J. Weston, L. Bottou, L. K\u00a8all, and W. S. Noble. Improvements to the Percolator\nalgorithm for peptide identi\ufb01cation from shotgun proteomics data sets. Journal of Proteome\nResearch, 8(7):3737\u20133745, 2009. PMC2710313.\n\n[20] M. Spivak, J. Weston, D. Tomazela, M. J. MacCoss, and W. S. Noble. Direct maximization\nof protein identi\ufb01cations from tandem mass spectra. Molecular and Cellular Proteomics,\n11(2):M111.012161, 2012. PMC3277760.\n\n[21] Shengjie Wang, John T Halloran, Jeff A Bilmes, and William S Noble. Faster and more\naccurate graphical model identi\ufb01cation of tandem mass spectra using trellises. Bioinformatics,\n32(12):i322\u2013i331, 2016.\n\n[22] C. D. Wenger and J. J. Coon. A proteomics search algorithm speci\ufb01cally designed for high-\n\nresolution tandem mass spectra. Journal of proteome research, 2013.\n\n10\n\n\f", "award": [], "sourceid": 2923, "authors": [{"given_name": "John", "family_name": "Halloran", "institution": "University of California, Davis"}, {"given_name": "David", "family_name": "Rocke", "institution": "University of California, Davis"}]}