{"title": "Learning Concave Conditional Likelihood Models for Improved Analysis of Tandem Mass Spectra", "book": "Advances in Neural Information Processing Systems", "page_first": 5420, "page_last": 5430, "abstract": "The most widely used technology to identify the proteins present in a complex biological sample is tandem mass spectrometry, which quickly produces a large collection of spectra representative of the peptides (i.e., protein subsequences) present in the original sample. In this work, we greatly expand the parameter learning capabilities of a dynamic Bayesian network (DBN) peptide-scoring algorithm, Didea, by deriving emission distributions for which its conditional log-likelihood scoring function remains concave. We show that this class of emission distributions, called Convex Virtual Emissions (CVEs), naturally generalizes the log-sum-exp function while rendering both maximum likelihood estimation and conditional maximum likelihood estimation concave for a wide range of Bayesian networks. Utilizing CVEs in Didea allows efficient learning of a large number of parameters while ensuring global convergence, in stark contrast to Didea\u2019s previous parameter learning framework (which could only learn a single parameter using a costly grid search) and other trainable models (which only ensure convergence to local optima). The newly trained scoring function substantially outperforms the state-of-the-art in both scoring function accuracy and downstream Fisher kernel analysis. Furthermore, we significantly improve Didea\u2019s runtime performance through successive optimizations to its message passing schedule and derive explicit connections between Didea\u2019s new concave score and related MS/MS scoring functions.", "full_text": "Learning Concave Conditional Likelihood Models for\n\nImproved Analysis of Tandem Mass Spectra\n\nJohn T. Halloran\n\nDepartment of Public Health Sciences\n\nUniversity of California, Davis\njthalloran@ucdavis.edu\n\nDavid M. Rocke\n\nDepartment of Public Health Sciences\n\nUniversity of California, Davis\ndmrocke@ucdavis.edu\n\nAbstract\n\nThe most widely used technology to identify the proteins present in a complex\nbiological sample is tandem mass spectrometry, which quickly produces a large\ncollection of spectra representative of the peptides (i.e., protein subsequences)\npresent in the original sample. In this work, we greatly expand the parameter\nlearning capabilities of a dynamic Bayesian network (DBN) peptide-scoring al-\ngorithm, Didea [25], by deriving emission distributions for which its conditional\nlog-likelihood scoring function remains concave. We show that this class of emis-\nsion distributions, called Convex Virtual Emissions (CVEs), naturally generalizes\nthe log-sum-exp function while rendering both maximum likelihood estimation and\nconditional maximum likelihood estimation concave for a wide range of Bayesian\nnetworks. Utilizing CVEs in Didea allows ef\ufb01cient learning of a large number of\nparameters while ensuring global convergence, in stark contrast to Didea\u2019s previous\nparameter learning framework (which could only learn a single parameter using\na costly grid search) and other trainable models [12, 13, 14] (which only ensure\nconvergence to local optima). The newly trained scoring function substantially\noutperforms the state-of-the-art in both scoring function accuracy and downstream\nFisher kernel analysis. Furthermore, we signi\ufb01cantly improve Didea\u2019s runtime\nperformance through successive optimizations to its message passing schedule and\nderive explicit connections between Didea\u2019s new concave score and related MS/MS\nscoring functions.\n\n1\n\nIntroduction\n\nA fundamental task in medicine and biology is identifying the proteins present in a complex biological\nsample, such as a drop of blood. The most widely used technology to accomplish this task is tandem\nmass spectrometry (MS/MS), which quickly produces a large collection of spectra representative of\nthe peptides (i.e., protein subsequences) present in the original sample. A critical problem in MS/MS\nanalysis, then, is the accurate identi\ufb01cation of the peptide generating each observed spectrum.\nThe most accurate methods which solve this problem search a database of peptides derived from the\nmapped organism of interest. Such database-search algorithms score peptides from the database and\nreturn the top-ranking peptide per spectrum. The pair consisting of an observed spectrum and scored\npeptide are typically referred to as a peptide-spectrum match (PSM). Many scoring functions have\nbeen proposed, ranging from simple dot-products [2, 28], to cross-correlation based [8], to p-value\nbased [10, 19, 16]. Recently, dynamic Bayesian networks (DBNs) have been shown to achieve the\nstate-of-the-art in both PSM identi\ufb01cation accuracy and post-search discriminative analysis, owing to\ntheir temporal modeling capabilities, parameter learning capabilities, and generative nature.\nThe \ufb01rst such DBN-based scoring function, Didea [25], used sum-product inference to ef\ufb01ciently\ncompute a log-posterior score for highly accurate PSM identi\ufb01cation. However, Didea utilized a\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00b4eal, Canada.\n\n\fFigure 1: Example tandem mass spectrum with precursor charge cs = 2 and generating peptide x =\nTGPSPQPESQGSFYQR. Plotted in red and blue are, respectively, b- and y-ion peaks (discussed in Section 2.1),\nwhile unidenti\ufb01ed peaks are colored gray.\n\ncomplicated emission distribution for which only a single parameter could be learned through a\ncostly grid search. Subsequently, a DBN for Rapid Identi\ufb01cation of Peptides (DRIP) [13, 11], was\nshown in [12] to outperform Didea due to its ability to generatively learn a large number of model\nparameters. Most recently, DRIP\u2019s generative nature was further exploited to derive log-likelihood\ngradients detailing the manner in which peptides align with observed spectra [14]. Combining these\ngradients with a discriminative postprocessor [17], the resulting DRIP Fisher kernel substantially\nimproved upon all state-of-the-art methods for downstream analysis on a large number of datasets.\nHowever, while DRIP signi\ufb01cantly improves several facets of MS/MS analysis due to its parameter\nlearning capabilities, these improvements come at high runtime cost. In practice, DRIP inference is\nslow due to its large model complexity (the state-space grows exponentially in the lengths of both\nthe observed spectrum and peptide). For instance, DRIP search required an order of magnitude\nlonger than the slowest implementation of Didea for the timing tests in Section 5.2. Herein, we\ngreatly improve upon all the analysis strengths provided by DRIP using the much faster Didea\nmodel. Furthermore, we optimize Didea\u2019s message passing schedule for a 64.2% speed improvement,\nleading to runtimes two orders of magnitude faster than DRIP and comparable to less accurate(but\nwidely used) methods. Thus, the work described herein not only improves upon state-of-the-art DBN\nanalysis for effective parameter learning, scoring function accuracy, and downstream Fisher kernel\nrecalibration, but also renders such analysis practical by signi\ufb01cantly decreasing state-of-the-art DBN\ninference time.\nIn this work, we begin by discussing relevant MS/MS background and previous work. We then greatly\nexpand the parameter learning capabilities of Didea by deriving a class of Bayesian network (BN)\nemission distributions for which both maximum likelihood learning and, most importantly, conditional\nmaximum likelihood learning are concave. Called Convex Virtual Emissions (CVEs), we show that\nthis class of emission distributions generalizes the widely used log-sum-exp function and naturally\narises from the solution of a nonlinear differential equation representing convex conditions for general\nBN emissions. We incorporate CVEs into Didea to quickly and ef\ufb01ciently learn a substantial number\nof model parameters, considerably improving upon the previous learning framework. The newly\ntrained model drastically improves PSM identi\ufb01cation accuracy, outperforming all state-of-the-art\nmethods over the presented datasets; at a strict FDR of 1% and averaged over the presented datasets,\nthe trained Didea scoring function identi\ufb01es 16% more spectra than DRIP and 17.4% more spectra\nthan the highly accurate and widely used MS-GF+ [19]. Under the newly parameterized model, we\nthen derive a bound explicitly relating Didea\u2019s score to the popular XCorr scoring function, thus\nproviding potential avenues to train XCorr using the presented parameter learning work.\nWith ef\ufb01cient parameter learning in place, we next utilize the new Didea model to improve MS/MS\nrecalibration performance. We use gradient information derived from Didea\u2019s conditional log-\nlikelihood in the feature-space of a kernel-based classi\ufb01er [17]. Training the resulting conditional\nFisher kernel substantially improves upon the state-of-the-art recalibration performance previously\nachieved by DRIP; at a strict FDR of 1%, discriminative recalibration using Didea\u2019s conditional\nFisher kernel results in an average 11.3% more identi\ufb01cations than using the DRIP Fisher kernel.\nFinally, we conclude with a discussion of several avenues for future work.\n\n2\n\n\f2 Tandem mass spectrometry\n\nWith a complex sample as input, a typical MS/MS experiment begins by cleaving the proteins of\nthe sample into peptides using a digesting enzyme, such as trypsin. The digested peptides are then\nseparated via liquid chromatography and undergo two rounds of mass spectrometry. The \ufb01rst round\nof mass spectrometry measures the mass and charge of the intact peptide, referred to as the precursor\nmass and precursor charge, respectively. Peptides are then fragmented into pre\ufb01x and suf\ufb01x ions.\nThe mass-to-charge (m/z) ratios of the resulting fragment ions are measured in the second round of\nmass spectrometry, producing an observed spectrum of m/z versus intensity values representative\nof the fragmented peptide. The output of this overall process is a large collection of spectra (often\nnumbering in the hundreds-of-thousands), each of which is representative of a peptide from the\noriginal complex sample and requires identi\ufb01cation. The x-axis of such observed spectra denotes m/z,\nmeasured in thomsons (Th), and y-axis measures the intensity at a particular m/z value. A sample\nsuch observed spectrum is illustrated in Figure 1.\n\n2.1 Database search and theoretical spectra\nLet s \u2208 S be an observed spectrum with precursor m/z ms and precursor charge cs, where S is the\nuniverse of tandem mass spectra. The generating peptide of s is identi\ufb01ed by searching a database\nof peptides, as follows. Let P be the universe of all peptides and x \u2208 P be an arbitrary peptide of\nlength l. x = x1 . . . xl is a string comprised of characters called amino acids, the dictionary size of\nwhich are 20. We denote peptide substrings as xi:j = xi, . . . , xj, where i > 0, j \u2264 l, i < j, and the\nmass of x as m(x). Given a peptide database D \u2286 P, the set of peptides considered is constrained to\nthose within a precursor mass tolerance window w of ms. The set of candidate peptides to be scored\nis thus D(s,D, w) = {x : x \u2208 D,| m(x)\ncs \u2212 ms| \u2264 w}. Using a scoring function \u03c8 : P \u00d7 S \u2192 R, a\ndatabase search outputs the top-scoring PSM, x\u2217 = argmaxx\u2208D \u03c8(x, s).\nIn order to score a PSM, the idealized fragment ions of x are \ufb01rst collected into a theoretical spectrum.\nThe most commonly encountered fragment ions are called b-ions and y-ions. B- and y-ions correspond\nto pre\ufb01x and suf\ufb01x mass pairs, respectively, such that the precursor charge cs is divided amongst the\npair. For b-ion charge cb \u2264 cs, the kth b-ion and the accompanying y-ion are then, respectively,\n\n(cid:104)(cid:80)l\n\n(cid:105)\n\n(cid:104)(cid:80)k\n\n(cid:105)\n\nm(x1:k) + cb\n\n=\n\ni=1 m(xi)\n\n+ cb\n\n,\n\ncb\n\ncb\n\ny(m(xk+1:l), cy) =\n\nb(m(x1:k), cb) =\nwhere cy is the y-ion charge, the b-ion offset corresponds to a cb charged hydrogen atom, and the\ny-ion offset corresponds to a cy charged hydrogen atom plus a water molecule. For singly charged\nspectra cs = 1, only singly charged fragment ions are detectable, so that cb = cy = 1. For higher\nprecursor charge states cs \u2265 2, the total charge is split between each b- and y-ion pair, so that\n0 < cb < cs and cy = cs \u2212 cb. The annotated b- and y-ions of an identi\ufb01ed observed spectrum are\nillustrated in Figure 1.\n\ni=k+1 m(xi)\ncy\n\n+ 18 + cy\n\n,\n\n3 Previous work\n\nMany database search scoring algorithms have been proposed, each of which is characterized by the\nscoring function they employ. These scoring functions have ranged from dot-products (X!Tandem [2]\nand Morpheus [28]), to cross-correlation based (XCorr [8]), to exact p-values computed over linear\nscores [19, 16]. Recently, DBNs have been used to substantially improve upon the accuracy of\nprevious approaches.\nIn the \ufb01rst such DBN, Didea [25], the time series being modeled is the sequence of a peptide\u2019s\namino acids (i.e., an amino acid is observed in each frame) and the quantized observed spectrum\nis observed in each frame. In successive frames, the sequence of b- and y-ions are computed and\nused as indices into the observed spectrum via virtual evidence [24]. A hidden variable in the\n\ufb01rst frame, corresponding to the amount to shift the observed spectrum by, is then marginalized\nin order to compute a conditional log-likehood probability consisting of a foreground score minus\na background score, similar in form to XCorr (described in Section 4.2.1). The resulting scoring\nfunction outperformed the most accurate scoring algorithms at the time (including MS-GF+, then\ncalled MS-GFDB) on a majority of datasets. However, parameter learning in the model was severely\nlimited and inef\ufb01cient; a single hyperparameter controlling the reweighting of peak intensities was\nlearned via an expensive grid search, requiring repeated database searches over a dataset.\n\n3\n\n\fSubsequent work saw the introduction of DRIP [12, 13], a DBN with substantial parameter learning\ncapabilities. In DRIP, the time series being modeled is the sequence of observed spectrum peaks (i.e.,\neach frame in DRIP corresponds to an observed peak) and two types of prevalent phenomena are\nexplicitly modeled via sequences of random variables: spurious observed peaks (called insertions)\nand absent theoretical peaks (called deletions). A large collection of Gaussians parameterizing the\nm/z axis are generatively learned, via expectation-maximization (EM) [4], and used to score observed\npeaks. DRIP then uses max-product inference to calculate the most probable sequences of insertions\nand deletions in order to score PSMs.\nIn practice, the majority of PSM scoring functions discussed are typically poorly calibrated, i.e., it\nis often dif\ufb01cult to compare the PSM scores across different spectra. In order to combat such poor\ncalibration, postprocessors are commonly employed to recalibrate PSM scores [17, 26, 27]. In recent\nwork, DRIP\u2019s generative framework was further exploited to calculate highly accurate features based\non the log-likelihood gradients of its learnable parameters. Combining these new gradient-based\nfeatures with a popular kernel-based classi\ufb01er for recalibrating PSM scores [17], the resulting Fisher\nkernel was shown to signi\ufb01cantly improve postprocessing accuracy [14].\n\n4 Didea\n\nFigure 2: Graph of Didea. Unshaded nodes are hidden, shaded nodes are observed, and edges denote\ndeterministic functions of parent variables.\n\nt = M(cid:48)\n\nt+1 + m(xt+1)|M(cid:48)\n\nl = 0) = 1 and p(M(cid:48)\n\nWe now derive Didea\u2019s scoring function in detail. The graph of Didea is displayed in Figure 2. Shaded\nvariables are observed and unshaded variables are hidden (random). Groups of variables are collected\ninto time instances called frames, where the \ufb01rst frame is called the prologue, the \ufb01nal frame is called\nthe epilogue, and the chunk dynamically expands to \ufb01ll all frames in between. Let 0 \u2264 t \u2264 l be an\narbitrary frame. The amino acids of a peptide are observed in each frame after the prologue. The\nvariable Mt successively accumulates the pre\ufb01x masses of the peptide such that p(M0 = 0) = 1 and\np(Mt = Mt\u22121 + m(xt)|Mt\u22121, xt) = 1, while the variable M(cid:48)\nt successively accumulates the suf\ufb01x\nmasses of the peptide such that p(M(cid:48)\nt+1, xt+1) = 1.\nDenoting the maximum spectra shift as L, the shift variable \u03c40 \u2208 [\u2212L, L] is hidden, uniform, and\ndeterministically copied by its descendents in successive frames, such that p(\u03c4t = \u00af\u03c4|\u03c40 = \u00af\u03c4 ) = 1 for\nt > 1.\nLet s \u2208 R\u00afo+1 be the binned observed spectrum, i.e., a vector of length \u00afo + 1 whose ith ele-\nment is s(i), where \u00afo is the maximum observable discretized m/z value. Shifted versions of\nthe tth b- and y-ion pair (where the shift is denoted by subscript) are deterministic functions\nof the shift variable as well as pre\ufb01x and suf\ufb01x masses, i.e., p(Bt = b\u03c4t(Mt, 1)|Mt, \u03c4t) =\np(Bt = max(min(b(Mt, 1) \u2212 \u03c4t, 0), \u00afo)|Mt, \u03c4t), p(Yt = y\u03c4t(m(xt+1:l), 1)|M(cid:48)\nt, \u03c4t) = p(Yt =\nmax(min(y(m(xt+1:l), 1) \u2212 \u03c4, 0), \u00afo)|M(cid:48)\nt, \u03c4t), respectively. \u03beb and \u03bey are virtual evidence chil-\ndren, i.e., leaf nodes whose conditional distribution need not be normalized (only non-negative) to\ncompute posterior probabilities in the DBN. A comprehensive overview of virtual evidence is avail-\nable in [15]. \u03beb and \u03bey compare the b- and y-ions, respectively, to the observed spectrum, such that\np(\u03beb|Bt) = f (s(b\u03c4t(m(x1:t), 1))), p(\u03bey|Yt) = f (s(y\u03c4t(m(xt+1:l), 1))), where f is a non-negative\nemission function.\n\n4\n\n\fLet 1{\u00b7} denote the indicator function. Didea\u2019s log-likelihood is then log p(\u03c40 = \u00af\u03c4 , x, s)\n\n= log p(\u03c40 = \u00af\u03c4 )p(M0)p(M\n\nl(cid:89)\n\n0|M\n(cid:48)\n\n(cid:48)\n1, x1)p(M\n[p(\u03c4t|\u03c4t\u22121)p(Mt|Mt\u22121, xt)p(M\n\nl )p(Ml|Ml\u22121, x1)+\n(cid:48)\nt|M\n(cid:48)\n\nlog\n\nt=1\n\nt+1, xt+1)p(Bt|Mt, \u03c4t)p(Yt|M\n(cid:48)\n\nt, \u03c4t)p(\u03beb|Bt)p(\u03bey|Yt)]\n(cid:48)\n\nt=1\n\nt=m(xt+1:l)}p(Bt|Mt, \u03c4t)p(Yt|M\n\n(cid:0)1{\u03c4t=\u00af\u03c4\u2227Mt=m(x1:t)\u2227M(cid:48)\n\nl\u22121(cid:89)\nl\u22121(cid:89)\nl\u22121(cid:88)\n(cid:0)log f (s\u00af\u03c4 (b(m(x1:t), 1))) + log f (s\u00af\u03c4 (y(m(xt+1:l), 1)))(cid:1).\n\np(\u03beb|b\u00af\u03c4 (Mt, 1))p(\u03bey|y\u00af\u03c4 (M\n\n(cid:48)\nt, 1))]\n\nt=1\n\n= log p(\u03c40 = \u00af\u03c4 ) + log\n\n= log p(\u03c40 = \u00af\u03c4 ) + log\n\n= log p(\u03c40 = \u00af\u03c4 ) +\n\nt=1\n\nt, \u03c4t)p(\u03beb|Bt)p(\u03bey|Yt)(cid:1)\n\n(cid:48)\n\nIn order to score PSMs, Didea computes the conditional log-likelihood\n\n\u03c8(s, x) = log p(\u03c40 = 0|x, s) = log p(\u03c40 = 0, x, s) \u2212 log\n\nL(cid:88)\n\n= log p(\u03c40 = 0, x, s) \u2212 log\n\n1\n|\u03c40|\n\nL(cid:88)\n\n\u00af\u03c4 =\u2212L\n\n\u00af\u03c4 =\u2212L\np(x, s\u03c4|\u03c40 = \u00af\u03c4 ).\n\np(\u03c40 = \u00af\u03c4 )p(x, s\u00af\u03c4|\u03c40 = \u00af\u03c4 )\n\n(1)\n\nAs previously mentioned, \u03c8(s, x) is a foreground score minus a background score, where the\nbackground score consists of averaging over |\u03c40| shifted versions of the foreground score, much like\nthe XCorr scoring function. Thus, Didea may be thought of as a probabilistic analogue of XCorr.\n\n4.1 Convex Virtual Emissions for Bayesian networks\n\np(h,E)\n\n(cid:80)\n\np(E) = log\n\n\u00afh\u2208H p(\u00afh,E) = log p(h, E) \u2212 log(cid:80)\n\nConsider an arbitrary Bayesian network where the observed variables are leaf nodes, as is common in a\nlarge number of time-series models such as hidden Markov models (HMMs), hierarchical HMMs [22],\nDBNs for speech recognition [5], hybrid HMMs/DBNs [3], as well as DRIP and Didea. Let E be\nthe set of observed random variables, H be the hypothesis space composed of the cross-product of\nthe n hidden discrete random variables in the network, and h \u2208 H be an arbitrary hypothesis (i.e.,\nan instantiation of the hidden variables). As is the case in Didea, often desired is the log-posterior\nprobability log p(h|E) = log p(h,E)\nh\u2208H p(h)p(E|h).\nUnder general assumptions, we\u2019d like to \ufb01nd emission functions for which log p(h|E) is concave.\nAssume p(h) and p(E|h) are non-negative, that the emission density p(E|h) is parameterized by \u03b8\n(which we\u2019d like to learn), and that there is a parameter \u03b8h to be learned for every hypothesis of latent\nvariables (though if we have fewer parameters, parameter estimation becomes strictly easier). We\nmake this parameterization explicit by denoting the emission distributions of interest as p\u03b8h(E|h).\nAssume that p\u03b8h (E|h) is smooth on R for all h \u2208 H. Applying virtual evidence for such models,\np\u03b8h (E|h) need not be normalized for posterior inference (as well as Viterbi inference and comparative\ninference between sets of observations; an extensive review of virtual evidence may be found in [15]).\nGiven the factorization of the joint distribution described by the BN, the quantity p\u03b8h (h, E) =\np(h)p\u03b8h (E|h) may often be ef\ufb01ciently computed for any given h. Thus, the computationally dif\ufb01cult\nportion of log p(h|E) is the calculation of the log-likelihood in the denominator, wherein all hidden\nvariables are marginalized over. We therefore \ufb01rst seek emission functions for which the log-\nh\u2208H p(h)p\u03b8h (E|h) is convex. For such emission functions, we have the\nh\u2208H p(h)p\u03b8h(E|h), such that\nh\u2208H \u03b1he\u03b2h\u03b8h, where\n\nfollowing theorem.\nTheorem 1. The unique convex functions of\n(p(cid:48)\n\u03b1h = p(h)ah and ah, \u03b2h are hyperparameters.\n(E|h))2 \u2212\nThe full proof of Theorem 1 is given in [15]. The nonlinear differential equation (p(cid:48)\n(E|h)p\u03b8h(E|h) = 0 describes the curvature of the desired emission functions and arises from\np(cid:48)(cid:48)\n\u03b8h\nthe necessary and suf\ufb01cient conditions for twice differentiable convex functions (i.e., the Hessian\n\nthe form log(cid:80)\nh\u2208H p(h)p\u03b8h(E|h) = log(cid:80)\n\n(E|h)p\u03b8h (E|h) = 0, are log(cid:80)\n\nlikelihood log p(E) = log(cid:80)\n\n(E|h))2 \u2212 p(cid:48)(cid:48)\n\n\u03b8h\n\n\u03b8h\n\n\u03b8h\n\n5\n\n\fmust be p.s.d.) and the Cauchy-Schwarz inequality. Particular values of the hyperparameters ah and\n\u03b2h correspond to unique initial conditions for this nonlinear differential equation. Note that when\n\u03b1h = 1, p\u03b8h (E|h) = e\u03b8h, we have the well-known log-sum-exp (LSE) convex function. Thus, this\nresult generalizes the LSE function to a broader class of convex functions.\nWe call the unique class of convex functions which arise from solving the nonlinear differential in\nTheorem 1, p\u03b8h (E|h) = ahe\u03b2h\u03b8h, Convex Virtual Emissions (CVEs). Note that utilizing CVEs,\nh\u2208H p(h)p\u03b8h (E|h)) is thus concave and\nguaranteed to converge to a global optimum. Furthermore, and most importantly for Didea, we have\nthe following result for the conditional log-likelihood (the full proof of which is in [15]).\nh\u2208H p(h)p\u03b8h(E|h) such that (p(cid:48)\n\nmaximimum likelihood estimation (i.e., argmax\u03b8 \u2212 log(cid:80)\nCorollary 1.1. For convex log p(E) = log(cid:80)\n\n(E|h))2 \u2212\n\n(E|h)p\u03b8h(E|h) = 0, the log-posterior log p\u03b8(h|E) is concave in \u03b8.\n\np(cid:48)(cid:48)\n\n\u03b8h\n\n\u03b8h\n\nThus, utilizing CVEs, conditional maximum likelihood estimation is also rendered concave.\n\n4.2 CVEs in Didea\nIn [25], the virtual evidence emission function to score peak intensities was f\u03bb(s(i)) = 1 \u2212 \u03bbe\u2212\u03bb +\n\u03bbe\u2212\u03bb(1\u2212s(i)). Under this function, Didea was shown to perform well on a variety of datasets.\nHowever, this function is non-convex and does not permit ef\ufb01cient parameter learning; although only\na single model parameter, \u03bb, was trained, learning required a grid search wherein each step consisted\nof a database search over a dataset and subsequent target-decoy analysis to assess each new parameter\nvalue. While this training scheme is already costly and impractical, it quickly becomes infeasible\nwhen looking to learn more than a single model parameter.\nWe use CVEs to render Didea\u2019s conditional log-likelihood concave given a large number of parameters.\nTo ef\ufb01ciently learn a distinct observation weight \u03b8\u03c4 for each spectral shift \u03c4 \u2208 [\u2212L, L], we thus\nutilize the emission function f\u03b8\u03c4 (s(i)) = e\u03b8\u03c4 s(i). Denote the set of parameters per spectra shift as\n\u03b8 = {\u03b8\u2212L, . . . , \u03b8L}. Due to the concavity of Equation 1 using f\u03b8\u03c4 under Corollary 1.1, given n PSM\ntraining pairs (s1, x1), . . . , (sn, xn), the learned parameters \u03b8\u2217 = argmax\u03b8\ni=1 \u03c8\u03b8(si, xi) are\nguaranteed to converge to a global optimum. Further analysis of Didea\u2019s scoring function under this\nnew emission function may be found in [15], including the derivation of the new model\u2019s gradients\n(i.e., conditional Fisher scores).\n\n(cid:80)n\n\n4.2.1 Relating Didea\u2019s conditional log-likelihood to XCorr using CVEs\n\nXCorr [8], the very \ufb01rst database search scoring function for peptide identi\ufb01cation, remains one of the\nmost widely used tools in the \ufb01eld today. Owing to its prominence, XCorr remains an active subject\nof analysis and continuous development [20, 23, 7, 6, 16, 21, 9, 14]. As previously noted, the scoring\nfunctions of XCorr and Didea share several similarities in form, where, in fact, the former served as\nthe motivating example in [25] for both the design of the Didea model and its posterior-based scoring\nfunction. While cosmetic similarities have thus far been noted, the reparameterization of Didea\u2019s\nconditional log-likelihood using CVEs permits the derivation of an explicit relationship between the\ntwo.\nLet u be the theoretical spectrum of peptide x. As with Didea, let L be the maximum spectra shift\nconsidered and, for shift \u03c4, denote a vector shift via subscript, such that s\u03c4 is the vector of observed\nspectrum elements shifted by \u03c4 units. In order to compare u and s, XCorr is thus computed as\nXCorr(s, x) = uT s \u2212 1\n\u03c4 =\u2212L uT s\u03c4 . Intuitively, the cross-correlation background term is\nmeant to penalize over\ufb01tting of the theoretical spectrum. Under the newly parameterized Didea\nconditional log-likelihood described herein, we have the following theorem explicitly relating the\nXCorr and Didea scoring functions.\nTheorem 2. Assume the PSM scoring function \u03c8(s, x) is that of Didea (i.e., Equation 1) where\nthe emission function f\u03b8\u03c4 (s(i)) has uniform weights \u03b8i = \u03b8j, for i, j \u2208 [\u2212L, L]. Then \u03c8(s, x) \u2264\nO(XCorr(s, x)).\n\n(cid:80)L\n\n2L+1\n\nThe full proof of Theorem 2 may be found in [15]. Thus, Didea\u2019s scoring function effectively serves\nto lower bound XCorr. This opens possible avenues for extending the learning results detailed herein\nto the widely used XCorr function. For instance, a natural extension is to use a variational Bayesian\n\n6\n\n\finference approach and learn XCorr parameters through iterative maximization of the Didea lower\nbound, made ef\ufb01cient by the concavity of new Didea model derived in this work.\n\n4.3 Faster Didea sum-product inference\n\nWe successively improved Didea\u2019s inference time when conducting a database search using the\nintensive charge-varying model (discussed in [15]). Firstly, we removed the need for a backward\npass by keeping track of the foreground log-likelihood during the forward pass (which computes the\nbackground score, i.e., the probability of evidence in the model). Next, by exploiting the symmetry\nof the spectral shifts, we cut the effective cardinality of \u03c4 in half during inference. While this\nrequires twice as much memory in practice, this is not close to being prohibitive on modern machines.\nFinally, a large portion of the speedup was achieved by offsetting the virtual evidence vector by\n|\u03c4| and pre/post buffering with zeros and offsetting each computed b- and y-ion by |\u03c4|. Under this\nconstruction, the scores do not change, but, during inference, we are able to shift each computed\nb- and y-ion by \u00b1\u03c4 without requiring any bound checking. Hashing virtual evidence bin values by\nb-/y-ion value and \u03c4 was also pursued, but did not offer any runtime bene\ufb01t over the aforementioned\nspeedups (due to the cost of constructing the hash table per spectrum).\n\n5 Results\n\nIn practice, assessing peptide identi\ufb01cation accuracy is made dif\ufb01cult by the lack of ground-truth\nencountered in real-world data. Thus, it is most common to estimate the false discovery rate (FDR) [1]\nby searching a decoy database of peptides which are unlikely to occur in nature, typically generated by\nshuf\ufb02ing entries in the target database [18]. For a particular score threshold, t, the FDR is calculated\nas the proportion of decoys scoring better than t to the number of targets scoring better than t. Once\nthe target and decoy PSM scores are calculated, a curve displaying the FDR threshold versus the\nnumber of correctly identi\ufb01ed targets at each given threshold may be calculated. In place of FDR\nalong the x-axis, we use the q-value [18], de\ufb01ned to be the minimum FDR threshold at which a given\nscore is deemed to be signi\ufb01cant. As many applications require a search algorithm perform well at\nlow thresholds, we only plot q \u2208 [0, 0.1].\nThe benchmark datasets and search settings used to recently evaluate the DRIP Fisher kernel in [14]\nare adapted in this work. The charge-varying Didea model (which integrates over multiple charge\nstates, further described in [15]) with concave emissions (described in Section 4.2) was used to\nscore and rank database peptides. Concave Didea parameters were learned using the high-quality\nPSMs used to generatively train the DRIP model in [13] and gradient ascent. Didea\u2019s newly trained\ndatabase-search scoring function is benchmarked against the Didea model from [25] trained using\na costly grid search for a single parameter (denoted as \u201cDidea-0\u201d) and four other state-of-the-art\nscoring algorithms: DRIP, MS-GF+, XCorr p-values, and XCorr.\nDRIP searches were conducted using the DRIP Toolkit and the generatively trained parameters\ndescribed in [12, 13]. MS-GF+, one of the most accurate search algorithms in wide-spread use,\nwas run using version 9980, with PSMs ranked by E-value. XCorr and XCorr p-value scores were\ncollected using Crux v2.1.17060. All database searches were run using a \u00b13.0Th mass tolerance,\nXCorr \ufb02anking peaks not allowed in Crux searches, and all search algorithm settings otherwise left to\ntheir defaults. Peptides were derived from the protein databases using trypsin cleavage rules without\nsuppression of proline and a single \ufb01xed carbamidomethyl modi\ufb01cation was included.\nThe resulting database-search accuracy plots are displayed in Figure 3. The trained Didea model\noutperforms all competitors across all presented datasets; compared to highly accurate scoring\nalgorithms DRIP and MS-GF+, the trained Didea scoring function identi\ufb01es 16% more spectra than\nDRIP and 17.4% more spectra than MS-GF+, at a strict FDR of 1% averaged over the presented\ndatasets. This high-level performance is attributable to the expanded and ef\ufb01cient parameter learning\nframework, which greatly improves upon the limited parameter learning capabilities of the original\nDidea model, identifying 9.8% more spectra than Didea-0 at a strict FDR of 1% averaged over the\npresented datasets.\n\n7\n\n\f(b) Worm-1\n\n(c) Worm-2\n\n(d) Worm-3\n\n(e) Worm-4\n\n(f) Yeast-1\n\n(g) Yeast-2\n\n(h) Yeast-3\n\n(i) Yeast-4\n\nFigure 3: Database search accuracy plots measured by q-value versus number of spectra identi\ufb01ed for worm\n(C. elegans) and yeast (Saccharomyces cerevisiae) datasets. All methods are run with as equivalent settings as\npossible. The Didea charge-varying model (described in [15]) was used to score PSMs, with \u201cDidea\u201d denoting\nthe model trained per charge state using the concave framework described in Section 4.2 and \u201cDidea-0\u201d denoting\nthe model from [25] trained using a grid search. DRIP, another DBN-based scoring function, was run using the\ngeneratively learned parameters described in [13].\n\n5.1 Conditional Fisher kernel for improved discriminative analysis\n\nFacilitated by Didea\u2019s effective parameter learning framework, we look to leverage gradient-based\nPSM information to aid in discriminative postprocessing analysis. We utilize the same set of features\nas the DRIP Fisher kernel [14]. However, in order to measure the relative utility of the gradients\nunder study, we replace the DRIP log-likelihood gradients with Didea gradient information (derived\nin [15]). These features are used to train an SVM classi\ufb01er, Percolator [17], which recalibrates\nPSM scores based on the learned decision boundary between input targets and decoys. Didea\u2019s\nresulting conditional Fisher kernel is benchmarked against the DRIP Fisher kernel and the previously\nbenchmarked scoring algorithms using their respective standard Percolator features sets.\nDRIP Kernel features were computed using the customized version of the DRIP Toolkit from [14].\nMS-GF+ Percolator features were collected using msgf2pin and XCorr/XCorr p-value features\ncollected using Crux. For the resulting postprocessing results, the trained Didea scoring function\noutperforms all competitors, identifying 12.3% more spectra than DRIP and 13.4% more spectra\nthan MS-GF+ at a strict FDR of 1% and averaged over the presented datasets. The full panel of\nresults is displayed in [15]. Compared to DRIP\u2019s log-likelihood gradient features, the conditional\n\n8\n\n0.020.040.060.080.10q-value051015Spectraidenti\ufb01ed(in1000\u2019s)DideaDidea-0DRIPMS-GF+XCorrp-valueXCorr0.020.040.060.080.10q-value051015Spectraidenti\ufb01ed(in1000\u2019s)0.020.040.060.080.10q-value2468Spectraidenti\ufb01ed(in1000\u2019s)0.020.040.060.080.10q-value25710Spectraidenti\ufb01ed(in1000\u2019s)0.020.040.060.080.10q-value2468Spectraidenti\ufb01ed(in1000\u2019s)0.020.040.060.080.10q-value571012Spectraidenti\ufb01ed(in1000\u2019s)0.020.040.060.080.10q-value57101215Spectraidenti\ufb01ed(in1000\u2019s)0.020.040.060.080.10q-value4681012Spectraidenti\ufb01ed(in1000\u2019s)0.020.040.060.080.10q-value4681012Spectraidenti\ufb01ed(in1000\u2019s)\flog-likelihood gradients of Didea contain much richer PSM information, thus allowing Percolator to\nbetter distinguish target from decoy PSMs for much greater recalibration performance.\n\n5.2 Optimized exact sum-product inference for improved Didea runtime\n\nImplementing the speedups to exact Didea sum-product inference described in Section 4.3, we\nbenchmark the optimized search algorithm using 1, 000 randomly sampled spectra (with charges\nvarying from 1+ to 3+) from the Worm-1 dataset and averaged database-search times (reported in\nwall clock time) over 10 runs. The resulting runtimes are listed in Table 1. DRIP was run using\nthe DRIP Toolkit and XCorr p-values were collected using Crux v2.1.17060. All benchmarked\nsearch algorithms were run on the same machine with an Intel Xeon E5-2620 and 64GB RAM. The\ndescribed optimizations result in a 64.2% runtime improvement, and brings search time closer to less\naccurate, but faster, search algorithms.\n\nDidea-0 Didea Opt. XCorr p-values\n\nruntime 19.1175\n\n6.8535\n\nTable 1: Database search runtimes per spectrum, in seconds, searching 1,000 worm spectra randomly\nsampled from the Worm-1 dataset. \u201cDidea-0\u201d is the implementation of Didea used in [25] and \u201cDidea\nOpt\u201d is the speed-optimized implementation described herein. All reported search algorithm runtimes\nwere averaged over 10 runs.\n\n2.2955\n\nDRIP\n\n143.4712\n\n6 Conclusions and future work\n\nIn this work, we\u2019ve derived a widely applicable class of Bayesian network emission distributions,\nCVEs, which naturally generalize the convex log-sum-exp function and carry important theoretical\nproperties for parameter learning. Using CVEs, we\u2019ve substantially improved the parameter learning\ncapabilities of the DBN scoring algorithm, Didea, by rendering its conditional log-likelihood concave\nwith respect to a large set of learnable parameters. Unlike previous DBN parameter learning solutions,\nwhich only guarantee convergence to a local optimum, the new learning framework thus guarantees\nglobal convergence. Didea\u2019s newly trained database-search scoring function signi\ufb01cantly outperforms\nall state-of-the-art scoring algorithms on the presented datasets. With ef\ufb01cient parameter learning in\nhand, we derived the gradients of Didea\u2019s conditional log-likelihood and used this gradient information\nin the feature space of a kernel-based discriminative postprocessor. The resulting conditional Fisher\nkernel once again outperforms the state-of-the-art on all presented datasets, including a highly\naccurate, recently proposed Fisher kernel. Furthermore, we successively optimized Didea\u2019s message\npassing schedule, leading to DBN analysis times two orders of magnitude faster than other leading\nDBN tools for MS/MS analysis. Thus, the presented results improve upon all aspects of state-of-the-\nart DBN analysis for MS/MS. Finally, using the new learning framework, we\u2019ve proven that Didea is\nproportionally lower bounds the widely used XCorr scoring function.\nThere are a number of exciting avenues for future work. Considering the large amount of PSM\ninformation held in the gradient space of Didea\u2019s conditional log-likelihood, we plan on pursuing\nkernel-based approaches to peptide identi\ufb01cation using the Hessian of the scoring function. This\nis especially exciting given the high degree of recalibration accuracy provided by Percolator, a\nkernel-based post-processor. Using a variational approach, we also plan on investigating parameter\nlearning options for XCorr given the Didea lower bound and the concavity of Didea\u2019s parameterized\nscoring function. Finally, in perhaps the most ambitious plan for future work, we plan to further build\nupon Didea\u2019s parameter learning framework by learning distance matrices between the theoretical\nand observed spectra. Such matrices naturally generalize the class of CVEs derived herein.\nAcknowledgments: This work was supported by the National Center for Advancing Translational\nSciences (NCATS), National Institutes of Health, through grant UL1 TR001860.\n\nReferences\n[1] Y. Benjamini and Y. Hochberg. Controlling the false discovery rate: a practical and powerful\n\napproach to multiple testing. Journal of the Royal Statistical Society B, 57:289\u2013300, 1995.\n\n9\n\n\f[2] R. Craig and R. C. Beavis. Tandem: matching proteins with tandem mass spectra. Bioinformat-\n\nics, 20:1466\u20131467, 2004.\n\n[3] George E Dahl, Dong Yu, Li Deng, and Alex Acero. Context-dependent pre-trained deep neural\nnetworks for large-vocabulary speech recognition. IEEE Transactions on audio, speech, and\nlanguage processing, 20(1):30\u201342, 2012.\n\n[4] A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via\nthe EM algorithm. Journal of the Royal Statistical Society. Series B (Methodological), 39:1\u201322,\n1977.\n\n[5] Li Deng. Dynamic speech models: theory, algorithms, and applications. Synthesis Lectures on\n\nSpeech and Audio Processing, 2(1):1\u2013118, 2006.\n\n[6] Benjamin J. Diament and William Stafford Noble. Faster SEQUEST searching for peptide\nidenti\ufb01cation from tandem mass spectra. Journal of Proteome Research, 10(9):3871\u20133879,\n2011.\n\n[7] J. K. Eng, B. Fischer, J. Grossman, and M. J. MacCoss. A fast SEQUEST cross correlation\n\nalgorithm. Journal of Proteome Research, 7(10):4598\u20134602, 2008.\n\n[8] J. K. Eng, A. L. McCormack, and J. R. Yates, III. An approach to correlate tandem mass spectral\ndata of peptides with amino acid sequences in a protein database. Journal of the American\nSociety for Mass Spectrometry, 5:976\u2013989, 1994.\n\n[9] Jimmy K Eng, Tahmina A Jahan, and Michael R Hoopmann. Comet: An open-source ms/ms\n\nsequence database search tool. Proteomics, 13(1):22\u201324, 2013.\n\n[10] Lewis Y. Geer, Sanford P. Markey, Jeffrey A. Kowalak, Lukas Wagner, Ming Xu, Dawn M.\nMaynard, Xiaoyu Yang, Wenyao Shi, and Stephen H. Bryant. Open mass spectrometry search\nalgorithm. Journal of Proteome Research, 3(5):958\u2013964, 2004.\n\n[11] John T Halloran. Analyzing tandem mass spectra using the drip toolkit: Training, searching,\n\nand post-processing. In Data Mining for Systems Biology, pages 163\u2013180. Springer, 2018.\n\n[12] John T. Halloran, Jeff A. Bilmes, and William S. Noble. Learning peptide-spectrum alignment\nmodels for tandem mass spectrometry. In Uncertainty in Arti\ufb01cial Intelligence (UAI), Quebec\nCity, Quebec Canada, July 2014. AUAI.\n\n[13] John T Halloran, Jeff A Bilmes, and William S Noble. Dynamic bayesian network for accurate\ndetection of peptides from tandem mass spectra. Journal of Proteome Research, 15(8):2749\u2013\n2759, 2016.\n\n[14] John T Halloran and David M Rocke. Gradients of generative models for improved discrimina-\ntive analysis of tandem mass spectra. In Advances in Neural Information Processing Systems,\npages 5728\u20135737, 2017.\n\n[15] John T. Halloran and David M. Rocke. Learning Concave Conditional Likelihood Models for\nImproved Analysis of Tandem Mass Spectra: Supplementary Material. In Advances in Neural\nInformation Processing Systems, 2018.\n\n[16] J Jeffry Howbert and William S Noble. Computing exact p-values for a cross-correlation\nshotgun proteomics score function. Molecular & Cellular Proteomics, pages mcp\u2013O113, 2014.\n\n[17] L. K\u00a8all, J. Canterbury, J. Weston, W. S. Noble, and M. J. MacCoss. A semi-supervised machine\nlearning technique for peptide identi\ufb01cation from shotgun proteomics datasets. Nature Methods,\n4:923\u201325, 2007.\n\n[18] Uri Keich, Attila Kertesz-Farkas, and William Stafford Noble. Improved false discovery rate\nestimation procedure for shotgun proteomics. Journal of proteome research, 14(8):3148\u20133161,\n2015.\n\n[19] Sangtae Kim and Pavel A Pevzner. Ms-gf+ makes progress towards a universal database search\n\ntool for proteomics. Nature communications, 5, 2014.\n\n10\n\n\f[20] A. A. Klammer, C. Y. Park, and W. S. Noble. Statistical calibration of the SEQUEST XCorr\n\nfunction. Journal of Proteome Research, 8(4):2106\u20132113, 2009. PMC2807930.\n\n[21] Sean McIlwain, Kaipo Tamura, Attila Kertesz-Farkas, Charles E Grant, Benjamin Diament,\nBarbara Frewen, J Jeffry Howbert, Michael R Hoopmann, Lukas K\u00a8all, Jimmy K Eng, et al.\nCrux: rapid open source protein tandem mass spectrometry analysis. Journal of proteome\nresearch, 2014.\n\n[22] Kevin P Murphy and Mark A Paskin. Linear-time inference in hierarchical hmms. In Advances\n\nin neural information processing systems, pages 833\u2013840, 2002.\n\n[23] C. Y. Park, A. A. Klammer, L. K\u00a8all, M. P. MacCoss, and W. S. Noble. Rapid and accurate peptide\nidenti\ufb01cation from tandem mass spectra. Journal of Proteome Research, 7(7):3022\u20133027, 2008.\n\n[24] J. Pearl. Probabilistic Reasoning in Intelligent Systems : Networks of Plausible Inference.\n\nMorgan Kaufmann, 1988.\n\n[25] Ajit P. Singh, John Halloran, Jeff A. Bilmes, Katrin Kirchoff, and William S. Noble. Spectrum\nidenti\ufb01cation using a dynamic bayesian network model of tandem mass spectra. In Uncertainty\nin Arti\ufb01cial Intelligence (UAI), Catalina Island, USA, July 2012. AUAI.\n\n[26] M. Spivak, J. Weston, L. Bottou, L. K\u00a8all, and W. S. Noble. Improvements to the Percolator\nalgorithm for peptide identi\ufb01cation from shotgun proteomics data sets. Journal of Proteome\nResearch, 8(7):3737\u20133745, 2009. PMC2710313.\n\n[27] M. Spivak, J. Weston, D. Tomazela, M. J. MacCoss, and W. S. Noble. Direct maximization\nof protein identi\ufb01cations from tandem mass spectra. Molecular and Cellular Proteomics,\n11(2):M111.012161, 2012. PMC3277760.\n\n[28] C. D. Wenger and J. J. Coon. A proteomics search algorithm speci\ufb01cally designed for high-\n\nresolution tandem mass spectra. Journal of proteome research, 2013.\n\n11\n\n\f", "award": [], "sourceid": 2598, "authors": [{"given_name": "John", "family_name": "Halloran", "institution": "University of California, Davis"}, {"given_name": "David", "family_name": "Rocke", "institution": "University of California, Davis"}]}