{"title": "Hidden Markov Models in Molecular Biology: New Algorithms and Applications", "book": "Advances in Neural Information Processing Systems", "page_first": 747, "page_last": 754, "abstract": null, "full_text": "Hidden Markov Models in  Molecular \n\nBiology:  New Algorithms  and \n\nApplications \n\nPierre Baldi \u2022 \n\nJet Propulsion Laboratory \n\nCalifornia Institute of Technology \n\nPasadena, CA 91109 \n\nYves  Chauvin t \n\nNet-ID,  Inc. \n\n8,  Cathy Place \n\nMenlo  Park, CA 94305 \n\nTim H lmkapiller \nDivision of Biology \n\nCalifornia Institute of Technology \n\nMarcella A.  McClure \n\nDepartment of Evolutionary Biology \n\nUniversity of California,  Irvine \n\nAbstract \n\nHidden  Markov Models  (HMMs)  can  be applied  to several impor(cid:173)\ntant problems in molecular biology.  We introduce a new convergent \nlearning algorithm for HMMs that, unlike the classical Baum-Welch \nalgorithm is smooth and can be applied  on-line or in  batch mode, \nwith or  without the usual Viterbi most  likely  path approximation. \nLeft-right HMMs with insertion and deletion states are then trained \nto represent several protein families including immunoglobulins and \nkinases.  In  all cases,  the models derived capture all  the important \nstatistical  properties  of the families  and  can  be  used  efficiently  in \na  number of important tasks such as multiple alignment, motif de(cid:173)\ntection,  and  classification. \n\n*and Division  of Biology,  California  Institute of Technology. \nt and Department of Psychology,  Stanford University. \n\n747 \n\n\f748 \n\nBaldi,  Chauvin,  Hunkapiller,  and McClure \n\n1 \n\nINTRODUCTION \n\nHidden Markov Models (e.g.,  Rabiner, 1989) and the more general EM algorithm in \nstatistics can be applied to the modeling and analysis of biological primary sequence \ninformation  (Churchill  (1989),  Lawrence  and  Reilly  (1990),  Baldi  et  al. \n(1992), \nCardon  and  Storrr..o  (1992),  H~.usslel et  al.  (1992\u00bb.  Most  not.ably,  as  in  speech \nrecognition applications,  a  family of evolutionarily related sequences can be  viewed \nas consisting of different utterances of the same prototypical sequence resulting from \na  common under1ying HMM dynamics.  A model  trained from a  family  can then be \nused  for  a  number  of tasks  including  multiple alignments and  classification.  The \nmultiple  alignment  is  particularly  important since  it  reveals  the  highly  conserved \nregions  of  the  molecules  with  functional  and  structural  significance  even  in  the \nabsence  of any  tertiary  information.  The  mUltiple  alignment  is  also  an  essential \ntool for  proper  phylogenetic  tree  reconstruction and  other  important tasks.  Good \nalgorithms based on dynamic programming exist for  the alignment of two sequences. \nHowever  they  scale  exponentially  with  the  number  of sequences  and  the  general \nmultiple alignment problem is  known to be NP-complete.  Here,  we  briefly  present \na  new  algorithm and  its variations for  learning in HMMs and the results of some of \nthe applications of this approach to new  protein families. \n\n2  HMMs FOR BIOLOGICAL PRIMARY SEQUENCES \n\nA  HMM  is  characterized  by  a  set  of states,  an alphabet of symbols,  a  probability \ntransition  matrix T  = (tij)  and  a  probability  emission  matrix  eij.  As  in  speech \napplications,  we are going to consider left-right architectures:  once a  given state is \nleft  it can never be visited  again.  Common knowledge of evolutionary mechanisms \nsuggests  the  choice  of  three  types  of states  (in  addition  to  the  start  and  to  the \nend state):  the main states m}, ... , mN, the delete states d}, ... , dN+l  and the insert \nstates iI, ... , iN+l'  N  is the length of the model which is usually chosen equal to the \naverage length of the sequences in the family and, if needed, can be adjusted in later \nstages.  The  details  of a  typical architecture are  given  in  Figure 1.  The alphabet \nhas  4  letters  in  the  case  of DNA  or  RNA  sequences,  one  symbol  per  nucleotide, \nand  20  letters in  the case of proteins,  one symbol per amino  acid.  Only the  main \nand  insert  states  emit  letters,  while  the  delete  states  are  of  course  mute.  The \nlinear  sequence  of state transitions start - ml  - m2  -\nend  is  the \nbackbone of the model and correponds to the path associated with the prototypical \nsequence  in  the  family  under  consideration.  Insertions  and  deletions  are  defined \nwith  respect  to  this backbone.  Insertions and  deletions  are  treated  symmetrically \nexcept  for  the loops on the  insert  states needed  to account for  multiple insertions. \nThe  adjustable  parameters  of the  HMM  provide  a  natural  way  of incorporating \nvariable gap penalties.  A number of other architectures are also possible. \n\n...  - mN  -\n\n3  LEARNING ALGORITHMS \n\nLearning from examples in HMMs is typically accomplished using the Baum-Welch \nalgorithm.  In  the  Baum-Welch  algorithm,  the  expected  number  nij  (resp.  mij) \nj  transitions (resp.  emissions of letter  j  from  state i)  induced  by the  data \nof i  -\nare calculated using the forward-backward  procedure.  The transition and emission \n\n\fHidden Markov Models in  Molecular Biology:  New Algorithms and Applications \n\n749 \n\nFigure 1:  The  basic  left-right  HMM  architecture.  Sand E  are  the start  and  end \nstates. \n\nprobabilities are  then reset to the observed frequencies  by \n\nni \n\neiJ\u00b7  = --\nt+ ..  -_  nij  and  +  mij \nmi \nIJ \n\n(1) \nwhere  ni = E j  nij  and  m, = E j  mij.  It is  clear  that  this algorithm  can  lead  to \nabrupt jumps  in  parameter  space  and  that  the procedure  cannot  be  used  for  on(cid:173)\nline  learning  (after  each  training  example).  This  is  even  more  so  if,  in  order  to \nsave some computations,  the Viterbi approximation is  used  to estimate likelihoods \nand  transition and  emission  statistics  by computing only  the most  likely  paths as \nopposed  to the forward-bacward  procedure where all  possible  paths are examined. \n\nA  new  algorithm for  HMM  learning which is smooth and can be used  on-line  or in \nbatch mode,  with or  without the Viterbi approximation,  can be defined  as follows. \nFirst,  we  use  a  Boltzmann-Gibbs  representation  for  the  parameters.  For each  tij \n(resp.  eij) we define  a  new  parameter Wij  (resp.  Vii)  by \n\n(2) \n\nNormalisation constraints are naturally enforced by this  representation throughout \nlearning with  the added  advantage that none of the parameters can  reach  the ab(cid:173)\nsorbing  value  O.  After  computing  on-line  or  in  batch  mode  the statistics  nij  and \nmii  using  the forward-backward  procedure  (or  the  usual  Viterbi  approximation), \nthe update equations are particularly simple and given  by \nm\u00b7\u00b7 \nmi \n\ntij)  and  ~Vij = 1}( ---2L - eii) \n\nn\u00b7\u00b7 \n~Wij = 1}(....!L -\nni \n\n(3) \n\nwhere  1}  is  the  learning  rate. \n(1992)  a  proof  is  given  that  this \nalgorithm  must  converge  to  a  maximum of the  product  of  the  likelihoods  of the \ntraining  sequences.  In  the  case  of an  on-line  Viterbi  approximation,  the  optimal \npath associated  with  the current training sequence  is  first  computed.  The update \nequations are  then  given by \n\nIn  Baldi  et  al. \n\n~Wij = 1}(t:i -\n\ntii)  and  ~vii = 1}(tii - eii) \n\n(4) \n\n\f750 \n\nBaldi,  Chauvin,  Hunkapiller,  and  McClure \n\nHere,  for  a  fixed  state  i,  t:j  and  t1j  are  the  target  transition  and  emission  val(cid:173)\nues:  t~j  =  1  every  time  the  transition  Si  -..  Sj  is  part  of the  Viterbi  path  of  the \ncorresponding training sequence sequence and 0  otherwise and similarly for  tij' \nAfter  training,  the  model  derived  can  be  used  for  a  number  of  tasks.  First,  by \ncomputing  for  each  sequence  its  most  likely  path  through  the  model  using  the \nViterbi algorithm, multiple sequences can be aligned to each other in time O(K N 2 ), \nlinear in the number K  of sequences.  The model can also be used for  classification \nand  data  base  searches.  The  likelihood  of any  sequence  (randomly  generated  or \ntaken from any data base)  can be calc\u00b0.llated  and compared to the likelihood of the \nsequences  in  the  family  being  modeled.  Additional  applications  are  discussed  in \nBaldi et al.  (1992). \n\n4  EXPERIMENTS AND RESULTS \n\nThe previous approach has been applied  to a  number of protein families  including \nglobins,  immunoglobulins,  kinases,  aspartic acid  proteases and  G-coupled  receptor \nproteins.  The  first  application  and  alignment  of  the  globin  family  using  HMMs \n(trained with the Viterbi approximation of the Baum-Welch algorithm, and a num(cid:173)\nber of additional heuristics)  was  given  by  Haussler et al.  (1992).  Here,  we  briefly \ndescribe some  of our results on  the immunoglobulin and the kinase families  1. \n\n4.1 \n\nIMMUNOGLOBULINS \n\nImmunoglobulins  or  antibodies  are  proteins  produced  by  B  cells  that  bind  with \nspecificity to foreign antigens in order to neutralize them or target their destruction \nby other effector cells (e.g.,  Hunkapiller & Hood, 1989).  The set of sequences used in \nour experiments consists of immunoglobulins V  region  sequences from  the  Protein \nIdentification  Resources  (PIR)  data  base.  It  corresponds  to  294  sequences,  with \nminimum length 90,  average length 117 and maximum length 254.  The variation in \nlength  resulted  from  including any sequence  with a  V  region,  including  those  that \nalso included signal or leader sequences, germline sequences that did not include the \nJ segment, and some that contained the C region as well.  Seventy seqences contained \none or more special characters indicating an ambiguous amino  acid  determination \nand  were removed. \n\nFor the immunoglobulins variable V regions,  we have trained a  model of length 117 \nusing  a  random  subset  of 150  sequences.  Figure  2  displays  the  alignment  corre(cid:173)\nsponding to the first  20  sequences in this random subset.  Letters emitted from  the \nmain states are upper case and letters emitted from insertion states are lower case. \nDashes  represent deletions  or  accomodate  for  insertions.  As  can  be observed,  the \nalgorithm has been able to detect all the main regions of highly conserved residues. \nMost importantly,  the cysteine residues towards the beginning and the end  respon(cid:173)\nsible for  the disulphide bonds which holds the chains together are perfectly aligned \nand  marked.  The only  exception is  the fifth  sequence from  the  bottom which  has \na  serine  residue  in  its  terminal  portion.  It is also  important  to  remark  that some \n\n1 Recently, Hausssler et al.  have also independently applied their approach to the kinase \n\nfamily  (Haussler,  private communication). \n\n\fHidden Markov  Models  in  Molecular  Biology:  New Algorithms  and Applications \n\n751 \n\nPHOI06 \nmklpvrllvlmfwipasssDvVMTQTPLSLpvSLGDQASISCRSSQSLVHSngnTYLNWYLQ--KAGQS(cid:173)\nB27563 \n----------------------LQQPGAELv-KPGASVKLSCKASGYTFTN---YWIHWVKQ--RPGRGL \n------------------------ESGGGLv-QPGGSMKLSCVASGFTFSN---YWMNWVRQ--SPEKGL \nMHMS76 \nmefglswiflvailkgvqcEvRLVESGGDLv-EPGGSLRVSCEVSGFIFSK---AWMNWVRQ--APGKGL \nD28035 \nD24672 \n---------------------------------------ISCKASGYTFTN---YGMNWVKQ--APGKGL \nPHOIOO \n-------------------LvQLQQSGPVLv-KPGTSMKISCKTSGYSFTG---YTMSWVRQ--SHGKSL \n-------------------EvMLVESGGGLa-KPGGSLKLSCTTSGFTFS1---HAMSWVRQ--TPEKRL \nB27888 \nPL0160 \n-------------------QvQLQQSGPGLv-KPSQTLSLTCA1SGDSVSSns-AAWNW1RQ--SPSRGL \n-------------------DvVMTQTPLSLpvSLGDQASISCRSSQSLVRSngnTYLHWYLQ--KPGQP-\nE28833 \nD30539 \n-------------------EvKLVESGGGLv-QSGGSLRLSCATSGFTFSD---FYMEWVRQ--PPGKSL \n-------------------QvHLQQSGAELv-KPGASVK1SCKASGYTFTS---YWMNWVKQ--RPGQGL \nC30560 \n-------------------EvKLLESGGGLv-QPGGSLKLSCAASGFDFSR---YWMSWVRQ--APGKGL \nAVMSX4 \nC30540 \n-------------------EvKLVESGGGLv-QPGGSLRLSCATSGFTFSD---FYMEWVRQ--PPGKRL \nPL0123 \n-------------------EvQLVESGGGLv-QPGGSLRLSCAASGFTFSS---YWMSWVRQ--APGKGL \n-------------------EvQLVESGGGLv-KPGGSLRLSCAASGFTFSN---AWMNWVRQ--APGKGL \nH36005 \n-------------------DvKLVESGGGLv-KPGGSLKLSCAASGFTFSS---Y1MSWVRQ--TPEKRL \nPH0097 \ngsimg---------------vQLQQSGPELv-KPGASVKISCKTSGYTFTE---YTMHWVKQ--SHGKSL \n137267 \n-------------------DvHLQESGPGLv-KPSQSLSLTCSVTGYS1TRg--YNWNW1RR--FPGNKL \nA25114 \nD2HUWA \n-------------------RIQLQESGPGLv-KPSETLSLTCIVSGGPIRRtg-YYWGWIRQ--PPGKGL \n-------------------EvKLVESGGGLv-QPGGSLRLSCATSGFTFSD---FYMEWVRQ--PPGKRL \nA30539 \n......\u2022............\u2022....\u2022............\u2022............\u2022 * ...................\u2022........ \n\nPHOI06 \nB27563 \nMHMS76 \nD28035 \nD24672 \nPHOIOO \nB27888 \nPL0160 \nE28833 \nD30539 \nC30560 \nAVMSX4 \nC30540 \nPL0123 \nH36005 \nPH0097 \n137267 \nA25114 \nD2HUWA \nA30539 \n\nPHOI06 \nB27563 \nMHMS76 \nD28035 \nD24672 \nPHOIOO \nB27888 \nPL0160 \nE28833 \nD30539 \nC30560 \nAVMSX4 \nC30540 \nPL0123 \nH36005 \nPH0097 \n137267 \nA25114 \nD2HUWA \nA30539 \n\n-p-KLLI-YKV---SNR-FSGVPDRFSGSG--SGTDFTLKI SRVEAEDLG IYFCSQ-------------(cid:173)\nE-W1GR1-DPNSGGTKY-NEKFKNKATLT1NKPSNTAYMQLSSLTSDDSAVYYCARGYDYSYY------(cid:173)\nE-WVAE1rLKSGYATHY-AESVKGRFT1SRDDSKSSVYLQMNNLRAEDTG1YYCTRPGV----------(cid:173)\nQ-WVGQ I kNKVDGGTIDYAAPVKGRF I ISRDDSKSTVYLQMNRLK1EDTAVYYCVGNYTGT--------(cid:173)\nK-WMGW1-NTYTGEPTY-ADDFKGRFAFSLETSASTAYLQ1NNLKNEDTATYFCARGSSYDYY------(cid:173)\nE-W1GL1-1PSNGGTNY-NQKFKDKASLTVDKSSSTAYMELLSLTSEDSAVYYCARPSYYGSRnyy---(cid:173)\nE-WVAA1-SSGGSYTFY-PDSVKGRFT1SRDNAKNTLYLQ1NSLRSEDTAIYYCAREEGLRLDdy----(cid:173)\nE-WLGRT-YYRSKWYNDYAVSVKSRITINPDTSKNQFSLQLNSVTPEDTAVYYCARELGDA--------(cid:173)\n-p-KLLI-YKV---SNR-VSGVPDRFSGSG--SGTDFTLK1SRVEAEDLGVYFCSQSTHV---------(cid:173)\nE-WlAASrNEANDYTTEYSASVKGRFIVSRDTSQSILYLQM1ALRAEDTAIYYCSRDYYGSSYw-----(cid:173)\nE-W1GEI-DPSNSYTNN-NQKFKNKATLTVDKSSNTAYMQLSSLTSEDSAVYYCARWGTGSSWg-----(cid:173)\nE-W1GE1-NPDSST1NY-TPSLKDKFIISRDNAKNTLYLQMSKVRSEDTALYYCARLHYYGY-------(cid:173)\nE-WlAASrNKAHDYTTEYSASVKGRFIVSRDTSQSI LYLQMNALRAEDTA1YYCARDADYGSSshw---(cid:173)\nE-WVAN1-KQDGSEKYY-VDSVKGRFT1SRDNAKNSLYLQMNSLRAEDTAVYYCAR-------------(cid:173)\nE-WVGRlkSKTDGGTTDYAAPVKGRFT1SRDDSKNTLYLQMNSLKTEDTAVYYCTTDRGGSSQ------(cid:173)\nE-WVAT1-SSGGRYTYY-SDSVKGRFT1SRDNAKNTLYLQMSSLRSEDTAMYYSTASGDS---------(cid:173)\nE-W1GGI-NPNNGGTSY-NQKFKGKATLTVDKSSSTAYMELRSLTSEDSAVYYCARRGLTTVVaksy--(cid:173)\nE-WMGY1-NYDGS-NNY-NPSLKNR1SVTRDTSKNQFFLKMNSVTTEDTATYYCARL1PFSDGyyedyy(cid:173)\nE-W1GGV-YYTGS-1YY-NPSLRGRVT1SVDTSRNQFSLNLRSMSAADTAMYYCARGNPPPYYdigtgsd \nE-W1AAS rNKANDYTTEYSASVKGRF 1VSRDTSQS I LYLQMNALRAEDTA1YYCARDYYGSSYvw -----\n\n* \n\n----------------tthvpptfgggtkleikr-\n-AMDYWGQGTSVTVSS-------------------\n--PDYWGQGTTLTVSS-------------------\n--VDYWGQGTLVTVSS-------------------\n-AMDYWGQGTSVTVSS-------------------\n-AMDYWGQGTSVTVSSak-----------------\n-AMDYWGQGTSVTVS--------------------\n--FD1WGQGTMVTVSS-------------------\n\n-YFDVWGAGTTVTVSS-------------------\n-WFAYWGQGTLVTVSA-------------------\n--AAYWGQGTLVTVSAe------------------\n-yFDVWGAGTTVTVSS-------------------\n\n--GDYWGQGTLVTVSS-------------------\n--FDYWGQGTTLTVSSak-----------------\n-yFDYWGQGTTLTVSS-------------------\n-AMDYWGQGT-------------------------\ndG1DVWGQGTTVHVSS------------------(cid:173)\n-YFDVWGAGTTVTVSS-------------------\n\nFigure 2:  Immunoglobulin alignment. \n\n\f752 \n\nBaldi,  Chauvin,  HunkapiIler,  and McClure \n\nof the sequences  in  the  family  have some  sort  of  \"header\"  (leader  signal  peptide) \nwhereas  the  others do  not.  We did  not  remove  the  headers  prior  to training  and \nused  the  sequences  as  they  were  given  to  us.  The  model  was  able  to  detect  and \naccomodate these  \"headers\"  by treating them as initial inserts as can be seen from \nthe alignment of two of the sequences. \n\n4.2  KINASES \n\nEukaryotic  protein  k~nases cO\"lstitute  a  very  large  family  of  proteins  that  regu(cid:173)\nlate the most  basic  of cellular processes through phosphorylation.  They  have been \ntermed  the  \"transistors\"  of the cell (Hunter (1987)).  We  have  used  the sequences \navailable  in  the  kinase  data base  maintained  at  the  Salk  Institute.  Our  basic  set \nconsists of 224 sequences,  with minimum length  156,  average length 287,  and max(cid:173)\nimallength 569.  Only one sequence containing a special symbol (X)  was discarded. \nIn  one  experiment,  we  trained  a  model  of  length  287  using  a  random  subset  of \n150  kinase  sequences.  Figure  3  displays  the  corresponding alignment for  a  subset \nof 12  phylogenetically  representative  sequences.  These  include  serine/threonine, \ntyrosine  and  dual  specificity kinases  from  mammals,  birds,  fungi  and  retroviruses \nand  herpes  viruses.  The  percentage  of identical  residues  within  the  kinase  data \nsets  ranges  from  8-30%,  suggesting  that  only  those  residues  involved  in  catalysis \nare  conserved  among  these  highly  divergent  sequences.  All  the  12  characteristic \ncatalytic  domains  or  subdomains described  in  Hanks and  Quinn  (1991)  are  easily \nrecognizable  and  marked.  Additional  highly  conserved  positions  can  also  be  ob(cid:173)\nserved  consistent  with  previously  constructed  multiple  alignments.  For  instance, \nthe initial hydrophobic consensus Gly-X-Gly-XX-Gly together with the Lys located \n15  or 20  residues downstream are part of the ATP /GTP binding site.  The carboxyl \nterminus is  characterized  by the presence  of an  invariant  Arg  residue.  Conserved \nresidues  in  proximity  to the acceptor  amino acid  are found  in  the  VIb  (Asp),  VII \n(Asp-Phe-Gly)  and  VIII  domains  (Ala-Pro-Glu).  In  Figure  4,  the  entropy of the \nemission distribution of each main state is plotted:  motifs are easily detectable and \ncorrespond  to positions with very low entropy. \n\n5  DISCUSSION \n\nHMMs are emerging  as a  powerful,  adaptive,  and  modular  tool for  computational \nbiology.  Here,  they  have  been  used,  together  with  a  new  learning  algorithm,  to \nmodel families of proteins.  In all cases, the models derived capture all the important \nstatistical properties of the families.  Additional  results and  potential applications, \nsuch as phylogenetic  tree  reconstruction,  classification,  and  superfamily modeling, \nare discussed  in  Baldi et al.  (1992). \n\nReferences \n\nBaldi,  P.,  Chauvin,  Y.,  Hunkapiller,  T.  and  McClure,  M.  A.  (1992)  Adaptive  Al(cid:173)\ngorithms  for  Modeling  and  Analysis  of Biological  Primary  Sequence  Information. \nTechnical Report. \n\nCardon,  L.  R.  and  Stormo,  G.  D.  (1992)  Expectation  Maximization  Algorithm \nfor  Identifying  Protein-binding Sites  with  Variable Lengths  from  Unaligned  DNA \n\n\fHidden Markov Models in  Molecular  Biology:  New Algorithms  and Applications \n\n753 \n\nCD2a \nan~KR--LEKVGEGTYGVVYKALDLrpg--QGQRVVALK------KIRLESEDEGVPSTAIREISLLKEL-K-DDNIVRLYDIVH \n--FSMnsKEALGGGKFGAVCTCTEK-----STGLKLAAK---VI-KKQTPKDKE----MVMLEIEVMNQL-N-HRNLIQLYAAIE \nMLCK \nPSKH \nakYDI--KALIGRGSFSRVVRVEHR-----ATRQPYAIK---MIETKYREGRE-----VCESELRVLRRV-R-HANIIQLVEVFE \nCAPK \ndqFER--IKTLGTGSFGRVMLVKHM-----ETGNHYAMK---ILDKQKVVKLKQIE--HTLNEKRILQAV-N-FPFLVKLEFSFK \nWEEl \ntrFRN--VTLLGSGEFSEVFQVEDPv----EKTLKYAVK---KL-KVKFSGPKERN--RLLQEVSIQRALkG-HDHIVELMDSWE \nCSRC \nesLRL--EVKLGQGCFGEVWMGTWN------GTTRVAIK---TLKPGNMSPE------AFLQEAQVMKKL-R-HEKLVQLYAVVS \nEGFR \nteFKK--IKVLGSGAFGTVYKGLWIpege-KVKIPVAIK---ELREATSPKANK----EILDEAYVMASV-D-NPHVCRLLGICL \ndqLVL--GRTLGSGAFGQVVEATAHglshsQATMKVAVK---MLKSTARSSEKQ----ALMSELY--GDL--v-DYLHRNKHTFL \nPDGF \nedLVL--GEQIGRGNFGEVFSGRLR-----ADNTLVAVK---SCRETLPPDIKA----KFLQEAKILKQY-S-HPNIVRLIGVCT \nVFES \nseVML--STRIGSGSFGTVYKGKWH--------GDVAVK---ILKVVDPTPEQFQ---AFRNEVAVLRKT-R-HVNILLFMGYMT \nRAFl \nCMOS \neqVCL--LQRLGAGGFGSVYKATYR-------GVPVAIKQvNKCTKNRLASRR-----SFWAELNV-ARL-R-HDNIVRVVAAST \nHSVK \nmgFTI--HGALTPGSEGCVFDSSHP-----DYFQRVIVK- -----AGWYT--------STSHEARLLRRL-D-HPAILPLLDLHV \n\u00b7 ............... , ....\u2022.. * .\u2022................... \u2022 . .. 0  \u2022 \u2022 \u2022 \u2022 \u2022 \u2022 \u2022 \u2022 \u2022 \u2022 \u2022 \u2022 \u2022 \u2022 \u2022 \u2022 \u2022 \u2022 \u2022 \u2022 \u2022 \u2022 \u2022 \u2022 \u2022 \u2022 \u2022 \u2022 \u2022 \u2022 \u2022 \u2022 \u2022 \u2022 \u2022 \u2022 \u2022 \u2022 \u2022 \u2022 \u2022 \u2022  \n\u2022 .....\u2022..\u2022............. I  . ......\u2022.\u2022....... . .... I I ...........\u2022.\u2022\u2022....... I I I .........\u2022... IV ..... \n\nCD2E \nMLCK \nPSKH \nCAPK \nWEEl \nCSRC \nEGFR \nPDGF \nVFES \nRAFl \nCMOS \nHSVK \n\nSDAHk---------LY-L-V-FEFLDL-DLKRYMEGIpkd--------------------------------------------(cid:173)\nTPHE----------IV-L-F-KEYIEGGELFERIVDE-----------------------------------------------(cid:173)\nTQER----------VY-M-V-MELATGGELFDRIIAK-----------------------------------------------(cid:173)\nDNSN----------LY-M-V-MEYVPGGEMFSHLRRI-----------------------------------------------(cid:173)\nHGGF----------LY-M-Q-VELCENGSLDRFLEEQgql--------------------------------------------(cid:173)\n-EEP----------IY-I-V-TEYMSKGSLLDFLKGE------------------------------------------------\n-TST----------VQ-L-I-TQLMPFGCLLDYVREH------------------------------------------------\n-QRHsnkhcppsaeLYs-n-a--LPVGFSLPSHLNLTgesdggymdmskdesidyvpmldmkgdikyadiespsymapydnyvps \nQKQP----------IY-I-V-MELVQGGDFLTFLRTE-----------------------------------------------(cid:173)\n-KDN----------LA-I-V-TQWCEGSSLYKHLHVQ------------------------------------------------\nRTPAgsnsl-----GT-I-I-MEFGGNVTLHQVIYGAaghpegdaqephcrtg--------------------------------\nVSGV----------TC-L-V-LPKYQA-DLYTYLSRR------------------------------------------------\n\n\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022 \u2022\u2022 \u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022 V \u2022\u2022\u2022\u2022 \u2022 \u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022 \n\n---------------QP-LGADIVKKFMMQ-LCKGIAYCHSHRILHRDLKPQNLL-INKDG---N-LKLGDFGLARAFGVPLRAY \nCD2a \n--------------DYH-LTEVDTMVFVRQ-ICDGILFMHKMRVLHLDLKPENILcVNTTG---H1VKIIDFGLARRYNPNEKL-\nMLCK \nPSKH \n---------------GS-FTERDATRVLQM-VLDGVRYLHALGITHRDLKPENLL-YYHPGtdsK-IIITDFGLASARKKGDDCL \nCAPK \n---------------GR-FSEPHARFYAAQ-IVLTFEYLHSLDLIYRDLKPENLL-IDQQG---Y-IQVTDFGFAKRVKGRT---\n---------------SR-LDEFRVWKILVE-VALGLQFIHHKNYVHLDLKPANVM-ITFEG---T-LKIGDFGMASVWPVPRG--\nWEEl \n--------------MGKyLRLPQLVDMAAQ-IASGMAYVERMNYVHRDLRAANIL-VGENL---V-CKVADFGLARLIEDNEYTA \nCSRC \nEGFR \n--------------KDN-IGSQYLLNWCVQ-IAKGMNYLEDRRLVHRDLAARNVL-VKTPQ---H-VKITDFGLAKLLGAEEKEY \napertyratlinds-PV-LSYTDLVGFSYQ-VANGMDFLASKNCVHRDLAARNVL-ICEGK---L-VKICDFGLARDIMRDSNYI \nPDGF \n--------------GAR-LRMKTLLQMVGD-AAAGMEYLESKCCIHRDLAARNCL-VTEKN---V-LKISDFGMSREAADGIYAA \nVFES \n--------------ETK-FQMFQLIDIARQ-TAQGMDYLKAKNIIHRDMKSNNIF-LHEGL---T-VKIGDFGLATVKSRWSGSQ \nRAFl \n---------------GQ-LSLGKCLKYSLD-VVNGLLFLHSQSIVHLDLKPANIL-ISEQD---V-CKISDFGCSEKLEDLLCFQ \nCMOS \nHSVK \n--------------LNP-LGRPQIAAVSRQ-LLSAVDYIHRQGIIHRDIKTENIF-INTPE---D-ICLGDFGAACFVQGSRSSP \n\u2022 ............................... . ....................... * \u2022\u2022\u2022\u2022 * .................. *. * \u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022 \n..\u2022\u2022........\u2022...\u2022.\u2022... .. ...... . VIa ................... . ... VIb ...........\u2022\u2022\u2022.... VII ....\u2022....... \n\nCD2a \n---THEIVTLWYRAPEVLLgGK---QYSTGVDTWSIGCIFAEMCNRKP---------------IFSGDSE-----IDQIFKIFRV \nMLCK \n---KVNFGTPEFLSPEVVN-YD---QISDKTDMWSLGVITYMLLSGLS---------------PFLGDDD-----TETLNNVLSG \nPSKH \nM--KTTCGTPEYIAPEVLV-RK---PYTNSVDMWALGVIAYILLSGTM---------------PFEDDNR-----TRLYRQILRG \nCAPK \n---WTLCGTPEYLAPEIIL-SK---GYNKAVDWWALGVLIYEMAAGYF---------------PFFADQP-----IQIYEKIVSG \nWEEl \n---MEREGDCEYIAPEVLA-NH---LYDKPADIFSLGITVFEAAANIV--------------LPDNGQSW-----Q----KLRSG \nCSRC \nR--QGAKFPIKWTAPEAAL-YG---RFTIKSDVWSFGILLTELTTKGR--------------VPYPGMVN-----REVLDQVERG \nEGFR \nH-AEGGKVPIKWMALESIL-HR---IYTHQSDVWSYGVTVWELMTFGS--------------KPYDGIPA-----SEISSILEKG \nPDGF \nS-KGSTYLPLKWMAPESIF-NS---LYTTLSDVWSFGILLKEIFTLGG--------------TPYPELPM----NDQFYNAIKRG \nVFES \nS-GGLRQVPVKWTAPEALN-YG---RYSSESDVWSFGILLKETFSLGA--------------SPYPNLSN-----QQTREFVEKG \nQ-VEQPTGSVLWMAPEVIR-MQdnnPFSFQSDVYSYGIVLYELMTGEL---------------PYS---R-----DQIIFMVGRG \nRAFl \nCMOS \nTpSYPLGGTYTHRAPELLK-GE---GVTPKADIYSFAITLWQMTTKQA---------------PYSGERQ-----HILYAVVAYD \nHSVK \nF-PYGIAGTIDTNAPEVLA-GD---PYTTTVDIWSAGLVIFETAVHNA-------------------------------------\n\u00b7 .......... . .......... . ** ...............\u2022.... * ..... . ................. I  \u2022 \u2022 \u2022 \u2022 \u2022 \u2022 \u2022 \u2022 \u2022 \u2022 \u2022 \u2022 \u2022 \u2022 \u2022 \u2022 \u2022 \u2022 \u2022 \u2022 \u2022 \u2022 \u2022 \u2022  \n\u00b7  ....\u2022\u2022......\u2022.\u2022..... VIII \u2022\u2022\u2022........\u2022.... IX ..........\u2022\u2022\u2022.\u2022\u2022..\u2022......... '  ............ X .\u2022...... \n\n---LGTPNEAlwpdivylpdfkpsfpqwrrkdlsqvvpSLDPRGIDLLDKLLAYDPINRISARRAAIHPYFQES-------(cid:173)\nCD2a \nMLCK \nnwyFDEETFEA----------------------------VSDEAKDFVSNLIVKEQGARMSAAQCLAHPWLNNL-------(cid:173)\nPSKH \nkysYSGEPWPS----------------------------VSNLAKDFIDRLLTVDPGARMTALQALRHPWVVSM-------(cid:173)\nCAPK \n---KVR-FPSH----------------------------FSSDLKDLLRNLLQVDLTKRFGNLKDGVNDIKNHK--------\n---DLSDAPRLsstdngssltsssretpansii------GQGGLDRVVEWKLSPEPRNRPTIDQILATD--EVCWV------\nWEE 1 \nCSRC \n---YRMPCPPE----------------------------CPESLHDLMCQCWRRDPEERPTFEYLQAFLEDYFT--------\n---ERLPQPPI----------------------------CTIDVYKIMVKCWKIDADSRPKFRELIIEFSKMAR--------\nEGFR \n---YRMAQPAH----------------------------ASDEIYEIMQKCKEEKFETRPPFSQLVLLLERLLGEGykkky-\nPDGF \nVFES \n---GRLPCPEL----------------------------CPDAVFRLMEQCWAYEPGQRPSFSAIYQEL-------------\n---YASPDLsKlykn------------------------CPKAMKRLVADCVKKVKEERPLFPQILSSIELLQH--------\nRAFl \n---LRPSLSAAvfedsl----------------------PGQRLGDVIQRCWRPSAAQRPSARLLLVDLTSLKA--------\nCMOS \nHSVK \n\u00b7 ..... ................................................ ............\u2022.............. ......... \n........\u2022 . .............. . .................. . .................\u2022... . XI ......\u2022\u2022.............. \n\nFigure 3:  Kinase  alignment of 12  representative sequences. \n\n\f754 \n\nBaldi,  Chauvin,  Hunkapiller,  and  McClure \n\nMain State Entropy Values \n\n10  20  30  40  50  60  70  eo  90  100110120130140 \n\n150 160 170 1  eo 190 200 210 220 230 240 250 260 270 2eo \n\nEntropy Distribution \n\n, \n\n0.0 \n\n, \n\n0.5 \n\n, \n\n2.5 \n\n, \n\n3.0 \n\nFigure 4:  Kinase emission  entropy plot and  distribution. \n\nFragments.  Journal of Molecular Biology,  223,159-170. \nChurchill, G.  A.  (1989)  Stochastic Models for  Heterogeneous DNA Sequences.  Bul(cid:173)\nletin of Mathematical Biology,  51,  1,  79-94. \nHanks,  S.  K.,  Quinn,  A.  M.  (1991)  Protein  Kinase  Catalytic  Domain  Sequences \nDatabase:  Identification of Conserved Features of Primary Structure and Classifi(cid:173)\ncation of Family Members.  Methods in  Enzymology, 200, 38-62. \nHaussler,  D.,  Krogh,  A.,  Mian, S.  and Sjolander, K.  (1992)  Protein Modeling using \nHidden  Markov  Models.  Computer  and  Information  Sciences  Technical  Report \n(UCSC-CRL-92-93),  University of California, Santa Cruz. \nHunkapiller, T. and Hood, L.  (1989)  Diversity of the Immunoglobulin Gene Super(cid:173)\nfamily.  Advances in  Immunology,  44, 1-63,  Academic Press, Inc. \n\nHunter, T.  (1987)  A Thousand and One Protein Kinases.  Cell,  50, 823-829. \nLawrence,  C.  E.  and  Reilly,  A.  A.  (1990)  An  Expectation Maximization  (EM)  Al(cid:173)\ngorithm for  the Identification and Characterization of Common Sites in  Unaligned \nBiopolymer Sequences.  Proteins:  Struct.  Funct.  Genet.,  7,  41-51. \n\nRabiner,  L.  R.  (1989)  A Tutorial on  Hidden Markor Models  and Selected  Applica(cid:173)\ntions  in  Speech Recognition.  Proceedings of the IEEE, 77,  2,  257-286. \n\n\f", "award": [], "sourceid": 629, "authors": [{"given_name": "Pierre", "family_name": "Baldi", "institution": null}, {"given_name": "Yves", "family_name": "Chauvin", "institution": null}, {"given_name": "Tim", "family_name": "Hunkapiller", "institution": null}, {"given_name": "Marcella", "family_name": "McClure", "institution": null}]}