{"title": "Bayesian Inference of Regular Grammar and Markov Source Models", "book": "Advances in Neural Information Processing Systems", "page_first": 388, "page_last": 395, "abstract": null, "full_text": "388 \n\nSmith and Miller \n\nBayesian Inference of Regular Grammar \n\nand Markov Source Models \n\nKurt R. Smith and Michael I. Miller \n\nBiomedical Computer Laboratory \n\nand \n\nElectronic Signals and Systems Research Laboratory \n\nWashington University, SL Louis. MO 63130 \n\nABSTRACT \n\nIn this paper we develop a Bayes criterion which includes the Rissanen \ncomplexity, for inferring regular grammar models. We develop two \nmethods for regular grammar Bayesian inference. The fIrst method is \nbased on treating the regular grammar as a I-dimensional Markov \nsource, and the second is based on the combinatoric characteristics of \nthe regular grammar itself. We apply the resulting Bayes criteria to a \nparticular example in order to show the efficiency of each method. \n\n1 MOTIVATION \n\nWe are interested in segmenting electron-microscope autoradiography (EMA) images by \nlearning representational models for the textures found in the EMA image. In studying \nthis problem, we have recognized that both structural and statistical features may be \nuseful for characterizing textures. This has motivated us to study the source modeling \nproblem for both structural sources and statistical sources. The statistical sources that \nwe have examined are the class of one and two-dimensional Markov sources (see [Smith \n1990] for a Bayesian treatment of Markov random field texture model inference), while \nthe structural sources that we are primarily interested in here are the class of regular \ngrammars, which are important due to the role that grammatical constraints may play in \nthe development of structural features for texture representation. \n\n\fBayesian Inference of Regular Grammar and Markov Source Models \n\n389 \n\n2 MARKOV SOURCE INFERENCE \n\nOur primary interest here is the development of a complete Bayesian framework for the \nprocess of inferring a regular grammar from a training sequence. However, we have \nshown previously that there exists a I-D Markov source which generates the regular \nlanguage defined via some regular grammar [Miller, 1988]. We can therefore develop a \ngeneralized Bayesian inference procedure over the class of I-D Markov sources which \nenables us to learn the Markov source corresponding to the optimal regular grammar. \nWe begin our analysis by developing the general structure for Bayesian source modeling. \n\n2.1 BAYESIAN APPROACH TO SOURCE MODELING \n\nWe state the Bayesian approach to model learning: Given a set of source models \n{ ~, th,\u00b7 . \" 8M.I} and the observation x, choose the source model a which most accurately \nrepresents the unknown source that generated x. This decision is made by calculating \nBayes risk over the possible models which produces a general decision criterion for the \nmodel learning problem: \n\n{ max} log P(xt~) + log Pj . \n~8t \u2022. \u00b7.Bit\u00b7} \n\n(2.1) \n\nUnder the additional assumption that the apriori probabilities over the candidate models \nare equivalent, the decision criterion becomes \n\n(2.2) \n\nwhich is the quantity that we will use in measuring the accuracy of a model's \nrepresentation. \n\n2.2 STOCHASTIC COMPLEXITY AND MODEL LEARNING \n\nIt is well known that when given finite data, Bayesian procedures of this kind which do \nnot have any prior on the models suffer from the fundamental limitation that they will \npredict models of greater and greater complexity. This has led others to introduce \npriors into the Bayes hypothesis testing procedure based on the complexity of the model \nbeing tested [Rissanen, 1986]. In particular, for the Markov case the complexity is \ndirectly proportional to the number of transition probabilities of the particular model \nbeing tested with the prior exponentially decreaSing with the associated complexity. \nWe now describe the inclusion of the complexity measure in greater detail. \n\nFollowing Rissanen, the basic idea is to uncover the model which assigns maximum \nprobability to the observed data, while also being as simple as possible so as to require a \nsmall Kolmogorov description length. The complexity associated with a model having \nk real parameters and a likelihood with n independent samples, is the now well-known \n!Jog n which allows us to express the generalization of the original Bayes procedure \n2 \n(2.2) as the quantity \n\n\f390 \n\nSmith and Miller \n\n(2.3) \n\n\"-\n\nNote well that a is the k9rdimensional parameter parameterizing model a. which must \nbe estimated from the observed data %,.. An alternative view of (2.3) is discovered by \nviewing the second term as the prior in the Bayes model (2.1) where the prior is defined \nas \n\nP \n\nltl \u00b7 \n---.! 101 \" \n~= e 2 \n\n\u2022 \n\n(2.4) \n\n2.3 I-D l\\fARKOV SOURCE MODELING \n\nConsider that x\" is a I-D n-Iength string of symbols which is generated by an unknown \nfinite-state Markov source. \nIn examining (2.3), we recognize that for I-D Markov \n\nsources log P(rl8;) may be written as log n P9a(S(Xj)lS\"(Xj_l\u00bb where S(x.) is a state \n\n.1 \n\nA \n\nfunction which evaluates to a state in the Markov source state set S9;. Using this \nnotation, the Bayes hypothesis test for I-D Markov sources may be expressed as: \n\nj-l \n\n(2.5) \n\nFor the general Markov source inference problem, we know only that the string x\" was \ngenerated by a I-D Markov source, with the state set S9; and the transition probabilities \nP9a{StIS,). kJeS9a' unknown. They must therefore be included in the inft\"rence procedure. \nTo include the complexity term for this case, we note that the number of parameters to \nbe estimated for model a is simply the number of entries in the state-transition matrix \nP4, i.e. 19; = IS9;12. Therefore for I-D Markov sources, the generalized Bayes hypothesis \ntest including complexity may be stated as \n\nmta \n\n{ \n~9t, .. ,8M1 \n\n1 ,,\u00b71 \n} n L log Pel.S(Xj)IS(Xj-l\u00bb - ~g n. \n';-1 \n\n'\" \n\nISBJ2 \n2n \n\n(2.6) \n\nwhere we have divided the entire quantity by n in order to express the criterion in terms \nof bits pc7 symbol. Note that a candidate Markov source model 8; is initially specified \nby its ordez and corresponding state set S Ba. \n\nThe procedure for inferring 1-0 Markov source models can thus be stated as follows. \nGiven a sequence x\" from some unknown source, consider candidate Markov source \nmodels by computing the state function S(x.) (detemlined by the candidate model \norder) over the entire string x~ Enumerating the state transitions which occur in %,. \nprovides an estimate of the state-transition matrix P,. which is then used to compute \n(2.6). Now. the inferred Markov source becomes the ooe maximizing (2.6). \n\n'\" \n\n\fBayesian Inference of Regular Grammar and Markov Source Models \n\n391 \n\n3 REGULAR GRAMMAR INFERENCE \n\nAlthough the Bayes criterion developed for I-D Markov sources (2.6) is a sufficient \nmodel learning criterion for the class of regular grammars, we will now show that by \ntaking advantage of the apriori knowledge that the source is a regular grammar, the \ninference procedure can be made much more efficient This apriori knowledge brings a \nspecial structure to the regular grammar inference problem in that not all allowable \nsets of Markov probabilities correspond to regular grammars. In fact, as shown in \n[Miller, 1988]. corresponding to each regular grammar is a unique set of candidate \nprobabilities, implying that the Bayesian solution which takes this into account will be \nfar more efficient. We demonstrate that now. \n\n3.1 BAYESIAN CRITERION l\"SING GRAMMAR COMBINATORICS \n\nOur approach is to use the combinatoric properties of the regular grammar in order to \ndevelop the optimal Bayes hypothesis test. We begin by defining the regular grammar. \n\nDefinition: A regular grammar G is a quadruple (VN, VT, Ss,R) where VN, VT are finite \nsets of non-terminal symbols (or states) and tenninal symbols respectively, Ss is the \nsentence start state, and R is a finite set of production rules consisting of the \ntransfonnation of a non-tenninal symbol to either a terminal followed by a non(cid:173)\ntenninal, or a terminal alone, i.e .. \n\nIn the class of regular grammars that we consider, we define the depth of the language \nas the maximum number of tenninal symbols which make up a nontenninal symbol. \nCorresponding to each regular grammar is an associated incidence matrix B with the i,k,1t \nentry B i) equal to the number of times there is a production for some tenninal j and \nnon-terminals i.k of the fonn Si~Wpk.ER. Also associated with each grammar Gi is \nthe set of all n-Iength strings produced by the grammar, denoted as the regular language \n%Il(Gi). \n\nNow we make the quite reasonable assumption that no string in the language %Il(Gi) is \nmore or less probable apriori than any other string in that language. This indicates that \nall n-lengtb strings that can be generated by Gi are equiprobable with a probability \ndictated by the combinatorics of the language as \n\nP(XIlIGi) = I 1 I' \n\n%Il(Gi) \n\n(3.1) \n\nwhere I %Il(Gi) I denotes the number of n-Iength sequences in the language which can be \ncomputed by considering the combinatorics of the language as follows: \n\n\f392 \n\nSmith and Miller \n\nwith AGi corresponding to the largest eigenvalue of the state-transition matrix BGI' \nThis results from the combinatoric growth rate being detennined by the sum of the \nentries in the \"til power state-transition matrix Bo . ., which grows as the largest \neigenvalue AGI of BGi [Blahut, 1987]. We can now write (3.1) in these tenns as \n\n(3.2) \n\nwhich expresses the probability of the sequence x\" in tenns of the combinatorics of Gi. \n\nWe now use this combinatoric interpretation of the probability to develop Bayes \ndecision criterion over two candidate grammars. Assume that there exists a fmite space \nof sequences X \u2022 all of which may be generated by one of the two possible grammars \n{Go. Gl}. Now by dividing this observation space X into two decision regions. Xo (for \nGo) and Xl (for G 1). we can write Bayes risk R in terms of the observation probabilities \nP(xIIIGo).P(x\"IG 1): \n\nx\"eXl \n\n.l'\"eXo \n\n(3.3) \n\nThis implementation of Bayes risk assumes that sequences from each grammar occur \nequiprobably apriori and that the cost of choosing the incorrect grammar is equal to 1. \nNow incorporating the combinatoric counting probabilities (3.2). we can rewrite (3.3) \nas \n\nwhich can be rewritten \n\nR = 2, AGo'\" + L AGl '\" \n\nX\"eXl \n\nx\"eXo \n\nR =1.+ 2, \n2 z,.eXo \n\n(AGI'\u00b7 - ko'\u00b7) . \n\n(3.4) \n\nThe risk is therefore minimized by choosing GO if AGl'\" < AGo'\u00b7 and 01 if AGI'\u00b7 > AGo'''. \nThis establishes the likelihood ratio for the grammar inference problem: \n\nGl \nAGI'\" > \nAGo'\u00b7 < \nGo \n\n1 \u2022 \n\nwhich can alternatively be expressed in tenns of the log as \n\n(max) -\" log Alii . \nGo.GI \n\nRecognizing this as the maximum likelihood decision. this decision criterion is easily \ngeneralized to M hypothesis. Now by ignoring any complexity component. the \ngeneralized Bayes test for a regular grammar can be stated as \n\n\fBayesian Inference of Regular Grammar and Markov Source Models \n\n393 \n\n(3.5) \n\nwhere Aai is the largest eigenvalue of the estimated incidence matrix BGi corresponding \nto grammar Gi where BGJ is estimated from .r... \n\n\"\" \n\n\"\" \n\n\"\" \n\nThe complexity factor to be included in this Bayesian criterion differs from the \ncomplexity term in (2.3) due to the fact that the parameters to be estimated are now the \nentries in the BGi matrix which are strictly binary. From a description length \ninterpretation then. these parameters can be fully described using 1 bit per entry in BGj. \nThe complexity term is thus simply ISOil2 which now allows us to write the Bayes \ninference criterion for regular grammars as \n\n\"\" \n\n(3.6) \n\nin terms of bits per symbol. We can now state the algorithm for inferring grammars. \n\nRegular Grammar Inference Algorithm \n\n1. Initialize the grammar depth to d= 1. \n\n2. ComputelSGJ =IVT~. \n\n3. Using the state function Sd(:rJ corresponding to the current depth. compute \nthe state transitions at all sites .t; in the observed sequence x\" in order to \nestimate the incidence matrix BGi for the grammar currently being \nconsidered. \n\n\"\" \n\n4. Compute Aaj from BGj. (recall that this is the largest eigenvalue of BGi). \n\n\"\" \n\n\"\" \n\n5. Using AajandlSGjl compute (3.6) - denote this aslGj= -log AGj_IS~jI2 . \n\n6. Increase the grammar depth d=d+l and goto 2 (Le. test another candidate \n\ngrammar) until IGidiscontinues to increase. \n\nThe regular grammar of minimum depth which maximizes IGj (Le. maximizes (3.6\u00bb is \nthen the optimal regular grammar source model for the given sequence x,. \n\n3.2 REGULAR GRAMMAR INFERENCE RESULTS \n\nTo compare the efficiency of the two Bayes criteria (2.6) and (3.6), we will consider a \nregular grammar inference experiment The regular grammar that we will attempt to \nlearn, which we refer to as the 4-0,ls regular grammar, is a run-length constrained binary \n\n\f394 \n\nSmith and Miller \n\ngrammar which disallows 4 consecutive occurrences of a 0 or 8 1. Referring to the \nregular grammar definition. we note that this regular grammar can be described by its \nincidence matrix \n\nB4.O,l \n\n000 I o 0 \n100 1 o 0 \n010 1 o 0 \no 0 1 010 \no 0 1 001 \no 0 1 000 \n\nwhere the states corresponding to row and column indices are \n\nNote that this regular grammar has a depth equal to 3 and thus the corresponding \nMarkov source has an order equal to 3. \n\nThe inference experiment may be described as follows. Given a training set of length 16 \nstrings from the 4-0,ls language, we apply the Bayes criteria (2.6) and (3.6) in an attempt \nto infer the regular grammar in each case. We compute the criteria for five candidate \nmodels of order/depth 1 through 5 (recall that this defmes the size of the state set for \nthe Markov source and the regular grammar, respectively). \n\nTreating the unknown regular grammar as a Markov source, we estimate the \ncorresponding state-transition matrix P and then compute the Bayes criterion according \nto (2.6) for each of the five candidate models. We compute the criterion as a function of \nthe number of training samples for rach candidate model and plot the result in Figure la. \nSimilarly. we estimate the incidence matrix B and compute the Bayes criterion according \nto (3.6) for each of the five regular grammar candidate models. and plot the results as a \nfunction of the number of training samples in Figure lb. \n\n\"\" \n\n\"\" \n\nWe compare the two Bayesian criteria by examining Figures 18 and lb. Note that \ncriterion (3.6) discovers the correct regular grammar (depth = 3) after only 50 training \nsamples (Figure Ib), while the equivalent Markov source (order = 3) is found only after \nalmost 500 training samples have been used in computing (2.6) (Figure la). This points \nout that a much more efficient inference procedure exists for regular grammars by \ntaking advantage of the apriori grammar information (i.e. only the depth and the binary \nincidence matrix B must be estimated). whereas for 1-0 Markov sources. both the order \nand the real-valued state-transition matrix P must be estimated. \n\n\"\" \n\n\"\" \n\n4. CONCLUSION \n\nIn conclusion, we stress the importance of casting the source modeling problem within a \nBayesian framework which incorporates priors based on the model complexity and \nknown model attributes. Using this approach, we have developed an efficient Bayesian \n\n\fBayesian Inference of Regular Grammar and Markov Source Models \n\n395 \n\n-0.8 \n\n-0.9 \n\n-1 \n\n\u2022 \n\n0 \n\n00 \n\n0 . .. ~ \n\u2022 \u2022\u2022 .... -\n\u2022 \u2022 \n.... ~ .... ... \n0 \u2022 \u2022 \n\u2022 \n\u2022 \u2022 \u2022 \n* \n\n\u2022 \n\u2022 \u2022 \u2022 \u2022 \u2022 \n\u2022 \n\u2022 \n\u2022 \n\u2022 \n\"ij_~i()I()I( )I()I()I( x x x Limit \n\n\u2022 \n\u2022 \n\u2022 \n\u2022 \n\nx \nx \nx \n\n0 \n\n0 \n\n-0.8 -\n\n-0.9 -\n\n-11-\n\n0 * '\" x>li<~ ......... \n* x \n\no 0 \n\no \n\n* x \n* \n\n, .. ' . \no \n\u2022\u2022\u2022\u2022 . . . . . __ _ \n\u2022 \n. .0 \n\u2022 \n\u2022 \nn.;l()I()i( x x x Limit \n\n\u2022\u2022 \n\u2022\u2022 \n\u2022 \n\nJj \n\nX \n\n5 \n\n50 \n\n500 \n\n5()()(; \n\n50000 \n\na) \n\nI \n5 \n\nI \n50 \n\nI \n\n500 \n\nb) \n\nI \n\nI \n\n5000 50000 \n\nGrammar depth d Markov order:. = 1,. = 2,0 = 3, \u2022 = 4, x = 5 . \n\nFigure 1: Results of computing Bayes criterion measures (2.6) and (3.6) \nvs. the number of training samples - a) Markov source criterion \n(2.6); b) Regular grammar combinatoric criterion (3.6). \n\nframework for inferring regular grammars. This type of Bayesian model is potentially \nquite useful for the texture analysis and image segmentation problem where a consistent \nframework is desired for considering both structural and statistical features in the \ntexture/image representation. \n\nAcknowledgements \n\nThis research was supported by the NSF via a Presidential Young Investigator Award \nECE-8552518 and by the NIH via a DRR GrantRR-1380. \n\nRererences \n\nBlahut, R. E. (1987). Principles and Practice of Information TltMry , Addison-Wesley \n\nPublishing Co .\u2022 Reading, MA. \n\nMillex. M. I., Roysam. B. Smith, K. R .\u2022 and Udding, 1. T (1988). \"Mapping Rule-Based \nRegular Grammars to Gibbs Distributions\", AMS-IMS-SIAM Joint Conference 011 \nSPATIAL STATISTICS AND IMAGING. American Mathematical Society. \n\nRissanen, J. (1986). \"Stochastic Complexity and Modeling-, An1lOls of Statistics, 14, \n\n00.3. pp. 1~ 1100. \n\nSmith. K. R .\u2022 Miller. M. I. (1990). \"A Bayesian Approach Incorporating Rissanen \nComplexity for Learning Markov Random Field Texture Models\", Proceedings of \nInl Conference on Acoustics, Speech. and Signal Processing. Albuquexque, NM. \n\n\f", "award": [], "sourceid": 231, "authors": [{"given_name": "Kurt", "family_name": "Smith", "institution": null}, {"given_name": "Michael", "family_name": "Miller", "institution": null}]}