{"title": "Diffusion of Credit in Markovian Models", "book": "Advances in Neural Information Processing Systems", "page_first": 553, "page_last": 560, "abstract": null, "full_text": "Diffusion of Credit in Markovian Models \n\nYoshua Bengio\u00b7 \n\nPaolo Frasconi \n\nDept. I.R.O., Universite de Montreal, \n\nDipartimento di Sistemi e Informatica \n\nMontreal, Qc, Canada H3C-3J7 \nbengioyCIRO.UMontreal.CA \n\nUniversita di Firenze, Italy \n\npaoloCmcculloch.ing.unifi.it \n\nAbstract \n\nThis paper studies the problem of diffusion in Markovian models, \nsuch as hidden Markov models (HMMs) and how it makes very \ndifficult the task of learning of long-term dependencies in sequences. \nUsing results from Markov chain theory, we show that the problem \nof diffusion is reduced if the transition probabilities approach 0 or 1. \nUnder this condition, standard HMMs have very limited modeling \ncapabilities, but input/output HMMs can still perform interesting \ncomputations. \n\n1 \n\nIntroduction \n\nThis paper presents an important new element in our research on the problem of \nlearning long-term dependencies in sequences. In our previous work [4J we found \ntheoretical reasons for the difficulty in training recurrent networks (or more gen(cid:173)\nerally parametric non-linear dynamical systems) to learn long-term dependencies. \nThe main result stated that either long-term storing or gradient propagation would \nbe harmed, depending on whether the norm of the Jacobian of the state to state \nfunction was greater or less than 1. In this paper we consider a special case in which \nthe norm of the Jacobian of the state to state function is constrained to be exactly \n1 because this matrix is a stochastic matrix. \nWe consider both homogeneous and non-homogeneous Markovian models. Let n \nbe the number of states and At be the transition matrices (constant in the ho(cid:173)\nmogeneous case): Aij(ut} = P(qt = j I qt-l = i, Ut; e) where Ut is an external \ninput (constant in the homogeneous case) and e is a vector of parameters. In \nthe homogeneous case (e.g., standard HMMs), such models can learn the distribu(cid:173)\ntion of output sequences by associating an output distribution to each state. In \n\n\u00b7also, AT&T Bell Labs, Holmdel, NJ 07733 \n\n\f554 \n\nYoshua Bengio. Paolo Frasconi \n\nthe non-homogeneous case, transition and output distributions are conditional on \nthe input sequences, allowing to model relationships between input and output se(cid:173)\nquences (e.g. to do sequence 'regression or classification as with recurrent networks). \nWe thus called Input/Output HMM (IOHMM) this kind of non-homogeneous \nHMM . In [3, 2] we proposed a connectionist implementation of IOHMMs. In both \ncases, training requires propagating forward probabilities and backward probabili(cid:173)\nties, taking products with the transition probability matrix or its transpose. This \npaper studies in which conditions these products of matrices might gradually \nconverge to lower rank, thus harming storage and learning of long-term context. \nHowever, we find in this paper that IOHMMs can better deal with this problem \nthan homogeneous HMMs. \n\n2 Mathematical Preliminaries \n2.1 Definitions \nA matrix A is said to be non-negative, written A 2:: 0, if Aij 2:: 0 Vi, j . Positive \nmatrices are defined similarly. A non-negative square matrix A E R nxn is called \nrow stochastic (or simply stochastic in this paper) if 'L,'l=1 Aij = 1 Vi = 1 . . . n. \nA non-negative matrix is said to be row [column} allowable if every row [column] \nsum is positive. An allowable matrix is both row and column allowable. A non(cid:173)\nnegative matrix can be associated to the directed transition graph 9 that constrains \nthe Markov chain. An incidence matrix A corresponding to a given non-negative \nmatrix A replaces all positive entries of A by 1. The incidence matrix of A is a \nconnectivity matrix corresponding to the graph 9 (assumed to be connected here). \nSome algebraic properties of A are described in terms of the topology of g. \nDefinition 1 (Irreducible Matrix) A non-negative n x n matrix A is said to be \nirreducible if for every pair i,j of indices, :3 m = m(i,j) positive integer s.t. \n(Amhj > O. \nA matrix A is irreducible if and only if the associated graph is strongly connected \n(i.e., there exists a path between any pair of states i,j) . If :3k s.t. (Ak)ii > 0, d(i) \nis called the period of index i ifit is the greatest common divisor (g.c.d.) of those k \nfor which (Ak)ii > O. In an irreducible matrix all the indices have the same period \nd, which is called the period of the matrix. The period of a matrix is the g.c.d. of \nthe lengths of all cycles in the associated transition graph. \nDefinition 2 (Primitive matrix) A non-negative matrix A is said to be primitive \nif there exists a positive integer k S.t. Ak > O. \nAn irreducible matrix is either periodic or primitive (i.e. of period 1). A primitive \nstochastic matrix is necessarily allowable. \n\n2.2 The Perron-Frobenius Theorem \nTheorem 1 (See [6], Theorem 1.1.) Suppose A is an n x n non-negative prim(cid:173)\nitive matrix. Then there exists an eigenvalue r such that: \n\n1. r is real and positive; \n2. with r can be associated strictly positive left and right eigenvectors; \n3. r> 1>'1 for any eigenvalue>. 1= r; \n4\u00b7 the eigenvectors associated with r are unique to constant multiples. \n5. If 0 S B s A and f3 is an eigenvalue of B, then 1f31 s r . Moreover, 1f31 = r \n\nimplies B = A. \n\n\fDiffusion of Credit in Markovian Models \n\n555 \n\n6. r is simple root of the characteristic equation of A. \n\nA simple consequence of the theorem for stochastic matrices is the following: \nCorollary 1 Suppose A is a primitive stochastic matrix. Then its largest eigeh(cid:173)\n1 = \nvalue is 1 and there is only one corresponding right eigenvector, which is \n[1, 1 .. \u00b71]'. Furthermore, all other eigenvalues < 1. \nProof. A1 = 1 by definition of stochastic matrices. This eigenvector is unique \nand all other eigenvalues < 1 by the Perron-Frobenius Theorem. \nIf A is stochastic but periodic with period d, then A has d eigenvalues of module 1 \nwhich are the d complex roots of 1. \n\n3 Learning Long-Term Dependencies with HMMs \nIn this section we analyze the case of a primitive transition matrix as well as the \ngeneral case with a canonical re-ordering of the matrix indices. We discuss how \nergodicity coefficients can be used to measure the difficulty in learning long-term \ndependencies. Finally, we find that in order to avoid all diffusion, the transitions \nshould be deterministic (0 or 1 probability). \n\n3.1 Training Standard HMMs \nTheorem 2 (See [6], Theorem 4.2.) If A is a primitive stochastic matrix, then \nas t -+ 00, At -+ 1V' where v' is the unique stationary distribution of the Markov \nchain. The rate of approach is geometric. \nThus if A is primitive, then liIDt-+oo At converges to a matrix whose eigenvalues are \nall 0 except for ,\\ = 1 (with eigenvector 1), i.e. the rank of this product converges \nto 1, i.e. its rows are equal. A consequence oftheorem 2 is that it is very difficult to \ntrain ordinary hidden Markov models, with a primitive transition matrix, to model \nlong-term dependencies in observed sequences. The reason is that the distribution \nover the states at time t > to becomes gradually independent of the distribution over \nthe states at time to as t increases. It means that states at time to become equally \nresponsible for increasing the likelihood of an output at time t. This corresponds in \nthe backward phase of the EM algorithm for trainin~ HMMs to a diffusion of credit \nover all the states. In practice we train HMMs WIth finite sequences. However , \ntraining will become more and more numerically ill-conditioned as one considers \nlonger term dependencies. Consider two events eo (occurring at to) and et (occur(cid:173)\nring at t), and suppose there are also \"interesting\" events occurring in between. Let \nus consider the overall influence of states at times 1\" < t upon the likelihood of the \noutputs at time t. Because of the phenomenon of diffusion of credit, and because \ngradients are added together, the influence of intervening events (especially those \noccurring shortly before t) will be much stronger than the influence of eo . Fur(cid:173)\nthermore, this problem gets geometrically worse as t increases. Clearly a positive \nmatrix is primitive. Thus in order to learn long-term dependencies, we would like \nto have many zeros in the matrix of transition probabilities. Unfortunately, this \ngenerally supposes prior knowledge of an appropriate connectivity graph . \n\n3.2 Coefficients of ergodicity \nTo study products of non-negative matrices and the loss of information about initial \nstate in Markov chains (particularly in the non-homogeneous case), we introduce \nthe projective distance between vectors x and y: \n\nx\u00b7y\u00b7 \nd(x',y') = ~~ In(--.:..l.). \n\nI ,) \n\nXjYi \n\nClearly, some contraction takes place when d(x'A,y'A) ::; d(x',y'). \n\n\f556 \n\nYoshua Bengio, Paolo Frasconi \n\nDefinition 3 BirkhofJ's contraction coefficient TB(A), for a non-negative column(cid:173)\nallowable matrix A, is defined in terms of the projective distance: \n\nTB(A) = \n\nsup \n\nx ,y> Ojx;t>.y \n\nd(x' A, y' A) \n\nd(x', y') \n\nDobrushin's coefficient Tl(A), for a stochastic matrix A, is defined as follows: \n\nTl(A) = 2 s~p L laik - ajkl\u00b7 \n\n1 \n\nI,) \n\nk \n\nBoth are proper ergodicity coefficients: 0 ~ T(A) ~ 1 and T(A) = 0 if and only if A \nhas identical rows. Furthermore, T(AIA2) ~ T(Al)T(A2)(see [6]). \n\n3.3 Products of Stochastic Matrices \nLet A (1 ,t) = A 1A2 \u00b7\u00b7\u00b7 At- 1 At denote a forward product of stochastic matrices \nAI, A2, ... At. From the properties of TB and Tl, if T(At} < 1, t > 0 then \nlimt-l-oo T(A(l,t\u00bb) = 0, i.e. A(l,t) has rank 1 and identical rows. Weak ergodic(cid:173)\nity is then defined in terms of a proper ergodic coefficient T such as TB and Tl: \n\nDefinition 4 (Weak Ergodicity) The products of stochastic matrices A(p,r) are \nweakly ergodic if and only if for all to ~ 0 as t -+ 00, T(A(to,t\u00bb) -+ O. \n\nTheorem 3 (See [6], Lemma 3.3 and 3.4.) Let A(l,t) a forward product of \nnon-negative and allowable matrices, then the products A(l,t) are weakly ergodic \nif and only if the following conditions both hold: \n1. 3to S.t. A(to,t) > 0 Vt > to \n2. A (;~,t) -+ Wij (t) > 0 as t -+ 00, i. e. rows of A (to,t) tend to proportionality. \n\nA(to,t) \n\n-\n\n),k \n\nFor stochastic matrices, row-proportionality is equivalent to row-equality since rows \nsum to 1. limt-l-oo ACto,t) does not need to exist in order to have weak ergodicity. \n\n3.4 Canonical Decomposition and Periodic Graphs \n\nAny non-negative matrix A can be rewritten by relabeling its indices in the following \ncanonical decomposition [6], with diagonal blocks B i , Ci and Q: \n\nA= \n\n0 \nB2 \n..... . ..... . \n\nC'+ 1 \n\n0 \n\n( Bl \n\n0 \n. . . . . . . . . \n0 \nLl \n\n0 \nL2 \n\n\" \n\n0 \n0 \n\n... \n. \n. . . . . . . \n0 \n. .... . .. \n\n0 \n0 \n\n0 \n\nCr \n0 \nLr Q \n\n) \n\n(1 ) \n\nwhere Bi and Ci are irreducible, Bi are primitive and Ci are periodic. Define the \ncorresponding sets of states as SBi' Se\" Sq. Q might be reducible, but the groups \nof states in Sq leak into the B or C blocks, i.e., Sq represents the transient part of \nthe state space. This decomposition is illustrated in Figure 1a. For homogeneous \nand non-homogeneous Markov models (with constant incidence matrix At = Ao), \nbecause P(qt E Sqlqt-l E Sq) < 1, liIl1t-l-oo P(qt E Sqlqo E Sq) = O. Furthermore, \nbecause the Bi are primitive, we can apply Theorem 1, and starting from a state \nin SB\" all information about an initial state at to is gradually lost. \n\n\fDiffusion of Credit in Markovian Models \n\n557 \n\n(b) \n\nFigure 1: (a): Transition graph corresponding to the canonical decomposition. \n(b): Periodic graph 91 becomes primitive (period 1) 92 when adding loop with \nstates 4,5. \n\nA more difficult case is the one of (A(to ,t))jk with initial state j ESc, . Let d i be \nthe period of the ith periodic block Cj. It can be shown r6] that taking d products \nof periodic matrices with the same incidence matrix and period d yields a block(cid:173)\ndiagonal matrix whose d blocks are primitive. Thus C(to ,t) retains information \nabout the initial block in which qt was. However, for every such block of size \n> 1, information will be gradually lost about the exact identity of the state within \nthat block. This is best demonstrated through a simple example. Consider the \nincidence matrix represented by the graph 91 of Figure lb. It has period 3 and the \nonly non-deterministic transition is from state 1, which can yield into either one of \ntwo loops. When many stochastic matrices with this graph are multiplied together, \ninformation about the loop in which the initial state was is gradually lost (i.e. if the \ninitial state was 2 or 3, this information is gradually lost). What is retained is the \nphase information, i.e. in which block ({O}, {I}, or {2,3}) of a cyclic chain was the \ninitial state. This suggests that it will be easy to learn about the type of outputs \nassociated to each block of a cyclic chain, but it will be hard to learn anything \nelse. Suppose now that the sequences to be modeled are slightly more complicated, \nrequiring an extra loop of period 4 instead of 3, as in Figure lb. In that case A is \nprimitive: all information about the initial state will be gradually lost. \n\n3.5 Learning Long-Term Dependencies: a Discrete Problem? \nWe might wonder if, starting from a positive stochastic matrix, the learning algo(cid:173)\nrithm could learn the topology, i.e. replace some transition probabilities by zeroes. \nLet us consider the update rule for transition probabilities in the EM algorithm: \n\nA \n\noL \nij 8A;j \n\nA \n\nij ~ \" \n\noL . \n\nwj Aij oA.j \n\n(2) \n\nStarting from Aij > 0 we could obtain a new Aij = 0 only if O~~j = 0, i.e. on a \nlocal maximum of the likelihood L. Thus the EM training algorithm will not exactly \nobtain zero probabilities. Transition probabilities might however approach O. \nIt is also interesting to ask in which conditions we are guaranteed that there will not \nbe any diffusion (of influence in the forward phase, and credit in the backward \nphase of training). It requires that some of the eigenvalues other than Al = 1 have \na norm that is also 1. This can be achieved with periodic matrices C (of period \n\n\f558 \n\nYoshua Bengio, PaoLo Frasconi \n\n5 \n\n-: -\"~::::-~:-~,~~:.~~:-~:-~~:~--.-.-.~~~~~-\n\nPeriodic_ \n\n\"'-\"\" \n\n'-. '--\n\nLeft-to-right-\n\n, .. '/, .. , .. , \n\n. . ~:;:~. \n\n:l:~; \n\n\"\"\" \n\n%\u00b7.';:\"'\u00b7.::1: .. \u00b7\u00b7 .. \u00b7:.\u00b7 .. \u00b7 \nII \n. \nill) \n\n.. : \n\nl?:t \n\n'''''I \n\".,'. \n\nd \u00b7 . : \u00b7 \u00b7 \n\n\u2022.\u2022.. \u00b7.\u00b7.:.'.\u00b7\u00b7 .. :\u00b7.\u00b7: .\u2022 . . . \n\nI I \n\nt=4 \n\n-10 \n\n-15 \n~ --20 \n-25 \n\n-30 \n\n/ \n\nFull connected \n\nLeft-to-right \n(triangular) \n\n5 \n\n10 \n\n15 \nT \n(a) \n\n20 \n\n25 \n\n30 \n\nt=3 \n\n(b) \n\nFigure 2: (a) Convergence of Dobrushin's coefficient (see Definition 3. (b) Evolution \nof products A(l,t) for fully connected graph. Matrix elements are visualized with \ngray levels. \n\nd), which have d eigenvalues that are the d roots of 1 on the complex unit circle. \nTo avoid any loss of information also requires that Cd = I be the identity, since \nany diagonal block of Cd with size more than 1 will yield to a loss of information \n(because of diffusion in primitive matrices) . This can be generalized to reducible \nmatrices whose canonical form is composed of periodic blocks Ci with ct = I. \nThe condition we are describing actually corresponds to a matrix with only 1 's and \nO's_ If At is fixed, it would mean that the Markov chain is also homogeneous. It ap(cid:173)\npears that many interesting computations can not be achieved with such constraints \n(i.e. only allowing one or more cycles of the same period and a purely deterministic \nand homogeneous Markov chain). Furthermore, if the parameters of the system \nare the transition probabilities themselves (as in ordinary HMMs), such solutions \ncorrespond to a subset of the corners of the 0-1 hypercube in parameter space. \nAway from those solutions, learning is mostly influenced by short term dependen(cid:173)\ncies, because of diffusion of credit. Furthermore, as seen in equation 2, algorithms \nlike EM will tend to stay near a corner once it is approached. This suggests that \ndiscrete optimization algorithms, rather continuous local algorithms, may be more \nappropriate to explore the (legal) corners of this hypercube. \n\n4 Experiments \n4.1 Diffusion: Numerical Simulations \nFirstly, we wanted to measure how (and if) different kinds of products of stochastic \nmatrices converged, for example to a matrix of equal rows. We ran 4 simulations, \neach with an 8 states non-homogeneous Markov chain but with different constraints \non the transition graph: 1) 9 fully connected; 2) 9 is a left-to-right model (i.e. A is \nupper triangular); 3) 9 is left-to-right but only one-state skips are allowed (i.e. A \nis upper bidiagonal); 4) At are periodic with period 4. Results shown in Figure 2 \nconfirm the convergence towards zero of the ergodicity coefficient 1 , at a rate that \ndepends on the graph topology. In Figure 2, we represent visually the convergence \nof fully connected matrices, in only 4 time steps, towards equal columns. \n\nlexcept for the experiments with periodic matrices, as expected \n\n\fDiffusion of Credit in Markovian Models \n\n559 \n\n100,----~-.. -\u00b7\u00b7\u00b7-~/~\u00b7\u00b7\u00b7~~~\\-.~-:-/~:--~-~-~--~~-~--yg-iV-en~ \n/ \\ \u2022 .1 ~_. \\ Randomly co\",,..;ted. \n\n80 \n\n\u2022 \n\nFully connected, \n40stales \n\n\\ \n\\. \n\n\\ 24\" ... , \n\\ \n\n. / \n\n\u2022 . / \n\n.... \\ \n\n__ -----'\\ \n\n\\\\ \n',\\ \n\n\\\\ \n, ' \n..,\\ \\ \n~ \nFully