{"title": "Unsupervised Spectral Learning of Finite State Transducers", "book": "Advances in Neural Information Processing Systems", "page_first": 800, "page_last": 808, "abstract": "Finite-State Transducers (FST) are a standard tool for modeling paired input-output sequences and are used in numerous applications, ranging from computational biology to natural language processing. Recently Balle et al. presented a spectral algorithm for learning FST from samples of aligned input-output sequences.  In this paper we address the more realistic, yet challenging setting where the alignments are unknown to the learning algorithm. We frame FST learning as finding a low rank Hankel matrix satisfying constraints derived from observable statistics. Under this formulation, we provide identifiability results for FST distributions. Then, following previous work on rank minimization, we propose a regularized convex relaxation of this objective which is based on minimizing a nuclear norm penalty subject to linear constraints and can be solved efficiently.", "full_text": "Unsupervised Spectral Learning of FSTs\n\nRapha\u00a8el Bailly\n\nXavier Carreras\n\nAriadna Quattoni\n\nUniversitat Politecnica de Catalunya\n\nBarcelona, 08034\n\nrbailly,carreras,aquattoni@lsi.upc.edu\n\nAbstract\n\nFinite-State Transducers (FST) are a standard tool for modeling paired input-\noutput sequences and are used in numerous applications, ranging from computa-\ntional biology to natural language processing. Recently Balle et al. [4] presented\na spectral algorithm for learning FST from samples of aligned input-output se-\nquences. In this paper we address the more realistic, yet challenging setting where\nthe alignments are unknown to the learning algorithm. We frame FST learning as\n\ufb01nding a low rank Hankel matrix satisfying constraints derived from observable\nstatistics. Under this formulation, we provide identi\ufb01ability results for FST dis-\ntributions. Then, following previous work on rank minimization, we propose a\nregularized convex relaxation of this objective which is based on minimizing a\nnuclear norm penalty subject to linear constraints and can be solved ef\ufb01ciently.\n\n1\n\nIntroduction\n\nThis paper addresses the problem of learning probability distributions over pairs of input-output\nsequences, also known as transduction problem. A pair of sequences is made of an input sequence,\nbuilt from an input alphabet, and an output sequence, built from an output alphabet. Finite State\nTransducers (FST) are one of the main probabilistic tools used to model such distributions and\nhave been used in numerous applications ranging from computational biology to natural language\nprocessing. A variety of algorithms for learning FST have been proposed in the literature, most of\nthem are based on EM optimizations [9, 11] or grammatical inference techniques [8, 6].\nIn essence, an FST can be regarded as an HMM that generates bi-symbols of combined input-output\nsymbols. The input and output symbols may be generated jointly or independently conditioned on\nthe previous observations. A particular generation pattern constitutes what we call an alignment.\n\nGAATTCAG-\n| | || |\nGGA-TC-GA\n\nGAATTCAG-\n| || | |\nGGAT-C-GA\n\nGAATTC-AG\n| | || |\nGGA-TCGA-\n\nGAATTC-AG\n| || | |\nGGAT-CGA-\n\nTo be able to handle different alignments, a special empty symbol\u2217 is added to the input and output\nbe represented by the pair x\u2217 (resp. \u2217\n\nalphabets. With this enlarged set of bi-symbols, the model is able to generate an input symbol (resp.\nan output symbol) without an output symbol (resp. input symbol). These special bi-symbols will\ny). As an example, the \ufb01rst alignment above will correspond\nto the two possible representations G\nA. Under this model the\nG\nprobability of observing a pair of un-aligned input-output sequences is obtained by integrating over\nall possible alignments.\nRecently, following a recent trend of work on spectral learning algorithms for \ufb01nite state machines\n[14, 2, 17, 18, 7, 16, 10, 5], Balle et al. [4] presented an algorithm for learning FST where the input\nto the algorithm are samples of aligned input-output sequences. As with most spectral methods the\ncore idea of this algorithm is to exploit low-rank decompositions of some Hankel matrix representing\n\nA\u2217\u2217\n\nA and G\n\nG\n\nA\u2217 G\n\nG\n\nA\u2217 G\n\nG\n\nA\u2217 A\n\nA\n\nT\u2217 T\n\nT\n\nT\u2217 T\n\nT\n\nA\nA\n\nG\n\n\u2217\n\nG\n\n\u2217\n\n\u2217\n\nC\nC\n\nC\nC\n\n1\n\n\fthe distribution of aligned sequences. To estimate this Hankel matrix it is assumed that the algorithm\ncan sample aligned sequences, i.e. it can directly observe sequences of enlarged bi-symbols.\nWhile the problem of learning FST from fully aligned sequences (what we sometimes refer to as\nsupervised learning) has been solved, the problem of deriving an unsupervised spectral method that\ncan be trained from samples of input-output sequences alone (i.e. where the alignment is hidden)\nremains open. This setting is signi\ufb01cantly more dif\ufb01cult due to the fact that we must deal with two\nsets of hidden variables: the states and the alignments. In this paper we address this unsupervised\nsetting and present a spectral algorithm that can approximate the distribution of paired sequences\ngenerated by an FST without having access to aligned sequences. To the best of our knowledge this\nis the \ufb01rst spectral algorithm for this problem.\nThe main challenge in the unsupervised setting is that since the alignment information is not avail-\nable, the Hankel matrices (as in [4]) can no longer be directly estimated from observable statistics.\nHowever, a key observation is that we can nevertheless compute observable statistics that can con-\nstraint the coef\ufb01cients of the Hankel matrix. This is because the probability of observing a pair of\nun-aligned input-output sequences (i.e. an observable statistic) is computed by summing over all\npossible alignments; i.e. by summing entries of the hidden Hankel matrix. The main idea of our al-\ngorithm is to exploit these constraints and \ufb01nd a Hankel matrix (from which we can directly recover\nthe model) which both agrees on the observed statistics and has a low-rank matrix factorization.\nIn brief, our main contribution is to show that an FST can be approximated by solving an optimiza-\ntion which is based on \ufb01nding a low-rank matrix satisfying a set of constraints derived from observ-\nable statistics and Hankel structure. We provide sample complexity bounds and some identi\ufb01ability\nresults for this optimization that show that, theoretically, the rank and the parameters of an FST\ndistribution can be identi\ufb01ed. Following previous work on rank minimization, we propose a regu-\nlarized convex relaxation of the proposed objective which is based on minimizing a nuclear norm\npenalty subject to linear constraints. The proposed relaxation balances a trade-off between model\ncomplexity (measured by the nuclear norm penalty) and \ufb01tting the observed statistics. Synthetic\nexperiments show that the performance of our unsupervised algorithm ef\ufb01ciently approximates that\nof a supervised method trained from fully aligned sequences.\nThe paper is organized as follows. Section 2 gives preliminaries on FST and spectral learning\nmethods, and establishes that an FST can be induced from a Hankel matrix of observable aligned\nstatistics. Section 3 presents a generalized form of Hankel matrices for FST that allows to express\nobservation constraints ef\ufb01ciently. One can not observe generalized Hankel matrices without as-\nsumming access to aligned samples. To solve this problem, Section 4 formulates \ufb01nding the Hankel\nmatrix of an FST from unaligned samples as rank minimization problem. This section also presents\nthe main theoretical results of the method, as well as a convex relaxation of the rank minimization\nproblem. Section 5 presents results on synthetic data and Section 6 concludes.\n\n2 Preliminaries\n\n2.1 Finite-State Transducers\n\nDe\ufb01nition 1. A Finite-State Transducer (FST) of rank d is given by:\n\n\u2022 alphabets \u03a3+={x1, . . . , xp} (input), \u03a3\u2212={y1, . . . , yq} (output)\n\u2022 \u03b11\u2208 Rd, \u03b1\u221e\u2208 Rd\n\u2022 \u2200x\u2208 \u03a3+\u222a{\u2217},\u2200y\u2208 \u03a3\u2212\u222a{\u2217}, a matrix Mx\n\u2208 Rd\u00d7d, with M\u2217\u2217= 0\nDe\ufb01nition 2. Let s be an input sequence, and let t be an output sequence. An alignment of(s, t) is\nby removing the empty symbols\u2217 equals s (resp. t).\nDe\ufb01nition 3. The set of alignments for a pair of sequences(s, t) is denoted[s, t].\nDe\ufb01nition 4. Let \u03a3={\u03a3+\u222a{\u2217}}\u00d7{\u03a3\u2212\u222a{\u2217}}. The set of aligned sequences is \u03a3\n\u2217. The empty\n\nsuch that the sequence obtained from x1 . . . xn (resp. y1 . . . yn)\n\ngiven by a sequence of pairs x1\ny1\n\n. . . xn\nyn\n\nstring is denoted \u03b5.\n\ny\n\n2\n\n\ffor the model T is given by:\n\nDe\ufb01nition 5. Let T be an FST, and let w= x1\nrT(w)= \u03b1\n\u0016\nDe\ufb01nition 6. Let(s, t) be an i/o (input/output) sequence. Then the value for(s, t) computed by an\nrT((s, t))= Q\n\nFST T is given by the sum of the values for all alignments:\n\nbe an aligned sequence. Then the value of w\n\n\u22c5 \u03b1\u221e\n\u0016Mxn\n\u2208[s,t] rT(x1\n\n1Mx1\ny1\n\n. . . xn\nyn\n\n. . . xn\nyn\n\n)\n\ny1\n\ny1\n\nyn\n\nx1\ny1 ...xn\nyn\n\nA more complete description of FST can be found in [15].\n\n1\n\n1\n\ntj\n\ntj\n\nt1\n\ntj\n\n\u0016\n\n\u0016M\u2217\n\n2.2 Computing with an FST\n\n[Id\u2212 M]\u22121\u03b1\u221e\n\n\u2217\u2212. The forward vector is de\ufb01ned by:\n+ Fi\u22121,j\u22121Msi\n\nalignments, which is generally exponential in the length of s and t. Standard techniques (e.g. the\nedit distance algorithm) can be applied in order to compute such a value in polynomial time.\n\nIn order to compute the value of a pair of sequences(s, t), one needs to sum over all possible\nProposition 1. Let T be an FST, s1\u2236n\u2208 \u03a3\n\u2217+, t1\u2236m\u2208 \u03a3\n1Ms1\u2217 \u0016Msi\u2217 , Fi,j= Fi\u22121,jMsi\u2217 + Fi,j\u22121M\u2217\n, Fi,0= \u03b1\nF0,j= \u03b1\n\u0016\n\u0016\n1M\u2217\nIt is then possible to compute rT((s, t))= Fn,m\u03b1\u221e in O(d2st). The sum of rT over all possible\n\u2217)=\u2211s\u2208\u03a3+,t\u2208\u03a3\u2212 rT((s, t)) can be computed with the formula\nvalues rT(\u03a3\n\u2217)= \u03b1\nrT(\u03a3\nwhere M=\u2211x\u2208\u03a3+\u222a{\u2217},y\u2208\u03a3\u2212\u222a{\u2217} Mx\nExample 1. Let us consider a particular subclass of FST: \u03a3\u2212= \u03a3+={0, 1}, where M0\n= M1\n=\n= 0.\nM0\u2217= M\u2217\n1~3 \u0003\n0 \u0003\nLet us draw all the paths for the i/o sequence(01, 001). The dashed red edges are discarded because\n\u2217\nand\u2217\nLet us consider the model A which satis\ufb01es the constraints above. One has rA(0\n) =\n)= 1~192 and, as those two aligned sequences are the only possible alignments for\n1~96, rA(\u2217\n(01, 001), one has rA((01, 001))= 1~64. It is possible to check that rA(\u03a3\n\u2217)= 1, thus the model\n\n0 \u0003 M1\u2217=\u0003 1~3\n=\u0003 0\n0 \u0003 M0\n\nof the constraints. Thus, there are only two different non-zero paths, corresponding to 0\n0\n\n1~6 \u0003 M\u2217\n1~4\n0 \u0003 M1\n\n\u03b11=\u0003 1\n\u03b1\u221e=\u0003 1~4\n\n=\u0003 1~6\n=\u0003 0\n1~2\n\nThe FST A satis\ufb01es the constraints.\n\n1 (in green)\n1\n\n1 (in blue).\n1\n\n\u2217\n\ny .\n\n0\u2217\n0\n\n0\u2217\n\n0\n0\n\n1\n1\n\n1\n1\n\n0\n0\n\n0\n\n0\n\n0\n\n1\n\n1\n\n0\n\n0\n\n0\n\n0\n\n0\n\n0\n\n0\n\n0\n\n0\n0\n\n0\n0\n\n0\n\n0\n\n1\n\n1\n1\n1\n1\n\n0\n\n0\n\n1\n\n0\n\ncomputes a probability distribution.\n\n2.3 Hankel Matrices\n\nLet us recall some basic de\ufb01nitions and properties.\n\n\u2217 a set of pre\ufb01xes, V \u2282 \u03a3\n\nLet \u03a3 be an alphabet, U\u2282 \u03a3\n\u2217 a set of suf\ufb01xes. U is said to be pre\ufb01x-closed\nif uv\u2208 U\u21d2 u\u2208 U. V is said to be suf\ufb01x-closed if uv\u2208 V \u21d2 v\u2208 V .\n\u2208 \u03a3}.\nu\u2208 U, x\nLet us denote U \u03a3 the set U\u222a{ux\n\u2208 \u03a3}. Let us denote \u03a3V the set V \u222a{x\nA Hankel matrix on U and V is a matrix with rows corresponding to elements u\u2208 U and columns\ncorresponding to v\u2208 V , which satis\ufb01es uv= u\n\u2032\u2192 H(u, v)= H(u\n\u2032).\n\u2032\n\u2032\nDe\ufb01nition 7. Let H a Hankel matrix for U \u03a3 and \u03a3V . One supposes that \u03b5\u2208 U and \u03b5\u2208 V . One\nthen de\ufb01nes the partial Hankel matrices de\ufb01ned for u\u2208 U and\u2208 V :\n, H1(v)= H(\u03b5, v)\n(u, v)= H(u, x\nH\u03b5(u, v)= H(u, v)\n, H\u221e(u)= H(u, \u03b5)\nyv)\n\nyvv\u2208 V, x\n\n, Hx\n\n, v\n\nv\n\ny\n\ny\n\ny\n\ny\n\n3\n\n\fy\n\n\u0016\n\n+\n\n+\n\n\u0016\n\n\u03b1\n1\n\nH\n\n\u03b5\n\ny\n\n1 H\n\u03b5\n\n= H\n\nWe will not give a proof of this result, as a more general result is seen further. The rank equality\ncomes from the fact that the WA de\ufb01ned above has the same rank as H\u03b5, and that the rank of a\n\nThe main result that we will be using is the following:\nProposition 2. Let H a Hankel matrix for U \u03a3 and \u03a3V . One supposes that U is pre\ufb01x-closed, V is\n\nsuf\ufb01x-closed, and that rank(H\u03b5)= rank(H). Then the WA de\ufb01ned by\n= Hx\n, \u03b1\u221e= H\u221e , Mx\ncomputes a mapping f such that\u2200u\u2208 U,\u2200v\u2208 V, f(uv)= H\u03b5(u, v).\nmapping f which satis\ufb01es f(uv)= H(u, v) is at least the rank of H. The following example shows\n\u03a3V with U={\u03b5, a3\nb3}, and V ={\u03b5, a3\nb3}.\n\u0004, H=\n\u0004 \u03b5\nH\u03b5=\n0 1~4\n\n\u00ef\u00ef\u00ef\u00ef\u00ef\u0017, H\n\u00ef\u00ef\u00ef\u00ef\u00ef\n1~4\n\u2032=\n1~4 1~4\n1~4 1~4 1~4\nOne has \u03b5\u2208 U and \u03b5\u2208 V , and also rank(H\u03b5)= rank(H)= 1, thus the computed WA is rank 1.\nb6)= 1~4. The complete Hankel\nSuch a WA cannot compute a mapping such that rT(\u0001)= 0 and rT(a6\n\nthat the pre\ufb01x and suf\ufb01x closeness are necessary conditions.\nExample 2. Let us consider the following Hankel over the set of pre\ufb01xes U \u03a3 and the set of suf\ufb01xes\n\n0 0 1~4 1~4\n0 0 1~4 1~4\n\na\n\u03b5\nb\n0 0\n0 0\n\n\u00ef\u00ef\u00ef\u00ef\u00ef\u00ef\n\n\u03b5\na\nb\na3\nb3\na4\nb4\n\n\u00ef\u00ef\u00ef\u00ef\u0017\n\na2\nb2\na3\nb3\na4\nb4\n\na4\nb4\n0\n0\n\na3\nb3\n0\n0\n\na2\nb2\n0\n\na3\nb3\n0\n\na3\nb3\n0\n\n\u03b5\na3\nb3\n\na4\nb4\n\n0\n\n0\n\nmatrix has at least rank 7.\n\n3\n\nInducing FST from Generalized Hankel Matrices\n\ny1\n\ny1\n\ny1 ...xn\nyn\n\n. . . xn\nyn\n\n. . . xn\nyn\n\n\u2208[s,t] rT(x1\n\n), where rT(x1\n\n) is computed from the Hankel.\n\nProposition 2 tells us that if we had access to certain sub-blocks of the Hankel matrix for aligned\nsequences we could recover the FST model. However, we do not have access to the hidden\n\nvations. One natural idea would be to search for a Hankel matrix that agrees with the obser-\nvations. To do so, we introduce observable constraints, which are linear constraints of the form\n\nFrom a matrix satisfying the hypothesis of Proposition 2 and the observation constraints, one can\nderive an FST computing a mapping which agrees on the observations.\n\nalignment information: we only have access to the statistics p(s, t), which we will call obser-\np(s, t)=\u2211x1\nGiven an i/o sequence(s, t), the size of[s, t] (hence the size of the Hankel matrix) is in general ex-\nOne will denote by P(A) the set of subsets of A.\n\u2032w\u2208 u, w\n\u2032={ww\n\u2032\u2208 u\n\u2032}.\n\u2217). The set uu\n\u2032\u2208 P(\u03a3\n\u2032 is de\ufb01ned by uu\ny1\u2236n will denote the set{x1\u2236n\n}, which contains a single\nDe\ufb01nition 8. Let u, u\n[s, t] will denote the set{x1\u2236nxn+1\u2236n+k\n xn+1\u2236n+k\n\u2208[s, t]}, which extends\nWe will denote sets of alignments as follows: x1\u2236n\n} with all ways of aligning(s, t). We will also use[s, t]x1\u2236n\n{x1\u2236n\naligned sequence; then x1\u2236n\ny1\u2236n\ny1\u2236nyn+1\u2236n+k\n[s, t]), where the aligned part is possibly empty.\n[s, t]x1\u2236n\ny1\u2236n\nDe\ufb01nition 9. A generalized pre\ufb01x (resp. generalized suf\ufb01x) is the empty set or a set of the form\n(resp. x1\u2236n\ny1\u2236n\n\nponential in the size of s and t. In order to overcome that problem when considering the observation\nconstraints, one will consider aggregations of rows and columns, corresponding to sets of pre\ufb01xes\nand suf\ufb01xes. Obviously, the de\ufb01nition of a Hankel matrix must be extended to this case.\n\ny1\u2236n analogously.\n\ny1\u2236n\nyn+1\u2236n+k\n\ny1\u2236n\n\n3.1 Generalized Hankel\n\nOne needs to extend the de\ufb01nition of a Hankel matrix to the generalized pre\ufb01xes and suf\ufb01xes.\nDe\ufb01nition 10. Let U be a set of generalized pre\ufb01xes, V be a set of generalized suf\ufb01xes. Let H be a\nmatrix indexed by U (in rows) and V (in columns). H is a generalized Hankel matrix if it satis\ufb01es:\n\nuv= \u00e4\n\u2032\nu\u2032\u2208\u00afu\u2032,v\u2032\u2208\u00afv\u2032 u\n\n\u2032\u21d2 Q\nu\u2208\u00afu,v\u2208\u00afv\n\nH(u, v)= Q\n\nu\u2032\u2208\u00afu\u2032,v\u2032\u2208\u00afv\u2032 H(u\n\u2032\n\n\u2032)\n\nv\n\nv\n\n\u2032\u2282 U,\u2200\u00afv, \u00afv\n\n\u2200\u00afu, \u00afu\n\u2032\u2282 V, \u00e4\nu\u2208\u00afu,v\u2208\u00afv\nwhere\u00e4 is the disjoint union.\n\n4\n\n\ft1\n\nt1\n\ntk\n\ntk\n\ny1\u2236n\n\nx1\u2236n\ny1\u2236n\n\ny1\u2236n\u22121\n\n\u2208 U\n\nDe\ufb01nition 12. Let V be a set of generalized suf\ufb01xes. V is suf\ufb01x-closed if\n\nDe\ufb01nition 13. Let U be a set of generalized pre\ufb01xes, V be a set of generalized suf\ufb01xes. The right-\n\nIn particular, if U and V are sets of regular pre\ufb01xes and suf\ufb01xes, this de\ufb01nition encompasses the\nregular de\ufb01nition for a Hankel matrix.\nDe\ufb01nition 11. Let U be a set of generalized pre\ufb01xes. U is pre\ufb01x-closed if\n\n[s, t]x1\u2236n\n\u2208 U\u21d2[s, t]x1\u2236n\u22121\n\u2208 U\n,[s1\u2236n\u22121, t1\u2236k\u22121]sn\n[s1\u2236n, t1\u2236k]\u2208 U\u21d2[s1\u2236n\u22121, t1\u2236k]sn\u2217 ,[s1\u2236n, t1\u2236k\u22121]\u2217\n[s, t]\u2208 V \u21d2 x2\u2236n\n[s, t]\u2208 V\n[s2\u2236n, t2\u2236k]\u2208 V\n[s1\u2236n, t1\u2236k]\u2208 V \u21d2 s1\u2217[s2\u2236n, t1\u2236k],\n[s1\u2236n, t2\u2236k], s1\ny2\u2236n\n\u2217\noperator completion of U is the set U \u03a3= U\u222a{ux\nu\u2208 U, x\n\u2208 \u03a3}. The left-operator completion of V\nis the set \u03a3V = V \u222a{x\nyvv\u2208 V, x\na Hankel matrix built from U \u03a3 and \u03a3V . One supposes that rank(H\u03b5)= rank(H), U is pre\ufb01x-\n= H\n= Hx\n\u0016\n+\ncomputes a mapping rA such that\u2200u\u2208 U,\u2200v\u2208 V, rA(uv)= H\u03b5(u, v)\npre\ufb01x (resp. suf\ufb01x) closure of input (resp. output) strings in S. Let U ={[s, t]}s\u2208prefS,in,t\u2208prefS,out\nand V ={[s, t]}s\u2208suffS,in,t\u2208suffS,out. The sets U and V contain all the observed pairs of unaligned\n\nA key result is the following, which is analogous to Proposition 2 for generalized Hankel matrices:\nProposition 3. Let U and V be two sets of generalized pre\ufb01xes and generalized suf\ufb01xes. Let H be\n\nLet S be a sample of unaligned sequences. Let prefS,in (resp. prefS,out, suffS,in, suffS,out) be the\n\nclosed and V is suf\ufb01x-closed. Then the model A de\ufb01ned by \u03b1\n1\n\n\u03b5 , \u03b1\u221e= H\u221e, Mx\n+\n\nProof. The proof can be found in the Appendix.\n\n\u2208 \u03a3}.\n\nsequences, and one can check that the sizes of U \u03a3 and \u03a3V are polynomial in the size of S.\n\n1 H\n\n\u0016\n\nH\n\ny\n\ny\n\ny\n\n\u03b5\n\ny\n\ny\n\nExample 3. Let us continue with the same model A as in Example 1. Let us now consider the\npre\ufb01xes \u0001, 0\n\n0 and the suf\ufb01xes \u0001, 1\n\n0\n\n1~96 \u0003\n0 \u0003\n\n0\n\n0\n\n0\n\n0\n\n0\n\n0\n\n0\n\n1. The Hankel matrices will be:\n\n=\u0003 1~24\n0 \u0003 H\u221e=\u0003 1~4\n0 \u0003 H\u03b5=\u0003 1~4\nH1=\u0003 1~4\n1~32 \u0003 H\u2217\n1~32\nH1\u2217=\u0003 1~12\n=\u0003\n\u0003 H1\n=\u0003 0\n1~192 \u0003 H0\n1~32\n\u2032 de\ufb01ned by:\n0 \u0003 M1\u2217=\u0003 1~3\n0 \u0003 \u03b1\u221e=\u0003 1~4\n1~6 \u0003\n\u03b11=\u0003 1\n=\u0003 1~6\n0 \u0003\n0 \u0003 M1\n=\u0003 0\n1~3 \u0003 M0\n=\u0003 0\n1~8\n\u2032 computes the same probability distribution than A.\n\n0\n\n0\n\n0\n\n0\n\n0\n\n1\n\n0\n\n0\n\n0\n\n0\n\n0\n\n1\n\n1\n\n0\n\nand one \ufb01nally gets the model A\n\nM\u2217\nOne can easily check that A\n\n0\n\n4 FST Learning as Non-convex Optimization\n\nProposition 3 shows that FST models can be recovered from generalized Hankel matrices. In this\nsection we show how FST learning can be framed as an optimization problem where one searches\nfor a low-rank generalized Hankel matrix that agrees with observation constraints derived from a\nsample. We assume here that p is a probability distribution over i/o sequences.\n\nWe will denote by z =(p([s, t]))s\u2208\u03a3\u2217+,t\u2208\u03a3\u2217\u2212 the vector built from observable probabilities, and by\nzS=(pS([s, t]))s\u2208\u03a3\u2217+,t\u2208\u03a3\u2217\u2212 the set of empirical observable probabilities, where pS is the frequency\n\ndeduced from an i.i.d. sample S.\n\n5\n\n\fminimize\n\nThe optimization task is the following:\n\nLet \u0014H be the vector describing the coef\ufb01cients of H. Let K be the matrix such that K\u0014H= 0 rep-\nresents the Hankel constraints (cf. de\ufb01nition 10). Let O be the matrix such that O\u0014H= z represents\nthe observation constraints (i.e.\u2211 H([s, t], \u0001)= p([s, t])).\nrank(H)\nsubject to \u0001O\u0014H\u2212 z\u00012\u2264 \u00b5\nK\u0014H= 0\n\u0001H\u00012\u2264 1.\n\nThe condition\u0001H\u00012\u2264 1 is necessary to bound the set of possible solutions. In particular, if H is\nthe Hankel matrix of a probability distribution, the condition is satis\ufb01ed: one has\u0001H\u00011\u2264 1 as each\ncolumn is a probability distribution, as well as\u0001H\u0001F \u2264 1 and thus\u0001H\u00012\u2264\u0001\u0001H\u00011\u0001H\u0001F \u2264 1. Let\nLet us denoteH\u00b5 the class of Hankel matrices solutions of (1) for a given \u00b5, where z represents the\np(uivj). Let us denoteHS\nrepresents pS(uivj), the observed frequencies in a sample S i.i.d. with respect to p.\nV such that any solution ofH0 leads to an FST\n\u03b5 , \u03b1\u221e= H\u221e, Mx\n+\n\nProposition 4. Let p be a distribution over i/o sequences computed by an FST. There exists U and\n\n\u00b5 the class of Hankel matrices solutions of (1) for a given \u00b5, where zS\n\nus remark that the set of matrices satisfying the constraints is a compact set.\n\n= Hx\n\n= H\n\n1 H\n\nH\n\u03b5\n\n(1)\n\n+\n\n\u0016\n\n\u0016\n\n\u03b1\n1\n\nH\n\ny\n\ny\n\nwhich computes p.\n\nProof. The proof can be found in the Appendix.\n\n4.1 Theoretical Properties\n\n\u00b52\n\nleads to a rank-d FST with probability\n\n\u00ef\u00172\n8 log(1~\u03b4)\n2 log(1~\u03b4)\n\u0001S\n\nS>\u00ef\u00ef 2+\u0001\n\u00b5 solution of (1) with \u00b5= 1+\u0001\n\nWe now present the main theoretical results related to the optimization problem (1). The \ufb01rst one\nconcerns the rank identi\ufb01cation, while the second one concerns the consistency of the method.\nProposition 5. Let p be a rank d distribution computed by an FST. There exists \u00b52 such that for any\n\n\u03b4> 0 and any i.i.d. sample S\nimplies that any H\u2208HS\nat least 1\u2212 \u03b4.\n\u03b4> 0 and any i.i.d. sample S\nimplies that for any H\u2208HS\nwhereA, AS\u221e is the maximum distance for model parameters, and \u03c3p is a non-zero parameter\n\nS>\u00ef\u00ef 1+\u0001\n\u00ef\u00172\n2 log(1~\u03b4)\n\u00b5 solution of (1) with \u00b5= 1+\u0001\n2 log(1~\u03b4)\n\u0001S\n)\nA, AS\u221e\u2264 O( d2\u0001\n\nProof. The proof can be found in the Appendix.\nProposition 6. Let p be a rank d distribution computed by an FST. There exists \u00b5\u0001 such that for any\n\nleading to a model As there exists\n\na model A computing p such that\n\n\u03c33\np\n\n\u00b5\u0001\n\ndepending on p.\n\nProof. The proof can be found in the Appendix.\n\n6\n\n\f0\n\n\u2217\n\n1\n1\n\n0\n\n0\n\n1\u22171\n\n1\n\n1~12\n\n) are\n\n\u2217\n1~36\n\n) and rM(0\n\u00ef\u00ef\u0017\n\nExample 4. This continues Examples 1 and 3. Let us \ufb01rst remark that, among all the values used to\nbuild the Hankel matrices in Example 3, some of them correspond to observable statistics, as there\n\nnot observable. Then the rank minimization objective is not suf\ufb01cient, as it allows any value for\nthose variables.\nLet us consider now larger sets of pre\ufb01xes and suf\ufb01xes: \u03b5,\n\nis only one possible alignment for them. Conversely, the exact values of rM(0\n0 and \u03b5, 1\u2217, 1\n\u00ef\u00ef\u0017 H1\u2217=\u00ef\u00ef\u00ef 1~12\nH\u0001=\u00ef\u00ef\u00ef 1~4\n=\u00ef\u00ef\u00ef\n\u00ef\u00ef\u0017 H1\n1~24\n1~32\n1~32\n\u2032 subject to the constraints O and K:\n\u00ef\u00ef\u0017=(hij) , O=\u0017\u0017\u0017\u0017\u0017\u0017\u0017\u0017\u0017\u0017\u0017 h11= 1~4\n, K=\u0017\u0017\u0017\u0017\u0017\u0017\u0017 h12= h41 h23= h81\n\u2032=\u00ef\u00ef\u00ef H\u03b5\nh12= 1~12\nh13= h71 h32= h61\nh13= 0\nh22= h51 h33= h91\nh21= 1~24\n)= rA((01, 001))= 1~64. One\nThe relation h63+ h92= 1~64 is due to fact that rA(0\nhas h22= h51 as they both represent p(\u2217\n\u2032 has a rank greater or equal than 2.\nh22= h51= 1~72, h52= 1~216, h63= 1~192, h92= 1~96\n\nh63+ h92= 1~64\n)+ rA(\u2217\n1\u2217). It is clear that H\n\n1. One then has\n0\n0\n0\n\n0\n0\n?\n\n0\n0\n\nThe only way to reach rank 2 under the constraints is\n\nWe want to minimize the rank of H\n\nH1\u2217\n\nH1\n1\n\n0\n0\n\n1\n1\n\n0\n\n1\n1\n\n0\n\n0\n\n\u22ee\n\u22ee\n\n\u2217\n\n?\n0\n\n?\n0\n\n?\n0\n\n0\n\n0\n0\n\n0, 0\n\n0\n0\n?\n\n1\n\nH\n\n0\n\nThus, the process of rank minimization under linear constraints leads to one single model, which is\nidentical to the original one. Of course, in the general case, the rank minimization objective may\nlead to several models.\n\n4.2 Convex Relaxation\n\nThe problem as it is stated in (1) is NP-hard to solve in the size of the Hankel matrix, hence impos-\nsible to deal with in practical cases. One can solve instead a convex relaxation of the problem (1),\nobtained by replacing the rank objective by the nuclear norm. The relaxed optimization statement is\nthen the following:\n\nminimize\n\n\u0001H\u0001\u2217\nsubject to \u0001O\u0014H\u2212 z\u00012\u2264 \u00b5\nK\u0014H= 0\n\u0001H\u00012\u2264 1.\n\nH\n\n(2)\n\nThis type of relaxation has been used extensively in multiple settings [13]. The nuclear norm\u0001\u0001\u2217\nplays the same role than the\u0001\u00011 norm in convex relaxations of the\u0001\u00010 norm, used to reach sparsity\n\nunder linear constraints.\n\n5 Experiments\n\nWe ran synthetic experiments using samples generated from random FST with input-output alpha-\nbets of size two. The main goal of our experiments was to compare our algorithm to a supervised\nspectral algorithm for FST that has access to alignments. In both methods, the Hankel was de\ufb01ned\nfor pre\ufb01xes and suf\ufb01xes up to length 1. Each run consists of generating a target FST, and generating\nN aligned samples according to the target distribution. These samples were directly used by the su-\npervised algorithm. Then, we removed the alignment information from each sample (thus obtaining\na pair of unaligned strings), and we used them to train an FST with our general learning algorithm,\ntrying different values for a C parameter that trades-off the nuclear norm term and the observation\nterm. We ran this experiment for 8 target models of 5 states, for different sampling sizes. We mea-\nsure the L1 error of the learned models with respect to the target distribution on all unaligned pairs\nof strings whose sizes sum up to 6. We report results for geometric averages.\nIn addition, we ran two additional methods. First a factorized method that assumes that the two\nsequences are generated independently, and learns two separate weighted automata using a spectral\n\n7\n\n\fFigure 1: Learning curves for different methods: SUP, supervised; UNS, unsupervised with dif-\nferent regularizers; EM, expectation maximization. The curves are averages of L1 error for random\ntarget models of 5 states.\n\nmethod. Its performance is very bad, with L1 error rates at\u223c0.08, which con\ufb01rms that our target\n\nmodels generate highly dependent sequence pairs. This baseline result also implies that the rest\nof the methods can learn the dependencies between paired strings. Second, we ran an Expectation\nMaximization algorithm (EM).\nFigure 1 shows the performance of the learned models with respect to the number of samples.\nClearly, our algorithm is able to estimate the target distribution and gets close to the performance of\nthe supervised method, while making use of much simpler statistics. EM performed slightly better\nthan the spectral methods, but nonetheless at the same levels of performance.\nOne can \ufb01nd other experimental results for the unsupervised spectral method in [1], where it is\nshown that, under certain circumstances, unsupervised spectral method can perform better than su-\npervised EM. Though the framework (unsupervised learning of PCFGs) is not the same, the method\nis similar and the optimization statement is identical.\n\n6 Conclusion\n\nIn this paper we presented a spectral algorithm for learning FST from unaligned sequences. This is\nthe \ufb01rst paper to derive a spectral algorithm for the unsupervised FST learning setting. We prove that\nthere is theoretical identi\ufb01ability of the rank and the parameters of an FST distribution, using a rank\nminimization formulation. However, this problem is NP-hard, and it remains open whether there\nexists a polynomial method with identi\ufb01ability results. Classically, rank minimization problems\nare solved with convex relaxations such as the nuclear norm minimization we have proposed. In\nexperiments, we show that our method is comparable to a fully supervised spectral method, and\nclose to the performance of EM.\nOur approach follows a similar idea to that of [3] since both works combine classic ideas from\nspectral learning with classic ideas from low rank matrix completion. The basic idea is to frame\nlearning of distributions over structured objects as a low-rank matrix factorization subject to linear\nconstraints derived from observable statistics. This method applies to other grammatical inference\ndomains, such as unsupervised spectral learning of PCFGs ([1]).\n\nAcknowledgments\n\nWe are grateful to Borja Balle and the anonymous reviewers for providing us with helpful comments.\nThis work was supported by a Google Research Award, and by projects XLike (FP7-288342), ERA-\nNet CHISTERA VISEN, TACARDI (TIN2012-38523-C02-02), BASMATI (TIN2011-27479-C04-\n03), and SGR-LARCA (2009-SGR-1428). Xavier Carreras was supported by the Ram\u00b4on y Cajal\nprogram of the Spanish Government (RYC-2008-02223).\n\n8\n\n 0.0001 0.0005 0.001 0.0015 0.002 0.0025 0.00310k20k50k100k200k500k1ML1 errorNumber of Samples (logscale)UNS 0.01UNS 0.1SUPEM\fReferences\n[1] R. Bailly, X. Carreras, F. M. Luque, and A. Quattoni. Unsupervised spectral learning of wcfg\n\nas low-rank matrix completion. EMNLP, 2013.\n\n[2] R. Bailly, F. Denis, and L. Ralaivola. Grammatical inference as a principal component analysis\n\nproblem. In Proc. ICML, 2009.\n\n[3] B. Balle and M. Mohri. Spectral learning of general weighted automata via constrained matrix\n\ncompletion. In Proc. of NIPS, 2012.\n\n[4] B. Balle, A. Quattoni, and X. Carreras. A spectral learning algorithm for \ufb01nite state transduc-\n\ners. ECML\u2013PKDD, 2011.\n\n[5] Borja Balle, Ariadna Quattoni, and Xavier Carreras. Local loss optimization in operator mod-\nels: A new insight into spectral learning. In John Langford and Joelle Pineau, editors, Pro-\nceedings of the 29th International Conference on Machine Learning (ICML-12), ICML \u201912,\npages 1879\u20131886, New York, NY, USA, July 2012. Omnipress.\n\n[6] M. Bernard, J-C. Janodet, and M. Sebban. A discriminative model of stochastic edit distance\nin the form of a conditional transducer. Grammatical Inference: Algorithms and Applications,\n4201, 2006.\n\n[7] B. Boots, S. Siddiqi, and G. Gordon. Closing the learning planning loop with predictive state\n\nrepresentations. I. J. Robotic Research, 2011.\n\n[8] F. Casacuberta. Inference of \ufb01nite-state transducers by using regular grammars and morphisms.\n\nGrammatical Inference: Algorithms and Applications, 1891, 2000.\n\n[9] A. Clark. Partially supervised learning of morphology with stochastic transducers. In Proc. of\n\nNLPRS, pages 341\u2013348, 2001.\n\n[10] Shay B. Cohen, Karl Stratos, Michael Collins, Dean P. Foster, and Lyle Ungar. Spectral learn-\ning of latent-variable pcfgs. In Proceedings of the 50th Annual Meeting of the Association for\nComputational Linguistics (Volume 1: Long Papers), pages 223\u2013231, Jeju Island, Korea, July\n2012. Association for Computational Linguistics.\n\n[11] J. Eisner. Parameter estimation for probabilistic \ufb01nite-state transducers. In Proc. of ACL, pages\n\n1\u20138, 2002.\n\n[12] G. Stewart et J.-G. Sun. Matrix perturbation theory. Academic Press, 1990.\n[13] Maryam Fazel. Matrix rank minimization with applications. PhD thesis, Stanford University,\n\nElectrical Engineering Dept., 2002.\n\n[14] D. Hsu, S. M. Kakade, and T. Zhang. A spectral algorithm for learning hidden markov models.\n\nIn Proc. of COLT, 2009.\n\n[15] Mehryar Mohri. Finite-state transducers in language and speech processing. Computational\n\nLinguistics, 23(2):269\u2013311, 1997.\n\n[16] A.P. Parikh, L. Song, and E.P. Xing. A spectral algorithm for latent tree graphical models.\n\nICML, 2011.\n\n[17] S.M. Siddiqi, B. Boots, and G.J. Gordon. Reduced-rank hidden markov models. AISTATS,\n\n2010.\n\n[18] L. Song, B. Boots, S. Siddiqi, G. Gordon, and A. Smola. Hilbert space embeddings of hidden\n\nmarkov models. ICML, 2010.\n\n[19] L. Zwald and G. Blanchard. On the convergence of eigenspaces in kernel principal component\n\nanalysis. NIPS, 2005.\n\n9\n\n\f", "award": [], "sourceid": 466, "authors": [{"given_name": "Raphael", "family_name": "Bailly", "institution": "Universitat Polit\u00e8cnica de Catalunya"}, {"given_name": "Xavier", "family_name": "Carreras", "institution": "Universitat Polit\u00e8cnica de Catalunya"}, {"given_name": "Ariadna", "family_name": "Quattoni", "institution": "Universitat Polit\u00e8cnica de Catalunya"}]}