{"title": "Constrained Hidden Markov Models", "book": "Advances in Neural Information Processing Systems", "page_first": 782, "page_last": 788, "abstract": null, "full_text": "Constrained Hidden Markov Models \n\nSam Roweis \n\nroweis@gatsby.ucl.ac.uk \n\nGatsby Unit, University College London \n\nAbstract \n\nBy thinking of each state in a hidden Markov model as corresponding to some \nspatial region of a fictitious topology space it is possible to naturally define neigh(cid:173)\nbouring states as those which are connected in that space. The transition matrix \ncan then be constrained to allow transitions only between neighbours; this means \nthat all valid state sequences correspond to connected paths in the topology space. \nI show how such constrained HMMs can learn to discover underlying structure \nin complex sequences of high dimensional data, and apply them to the problem \nof recovering mouth movements from acoustics in continuous speech. \n\n1 Latent variable models for structured sequence data \nStructured time-series are generated by systems whose underlying state variables change in \na continuous way but whose state to output mappings are highly nonlinear, many to one and \nnot smooth. Probabilistic unsupervised learning for such sequences requires models with \ntwo essential features: latent (hidden) variables and topology in those variables. Hidden \nMarkov models (HMMs) can be thought of as dynamic generalizations of discrete state \nstatic data models such as Gaussian mixtures, or as discrete state versions of linear dynam(cid:173)\nical systems (LDSs) (which are themselves dynamic generalizations of continuous latent \nvariable models such as factor analysis). While both HMMs and LDSs provide probabilistic \nlatent variable models for time-series, both have important limitations. Traditional HMMs \nhave a very powerful model of the relationship between the underlying state and the associ(cid:173)\nated observations because each state stores a private distribution over the output variables. \nThis means that any change in the hidden state can cause arbitrarily complex changes in the \noutput distribution. However, it is extremely difficult to capture reasonable dynamics on \nthe discrete latent variable because in principle any state is reachable from any other state at \nany time step and the next state depends only on the current state. LDSs, on the other hand, \nhave an extremely impoverished representation of the outputs as a function of the latent \nvariables since this transformation is restricted to be global and linear. But it is somewhat \neasier to capture state dynamics since the state is a multidimensional vector of continuous \nvariables on which a matrix \"flow\" is acting; this enforces some continuity of the latent \nvariables across time. Constrained hidden Markov models address the modeling of state \ndynamics by building some topology into the hidden state representation. The essential \nidea is to constrain the transition parameters of a conventional HMM so that the discrete(cid:173)\nvalued hidden state evolves in a structured way.l In particular, below I consider parameter \nrestrictions which constrain the state to evolve as a discretized version of a continuous \nmultivariate variable, i.e. so that it inscribes only connected paths in some space. This \nlends a physical interpretation to the discrete state trajectories in an HMM. \n\nI A standard trick in traditional speech applications of HMMs is to use \"left-to-right\" transition \n\nmatrices which are a special case of the type of constraints investigated in this paper. However, left(cid:173)\nto-right (Bakis) HMMs force state trajectories that are inherently one-dimensional and uni-directional \nwhereas here I also consider higher dimensional topology and free omni-directional motion. \n\n\fConstrained Hidden Markov Models \n\n783 \n\n2 An illustrative game \nConsider playing the following game: divide a sheet of paper into several contiguous, non(cid:173)\noverlapping regions which between them cover it entirely. In each region inscribe a symbol, \nallowing symbols to be repeated in different regions. Place a pencil on the sheet and move it \naround, reading out (in order) the symbols in the regions through which it passes. Add some \nnoise to the observation process so that some fraction of the time incorrect symbols are \nreported in the list instead of the correct ones. The game is to reconstruct the configuration \nof regions on the sheet from only such an ordered list(s) of noisy symbols. Of course, the \nabsolute scale, rotation and reflection of the sheet can never be recovered, but learning the \nessential topology may be possible.2 Figure 1 illustrates this setup. \n\n_ _ \n\n1, 11, 1, 11, .. . \n24(V, 21, 2, .. . \n..... 18, 19, 10,3, .. . \n\n~ \n\n8 \n\n2UJ 16, 16,.~ \n15,15,2(]), ... \n\nTrue Generative Map \n\niteration:030 \n\nlogLikelihood:-1.9624 \n\nFigure 1: (left) True map which generates symbol sequences by random movement between con(cid:173)\nnected cells. (centre) An example noisy output sequence with noisy symbols circled. (right) Learned \nmap after training on 3 sequences (with 15% noise probability) each 200 symbols long. Each cell \nactually contains an entire distribution over all observed symbols, though in this case only the upper \nright cell has significant probability mass on more than one symbol (see figure 3 for display details). \n\nWithout noise or repeated symbols, the game is easy (non-probabilistic methods can solve \nit) but in their presence it is not. One way of mitigating the noise problem is to do statistical \naveraging. For example, one could attempt to use the average separation in time of each \npair of symbols to define a dissimilarity between them. It then would be possible to use \nmethods like multi-dimensional scaling or a sort of Kohonen mapping though time3 to ex(cid:173)\nplicitly construct a configuration of points obeying those distance relations. However, such \nmethods still cannot deal with many-to-one state to output mappings (repeated numbers in \nthe sheet) because by their nature they assign a unique spatial location to each symbol. \nPlaying this game is analogous to doing unsupervised learning on structured sequences. \n(The game can also be played with continuous outputs, although often high-dimensional \ndata can be effectively clustered around a manageable number of prototypes; thus a vector \ntime-series can be converted into a sequence of symbols.) Constrained HMMs incorporate \nlatent variables with topology yet retain powerful nonlinear output mappings and can deal \nwith the difficulties of noise and many-to-one mappings mentioned above; so they can \n\"win\" our game (see figs. 1 & 3). The key insight is that the game generates sequences ex(cid:173)\nactly according to a hidden Markov process whose transition matrix allows only transitions \nbetween neighbouring cells and whose output distributions have most of their probability \non a single symbol with a small amount on all other symbols to account for noise. \n\n2The observed symbol sequence must be \"informative enough\" to reveal the map structure (this \n\ncan be quantified using the idea of persistent excitation from control theory). \n\n3Consider a network of units which compete to explain input data points. Each unit has a position \nin the output space as well as a position in a lower dimensional topology space. The winning unit has \nits position in output space updated towards the data point; but also the recent (in time) winners have \ntheir positions in topology space updated towards the topology space location of the current winner. \nSuch a rule works well, and yields topological maps in which nearby units code for data that typically \noccur close together in time. However it cannot learn many-to-one maps in which more than one unit \nat different topology locations have the same (or very similar) outputs. \n\n\f784 \n\nS. Roweis \n\n3 Model definition: state topologies from cell packings \nDefining a constrained HMM involves identifying each state of the underlying (hidden) \nMarkov chain with a spatial cell in a fictitious topology space. This requires selecting \na dimensionality d for the topology space and choosing a packing (such as hexagonal or \ncubic) which fills the space. The number of cells in the packing is equal to the number of \nstates M in the original Markov model. Cells are taken to be all of equal size and (since \nthe scale of the topology space is completely arbitrary) of unit volume. Thus, the packing \ncovers a volume M in topology space with a side length l of roughly l = MIld. The \ndimensionality and packing together define a vector-valued function x(m), m = 1 ... M \nwhich gives the location of cell m in the packing. (For example, a cubic packing of d \ndimensional space defines x(m+l) to be [m, mil, mll2, ... ,mild-I] mod l.) State m \nin the Markov model is assigned to to cell m in the packing, thus giving it a location x( m) \nin the topology space. Finally, we must choose a neighbourhood rule in the topology space \nwhich defines the neighbours of cell m; for example, all \"connected\" cells, all face neigh(cid:173)\nbours, or all those within a certain radius. (For cubic packings, there are 3d -1 connected \nneighbours and 2d face neighbours in a d dimensional topology space.) The neighbourhood \nrule also defines the boundary conditions of the space - e.g. periodic boundary conditions \nwould make cells on opposite extreme faces of the space neighbours with each other. \nThe transition matrix of the HMM is now preprogrammed to only allow transitions between \nneighbours. All other transition probabilities are set to zero, making the transition matrix \nvery sparse. (I have set all permitted transitions to be equally likely.) Now, all valid \nstate sequences in the underlying Markov model represent connected ( \"city block\") paths \nthrough the topology space. Figure 2 illustrates this for a three-dimensional model. \n\n/ \n\n/ \n\n/ \n\n/ \n\nL L \n\n/ \n\n/ \n\n/ \n\n/ \n\n/ \n\n/ \n\n/ \n\n/ \n/.1,,< \u2022. \n\u2022 \n\n\u00b7 ... \nL.\" \n\u00b7 \n\n/ \n\n/ \n\n/ \n/ \n~V / \n.~ /V / \n/ \n!;.y / \nV \nyV \n\nFigure 2: (left) PhYSical depiction of the \ntopology space for a constrained HMM with \nd=3,l=4 and M =64 showing an example state \n(right) Corresponding transition \ntrajectory. \nmatrix structure for the 64-state HMM com(cid:173)\nputed using face-centred cubic packing. The \ngaps in the inner bands are due to edge effects. \n\n641 \n\n4 State inference and learning \nThe constrained HMM has exactly the same inference procedures as a regular HMM: the \nforward-backward algorithm for computing state occupation probabilities and the Viterbi \ndecoder for finding the single best state sequence. Once these discrete state inferences have \nbeen performed, they can be transformed using the state position function x( m) to yield \nprobability distributions over the topology space (in the case offorward-backward) or paths \nthrough the topology space (in the case of Viterbi decoding). This transformation makes \nthe outputs of state decodings in constrained HMMs comparable to the outputs of inference \nprocedures for continuous state dynamical systems such as Kalman smoothing. \nThe learning procedure for constrained HMMs is also almost identical to that for HMMs. \nIn particular, the EM algorithm (Baum-Welch) is used to update model parameters. The \ncrucial difference is that the transition probabilities which are precomputed by the topology \nand packing are never updated during learning. In fact, this makes learning much easier \nin some cases. Not only do the transition probabilities not have to be learned, but their \nstructure constrains the hidden state sequences in such a way as to make the learning of the \noutput parameters much more efficient when the underlying data really does come from a \nspatially structured generative model. Figure 3 shows an example of parameter learning \nfor the game discussed above. Notice that in this case, each part of state space had only a \nsingle output (except for noise) so the final learned output distributions became essentially \nminimum entropy. But constrained HMMs can in principle model stochastic or multimodal \noutput processes since each state stores an entire private distribution over outputs. \n\n\fConstrained Hidden Markov Models \n\n785 \n\n1'\"IIlloa-olO \n\nIDp.lkdihood.-2.1451 \n\nFigure 3: Snapshots of model parameters during constrained HMM learning for the game described \nin section 2. At every iteration each cell in the map has a complete distribution over all of the observed \nsymbols. Only the top three symbols of each cell's histogram are show, with/ont size proportional \nto the square root o/probability (to make ink roughly proportional). The map was trained on 3 noisy \nsequences each 200 symbols long generated from the map on the left of figure 1 using 15% noise \nprobability. The final map after convergence (30 iterations) is shown on the right of figure l. \n\n5 Recovery of mouth movements from speech audio \nI have applied the constrained HMM approach described above to the problem of recover(cid:173)\ning mouth movements from the acoustic waveform in human speech. Data containing si(cid:173)\nmultaneous audio and articulator movement information was obtained from the University \nof Wisconsin X-ray microbeam database [9]. Eight separate points (four on the tongue, one \non each lip and two on the jaw) located in the midsaggital plane of the speaker's head were \ntracked while subjects read various words, sentences, paragraphs and lists of numbers. The \nx and y coordinates (to within about \u00b1 Imm) of each point were sampled at 146Hz by an X(cid:173)\nray system which located gold beads attached to the feature points on the mouth, producing \na 16-dimensional vector every 6.9ms. The audio was sampled at 22kHz with roughly 14 \nbits of amplitude resolution but in the presence of machine noise. \nThese data are well suited to the constrained HMM architecture. They come from a system \nwhose state variables are known, because of physical constraints, to move in connected \npaths in a low degree-of-freedom space. In other words the (normally hidden) articulators \n(movable structures of the mouth), whose positions represent the underlying state of the \nspeech production system,4 move slowly and smoothly. The observed speech signal-the \nsystem's output--can be characterized by a sequence of short-time spectral feature vectors, \noften known as a spectrogram. In the experiments reported here, I have characterized the \naudio signal using 12 line spectral frequencies (LSFs) measured every 6.9ms (to coincide \nwith the articulatory sampling rate) over a 25ms window. These LSF vectors character(cid:173)\nize only the spectral shape of the speech waveform over a short time but not its energy. \nAverage energy (also over a 25ms window every 6.9ms) was measured as a separate one \ndimensional signal. Unlike the movements of the articulators, the audio spectrum/energy \ncan exhibit quite abrupt changes, indicating that the mapping between articulator positions \nand spectral shape is not smooth. Furthermore, the mapping is many to one: different \narticulator configurations can produce very similar spectra (see below). \nThe unsupervised learning task, then, is to explain the complicated sequences of observed \nspectral features (LSFs) and energies as the outputs of a system with a low-dimensional \nstate vector that changes slowly and smoothly. In other words, can we learn the parameters5 \nof a constrained HMM such that connected paths through the topology space (state space) \ngenerate the acoustic training data with high likelihood? Once this unsupervised learning \ntask has been performed, we can (as I show below) relate the learned trajectories in the \ntopology space to the true (measured) articulator movements. \n\n4 Articulator positions do not provide complete state information. For example, the excitation \nsignal (voiced or unvoiced) is not captured by the bead locations. They do, however, provide much \nimportant information; other state information is easily accessible directly from acoustics. \n\n5Model structure (dimensionality and number of states) is currently set using cross validation. \n\n\f786 \n\nS. Roweis \n\nWhile many models of the speech production process predict the many-to-one and non(cid:173)\nsmooth properties of the articulatory to acoustic mapping, it is useful to confirm these \nfeatures by looking at real data. Figure 4 shows the experimentally observed distribution \nof articulator configurations used to produce similar sounds. It was computed as follows. \nAll the acoustic and articulatory data for a single speaker are collected together. Starting \nwith some sample called the key sample, I find the 1000 samples \"nearest\" to this key \nby two measures: articulatory distance, defined using the Mahalanobis norm between two \nposition vectors under the global covariance of all positions for the appropriate speaker, \nand spectral shape distance, again defined using the Mahalanobis norm but now between \ntwo line spectral frequency vectors using the global LSF covariance of the speaker's audio \ndata. In other words, I find the 1000 samples that \"look most like\" the key sample in mouth \nshape and that \"sound most like\" the key sample in spectral shape. I then plot the tongue \nbead positions of the key sample (as a thick cross), and the 1000 nearest samples by mouth \nshape (as a thick ellipse) and spectral shape (as dots). The points of primary interest are the \ndots; they show the distribution of tongue positions used to generate very similar sounds. \n(The thick ellipses are shown only as a control to ensure that many nearby points to the key \nsample do exist in the dataset.) Spread or multimodality in the dots indicates that many \ndifferent articulatory configurations are used to generate the same sound. \n\n;~~.; \n\n:Ill \n\nI 10 \n., \n\n0 \n\n.:~~~ . -\nf\u00b7i1~W\u00b7 \n.;.: \n.:. \n'~ . \n\n30 \n\n:Ill \n\nI 10 \n., \n\n0 \n\n:.fI:(mmJ \n\n20 \n\nI 10 \n., \n\n0 \n\n~~\u00a5f \n\n-~ -so \n\n_ \n\n-40 \n\nbody2 > (nmJ \n\n0 \n\n-30 \n\n-~ -so \n\n-30 \ntoogue bodyl > (mmJ \n\n-40 \n\n:Ill \n\n0 \n\n10 \n\nI .J!! \n\n., \n-10 \n\n-60 \n\n_ -> (om\u00bb \n\n-so \n\n-~o \n\n-:Ill \n\n-40 \n\n-10 \n\n-40 \n\n20 \n\nI 10 \n., \n\n0 \n\n'lMiJf \n\n-~ -so \n\n-40 \n\ntongue body2 > (mmJ \n\n-30 \n\n-:Ill \n\n-10 \ntoque lip:r. (mmJ \n\n-~ -so \n\n-30 \n..\"... bodyl> (mmJ \n\n-40 \n\n10 \n\nI 0 \n., \n-10 \n\n-30 \n\n-~o \n\n-60 \n\n__ > (mmJ \n\n-so \n\n-:Ill \n\n-40 \n\nFigure 4: Inverse mapping from acoustics to articulation is ill-posed in real speech production data. \nEach group of four articulator-space plots shows the 1000 samples which are \"nearest\" to one key \nsample (thick cross). The dots are the 1000 nearest samples using an acoustic measure based on line \nspectral frequencies. Spread or multimodality in the dots indicates that many different articulatory \nconfigurations are used to generate very similar sounds. Only the positions of the four tongue beads \nhave been plotted. 1\\vo examples (with different key samples) are shown, one in the left group of four \npanels and another in the right group. The thick ellipses (shown as a control) are the two-standard \ndeviation contour of the 1000 nearest samples using an articulatory position distance metric. \n\nWhy not do direct supervised learning from short-time spectral features (LSFs) to the artic(cid:173)\nulator positions? The ill-posed nature of the inverse problem as shown in figure 4 makes this \nimpossible. To illustrate this difficulty, I have attempted to recover the articulator positions \nfrom the acoustic feature vectors using Kalman smoothing on a LDS. In this case, since we \nhave access to both the hidden states (articulator positions) and the system outputs (LSFs) \nwe can compute the optimal parameters of the model directly. (In particular, the state \ntransition matrix is obtained by regression from articulator positions and velocities at time \nt onto positions at time t + 1; the output matrix by regression from articulator positions \nand velocities onto LSF vectors; and the noise covariances from the residuals of these \nregressions.) Figure 5b shows the results of such smoothing; the recovery is quite poor. \nConstrained HMMs can be applied to this recovery problem, as previously reported [6]. \n(My earlier results used a small subset of the same database that was not continuous speech \nand did not provide the hard experimental verification (fig. 4) of the many-to-one problem.) \n\n\fConstrained Hidden Markov Models \n\n787 \n\nFigure 5: (A) Recovered articulator movements using state inference on a constrained HMM. A \nfour-dimensional model with 4096 states was trained on data (all beads) from a single speaker but \nnot including the test utterance shown. Dots show the actual measured articulator movements for a \nsingle bead coordinate versus time; the thin lines are estimated movements from the corresponding \nacoustics. (B) Unsuccessful recovery of articulator movements using Kalman smoothing on a global \nLDS model. All the (speaker-dependent) parameters of the underlying linear dynamical system are \nknown; they have been set to their optimal values using the true movement information from the \ntraining data. Furthermore, for this example, the test utterance shown was included in the training \ndata used to estimate model parameters. (C) All 16 bead coordinates; all vertical axes are the same \nscale. Bead names are shown on the left. Horizontal movements are plotted in the left-hand column \nand vertical movements in the right-hand column, The separation between the two horizontal lines \nnear the centre of the right panel indicates the machine measurement error. \n\nRecovery of tongue tip vertical motion from acoustics \n\n2 345 \n\ntime [sec] \n\n6 \n\n7 \n\n8 \n\nKalman smoothing on optimal linear dynamical system \n\nI 20 \n\n\u00a7 \n'::l \n'[ \n'0 \n\n0 \n\n~-10 B \n\n-20L-~--~--~--~--~--~--~~ \n8 \n\n02345 \n\n6 \n\n7 \n\ntime [sec] \n\nThe basic idea is to train (unsupervised) on sequences of acoustic-spectral features and \nthen map the topology space state trajectories onto the measured articulatory movements. \nFigure 5 shows movement recovery using state inference in a four-dimensional model with \n4096 states (d=4,\u00a3=8,M =4096) trained on data (all beads) from a single speaker. (Naive \nunsupervised learning runs into severe local minima problems. To avoid these, in the sim(cid:173)\nulations shown above, models were trained by slowly annealing two learning parameters6 : \na term f.!3 was used in place of the zeros in the sparse transition matrix, and If was used \nin place of It = p(mtlobservations) during inference of state occupation probabilities. \nInverse temperature (3 was raised from 0 to 1.) To infer a continuous state trajectory from \nan utterance after learning, I first do Viterbi decoding on the acoustics to generate a discrete \nstate sequence mt and then interpolate smoothly between the positions x(mt) of each state. \n\n6 An easier way (which I have used previously) to find good minima is to initialize the models \nusing the articulatory data themselves. This does not provide as impressive \"structure discovery\" as \nannealing but still yields a system capable of inverting acoustics into articulatory movements on pre(cid:173)\nviously unseen test data. First, a constrained HMM is trained onjust the articulatory movements; this \nworks easily because of the natural geometric (physical) constraints. Next, I take the distribution of \nacoustic features (LSFs) over all times (in the training data) when Viterbi decoding places the model \nin a particular state and use those LSF distributions to initialize an equivalent acoustic constrained \nHMM. This new model is then retrained until convergence using Baum-Welch. \n\n\f788 \n\nS. Roweis \n\nAfter unsupervised learning, a single linear fit is performed between these continuous state \ntrajectories and actual articulator movements on the training data. (The model cannot dis(cid:173)\ncover the units system or axes used to represent the articulatory data.) To recover articulator \nmovements from a previously unseen test utterance, I infer a continuous state trajectory as \nabove and then apply the single linear mapping (learned only once from the training data). \n6 Conclusions, extensions and other work \nBy enforcing a simple constraint on the transition parameters of a standard HMM, a link \ncan be forged between discrete state dynamics and the motion of a real-valued state vector \nin a continuous space. For complex time-series generated by systems whose underlying \nlatent variables do in fact change slowly and smoothly, such constrained HMMs provide a \npowerful unsupervised learning paradigm. They can model state to output mappings that \nare highly nonlinear, many to one and not smooth. Furthermore, they rely only on well \nunderstood learning and inference procedures that come with convergence guarantees. \nResults on synthetic and real data show that these models can successfully capture the low(cid:173)\ndimensional structure present in complex vector time-series. In particular, I have shown \nthat a speaker dependent constrained HMM can accurately recover articulator movements \nfrom continuous speech to within the measurement error of the data. This acoustic to \narticulatory inversion problem has a long history in speech processing (see e.g. [7] and \nreferences therein). Many previous approaches have attempted to exploit the smoothness \nof articulatory movements for inversion or modeling: Hogden et.al (e.g. [4]) provided early \ninspiration for my ideas, but do not address the many-to-one problem; Simon Blackburn [1] \nhas investigated a forward mapping from articulation to acoustics but does not explicitly \nattempt inversion; early work at Waterloo [5] suggested similar constraints for improving \nspeech recognition systems but did look at real articulatory data, more recent work at \nRutgers [2] developed a very similar system much further with good success. Perpinan [3], \nconsiders a related problem in sequence learning using EPG speech data as an example. \nWhile in this note I have described only \"diffusion\" type dynamics (transitions to all neigh(cid:173)\nbours are equally likely) it is also possible to consider directed flows which give certain \nneighbours of a state lower (or zero) probability. The left-to-right HMMs mentioned earlier \nare an example of this for one-dimensional topologies. For higher dimensions, flows can \nbe derived from discretization of matrix (linear) dynamics or from other physical/structural \nconstraints. It is also possible to have many connected local flow regimes (either diffusive \nor directed) rather than one global regime as discussed above; this gives rise to mixtures of \nconstrained HMMs which have block-structured rather than banded transition matrices. \nSmyth [8] has considered such models in the case of one-dimensional topologies and \ndirected flows; I have applied these to learning character sequences from English text. \nAnother application I have investigated is map learning from mUltiple sensor readings. \nAn explorer (robot) navigates in an unknown environment and records at each time many \nlocal measurements such as altitude, pressure, temperature, humidity, etc. We wish to \nreconstruct from only these sequences of readings the topographic maps (in each sensor \nvariable) of the area as well as the trajectory of the explorer. A final application is tracking \n(inferring movements) of articulated bodies using video measurements of feature positions. \n\nReferences \n[1] S. Blackburn & S. Young. ICSLP 1996, Philadephia, v.2 pp.969-972 \n[2] S. Chennoukh et.al, Eurospeech 1997, Rhodes, Greece, v.l pp.429-432 \n[3] M. Carreira-Perpinan. NIPS'12, 2000. (This volume.) \n[4] D. Nix & 1. Hogden. NIPS'lI, 1999, pp.744-750 \n[5] G. Ramsay & L. Deng. 1. Acoustical Society of America, 95(5), 1994, p.2873 \n[6] S. Roweis & A. Alwan. Eurospeech 1997, Rhodes, Greece, v.3 pp.1227-1230 \n[7] 1. Schroeter & M. Sondhi. IEEE Trans.Speech & Audio Processing, 2(1 p2), 1994, pp.133-150 \n[8] P. Smyth. NIPS'9, 1997, pp.648-654 \n[9] J. Westbury. X-ray microbeam speech production database user's handbook version J.O. \n\nUniversity of Wisconsin, Madison, June 1994. \n\n\f", "award": [], "sourceid": 1738, "authors": [{"given_name": "Sam", "family_name": "Roweis", "institution": null}]}