{"title": "Unsupervised Learning of Disentangled and Interpretable Representations from Sequential Data", "book": "Advances in Neural Information Processing Systems", "page_first": 1878, "page_last": 1889, "abstract": "We present a factorized hierarchical variational autoencoder, which learns disentangled and interpretable representations from sequential data without supervision. Specifically, we exploit the multi-scale nature of information in sequential data by formulating it explicitly within a factorized hierarchical graphical model that imposes sequence-dependent priors and sequence-independent priors to different sets of latent variables. The model is evaluated on two speech corpora to demonstrate, qualitatively, its ability to transform speakers or linguistic content by manipulating different sets of latent variables; and quantitatively, its ability to outperform an i-vector baseline for speaker verification and reduce the word error rate by as much as 35% in mismatched train/test scenarios for automatic speech recognition tasks.", "full_text": "Unsupervised Learning of Disentangled and\n\nInterpretable Representations from Sequential Data\n\nWei-Ning Hsu, Yu Zhang, and James Glass\n\nComputer Science and Arti\ufb01cial Intelligence Laboratory\n\nMassachusetts Institute of Technology\n\nCambridge, MA 02139, USA\n\n{wnhsu,yzhang87,glass}@csail.mit.edu\n\nAbstract\n\nWe present a factorized hierarchical variational autoencoder, which learns disen-\ntangled and interpretable representations from sequential data without supervision.\nSpeci\ufb01cally, we exploit the multi-scale nature of information in sequential data by\nformulating it explicitly within a factorized hierarchical graphical model that im-\nposes sequence-dependent priors and sequence-independent priors to different sets\nof latent variables. The model is evaluated on two speech corpora to demonstrate,\nqualitatively, its ability to transform speakers or linguistic content by manipulating\ndifferent sets of latent variables; and quantitatively, its ability to outperform an\ni-vector baseline for speaker veri\ufb01cation and reduce the word error rate by as much\nas 35% in mismatched train/test scenarios for automatic speech recognition tasks.\n\n1\n\nIntroduction\n\nUnsupervised learning is a powerful methodology that can leverage vast quantities of unannotated\ndata in order to learn useful representations that can be incorporated into subsequent applications in\neither supervised or unsupervised fashions. One of the principle approaches to unsupervised learning\nis probabilistic generative modeling. Recently, there has been signi\ufb01cant interest in three classes of\ndeep probabilistic generative models: 1) Variational Autoencoders (VAEs) [23, 34, 22], 2) Generative\nAdversarial Networks (GANs) [11], and 3) auto-regressive models [30, 39]; more recently, there are\nalso studies combining multiple classes of models [6, 27, 26]. While GANs bypass any inference of\nlatent variables, and auto-regressive models abstain from using latent variables, VAEs jointly learn an\ninference model and a generative model, allowing them to infer latent variables from observed data.\nDespite successes with VAEs, understanding the underlying factors that latent variables associate\nwith is a major challenge. Some research focuses on the supervised or semi-supervised setting using\nVAEs [21, 17]. There is also research attempting to develop weakly supervised or unsupervised\nmethods to learn disentangled representations, such as DC-IGN [25], InfoGAN [1], and \u03b2-VAE [13].\nThere is yet another line of research analyzing the latent variables with labeled data after the model\nis trained [33, 15]. While there has been much research investigating static data, such as the\naforementioned ones, there is relatively little research on learning from sequential data [8, 3, 2, 9, 7,\n18, 36]. Moreover, to the best of our knowledge, there has not been any attempt to learn disentangled\nand interpretable representations without supervision from sequential data. The information encoded\nin sequential data, such as speech, video, and text, is naturally multi-scaled; in speech for example,\ninformation about the channel, speaker, and linguistic content is encoded in the statistics at the\nsession, utterance, and segment levels, respectively. By leveraging this source of constraint, we can\nlearn disentangled and interpretable factors in an unsupervised manner.\nIn this paper, we propose a novel factorized hierarchical variational autoencoder, which learns\ndisentangled and interpretable latent representations from sequential data without supervision by\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fFigure 1: FHVAE (\u03b1 = 0) decoding results of three combinations of latent segment variables z1 and\nlatent sequence variables z2 from two utterances in Aurora-4: a clean one (top-left) and a noisy one\n(bottom-left). FHVAEs learn to encode local attributes, such as linguistic content, into z1, and encode\nglobal attributes, such as noise level, into z2. Therefore, by replacing z2 of a noisy utterance with\nz2 of a clean utterance, an FHVAE decodes a denoised utterance (middle-right) that preserves the\nlinguistic content. Reconstruction results of the clean and noisy utterances are also shown on the\nright. Audio samples are available at https://youtu.be/naJZITvCfI4.\n\nexplicitly modeling the multi-scaled information with a factorized hierarchical graphical model. The\ninference model is designed such that the model can be optimized at the segment level, instead\nof at the sequence level, which may cause scalability issues when sequences become too long. A\nsequence-to-sequence neural network architecture is applied to better capture temporal relationships.\nWe evaluate the proposed model on two speech datasets. Qualitatively, the model demonstrates an\nability to factorize sequence-level and segment-level attributes into different sets of latent variables.\nQuantitatively, the model achieves 2.38% and 1.34% equal error rate on unsupervised and supervised\nspeaker veri\ufb01cation tasks respectively, which outperforms an i-vector baseline. On speech recognition\ntasks, it reduces the word error rate in mismatched train/test scenarios by up to 35%.\nThe rest of the paper is organized as follows. In Section 2, we introduce our proposed model, and\ndescribe the neural network architecture in Section 3. Experimental results are reported in Section 4.\nWe discuss related work in Section 5, and conclude our work as well as discuss future research plans\nin Section 6. We have released the code for the model described in this paper.1\n\n2 Factorized Hierarchical Variational Autoencoder\n\nGeneration of sequential data, such as speech, often involves multiple independent factors operating\nat different time scales. For instance, the speaker identity affects fundamental frequency (F0) and\nvolume at the sequence level, while phonetic content only affects spectral contour and durations of\nformants at the segmental level. This multi-scale behavior results in the fact that some attributes,\nsuch as F0 and volume, tend to have a smaller amount of variation within an utterance, compared to\nbetween utterances; while other attributes, such as phonetic content, tend to have a similar amount of\nvariation within and between utterances.\nWe refer to the \ufb01rst type of attributes as sequence-level attributes, and the other as segment-level\nattributes. In this work, we achieve disentanglement and interpretability by encoding the two types of\nattributes into latent sequence variables and latent segment variables respectively, where the former\nis regularized by an sequence-dependent prior and the latter by an sequence-independent prior.\nWe now formulate a generative process for speech and propose our Factorized Hierarchical Variational\nAutoencoder (FHVAE). Consider some dataset D = {X (i)}M\ni=1 consisting of M i.i.d. sequences,\nwhere X (i) = {x(i,n)}N (i)\nn=1 is a sequence of N (i) observed variables. N (i) is referred to as the\n\n1https://github.com/wnhsu/FactorizedHierarchicalVAE\n\n2\n\n\f(a) Generative Model\n\n(b) Inference Model\n\nFigure 2: Graphical illustration of the proposed generative model and inference model. Grey nodes\ndenote the observed variables, and white nodes are the hidden variables.\n\nN(cid:89)\n\nnumber of segments for the i-th sequence, and x(i,n) is referred to as the n-th segment of the i-th\nsequence. Note that here a \u201csegment\u201d refers to a variable of smaller temporal scale compared to the\n\u201csequence\u201d, which is in fact a sub-sequence. We will drop the index i whenever it is clear that we are\nreferring to terms associated with a single sequence. We assume that each sequence X is generated\nfrom some random process involving the latent variables Z1 = {z(n)\nn=1, and\n\u00b52. The following generation process as illustrated in Figure 2(a) is considered: (1) a s-vector\n2 }N\n\u00b52 is drawn from a prior distribution p\u03b8(\u00b52); (2) N i.i.d.\nand latent segment variables {z(n)\nn=1 are drawn from a sequence-dependent prior distribution\np\u03b8(z2|\u00b52) and a sequence-independent prior distribution p\u03b8(z1) respectively; (3) N i.i.d. observed\nvariables {x(n)}N\nn=1 are drawn from a conditional distribution p\u03b8(x|z1, z2). The joint probability\nfor a sequence is formulated in Eq. 1:\n\n2 }N\nlatent sequence variables {z(n)\n\nn=1, Z2 = {z(n)\n\n1 }N\n\n1 }N\n\nn=1\n\np\u03b8(X, Z1, Z2, \u00b52) = p\u03b8(\u00b52)\n\np\u03b8(x(n)|z(n)\n\n1\n\n, z(n)\n\n2\n\n)p\u03b8(z(n)\n\n1\n\n)p\u03b8(z(n)\n\n2\n\n|\u00b52).\n\n(1)\n\nSpeci\ufb01cally, we formulate each of the RHS term as follows:\n\nn=1\n\np\u03b8(z1) = N (z1|0, \u03c32\n\nz1\n\np\u03b8(x|z1, z2) = N (x|f\u00b5x (z1, z2), diag(f\u03c32\nI),\n\nI), p\u03b8(z2|\u00b52) = N (z2|\u00b52, \u03c32\n\nx\n\nz2\n\n(z1, z2)))\n\np\u03b8(\u00b52) = N (\u00b52|0, \u03c32\n\nI),\n\n\u00b52\n\nwhere the priors over the s-vectors \u00b52 and the latent segment variables z1 are centered isotropic\nmultivariate Gaussian distributions. The prior over the latent sequence variable z2 conditioned on \u00b52\nis an isotropic multivariate Gaussian centered at \u00b52. The conditional distribution of the observed\nvariable x is the multivariate Gaussian with a diagonal covariance matrix, whose mean and diagonal\nvariance are parameterized by neural networks f\u00b5x (\u00b7,\u00b7) and f\u03c32\n(\u00b7,\u00b7) with input z1 and z2. We use \u03b8\nto denote the set of parameters in the generative model.\nThis generative model is factorized in a way such that the latent sequence variables z2 within a\nsequence are forced to be close to \u00b52 as well as to each other in Euclidean distance, and therefore are\nencouraged to encode sequence-level attributes that may have larger variance across sequences, but\nsmaller variance within sequences. The constraint to the latent segment variables z1 is imposed glob-\nally, and therefore encourages encoding of residual attributes whose variation is not distinguishable\ninter and intra sequences.\nIn the variational autoencoder\ntractable, an inference model, q\u03c6(Z(i)\np\u03b8(Z(i)\ninference model as Figure 2(b):\n\nsince the exact posterior\nin-\n2 |X (i)), that approximates the true posterior,\n2 |X (i)), for variational inference [19] is introduced. We consider the following\n\nframework,\n1 , Z(i)\n\ninference is\n\n1 , Z(i)\n\n2 , \u00b5(i)\n\n2 , \u00b5(i)\n\nx\n\nq\u03c6(Z(i)\n\nq\u03c6(\u00b5(i)\n\n2 , \u00b5(i)\n1 , Z(i)\n2 ) = N (\u00b5(i)\n\n2 |X (i)) = q\u03c6(\u00b5(i)\n2 )\n2 |g\u00b5\u00b52\n\n(i), \u03c32\n\u02dc\u00b52\n\nI),\n\nq\u03c6(z1|x, z2) = N (z1|g\u00b5z1\n\nq\u03c6(z(i,n)\n\nN (i)(cid:89)\n|x(i,n), z(i,n)\nq\u03c6(z2|x) = N (z2|g\u00b5z2\n(x, z2), diag(g\u03c32\nz1\n\nn=1\n\n1\n\n2\n\n(x, z2))),\n\n)q\u03c6(z(i,n)\n\n2\n\n|x(i,n))\n\n(x), diag(g\u03c32\nz2\n\n(x)))\n\n3\n\nM\u00b5(i)2N(i)x(i,n)z(i,n)1z(i,n)2\u2713M\u00b5(i)2N(i)x(i,n)z(i,n)1z(i,n)2\fwhere the posteriors over \u00b52, z1, and z2 are all multivariate diagonal Gaussian distributions. Note that\nthe mean of the posterior distribution of \u00b52 is not directly inferred from X, but instead is regarded as\npart of inference model parameters, with one for each utterance, which would be optimized during\ntraining. Therefore, g\u00b5\u00b52\n(i) to denote\nthe posterior mean of \u00b52 for the i-th sequence; we \ufb01x the posterior covariance matrix of \u00b52 for all\n(\u00b7,\u00b7) are also neural\nsequences. Similar to the generative model, g\u00b5z2\n(\u00b7) are denoted collectively by \u03c6. The variational lower\nnetworks whose parameters along with g\u00b5\u00b52\nbound for this inference model on the marginal likelihood of a sequence X is derived as follows:\n\n(\u00b7) can be seen as a lookup table, and we use \u02dc\u00b5(i)\n(\u00b7,\u00b7), and g\u03c32\n\n(\u00b7), g\u00b5z1\n\n2 = g\u00b5\u00b52\n\n(\u00b7), g\u03c32\n\nz1\n\nz2\n\nN(cid:88)\n\nn=1\n\nL(\u03b8, \u03c6; X) =\nL(\u03b8, \u03c6; x(n)|\u02dc\u00b52) =E\n\nL(\u03b8, \u03c6; x(n)|\u02dc\u00b52) + log p\u03b8(\u02dc\u00b52) + const\n\n)(cid:3)\n\n(cid:2) log p\u03b8(x(n)|z(n)\n(cid:2)DKL(q\u03c6(z(n)\n\n1\n\n|x(n))||p\u03b8(z(n)\n\n2\n\n2\n\n1\n\n, z(n)\n|x(n), z(n)\n|\u02dc\u00b52)).\n\n2\n\n))(cid:3)\n\n)||p\u03b8(z(n)\n\n1\n\n|x(n))\n\n2\n\n1\n\n,z(n)\n\nq\u03c6(z(n)\n\u2212 E\n|x(n))\n\u2212 DKL(q\u03c6(z(n)\n\nq\u03c6(z(n)\n\n2\n\n2\n\nThe detailed derivation can be found in Appendix A. Because the approximated posterior of \u00b52 does\nnot depend on the sequence X, the sequence variational lower bound L(\u03b8, \u03c6; X) can be decomposed\ninto the sum of L(\u03b8, \u03c6; x(n)|\u02dc\u00b52), the conditional segment variational lower bounds, over segments,\nplus the log prior probability of \u02dc\u00b52 and a constant. Therefore, instead of sampling a batch at the\nsequence level to maximize the sequence variational lower bound, we can sample a batch at the\nsegment level to maximize the segment variational lower bound:\n\nL(\u03b8, \u03c6; x(n)) = L(\u03b8, \u03c6; x(n)|\u02dc\u00b52) +\n\n1\nN\n\nlog p\u03b8(\u02dc\u00b52) + const.\n\n(2)\n\nThis approach provides better scalability when the sequences are extremely long, such that computing\nan entire sequence for a batched update is too computationally expensive.\nIn this paper we only introduce two scales of attributes; however, one can easily extend this model\nto more scales by simply introducing \u00b5k for k = 2, 3,\u00b7\u00b7\u00b7 2 that constrains the prior distribution of\nlatent variables at more scales, such as having session-dependent prior or dataset-dependent prior.\n\n2.1 Discriminative Objective\n\nThe idea of having sequence-speci\ufb01c priors for each sequence is to encourage the model to encode\nthe sequence-level attributes and the segment-level attributes into different sets of latent variables.\nHowever, when \u00b52 = 0 for all sequences, the prior probability of the s-vector is maximized, and the\nKL-divergence of the inferred posterior of z2 is measured from the same conditional prior for all\nsequences. This would result in trivial s-vectors \u00b52, and therefore z1 and z2 would not be factorized\nto encode sequence and segment attributes respectively.\nTo encourage z2 to encode sequence-level attributes, we use z(i,n)\ninfer the sequence index i of x(i,n). We formulate the discriminative objective as:\n\n, which is inferred from x(i,n), to\n\n2\n\nlog p(i|z(i,n)\n\n2\n\n) = log p(z(i,n)\n\n2\n\n|i) \u2212 log\n\n|j)\n\n(p(i) is assumed uniform)\n\nM(cid:88)\n2 ) \u2212 log(cid:0) M(cid:88)\n\nj=1\n\n2\n\np(z(i,n)\n\n2 )(cid:1),\n\n|\u02dc\u00b5(j)\n\np\u03b8(z(i,n)\n\n2\n\n:= log p\u03b8(z(i,n)\n\n2\n\n|\u02dc\u00b5(i)\n\nj=1\n\nCombining the discriminative objective using a weighting parameter \u03b1 with the segment variational\nlower bound, the objective function to maximize then becomes:\n\nLdis(\u03b8, \u03c6; x(i,n)) = L(\u03b8, \u03c6; x(i,n)) + \u03b1 log p(i|z(i,n)\n\n2\nwhich we refer to as the discriminative segment variational lower bound.\n\n),\n\n2The index starts from 2 because we do not introduce the hierarchy to z1.\n\n(3)\n\n4\n\n\fInferring S-Vectors During Testing\n\n2.2\nDuring testing, we may want to use the s-vector \u00b52 of an unseen sequence \u02dcX = {\u02dcx(n)} \u02dcN\nn=1 as\nthe sequence-level attribute representation for tasks such as speaker veri\ufb01cation. Since the exact\nmaximum a posterior estimation of \u00b52 is intractable, we approximate the estimation using the\nconditional segment variational lower bound as follows:\n\nlog p\u03b8(\u00b52| \u02dcX) = argmax\n\nlog p\u03b8( \u02dcX, \u00b52)\n\n\u00b52\n\nlog p\u03b8(\u02dcx(n)|\u00b52)(cid:1) + log p\u03b8(\u00b52)\n\n\u00b5\u2217\n2 = argmax\n\n\u00b52\n\n= argmax\n\n\u00b52\n\n\u2248 argmax\n\n\u00b52\n\n(cid:0) \u02dcN(cid:88)\n\u02dcN(cid:88)\n\nn=1\n\nn=1\n\n(cid:80) \u02dcN\n\n\u00b5\u2217\n2 =\n\n(\u02dcx(n))\n\nn=1 g\u00b5z2\n\u02dcN + \u03c32\n\nz2/\u03c32\n\u00b52\n\nL(\u03b8, \u03c6; \u02dcx(n)|\u00b52) + log p\u03b8(\u00b52).\n\n(4)\n\nThe closed form solution of \u00b5\u2217\n\n2 can be derived by differentiating Eq. 4 w.r.t. \u00b52 (see Appendix B):\n\n.\n\n(5)\n\n3 Sequence-to-Sequence Autoencoder Model Architecture\n\nIn this section, we introduce the detailed neural network architectures for our proposed FHVAE. Let\na segment x = x1:T be a sub-sequence of X that contains T time steps, and xt denotes the t-th time\nstep of x. We use recurrent network architectures for encoders that capture the temporal relationship\namong time steps, and generate a summarized \ufb01xed-dimension vector after consuming an entire\nsub-sequence. Likewise, we adopt a recurrent network architecture that generates a frame step by\nstep conditioned on the latent variables z1 and z2. The complete network can be seen as a stochastic\nsequence-to-sequence autoencoder that encodes x1:T stochastically into z1, z2, and stochastically\ndecodes from them back to x1:T .\n\nFigure 3: Sequence-to-sequence factorized hierarchical variational autoencoder. Dashed lines indicate\nthe sampling process using the reparameterization trick [23]. The encoders for z1 and z2 are pink\nand amber, respectively, while the decoder for x is blue. Darker colors denote the recurrent neural\nnetworks, while lighter colors denote the fully-connected layers predicting the mean and log variance.\n\nFigure 3 shows our proposed Seq2Seq-FHVAE architecture.3 Here we show the detailed formulation:\n\n(hz2,t, cz2,t) = LSTM(xt\u22121, hz2,t\u22121, cz2,t\u22121; \u03c6LSTM,z2)\nq\u03c6(z2|x1:T ) = N (z2| MLP(hz2,T ; \u03c6MLP\u00b5,z2), diag(exp(MLP(hz2,T ; \u03c6MLP\u03c32 ,z2 ))))\n(hz1,t, cz1,t) = LSTM([xt\u22121; z2], hz1,t\u22121, cz1,t\u22121; \u03c6z1)\nq\u03c6(z1|x1:T , z2) = N (z1| MLP(hz1,T ; \u03c6MLP\u00b5,z1), diag(exp(MLP(hz1,T ; \u03c6MLP\u03c32 ,z1 ))))\n(hx,t, cx,t) = LSTM([z1; z2], hx,t\u22121, cx,t\u22121; \u03c6x)\np\u03b8(xt|z1, z2) = N (xt| MLP(hx,t; \u03c6MLP\u00b5,x), diag(exp(MLP(hx,t; \u03c6MLP\u03c32 ,x)))),\n\nwhere LSTM refers to a long short-term memory recurrent neural network [14], and MLP refers to a\nmulti-layer perceptron, \u03c6\u2217 are the related weight matrices. None of the neural network parameters\nare shared. We refer to this model as Seq2Seq-FHVAE. Log-likelihood and qualitative comparison\nwith alternative architectures can be found in Appendix D.\n\n3Best viewed in color.\n\n5\n\nx1p(x1|z1, z2)\u2026z2q(z1|x1:T, z2)\u2026z1EncoderDecoderx2x3xTx1x2p(x2|z1, z2)p(xT|z1, z2)q(z2|x1:T)xT\u2026\f4 Experiments\n\nWe use speech, which inherently contains information at multiple scales, such as channel, speaker,\nand linguistic content to test our model. Learning to disentangle the mixed information from the\nsurface representation is essential for a wide variety of speech applications: for example, noise robust\nspeech recognition [41, 38, 37, 16], speaker veri\ufb01cation [5], and voice conversion [40, 29, 24].\nThe following two corpora are used for our experiments: (1) TIMIT [10], which contains broadband\n16kHz recordings of phonetically-balanced read speech. A total of 6300 utterances (5.4 hours) are\npresented with 10 sentences from each of 630 speakers, of which approximately 70% are male and\n30% are female. (2) Aurora-4 [32], a broadband corpus designed for noisy speech recognition tasks\nbased on the Wall Street Journal corpus (WSJ0) [31]. Two microphone types, CLEAN/CHANNEL\nare included, and six noise types are arti\ufb01cially added to both microphone types, which results in\nfour conditions: CLEAN, CHANNEL, NOISY, and CHANNEL+NOISY. Two 14 hour training sets are\nused, where one is clean and the other is a mix of all four conditions. The same noise types and\nmicrophones are used to generate the development and test sets, which both consist of 330 utterances\nfrom all four conditions, resulting in 4,620 utterances in total for each set.\nAll speech is represented as a sequence of 80 dimensional Mel-scale \ufb01lter bank (FBank) features\nor 200 dimensional log-magnitude spectrum (only for audio reconstruction), computed every 10ms.\nMel-scale features are a popular auditory approximation for many speech applications [28]. We\nconsider a sample x to be a 200ms sub-sequence, which is on the order of the length of a syllable,\nand implies T = 20 for each x. For the Seq2Seq-FHVAE model, all the LSTM and MLP networks\nare one-layered, and Adam [20] is used for optimization. More details of the model architecture and\ntraining procedure can be found in Appendix C.\n\n4.1 Qualitative Evaluation of the Disentangled Latent Variables\n\nFigure 4: (left) Examples generated by varying different latent variables. (right) An illustration\nof harmonics and formants in \ufb01lter bank images. The green block \u2018A\u2019 contains four reconstructed\nexamples. The red block \u2018B\u2019 contains ten original sequences on the \ufb01rst row with the corresponding\nreconstructed examples on the second row. The entry on the i-th row and the j-th column in the blue\nblock \u2018C\u2019 is the reconstructed example using the latent segment variable z1 of the i-th row from block\n\u2018A\u2019 and the latent sequence variable z2 of the j-th column from block \u2018B\u2019.\n\nTo qualitatively study the factorization of information between the latent segment variable z1 and the\nlatent sequence variable z2, we generate examples x by varying each of them respectively. Figure 4\nshows 40 examples in block \u2018C\u2019 of all the combinations of the 4 latent segment variables extracted\nfrom block \u2018A\u2019 and the 10 latent sequence variables extracted from block \u2018B.\u2019 The top two examples\nfrom block \u2018A\u2019 and the \ufb01ve leftmost examples from block \u2018B\u2019 are from male speakers, while the rest\nare from female speakers, which show higher fundamental frequencies and harmonics.4\n\n4The harmonics corresponds to horizontal dark stripes in the \ufb01gure; the more widely these stripes are spaced\n\nvertically, the higher the fundamental frequency of the speaker is.\n\n6\n\n\fFigure 5: FHVAE (\u03b1 = 0) decoding results of three combinations of latent segment variables z1\nand latent sequence variables z2 from one male-speaker utterance (top-left) and one female-speaker\nutterance (bottom-left) in Aurora-4. By replacing z2 of a male-speaker utterance with z2 of a female-\nspeaker utterance, an FHVAE decodes a voice-converted utterance (middle-right) that preserves the\nlinguistic content. Audio samples are available at https://youtu.be/VMX3IZYWYdg.\n\nWe can observe that along each row in block \u2018C\u2019, the linguistic phonetic-level content, which manifests\nitself in the form of the spectral contour and temporal position of formants, as well as the relative\nposition between formants, is very similar between elements; the speaker identity however changes\n(e.g., harmonic structure). On the other hand, for each column we see that the speaker identity remains\nconsistent, despite the change of linguistic content. The factorization of the sequence-level attributes\nand the segment-level attributes of our proposed Seq2Seq-FHVAE is clearly evident. In addition, we\nalso show examples of modifying an entire utterance in Figure 1 and 5, which achieves denoising\nby replacing the latent sequence variable of a noisy utterance with those of a clean utterance, and\nachieves voice conversion by replacing the latent sequence variable of one speaker with that of\nanother speaker. Details of the operations we applied to modify an entire utterance as well as more\nlarger-sized examples of different \u03b1 values can be found in Appendix E. We also show extra latent\nspace traversal experiments in Appendix H.\n\n4.2 Quantitative Evaluation of S-Vectors \u2013 Speaker Veri\ufb01cation\n\nn=1 g\u00b5z1\n\n(\u02dcx(n))/( \u02dcN + \u03c32\nz1\n\n\u00b51 = (cid:80) \u02dcN\n\nTo quantify the performance of our model on disentangling the utterance-level attributes from the\nsegment-level attributes, we present experiments on a speaker veri\ufb01cation task on the TIMIT corpus\nto evaluate how well the estimated \u00b52 encodes speaker-level information.5 As a sanity check, we\nmodify Eq. 5 to estimate an alternative s-vector based on latent segment variables z1 as follows:\n). We use the i-vector method [5] as the baseline, which is\nthe representation used in most state-of-the-art speaker veri\ufb01cation systems. They are in a low\ndimensional subspace of the Gaussian mixture model (GMM) mean supervector space, where the\nGMM is the universal background model (UBM) that models the generative process of speech.\nI-vectors, \u00b51, and \u00b52 can all be extracted without supervision; when speaker labels are available\nduring training, techniques such as linear discriminative analysis (LDA) can be applied to further\nimprove the linear separability of the representation. For all experiments, we use the fast scoring\napproach in [4] that uses cosine similarity as the similarity metric and compute the equal error rate\n(EER). More details about the experimental settings can be found in Appendix F.\nWe compare different dimensions for both features as well as different \u03b1\u2019s in Eq.3 for training\nFHVAE models. The results in Table 1 show that the 16 dimensional s-vectors \u00b52 outperform i-vector\nbaselines in both unsupervised (Raw) and supervised (LDA) settings for all \u03b1\u2019s as shown in the fourth\ncolumn; the more discriminatively the FHVAE model is trained (i.e., with larger \u03b1), the better speaker\n\n5TIMIT is not a standard corpus for speaker veri\ufb01cation, but it is a good corpus to show the utterance-level\nattribute we learned via this task, because the main attribute that is consistent within an utterance is speaker\nidentity, while in Aurora-4 both speaker identity and the background noise are consistent within an utterance.\n\n7\n\n\fveri\ufb01cation results it achieves. Moreover, with the appropriately chosen dimension, a 32 dimensional\n\u00b52 reaches an even lower EER at 1.34%. On the other hand, the negative results of using \u00b51 also\nvalidate the success in disentangling utterance and segment level attributes.\n\nTable 1: Comparison of speaker veri\ufb01cation equal error rate (EER) on the TIMIT test set\n\ni-vector\n\nFeatures Dimension \u03b1\n-\n-\n-\n0\n10\u22121\n100\n101\n101\n\n48\n100\n200\n16\n16\n16\n16\n32\n16\n16\n32\n\n\u00b52\n\n\u00b51\n\nRaw\n\nLDA (12 dim) LDA (24 dim)\n\n10.12% 6.25%\n6.10%\n9.52%\n9.82%\n6.54%\n4.02%\n5.06%\n4.61%\n4.91%\n3.87%\n3.86%\n2.38% 2.08%\n2.38% 2.08%\n22.77% 15.62%\n27.68% 22.17%\n22.47% 16.82%\n\n5.95%\n5.50%\n6.10%\n-\n-\n-\n-\n1.34%\n-\n-\n17.26%\n\n100\n101\n101\n\n4.3 Quantitative Evaluation of the Latent Segment Variables \u2013 Domain Invariant ASR\n\nSpeaker adaptation and robust speech recognition in automatic speech recognition (ASR) can often\nbe seen as domain adaptation problems, where available labeled data is limited and hence the data\ndistributions during training and testing are mismatched. One approach to reduce the severity of this\nissue is to extract speaker/channel invariant features for the tasks.\nAs demonstrated in Section 4.2, the s-vector contains information about domains. Here we evaluate\nif the latent segment variables contains domain invariant linguistic information by evaluating on\nan ASR task: (1) train our proposed Seq2Seq-FHVAE using FBank feature on a set that covers\ndifferent domains. (2) train an LSTM acoustic model [12, 35, 42] on the set that only covers partial\ndomains using mean and log variance of the latent segment variable z1 extracted from the trained\nSeq2Seq-FHVAE. (3) test the ASR system on all domains. As a baseline, we also train the same ASR\nmodels but use the FBank features alone. Detailed con\ufb01gurations are in Appendix G.\nFor TIMIT we assume that male and female speakers constitute different domains, and show the\nresults in Table 2. The \ufb01rst row of results shows that the ASR model trained on all domains (speakers)\nusing FBank features as the upper bound. When trained on only male speakers, the phone error rate\n(PER) on female speakers increases by 16.1% for FBank features; however, for z1, despite the slight\ndegradation on male speakers, the PER on the unseen domain, which are female speakers, improves\nby 6.6% compared to FBank features.\n\nTable 2: TIMIT test phone error rate of acoustic models trained on different features and sets\n\nTrain Set and Con\ufb01guration\n\nTest PER by Set\n\nASR\n\nTrain All\n\nTrain Male\n\nFHVAE\n\nFeatures Male\n\nFemale All\n\n-\n-\nTrain All, \u03b1 = 10\n\nFBank\nFBank\nz1\n\n20.1% 16.7% 19.1%\n21.0% 32.8% 25.2%\n22.0% 26.2% 23.5%\n\nOn Aurora-4, four domains are considered, which are clean, noisy, channel, and noisy+channel (NC\nfor short). We train the FHVAE on the development set for two purposes: (1) the FHVAE can be\nconsidered as a general feature extractor, which can be trained on an arbitrary collection of data that\ndoes not necessarily include the data for subsequent applications. (2) the dev set of Aurora-4 contains\nthe domain label for each utterance so it is possible to control which domain has been observed by the\nFHVAE. Table 3 shows the word error rate (WER) results on Aurora-4, from which we can observe\nthat the FBank representation suffers from severe domain mismatch problems; speci\ufb01cally, the WER\n\n8\n\n\fincreases by 53.3% when noise is presented in mismatched microphone recordings (NC). In contrast,\nwhen the FHVAE is trained on data from all domains, using the latent segment variables as features\nreduce WER from 16% to 35% compare to baseline on mismatched domains, with less than 2%\nWER degradation on the matched domain. In addition, \u03b2-VAEs [13] are trained on the same data as\nthe FHVAE to serve as the baseline feature extractor, from which we extract the latent variables z as\nthe ASR feature and show the result in the third to the sixth rows. The \u03b2-VAE features outperform\nFBank in all mismatched domains, but are inferior to the latent segment variable z1 from the FHVAE\nin those domains. The results demonstrate the importance of learning not only disentangled, but also\ninterpretable representations, which can be achieved by our proposed FHVAE models. As a sanity\ncheck, we replace z1 with z2, the latent sequence variable and train an ASR, which results in terrible\nWER performance as shown in the eighth row as expected.\nFinally, we train another FHVAE on all domains excluding the combinatory NC domain, and shows\nthe results in the last row in Table 3.\nIt can be observed that the latent segment variable still\noutperforms the baseline feature with 30% lower WER on noise and channel combined data, even\nthough the FHAVE has only seen noise and channel variation independently.\n\nTable 3: Aurora-4 test word error rate of acoustic models trained on different features and sets\n\nASR\n\nTrain All\n\nTrain Set and Con\ufb01guration\n{FH-,\u03b2-}VAE\n-\n-\nDev, \u03b2 = 1\nDev, \u03b2 = 2\nDev, \u03b2 = 4\nDev, \u03b2 = 8\nDev, \u03b1 = 10\nDev, \u03b1 = 10\nDev\\NC, \u03b1 = 10\n\nTrain Clean\n\nFeatures\n\nClean\n\nNoisy\n\nChannel NC\n\nAll\n\nTest WER by Set\n\nFBank\nFBank\nz (\u03b2-VAE)\nz (\u03b2-VAE)\nz (\u03b2-VAE)\nz (\u03b2-VAE)\nz1 (FHVAE)\nz2 (FHVAE)\nz1 (FHVAE)\n\n8.24%\n\n7.06%\n\n3.60%\n18.49% 11.80%\n3.47% 50.97% 36.99% 71.80% 55.51%\n4.95%\n23.54% 31.12% 46.21% 32.47%\n27.24% 30.56% 48.17% 34.75%\n3.57%\n24.40% 29.80% 47.87% 33.38%\n3.89%\n34.84% 36.13% 58.02% 42.76%\n5.32%\n16.42% 20.29% 36.33% 24.41%\n5.01%\n41.08% 68.73% 61.89% 86.36% 72.53%\n16.52% 19.30% 40.59% 26.23%\n5.25%\n\n5 Related Work\n\nA number of prior publications have extended VAEs to model structured data by altering the un-\nderlying graphical model to dynamic Bayesian networks, such as SRNN [3] and VRNN [9], or to\nhierarchical models, such as neural statistician [7] and SVAE [18]. These models have shown success\nin quantitatively increasing the log-likelihood, or qualitatively generating reasonable structured data\nby sampling. However, it remains unclear whether independent attributes are disentangled in the\nlatent space. Moreover, the learned latent variables in these models are not interpretable without\nmanually inspecting or using labeled data. In contrast, our work presents a VAE framework that\naddresses both problems by explicitly modeling the difference in the rate of temporal variation of the\nattributes that operate at different scales.\nOur work is also related to \u03b2-VAE [13] with respect to unsupervised learning of disentangled repre-\nsentations with VAEs. The boosted KL-divergence penalty imposed in \u03b2-VAE training encourages\ndisentanglement of independent attributes, but does not provide interpretability without supervision.\nWe demonstrate in our domain invariant ASR experiments that learning interpretable representations\nis important for such applications, and can be achieved by our FHVAE model. In addition, the idea\nof boosting KL-divergence regularization is complimentary to our model, which can be potentially\nintegrated for better disentanglement.\n\n6 Conclusions and Future Work\n\nWe introduce the factorized hierarchical variational autoencoder, which learns disentangled and\ninterpretable representations for sequence-level and segment-level attributes without any supervision.\nWe verify the disentangling ability both qualitatively and quantitatively on two speech corpora. For\nfuture work, we plan to (1) extend to more levels of hierarchy, (2) investigate adversarial training for\ndisentanglement, and (3) apply the model to other types of sequential data, such as text and videos.\n\n9\n\n\fReferences\n[1] Xi Chen, Yan Duan, Rein Houthooft, John Schulman, Ilya Sutskever, and Pieter Abbeel.\n\nInfogan:\nInterpretable representation learning by information maximizing generative adversarial nets. In Advances\nin Neural Information Processing Systems, page 2172\u20132180, 2016.\n\n[2] Junyoung Chung, Sungjin Ahn, and Yoshua Bengio. Hierarchical multiscale recurrent neural networks.\n\narXiv preprint arXiv:1609.01704, 2016.\n\n[3] Junyoung Chung, Kyle Kastner, Laurent Dinh, Kratarth Goel, Aaron C Courville, and Yoshua Bengio. A\nrecurrent latent variable model for sequential data. In Advances in neural information processing systems,\npages 2980\u20132988, 2015.\n\n[4] Najim Dehak, Reda Dehak, Patrick Kenny, Niko Br\u00fcmmer, Pierre Ouellet, and Pierre Dumouchel. Support\nvector machines versus fast scoring in the low-dimensional total variability space for speaker veri\ufb01cation.\nIn Interspeech, volume 9, pages 1559\u20131562, 2009.\n\n[5] Najim Dehak, Patrick J Kenny, R\u00e9da Dehak, Pierre Dumouchel, and Pierre Ouellet. Front-end fac-\ntor analysis for speaker veri\ufb01cation. IEEE Transactions on Audio, Speech, and Language Processing,\n19(4):788\u2013798, 2011.\n\n[6] Vincent Dumoulin, Ishmael Belghazi, Ben Poole, Alex Lamb, Martin Arjovsky, Olivier Mastropietro, and\n\nAaron Courville. Adversarially learned inference. arXiv preprint arXiv:1606.00704, 2016.\n\n[7] Harrison Edwards and Amos Storkey. Towards a neural statistician. arXiv preprint arXiv:1606.02185,\n\n2016.\n\n[8] Otto Fabius and Joost R van Amersfoort. Variational recurrent auto-encoders.\n\narXiv:1412.6581, 2014.\n\narXiv preprint\n\n[9] Marco Fraccaro, S\u00f8ren Kaae S\u00f8nderby, Ulrich Paquet, and Ole Winther. Sequential neural models with\n\nstochastic layers. In Advances in Neural Information Processing Systems, pages 2199\u20132207, 2016.\n\n[10] John S Garofolo, Lori F Lamel, William M Fisher, Jonathon G Fiscus, and David S Pallett. DARPA TIMIT\nacoustic-phonetic continous speech corpus CD-ROM. NIST speech disc 1-1.1. NASA STI/Recon technical\nreport n, 93, 1993.\n\n[11] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron\nCourville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural information processing\nsystems, pages 2672\u20132680, 2014.\n\n[12] Alex Graves, Navdeep Jaitly, and Abdel-rahman Mohamed. Hybrid speech recognition with deep bidirec-\ntional LSTM. In Automatic Speech Recognition and Understanding (ASRU), 2013 IEEE Workshop on,\npages 273\u2013278. IEEE, 2013.\n\n[13] Irina Higgins, Loic Matthey, Arka Pal, Christopher Burgess, Xavier Glorot, Matthew Botvinick, Shakir\nMohamed, and Alexander Lerchner. beta-vae: Learning basic visual concepts with a constrained variational\nframework. 2016.\n\n[14] Sepp Hochreiter and J\u00fcrgen Schmidhuber. Long short-term memory. Neural computation, 9(8):1735\u20131780,\n\n1997.\n\n[15] Wei-Ning Hsu, Yu Zhang, and James Glass. Learning latent representations for speech generation and\n\ntransformation. In Interspeech, pages 1273\u20131277, 2017.\n\n[16] Wei-Ning Hsu, Yu Zhang, and James Glass. Unsupervised domain adaptation for robust speech recognition\nvia variational autoencoder-based data augmentation. In Automatic Speech Recognition and Understanding\n(ASRU), 2017 IEEE Workshop on. IEEE, 2017.\n\n[17] Zhiting Hu, Zichao Yang, Xiaodan Liang, Ruslan Salakhutdinov, and Eric P Xing. Controllable text\n\ngeneration. arXiv preprint arXiv:1703.00955, 2017.\n\n[18] Matthew Johnson, David K Duvenaud, Alex Wiltschko, Ryan P Adams, and Sandeep R Datta. Composing\ngraphical models with neural networks for structured representations and fast inference. In Advances in\nneural information processing systems, pages 2946\u20132954, 2016.\n\n[19] Michael I Jordan, Zoubin Ghahramani, Tommi S Jaakkola, and Lawrence K Saul. An introduction to\n\nvariational methods for graphical models. Machine learning, 37(2):183\u2013233, 1999.\n\n10\n\n\f[20] Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint\n\narXiv:1412.6980, 2014.\n\n[21] Diederik P Kingma, Shakir Mohamed, Danilo Jimenez Rezende, and Max Welling. Semi-supervised\nlearning with deep generative models. In Advances in Neural Information Processing Systems, pages\n3581\u20133589, 2014.\n\n[22] Diederik P Kingma, Tim Salimans, Rafal Jozefowicz, Xi Chen, Ilya Sutskever, and Max Welling. Improved\n\nvariational inference with inverse autoregressive \ufb02ow. 2016.\n\n[23] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114,\n\n2013.\n\n[24] Tomi Kinnunen, Lauri Juvela, Paavo Alku, and Junichi Yamagishi. Non-parallel voice conversion using\n\ni-vector plda: Towards unifying speaker veri\ufb01cation and transformation. In ICASSP, 2017.\n\n[25] Tejas D Kulkarni, William F Whitney, Pushmeet Kohli, and Josh Tenenbaum. Deep convolutional inverse\n\ngraphics network. In Advances in Neural Information Processing Systems, pages 2539\u20132547, 2015.\n\n[26] Anders Boesen Lindbo Larsen, S\u00f8ren Kaae S\u00f8nderby, Hugo Larochelle, and Ole Winther. Autoencoding\n\nbeyond pixels using a learned similarity metric. arXiv preprint arXiv:1512.09300, 2015.\n\n[27] Alireza Makhzani, Jonathon Shlens, Navdeep Jaitly, Ian Goodfellow, and Brendan Frey. Adversarial\n\nautoencoders. arXiv preprint arXiv:1511.05644, 2015.\n\n[28] Nelson Mogran, Herv\u00e9 Bourlard, and Hynek Hermansky. Automatic speech recognition: An auditory\n\nperspective. In Speech processing in the auditory system, pages 309\u2013338. Springer, 2004.\n\n[29] Toru Nakashika, Tetsuya Takiguchi, Yasuhiro Minami, Toru Nakashika, Tetsuya Takiguchi, and Yasuhiro\nMinami. Non-parallel training in voice conversion using an adaptive restricted boltzmann machine.\nIEEE/ACM Trans. Audio, Speech and Lang. Proc., 24(11):2032\u20132045, November 2016.\n\n[30] Aaron van den Oord, Nal Kalchbrenner, and Koray Kavukcuoglu. Pixel recurrent neural networks. arXiv\n\npreprint arXiv:1601.06759, 2016.\n\n[31] Douglas B Paul and Janet M Baker. The design for the wall street journal-based csr corpus. In Proceedings\nof the workshop on Speech and Natural Language, pages 357\u2013362. Association for Computational\nLinguistics, 1992.\n\n[32] David Pearce. Aurora working group: DSR front end LVCSR evaluation AU/384/02. PhD thesis, Mississippi\n\nState University, 2002.\n\n[33] Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with deep\n\nconvolutional generative adversarial networks. arXiv preprint arXiv:1511.06434, 2015.\n\n[34] Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic backpropagation and approxi-\n\nmate inference in deep generative models. arXiv preprint arXiv:1401.4082, 2014.\n\n[35] Hasim Sak, Andrew W Senior, and Fran\u00e7oise Beaufays. Long short-term memory recurrent neural network\n\narchitectures for large scale acoustic modeling. In Interspeech, pages 338\u2013342, 2014.\n\n[36] Iulian Vlad Serban, Alessandro Sordoni, Ryan Lowe, Laurent Charlin, Joelle Pineau, Aaron Courville,\nand Yoshua Bengio. A hierarchical latent variable encoder-decoder model for generating dialogues. In\nThirty-First AAAI Conference on Arti\ufb01cial Intelligence, 2017.\n\n[37] Dmitriy Serdyuk, Kartik Audhkhasi, Philemon Brakel, Bhuvana Ramabhadran, Samuel Thomas, and\n\nYoshua Bengio. Invariant representations for noisy speech recognition. CoRR, abs/1612.01928, 2016.\n\n[38] Yusuke Shunohara. Adversarial multi-task learning of deep neural networks for robust speech recognition.\n\nIn Interspeeech, pages 2369\u20132372, 2016.\n\n[39] A\u00e4ron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal\nKalchbrenner, Andrew Senior, and Koray Kavukcuoglu. Wavenet: A generative model for raw audio.\nCoRR abs/1609.03499, 2016.\n\n[40] Zhizheng Wu, Eng Siong Chng, and Haizhou Li. Conditional restricted boltzmann machine for voice\n\nconversion. In ChinaSIP, 2013.\n\n[41] Dong Yu, Michael Seltzer, Jinyu Li, Jui-Ting Huang, and Frank Seide. Feature learning in deep neural\n\nnetworks \u2013 studies on speech recognition tasks. arXiv preprint arXiv:1301.3605, 2013.\n\n11\n\n\f[42] Yu Zhang, Guoguo Chen, Dong Yu, Kaisheng Yaco, Sanjeev Khudanpur, and James Glass. Highway\nlong short-term memory RNNs for distant speech recognition. In 2016 IEEE International Conference on\nAcoustics, Speech and Signal Processing (ICASSP), pages 5755\u20135759. IEEE, 2016.\n\n12\n\n\f", "award": [], "sourceid": 1173, "authors": [{"given_name": "Wei-Ning", "family_name": "Hsu", "institution": "Massachusetts Institute of Technology"}, {"given_name": "Yu", "family_name": "Zhang", "institution": "Google Brain"}, {"given_name": "James", "family_name": "Glass", "institution": "MIT CSAIL"}]}