{"title": "Can Unconditional Language Models Recover Arbitrary Sentences?", "book": "Advances in Neural Information Processing Systems", "page_first": 15258, "page_last": 15268, "abstract": "Neural network-based generative language models like ELMo and BERT can work effectively as general purpose sentence encoders in text classification without further fine-tuning. Is it possible to adapt them in a  similar way for use as general-purpose decoders? For this to be possible, it would need to be the case that for any target sentence of interest, there is some continuous representation that can be passed to the language model to cause it to reproduce that sentence. We set aside the difficult problem of designing an encoder that can produce such representations and, instead, ask directly whether such representations exist at all. To do this, we introduce a pair of effective, complementary methods for feeding representations into pretrained unconditional language models and a corresponding set of methods to map sentences into and out of this representation space, the reparametrized sentence space. We then investigate the conditions under which a language model can be made to generate a sentence through the identification of a point in such a space and find that it is possible to recover arbitrary sentences nearly perfectly with language models and representations of moderate size.", "full_text": "Can Unconditional Language Models Recover\n\nArbitrary Sentences?\n\nNishant Subramani\nNew York University\nnishant@nyu.edu\n\nSamuel R. Bowman\nNew York University\n\nKyunghyun Cho\n\nNew York Univeristy\nFacebook AI Research\n\nCIFAR Azrieli Global Scholar\n\nAbstract\n\nNeural network-based generative language models like ELMo and BERT can work\neffectively as general purpose sentence encoders in text classi\ufb01cation without\nfurther \ufb01ne-tuning. Is it possible to adapt them in a similar way for use as general-\npurpose decoders? For this to be possible, it would need to be the case that for\nany target sentence of interest, there is some continuous representation that can be\npassed to the language model to cause it to reproduce that sentence. We set aside\nthe dif\ufb01cult problem of designing an encoder that can produce such representations\nand, instead, ask directly whether such representations exist at all. To do this, we\nintroduce a pair of effective, complementary methods for feeding representations\ninto pretrained unconditional language models and a corresponding set of methods\nto map sentences into and out of this representation space, the reparametrized\nsentence space. We then investigate the conditions under which a language model\ncan be made to generate a sentence through the identi\ufb01cation of a point in such\na space and \ufb01nd that it is possible to recover arbitrary sentences nearly perfectly\nwith language models and representations of moderate size without modifying any\nmodel parameters.\n\n1\n\nIntroduction\n\nWe have recently seen great successes in using pretrained language models as encoders for a range\nof dif\ufb01cult natural language processing tasks (Dai and Le, 2015; Peters et al., 2017, 2018; Radford\net al., 2018; Ruder and Howard, 2018; Devlin et al., 2018; Dong et al., 2019; Yang et al., 2019), often\nwith little or no \ufb01ne-tuning: Language models learn useful representations that allow them to serve\nas general-purpose encoders. A hypothetical general-purpose decoder would offer similar bene\ufb01ts:\nmaking it possible to both train models for text generation tasks with little annotated data and share\nparameters extensively across applications in environments where memory is limited. Then, is it\npossible to use a pretrained language model as a general-purpose decoder in a similar fashion?\nFor this to be possible, we would need both a way of feeding some form of continuous sentence\nrepresentation into a trained language model and a task-speci\ufb01c encoder that could convert some\ntask input into a sentence representation that would cause the language model to produce the\ndesired sentence. We are not aware of any work that has successfully produced an encoder that can\ninteroperate in this way with a pretrained language model, and in this paper, we ask whether it is\npossible at all: Are typical, trained neural network language models capable of recovering arbitrary\nsentences through conditioning of this kind?\nWe start by de\ufb01ning the sentence space of a recurrent language model and show how this model maps\na given sentence to a trajectory in this space. We reparametrize this sentence space into a new space,\nthe reparametrized sentence space, by mapping each trajectory in the original space to a point in the\nnew space. To accomplish the reparametrization, we introduce two complementary methods to add\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fadditional bias terms to the previous hidden and cell state at each time step in the trained and frozen\nlanguage model, and optimize those bias terms to maximize the likelihood of the sentence.\nRecoverability inevitably depends on model size and quality of the underlying language model, so\nwe vary both along with different dimensions for the reparametrized sentence space. We \ufb01nd that the\nchoice of optimizer (nonlinear conjugate gradient over stochastic gradient descent) and initialization\nare quite sensitive, so it is unlikely that a simple encoder setup would work out of the box.\nOur experiments reveal that we can achieve full recoverability with a reparametrized sentence space\nwith dimension equal to the dimension of the recurrent hidden state of the model, at least for large\nenough models: For nearly all sentences, there exists a single vector that can recover the sentence\nperfectly. We show that this trend holds even with sentences that come from a different domain\nthan the ones used to train the \ufb01xed language model. We also observe that the smallest dimension\nable to achieve the greatest recoverability is approximately equal to the dimension of the recurrent\nhidden state of the model. Furthermore, we observe that recoverability decreases as sentence length\nincreases and that models \ufb01nd it increasingly dif\ufb01cult to generate words later in a sentence. In other\nwords, models rarely generate any correct words after generating an incorrect word when decoding\na given sentence. Lastly, experiments on recovering random sequences of words show that our\nreparametrized sentence space does not simply memorize the sequence, but also utilizes the language\nmodel. These observations indicate that unconditional language models can indeed be conditioned to\nrecover arbitrary sentences almost perfectly and may have a future as universal decoders.\n\n2 The Sentence Space of a Recurrent Language Model\n\nIn this section, we \ufb01rst cover the background on recurrent language models. We then characterize\nits sentence space and show how we can reparametrize it for easier analysis. In this reparametrized\nsentence space, we de\ufb01ne the recoverability of a sentence.\n\n2.1 Recurrent Language Models\n\nModel Description We train a 2-layer recurrent language model over sentences autoregressively:\n\nT(cid:89)\n\np(x1, . . . , xT ) =\n\np(xt|x1, . . . , xt\u22121)\n\n(1)\n\nt=1\n\nA neural network models each conditional distribution (right side) by taking as input all the pre-\nvious tokens (x1, . . . , xt\u22121) and producing as output the distribution over all possible next tokens.\nAt every time-step, we update the internal hidden state ht\u22122, which summarizes (x1, . . . , xt\u22122),\nwith a new token xt\u22121, resulting in ht\u22121. This resulting hidden state, ht\u22121, is used to compute\np(xt|x1, . . . , xt\u22121):\n\n\u03b8(ht\u22121),\n\nht\u22121 = f\u03b8(ht\u22122, xt\u22121),\np(xt = i|x1, . . . , xt\u22121) = gi\n\n(2)\n(3)\nwhere f\u03b8 : Rd \u00d7 V \u2192 Rd is a recurrent transition function often implemented as an LSTM recurrent\nnetwork (as in Hochreiter and Schmidhuber, 1997; Mikolov et al., 2010). The readout function g is\ngenerally a softmax layer with dedicated parameters for each possible word. The incoming hidden\nstate h0 \u2208 Rd at the start of generation is generally an arbitrary constant vector. We use zeroes. For a\nLSTM language model with l layers of d LSTM units, its model dimension d\u2217 = 2dl because LSTMs\nhave two hidden state vectors (conventionally h and c) both of dimension d.\nTraining We train the full model using stochastic gradient decent with negative log likelihood loss.\nInference Once learning completes, a language model can be straightforwardly used in two ways:\nscoring and generation. To score, we compute the log-probability of a newly observed sentence\naccording to Eq. (1). To generate, we use ancestral sampling by sampling tokens (x1, . . . , xT )\nsequentially, conditioning on all previous tokens at each step via Eq. (1).\nIn addition, we can \ufb01nd the approximate most likely sequence using beam search (Graves, 2012).\nThis procedure is generally used with language model variants like sequence-to-sequence models\n(Sutskever et al., 2014) that condition on additional context. We use this procedure in backward\nestimation to recover the sentence corresponding to a given point in the reparametrized space.\n\n2\n\n\fFigure 1: We add an additional bias, Wzz (left, when dim(z) \u2264 d\u2217) or Z = [z1 . . . zK] (right, when\ndim(z) > d\u2217), to the previous hidden and cell state at every time step. Only the z vector or Z matrix\nis trained during forward estimation: The main LSTM parameters are frozen and Wz is set randomly.\nIn the right hand case, we use soft attention to allow the model to use different slices of Z each step.\n\n2.2 De\ufb01ning the Sentence Space\n\nThe recurrent transition function f\u03b8 in Eq. (2) de\ufb01nes a dynamical system driven by the observations\nof tokens (x1, . . . , xT ) \u2208 X in a sentence. In this dynamical system, all trajectories start at the\n(cid:62) and evolve according to incoming tokens (xt\u2019s) over time. Any trajectory\norigin h0 = [0, . . . , 0]\n(h0, . . . , hT ) is entirely embedded in a d-dimensional space, where d is equal to the dimension of the\nhidden state and H \u2208 Rd, i.e., ht \u2208 H. In other words, the language model embeds a sentence of\nlength T as a T + 1-step trajectory in a d-dimensional space H, which we refer to as the sentence\nspace of a language model.\nReparametrizing the Sentence Space We want to recover sentences from semantic representations\nthat do not encode sentence length symbolically. Given that and since a single replacement of an\nintermediate token can drastically change the remaining trajectory in the sentence space, we want\na \ufb02at-vector representation. In order to address this, we propose to (approximately) reparametrize\nthe sentence space into a \ufb02at-vector space Z \u2208 Rd(cid:48)\nto characterize the sentence space of a language\nmodel. Under the proposed reparameterization, a trajectory of hidden states in the sentence space H\nmaps to a vector of dimension d(cid:48) in the reparametrized sentence space Z. To accomplish this, we add\nbias terms to the previous hidden and cell state at each time step in the model and optimize them to\nmaximize the log probability of the sentence as shown in Figure 1. We add this bias in two ways:\n(1) if d(cid:48) \u2264 d\u2217, we use a random projection matrix to project our vector z \u2208 Rd(cid:48)\nup to d\u2217 and (2) if\nd(cid:48) > d\u2217, we use soft-attention with the previous hidden state to adaptively project our vector z \u2208 Rd(cid:48)\ndown to d\u2217 (Bahdanau et al., 2015).\nOur reparametrization must approximately allow us to go back (forward estimation) and forth (back-\nward estimation) between a sequence of tokens, (x1, . . . , xT ), and a point z in this reparametrized\nspace Z via the language model. We need back-and-forth reparametrization to measure recoverability.\nOnce this back-and-forth property is established, we can inspect a set of points in Z instead of trajec-\ntories in H. A vector z \u2208 Z resembles the output of an encoder acting as context for a conditional\ngeneration task. This makes analysis in Z resemble analyses of context on sequence models and thus\nhelps us understand the unconditional language model that we are trying to condition with z better.\nWe expect that our reparametrization will allow us to approximately go back and forth between a\nsequence and its corresponding point z \u2208 Z because we expect z to contain all of the information\nof the sequence. Since we\u2019re adding z at every time-step, the information preserved in z will not\ndegrade as quickly as the sequence is processed like it could if we just added it to the initial hidden\nand cell states. While there are other similar ways to integrate z, we choose to modify the recurrent\nconnection.\nUsing the Sentence Space In this paper, we describe the reparametrized sentence space Z of a\nlanguage model as a set of d(cid:48)-dimensional vectors that correspond to a set D(cid:48) of sentences that were\nnot used in training the underlying language model. This use of unseen sentences helps us understand\nthe sentence space of a language model in terms of generalization rather than memorization, providing\ninsight into the potential of using a pretrained language model as a \ufb01xed decoder/generator. Using\nour reparametrized sentence space framework, evaluation techniques designed for investigating word\nvectors become applicable. One of those interesting techniques that we can do now is interpolation\nbetween different sentences in our reparameterized sentence space (Table 1 in Choi et al., 2017;\nBowman et al., 2016), but we do not explore this here.\n\n3\n\n h1Wz z...x1x2 h2x2x3 h3x3x4 hTxT-1xT+++ h1...x1x2 h2x2x3 h3x3x4 hTxT-1xTZzz...zK12+++\fForward Estimation X \u2192 Z The goal of forward estimation is to \ufb01nd a point z \u2208 Z that represents\na sentence (x1, . . . , xT ) \u2208 X via the trained language model (i.e., \ufb01xed \u03b8). When the dimension of z\nis smaller than the model dimension d\u2217, we use a random projection matrix to project it up to d\u2217 and\nwhen the dimension of z is greater than the model dimension, we use soft attention to project it down\nto d\u2217. We modify the recurrent dynamics f\u03b8 in Eq. (2) to be:\n\n(cid:26)Wzz,\n\nht\u22121 = f\u03b8(ht\u22122 + z(cid:48), xt\u22121)\nz(cid:48) =\n\nsoftmax(h(cid:62)\n\nt\u22122Z)Z(cid:62),\n\nif dim(z) \u2264 d\u2217\nif dim(z) > d\u2217\n\n(4)\n\n(5)\n\nwhere Z \u2208 Rd\u00d7k and is just the un\ufb02attened matrix of z consisting of k = dim(z)/d vectors of dimen-\nsion d. We initialize the hidden state by h0 = z(cid:48). Wz \u2208 Rd\u00d7d(cid:48)\nis a random matrix with L2-normalized\nrows, following Li et al. (2018) and is an identity matrix when d = d(cid:48): Wz = [w1\nz ], where\nz = \u0001l/(cid:107)\u0001l(cid:107)2 and \u0001l \u223c N (0, 12) \u2208 Rd. We then estimate z by maximizing the log-probability of\nwl\nthe given sentence under this modi\ufb01ed model, while \ufb01xing the original parameters \u03b8:\n\nz; . . . ; wd(cid:48)\n\nT(cid:88)\n\nt=1\n\n\u02c6z = argmax\n\nz\u2208Z\n\nlog p(xt|x<t, z)\n\n(6)\n\nWe represent the entire sentence (x1, . . . , xT ) in a single z. To solve this optimization problem,\nwe can use any off-the-shelf gradient-based optimization algorithm, such as gradient descent or\nnonlinear conjugate descent. This objective function is highly non-convex, potentially leading to\nmultiple approximately optimal z\u2019s. As a result, to estimate z in forward estimation, we use nonlinear\nconjugate gradient (Wright and Nocedal, 1999) implemented in SciPy (Jones et al., 2014) with a\nlimit of 10,000 iterations, although almost all runs converge much more quickly. Our experiments,\nhowever, reveal that many of these z\u2019s lead to similar performance in recovering the original sentence.\nBackward Estimation Z \u2192 X Backward estimation, an instance of sequence decoding, aims at\nrecovering the original sentence (x1, . . . , xT ) given a point z in the reparametrized sentence Z,\nwhich we refer to as recovery. We use the same objective function as in Eq. (6), but we optimize over\n(x1, . . . , xT ) rather than over z. Unlike forward estimation, backward estimation is a combinatorial\noptimization problem and cannot be solved easily with a recurrent language model (Cho, 2016; Chen\net al., 2018). To circumvent this, we use beam search, which is a standard approach in conditional\nlanguage modeling applications such as machine translation. Our backward estimation procedure\ndoes not assume a true length when decoding with beam search\u2014we stop when an end of token or\n100 tokens is reached.\n\n2.3 Analyzing the Sentence Space through Recoverability\n\nUnder this formulation, we can investigate various properties of the sentence space of the underlying\nmodel. As a \ufb01rst step toward understanding the sentence space of a language model, we propose three\nround-trip recoverability metrics and describe how we use them to characterize the sentence space.\nRecoverability Recoverability measures how much information about the original sentence x =\n(x1, . . . , xT ) \u2208 X is preserved in the reparameterized sentence space Z. We measure this by\nreconstructing the original sentence x. First, we forward-estimate the sentence vector z \u2208 Z from\nx \u2208 X by Eq. (6). Then, we reconstruct the sentence \u02c6x from the estimated z via backward estimation.\nTo evaluate the quality of reconstruction, we compare the original and reconstructed sentences, x and\n\u02c6x using the following three metrics:\n\n1. Exact Match (EM):(cid:80)T\n\nI(xt = \u02c6xt)/T\n\nt=1\n\n2. BLEU (Papineni et al., 2002)\n3. Pre\ufb01x Match (PM): argmaxt EM(x\u2264t = \u02c6x\u2264t)/T\n\nExact match gives information about the possibility of perfect recoverability. BLEU provides us with\na smoother approximation to this, in which the hypothesis gets some reward for n-gram overlap, even\nif slightly inexact. Since BLEU is 0 for sentences with less than 4 tokens, we smooth these by only\nconsidering n-grams up to the sentence length if sentence length is less than 4. Pre\ufb01x match measures\nthe longest consecutive sequence of tokens that are perfectly recovered from the beginning of the\n\n4\n\n\fsentence and we divide this by the sentence length. We use pre\ufb01x match because early experiments\nshow a very strong left-to-right falloff in quality of generation. In other words, candidate generations\nare better for shorter sentences and once an incorrect token is generated, future tokens are extremely\nunlikely to be correct. We compute each metric for each sentence x \u2208 D(cid:48) by averaging over multiple\noptimization runs, we show exact match (EM) in the equations, but we do the same for BLEU and\nPre\ufb01x Match. To counter the effect of non-convex optimization in Eq. (6), these runs vary by the\ninitialization of z and the random projection matrix Wz in Eq. (4). That is,\n\nEM(x, \u03b8) = Ez0\u2208Z(cid:2)E\n\nWz\u2208Rd\u00d7d(cid:48) [EM(x, \u02c6x)](cid:3)\n\nEffective Dimension by Recoverability These recoverability measures allow us to investigate the\nunderlying properties of the proposed sentence space of a language model. If all sentences can be\nprojected into a d-dimensional sentence space Z and recovered perfectly, the effective dimension\nof Z must be no greater than d. In this paper, when analyzing the effective dimension of a sentence\nspace of a language model, we focus on the effective dimension given a target recoverability \u03c4:\n\n\u02c6d(cid:48)(\u03b8, \u03c4 ) = min(cid:8)d(cid:48)(cid:12)(cid:12)EM(D(cid:48), \u03b8) > \u03c4(cid:9)\n\n(7)\n\nwhere EM(D(cid:48), \u03b8) = 1|D(cid:48)|\nx\u2208D(cid:48) EM(x, \u03b8). In other words, given a trained model (\u03b8), we \ufb01nd the\nsmallest effective dimension d(cid:48) (the dimension of Z) that satis\ufb01es the target recoverability (\u03c4). Using\nthis, we can answer questions like what is the minimum dimension d(cid:48) needed to achieve recoverability\n\u03c4 under the model \u03b8. Using this, the unconstrained effective dimension, i.e. the smallest dimension\nthat satis\ufb01es the best possible recoverability, is:\n\n(cid:80)\n\n(cid:88)\n\nx\u2208D(cid:48)\n\n\u02c6d(cid:48)(\u03b8) = argmin\nd(cid:48)\u2208{1,...,d}\n\nmax\n\n1\n|D(cid:48)|\n\nEM(x, \u03b8)\n\n(8)\n\nWe approximate the effective dimension by inspecting various values of d(cid:48) on a logarithmic scale:\nd(cid:48) = 128, 256, 512, . . . , 32768. Since our forward estimation process uses non-convex optimization\nand our backward estimation process uses beam search, our effective dimension estimates are\nupper-bound approximations.\n\n3 Experimental Setup\n\nCorpus We use the \ufb01fth edition of the English Gigaword (Graff et al., 2003) news corpus. Our\nprimary model is trained on 50M sentences from this corpus, and analysis experiments additionally\ninclude a weaker model trained on a subset of only 10M. Our training sentences are drawn from\narticles published before November 2010. We use a development set with 879k sentences from the\narticles published in November 2010 and a test set of 878k sentences from the articles published\nin December 2010. We lowercase the entire corpus, segment each article into sentences using\nNLTK (Bird and Loper, 2004), and tokenize each sentence using the Moses tokenizer (Koehn et al.,\n2007). We further segment the tokens using byte-pair encoding (BPE; following Sennrich et al.,\n2016) with 20,000 merges to obtain a vocabulary of 20,234 subword tokens. To evaluate out-of-\ndomain sentence recoverability, we use a random sample of 50 sentences from the IWSLT16 English\nto German translation dataset (validation portion) processed in the same way and using the same\nvocabulary.\nRecurrent Language Models The proposed framework is agnostic to the underlying architecture\nof a language model. We choose a 2-layer language model with LSTM units (Graves, 2013). We\nconstruct a small, medium, and large language model consisting of 256, 512, and 1024 LSTM\nunits respectively in each layer. The input and output embedding matrices of 256, 512, and 1024-\ndimensional vectors respectively are shared (Press and Wolf, 2017). We use dropout (Srivastava et al.,\n2014) between the two recurrent layers and before the \ufb01nal linear layer with a drop rate of 0.1, 0.25,\nand 0.3 respectively. We use stochastic gradient descent with Adam with a learning rate of 10\u22124 on\n100-sentence minibatches (Kingma and Ba, 2014), where sentences have a maximum length of 100.\nWe measure perplexity on the development set every 10k minibatches, halve the learning rate\nwhenever it increases, and clip the norm of the gradient to 1 (Pascanu et al., 2013). For each training\nset (10M and 50M), we train for only one epoch. Because of the large size of the training sets, these\nmodels nonetheless achieve a good \ufb01t to the underlying distribution (Table 1).\n\n5\n\n\fTable 1: Language modeling perplexities on English Gigaword for the models under study.\n\n|Train| = 10M\n\n|Train| = 50M\n\nModel\n\nSMALL\nMEDIUM\nLARGE\n\nd\n256\n512\n1024\n\nDev Ppl.\n122.9\n89.6\n65.9\n\nTest Ppl. Dev Ppl.\n77.2\n62.1\n47.4\n\n125.2\n91.3\n67.7\n\nTest Ppl.\n79.2\n63.5\n48.9\n\nReparametrized Sentence Spaces We use a set D(cid:48) of 100 randomly selected sentences from the\ndevelopment set in our analysis. We set z to have 128, 256, 512, 1024, 2048, 4096, 8192, 16384\nand 32768 dimensions for each language model and measure its recoverability. For each sentence\nwe have ten random initializations. When the dimension d(cid:48) of the reparametrized sentence space is\nsmaller than the model dimension, we construct ten random projection matrices that are sampled\nonce and \ufb01xed throughout the optimization procedure. We perform beam search with beam width 5.\n\n4 Results and Analysis\n\nRecoverability Results In Figure 2, we present the recoverability results of our experiments relative\nto sentence length using the three language models trained on 50M sentences. We observe that\nthe recoverability increases as d(cid:48) increases until d(cid:48) = d\u2217. After this point, recoverability plateaus.\nRecoverability between metrics for a single model are strongly positively correlated. We also observe\nthat recoverability is nearly perfect for the large model when d(cid:48) = 4096 achieving EM \u2265 99, and\nvery high for the medium model when d(cid:48) \u2265 2048 achieving EM \u2265 84.\nWe \ufb01nd that recoverability increases for a speci\ufb01c d(cid:48) as the language model is trained, although we\ncannot present the result due to space constraints. The corresponding \ufb01gure to Figure 2 for the 10M\nsetting and tables for both of the settings detailing overall performance are provided in the appendix.\nAll these estimates have high con\ufb01dence (small standard deviations).\nEffective Dimension of the Sentence Space From Figure 2, the large model\u2019s unconstrained effec-\ntive dimension is d\u2217 = 4096 with a slight degradation in recoverability when increasing d(cid:48) beyond d\u2217.\nFor the medium model, we notice that its unconstrained effective dimension is also d\u2217 = 2048 with\nno real recoverability improvements when increasing d(cid:48) beyond d\u2217. For the small model, however, its\nunconstrained effective dimension is 8192, which is much greater than d\u2217 = 1024.\nWhen d(cid:48) = 4096, we can recover any sentence nearly perfectly, and for large sentences, the large\nmodel with d(cid:48) \u2265 4096 achieves recoverability estimates \u03c4 \u2265 0.8. For other model sizes and other\ndimensions of the reparametrized space, we fail to perfectly recover some sentences. To ascertain\nwhich sentences we fail to recover, we look at the shapes of each curve. We observe that the vast\nmajority of these curves never increase, indicating recoverability and sentence length have a strong\nnegative correlation. Most curves decrease to 0 as sentence length exceeds 30 indicating that longer\nsentences are more dif\ufb01cult to recover. Earlier observations in using neural sequence-to-sequence\nmodels for machine translation concluded exactly this (Cho et al., 2014; Koehn and Knowles, 2017).\nThis suggests that a \ufb01xed-length representation lacks the capacity to represent a complex sentence\nand could sacri\ufb01ce important information in order to encode others. The degradation in recoverability\nalso implies that the unconstrained effective dimension of the sentence space could be strongly related\nto the length of the sentence and may not be related to the model dimension d\u2217. The fact that the\nsmaller model has an unconstrained effective dimension much larger than d\u2217 supports this claim.\nImpact of Beam Width & Optimization Strategy To analyze the impact of various beam widths,\nwe experimented with beam widths of 5, 10, and 20 in decoding. We \ufb01nd that results are consistent\nacross these beam widths. As a result, all experimental results in this paper other than this one use a\nbeam width of 5. We provide a representative partial table of sentence recoverability varying just\nbeam width during decoding in Table 2.\nTo understand the importance of the choice of optimizer, we experimented with using Adam with\na learning rate of 10\u22124 with default settings on our best performing settings for each model size.\nWe \ufb01nd that using Adam results in recovery estimates that do not exceed 1.0 BLEU for all three\nsituations, hinting at the highly non-convex nature of the optimization problem.\n\n6\n\n\fSmall Model (256d)\n\nMedium Model (512d)\n\nLarge Model (1024d)\n\nh\nc\nt\na\n\nM\n\nt\nc\na\nx\nE\n\nU\nE\nL\nB\n\nh\nc\nt\na\n\nM\nx\n\ufb01\ne\nr\nP\n\nSentence Length\n\nSentence Length\n\nSentence Length\n\nFigure 2: Plots of the three recoverability metrics with respect to varying sentence lengths for each\nof our three model sizes for the 50M sentence setting. Within each plot, the curves correspond to the\nvarying dimensions of z including error regions of \u00b1\u03c3. Regardless of metric, recoverability improves\nas the size and quality of the language model and dimension of the reparametrized sentence space\nincreases. The corresponding plot for the 10M sentence setting is in the appendix.\nSources of Randomness There are two points of stochasticity in the proposed framework: the\nnon-convexity of the optimization procedure in forward estimation (Eq. 6) and the sampling of a\nrandom projection matrix Wz. However, based on the small standard deviations in Figure 2, these\nhave minimal impact on recoverability. Also, the observation of high con\ufb01dence (low-variance)\nupper-bound estimates for recoverability supports the usability of our recoverability metrics for\ninvestigating a language model\u2019s sentence space.\nOut-of-Domain Recoverability To study how well our pretrained language models can recover\nsentences out-of-domain, we evaluate recoverability on our IWSLT data. IWSLT is comrpised of\nTED talk transcripts, a very different style than the news corpora our language models were trained\non. The left and center graphs in Figure 3 show that recovery performance measured in BLEU is\nnearly perfect even for out-of-domain sentences for both the medium and large models when d(cid:48) \u2265 d\u2217,\nfollowing trends from the experiments on English Gigaword from Figure 2.\nMore than just Memorization Near-perfect performance on out-of-domain sentences indicates that\nthis methodology could either be learning important properties of language by leveraging the language\n\n7\n\n\fTable 2: Recoverability (BLEU) varying beam width on English Gigaword\n\n|Z| Width=5 Width=10 Width=20\nModel\n40.5\n512\nSMALL; 50M\n8192\n79.6\nSMALL; 50M\nMEDIUM; 50M\n512\n42.3\n89.8\nMEDIUM; 50M 16384\n53.8\n512\nLARGE; 50M\nLARGE; 50M\n4096\n99.5\n\n40.0\n81.1\n41.1\n92.4\n54.8\n99.8\n\nBLEU\n\n40.3\n79.8\n41.1\n91.9\n54.1\n99.8\n\nMedium (512d) - IWSLT\n\nLarge (1024d) - IWSLT\n\nLarge (1024d) - Random\n\nU\nE\nL\nB\n\nFigure 3: Recoverability (BLEU) on IWSLT for medium (left) and large models (center) and on the\nrandom data for the large model (right).\n\nmodel, which helps generalization, or just be memorizing any arbitrary sequence without using the\nlanguage model at all. To investigate this, we randomly sample 50 sentences of varying lengths where\neach token is sampled randomly with equal probability with replacement from the vocabulary. The\nright graph in Figure 3 shows BLEU recovery for the large model. Many of the shorter sequences\ncan be recovered well, but for sequences greater than 25 subword units, recoverability drops quickly.\nThis experiment shows that memorization cannot fully explain results on Gigaword or IWSLT16.\nTowards a General-Purpose Decoder In this formulation, our vector z(cid:48) can be considered as\ntrainable context used to condition our unconditioned language models to generate arbitrary sentences.\nSince we \ufb01nd that well-trained language models of reasonable size have an unconstrained effective\ndimension with high recoverability that is approximately its model dimension on both in-domain and\nout-of-domain sequences, unconditional language models are able to utilize our context z(cid:48) effectively.\nFurther experiments con\ufb01rm that our context vectors do not simply memorize arbitrary sequences, but\nleverage the language model to generate well-formed sequences. As a result, such a model could be\nused as a task-independent decoder given an encoder with the ability to generate an optimal context\nvector z(cid:48).\nWe observe that recoverability isn\u2019t perfect for both the small and medium models, falling off\ndramatically for longer sentences, indicating that the minimum model size for high recoverability is\nfairly large. Since the sentence length distribution is a Zipf distribution (heavily right-tailed), if we\ncan increase the recoverability degredation cutoff point, the number of sentences we fail to recover\nperfectly would decrease exponentially. However, since we \ufb01nd that larger and better-trained models\ncan exhibit near perfect recoverability for both in-domain and out-of-domain sequences and can more\neasily utilize our conditioning strategy, we think that this may only be a concern for lower capacity\nmodels. Our methodology could use a regularization mechanism to smooth the implicit sentence\nspace. This may improve recoverability and reduce the unconstrained effective dimension, whereby\nincreasing the applicability of an unconditional language model as a general-purpose decoder.\n\n8\n\n\f5 Related Work\n\nLatent Variable Recurrent Language Models The way we describe the sentence space of a lan-\nguage model can be thought of as performing inference over an implicit latent variable z using a \ufb01xed\ndecoder \u03b8. This resembles prior work on sparse coding (Olshausen and Field, 1997) and generative\nlatent optimization (Bojanowski et al., 2018). Under this lens, it also relates to work on training\nlatent variable language models, such as models based on variational autoencoders by Bowman et al.\n(2016) and sequence generative adversarial networks by Yu et al. (2017). The goal of identifying the\nsmallest dimension of the sentence space for a speci\ufb01c target recoverability resembles work looking\nat continuous bag-of-words representations by Mu et al. (2017). Our approach differs from these\napproaches in that we focus entirely on analyzing a \ufb01xed model that was trained unconditionally. Our\nformulation of the sentence space also is more general, and potentially applies to all of these models.\nPretrained Recurrent Language Models Pretrained or separately trained language models have\nlargely been used in two contexts: as a feature extractor for downstream tasks and as a scoring\nfunction for a task-speci\ufb01c decoder (Gulcehre et al., 2015; Li et al., 2016; Sriram et al., 2018). None\nof the above analyze how a pretrained model represents sentences nor investigate the potential of\nusing a language model as a decoder. The work by Zoph et al. (2016) transfers a pretrained language\nmodel, as a part of a neural machine translation system, to another language pair and \ufb01ne-tunes.\nThe positive result here is speci\ufb01c to machine translation as a downstream task, unlike the proposed\nframework, which is general and downstream task independent. Recently, there has been more\nwork in pretraining the decoder using BERT (Devlin et al., 2018) for neural machine translation and\nabstractive summarization (Edunov et al., 2019; Lample and Conneau, 2019; Song et al., 2019).\n\n6 Conclusion\n\nTo answer whether unconditional language models can be conditioned to generate held-out sentences,\nwe introduce the concept of the reparametrized sentence space for a frozen, pretrained language\nmodel, in which each sentence is represented as a point vector, which is added as a bias and optimized\nto reproduce that sentence during decoding. We design optimization-based forward estimation and\nbeam-search-based backward estimation procedures, allowing us to map a sentence to and from the\nreparametrized space. We then introduce and use recoverability metrics that allow us to measure the\neffective dimension of the reparametrized space and to discover the degree to which sentences can be\nrecovered from \ufb01xed-sized representations by the model without further training.\nWe observe that we can indeed condition our unconditional language models to generate held-out\nsentences both in and out-of-domain: our large model achieves near perfect recoverability on both\nin and out-of-domain sequences with d(cid:48) = 8192 across all metrics. Furthermore, we \ufb01nd that\nrecoverability increases with the dimension of the reparametrized space until it reaches the model\ndimension, after which, it plateaus for well-trained, suf\ufb01ciently-large (d \u2265 512) models.\nThese experiments reveal two properties of the sentence space of a language model. First, recoverabil-\nity improves with the size and quality of the language model and is nearly perfect when the dimension\nof the reparametrized space equals that of the model. Second, recoverability is negatively correlated\nwith sentence length, i.e., recoverability is more dif\ufb01cult for longer sentences. Our recoverability-\nbased approach for analyzing the sentence space gives conservative estimates (upper-bounds) of the\neffective dimension of the space and lower-bounds for the associated recoverabilities.\nWe see three avenues for further work. Measuring the realtionship between regularization (encourag-\ning the reparametrized sentence space to be of a certain form) and non-linearity would be valuable.\nIn addition, although our framework is downstream task- and network architecture-independent, we\nwant to compare recoverability and downstream task performance and analyze the sentence space of\ndifferent architectures of language models. We also want to utilize this framework to convert encoder\nrepresentations for use in a data- and memory-ef\ufb01cient conditional generation model.\n\nAcknowledgments\n\nThis work was supported by Samsung Electronics (Improving Deep Learning using Latent Structure).\nWe gratefully acknowledge the support of NVIDIA Corporation with the donation of a Titan V GPU\nused at NYU for this research.\n\n9\n\n\fReferences\n\nDzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2015. Neural machine translation by\n\njointly learning to align and translate. In ICLR.\n\nSteven Bird and Edward Loper. 2004. Nltk: the natural language toolkit. In ACL.\n\nPiotr Bojanowski, Armand Joulin, David Lopez-Pas, and Arthur Szlam. 2018. Optimizing the latent\n\nspace of generative networks. In ICML.\n\nSamuel R Bowman, Luke Vilnis, Oriol Vinyals, Andrew M Dai, Rafal Jozefowicz, and Samy Bengio.\n\n2016. Generating sentences from a continuous space. CoNLL 2016.\n\nYun Chen, Victor OK Li, Kyunghyun Cho, and Samuel Bowman. 2018. A stable and effective\n\nlearning strategy for trainable greedy decoding. In EMNLP.\n\nKyunghyun Cho. 2016. Noisy parallel approximate decoding for conditional recurrent language\n\nmodel. arXiv preprint arXiv:1605.03835.\n\nKyunghyun Cho, Bart van Merrienboer, Dzmitry Bahdanau, and Yoshua Bengio. 2014. On the\nproperties of neural machine translation: Encoder\u2013decoder approaches. In Proceedings of SSST-8,\nEighth Workshop on Syntax, Semantics and Structure in Statistical Translation.\n\nHeeyoul Choi, Kyunghyun Cho, and Yoshua Bengio. 2017. Context-dependent word representation\n\nfor neural machine translation. Computer Speech & Language.\n\nAndrew M. Dai and Quoc V. Le. 2015. Semi-supervised sequence learning. In NIPS.\n\nJacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of\n\ndeep bidirectional transformers for language understanding. CoRR.\n\nLi Dong, Nan Yang, Wenhui Wang, Furu Wei, Xiaodong Liu, Yu Wang, Jianfeng Gao, Ming Zhou, and\nHsiao-Wuen Hon. 2019. Uni\ufb01ed language model pre-training for natural language understanding\nand generation. arXiv preprint arXiv:1905.03197.\n\nSergey Edunov, Alexei Baevski, and Michael Auli. 2019. Pre-trained language model representations\n\nfor language generation. arXiv preprint arXiv:1903.09722.\n\nDavid Graff, Junbo Kong, Ke Chen, and Kazuaki Maeda. 2003. English gigaword. Linguistic Data\n\nConsortium, Philadelphia.\n\nAlex Graves. 2012. Sequence transduction with recurrent neural networks.\n\narXiv:1211.3711.\n\nAlex Graves. 2013. Generating sequences with recurrent neural networks.\n\narXiv:1308.0850.\n\narXiv preprint\n\narXiv preprint\n\nCaglar Gulcehre, Orhan Firat, Kelvin Xu, Kyunghyun Cho, Loic Barrault, Huei-Chi Lin, Fethi\nBougares, Holger Schwenk, and Yoshua Bengio. 2015. On using monolingual corpora in neural\nmachine translation. arXiv preprint arXiv:1503.03535.\n\nSepp Hochreiter and J\u00fcrgen Schmidhuber. 1997. Long short-term memory. Neural computation.\n\nEric Jones, Travis Oliphant, Pearu Peterson, et al. 2014. Scipy: Open source scienti\ufb01c tools for\n\npython, 2014. http://www.scipy.org.\n\nDiederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv\n\npreprint arXiv:1412.6980.\n\nPhilipp Koehn, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, Marcello Federico, Nicola\nBertoldi, Brooke Cowan, Wade Shen, Christine Moran, Richard Zens, et al. 2007. Moses: Open\nsource toolkit for statistical machine translation. In ACL.\n\nPhilipp Koehn and Rebecca Knowles. 2017. Six challenges for neural machine translation. ACL\n\n2017.\n\n10\n\n\fGuillaume Lample and Alexis Conneau. 2019. Cross-lingual language model pretraining. arXiv\n\npreprint arXiv:1901.07291.\n\nChunyuan Li, Heerad Farkhoor, Rosanne Liu, and Jason Yosinski. 2018. Measuring the intrinsic\n\ndimension of objective landscapes. arXiv preprint arXiv:1804.08838.\n\nJiwei Li, Michel Galley, Chris Brockett, Jianfeng Gao, and Bill Dolan. 2016. A diversity-promoting\n\nobjective function for neural conversation models. In NAACL.\n\nTom\u00e1\u0161 Mikolov, Martin Kara\ufb01\u00e1t, Luk\u00e1\u0161 Burget, Jan \u02c7Cernock`y, and Sanjeev Khudanpur. 2010. Recur-\nrent neural network based language model. In Eleventh Annual Conference of the International\nSpeech Communication Association.\n\nJiaqi Mu, Suma Bhat, and Pramod Viswanath. 2017. Representing sentences as low-rank subspaces.\n\nIn ACL.\n\nBruno A Olshausen and David J Field. 1997. Sparse coding with an overcomplete basis set: A\n\nstrategy employed by v1? Vision research.\n\nKishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a method for automatic\n\nevaluation of machine translation. In ACL.\n\nRazvan Pascanu, Tomas Mikolov, and Yoshua Bengio. 2013. On the dif\ufb01culty of training recurrent\n\nneural networks. In ICML.\n\nMatthew E. Peters, Waleed Ammar, Chandra Bhagavatula, and Russell Power. 2017. Semi-supervised\n\nsequence tagging with bidirectional language models. In ACL.\n\nMatthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and\n\nLuke S. Zettlemoyer. 2018. Deep contextualized word representations. In NAACL-HLT.\n\nO\ufb01r Press and Lior Wolf. 2017. Using the output embedding to improve language models. In ACL.\n\nAlec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. 2018. Improving language\nunderstanding by generative pre-training. Unpublished ms. available through a link at https:\n//blog.openai.com/language-unsupervised/.\n\nSebastian Ruder and Jeremy Howard. 2018. Universal language model \ufb01ne-tuning for text classi\ufb01ca-\n\ntion. In ACL.\n\nRico Sennrich, Barry Haddow, and Alexandra Birch. 2016. Neural machine translation of rare words\n\nwith subword units. In ACL.\n\nKaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, and Tie-Yan Liu. 2019. Mass: Masked sequence to\n\nsequence pre-training for language generation. In ICML.\n\nAnuroop Sriram, Heewoo Jun, Sanjeev Satheesh, and Adam Coates. 2018. Cold fusion: Training\n\nseq2seq models together with language models. In Interspeech.\n\nNitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 2014.\n\nDropout: a simple way to prevent neural networks from over\ufb01tting. JMLR.\n\nIlya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014. Sequence to sequence learning with neural\n\nnetworks. In NIPS.\n\nStephen Wright and Jorge Nocedal. 1999. Numerical optimization. Springer Science.\n\nZhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, and Quoc V Le.\n2019. Xlnet: Generalized autoregressive pretraining for language understanding. arXiv preprint\narXiv:1906.08237.\n\nLantao Yu, Weinan Zhang, Jun Wang, and Yong Yu. 2017. Seqgan: Sequence generative adversarial\n\nnets with policy gradient. In AAAI.\n\nBarret Zoph, Deniz Yuret, Jonathan May, and Kevin Knight. 2016. Transfer learning for low-resource\n\nneural machine translation. In EMNLP.\n\n11\n\n\f", "award": [], "sourceid": 8753, "authors": [{"given_name": "Nishant", "family_name": "Subramani", "institution": "AI Foundation"}, {"given_name": "Samuel", "family_name": "Bowman", "institution": "New York University"}, {"given_name": "Kyunghyun", "family_name": "Cho", "institution": "New York University"}]}