{"title": "GLoMo: Unsupervised Learning of Transferable Relational Graphs", "book": "Advances in Neural Information Processing Systems", "page_first": 8950, "page_last": 8961, "abstract": "Modern deep transfer learning approaches have mainly focused on learning generic feature vectors from one task that are transferable to other tasks, such as word embeddings in language and pretrained convolutional features in vision. However, these approaches usually transfer unary features and largely ignore more structured graphical representations. This work explores the possibility of learning generic latent relational graphs that capture dependencies between pairs of data units (e.g., words or pixels) from large-scale unlabeled data and transferring the graphs to downstream tasks. Our proposed transfer learning framework improves performance on various tasks including question answering, natural language inference, sentiment analysis, and image classification. We also show that the learned graphs are generic enough to be transferred to different embeddings on which the graphs have not been trained (including GloVe embeddings, ELMo embeddings, and task-specific RNN hidden units), or embedding-free units such as image pixels.", "full_text": "GLoMo: Unsupervised Learning of\n\nTransferable Relational Graphs\n\nZhilin Yang\u22171, Jake (Junbo) Zhao\u221723, Bhuwan Dhingra1\n\nKaiming He3, William W. Cohen1, Ruslan Salakhutdinov1, Yann LeCun23\n\n\u2217Equal contribution\n\n1Carnegie Mellon University, 2New York University, 3Facebook AI Research\n\n{zhiliny,bdhingra,wcohen,rsalakhu}@cs.cmu.edu\n{jakezhao,yann}@cs.nyu.com, kaiminghe@fb.com\n\nAbstract\n\nModern deep transfer learning approaches have mainly focused on learning generic\nfeature vectors from one task that are transferable to other tasks, such as word\nembeddings in language and pretrained convolutional features in vision. However,\nthese approaches usually transfer unary features and largely ignore more structured\ngraphical representations. This work explores the possibility of learning generic\nlatent relational graphs that capture dependencies between pairs of data units (e.g.,\nwords or pixels) from large-scale unlabeled data and transferring the graphs to\ndownstream tasks. Our proposed transfer learning framework improves perfor-\nmance on various tasks including question answering, natural language inference,\nsentiment analysis, and image classi\ufb01cation. We also show that the learned graphs\nare generic enough to be transferred to different embeddings on which the graphs\nhave not been trained (including GloVe embeddings, ELMo embeddings, and\ntask-speci\ufb01c RNN hidden units), or embedding-free units such as image pixels.\n\n1\n\nIntroduction\n\nRecent advances in deep learning have largely relied on building blocks such as convolutional\nnetworks (CNNs) [19] and recurrent networks (RNNs) [14] augmented with attention mechanisms\n[1]. While possessing high representational capacity, these architectures primarily operate on grid-like\nor sequential structures due to their built-in \u201cinnate priors\u201d. As a result, CNNs and RNNs largely rely\non high expressiveness to model complex structural phenomena, compensating the fact that they do\nnot explicitly leverage structural, graphical representations.\nThis paradigm has led to a standardized norm in transfer learning and pretraining\u2014\ufb01tting an ex-\npressive function on a large dataset with or without supervision, and then applying the function to\ndownstream task data for feature extraction. Notable examples include pretrained ImageNet features\n[13] and pretrained word embeddings [24, 29].\nIn contrast, a variety of real-world data exhibit much richer relational graph structures than the simple\ngrid-like or sequential structures. This is also emphasized by a parallel work [3]. For example in\nthe language domain, linguists use parse trees to represent syntactic dependency between words;\ninformation retrieval systems exploit knowledge graphs to re\ufb02ect entity relations; and coreference\nresolution is devised to connect different expressions of the same entity. As such, these exempli\ufb01ed\nstructures are universally present in almost any natural language data regardless of the target tasks,\nwhich suggests the possibility of transfer across tasks. These observations also generalize to other\ndomains such as vision, where modeling the relations between pixels is proven useful [28, 50, 44].\nOne obstacle remaining, however, is that many of the universal structures are essentially human-\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fFigure 1: Traditional transfer learning versus our new transfer learning framework. Instead of transferring\nfeatures, we transfer the graphs output by a network. The graphs are multiplied by task-speci\ufb01c features (e.g.\nembeddings or hidden states) to produce structure-aware features.\ncurated and expensive to acquire on a large scale, while automatically-induced structures are mostly\nlimited to one task [17, 42, 44].\nIn this paper, we attempt to address two challenges: 1) to break away from the standardized norm of\nfeature-based deep transfer learning1, and 2) to learn versatile structures in the data with a data-driven\napproach. In particular, we are interested in learning transferable latent relational graphs, where\nthe nodes of a latent graph are the input units, e.g., all the words in a sentence. The task of latent\nrelational graph learning is to learn an af\ufb01nity matrix where the weights (possibly zero) capture the\ndependencies between any pair of input units.\nTo achieve the above goals, we propose a novel framework of unsupervised latent graph learning\ncalled GLoMo (Graphs from LOw-level unit MOdeling). Speci\ufb01cally, we train a neural network\nfrom large-scale unlabeled data to output latent graphs, and transfer the network to extract graph\nstructures on downstream tasks to augment their training. This approach allows us to separate the\nfeatures that represent the semantic meaning of each unit and the graphs that re\ufb02ect how the units\nmay interact. Ideally, the graphs capture task-independent structures underlying the data, and thus\nbecome applicable to different sets of features. Figure 1 highlights the difference between traditional\nfeature-based transfer learning and our new framework.\nExperimental results show that GLoMo improves performance on various language tasks including\nquestion answering, natural language inference, and sentiment analysis. We also demonstrate\nthat the learned graphs are generic enough to work with various sets of features on which the\ngraphs have not been trained, including GloVe embeddings [29], ELMo embeddings [30], and task-\nspeci\ufb01c RNN states. We also identify key factors of learning successful generic graphs: decoupling\ngraphs and features, hierarchical graph representations, sparsity, unit-level objectives, and sequence\nprediction. To demonstrate the generality of our framework, we further show improved results on\nimage classi\ufb01cation by applying GLoMo to model the relational dependencies between the pixels.\n\n2 Unsupervised Relational Graph Learning\n\nWe propose a framework for unsupervised latent graph learning. Given a one-dimensional input\nx = (x1,\u00b7\u00b7\u00b7 , xT ), where each xt denotes an input unit at position t and T is the length of the\nsequence, the goal of latent graph learning is to learn a (T \u00d7 T ) af\ufb01nity matrix G such that each entry\nGij captures the dependency between the unit xi and the unit xj. The af\ufb01nity matrix is asymmetric,\nrepresenting a directed weighted graph. In particular, in this work we consider the case where each\ncolumn of the af\ufb01nity matrix sums to one, for computational convenience. In the following text, with\na little abuse of notation, we use G to denote a set of af\ufb01nity matrices. We use the terms \u201caf\ufb01nity\nmatrices\u201d and \u201cgraphs\u201d interchangeably.\nDuring the unsupervised learning phase, our framework trains two networks, a graph predictor\nnetwork g and a feature predictor network f. Given the input x, our graph predictor g produces a\nset of graphs G = g(x). The graphs G are represented as a 3-d tensor in RL\u00d7T\u00d7T , where L is the\n1Throughout the paper, we use \u201cfeature\u201d to refer to unary feature representations, and use \u201cgraph\u201d to refer to\n\nstructural, graphical representations.\n\n2\n\nNetTask ANetinputinputfeatureTask BtransferNetTask ANetinputinputtransferTask Bgraphfeature specific to BfeatureTraditionalOurs\fnumber of layers that produce graphs. For each layer l , the last two dimensions Gl de\ufb01ne a (T \u00d7 T )\naf\ufb01nity matrix that captures the dependencies between any pair of input units. The feature predictor\nnetwork f then takes the graphs G and the original input x to perform a predictive task.\nDuring the transfer phase, given an input x(cid:48) from a downstream task, we use the graph predictor g to\nextract graphs G = g(x(cid:48)). The extracted graphs G are then fed as the input to the downstream task\nnetwork to augment training. Speci\ufb01cally, we multiply G with task-speci\ufb01c features such as input\nembeddings and hidden states (see Figure 1). The network f is discarded during the transfer phase.\nNext, we will introduce the network architectures and objective functions for unsupervised learning,\nfollowed by the transfer procedure. An overview of our framework is illustrated in Figure 2.\n\n2.1 Unsupervised Learning\n\nGraph Predictor The graph predictor g is instantiated as two multi-layer CNNs, a key CNN,\nand a query CNN. Given the input x, the key CNN outputs a sequence of convolutional features\n(k1,\u00b7\u00b7\u00b7 , kT ) and the query CNN similarly outputs (q1,\u00b7\u00b7\u00b7 , qT ). At layer l, based on these convolu-\ntional features, we compute the graphs as\n\n(cid:16)\n(cid:80)\n\ni(cid:48)\n\n(cid:17)2\n(cid:17)2\n\n(cid:16)\n\nReLU(kl(cid:62)\n\ni ql\nReLU(kl(cid:62)\n\nj + b)\n\ni(cid:48) ql\n\nj + b)\n\nGl\n\nij =\n\n(1)\n\ni = Wl\n\nj = Wl\n\nk and Wl\n\nkki and ql\n\nqqj. The matrices Wl\n\nwhere kl\nq are model parameters at layer l, and\nthe bias b is a scalar parameter. This resembles computing the attention weights [1] from position j\nto position i except that the exponential activation in the softmax function is replaced with a squared\nReLU operation\u2014we use ReLUs to enforce sparsity and the square operations to stabilize training.\nMoreover, we employ convolutional networks to let the graphs G be aware of the local order of the\ninput and context, up to the size of each unit\u2019s receptive \ufb01eld.\nFeature Predictor Now we introduce the feature predictor f. At each layer l, the input to the feature\npredictor f is a sequence of features Fl\u22121 = (f l\u22121\n) and an af\ufb01nity matrix Gl extracted by\nthe graph predictor g. The zero-th layer features F0 are initialized to be the embeddings of x. The\naf\ufb01nity matrix Gl is then combined with the current features to compute the next-layer features at\neach position t,\n\n,\u00b7\u00b7\u00b7 , f l\u22121\n\n1\n\nt\n\nf l\nt = v(\n\njtf l\u22121\nGl\n\nj\n\n, f l\u22121\n\nt\n\n)\n\n(2)\n\n(cid:88)\n\nj\n\nwhere v is a compositional function such as a GRU cell [8] or a linear layer with residual connections.\nIn other words, the feature at each position is computed as a weighted sum of other features, where\nthe weights are determined by the graph Gl, followed by transformation function v.\nObjective Function At the top layer, we obtain the features FL. At each position t, we use the\nfeature f L\nto initialize the hidden states of an RNN decoder, and employ the decoder to predict\nt\nthe units following xt. Speci\ufb01cally, the RNN decoder maximizes the conditional log probability\nlog P (xt+1,\u00b7\u00b7\u00b7 , xt+D|xt, f l\nt ) using an auto-regressive factorization as in standard language modeling\n[47] (also see Figure 2). Here D is a hyper-parameter called the context length. The overall objective\nis written as the sum of the objectives at all positions t,\n\n(cid:88)\n\nmax\n\nlog P (xt+1,\u00b7\u00b7\u00b7 , xt+D|xt, f L\nt )\n\n(3)\n\nBecause our objective is context prediction, we mask the convolutional \ufb01lters and the graph G (see\nEq. 1) in the network g to prevent the network from accessing the future, following [34].\n\nt\n\n2.1.1 Desiderata\n\nThere are several key desiderata of the above unsupervised learning framework, which also highlight\nthe essential differences between our framework and previous work on self-attention and predictive\nunsupervised learning:\nDecoupling graphs and features Unlike self-attention [42] that fuses the computation of graphs\nand features into one network, we employ separate networks g and f for learning graphs and features\n\n3\n\n\fFigure 2: Overview of our approach GLoMo. During the unsupervised learning phase, the feature predictor\nand the graph predictor are jointly trained to perform context prediction. During the transfer phase, the graph\npredictor is frozen and used to extract graphs for the downstream tasks. An RNN decoder is applied to all\npositions in the feature predictor, but we only show the one at position \u201cA\u201d for simplicity. \u201cSelect one\u201d means\nthe graphs can be transferred to any layer in the downstream task model. \u201cFF\u201d refers to feed-forward networks.\nThe graphs output by the graph predictor are used as the weights in the \u201cweighted sum\u201d operation (see Eq. 2).\n\nrespectively. The features represent the semantic meaning of each unit while the graph re\ufb02ects how\nthe units may interact. This increases the transferability of the graphs G because (1) the graph\npredictor g is freed from encoding task-speci\ufb01c non-structural information, and (2) the decoupled\nsetting is closer to our transfer setting, where the graphs and features are also separated.\nSparsity Instead of using Softmax for attention [1], we employ a squared ReLU activation in Eq. (1)\nto enforce sparse connections in the graphs. In fact, most of the linguistically meaningful structures\nare sparse, such as parse trees and coreference links. We believe sparse structures reduce noise and\nare more transferable.\nHierarchical graph representations We learn multiple layers of graphs, which allows us to model\nhierarchical structures in the data.\nUnit-level objectives In Eq. (3), we impose a context prediction objective on each unit xt. An\nalternative is to employ a sequence-level objective such as predicting the next sentence [18] or\ntranslating the input into another language [42]. However, since the weighted sum operation in Eq.\n(2) is permutation invariant, the features in each layer can be randomly shuf\ufb02ed without affecting the\nobjective, which we observed in our preliminary experiments. As a result, the induced graph bears no\nrelation to the structures underlying the input x when a sequence-level objective is employed.\nSequence prediction As opposed to predicting just the immediate next unit [28, 30], we predict the\ncontext of length up to D. This gives stronger training signals to the unsupervised learner.\nLater in the experimental section, we will demonstrate that all these factors contribute to successful\ntraining of our framework.\n\n2.2 Latent Graph Transfer\n\n1,\u00b7\u00b7\u00b7 , h(cid:48)\n\nIn this section, we discuss how to transfer the graph predictor g to downstream tasks.\nSuppose for a downstream task the model is a deep multi-layer network. Speci\ufb01cally, each layer is\ndenoted as a function h that takes in features H = (h1,\u00b7\u00b7\u00b7 , hT ) and possibly additional inputs, and\noutputs features (h(cid:48)\nT ). The function h can be instantiated as any neural network component,\nsuch as CNNs, RNNs, attention, and feed-forward networks. This setting is general and subsumes\nthe majority of modern neural architectures.\nGiven an input example x(cid:48) from the downstream task, we apply the graph predictor to obtain the\ni=1 Gi \u2208 RT\u00d7T denote the product of all af\ufb01nity matrices from the\n\ufb01rst layers to the l-th layer. This can be viewed as propagating the connections among multiple layers\nof graphs, which allows us to model hierarchical structures. We then take a mixture of all the graphs\n\ngraphs G = g(x(cid:48)). Let \u039bl =(cid:81)l\n\n4\n\nABCDinputKey CNNQuery CNNDot & ReLUWeighted sumFFFFFFFFWeighted sumDot & ReLURNNRNNRNNBCBCDEFGHWeightedsumWeightedsumgraphgraphEmbeddingIdeally any embeddingRNN/CNN/Attention\u2026Select oneobjectiveFeaturepredictorGraphpredictorDownstreamtaskmodelUnsupervisedlearningTransfer\u2026\u2026\u2026mixedgraph\fTable 1: Main results on natural language datasets. Self-attention modules are included in all baseline\nmodels. All baseline methods are feature-based transfer learning methods, including ELMo and\nGloVe. Our methods combine graph-based transfer with feature-based transfer. Our graphs operate on\nvarious sets of features, including GloVe embeddings, ELMo embeddings, and RNN states. \u201cmism.\u201d\nrefers to the \u201cmismatched\u201d setting.\n\nTransfer method\n\ntransfer feature only (baseline)\nGLoMo on embeddings\nGLoMo on RNN states\nl=1 \u222a {\u039bl}L\nin {Gl}L\n\nl=1,\n\nL(cid:88)\n\nl=1\n\nM =\n\nSQuAD GloVe\nEM\n69.33\n70.84\n71.30\n\nF1\n78.73\n79.90\n80.24\n\nSQuAD ELMo\nEM\n74.75\n76.00\n76.20\n\nF1\n82.95\n84.13\n83.99\n\nIMDB GloVe\n\nAccuracy\n\n88.51\n89.16\n\n-\n\nL(cid:88)\n\nl=1\n\nml\n\nGGl +\n\nml\n\n\u039b\u039bl, s.t.\n\nL(cid:88)\n\n(ml\n\nG + ml\n\n\u039b) = 1\n\nl=1\n\nMNLI GloVe\n\nmatched mism.\n77.40\n77.14\n78.32\n78.00\n\n-\n\n-\n\nG and ml\n\nThe mixture weights ml\n\u039b can be instantiated as Softmax-normalized parameters [30] or can\nbe conditioned on the features H. To transfer the mixed latent graph, we again adopt the weighted\nsum operation as in Eq. (2). Speci\ufb01cally, we use the weighted sum HM (see Figures 1 and 2), in\naddition to H, as the input to the function h. This can be viewed as performing attention with weights\ngiven by the mixed latent graph M. This setup of latent graph transfer is general and easy to be\nplugged in, as the graphs can be applied to any layer in the network architecture, with either learned\nor pretrained features H, at variable length.\n\n2.3 Extensions and Implementation\n\nSo far we have introduced a general framework of unsupervised latent graph learning. This framework\ncan be extended and implemented in various ways.\nIn our implementation, at position t, in addition to predicting the forward context (xt+1,\u00b7\u00b7\u00b7 , xt+D),\nwe also use a separate network to predict the backward context (xt\u2212D,\u00b7\u00b7\u00b7 , xt\u22121), similar to [30].\nThis allows the graphs G to capture both forward and backward dependencies, as graphs learned\nfrom one direction are masked on future context. Accordingly, during transfer, we mix the graphs\nfrom two directions separately.\nIn the transfer phase, there are different ways of effectively fusing H and HM. In practice, we feed\nthe concatenation of H and a gated output, W1[H; HM](cid:12) \u03c3(W2[H; HM]), to the function h. Here\nW1 and W2 are parameter matrices, \u03c3 denotes the sigmoid function, and (cid:12) denotes element-wise\nmultiplication. We also adopt the multi-head attention [42] to produce multiple graphs per layer. We\nuse a mixture of the graphs from different heads for transfer.\nIt is also possible to extend our framework to 2-d or 3-d data such as images and videos. The\nadaptations needed are to adopt high-dimensional attention [44, 28], and to predict a high-dimensional\ncontext (e.g., predicting a grid of future pixels). As an example, in our experiments, we use these\nadaptations on the task of image classi\ufb01cation.\n\n3 Experiments\n\n3.1 Natural Language Tasks and Setting\n\nQuestion Answering The stanford question answering dataset [31](SQuAD) was recently proposed\nto advance machine reading comprehension. The dataset consists of more than 100,000+ question-\nanswer pairs from 500+ Wikipedia articles. Each question is associated with a corresponding reading\npassage in which the answer to the question can be deduced.\nNatural Language Inference We chose to use the latest Multi-Genre NLI corpus (MNLI) [46].\nThis dataset has collected 433k sentence pairs annotated with textual entailment information. It uses\nthe same modeling protocol as SNLI dataset [4] but covers a 10 different genres of both spoken\nand formal written text. The evaluation in this dataset can be set up to be in-domain (Matched) or\ncross-domain (Mismatched). We did not include the SNLI data into our training set.\n\n5\n\n\fTable 2: Ablation study.\n\nMethod\n\nGLoMo\n- decouple\n- sparse\n- hierarchical\n- unit-level\n- sequence\nuniform graph\n\nSQuAD GloVe\nEM\n70.84\n70.45\n70.13\n69.92\n69.23\n69.92\n69.48\n\nF1\n79.90\n79.56\n79.34\n79.23\n78.66\n79.29\n78.82\n\nSQuAD ELMo\nEM\n76.00\n75.89\n75.61\n75.70\n74.84\n75.50\n75.14\n\nF1\n84.13\n83.79\n83.89\n83.72\n83.37\n83.70\n83.28\n\nIMDB GloVe\n\nAccuracy\n\n89.16\n\n-\n\n88.96\n88.71\n88.49\n88.96\n88.57\n\nMNLI GloVe\n\nmatched mism.\n78.32\n78.00\n\n-\n\n78.07\n77.87\n77.58\n78.11\n77.26\n\n-\n\n77.75\n77.85\n78.05\n77.76\n77.50\n\nSentiment Analysis We use the movie review dataset collected in [22], with 25,000 training and\n25,000 testing samples crawled from IMDB.\nTransfer Setting We preprocessed the Wikipedia dump and obtained a corpus of over 700 million\ntokens after cleaning html tags and removing short paragraphs. We trained the networks g and f on\nthis corpus as discussed in Section 2.1. We used randomly initialized embeddings to train both g and\nf, while the graphs are tested on other embeddings during transfer. We transfer the graph predictor g\nto a downstream task to extract graphs, which are then used for supervised training, as introduced\nin Section 2.2. We experimented with applying the transferred graphs to various sets of features,\nincluding GloVe embeddings, ELMo embeddings, and the \ufb01rst RNN layer\u2019s output.\n\n3.2 Main results\n\nOn SQuAD, we follow the open-sourced implementation [9] except that we dropped weight averaging\nto rule out ensembling effects. This model employs a self-attention layer following the bi-attention\nlayer, along with multiple layers of RNNs. On MNLI, we adopt the open-sourced implementation\n[5]. Additionally, we add a self-attention layer after the bi-inference component to further model\ncontext dependency. For IMDB, our baseline utilizes a feedforward network architecture composed\nof RNNs, linear layers and self-attention. Note the state-of-the-art (SOTA) models on these datasets\nare [49, 21, 25] respectively. However, these SOTA results often rely on data augmentation [49],\nsemi-supervised learning [25], additional training data (SNLI) [20], or specialized architectures [20].\nIn this work, we focus on competitive baselines with general architectures that the SOTA models are\nbased on to test the graph transfer performance and exclude independent in\ufb02uence factors.\nThe main results are reported in Table 1. There are three important messages. First, we have\npurposely incorporated the self-attention module into all of our baselines models\u2014indeed having\nself-attention in the architecture could potentially induce a supervisedly-trained graph, because of\nwhich one may argue that this graph could replace its unsupervised counterpart. However, as is\nshown in Table 1, augmenting training with unsupervisedly-learned graphs has further improved\nperformance. Second, as we adopt pretrained embeddings in all the models, the baselines establish\nthe performance of feature-based transfer. Our results in Table 1 indicate that when combined with\nfeature-based transfer, our graph transfer methods are able to yield further improvement. Third, the\nlearned graphs are generic enough to work with various sets of features, including GloVe embeddings,\nELMo embeddings, and RNN output.\n\n3.3 Ablation Study\n\nIn addition to comparing graph-based transfer against feature-based transfer, we further conducted\na series of ablation studies. Here we mainly target at the following components in our framework:\ndecoupling feature and graph networks, sparsity, hierarchical (i.e. multiple layers of) graphs, unit-\nlevel objectives, and sequence prediction. Respectively, we experimented with coupling the two\nnetworks, removing the ReLU activations, using only a single layer of graphs, using a sentence-level\nSkip-thought objective [18], and reducing the context length to one [30]. As is shown in Table 2, all\nthese factors contribute to better performance of our method, which justi\ufb01es our desiderata discussed\nin Section 2.1.1. Additionally, we did a sanity check by replacing the trained graphs with uniformly\nsampled af\ufb01nity matrices (similar to [15]) during the transfer phase. This result shows that the learned\ngraphs have played a valuable role for transfer.\n\n6\n\n\f(a) Related to coreference resolution.\n\n(b) Attending to objects for modeling long-term dependency.\n\n(c) Attending to negative words and predicates.\n\n(d) Attending to nouns, verbs, and adjectives for topic modeling.\n\nFigure 3: Visualization of the graphs on the MNLI dataset. The graph predictor has not been trained on MNLI.\nThe words on the y-axis \u201cattend\u201d to the words on the a-axis; i.e., each row sums to 1.\n\n3.4 Visualization and Analysis\n\nWe visualize the latent graphs on the MNLI dataset in Figure 3. We remove irrelevant rows in\nthe af\ufb01nity matrices to highlight the key patterns. The graph in Figure 3a resembles coreference\nresolution as \u201che\u201d is attending to \u201cGary Bauer\u201d. In Figure 3b, the words attend to the objects such as\n\u201cGreen Grotto\u201d, which allows modeling long-term dependency when a clause exists. In Figure 3c, the\nwords following \u201cnot\u201d attend to \u201cnot\u201d so that they are aware of the negation; similarly, the predicate\n\u201csplashed\u201d is attended by the following object and adverbial. Figure 3d possibly demonstrates a way\nof topic modeling by attending to informative words in the sentence. Overall, though seemingly\ndifferent from human-curated structures such as parse trees, these latent graphs display linguistic\nmeanings to some extent. Also note that the graph predictor has not been trained on MNLI, which\nsuggests the transferability of the latent graphs.\n\n3.5 Vision Task\n\nImage Classi\ufb01cation We are also prompted to extend the scope of our approach from natural lan-\nguage to vision domain. Drawing from natural language graph predictor g(\u00b7) leads the unsupervised\ntraining phase in vision domain to a PixelCNN-like setup [27], but with a sequence prediction window\n\n7\n\n\fFigure 4: Visualization. Left: a shark image as the input. Middle: weights of the edges connected\nwith the central pixel, organized into 24 heads (3 layers with 8 heads each). Right: weights of the\nedges connected with the bottom-right pixel. Note the use of masking.\n\nMethod / Base-model\nbaseline\nGLoMo\nablation: uniform graph\n\nResNet-18\n90.93\u00b10.33\n91.55\u00b10.23\n91.07\u00b10.24\n\nResNet-34\n91.42\u00b10.17\n91.70\u00b10.09\n\n-\n\nTable 3: CIFAR-10 classi\ufb01cation results. We adopt a 42,000/8,000 train/validation split\u2014once\nthe best model is selected according to the validation error, we directly forward it to the test set\nwithout doing any validation set place-back retraining. We only used horizontal \ufb02ipping for data\naugmentation. The results are averaged from 5 rounds of experiments.\n\nof size 3x3 (essentially only predicting the bottom-right quarter under the mask). We leverage the\nentire ImageNet [11] dataset and have the images resized to 32x32 [27]. In the transfer phase, we\nchose CIFAR-10 classi\ufb01cation as our target task. Similar to the language experiments, we augment\nH by HM, and obtain the \ufb01nal input through a gating layer. This result is then fed into a ResNet [13]\nto perform regular supervised training. Two architectures, i.e. ResNet-18 and ResNet-34, are ex-\nperimented here. As shown in Table 3, GLoMo improves performance over the baselines, which\ndemonstrates that GLoMo as a general framework also generalizes to images.\nIn the meantime we display the attention weights we obtain from the graph predictor in Figure 4. We\ncan see that g has established the connections from key-point pixels while exhibiting some variation\nacross different attention heads. There has been similar visualization reported by [50] lately, in which\na vanilla transformer model is exploited for generative adversarial training. Putting these results\ntogether we want to encourage future research to take further exploration into the relational long-term\ndependency in image modeling.\n\n4 Related Work\n\nThere is an overwhelming amount of evidence on the success of transferring pre-trained representa-\ntions across tasks in deep learning. Notable examples in the language domain include transferring\nword vectors [29, 24, 30] and sentence representations [18, 10]. Similarly, in the image domain it is\nstandard practice to use features learned in a supervised manner on the ImageNet [33, 37] dataset\nfor other downstream prediction tasks [32]. Our approach is complementary to these approaches \u2013\ninstead of transferring features we transfer graphs of dependency patterns between the inputs \u2013 and\ncan be combined with these existing transfer learning methods.\nSpecialized neural network architectures have been developed for different domains which respect\nhigh-level intuitions about the dependencies among the data in those domains. Examples include\nCNNs for images [19], RNNs for text [14] and Graph Neural Networks for graphs [35]. In the\nlanguage domain, more involved structures have also been exploited to inform neural network\narchitectures, such as phrase and dependency structures [41, 39], sentiment compositionality [38],\nand coreference [12, 16]. [17] combines graph neural networks with VAEs to discover latent graph\nstructures in particle interaction systems. There has also been interest lately on Neural Architecture\nSearch [51, 2, 26, 20], where a class of neural networks is searched over to \ufb01nd the optimal one for a\nparticular task.\nRecently, the self-attention module [42] has been proposed which, in principle, is capable of learning\narbitrary structures in the data since it models pairwise interactions between the inputs. Originally\n\n8\n\n\fused for Machine Translation, it has also been successfully applied to sentence understanding [36],\nimage generation [28], summarization [20], and relation extraction [43]. Non local neural networks\n[44] for images also share a similar idea. Our work is related to these methods, but our goal is to\nlearn a universal structure using an unsupervised objective and then transfer it for use with various\nsupervised tasks. Technically, our approach also differs from previous work as discussed in Section\n2.1.1, including separating graphs and features. LISA [40], explored a related idea of using existing\nlinguistic structures, such as dependency trees, to guide the attention learning process of a self\nattention network.\nAnother line of work has explored latent tree learning for jointly parsing sentences based on a\ndownstream semantic objective [48, 23, 6]. Inspired by linguistic theories of constituent phrase\nstructure [7], these works restrict their latent parses to be binary trees. While these models show\nimproved performance on the downstream semantic tasks, Williams et al [45] showed that the\nintermediate parses bear little resemblance to any known syntactic or semantic theories from the\nliterature. This suggests that the optimal structure for computational linguistics might be different\nfrom those that have been proposed in formal syntactic theories. In this paper we explore the use of\nunsupervised learning objectives for discovering such structures.\n\n5 Conclusions\n\nWe present a novel transfer learning scheme based on latent relational graph learning, which is\northogonal to but can be combined with the traditional feature transfer learning framework. Through\na variety of experiments in language and vision, this framework is demonstrated to be capable of\nimproving performance and learning generic graphs applicable to various types of features. In the\nfuture, we hope to extend the framework to more diverse setups such as knowledge based inference,\nvideo modeling, and hierarchical reinforcement learning where rich graph-like structures abound.\n\nAcknowledgement\n\nThis work was supported in part by the Of\ufb01ce of Naval Research, DARPA award D17AP00001,\nApple, the Google focused award, and the Nvidia NVAIL award. ZY is supported by the Nvidia PhD\nFellowship. The authors would also like to thank Sam Bowman for useful discussions.\n\nReferences\n[1] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly\n\nlearning to align and translate. arXiv preprint arXiv:1409.0473, 2014.\n\n[2] Bowen Baker, Otkrist Gupta, Nikhil Naik, and Ramesh Raskar. Designing neural network\n\narchitectures using reinforcement learning. arXiv preprint arXiv:1611.02167, 2016.\n\n[3] Peter W Battaglia, Jessica B Hamrick, Victor Bapst, Alvaro Sanchez-Gonzalez, Vinicius\nZambaldi, Mateusz Malinowski, Andrea Tacchetti, David Raposo, Adam Santoro, Ryan\nFaulkner, et al. Relational inductive biases, deep learning, and graph networks. arXiv preprint\narXiv:1806.01261, 2018.\n\n[4] Samuel R Bowman, Gabor Angeli, Christopher Potts, and Christopher D Manning. A large\nannotated corpus for learning natural language inference. arXiv preprint arXiv:1508.05326,\n2015.\n\n[5] Qian Chen, Xiaodan Zhu, Zhen-Hua Ling, Si Wei, Hui Jiang, and Diana Inkpen. Enhanced lstm\nfor natural language inference. In Proceedings of the 55th Annual Meeting of the Association\nfor Computational Linguistics (Volume 1: Long Papers), volume 1, pages 1657\u20131668, 2017.\n\n[6] Jihun Choi, Kang Min Yoo, and Sang-goo Lee. Unsupervised learning of task-speci\ufb01c tree\n\nstructures with tree-lstms. AAAI, 2018.\n\n[7] Noam Chomsky. Aspects of the Theory of Syntax, volume 11. MIT press, 2014.\n\n9\n\n\f[8] Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, and Yoshua Bengio. Empirical evaluation\nof gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555,\n2014.\n\n[9] Christopher Clark and Matt Gardner. Simple and effective multi-paragraph reading comprehen-\n\nsion. arXiv preprint arXiv:1710.10723, 2017.\n\n[10] Alexis Conneau, Douwe Kiela, Holger Schwenk, Loic Barrault, and Antoine Bordes. Supervised\nlearning of universal sentence representations from natural language inference data. Proceedings\nof the 2017 conference on empirical methods in natural language processing (EMNLP), 2017.\n\n[11] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale\nhierarchical image database. In Computer Vision and Pattern Recognition, 2009. CVPR 2009.\nIEEE Conference on, pages 248\u2013255. IEEE, 2009.\n\n[12] Bhuwan Dhingra, Qiao Jin, Zhilin Yang, William W Cohen, and Ruslan Salakhutdinov. Neural\n\nmodels for reasoning over multiple mentions using coreference. NAACL, 2018.\n\n[13] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image\nrecognition. In Proceedings of the IEEE conference on computer vision and pattern recognition,\npages 770\u2013778, 2016.\n\n[14] Sepp Hochreiter and J\u00fcrgen Schmidhuber. Long short-term memory. Neural computation,\n\n9(8):1735\u20131780, 1997.\n\n[15] Jie Hu, Li Shen, and Gang Sun.\n\narXiv:1709.01507, 2017.\n\nSqueeze-and-excitation networks.\n\narXiv preprint\n\n[16] Yangfeng Ji, Chenhao Tan, Sebastian Martschat, Yejin Choi, and Noah A Smith. Dynamic\n\nentity representations in neural language models. arXiv preprint arXiv:1708.00781, 2017.\n\n[17] Thomas Kipf, Ethan Fetaya, Kuan-Chieh Wang, Max Welling, and Richard Zemel. Neural\n\nrelational inference for interacting systems. arXiv preprint arXiv:1802.04687, 2018.\n\n[18] Ryan Kiros, Yukun Zhu, Ruslan R Salakhutdinov, Richard Zemel, Raquel Urtasun, Antonio\nTorralba, and Sanja Fidler. Skip-thought vectors. In Advances in neural information processing\nsystems, pages 3294\u20133302, 2015.\n\n[19] Yann LeCun, Yoshua Bengio, et al. Convolutional networks for images, speech, and time series.\n\nThe handbook of brain theory and neural networks, 3361(10):1995, 1995.\n\n[20] Peter J Liu, Mohammad Saleh, Etienne Pot, Ben Goodrich, Ryan Sepassi, Lukasz Kaiser,\nand Noam Shazeer. Generating wikipedia by summarizing long sequences. arXiv preprint\narXiv:1801.10198, 2018.\n\n[21] Xiaodong Liu, Kevin Duh, and Jianfeng Gao. Stochastic answer networks for natural language\n\ninference. arXiv preprint arXiv:1804.07888, 2018.\n\n[22] Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, and Christopher\nPotts. Learning word vectors for sentiment analysis. In Proceedings of the 49th Annual Meeting\nof the Association for Computational Linguistics: Human Language Technologies, pages\n142\u2013150, Portland, Oregon, USA, June 2011. Association for Computational Linguistics.\n\n[23] Jean Maillard, Stephen Clark, and Dani Yogatama. Jointly learning sentence embeddings and\n\nsyntax with unsupervised tree-lstms. arXiv preprint arXiv:1705.09189, 2017.\n\n[24] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed repre-\nsentations of words and phrases and their compositionality. In Advances in neural information\nprocessing systems, pages 3111\u20133119, 2013.\n\n[25] Takeru Miyato, Andrew M Dai, and Ian Goodfellow. Adversarial training methods for semi-\n\nsupervised text classi\ufb01cation. arXiv preprint arXiv:1605.07725, 2016.\n\n[26] Renato Negrinho and Geoff Gordon. Deeparchitect: Automatically designing and training deep\n\narchitectures. arXiv preprint arXiv:1704.08792, 2017.\n\n10\n\n\f[27] Aaron van den Oord, Nal Kalchbrenner, and Koray Kavukcuoglu. Pixel recurrent neural\n\nnetworks. arXiv preprint arXiv:1601.06759, 2016.\n\n[28] Niki Parmar, Ashish Vaswani, Jakob Uszkoreit, \u0141ukasz Kaiser, Noam Shazeer, and Alexander\n\nKu. Image transformer. arXiv preprint arXiv:1802.05751, 2018.\n\n[29] Jeffrey Pennington, Richard Socher, and Christopher Manning. Glove: Global vectors for\nword representation. In Proceedings of the 2014 conference on empirical methods in natural\nlanguage processing (EMNLP), pages 1532\u20131543, 2014.\n\n[30] Matthew E Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Ken-\nton Lee, and Luke Zettlemoyer. Deep contextualized word representations. arXiv preprint\narXiv:1802.05365, 2018.\n\n[31] Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. Squad: 100,000+ questions\n\nfor machine comprehension of text. arXiv preprint arXiv:1606.05250, 2016.\n\n[32] Ali Sharif Razavian, Hossein Azizpour, Josephine Sullivan, and Stefan Carlsson. Cnn fea-\ntures off-the-shelf: an astounding baseline for recognition. In Computer Vision and Pattern\nRecognition Workshops (CVPRW), 2014 IEEE Conference on, pages 512\u2013519. IEEE, 2014.\n\n[33] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng\nHuang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei.\nImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision\n(IJCV), 115(3):211\u2013252, 2015.\n\n[34] Tim Salimans, Andrej Karpathy, Xi Chen, and Diederik P Kingma. Pixelcnn++: Improving the\npixelcnn with discretized logistic mixture likelihood and other modi\ufb01cations. arXiv preprint\narXiv:1701.05517, 2017.\n\n[35] Franco Scarselli, Marco Gori, Ah Chung Tsoi, Markus Hagenbuchner, and Gabriele Monfardini.\nThe graph neural network model. IEEE Transactions on Neural Networks, 20(1):61\u201380, 2009.\n\n[36] Tao Shen, Tianyi Zhou, Guodong Long, Jing Jiang, Shirui Pan, and Chengqi Zhang. Disan:\nDirectional self-attention network for rnn/cnn-free language understanding. arXiv preprint\narXiv:1709.04696, 2017.\n\n[37] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale\n\nimage recognition. arXiv preprint arXiv:1409.1556, 2014.\n\n[38] Richard Socher, Cliff C Lin, Chris Manning, and Andrew Y Ng. Parsing natural scenes and\nnatural language with recursive neural networks. In Proceedings of the 28th international\nconference on machine learning (ICML-11), pages 129\u2013136, 2011.\n\n[39] Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D Manning, Andrew Ng,\nand Christopher Potts. Recursive deep models for semantic compositionality over a sentiment\ntreebank. In Proceedings of the 2013 conference on empirical methods in natural language\nprocessing, pages 1631\u20131642, 2013.\n\n[40] Emma Strubell, Patrick Verga, Daniel Andor, David Weiss, and Andrew McCal-\narXiv preprint\n\nlum. Linguistically-informed self-attention for semantic role labeling.\narXiv:1804.08199, 2018.\n\n[41] Kai Sheng Tai, Richard Socher, and Christopher D Manning. Improved semantic representations\nfrom tree-structured long short-term memory networks. arXiv preprint arXiv:1503.00075, 2015.\n\n[42] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez,\n\u0141ukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural Informa-\ntion Processing Systems, pages 6000\u20136010, 2017.\n\n[43] Patrick Verga, Emma Strubell, Ofer Shai, and Andrew McCallum. Attending to all mention\n\npairs for full abstract biological relation extraction. arXiv preprint arXiv:1710.08312, 2017.\n\n[44] Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaiming He. Non-local neural networks.\n\narXiv preprint arXiv:1711.07971, 2017.\n\n11\n\n\f[45] Adina Williams, Andrew Drozdov, and Samuel R Bowman. Do latent tree learning models\nidentify meaningful structure in sentences? Transactions of the Association for Computational\nLinguistics, 6:253\u2013267, 2018.\n\n[46] Adina Williams, Nikita Nangia, and Samuel R Bowman. A broad-coverage challenge corpus\n\nfor sentence understanding through inference. arXiv preprint arXiv:1704.05426, 2017.\n\n[47] Zhilin Yang, Zihang Dai, Ruslan Salakhutdinov, and William W Cohen. Breaking the softmax\n\nbottleneck: a high-rank rnn language model. arXiv preprint arXiv:1711.03953, 2017.\n\n[48] Dani Yogatama, Phil Blunsom, Chris Dyer, Edward Grefenstette, and Wang Ling. Learning to\n\ncompose words into sentences with reinforcement learning. ICLR, 2016.\n\n[49] Adams Wei Yu, David Dohan, Minh-Thang Luong, Rui Zhao, Kai Chen, Mohammad Norouzi,\nand Quoc V Le. Qanet: Combining local convolution with global self-attention for reading\ncomprehension. arXiv preprint arXiv:1804.09541, 2018.\n\n[50] Han Zhang, Ian Goodfellow, Dimitris Metaxas, and Augustus Odena. Self-attention generative\n\nadversarial networks. arXiv preprint arXiv:1805.08318, 2018.\n\n[51] Barret Zoph and Quoc V Le. Neural architecture search with reinforcement learning. arXiv\n\npreprint arXiv:1611.01578, 2016.\n\n12\n\n\f", "award": [], "sourceid": 5363, "authors": [{"given_name": "Zhilin", "family_name": "Yang", "institution": "Carnegie Mellon University"}, {"given_name": "Jake", "family_name": "Zhao", "institution": "New York University / Facebook"}, {"given_name": "Bhuwan", "family_name": "Dhingra", "institution": "Carnegie Mellon University"}, {"given_name": "Kaiming", "family_name": "He", "institution": "Facebook AI Research"}, {"given_name": "William", "family_name": "Cohen", "institution": "Google AI"}, {"given_name": "Russ", "family_name": "Salakhutdinov", "institution": "Carnegie Mellon University"}, {"given_name": "Yann", "family_name": "LeCun", "institution": "Facebook AI Research and New York University"}]}