{"title": "Deep Recursive Neural Networks for Compositionality in Language", "book": "Advances in Neural Information Processing Systems", "page_first": 2096, "page_last": 2104, "abstract": "Recursive neural networks comprise a class of architecture that can operate on structured input. They have been previously successfully applied to model compositionality in natural language using parse-tree-based structural representations. Even though these architectures are deep in structure, they lack the capacity for hierarchical representation that exists in conventional deep feed-forward networks as well as in recently investigated deep recurrent neural networks. In this work we introduce a new architecture --- a deep recursive neural network (deep RNN) --- constructed by stacking multiple recursive layers. We evaluate the proposed model on the task of fine-grained sentiment classification. Our results show that deep RNNs outperform associated shallow counterparts that employ the same number of parameters. Furthermore, our approach outperforms previous baselines on the sentiment analysis task, including a multiplicative RNN variant as well as the recently introduced paragraph vectors, achieving new state-of-the-art results. We provide exploratory analyses of the effect of multiple layers and show that they capture different aspects of compositionality in language.", "full_text": "Deep Recursive Neural Networks\nfor Compositionality in Language\n\nOzan \u02d9Irsoy\n\nCornell University\nIthaca, NY 14853\n\nClaire Cardie\n\nCornell University\nIthaca, NY 14853\n\nDepartment of Computer Science\n\nDepartment of Computer Science\n\noirsoy@cs.cornell.edu\n\ncardie@cs.cornell.edu\n\nAbstract\n\nRecursive neural networks comprise a class of architecture that can operate on\nstructured input. They have been previously successfully applied to model com-\npositionality in natural language using parse-tree-based structural representations.\nEven though these architectures are deep in structure, they lack the capacity for\nhierarchical representation that exists in conventional deep feed-forward networks\nas well as in recently investigated deep recurrent neural networks. In this work we\nintroduce a new architecture \u2014 a deep recursive neural network (deep RNN) \u2014\nconstructed by stacking multiple recursive layers. We evaluate the proposed model\non the task of \ufb01ne-grained sentiment classi\ufb01cation. Our results show that deep\nRNNs outperform associated shallow counterparts that employ the same number\nof parameters. Furthermore, our approach outperforms previous baselines on the\nsentiment analysis task, including a multiplicative RNN variant as well as the re-\ncently introduced paragraph vectors, achieving new state-of-the-art results. We\nprovide exploratory analyses of the effect of multiple layers and show that they\ncapture different aspects of compositionality in language.\n\n1\n\nIntroduction\n\nDeep connectionist architectures involve many layers of nonlinear information processing [1]. This\nallows them to incorporate meaning representations such that each succeeding layer potentially has\na more abstract meaning. Recent advancements in ef\ufb01ciently training deep neural networks enabled\ntheir application to many problems, including those in natural language processing (NLP). A key\nadvance for application to NLP tasks was the invention of word embeddings that represent a single\nword as a dense, low-dimensional vector in a meaning space [2], and from which numerous problems\nhave bene\ufb01ted [3, 4].\nRecursive neural networks, comprise a class of architecture that operates on structured inputs, and\nin particular, on directed acyclic graphs. A recursive neural network can be seen as a generalization\nof the recurrent neural network [5], which has a speci\ufb01c type of skewed tree structure (see Figure 1).\nThey have been applied to parsing [6], sentence-level sentiment analysis [7, 8], and paraphrase de-\ntection [9]. Given the structural representation of a sentence, e.g. a parse tree, they recursively\ngenerate parent representations in a bottom-up fashion, by combining tokens to produce represen-\ntations for phrases, eventually producing the whole sentence. The sentence-level representation (or,\nalternatively, its phrases) can then be used to make a \ufb01nal classi\ufb01cation for a given input sentence\n\u2014 e.g. whether it conveys a positive or a negative sentiment.\nSimilar to how recurrent neural networks are deep in time, recursive neural networks are deep in\nstructure, because of the repeated application of recursive connections. Recently, the notions of\ndepth in time \u2014 the result of recurrent connections, and depth in space \u2014 the result of stacking\n\n1\n\n\f(a)\n\n(b)\n\n(c)\n\nFigure 1: Operation of a recursive net (a), untied recursive net (b) and a recurrent net (c) on an\nexample sentence. Black, orange and red dots represent input, hidden and output layers, respectively.\nDirected edges having the same color-style combination denote shared connections.\n\nmultiple layers on top of one another, are distinguished for recurrent neural networks. In order to\ncombine these concepts, deep recurrent networks were proposed [10, 11, 12]. They are constructed\nby stacking multiple recurrent layers on top of each other, which allows this extra notion of depth\nto be incorporated into temporal processing. Empirical investigations showed that this results in a\nnatural hierarchy for how the information is processed [12]. Inspired by these recent developments,\nwe make a similar distinction between depth in structure and depth in space, and to combine these\nconcepts, propose the deep recursive neural network, which is constructed by stacking multiple\nrecursive layers.\nThe architecture we study in this work is essentially a deep feedforward neural network with an\nadditional structural processing within each layer (see Figure 2). During forward propagation, in-\nformation travels through the structure within each layer (because of the recursive nature of the\nnetwork, weights regarding structural processing are shared). In addition, every node in the struc-\nture (i.e. in the parse tree) feeds its own hidden state to its counterpart in the next layer. This can\nbe seen as a combination of feedforward and recursive nets. In a shallow recursive neural network,\na single layer is responsible for learning a representation of composition that is both useful and\nsuf\ufb01cient for the \ufb01nal decision. In a deep recursive neural network, a layer can learn some parts\nof the composition to apply, and pass this intermediate representation to the next layer for further\nprocessing for the remaining parts of the overall composition.\nTo evaluate the performance of the architecture and make exploratory analyses, we apply deep re-\ncursive neural networks to the task of \ufb01ne-grained sentiment detection on the recently published\nStanford Sentiment Treebank (SST) [8]. SST includes a supervised sentiment label for every node\nin the binary parse tree, not just at the root (sentence) level. This is especially important for deep\nlearning, since it allows a richer supervised error signal to be backpropagated across the network,\npotentially alleviating vanishing gradients associated with deep neural networks [13].\nWe show that our deep recursive neural networks outperform shallow recursive nets of the same size\nin the \ufb01ne-grained sentiment prediction task on the Stanford Sentiment Treebank. Furthermore, our\nmodels outperform multiplicative recursive neural network variants, achieving new state-of-the-art\nperformance on the task. We conduct qualitative experiments that suggest that each layer handles\na different aspect of compositionality, and representations at each layer capture different notions of\nsimilarity.\n\n2 Methodology\n\n2.1 Recursive Neural Networks\n\nRecursive neural networks (e.g. [6]) (RNNs) comprise an architecture in which the same set of\nweights is recursively applied within a structural setting: given a positional directed acyclic graph,\nit visits the nodes in topological order, and recursively applies transformations to generate further\nrepresentations from previously computed representations of children. In fact, a recurrent neural\nnetwork is simply a recursive neural network with a particular structure (see Figure 1c). Even though\n\n2\n\nthatmoviewascoolthatmoviewascoolthatmoviewascool\fRNNs can be applied to any positional directed acyclic graph, we limit our attention to RNNs over\npositional binary trees, as in [6].\nGiven a binary tree structure with leaves having the initial representations, e.g. a parse tree with\nword vector representations at the leaves, a recursive neural network computes the representations\nat each internal node \u03b7 as follows (see also Figure 1a):\n\nx\u03b7 = f (WLxl(\u03b7) + WRxr(\u03b7) + b)\n\n(1)\nwhere l(\u03b7) and r(\u03b7) are the left and right children of \u03b7, WL and WR are the weight matrices that\nconnect the left and right children to the parent, and b is a bias vector. Given that WL and WR\nare square matrices, and not distinguishing whether l(\u03b7) and r(\u03b7) are leaf or internal nodes, this\nde\ufb01nition has an interesting interpretation:\ninitial representations at the leaves and intermediate\nrepresentations at the nonterminals lie in the same space. In the parse tree example, a recursive\nneural network combines the representations of two subphrases to generate a representation for the\nlarger phrase, in the same meaning space [6]. We then have a task-speci\ufb01c output layer above the\nrepresentation layer:\n\ny\u03b7 = g(U x\u03b7 + c)\n\n(2)\nwhere U is the output weight matrix and c is the bias vector to the output layer. In a supervised task,\ny\u03b7 is simply the prediction (class label or response value) for the node \u03b7, and supervision occurs\nat this layer. As an example, for the task of sentiment classi\ufb01cation, y\u03b7 is the predicted sentiment\nlabel of the phrase given by the subtree rooted at \u03b7. Thus, during supervised learning, initial external\nerrors are incurred on y, and backpropagated from the root, toward leaves [14].\n\n2.2 Untying Leaves and Internals\n\nEven though the aforementioned de\ufb01nition, which treats the leaf nodes and internal nodes the same,\nhas some attractive properties (such as mapping individual words and larger phrases into the same\nmeaning space), in this work we use an untied variant that distinguishes between a leaf and an\ninternal node. We do this by a simple parametrization of the weights W with respect to whether the\nincoming edge emanates from a leaf or an internal node (see Figure 1b in contrast to 1a, color of the\nedges emanating from leaves and internal nodes are different):\n\nh\u03b7 = f (W l(\u03b7)\n\nL hl(\u03b7) + W r(\u03b7)\n\nR hr(\u03b7) + b)\n\n(3)\n\nif \u03b7 is a leaf and\notherwise. X and H are vector spaces of words and phrases, respectively. The weights\nact as a transformation from word space to phrase space, and W hh as a transformation from\n\nwhere h\u03b7 = x\u03b7 \u2208 X if \u03b7 is a leaf and h\u03b7 \u2208 H otherwise, and W \u03b7\u00b7 = W xh\u00b7\nW \u03b7\u00b7 = W hh\u00b7\nW xh\u00b7\nphrase space to itself.\nWith this untying, a recursive network becomes a generalization of the Elman type recurrent neural\nnetwork with h being analogous to the hidden layer of the recurrent network (memory) and x be-\ning analogous to the input layer (see Figure 1c). Bene\ufb01ts of this untying are twofold: (1) Now the\nare of size |h| \u00d7 |x| and |h| \u00d7 |h| which means that we can use\nweight matrices W xh\u00b7\nlarge pretrained word vectors and a small number of hidden units without a quadratic dependence on\nthe word vector dimensionality |x|. Therefore, small but powerful models can be trained by using\npretrained word vectors with a large dimensionality. (2) Since words and phrases are represented\nin different spaces, we can use recti\ufb01er activation units for f, which have previously been shown to\nyield good results when training deep neural networks [15]. Word vectors are dense and generally\nhave positive and negative entries whereas recti\ufb01er activation causes the resulting intermediate vec-\ntors to be sparse and nonnegative. Thus, when leaves and internals are represented in the same space,\na discrepancy arises, and the same weight matrix is applied to both leaves and internal nodes and\nis expected to handle both sparse and dense cases, which might be dif\ufb01cult. Therefore separating\nleaves and internal nodes allows the use of recti\ufb01ers in a more natural manner.\n\n, and W hh\u00b7\n\n2.3 Deep Recursive Neural Networks\n\nRecursive neural networks are deep in structure: with the recursive application of the nonlinear\ninformation processing they become as deep as the depth of the tree (or in general, DAG). However,\nthis notion of depth is unlikely to involve a hierarchical interpretation of the data. By applying\n\n3\n\n\fFigure 2: Operation of a 3-layer deep recursive neural network. Red and black points denote\noutput and input vectors, respectively; other colors denote intermediate memory representations.\nConnections denoted by the same color-style combination are shared (i.e. share the same set of\nweights).\n\nthe same computation recursively to compute the contribution of children to their parents, and the\nsame computation to produce an output response, we are, in fact, representing every internal node\n(phrase) in the same space [6, 8]. However, in the more conventional stacked deep learners (e.g. deep\nfeedforward nets), an important bene\ufb01t of depth is the hierarchy among hidden representations:\nevery hidden layer conceptually lies in a different representation space and potentially is a more\nabstract representation of the input than the previous layer [1].\nTo address these observations, we propose the deep recursive neural network, which is constructed\nby stacking multiple layers of individual recursive nets:\n\n\u03b7 = f (W (i)\nh(i)\n\nL h(i)\n\nl(\u03b7) + W (i)\n\nR h(i)\n\nr(\u03b7) + V (i)h(i\u22121)\n\n\u03b7\n\n+ b(i))\n\n(4)\n\nL , W (i)\n\nwhere i indexes the multiple stacked layers, W (i)\nR , and b(i) are de\ufb01ned as before within each\nlayer i, and V (i) is the weight matrix that connects the (i\u2212 1)th hidden layer to the ith hidden layer.\nNote that the untying that we described in Section 2.2 is only necessary for the \ufb01rst layer, since we\ncan map both x \u2208 X and h(1) \u2208 H(1) in the \ufb01rst layer to h(2) \u2208 H(2) in the second layer using sep-\narate V (2) for leaves and internals (V xh(2) and V hh(2)). Therefore every node is represented in the\nsame space at layers above the \ufb01rst, regardless of their \u201cleafness\u201d. Figure 2 provides a visualization\nof weights that are untied or shared.\nFor prediction, we connect the output layer to only the \ufb01nal hidden layer:\n\ny\u03b7 = g(U h((cid:96))\n\n\u03b7 + c)\n\n(5)\n\nwhere (cid:96) is the total number of layers. Intuitively, connecting the output layer to only the last hidden\nlayer forces the network to represent enough high level information at the \ufb01nal layer to support the\nsupervised decision. Connecting the output layer to all hidden layers is another option; however, in\nthat case multiple hidden layers can have synergistic effects on the output and make it more dif\ufb01cult\nto qualitatively analyze each layer.\nLearning a deep RNN can be conceptualized as interleaved applications of the conventional back-\npropagation across multiple layers, and backpropagation through structure within a single layer.\nDuring backpropagation a node \u03b7 receives error terms from both its parent (through structure), and\nfrom its counterpart in the higher layer (through space). Then it further backpropagates that error\nsignal to both of its children, as well as to its counterpart in the lower layer.\n\n4\n\nthatmoviewascool\f3 Experiments\n\n3.1 Setting\n\nData. For experimental evaluation of our models, we use the recently published Stanford Senti-\nment Treebank (SST) [8], which includes labels for 215,154 phrases in the parse trees of 11,855\nsentences, with an average sentence length of 19.1 tokens. Real-valued sentiment labels are con-\nverted to an integer ordinal label in {0, . . . , 4} by simple thresholding. Therefore the supervised\ntask is posed as a 5-class classi\ufb01cation problem. We use the single training-validation-test set parti-\ntioning provided by the authors.\n\nBaselines.\nIn addition to experimenting among deep RNNs of varying width and depth, we com-\npare our models to previous work on the same data. We use baselines from [8]: a naive bayes classi-\n\ufb01er that operates on bigram counts (BINB), shallow RNN (RNN) [6, 7] that learns the word vectors\nfrom the supervised data and uses tanh units, in contrast to our shallow RNNs, a matrix-vector\nRNN in which every word is assigned a matrix-vector pair instead of a vector, and composition is\nde\ufb01ned with matrix-vector multiplications (MV-RNN) [16], and the multiplicative recursive net (or\nthe recursive neural tensor network) in which the composition is de\ufb01ned as a bilinear tensor prod-\nuct (RNTN) [8]. Additionally, we use a method that is capable of generating representations for\nlarger pieces of text (PARAGRAPH VECTORS) [17], and the dynamic convolutional neural network\n(DCNN) [18]. We use the previously published results for comparison using the same training-\ndevelopment-test partitioning of the data.\n\nexi/(cid:80)\n\nActivation Units. For the output layer, we employ the standard softmax activation: g(x) =\nj exj . For the hidden layers we use the recti\ufb01er linear activation: f (x) = max{0, x}.\nExperimentally, recti\ufb01er activation gives better performance, faster convergence, and sparse rep-\nresentations. Previous work with recti\ufb01er units reported good results when training deep neural\nnetworks, with no pre-training step [15].\n\nWord Vectors.\nIn all of our experiments, we keep the word vectors \ufb01xed and do not \ufb01netune for\nsimplicity of our models. We use the publicly available 300 dimensional word vectors by [19],\ntrained on part of the Google News dataset (\u223c100B words).\n\nRegularizer. For regularization of the networks, we use the recently proposed dropout technique,\nin which we randomly set entries of hidden representations to 0, with a probability called the dropout\nrate [20]. Dropout rate is tuned over the development set out of {0, 0.1, 0.3, 0.5}. Dropout prevents\nlearned features from co-adapting, and it has been reported to yield good results when training deep\nneural networks [21, 22]. Note that dropped units are shared: for a single sentence and a layer, we\ndrop the same units of the hidden layer at each node.\nintermediate representations are not\nSince we are using a non-saturating activation function,\nbounded from above, hence, they can explode even with a strong regularization over the connec-\ntions, which is con\ufb01rmed by preliminary experiments. Therefore, for stability reasons, we use a\nsmall \ufb01xed additional L2 penalty (10\u22125) over both the connection weights and the unit activations,\nwhich resolves the explosion problem.\n\nNetwork Training. We use stochastic gradient descent with a \ufb01xed learning rate (.01). We use a\ndiagonal variant of AdaGrad for parameter updates [23]. AdaGrad yields a smooth and fast conver-\ngence. Furthermore, it can be seen as a natural tuning of individual learning rates per each parameter.\nThis is bene\ufb01cial for our case since different layers have gradients at different scales because of the\nscale of non-saturating activations at each layer (grows bigger at higher layers). We update weights\nafter minibatches of 20 sentences. We run 200 epochs for training. Recursive weights within a layer\n(W hh) are initialized as 0.5I + \u0001 where I is the identity matrix and \u0001 is a small uniformly random\nnoise. This means that initially, the representation of each node is approximately the mean of its\ntwo children. All other weights are initialized as \u0001. We experiment with networks of various sizes,\nhowever we have the same number of hidden units across multiple layers of a single RNN. When\nwe increase the depth, we keep the overall number of parameters constant, therefore deeper net-\nworks become narrower. We do not employ a pre-training step; deep architectures are trained with\nthe supervised error signal, even when the output layer is connected to only the \ufb01nal hidden layer.\n\n5\n\n\f|h|\n50\n45\n40\n340\n242\n200\n174\n157\n\n46.1\n48.0\n43.1\n48.1\n48.3\n49.5\n49.8\n49.0\n\nFine-grained Binary\n85.3\n85.5\n83.5\n86.4\n86.4\n86.7\n86.6\n85.5\n\n(cid:96)\n1\n2\n3\n1\n2\n3\n4\n5\n(a) Results for RNNs. (cid:96) and |h| denote the\ndepth and width of the networks, respec-\ntively.\n\nMethod\nBigram NB\nRNN\nMV-RNN\nRNTN\nDCNN\nParagraph Vectors\nDRNN (4, 174)\n(b) Results for previous work and our best model\n(DRNN).\n\nFine-grained Binary\n83.1\n82.4\n82.9\n85.4\n86.8\n87.8\n86.6\n\n41.9\n43.2\n44.4\n45.7\n48.5\n48.7\n49.8\n\nTable 1: Accuracies for 5-class predictions over SST, at the sentence level.\n\nAdditionally, we employ early stopping: out of all iterations, the model with the best development\nset performance is picked as the \ufb01nal model to be evaluated.\n\n3.2 Results\n\nQuantitative Evaluation. We evaluate on both \ufb01ne-grained sentiment score prediction (5-class\nclassi\ufb01cation) and binary (positive-negative) classi\ufb01cation. For binary classi\ufb01cation, we do not train\na separate network, we use the network trained for \ufb01ne-grained prediction, and then decode the 5\ndimensional posterior probability vector into a binary decision which also effectively discards the\nneutral cases from the test set. This approach solves a harder problem. Therefore there might be\nroom for improvement on binary results by separately training a binary classi\ufb01er.\nExperimental results of our models and previous work are given in Table 1. Table 1a shows our\nmodels with varying depth and width (while keeping the overall number of parameters constant\nwithin each group). (cid:96) denotes the depth and |h| denotes the width of the networks (i.e. number of\nhidden units in a single hidden layer).\nWe observe that shallow RNNs get an improvement just by using pretrained word vectors, recti\ufb01ers,\nand dropout, compared to previous work (48.1 vs. 43.2 for the \ufb01ne-grained task, see our shallow\nRNN with |h| = 340 in Table 1a and the RNN from [8] in Table 1b). This suggests a validation for\nuntying leaves and internal nodes in the RNN as described in Section 2.2 and using pre-trained word\nvectors.\nResults on RNNs of various depths and sizes show that deep RNNs outperform single layer RNNs\nwith approximately the same number of parameters, which quantitatively validates the bene\ufb01ts of\ndeep networks over shallow ones (see Table 1a). We see a consistent improvement as we use deeper\nand narrower networks until a certain depth. The 2-layer RNN for the smaller networks and 4-\nlayer RNN for the larger networks give the best performance with respect to the \ufb01ne-grained score.\nIncreasing the depth further starts to cause a degrade. An explanation for this might be the decrease\nin width dominating the gains from an increased depth.\nFurthermore, our best deep RNN outperforms previous work on both the \ufb01ne-grained and binary\nprediction tasks, and outperforms Paragraph Vectors on the \ufb01ne-grained score, achieving a new\nstate-of-the-art (see Table 1b).\nWe attribute an important contribution of the improvement to dropouts. In a preliminary experiment\nwith simple L2 regularization, a 3-layer RNN with 200 hidden units each achieved a \ufb01ne-grained\nscore of 46.06 (not shown here), compared to our current score of 49.5 with the dropout regularizer.\n\nInput Perturbation.\nIn order to assess the scale at which different layers operate, we investigate\nthe response of all layers to a perturbation in the input. A way of perturbing the input might be an\naddition of some noise, however with a large amount of noise, it is possible that the resulting noisy\ninput vector is outside of the manifold of meaningful word vectors. Therefore, instead, we simply\npick a word from the sentence that carries positive sentiment, and alter it to a set of words that have\nsentiment values shifting towards the negative direction.\n\n6\n\n\fFigure 3: An example sentence with its parse tree (left) and the response measure of every layer\n(right) in a three-layered deep recursive net. We change the word \u201cbest\u201d in the input to one of the\nwords \u201ccoolest\u201d, \u201cgood\u201d, \u201caverage\u201d, \u201cbad\u201d, \u201cworst\u201d (denoted by blue, light blue, black, orange\nand red, respectively) and measure the change of hidden layer representations in one-norm for every\nnode in the path.\n\n1\n2\n3\n4\n5\n\n1\n2\n3\n4\n5\n\ncharming ,\ncharming and\nappealingly manic and energetic\nrefreshingly adult take on adultery\nunpretentious , sociologically pointed\n\nas great\na great\nis great\nIs n\u2019t it great\nbe great\n\ncharming results\n\ninteresting results\nriveting performances\ngripping performances\njoyous documentary\nan amazing slapstick instrument\nnot great\nnothing good\nnot compelling\nonly good\ntoo great\ncompletely numbing experience\n\ncharming chemistry\nperfect ingredients\nbrilliantly played\nperfect medium\nengaging \ufb01lm\n\nnot very informative\nnot really funny\nnot quite satisfying\nthrashy fun\nfake fun\n\nTable 2: Example shortest phrases and their nearest neighbors across three layers.\n\nIn Figure 3, we give an example sentence, \u201cRoger Dodger is one of the best variations on this\ntheme\u201d with its parse tree. We change the word \u201cbest\u201d into the set of words \u201ccoolest\u201d, \u201cgood\u201d,\n\u201caverage\u201d, \u201cbad\u201d, \u201cworst\u201d, and measure the response of this change along the path that connects\nthe leaf to the root (labeled from 1 to 8). Note that all other nodes have the same representations,\nsince a node is completely determined by its subtree. For each node, the response is measured as\nthe change of its hidden representation in one-norm, for each of the three layers in the network, with\nrespect to the hidden representations using the original word (\u201cbest\u201d).\nIn the \ufb01rst layer (bottom) we observe a shared trend change as we go up in the tree. Note that\n\u201cgood\u201d and \u201cbad\u201d are almost on top of each other, which suggests that there is not necessarily\nenough information captured in the \ufb01rst layer yet to make the correct sentiment decision. In the\nsecond layer (middle) an interesting phenomenon occurs: Paths with \u201ccoolest\u201d and \u201cgood\u201d start\nclose together, as well as \u201cworst\u201d and \u201cbad\u201d. However, as we move up in the tree, paths with\n\u201cworst\u201d and \u201ccoolest\u201d come closer together as well as the paths with \u201cgood\u201d and \u201cbad\u201d. This\nsuggests that the second layer remembers the intensity of the sentiment, rather than direction. The\nthird layer (top) is the most consistent one as we traverse upward the tree, and correct sentiment\ndecisions persist across the path.\n\n7\n\nRogerDodger6.78is5one4of32the1[best]variationsonthisthemecoolest/good/average/bad/worst12345678\fNearest Neighbor Phrases.\nIn order to evaulate the different notions of similarity in the meaning\nspace captured by multiple layers, we look at nearest neighbors of short phrases. For a three layer\ndeep recursive neural network we compute hidden representations for all phrases in our data. Then,\nfor a given phrase, we \ufb01nd its nearest neighbor phrases across each layer, with the one-norm distance\nmeasure. Two examples are given in Table 2.\nFor the \ufb01rst layer, we observe that similarity is dominated by one of the words that is composed, i.e.\n\u201ccharming\u201d for the phrase \u201ccharming results\u201d (and \u201cappealing\u201d, \u201crefreshing\u201d for some neighbors),\nand \u201cgreat\u201d for the phrase \u201cnot great\u201d. This effect is so strong that it even discards the negation for\nthe second case, \u201cas great\u201d and \u201cis great\u201d are considered similar to \u201cnot great\u201d.\nIn the second layer, we observe a more diverse set of phrases semantically. On the other hand, this\nlayer seems to be taking syntactic similarity more into account: in the \ufb01rst example, the nearest\nneighbors of \u201ccharming results\u201d are comprised of adjective-noun combinations that also exhibit\nsome similarity in meaning (e.g. \u201cinteresting results\u201d, \u201criveting performances\u201d). The account is\nsimilar for \u201cnot great\u201d: its nearest neighbors are adverb-adjective combinations in which the ad-\njectives exhibit some semantic overlap (e.g. \u201cgood\u201d, \u201ccompelling\u201d). Sentiment is still not properly\ncaptured in this layer, however, as seen with the neighbor \u201ctoo great\u201d for the phrase \u201cnot great\u201d.\nIn the third and \ufb01nal layer, we see a higher level of semantic similarity, in the sense that phrases\nare mostly related to one another in terms of sentiment. Note that since this is a supervised task\non sentiment detection, it is suf\ufb01cient for the network to capture only the sentiment (and how it is\ncomposed in context) in the last layer. Therefore, it should be expected to observe an even more\ndiverse set of neighbors with only a sentiment connection.\n\n4 Conclusion\n\nIn this work we propose the deep recursive neural network, which is constructed by stacking multiple\nrecursive layers on top of each other. We apply this architecture to the task of \ufb01ne-grained sentiment\nclassi\ufb01cation using binary parse trees as the structure. We empirically evaluated our models against\nshallow recursive nets. Additionally, we compared with previous work on the task, including a\nmultiplicative RNN and the more recent Paragraph Vectors method. Our experiments show that deep\nmodels outperform their shallow counterparts of the same size. Furthermore, deep RNN outperforms\nthe baselines, achieving state-of-the-art performance on the task.\nWe further investigate our models qualitatively by performing input perturbation, and examining\nnearest neighboring phrases of given examples. These results suggest that adding depth to a recursive\nnet is different from adding width. Each layer captures a different aspect of compositionality. Phrase\nrepresentations focus on different aspects of meaning at each layer, as seen by nearest neighbor\nphrase examples.\nSince our task was supervised, learned representations seemed to be focused on sentiment, as in\nprevious work. An important future direction might be an application of the deep RNN to a broader,\nmore general task, even an unsupervised one (e.g. as in [9]). This might provide better insights on the\noperation of different layers and their contribution, with a more general notion of composition. The\neffects of \ufb01ne-tuning word vectors on the performance of deep RNN is also open to investigation.\n\nAcknowledgments\n\nThis work was supported in part by NSF grant IIS-1314778 and DARPA DEFT FA8750-13-2-0015.\nThe views and conclusions contained herein are those of the authors and should not be interpreted as\nnecessarily representing the of\ufb01cial policies or endorsements, either expressed or implied, of NSF,\nDARPA or the U.S. Government.\n\nReferences\n[1] Yoshua Bengio. Learning deep architectures for AI. Foundations and Trends in Machine Learning,\n\n2(1):1\u2013127, 2009.\n\n[2] Yoshua Bengio, Rjean Ducharme, Pascal Vincent, Christian Jauvin, Jaz K, Thomas Hofmann, Tomaso\nPoggio, and John Shawe-taylor. A neural probabilistic language model. In In Advances in Neural Infor-\nmation Processing Systems, 2001.\n\n8\n\n\f[3] Ronan Collobert and Jason Weston. A uni\ufb01ed architecture for natural language processing: Deep neu-\nral networks with multitask learning. In Proceedings of the 25th international conference on Machine\nlearning, pages 160\u2013167. ACM, 2008.\n\n[4] Ronan Collobert, Jason Weston, L\u00b4eon Bottou, Michael Karlen, Koray Kavukcuoglu, and Pavel Kuksa.\nNatural language processing (almost) from scratch. J. Mach. Learn. Res., 12:2493\u20132537, November\n2011.\n\n[5] Jeffrey L Elman. Finding structure in time. Cognitive science, 14(2):179\u2013211, 1990.\n[6] Richard Socher, Cliff C Lin, Andrew Ng, and Chris Manning. Parsing natural scenes and natural language\nwith recursive neural networks. In Proceedings of the 28th International Conference on Machine Learning\n(ICML-11), pages 129\u2013136, 2011.\n\n[7] Richard Socher, Jeffrey Pennington, Eric H Huang, Andrew Y Ng, and Christopher D Manning. Semi-\nsupervised recursive autoencoders for predicting sentiment distributions. In Proceedings of the Confer-\nence on Empirical Methods in Natural Language Processing, pages 151\u2013161. Association for Computa-\ntional Linguistics, 2011.\n\n[8] Richard Socher, Alex Perelygin, Jean Y Wu, Jason Chuang, Christopher D Manning, Andrew Y Ng,\nand Christopher Potts. Recursive deep models for semantic compositionality over a sentiment treebank.\nIn Proceedings of the Conference on Empirical Methods in Natural Language Processing, EMNLP \u201913,\n2013.\n\n[9] Richard Socher, Eric H Huang, Jeffrey Pennin, Christopher D Manning, and Andrew Ng. Dynamic pool-\ning and unfolding recursive autoencoders for paraphrase detection. In Advances in Neural Information\nProcessing Systems, pages 801\u2013809, 2011.\n\n[10] J\u00a8urgen Schmidhuber. Learning complex, extended sequences using the principle of history compression.\n\nNeural Computation, 4(2):234\u2013242, 1992.\n\n[11] Salah El Hihi and Yoshua Bengio. Hierarchical recurrent neural networks for long-term dependencies. In\n\nAdvances in Neural Information Processing Systems, pages 493\u2013499, 1995.\n\n[12] Michiel Hermans and Benjamin Schrauwen. Training and analysing deep recurrent neural networks. In\n\nAdvances in Neural Information Processing Systems, pages 190\u2013198, 2013.\n\n[13] Yoshua Bengio, Patrice Simard, and Paolo Frasconi. Learning long-term dependencies with gradient\n\ndescent is dif\ufb01cult. Neural Networks, IEEE Transactions on, 5(2):157\u2013166, 1994.\n\n[14] Christoph Goller and Andreas Kuchler. Learning task-dependent distributed representations by back-\npropagation through structure. In Neural Networks, 1996., IEEE International Conference on, volume 1,\npages 347\u2013352. IEEE, 1996.\n\n[15] Xavier Glorot, Antoine Bordes, and Yoshua Bengio. Deep sparse recti\ufb01er networks. In Proceedings of the\n14th International Conference on Arti\ufb01cial Intelligence and Statistics. JMLR W&CP Volume, volume 15,\npages 315\u2013323, 2011.\n\n[16] Richard Socher, Brody Huval, Christopher D Manning, and Andrew Y Ng. Semantic compositionality\nthrough recursive matrix-vector spaces. In Proceedings of the 2012 Joint Conference on Empirical Meth-\nods in Natural Language Processing and Computational Natural Language Learning, pages 1201\u20131211.\nAssociation for Computational Linguistics, 2012.\n\n[17] Quoc V Le and Tomas Mikolov. Distributed representations of sentences and documents. arXiv preprint\n\narXiv:1405.4053, 2014.\n\n[18] Nal Kalchbrenner, Edward Grefenstette, and Phil Blunsom. A convolutional neural network for modelling\nsentences. Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics,\nJune 2014.\n\n[19] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed representations\nof words and phrases and their compositionality. In Advances in Neural Information Processing Systems,\npages 3111\u20133119, 2013.\n\n[20] Geoffrey E Hinton, Nitish Srivastava, Alex Krizhevsky, Ilya Sutskever, and Ruslan R Salakhutdi-\narXiv preprint\n\nImproving neural networks by preventing co-adaptation of feature detectors.\n\nnov.\narXiv:1207.0580, 2012.\n\n[21] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classi\ufb01cation with deep convolutional\n\nneural networks. In NIPS, volume 1, page 4, 2012.\n\n[22] George E Dahl, Tara N Sainath, and Geoffrey E Hinton. Improving deep neural networks for lvcsr using\nIn Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE\n\nrecti\ufb01ed linear units and dropout.\nInternational Conference on, pages 8609\u20138613. IEEE, 2013.\n\n[23] John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and\n\nstochastic optimization. The Journal of Machine Learning Research, 12:2121\u20132159, 2011.\n\n9\n\n\f", "award": [], "sourceid": 1123, "authors": [{"given_name": "Ozan", "family_name": "Irsoy", "institution": "Cornell University"}, {"given_name": "Claire", "family_name": "Cardie", "institution": "Cornell University"}]}