{"title": "Global Belief Recursive Neural Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 2888, "page_last": 2896, "abstract": "Recursive Neural Networks have recently obtained state of the art performance on several natural language processing tasks. However, because of their feedforward architecture they cannot correctly predict phrase or word labels that are determined by context. This is a problem in tasks such as aspect-specific sentiment classification which tries to, for instance, predict that the word Android is positive in the sentence Android beats iOS. We introduce global belief recursive neural networks (GB-RNNs) which are based on the idea of extending purely feedforward neural networks to include one feedbackward step during inference. This allows phrase level predictions and representations to give feedback to words. We show the effectiveness of this model on the task of contextual sentiment analysis. We also show that dropout can improve RNN training and that a combination of unsupervised and supervised word vector representations performs better than either alone. The feedbackward step improves F1 performance by 3% over the standard RNN on this task, obtains state-of-the-art performance on the SemEval 2013 challenge and can accurately predict the sentiment of specific entities.", "full_text": "Global Belief Recursive Neural Networks\n\nRomain Paulus, Richard Socher\n\nMetaMind\n\nPalo Alto, CA\n\n{romain,richard}@metamind.io\n\nChristopher D. Manning\n\nStanford University\n\n353 Serra Mall\n\nStanford, CA 94305\n\nmanning@stanford.edu\n\nAbstract\n\nRecursive Neural Networks have recently obtained state of the art performance on\nseveral natural language processing tasks. However, because of their feedforward\narchitecture they cannot correctly predict phrase or word labels that are deter-\nmined by context. This is a problem in tasks such as aspect-speci\ufb01c sentiment\nclassi\ufb01cation which tries to, for instance, predict that the word Android is positive\nin the sentence Android beats iOS. We introduce global belief recursive neural\nnetworks (GB-RNNs) which are based on the idea of extending purely feedfor-\nward neural networks to include one feedbackward step during inference. This\nallows phrase level predictions and representations to give feedback to words. We\nshow the effectiveness of this model on the task of contextual sentiment analy-\nsis. We also show that dropout can improve RNN training and that a combination\nof unsupervised and supervised word vector representations performs better than\neither alone. The feedbackward step improves F1 performance by 3% over the\nstandard RNN on this task, obtains state-of-the-art performance on the SemEval\n2013 challenge and can accurately predict the sentiment of speci\ufb01c entities.\n\n1\n\nIntroduction\n\nModels of natural language need the ability to compose the meaning of words and phrases in order\nto understand complex utterances such as facts, multi-word entities, sentences or stories. There has\nrecently been a lot of work extending single word semantic vector spaces [27, 11, 16] to compo-\nsitional models of bigrams [17, 29] or phrases of arbitrary length [25, 28, 24, 10]. Work in this\narea so far has focused on computing the meaning of longer phrases in purely feedforward types\nof architectures in which the meaning of the shorter constituents that are being composed is not\naltered. However, a full treatment of semantic interpretation cannot be achieved without taking into\nconsideration that the meaning of words and phrases can also change once the sentence context is\nobserved. Take for instance the sentence in Fig. 1: Android beats iOS. All current recursive deep\nlearning sentiment models [26] would attempt to classify the word Android or iOS, both of which\nare simply neutral. The sentiment of the overall sentence is unde\ufb01ned; it depends on which of the\nentities the user of the sentiment analysis cares about. Generally, for many analyses of social media\ntext, users are indeed most interested in the sentiment directed towards a speci\ufb01c entity or phrase.\nIn order to solve the contextual classi\ufb01cation problem in general and aspect-speci\ufb01c sentiment classi-\n\ufb01cation in particular, we introduce global belief recursive neural networks (GB-RNN). These models\ngeneralize purely feedforward recursive neural networks (RNNs) by including a feedbackward step\nat inference time. The backward computation uses the representations from both steps in its recur-\nsion and allows all phrases, to update their prediction based on the global context of the sentence.\nUnlike recurrent neural networks or window-based methods [5] the important context can be many\nwords away from the phrase that is to be labeled. This will allow models to correctly classify that in\nthe sentence of Fig. 1, Android is described with positive sentiment and iOS was not. Neither was\npossible to determine only from their respective phrases in isolation.\n\n1\n\n\f?\n\n0\n\nAndroid\n\n-\n\n0\n\nbeats\n\n0\n\niOS\n\nFigure 1: Illustration of the problem of sentiment classi\ufb01cation that uses only the phrase to be labeled\nand ignores the context. The word Android is neutral in isolation but becomes positive in context.\n\nIn order to validate the GB-RNN\u2019s ability to contextually disambiguate sentiment on real text, we\nuse the Twitter dataset and annotations from Semeval Challenge 2013 Task 2.1 The GB-RNN out-\nperforms both the standard RNN and all other baselines, as well the winner of the Sentiment com-\npetition of SemEval 2013, showing that it can successfully make use of surrounding context.\n\n2 Related Work\n\nNeural word vectors One common way to represent words is to use distributional word vectors\n[27] learned via dimensionality reduction of large co-occurrence matrices over documents (as in\nlatent semantic analysis [14]), local context windows [16, 13] or combinations of both [11]. Words\nwith similar meanings are close to each other in the vector space. Since unsupervised word vec-\ntors computed from local context windows do not always encode task-speci\ufb01c information, such\nas sentiment, word vectors can also be \ufb01ne-tuned to such speci\ufb01c tasks [5, 24]. We introduce a\nhybrid approach where some dimensions are obtained from an unsupervised model and others are\nlearned for the supervised task. We show that this performs better than both the purely supervised\nand unsupervised semantic word vectors.\nRecursive Neural Networks The idea of recursive neural networks (RNNs) for natural language\nprocessing (NLP) is to train a deep learning model that can be applied to inputs of any length.\nUnlike computer vision tasks, where it is easy to resize an image to a \ufb01xed number of pixels, nat-\nural sentences do not have a \ufb01xed size input. However, phrases and sentences have a grammatical\nstructure that can be parsed as a binary tree [22].\nFollowing this tree structure, we can assign a \ufb01xed-length vector to each word at the leaves of\nthe tree, and combine word and phrase pairs recursively to create intermediate node vectors of the\nsame length, eventually having one \ufb01nal vector representing the whole sentence [19, 25]. Multiple\nrecursive combination functions have been explored, from linear transformation matrices to tensor\nproducts [26]. In this work, we use the simple single matrix RNN to combine node vectors at each\nrecursive step.\nBidirectional-recurrent and bidirectional-recursive neural networks. Recurrent neural networks\nare a special case of recursive neural networks that operate on chains and not trees. Unlike recursive\nneural networks, they don\u2019t require a tree structure and are usually applied to time series. In a re-\ncurrent neural network, every node is combined with a summarized representation of the past nodes\n[8], and then the resulting combination will be forwarded to the next node. Bidirectional recur-\nrent neural network architectures have also been explored [21] and usually compute representations\nindependently from both ends of a time series.\nBidirectional recursive models [12, 15], developed in parallel with ours, extend the de\ufb01nition of the\nrecursive neural network by adding a backward propagation step, where information also \ufb02ows from\nthe tree root back to the leaves. We compare our model to theirs theoretically in the model section,\nand empirically in the experiments.\n[20] unfold the same autoencoder multiple times which gives it more representational power with\nthe same number of parameters. Our model is different in that it takes into consideration more\ninformation at each step and can eventually make better local predictions by using global context.\nSentiment analysis. Sentiment analysis has been the subject of research for some time [4, 2, 3, 6,\n1, 23]. Most approaches in sentiment analysis use \u201cbag of words\u201d representations that do not take\n\n1http://www.cs.york.ac.uk/semeval-2013/task2/\n\n2\n\n\fFigure 2: Propagation steps of the GB-RNN. Step 1 describes the standard RNN feedforward pro-\ncess, showing that the vector representation of Android is independent of the rest of the document.\nStep 2 computes additional vectors at each node (in red), using information from the higher level\nnodes in the tree (in blue), allowing Android and iOS to have different representations given the\ncontext.\n\nthe phrase structure into account but learn from word-level features. We explore our model\u2019s ability\nto determine contextual sentiment on Twitter, a social media platform.\n\n3 Global Belief Recursive Neural Networks\n\nIn this section, we introduce a new model to compute context-dependent compositional vector rep-\nresentations of variable length phrases. These vectors are trained to be useful as features to classify\neach phrase and word. Fig. 2 shows an example phrase computation that we will describe in detail\nbelow. This section begins by motivating compositionality and context-dependence, followed by a\nde\ufb01nition of standard recursive neural networks. Next, we introduce our novel global belief model\nand hybrid unsupervised-supervised word vectors.\n\n3.1 Context-Dependence as Motivation for Global Belief\n\nA common simplifying assumption when mapping sentences into a feature vector is that word order\ndoes not matter (\u201cbag of words\u201d). However, this will prevent any detailed understanding of language\nas exempli\ufb01ed in Fig. 1, where the overall sentiment of the phrase Android beats iOS, is unclear.\nInstead, we need an understanding of each phrase which leads us to deep recursive models.\nThe \ufb01rst step for mapping a sentence into a vector space is to parse them into a binary tree structure\nthat captures the grammatical relationships between words. Such an input dependent binary tree then\ndetermines the architecture of a recursive neural network which will compute the hidden vectors in a\nbottom-up fashion starting with the word vectors. The resulting phrase vectors are given as features\nto a classi\ufb01er. This standard RDL architecture works well for classifying the inherent or context-\nindependent label of a phrase. For instance, it can correctly classify that a not so beautiful day is\nnegative in sentiment. However, not all phrases have an inherent sentiment as shown in Fig. 1.\nThe GB-RNN addresses this issue by propagating information from the root node back to the\nleaf nodes as described below. There are other ways context can be incorporated such as with\nbi-directional recurrent neural networks or with window-based methods. Both of these methods,\nhowever, cannot incorporate information from words further away from the phrase to be labeled.\n\n3.2 Standard Recursive Neural Networks\n\nWe \ufb01rst describe a simple recursive neural network that can be used for context-independent phrase-\nlevel classi\ufb01cation. It can also be seen as the \ufb01rst step of a GB-RNN.\nAssume, for now, that each word vector a \u2208 Rn is obtained by sampling each element from a\nuniform distribution: ai \u223c U(\u22120.001, 0.001). All these vectors are columns of a large embedding\nmatrix L \u2208 Rn\u00d7|V |, where |V | is the size of the vocabulary. All word vectors are learned together\nwith the model.\n\n3\n\n\fFor the example word vector sequence (abc) of Fig. 2, the RNN equations become:\n\n(cid:18)\n\n(cid:21)(cid:19)\n\n(cid:20) b\n\nc\n\n(cid:18)\n\n(cid:21)(cid:19)\n\n(cid:20) a\n\np1\n\np1 = f\n\nW\n\n, p2 = f\n\nW\n\n,\n\n(1)\n\nwhere W \u2208 Rn\u00d72n is the matrix governing the composition and f the non-linear activation func-\ntion. Each node vector is the given as input to a softmax classi\ufb01er for a classi\ufb01cation task such as\nsentiment analysis.\n\n3.3 GB-RNN: Global Belief Recursive Neural Networks\n\nOur goal is to include contextual information in the recursive node vector representations. One\nsimple solution would be to just include the k context words to the left and right of each pair as in\n[25]. However, this will only work if the necessary context is at most k words away. Furthermore,\nin order to capture more complex linguistic phenomena it may be necessary to allow for multiple\nwords to compose the contextual shift in meaning. Instead, we will use the feedforward nodes from\na standard RNN architecture and simply move back down the tree. This can also be interpreted as\nunfolding the tree and moving up its branches.\nHence, we keep the same Eq. 1 for computing the forward node vectors, but we introduce new\nfeedbackward vectors, denoted with a down arrow \u2193, at every level of the parse tree. Unlike the\nfeedforward vectors, which were computed with a bottom-up recursive function, feedbackward vec-\ntors are computed with a top-down recursive function. The backwards pass starts at the root node\nand propagates all the way down to the single word vectors. At the root note, in our example the\nnode p2, we have:\n\n\u2193\np\n2 = f (V p2) ,\n\nwhere V \u2208 Rnd\u00d7n so that all \u2193-node vectors are nd-dimensional. Starting from p\nget \u2193-node vectors for every node as we go down the tree:\n\n(cid:18)\n\nW \u2193(cid:20) p2\n\n\u2193\n2\n\np\n\n(cid:21)(cid:19)\n\n,\n\n(cid:21)\n\n(cid:20) b\u2193\n\nc\u2193\n\n= f\n\n(cid:21)(cid:19)\n\n(cid:18)\n\nW \u2193(cid:20) p1\n\n\u2193\n1\n\np\n\n= f\n\n(cid:21)\n\n(cid:20) a\u2193\n\n\u2193\n1\n\np\n\n(2)\n\u2193\n2, we recursively\n\n(3)\n\n(cid:18)\n\n(cid:21)(cid:19)\n\n(cid:20) a\n\nwhere all \u2193-vectors, are nd-dimensional and hence W \u2193 \u2208 R(2nd)\u00d7(n+nd) is a new de-composition\nmatrix. Figure 2 step 2 illustrates this top-down recursive computation on our example. Once we\nhave both feedforward and feedbackward vectors for a given node, we concatenate them and employ\nthe standard softmax classi\ufb01er to make the \ufb01nal prediction. For instance, the classi\ufb01cation for word\na becomes: ya = softmax\n, where we fold the bias into the C-class classi\ufb01er weights\nWc \u2208 RC\u00d7(n+1).\n\u2193\nAt the root node, the equation for x\nroot = xroot. But\nthere are two advantages of introducing a transform matrix V . First, it helps clearly differentiat-\ning features computed during the forward step and the backward step in multiplication with W \u2193.\nSecond, it allows to use a different dimension for the x\u2193 vectors, which reduces the number of pa-\nrameters in the W \u2193 and Wclass matrices, and adds more \ufb02exibility to the model. It also performs\nbetter empirically.\n\n\u2193\nroot could be replaced by simply copying x\n\na\u2193\n\nWc\n\n3.4 Hybrid Word Vector Representations\n\nThere are two ways to initialize the word vectors that are given as inputs to the RNN models. The\nsimplest one is to initialize them to small random numbers as mentioned above and backpropagate\nerror signals into them in order to have them capture the necessary information for the task at hand.\nThis has the advantage of not requiring any other pre-training method and the vectors are sure to\ncapture domain knowledge. However, the vectors are more likely to over\ufb01t and less likely to gener-\nalize well to words that have not been in the (usually smaller) labeled training set. Another approach\nis to use unsupervised methods that learn semantic word vectors such as [13]. One then has the\noption to backpropagate task speci\ufb01c errors into these vectors or keep them at their initialization.\nBackpropagating into them still has the potential disadvantage of hurting generalization apart from\n\n4\n\n\fFigure 3: Hybrid unsupervised-supervised vector representations for the most frequent 50 words\nof the dataset. For each horizontal vector, the \ufb01rst 100 dimensions are trained on unlabeled twitter\nmessages, and the last dimensions are trained on labeled contextual sentiment examples.\n\nslowing down training since it increases the number of parameters by a large amount (there are usu-\nally 100, 000 \u00d7 50 many parameters in the embedding matrix L). Without propagating information\nhowever one has to hope that the unsupervised method really captures all the necessary semantic\ninformation which is often not the case for sentiment (which suffers from the antonym problem).\nIn this paper we propose to combine both ideas by representing each word as a concatenation of both\nunsupervised vectors that are kept at their initialization during training and adding a small additional\nvector into which we propagate the task speci\ufb01c error signal. This vector representation applies only\nto the feedforward word vectors and shold not be confused with the combination of the feedwordard\nand feedbackward node vectors in the softmax.\nFigure 3.4 shows the resulting word vectors trained on unlabeled documents on one part (the \ufb01rst\n100 dimensions), and trained on labeled examples on the other part (the remaining dimensions).\n\n3.5 Training\n\nThe GB-RNN is trained by using backpropagation through structure [9]. We train the parameters by\noptimizing the regularized cross-entropy error for labeled node vectors with mini-batched AdaGrad\n[7]. Since we don\u2019t have labels for every node of the training trees, we decided that unlabeled\nnodes do not add an additional error during training. For all models, we use a development set to\ncross-validate over regularization of the different weights, word vector size, mini-batch size, dropout\nprobability and activation function (recti\ufb01ed linear or logistic function).\nWe also applied the dropout technique to improve training with high dimensional word vectors.\nNode vector units are randomly set to zero with a probability of 0.5 at each training step. Our\nexperiments show that applying dropout in this way helps differentiating word vector units and\nhidden units, and leads to better performance. The high-dimensional hybrid word vectors that we\nintroduced previously have obtained a higher accuracy than other word vectors with the use of\ndropout.\n\n3.6 Comparison to Other Models\n\nThe idea of unfolding of neural networks is commonly used in autoencoders as well as in a recursive\nsetting [23], in this setting the unfolding is only used during training and not at inference time to\nupdate the beliefs about the inputs.\nIrsoy and Cardie [12] introduced a bidirectional RNN similar to ours. It employs the same standard\nfeedforward RNN, but a different computation for the backward \u2193 vectors. In practice, their model is\nde\ufb01ned by the same forward equations as ours. However, equation 3 which computes the backward\nvectors is instead:\n\n(4)\n\n(cid:20) b\u2193\n\nc\u2193\n\n(cid:21)\n\n= f\n\n(cid:21)(cid:19)\n\n(cid:18)(cid:20) V b + W\n\nV c + W\n\n5\n\n\u2193\n\u2193\nlbp\n1\n\u2193\n\u2193\nrbp\n1\n\n\fCorrect FUSION\u2019s 5th General Meeting is tonight at 7 in ICS 213! Come out and carve pumpkins mid-quarter\n\nwith us!\n\nCorrect I would rather eat my left foot then to be taking the SATs tomorrow\nCorrect Special THANKS to EVERYONE for coming out to Taboo Tuesday With DST tonight!\n\nFUN&educational!!! :) @XiEtaDST\n\nIt was\n\nCorrect Tough loss for @statebaseball today. Good luck on Monday with selection Sunday\nCorrect I got the job at Claytons!(: I start Monday doing Sheetrock(: #MoneyMakin\nCorrect St Pattys is no big deal for me, no fucks are given, but Cinco De Mayo on the other hand .. thats my\n\n2nd bday .\n\nIncorrect \u201c@Hannah Sunder: The Walking Dead is just a great tv show\u201d its bad ass just started to watch the\n\n2nd season to catch up with the 3rd\n\nFigure 4: Examples of predictions made by the GB-RNN for twitter documents. In this example,\nred phrases are negative and blue phrases are positive. On the last example, the model predicted\nincorrectly \u201cbad ass\u201d as negative.\n\n(cid:21)\n\nW\nW\n\nW\nW\n\n\u2193\nlb\n\u2193\nrb\n\n,\n\n(5)\n\n, then\n\n= f\n\n(cid:20) b\u2193\n\nc\u2193\n\n\u2193\nlb and W\n\nWhere W\nmodel we re-write Eq. 3 and make explicit the 4 blocks of W \u2193:\n\n\u2193\nrb are two matrices with dimensions nd \u00d7 nd. For a better comparison with our\n(cid:34)\n\n(cid:32)(cid:34)\n\n(cid:35)(cid:33)\n\n(cid:35)\n\nLet W \u2193 =\n\n\u2193\nlf p1 and W\n\nwhere the dimensions of W\n\n\u2193\nlf p1 + W\n\u2193\nrf p1 + W\n\n\u2193\nlf W\n\u2193\nrf W\n\u2193\nlf and W\n\n\u2193\nrf are nd \u00d7 n, and the dimensions of W\n\n\u2193\n\u2193\nlbp\n1\n\u2193\n\u2193\nrbp\n1\n\u2193\nrd are nd \u00d7 nd.\nA closer comparison between Eqs. 4 and 5 reveals that both use a left and right forward transfor-\n\u2193\nrf p1, but the other parts of the sums differ. In the bidirectional-RNN, the\nmation W\ntransformation of any children is de\ufb01ned by the forward parent and independent on its position (left\nor right node). Whereas our GB-RNN makes uses of both the forward and backward parent node.\nThe intuition behind our choice is that using both nodes helps to push the model to disentangled\nthe children from their backward parent vector. We also note that our model does not use the for-\nward node vector for computing the backward node vector, but we \ufb01nd this not necessary since the\nsoftmax function already combines the two vectors.\nOur model also has n \u00b7 nd more parameters to compute the feedbackward vectors than the\nd + 2n \u00b7 nd parameters, while the other\nbidirectional-RNN. The W \u2193 matrix of our model has 2n2\n\u2193\n\u2193\nd + n \u00b7 nd parameters with the W\nmodel has a total of 2n2\nrf and V matrices. We show in the\nlf , W\nnext section that GB-RNN outperforms the bidirectional RNN in our experiments.\n\n\u2193\nld and W\n\n4 Experiments\n\nWe present a qualitative and quantitative analysis of the GB-RNN on a contextual sentiment clas-\nsi\ufb01cation task. The main dataset is provided by the SemEval 2013, Task 2 competition [18]. We\noutperform the winners of the 2013 challenge, as well as several baseline and model ablations.\n\n4.1 Evaluation Dataset\n\nThe SemEval competition dataset is composed of tweets labeled for 3 different sentiment classes:\npositive, neutral and negative. The tweets in this dataset were split into a train (7862 labeled phrases),\ndevelopment (7862) and development-test (7862) set. The \ufb01nal test set is composed of 10681 exam-\nples. Fig. 4 shows example GB-RNN predictions on phrases marked for classi\ufb01cation in this dataset.\nThe development dataset consists only of tweets whereas the \ufb01nal evaluation dataset included also\nshort text messages (SMS in the tables below).\nTweets were parsed using the Stanford Parser [22] which includes tokenizing of negations (e.g.,\ndon\u2019t becomes two tokens do and n\u2019t). We constrained the parser to keep each phrase labeled by the\ndataset inside its own subtree, so that each labeled example is represented by a single node and can\nbe classi\ufb01ed easily.\n\n6\n\n\fSVM\n\nSVM\n\nClassi\ufb01er\nSVM\n\nFeature Sets\nstemming, word cluster, SentiWordNet\nscore, negation\nPOS,\nlexicon, negations, emoticons,\nelongated words, scores, syntactic de-\npendency, PMI\npunctuation, word n-grams, emoticons,\ncharacter n-grams, elongated words,\nupper case, stopwords, phrase length,\nnegation, phrase position, large senti-\nment lexicons, microblogging features\nGB-RNN parser, unsupervised word vectors (en-\n\nTwitter 2013 (F1)\n\nSMS 2013 (F1)\n\n85.19\n\n87.38\n\n88.93\n\n88.37\n\n85.79\n\n88.00\n\n89.41\n\n88.40\n\nsemble)\n\nTable 1: Comparison to the best Semeval 2013 Task 2 systems, their feature sets and F1 results on\neach dataset for predicting sentiment of phrases in context. The GB-RNN obtains state of the art\nperformance on both datasets.\n\nModel\nBigram Naive Bayes\nLogistic Regression\nSVM\nRNN\nBidirectional-RNN (Irsoy and Cardie)\nGB-RNN (best single model)\n\nTwitter 2013\n\n80.45\n80.91\n81.87\n82.11\n85.77\n86.80\n\nSMS 2013\n\n78.53\n80.37\n81.91\n84.07\n84.77\n87.15\n\nTable 2: Comparison with baselines: F1 scores on the SemEval 2013 test datasets.\n\n4.2 Comparison with Competition Systems\nThe \ufb01rst comparison is with several highly tuned systems from the SemEval 2013, Task 2 compe-\ntition. The competition was scored by an average of positive and negative class F1 scores. Table 1\nlists results for several methods, together with the resources and features used by each method. Most\nsystems used a considerable amount of hand-crafted features. In contrast, the GB-RNN only needs\na parser for the tree structure, unsupervised word vectors and training data. Since the competition\nallowed for external data we outline below the additional training data we use. Our best model is an\nensemble of the top 5 GB-RNN models trained independently. Their predictions were then averaged\nto produce the \ufb01nal output.\n\n4.3 Comparison with Baselines\nNext we compare our single best model to several baselines and model ablations. We used the same\nhybrid word vectors with dropout training for the RNN, the bidirectional RNN and the GB-RNN.\nThe best models were selected by cross-validating on the dev set for several hyper-parameters (word\nvectors dimension, hidden node vector dimension, number of training epochs, regularization param-\neters, activation function, training batch size and dropout probability) and we kept the models with\nthe highest cross-validation accuracy. Table 2 shows these results. The most important comparison\nis against the purely feedforward RNN which does not take backward sentence context into account.\nThis model performs over 5% worse than the GB-RNN.\nFor the logistic regression and Bigram Naive Bayes classi\ufb01cation, each labeled phrase was taken\nas a separate example, removing the surrounding context. Another set of baselines used a context\nwindow for classi\ufb01cation as well as the entire tweet as input to the classi\ufb01er.\nOptimal performance for the single best GB-RNN was achieved by using vector sizes of 130 dimen-\nsions (100 pre-trained, \ufb01xed word vectors and 30 trained on sentiment data), a mini-batch size of\n30, dropout with p = 0.5 and sigmoid non-linearity. In table 3, we show that the concatenation of\n\ufb01xed, unsupervised vectors with additional randomly initialized, supervised vectors performs better\nthan both methods.\n4.4 Model Analysis: Additional Training Data\nBecause the competition allowed the usage of arbitrary resources we included as training data la-\nbeled unigrams and bigrams extracted from the NRC-Canada system\u2019s sentiment lexicon. Adding\nthese additional training examples increased accuracy by 2%. Although this lexicon helps reduc-\n\n7\n\n\fWord vectors\nsupervised word vectors\nsemantic word vectors\nhybrid word vectors\n\ndimension\n\nTwitter 2013\n\nSMS 2013\n\n15\n100\n\n100 + 34\n\n85.15\n85.67\n86.80\n\n85.66\n84.70\n87.15\n\nTable 3: F1 score comparison of word vectors on the SemEval 2013 Task 2 test dataset.\n\n+\n\n-\n\nChelski\n\n-\n\n+\n\n+\n\n+\n\n-\n\nChelski\n\n-\n\n-\n\n+\n\nwant\n\n-\nthis\n\n-\n\n-\nso\n\n-\n\n-\n\nthat\n-\n\n-\nit\n\n-\nbad\n\n+\n\n+\n\n+\n\nmakes\n-\nme\n\n+\n\n+\n\n+\n\n-\n\n+\n\nthinking\n-\nwe\n+\n\nhappier\n\n-\n\neven\n\n+\n\n-\n\nmay\n\n+\n\n-\n\n+\n\n+\n\n-\n\n+\nbeat\n\ntwice\n\nthem\n\n+\n\nwant\n\n+\nthis\n\n+\n\n+\nso\n\n+\nthat\n-\n\n+\nit\n\n-\nbad\n\n+\n\n-\n4\n\n+\n\n-\nin\n+\n\n-\n\n-\n\ndays\n\n-\n\n+\nSB\n\n-\nat\n\n+\n\n+\n\n+\n\nmakes\n+\nme\n\n+\n\n+\n\n+\n\n+\n\n+\n\n+\n\nthinking\n+\nwe\n+\n\nhappier\n\n+\n\nmay\n\n+\n\neven\n\n+\n\n+\n\n+\n\n+\n\n+\n\n+\nbeat\n\ntwice\n\nthem\n\n+\n\n+\n4\n\n+\nin\n+\n\n+\n\n+\n\n+\n\ndays\n\n+\nat\n\n+\n\n+\nSB\n\nFigure 5: Change in sentiment predictions in the tweet chelski want this so bad that it makes me even\nhappier thinking we may beat them twice in 4 days at SB between the RNN (left) and the GB-RNN\n(right). In particular, we can see the change for the phrase want this so bad where it is correctly\npredicted as positive with context.\n\ning the number of unknown tokens, it does not do a good job for training recursive composition\nfunctions, because each example is short.\nWe also included our own dataset composed 176,311 noisily labeled tweets (using heuristics such\nas smiley faces) as well as the movie reviews dataset from [26]. In both datasets the labels only\ndenote the context-independent sentiment of a phrase or full sentence. Hence, we trained the \ufb01nal\nmodel in two steps: train only the standard RNN, then train the full GB-RNN model on the smaller\ncontext-speci\ufb01c competition data. Training the GB-RNN jointly in this fashion gave a 1% accuracy\nimprovement.\n5 Conclusion\nWe introduced global belief recursive neural networks, applied to the task of contextual sentiment\nanalysis. The idea of propagating beliefs through neural networks is a powerful and important piece\nfor interpreting natural language. The applicability of this idea is more general than RNNs and can\nbe helpful for a variety of NLP tasks such as word-sense disambiguation.\n\nAcknowledgments\n\nWe thank the anonymous reviewers for their valuable comments. We gratefully acknowledge the\nsupport of the Defense Advanced Research Projects Agency (DARPA) Deep Exploration and Filter-\ning of Text (DEFT) Program under AFRL contract no. FA8750-13-2-0040.\n\nReferences\n[1] B.R. Routledge B. O\u2019Connor, R. Balasubramanyan and N.A. Smith. From tweets to polls:\n\nLinking text sentiment to public opinion time series. AAAI Conference, 2010.\n\n[2] L. Barbosa and J. Feng. Robust sentiment detection on twitter from biased and noisy data.\n\n23rd International Conference on Computational Linguistics: Posters, pages 36\u201344, 2010.\n\n[3] A. Bifet and E. Frank. Sentiment knowledge discovery in twitter streaming data. Proceedings\n\nof the 13th international conference on Discovery science, 2010.\n\n[4] K. Sobel B.J. Jansen, M. Zhang and A. Chowdury. Twitter power: Tweets as electronic word\n\nof mouth. Journal of the American Society for Information Science and Technology, 2009.\n\n8\n\n\f[5] R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu, and P. Kuksa. Natural Lan-\n\nguage Processing (Almost) from Scratch. JMLR, 12:2493\u20132537, 2011.\n\n[6] O. Tsur D. Davidov and A. Rappoport. Enhanced sentiment learning using twitter hashtags\n\nand smileys. Association for Computational Linguistics, 2010.\n\n[7] J. Duchi, E. Hazan, and Y. Singer. Adaptive subgradient methods for online learning and\n\nstochastic optimization. JMLR, 12, July 2011.\n\n[8] J. L. Elman. Distributed representations, simple recurrent networks, and grammatical structure.\n\nMachine Learning, 7(2-3):195\u2013225, 1991.\n\n[9] C. Goller and A. K\u00a8uchler. Learning task-dependent distributed representations by backpropa-\n\ngation through structure. In International Conference on Neural Networks, 1996.\n\n[10] E. Grefenstette, G. Dinu, Y.-Z. Zhang, M. Sadrzadeh, and M. Baroni. Multi-step regression\n\nlearning for compositional distributional semantics. In IWCS, 2013.\n\n[11] E. H. Huang, R. Socher, C. D. Manning, and A. Y. Ng. Improving Word Representations via\n\nGlobal Context and Multiple Word Prototypes. In ACL, 2012.\n\n[12] O. Irsoy and C. Cardie. Bidirectional recursive neural networks for token-level labeling with\n\nstructure. NIPS Deep Learning Workshop, 2013.\n\n[13] R. Socher J. Pennington and C. D. Manning. Glove: Global vectors for word representation.\n\nEMNLP, 2014.\n\n[14] T. K. Landauer and S. T. Dumais. A solution to Plato\u2019s problem: the Latent Semantic Anal-\nysis theory of acquisition, induction and representation of knowledge. Psychological Review,\n104(2):211\u2013240, 1997.\n\n[15] P. Le and W. Zuidema. The inside-outside recursive neural network model for dependency\n\nparsing. EMNLP, 2014.\n\n[16] T. Mikolov, W. Yih, and G. Zweig. Linguistic regularities in continuous spaceword represen-\n\ntations. In HLT-NAACL, 2013.\n\n[17] J. Mitchell and M. Lapata. Composition in distributional models of semantics. Cognitive\n\nScience, 34(8):1388\u20131429, 2010.\n\n[18] Z. Kozareva P. Nakov. Semeval-2013 task 2: Sentiment analysis in twitter. Proceedings of the\n\nSeventh International Workshop on Semantic Evaluation (SemEval 2013), 2013.\n\n[19] J. B. Pollack. Recursive distributed representations. Arti\ufb01cial Intelligence, 46, 1990.\n[20] J.T. Rolfe and Y. LeCun. Discriminative recurrent sparse auto-encoders. arXiv:1301.3775v4,\n\n2013.\n\n[21] M. Schuster and K.K. Paliwal. Bidirectional recurrent neural networks. Signal Processing,\n\nIEEE Transactions, 1997.\n\n[22] R. Socher, J. Bauer, C. D. Manning, and A. Y. Ng. Parsing With Compositional Vector Gram-\n\nmars. In ACL, 2013.\n\n[23] R. Socher, E. H. Huang, J. Pennington, A. Y. Ng, and C. D. Manning. Dynamic Pooling and\n\nUnfolding Recursive Autoencoders for Paraphrase Detection. In NIPS, 2011.\n\n[24] R. Socher, B. Huval, C. D. Manning, and A. Y. Ng. Semantic Compositionality Through\n\nRecursive Matrix-Vector Spaces. In EMNLP, 2012.\n\n[25] R. Socher, C. D. Manning, and A. Y. Ng. Learning continuous phrase representations and syn-\ntactic parsing with recursive neural networks. In Proceedings of the NIPS-2010 Deep Learning\nand Unsupervised Feature Learning Workshop, 2010.\n\n[26] R. Socher, A. Perelygin, J. Wu, J. Chuang, C. Manning, A. Ng, and C. Potts. Recursive deep\n\nmodels for semantic compositionality over a sentiment treebank. In EMNLP, 2013.\n\n[27] P. D. Turney and P. Pantel. From frequency to meaning: Vector space models of semantics.\n\nJournal of Arti\ufb01cial Intelligence Research, 37:141\u2013188, 2010.\n\n[28] A. Yessenalina and C. Cardie. Compositional matrix-space models for sentiment analysis. In\n\nEMNLP, 2011.\n\n[29] F.M. Zanzotto, I. Korkontzelos, F. Fallucchi, and S. Manandhar. Estimating linear models for\n\ncompositional distributional semantics. In COLING, 2010.\n\n9\n\n\f", "award": [], "sourceid": 1496, "authors": [{"given_name": "Romain", "family_name": "Paulus", "institution": "ISEP"}, {"given_name": "Richard", "family_name": "Socher", "institution": "Stanford University"}, {"given_name": "Christopher", "family_name": "Manning", "institution": "Stanford University"}]}