{"title": "Lexical and Hierarchical Topic Regression", "book": "Advances in Neural Information Processing Systems", "page_first": 1106, "page_last": 1114, "abstract": "Inspired by a two-level theory that unifies agenda setting and ideological framing, we propose supervised hierarchical latent Dirichlet allocation (SHLDA) which jointly captures documents' multi-level topic structure and their polar response variables. Our model extends the nested Chinese restaurant process to discover a tree-structured topic hierarchy and uses both per-topic hierarchical and per-word lexical regression parameters to model the response variables. Experiments in a political domain and on sentiment analysis tasks show that SHLDA improves predictive accuracy while adding a new dimension of insight into how topics under discussion are framed.", "full_text": "Lexical and Hierarchical Topic Regression\n\nViet-An Nguyen\nComputer Science\n\nUniversity of Maryland\n\nCollege Park, MD\n\nvietan@cs.umd.edu\n\nJordan Boyd-Graber\niSchool & UMIACS\n\nUniversity of Maryland\n\nCollege Park, MD\n\njbg@umiacs.umd.edu\n\nPhilip Resnik\n\nLinguistics & UMIACS\nUniversity of Maryland\n\nCollege Park, MD\nresnik@umd.edu\n\nAbstract\n\nInspired by a two-level theory from political science that uni\ufb01es agenda setting\nand ideological framing, we propose supervised hierarchical latent Dirichlet alloca-\ntion (SHLDA), which jointly captures documents\u2019 multi-level topic structure and\ntheir polar response variables. Our model extends the nested Chinese restaurant\nprocesses to discover tree-structured topic hierarchies and uses both per-topic hier-\narchical and per-word lexical regression parameters to model response variables.\nSHLDA improves prediction on political af\ufb01liation and sentiment tasks in addition\nto providing insight into how topics under discussion are framed.\n\n1\n\nIntroduction: Agenda Setting and Framing in Hierarchical Models\n\nHow do liberal-leaning bloggers talk about immigration in the US? What do conservative politicians\nhave to say about education? How do Fox News and MSNBC differ in their language about the gun\ndebate? Such questions concern not only what, but how things are talked about.\nIn political communication, the question of \u201cwhat\u201d falls under the heading of agenda setting theory,\nwhich concerns the issues introduced into political discourse (e.g., by the mass media) and their\nin\ufb02uence over public priorities [1]. The question of \u201chow\u201d concerns framing: the way the presentation\nof an issue re\ufb02ects or encourages a particular perspective or interpretation [2]. For example, the rise\nof the \u201cinnocence frame\u201d in the death penalty debate, emphasizing the irreversible consequence of\nmistaken convictions, has led to a sharp decline in the use of capital punishment in the US [3].\nIn its concern with the subjects or issues under discussion in political discourse, agenda setting\nmaps neatly to topic modeling [4] as a means of discovering and characterizing those issues [5].\nInterestingly, one line of communication theory seeks to unify agenda setting and framing by viewing\nframes as a second-level kind of agenda [1]:\njust as agenda setting is about which objects of\ndiscussion are salient, framing is about the salience of attributes of those objects. The key is that\nwhat communications theorists consider an attribute in a discussion can itself be an object, as well.\nFor example, \u201cmistaken convictions\u201d is one attribute of the death penalty discussion, but it can also\nbe viewed as an object of discussion in its own right.\nThis two-level view leads naturally to the idea of using a hierarchical topic model to formalize\nboth agendas and frames within a uniform setting. In this paper, we introduce a new model to do\nexactly that. The model is predictive: it represents the idea of alternative or competing perspectives\nvia a continuous-valued response variable. Although inspired by the study of political discourse,\nassociating texts with \u201cperspectives\u201d is more general and has been studied in sentiment analysis,\ndiscovery of regional variation, and value-sensitive design. We show experimentally that the model\u2019s\nhierarchical structure improves prediction of perspective in both a political domain and on sentiment\nanalysis tasks, and we argue that the topic hierarchies exposed by the model are indeed capturing\nstructure in line with the theory that motivated the work.\n\n1\n\n\f1. For each node k \u2208 [1,\u221e) in the tree\n\n(a) Draw topic \u03c6k \u223c Dir(\u03b2k)\n(b) Draw regression parameter \u03b7k \u223c N (\u00b5, \u03c3)\n\n2. For each word v \u2208 [1, V ], draw \u03c4v \u223c Laplace(0, \u03c9)\n3. For each document d \u2208 [1, D]\n\n(a) Draw level distribution \u03b8d \u223c GEM(m, \u03c0)\n(b) Draw table distribution \u03c8d \u223c GEM(\u03b1)\n(c) For each table t \u2208 [1,\u221e), draw a path cd,t \u223c nCRP(\u03b3)\n(d) For each sentence s \u2208 [1, Sd], draw a table indicator\n\ntd,s \u223c Mult(\u03c8d)\ni. For each token n \u2208 [1, Nd,s]\nA. Draw level zd,s,n \u223c Mult(\u03b8d)\nB. Draw word wd,s,n \u223c Mult(\u03c6cd,td,s\n(e) Draw response yd \u223c N (\u03b7T \u00afzd + \u03c4 T \u00afwd, \u03c1):\nI [kd,s,n = k]\nI [wd,s,n = v]\n\n(cid:80)Nd,s\n(cid:80)Nd,s\n\ni. \u00afzd,k = 1\nNd,\u00b7\nii. \u00afwd,v = 1\nNd,\u00b7\n\nn=1\n\n(cid:80)Sd\n(cid:80)Sd\n\ns=1\n\n,zd,s,n )\n\nn=1\n\ns=1\n\nFigure 1: SHLDA\u2019s generative process and plate diagram. Words w are explained by topic hierarchy \u03c6, and\nresponse variables y are explained by per-topic regression coef\ufb01cients \u03b7 and global lexical coef\ufb01cients \u03c4 .\n2 SHLDA: Combining Supervision and Hierarchical Topic Structure\n\nJointly capturing supervision and hierarchical topic structure falls under a class of models called\nsupervised hierarchical latent Dirichlet allocation. These models take as input a set of D documents,\neach of which is associated with a response variable yd, and output a hierarchy of topics which is\ninformed by yd. Zhang et al. [6] introduce the SHLDA family, focusing on a categorical response.\nIn contrast, our novel model (which we call SHLDA for brevity), uses continuous responses. At\nits core, SHLDA\u2019s document generative process resembles a combination of hierarchical latent\nDirichlet allocation [7, HLDA] and the hierarchical Dirichlet process [8, HDP]. HLDA uses the nested\nChinese restaurant process (nCRP(\u03b3)), combined with an appropriate base distribution, to induce an\nunbounded tree-structured hierarchy of topics: general topics at the top, speci\ufb01c at the bottom. A\ndocument is generated by traversing this tree, at each level creating a new child (hence a new path)\nwith probability proportional to \u03b3 or otherwise respecting the \u201crich-get-richer\u201d property of a CRP.\nA drawback of HLDA, however, is that each document is restricted to only a single path in the\ntree. Recent work relaxes this restriction through different priors: nested HDP [9], nested Chinese\nfranchises [10] or recursive CRPs [11]. In this paper, we address this problem by allowing documents\nto have multiple paths through the tree by leveraging information at the sentence level using the two-\nlevel structure used in HDP. More speci\ufb01cally, in the HDP\u2019s Chinese restaurant franchise metaphor,\ncustomers (i.e., tokens) are grouped by sitting at tables and each table takes a dish (i.e., topic) from\na \ufb02at global menu. In our SHLDA, dishes are organized in a tree-structured global menu by using\nthe nCRP as prior. Each path in the tree is a collection of L dishes (one for each level) and is called\na combo. SHLDA groups sentences of a document by assigning them to tables and associates each\ntable with a combo, and thus, models each document as a distribution over combos.1\nIn SHLDA\u2019s metaphor, customers come in a restaurant and sit at a table in groups, where each group\nis a sentence. A sentence wd,s enters restaurant d and selects a table t (and its associated combo)\nwith probability proportional to the number of sentences Sd,t at that table; or, it sits at a new table\nwith probability proportional to \u03b1. After choosing the table (indexed by td,s), if the table is new, the\ngroup will select a combo of dishes (i.e., a path, indexed by cd,t) from the tree menu. Once a combo\nis in place, each token in the sentence chooses a \u201clevel\u201d (indexed by zd,s,n) in the combo, which\nspeci\ufb01es the topic (\u03c6kd,s,n \u2261 \u03c6cd,td,s ,zd,s,n) producing the associated observation (Figure 2).\nSHLDA also draws on supervised LDA [12, SLDA] associating each document d with an observable\ncontinuous response variable yd that represents the author\u2019s perspective toward a topic, e.g., positive\nvs. negative sentiment, conservative vs. liberal ideology, etc. This lets us infer a multi-level topic\nstructure informed by how topics are \u201cframed\u201d with respect to positions along the yd continuum.\n\n1We emphasize that, unlike in HDP where each table is assigned to a single dish, each table in our metaphor\n\nis associated with a combo\u2013a collection of L dishes. We also use combo and path interchangeably.\n\n2\n\n \u0730\u0bd7\u0be6 \u0726 \u0749 \u07f6\u0bde \u07da \u221e \u07db \u0755\u0bd7 \u07e8 \u07e0\u0bd7 \u07df\u0bde \u07e4 \u07ea \u07e9 \u07f0\u0bd7 \u0750\u0bd7\u0be6 \u0753\u0bd7\u0be6\u0be1\u0756\u0bd7\u0be6\u0be1\u073f\u0bd7\u0be7 \u07d9 \u221e \u0735\u0bd7 \u07ec\u0be9 \u0738 \u07f1 \fSd\nSd,t\n\nNd,s\nNd,\u00b7,l\nNd,\u00b7,>l\nNd,\u00b7,\u2265l\nMc,l\nCc,l,v\nCd,x,l,v\n\u03c6k\n\u03b7k\n\u03c4v\ncd,t\ntd,s\nzd,s,n\nkd,s,n\n\nL\nC+\n\n# sentences in document d\n# groups (i.e. sentences) sitting at table t\nin restaurant d\n# tokens wd,s\n# tokens in wd assigned to level l\n# tokens in wd assigned to level > l\n\u2261 Nd,\u00b7,l + Nd,\u00b7,>l\n# tables at level l on path c\n# word type v assigned to level l on path c\n# word type v in vd,x assigned to level l\nTopic at node k\nRegression parameter at node k\nRegression parameter of word type v\nPath assignment for table t in restaurant d\nTable assignment for group wd,s\nLevel assignment for wd,s,n\nNode assignment for wd,s,n (i.e., node at\nlevel zd,s,n on path cd,td,s)\nHeight of the tree\nSet of all possible paths (including new\nones) of the tree\n\nFigure 2: SHLDA\u2019s restaurant franchise metaphor.\n\nTable 1: Notation used in this paper\n\nUnlike SLDA, we model the response variables using a normal linear regression that contains both per-\ntopic hierarchical and per-word lexical regression parameters. The hierarchical regression parameters\nare just like topics\u2019 regression parameters in SLDA: each topic k (here, a tree node) has a parameter\n\u03b7k, and the model uses the empirical distribution over the nodes that generated a document as the\nregressors. However, the hierarchy in SHLDA makes it possible to discover relationships between\ntopics and the response variable that SLDA\u2019s simple latent space obscures. Consider, for example,\na topic model trained on Congressional debates. Vanilla LDA would likely discover a healthcare\ncategory. SLDA [12] could discover a pro-Obamacare topic and an anti-Obamacare topic. SHLDA\ncould do that and capture the fact that there are alternative perspectives, i.e., that the healthcare issue\nis being discussed from two ideological perspectives, along with characterizing how the higher level\ntopic is discussed by those on both sides of that ideological debate.\nSometimes, of course, words are strongly associated with extremes on the response variable continuum\nregardless of underlying topic structure. Therefore, in addition to hierarchical regression parameters,\nwe include global lexical regression parameters to model the interaction between speci\ufb01c words\nand response variables. We denote the regression parameter associated with a word type v in the\nvocabulary as \u03c4v, and use the normalized frequency of v in the documents to be its regressor.\nIncluding both hierarchical and lexical parameters is important. For detecting ideology in the US,\n\u201cliberty\u201d is an effective indicator of conservative speakers regardless of context; however, \u201ccost\u201d\nis a conservative-leaning indicator in discussions about environmental policy but liberal-leaning\nin debates about foreign policy. For sentiment, \u201cwonderful\u201d is globally a positive word; however,\n\u201cunexpected\u201d is a positive descriptor of books but a negative one of a car\u2019s steering. SHLDA captures\nthese properties in a single model.\n\n3 Posterior Inference and Optimization\nGiven documents with observed words w = {wd,s,n} and response variables y = {yd}, the inference\ntask is to \ufb01nd the posterior distribution over: the tree structure including topic \u03c6k and regression\nparameter \u03b7k for each node k, combo assignment cd,t for each table t in document d, table assignment\ntd,s for each sentence s in a document d, and level assignment zd,s,n for each token wd,s,n. We\napproximate SHLDA\u2019s posterior using stochastic EM, which alternates between a Gibbs sampling\nE-step and an optimization M-step. More speci\ufb01cally, in the E-step, we integrate out \u03c8, \u03b8 and \u03c6 to\nconstruct a Markov chain over (t, c, z) and alternate sampling each of them from their conditional\ndistributions. In the M-step, we optimize the regression parameters \u03b7 and \u03c4 using L-BFGS [13].\nBefore describing each step in detail, let us de\ufb01ne the following probabilities. For more thorough\nderivations, please see the supplement.\n\n3\n\n \u07f6\u0b35\t\t\t\t\u07df\u0b35 \u07f6\u0b35\u0b35\t\t\t\u07df\u0b35\u0b35 \u07f6\u0b35\u0b35\u0b35\t\t\u07df\u0b35\u0b35\u0b35 \u07f6\u0b35\u0b35\u0b36\t\t\u07df\u0b35\u0b35\u0b36 \u07f6\u0b35\u0b36\t\t\t\u07df\u0b35\u0b36 \u07f6\u0b35\u0b36\u0b35\t\t\u07df\u0b35\u0b36\u0b35 \u07f6\u0b35\u0b36\u0b36\t\t\u07df\u0b35\u0b36\u0b36 \u0750=2 \u0750=1 \u0740=1 \u0750=2 \u0750=1 \u0750=1 \u0750=2 \u0740=2 \u0740=\u0726 \u073f\u0bd7\u0be7 \u074f=1 \u0750=3 \u074f=2 \u074f=\u0735\u0b35 \u074f=1 \u074f=\u0735\u0b36 \u074f=3 \u074f=\u0735\u0bbd \u0750\u0bd7\u0be6 \u074f=2 \u0740=1 \u07f6\u0b35\t\t\t\t\u07df\u0b35 \u07f6\u0b35\u0b35\t\t\t\u07df\u0b35\u0b35 \u07f6\u0b35\u0b35\u0b35\t\t\u07df\u0b35\u0b35\u0b35 \u0747\u0bd7\u0be6\u0be1\u074f=1 group (sentence) restaurant (document) table customer (token) dish combo (path) \ff\n\n\u2212d,x\nc\n\n(vd,x) =\n\nL(cid:89)\n\n\u2022 First, de\ufb01ne vd,x as a set of tokens (e.g., a token, a sentence or a set of sentences) in document d.\n\n\u2212d,x\nc,l,\u00b7 + V \u03b2l)\n\nThe conditional density of vd,x being assigned to path c given all other assignments is\n\u2212d,x\nc,l,v + Cd,x,l,v + \u03b2l)\n\u0393(C\n\n(1)\nwhere superscript \u2212d,x denotes the same count excluding assignments of vd,x; marginal counts\n\u2212d,x\nare represented by \u00b7\u2019s. For a new path cnew, if the node does not exist, C\ncnew,l,v = 0 for all word\ntypes v.\n\u2022 Second, de\ufb01ne the conditional density of the response variable yd of document d given vd,x being\nassigned to path c and all other assignments as g\u2212d,x\n\n\u0393(C\n\u2212d,x\nc,l,\u00b7 + Cd,x,l,\u00b7 + V \u03b2l)\n\n\u2212d,x\nc,l,v + \u03b2l)\n\nV(cid:89)\n\n\u0393(C\n\n\u0393(C\n\nv=1\n\nl=1\n\n(yd) =\n\nc\n\nL(cid:88)\n\nSd(cid:88)\n\nNd,s(cid:88)\n\n(cid:33)\n\n\uf8f6\uf8f8\n\n\u03b7cd,td,s\n\n,zd,s,n +\n\n\u03b7c,l \u00b7 Cd,x,l,\u00b7 +\n\nl=1\n\ns=1\n\nn=1\n\n\u03c4wd,s,n\n\n, \u03c1\n\n(2)\n\n\uf8eb\uf8ed 1\n\nNd,\u00b7\n\nN\n\n(cid:32) (cid:88)\n\nwd,s,n\u2208{wd\\vd,x}\n\nwhere Nd,\u00b7 is the total number of tokens in document d. For a new node at level l on a new path\ncnew, we integrate over all possible values of \u03b7cnew,l.\n\nSampling t: For each group wd,s we need to sample a table td,s. The conditional distribution of a\ntable t given wd,s and other assignments is proportional to the number of sentences sitting at t times\nthe probability of wd,s and yd being observed under this assignment. This is P (td,s = t| rest) \u221d\nP (td,s = t| t\u2212s\n\n(cid:26) S\n\u03b1 \u00b7(cid:80)\n\nd ) \u00b7 P (wd,s, yd | td,s = t, w\u2212d,s, t\u2212d,s, z, c, \u03b7)\n\u2212d,s\nd,t\n\n\u00b7 f\u2212d,s\nc\u2208C+ P (cd,tnew = c| c\u2212d,s) \u00b7 f\u2212d,s\n\n(wd,s) \u00b7 g\u2212d,s\n\n(yd),\n\ncd,t\n\ncd,t\n\nc\n\n\u221d\n\n(3)\nFor a new table tnew, we need to sum over all possible paths C+ of the tree, including new ones. For\nexample, the set C+ for the tree shown in Figure 2 consists of four existing paths (ending at one of\nthe four leaf nodes) and three possible new paths (a new leaf off of one of the three internal nodes).\nThe prior probability of path c is: P (cd,tnew = c| c\u2212d,s) \u221d\n\n(wd,s) \u00b7 g\u2212d,s\n\n(yd),\n\nc\n\nfor existing table t;\nfor new table tnew.\n\n(cid:81)L\n\nl=2\n\n\uf8f1\uf8f4\uf8f4\uf8f4\uf8f4\uf8f2\uf8f4\uf8f4\uf8f4\uf8f4\uf8f3\n\n\u2212d,s\nc,l\n\nM\n\u2212d,s\nc,l\u22121 + \u03b3l\u22121\n\n(cid:81)l\u2217\n\nM\n\u03b3l\u2217\n\n\u2212d,s\ncnew ,l\u2217 + \u03b3l\u2217\n\nM\n\nl=2\n\n,\n\nfor an existing path c;\n\n\u2212d,s\ncnew ,l\n\nM\n\u2212d,s\ncnew ,l\u22121 + \u03b3l\u22121\n\n,\n\nM\n\nfor a new path cnew which consists of an existing path\nfrom the root to a node at level l\u2217 and a new node.\n\n(4)\n\nSampling z: After assigning a sentence wd,s to a table, we assign each token wd,s,n to a level to\nchoose a dish from the combo. The probability of assigning wd,s,n to level l is\n)P (wd,s,n, yd | zd,s,n = l, w\u2212d,s,n, z\u2212d,s,n, t, c, \u03b7) (5)\nP (zd,s,n = l | rest) \u221d P (zd,s,n = l | z\nThe \ufb01rst factor captures the probability that a customer in restaurant d is assigned to level l, condi-\ntioned on the level assignments of all other customers in restaurant d, and is equal to\n\n\u2212s,n\nd\n\nP (zd,s,n = l | z\n\n\u2212s,n\nd\n\n) =\n\nm\u03c0 + N\n\n\u03c0 + N\n\n\u2212d,s,n\nd,\u00b7,l\n\u2212d,s,n\nd,\u00b7,\u2265l\n\n(1 \u2212 m)\u03c0 + N\n\u2212d,s,n\nd,\u00b7,\u2265j\n\n\u03c0 + N\n\n\u2212d,s,n\nd,\u00b7,>j\n\n,\n\nl\u22121(cid:89)\n\nj=1\n\nThe second factor is the probability of observing wd,s,n and yd, given that wd,s,n is assigned to level\nl: P (wd,s,n, yd | zd,s,n = l, w\u2212d,s,n, z\u2212d,s,n, t, c, \u03b7) = f\u2212d,s,n\ncd,td,s\n\n(wd,s,n) \u00b7 g\u2212d,s,n\ncd,td,s\n\n(yd).\n\nSampling c: After assigning customers to tables and levels, we also sample path assignments for\nall tables. This is important since it can change the assignments of all customers sitting at a table,\nwhich leads to a well-mixed Markov chain and faster convergence. The probability of assigning table\nt in restaurant d to a path c is\n\nP (cd,t = c| rest) \u221d P (cd,t = c| c\u2212d,t) \u00b7 P (wd,t, yd | cd,t = c, w\u2212d,t, c\u2212d,t, t, z, \u03b7)\n\n(6)\nwhere we slightly abuse the notation by using wd,t \u2261 \u222a{s|td,s=t}wd,s to denote the set of customers\nin all the groups sitting at table t in restaurant d. The \ufb01rst factor is the prior probability of a path\ngiven all tables\u2019 path assignments c\u2212d,t, excluding table t in restaurant d and is given in Equation 4.\nThe second factor in Equation 6 is the probability of observing wd,t and yd given the new path\nassignments, P (wd,t, yd | cd,t = c, w\u2212d,t, c\u2212d,t, t, z, \u03b7) = f\u2212d,t\n\n(wd,t) \u00b7 g\u2212d,t\n\n(yd).\n\nc\n\nc\n\n4\n\n\fOptimizing \u03b7 and \u03c4 : We optimize the regression parameters \u03b7 and \u03c4 via the likelihood,\n\nD(cid:88)\n\nd=1\n\nK+(cid:88)\n\nk=1\n\nV(cid:88)\n\nv=1\n\nL(\u03b7, \u03c4 ) = \u2212 1\n2\u03c1\n\n(yd \u2212 \u03b7T \u00afzd \u2212 \u03c4 T \u00afwd)2 \u2212 1\n2\u03c3\n\n(\u03b7k \u2212 \u00b5)2 \u2212 1\n\u03c9\n\n|\u03c4v|,\n\n(7)\n\nwhere K + is the number of nodes in the tree.2 This maximization is performed using L-BFGS [13].\n\n4 Data: Congress, Products, Films\n\nWe conduct our experiments using three datasets: Congressional \ufb02oor debates, Amazon product\nreviews, and movie reviews. For all datasets, we remove stopwords, add bigrams to the vocabulary,\nand \ufb01lter the vocabulary using tf-idf.3\n\u2022 U.S Congressional \ufb02oor debates: We downloaded debates of the 109th US Congress from Gov-\nTrack4 and preprocessed them as in Thomas et al. [14]. To remove uninterestingly non-polarized\ndebates, we ignore bills with less than 20% \u201cYea\u201d votes or less than 20% \u201cNay\u201d votes. Each\ndocument d is a turn (a continuous utterance by a single speaker, i.e. speech segment [14]), and\nits response variable yd is the \ufb01rst dimension of the speaker\u2019s DW-NOMINATE score [15], which\ncaptures the traditional left-right political distinction.5 After processing, our corpus contains 5,201\nturns in the House, 3,060 turns in the Senate, and 5,000 words in the vocabulary.6\n\u2022 Amazon product reviews: From a set of Amazon reviews of manufactured products such as\ncomputers, MP3 players, GPS devices, etc. [16], we focused on the 50 most frequently reviewed\nproducts. After \ufb01ltering, this corpus contains 37,191 reviews with a vocabulary of 5,000 words.\nWe use the rating associated with each review as the response variable yd.7\n\u2022 Movie reviews: Our third corpus is a set of 5,006 reviews of movies [17], again using review\nratings as the response variable yd, although in this corpus the ratings are normalized to the range\nfrom 0 to 1. After preprocessing, the vocabulary contains 5,000 words.\n\n5 Evaluating Prediction\n\nSHLDA\u2019s response variable predictions provide a formally rigorous way to assess whether it is an\nimprovement over prior methods. We evaluate effectiveness in predicting values of the response\nvariables for unseen documents in the three datasets. For comparison we consider these baselines:\n\u2022 Multiple linear regression (MLR) models the response variable as a linear function of multiple\nfeatures (or regressors). Here, we consider two types of features: topic-based features and lexically-\nbased features. Topic-based MLR, denoted by MLR-LDA, uses the topic distributions learned by\nvanilla LDA as features [12], while lexically-based MLR, denoted by MLR-VOC, uses the frequencies\nof words in the vocabulary as features. MLR-LDA-VOC uses both features.\n\u2022 Support vector regression (SVM) is a discriminative method [18] that uses LDA topic distributions\n\u2022 Supervised topic model (SLDA): we implemented SLDA using Gibbs sampling. The version of\nSLDA we use is slightly different from the original SLDA described in [12], in that we place a\nGaussian prior N (0, 1) over the regression parameters to perform L2-norm regularization.9\n\n(SVM-LDA), word frequencies (SVM-VOC), and both (SVM-LDA-VOC) as features.8\n\nFor parametric models (LDA and SLDA), which require the number of topics K to be speci\ufb01ed before-\nhand, we use K \u2208 {10, 30, 50}. We use symmetric Dirichlet priors in both LDA and SLDA, initialize\n\n2The superscript + is to denote that this number is unbounded and varies during the sampling process.\n3To \ufb01nd bigrams, we begin with bigram candidates that occur at least 10 times in the corpus and use Pearson\u2019s\n\u03c72-test to \ufb01lter out those that have \u03c72-value less than 5, which corresponds to a signi\ufb01cance level of 0.025. We\nthen treat selected bigrams as single word types and add them to the vocabulary.\n\n4 http://www.govtrack.us/data/us/109/\n5Scores were downloaded from http://voteview.com/dwnomin_joint_house_and_senate.htm\n6Data will be available after blind review.\n7The ratings can range from 1 to 5, but skew positive.\n8 http://svmlight.joachims.org/\n9This performs better than unregularized SLDA in our experiments.\n\n5\n\n\fModels\n\nSVM-LDA10\nSVM-LDA30\nSVM-LDA50\nSVM-VOC\n\nSVM-LDA-VOC\nMLR-LDA10\nMLR-LDA30\nMLR-LDA50\nMLR-VOC\n\nMLR-LDA-VOC\n\nSLDA10\nSLDA30\nSLDA50\nSHLDA\n\nAmazon\nReviews\n\nMovie\nReviews\n\nHouse-Senate\nPCC \u2191\n0.173\n0.172\n0.169\n0.336\n0.256\n\nFloor Debates\nSenate-House\nPCC \u2191\nMSE \u2193\n0.08\n0.861\n0.155\n0.840\n0.215\n0.832\n1.549\n0.131\n0.246\n0.784\n\nMSE \u2193\n1.247\n1.183\n1.135\n1.467\n1.101\n\nPCC \u2191\n0.157\n0.277\n0.245\n0.373\n0.371\n\nMSE \u2193\n1.241\n1.091\n1.130\n0.972\n0.965\n\nPCC \u2191\n0.327\n0.365\n0.395\n0.584\n0.585\n\nMSE \u2193\n0.970\n0.938\n0.906\n0.681\n0.678\n\n0.163\n0.160\n0.150\n0.322\n0.319\n\n0.154\n0.174\n0.254\n\n0.735\n0.737\n0.741\n0.889\n0.873\n\n0.729\n0.793\n0.897\n\n0.068\n0.162\n0.248\n0.191\n0.194\n\n0.090\n0.128\n0.245\n\n1.151\n1.125\n1.081\n1.124\n1.120\n\n1.145\n1.188\n1.184\n\n0.143\n0.258\n0.234\n0.408\n0.410\n\n0.270\n0.357\n0.241\n\n0.356\n\n0.753\n\n0.303\n\n1.076\n\n0.413\n\n1.034\n1.065\n1.114\n0.869\n0.860\n\n1.113\n1.146\n1.939\n\n0.891\n\n0.328\n0.367\n0.389\n0.568\n0.581\n\n0.383\n0.433\n0.503\n\n0.957\n0.936\n0.914\n0.721\n0.702\n\n0.953\n0.852\n0.772\n\n0.597\n\n0.673\n\nTable 2: Regression results for Pearson\u2019s correlation coef\ufb01cient (PCC, higher is better (\u2191)) and mean squared\nerror (MSE, lower is better (\u2193)). Results on Amazon product reviews and movie reviews are averaged over 5\nfolds. Subscripts denote the number of topics for parametric models. For SVM-LDA-VOC and MLR-LDA-VOC,\nonly best results across K \u2208 {10, 30, 50} are reported. Best results are in bold.\n\nthe Dirichlet hyperparameters to 0.5, and use slice sampling [19] for updating hyperparameters. For\nSLDA, the variance of the regression is set to 0.5. For SHLDA, we use trees with maximum depth\nof three. We slice sample m, \u03c0, \u03b2 and \u03b3, and \ufb01x \u00b5 = 0, \u03c3 = 0.5, \u03c9 = 0.5 and \u03c1 = 0.5. We found\nthat the following set of initial hyperparameters works reasonably well for all the datasets in our\nexperiments: m = 0.5, \u03c0 = 100, (cid:126)\u03b2 = (1.0, 0.5, 0.25), (cid:126)\u03b3 = (1, 1), \u03b1 = 1. We also set the regression\nparameter of the root node to zero, which speeds inference (since it is associated with every document)\nand because it is reasonable to assume that it would not change the response variable.\nTo compare the performance of different methods, we compute Pearson\u2019s correlation coef\ufb01cient\n(PCC) and mean squared error (MSE) between the true and predicted values of the response variables\nand average over 5 folds. For the Congressional debate corpus, following Yu et al. [20], we use\ndocuments in the House to train and test on documents in the Senate and vice versa.\n\nResults and analysis Table 2 shows the performance of all models on our three datasets. Methods\nthat only use topic-based features such as SVM-LDA and MLR-LDA do poorly. Methods only based\non lexical features like SVM-VOC and MLR-VOC outperform methods that are based only on topic\nfeatures signi\ufb01cantly for the two review datasets, but are comparable or worse on congressional\ndebates. This suggests that reviews have more highly discriminative words than political speeches\n(Table 3). Combining topic-based and lexically-based features improves performance, which supports\nour choice of incorporating both per-topic and per-word regression parameters in SHLDA.\nIn all cases, SHLDA achieves strong performance results. For the two cases where SHLDA was\nsecond best in MSE score (Amazon reviews and House-Senate), it outperforms other methods in PCC.\nDoing well in PCC for these two datasets is important since achieving low MSE is relatively easier due\nto the response variables\u2019 bimodal distribution in the \ufb02oor debates and positively-skewed distribution\nin Amazon reviews. For the \ufb02oor debate dataset, the results of the House-Senate experiment are\ngenerally better than those of the Senate-House experiment, which is consistent with previous\nresults [20] and is explained by the greater number of debates in the House.\n\n6 Qualitative Analysis: Agendas and Framing/Perspective\n\nAlthough a formal coherence evaluation [21] remains a goal for future work, a qualitative look at\nthe topic hierarchy uncovered by the model suggests that it is indeed capturing agenda/framing\nstructure as discussed in Section 1. In Figure 3, a portion of the topic hierarchy induced from the\nCongressional debate corpus, Nodes A and B illustrate agendas\u2014issues introduced into political\ndiscourse\u2014associated with a particular ideology: Node A focuses on the hardships of the poorer\nvictims of hurricane Katrina and is associated with Democrats, and text associated with Node E\ndiscusses a proposed constitutional amendment to ban \ufb02ag burning and is associated with Republicans.\nNodes C and D, children of a neutral \u201ctax\u201d topic, reveal how parties frame taxes as gains in terms of\nnew social services (Democrats) and losses for job creators (Republicans).\n\n6\n\n\fFigure 3: Topics discovered from\nCongressional \ufb02oor debates. Many\n\ufb01rst-level topics are bipartisan (purple),\nwhile lower level topics are associated\nwith speci\ufb01c ideologies (Democrats blue,\nRepublicans red). For example, the\n\u201ctax\u201d topic (B) is bipartisan, but its\nDemocratic-leaning child (D) focuses on\nsocial goals supported by taxes (\u201cchil-\ndren\u201d, \u201ceducation\u201d, \u201chealth care\u201d), while\nits Republican-leaning child (C) focuses\non business implications (\u201cdeath tax\u201d,\n\u201cjobs\u201d, \u201cbusinesses\u201d). The number below\neach topic denotes the magnitude of the\nlearned regression parameter associated\nwith that topic. Colors and the numbers\nbeneath each topic show the regression\nparameter \u03b7 associated with the topic.\n\nFigure 4 shows the topic structure discovered by SHLDA in the review corpus. Nodes at higher levels\nare relatively neutral, with relatively small regression parameters.10 These nodes have general topics\nwith no speci\ufb01c polarity. However, the bottom level clearly illustrates polarized positive/negative\nperspective. For example, Node A concerns washbasins for infants, and has two polarized children\nnodes: reviewers take a positive perspective when their children enjoy the product (Node B: \u201cloves\u201d,\n\u201csplash\u201d, \u201cplay\u201d) but have negative reactions when it leaks (Node C: \u201cleak(s/ed/ing)\u201d).\n\nFigure 4: Topics discovered from Amazon reviews. Higher topics are general, while lower topics are more\nspeci\ufb01c. The polarity of the review is encoded in the color: red (negative) to blue (positive). Many of the \ufb01rst-\nlevel topics have no speci\ufb01c polarity and are associated with a broad class of products such as \u201crouters\u201d (Node D).\nHowever, the lowest topics in the hierarchy are often polarized; one child topic of \u201crouter\u201d focuses on upgradable\n\ufb01rmware such as \u201ctomato\u201d and \u201cddwrt\u201d (Node E, positive) while another focuses on poor \u201ctech support\u201d and\n\u201ccustomer service\u201d (Node F, negative). The number below each topic is the regression parameter learned with\nthat topic.\n\nIn addition to the per-topic regression parameters, SHLDA also associates each word with a lexical\nregression parameter \u03c4. Table 3 shows the top ten words with highest and lowest \u03c4. The results are\nunsuprising, although the lexical regression for the Congressional debates is less clear-cut than other\n\n10All of the nodes at the second level have slightly negative values for the regression parameters mainly due\n\nto the very skewed distribution of the review ratings in Amazon.\n\n7\n\n bill speaker time amendment chairman people gentleman legislation congress supportREPUBLICANDEMOCRATR:0 gses credit_rating fannie_mae regulator freddie_mac market \ufb01nancial_services agencies competition investors fannie R:1.0 affordable_housing housing manager fund activities funds organizations voter_registration faithbased nonpro\ufb01ts D:2.2D:1.7 minimum_wage commission independent_commission investigate hurricane_katrina increase investigation A \ufb02ag constitution freedom supreme_court elections rights continuity american_\ufb02ag constitutional_amendment R:1.1ER:0.4 percent tax economy estate_tax capital_gains money taxes businesses families tax_cuts pay tax_relief social_securityB billion budget children cuts debt tax_cuts child_support de\ufb01cit education students health_care republicans national_debt D:4.5D death_tax jobs businesses business family_businesses equipment productivity repeal_permanency employees capital farmsR:4.3C transmitter ipod car frequency iriver product transmitters live station presets itrip iriver_aft charges international_mode driving leak formula bottles_leak feeding leaked brown frustrating started clothes waste newborn playtex_ventaire soaked matter tried waste batteries tunecast rabbit_ears weak terrible antenna hear returned refund returning item junk return nipple breast nipples dishwasher ring sippy_cups tried breastfeed screwed breastfeeding nipple_confusion avent_system bottle transmitter car static ipod radio mp3_player signal station sound music sound_quality volume stations frequency frequencies comfortable sound phones sennheiser bass px100 px100s phone headset highs portapros portapro price wear koss months loves hammock splash love baby drain eurobath hot \ufb01ts wash play infant secure slip time bought product easy buy love using price lot able set found purchased money months tub baby water bath sling son daughter sit bathtub sink newborn months bath_tub bathe bottom router setup network expander set signal wireless connect linksys connection house wireless_router laptop computer wre54g monitor radio weather_radio night baby range alerts sound sony house interference channels receiver static alarm tivo adapter series adapters phone_line tivo_wireless transfer plugged wireless_adapter tivos plug dvr tivo_series tivo_box tivo_unit router \ufb01rmware ddwrt wrt54gl version wrt54g tomato linksys linux routers \ufb02ash versions browser dlink stable bottles bottle baby leak nipples nipple avent avent_bottles leaking son daughter formula leaks gas milknoise_canceling noise sony exposed noise_cancellation stopped wires warranty noise_cancelling bud pay white_noise disappointed headphones sound pair bass headset sound_quality ear ears cord earbuds comfortable hear head earphones \ufb01t hear feature static monitors set live warning volume counties noise outside alert breathing rechargeable_battery alerts leaks leaked leak leaking hard waste snap suction_cups lock tabs dif\ufb01cult bottom tub_leaks properly ring version hours phone \ufb01rmware told spent linksys tech_support technical_supportcustomer_service range_expander support return appointments organized phone lists handheld organizer photos etc pictures memos track bells books purse whistles z22 palm pda palm_z22 calendar software screen contacts computer device sync information outlook data programsNEGATIVEPOSITIVEP:6.6N:8.0P:7.5N:2.7N:2.2N:8.9N:1.0P:5.1N:10.6P:4.8N:0P:6.2N:2.0N:1.3N:7.9P:6.4P:5.7N:7.6N:1.7N:1.9P:5.8ABCDEF\fdatasets. As we saw in Section 5, for similar datasets, SHLDA\u2019s context-speci\ufb01c regression is more\nuseful when global lexical weights do not readily differentiate documents.\n\nDataset\nFloor\nDebates\n\nAmazon\nReviews\nMovie\nReviews\n\nprivate property,\n\nTop 10 words with positive weights\nbringing,\nillegally,\ntax relief, regulation, mandates, constitu-\ntional, committee report, illegal alien\nhighly recommend, pleased, love, loves, per-\nfect, easy, excellent, amazing, glad, happy\nhilarious,\nfast, schindler, excellent, mo-\ntion pictures, academy award, perfect, jour-\nney, fortunately, ability\n\nTop 10 words with negative weights\nbush administration, strong opposition, rank-\ning, republicans, republican leadership, se-\ncret, discriminate, majority, undermine\nwaste, returned, return, stopped, leak, junk,\nuseless, returning, refund, terrible\nbad, unfortunately, supposed, waste, mess,\nworst, acceptable, awful, suppose, boring\n\nTable 3: Top words based on the global lexical regression coef\ufb01cient, \u03c4. For the \ufb02oor debates, positive \u03c4\u2019s are\nRepublican-leaning while negative \u03c4\u2019s are Democrat-leaning.\n\n7 Related Work\n\nSHLDA joins a family of LDA extensions that introduce hierarchical topics, supervision, or both.\nOwing to limited space, we focus here on related work that combines the two. Petinot et al. [22]\npropose hierarchical Labeled LDA (hLLDA), which leverages an observed document ontology to learn\ntopics in a tree structure; however, hLLDA assumes that the underlying tree structure is known a\npriori. SSHLDA [23] generalizes hLLDA by allowing the document hierarchy labels to be partially\nobserved, with unobserved labels and topic tree structure then inferred from the data. Boyd-Graber\nand Resnik [24] used hierarchical distributions within topics to learn topics across languages. In\naddition to these \u201cupstream\u201d models [25], Perotte et al. [26] propose a \u201cdownstream\u201d model called\nHSLDA, which jointly models documents\u2019 hierarchy of labels and topics. HSLDA\u2019s topic structure\nis \ufb02at, however, and the response variable is a hierarchy of labels associated with each document,\nunlike SHLDA\u2019s continuous response variable. Finally, another body related body of work includes\nmodels that jointly capture topics and other facets such as ideologies/perspectives [27, 28] and\nsentiments/opinions [29], albeit with discrete rather than continuously valued responses.\nComputational modeling of sentiment polarity is a voluminous \ufb01eld [30], and many computational\npolitical science models describe agendas [5] and ideology [31]. Looking at framing or bias at\nthe sentence level, Greene and Resnik [32] investigate the role of syntactic structure in framing,\nYano et al. [33] look at lexical indications of sentence-level bias, and Recasens et al. [34] develop\nlinguistically informed sentence-level features for identifying bias-inducing words.\n\n8 Conclusion\n\nWe have introduced SHLDA, a model that associates a continuously valued response variable with\nhierarchical topics to capture both the issues under discussion and alternative perspectives on those\nissues. The two-level structure improves predictive performance over existing models on multiple\ndatasets, while also adding potentially insightful hierarchical structure to the topic analysis. Based on\na preliminary qualitative analysis, the topic hierarchy exposed by the model plausibly captures the\nidea of agenda setting, which is related to the issues that get discussed, and framing, which is related\nto authors\u2019 perspectives on those issues. We plan to analyze the topic structure produced by SHLDA\nwith political science collaborators and more generally to study how SHLDA and related models can\nhelp analyze and discover useful insights from political discourse.\n\nAcknowledgments\n\nThis research was supported in part by NSF under grant #1211153 (Resnik) and #1018625 (Boyd-\nGraber and Resnik). Any opinions, \ufb01ndings, conclusions, or recommendations expressed here are\nthose of the authors and do not necessarily re\ufb02ect the view of the sponsor.\n\n8\n\n\fReferences\n[1] McCombs, M. The agenda-setting role of the mass media in the shaping of public opinion. North,\n\n2009(05-12):21, 2002.\n\n[2] McCombs, M., S. Ghanem. The convergence of agenda setting and framing. In Framing public life. 2001.\n[3] Baumgartner, F. R., S. L. De Boef, A. E. Boydstun. The decline of the death penalty and the discovery of\n\ninnocence. Cambridge University Press, 2008.\n\n[4] Blei, D. M., A. Ng, M. Jordan. Latent Dirichlet allocation. JMLR, 3, 2003.\n[5] Grimmer, J. A Bayesian hierarchical topic model for political texts: Measuring expressed agendas in\n\nSenate press releases. Political Analysis, 18(1):1\u201335, 2010.\n\n[6] Zhang, J. Explore objects and categories in unexplored environments based on multimodal data. Ph.D.\n\nthesis, University of Hamburg, 2012.\n\n[7] Blei, D. M., T. L. Grif\ufb01ths, M. I. Jordan. The nested Chinese restaurant process and Bayesian nonparametric\n\ninference of topic hierarchies. J. ACM, 57(2), 2010.\n\n[8] Teh, Y. W., M. I. Jordan, M. J. Beal, et al. Hierarchical Dirichlet processes. JASA, 101(476), 2006.\n[9] Paisley, J. W., C. Wang, D. M. Blei, et al. Nested hierarchical Dirichlet processes. arXiv:1210.6738, 2012.\n[10] Ahmed, A., L. Hong, A. Smola. The nested Chinese restaurant franchise process: User tracking and\n\ndocument modeling. In ICML. 2013.\n\n[11] Kim, J. H., D. Kim, S. Kim, et al. Modeling topic hierarchies with the recursive Chinese restaurant process.\n\nIn CIKM, pages 783\u2013792. 2012.\n\n[12] Blei, D. M., J. D. McAuliffe. Supervised topic models. In NIPS. 2007.\n[13] Liu, D., J. Nocedal. On the limited memory BFGS method for large scale optimization. Math. Prog., 1989.\n[14] Thomas, M., B. Pang, L. Lee. Get out the vote: Determining support or opposition from Congressional\n\n\ufb02oor-debate transcripts. In EMNLP. 2006.\n\n[15] Lewis, J. B., K. T. Poole. Measuring bias and uncertainty in ideal point estimates via the parametric\n\nbootstrap. Political Analysis, 12(2), 2004.\n\n[16] Jindal, N., B. Liu. Opinion spam and analysis. In WSDM. 2008.\n[17] Pang, B., L. Lee. Seeing stars: Exploiting class relationships for sentiment categorization with respect to\n\nrating scales. In ACL. 2005.\n\n[18] Joachims, T. Making large-scale SVM learning practical. In Adv. in Kernel Methods - SVM. 1999.\n[19] Neal, R. M. Slice sampling. Annals of Statistics, 31:705\u2013767, 2003.\n[20] Yu, B., D. Diermeier, S. Kaufmann. Classifying party af\ufb01liation from political speech. JITP, 2008.\n[21] Chang, J., J. Boyd-Graber, C. Wang, et al. Reading tea leaves: How humans interpret topic models. In\n\nNIPS. 2009.\n\n[22] Petinot, Y., K. McKeown, K. Thadani. A hierarchical model of web summaries. In HLT. 2011.\n[23] Mao, X., Z. Ming, T.-S. Chua, et al. SSHLDA: A semi-supervised hierarchical topic model. In EMNLP.\n\n2012.\n\n[24] Boyd-Graber, J., P. Resnik. Holistic sentiment analysis across languages: Multilingual supervised latent\n\nDirichlet allocation. In EMNLP. 2010.\n\n[25] Mimno, D. M., A. McCallum. Topic models conditioned on arbitrary features with Dirichlet-multinomial\n\nregression. In UAI. 2008.\n\n[26] Perotte, A. J., F. Wood, N. Elhadad, et al. Hierarchically supervised latent Dirichlet allocation. In NIPS.\n\n2011.\n\n[27] Ahmed, A., E. P. Xing. Staying informed: Supervised and semi-supervised multi-view topical analysis of\n\nideological perspective. In EMNLP. 2010.\n\n[28] Eisenstein, J., A. Ahmed, E. P. Xing. Sparse additive generative models of text. In ICML. 2011.\n[29] Jo, Y., A. H. Oh. Aspect and sentiment uni\ufb01cation model for online review analysis. In WSDM. 2011.\n[30] Pang, B., L. Lee. Opinion Mining and Sentiment Analysis. Now Publishers Inc, 2008.\n[31] Monroe, B. L., M. P. Colaresi, K. M. Quinn. Fightin\u2019words: Lexical feature selection and evaluation for\n\nidentifying the content of political con\ufb02ict. Political Analysis, 16(4):372\u2013403, 2008.\n\n[32] Greene, S., P. Resnik. More than words: Syntactic packaging and implicit sentiment. In NAACL. 2009.\n[33] Yano, T., P. Resnik, N. A. Smith. Shedding (a thousand points of) light on biased language. In NAACL-HLT\n\nWorkshop on Creating Speech and Language Data with Amazon\u2019s Mechanical Turk. 2010.\n\n[34] Recasens, M., C. Danescu-Niculescu-Mizil, D. Jurafsky. Linguistic models for analyzing and detecting\n\nbiased language. In ACL. 2013.\n\n9\n\n\f", "award": [], "sourceid": 584, "authors": [{"given_name": "Viet-An", "family_name": "Nguyen", "institution": "University of Maryland"}, {"given_name": "Jordan", "family_name": "Ying", "institution": "University of Maryland"}, {"given_name": "Philip", "family_name": "Resnik", "institution": "University of Maryland"}]}