{"title": "Learning a Concept Hierarchy from Multi-labeled Documents", "book": "Advances in Neural Information Processing Systems", "page_first": 3671, "page_last": 3679, "abstract": "While topic models can discover patterns of word usage in large corpora, it is difficult to meld this unsupervised structure with noisy, human-provided labels, especially when the label space is large. In this paper, we present a model-Label to Hierarchy (L2H)-that can induce a hierarchy of user-generated labels and the topics associated with those labels from a set of multi-labeled documents. The model is robust enough to account for missing labels from untrained, disparate annotators and provide an interpretable summary of an otherwise unwieldy label set. We show empirically the effectiveness of L2H in predicting held-out words and labels for unseen documents.", "full_text": "Learning a Concept Hierarchy from\n\nMulti-labeled Documents\n\nViet-An Nguyen1\u2217, Jordan Boyd-Graber2, Philip Resnik1,3,4, Jonathan Chang5\n\n2Computer Science\n\n5Facebook\n\n1Computer Science, 3Linguistics, 4UMIACS\n\nUniv. of Maryland, College Park, MD\n\nvietan@cs.umd.edu\n\nJordan.Boyd.Graber\n\njonchang@fb.com\n\nUniv. of Colorado, Boulder, CO\n\nMenlo Park, CA\n\nresnik@umd.edu\n\n@colorado.edu\n\nAbstract\n\nWhile topic models can discover patterns of word usage in large corpora, it is\ndif\ufb01cult to meld this unsupervised structure with noisy, human-provided labels,\nespecially when the label space is large. In this paper, we present a model\u2014Label\nto Hierarchy (L2H)\u2014that can induce a hierarchy of user-generated labels and the\ntopics associated with those labels from a set of multi-labeled documents. The\nmodel is robust enough to account for missing labels from untrained, disparate\nannotators and provide an interpretable summary of an otherwise unwieldy label\nset. We show empirically the effectiveness of L2H in predicting held-out words\nand labels for unseen documents.\n\n1 Understanding Large Text Corpora through Label Annotations\n\nProbabilistic topic models [4] discover the thematic structure of documents from news, blogs, and\nweb pages. Typical unsupervised topic models such as latent Dirichlet allocation [7, LDA] uncover\ntopics from unannotated documents. In many settings, however, documents are also associated\nwith additional data, which provide a foundation for joint models of text with continuous response\nvariables [6, 48, 27], categorical labels [37, 18, 46, 26] or link structure [9].\nThis paper focuses on additional information in the form of multi-labeled data, where each document\nis tagged with a set of labels. These data are ubiquitous. Web pages are tagged with multiple\ndirectories,1 books are labeled with different categories or political speeches are annotated with\nmultiple issues.2 Previous topic models on multi-labeled data focus on a small set of relatively\nindependent labels [25, 36, 46]. Unfortunately, in many real-world examples, the number of labels\u2014\nfrom hundreds to thousands\u2014is incompatible with the independence assumptions of these models.\nIn this paper, we capture the dependence among the labels using a learned tree-structured hierarchy.\nOur proposed model, L2H\u2014Label to Hierarchy\u2014learns from label co-occurrence and word usage to\ndiscover a hierarchy of topics associated with user-generated labels. We show empirically that L2H\ncan improve over relevant baselines in predicting words or missing labels in two prediction tasks.\nL2H is designed to explicitly capture the relationships among labels to discover a highly interpretable\nhierarchy from multi-labeled data. This interpretable hierarchy helps improve prediction performance\nand also provides an effective way to search, browse and understand multi-labeled data [17, 10, 8, 12].\n\n\u2217Part of this work was done while the \ufb01rst author interned at Facebook.\n1Open Directory Project (http://www.dmoz.org/)\n2Policy Agenda Codebook (http://policyagendas.org/)\n\n1\n\n\f2 L2H: Capturing Label Dependencies using a Tree-structured Hierarchy\n\nDiscovering a topical hierarchy from text has been the focus of much topic modeling research. One\npopular approach is to learn an unsupervised hierarchy of topics. For example, hLDA [5] learns\nan unbounded tree-structured hierarchy of topics from unannotated documents. One drawback of\nhLDA is that documents only are associated with a single path in the topic tree. Recent work\nrelaxing this restriction include TSSB [1], nHDP [30], nCRF [2] and SHLDA [27]. Going beyond\ntree structure, PAM [20] captures the topic hierarchy using a pre-de\ufb01ned DAG, inspiring more \ufb02exible\nextensions [19, 24]. However, since only unannotated text is used to infer the hierarchical topics,\nit usually requires an additional step of topic labeling to make them interpretable. This dif\ufb01culty\nmotivates work leveraging existing taxonomies such as HSLDA [31] and hLLDA [32].\nA second active area of research is constructing a taxonomy from multi-labeled data. For example,\nHeymann and Garcia-Molina [17] extract a tag hierarchy using the tag network centrality; simi-\nlar work has been applied to protein hierarchies [42]. Hierarchies of concepts have come from\nseeded ontologies [39], crowdsourcing [29], and user-speci\ufb01ed relations [33]. More sophisticated\napproaches build domain-speci\ufb01c keyword taxonomies with adapting Bayesian Rose Trees [21].\nThese approaches, however, concentrate on the tags, ignoring the content the tags describe.\nIn this paper, we combine ideas from these two lines of research and introduce L2H, a hierarchical\ntopic model that discovers a tree-structured hierarchy of concepts from a collection of multi-labeled\ndocuments. L2H takes as input a set of D documents {wd}, each tagged with a set of labels ld.\nThe label set L contains K unique, unstructured labels and the word vocabulary size is V . To\nlearn an interpretable taxonomy, L2H associates each label\u2014a user-generated word/phrase\u2014with a\ntopic\u2014a multinomial distribution over the vocabulary\u2014to form a concept, and infers a tree-structured\nhierarchy to capture the relationships among concepts. Figure 1 shows the plate diagram for L2H,\ntogether with its generative process.\n\n1. Create label graph G and draw a spanning tree T from G (Section 2.1)\n2. For each node k \u2208 [1, K] in T\n\n(a) If k is the root, draw background topic \u03c6k \u223c Dir(\u03b2u)\n(b) Otherwise, draw topic \u03c6k \u223c Dir(\u03b2\u03c6\u03c3(k)) where \u03c3(k) is node k\u2019s\n\nparent.\n\n3. For each document d \u2208 [1, D] having labels ld\nd \u223c Dir(L1\n\n(a) De\ufb01ne L0\n(b) Draw \u03b80\n(c) Draw a stochastic switching variable \u03c0d \u223c Beta(\u03b30, \u03b31)\n(d) For each token n \u2208 [1, Nd]\n\nd using T and ld (Section 2.2)\nd \u00d7 \u03b1) and \u03b81\nd \u00d7 \u03b1)\n\nd and L1\nd \u223c Dir(L0\n\ni. Draw set indicator xd,n \u223c Bern(\u03c0d)\nii. Draw topic indicator zd,n \u223c Mult(\u03b8\niii. Draw word wd,n \u223c Mult(\u03c6zd,n )\n\nxd,n\nd\n\n)\n\nFigure 1: Generative process and the plate diagram notation of L2H.\n\n2.1 Generating a labeled topic hierarchy\nWe assume an underlying directed graph G = (E,V) in which each node is a concept consisting of\n(1) a label\u2014observable user-generated input, and (2) a topic\u2014latent multinomial distribution over\nwords.3 The prior weight of a directed edge from node i to node k is the fraction of documents tagged\nwith label k which are also tagged with label i: ti,k = Di,k/Dj. We also assume an additional\nBackground node. Edges to the Background node have prior zero weight and edges from the\nBackground node to node i have prior weight troot,i = Di/maxk Dk. Here, Di is the number of\ndocuments tagged with label i, and Di,k is the number of documents tagged with both labels i and k.\nThe tree T is a spanning tree generated from G. The probability of a tree given the graph G is thus\ne\u2208E te. To capture the intuition that child\nnodes in the hierarchy specialize the concepts of their parents, we model the topic \u03c6k at each node\n\nthe product of all its edge prior weights p(T |G) = (cid:81)\n\n3In this paper, we use node when emphasizing the structure discovered by the model. Each node corresponds\n\nto a concept which consists of a label and a topic.\n\n2\n\n\u0753\u0bd7,\u0be1\u0730\u0bd7 \u0756\u0bd7,\u0be1 \u0892\u0bd7 \u08ee \u07d9 \u07f6\u0bde \u07db \u07da \u0726 \u072d \u0754\u0bd7,\u0be1 \u07e8\u0bd7 \u08e1 \u2112\u0bd7\u0b34,\u2112\u0bd7\u0b35\u07e0\u0bd7\u0b34,\u07e0\u0bd7\u0b35 \fk using a Dirichlet distribution whose mean is centered at the topic of node k\u2019s parent \u03c3(k), i.e.,\n\u03c6k \u223c Dir(\u03b2\u03c6\u03c3(k)). The topic at the root node is drawn from a symmetric Dirichlet \u03c6root \u223c Dir(\u03b2u),\nwhere u is a uniform distribution over the vocabulary [1, 2]. This is similar to the idea of \u201cbackoff\u201d in\nlanguage models where more speci\ufb01c contexts inherit the ideas expressed in more general contexts;\ni.e., if we talk about \u201cpedagogy\u201d in education, there\u2019s a high likelihood we\u2019ll also talk about it in\nuniversity education [22, 41].\n\n2.2 Generating documents\n\nAs in LDA, each word in a document is generated by one of the latent topics. L2H, however, also\nuses the labels and topic hierarchy to restrict the topics a document uses. The document\u2019s label set ld\nidenti\ufb01es which nodes are more likely to be used. Restricting tokens of a document in this way\u2014to\nbe generated only from a subset of the topics depending the document\u2019s labels\u2014creates speci\ufb01c,\nfocused, labeled topics [36, Labeled LDA].\nUnfortunately, ld is unlikely to be an exhaustive enumeration: particularly when the label set is large,\nusers often forget or overlook relevant labels. We therefore depend on the learned topology of the\nhierarchy to \ufb01ll in the gaps of what users forget by expanding ld into a broader set, L1\nd, which is the\nunion of nodes on the paths from the root node to any of the document\u2019s label nodes. We call this\nthe document\u2019s candidate set. The candidate set also induces a complementary set L0\nd \u2261 L \\ L1\nd\n(illustrated in Figure 2). Previous approaches such as LPAM [3] and Tree labeled LDA [40] also\nleverage the label hierarchy to expand the original label set. However, these previous models require\nthat the label hierarchy is given rather than inferred as in our L2H.\n\nFigure 2: Illustration of the candidate label set: given a document d\nhaving labels ld = {2, 4} (double-circled nodes), the candidate label\nset of d consists of nodes on all the paths from the root node to node\n2 and node 4. L1\nd = {3, 5, 6}. This allows\nan imperfect label set to induce topics that the document should be\nassociated with even if they weren\u2019t explicitly enumerated.\n\nd = {0, 1, 2, 4} and L0\n\nL2H replaces Labeled LDA\u2019s absolute restriction to speci\ufb01c topics to a soft preference. To achieve this,\neach document d has a switching variable \u03c0d drawn from Beta(\u03b30, \u03b31), which effectively decides\nhow likely tokens in d are to be generated from L1\nd versus L0\nd. Token n in document d is generated by\n\ufb01rst \ufb02ipping the biased coin \u03c0d to choose the set indicator xd,n. Given xd,n, the label zd,n is drawn\nfrom the corresponding label distribution \u03b8xd,n\nand the token is generated from the corresponding\ntopic wd,n \u223c Mult(\u03c6zd,n).\n\nd\n\n3 Posterior Inference\nGiven a set of documents with observed words {wd} and labels {ld}, inference \ufb01nds the posterior\ndistribution over the latent variables. We use a Markov chain Monte Carlo (MCMC) algorithm to\nperform posterior inference, in which each iteration after initialization consists of the following\nsteps: (1) sample a set indicator xd,n and topic assignment zd,n for each token, (2) sample a word\ndistribution \u03c6k for each node k in the tree, and (3) update the structure of the label tree.\n\nInitialization: With the large number of labels, the space of hierarchical structures that MCMC\nneeds to explore is huge. Initializing the tree-structure hierarchy is crucial to help the sampler focus\non more important regions of the search space and help the sampler converge. We initialize the\nhierarchy with the maximum a priori probability tree by running Chu-Liu/Edmonds\u2019 algorithm to\n\ufb01nd the maximum spanning tree on the graph G starting at the background node.\n\nSampling assignments xd,n and zd,n: For each token, we need to sample whether it was generated\nfrom the label set or not, xd,n. We choose label set i with probability C\nand we sample a\nd| \u00b7 \u03c6k,wd,n. Here, Cd,i is the number of times\nnode in the chosen set i with probability N\ntokens in document d are assigned to label set i; Nd,k is the number of times tokens in document d\n\n\u2212d,n\nd,k +\u03b1\n\u2212d,n\nd,i +\u03b1|Li\n\n\u2212d,n\nd,i +\u03b3i\n\n\u2212d,n\nd,\u00b7 +\u03b30+\u03b31\n\nC\n\nC\n\n3\n\n0145623\fare assigned to node k. Marginal counts are denoted by \u00b7, and \u2212d,n denotes the counts excluding the\nassignment of token wd,n.\nAfter we have the label set, we can sample the topic assignment. This is more ef\ufb01cient than sampling\njointly, as most tokens are in the label set, and there are a limited number of topics in the label set.\nThe probability of assigning node k to zd,n is\n\np(xd,n = i, zd,n = k | x\u2212d,n, z\u2212d,n, \u03c6,Li\n\n\u2212d,n\nd,i + \u03b3i\n\nd) \u221d C\n\u2212d,n\nd,\u00b7 + \u03b30 + \u03b31\n\nC\n\n\u2212d,n\nd,k + \u03b1\n\u00b7 N\n\u2212d,n\nd,i + \u03b1|Li\nC\n\nd| \u00b7 \u03c6k,wd,n\n\n(1)\n\nSampling topics \u03c6: As discussed in Section 2.1, topics on each path in the hierarchy form a\ncascaded Dirichlet-multinomial chain where the multinomial \u03c6k at node k is drawn from a Dirichlet\ndistribution with the mean vector being the topic \u03c6\u03c3(k) at the parent node \u03c3(k). Given assignments\nof tokens to nodes, we need to determine the conditional probability of a word given the token. This\ncan be done ef\ufb01ciently in two steps: bottom-up smoothing and top-down sampling [2].\n\n\u2022 Bottom-up smoothing: This step estimates the counts \u02dcMk,v of node k propagated from its children.\nThis can be approximated ef\ufb01ciently by using the minimal/maximal path assumption [11, 44]. For\nthe minimal path assumption, each child node k(cid:48) of k propagates a value of 1 to \u02dcMk,v if Mk(cid:48),v > 0.\nFor the maximal path assumption, each child node k(cid:48) of k propagates the full count Mk(cid:48),v to \u02dcMk,v.\n\u2022 Top-down sampling: After estimating \u02dcMk,v for each node from leaf to root, we sample the word\ndistributions top-down using its actual counts mk, its children\u2019s propagated counts \u02dcmk and its\nparent\u2019s word distribution \u03c6\u03c3(k) as \u03c6k \u223c Dir(mk + \u02dcmk + \u03b2\u03c6\u03c3(k)).\n\nUpdating tree structure T : We update the tree structure by looping through each non-root\nnode, proposing a new parent node and either accepting or rejecting the proposed parent using\nthe Metropolis-Hastings algorithm. More speci\ufb01cally, given a non-root node k with current parent i,\nwe propose a new parent node j by sampling from all incoming nodes of k in graph G, with probability\nproportional to the corresponding edge weights. If the proposed parent node j is a descendant of\nk, we reject the proposal to avoid a cycle. If it is not a descendant, we accept the proposed move\nwith probability min\n, where Q and P denote the proposal distribution and the\nmodel\u2019s joint distribution respectively, and i \u227a k denotes the case where i is the parent of k.\nSince we sample the proposed parent using the edge weights, the proposal probability ratio is\n\n1, Q(i\u227ak)\nQ(j\u227ak)\n\nP (j\u227ak)\nP (i\u227ak)\n\n(cid:16)\n\n(cid:17)\n\nQ(i \u227a k)\nQ(j \u227a k)\n\n=\n\nti,k\ntj,k\n\n(2)\n\nThe joint probability of L2H\u2019s observed and latent variables is:\n\nP =\n\np(e|G)\n\np(xd | \u03b3)p(zd | xd, ld, \u03b1)p(wd | zd, \u03c6)\n\n(cid:89)\n\ne\u2208E\n\nD(cid:89)\n\nd=1\n\nK(cid:89)\n\nl=1\n\np(\u03c6l | \u03c6\u03c3(l), \u03b2)p(\u03c6root | \u03b2)\n\n(3)\n\nWhen node k changes its parent from node i to j, the candidate set L1\nd changes for any document d\nthat is tagged with any label in the subtree rooted at k. Let (cid:52)k denote the subtree rooted at k and\nD(cid:52)k = {d|\u2203l \u2208 (cid:52)k \u2227 l \u2208 ld} the set of documents whose candidate set might change when k\u2019s\nparent changes. Canceling unchanged quantities, the ratio of the joint probabilities is:\n\nP (j \u227a k)\nP (i \u227a k)\n\n=\n\ntj,k\nti,k\n\np(zd | j \u227a k)\np(zd | i \u227a k)\n\np(xd | j \u227a k)\np(xd | i \u227a k)\n\np(wd | j \u227a k)\np(wd | i \u227a k)\n\np(\u03c6l | j \u227a k)\np(\u03c6l | i \u227a k)\n\n(4)\n\nK(cid:89)\n\nl=1\n\n(cid:89)\n\nd\u2208D(cid:52)k\n\nWe now expand each factor in Equation 4. The probability of node assignments zd for document d is\ncomputed by integrating out the document-topic multinomials \u03b80\nd (for the candidate set and its\ninverse):\n\nd and \u03b81\n\np(zd | xd,L0\n\nd,L1\n\nd; \u03b1) =\n\n\u0393(Nd,l + \u03b1)\n\n\u0393(\u03b1)\n\n(5)\n\n(cid:89)\n\nx\u2208{0,1}\n\n\u0393(\u03b1|Lx\nd|)\n\u0393(Cd,x + \u03b1|Lx\nd|)\n\n(cid:89)\n\nl\u2208Lx\n\nd\n\n4\n\n\fSimilarly, we compute the probability of xd for each document d, integrating out \u03c0d,\n\n(cid:89)\n\np(xd | \u03b3) =\n\n\u0393(\u03b30 + \u03b31)\n\n\u0393(Cd,\u00b7 + \u03b30 + \u03b31)\n\n\u0393(Cd,x + \u03b3i)\n\nx\u2208{0,1}\n\n\u0393(\u03b3x)\n\n(6)\n\nj is the parent of i to compute the ratio(cid:81)K\np(wd | zd, \u03c6) =(cid:81)Nd\n\nn=1 \u03c6zd,n,wd,n.\n\nSince we explicitly sample the topic \u03c6l at each node l, we need to re-sample all topics for the case that\np(\u03c6l | j\u227ak)\np(\u03c6l | i\u227ak) . Given the sampled \u03c6, the word likelihood is\n\nl=1\n\nHowever, re-sampling the topics for the whole hierarchy for every node proposal is inef\ufb01cient. To\navoid that, we keep all \u03c6\u2019s \ufb01xed and approximate the ratio as:\n\np(wd | j \u227a k)\np(wd | i \u227a k)\n\np(\u03c6l | j \u227a k)\np(\u03c6l | i \u227a k)\n\n\u2248\n\np(mk + \u02dcmk | \u03c6k) p(\u03c6k | \u03c6j) d\u03c6k\np(mk + \u02dcmk | \u03c6k) p(\u03c6k | \u03c6i) d\u03c6k\n\n(7)\n\n(cid:89)\n\nd\u2208D(cid:52)k\n\nK(cid:89)\n\nl=1\n\n(cid:82)\n(cid:82)\n\n\u03c6k\n\n\u03c6k\n\nV(cid:89)\n\n(cid:90)\n\n\u03c6k\n\nwhere mk is the word counts at node k and \u02dcmk is the word counts propagated from children of k.\nSince \u03c6 is \ufb01xed and the node assignments z are unchanged, the word likelihoods cancel out except\nfor tokens assigned at k or any of its children. The integration in Equation 7 is\n\np(mk + \u02dcmk | \u03c6k) p(\u03c6k | \u03c6j) d\u03c6k =\n\n\u0393(\u03b2)\n\n\u0393(Mk,v + \u02dcMk,v + \u03b2\u03c6i,v)\n\n\u0393(Mk,\u00b7 + \u02dcMk,\u00b7 + \u03b2)\n\nv=1\n\n\u0393(\u03b2\u03c6i,v)\n\n(8)\n\nUsing Equations 2 and 4, we can compute the Metropolis-Hastings acceptance probability.\n\n4 Experiments: Analyzing Political Agendas in U.S. Congresses\n\nIn our experiments, we focus on studying political attention in the legislative process, of interest to\nboth computer scientists [13, 14] and political scientists [15, 34]. GovTrack provides bill text from\nthe US Congress, each of which is assigned multiple political issues by the Congressional Research\nService. Examples of Congressional issues include Education, Higher Education, Health, Medi-\ncare, etc. To evaluate the effectiveness of L2H, we evaluate on two computational tasks: document\nmodeling\u2014measuring perplexity on a held-out set of documents\u2014and multi-label classi\ufb01cation. We\nalso discuss qualitative results based on the label hierarchy learned by our model.\n\nData: We use the text and labels from GovTrack for the 109th through 112th Congresses (2005\u2013\n2012). For both quantitative tasks, we perform 5-fold cross-validation. For each fold, we perform\nstandard pre-processing steps on the training set including tokenization, removing stopwords, stem-\nming, adding bigrams, and \ufb01ltering using TF-IDF to obtain a vocabulary of 10,000 words (\ufb01nal\nstatistics in Figure 3).4 After building the vocabulary from training documents, we discard all\nout-of-vocabulary words in the test documents. We ignore labels associated with fewer than 100 bills.\n\n4.1 Document modeling\n\nIn the \ufb01rst quantitative experiment, we focus on the task of predicting the words in held-out test\ndocuments, given their labels. This is measured by perplexity, a widely-used evaluation metric [7, 45].\nTo compute perplexity, we follow the \u201cestimating \u03b8\u201d method described in Wallach et al. [45, Sec.\n5.1] and split each test document d into wTE1\nd . During training, we estimate all topics\u2019\ndistributions over the vocabulary \u02c6\u03c6. During test, \ufb01rst we run Gibbs sampling using the learned topics\non wTE1\nd for each test document d. Then, we compute the\nwhere N TE2 is the total\n\nto estimate the topic proportions \u02c6\u03b8TE\n\nperplexity on the held-out words wTE2\n\nand wTE2\n\nas exp\n\n(cid:27)\n\n| ld,\u02c6\u03b8TE\n\nd , \u02c6\u03c6))\n\n(cid:26)\n\nd\n\n(cid:80)\n\n\u2212\n\nd\n\nd log(p(wTE2\nN TE2\n\nd\n\nd\n\nnumber of tokens in wTE2\nd .\n\n4We \ufb01nd bigram candidates that occur at least ten times in the training set and use a \u03c72 test to \ufb01lter out those\n\nhaving a \u03c72 value less than 5.0. We then treat selected bigrams as single word types in the vocabulary.\n\n5\n\n\fSetup We compare our proposed model L2H with the following methods:\n\n\u2022 LDA [7]: unsupervised topic model with a \ufb02at topic structure. In our experiments, we set the\n\nnumber of topics of LDA equal to the number of labels in each dataset.\n\n\u2022 L-LDA [36]: associates each topic with a label, and a document is generated using the topics\n\nassociated with the document\u2019s labels only.\n\n\u2022 L2F (Label to Flat structure): a simpli\ufb01ed version of L2H with a \ufb01xed, \ufb02at topic structure. The\nmajor difference between L2F and L-LDA is that L2F allows tokens to be drawn from topics that are\nnot in the document\u2019s label set via the use of the switching variable (Section 2.2). Improvements\nof L2H over L2F show the importance of the hierarchical structure.\n\nFor all models, the number of topics is the number of labels in the dataset. We run for 1,000 iterations\non the training data with a burn-in period of 500 iterations. After the burn-in period, we store ten\nsets of estimated parameters, one after every \ufb01fty iterations. During test time, we run ten chains\nusing these ten learned models on the test data and compute the perplexity after 100 iterations. The\nperplexity of each fold is the average value over the ten chains [28].\n\nFigure 3: Dataset statistics\n\nFigure 4: Perplexity on held-out documents, averaged over 5 folds\n\nResults: Figure 4 shows the perplexity of the four models averaged over \ufb01ve folds on the four\ndatasets. LDA outperforms the other models with labels since it can freely optimize the likelihood\nwithout additional constraints. L-LDA and L2F are comparable. However, L2H signi\ufb01cantly outper-\nforms both L-LDA and L2F. Thus, if incorporating labels into a model, learning an additional topic\nhierarchy improves predictive power and generalizability of L-LDA.\n\n4.2 Multi-label Classi\ufb01cation\n\nMulti-label classi\ufb01cation is predicting a set of labels for a test document given its text [43, 23, 47].\nThe prediction is from a set of pre-de\ufb01ned K labels and each document can be tagged with any\nof the 2K possible subsets. In this experiment, we use M3L\u2014an ef\ufb01cient max-margin multi-label\nclassi\ufb01er [16]\u2014to study how features extracted from our L2H improve classi\ufb01cation.\nWe use F1 as the evaluation metric. The F1 score is \ufb01rst computed for each document d as F1(d) =\n2\u00b7P (d)\u00b7R(d)\nP (d)+R(d) , where P (d) and R(d) are the precision and recall for document d. After F1(d) is\ncomputed for all documents, the overall performance can be summarized by micro-averaging and\nmacro-averaging to obtain Micro-F1 and Macro-F1 respectively. In macro-averaging, F1 is \ufb01rst\ncomputed for each document using its own confusion matrix and then averaged. In micro-averaging,\non the other hand, only a single confusion matrix is computed for all documents, and the F1 score is\ncomputed based on this single confusion matrix [38].\n\nSetup We use the following sets of features:\n\n\u2022 TF: Each document is represented by a vector of term frequency of all word types in the vocabulary.\n\u2022 TF-IDF: Each document is represented by a vector \u03c8TFIDF\n\u2022 L-LDA&TF-IDF: Ramage et al. [35] combine L-LDA features and TF-IDF features to improve the\nand\n\nperformance on recommendation tasks. Likewise, we extract a K-dimensional vector \u02c6\u03b8L-LDA\ncombine with TF-IDF vector \u03c8TFIDF\n\nto form the feature vector of L-LDA&TF-IDF.5\n\nof TF-IDF of all word types.\n\nd\n\nd\n\nd\n\n5We run L-LDA on train for 1,000 iterations and ten models after 500 burn-in iterations. For each model, we\n\nsample assignments for all tokens using 100 iterations and average over chains to estimate \u02c6\u03b8L-LDA\n\n.\n\nd\n\n6\n\n13067140341367312274418375243205Number of billsNumber of labels05000100000100200300400109110111112109110111112Congressllll109110111112150200250300PerplexitylLDAL\u2212LDAL2FL2H\f\u2022 L2H&TF-IDF: Similarly, we combine TF-IDF with the features \u02c6\u03b8L2H\n\nL2H (same MCMC setup as L-LDA).\n\nd = {\u02c6\u03b80\n\nd, \u02c6\u03b81\n\nd} extracted using\n\nOne complication for L2H is the candidate label set L1\nd, which is not observed during test time. Thus,\nd using TF-IDF. Let Dl be the set of documents tagged with label\nduring test time, we estimate L1\n. Then for each document d, we\nl. For each l, we compute a TF-IDF vector \u03c6TFIDF\ngenerate the k nearest labels using cosine similarity, and add them to the candidate label set L1\nd of\nd. Finally, we expand this initial set by adding all labels on the paths from the root of the learned\nhierarchy to any of the k nearest labels (Figure 2). We explored different values of k \u2208 {3, 5, 7, 9},\nwith similar results; the results in this section are reported with k = 5.\n\n= avgd\u2208Dl\n\n\u03c8TFIDF\n\nd\n\nl\n\nFigure 5: Multi-label classi\ufb01cation results. The results are averaged over 5 folds.\n\nResults Figure 5 shows classi\ufb01cation results. For both Macro-F 1 and Micro-F 1, TF-IDF, L-\nLDA&TF-IDF and L2H&TF-IDF signi\ufb01cantly outperform TF. Also, L-LDA&TF-IDF performs better\nthan TF-IDF, which is consistent with Ramage et al. (2010) [35].\nL2H&TF-IDF performs better than L-LDA&TF-IDF, which in turn performs better than TF-IDF. This\nshows that features extracted from L2H are more predictive than those extracted from L-LDA, and\nboth improve classi\ufb01cation. The improvements of L2H&TF-IDF and L-LDA&TF-IDF over TF-IDF are\nclearer for Macro-F 1 compared with Micro-F 1. Thus, features from both topic models help improve\nprediction, regardless of the frequencies of their tagged labels.\n\n4.3 Learned label hierarchy: A taxonomy of Congressional issues\n\nFigure 6: A subtree in the hierarchy learned by L2H. The subtree root International Affairs is a child\nnode of the Background root node.\n\nTo qualitatively analyze the hierarchy learned by our model, Figure 6 shows a subtree whose root\nis about International Affairs, obtained by running L2H on bills in the 112th U.S. Congress. The\nlearned topic at International Affairs shows the focus of 112th Congress on the Arab Spring\u2014a\nrevolutionary wave of demonstrations and protests in Arab countries like Libya, Bahrain, etc. The\nconcept is then split into two distinctive aspects of international affairs: Military and Diplomacy.\n\n7\n\nllll1091101111120.450.500.550.600.65Macro F1llll1091101111120.50.6Micro F1lTFTFIDFL\u2212LDA & TFIDFL2H & TFIDFTerrorismintellig, intellig_commun, afghanistan, nation_intellig, guantanamo_bai, qaeda, central_intellig, detent, pakistan, interrog, defens_intellig, detaine, Int'l organizations & cooperationexport, arm_export, control_act, foreign_assist, cuba, defens_articl, foreign_countri, foreign_servic, export_administr, author_act, munit_listInternational affairslibya, unit_nation, intern_religi, bahrain, religi_freedom, religi_minor, freedom_act, africa, violenc, secur_council, benghazi, privileg_resolut, hostil, Foreign aid and international relieffund_appropri, foreign_assist, remain_avail, regular_notif, intern_develop, relat_program, unit_nation, pakistan, foreign_oper, usaid, prior_actInternational law and treatiesforeign_assist, intern_develop, vessel, foreign_countri, sanit, appropri_congression, develop_countri, violenc, girl, defens_articl, exportReligionunit_nation, israel, iaea, harass, syria, iran, peacekeep_oper, regular_budget, unrwa, palestinian, refuge, durban, bulli, secur_councilEuroperepubl, belaru, turkei, nato, holocaust_survivor, north_atlant, holocaust, european_union, albania, jew, china, macedonia, treati_organ, albanian, greecMiddle Eastsyria, israel, iran, enterpris_fund, unit_nation, egypt, palestinian, cypru, tunisia, hezbollah, lebanon, republ, hama, syrian, violenc, weapon,Latin Americaborder_protect, haiti, merchandis, evas, tariff_act, cover_merchandis, export, custom_territori, custom_enforc,, countervail_duti, intern_tradeAsiachina, vietnam, taiwan, republ, chines, sea, north_korea, tibetan, north_korean, refuge, south_china, intern_religi, tibet, enterpris, religi_freedomMilitary operations and strategyarmi, air_forc, none, navi, addit_amount, control_act, emerg_de\ufb01cit, fund_appropri, balanc_budget, terror_pursuant, transfer_author,marin_corpSanctionsiran, sanction, syria, comprehens_iran, north_korea, \ufb01nanci_institut, presid_determin, islam_republ, foreign_person, weapon, iran_sanctionHuman rightstraf\ufb01ck, russian_feder, traf\ufb01ck_victim, prison, alien, visa, nation_act, victim, detent, human_traf\ufb01ck, corrupt, russian, foreign_labor, sex_traf\ufb01ck, Department of Defenseair_forc, militari_construct, author_act, armi, nation_defens, navi, militari_depart, aircraft, congression_defens, command, sexual_assault, activ_dutiMilitary personnel and dependentscoast_guard, vessel, command, special_select, sexual_violenc, academi, sexual_harass, navi, former_of\ufb01c, gulf_coast, haze, port, marin, marin_debriArmed forces and national securitycemeteri, nation_guard, dog, service_memb, homeless_veteran, funer, medic_center, militari_servic, arlington_nation, armi, guardDepartment of Homeland Securitycybersecur, inform_secur, inform_system, cover_critic, critic_infrastructur, inform_infrastructur, cybersecur_threat, \fWe are working with domain experts to formally evaluate the learned concept hierarchy. A political\nscientist (personal communication) comments:\n\nThe international affairs topic does an excellent job of capturing the key distinction\nbetween military/defense and diplomacy/aid. Even more impressive is that it then\nalso captures the major policy areas within each of these issues: the distinction\nbetween traditional military issues and terrorism-related issues, and the distinction\nbetween thematic policy (e.g., human rights) and geographic/regional policy.\n\n5 Conclusion\n\nWe have presented L2H, a model that discovers not just the interaction between overt labels and\nthe latent topics used in a corpus, but also how they \ufb01t together in a hierarchy. Hierarchies are a\nnatural way to organize information, and combining labels with a hierarchy provides a mechanism\nfor integrating user knowledge and data-driven summaries in a single, consistent structure. Our\nexperiments show that L2H yields interpretable label/topic structures, that it can substantially improve\nmodel perplexity compared to baseline approaches, and that it improves performance on a multi-label\nprediction task.\n\nAcknowledgments\n\nWe thank Kristina Miler, Ke Zhai, Leo Claudino, and He He for helpful discussions, and thank\nthe anonymous reviewers for insightful comments. This research was supported in part by NSF\nunder grant #1211153 (Resnik) and #1018625 (Boyd-Graber and Resnik). Any opinions, \ufb01ndings,\nconclusions, or recommendations expressed here are those of the authors and do not necessarily\nre\ufb02ect the view of the sponsor.\n\nReferences\n\n[1] Adams, R., Ghahramani, Z., and Jordan, M. (2010). Tree-structured stick breaking for hierarchical data. In\n\nNIPS.\n\n[2] Ahmed, A., Hong, L., and Smola, A. (2013). The nested Chinese restaurant franchise process: User tracking\n\nand document modeling. In ICML.\n\n[3] Bakalov, A., McCallum, A., Wallach, H., and Mimno, D. (2012). Topic models for taxonomies. In JCDL.\n[4] Blei, D. M. (2012). Probabilistic topic models. Communications of the ACM, 55(4):77\u201384.\n[5] Blei, D. M., Grif\ufb01ths, T. L., Jordan, M. I., and Tenenbaum, J. B. (2003a). Hierarchical topic models and the\n\nnested Chinese restaurant process. In NIPS.\n\n[6] Blei, D. M. and McAuliffe, J. D. (2007). Supervised topic models. In NIPS.\n[7] Blei, D. M., Ng, A., and Jordan, M. (2003b). Latent Dirichlet allocation. JMLR, 3.\n[8] Bragg, J., Mausam, and Weld, D. S. (2013). Crowdsourcing multi-label classi\ufb01cation for taxonomy creation.\n\nIn HCOMP.\n\n[9] Chang, J. and Blei, D. M. (2010). Hierarchical relational models for document networks. The Annals of\n\nApplied Statistics, 4(1):124\u2013150.\n\n[10] Chilton, L. B., Little, G., Edge, D., Weld, D. S., and Landay, J. A. (2013). Cascade: Crowdsourcing\n\ntaxonomy creation. In CHI.\n\n[11] Cowans, P. J. (2006). Probabilistic Document Modelling. PhD thesis, University of Cambridge.\n[12] Deng, J., Russakovsky, O., Krause, J., Bernstein, M. S., Berg, A., and Fei-Fei, L. (2014). Scalable\n\nmulti-label annotation. In CHI.\n\n[13] Gerrish, S. and Blei, D. M. (2011). Predicting legislative roll calls from text. In ICML.\n[14] Gerrish, S. and Blei, D. M. (2012). How they vote: Issue-adjusted models of legislative behavior. In NIPS.\n[15] Grimmer, J. (2010). A Bayesian Hierarchical Topic Model for Political Texts: Measuring Expressed\n\nAgendas in Senate Press Releases. Political Analysis, 18(1):1\u201335.\n\n[16] Hariharan, B., Vishwanathan, S. V., and Varma, M. (2012). Ef\ufb01cient max-margin multi-label classi\ufb01cation\n\nwith applications to zero-shot learning. Mach. Learn., 88(1-2):127\u2013155.\n\n8\n\n\f[17] Heymann, P. and Garcia-Molina, H. (2006). Collaborative creation of communal hierarchical taxonomies\n\nin social tagging systems. Technical Report 2006-10, Stanford InfoLab.\n\n[18] Lacoste-Julien, S., Sha, F., and Jordan, M. I. (2008). DiscLDA: Discriminative learning for dimensionality\n\nreduction and classi\ufb01cation. In NIPS, pages 897\u2013904.\n\n[19] Li, W., Blei, D. M., and McCallum, A. (2007). Nonparametric Bayes Pachinko allocation. In UAI.\n[20] Li, W. and McCallum, A. (2006). Pachinko allocation: DAG-structured mixture models of topic correlations.\n\nIn ICML.\n\n[21] Liu, X., Song, Y., Liu, S., and Wang, H. (2012). Automatic taxonomy construction from keywords. In\n\nKDD.\n\n[22] Mackay, D. J. C. and Peto, L. C. B. (1995). A hierarchical Dirichlet language model. Natural Language\n\nEngineering, 1(3):289\u2013308.\n\n[23] Madjarov, G., Kocev, D., Gjorgjevikj, D., and Deroski, S. (2012). An extensive experimental comparison\n\nof methods for multi-label learning. Pattern Recogn., 45(9):3084\u20133104.\n\n[24] Mimno, D., Li, W., and McCallum, A. (2007). Mixtures of hierarchical topics with Pachinko allocation. In\n\nICML.\n\n[25] Mimno, D. M. and McCallum, A. (2008). Topic models conditioned on arbitrary features with Dirichlet-\n\nmultinomial regression. In UAI.\n\n[26] Nguyen, V.-A., Boyd-Graber, J., and Resnik, P. (2012). SITS: A hierarchical nonparametric model using\n\nspeaker identity for topic segmentation in multiparty conversations. In ACL.\n\n[27] Nguyen, V.-A., Boyd-Graber, J., and Resnik, P. (2013). Lexical and hierarchical topic regression. In NIPS.\n[28] Nguyen, V.-A., Boyd-Graber, J., and Resnik, P. (2014). Sometimes average is best: The importance of\n\naveraging for prediction using MCMC inference in topic modeling. In EMNLP.\n\n[29] Nikolova, S. S., Boyd-Graber, J., and Fellbaum, C. (2011). Collecting Semantic Similarity Ratings to\n\nConnect Concepts in Assistive Communication Tools. Studies in Computational Intelligence. Springer.\n\n[30] Paisley, J. W., Wang, C., Blei, D. M., and Jordan, M. I. (2012). Nested hierarchical Dirichlet processes.\n\nCoRR, abs/1210.6738.\n\n[31] Perotte, A. J., Wood, F., Elhadad, N., and Bartlett, N. (2011). Hierarchically supervised latent Dirichlet\n\nallocation. In NIPS.\n\n[32] Petinot, Y., McKeown, K., and Thadani, K. (2011). A hierarchical model of web summaries. In HLT.\n[33] Plangprasopchok, A. and Lerman, K. (2009). Constructing folksonomies from user-speci\ufb01ed relations on\n\nFlickr. In WWW.\n\n[34] Quinn, K. M., Monroe, B. L., Colaresi, M., Crespin, M. H., and Radev, D. R. (2010). How to analyze\npolitical attention with minimal assumptions and costs. American Journal of Political Science, 54(1):209\u2013228.\n[35] Ramage, D., Dumais, S. T., and Liebling, D. J. (2010). Characterizing microblogs with topic models. In\n\nICWSM.\n\n[36] Ramage, D., Hall, D., Nallapati, R., and Manning, C. (2009). Labeled LDA: A supervised topic model for\n\ncredit attribution in multi-labeled corpora. In EMNLP.\n\n[37] Rosen-Zvi, M., Grif\ufb01ths, T. L., Steyvers, M., and Smyth, P. (2004). The author-topic model for authors\n\nand documents. In UAI.\n\n[38] Rubin, T. N., Chambers, A., Smyth, P., and Steyvers, M. (2012). Statistical topic models for multi-label\n\ndocument classi\ufb01cation. Mach. Learn., 88(1-2):157\u2013208.\n\n[39] Schmitz, P. (2006). Inducing ontology from Flickr tags. In WWW 2006.\n[40] Slutsky, A., Hu, X., and An, Y. (2013). Tree labeled LDA: A hierarchical model for web summaries. In\n\nIEEE International Conference on Big Data, pages 134\u2013140.\n\n[41] Teh, Y. W. (2006). A hierarchical Bayesian language model based on Pitman-Yor processes. In ACL.\n[42] Tibely, G., Pollner, P., Vicsek, T., and Palla, G. (2013). Extracting tag hierarchies. PLoS ONE, 8(12):e84133.\n[43] Tsoumakas, G., Katakis, I., and Vlahavas, I. P. (2010). Mining multi-label data. In Data Mining and\n\nKnowledge Discovery Handbook.\n\n[44] Wallach, H. M. (2008). Structured Topic Models for Language. PhD thesis, University of Cambridge.\n[45] Wallach, H. M., Murray, I., Salakhutdinov, R., and Mimno, D. (2009). Evaluation methods for topic\n\nmodels. In ICML.\n\n[46] Wang, C., Blei, D., and Fei-Fei, L. (2009). Simultaneous image classi\ufb01cation and annotation. In CVPR.\n[47] Zhang, M.-L. and Zhou, Z.-H. (2014). A review on multi-label learning algorithms. IEEE TKDE, 26(8).\n[48] Zhu, J., Ahmed, A., and Xing, E. P. (2009). MedLDA: maximum margin supervised topic models for\n\nregression and classi\ufb01cation. In ICML.\n\n9\n\n\f", "award": [], "sourceid": 1928, "authors": [{"given_name": "Viet-An", "family_name": "Nguyen", "institution": "University of Maryland"}, {"given_name": "Jordan", "family_name": "Ying", "institution": "University of Maryland"}, {"given_name": "Philip", "family_name": "Resnik", "institution": "University of Maryland"}, {"given_name": "Jonathan", "family_name": "Chang", "institution": "Facebook"}]}