{"title": "Hierarchical Optimal Transport for Document Representation", "book": "Advances in Neural Information Processing Systems", "page_first": 1601, "page_last": 1611, "abstract": "The ability to measure similarity between documents enables intelligent summarization and analysis of large corpora. Past distances between documents suffer from either an inability to incorporate semantic similarities between words or from scalability issues. As an alternative, we introduce hierarchical optimal transport as a meta-distance between documents, where documents are modeled as distributions over topics, which themselves are modeled as distributions over words. We then solve an optimal transport problem on the smaller topic space to compute a similarity score. We give conditions on the topics under which this construction defines a distance, and we relate it to the word mover's distance. \nWe evaluate our technique for k-NN classification and show better interpretability and scalability with comparable performance to current methods at a fraction of the cost.", "full_text": "Hierarchical Optimal Transport\nfor Document Representation\n\nMikhail Yurochkin1,3\n\nmikhail.yurochkin@ibm.com\n\nSebastian Claici2,3\nsclaici@mit.edu\n\nEdward Chien2,3\n\nedchien@mit.edu\n\nFarzaneh Mirzazadeh1,3\nfarzaneh@ibm.com\n\nJustin Solomon2,3\n\njsolomon@mit.edu\n\nIBM Research,1 MIT CSAIL,2 MIT-IBM Watson AI Lab3\n\nAbstract\n\nThe ability to measure similarity between documents enables intelligent summa-\nrization and analysis of large corpora. Past distances between documents suffer\nfrom either an inability to incorporate semantic similarities between words or from\nscalability issues. As an alternative, we introduce hierarchical optimal transport\nas a meta-distance between documents, where documents are modeled as distribu-\ntions over topics, which themselves are modeled as distributions over words. We\nthen solve an optimal transport problem on the smaller topic space to compute a\nsimilarity score. We give conditions on the topics under which this construction\nde\ufb01nes a distance, and we relate it to the word mover\u2019s distance. We evaluate our\ntechnique for k-NN classi\ufb01cation and show better interpretability and scalability\nwith comparable performance to current methods at a fraction of the cost.1\n\n1\n\nIntroduction\n\nTopic models like latent Dirichlet allocation (LDA) (Blei et al., 2003) are major workhorses for\nsummarizing document collections. Typically, a topic model represents topics as distributions over\nthe vocabulary (i.e., unique words in the corpus); documents are then modeled as distributions over\ntopics. In this approach, words are vertices of a simplex whose dimension equals the vocabulary size\nand for which the distance between any pair of words is the same. More recently, word embeddings\nmap words into high-dimensional space such that co-occurring words tend to be closer to each other\nthan unrelated words (Mikolov et al., 2013; Pennington et al., 2014). Kusner et al. (2015) combine\nthe geometry of word embedding space with optimal transport to propose the word mover\u2019s distance\n(WMD), a powerful document distance metric limited mostly by computational complexity.\nAs an alternative to WMD, in this paper we combine hierarchical latent structures from topic models\nwith geometry from word embeddings. We propose hierarchical optimal topic transport (HOTT)\ndocument distances, which combine language information from word embeddings with corpus-\nspeci\ufb01c, semantically-meaningful topic distributions from latent Dirichlet allocation (LDA) (Blei\net al., 2003). This document distance is more ef\ufb01cient and more interpretable than WMD.\nWe give conditions under which HOTT gives a metric and show how it relates to WMD. We test\nagainst existing metrics on k-NN classi\ufb01cation and show that it outperforms others on average. It\nperforms especially well on corpora with longer documents and is robust to the number of topics and\nword embedding quality. Additionally, we consider two applications requiring pairwise distances.\n\n1Code: https://github.com/IBM/HOTT\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fThe \ufb01rst is visualization of the metric with t-SNE (van der Maaten & Hinton, 2008). The second is\nlink prediction from a citation network, cast as pairwise classi\ufb01cation using HOTT features.\nContributions. We introduce hierarchical optimal transport to measure dissimilarities between\ndistributions with common structure. We apply our method to document classi\ufb01cation, where topics\nfrom a topic modeler represent the shared structure. Our approach\n\u2022 is computationally ef\ufb01cient, since HOTT distances involve transport with small numbers of sites;\n\u2022 uses corpus-speci\ufb01c topic and document distributions, providing higher-level interpretability;\n\u2022 has comparable performance to WMD and other baselines for k-NN classi\ufb01cation; and\n\u2022 is practical in applications where all pairwise document distances are needed.\n\n2 Related work\n\nDocument representation and similarity assessment are key applications in learning. Many methods\nare based on the bag-of-words (BOW), which represents documents as vectors in R|V |, where |V | is\nthe vocabulary size; each coordinate equals the number of times a word appears. Other weightings\ninclude term frequency inverse document frequency (TF-IDF) (Luhn, 1957; Sp\u00e4rck Jones, 1972) and\nlatent semantic indexing (LSI) (Deerwester et al., 1990). Latent Dirichlet allocation (LDA) (Blei\net al., 2003) is a hierarchical Bayesian model where documents are represented as admixtures of\nlatent topics and admixture weights provide low-dimensional representations. These representations\nequipped with the l2 metric comprise early examples of document dissimilarity scores.\nRecent document distances employ more sophisticated methods. WMD incorporates word embed-\ndings to account for word similarities (Kusner et al., 2015) (see \u00a73). Huang et al. (2016) extend\nWMD to the supervised setting, modifying embeddings so that documents in the same class are close\nand documents from different classes are far. Due to computational complexity, these approaches are\nimpractical for large corpora or documents with many unique words.\nWu & Li (2017) attempt to address the complexity of WMD via a topic mover\u2019s distance (TMD).\nWhile their k-NN classi\ufb01cation results are comparable to WMD, they use signi\ufb01cantly more topics,\ngenerated with a Poisson in\ufb01nite relational model. This reduces semantic content and interpretability,\nwith less signi\ufb01cant computational speedup. They also do not leverage language information from\nword embeddings or otherwise. Xu et al. (2018) jointly learn topics and word embeddings, limiting\nthe complexity to under a hundred words, which is not suited for natural language processing.\nWu et al. (2018) approximate WMD using a random feature kernel. In their method, the WMD from\ncorpus documents to a selection of random short documents facilitates approximation of pairwise\nWMD. The resulting word mover\u2019s embedding (WME) has similar performance with signi\ufb01cant\nspeedups. Their method, however, requires parameter tuning in selecting the random document set\nand lacks topic-level interpretability. Additionally, they do not show full-metric applications. Lastly,\nWan (2007), whose work predates (Kusner et al., 2015), applies transport to blocks of text.\n\n3 Background\n\nDiscrete optimal transport. Optimal transport (OT) is a rich theory; we only need a small part and\nrefer the reader to (Villani, 2009; Santambrogio, 2015) for mathematical foundations and to (Peyr\u00e9 &\nCuturi, 2018; Solomon, 2018) for applications. Here, we focus on discrete-to-discrete OT.\nLet x = {x1, . . . , xn} and y = {y1, . . . , ym} be two sets of points (sites) in a metric space. Let\n\u2206n \u2282 Rn+1 denote the probability simplex on n elements, and let p \u2208 \u2206n and q \u2208 \u2206m be\ndistributions over x and y. Then, the 1-Wasserstein distance between p and q is\n\n(cid:26) min\u0393\u2208Rn\u00d7m\n\n(cid:80)\nj \u0393i,j = pi and (cid:80)\nsubject to (cid:80)\n\ni,j Ci,j\u0393i,j\n\n+\n\nW1(p, q) =\n\n(1)\nwhere the cost matrix C has entries Ci,j = d(xi, yj), where d(\u00b7,\u00b7) denotes the distance. The\nconstraints allow \u0393 to be interpreted as a transport plan or matching between p and q. The linear\nprogram (1) can be solved using the Hungarian algorithm (Kuhn, 1955), with complexity O(l3 log l)\nwhere l = max(n, m). While entropic regularization can accelerate OT in learning environments\n(Cuturi, 2013), it is most successful when the support of the distributions is large as it has complexity\n\ni \u0393i,j = qj,\n\n2\n\n\fj is the count of word vj in di and(cid:12)(cid:12)di(cid:12)(cid:12) is the number of words\n\nO(l2/\u03b52). In our case, the number of topics in each document is small, and the linear program is\ntypically faster if we need an accurate solution (i.e. if \u03b5 is small).\nWord mover\u2019s distance. Given an embedding of a vocabulary as V \u2282 Rn, the Euclidean metric puts\na geometry on the words in V . A corpus D = {d1, d2, . . . d|D|} can be represented using distributions\nover V via a normalized BOW. In particular, di \u2208 \u2206li, where li is the number of unique words in a\ndocument di, and di\nin di. The WMD between documents d1 and d2 is then WMD(d1, d2) = W1(d1, d2).\nThe complexity of computing WMD depends heavily on l = max(l1, l2); for longer documents,\nl may be a signi\ufb01cant fraction of |V |. To evaluate the full metric on a corpus, the complexity\nis O(|D|2l3 log l), since WMD must be computed pairwise. Kusner et al. (2015) test WMD for\nk-NN classi\ufb01cation. To circumvent complexity issues, they introduce a pruning procedure using a\nrelaxed word mover\u2019s distance (RWMD) to lower-bound WMD. On the larger 20NEWS dataset, they\nadditionally remove infrequent words by using only the top 500 words to generate a representation.\n\nj/|di|, where ci\n\nj = ci\n\n4 Hierarchical optimal transport\nAssume a topic model produces corpus-speci\ufb01c topics T = {t1, t2, . . . , t|T|} \u2282 \u2206|V |, which are\ndistributions over words, as well as document distributions \u00afdi \u2208 \u2206|T| over topics. WMD de\ufb01nes a\nmetric WMD(ti, tj) between topics; we consider discrete transport over T as a metric space.\nWe de\ufb01ne the hierarchical topic transport distance (HOTT) between documents d1 and d2 as\n\n\uf8eb\uf8ed |T|(cid:88)\n\n|T|(cid:88)\n\n\uf8f6\uf8f8 ,\n\nHOTT (d1, d2) = W1\n\n\u00afd1\nk\u03b4tk ,\n\n\u00afd2\nk\u03b4tk\n\nk=1\n\nk=1\n\nwhere each Dirac delta \u03b4tk is a probability distribution supported on the corresponding topic tk and\nwhere the ground metric is WMD between topics as distributions over words. The resulting transport\nproblem leverages topic correspondences provided by WMD in the base metric. This explains the\nhierarchical nature of our proposed distance.\nOur construction uses transport twice: WMD provides topic distances, which are subsequently the\ncosts in the HOTT problem. This hierarchical structure greatly reduces runtime, since |T| (cid:28) l; the\ncosts for HOTT can be precomputed once per corpus. The expense of evaluating pairwise distances\nis drastically lower, since pairwise distances between topics may be precomputed and stored. Even as\ndocument length and corpus size increase, the transport problem for HOTT remains the same size.\nHence, full metric computations are feasible on larger datasets with longer documents.\nWhen computing WMD(ti, tj), we reduce computational time by truncating topics to a small amount\nof words carrying the majority of the topic mass and re-normalize. This procedure is motivated by\ninterpretability considerations and estimation variance of the tail probabilities. On the interpretability\nside, LDA topics are often displayed using a few dozen top words, providing a human-understandable\ntag. Semantic coherence, a popular topic modeling evaluation metric, also is based on heavily-\nweighted words and was demonstrated to align with human evaluation of topic models (Newman\net al., 2010). Moreover, any topic modeling inference procedure, e.g. Gibbs sampling (Grif\ufb01ths &\nSteyvers, 2004), has estimation variance that may dominate tail probabilities, making them unreliable.\nHence, we truncate to the top 20 words when computing WMD between topics. We empirically\nverify that truncation to any small number of words performs equally well in \u00a75.3.\nIn topic models, documents are assumed to be represented by a small subset of topics of size\n\u03bai (cid:28) |T| (e.g., in Figure 1, books are majorly described by three topics), but in practice document\ntopic proportions tend to be dense with little mass outside of the dominant topics. Williamson et al.\n(2010) propose an LDA extension enforcing sparsity of the topic proportions, at the cost of slower\ninference. When computing HOTT, we simply truncate LDA topic proportions at 1/(|T| + 1), the\nvalue below LDA\u2019s uniform topic proportion prior, and re-normalize. This reduces complexity of our\napproach without performance loss as we show empirically in \u00a75.2 and \u00a75.3.\nMetric properties of HOTT. If each document can be uniquely represented as a linear combination\n\u00afdi\nktk, and each topic is unique, then HOTT is a metric on document space. We\n\nof topics di =(cid:80)|T|\n\nk=1\n\npresent a brief proof in the supplementary material.\n\n3\n\n\fFigure 1: Topic transport interpretability. We show two books from GUTENBERG and their heaviest-\nweighted topics (bolded topic names are manually assigned). The \ufb01rst involves steamship warfare,\nwhile the second involves biology. Left and right column percentages indicate the weights of the\ntopics in the corresponding texts. Percentages labeling the arrows indicate the transported mass\nbetween the corresponding topics, which match semantically-similar topics.\n\nTopic-level interpretability. The additional level of abstraction promotes higher-level interpretability\nat the level of topics as opposed to dense word-level correspondences from WMD. We provide an\nexample in Figure 1. This diagram illustrates two books from the GUTENBERG dataset and the\nsemantically meaningful transport between their three most heavily-weighted topics. Remaining\ntopics and less prominent transport terms account for the remainder of the transport plan not illustrated.\nRelation to WMD. First we note that if |T| = |V | and topics consist of single words covering the\nvocabulary, then HOTT becomes WMD. In well-behaved topic models, this is expected as |T| \u2192 |V |.\nAllowing |T| to vary produces different levels of granularity for our topics as well as a trade-off\nbetween computational speed and topic speci\ufb01city. When |T| (cid:28) |V |, we argue that WMD is upper\nbounded by HOTT and two terms that represent topic modeling loss. By the triangle inequality,\n\n\uf8f6\uf8f8+W1\n\n\uf8eb\uf8ed|T|(cid:88)\n\n|T|(cid:88)\n\n\uf8f6\uf8f8+W1\n\n\uf8eb\uf8ed|T|(cid:88)\n\n\u00afdi\nktk,\n\n\u00afdj\nktk\n\n\u00afdj\nktk, dj\n\n(2)\n\nk=1\n\nk=1\n\nk=1\n\n\uf8f6\uf8f8.\n\n\uf8eb\uf8eddi,\n\nWMD(di, dj) \u2264 W1\n\n|T|(cid:88)\nLDA inference minimizes KL(di(cid:107)(cid:80)|T|\n\nk=1\n\n\u00afdi\nktk\n\nk=1\n\n(cid:113) 1\nktk) over topic proportions \u00afdi for a given document di;\n\u00afdi\nhence, we look to relate Kullback\u2013Leibler divergence to W1. In \ufb01nite-diameter metric spaces,\n2KL(\u00b5(cid:107)\u03bd), which follows from inequalities relating Wasserstein distances\nW1(\u00b5, \u03bd) \u2264 diam(X)\n\uf8eb\uf8ed |T|(cid:88)\nto KL divergence (Otto & Villani, 2000). The middle term satis\ufb01es the following inequality:\n\n|T|(cid:88)\n\n|T|(cid:88)\n\n\u00afdi\nk\u03b4tk ,\n\n\u00afdi\nktk,\n\nW1\n\n(3)\n\n\u00afdj\nk\u03b4tk\n\n\u00afdj\nktk\n\nk=1\n\nk=1\n\nk=1\n\nwhere on the right we have HOTT (d1, d2). The optimal topic transport on the right implies an\nequal-cost transport of the corresponding linear combinations of topic distributions on the left. The\ninequality follows since W1 gives the optimal transport cost. Combining into a single inequality,\n\nWMD(di, dj) \u2264 HOTT (di, dj)+diam(X)\n\n\u00afdj\nktk\n\nKL\n\n\u00afdi\nktk\n\n\uf8f6\uf8f8 \u2264 W1\n(cid:118)(cid:117)(cid:117)(cid:117)(cid:116) 1\n\uf8ee\uf8ef\uf8f0\n\nKL\n\n2\n\nk=1\n\n\uf8eb\uf8ed |T|(cid:88)\n(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) |T|(cid:88)\n\n\uf8eb\uf8eddj\n\nk=1\n\n\uf8f6\uf8f8 ,\n(cid:118)(cid:117)(cid:117)(cid:117)(cid:116) 1\n\n2\n\n\uf8f6\uf8f8 +\n\n\uf8eb\uf8eddi\n\n(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) |T|(cid:88)\n\nk=1\n\n\uf8f9\uf8fa\uf8fb .\n\uf8f6\uf8f8\n\nWMD involves a large transport problem and Kusner et al. (2015) propose relaxed WMD (RWMD),\na relaxation via a lower bound (see also Atasu & Mittelholzer (2019) for a GPU-accelerated variant).\nWe next show that RWMD is not always a good lower bound on WMD.\n\n4\n\n\fRWMD\u2013Hausdorff bound. Consider the optimization in (1) for calculating WMD(d1, d2), and\nremove the marginal constraint on d2. The resulting optimal \u0393 is no longer a transport plan, but\nrather moves mass on words in d1 to their nearest words in d2, only considering the support of d2\nand not its density values. Removing the marginal constraint on d1 produces symmetric behavior;\nRWMD(d1, d2) is de\ufb01ned to be the larger cost of these relaxed problems.\nSuppose that word vj is shared by d1 and d2. Then, the mass on vj in d1 and d2 in each re-\nlaxed problems will not move and contributes zero cost. In the worst case, if d1 and d2 contain\n\nthe same words, i.e., supp(cid:0)d1(cid:1) = supp(cid:0)d2(cid:1), then RWMD(d1, d2) = 0. More generally, the\n\ncloser the supports of two documents (over V ), the looser RWMD might be as a lower bound.\n\nFigure 2 illustrates two examples. In the 2D example, 1\u2212\u0001 and \u0001 denote\nthe masses in the teal and maroon documents. The 1D example uses\nhistograms to represent masses in the two documents. In both, RWMD\nis nearly zero as masses do not have far to move, while the WMD will\nbe larger thanks to the constraints.\nTo make this precise we provide the following tight upper bound:\n\nRWMD(d1, d2) \u2264 dH(supp(cid:0)d1(cid:1), supp(cid:0)d2(cid:1)), the Hausdorff distance\nbetween the supports of d1 and d2. Let X = supp(cid:0)d1(cid:1) and Y =\nsupp(cid:0)d2(cid:1); and let RWMD1(d1, d2) and RWMD2(d1, d2) denote the\n(cid:19)\n\nrelaxed optimal values when the marginal constraints on d1 and d2 are\nkept, respectively:\n\n(cid:18)\n\ndH (X, Y ) = max\n\ninf\ny\u2208Y\n\nFigure 2: RWMD as a poor\napproximation to WMD\n\n\u2265 max(cid:0)RWMD1(d1, d2), RWMD2(d1, d2)(cid:1) = RWMD(d1, d2).\n\nd(x, y), sup\ny\u2208Y\n\nsup\nx\u2208X\n\nd(x, y)\n\ninf\nx\u2208X\n\nThe inequality follows since the left argument of the max is the furthest mass must travel in the\nsolution to RWMD1, while the right is the furthest mass must travel in the solution to RWMD2. It\nis tight if the documents have singleton support and whenever d1 and d2 are supported on parallel\naf\ufb01ne subspaces and are translates in a normal direction. A 2D example is in Figure 2.\nThe preceding discussion suggests that RWMD is not an appropriate way to speed up WMD for long\ndocuments with overlapping support, scenario where WMD computational complexity is especially\nprohibitive. The GUTENBERG dataset showcases this failure, in which documents frequently have\ncommon words. Our proposed HOTT does not suffer from this failure mode, while being signi\ufb01cantly\nfaster and as accurate as WMD. We verify this in the subsequent experimental studies. In the\nsupplementary materials we present a brief empirical analysis relating HOTT and RWMD to WMD\nin terms of Mantel correlation and a Frobenius norm.\n\n5 Experiments\n\nWe present timings for metric computation and consider applications where distance between docu-\nments plays a crucial role: k-NN classi\ufb01cation, low-dimensional visualization, and link prediction.\n\n5.1 Computational timings\n\nHOTT implementation. During training, we \ufb01t LDA with 70 topics using a Gibbs sampler (Grif\ufb01ths\n& Steyvers, 2004). Topics are truncated to the 20 most heavily-weighted words and renormalized.\nThe pairwise distances between topics WMD(ti, tj) are precomputed with words embedded in R300\nusing GloVe (Pennington et al., 2014). To evaluate HOTT at testing time, a few iterations of the\nGibbs sampler are run to obtain topic proportions \u00afdi of a new document di. When computing HOTT\nbetween a pair of documents we truncate topic proportions at 1/(|T| + 1) and renormalize. Every\ninstance of the OT linear program is solved using Gurobi (Gurobi Optimization, 2018).\nWe note that LDA inference may be carried out using any other approaches, e.g. stochastic/streaming\nvariational inference (Hoffman et al., 2013; Broderick et al., 2013) or geometric algorithms (Yurochkin\n& Nguyen, 2016; Yurochkin et al., 2019). We chose the MCMC variant (Grif\ufb01ths & Steyvers, 2004)\nfor its strong theoretical guarantees, simplicity and wide adoption in the topic modeling literature.\n\n5\n\n\fFigure 3: k-NN classi\ufb01cation performance across datasets\n\nTable 1: Dataset statistics and document pairs per second; higher is better. HOTT has higher\nthroughput and excels on long documents with large portions of the vocabulary (as in GUTENBERG).\n\nDATASET STATISTICS\n\nPAIRS PER SECOND\n\nDATASET\n\nBBCSPORT\nTWITTER\nOHSUMED\nCLASSIC\nREUTERS8\nAMAZON\n20NEWS\nGUTENBERG\n\n|D|\n737\n3108\n9152\n7093\n7674\n8000\n13277\n3037\n\n|V |\n3657\n1205\n8261\n5813\n5495\n16753\n9251\n15000\n\nIOU AVG(l) AVG(\u03ba) CLASSES\n\n0.066\n0.029\n0.046\n0.017\n0.06\n0.019\n0.011\n0.25\n\n116.5\n9.7\n59.4\n38.5\n35.7\n44.3\n69.3\n4367\n\n11.7\n6.3\n11.0\n8.7\n8.7\n9.0\n10.5\n13.3\n\n5\n3\n10\n4\n8\n4\n20\n142\n\nRWMD WMD WMDT20 HOFTT HOTT\n2548\n1552\n908\n1053\n989\n966\n699\n1720\n\n2016\n1384\n829\n980\n918\n927\n652\n1503\n\n1545\n2194\n473\n720\n672\n253\n384\n359\n\n1494\n2664\n454\n816\n834\n289\n338\n2\n\n526\n2536\n377\n689\n685\n259\n260\n0.3\n\nTopic computations. The preprocessing steps of our method\u2014computing LDA topics and the topic to\ntopic pairwise distance matrix\u2014are dwarfed by the cost of computing the full document-to-document\npairwise distance matrix. The complexity of base metric computation in our implementation is\nO(|T|2), since |supp(ti)| = 20 for all topics, leading to a relatively small OT instance.\nHOTT computations. All distance computations were implemented in Python 3.7 and run on an\nIntel i7-6700K at 4GHz with 32GB of RAM. Timings for pairwise distance computations are in\nTable 1 (right). HOTT outperforms RWMD and WMD in terms of speed as it solves a signi\ufb01cantly\nsmaller linear program. On the left side of Table 1 we summarize relevant dataset statistics: |D| is\nthe number of documents; |V | is the vocabulary size; intersection over union (IOU) characterizes\naverage overlap in words between pairs of documents; AVG(l) is the average number of unique words\nper document and AVG(\u03ba) is the average number of major topics (i.e., after truncation) per document.\n\n5.2 k-NN classi\ufb01cation\n\nWe follow the setup of Kusner et al. (2015) to evaluate performance of HOTT on k-NN classi\ufb01cation.\nDatasets. We consider 8 document classi\ufb01cation datasets: BBC sports news articles (BBCSPORT)\nlabeled by sport; tweets labeled by sentiments (TWITTER) (Sanders, 2011); Amazon reviews labeled\nby category (AMAZON); Reuters news articles labeled by topic (REUTERS) (we use the 8-class version\nand train-test split of Cachopo et al. (2007)); medical abstracts labeled by cardiovascular disease\ntypes (OHSUMED) (using 10 classes and train-test split as in Kusner et al. (2015)); sentences from\nscienti\ufb01c articles labeled by publisher (CLASSIC); newsgroup posts labeled by category (20NEWS),\nwith \u201cby-date\u201d train-test split and removing headers, footers and quotes;2 and Project Gutenberg\nfull-length books from 142 authors (GUTENBERG) using the author names as classes and 80/20\ntrain-test split in the order of document appearance. For GUTENBERG, we reduced the vocabulary to\nthe most common 15000 words. For 20NEWS, we removed words appearing in \u2264 5 documents.\n\n2https://scikit-learn.org/0.19/datasets/twenty_newsgroups.html\n\n6\n\nohsumed20newstwittergutenbergamazonr8bbcsportclassic0%10%20%30%40%50%60%Test error %5855484847423845454460554234424631323240363232323130293131292634301822294217193735201310138.49.78.88.78.9126.2146.18.75.6104.74.84.6111076.25.343.24.13.66.3179.19.59.46.23.85.15.75.14.5nBOW(Frakes&Baeza-Yates,1992)LSI(Deerwesteretal.,1990)SIF(Aroraetal.,2016)LDA(Bleietal.,2003)CosineRWMD(Kusneretal.,2015)TF-IDF(Jones,1972)HOFTTHOTTWMD-T20(Kusneretal.,2015)\fBaselines. We focus on evaluating HOTT and a variation without topic proportion truncation (HOFTT:\nhierarchical optimal full topic transport) as alternatives to RWMD in a variety of metric-dependent\ntasks. As demonstrated by the authors, RWMD has nearly identical performance to WMD, while\nbeing more computationally feasible. Additionally, we analyze a na\u00efve approach for speeding-up\nWMD where we truncate documents to their top 20 unique words (WMD-T20), making complexity\ncomparable to HOTT (yet 20 >AVG(\u03ba) on all datasets). For k-NN classi\ufb01cation, we also consider\nbaselines that represent documents in vector form and use Euclidean distances: normalized bag-of-\nwords (nBOW) (Frakes & Baeza-Yates, 1992); latent semantic indexing (LSI) (Deerwester et al.,\n1990); latent Dirichlet allocation (LDA) (Blei et al., 2003) trained with a Gibbs sampler (Grif\ufb01ths &\nSteyvers, 2004); and term frequency inverse document frequency (TF-IDF) (Sp\u00e4rck Jones, 1972).\nWe omit comparison to embedding via BOW weighted averaging as it was shown to be inferior to\nRWMD by Kusner et al. (2015) (i.e., Word Centroid Distance) and instead consider smooth inverse\nfrequency (SIF), a recent document embedding method by Arora et al. (2016). We also compare\nto bag-of-words, where neighbors are identi\ufb01ed using cosine similarity (Cosine). We use same\npre-trained GloVe embeddings for HOTT, RWMD, SIF and truncated WMD and set the same number\nof topics |T| = 70 for HOTT, LDA and LSI; we provide experiments testing parameter sensitivity.\nResults. We evaluate each method on k-NN classi\ufb01ca-\ntion (Fig. 3). There is no uniformly best method, but\nHOTT performs best on average (Fig. 4) We highlight\nthe performance on the GUTENBERG dataset compared to\nRWMD. We anticipate poor performance of RWMD on\nGUTENBERG, since books contain more words, which can\nmake RWMD degenerate (see \u00a74 and Fig. 2). Also note\nstrong performance of TF-IDF on OHSUMED and 20NEWS,\nwhich differs from results of Kusner et al. (2015). We be-\nlieve this is due to a different normalization scheme. We\nused T\ufb01dfTransformer from scikit-learn (Pedregosa et al.,\n2011) with default settings. We conclude that HOTT is\nmost powerful, both computationally (Table 1 right) and as\na distance metric for k-NN classi\ufb01cation (Figures 3 and 4),\non larger corpora of longer documents, whereas on shorter\ndocuments both RWMD and HOTT perform similarly.\nAnother interesting observation is the effect of truncation: HOTT performs as well as HOFTT,\nmeaning that truncating topic proportions of LDA does not prevent us from obtaining high-quality\ndocument distances in less computational time, whereas truncating unique words for WMD degrades\nits performance. This observation emphasizes the challenge of speeding up WMD, i.e. WMD cannot\nbe made computationally ef\ufb01cient using truncation without degrading its performance. WMD-T20 is\nslower than HOTT (Table 1) and performs noticeably worse (Figure 4). Truncating WMD further\nwill make its performance even worse, while truncating less will quickly lead to impractical run-time.\nIn the supplement, we complement our results considering 2-Wasserstein distance, and stemming, a\npopular text pre-processing procedure for topic models to reduce vocabulary size. HOTT continues\nto produce best performance on average. We restate that in all main text experiments we used\n1-Wasserstein (i.e. eq. (1)) and did not stem, following experimental setup of Kusner et al. (2015).\n\nFigure 4: Aggregated k-NN classi\ufb01ca-\ntion performance normalized by nBOW\n\n5.3 Sensitivity analysis of HOTT\n\nWe analyze senstitivity of HOTT with respect to its components: word embeddings, number of LDA\ntopics, and topic truncation level.\nSensitivity to word embeddings. We train word2vec (Mikolov et al., 2013) 200-dimensional\nembeddings on REUTERS and compare relevant methods with our default embedding (i.e., GloVe)\nand newly-trained word2vec embeddings. According to Mikolov et al. (2013), word embedding\nquality largely depends on data quantity rather than quality; hence we expect the performance to\ndegrade. In Fig. 5a, RWMD and WMD truncated performances drop as expected, but HOTT and\nHOFTT remain stable; this behavior is likely due to the embedding-independent topic structure taken\ninto consideration.\n\n7\n\nAllDatasetsAverageerrorw.r.t.nBOW10.820.790.610.650.590.660.520.520.64nBOW(Frakes&Baeza-Yates,1992)LSI(Deerwesteretal.,1990)SIF(Aroraetal.,2016)LDA(Bleietal.,2003)CosineRWMD(Kusneretal.,2015)TF-IDF(Jones,1972)HOFTTHOTTWMD-T20(Kusneretal.,2015)\f(a) Embedding sensitivity on\n\n(b) Topic number sensitivity on\n\n(c) Topic truncation sensitivity on\n\nREUTERS\n\nCLASSIC\n\nREUTERS\n\nFigure 5: Sensitivity analysis: embedding, topic number and topic truncation\n\nNumber of LDA topics. In our experiments, we set |T| = 70. When the |T| increases, LDA\nresembles the nBOW representation; correspondingly, HOTT approaches the WMD. The difference,\nhowever, is that nBOW is a weaker baseline, while WMD is powerful document distance. Using\nthe CLASSIC dataset, in Fig. 5b we demonstrate that LDA (and LSI) may degrade with too many\ntopics, while HOTT and HOFTT are robust to topic overparameterization. In this example, better\nperformance of HOTT over HOFTT is likely due relatively short documents of the CLASSIC dataset.\nWhile we have shown that HOTT is not sensitive to the choice of the number of topics, it is also\npossible to eliminate this parameter by using LDA inference algorithms that learn number of topics\n(Yurochkin et al., 2017) or adopting Bayesian nonparametric topic modes and corresponding inference\nschemes (Teh et al., 2006; Wang et al., 2011; Bryant & Sudderth, 2012).\nTopic truncation. Fig. 5c demonstrates k-NN classi\ufb01cation performance on the REUTERS dataset\nwith varying topic truncation: top 10, 20 (HOTT and HOFTT), 50, 100 words and no truncation\n(HOTT full and HOFTT full); LDA performance is given for reference. Varying the truncation level\ndoes not affect the results signi\ufb01cantly, however no truncation results in unstable performance.\n\n5.4\n\nt-SNE metric visualization\n\nVisualizing metrics as point clouds provides useful qual-\nitative information for human users. Unlike k-NN clas-\nsi\ufb01cation, most methods for this task require long-range\ndistances and a full metric. Here, we use t-SNE (van der\nMaaten & Hinton, 2008) to visualize HOTT and RWMD\non the CLASSIC dataset in Fig. 6. HOTT appears to more\naccurately separate the labeled points (color-coded). The\nsupplementary material gives additional t-SNE results.\n\n5.5 Supervised link prediction\n\nFigure 6: t-SNE on CLASSIC\n\nWe next evaluate HOTT in a different prediction task: supervised link prediction on graphs de\ufb01ned\non text domains, here citation networks. The speci\ufb01c task we address is the Kaggle challenge of Link\nPrediction TU.3 In this challenge, a citation network is given as an undirected graph, where nodes\nare research papers and (undirected) edges represent citations. From this graph, edges have been\nremoved at random. The task is to reconstruct the full network. The dataset contains 27770 papers\n(nodes). The training and testing sets consist of 615512 and 32648 node pairs (edges) respectively.\nFor each paper, the available data only includes publication year, title, authors, and abstract.\nTo study the effectiveness of a distance-based model with HOTT for link prediction, we train a\nlinear SVM classi\ufb01er over the feature set \u03a6, which includes the distance between the two abstracts\n\u03c6dist computed via one of {HOFT, HOTT, RWMD, WMD-T20}. For completeness, we also\nexamine excluding the distance totally. We incrementally grow the feature sets \u03a6 as: \u03a60 = {\u03c6dist},\n\u03a61 = {\u03c6dist} \u222a {\u03c61}, \u03a6n = {\u03c6dist} \u222a {\u03c61, . . . , \u03c6n} where \u03c61 is the number of common words\n\n3www.kaggle.com/c/link-prediction-tu\n\n8\n\nGloVeword2vec on R8Word embedding method0246810Test error %5.67.74.69.94.74.94.84.5RWMDWMD-T20HOFTTHOTT20406080100Number of topics4681012Test error %HOTTHOFTTLDALSI20406080100Number of topics510152025Test error %HOTTHOTT fullHOFTT 50HOTT 50HOFTT 10HOFTTHOTT 10LDAHOTT 100HOFTT 100HOFTT fullHOTTHOTTHOTTHOTTCACMMEDCRANCISIRWMDRWMDRWMDRWMDCACMMEDCRANCISI\fTable 2: Link prediction: using distance (rows) for node-pair representations (cols).\n\nDistance\n\nF1 Score\n\n\u03a60\nHOFTT 73.22\nHOTT 73.19\nRWMD 71.60\n67.22\n\n\u03a61\n76.27\n76.03\n74.90\n63.38\nNone \u2014 61.13\n\nWMD-T20\n\n\u03a62\n76.62\n76.24\n75.20\n65.20\n64.27\n\n\u03a63\n78.85\n78.64\n77.16\n70.38\n67.72\n\n\u03a64\n83.37\n83.25\n82.92\n81.84\n81.68\n\nin the titles, \u03c62 the number of common authors, and \u03c63 and \u03c64 the signed and absolute difference\nbetween the publication years.\nTable 2 presents the results; evaluation is based on the F1-Score. Consistently, HOFTT and HOTT are\nmore effective than RWMD and WMD-T20 in all tests, and not using any of the distances consistently\ndegrades the performance.\n\n6 Conclusion\n\nWe have proposed a hierarchical method for comparing natural language documents that leverages\noptimal transport, topic modeling, and word embeddings. Speci\ufb01cally, word embeddings provide\nglobal semantic language information, while LDA topic models provide corpus-speci\ufb01c topics and\ntopic distributions. Empirically these combine to give superior performance on various metric-based\ntasks. We hypothesize that modeling documents by their representative topics is better for highlighting\ndifferences despite the loss in resolution. HOTT appears to capture differences in the same way\na person asked to compare two documents would: by breaking down each document into easy to\nunderstand concepts, and then comparing the concepts.\nThere are many avenues for future work. From a theoretical perspective, our use of a nested\nWasserstein metric suggests further analysis of this hierarchical transport space. Insight gained in this\ndirection may reveal the learning capacity of our method and inspire faster or more accurate algorithms.\nFrom a computational perspective, our approach currently combines word embeddings, topic models\nand OT, but these are all trained separately. End-to-end training that ef\ufb01ciently optimizes these three\ncomponents jointly would likely improve performance and facilitate analysis of our algorithm as a\nuni\ufb01ed approach to document comparison.\nFinally, from an empirical perspective, the performance improvements we observe stem directly\nfrom a reduction in the size of the transport problem. Investigation of larger corpora with longer\ndocuments, and applications requiring the full set of pairwise distances are now feasible. We also can\nconsider applications to modeling of images or 3D data.\n\nAcknowledgements.\nJ. Solomon acknowledges the generous support of Army Research Of\ufb01ce\ngrant W911NF1710068, Air Force Of\ufb01ce of Scienti\ufb01c Research award FA9550-19-1-031, of National\nScience Foundation grant IIS-1838071, from an Amazon Research Award, from the MIT-IBM Watson\nAI Laboratory, from the Toyota-CSAIL Joint Research Center, from the QCRI\u2013CSAIL Computer\nScience Research Program, and from a gift from Adobe Systems. Any opinions, \ufb01ndings, and\nconclusions or recommendations expressed in this material are those of the authors and do not\nnecessarily re\ufb02ect the views of these organizations.\n\n9\n\n\fReferences\nArora, S., Liang, Y., and Ma, T. A simple but tough-to-beat baseline for sentence embeddings. 2016.\n\nAtasu, K. and Mittelholzer, T. Linear-complexity data-parallel earth mover\u2019s distance approximations.\n\nIn International Conference on Machine Learning, pp. 364\u2013373, 2019.\n\nBlei, D. M., Ng, A. Y., and Jordan, M. I. Latent Dirichlet Allocation. Journal of Machine Learning\n\nResearch, 3:993\u20131022, March 2003.\n\nBroderick, T., Boyd, N., Wibisono, A., Wilson, A. C., and Jordan, M. I. Streaming variational Bayes.\n\nIn Advances in Neural Information Processing Systems, pp. 1727\u20131735, 2013.\n\nBryant, M. and Sudderth, E. B. Truly nonparametric online variational inference for hierarchical\nDirichlet processes. In Advances in Neural Information Processing Systems, pp. 2699\u20132707, 2012.\n\nCachopo, A. M. d. J. C. et al. Improving methods for single-label text categorization. Instituto\n\nSuperior T\u00e9cnico, Portugal, 2007.\n\nCuturi, M. Sinkhorn distances: Lightspeed computation of optimal transport. In Advances in Neural\n\nInformation Processing Systems, pp. 2292\u20132300, 2013.\n\nDeerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., and Harshman, R. Indexing by latent\nsemantic analysis. Journal of the American Society for Information Science, 41(6):391, Sep 01\n1990.\n\nFrakes, W. B. and Baeza-Yates, R. Information retrieval: Data structures & algorithms, volume 331.\n\nprentice Hall Englewood Cliffs, NJ, 1992.\n\nGrif\ufb01ths, T. L. and Steyvers, M. Finding scienti\ufb01c topics. PNAS, 101(suppl. 1):5228\u20135235, 2004.\n\nGurobi Optimization, L. Gurobi optimizer reference manual, 2018. URL http://www.gurobi.\n\ncom.\n\nHoffman, M. D., Blei, D. M., Wang, C., and Paisley, J. Stochastic variational inference. Journal of\n\nMachine Learning Research, 14(1):1303\u20131347, May 2013.\n\nHuang, G., Guo, C., Kusner, M. J., Sun, Y., Sha, F., and Weinberger, K. Q. Supervised word mover\u2019s\n\ndistance. In Advances in Neural Information Processing Systems, pp. 4862\u20134870, 2016.\n\nKuhn, H. W. The Hungarian method for the assignment problem. Naval Research Logistics (NRL), 2\n\n(1-2):83\u201397, 1955.\n\nKusner, M., Sun, Y., Kolkin, N., and Weinberger, K. From word embeddings to document distances.\n\nIn International Conference on Machine Learning, pp. 957\u2013966, 2015.\n\nLuhn, H. P. A statistical approach to mechanized encoding and searching of literary information.\n\nIBM Journal of Research and Development, 1(4):309\u2013317, 1957.\n\nMikolov, T., Sutskever, I., Chen, K., Corrado, G. S., and Dean, J. Distributed representations of words\nand phrases and their compositionality. In Advances in neural information processing systems, pp.\n3111\u20133119, 2013.\n\nNewman, D., Lau, J. H., Grieser, K., and Baldwin, T. Automatic evaluation of topic coherence. In\nHuman Language Technologies: The 2010 Annual Conference of the North American Chapter\nof the Association for Computational Linguistics, pp. 100\u2013108. Association for Computational\nLinguistics, 2010.\n\nOtto, F. and Villani, C. Generalization of an inequality by talagrand and links with the logarithmic\n\nsobolev inequality. Journal of Functional Analysis, 173(2):361\u2013400, 2000.\n\nPedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M.,\nPrettenhofer, P., Weiss, R., Dubourg, V., et al. Scikit-learn: Machine learning in Python. Journal\nof Machine Learning Research, 12(Oct):2825\u20132830, 2011.\n\n10\n\n\fPennington, J., Socher, R., and Manning, C. D. Glove: Global vectors for word representation. In\n\nEmpirical Methods in Natural Language Processing (EMNLP), pp. 1532\u20131543, 2014.\n\nPeyr\u00e9, G. and Cuturi, M. Computational Optimal Transport. Submitted, 2018.\n\nSanders, N. J. Sanders-twitter sentiment corpus. Sanders Analytics LLC, 2011.\n\nSantambrogio, F. Optimal Transport for Applied Mathematicians, volume 87 of Progress in Nonlinear\nDifferential Equations and Their Applications. Springer International Publishing, 2015. ISBN\n978-3-319-20827-5 978-3-319-20828-2. doi: 10.1007/978-3-319-20828-2.\n\nSolomon, J. Optimal Transport on Discrete Domains. AMS Short Course on Discrete Differential\n\nGeometry, 2018.\n\nSp\u00e4rck Jones, K. A statistical interpretation of term speci\ufb01city and its application in retrieval. Journal\n\nof Documentation, 28(1):11\u201321, 1972.\n\nTeh, Y. W., Jordan, M. I., Beal, M. J., and Blei, D. M. Hierarchical Dirichlet processes. Journal of\n\nthe American Statistical Association, 101(476), 2006.\n\nvan der Maaten, L. and Hinton, G. Visualizing data using t-SNE. Journal of Machine Learning\n\nResearch, 9:2579\u20132605, 2008.\n\nVillani, C. Optimal Transport: Old and New. Number 338 in Grundlehren der mathematischen\n\nWissenschaften. Springer, Berlin, 2009. ISBN 978-3-540-71049-3. OCLC: ocn244421231.\n\nWan, X. A novel document similarity measure based on earth mover\u2019s distance. Information Sciences,\n\n177(18):3718 \u2013 3730, 2007. ISSN 0020-0255. doi: https://doi.org/10.1016/j.ins.2007.02.045.\n\nWang, C., Paisley, J., and Blei, D. Online variational inference for the hierarchical Dirichlet process.\nIn Proceedings of the 14th International Conference on Arti\ufb01cial Intelligence and Statistics, pp.\n752\u2013760, 2011.\n\nWilliamson, S., Wang, C., Heller, K. A., and Blei, D. M. The IBP compound Dirichlet process and\nits application to focused topic modeling. In Proceedings of the 27th International Conference on\nMachine Learning, pp. 1151\u20131158, 2010.\n\nWu, L., Yen, I. E., Xu, K., Xu, F., Balakrishnan, A., Chen, P.-Y., Ravikumar, P., and Witbrock, M. J.\nWord mover\u2019s embedding: From word2vec to document embedding. Proceedings of the 2018\nConference on Empirical Methods in Natural Language Processing, pp. 4524\u20134534, 2018.\n\nWu, X. and Li, H. Topic mover\u2019s distance based document classi\ufb01cation.\n\nIn Communication\n\nTechnology (ICCT), 2017 IEEE 17th International Conference on, pp. 1998\u20132002. IEEE, 2017.\n\nXu, H., Wang, W., Liu, W., and Carin, L. Distilled wasserstein learning for word embedding and\n\ntopic modeling. In Advances in Neural Information Processing Systems, pp. 1716\u20131725, 2018.\n\nYurochkin, M. and Nguyen, X. Geometric Dirichlet Means Algorithm for topic inference.\n\nAdvances in Neural Information Processing Systems, pp. 2505\u20132513, 2016.\n\nIn\n\nYurochkin, M., Guha, A., and Nguyen, X. Conic Scan-and-Cover algorithms for nonparametric topic\n\nmodeling. In Advances in Neural Information Processing Systems, pp. 3881\u20133890, 2017.\n\nYurochkin, M., Guha, A., Sun, Y., and Nguyen, X. Dirichlet simplex nest and geometric inference.\n\nIn International Conference on Machine Learning, pp. 7262\u20137271, 2019.\n\n11\n\n\f", "award": [], "sourceid": 913, "authors": [{"given_name": "Mikhail", "family_name": "Yurochkin", "institution": "IBM Research, MIT-IBM Watson AI Lab"}, {"given_name": "Sebastian", "family_name": "Claici", "institution": "MIT"}, {"given_name": "Edward", "family_name": "Chien", "institution": "Massachusetts Institute of Technology"}, {"given_name": "Farzaneh", "family_name": "Mirzazadeh", "institution": "MIT-IBM Watson AI Lab, IBM Research"}, {"given_name": "Justin", "family_name": "Solomon", "institution": "MIT"}]}