{"title": "Conic Scan-and-Cover algorithms for nonparametric topic modeling", "book": "Advances in Neural Information Processing Systems", "page_first": 3878, "page_last": 3887, "abstract": "We propose new algorithms for topic modeling when the number of topics is unknown. Our approach relies on an analysis of the concentration of mass and angular geometry of the topic simplex, a convex polytope constructed by taking the convex hull of vertices representing the latent topics. Our algorithms are shown in practice to have accuracy comparable to a Gibbs sampler in terms of topic estimation, which requires the number of topics be given. Moreover, they are one of the fastest among several state of the art parametric techniques. Statistical consistency of our estimator is established under some conditions.", "full_text": "Conic Scan-and-Cover algorithms for\n\nnonparametric topic modeling\n\nMikhail Yurochkin\nDepartment of Statistics\nUniversity of Michigan\n\nmoonfolk@umich.edu\n\nAritra Guha\n\nDepartment of Statistics\nUniversity of Michigan\naritra@umich.edu\n\nXuanLong Nguyen\n\nDepartment of Statistics\nUniversity of Michigan\n\nxuanlong@umich.edu\n\nAbstract\n\nWe propose new algorithms for topic modeling when the number of topics is\nunknown. Our approach relies on an analysis of the concentration of mass and\nangular geometry of the topic simplex, a convex polytope constructed by taking\nthe convex hull of vertices representing the latent topics. Our algorithms are shown\nin practice to have accuracy comparable to a Gibbs sampler in terms of topic\nestimation, which requires the number of topics be given. Moreover, they are one\nof the fastest among several state of the art parametric techniques.1 Statistical\nconsistency of our estimator is established under some conditions.\n\n1\n\nIntroduction\n\nA well-known challenge associated with topic modeling inference can be succinctly summed up\nby the statement that sampling based approaches may be accurate but computationally very slow,\ne.g., Pritchard et al. (2000); Grif\ufb01ths & Steyvers (2004), while the variational inference approaches\nare faster but their estimates may be inaccurate, e.g., Blei et al. (2003); Hoffman et al. (2013). For\nnonparametric topic inference, i.e., when the number of topics is a priori unknown, the problem\nbecomes more acute. The Hierarchical Dirichlet Process model (Teh et al., 2006) is an elegant\nBayesian nonparametric approach which allows for the number of topics to grow with data size, but\nits sampling based inference is much more inef\ufb01cient compared to the parametric counterpart. As\npointed out by Yurochkin & Nguyen (2016), the root of the inef\ufb01ciency can be traced to the need for\napproximating the posterior distributions of the latent variables representing the topic labels \u2014 these\nare not geometrically intrinsic as any permutation of the labels yields the same likelihood.\nA promising approach in addressing the aforementioned challenges is to take a convex geometric\nperspective, where topic learning and inference may be formulated as a convex geometric problem: the\nobserved documents correspond to points randomly drawn from a topic polytope, a convex set whose\nvertices represent the topics to be inferred. This perspective has been adopted to establish posterior\ncontraction behavior of the topic polytope in both theory and practice (Nguyen, 2015; Tang et al.,\n2014). A method for topic estimation that exploits convex geometry, the Geometric Dirichlet Means\n(GDM) algorithm, was proposed by Yurochkin & Nguyen (2016), which demonstrates attractive\nbehaviors both in terms of running time and estimation accuracy. In this paper we shall continue to\namplify this viewpoint to address nonparametric topic modeling, a setting in which the number of\ntopics is unknown, as is the distribution inside the topic polytope (in some situations).\nWe will propose algorithms for topic estimation by explicitly accounting for the concentration of\nmass and angular geometry of the topic polytope, typically a simplex in topic modeling applications.\nThe geometric intuition is fairly clear: each vertex of the topic simplex can be identi\ufb01ed by a ray\nemanating from its center (to be de\ufb01ned formally), while the concentration of mass can be quanti\ufb01ed\n\n1Code is available at https://github.com/moonfolk/Geometric-Topic-Modeling.\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\ffor the cones hinging on the apex positioned at the center. Such cones can be rotated around the\ncenter to scan for high density regions inside the topic simplex \u2014 under mild conditions such cones\ncan be constructed ef\ufb01ciently to recover both the number of vertices and their estimates.\nWe also mention another fruitful approach, which casts topic estimation as a matrix factorization\nproblem (Deerwester et al., 1990; Xu et al., 2003; Anandkumar et al., 2012; Arora et al., 2012). A\nnotable recent algorithm coming from the matrix factorization perspective is RecoverKL (Arora et al.,\n2012), which solves non-negative matrix factorization (NMF) ef\ufb01ciently under assumptions on the\nexistence of so-called anchor words. RecoverKL remains to be a parametric technique \u2014 we will\nextend it to a nonparametric setting and show that the anchor word assumption appears to limit the\nnumber of topics one can ef\ufb01ciently learn.\nOur paper is organized as follows. In Section 2 we discuss recent developments in geometric topic\nmodeling and introduce our approach; Sections 3 and 4 deliver the contributions outlined above;\nSection 5 demonstrates experimental performance; we conclude with a discussion in Section 6.\n\n2 Geometric topic modeling\n\nBackground and related work In this section we present the convex geometry of the Latent\nDirichlet Allocation (LDA) model of Blei et al. (2003), along with related theoretical and algorithmic\nresults that motivate our work. Let V be vocabulary size and \u2206V \u22121 be the corresponding vocabulary\nprobability simplex. Sample K topics (i.e., distributions on words) \u03b2k \u223c DirV (\u03b7), k = 1, . . . , K,\n(i.e., topic proportions) \u03b8m \u223c DirK(\u03b1) and then setting pm :=(cid:80)\nwhere \u03b7 \u2208 RV\n+. Next, sample M document-word probabilities pm residing in the topic simplex\nB := Conv(\u03b21, . . . , \u03b2K) (cf. Nguyen (2015)), by \ufb01rst generating their barycentric coordinates\nk \u03b2k\u03b8mk for m = 1, . . . , M and\n\u03b1 \u2208 RK\n+ . Finally, word counts of the m-th document can be sampled wm \u223c Mult(pm, Nm), where\nNm \u2208 N is the number of words in document m. The above model is equivalent to the LDA when\nindividual words to topic label assignments are marginalized out.\nNguyen (2015) established posterior contraction rates of the topic simplex, provided that \u03b1k \u2264 1\u2200k\nand either number of topics K is known or topics are suf\ufb01ciently separated in terms of the Euclidean\ndistance. Yurochkin & Nguyen (2016) devised an estimate for B, taken to be a \ufb01xed unknown\nquantity, by formulating a geometric objective function, which is minimized when topic simplex B\nis close to the normalized documents \u00afwm := wm/Nm. They showed that the estimation of topic\nproportions \u03b8m given B simply reduces to taking barycentric coordinates of the projection of \u00afwm\nonto B. To estimate B given K, they proposed a Geometric Dirichlet Means (GDM) algorithm,\nwhich operated by performing a k-means clustering on the normalized documents, followed by a\ngeometric correction for the cluster centroids. The resulting algorithm is remarkably fast and accurate,\nsupporting the potential of the geometric approach. The GDM is not applicable when K is unknown,\nbut it provides a motivation which our approach is built on.\n\nThe Conic Scan-and-Cover approach To enable the inference of B when K is not known, we\nneed to investigate the concentration of mass inside the topic simplex. It suf\ufb01ces to focus on two\ntypes of geometric objects: cones and spheres, which provide the basis for a complete coverage of the\nsimplex. To gain intuition of our procedure, which we call Conic Scan-and-Cover (CoSAC) approach,\nimagine someone standing at a center point of a triangular dark room trying to \ufb01gure out all corners\nwith a portable \ufb02ashlight, which can produce a cone of light. A room corner can be identi\ufb01ed with\nthe direction of the farthest visible data objects. Once a corner is found, one can turn the \ufb02ashlight to\nanother direction to scan for the next ones. See Fig. 1a, where red denotes the scanned area. To make\nsure that all corners are detected, the cones of light have to be open to an appropriate range of angles\nso that enough data objects can be captured and removed from the room. To make sure no false\ncorners are declared, we also need a suitable stopping criterion, by relying only on data points that lie\nbeyond a certain spherical radius, see Fig. 1b. Hence, we need to be able to gauge the concentration\nof mass for suitable cones and spherical balls in \u2206V \u22121. This is the subject of the next section.\n\n3 Geometric estimation of the topic simplex\n\nWe start by representing B in terms of its convex and angular geometry. First, B is centered at a point\ndenoted by Cp. The centered probability simplex is denoted by \u2206V \u22121\n:= {x \u2208 RV |x+Cp \u2208 \u2206V \u22121}.\n\n0\n\n2\n\n\f(b) Complete coverage using\n\n3 cones (red) and a ball (yellow).\n\n(a) An incomplete coverage using\n3 cones (containing red points).\nFigure 1: Complete coverage of topic simplex by cones and a spherical ball for K = 3, V = 3.\nThen, write bk := \u03b2k \u2212 Cp \u2208 \u2206V \u22121\nfor k = 1, . . . , K and \u02dcpm := pm \u2212 Cp \u2208 \u2206V \u22121\nfor\nm = 1, . . . , M. Note that re-centering leaves corresponding barycentric coordinates \u03b8m \u2208 \u2206K\u22121\nunchanged. Moreover, the extreme points of centered topic simplex \u02dcB := Conv{b1, . . . , bK} can\nnow be represented by their directions vk \u2208 RV and corresponding radii Rk \u2208 R+ such that\nbk = Rkvk for any k = 1, . . . , K.\n\n(c) Cap \u039bc(v1) and cone S\u03c9(v1).\n\n0\n\n0\n\n3.1 Coverage of the topic simplex\n\n0\n\nmin)/(2R2\n\nmax), max\n\nThe \ufb01rst step toward formulating a CoSAC approach is to show how \u02dcB can be covered with\nexactly K cones and one spherical ball positioned at Cp. A cone is de\ufb01ned as set S\u03c9(v) :=\n{p \u2208 \u2206V \u22121\n|dcos(v, p) < \u03c9}, where we employ the angular distance (a.k.a. cosine distance)\ndcos(v, p) := 1 \u2212 cos(v, p) and cos(v, p) is the cosine of angle \u2220(v, p) formed by vectors v and p.\nK(cid:83)\nThe Conical coverage\nIt is possible to choose \u03c9 so that the topic simplex can be covered with\nk=1S\u03c9(vk) \u2287 \u02dcB. Moreover, each cone contains exactly one vertex. Suppose\nexactly K cones, that is,\nthat Cp is the incenter of the topic simplex \u02dcB, with r being the inradius. The incenter and inradius\ncorrespond to the maximum volume sphere contained in \u02dcB. Let ai,k denote the distance between\nthe i-th and k-th vertex of \u02dcB, with amin \u2264 ai,k \u2264 amax for all i, k, and Rmax, Rmin such that\nRmin \u2264 Rk := (cid:107)bk(cid:107)2 \u2264 Rmax \u2200 k = 1, . . . , K. Then we can establish the following.\nProposition 1. For simplex \u02dcB and \u03c9 \u2208 (\u03c91, \u03c92), where \u03c91 = 1 \u2212 r/Rmax and \u03c92 =\n(1 \u2212 cos(bi, bk)}, the cone S\u03c9(v) around any vertex direction\nmax{(a2\nv of \u02dcB contains exactly one vertex. Moreover, complete coverage holds:\n(cid:16)\n\nWe say there is an angular separation if cos(bi, bk) \u2264 0 for any i, k = 1, . . . , K (i.e., the angles for\nall pairs are at least \u03c0/2), then \u03c9 \u2208\n(cid:54)= \u2205. Thus, under angular separation, the range \u03c9\nthat allows for full coverage is nonempty independently of K. Our result is in agreement with that of\nNguyen (2015), whose result suggested that topic simplex B can be consistently estimated without\nknowing K, provided there is a minimum edge length amin > 0. The notion of angular separation\nleads naturally to the Conic Scan-and-Cover algorithm. Before getting there, we show a series of\nresults allowing us to further extend the range of admissible \u03c9.\nThe inclusion of a spherical ball centered at Cp allows us to expand substantially the range of \u03c9\nfor which conical coverage continues to hold. In particular, we can reduce the lower bound on \u03c9 in\nProposition 1, since we only need to cover the regions near the vertices of \u02dcB with cones using the\nfollowing proposition. Fig. 1b provides an illustration.\nProposition 2. Let B(Cp,R) = {\u02dcp \u2208 RV |(cid:107)\u02dcp \u2212 Cp(cid:107)2 \u2264 R}, R > 0; \u03c91, \u03c92 given in Prop. 1, and\n\nK(cid:83)\nk=1S\u03c9(vk) \u2287 \u02dcB.\n\n1 \u2212 r\n\ni,k=1,...,K\n\n(cid:17)\n\n, 1\n\nRmax\n\n(cid:115)\n\n(cid:26)\n\n\uf8eb\uf8ed Rk sin2(bi, bk)\n\nR\n\n\u03c93 := 1 \u2212 min\n\nmin\ni,k\n\n\uf8f6\uf8f8 , 1\n\n(cid:27)\n\n,\n\n(1)\n\nR2\n\nk sin2(bi, bj)\n\nR2\n\n+ cos(bi, bk)\n\n1 \u2212\n\n3\n\n0.40.20.00.20.40.20.10.00.10.20.30.4(v1)(v2)(v3)0.40.20.00.20.40.20.10.00.10.20.30.4(v1)(v2)(v3)0.40.20.00.20.40.30.20.10.00.10.20.30.4c(v1)(v1)c1cc1c\fK(cid:83)\nk=1S\u03c9(vk) \u222a B(Cp,R) \u2287 \u02dcB whenever \u03c9 \u2208 (min{\u03c91, \u03c93}, \u03c92).\n\nthen we have\n\nNotice that as R \u2192 Rmax , the value of \u03c93 \u2192 0. Hence if R \u2264 Rmin \u2248 Rmax, the admissible\nrange for \u03c9 in Prop. 2 results in a substantial strengthening from Prop. 1. It is worth noting that the\nabove two geometric propositions do not require any distributional properties inside the simplex.\n\nCoverage leftovers\nIn practice complete coverage may fail if \u03c9 and R are chosen outside of\ncorresponding ranges suggested by the previous two propositions. In that case, it is useful to note that\nleftover regions will have a very low mass. Next we quantify the mass inside a cone that does contain\na vertex, which allows us to reject a cone that has low mass, therefore not containing a vertex in it.\nProposition 3. The cone S\u03c9(v1) whose axis is a topic direction v1 has mass\nP(S\u03c9(v1)) > P(\u039bc(b1)) =\ni(cid:54)=1 \u03b1i(1 \u2212 c)\u03b11 \u0393((cid:80)K\n(cid:80)\n((cid:80)\ni(cid:54)=1 \u03b1i)\u0393(\u03b11)\u0393((cid:80)\n\n(cid:80)\ni(cid:54)=1 \u03b1i\u22121d\u03b81\n(cid:80)\n(1 \u2212 \u03b81)\ni(cid:54)=1 \u03b1i\u22121d\u03b81\n(1 \u2212 \u03b81)\ni=1 \u03b1i\n\n(cid:82) 1\n(cid:82) 1\n1\u2212c \u03b8\u03b11\u22121\nc(cid:80)K\n(cid:20)\n0 \u03b8\u03b11\u22121\n(cid:80)\n\nc2((cid:80)K\ni=1 \u03b1i)((cid:80)K\ni(cid:54)=1 \u03b1i + 1)((cid:80)\n((cid:80)\n\ni=1 \u03b1i + 1)\ni(cid:54)=1 \u03b1i + 2)\n\ni=1 \u03b1i)\ni(cid:54)=1 \u03b1i)\n\ni(cid:54)=1 \u03b1i + 1\n\n+ \u00b7\u00b7\u00b7\n\n1\n\n1\n\n(cid:21)\n\n,\n\n1 +\n\n(2)\n\n+\n\n=\n\nc\n\nwhere \u039bc(b1) is the simplicial cap of S\u03c9(v1) which is composed of vertex b1 and a base parallel to\nthe corresponding base of \u02dcB and cutting adjacent edges of \u02dcB in the ratio c : (1 \u2212 c).\nSee Fig. 1c for an illustration for the simplicial cap described in the proposition. Given the lower\nbound for the mass around a cone containing a vertex, we have arrived at the following guarantee.\nProposition 4. For \u03bb \u2208 (0, 1), let c\u03bb be such that \u03bb = min\n\nP(\u039bc\u03bb(bk)) and let \u03c9\u03bb be such that\n\nk\n\n(cid:32)(cid:32)\n\n(cid:115)\n\n(cid:33)\n\nc\u03bb =\n\n2\n\n1 \u2212\n\nr2\nR2\n\nmax\n\n(cid:33)\u22121\n\n(cid:19)(cid:19)\n\n(3)\n\n(4)\n\n(sin(d) cot(arccos(1 \u2212 \u03c9\u03bb)) + cos(d))\n\n,\n\nwhere angle d \u2264 min\n\ni,k\n\n(cid:18)\n\u2220(bk, bk \u2212 bi). Then, as long as\n\n(cid:18) a2\n\nmin\n\nthe bound P(S\u03c9(vk)) \u2265 \u03bb holds for all k = 1, . . . , K.\n3.2 CoSAC: Conic Scan-and-Cover algorithm\n\n\u03c9 \u2208\n\n\u03c9\u03bb, max\n\n2R2\n\nmax\n\n, max\ni,k=1,...,K\n\n(1 \u2212 cos(bi, bk)\n\n,\n\nHaving laid out the geometric foundations, we are ready to present the Conic Scan-and-Cover\n(cid:80)\n(CoSAC) algorithm, which is a scanning procedure for detecting the presence of simplicial vertices\nbased on data drawn randomly from the simplex. The idea is simple: iteratively pick the farthest point\nfrom the center estimate \u02c6Cp := 1\nm pm, say v, then construct a cone S\u03c9(v) for some suitably\nM\nchosen \u03c9, and remove all the data residing in this cone. Repeat until there is no data point left.\nSpeci\ufb01cally, let A = {1, . . . , M} be the index set of the initially unseen data, then set v :=\n\u02dcpm:m\u2208A (cid:107)\u02dcpm(cid:107)2 and update A := A \\ S\u03c9(v). The parameter \u03c9 needs to be suf\ufb01ciently large to ensure\nargmax\nthat the farthest point is a good estimate of a true vertex, and that the scan will be completed in exactly\nK iterations; \u03c9 needs to be not too large, so that S\u03c9(v) does not contain more than one vertex. The\nexistence of such \u03c9 is guaranteed by Prop. 1. In particular, for an equilateral \u02dcB, the condition of the\nProp. 1 is satis\ufb01ed as long as \u03c9 \u2208 (1 \u2212 1/\u221aK \u2212 1, 1 + 1/(K \u2212 1)).\n\nIn our setting, K is unknown. A smaller \u03c9 would be a more robust choice, and accordingly the set A\nwill likely remain non-empty after K iterations. See the illustration of Fig. 1a, where the blue regions\ncorrespond to A after K = 3 iterations of the scan. As a result, we proceed by adopting a stopping\ncriteria based on Prop. 2: the procedure is stopped as soon as \u2200 m \u2208 A (cid:107)\u02dcpm(cid:107)2 < R, which allows us\nto complete the scan in K iterations (as in Fig. 1b for K = 3).\nThe CoSAC algorithm is formally presented by Algorithm 1. Its running is illustrated in Fig. 2,\nwhere we show iterations 1, 26, 29, 30 of the algorithm by plotting norms of the centered documents\n\n4\n\n\fin the active set A and cone S\u03c9(v) against cosine distance to the chosen direction of a topic. Iteration\n30 (right) satis\ufb01es stopping criteria and therefore CoSAC recovered correct K = 30. Note that this\ntype of visual representation can be useful in practice to verify choices of \u03c9 and R. The following\npm := (cid:80)\ntheorem establishes the consistency of the CoSAC procedure.\nTheorem 1. Suppose {\u03b21, . . . , \u03b2K} are the true topics, incenter Cp is given, \u03b8m \u223c DirK(\u03b1) and\n+ . Let \u02c6K be the estimated number of topics,\n(cid:32)(cid:40)\n(cid:41)\n{ \u02c6\u03b21, . . . , \u02c6\u03b2 \u02c6K} be the output of Algorithm 1 trained with \u03c9 and R as in Prop. 2. Then \u2200 \u0001 > 0,\nP\nj\u2208{1,..., \u02c6K}(cid:107)\u03b2i \u2212 \u02c6\u03b2j(cid:107) > \u0001 , for any i \u2208 {1, . . . , \u02c6K}\n\u2192 0 as M \u2192 \u221e.\n\nk \u03b2k\u03b8mk for m = 1, . . . , M and \u03b1 \u2208 RK\n\n\u222a {K (cid:54)= \u02c6K}\n\n(cid:33)\n\nmin\n\nRemark We found the choices \u03c9 = 0.6 and R to be median of {(cid:107)\u02dcp1(cid:107)2, . . . ,(cid:107)\u02dcpM(cid:107)2} to be robust in\npractice and agreeing with our theoretical results. From Prop. 3 it follows that choosing R as median\n1\u2212c )1\u22121/K \u2265\nlength is equivalent to choosing \u03c9 resulting in an edge cut ratio c such that 1 \u2212 K\n1/2, then c \u2264 ( K\u22121\n2K )K/(K\u22121), which, for any equilateral topic simplex B, is satis\ufb01ed by setting\n\u03c9 \u2208 (0.3, 1), provided that K \u2264 2000 based on the Eq. (3).\n4 Document Conic Scan-and-Cover algorithm\n\nK\u22121 ( c\n\nM\n\n\u02dcwm:m\u2208A(cid:107) \u02dcwm(cid:107)2, where \u02dcwm := \u00afwm \u2212 \u02c6Cp \u2208 \u2206V \u22121\n\n(cid:80) \u00afwm. In practice, M and Nm are\n\nIn the topic modeling problem, pm for m = 1, . . . , M are not given. Instead, under the bag-of-words\nassumption, we are given the frequencies of words in documents w1, . . . , wM which provide a point\nestimate \u00afwm := wm/Nm for the pm. Clearly, if number of documents M \u2192 \u221e and length of\ndocuments Nm \u2192 \u221e \u2200m, we can use Algorithm 1 with the plug-in estimates \u00afwm in place of pm,\nsince \u00afwm \u2192 pm. Moreover, Cp will be estimated by \u02c6Cp := 1\n\ufb01nite, some of which may take relatively small values. Taking the topic direction to be the farthest\npoint in the topic simplex, i.e., v = argmax\n, may no\nlonger yield a robust estimate, because the variance of this topic direction estimator can be quite high\n(in the Supplement we show that it is upper bounded with (1 \u2212 1/V )/Nm).\nTo obtain improved estimates, we propose a technique that we call \u201cmean-shifting\u201d. Instead of taking\nthe farthest point in the simplex, this technique is designed to shift the estimate of a topic to a high\ndensity region, where true topics are likely to be found. Precisely, given a (current) cone S\u03c9(v), we\n(cid:80)\nre-position the cone by updating v := argmin\nm\u2208S\u03c9(v) (cid:107) \u02dcwm(cid:107)2(1 \u2212 cos( \u02dcwm, v)). In other words,\nwe re-position the cone by centering it around the mean direction of the cone weighted by the norms\nm\u2208S\u03c9(v) \u02dcwm/ card(S\u03c9(v)). This results in\nof the data points inside, which is simply given by v \u221d\nreduced variance of the topic direction estimate, due to the averaging over data residing in the cone.\nThe mean-shifting technique may be slightly modi\ufb01ed and taken as a local update for a subsequent\noptimization which cycles through the entire set of documents and iteratively updates the cones. The\noptimization is with respect to the following weighted spherical k-means objective:\n\n(cid:80)\n\nv\n\n0\n\nK(cid:88)\n\n(cid:88)\n\nk=1\n\nm\u2208Sk(vk)\n\nmin\n\n(cid:107)vk(cid:107)2=1,k=1,...K\n\n(cid:107) \u02dcwm(cid:107)2(1 \u2212 cos(vk, \u02dcwm)),\n\n(5)\n\nK(cid:70)\nwhere cones Sk(vk) = {m|dcos(vk, \u02dcpm) < dcos(vl, \u02dcpi) \u2200l (cid:54)= k} yield a disjoint data partition\nSk(vk) = {1, . . . , M} (this is different from S\u03c9(vk)). The rationale of spherical k-means\nk=1\noptimization is to use full data for estimation of topic directions, hence further reducing the variance\ndue to short documents. The connection between objective function (5) and topic simplex estimation\nis given in the Supplement. Finally, obtain topic norms Rk along the directions vk using maximum\nprojection: Rk := max\n\nm:m\u2208Sk(vk)(cid:104)vk, \u02dcwm(cid:105). Our entire procedure is summarized in Algorithm 2.\n\nRemark In Step 9 of the algorithm, cone S\u03c9(v) with a very low cardinality, i.e., card(S\u03c9(v)) <\n\u03bbM, for some small constant \u03bb, is discarded because this is likely an outlier region that does not actu-\nally contain a true vertex. The choice of \u03bb is governed by results of Prop. 4. For small \u03b1k = 1/K, \u2200k,\n\n5\n\n\f(K\u22121)(1\u2212c) and for an equilateral \u02dcB we can choose d such that cos(d) =\n1\u2212\u03c9\u221a1\u2212(1\u2212\u03c9)2 ) +\n\n(cid:19)(cid:19)\u22121\n\u03bb \u2264 P(\u039bc) \u2248 c(K\u22121)/K\nging these values into Eq. (3) leads to c =\nNow, plugging in \u03c9 = 0.6 we obtain \u03bb \u2264 K\u22121 for large K. Our approximations were based on large\nK to get a sense of \u03bb, we now make a conservative choice \u03bb = 0.001, so that (K)\u22121 > \u03bb \u2200K < 1000.\nAs a result, a topic is rejected if the corresponding cone contains less than 0.1% of the data.\n\n(cid:17) (cid:18)(cid:113) K\u22121\n\n2K . Plug-\n\n(cid:18)(cid:16)\n\n1 \u2212 1\n\nK2\n\n2K (\n\n(cid:113)\n\n2\n\n.\n\n(cid:113) K+1\n(cid:113) K+1\n\n2K\n\nFinding anchor words using Conic Scan-and-Cover Another approach to reduce the noise is\nto consider the problem from a different viewpoint, where Algorithm 1 will prove itself useful.\nRecoverKL by Arora et al. (2012) can identify topics with diminishing errors (in number of documents\nM), provided that topics contain anchor words. The problem of \ufb01nding anchor words geometrically\nreduces to identifying rows of the word-to-word co-occurrence matrix that form a simplex containing\nother rows of the same matrix (cf. Arora et al. (2012) for details). An advantage of this approach\nis that noise in the word-to-word co-occurrence matrix goes to zero as M \u2192 \u221e no matter the\ndocument lengths, hence we can use Algorithm 1 with \"documents\" being rows of the word-to-word\nco-occurrence matrix to learn anchor words nonparametrically and then run RecoverKL to obtain\ntopic estimates. We will call this procedure cscRecoverKL.\n\nAlgorithm 1 Conic Scan-and-Cover (CoSAC)\nInput: document generating distributions p1, . . . , pM ,\n\n(cid:80)\n\nangle threshold \u03c9, norm threshold R\n\nOutput: topics \u03b21, . . . , \u03b2k\n1: \u02c6Cp = 1\nm pm {\ufb01nd center};\nM\n2: A1 = {1, . . . , M} {initialize active set};\n3: while \u2203m \u2208 Ak : (cid:107)\u02dcpm(cid:107)2 > R do\n4:\n\n\u02dcpm:m\u2208Ak (cid:107)\u02dcpm(cid:107)2 {\ufb01nd topic}\n\nvk = argmax\nS\u03c9(vk) = {m : dcos(\u02dcpm, vk) < \u03c9} {\ufb01nd cone of near documents}\n\u03b2k = vk + \u02c6Cp, k = k + 1 {compute topic}\n\n5:\n6: Ak = Ak \\ S\u03c9(vk) {update active set}\n7:\n8: end while\n\n\u02dcpm := pm \u2212 \u02c6Cp for m = 1, . . . , M {center the data}\n\nk = 1 {initialize topic count}\n\nFigure 2: Iterations 1, 26, 29, 30 of the Algorithm 1. Red are the documents in the cone S\u03c9(vk); blue\nare the documents in the active set Ak+1 for next iteration. Yellow are documents (cid:107)\u02dcpm(cid:107)2 < R.\n\n5 Experimental results\n\n5.1 Simulation experiments\n\nIn the simulation studies we shall compare CoSAC (Algorithm 2) and cscRecoverKL based on\nAlgorithm 1 both of which don\u2019t have access to the true K, versus popular parametric topic modeling\napproaches (trained with true K): Stochastic Variational Inference (SVI), Collapsed Gibbs sampler,\nRecoverKL and GDM (more details in the Supplement). The comparisons are done on the basis of\nminimum-matching Euclidean distance, which quanti\ufb01es distance between topic simplices (Tang\net al., 2014), and running times (perplexity scores comparison is given in the Supplement). Lastly we\nwill demonstrate the ability of CoSAC to recover correct number of topics for a varying K.\n\n6\n\n0.00.20.40.60.81.01.2cosinedistancedcos(v1,\u02dcpi)0.020.040.060.080.10normk\u02dcpik2topicv1\u03c9=0.60S\u03c9(v1)A20.00.20.40.60.81.01.2cosinedistancedcos(v26,\u02dcpi)topicv26\u03c9=0.60S\u03c9(v26)A270.00.20.40.60.81.01.2cosinedistancedcos(v29,\u02dcpi)topicv29\u03c9=0.60S\u03c9(v29)A300.00.20.40.60.81.01.2cosinedistancedcos(v30,\u02dcpi)topicv30\u03c9=0.60R=0.047S\u03c9(v30)A31\fAlgorithm 2 CoSAC for documents\nInput: normalized documents \u00afw1, . . . , \u00afwM ,\n\n(cid:80)\n\nangle threshold \u03c9, norm threshold R, outlier threshold \u03bb\n\nOutput: topics \u03b21, . . . , \u03b2k\n1: \u02c6Cp = 1\nm \u00afwm {\ufb01nd center};\nM\n2: A1 = {1, . . . , M} {initialize active set};\n3: while \u2203 m \u2208 Ak : (cid:107) \u02dcwm(cid:107)2 > R do\n4:\n\n\u02dcwm:m\u2208Ak (cid:107) \u02dcwm(cid:107)2 {initialize direction}\nvk = argmax\nwhile vk not converged do {mean-shifting}\n\n\u02dcwm := \u00afwm \u2212 \u02c6Cp for m = 1, . . . , M {center the data}\n\nk = 1 {initialize topic count}\n\nvk =(cid:80)\n\nS\u03c9(vk) = {m : dcos( \u02dcwm, vk) < \u03c9} {\ufb01nd cone of near documents}\n\nm\u2208S\u03c9(vk) \u02dcwm/ card(S\u03c9(vk)) {update direction}\n\nend while\nif card(S\u03c9(vk)) > \u03bbM\n\n5:\n6:\n7:\n8:\n9: Ak = Ak \\ S\u03c9(vk) {update active set}\n10: end while\n11: v1, . . . , vk = weighted spherical k-means (v1, . . . , vk, \u02dcw1, . . . , \u02dcwM )\n12: for l in {1, . . . , k} do\nm:m\u2208Sl(vl)(cid:104)vl, \u02dcwm(cid:105) {\ufb01nd topic length along direction vl}\n13: Rl := max\n\nthen k = k + 1 {record topic direction}\n\n\u03b2l = Rlvl + \u02c6Cp {compute topic}\n\n14:\n15: end for\n\nFigure 3: Minimum matching Euclidean distance for (a) varying corpora size, (b) varying length of\ndocuments; (c) Running times for varying corpora size; (d) Estimation of number of topics.\n\nFigure 4: Gibbs sampler convergence analysis for (a) Minimum matching Euclidean distance for\ncorpora sizes 1000 and 5000; (b) Perplexity for corpora sizes 1000 and 5000; (c) Perplexity for\nNYTimes data.\n\nEstimation of the LDA topics First we evaluate the ability of CoSAC and cscRecoverKL to\nestimate topics \u03b21, . . . , \u03b2K, \ufb01xing K = 15. Fig. 3(a) shows performance for the case of fewer\nM \u2208 [100, 10000] but longer Nm = 500 documents (e.g. scienti\ufb01c articles, novels, legal documents).\nCoSAC demonstrates performance comparable in accuracy to Gibbs sampler and GDM.\nNext we consider larger corpora M = 30000 of shorter Nm \u2208 [25, 300] documents (e.g. news\narticles, social media posts). Fig. 3(b) shows that this scenario is harder and CoSAC matches the\nperformance of Gibbs sampler for Nm \u2265 75. Indeed across both experiments CoSAC only made\nmistakes in terms of K for the case of Nm = 25, when it was underestimating on average by 4 topics\n\n7\n\nllllllllllllllllllllll0.0000.0250.0500.0750200040006000800010000Number of documents MMinimum Matching distancellcscRecoverKLRecoverKLCoSACGDMGibbsSVIllllllllllllll0.00.10.20.350100150200250300Length of documents NmMinimum Matching distancellcscRecoverKLRecoverKLCoSACGDMGibbsSVIllllllllllllllllllllll01002003000200040006000800010000Number of documents MRunning time, secllcscRecoverKLRecoverKLCoSACGDMGibbsSVI0102030401020304050True number of topics KAbsolute topic number errorcscRecoverKLCoSACBayes factorlllllllllll0.020.040.06050100150Training time, secMinimum Matching distancelGibbs, M=1000Gibbs, M=5000CoSAC, M=1000CoSAC, M=5000lllllllllll675700725750775050100150Training time, secPerplexitylGibbs, M=1000Gibbs, M=5000CoSAC, M=1000CoSAC, M=50001500155016000500100015002000Training time, minPerplexityLDA GibbsHDP GibbsCoSAC\fand for Nm = 50 when it was off by around 1, which explains the earlier observation. Experiments\nwith varying V and \u03b1 are given in the Supplement.\nIt is worth noting that cscRecoverKL appears to be strictly better than its predecessor. This suggests\nthat our procedure for selection of anchor words is more accurate in addition to being nonparametric.\n\nRunning time A notable advantage of the CoSAC algorithm is its speed. In Fig. 3(c) we see\nthat Gibbs, SVI, GDM and CoSAC all have linear complexity growth in M, but the slopes are very\ndifferent and approximately are INm for SVI and Gibbs (where I is the number of iterations which\nhas to be large enough for convergence), number of k-means iterations to converge for GDM and is\nof order K for the CoSAC procedure making it the fastest algorithm of all under consideration.\nNext we compare CoSAC to per iteration quality of the Gibbs sampler trained with 500 iterations for\nM = 1000 and M = 5000. Fig. 4(b) shows that Gibbs sampler, when true K is given, can achieve\ngood perplexity score as fast as CoSAC and outperforms it as training continues, although Fig. 4(a)\nsuggests that much longer training time is needed for Gibbs sampler to achieve good topic estimates\nand small estimation variance.\n\nEstimating number of topics Model selection in the LDA context is a quite challenging task and,\nto the best of our knowledge, there is no \"go to\" procedure. One of the possible approaches is based\non re\ufb01tting LDA with multiple choices of K and using Bayes Factor for model selection (Grif\ufb01ths &\nSteyvers, 2004). Another option is to adopt the Hierarchical Dirichlet Process (HDP) model, but we\nshould understand that it is not a procedure to estimate K of the LDA model, but rather a particular\nprior on the number of topics, that assumes K to grow with the data. A more recent suggestion is to\nslightly modify LDA and use Bayes moment matching (Hsu & Poupart, 2016), but, as can be seen\nfrom Figure 2 of their paper, estimation variance is high and the method is not very accurate (we\ntried it with true K = 15 and it took above 1 hour to \ufb01t and found 35 topics). Next we compare\nBayes factor model selection versus CoSAC and cscRecoverKL for K \u2208 [5, 50]. Fig. 3(d) shows that\nCoSAC consistently recovers exact number of topics in a wide range.\nWe also observe that cscRecoverKL does not estimate K well (underestimates) in the higher range.\nThis is expected because cscRecoverKL \ufb01nds the number of anchor words, not topics. The former\nis decreasing when later is increasing. Attempting to \ufb01t RecoverKL with more topics than there\nare anchor words might lead to deteriorating performance and our modi\ufb01cation can address this\nlimitation of the RecoverKL method.\n\n5.2 Real data analysis\n\nIn this section we demonstrate CoSAC algorithm for topic modeling on one of the standard bag\nof words datasets \u2014 NYTimes news articles. After preprocessing we obtained M \u2248 130, 000\ndocuments over V = 5320 words. Bayes factor for the LDA selected the smallest model among\nK \u2208 [80, 195], while CoSAC selected 159 topics. We think that disagreement between the two\nprocedures is attributed to the misspeci\ufb01cation of the LDA model when real data is in play, which\naffects Bayes factor, while CoSAC is largely based on the geometry of the topic simplex.\nThe results are summarized in Table 1 \u2014 CoSAC found 159 topics in less than 20min; cscRecoverKL\nestimated the number of anchor words in the data to be 27 leading to fewer topics. Fig. 4(c) compares\nCoSAC perplexity score to per iteration test perplexity of the LDA (1000 iterations) and HDP (100\niterations) Gibbs samplers. Text \ufb01les with top 20 words of all topics are included in the Supplementary\nmaterial. We note that CoSAC procedure recovered meaningful topics, contextually similar to LDA\nand HDP (e.g. elections, terrorist attacks, Enron scandal, etc.) and also recovered more speci\ufb01c topics\nabout Mike Tyson, boxing and case of Timothy McVeigh which were present among HDP topics, but\nnot LDA ones. We conclude that CoSAC is a practical procedure for topic modeling on large scale\ncorpora able to \ufb01nd meaningful topics in a short amount of time.\n\n6 Discussion\n\nWe have analyzed the problem of estimating topic simplex without assuming number of vertices\n(i.e., topics) to be known. We showed that it is possible to cover topic simplex using two types of\ngeometric shapes, cones and a sphere, leading to a class of Conic Scan-and-Cover algorithms. We\n\n8\n\n\fTable 1: Modeling topics of NYTimes articles\n\ncscRecoverKL\nHDP Gibbs\nLDA Gibbs\nCoSAC\n\nK Perplexity\n27\n2603\n221 \u00b1 5\n80\n159\n\nCoherence\n-238\n1477 \u00b1 1.6 \u2212442 \u00b1 1.7\n1520 \u00b1 1.5 \u2212300 \u00b1 0.7\n-322\n\n1568\n\nTime\n37 min\n35 hours\n5.3 hours\n19 min\n\nthen proposed several geometric correction techniques to account for the noisy data. Our procedure is\naccurate in recovering the true number of topics, while remaining practical due to its computational\nspeed. We think that angular geometric approach might allow for fast and elegant solutions to other\nclustering problems, although as of now it does not immediately offer a unifying problem solving\nframework like MCMC or variational inference. An interesting direction in a geometric framework is\nrelated to building models based on geometric quantities such as distances and angles.\n\nAcknowledgments\n\nThis research is supported in part by grants NSF CAREER DMS-1351362, NSF CNS-1409303, a\nresearch gift from Adobe Research and a Margaret and Herman Sokol Faculty Award.\n\n9\n\n\fReferences\nAnandkumar, A., Foster, D. P., Hsu, D., Kakade, S. M., and Liu, Y. A spectral algorithm for Latent Dirichlet\n\nAllocation. NIPS, 2012.\n\nArora, S., Ge, R., Halpern, Y., Mimno, D., Moitra, A., Sontag, D., Wu, Y., and Zhu, M. A practical algorithm for\n\ntopic modeling with provable guarantees. arXiv preprint arXiv:1212.4777, 2012.\n\nBlei, D. M., Ng, A. Y., and Jordan, M. I. Latent Dirichlet Allocation. J. Mach. Learn. Res., 3:993\u20131022, March\n\n2003.\n\nDeerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., and Harshman, R. Indexing by latent semantic\n\nanalysis. Journal of the American Society for Information Science, 41(6):391, Sep 01 1990.\n\nGrif\ufb01ths, Thomas L and Steyvers, Mark. Finding scienti\ufb01c topics. PNAS, 101(suppl. 1):5228\u20135235, 2004.\n\nHoffman, Ma. D., Blei, D. M., Wang, C., and Paisley, J. Stochastic variational inference. J. Mach. Learn. Res.,\n\n14(1):1303\u20131347, May 2013.\n\nHsu, Wei-Shou and Poupart, Pascal. Online bayesian moment matching for topic modeling with unknown\n\nnumber of topics. In Advances In Neural Information Processing Systems, pp. 4529\u20134537, 2016.\n\nNguyen, XuanLong. Posterior contraction of the population polytope in \ufb01nite admixture models. Bernoulli, 21\n\n(1):618\u2013646, 02 2015.\n\nPritchard, Jonathan K, Stephens, Matthew, and Donnelly, Peter. Inference of population structure using multilocus\n\ngenotype data. Genetics, 155(2):945\u2013959, 2000.\n\nTang, Jian, Meng, Zhaoshi, Nguyen, Xuanlong, Mei, Qiaozhu, and Zhang, Ming. Understanding the limiting\nIn Proceedings of The 31st International\n\nfactors of topic modeling via posterior contraction analysis.\nConference on Machine Learning, pp. 190\u2013198. ACM, 2014.\n\nTeh, Y. W., Jordan, M. I., Beal, M. J., and Blei, D. M. Hierarchical dirichlet processes. Journal of the american\n\nstatistical association, 101(476), 2006.\n\nXu, Wei, Liu, Xin, and Gong, Yihong. Document clustering based on non-negative matrix factorization. In\nProceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in\nInformaion Retrieval, SIGIR \u201903, pp. 267\u2013273. ACM, 2003.\n\nYurochkin, Mikhail and Nguyen, XuanLong. Geometric dirichlet means algorithm for topic inference. In\n\nAdvances in Neural Information Processing Systems, pp. 2505\u20132513, 2016.\n\n10\n\n\f", "award": [], "sourceid": 2107, "authors": [{"given_name": "Mikhail", "family_name": "Yurochkin", "institution": "University of Michigan"}, {"given_name": "Aritra", "family_name": "Guha", "institution": "University of Michigan"}, {"given_name": "XuanLong", "family_name": "Nguyen", "institution": "University of Michigan"}]}