{"title": "Hierarchical Topic Modeling for Analysis of Time-Evolving Personal Choices", "book": "Advances in Neural Information Processing Systems", "page_first": 1395, "page_last": 1403, "abstract": "The nested Chinese restaurant process is extended to design a nonparametric  topic-model tree for representation of human choices. Each tree branch corresponds  to a type of person, and each node (topic) has a corresponding probability  vector over items that may be selected. The observed data are assumed to have  associated temporal covariates (corresponding to the time at which choices are  made), and we wish to impose that with increasing time it is more probable that  topics deeper in the tree are utilized. This structure is imposed by developing  a new \u201cchange point\" stick-breaking model that is coupled with a Poisson and  product-of-gammas construction. To share topics across the tree nodes, topic distributions  are drawn from a Dirichlet process. As a demonstration of this concept,  we analyze real data on course selections of undergraduate students at Duke University,  with the goal of uncovering and concisely representing structure in the  curriculum and in the characteristics of the student body.", "full_text": "Hierarchical Topic Modeling for Analysis of\n\nTime-Evolving Personal Choices\n\nXianXing Zhang\nDuke University\n\nxianxing.zhang@duke.edu\n\nDavid B. Dunson\nDuke University\n\ndunson@stat.duke.edu\n\nLawrence Carin\nDuke University\n\nlcarin@ee.duke.edu\n\nAbstract\n\nThe nested Chinese restaurant process is extended to design a nonparametric\ntopic-model tree for representation of human choices. Each tree path corresponds\nto a type of person, and each node (topic) has a corresponding probability vector\nover items that may be selected. The observed data are assumed to have associ-\nated temporal covariates (corresponding to the time at which choices are made),\nand we wish to impose that with increasing time it is more probable that topics\ndeeper in the tree are utilized. This structure is imposed by developing a new\n\u201cchange point\" stick-breaking model that is coupled with a Poisson and product-\nof-gammas construction. To share topics across the tree nodes, topic distributions\nare drawn from a Dirichlet process. As a demonstration of this concept, we ana-\nlyze real data on course selections of undergraduate students at Duke University,\nwith the goal of uncovering and concisely representing structure in the curriculum\nand in the characteristics of the student body.\n\n1\n\nIntroduction\n\nAs time progresses, the choices humans make often change. For example, the types of products\none purchases, as well as the types of people one befriends, often change or evolve with time.\nHowever, the choices one makes later in life are typically statistically related to choices made earlier.\nSuch behavior is of interest when considering marketing to particular groups of people, at different\nstages of their lives, and it is also relevant for analysis of time-evolving social networks. In this\npaper we seek to develop a hierarchical tree structure for representation of this phenomena, with\neach tree path characteristic of a type of person, and a tree node de\ufb01ned by a distribution over\nchoices (characterizing a type of person at a \u201cstage of life\u201d). As time proceeds, each person moves\nalong layers of a tree branch, making choices based on the node at a given layer, thereby yielding\na hierarchical representation of behavior with time. Note that as one moves deeper in the tree, the\nnumber of nodes at a given tree layer increases as a result of sequential branching; this appears to be\nwell matched to the modeling of choices made by individuals, who often become more distinctive\nand specialized with increasing time. The number of paths (types of people), nodes (stages of\ndevelopment) and the statistics of the time dependence are to be inferred nonparametrically based\non observed data, which are typically characterized by a very sparse binary matrix (most individuals\nonly select a tiny fraction of the choices that are available to them).\nTo demonstrate this concept using real data, we consider selections of classes made by undergraduate\nstudents at Duke University, with the goal of uncovering structure in the students and classes, as\ninferred by time-evolving student choices. For each student, the data presented to the model are a\nset of indices of selected classes (but not class names or subject matter), as well as the academic\nyear in which each class was selected (e.g., sophomore year). While the student majors and class\nnames are not used by the model, they are known, and this information provides \u201ctruth\u201d with which\nmodel-inferred structure may be assessed. This study therefore also provides the opportunity to\nexamine the quality of the inferred hierarchical-tree structure in models of the type considered in\n\n1\n\n\f[1, 4, 5, 8, 7, 12, 13, 6, 21, 15, 22] (such structure is dif\ufb01cult to validate with documents, for which\nthere is no \u201ctruth\u201d). We seek to impose that as time progresses it is more probable that an individual\u2019s\nchoices are based on nodes deeper in the tree, so that as one moves from the tree root to the leaves,\nwe observe the evolution of choices as people age. Such temporal tree-structure could be meaningful\nby itself, e.g., in our particular case it will allow university administrators and faculty to examine\nif objectives in curriculum design are manifested in the actual usage/choices of students. Further,\nthe results of such an analysis may be of interest to applicants at a given school (e.g., high school\nstudents), as the inferred structure concisely describes both the student body and the curriculum.\nAlso the uncovered structure may be used to aid downstream applications [17, 2, 11].\nThe basic form of the nonparametric tree developed here is based on the the nested Chinese restau-\nrant process (nCRP) topic model [4, 5]. However, to achieve the goals of the unique problem con-\nsidered, we make the following new modeling contributions. We develop a new \u201cchange-point\u201d\nstick-breaking process (cpSBP), which is a stick-breaking process that induces probabilities that\nstochastically increase to an unknown changepoint and then decrease. This construction is concep-\ntually related to the \u201cumbrella\u201d placed on dose response curves [9]. In the proposed model each\nindividual has a unique cpSBP, that evolves with time such that choices at later times are encour-\naged to be associated with deeper nodes in the tree. Time is a covariate, and within the change-point\nmodel a new product-of-gammas construction is developed, and coupled to the Poisson distribution.\nAnother novel aspect of the proposed model concerns drawing the node-dependent topics from a\nDirichlet process, sharing topics across the tree. This is motivated by the idea that different types\nof people (paths) may be characterized by similar choices at different nodes in the respective paths\n(e.g., person Type A may make certain types of choices early in life, while person Type B may make\nsimilar choices later in life). Such sharing of topics allows the inference of relationships between\nchoices different people make over time.\n\n2 Model Formulation\n\n2.1 Nested Chinese Restaurant Process\n\nThe nested Chinese restaurant process (nCRP) [4, 5] is a generative probabilistic model that de\ufb01nes\na prior distribution over a tree-structured hierarchy with in\ufb01nitely many paths. In an nCRP model of\npersonal choice each individual picks a tree path by walking from the root node down the tree, from\nnode to node. Speci\ufb01cally, when situated at a particular parent node, the child node ci individual\ni chooses is modeled as a random variable that can be either an existing node or a new node: (i)\nthe probability that ci is an existing child node k is proportional to the number of persons who\nalready chose node k from the same parent, (ii) and a new node can be created and chosen with\nprobability proportional to \u03b30 > 0, which is the nCRP concentration parameter. This process is\nde\ufb01ned recursively such that each individual is allocated to one speci\ufb01c path of the tree hierarchy,\nthrough a sequence of probabilistic parent-child steps.\nThe tree hierarchy implied by the nCRP provides a natural framework to capture the structure of\npersonal choices, where each node is characterized by a distribution on the items that may be se-\nlected (e.g., each node is a \u201cchoice topic\"). Similar constructions have been considered for document\nanalysis [5, 21, 4], in which the model captures the structure of word usage in documents. How-\never, there are unique aspects of the time-evolving personal-choice problem, particularly the goal\nmotivated above that one should select topics deeper in the tree as time evolves, to capture the\nspecialized characteristics of people as they age. Hierarchical topic models proposed previously\n[4, 7] have employed a stick-breaking process (SBP) to guide selection of the tree depth at which\na node/topic is selected, with an unbounded number of path layers, but these models do not pro-\nvide a means of imposing the above temporal dynamics (which were not relevant for the document\nproblems considered there).\n\n2.2 Change Point Stick Breaking Process\n\nIn the proposed model, instead of constraining the SBP construction to start at the root node, we\nmodel the starting-point depth of the SBP as a random variable and infer it from data, while still\nmaintaining a valid distribution over each layer of any path. To do this we replace the single SBP\nover path layers with two statistically dependent SBPs: one begins from layer p + 1 and moves down\n\n2\n\n\fFigure 1: Illustrative comparison of the stick lengths between change point stick breaking process\n(cpSBP) and stick breaking process (SBP) with different value of \u03b1; typical draws from cpSBP and\nSBP are depicted. a\u03c9 and b\u03c9 are both set to 1, the change point is set to p = 30 and the truncation\nof both stick breaking constructions are set to 60.\n\nthe tree away from the root, and the other begins from layer p and moves upward to the root; the\nlatter SBP is truncated when it hits the root, while the former is in principle of in\ufb01nite length. The\ntree depth p which relates these two SBPs is modeled as a random variable, drawn from a Poisson\ndistribution, and is denoted the change point. In this way we encourage the principal stick weight to\nbe placed heavily around the change point p, instead of restricting it to the top layers as in [4, 7]. To\nmodel the time dependence, and encourage use of greater tree depths with increasing time, we seek\na formulation that imposes that the Poisson parameter grows (statistically) with increasing time.\nThe temporal information is represented as covariate t(i, n), denoting the time at which the the nth\nselection/choice is made by individual i; in many applications t(i, n) \u2208 {1, . . . , T}, and for the\nstudent class-selection problem T = 4, corresponding to the freshman through senior years; below\nwe drop the indices (i, n) on the time, for notational simplicity. When individual i makes selections\nat time t, she employs a corresponding change point pi,t. To integrate the temporal covariate into the\nmodel, we develop a product-of-gammas and Poission conjugate pair to model pi,t which encourages\npi,t associated with larger t to locate deeper in the tree. Speci\ufb01cally, consider\n\n\u03b3i,l = Ga(\u03b3i,l|ai,l, bi,l),\n\n\u03bbi,t =\n\n\u03b3i,l,\n\npi,t \u223c Poi(pi,t|\u03bbi,t)\n\n(1)\n\nt(cid:89)\n\nl=1\n\nThe product-of-gammas construction in (1) is inspired by the multiplicative-gamma process (MGP)\ndeveloped in [3] for sparse factor analysis. Although each draw of \u03b3i,l from a gamma distribution is\nnot guaranteed to be greater than one, and thus \u03bbi,t will not increase with probability one, in practice\nwe \ufb01nd \u03b3i,l is often inferred to be greater than one when (ai,l \u2212 1)/bi,l > 1. However, an MGP\nbased on a left-truncated gamma distribution can be readily derived.\nGiven the change point pi,t = p, the cpSBP constructs the stick-weight vector \u03b8p\npath bi by dividing it to into two parts: \u02c6\u03b8p\ni,t = {\u02c6\u03b8p\n\u02c6\u03b8p\nnotation simplicity, we denote Vh = Vi,t(h) when constructing \u03b8p\n\ni,t over layers of\ni,t, modeling them separately as two SBPs, where\ni,t = {\u02dc\u03b8p\ni,t(\u221e)}. For\nd\u22121(cid:89)\n\ni,t and \u02dc\u03b8p\ni,t(1)} and \u02dc\u03b8p\n\ni,t(p \u2212 1), . . . , \u02c6\u03b8p\n\ni,t(p + 1), \u02dc\u03b8p\n\ni,t(p + 2), . . . , \u02dc\u03b8p\n\ni,t, yielding\n\ni,t(p), \u02c6\u03b8p\n\np(cid:89)\n\n(1 \u2212 Vh),\n\nVh \u223c beta(Vh|1, \u03b1)\n\n(2)\n\n\u02c6\u03b8p\ni,t(u) = Vu\n\n(1 \u2212 Vh),\n\n\u02dc\u03b8p\ni,t(d) = Vd\n\nh=u+1\n\nh=p+1\n\nNote that the above SBP contains two constructions: When d > p the stick weight \u02dc\u03b8p\ni,t(d) is con-\nstructed as a classic SBP but with the stick-breaking construction starting at layer p + 1. On the\nother hand when u \u2264 p the stick weight \u02c6\u03b8p\ni,t(u) is constructed \u201cbackwards\u201d from p to the root node,\nwhich is a truncated stick breaking process with truncation level set to p. A thorough discussion\nof the truncated stick breaking process is found in [10]. We further use a beta distributed latent\nvariable \u03c9bi to combine the two stick breaking process together while ensuring each element of\ni,t = { \u02c6\u03b8p\ni,t} sums to one. Thus we have the following distribution over layers of a given path\n\u03b8p\nfrom which the layer allocation variables {li,n : t(i, n) = t} for a selection at time t by individual i\nare sampled:\n\ni,t, \u02dc\u03b8p\n\nli,n \u223c \u03c9i,t\n\n\u02c6\u03b8i,t(l)\u03b4l + (1 \u2212 \u03c9i,t)\n\n\u02dc\u03b8i,t(l)\u03b4l, \u03c9i,t \u223c Beta(\u03c9i,t|a\u03c9, b\u03c9)\n\n(3)\n\np(cid:88)\n\n\u221e(cid:88)\n\nl=1\n\nl=p+1\n\n3\n\n5101520253035404550556000.10.20.30.40.5IndexStick Lenght  cpSBP, \u03b1=1cpSBP, \u03b1=5cpSBP, \u03b1=10SBP, \u03b1=1SBP, \u03b1=5SBP, \u03b1=10\fNote that the change point stick breaking process (cpSBP) can be treated as a generalization of the\nstick breaking process for Dirichlet process, since when pi,t = 0 the cpSBP corresponds to the\nSBP. From the simulation studied in Figure 1, we observe that the change point, which is modeled\nthrough the temporal covariate t as in (1), corresponds to the layer with large stick weight and\nthus at which topic draws are most probable. Also note that one may alternatively suggest simply\nusing pi,t directly as the layer from which a topic is drawn, without the subsequent use of a cpSBP.\nWe examined this in the course of the experiments, and it did not work well, likely as a result of\nthe in\ufb02exibility of the single-parameter Poisson (with its equal mean and variance). The cpSBP\nprovided the additional necessary modeling \ufb02exibility.\n\n2.3 Sharing topics among different nodes\n\nOne problem with the nCRP-based topic model, implied by the tree structure, is that all descendent\nsub-topics from parent node pa1 are distinct from the descendants of parent pa2, if pa1 (cid:54)= pa2.\nSome of these distinct sets of children from different parents may be redundant, and this redundancy\ncan be removed if a child can have more than one parent [7, 13, 6]. In addition to the above problem,\nin our application there are other potential problems brought by the cpSBP. Since we encourage the\nlater choice selections to be drawn from topics deeper in the tree, redundant topics at multiple layers\nmay be manifested if two types of people tend to make similar choices at different time points (e.g.,\nat different stages of life). Thus it is likely that similar (redundant) topics may be learned on different\nlayers of the tree, and the inability of the original nCRP construction to share these topics misses\nanother opportunity to share statistical strength.\nIn [7, 13, 6] the authors addressed related challenges by replacing the tree structure with a directed\nacyclic graph (DAG), demonstrating success for document modeling. However, those solutions\ndon\u2019t have the \ufb02exibility of sharing topics on nodes among different layers. Here we propose a new\nand simpler approach, so that the nCRP-based tree hierarchy is retained, while different nodes in the\nwhole tree may share the same topic, resolving the two problems discussed above. To achieve this\nwe draw a set of \u201cglobal\u201d topics { \u02c6\u03c6k}, and a stick-breaking process is employed to allocate one of\nthese global topics as \u03c6j, representing the jth node in the tree (this corresponds to drawing the {\u03c6j}\nfrom a Dirichlet process [16], with a Dirichlet distribution base). The SBP de\ufb01ned over the global\ntopics is represented as follows:\n\n\u221e(cid:88)\n\nk\u22121(cid:89)\n\n\u03c0k = \u03b4k\n\n(1 \u2212 \u03b4i),\n\n\u03b4i \u223c Beta(\u03b4i|1, \u03b7),\n\n\u02c6\u03c6k \u223c Dir( \u02c6\u03c6k|\u03b2), \u03c6j \u223c\n\n\u03c0k\u03b4 \u02c6\u03c6k\n\n(4)\n\ni=1\n\nk=1\n\nWithin the generative process, let zi,n denote the assignment of the nth choice of individual i to\nglobal topic \u02c6\u03c6zi,n; then the corresponding item chosen is drawn from Mult(1, \u02c6\u03c6zi,n ).\n\n3 Model Inference\n\nIn the proposed model, we sample the per-individual tree path indicator bi, the layer allocation of\nchoice topics in those paths li,n, the change point pi,t for each time interval, the parameters associ-\nated with the cpSBP construction \u03b3i,t, \u03c9i,t, \u03b8p\ni,t, the stick breaking weight \u03c0 over the global topics\n\u02c6\u03c6k, and the global topic-assignment indicator zi,n. Similar to [4], the per-node topic parameters \u03c6n\nare marginalized out. We provide update equations cycling through {li,n, pi,t, \u03b3i,t, \u03c9i,t, \u03b8p\ni,t} that\nare unique for this model. The update equations for bi and {\u03c0, zi,n} are similar to the the ones in\n[4] and [18], respectively, which we do not reproduce here for brevity.\n\nSampling for change point pi,t Due to the non-conjugacy between the Poisson and multinomial\ndistributions, the exact form of its posterior distribution is dif\ufb01cult to compute. Additionally, in or-\nder to sample pi,t, we require imputation of an in\ufb01nite-dimensional process. The implementation of\nthe sampling algorithm either relies on \ufb01nite approximations [10] which lead to straightforward up-\ndate equations, or requires an additional Metropolis-Hastings (M-H) step which allows us to obtain\nsamples from the exact posterior distribution of pi,t with no approximation, e.g., the retrospective\nsampler [14] proposed for Dirichlet process hierarchical models. In this section we \ufb01rst introduce\nthe \ufb01nite approximation based sampler, and the retrospective sampling scheme based method will\nbe described in the supplemental material.\n\n4\n\n\fDenote P as the truncated maximum value of the change point, then given the samples of all other\nlatent variables, pi,t can be sampled from the following equation:\n\nq(pi,t = p|\u03b8p\n\ni,t, \u03bbi,t, \u03c9i,t, li,t) \u221d p(pi,t = p|\u03bbi,t, P )p(li,t|\u03b8p\n\n(5)\nwhere li,t = {li,n : t(i, n) = t} are all layer allocations of choices made by individual i at time\nt. p(pi,t = p|\u03bbi,t, P ) =\nis the Poisson density function truncated with p \u2264 P , and\ni,t}) is the multinomial\n\ni,t, \u03c9i,t) = Mult(li,t|{\u03c9i,t\n\nCP = (cid:80)P\n\ni,t, \u03c9i,t), 0 \u2264 p \u2264 P\n\ni,t, (1 \u2212 \u03c9i,t) \u02dc\u03b8p\n\u02c6\u03b8p\n\n. p(li,t|\u03b8p\n\n\u2212\u03bbi,t\ni,te\np!\n\ni,te\np!CP\n\n\u2212\u03bbi,t\n\np=1\n\n\u03bbp\n\n\u03bbp\n\ndensity function over the layer allocations li,t.\n\nSampling choice layer allocation li,n Given all the other variables, now we sample the layer\nallocation li,n for the nth choice made by individual i. Denote ci,n as the nth choice made by\nindividual i, Mzi,n,ci,n = #[z\u2212(i,n) = zi,n, c\u2212(i,n) = ci,n] + \u03b2 as the smoothed count of seeing\nchoice ci,n allocated to global topic zi,n, excluding the current choice. Parameter li,n can be sampled\nfrom the following equation:\n\np(li,n = l|pi,t = p, z, \u03c9i,t, \u03b8p\n\ni,t, c) \u221d\n\n\u02c6\u03b8p\ni,t(l)Mzi,n,ci,n ,\n\n\u03c9i,t\n(1 \u2212 \u03c9i,t)\u02dc\u03b8p\n\ni,t(l)Mzi,n,ci,n,\n\n0 < l \u2264 p\np < l \u2264 P\n\n(cid:40)\n\nSampling for product-of-gammas construction \u03b3i,t From (1) note that the temporal dependent\nintensity parameter \u03bbi,t can be reconstructed from the gamma distributed variables \u03b3i,t, which again\ncan be sampled directly from its posterior distribution given all other variables, due to the conjugacy\nof product-gamma variable and Poisson construction. Denoting \u03c4 (t)\n\ni,l =(cid:81)l\n\nj=1,j(cid:54)=t \u03b3i,j, we have:\n\np(\u03b3i,t|{pi,t}T\n\nt=1, ai,t, bi,t) = Ga\n\n\u03b3i,t\n\n(cid:32)\n\n(cid:12)(cid:12)(cid:12)ai,t +\n\nT(cid:88)\n\n(cid:33)\n\nT(cid:88)\n\npi,l, bi,t +\n\n\u03c4 (t)\ni,l\n\nl=t\n\nl=t\n\nSampling for cpSBP parameters {\u03c9i,t, \u03b8p\ncation li,t = {li,n : t(i, n) = t}, the cpSBP parameters \u03b8p\nbased on samples of Vh as de\ufb01ned in (2). Speci\ufb01cally, we have\n\np(Vh|pi,t = p, li,t) =\n\n(cid:16)\n(cid:16)\n\n\uf8f1\uf8f2\uf8f3 Beta\nl=1 Nl,t, 1 +(cid:80)max li,t\n\nBeta\n\ni,t = { \u02c6\u03b8p\n\ni,t} Given the change points pi,t and choice layer allo-\ni,t} can be reconstructed\n(cid:17)\n\ni,t, \u02dc\u03b8p\n\n(cid:17)\n\nif h \u2264 p\nif h > p\n\nl=1 Nl,t\nl=h+1 Nl,t\n\nVh|ah + Nh,t, bh +(cid:80)h\u22121\nVh|ah + Nh,t, bh +(cid:80)max li,t\n(cid:17)\n\nwhere Nl,t = #[li,n = l, t(i, n) = t] records the number of times a choice made by individ-\nual i in time interval t is allocated to path layer l. Given the samples of other variables, \u03c9i,t is\nsampled from its full conditional posteior distribution: p(\u03c9i,t|pi,t = p,{li,n : t(i, n) = t}) =\nBeta\n\n(cid:16)\n\n(cid:12)(cid:12)(cid:12)1 +(cid:80)p\n\nl=p+1 Nl,t\n\n\u03c9i,t\n\nSampling the Hyperparameters Concerning hyperparameters \u03b1, \u03b7, \u03b30, \u03b2, related to the stick\nbreaking process and hierarchical topic model construction, we sample them within the inference\nprocess by placing prior distributions over them, similar to methods in [4]. One may also consider\nother alternatives for learning the hyperparameters within topic models [19]. For the hyperparame-\nters ai,l, bi,l in the product-of-gamma construction, we sample them as proposed in [3]. Finally, we\n\ufb01x aw = 1 and sample bw by placing a gamma prior distribution on it. All these steps are done by a\nM-H step between iterations of the Gibbs sampler.\n\n4 Analysis of student course selection\n\n4.1 Data description and computations\n\nWe demonstrate the proposed model on real data by considering selections of classes made by un-\ndergraduate students at Duke University, for students in graduating classes 2009 to 2013; the data\nconsists of class selections of all students from Fall 2005 to Spring 2011. For computational reasons\n\n5\n\n\fthe cpSBP and SBP employed over the tree-path depth are truncated to a maximum of 10 layers (be-\nyond this depth the number of topics employed by the data was minimal), while the number children\nof each parent node is allowed to be unbounded. Within the sampler, we ran the model based on\nthe class selection records of students from class of 2009 and 2010 (total of 3382 students and 2756\nunique classes), and collected 200 samples after burn-in, taking every \ufb01fth sample to approximate\nthe posterior distribution over the latent tree structure as well as the topic on each node of the tree.\nWe analyze the quality of the learned models using the remaining data (classes of 2011-2013), char-\nacterized by 5171 students and 2972 unique classes. Each topic is a probabilistic vector de\ufb01ned over\n3015 classes offered across all years. Within the MCMC inference procedure we trained our model\nas follows: \ufb01rst, we \ufb01xed the change point pi,t = t and then ran the sampler for 100 iterations, then\nburned in the inference for 5000 iterations with pi,t updated before drawing 5000 samples from the\nfull posterior.\n\n4.2 Quantitative assessment\n\nnCRP\n\ncpSBP-nCRP\n\nDP-nCRP\nFull model\n\n# Topics\n492\u00b111\n973\u00b137\n318\u00b126\n367\u00b132\n\n# Nodes\n492\u00b111\n973\u00b137\n521\u00b141\n961\u00b144\n\nPredictive LL (11)\n\n-293226.8399\n-290271.3576\n-292311.3971\n-288511.4298\n\nPredictive LL (11-13)\n\n-471736.8876\n-469912.1120\n-471951.3452\n-468331.2990\n\nTable 1: Predictive log-likelihood comparison on two datasets, given the mean of number of topics\nand nodes learned with rounded standard deviation. nCRP is the model proposed in [4]. Compared\nto nCRP, the cpSBP-nCRP replaced SBP with the proposed cpSBP, while in DP-nCRP the draw of\ntopic for each node is from Dirichlet process(DP) instead of Dirichlet distribution and retained the\nSBP construction in nCRP. The full model used both cpSBP and DP. Results shown for class of\n2011, and classes 2011-2013.\n\nFigure 2: Histograms of class layer allocations according to their time covariates. Left: Stick break-\ning process, Right: Change point stick breaking process\n\nIn this section we examine the model\u2019s ability to explain unseen data. For comparison consistency\nwe computed the predictive log-likelihood based on the samples collected in the same way as [4]\n(alternatives means of evaluating topic models are discussed in [20]). We test the model using two\ndifferent compositions of the data, the \ufb01rst is based on class selection history of students from class\nof 2011 (1696 students), where all 4 years of records are available. The second is based on class\nselection history of students from class of 2011 to 2013 (3475 students), where for the later two\nyears only partial course selection information is available, e.g., for students from class of 2013\nonly class selection choices made in freshman year are available. Additionally, we compare the\ndifferent models with respect to the learned number of topics and the learned number of tree nodes.\nThis comparison is an indicator of the level of \u201cparsimony\u201d of the proposed model, introduced by\nreplacing independent draws of topics from a Dirichlet distribution by draws from a Dirichlet process\n(with Dirichlet distribution base), as explained in Section 2.3. Since the number of tree nodes grows\nexponentially with the number of tree layers, from a practical viewpoint sharing topics among the\nnodes saves memory used to store the topic vectors, whose dimension is typically large (here the\nnumber of classes: 3015). In addition to the above insight, as the experimental results indicates,\nsharing topics among different nodes can enhance the sharing of statistical strength, which leads to\nbetter predictive performance. The results are summarized in Table 4.2.\nWe hypothesize that the enhanced performance of the proposed model to explain the unseen data is\nalso due to it\u2019s improved ability to capture the latent predictive statistical structure, e.g., to capture\n\n6\n\n1234567891001234x 104Layer  FreshmanSophomoreJuniorSenior1234567891001234x 104Layer  FreshmanSophomoreJuniorSenior\fFigure 3: Change of the proportion of important majors along the layers of two paths which share the\nnodes up to the second layer. These two paths correspond to the full versions (all 7 layers) of the top\ntwo paths in Figure 4. BME: Biomedical Engineering, POLI: Political Science, ECON: Economics,\nCS: Computer Science, BIO: Biology, PPS: Public Policy Science, ME: Mechanical Engineering,\nOTHER: other 73 majors.\n\nthe latent temporal dynamics within the data by the change point stick breaking process (cpSBP).\nTo demonstrate this point, in Figure 2 we compare how cpSBP and SBP guided the class layer\nallocations which have associated time covariates (e.g., the academic year of each student). From\nFigure 2 we observe that under spSBP, as the students\u2019 academic career advances, they are more\nprobable to choose classes from topics deeper in the tree, while such pattern is less obvious in the\nSBP case. Further, spSBP encouraged the data to utilize more layers in the tree than SBP.\n\n4.3 Analyzing the learned tree\n\nWith incorporation of time covariates, we examine if the uncovered hierarchical structure is consis-\ntent with the actual curriculum of students from their freshman to senior year. And we consider two\nanalyses here.\nThe \ufb01rst is a visualization of a subtree learned from the class-selection history, based on students of\nthe class of 2009, as shown in Figure 4; shown are the most-probable classes in each topic, as well as\na histogram on the covariates (1 to 4, for freshman through senior) of the students who employed the\ntopic. For example, the topics on the top two layers correspond to the most popular classes selected\nby mechanical engineering and computer science students, respectively, while topics located to the\nright correspond to more advanced classes; to the left-most the root topic corresponds to classes\nrequired for all students (e.g., academic writing). The tree structured hierarchy captured the general\ntrend of class selection within/across different majors.\nIn Figure 4 we also highlight a topic in red, shared by two nodes. This topic corresponds to a set of\ngeneral introductory classes which are popular (high attendance) for two types of students: (i) young\nstudents who take these classes early for preparation of future advanced studies, and (ii) students\nwho need to \ufb01ll elective requirements later in their academic career (\u201cideally\" of an easy/elementary\nnature, to not \u201cdistract\" from required classes from the major). It is therefore deemed interesting\nthat these same classes seem to be preferred by young and old students, for apparently very different\nreasons. Note that the sharing of topics between nodes of different layers is a unique aspect of this\nmodel, not possible in [7, 13, 6].\nIn the second analysis we examine how the majors of students are distributed in the learned tree;\nthe ideal case would be that each tree path corresponds to an academic major, and the nodes shared\nby paths manifest sharing of topics between different but related majors. In Figure 3 we show the\nchange of proportions of different majors among different layers of the top two paths in Figure 4\n(this is a zoom-in of a much larger tree). For a clear illustration, we show the seven most popular\nmajors for these paths as a function of time (out of a total of 80 majors), and the remaining 73 majors\nare group together. We observe that the students with mechanical engineering (ME) majors share the\nnode on the second layer with students with a computer science (CS) major, and the layers deeper in\nthe tree begin to be exclusive to students with CS and ME majors, respectively. This phenomenon\ncorresponds to the process a student determining her major by choosing courses as she walks down\ntree path. This also matches the fact that in this university, students declare their major during the\nsophomore year.\n\n7\n\n00.10.20.30.40.50.60.70.80.911234567ProportionLayer  00.10.20.30.40.50.60.70.80.911234567ProportionLayer  BMEPOLIECONCSBIOPPSMEOTHER\fFigure 4: A subtree of topics learned from courses chosen by undergraduate students of class 2009;\nthe whole tree has 372 nodes and 252 topics, and a maximum of 7 layers. Each node shows two\naggregated statistics at that node: the eight most common classes of the topic on that node and a\nhistogram of the academic year the topic was selected by students (1-4, for freshman-senior). The\ncolumns in each of the histogram correspond to freshman to senior year from left to right. The two\nhighlighted red nodes share the same topic. These results correspond to one (maximum-likelihood)\ncollection sample.\n\n5 Discussion\n\nWe have extended hierarchical topic models to an important problem that has received limited at-\ntention to date: the evolution of personal choices over time. The proposed approach builds upon the\nnCRP [4], but introduces novel modeling components to address the problem of interest. Specif-\nically, we develop a change-point stick-breaking process, coupled with a product of gammas and\nPoisson construction, that encourages individuals to be represented by nodes deeper in the tree as\ntime evolves. The Dirichlet process has also been used to design the node-dependent topics, sharing\nstrength and inferring relationships between choices of different people over time. The framework\nhas been successfully demonstrated with a real-world data set: selection of courses over many years,\nfor students at Duke University.\nAlthough we worked only on one speci\ufb01c real-world data set, there are many other examples for\nwhich such a model may be of interest, especially when the data correspond to a sparse set of choices\nover time. For example, it could be useful for companies attempting to understand the purchases\n(choices) of customers, as a function of time (e.g., the clothing choices of people as they advance\nfrom teen years to adulthood). This may be of interest in marketing and targeted advertisement.\n\nAcknowledgements\nWe would like to thank the anonymous reviewers for their insightful comments. The research re-\nported here was supported by AFOSR, ARO, DARPA, DOE, NGA and ONR.\n\n8\n\nINTRO\t\r \u00a0TO\t\r \u00a0SIGNALS\t\r \u00a0AND\t\r \u00a0SYSTEMS\t\r \u00a0MICROELECT\t\r \u00a0DEVICES\t\r \u00a0&\t\r \u00a0CIRCUITS\t\r \u00a0ELECTRONIC\t\r \u00a0MUSIC\t\r \u00a0QUANTUM\t\r \u00a0MECHANICS\t\r \u00a0I\t\r \u00a0TRNSPRT\t\r \u00a0PHENOM\t\r \u00a0BIOLOGCL\t\r \u00a0SYS\t\r \u00a0NONLINEAR\t\r \u00a0DYNAMICS\t\r \u00a0INTRO\t\r \u00a0ELECTRIC,\t\r \u00a0MAGNET\t\r \u00a0TISSUE\t\r \u00a0ENGINEERING\t\r \u00a0MECHANICAL\t\r \u00a0DESIGN\t\r \u00a0AERODYNAMICS\t\r \u00a0THERMO\t\r \u00a0DYNAMICS\t\r \u00a0MODELS\t\r \u00a0CELL\t\r \u00a0&\t\r \u00a0MOL\t\r \u00a0SYSTEMS\t\r \u00a0COMPRESSIBLE\t\r \u00a0FLUID\t\r \u00a0FLOW\t\r \u00a0SIGNALS\t\r \u00a0AND\t\r \u00a0SYSTEMS\t\r \u00a0ELECTROMAGNET\t\r \u00a0FIELDS\t\r \u00a0FAILURE\t\r \u00a0ANALY/PREVENTION\t\r \u00a0CHEM/TECHNOL/SOCIETY\t\r \u00a0DATA\t\r \u00a0ANALY/STAT\t\r \u00a0INFER\t\r \u00a0THE\t\r \u00a0DYNAMIC\t\r \u00a0EARTH\t\r \u00a0SOCIAL\t\r \u00a0PSYCHOLOGY\t\r \u00a0ABNORMAL\t\r \u00a0PSYCHOLOGY\t\r \u00a0POL\t\r \u00a0ANALY\t\r \u00a0PUB\t\r \u00a0POL\t\r \u00a0MAKING\t\r \u00a0GLOBAL\t\r \u00a0HEALTH\t\r \u00a0ELEMENTARY\t\r \u00a0SPANISH\t\r \u00a0ACADEMIC\t\r \u00a0WRITING\t\r \u00a0ECONOMIC\t\r \u00a0PRINCIPLES\t\r \u00a0\t\r \u00a0\t\r \u00a0GENERAL\t\r \u00a0CHEMISTRY\t\r \u00a0I\t\r \u00a0GENERAL\t\r \u00a0CHEMISTRY\t\r \u00a0II\t\r \u00a0INTERMEDIATE\t\r \u00a0CALCULUS\t\r \u00a0INTERMEDIATE\t\r \u00a0ECONOMICS\t\r \u00a0I\t\r \u00a0\t\r \u00a0INTRODUCTORY\t\r \u00a0PSYCHOLOGY\t\r \u00a0LABORATORY\t\r \u00a0CALCULUS\t\r \u00a0I\t\r \u00a0COMP\t\r \u00a0METH\t\r \u00a0IN\t\r \u00a0ENGINEERING\t\r \u00a0INTERMEDIATE\t\r \u00a0CALCULUS\t\r \u00a0INTRODUCTORY\t\r \u00a0MECHANICS\t\r \u00a0INTRO\t\r \u00a0TO\t\r \u00a0ENGINEERING\t\r \u00a0LINEAR\t\r \u00a0ALGEBRA\t\r \u00a0GENERAL\t\r \u00a0CHEMISTRY\t\r \u00a0ENGINEERING\t\r \u00a0INNOVATION\t\r \u00a0ELECTRICITY\t\r \u00a0&\t\r \u00a0MAGNETISM\t\r \u00a0THE\t\r \u00a0DYNAMIC\t\r \u00a0EARTH\t\r \u00a0INTEGRATING\t\r \u00a0ENV\t\r \u00a0SCI/POL\t\r \u00a0ENERGY\t\r \u00a0AND\t\r \u00a0ENVIRONMENT\t\r \u00a0LIBERTY\t\r \u00a0&\t\r \u00a0AMER\t\r \u00a0CONST\t\r \u00a0GENERAL\t\r \u00a0PHYSICS\t\r \u00a0II\t\r \u00a0INTERMEDIATE\t\r \u00a0GERMAN\t\r \u00a0I\t\r \u00a0U\t\r \u00a0S\t\r \u00a0ENVIRONMENTAL\t\r \u00a0POL\t\r \u00a0ECONOMIC\t\r \u00a0PRINCIPLES\t\r \u00a0COMPUTER\t\r \u00a0ORGANIZA/PROG\t\r \u00a0SOFTWARE\t\r \u00a0DESIGN/IMPLEMEN\t\r \u00a0DISCRETE\t\r \u00a0MATH\t\r \u00a0FOR\t\r \u00a0COMPSCI\t\r \u00a0INTRO\t\r \u00a0TO\t\r \u00a0OPERATING\t\r \u00a0SYSTM\t\r \u00a0INTRO\t\r \u00a0TO\t\r \u00a0MATH\t\r \u00a0LOGIC\t\r \u00a0INTRO\t\r \u00a0TO\t\r \u00a0DATABASE\t\r \u00a0SYSTEMS\t\r \u00a0DESIGN/ANALY\t\r \u00a0ALGORITHMS\t\r \u00a0PROGRAM\t\r \u00a0DESIGN/ANALY\t\r \u00a0II\t\r \u00a0MODERN\t\r \u00a0TERRORISM\t\r \u00a0LIBERTY/EQUALITY&AMER\t\r \u00a0CONST\t\r \u00a0CAMPAIGNS\t\r \u00a0AND\t\r \u00a0ELECTIONS\t\r \u00a0JUNIOR-\u00ad\u2010SENIOR\t\r \u00a0SEM\t\r \u00a0SP\t\r \u00a0TOP\t\r \u00a0THE\t\r \u00a0AMERICAN\t\r \u00a0PRESIDENCY\t\r \u00a0ERA\t\r \u00a0OF\t\r \u00a0THE\t\r \u00a0AMERICAN\t\r \u00a0REVOLU\t\r \u00a0HISTORY\t\r \u00a0OF\t\r \u00a0WORLD\t\r \u00a0WARS\t\r \u00a0POLITICS\t\r \u00a0AND\t\r \u00a0LITERATURE\t\r \u00a0THEATER\t\r \u00a0IN\t\r \u00a0LONDON:\t\r \u00a0TEXT\t\r \u00a0THEATER\t\r \u00a0IN\t\r \u00a0LONDON:\t\r \u00a0PERFORM\t\r \u00a0INDIVID\t\r \u00a0DANCE\t\r \u00a0PROG:\t\r \u00a0SPECIAL\t\r \u00a0INTRODUCTION\t\r \u00a0TO\t\r \u00a0ACTING\t\r \u00a0DIRECTING\t\r \u00a0THEATER\t\r \u00a0PRODUCTION\t\r \u00a0TENNESSEE\t\r \u00a0WILLIAMS/CHEKHOV\t\r \u00a0INTRODUCTION\t\r \u00a0TO\t\r \u00a0THEATER\t\r \u00a0PROGRAM\t\r \u00a0DESIGN/ANALY\t\r \u00a0I\t\r \u00a0INTRO\t\r \u00a0ELECTRIC,\t\r \u00a0MAGNET\t\r \u00a0STRUCT\t\r \u00a0DESIGN\t\r \u00a0AND\t\r \u00a0OPTMZATN\t\r \u00a0ELEM\t\r \u00a0DIFFERENTIAL\t\r \u00a0EQUAT\t\r \u00a0COMPUTER\t\r \u00a0ARCHITECTURE\t\r \u00a0PROBABIL/STATIS\t\r \u00a0IN\t\r \u00a0EGR\t\r \u00a0MATRIX\t\r \u00a0STRUCT\t\r \u00a0ANALYSIS\t\r \u00a0CONCRETE\t\r \u00a0AND\t\r \u00a0COMP\t\r \u00a0STRUCT\t\r \u00a0DEV\t\r \u00a0CONGRESS\t\r \u00a0AS\t\r \u00a0INSTITUTION\t\r \u00a0MODERN\t\r \u00a0AMERICA\t\r \u00a0INTERNATIONAL\t\r \u00a0SECURITY\t\r \u00a0POL\t\r \u00a0DEV\t\r \u00a0WESTERN\t\r \u00a0EUROPE\t\r \u00a0GLOBAL\t\r \u00a0AND\t\r \u00a0DOM\t\r \u00a0POLITICS\t\r \u00a0CRUSADES\t\r \u00a0TO\t\r \u00a0HOLY\t\r \u00a0LAND\t\r \u00a0SPECIAL\t\r \u00a0TOPICS\t\r \u00a0IN\t\r \u00a0POLITICS\t\r \u00a0THEOL/FICTION\t\r \u00a0C\t\r \u00a0S\t\r \u00a0LEWIS\t\r \u00a0ACOUSTICS\t\r \u00a0AND\t\r \u00a0MUSIC\t\r \u00a0REPERTORY:\t\r \u00a0MODERN\t\r \u00a0DANCE\t\r \u00a0COMPOSITION\t\r \u00a0MEDIA\t\r \u00a0INTERN\t\r \u00a0LOS\t\r \u00a0ANGELES\t\r \u00a0U.S.\t\r \u00a0CULTURE\t\r \u00a0INDUSTRIES\t\r \u00a0THE\t\r \u00a0HOLLYWOOD\t\r \u00a0CYBER\t\r \u00a0JOUR\t\r \u00a0REPERTORY:\t\r \u00a0BALLET\t\r \u00a0MODERN\t\r \u00a0DANCE\t\r \u00a01890-\u00ad\u20101950\t\r \u00a0COMP\t\r \u00a0NETWORK\t\r \u00a0ARCHITEC\t\r \u00a0ELECTROMAGNET\t\r \u00a0FIELDS\t\r \u00a0INTRO\t\r \u00a0TO\t\r \u00a0EMBEDDED\t\r \u00a0SYSTEMS\t\r \u00a0LINEAR\t\r \u00a0CONTROL\t\r \u00a0SYSTEMS\t\r \u00a0ELECTRICITY\t\r \u00a0&\t\r \u00a0MAGNETISM\t\r \u00a0COMPUTER\t\r \u00a0ARCHITECTURE\t\r \u00a0INTRO\t\r \u00a0TO\t\r \u00a0OPERATING\t\r \u00a0SYSTM\t\r \u00a0DIG\t\r \u00a0IMAGE/MULTIDIM\t\r \u00a0PROCESSING\t\r \u00a0CHEM/TECHNOL/SOCIETY\t\r \u00a0DATA\t\r \u00a0ANALY/STAT\t\r \u00a0INFER\t\r \u00a0THE\t\r \u00a0DYNAMIC\t\r \u00a0EARTH\t\r \u00a0SOCIAL\t\r \u00a0PSYCHOLOGY\t\r \u00a0ABNORMAL\t\r \u00a0PSYCHOLOGY\t\r \u00a0POL\t\r \u00a0ANALY\t\r \u00a0PUB\t\r \u00a0POL\t\r \u00a0MAKING\t\r \u00a0GLOBAL\t\r \u00a0HEALTH\t\r \u00a0ELEMENTARY\t\r \u00a0SPANISH\t\r \u00a0LAW\t\r \u00a0AND\t\r \u00a0POLITICS\t\r \u00a0LIBERTY/EQUALITY&AMER\t\r \u00a0CONST\t\r \u00a0CAMPAIGNS\t\r \u00a0AND\t\r \u00a0ELECTIONS\t\r \u00a0JUNIOR-\u00ad\u2010SENIOR\t\r \u00a0SEM\t\r \u00a0SP\t\r \u00a0TOP\t\r \u00a0THE\t\r \u00a0AMERICAN\t\r \u00a0PRESIDENCY\t\r \u00a0ERA\t\r \u00a0OF\t\r \u00a0THE\t\r \u00a0AMERICAN\t\r \u00a0REVOL\t\r \u00a0HISTORY\t\r \u00a0OF\t\r \u00a0WORLD\t\r \u00a0WARS\t\r \u00a0POLITICS\t\r \u00a0AND\t\r \u00a0LITERATURE\t\r \u00a0THEATER\t\r \u00a0IN\t\r \u00a0LONDON:\t\r \u00a0TEXT\t\r \u00a0THEATER\t\r \u00a0IN\t\r \u00a0LONDON:\t\r \u00a0PERFORM\t\r \u00a0INDIVID\t\r \u00a0DANCE\t\r \u00a0PROG:\t\r \u00a0SPECIAL\t\r \u00a0INTRODUCTION\t\r \u00a0TO\t\r \u00a0ACTING\t\r \u00a0DIRECTING\t\r \u00a0THEATER\t\r \u00a0PRODUCTION\t\r \u00a0TENNESSEE\t\r \u00a0WILLIAMS/CHEKHOV\t\r \u00a0INTRODUCTION\t\r \u00a0TO\t\r \u00a0THEATER\t\r \u00a0\fReferences\n[1] R. P. Adams, Z. Ghahramani, and M. I. Jordan. Tree-structured stick breaking for hierarchical\n\ndata. In Neural Information Processing Systems (NIPS), 2010.\n\n[2] E. Bart, I. Porteous, P. Perona, and M. Welling. Unsupervised learning of visual taxonomies.\n\nIn CVPR, 2008.\n\n[3] A. Bhattacharya and D. B. Dunson. Sparse Bayesian in\ufb01nite factor models. Biometrika, 2011.\n[4] D. M. Blei, T. L. Grif\ufb01ths, and M. I. Jordan. The nested Chinese restaurant process and\n\nBayesian nonparametric inference of topic hierarchies. Journal of the ACM, 57(2), 2010.\n\n[5] D. M. Blei, T. L. Grif\ufb01ths, M. I. Jordan, and J. B. Tenenbaum. Hierarchical topic models\nand the nested Chinese restaurant process. In Neural Information Processing Systems (NIPS).\n2004.\n\n[6] A. Chambers, P. Smyth, and M. Steyvers. Learning concept graphs from text with stick-\n\nbreaking priors. In Advances in Neural Information Processing Systems(NIPS). 2010.\n\n[7] H. Chen, D.B. Dunson, and L. Carin. Topic modeling with nonparametric Markov tree. In\n\nProc. Int. Conf. Machine Learning (ICML), 2011.\n\n[8] T. L. Grif\ufb01ths, M. Steyvers, and J. B. Tenenbaum. Topics in semantic representation. Psycho-\n\nlogical Review, 114(2):211\u2013244, 2007.\n\n[9] C. Hans and D. B. Dunson. Bayesian inferences on umbrella orderings. BIOMETRICS,\n\n61:1018\u20131026, 2005.\n\n[10] H. Ishwaran and L. F. James. Gibbs sampling methods for stick-breaking priors. Journal of\n\nthe American Statistical Association, 96(453):161\u2013173, 2001.\n\n[11] L. Li, C. Wang, Y. Lim, D. Blei, and L. Fei-Fei. Building and using a semantivisual image\n\nhierarchy. In CVPR, 2010.\n\n[12] W. Li and A. McCallum. Pachinko allocation: Dag-structured mixture models of topic corre-\n\nlations. In Proc. Int. Conf. Machine Learning (ICML), 2006.\n\n[13] D. Mimno, W. Li, and A. McCallum. Mixtures of hierarchical topics with Pachinko allocation.\n\nIn Proc. Int. Conf. Machine Learning (ICML), 2007.\n\n[14] O. Papaspiliopoulos and G. O. Roberts. Retrospective Markov chain Monte Carlo methods for\n\nDirichlet process hierarchiacal models. Biometrika, 95(1):169\u2013186, 2008.\n\n[15] R. Salakhutdinov, J. Tenenbaum, and A. Torralba. One-shot Learning with a Hierarchical\n\nNonparametric Bayesian Model. MIT Technical Report, 2011.\n\n[16] J. Sethuraman. A constructive de\ufb01nition of Dirichlet priors. Statistica Sinica, 4:639\u2013650,\n\n1994.\n\n[17] J. Sivic, B. C. Russell, A. Zisserman, W. T. Freeman, and A. A. Efros. Unsupervised discovery\n\nof visual object class hierarchies. In CVPR, 2008.\n\n[18] Y. W. Teh, M. I. Jordan, Matthew J. Beal, and D. M. Blei. Hierarchical Dirichlet processes.\n\nJournal of the American Statistical Association, 101(476):1566\u20131581, 2006.\n\n[19] H. M. Wallach, D. Mimno, and A. McCallum. Rethinking LDA: Why priors matter. In Neural\n\nInformation Processing Systems (NIPS), 2009.\n\n[20] H. M. Wallach, I. Murray, R. Salakhutdinov, and D. Mimno. Evaluation methods for topic\n\nmodels. In Proc. Int. Conf. Machine Learning (ICML), 2009.\n\n[21] C. Wang and D. M. Blei. Variational inference for the nested Chinese restaurant process. In\n\nNeural Information Processing Systems (NIPS), 2009.\n\n[22] XX. Zhang, D. B. Dunson, and L. Carin. Tree-structured in\ufb01nite sparse factor model. In Proc.\n\nInt. Conf. Machine Learning (ICML), 2011.\n\n9\n\n\f", "award": [], "sourceid": 814, "authors": [{"given_name": "Xianxing", "family_name": "Zhang", "institution": null}, {"given_name": "Lawrence", "family_name": "Carin", "institution": null}, {"given_name": "David", "family_name": "Dunson", "institution": null}]}