{"title": "Dynamic Rank Factor Model for Text Streams", "book": "Advances in Neural Information Processing Systems", "page_first": 2663, "page_last": 2671, "abstract": "We propose a semi-parametric and dynamic rank factor model for topic modeling, capable of (1) discovering topic prevalence over time, and (2) learning contemporary multi-scale dependence structures, providing topic and word correlations as a byproduct. The high-dimensional and time-evolving ordinal/rank observations (such as word counts), after an arbitrary monotone transformation, are well accommodated through an underlying dynamic sparse factor model. The framework naturally admits heavy-tailed innovations, capable of inferring abrupt temporal jumps in the importance of topics. Posterior inference is performed through straightforward Gibbs sampling, based on the forward-filtering backward-sampling algorithm. Moreover, an efficient data subsampling scheme is leveraged to speed up inference on massive datasets. The modeling framework is illustrated on two real datasets: the US State of the Union Address and the JSTOR collection from Science.", "full_text": "Dynamic Rank Factor Model for Text Streams\n\nShaobo Han\u2217, Lin Du\u2217, Esther Salazar and Lawrence Carin\n\n{shaobo.han, lin.du, esther.salazar, lcarin}@duke.edu\n\nDuke University, Durham, NC 27708\n\nAbstract\n\nWe propose a semi-parametric and dynamic rank factor model for topic model-\ning, capable of (i) discovering topic prevalence over time, and (ii) learning con-\ntemporary multi-scale dependence structures, providing topic and word correla-\ntions as a byproduct. The high-dimensional and time-evolving ordinal/rank ob-\nservations (such as word counts), after an arbitrary monotone transformation, are\nwell accommodated through an underlying dynamic sparse factor model. The\nframework naturally admits heavy-tailed innovations, capable of inferring abrupt\ntemporal jumps in the importance of topics. Posterior inference is performed\nthrough straightforward Gibbs sampling, based on the forward-\ufb01ltering backward-\nsampling algorithm. Moreover, an ef\ufb01cient data subsampling scheme is leveraged\nto speed up inference on massive datasets. The modeling framework is illustrated\non two real datasets: the US State of the Union Address and the JSTOR collection\nfrom Science.\n\nIntroduction\n\n1\nMultivariate longitudinal ordinal/count data arise in many areas, including economics, opinion polls,\ntext mining, and social science research. Due to the lack of discrete multivariate distributions sup-\nporting a rich enough correlation structure, one popular choice in modeling correlated categorical\ndata employs the multivariate normal mixture of independent exponential family distributions, after\nappropriate transformations. Examples include the logistic-normal model for compositional data\n[1], the Poisson log-normal model for correlated count data [2], and the ordered probit model for\nmultivariate ordinal data [3]. Moreover, a dynamic Bayesian extension of the generalized linear\nmodel [4] may be considered, for capturing the temporal dependencies of non-Gaussian data (such\nas ordinal data). In this general framework, the observations are assumed to follow an exponen-\ntial family distribution, with natural parameter related to a conditionally Gaussian dynamic model\n[5], via a nonlinear transformation. However, these model speci\ufb01cations may still be too restrictive\nin practice, for the following reasons: (i) Observations are usually discrete, non-negative and with\na massive number of zero values and, unfortunately, far from any standard parametric distributions\n(e.g., multinomial, Poisson, negative binomial and even their zero-in\ufb02ated variants). (ii) The number\nof contemporaneous series can be large, bringing dif\ufb01culties in sharing/learning statistical strength\nand in performing ef\ufb01cient computations. (iii) The linear state evolution is not truly manifested after\na nonlinear transformation, where positive shocks (such as outliers and jumps) are magni\ufb01ed and\nnegative shocks are suppressed; hence, handling temporal jumps (up and down) is a challenge for\nthe above models.\nWe present a \ufb02exible semi-parametric Bayesian model, termed dynamic rank factor model (DRFM),\nthat does not suffer these drawbacks. We \ufb01rst reduce the effect of model misspeci\ufb01cation by mod-\neling the sampling distribution non-parametrically. To do so, we \ufb01t the observed data only after\nsome implicit monotone transformation, learned automatically via the extended rank likelihood [6].\nSecond, instead of treating panels of time series as independent collections of variables, we analyze\nthem jointly, with the high-dimensional cross-sectional dependencies estimated via a latent factor\n\n\u2217contributed equally\n\n1\n\n\fmodel. Finally, by avoiding nonlinear transformations, both smooth transitions and sudden changes\n(\u201cjumps\u201d) are better preserved in the state-space model, using heavy-tailed innovations.\nThe proposed model offers an alternative to both dynamic and correlated topic models [7, 8, 9],\nwith additional modeling facility of word dependencies, and improved ability to handle jumps. It\nalso provides a semi-parametric Bayesian treatment of dynamic sparse factor model. Further, our\nproposed framework is applicable in the analysis of multiple ordinal time series, where the innova-\ntions follow either stationary Gaussian or heavy-tailed distributions.\n\n2 Dynamic Rank Factor Model\nWe perform analysis of multivariate ordinal time series. In the most general sense, such ordinal\nvariables indicate a ranking of responses in the sample space, rather than a cardinal measure [10].\nExamples include real continuous variables, discrete ordered variables with or without numerical\nscales or, more specially, counts, which can be viewed as discrete variables with integer numeric\nscales. Our goal is twofold: (i) discover the common trends that govern variations in observations,\nand (ii) extract interpretable patterns from the cross-sectional dependencies.\nDependencies among multivariate non-normal variables may be induced through normally dis-\ntributed latent variables. Suppose we have P ordinal-valued time series yp,t, p = 1, . . . , P ,\nt = 1, . . . , T . The general framework contains three components:\n\nyp,t \u223c g(zp,t),\n\nzp,t \u223c p(\u03b8t),\n\n\u03b8t \u223c q(\u03b8t\u22121),\n\n(1)\nwhere g(\u00b7) is the sampling distribution, or marginal likelihood for the observations, the latent vari-\nable zp,t is modeled by p(\u00b7) (assumed to be Gaussian) with underlying system parameters \u03b8t, and\nq(\u00b7) is the system equation representing Markovian dynamics for the time-evolving parameter \u03b8t.\nIn order to gain more model \ufb02exibility and robustness against misspeci\ufb01cation, we propose a semi-\nparametric Bayesian dynamic factor model for multiple ordinal time series analysis. The model is\nbased on the extended rank likelihood [6], allowing the transformation from the latent conditionally\nGaussian dynamic model to the multivariate observations, treated non-parametrically.\nExtended rank likelihood (ERL): There exist many approaches for dealing with ordinal data, how-\never, they all have some restrictions. For continuous variables, the underlying normality assumption\ncould be easily violated without a carefully chosen deterministic transformation. For discrete ordi-\nnal variables, an ordered probit model, with cut points, becomes computationally expensive if the\nnumber of categories is large. For count variables, a multinomial model requires \ufb01nite support on\nthe integer values. Poisson and negative binomial models lack \ufb02exibility from a practical viewpoint,\nand often lead to non-conjugacy when employing log-normal priors.\nBeing aware of these issues, a natural candidate for consideration is the ERL [6]. With appropriate\nmonotone transformations learned automatically from data, it offers a uni\ufb01ed framework for han-\ndling both continuous [11] and discrete ordinal variables. The ERL depends only on the ranks of the\nobservations (zero values in observations are further restricted to have negative latent variables),\n\nzp,t \u2208 D(Y ) \u2261 {zp,t \u2208 R : yp,t < yp(cid:48),t(cid:48) \u21d2 zp,t < zp(cid:48),t(cid:48), and zp,t \u2264 0 if yp,t = 0}.\n\n(2)\nIn particular, this offers a distribution-free approach, with relaxed assumptions compared to para-\nmetric models, such as Poisson log-normal [12]. It also avoids the burden of computing nuisance\nparameters in the ordered probit model (cut points). The ERL has been utilized in Bayesian Gaussian\ncopula modeling, to characterize the dependence of mixed data [6]. In [13] a low-rank decompo-\nsition of the covariance matrix is further employed and ef\ufb01cient posterior sampling is developed in\n[14]. The proposed work herein can be viewed as a dynamic extension of that framework.\n\n2.1 Latent sparse dynamic factor model\nIn the forthcoming text, G(\u03b1, \u03b2) denotes a gamma distribution with shape parameter \u03b1 and rate\nparameter \u03b2, TN(l,u)(\u00b5, \u03c32) denotes a univariate truncated normal distribution within the interval\n(l, u), and N+(0, \u03c32) is the half-normal distribution that only has non-negative support.\nAssume zt \u223c N (0, \u2126t), where \u2126t is usually a high-dimensional (P \u00d7 P ) covariance matrix.\nTo reduce the number of parameters, we assume a low rank factor model decomposition of the\ncovariance matrix \u2126t = \u039bV t\u039bT + R such that\n\nzt = \u039bst + \u0001t,\n\n\u0001t \u223c N (0, R), R = I P .\n\n(3)\n\n2\n\n\f0 < \u03c1k < 1,\n\nsk,t = \u03c1ksk,t\u22121 + \u03b4k,t,\n\n\u03bd1/2 \u223c C+(0, h),\n\nCommon trends (importance of topics) are captured by a low-dimensional factor score parameter\nst. We assume autoregressive dynamics on sk,t \u2190 AR(1|(\u03c1k, \u03b4k,t)) with heavy-tailed innovations,\n(4)\nwhere \u03b4k,t follows the three-parameter beta mixture of normal TPBN(e, f, \u03bd) distribution [15]. Pa-\nrameter e controls the peak around zero, f controls the heaviness on the tails, and \u03bd controls the\nglobal sparsity with a half-Cauchy prior [16]. This prior encourages smooth transitions in general,\nwhile jumps are captured by the heavy tails. The conjugate hierarchy may be equivalently repre-\nsented as\n\n\u03b4k,t \u223c TPBN(e, f, \u03bd),\n\n\u03b6 \u223c G(1/2, h2).\n0), and assume s0,k \u223c N (0, \u03c32\nTruncated normal priors are employed on \u03c1k, \u03c1k \u223c TN(0,1)(\u00b50, \u03c32\ns ).\nNote that the extended rank likelihood is scale-free; therefore, we do not need to include a redundant\nintercept parameter in (3). For the same reason, we set R = I P .\nModel identi\ufb01ability issues: Although the covariance matrix \u2126t is not identi\ufb01able [10], the related\n\ncorrelation matrix Ct = \u2126[i,j],t/(cid:112)\u2126[i,i],t\u2126[j,j],t, (i, j = 1, . . . , P ) may be identi\ufb01ed, using the\n\n\u03b4k,t \u223c N (0, \u03c4k,t),\n\n\u03c4k,t \u223c G(e, \u03b7k,t),\n\n\u03b7k,t \u223c G(f, \u03bd)\n\n\u03bd \u223c G(1/2, \u03b6),\n\nparameter expansion technique [3, 13]. Further, the rank K in the low-rank decomposition of \u2126t is\nalso not unique. For the purpose of brevity, we do not explore this uncertainty here, but the tools\ndeveloped in the Bayesian factor analysis literature [17, 18, 19] can be easily adopted.\nIdenti\ufb01ability is a key concern for factor analysis. Conventionally, for \ufb01xed K, a full-rank, lower-\ntriangular structure in \u039b ensures identi\ufb01ability [20]. Unfortunately, this assumption depends on the\nordering of variables. As a solution, we add nonnegative and sparseness constraints on the factor\nloadings, to alleviate the inherit ambiguity, while also improving interpretability. Also, we add a\nProcrustes post-processing step [21] on the posterior samples, to reduce this indeterminacy.\nThe nonnegative and (near) sparseness constraints are imposed by the following hierarchy,\nk \u223c C+(0, d).\n\n(5)\nIntegrating out lp,k and up,k, we obtain a half-TPBN prior \u03bbp,k \u223c TPBN+(a, b, \u03c6k). The column-\nwise shrinkage parameters \u03c6k enable factors to be of different sparsity levels [22]. We set hyperpa-\nrameters a = b = e = f = 0.5, d = P , h = 1, \u03c32\ns = 1. For weakly informative priors, we set\n\u03b1 = \u03b2 = 0.01; \u00b50 = 0.5, \u03c32\n2.2 Extension to handle multiple documents\nAt each time point t we may have a corpus of documents {ynt\nis a P -dimensional\nobservation vector, and Nt denotes the number of documents at time t. The model presented in\nSection 2.1 is readily extended to handle this situation. Speci\ufb01cally, at each time point t, for each\ndocument nt, the ERL representation for word count p, denoted by ynt\n\nup,k \u223c G(b, \u03c6k), \u03c61/2\n\n\u03bbp,k \u223c N+(0, lp,k)\n\nlp,k \u223c G(a, up,k),\n\nnt=1, where ynt\n\nt }Nt\n\n0 = 10.\n\nt\n\np,t, is\n\np = 1, . . . , P,\n\nt = 1, . . . , T, nt = 1, . . . , Nt,\n\n+\n\nwhere znt\n\nt \u223c N (st, \u0393), \u0393 = diag(\u03b3),\nbnt\n\nt \u223c N (0, I P ),\n\u0001nt\nt \u2208 RK is the topic usage for each document ynt\n\nt \u2208 RP and P is the vocabulary size. We assume a latent factor model for znt\nsuch that\nk \u223c G(\u03b1, \u03b2),\n\u03b3\u22121\nt = \u039bbnt\nznt\nt + \u0001nt\nt ,\nwhere \u039b \u2208 RP\u00d7K\nis the topic-word loading matrix, representing the K topics as columns of \u039b.\nThe factor score vector bnt\nt , corresponding to loca-\ntions in a low-dimensional RK space. The other parts of the model remain unchanged. The latent\ntrajectory s1:T represents the common trends for the K topics. Moreover, through the forward \ufb01l-\ntering backward sampling (FFBS) algorithm [23, 24], we also obtain time-evolving topic correlation\nmatrices \u03a6t \u2208 RK\u00d7K and word dependencies matrices Ct \u2208 RP\u00d7P , offering a multi-scale graph\nrepresentation, a useful tool for document visualization.\n\nt\n\np,t = g(cid:0)znt\n\np,t\n\n(cid:1) ,\n\nynt\n\n2.3 Comparison with admixture topic models\nMany topic models are uni\ufb01ed in the admixture framework [25],\n\nP\n\nAdmix\n\n(yn|w, \u03a6) = P\n\nBase\n\nyn\n\n(cid:32)\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)\u03c6n =\n\nK(cid:88)\n\nk=1\n\n(cid:33)\n\nwk,n\u03c6k\n\n,\n\n(6)\n\nwhere yn is the P -dimensional observation vector of word counts in the n th document, and P de-\nnotes the vocabulary size. Traditionally, yn is generated from an admixture of base distributions, wn\nis the admixture weight (topic proportion for document n), and \u03c6k is the canonical parameter (word\n\n3\n\n\f(cid:32)\n\ndistribution for topic k), which denotes the location of the kth topic on the P -1 dimensional simplex.\nFor example, latent Dirichlet allocation (LDA) [26] assumes the base distribution to be multinomial,\nwith \u03c6k \u223c Dir(\u03b10), wn \u223c Dir(\u03b20). The correlated topic model (CTM) [8] modi\ufb01es the topic dis-\ntribution, with wn \u223c Logistic Normal(\u00b5, \u03a3). The dynamic topic model (DTM) [7] analyzes docu-\nment collections in a known chronological order. In order to incorporate the state space model, both\nthe topic proportion and the word distribution are changed to logistic normal, with isotropic covari-\nance matrices wt \u223c Logistic Normal(wt\u22121, \u03c32I K) and \u03c6k,t \u223c Logistic Normal(\u03c6k,t\u22121, vI P ),\nrespectively. To overcome the drawbacks of multinomial base, spherical topic models [27] assume\nthe von Mises-Fisher (vMF) distribution as its base distribution, with \u03c6k \u223c vMF(\u00b5, \u03be) lying on a\nunit P -1 dimensional sphere. Recently in [25] the base and word distribution are both replaced with\nPoisson Markov random \ufb01elds (MRFs), which characterizes word dependencies.\nWe present here a semi-parametric factor model formulation,\n\n,\n\n(7)\n\nsk,n\u03bbk\n\nzn \u2208 D(Y )\n\nP(yn|s, \u039b) (cid:44) P\nwith yn de\ufb01ned as above, \u03bbk \u2208 RP\n+ is a vector of nonnegative weights, indicating the P vocab-\nulary usage in each individual topics k, and sn \u2208 RK is the topic usage. Note that the extended\nrank likelihood does not depend on any assumptions about the data marginal distribution, making it\nappropriate for a broad class of ordinal-valued observations, e.g., term frequency-inverse document\nfrequency (tf-idf) or rankings, beyond word counts. However, the proposed model here is not an\nadmixture model, as the topic usage is allowed to be either positive or negative.\nThe DRFM framework has some appealing advantages: (i) It is more natural and convenient to in-\ncorporate with sparsity, rank selection, and state-space model; (ii) it provides topic-correlations and\nword-dependences as a byproduct; and (iii) computationally, this model is tractable and often leads\nto locally conjugate posterior inference. DRFM has limitations. Since the marginal distributions\nare of unspeci\ufb01ed types, objective criteria (e.g. perplexity) is not directly computable. This makes\nquantitative comparisons to other parametric baselines developed in the literature very dif\ufb01cult.\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)\u03bbn =\n\nK(cid:88)\n\nk=1\n\n(cid:33)\n\n3 Conjugate Posterior Inference\nLet \u0398 = {\u039b, S, L, U , \u03c6, \u03c9, \u03c1, \u03c4 , \u03b7, \u03bd, \u03b6} denote the set of parameters in basic model, and let Z be\nthe augmented data (from the ERL). We use Gibbs sampling to approximate the joint posterior dis-\ntribution p(Z, \u0398|Z \u2208 R(Y )). The algorithm alternates between sampling p(Z|\u0398, Z \u2208 R(Y )) and\np(\u0398|Z, Z \u2208 R(Y )) (reduced to p(\u0398|Z)). The derivation of the Gibbs sampler is straightforward,\nand for brevity here we only highlight the sampling steps for Z, and the forward \ufb01ltering backward\nsampling (FFBS) steps for the trajectory s1:T . The Supplementary Material contains further details\nfor the inference.\n\n\u2022 Sampling zp,t: p(zp,t|\u0398, Z \u2208 R(Y ), Z\u2212p,\u2212t) \u223c TN[zp,t,zp,t]((cid:80)K\n\nk=1 \u03bbp,ksk,t, 1), where zp,t =\n\nmax{zp(cid:48),t(cid:48) : yp(cid:48),t(cid:48) < yp,t} and zp,t = min{zp(cid:48),t(cid:48) : yp(cid:48),t(cid:48) > yp,t}.\n\nThis conditional sampling scheme is widely used in [6, 10, 13]. In [14] a novel Hamiltonian Monte\nCarlo (HMC) approach has been developed recently, for a Gaussian copula extended rank likelihood\nmodel, where ranking is only within each row of Z. This method simultaneously samples a column\nvector of zi conditioned on other columns Z\u2212i, with higher computation but better mixing.\n\u2022 Sampling st: we have the state model st|st\u22121 \u223c N (Ast\u22121, Qt), and the observation model\nzt|st \u223c N (\u039bst, R),1 where A = diag(\u03c1), Qt = diag(\u03c4 t), R = I P . for t = 1, . . . , T\n1. Forward Filtering: beginning at t = 0 with s0 \u223c N (0, \u03c32\n\ns I K), for all t = 1, . . . , T , we\n\ufb01nd the on-line posteriors at t, p(st|z1:t) = N (mt, V t), where mt = V t{\u039bT R\u22121zt +\nH\u22121\n\n2. Backward Sampling: starting from N ((cid:102)mt,(cid:101)V t), the backward smoothing density, i.e., the\nconditional distribution of st\u22121 given st, is p(st\u22121|st, z1:(t\u22121)) = N ((cid:101)\u00b5t\u22121,(cid:101)\u03a3t\u22121), where\n(cid:101)\u00b5t\u22121 = (cid:101)\u03a3t\u22121{AT Q\u22121\n\nt\u22121mt\u22121}, (cid:101)\u03a3t\u22121 = (V \u22121\n\nt + \u039bT R\u22121\u039b]\u22121, and H t = Qt + AV t\u22121AT .\n\nt Amt\u22121}, V t = [H\u22121\n\nt st + V \u22121\n\nt\u22121 + AT Q\u22121\n\nt A)\u22121.\n\nThere exist different variants of FFBS schemes (see [28] for a detailed comparison); the method we\nchoose here enjoys fast decay in autocorrelation and reduced computation time.\n\n1For brevity, we omit the dependencies on \u0398 in notation\n\n4\n\n\ft (cid:102)mt + V \u22121\n\n3.1 Time-evolving topic and word dependencies\n\nt\u22121mt\u22121) and (cid:101)V t\u22121 = (cid:101)\u03a3t\u22121 + (cid:101)\u03a3t\u22121AT Q\u22121\n\nWe also have the backward recursion density at t \u2212 1, p(st\u22121|z1:T ) = N ((cid:102)mt\u22121,(cid:101)V t\u22121), where\n(cid:102)mt\u22121 = (cid:101)\u03a3t\u22121(AT Q\u22121\nt A(cid:101)\u03a3t\u22121.\ncovariances {(cid:101)V 1:T} (with topic correlation matrices \u03a61:T , \u03a6[r,s],t = V[r,s],t/(cid:112)V[r,r],tV[s,s],t, r, s =\nwith \u2126t = \u039b(cid:101)V t\u039bT + I P . Essentially, this can be viewed as a dynamic Gaussian copula model,\nyp,t = g((cid:101)zp,t),(cid:101)zt \u223c N (0, Ct), where g(\u00b7) is a non-decreasing function of a univariate marginal\n\nWe perform inference on the K \u00d7 K time-evolving topic dependences in s1:T , using the posterior\n1, . . . , K), and further obtain the P \u00d7 P time-evolving word dependencies capsuled in {\u21261:T}\n\nlikelihood and Ct (t = 1, . . . , T ) is the correlation matrix capturing the multivariate dependence.\nWe obtain a posterior distribution for C1:T as a byproduct, without having to estimate the nuisance\nparameters in marginal likelihoods g(\u00b7). This decoupling strategy resembles the idea of copula\nmodels.\n\nt (cid:101)V tQ\u22121\n\n3.2 Accelerated MCMC via document subsampling\nFor large-scale datasets, recent approaches ef\ufb01ciently reduce the computational load of Monte Carlo\nMarkov chain (MCMC) by data subsampling [29, 30]. We borrow this idea of subsampling docu-\nments when considering a large corpora (e.g., in our experiments, we consider analysis of articles\nin the magazine Science, composed of 139379 articles from years 1880 to 2002, and a vocabulary\nsize 5855). In our model, the augmented data znt\n(nt = 1, . . . , Nt) for each document is relatively\nt\nexpensive to sample. One simple method is random document sampling without replacement. How-\never, by treating all likelihood contributions symmetrically, this method leads to a highly inef\ufb01cient\nMCMC chain with poor mixing [29].\nAlternatively, we adopt the probability proportional-to-size (PSS) sampling scheme in [30], i.e.,\nsampling the documents with inclusion probability proportional to the likelihood contributions. For\neach MCMC iteration, the sub-sampling procedure for documents at time t is designed as follows:\nt} for all\n\u2022 Step 1: Given a small subset Vt \u2282 {1, . . . , Nt} of chosen documents, only sample {zd\nN (\u039bst,(cid:101)R), where (cid:101)R = \u039b\u0393\u039bT + I P . Note that, only a K-dimensional matrix inversion is\nd \u2208 Vt and compute the augment log-likelihood contributions (with Bt integrated out) (cid:96)Vt(zd\nt ) =\nrequired, by using the Woodbury matrix inversion formula (cid:101)R\n\n= I P \u2212 \u039b(\u0393\u22121 + \u039bT \u039b)T \u039bT .\n\u2022 Step 2: Similar to [30], we use a Gaussian process [31] to predict the log-likelihood for\nt ), where K is a Nt \u00d7\nt ) =\n\nthe remaining documents (cid:96)V c\nNt squared-exponential kernel, which denotes the similarity of documents: K(yi\n\u03c32\nf exp\n\nt ,Vt)K(Vt,Vt)\u22121(cid:96)Vt(zd\n\n(cid:16)\u2212||yi\n\n, i, j = 1, . . . , Nt, \u03c32\n\nt ) = K(V c\n\nf = 1, s = 1.\n\nt \u2212 yj\n\nt||2/(2s2)\n\n\u2022 Step 3: Calculate the inclusion probability wd \u221d exp [(cid:96)(zd\n\n\u2022 Step 4: Sampling the next subset Vt of pre-speci\ufb01ed size |Vt| with inclusion probability (cid:101)wd, and\n\nd(cid:48) wd(cid:48).\n\n(zd\n\nt\n\n(cid:17)\n\nt, yj\n\nt )], d = 1, . . . , Nt, (cid:101)wd = wd/(cid:80)\n\n\u22121\n\nstore it for the use of the next MCMC iteration.\n\nIn practice, this adaptive design allows MCMC to run more ef\ufb01ciently on a full dataset of large\nscale, often mitigating the need to do parallel MCMC implementation. Future work could also con-\nsider nonparametric function estimation subject to monotonicity constraint, e.g. Gaussian process\nprojections recently developed in [32].\n\n4 Experiments\nDifferent from DTM [7] , the proposed model has the jumps directly at the level of the factor scores\n(no exponentiation or normalization needed), and therefore it proved more effective in uncovering\njumps in factor scores over time. Demonstrations of this phenomenon in a synthetic experiment are\ndetailed in the Supplementary Material. In the following, we present exploratory data analysis on\ntwo real examples, demonstrating the ability of the proposed model to infer temporal jumps in topic\nimportance, and to infer correlations across topics and words.\n\n4.1 Case Study I: State of the Union dataset\nThe State of the Union dataset contains the transcripts of T = 225 US State of the Union addresses,\nfrom 1790 to 2014. We take each transcript as a document, i.e., we have one document per year.\n\n5\n\n\fAfter removing stop words, and removing terms that occur fewer than 3 times in one document and\nless than 10 times overall, we have P = 7518 unique words. The observation yp,t corresponds to\nthe frequency of word p of the State of the Union transcript from year t.\nWe apply the proposed DRFM setting and learned K = 25 topics. To better understand the temporal\ndynamic per topic, six topics are selected and the posterior mean of their latent trajectories sk,1:T\nare shown in Figure 1 (with also the top 12 most probable words associated with each of the topics).\nA complete table with all 25 learned topics and top 12 words is provided in the Supplementary\nMaterial. The learned trajectory associated with every topic indicates different temporal patterns\nacross all the topics. Clearly, we can identify jumps associated with some key historical events. For\ninstance, for Topic 10, we observe a positive jump in 1846 associated with the Mexican-American\nwar. Topic 13 is related with the Spanish-American war of 1898, with a positive jump in that year.\nIn Topic 24, we observe a positive jump in 1914, when the Panama Canal was of\ufb01cially opened\n(words Panana and canal are included). In Topic 18, the positive jumps observed from 1997 to\n1999 seem to be associated with the creation of the State Children\u2019s Health Insurance Program in\n1997. We note that the words for this topic are explicitly related with this issue. Topic 25 appears to\nbe related to banking; the signi\ufb01cant spike around 1836 appears to correspond to the Second Bank\nof the United States, which was allowed to go out of existence, and end national banking that year.\nIn 1863 Congress passed the National Banking Act, which ended the \u201cfree-banking\u201d period from\n1836-1863; note the spike around 1863 in Topic 25.\n\nTopic#10\nMexico\nGovernment\nTexas\nUnited\nWar\nMexican\nArmy\nTerritory\nCountry\nPeace\nPolicy\nLands\n\nTopic#13\nGovernment\nUnited\nIslands\nCommission\nIsland\nCuba\nSpain\nAct\nGeneral\nMilitary\nInternational\nOf\ufb01ciers\n\nTopic#24\nUnited\nTreaty\nIsthmus\nPublic\nPanama\nLaw\nTerritory\nAmerica\nCanal\nService\nBanks\nColombia\n\nTopic#17\nJobs\nCountry\nTax\nAmerican\nEconomy\nDe\ufb01cit\nAmericans\nEnergy\nBusinesses\nHealth\nPlan\nCare\n\nTopic#18\nChildren\nAmerica\nAmericans\nCare\nTonight\nSupport\nCentury\nHealth\nWorking\nChallenge\nSecurity\nFamilies\n\nTopic#25\nGovernment\nPublic\nBanks\nBank\nCurrency\nMoney\nUnited\nFederal\nAmerican\nNational\nDuty\nInstitutions\n\nFigure 1: (State of the Union dataset) Above: Time evolving from 1790 to 2014 for six selected\ntopics. The plotted values represent the posterior means. Below: Top 12 most probable words\nassociated with the above topics.\n\nOur modeling framework is able to capture dynamic patterns of topics and word correlations. To\nillustrate this, we select three years (associated with some meaningful historical events) and analyze\ntheir corresponding topic and word correlations. Figure 2 (\ufb01rst row) shows graphs of the topic\ncorrelation matrices, in which the nodes represent topics and the edges indicate positive (green) and\nnegative (red) correlations (we show correlations with absolute value larger than 0.01). We notice\nthat Topics 11 and 22 are positively correlated with those years. Some of the most probable words\nincrease, united, law and legislation (for Topic 11) and war,\nassociated with each of them are:\nMexico, peace, army, enemy and military (for Topic 22). We also are interested in understanding\nthe time-varying correlation between words. To do so, and for the same years as before, in Figure 2\n(second row) we plot the dendrogram associated with the learned correlation matrix for words. In\nthe plots, different colors indicate highly correlated word clusters de\ufb01ned by cutting the branches off\nthe dendrogram. Those \ufb01gures reveal different sets of highly correlated words for different years. By\n\n6\n\n024 Topic 100246 Topic 13180018501900195020000246 Topic 24024 Topic 170246 Topic 1818001850190019502000-50510 Topic 25\f1846\n\nMexican-American War\n\n1929\n\nEconomic Depression\n\n2003\n\nIraq War\n\nFigure 2: (State of the Union dataset) First row: Inferred correlations between topics for some\nspeci\ufb01c years associated with some meaningful historical events. Green edges indicate positive\ncorrelations and red edges indicate negative correlations. Second row: Learned dendrogram based\nupon the correlation matrix between the top 10 words associated with each topic (we display 80\nunique words in total).\ninspecting all the words correlation, we noticed that the set of words {government, federal, public,\npower, authority, general, country} are highly correlated across the whole period.\n\n4.2 Case Study II: Analysis of Science dataset\nWe analyze a collection of scienti\ufb01c documents from the JSTOR Science journal [7]. This dataset\ncontains a collection of 139379 documents from 1880 to 2002 (T = 123), with approximately 1100\ndocuments per year. After removing terms that occurred fewer than 25 times, the total vocabulary\nsize is P = 5855. We learn K = 50 topics from the inferred posterior distribution, for brevity and\nsimplicity, we only show 20 of them. We handle about 2700 documents per iteration (subsampling\nrate: 2%). Table 1 shows the 20 selected topics and the top 10 most probable words associated with\neach of them. By inspection, we notice that those topics are related with speci\ufb01c \ufb01elds in science.\nFor instance, Topic 2 is more related to \u201cscienti\ufb01c research\u201d, Topic 10 to \u201cnatural resources\u201d, and\n\nTopic 15 to \u201cgenetics\u201d. Figure 3 shows the time-varying trend for some speci\ufb01c words,(cid:98)zp,1:T , which\n\nreveals the importance of those words across time. Finally, Figure 4 shows the correlation between\nthe selected 20 topics. For instance, in 1950 and 2000, topic 9 (related to mouse, cells, human,\ntransgenic) and topic 17 (related to virus, rna, tumor, infection) are highly correlated.\n\nFigure 3: (Science dataset) the inferred latent trend for variable(cid:98)zp,1:T associated with words.\n\n7\n\nT1T2T3T4T5T6T7T8T9T10T11T12T13T14T15T16T17T18T19T20T21T22T23T24T25T1T2T3T4T5T6T7T8T9T10T11T12T13T14T15T16T17T18T19T20T21T22T23T24T25T1T2T3T4T5T6T7T8T9T10T11T12T13T14T15T16T17T18T19T20T21T22T23T24T25actadministrationamericaamericanamericansarmyattentionauthoritybanksbusinessbillionbondscanalcarechildrencountrycourtcurrencycitizensconstitutionconventioncubadepartmentdevelopmentdollarseconomicenergyexpendituresfederalfiscalforcesforeignfreedomfreegeneralgovernmentgoldhealthincreaseislandsislandjobsjunelaborlawmexicanmexicomilitarymillionnationalnationsnationnotesnumberorderpeacepolicypowerpresidentprogrampublicpresentprogramsreservereportsecretaryservicesilverspainsubjecttaxtradetreasurytreatyterritorytexastonightunionunitedwaractadministrationamericaamericanamericansarmyattentionauthoritybanksbusinessbillionbondscanalcarechildrencountrycourtcurrencycitizensconstitutionconventioncubadepartmentdevelopmentdollarseconomicenergyexpendituresfederalfiscalforcesforeignfreedomfreegeneralgovernmentgoldhealthincreaseislandsislandjobsjunelaborlawmexicanmexicomilitarymillionnationalnationsnationnotesnumberorderpeacepolicypowerpresidentprogrampublicpresentprogramsreservereportsecretaryservicesilverspainsubjecttaxtradetreasurytreatyterritorytexastonightunionunitedwaractadministrationamericaamericanamericansarmyattentionauthoritybanksbusinessbillionbondscanalcarechildrencountrycourtcurrencycitizensconstitutionconventioncubadepartmentdevelopmentdollarseconomicenergyexpendituresfederalfiscalforcesforeignfreedomfreegeneralgovernmentgoldhealthincreaseislandsislandjobsjunelaborlawmexicanmexicomilitarymillionnationalnationsnationnotesnumberorderpeacepolicypowerpresidentprogrampublicpresentprogramsreservereportsecretaryservicesilverspainsubjecttaxtradetreasurytreatyterritorytexastonightunionunitedwar\u22121.0\u22120.50.00.51.01880190019201940196019802000\u22121\u22120.500.511.522.5 DNARNAGene188019001920194019601980200000.20.40.60.81 CancerPatientsNuclear188019001920194019601980200000.10.20.30.40.50.60.7 AstronomyPsychologyBrain\f1900\n\n1950\n\n2000\n\nFigure 4: (Science dataset) Inferred correlations between topics for some speci\ufb01c years. Green\nedges indicate positive correlations and red edges indicate negative correlations.\n\nTable 1: Selected 20 topics associated with the analysis of the Science dataset and top 10 most\nprobable words.\nTopic#1\ncells\ncell\nnormal\ntwo\ngrowth\ndevelopment development\ntissue\nbody\negg\nblood\nTopic#11\nsystem\nnuclear\nnew\nsystems\npower\ncost\ncomputer\nfuel\ncoal\nplant\n\nTopic#9\nTopic#8\nmice\nwork\nmouse\nresearch\ntype\nscienti\ufb01c\nwild\nlaboratory\n\ufb01g\nmade\ncells\nuniversity\nhuman\nresults\ntransgenic\nscience\nsurvey\nanimals\ndepartment mutant\nTopic#19\nTopic#18\nstars\nenergy\nmass\nelectron\nstar\nstate\ntemperature protein\n\ufb01g\nsolar\ntwo\ngas\nstructure\ndata\nreaction\ndensity\nlaser\nhigh\nsurface\ntemperature galaxies\n\nTopic#3\n\ufb01eld\nmagnetic\nsolar\nenergy\nspin\nstate\nelectron\nnew\nquantum\nprogram\ntemperature\nscienti\ufb01c\ncurrent\nbasic\nTopic#13\nTopic#12\nassociation\nenergy\ntheory\nscience\ntemperature meeting\nradiation\natoms\nsurface\natomic\nmass\natom\ntime\n\nTopic#7\nscience\nscienti\ufb01c\nnew\nscientists\nhuman\nmen\nsciences\nknowledge\nmeeting\nwork\nTopic#17\nvirus\nrna\nviruses\nparticles\ntumor\nmice\ndisease\nviral\nhuman\ninfection\n\nTopic#6\nuniversity\nprofessor\ncollege\npresident\ndepartment\nresearch\ninstitute\ndirector\nsociety\nschool\nTopic#16\nprofessor\nuniversity\nsociety\n\nTopic#5\nenergy\noil\npercent\nproduction\nfuel\ntotal\ngrowth\nstates\nelectricity\ncoal\nTopic#15\nhuman\ngenome\nsequence\nchromosome department\ngene\ngenes\nmap\ndata\nsequences\ngenetic\n\ncollege\npresident\ndirector\namerican\nappointed\nmedical\n\nTopic#10\nwater\nsurface\ntemperature\nsoil\npressure\nsea\nplants\nsolution\nplant\nair\nTopic#20\nrna\n\ufb01g\nmrna\n\nsite\nsequence\nsplicing\nsynthesis\ntrna\nrnas\n\nTopic#2\nresearch\nnational\ngovernment\nsupport\nfederal\n\nTopic#4\nanimals\nbrain\nneurons\nactivity\nresponse\nrats\ncontrol\n\ufb01g\neffects\ndays\nTopic#14\nprotein\nproteins\ncell\nmembrane\namino\nsequence\nbinding\nacid\nresidues\nsequences\n\nuniversity\namerican\nsociety\nsection\npresident\ncommittee\nsecretary\n\n5 Discussion\n\nWe have proposed a DRFM framework that could be applied to a broad class of applications such\nas: (i) dynamic topic model for the analysis of time-stamped document collections; (ii) joint analy-\nsis of multiple time series, with ordinal valued observations; and (iii) multivariate ordinal dynamic\nfactor analysis or dynamic copula analysis for mixed type of data. The proposed model is a semi-\nparametric methodology, which offers modeling \ufb02exibilities and reduces the effect of model mis-\nspeci\ufb01cation. However, as the marginal likelihood is distribution-free, we could not calculate the\nmodel evidence or other evaluation metrics based on it (e.g. held-out likelihood). As a consequence,\nwe are lack of objective evaluation criteria, which allow us to perform formal model comparisons.\nIn our proposed setting, we are able to perform either retrospective analysis or multi-step ahead\nforecasting (using the recursive equations derived in the FFBS algorithm). Finally, our inference\nframework is easily adaptable for using sequential Monte Carlo (SMC) methods [33] allowing on-\nline learning.\n\nAcknowledgments\n\nThe research reported here was funded in part by ARO, DARPA, DOE, NGA and ONR. The authors\nare grateful to Jonas Wallin, Lund University, Sweden, for providing ef\ufb01cient package on simulation\nof the GIG distribution.\n\n8\n\nT1T2T3T4T5T6T7T8T9T10T11T12T13T14T15T16T17T18T19T20T1T2T3T4T5T6T7T8T9T10T11T12T13T14T15T16T17T18T19T20T1T2T3T4T5T6T7T8T9T10T11T12T13T14T15T16T17T18T19T20\u22121.0\u22120.50.00.51.0\fReferences\n[1] J. Aitchison. The statistical analysis of compositional data. J. Roy. Stat. Soc. Ser. B, 44(2):139\u2013177, 1982.\n[2] S. Chib and R. Winkelmann. Markov chain Monte Carlo analysis of correlated count data. Journal of\n\nBusiness & Economic Statistics, 19(4), 2001.\n\n[3] E. Lawrence, D. Bingham, C. Liu, and V. N. Nair. Bayesian inference for multivariate ordinal data using\n\nparameter expansion. Technometrics, 50(2), 2008.\n\n[4] M. West, P. J. Harrison, and H. S. Migon. Dynamic generalized linear models and Bayesian forecasting.\n\nJ. Am. Statist. Assoc., 80(389):73\u201383, 1985.\n\n[5] C. Cargnoni, P. M\u00a8uller, and M. West. Bayesian forecasting of multinomial time series through condition-\n\nally Gaussian dynamic models. J. Am. Statist. Assoc., 92(438):640\u2013647, 1997.\n\n[6] P. D. Hoff. Extending the rank likelihood for semiparametric copula estimation. Ann. Appl. Statist.,\n\n1(1):265\u2013283, 2007.\n\n[7] D. M. Blei and J. D. Lafferty. Dynamic topic models. In Int. Conf. Machine Learning, 2006.\n[8] D. M. Blei and J. D. Lafferty. Correlated topic models. In Adv. Neural Inform. Processing Systems, 2006.\n[9] A. Ahmed and E. P. Xing. Timeline: A dynamic hierarchical dirichlet process model for recovering\n\nbirth/death and evolution of topics in text stream. 2010.\n\n[10] P. D. Hoff. A \ufb01rst course in Bayesian statistical methods. Springer, 2009.\n[11] A. N. Pettitt. Inference for the linear model using a likelihood based on ranks. J. Roy. Stat. Soc. Ser. B,\n\n44(2):234\u2013243, 1982.\n\n[12] J. Aitchison and C. H. Ho. The multivariate Poisson-log normal distribution. Biometrika, 76(4):643\u2013653,\n\n1989.\n\n[13] J. S. Murray, D. B. Dunson, L. Carin, and J. E. Lucas. Bayesian Gaussian copula factor models for mixed\n\ndata. J. Am. Statist. Assoc., 108(502):656\u2013665, 2013.\n\n[14] A. Kalaitzis and R. Silva. Flexible sampling of discrete data correlations without the marginal distribu-\n\ntions. In Adv. Neural Inform. Processing Systems, 2013.\n\n[15] A. Armagan, M. Clyde, and D. B. Dunson. Generalized Beta mixtures of Gaussians.\n\nInform. Processing Systems, 2011.\n\nIn Adv. Neural\n\n[16] N. G. Polson and J. G. Scott. On the half-Cauchy prior for a global scale parameter. Bayesian Analysis,\n\n7(4):887\u2013902, 2012.\n\n[17] H. F. Lopes and M. West. Bayesian model assessment in factor analysis. Statistica Sinica, 14(1):41\u201368,\n\n2004.\n\n[18] J. Ghosh and D. B. Dunson. Default prior distributions and ef\ufb01cient posterior computation in Bayesian\n\nfactor analysis. Journal of Computational and Graphical Statistics, 18(2):306\u2013320, 2009.\n\n[19] A. Bhattacharya and D. B. Dunson. Sparse Bayesian in\ufb01nite factor models. Biometrika, 98(2):291\u2013306,\n\n2011.\n\n[20] J. Geweke and G. Zhou. Measuring the pricing error of the arbitrage pricing theory. Review of Financial\n\nStudies, 9(2):557\u2013587, 1996.\n\n[21] A. Christian, B. Jens, and P. Markus. Bayesian analysis of dynamic factor models: an ex-post approach\n\ntowards the rotation problem. Kiel Working Papers 1902, Kiel Institute for the World Economy, 2014.\n\n[22] C. Gao and B. E. Engelhardt. A sparse factor analysis model for high dimensional latent spaces. In NIPS:\nWorkshop on Analysis Operator Learning vs. Dictionary Learning: Fraternal Twins in Sparse Modeling,\n2012.\n\n[23] C. K. Carter and R. Kohn. On Gibbs sampling for state space models. Biometrika, 81:541\u2013553, 1994.\n[24] S. Fr\u00a8uhwirth-Schnatter. Data augmentation and dynamic linear models. Journal of Times Series Analysis,\n\n15:183\u2013202, 1994.\n\n[25] D. Inouye, P. Ravikumar, and I. Dhillon. Admixture of Poisson MRFs: A topic model with word depen-\n\ndencies. In Int. Conf. Machine Learning, 2014.\n\n[26] D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent Dirichlet allocation. J. Machine Learn. Res., 3:993\u20131022,\n\n2003.\n\n[27] J. Reisinger, A. Waters, B. Silverthorn, and R. J. Mooney. Spherical topic models. In Int. Conf. Machine\n\nLearning, 2010.\n\n[28] E. A. Reis, E. Salazar, and D. Gamerman. Comparison of sampling schemes for dynamic linear models.\n\nInternational Statistical Review, 74(2):203\u2013214, 2006.\n\n[29] A. Korattikara, Y. Chen, and M. Welling. Austerity in MCMC land: cutting the Metropolis-Hastings\n\nbudget. In Int. Conf. Machine Learning, pages 181\u2013189, 2014.\n\n[30] M. Quiroz, M. Villani, and R. Kohn.\n\narXiv:1404.4178, 2014.\n\nSpeeding up MCMC by ef\ufb01cient data subsampling.\n\n[31] C. E. Rasmussen. Gaussian processes in machine learning. Springer, 2004.\n[32] L. Lin and D. B. Dunson. Bayesian monotone regression using Gaussian process projection. Biometrika,\n\n101(2):303\u2013317, 2014.\n\n[33] A. Doucet, D. F. Nando, and N. Gordon. Sequential Monte Carlo methods in practice. Springer, 2001.\n\n9\n\n\f", "award": [], "sourceid": 1383, "authors": [{"given_name": "Shaobo", "family_name": "Han", "institution": "Duke University"}, {"given_name": "Lin", "family_name": "Du", "institution": "Duke University"}, {"given_name": "Esther", "family_name": "Salazar", "institution": "Duke University"}, {"given_name": "Lawrence", "family_name": "Carin", "institution": "Duke University"}]}