{"title": "Simplicial Mixtures of Markov Chains: Distributed Modelling of Dynamic User Profiles", "book": "Advances in Neural Information Processing Systems", "page_first": 9, "page_last": 16, "abstract": "", "full_text": "Simplicial Mixtures of Markov Chains:\n\nDistributed Modelling of Dynamic User Pro\ufb01les\n\nMark Girolami\n\nDepartment of Computing Science\n\nUniversity of Glasgow\n\nGlasgow, UK\n\nAta Kab\u00b4an\n\nSchool of Computer Science\nUniversity of Birmingham\n\nBirmingham, UK\n\ngirolami@dcs.gla.ac.uk\n\na.kaban@cs.bham.ac.uk\n\nAbstract\n\nTo provide a compact generative representation of the sequential activ-\nity of a number of individuals within a group there is a tradeoff between\nthe de\ufb01nition of individual speci\ufb01c and global models. This paper pro-\nposes a linear-time distributed model for \ufb01nite state symbolic sequences\nrepresenting traces of individual user activity by making the assump-\ntion that heterogeneous user behavior may be \u2018explained\u2019 by a relatively\nsmall number of common structurally simple behavioral patterns which\nmay interleave randomly in a user-speci\ufb01c proportion. The results of an\nempirical study on three different sources of user traces indicates that\nthis modelling approach provides an ef\ufb01cient representation scheme, re-\n\ufb02ected by improved prediction performance as well as providing low-\ncomplexity and intuitively interpretable representations.\n\n1 Introduction\n\nThe now commonplace ability to accurately and inexpensively log the activity of individ-\nuals in a digital environment makes available a variety of traces of user activity and with\nit the necessity to develop ef\ufb01cient representations, or pro\ufb01les, of individuals. Most of-\nten, such recordings take the form of streams of discrete symbols ordered in time. The\nmodelling of time dependent sequences of discrete symbols employing n\u2019th order Markov\nchains has been extensively studied in a number of domains. The representation provided\nby such models is global in the sense that it is assumed that one global generating pro-\ncess underlies all observed sequences. To capture the possible heterogeneous nature of the\nobserved sequences a model with a number of differing generating processes needs to be\nconsidered. Indeed the notion of a heterogeneous population, characterized for example by\noccupational mobility and consumer brand preferences, has been captured in the Mover-\nStayer model [3]. This model is a discrete time stochastic process that is a two component\nmixture of \ufb01rst order Markov chains, one of which is degenerate and possesses an identity\ntransition matrix characterizing the stayers in the population. The original notion of a two-\ncomponent mixture of Markov chains has recently been extended to the general form of a\nmixture model of Markov chains in [2]. Whilst the main motivation was the visualization\nof the class structure inherent in the browsing patterns of visitors to a commercial website,\neach class of users being characterized by their global behavior, such mixture models will\n\n\fnot be appropriate for identifying the shared behavioral patterns which are the basis of mul-\ntiple relationships between users and groups of users and which may yield a more realistic\nmodel of the population.\n\nThe purpose of this paper is to develop a dynamic user model for individuals within a\ngroup that explicitly captures the assumption of the existence of a common set of behav-\nioral patterns which can be estimated from all observed users along with their user-speci\ufb01c\nproportion of participation and these form the basis of individual pro\ufb01les within a group.\nThis is also a computationally attractive model, as simple structural characteristics may be\nassumed at the generative level, while allowing them to interleave randomly can account\nfor more complex individual behavior. The resulting model is thus a distributed dynamic\nmodel which bene\ufb01ts from the recent technical developments in distributed parts based\nmodelling of static vectorial data [7, 9, 5, 1, 8], with various applications including im-\nage decomposition, document modelling, information retrieval and collaborative \ufb01ltering.\nConsistent generative semantics similar to the recently introduced latent Dirichlet alloca-\ntion (LDA) [1] will be adopted and by analogy with [8] the resulting model will be referred\nto as a simplicial mixture.\n\nn\n\n(cid:81)|S|\nsm=1 \u00b7\u00b7\u00b7(cid:81)|S|\n\n2 Simplicial Mixtures of Markov Chains\nAssume that a sequence of L symbols sLsL\u22121,\u00b7\u00b7\u00b7 , s0, denoted by s, can be drawn\nfrom a dictionary S by a process k, which has initial state probability P1(k) and has\n|S|m+1 state transition probabilities denoted by T (sm,\u00b7\u00b7\u00b7 , s1 \u2192 s0|k). The number\nof times that the symbol s0 follows from the state de\ufb01ned by the m-tuple of symbols\nsm,\u00b7\u00b7\u00b7 , s1 within the n-th sequence is denoted as rsm,\u00b7\u00b7\u00b7 ,s1\u2192s0\nand so the probabil-\nity of the sequence of symbols under the k\u2019th m-th order Markov process is P (s|k) =\ns0=1 T (sm,\u00b7\u00b7\u00b7 , s1 \u2192 s0|k)rsm,\u00b7\u00b7\u00b7 ,s1\u2192s0 . To introduce a more com-\nP1(k)\npact notation we represent the elements of the state transition matrix for the k\u2019th Markov\nprocess by Tm\u00b7\u00b7\u00b70,k and the counts rsm,\u00b7\u00b7\u00b7 ,s1\u2192s0 within the n\u2019th observed sequence as\nrm\u00b7\u00b7\u00b70\n. In addition, we employ Start and Stop states in each symbol sequence sn and\nn\nincorporate the initial state distribution of the Start state as the transition probabilities\nfrom this state within the state transition matrix Tk. We denote the set of all state transi-\ntion matrices {T1,\u00b7\u00b7\u00b7 , Tk,\u00b7\u00b7\u00b7 , TK} as T. Suppose that we are given a set of symbolic\ntrajectories {sn}n=1:N over a common \ufb01nite state space, each having length Ln. As op-\nposed and somewhat complementary to cluster models for trajectories which try to model\ninter-sequence heterogeneities, our intuition is that sequences over a common \ufb01nite state\nspace, provided they are suf\ufb01ciently long and possibly non-stationary, could have several\nrandomly interleaved generator processes, some of which might be common to several\nsequences. To account for this idea, we will adopt a similar modelling strategy to LDA.\n\nThe complete generative semantics of LDA allows us to describe the process of sequence\ngeneration where mixing components \u03bb = [\u03bb1,\u00b7\u00b7\u00b7 , \u03bbk,\u00b7\u00b7\u00b7 , \u03bbK] are K-dimensional\nDirichlet random variables and so are drawn from the K \u2212 1 dimensional simplex de\ufb01ned\nby the Dirichlet distribution D(\u03bb|\u03b1) with parameters \u03b1. These are then combined with the\nindividual state-transition probabilities Tk, which are model parameters to be estimated,\nand yield the symbol transition probabilities Tm\u00b7\u00b7\u00b70 =\nk=1 Tm\u00b7\u00b7\u00b70,k\u03bbk. The overall prob-\nability for a sequence sn under such a mixture, which we shall now refer to as a simplicial\nmixture [8], denoted as P (sn|T, \u03b1) is equal to\n\n(cid:41)rm\u00b7\u00b7\u00b70\n\nn\n\nP (sn|T, \u03bb)D(\u03bb|\u03b1)d\u03bb =\n\nd\u03bbD(\u03bb|\u03b1)\n\nTm\u00b7\u00b7\u00b70,k\u03bbk\n\n(1)\n\n(cid:80)K\n|S|(cid:89)\n\n\u00b7\u00b7\u00b7\n\n|S|(cid:89)\n\n(cid:40)\n\nK(cid:88)\n\nsm=1\n\ns0=1\n\nk=1\n\n(cid:90)\n\n(cid:52)\n\n(cid:90)\n\n(cid:52)\n\nEach sequence will have its own expectation under the Dirichlet mixing coef\ufb01cients and\n\n\fso the ability of such a representation to model intra-sequence heterogeneity emerges nat-\nurally.\n\nThe following subsections brie\ufb02y present the details of the identi\ufb01cation of this model,\nwhich also highlights the close relationship between two existing related models, speci\ufb01-\ncally the probabilistic latent semantic analysis (PLSA) [5] and LDA [1] as being instances\nof the same theoretical model and differing only in the estimation procedure adopted [4].\n\n2.1 Parameter Estimation and Inference\n\nExact inference within the LDA framework is not possible [1], however the likelihood can\nbe lower-bounded by introducing a sequence speci\ufb01c parameterised variational posterior\nQn(\u03bb) whose parameters will depend on n\nlog P (sn|T, \u03b1) \u2265 EQn(\u03bb)\n\nP (sn|T, \u03bb)\n\n(cid:190)(cid:184)\n\n(cid:189)\n\n(cid:183)\n\nlog\n\n(2)\n\nD(\u03bb|\u03b1)\nQn(\u03bb)\n\nn\n\nn\n\nn\n\nn\n\n) + log D(\u03bbM AP\n\nWhere EQn(\u03bb) denotes expectation with respect to Qn(\u03bb). The bound can be de\ufb01ned using\nthe Maximum a Posteriori (MAP) estimator, such that Qn(\u03bb) = \u03b4(\u03bb \u2212 \u03bbM AP\n), in which\ncase (2) is equal to log P (sn|T, \u03bbM AP\n|\u03b1) + H\u03b4 where H\u03b4 denotes the\nentropy of the delta function around \u03bbM AP\n(which can be discarded in this setting as it\ndoes not depend on the model parameters, although it amounts to minus in\ufb01nity). Forming\na Lagrangian from the above to enforce the constraint that \u03bbM AP is a sample point from a\nDirichlet variable then taking derivatives with respect to the \u03bbM AP\n, a convergent series of\nupdates \u03bbt\nkn is obtained where the superscript denotes the t\u2019th iteration. As in [7], for each\nobserved sequence in the sample a MAP value for the variable \u03bb is iteratively estimated by\nthe following multiplicative updates\n\u02dc\u03bbkn = (\u03b1k\u22121)+\u03bbt\n\n\u02dc\u03bbkn\nk(\u03b1k \u2212 1)\n(3)\nwhere Ln =\nis the length of the sequence sn. Once the MAP values \u03bbM AP\nfor each sn are obtained a similar multiplicative iteration for the transition probabilities can\nbe obtained\n\nl=1 Tm\u00b7\u00b7\u00b70,l\u03bbt\nln\n\n(cid:80)K\n\n|S|(cid:88)\n\n|S|(cid:88)\n\nkn =\n\u03bbt+1\n\n(cid:80)\n\nrm\u00b7\u00b7\u00b70\n\nTm\u00b7\u00b7\u00b70,k\n\nLn +\n\n\u00b7\u00b7\u00b7\n\nkn\n\nsm=1\n\n(cid:80)\nsm\u00b7\u00b7\u00b7s0 rm\u00b7\u00b7\u00b70\nN(cid:88)\n\nn\n\ns0=1\n\nn\n\nn\n\n;\n\nk\n\n\u02dcTm\u00b7\u00b7\u00b70,k = T t\n\nm\u00b7\u00b7\u00b70,k\n\nrm\u00b7\u00b7\u00b70\n\nn\n\nn=1\n\n\u03bbM AP\nkn\nm\u00b7\u00b7\u00b70,l\u03bbM AP\n\nln\n\nl=1 T t\n\n; T t+1\n\nm\u00b7\u00b7\u00b70,k =\n\n(cid:80)|S|\n\n\u02dcTm\u00b7\u00b7\u00b70,k\n\n\u02dcTm\u00b7\u00b7\u00b70(cid:48),k\n\ns(cid:48)\n0=1\n\n(4)\n\n(cid:80)K\n\nThe \ufb01nal parameter is that of the prior Dirichlet distribution, maximum likelihood estima-\ntion yields the estimated distribution parameters \u03b1 given the \u03bbM AP\n[6, 1]. Note that both\n(3) and (4) require an elementwise matrix multiplication and division so these iterations\nwill scale linearly with the number of non-zero state-transition counts. It is interesting to\nnote that the MAP estimator under a uniform Dirichlet distribution exactly recovers the\naspect mixture model of [5] as a special case of the MAP estimated LDA model.\n\nn\n\n2.1.1 Variational Parameter Estimation and Inference\n\nWhile being optimal in analyzing an existing data set, MAP estimators are notoriously\nprone to over\ufb01tting, especially where there is a paucity of available data [10] and so the\nvariational Bayes (VB) approach detailed in [1] can be adopted by considering Qn(\u03bb) =\nD(\u03bb|\u03b3n), where \u03b3n is a sequence-speci\ufb01c variational free parameter vector. The above (2)\ncan be further lower-bounded by noting that\n\n(cid:189)\n\n(cid:190)\n\nlog P (sn|T, \u03bb) \u2265\n\n\u00b7\u00b7\u00b7\n\nrm\u00b7\u00b7\u00b70\nn Qm\u00b7\u00b7\u00b70,n(k) log\n\n\u03bbk\n\nTm\u00b7\u00b7\u00b70,k\n\nQm\u00b7\u00b7\u00b70,n(k)\n\n(5)\n\n|S|(cid:88)\n\n|S|(cid:88)\n\nK(cid:88)\n\nsm=1\n\ns0=1\n\nk=1\n\n\f(cid:80)\nk Qm\u00b7\u00b7\u00b70,n(k) = 1, Qm\u00b7\u00b7\u00b70,n(k) \u2265 0 are additional variational parameters. Al-\nwhere\nternatively, Qm\u00b7\u00b7\u00b70,n(.) can also be understood as a variational distribution on a discrete\n(cid:80)\nhidden variable with K possible outcomes that selects which transition matrix is active at\neach time step of the generative process.\nReplacing (5) in (2), expanding and evaluating ED(\u03bb|\u03b3 n)[log \u03bbk] = \u03c8(\u03b3k) \u2212 \u03c8(\nk(cid:48) \u03b3k(cid:48)),\nwhere \u03c8 denotes the digamma function, then solving for Qm\u00b7\u00b7\u00b70,n(k) and \u03b3kn and \ufb01nally\ncombining yields the following multiplicative iterative update for the sequence speci\ufb01c\nvariational free parameter \u03b3n\n\nkn = \u03b1k + exp{\u03c8(\u03b3t\n\u03b3t+1\n\nkn)}\n\n\u00b7\u00b7\u00b7\n\nrm\u00b7\u00b7\u00b70\n\nn\n\n|S|(cid:88)\n\n|S|(cid:88)\n\nsm=1\n\ns0=1\n\n(cid:80)K\nk(cid:48)=1 Tm\u00b7\u00b7\u00b70,k(cid:48) exp{\u03c8(\u03b3t\n\nTm\u00b7\u00b7\u00b70,k\n\nk(cid:48)n)}\n\n(6)\n\nN(cid:88)\n\n(cid:80)K\n\nSolving for the transition probabilities and combining with the \ufb01xed point solutions for\neach Qm\u00b7\u00b7\u00b70,n(k) yields the following\n\nrm\u00b7\u00b7\u00b70\n\n\u02dcTm\u00b7\u00b7\u00b70,k = T t\n\nm\u00b7\u00b7\u00b70,k\n\nn\n\n\u02dcTm\u00b7\u00b7\u00b70(cid:48),k\n(7)\nAs before the parameters of the prior Dirichlet distribution \u03b1 given the variational param-\neters \u03b3n are estimated using standard methods [6, 1].\n\nk(cid:48)=1 T t\n\nn=1\n\nk(cid:48)n)} ; T t+1\n\nm\u00b7\u00b7\u00b70,k =\n\nkn)}\n\nexp{\u03c8(\u03b3t\nm\u00b7\u00b7\u00b70,k(cid:48) exp{\u03c8(\u03b3t\n\n(cid:80)\n\n\u02dcTm\u00b7\u00b7\u00b70,k\ns(cid:48)\n0\n\n2.2 Prediction with Simplicial Mixtures\n\n(cid:80)K\nThe predictive probability of observing symbol snext given a sequence of L symbols\nsn = {sLn ,\u00b7\u00b7\u00b7 , s1} is given as P (snext|sn) = EP (\u03bb|sn){P (snext|sm \u00b7\u00b7\u00b7 s1, \u03bb)} \u2248\nk=1 T (snext|sm \u00b7\u00b7\u00b7 s1, k)EQn(\u03bb){\u03bbk}. It should be noted that while m-th order Markov\nchains form the basis of the representation, the resulting simplicial mixture is not m-th or-\nder Markov with any global transition model. Rather it approximates the individual m-th\norder models while keeping the generative parameter set compact. The m-th order infor-\nmation of each individual\u2019s past behaviour is embodied in the individual-speci\ufb01c latent\nvariable estimate. On the other hand in a mixture model one component is responsible\nfor sequence generation so within a cluster the representation is still global m-th order.\nEmploying the MAP approximation for the Dirichlet distribution then EQn(\u03bb){\u03bbk} =\nE\u03b4(\u03bb\u2212\u03bbM AP\n. Employing the\nvariational Dirichlet approximation then EQn(\u03bb){\u03bbk} = ED(\u03bb|\u03b3n){\u03bbk} = \u03b3kn/\nl=1 \u03b3ln\ntherefore given a new sequence snew, the symbol snext which is most likely to be predicted\nfrom the model as a suggested continuation of the sequence, is the maximum argument of\nP (snext|sn).\n\nis the k-th dimension of \u03bbM AP\n\n){\u03bbk} = \u03bbM AP\n\nkn where \u03bbM AP\n\n(cid:80)K\n\nkn\n\nn\n\nn\n\n3 Distributed Modelling of Dynamic Pro\ufb01les\n\n3.1 Datasets\n\n3.1.1 Telephone Usage Modelling\n\nThe ability to model the usage of a telephone service is of importance at a number of lev-\nels, e.g.\nto obtain a predictive model of customer speci\ufb01c activity and service usage for\nthe purposes of service provision planning, resource management of switching capacity,\nidenti\ufb01cation of fraudulent usage of services. A representative description can be based on\nthe distribution of the destination numbers dialled and connected by the customer, in which\n\n\fcase a multinomial distribution over the dialling codes can be employed. One method of\nencoding the destination numbers dialled by a customer is to capture the geographic loca-\ntion of the destination, or the mobile service provider if not a land based call. This is useful\nin determining the potential demand placed on telecommunication switches which route\ntraf\ufb01c from various geographical regions on the service providers network. Two weeks\nof transactions from a UK telecommunications operator were logged during weekdays,\namounting to 36,492,082 and 45,350,654 transactions in each week respectively. All trans-\nactions made by commercial customers in the Glasgow region of the UK were considered\nin this study. This amounts to 1,172,578 transactions from 12,202 high usage customers\nin the \ufb01rst week considered and 1,753,304 transactions being made in the following week.\nThe mapping from dialling number to geographic region or mobile operator was encoded\nwith 87 symbols amounting to a possible 7,569 symbol transitions. Each customers activity\nis de\ufb01ned by a sequence of symbols de\ufb01ning the sequence of calls made over each period\nconsidered and these are employed to encode activity in a customer speci\ufb01c generative\nrepresentation.\n\n3.1.2 Web Page Browsing\n\nThe second data set used in this study is a selected subset of the msnbc.com user navigation\ncollection employed in [2]. Sequences of users who visited at least 9 of the overall 17\npage categories (frontpage, news, tech, local, opinion, on-air, misc,weather, msn-news,\nhealth, living, business, msn-sports, sports, summary, bbs, travel) have been retained, this\nselection criteria is motivated by the observation that there would be little scope in trying\nto model interleaved dynamic behavior in observables which are too short to reveal any\nintra-sequence heterogeneity. The resulting data set, referred to as WEB, totals 119,667\npage requests corresponding to 1,480 web browsing sessions.\n\nFigure 1: Left: percentage of incorrect predictions against the number of model factors;\nright: predictive perplexity of each model against model order for the PHONE dataset.\nSolid straight line: global \ufb01rst order MC, dash: MAP estimated simplicial mixture, solid\nline: VB estimated simplicial mixture, dash-dot: mixture model.\n\n3.2 Results\n\n(cid:80)Ntest\n\nIn each experiment the objective assessment of model performance is evaluated by the\npredictive perplexity, exp{\u22121/N\nm=1 log P (snext|sm)}. In addition, the predictive ac-\ncuracy of all models is measured under a 0-1 loss. Given a number of previously unob-\nserved truncated sequences, the number of times the model correctly predicts the symbol\nwhich follows in the sequence is then counted. In all mixture models naive random ini-\ntialization of the parameters was employed and parameter estimation was halted when the\n\n0.40.60.811.21.41.61.80.58 0.590.6 0.61 0.62 Number of Factors ( log10 )Fraction Prediction Error0.40.60.811.21.41.61.87891011Number of Factors ( log10 )Predictive Perplexity\fFigure 2: Left: distribution of entropy rates for the transition matrices of a 20-factor mixture\nand simplicial mixture models (VB). Right: the expected value of the Dirichlet variable\nunder the variational approximation for one customer indicating the levels of participation\nin factor speci\ufb01c behaviors.\n\nin-sample likelihood did not improve by more than 0.001%, no annealing or early stopping\nwas utilized, \ufb01fteen randomly initialized parameter estimation runs for each model were\nperformed. The number of mixture components for the models ranged from 2 up to 200. On\nthe PHONE data set the parameters of a global \ufb01rst-order Markov chain (bigram), mixtures\nof Markov chains [2], and simplicial mixtures of Markov chains (using both the MAP and\nVB estimation procedures) are estimated using the \ufb01rst week of customer transactions and\nthe predictive capabilities of the models are assessed on the transactions from the follow-\ning week. The results are summarized in Figure 1, from the predictive perplexity measures\nit is clear that the simplicial representation provides a statistically (tested at the 5% level\nusing a Wilcoxon Rank Sum test) and practically signi\ufb01cant reduction in perplexity over\nthe global and mixture models. This is also re\ufb02ected in the levels of prediction error under\neach model, however the mixture models tend to perform slightly worse than the global\nmodel. As expected the MAP estimated simplicial model performs slightly worse than that\nobtained using VB [1]. This also provides an additional insight as to why LDA models\nimprove upon PLSA, as they are in fact both the same model using different approxima-\ntions to the likelihood, refer to [10] for an illustrative discussion on the weaknesses of MAP\nestimators. As a comparison to different structural models hidden Markov models with a\nrange of hidden states were also tested on this data set the best results obtained were for a\nten state model which achieved a predictive perplexity score of (mean\u00b1standard-deviation)\n11.119 \u00b1 0.624 and fraction prediction error of 0.674 \u00b1 0.959, considerably poorer than\nthat obtained by the models considered here.\n\nIn addition to the predictive capability of a simplicial representation of a customers activity\nthe cost of encoding such a representation can be assessed by measuring the entropy rate\nof each of the constituent transition matrices which act as a basis in the representation\nof the individual speci\ufb01c generative process. The left hand plot of Figure (2) shows the\ndistribution of the entropy rates for the transition probabilities in twenty factor simplicial\nand mixture models, the results are obtained from \ufb01fty randomly initialized estimation\nprocedures. The entropy rates for the simplicial mixture are signi\ufb01cantly lower than that\nof a mixture model indicating that the basis of each representation describes a number of\nsimpler processes.\n\nThe \ufb01nal experiment demonstrated considers the WEB data set. The results of ten-fold\ncross-validated predictive perplexities again show statistically signi\ufb01cant improvement ob-\ntained with the VB-estimated simplicial mixture (again tested using the ranksum Wilcoxon\ntest at the 5% level). The results are summarized in Figure 3. Five of the estimated tran-\nsition factors of a twenty-factor model are shown in Figure 4, demonstrating once more\nthat the proposed model creates a low entropy and an easily interpretable dynamic factorial\nrepresentation. The numbers on the axes on these charts correspond to the 17 page cat-\n\n1212345Entropy Rate ( bits )Simplicial Model Mixture Model510152000.050.10.150.20.250.30.350.4(k)(cid:242) lk P(l|gn)dl\f(cid:80)N\negories enumerated earlier and the average strength of each of these factors amongst the\nn=1 ED(\u03bb|\u03b3n){\u03bbk} is also given above each\nfull set of twenty factors computed as 1\nN\nchart. We can see that a behavioral feature manifested is a keen interest to visit pages about\n\u2018news\u2019 along with a quite dynamic transition model (left hand chart) which characterizes\naround 12% of the behavioral patterns of the entire user population under consideration\nwhile static state-repetition (second chart) or an almost exclusive interest in viewing the\nhomepage (last chart) etc represent also relatively strong common characteristics of brows-\ning behavior. The distribution of the entropy rates of the full set of these twenty basis-\ntransitions in comparison to those obtained from the mixture model is given on the right\nhand plot of Figure 3. Clearly, the coding ef\ufb01ciency of a simplicial mixture representa-\ntion is signi\ufb01cantly (statistically tested) superior. Note also these basis-transitions embody\ncorrelated transitions (transitions which appear in similar dynamical contexts and so have\nsimilar functionality), as can be seen from the multiplicative nature of the equations used\nfor identifying the model. It is not surprising then that state repetitions or transitions which\nexpress focused interest in one of the topic categories appear together on distinct factors.\nWe can also see a joint interest in msnnews and msnsport being present together on the 4-th\nchart of Figure 4 \u2014 indeed, as the pre\ufb01x of these page categories also indicates, these are\nrelated page categories.\n\nFigure 3: Left:\nthe predictive perplexity for the WEB data (straight line: global \ufb01rst-\norder Markov chain, dash-dot: mixture of Markov chains, dotted line: simplicial mixture\nestimated by MAP, solid line: simplicial mixture estimated by VB). Right: the distribution\nof entropy rates.\n\n4 Conclusions\n\nThis paper has presented a linear time method to model \ufb01nite-state sequences of discrete\nsymbols which may arise from user or customer activity traces. The main feature of the\nproposed approach has been the assumption that heterogeneous user behavior may be \u2018ex-\nplained\u2019 by the interleaved action of some structurally simple common generator processes.\n\nFigure 4: State transition matrices of selected factors from a 20-factor run on WEB.\n\n0.40.60.811.21.41.666.26.46.66.877.27.47.67.8Nr of Factors (log10)Predictive PerplexityEntropy Rates (bits)0.511.522.53Simplicial Model Mixture Model 0.12510152468101214160.07510152468101214160.23510152468101214160.03510152468101214160.0251015246810121416\fAn empirical study has been conducted on two real-world collections of user activity which\nhas demonstrated this to be an ef\ufb01cient representation, revealed by both objective measures\nof prediction performance, low entropy rates, and interpretable representations of the user\npro\ufb01les provided.\n\nAcknowledgements\n\nMark Girolami is part of the DETECTOR project funded by the Department of Trade and\nIndustry (DTI) Management of Information (LINK) Programme and the Engineering &\nPhysical Sciences Research Council (EPSRC) grant GR/R55184.\n\nReferences\n\n[1] D. M. Blei, A. Y. Ng & M. I. Jordan, Latent Dirchlet Allocation, Journal of Ma-\n\nchine Learning Research, 3(5):993\u20131022, 2003.\n\n[2] I. Cadez, D. Heckerman, C. Meek, P. Smyth & S. White, Model-based clustering\nand visualisation of navigation patterns on a web site, Journal of data Mining and\nKnowledge Discovery, in press.\n\n[3] H. Frydman, Maximum likelihood estimation in the mover-stayer model, Journal\n\nof the American Statistical Society, 79, 632-638, 1984.\n\n[4] M. Girolami and A. Kab\u00b4an, On an equivalence between PLSI and LDA, Proc. 26-th\n\nAnnual International ACM SIGIR Conference, 2003, pp. 433\u2013434.\n\n[5] T. Hofmann,Unsupervised learning by probabilistic latent semantic analysis, Ma-\n\nchine Learning, 42, 177-196, 2001.\n\n[6] G. Ronning, Maximum likelihood estimation of Dirichlet distributions, Journal of\n\nStatistical Computation and Simulation, 32:4, 215-221, 1989.\n\n[7] D. Lee & H. Sebastian Seung, Algorithms for Non-negative Matrix Factorization,\nAdvances in Neural Information Processing Systems 13, ed\u2019s Leen, Todd K, Diet-\nterich, Thomas G. and Tresp, Volker, 556\u2013562, MIT Press, 2001.\n\n[8] T. Minka & J. Lafferty, Expectation-propogation for the generative aspect model,\nProceedings of the Eighteenth Conference on Uncertainty in Arti\ufb01cial Intelligence,\n2002.\n\n[9] D. A. Ross & R. S. Zemel, Multiple-cause vector quantiztion, Advances in Neural\n\nInformation Processing Systems 15, 2003.\n\n[10] H. Lappalainen & J. W. Miskin. Ensemble Learning. In M. Girolami, editor, Ad-\n\nvances in Independent Component Analysis, 75-92, Springer-Verlag, 2000.\n\n\f", "award": [], "sourceid": 2519, "authors": [{"given_name": "Mark", "family_name": "Girolami", "institution": null}, {"given_name": "Ata", "family_name": "Kab\u00e1n", "institution": null}]}