{"title": "Time-Sensitive Recommendation From Recurrent User Activities", "book": "Advances in Neural Information Processing Systems", "page_first": 3492, "page_last": 3500, "abstract": "By making personalized suggestions, a recommender system is playing a crucial role in improving the engagement of users in modern web-services. However, most recommendation algorithms do not explicitly take into account the temporal behavior and the recurrent activities of users. Two central but less explored questions are how to recommend the most desirable item \\emph{at the right moment}, and how to predict \\emph{the next returning time} of a user to a service. To address these questions, we propose a novel framework which connects self-exciting point processes and low-rank models to capture the recurrent temporal patterns in a large collection of user-item consumption pairs. We show that the parameters of the model can be estimated via a convex optimization, and furthermore, we develop an efficient algorithm that maintains $O(1 / \\epsilon)$ convergence rate, scales up to problems with millions of user-item pairs and thousands of millions of temporal events. Compared to other state-of-the-arts in both synthetic and real datasets, our model achieves superb predictive performance in the two time-sensitive recommendation questions. Finally, we point out that our formulation can incorporate other extra context information of users, such as profile, textual and spatial features.", "full_text": "Time-Sensitive Recommendation From\n\nRecurrent User Activities\n\nNan Du(cid:5), Yichen Wang(cid:5), Niao He\u2217, Le Song(cid:5)\n\n(cid:5)College of Computing, Georgia Tech\n\n\u2217H. Milton Stewart School of Industrial & System Engineering, Georgia Tech\n\ndunan@gatech.edu, yichen.wang@gatech.edu, nhe6@gatech.edu\n\nlsong@cc.gatech.edu\n\nAbstract\n\nBy making personalized suggestions, a recommender system is playing a crucial\nrole in improving the engagement of users in modern web-services. However,\nmost recommendation algorithms do not explicitly take into account the tempo-\nral behavior and the recurrent activities of users. Two central but less explored\nquestions are how to recommend the most desirable item at the right moment, and\nhow to predict the next returning time of a user to a service. To address these\nquestions, we propose a novel framework which connects self-exciting point pro-\ncesses and low-rank models to capture the recurrent temporal patterns in a large\ncollection of user-item consumption pairs. We show that the parameters of the\nmodel can be estimated via a convex optimization, and furthermore, we develop\nan ef\ufb01cient algorithm that maintains O(1/\u0001) convergence rate, scales up to prob-\nlems with millions of user-item pairs and hundreds of millions of temporal events.\nCompared to other state-of-the-arts in both synthetic and real datasets, our model\nachieves superb predictive performance in the two time-sensitive recommenda-\ntion tasks. Finally, we point out that our formulation can incorporate other extra\ncontext information of users, such as pro\ufb01le, textual and spatial features.\n\nIntroduction\n\n1\nDelivering personalized user experiences is believed to play a crucial role in the long-term engage-\nment of users to modern web-services [26]. For example, making recommendations on proper items\nat the right moment can make personal assistant services on mainstream mobile platforms more com-\npetitive and usable, since people tend to have different activities depending on the temporal/spatial\ncontexts such as morning vs. evening, weekdays vs. weekend (see for example Figure 1(a)). Unfor-\ntunately, most existing recommendation techniques are mainly optimized at predicting users\u2019 one-\ntime preference (often denoted by integer ratings) on items, while users\u2019 continuously time-varying\npreferences remain largely under explored.\nBesides, traditional user feedback signals (e.g. user-item ratings, click-through-rates, etc.) have been\nincreasingly argued to be ineffective to represent real engagement of users due to the sparseness and\nnosiness of the data [26]. The temporal patterns at which users return to the services (items) thus be-\ncomes a more relevant metric to evaluate their satisfactions [12]. Furthermore, successful predictions\nof the returning time not only allows a service to keep track of the evolving user preferences, but also\nhelps a service provider to improve their marketing strategies. For most web companies, if we can\npredict when users will come back next, we could make ads bidding more economic, allowing mar-\nketers to bid on time slots. After all, marketers need not blindly bid all time slots indiscriminately.\nIn the context of modern electronic health record data, patients may have several diseases that have\ncomplicated dependencies on each other shown at the bottom of Figure 1(a). The occurrence of one\ndisease could trigger the progression of another. Predicting the returning time on certain disease\ncan effectively help doctors to take proactive steps to reduce the potential risks. However, since\nmost models in literature are particularly optimized for predicting ratings [16, 23, 15, 3, 25, 13, 21],\n\n1\n\n\f(a) Predictions from recurrent events.\n\n(b) User-item-event model.\n\nFigure 1: Time-sensitive recommendation.\n(a) in the top \ufb01gure, one wants to predict the most\ndesirable activity at a given time t for a user; in the bottom \ufb01gure, one wants to predict the returning\ntime to a particular disease of a patient. (b) The sequence of events induced from each user-item\npair (u, i) is modeled as a temporal point process along time.\n\nexploring the recurrent temporal dynamics of users\u2019 returning behaviors over time becomes more\nimperative and meaningful than ever before.\nAlthough the aforementioned applications come from different domains, we seek to capture them\nin a uni\ufb01ed framework by addressing the following two related questions: (1) how to recommend\nthe most relevant item at the right moment, and (2) how to accurately predict the next returning-\ntime of users to existing services. More speci\ufb01cally, we propose a novel convex formulation of\nthe problems by establishing an under explored connection between self-exciting point processes\nand low-rank models. We also develop a new optimization algorithm to solve the low rank point\nprocess estimation problem ef\ufb01ciently. Our algorithm blends proximal gradient and conditional\ngradient methods, and achieves the optimal O(1/t) convergence rate. As further demonstrated by\nour numerical experiments, the algorithm scales up to millions of user-item pairs and hundreds of\nmillions of temporal events, and achieves superb predictive performance on the two time-sensitive\nproblems on both synthetic and real datasets. Furthermore, our model can be readily generalized to\nincorporate other contextual information by making the intensity function explicitly depend on the\nadditional spatial, textual, categorical, and user pro\ufb01le information.\nRelated Work. The very recent work of Kapoor et al. [12, 11] is most related to our approach. They\nattempt to predict the returning time for music streaming service based on survival analysis [1] and\nhidden semi-markov model. Although these methods explicitly consider the temporal dynamics of\nuser-item pairs, a major limitation is that the models cannot generalize to recommend any new item\nin future time, which is a crucial difference compared to our approach. Moreover, survival analysis\nis often suitable for modeling a single terminal event [1], such as infection and death, by assuming\nthat the inter-event time to be independent. However, in many cases this assumption might not hold.\n\n2 Background on Temporal Point Processes\n\nThis section introduces necessary concepts from the theory of temporal point processes [4, 5, 6].\nA temporal point process is a random process of which the realization is a sequence of events {ti}\nwith ti \u2208 R+ and i \u2208 Z+ abstracted as points on the time line. Let the history T be the list of event\ntime {t1, t2, . . . , tn} up to but not including the current time t. An important way to characterize\ntemporal point processes is via the conditional intensity function, which is the stochastic model\nfor the next event time given all previous events. Within a small window [t, t + dt), \u03bb(t)dt =\nP{event in [t, t + dt)|T } is the probability for the occurrence of a new event given the history T .\nThe functional form of the intensity \u03bb(t) is designed to capture the phenomena of interests [1]. For\ni.e., \u03bb(t) = \u03bb0 (cid:62)\ninstance, a homogeneous Poisson process has a constant intensity over time,\n0, which is independent of the history T . The inter-event gap thus conforms to the exponential\ndistribution with the mean being 1/\u03bb0. Alternatively, for an inhomogeneous Poisson process, its\nintensity function is also assumed to be independent of the history T but can be a simple function\nof time, i.e., \u03bb(t) = g(t) (cid:62) 0. Given a sequence of events T = {t1, . . . , tn}, for any t > tn,\nwe characterize the conditional probability that no event happens during [tn, t) and the conditional\ndensity f (t|T ) that an event occurs at time t as S(t|T ) = exp\nand f (t|T ) =\n\n(cid:16)\u2212(cid:82) t\n\n(cid:17)\n\n\u03bb(\u03c4 ) d\u03c4\n\ntn\n\n2\n\nnext event prediction time Disease 1 ? time Disease n ? patient predict the next activity at time t ? time Church t ? time Grocery t ? user time \f\u03bb(t) S(t|T ) [1]. Then given a sequence of events T = {t1, . . . , tn}, we express its likelihood by\n\n(cid:96)({t1, . . . , tn}) =\n\n\u03bb(\u03c4 ) d\u03c4\n\n.\n\n(1)\n\n(cid:89)\nti\u2208T \u03bb(ti) \u00b7 exp\n\n(cid:32)\n\n\u2212\n\n(cid:90) T\n\n0\n\n(cid:33)\n\n(cid:88)\n\n3 Low Rank Hawkes Processes\nIn this section, we present our model in terms of low-rank self-exciting Hawkes processes, discuss its\npossible extensions and provide solutions to our proposed time-sensitive recommendation problems.\n3.1 Modeling Recurrent User Activities with Hawkes Processes\nFigure 1(b) highlights the basic setting of our model. For each observed user-item pair (u, i), we\nmodel the occurrences of user u\u2019s past consumption events on item i as a self-exciting Hawkes\nprocess [10] with the intensity:\n\n\u03bb(t) = \u03b30 + \u03b1\n\nti\u2208T \u03b3(t, ti),\n\n0 + \u03b1u,i(cid:80)\n\n(2)\nwhere \u03b3(t, ti) (cid:62) 0 is the triggering kernel capturing temporal dependencies, \u03b1 (cid:62) 0 scales the\nmagnitude of the in\ufb02uence of each past event, \u03b30 (cid:62) 0 is a baseline intensity, and the summation of\nthe kernel terms is history dependent and thus a stochastic process by itself.\nWe have a twofold rationale behind this modeling choice. First, the baseline intensity \u03b30 captures\nusers\u2019 inherent and long-term preferences to items, regardless of the history. Second, the triggering\nkernel \u03b3(t, ti) quanti\ufb01es how the in\ufb02uence from each past event evolves over time, which makes\nthe intensity function depend on the history T . Thus, a Hawkes process is essentially a conditional\nPoisson process [14] in the sense that conditioned on the history T , the Hawkes process is a Poisson\nprocess formed by the superposition of a background homogeneous Poisson process with the inten-\nsity \u03b30 and a set of inhomogeneous Poisson processes with the intensity \u03b3(t, ti). However, because\nthe events in the past can affect the occurrence of the events in future, the Hawkes process in general\nis more expressive than a Poisson process, which makes it particularly useful for modeling repeated\nactivities by keeping a balance between the long and the short term aspects of users\u2019 preferences.\n3.2 Transferring Knowledge with Low Rank Models\nSo far, we have shown modeling a sequence of events from a single user-item pair. Since we cannot\nobserve the events from all user-item pairs, the next step is to transfer the learned knowledge to\nunobserved pairs. Given m users and n items, we represent the intensity function between user u\nand item i as \u03bbu,i(t) = \u03bbu,i\nand \u03b1u,i are the (u, i)-th entry\nof the m-by-n non-negative base intensity matrix \u039b0 and the self-exciting matrix A, respectively.\nHowever, the two matrices of coef\ufb01cients \u039b0 and A contain too many parameters. Since it is often\nbelieved that users\u2019 behaviors and items\u2019 attributes can be categorized into a limited number of\nprototypical types, we assume that \u039b0 and A have low-rank structures. That is, the nuclear norms\nof these parameter matrices are small (cid:107)\u039b0(cid:107)\u2217 (cid:54) \u03bb(cid:48),(cid:107)A(cid:107)\u2217 (cid:54) \u03b2(cid:48). Some researchers also explicitly\nassume that the two matrices factorize into products of low rank factors. Here we assume the above\nnuclear norm constraints in order to obtain convex parameter estimation procedures later.\n3.3 Triggering Kernel Parametrization and Extensions\nBecause it is only required that the triggering kernel should be nonnegative and bounded, feature\n\u03c8u,i in 3 often has analytic forms when \u03b3(t, tu,i\nj ) belongs to many \ufb02exible parametric families,\nsuch as the Weibull and Log-logistic distributions [1]. For the simplest case, \u03b3(t, tu,i\nj ) takes the\nexponential form \u03b3(t, tu,i\nj )/\u03c3). Alternatively, we can make the intensity function\n\u03bbu,i(t) depend on other additional context information associated with each event. For instance, we\ncan make the base intensity \u039b0 depend on user-pro\ufb01les and item-contents [9, 7]. We might also\nextend \u039b0 and A into tensors to incorporate the location information. Furthermore, we can even\nlearn the triggering kernel directly using nonparametric methods [8, 30]. Without loss of generality,\nwe stick with the exponential form in later sections.\n3.4 Time-Sensitive Recommendation\nOnce we have learned \u039b0 and A, we are ready to solve our proposed problems as follows :\n\nj ) = exp(\u2212(t\u2212 tu,i\n\nj \u2208T u,i \u03b3(t, tu,i\ntu,i\n\nj ), where \u03bbu,i\n\n0\n\n3\n\n\f(a) Item recommendation. At any given time t, for each user-item pair (u, i), because the intensity\nfunction \u03bbu,i(t) indicates the tendency that user u will consume item i at time t, for each user u,\nwe recommend the proper items by the following procedures :\n\n1. Calculate \u03bbu,i(t) for each item i.\n2. Sort the items by the descending order of \u03bbu,i(t).\n3. Return the top-k items.\n\n(b) Returning-time prediction: for each user-item pair (u, i), the intensity function \u03bbu,i(t) domi-\nnates the point patterns along time. Given the history T u,i = {t1, t2, . . . , tn}, we calculate the\ndensity of the next event time by f (t|T u,i) = \u03bbu,i(t) exp\n, so we can use the\nexpectation to predict the next event. Unfortunately, this expectation often does not have analytic\nforms due to the complexity of \u03bbu,i(t) for Hawkes process, so we approximate the returning-time\nas following :\n\n(cid:9) \u223c f (t|T u,i) by Ogata\u2019s thinning algorithm [19].\n\n1. Draw samples(cid:8)t1\n\n\u03bbu,i(t)dt\n\ntn\n\nn+1, . . . , tm\n\nn+1\n\n(cid:17)\n\n(cid:16)\u2212(cid:82) t\n\n(cid:80)m\n\n2. Estimate the returning-time by the sample average 1\nm\n\ni=1 ti\n\nn+1\n\n(cid:96)(cid:0)T u,i|\u039b0, A(cid:1) =\n\n(cid:88)\n\n4 Parameter Estimation\nHaving presented our model, in this section, we develop a new algorithm which blends proximal\ngradient and conditional gradient methods to learn the model ef\ufb01ciently.\n4.1 Convex Formulation\nLet T u,i be the set of events induced between u and i. We express the log-likelihood of observing\neach sequence T u,i based on Equation 1 as :\n\nj\n\ntu,i\n\ntu,i\nj\n\ntu,i\nk <tu,i\n\nj\n\nj\n\n\u03b3(t, tu,i\n\nu,i\u03c6u,i\n\n(cid:82) T\n\nu,i\u03c8u,i,\n\nj \u2208T u,i\ntu,i\n\nj \u2208T u,i\ntu,i\n\nlog(w(cid:62)\n\n(cid:62), \u03c6u,i\n\nj ) \u2212 w(cid:62)\n\u03b3(tu,i\n\nwhere wu,i = (\u039b0(u, i), A(u, i))\n\n= (1,(cid:80)\nj )dt)(cid:62). When \u03b3(t, tu,i\nj \u2208T u,i \u03c3(1 \u2212 exp(\u2212(T \u2212 tu,i\n\n(T,(cid:80)\npressed as \u03c8u,i = (T,(cid:80)\nobserving all event sequences O = (cid:8)T u,i(cid:9)\n(cid:96) (O) =(cid:80)T u,i\u2208O (cid:96)(cid:0)T u,i(cid:1). Finally, we can have the following convex formulation :\n(cid:88)\n\n(3)\nk ))(cid:62) and \u03c8u,i =\n, tu,i\nj ) is the exponential kernel, \u03c8u,i can be ex-\nj )/\u03c3)))(cid:62). Then, the log-likelihood of\nu,i is simply a summation of each individual term by\n\nT u,i\u2208O (cid:96)(cid:0)T u,i|\u039b0, A(cid:1) + \u03bb(cid:107)\u039b0(cid:107)\u2217 + \u03b2(cid:107)A(cid:107)\u2217 subject to \u039b0, A (cid:62) 0,\n\n(4)\nwhere the matrix nuclear norm (cid:107)\u00b7(cid:107)\u2217, which is a summation of all singular values, is commonly used\nas a convex surrogate for the matrix rank function [24]. One off-the-shelf solution to 4 is proposed\nin [29] based on ADMM. However, the algorithm in [29] requires, at each iteration, a full SVD for\ncomputing the proximal operator, which is often prohibitive with large matrices. Alternatively, we\nmight turn to more ef\ufb01cient conditional gradient algorithms [28], which require instead, the much\ncheaper linear minimization oracles. However, the non-negativity constraints in our problem prevent\nthe linear minimization from having a simple analytical solution.\n\nOPT = min\n\u039b0,A\n\n\u2212 1\n|O|\n\n4.2 Alternative Formulation\nThe dif\ufb01culty of directly solving the original formulation 4 is caused by the fact that the nonnegative\nconstraints are entangled with the non-smooth nuclear norm penalty. To address this challenge, we\napproximate 4 using a simple penalty method. Speci\ufb01cally, given \u03c1 > 0, we arrive at the next\nformulation 5 by introducing two auxiliary variables Z1 and Z2 with some penalty function, such\nas the squared Frobenius norm.\n\u2212 1\n|O|\n+ \u03c1(cid:107)A \u2212 Z2(cid:107)2\n\n(cid:96)(cid:0)T u,i|\u039b0, A(cid:1) + \u03bb(cid:107)Z1(cid:107)\u2217 + \u03b2(cid:107)Z2(cid:107)\u2217 + \u03c1(cid:107)\u039b0 \u2212 Z1(cid:107)2\n\nT u,i\u2208O\nsubject to \u039b0, A (cid:62) 0.\n\n(cid:91)OPT = min\n\n(cid:88)\n\n\u039b0,A,Z1,Z2\n\n(5)\n\nF\n\nF\n\n4\n\n\fAlgorithm 1: Learning Hawkes-Recommender\n\nInput: O =(cid:8)T u,i(cid:9), \u03c1 > 0\n\nOutput: Y1 = [\u039b0; A]\nChoose to initialize X 0\nSet Y 0 = X 0;\nfor k = 1, 2, . . . do\n\n1 and X 0\n\n1 ;\n2 = X 0\n\nk+1 ;\n\n\u03b4k = 2\nU k\u22121 = (1 \u2212 \u03b4k)Y k\u22121 + \u03b4kX k\u22121 ;\nX k\nX k\nY k = (1 \u2212 \u03b4k)Y k\u22121 + \u03b4kX k;\n\n(cid:0)\u03b7k\u22071(f (U k\u22121))(cid:1);\n(cid:0)\u22072(f (U k\u22121))(cid:1);\n\n1 = ProxU k\u22121\n2 = LMO\u03c8\n\nend\n\nAlgorithm 2: ProxU k\u22121\n\n1 =(cid:0)U k\u22121 \u2212 \u03b7k\u22071(f (U k\u22121))(cid:1)\n\nX k\n\n(cid:0)\u03b7k\u22071(f (U k\u22121))(cid:1)\n(cid:0)\u22072(f (U k\u22121))(cid:1)\n\n+;\n\nAlgorithm 3: LMO\u03c8\n(u1, v1), (u2, v2) top singular vector pairs of\n\u2212\u22072(f (U k\u22121))[Z1] and \u2212\u22072(f (U k\u22121))[Z2];\nX k\nFind \u03b1k\nX k\nX k\n\n1 , X k\n2 by solving (6);\n1 X k\n2 X k\n\n2 [Z1] = u1v(cid:62)\n1 and \u03b1k\n2 [Z1] = \u03b1k\n2 [Z2] = \u03b1k\n\n2 [Z2] = u2v(cid:62)\n2 ;\n\n2 [Z1];\n2 [Z2];\n\nWe show in Theorem 1 that when \u03c1 is properly chosen, these two formulations lead to the same\noptimum. See appendix for the complete proof. More importantly, the new formulation 5 allows us\nto handle the non-negativity constraints and nuclear norm regularization terms separately.\nTheorem 1. With the condition \u03c1 (cid:62) \u03c1\u2217, the optimal value (cid:91)OPT of the problem 5 coincides with the\noptimal value OPT in the problem 4 of interest, where \u03c1\u2217 is a problem dependent threshold,\n\n(cid:26) \u03bb ((cid:107)\u039b\u2217\n\n\u03c1\u2217 = max\n\n0(cid:107)\u2217 \u2212 (cid:107)Z\u2217\n1(cid:107)\u2217) + \u03b2 ((cid:107)A\u2217(cid:107)\u2217 \u2212 (cid:107)Z\u2217\n0 \u2212 Z\u2217\n(cid:107)\u039b\u2217\n1(cid:107)2\n\nF + (cid:107)A\u2217 \u2212 Z\u2217\n2(cid:107)2\n\nF\n\n2(cid:107)\u2217)\n\n(cid:27)\n\n.\n\n4.3 Ef\ufb01cient Optimization: Proximal Method Meets Conditional Gradient\n\nfor simplicity.\n\nthe respective part\n\nNow, we are ready to present Algorithm 1 for solving 5 ef\ufb01ciently. Denote X1 = [\u039b0; A], X2 =\n[Z1; Z2] and X = [X1; X2]. We use the bracket [\u00b7] notation X1[\u039b0], X1[A], X2[Z1], X2[Z2]\nto represent\n:= f (\u039b0, A, Z1, Z2) =\n\u2212 1|O|\nThe course of our action is straightforward: at each iteration, we apply cheap projection gradient for\nblock X1 and cheap linear minimization for block X2 and maintain three interdependent sequences\nk(cid:62)1 based on the accelerated scheme in [17, 18]. To be more\n\n(cid:80)T u,i\u2208O (cid:96)(cid:0)T u,i|\u039b0, A(cid:1) + \u03c1(cid:107)\u039b0 \u2212 Z1(cid:107)2\nk(cid:62)1, (cid:8)Y k(cid:9)\n(cid:8)U k(cid:9)\n\nk(cid:62)1 and(cid:8)X k(cid:9)\n\nLet f (X)\nF + \u03c1(cid:107)A \u2212 Z2(cid:107)2\nF .\n\n1 =(cid:0)U k\u22121 \u2212 \u03b7k\u22071f (U k\u22121)(cid:1)\n\nspeci\ufb01c, the algorithm consists of two main subroutines:\nProximal Gradient. When updating X1, we compute directly the associated proximal operator,\n+, where (\u00b7)+\nwhich in our case, reduces to the simple projection X k\nsimply sets the negative coordinates to zero.\nConditional Gradient. When updating X2, instead of computing the proximal operator, we call\n2 [Z1] = argmin{(cid:104)pk[Z1], Z1(cid:105) + \u03c8(Z1)} where pk =\nthe linear minimization oracle (LMO\u03c8): X k\n\u22072(f (U k\u22121)) is the partial derivative with respect to X2 and \u03c8(Z1) = \u03bb(cid:107)Z1(cid:107)\u2217. We do similar\n2 [Z2]. The overall performance clearly depends on the ef\ufb01ciency of this LMO, which\nupdates for X k\ncan be solved ef\ufb01ciently in our case as illustrated in Algorithm 3. Following [27], the linear min-\n2 [Z1] = argmin(cid:107)Z1(cid:107)\u2217(cid:54)1 (cid:104)pk[Z1], Z1(cid:105),\nimization for our situation requires only : (i) computing X k\n1 , and u1, v1 are the top singular vectors of\nwhere the minimizer is readily given by X k\n\u2212pk[Z1]; and (ii) conducting a line-search that produces a scaling factor \u03b1k\n1 = argmin\u03b11(cid:62)0 h(\u03b11)\n2 [Z1])(cid:107)2\n(6)\nF + \u03bb\u03b4k\u03b11 + C,\n[Z1](cid:107)\u2217. The quadratic problem (6) admits a closed-form solution and\n\nwhere C = \u03bb(1 \u2212 \u03b4k)(cid:107)Y k\u22121\nthus can be computed ef\ufb01ciently. We repeat the same process for updating \u03b1k\n\n[\u039b0] \u2212 (1 \u2212 \u03b4k)Y k\u22121\n\nh(\u03b11) := \u03c1(cid:107)Y k\u22121\n\n[Z1] \u2212 \u03b4k(\u03b11X k\n\n2 [Z1] = u1v(cid:62)\n\n2 accordingly.\n\n2\n\n1\n\n2\n\n4.4 Convergence Analysis\n\nDenote F (X) = f (X)+\u03c8(X2) as the objective in formulation 5, where X = [X1; X2]. We estab-\nlish the following convergence results for Algorithm 1 described above when solving formulation 5.\nPlease refer to Appendix for complete proof.\n\n5\n\n\fTheorem 2. Let(cid:8)Y k(cid:9) be the sequence generated by Algorithm 1 by setting \u03b4k = 2/(k + 1), and\n\n\u03b7k = (\u03b4k)\u22121/L. Then for k (cid:62) 1, we have\n\n(7)\nwhere L corresponds to the Lipschitz constant of \u2207f (X) and D1 and D2 are some problem depen-\ndent constants.\n\nF (Y k) \u2212 (cid:91)OPT (cid:54) 4LD1\nk(k + 1)\n\n+\n\n2LD2\nk + 1\n\n.\n\nRemark. Let g(\u039b0, A) denote the objective in formulation 4, which is the original problem of our\ninterest. By invoking Theorem 1, we further have, g(Y k[\u039b0], Y k[A]) \u2212 OPT (cid:54) 4LD1\nk(k+1) + 2LD2\nk+1 .\nThe analysis builds upon the recursions from proximal gradient and conditional gradient methods.\nAs a result, the overall convergence rate comes from two parts, as re\ufb02ected in (7). Interestingly, one\ncan easily see that for both the proximal and the conditional gradient parts, we achieve the respective\noptimal convergence rates. When there is no nuclear norm regularization term, the results recover\nthe well-known optimal O(1/t2) rate achieved by proximal gradient method for smooth convex\noptimization. When there is no nonnegative constraint, the results recover the well-known O(1/t)\nrate attained by conditional gradient method for smooth convex minimization. When both nuclear\nnorm and non-negativity are in present, the proposed algorithm, up to our knowledge, is \ufb01rst of its\nkind, that achieves the best of both worlds, which could be of independent interest.\n5 Experiments\nWe evaluate our algorithm by comparing with state-of-the-art competitors on both synthetic and real\ndatasets. For each user, we randomly pick 20-percent of all the items she has consumed and hold out\nthe entire sequence of events. Besides, for each sequence of the other 80-percent items, we further\nsplit it into a pair of training/testing subsequences. For each testing event, we evaluate the predictive\naccuracy on two tasks :\n(a) Item Recommendation: suppose the testing event belongs to the user-item pair (u, i). Ideally\nitem i should rank top at the testing moment. We record its predicted rank among all items.\nSmaller value indicates better performance.\n\n(b) Returning-Time Prediction: we predict the returning-time from the learned intensity function\n\nand compute the absolute error with respect to the true time.\n\nWe repeat these two evaluations on all testing events. Because the predictive tasks on those entirely\nheld-out sequences are much more challenging, we report the total mean absolute error (MAE) and\nthat speci\ufb01c to the set of entirely heldout sequences, separately.\n\n5.1 Competitors\nPoisson process is a relaxation of our model by assuming each user-item pair (u, i) has only a\nconstant base intensity \u039b0(u, i), regardless of the history. For task (a), it gives static ranks regardless\nof the time. For task (b), it produces an estimate of the average inter-event gaps. In many cases, the\nPoisson process is a hard baseline in that the most popular items often have large base intensity, and\nrecommending popular items is often a strong heuristic.\nSTiC [11] \ufb01ts a semi-hidden Markov model to each observed user-item pair. Since it can only make\nrecommendations speci\ufb01c to the few observed items visited before, instead of the large number of\nnew items, we only evaluate its performance on the returning time prediction task. For the set of\nentirely held-out sequences, we use the average predicted inter-event time from each observed item\nas the \ufb01nal prediction.\nSVD is the classic matrix factorization model. The implicit user feedback is converted into an ex-\nplicit rating using the frequency of item consumptions [2]. Since it is not designed for predicting the\nreturning time, we report its performance on the time-sensitive recommendation task as a reference.\nTensor factorization generalizes matrix factorization to include time. We compare with the state-\nof-art method [3] which considers poisson regression as the loss function to \ufb01t the number of events\nin each discretized time slot and shows better performance compared to other alternatives with the\nsquared loss [25, 13, 22, 21]. We report the performance by (1) using the parameters \ufb01tted only in\nthe last interval, and (2) using the average parameters over all time intervals. We denote these two\nvariants with varying number of intervals as Tensor-#-Last and Tensor-#-Avg.\n\n6\n\n\f(a) Convergence by iterations\n\n(b) Convergence by #user-item\n\n(c) Convergence by #events\n\n(d) Scalability\n\n(e) Item recommendation\n\n(f) Returning-time prediction\n\nFigure 2: Estimation error (a) by #iterations, (b) by #entries (1,000 events per entry), and (c) by\n#events per entry (10,000 entries); (d) scalability by #entries (1,000 events per entry, 500 iterations);\n(e) MAE of the predicted ranking; and (f) MAE of the predicted returning time.\n5.2 Results\nSynthetic data. We generate two 1,024-by-1,204 user-item matrices \u039b0 and A with rank \ufb01ve as the\nground-truth. For each user-item pair, we simulate 1,000 events by Ogata\u2019s thinning algorithm [19]\nwith an exponential triggering kernel and get 100 million events in total. The bandwidth for the\ntriggering kernel is \ufb01xed to one. By theorem 1, it is inef\ufb01cient to directly estimate the exact value of\nthe threshold value for \u03c1. Instead, we tune \u03c1, \u03bb and \u03b2 to give the best performance.\nHow does our algorithm converge ? Figure 2(a) shows that it only requires a few hundred iterations\nto descend to a decent error for both \u039b0 and A, indicating algorithm 1 converges very fast. Since\nthe true parameters are low-rank, Figure 2(b-c) verify that it only requires a modest number of ob-\nserved entries, each of which induces a small number of events (1,000) to achieve a good estimation\nperformance. Figure 2(d) further illustrates that algorithm 1 scales linearly as the training set grows.\nWhat is the predictive performance ? Figure 2(e-f) con\ufb01rm that algorithm 1 achieves the best pre-\ndictive performance compared to other baselines. In Figure 2(e), all temporal methods outperform\nthe static SVD since this classic baseline does not consider the underlying temporal dynamics of the\nobserved sequences. In contrast, although the Poisson regression also produces static rankings of\nthe items, it is equivalent to recommending the most popular items over time. This simple heuristic\ncan still give competitive performance. In Figure 2(f), since the occurrence of a new event depends\non the whole past history instead of the last one, the performance of STiC deteriorates vastly. The\nother tensor methods predict the returning time with the information from different time intervals.\nHowever, because our method automatically adapts different contributions of each past event to the\nprediction of the next event, it can achieve the best prediction performance overall.\nReal data. We also evaluate the proposed method on real datasets. last.fm consists of the music\nstreaming logs between 1,000 users and 3,000 artists. There are around 20,000 observed user-artist\npairs with more than one million events in total. tmall.com contains around 100K shopping events\nbetween 26,376 users and 2,563 stores. The unit time for both dataset is hour. MIMIC II medical\ndataset is a collection of de-identi\ufb01ed clinical visit records of Intensive Care Unit patients for seven\nyears. We \ufb01ltered out 650 patients and 204 diseases. Each event records the time when a patient was\ndiagnosed with a speci\ufb01c disease. The time unit is week. All model parameters \u03c1, \u03bb, \u03b2, the kernel\nbandwidth and the latent rank of other baselines are tuned to give the best performance.\nDoes the history help ? Because the true temporal dynamics governing the event patterns are unob-\nserved, we \ufb01rst investigate whether our model assumption is reasonable. Our Hawkes model con-\nsiders the self-exciting effects from past user activities, while the survival analysis applied in [11]\n\n7\n\n0.100.150.200.250.300100200300400500#iterationsMAEParametersA \u039b 00.080.100.120.140250005000075000100000#eventsMAEParametersA \u039b 00.080.100.120.14025005000750010000#eventsMAEParametersA \u039b 0102103102103104105#entriestime(s)59.9213.5242.6261.9398.343.3193.4210.5234.7351.30100200300400HeldoutTotalGroupsMAEMethodsHawkesPoissonTensor2Tensor90SVD68.3163.4169.5169.7182.4182.8319.354.7141.2151.5153.7171.3171.9312.70100200300HeldoutTotalGroupsMAEMethodsHawkesPoissonTensor2LastTensor2AvgTensor90LastTensor90AvgSTiC\fQuantile-plot\n\nItem recommendation\n\nReturning-time prediction\n\nm\n\nf\n.\nt\ns\na\nl\n\nm\no\nc\n.\nl\nl\na\nm\n\nt\n\nI\nI\nC\nI\nM\nM\n\nI\n\nFigure 3: The quantile plots of different \ufb01tted processes, the MAE of predicted rankings and\nreturning-time on the last.fm (top), tmall.com (middle) and the MIMIC II (bottom), respectively.\n\n(cid:111)n\n\ni=1\n\n(cid:110)(cid:82) ti\n\nti\u22121 \u03bb(t)dt\n\nassumes i.i.d. inter-event gaps which might conform to an exponential (Poisson process) or Rayleigh\ndistribution. According to the time-change theorem [6], given a sequence T = {t1, . . . , tn} and a\nparticular point process with intensity \u03bb(t), the set of samples\nshould conform\nto a unit-rate exponential distribution if T is truly sampled from the process. Therefore, we compare\nthe theoretical quantiles from the exponential distribution with the \ufb01ttings of different models to a\nreal sequence of (listening/shopping/visiting) events. The closer the slope goes to one, the better a\nmodel matches the event patterns. Figure 3 clearly shows that our Hawkes model can better explain\nthe observed data compared to the other survival analysis models.\nWhat is the predictive performance ? Finally, we evaluate the prediction accuracy in the 2nd and\n3rd column of Figure 3. Since holding-out an entire testing sequence is more challenging, the\nperformance on the Heldout group is a little lower than that on the average Total group. However,\nacross all cases, since the proposed model is able to better capture the temporal dynamics of the\nobserved sequences of events, it can achieve a better performance on both tasks in the end.\n6 Conclusions\nWe propose a novel convex formulation and an ef\ufb01cient learning algorithm to recommend relevant\nservices at any given moment, and to predict the next returning-time of users to existing services.\nEmpirical evaluations on large synthetic and real data demonstrate its superior scalability and predic-\ntive performance. Moreover, our optimization algorithm can be used for solving general nonnegative\nmatrix rank minimization problem with other convex losses under mild assumptions, which may be\nof independent interest.\nAcknowledge\nThe research was supported in part by NSF IIS-1116886, NSF/NIH BIGDATA 1R01GM108341,\nNSF CAREER IIS-1350983.\n\n8\n\nHawkesPoissonRayleigh0246802468Theoretical QuantilesQuantiles of Real Data201.7807.5896.7903.71085.7191.6615.6889.4896.41043.70300600900HeldoutTotalGroupsMAEMethodsHawkesPoissonTensor2Tensor90SVD95.1111.6174.7168.2173.5176.7379.1140.6147.1158.7162.3160.3163.8372.90100200300HeldoutTotalGroupsMAEMethodsHawkesPoissonTensor2LastTensor2AvgTensor90LastTensor90AvgSTiCHawkesPoissonRayleigh0123401234Theoretical QuantilesQuantiles of Real Data43.65111.27132.06174.0120411.2887.23115.15164.78183.43050100150200HeldoutTotalGroupsMAEMethodsHawkesPoissonTensor2Tensor90SVD134.2140.5192.3188.1187.4189.4297.8163.7165.6185.9184.3180.9180.7292.60100200300HeldoutTotalGroupsMAEMethodsHawkesPoissonTensor2LastTensor2AvgTensor90LastTensor90AvgSTiCHawkesPoissonRayleigh0246802468Theoretical QuantilesQuantiles of Real Data18.325.921.322.434.53.910.420.122.234.20102030HeldoutTotalGroupsMAEMethodsHawkesPoissonTensor2Tensor90SVD139.6162.1224.3218.9271.7268.824641.675.8235.4230.6274.9270.42380100200HeldoutTotalGroupsMAEMethodsHawkesPoissonT2LastT2AvgT90LastT90AvgSTiC\fReferences\n[1] O. Aalen, O. Borgan, and H. Gjessing. Survival and event history analysis: a process point of view.\n\nSpringer, 2008.\n\n[2] L. Baltrunas and X. Amatriain. Towards time-dependant recommendation based on implicit feedback,\n\n2009.\n\n[3] E. C. Chi and T. G. Kolda. On tensors, sparsity, and nonnegative factorizations, 2012.\n[4] D. Cox and V. Isham. Point processes, volume 12. Chapman & Hall/CRC, 1980.\n[5] D. Cox and P. Lewis. Multivariate point processes. Selected Statistical Papers of Sir David Cox: Volume\n\n1, Design of Investigations, Statistical Methods and Applications, 1:159, 2006.\n\n[6] D. Daley and D. Vere-Jones. An introduction to the theory of point processes: volume II: general theory\n\nand structure, volume 2. Springer, 2007.\n\n[7] N. Du, M. Farajtabar, A. Ahmed, A. J. Smola, and L. Song. Dirichlet-hawkes processes with applications\n\nto clustering continuous-time document streams. In KDD\u201915, 2015.\n\n[8] N. Du, L. Song, A. Smola, and M. Yuan. Learning networks of heterogeneous in\ufb02uence. In Advances in\n\nNeural Information Processing Systems 25, pages 2789\u20132797, 2012.\n\n[9] N. Du, L. Song, H. Woo, and H. Zha. Uncover topic-sensitive information diffusion networks. In Arti\ufb01cial\n\nIntelligence and Statistics (AISTATS), 2013.\n\n[10] A. G. Hawkes. Spectra of some self-exciting and mutually exciting point processes. Biometrika,\n\n58(1):83\u201390, 1971.\n\n[11] K. Kapoor, K. Subbian, J. Srivastava, and P. Schrater. Just in time recommendations: Modeling the\n\ndynamics of boredom in activity streams. WSDM, pages 233\u2013242, 2015.\n\n[12] K. Kapoor, M. Sun, J. Srivastava, and T. Ye. A hazard based approach to user return time prediction. In\n\nKDD\u201914, pages 1719\u20131728, 2014.\n\n[13] A. Karatzoglou, X. Amatriain, L. Baltrunas, and N. Oliver. Multiverse recommendation: N-dimensional\ntensor factorization for context-aware collaborative \ufb01ltering. In Proceeedings of the 4th ACM Conference\non Recommender Systems (RecSys), 2010.\n\n[14] J. Kingman. On doubly stochastic poisson processes. Mathematical Proceedings of the Cambridge\n\nPhilosophical Society, pages 923\u2013930, 1964.\n\n[15] N. Koenigstein, G. Dror, and Y. Koren. Yahoo! music recommendations: Modeling music ratings with\ntemporal dynamics and item taxonomy. In Proceedings of the Fifth ACM Conference on Recommender\nSystems, RecSys \u201911, pages 165\u2013172, 2011.\n\n[16] Y. Koren. Collaborative \ufb01ltering with temporal dynamics.\n\nKDD, pages 447\u2013456, 2009.\n\nIn Knowledge discovery and data mining\n\n[17] G. Lan. An optimal method for stochastic composite optimization. Mathematical Programming, 2012.\n[18] G. Lan. The complexity of large-scale convex programming under a linear optimization oracle. arXiv\n\npreprint arxiv:1309.5550v2, 2014.\n\n[19] Y. Ogata. On lewis\u2019 simulation method for point processes. Information Theory, IEEE Transactions on,\n\n27(1):23\u201331, 1981.\n\n[20] H. Ouyang, N. He, L. Q. Tran, and A. Gray. Stochastic alternating direction method of multipliers. In\n\nICML, 2013.\n\n[21] J. Z. J. L. Preeti Bhargava, Thomas Phan. Who, what, when, and where: Multi-dimensional collaborative\n\nrecommendations using tensor factorization on sparse user-generated data. In WWW, 2015.\n\n[22] Y. Wang, R. Chen, J. Ghosh, J. Denny, A. Kho, Y. Chen, B. Malin, and J. Sun. Rubik: Knowledge guided\n\ntensor factorization and completion for health data analytics. In KDD, 2015.\n\n[23] S. Rendle. Time-Variant Factorization Models Context-Aware Ranking with Factorization Models. vol-\n\nume 330 of Studies in Computational Intelligence, chapter 9, pages 137\u2013153. 2011.\n\n[24] S. Sastry. Some np-complete problems in linear algebra. Honors Projects, 1990.\n[25] L. Xiong, X. Chen, T.-K. Huang, J. G. Schneider, and J. G. Carbonell. Temporal collaborative \ufb01ltering\n\nwith bayesian probabilistic tensor factorization. In SDM, pages 211\u2013222. SIAM, 2010.\n\n[26] X. Yi, L. Hong, E. Zhong, N. N. Liu, and S. Rajan. Beyond clicks: Dwell time for personalization. In\n\nProceedings of the 8th ACM Conference on Recommender Systems, RecSys \u201914, pages 113\u2013120, 2014.\n\n[27] A. W. Yu, W. Ma, Y. Yu, J. G. Carbonell, and S. Sra. Ef\ufb01cient structured matrix rank minimization. In\n\nNIPS, 2014.\n\n[28] A. N. Zaid Harchaoui, Anatoli Juditsky. Conditional gradient algorithms for norm-regularized smooth\n\nconvex optimization. Mathematical Programming, 2013.\n\n[29] K. Zhou, H. Zha, and L. Song. Learning social infectivity in sparse low-rank networks using multi-\n\ndimensional hawkes processes. In Arti\ufb01cial Intelligence and Statistics (AISTATS), 2013.\n\n[30] K. Zhou, H. Zha, and L. Song. Learning triggering kernels for multi-dimensional hawkes processes. In\n\nInternational Conference on Machine Learning (ICML), 2013.\n\n9\n\n\f", "award": [], "sourceid": 1937, "authors": [{"given_name": "Nan", "family_name": "Du", "institution": "Georgia Tech"}, {"given_name": "Yichen", "family_name": "Wang", "institution": "Georgia Tech"}, {"given_name": "Niao", "family_name": "He", "institution": "Georgia Institute of Technology"}, {"given_name": "Jimeng", "family_name": "Sun", "institution": "Gatech"}, {"given_name": "Le", "family_name": "Song", "institution": "Georgia Institute of Technology"}]}