{"title": "Scalable Demand-Aware Recommendation", "book": "Advances in Neural Information Processing Systems", "page_first": 2412, "page_last": 2421, "abstract": "Recommendation for e-commerce with a mix of durable and nondurable goods has characteristics that distinguish it from the well-studied media recommendation problem. The demand for items is a combined effect of form utility and time utility, i.e., a product must both be intrinsically appealing to a consumer and the time must be right for purchase. In particular for durable goods, time utility is a function of inter-purchase duration within product category because consumers are unlikely to purchase two items in the same category in close temporal succession. Moreover, purchase data, in contrast to rating data, is implicit with non-purchases not necessarily indicating dislike. Together, these issues give rise to the positive-unlabeled demand-aware recommendation problem that we pose via joint low-rank tensor completion and product category inter-purchase duration vector estimation. We further relax this problem and propose a highly scalable alternating minimization approach with which we can solve problems with millions of users and millions of items in a single thread. We also show superior prediction accuracies on multiple real-world datasets.", "full_text": "Scalable Demand-Aware Recommendation\n\nJinfeng Yi1\u2217, Cho-Jui Hsieh2, Kush R. Varshney1, Lijun Zhang3, Yao Li2\n1IBM Thomas J. Watson Research Center, Yorktown Heights, NY, USA\n\n2University of California, Davis, CA, USA\n\n3National Key Laboratory for Novel Software Technology, Nanjing University, Nanjing, China\n\njinfengyi.ustc@gmail.com, chohsieh@ucdavis.edu, krvarshn@us.ibm.com,\n\nzhanglj@lamda.nju.edu.cn, yaoli@ucdavis.edu\n\nAbstract\n\nRecommendation for e-commerce with a mix of durable and nondurable goods\nhas characteristics that distinguish it from the well-studied media recommendation\nproblem. The demand for items is a combined effect of form utility and time utility,\ni.e., a product must both be intrinsically appealing to a consumer and the time must\nbe right for purchase. In particular for durable goods, time utility is a function of\ninter-purchase duration within product category because consumers are unlikely to\npurchase two items in the same category in close temporal succession. Moreover,\npurchase data, in contrast to rating data, is implicit with non-purchases not neces-\nsarily indicating dislike. Together, these issues give rise to the positive-unlabeled\ndemand-aware recommendation problem that we pose via joint low-rank tensor\ncompletion and product category inter-purchase duration vector estimation. We\nfurther relax this problem and propose a highly scalable alternating minimization\napproach with which we can solve problems with millions of users and millions of\nitems in a single thread. We also show superior prediction accuracies on multiple\nreal-world datasets.\n\n1\n\nIntroduction\n\nE-commerce recommender systems aim to present items with high utility to the consumers [18].\nUtility may be decomposed into form utility: the item is desired as it is manifested, and time utility:\nthe item is desired at the given point in time [28]; recommender systems should take both types of\nutility into account. Economists de\ufb01ne items to be either durable goods or nondurable goods based\non how long they are intended to last before being replaced [27]. A key characteristic of durable\ngoods is the long duration of time between successive purchases within item categories whereas this\nduration for nondurable goods is much shorter, or even negligible. Thus, durable and nondurable\ngoods have differing time utility characteristics which lead to differing demand characteristics.\nAlthough we have witnessed great success of collaborative \ufb01ltering in media recommendation, we\nshould be careful when expanding its application to general e-commerce recommendation involving\nboth durable and nondurable goods due to the following reasons:\n\n1. Since media such as movies and music are nondurable goods, most users are quite receptive\nto buying or renting them in rapid succession. However, users only purchase durable goods\nwhen the time is right. For instance, most users will not buy televisions the day after they have\nalready bought one. Therefore, recommending an item for which a user has no immediate\ndemand can hurt user experience and waste an opportunity to drive sales.\n\n\u2217Now at Tencent AI Lab, Bellevue, WA, USA\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\f2. A key assumption made by matrix factorization- and completion-based collaborative \ufb01ltering\nalgorithms is that the underlying rating matrix is of low-rank since only a few factors typically\ncontribute to an individual\u2019s form utility [5]. However, a user\u2019s demand is not only driven by\nform utility, but is the combined effect of both form utility and time utility. Hence, even if the\nunderlying form utility matrix is of low-rank, the overall purchase intention matrix is likely to\nbe of high-rank,2 and thus cannot be directly recovered by existing approaches.\n\nAn additional challenge faced by many real-world recommender systems is the one-sided sampling\nof implicit feedback [15, 23]. Unlike the Net\ufb02ix-like setting that provides both positive and negative\nfeedback (high and low ratings), no negative feedback is available in many e-commerce systems.\nFor example, a user might not purchase an item because she does not derive utility from it, or just\nbecause she was simply unaware of it or plans to buy it in the future. In this sense, the labeled training\ndata only draws from the positive class, and the unlabeled data is a mixture of positive and negative\nsamples, a problem usually referred to as positive-unlabeled (PU) learning [13]. To address these\nissues, we study the problem of demand-aware recommendation. Given purchase triplets (user, item,\ntime) and item categories, the objective is to make recommendations based on users\u2019 overall predicted\ncombination of form utility and time utility.\nWe denote purchases by the sparse binary tensor P. To model implicit feedback, we assume that\nP is obtained by thresholding an underlying real-valued utility tensor to a binary tensor Y and then\nrevealing a subset of Y\u2019s positive entries. The key to demand-aware recommendation is de\ufb01ning\nan appropriate utility measure for all (user, item, time) triplets. To this end, we quantify purchase\nintention as a combined effect of form utility and time utility. Speci\ufb01cally, we model a user\u2019s time\nutility for an item by comparing the time t since her most recent purchase within the item\u2019s category\nand the item category\u2019s underlying inter-purchase duration d; the larger the value of d \u2212 t, the less\nlikely she needs this item. In contrast, d \u2264 t may indicate that the item needs to be replaced, and\nshe may be open to related recommendations. Therefore, the function h = max(0, d \u2212 t) may be\nemployed to measure the time utility factor for a (user, item) pair. Then the purchase intention for a\n(user, item, time) triplet is given by x \u2212 h, where x denotes the user\u2019s form utility. This observation\nallows us to cast demand-aware recommendation as the problem of learning users\u2019 form utility tensor\nX and items\u2019 inter-purchase durations vector d given the binary tensor P.\nAlthough the learning problem can be naturally formulated as a tensor nuclear norm minimization\nproblem, the high computational cost signi\ufb01cantly limits its application to large-scale recommendation\nproblems. To address this limitation, we \ufb01rst relax the problem to a matrix optimization problem with\na label-dependent loss. We note that the problem after relaxation is still non-trivial to solve since it\nis a highly non-smooth problem with nested hinge losses. More severely, the optimization problem\ninvolves mnl entries, where m, n, and l are the number of users, items, and time slots, respectively.\nThus a naive optimization algorithm will take at least O(mnl) time, and is intractable for large-\nscale recommendation problems. To overcome this limitation, we develop an ef\ufb01cient alternating\nminimization algorithm and show that its time complexity is only approximately proportional to the\nnumber of nonzero elements in the purchase records tensor P. Since P is usually very sparse, our\nalgorithm is extremely ef\ufb01cient and can solve problems with millions of users and items.\nCompared to existing recommender systems, our work has the following contributions and advantages:\n(i) to the best of our knowledge, this is the \ufb01rst work that makes demand-aware recommendation by\nconsidering inter-purchase durations for durable and nondurable goods; (ii) the proposed algorithm is\nable to simultaneously infer items\u2019 inter-purchase durations and users\u2019 real-time purchase intentions,\nwhich can help e-retailers make more informed decisions on inventory planning and marketing\nstrategy; (iii) by effectively exploiting sparsity, the proposed algorithm is extremely ef\ufb01cient and able\nto handle large-scale recommendation problems.\n\n2 Related Work\n\nOur contributions herein relate to three different areas of prior work: consumer modeling from a\nmicroeconomics and marketing perspective [6], time-aware recommender systems [4, 29, 8, 19], and\nPU learning [20, 9, 13, 14, 23, 2]. The extensive consumer modeling literature is concerned with\ndescriptive and analytical models of choice rather than prediction or recommendation, but nonetheless\n\n2A detailed illustration can be found in the supplementary material\n\n2\n\n\fforms the basis for our modeling approach. A variety of time-aware recommender systems have\nbeen proposed to exploit time information, but none of them explicitly consider the notion of time\nutility derived from inter-purchase durations in item categories. Much of the PU learning literature is\nfocused on the binary classi\ufb01cation problem, e.g. [20, 9], whereas we are in the collaborative \ufb01ltering\nsetting. For the papers that do examine collaborative \ufb01ltering with PU learning or learning with\nimplicit feedback [14, 23, 2, 32], they mainly focus on media recommendation and overlook users\u2019\ndemands, thus are not suitable for durable goods recommendation.\nTemporal aspects of the recommendation problem have been examined in a few ways: as part of\nthe cold-start problem [3], to capture dynamics in interests or ratings over time [17], and as part of\nthe context in context-aware recommenders [1]. However, the problem we address in this paper is\ndifferent from all of those aspects, and in fact could be combined with the other aspects in future\nsolutions. To the best of our knowledge, there is no existing work that tries to take inter-purchase\ndurations into account to better time recommendations as we do herein.\n\n3 Positive-Unlabeled Demand-Aware Recommendation\n\nThroughout the paper, we use boldface Euler script letters, boldface capital letters, and boldface\nlower-case letters to denote tensors (e.g., A), matrices (e.g., A) and vectors (e.g., a), respectively.\nScalars such as entries of tensors, matrices, and vectors are denoted by lowercase letters, e.g., a. In\nparticular, the (i, j, k) entry of a third-order tensor A is denoted by aijk.\nGiven a set of m users, n items, and l time slots, we construct a third-order binary tensor P \u2208\n{0, 1}m\u00d7n\u00d7l to represent the purchase history. Speci\ufb01cally, entry pijk = 1 indicates that user i has\npurchased item j in time slot k. We denote (cid:107)P(cid:107)0 as the number of nonzero entries in tensor P.\nSince P is usually very sparse, we have (cid:107)P(cid:107)0 (cid:28) mnl. Also, we assume that the n items belong to r\nitem categories, with items in each category sharing similar inter-purchase durations.3 We use an\nn-dimensional vector c \u2208 {1, 2, . . . , r}n to represent the category membership of each item. Given\nP and c, we further generate a tensor T \u2208 Rm\u00d7r\u00d7l where ticj k denotes the number of time slots\nbetween user i\u2019s most recent purchase within item category cj until time k. If user i has not purchased\nwithin item category cj until time k, ticj k is set to +\u221e.\n\n3.1\n\nInferring Purchase Intentions from Users\u2019 Purchase Histories\n\nIn this work, we formulate users\u2019 utility as a combined effect of form utility and time utility. To this\nend, we use an underlying third-order tensor X \u2208 Rm\u00d7n\u00d7l to quantify form utility. In addition, we\nemploy a non-negative vector d \u2208 Rr\n+ to measure the underlying inter-purchase duration times of the\nr item categories. It is understood that the inter-purchase durations for durable good categories are\nlarge, while for nondurable good categories are small, or even zero. In this study, we focus on items\u2019\ninherent properties and assume that the inter-purchase durations are user-independent. The problem\nof learning personalized durations will be studied in our future work.\nAs discussed above, the demand is mediated by the time elapsed since the last purchase of an item\nin the same category. Let dcj be the inter-purchase duration time of item j\u2019s category cj, and let\nticj k be the time gap of user i\u2019s most recent purchase within item category cj until time k. Then\nif dcj > ticj k, a previously purchased item in category cj continues to be useful, and thus user i\u2019s\nutility from item j is weak. Intuitively, the greater the value dcj \u2212 ticj k, the weaker the utility. On the\nother hand, dcj < ticj k indicates that the item is nearing the end of its lifetime and the user may be\nopen to recommendations in category cj. We use a hinge loss max(0, dcj \u2212 ticj k) to model such time\nutility. The overall utility can be obtained by comparing form utility and time utility. In more detail,\nwe model a binary utility indicator tensor Y \u2208 {0, 1}m\u00d7n\u00d7l as being generated by the following\nthresholding process:\n\nyijk = 1[xijk \u2212 max(0, dcj \u2212 ticj k) > \u03c4 ],\n\n(1)\n\nwhere 1(\u00b7) : R \u2192 {0, 1} is the indicator function, and \u03c4 > 0 is a prede\ufb01ned threshold.\n\n3To meet this requirement, the granularity of categories should be properly selected. For instance, the\ncategory \u201cSmart TV\u201d is a better choice than the category \u201cElectrical Equipment\u201d, since the latter category covers\na broad range of goods with different durations.\n\n3\n\n\f(cid:88)\n\nNote that the positive entries of Y denote high purchase intentions, while the positive entries of P\ndenote actual purchases. Generally speaking, a purchase only happens when the utility is high, but\na high utility does not necessarily lead to a purchase. This observation allows us to link the binary\ntensors P and Y: P is generated by a one-sided sampling process that only reveals a subset of\nY\u2019s positive entries. Given this observation, we follow [13] and include a label-dependent loss [26]\ntrading the relative cost of positive and unlabeled samples:\nL(X ,P) = \u03b7\n\nmax[1 \u2212 (xijk \u2212 max(0, dcj \u2212 ticj k)), 0]2 + (1 \u2212 \u03b7)\n\n(cid:88)\n\n(cid:88)\n\nl(xijk, 0),\n\nijk: pijk=1\n\nijk: pijk=0\n\nwhere l(x, c) = (x \u2212 c)2 denotes the squared loss.\nIn addition, the form utility tensor X should be of low-rank to capture temporal dynamics of users\u2019\ninterests, which are generally believed to be dictated by a small number of latent factors [22].\nBy combining asymmetric sampling and the low-rank property together, we jointly recover the\ntensor X and the inter-purchase duration vector d by solving the following tensor nuclear norm\nminimization (TNNM) problem:\n\nmin\n\nX\u2208Rm\u00d7n\u00d7l, d\u2208Rr\n\n+\n\n\u03b7\nijk: pijk=1\n\n+ (1 \u2212 \u03b7)\n\nmax[1 \u2212 (xijk \u2212 max(0, dcj \u2212 ticj k)), 0]2\n(cid:88)\n\nijk + \u03bb (cid:107)X(cid:107)\u2217,\nx2\nijk: pijk=0\n\n(2)\n\nwhere (cid:107)X(cid:107)\u2217 denotes the tensor nuclear norm, a convex combination of nuclear norms of X \u2019s unfolded\nmatrices [21]. Given the learned \u02c6X and \u02c6d, the underlying binary tensor Y can be recovered by (1).\nWe note that although the TNNM problem (2) can be solved by optimization techniques such as block\ncoordinate descent [21] and ADMM [10], they suffer from high computational cost since they need\nto be solved iteratively with multiple SVDs at each iteration. An alternative way to solve the problem\nis tensor factorization [16]. However, this also involves iterative singular vector estimation and thus\nnot scalable enough. As a typical example, recovering a rank 20 tensor of size 500 \u00d7 500 \u00d7 500 takes\nthe state-of-the-art tensor factorization algorithm TenALS 4 more than 20, 000 seconds on an Intel\nXeon 2.40 GHz processor with 32 GB main memory.\n\n3.2 A Scalable Relaxation\n\nIn this subsection, we discuss how to signi\ufb01cantly improve the scalability of the proposed demand-\naware recommendation model. To this end, we assume that an individual\u2019s form utility does not\nchange over time, an assumption widely-used in many collaborative \ufb01ltering methods [25, 32]. Under\nthis assumption, the tensor X is a repeated copy of its frontal slice x::1, i.e.,\n\nX = x::1 \u25e6 e,\n\n(3)\nwhere e is an l-dimensional all-one vector and the symbol \u25e6 represents the outer product operation.\nIn this way, we can relax the problem of learning a third-order tensor X to the problem of learning\nits frontal slice, which is a second-order tensor (matrix). For notational simplicity, we use a matrix X\nto denote the frontal slice x::1, and use xij to denote the entry (i, j) of the matrix X.\nSince X is a low-rank tensor, its frontal slice X should be of low-rank as well. Hence, the minimiza-\ntion problem (2) simpli\ufb01es to:\n\n(cid:88)\n\nX\u2208Rm\u00d7n\n\nmin\nd\u2208Rr\n\n\u03b7\nijk: pijk=1\n\n+ (1 \u2212 \u03b7)\n\nmax[1 \u2212 (xij \u2212 max(0, dcj \u2212 ticj k)), 0]2\n(cid:88)\n\nij + \u03bb (cid:107)X(cid:107)\u2217 := f (X, d),\nx2\nijk: pijk=0\n\n(4)\n\nwhere (cid:107)X(cid:107)\u2217 stands for the matrix nuclear norm, the convex surrogate of the matrix rank function. By\nrelaxing the optimization problem (2) to the problem (4), we recover a matrix instead of a tensor to\ninfer users\u2019 purchase intentions.\n\n4http://web.engr.illinois.edu/~swoh/software/optspace/code.html\n\n4\n\n\f4 Optimization\n\nAlthough the learning problem has been relaxed, optimizing (4) is still very challenging for two main\nreasons: (i) the objective is highly non-smooth with nested hinge losses, and (ii) it contains mnl\nterms, and a naive optimization algorithm will take at least O(mnl) time.\nTo address these challenges, we adopt an alternating minimization scheme that iteratively \ufb01xes one of\nd and X and minimizes with respect to the other. Speci\ufb01cally, we propose an extremely ef\ufb01cient\noptimization algorithm by effectively exploring the sparse structure of the tensor P and low-rank\nstructure of the matrix X. We show that (i) the problem (4) can be solved within O((cid:107)P(cid:107)0(k +\nlog((cid:107)P(cid:107)0)) + (n + m)k2) time, where k is the rank of X, and (ii) the algorithm converges to the\ncritical points of f (X, d). In the following, we provide a sketch of the algorithm. The detailed\ndescription can be found in the supplementary material.\n\n4.1 Update d\n\n(cid:40)\n\n(cid:88)\n\n(cid:18)\n\nmin\n\nd\n\n(cid:19)2(cid:41)\n\n(cid:88)\n\nWhen X is \ufb01xed, the optimization problem with respect to d can be written as:\n\n(cid:26)max(1 \u2212 xij, 0)2,\n\nmax\n\n1 \u2212 (xij \u2212 max(0, dcj \u2212 ticj k)), 0\n\n:= g(d) :=\n\ngijk(dcj ).\n\n(5)\n\nijk: pijk=1\n\nijk: pijk=1\n\nProblem (5) is non-trivial to solve since it involves nested hinge losses. Fortunately, by carefully\nanalyzing the value of each term gijk(dcj ), we can show that\n\ngijk(dcj ) =\n\n(1 \u2212 (xij \u2212 dcj + ticj k))2,\n\nif dcj \u2264 ticj k + max(xij \u2212 1, 0)\nif dcj > ticj k + max(xij \u2212 1, 0).\n\nFor notational simplicity, we let sijk = ticj k + max(xij \u2212 1, 0) for all triplets (i, j, k) satisfying\npijk = 1. Now we can focus on each category \u03ba: for each \u03ba, we collect the set Q = {(i, j, k) |\npijk = 1 and cj = \u03ba} and calculate the corresponding sijks. We then sort sijks such that s(i1j1k1) \u2264\n\u00b7\u00b7\u00b7 \u2264 s(i|Q|j|Q|k|Q|). For each interval [s(iqjqkq), s(iq+1jq+1kq+1)], the function is quadratic, thus can\nbe solved in a closed form. Therefore, by scanning the solution regions from left to right according to\nthe sorted s values, and maintaining some intermediate computed variables, we are able to \ufb01nd the\noptimal solution, as summarized by the following lemma:\nLemma 1. The subproblem (5) is convex with respect to d and can be solved exactly in\nO((cid:107)P(cid:107)0 log((cid:107)P(cid:107)0)), where (cid:107)P(cid:107)0 is the number of nonzero elements in tensor P.\nTherefore, we can ef\ufb01ciently update d since P is a very sparse tensor with only a small number of\nnonzero elements.\n\n4.2 Update X\n\nBy de\ufb01ning\n\naijk =\n\n(cid:26)1 + max(0, dcj \u2212 ticj k),\n\n0,\n\nthe subproblem with respect to X can be written as\n\nif pijk = 1\notherwise\n\n(cid:26)\n\n\u03b7\n\n(cid:88)\n\nh(X)+\u03bb(cid:107)X(cid:107)\u2217 where h(X) :=\n\nmin\n\nX\u2208Rm\u00d7n\n\nmax(aijk\u2212xij, 0)2+(1\u2212\u03b7)\n\n(6)\nSince there are O(mnl) terms in the objective function, a naive implementation will take at least\nO(mnl) time, which is computationally infeasible when the data is large. To address this issue, We\nuse proximal gradient descent to solve the problem. At each iteration, X is updated by\n\nijk: pijk=1\n\nijk: pijk=0\n\nwhere S\u03bb(\u00b7) is the soft-thresholding operator for singular values.5\n\nX \u2190 S\u03bb(X \u2212 \u03b1\u2207h(X)),\n\n5If X has the singular value decomposition X = U\u03a3VT , then S\u03bb(X) = U(\u03a3 \u2212 \u03bbI)+VT where a+ =\n\nmax(0, a).\n\n5\n\n(cid:88)\n\n(cid:27)\n\n.\n\nx2\nij\n\n(7)\n\n\fTable 1: CPU time for solving problem (4) with different number of purchase records\n\nm (# users) n (# items)\n1,000,000\n1,000,000\n1,000,000\n1,000,000\n1,000,000\n1,000,000\n\nl (# time slots)\n\n1,000\n1,000\n1,000\n\n(cid:107)P(cid:107)0\n\n11,112,400\n43,106,100\n166,478,000\n\nk CPU Time (in seconds)\n10\n10\n10\n\n595\n1,791\n6,496\n\nIn order to ef\ufb01ciently compute the top singular vectors of X \u2212 \u03b1\u2207h(X), we rewrite it as\n\n\uf8eb\uf8ed2(1 \u2212 \u03b7)\n\n(cid:88)\n\n(cid:88)\n\nxij \u2212 2\u03b7\n\nmax(aijk \u2212 xij, 0)\n\nijk: pijk=1\n\nijk: pijk=1\n\n\uf8f6\uf8f8 .\n\nX \u2212 \u03b1\u2207h(X) = [1 \u2212 2(1 \u2212 \u03b7)l] X +\n\n= fa(X) + fb(X).\n\nSince fa(X) is of low-rank and fb(X) is sparse, multiplying (X \u2212 \u03b1\u2207h(X)) with a skinny m by\nk matrix can be computed in O(nk2 + mk2 + (cid:107)P(cid:107)0k) time. As shown in [12], each iteration of\nproximal gradient descent for nuclear norm minimization only requires a \ufb01xed number of iterations\nof randomized SVD (or equivalently, power iterations) using the warm start strategy, thus we have\nthe following lemma.\nLemma 2. A proximal gradient descent algorithm can be applied to solve problem (6) within\nO(nk2T + mk2T + (cid:107)P(cid:107)0kT ) time, where T is the number of iterations.\nWe note that the algorithm is guaranteed to converge to the true solution. This is because when we\napply a \ufb01xed number of iterations to update X via problem (7), it is equivalent to the \u201cinexact gradient\ndescent update\u201d where each gradient is approximately computed and the approximation error is upper\nbounded by a constant between zero and one. Intuitively speaking, when the gradient converges to 0,\nthe error will also converge to 0 at an even faster rate. See [12] for the detailed explanations.\n\n4.3 Overall Algorithm\n\nCombining the two subproblems together, the time complexity of each iteration of the proposed\nalgorithm is:\n\nO((cid:107)P(cid:107)0 log((cid:107)P(cid:107)0) + nk2T + mk2T + (cid:107)P(cid:107)0kT ).\n\nRemark: Since each user should make at least one purchase and each item should be purchased at\nleast once to be included in P, n and m are smaller than (cid:107)P(cid:107)0. Also, since k and T are usually\nvery small, the time complexity to solve problem (4) is dominated by the term (cid:107)P(cid:107)0, which is a\nsigni\ufb01cant improvement over the naive approach with at least O(mnl) complexity.\nSince our problem has only two blocks d, X and each subproblem is convex, our optimization\nalgorithm is guaranteed to converge to a stationary point [11]. Indeed, it converges very fast in\npractice. As a concrete example, our experiment shows that it takes only 9 iterations to optimize a\nproblem with 1 million users, 1 million items, and more than 166 million purchase records.\n\n5 Experiments\n\n5.1 Experiment with Synthesized Data\n\nWe \ufb01rst conduct experiments with simulated data to verify that the proposed demand-aware recom-\nmendation algorithm is computationally ef\ufb01cient and robust to noise. To this end, we \ufb01rst construct a\nlow-rank matrix X = WHT , where W \u2208 Rm\u00d710 and H \u2208 Rn\u00d710 are random Gaussian matrices\nwith entries drawn from N (1, 0.5), and then normalize X to the range of [0, 1]. We randomly assign\nall the n items to r categories, with their inter-purchase durations d equaling [10, 20, . . . , 10r]. We\nthen construct the high purchase intension set \u2126 = {(i, j, k) | ticj k \u2265 dcj and xij \u2265 0.5}, and\nsample a subset of its entries as the observed purchase records. We let n = m and vary them in the\nrange {10, 000, 20, 000, 30, 000, 40, 000}. We also vary r in the range {10, 20,\u00b7\u00b7\u00b7 , 100}. Given the\nlearned durations d\u2217, we use (cid:107)d \u2212 d\u2217(cid:107)2/(cid:107)d(cid:107)2 to measure the prediction errors.\n\n6\n\n\f(a) Error vs Number of users/items\n\n(b) Error vs Number of categories\n\n(c) Error vs Noise levels\n\nFigure 1: Prediction errors (cid:107)d \u2212 d\u2217(cid:107)2/(cid:107)d(cid:107)2 as a function of number of users, items, categories, and\nnoise levels on synthetic datasets\n\nAccuracy Figure 1(a) and 1(b) clearly show that the proposed algorithm can perfectly recover the\nunderlying inter-purchase durations with varied numbers of users, items, and categories. To further\nevaluate the robustness of the proposed algorithm, we randomly \ufb02ip some entries in tensor P from\n0 to 1 to simulate the rare cases of purchasing two items in the same category in close temporal\nsuccession. Figure 1(c) shows that when the ratios of noisy entries are not large, the predicted\ndurations \u02c6d are close enough to the true durations, thus verifying the robustness of the proposed\nalgorithm.\nScalability To verify the scalability of the proposed algorithm, we \ufb01x the numbers of users and items\nto be 1 million, the number of time slots to be 1, 000, and vary the number of purchase records (i.e.,\n(cid:107)P(cid:107)0). Table 1 summarizes the CPU time of solving problem (4) on an Intel Xeon 2.40 GHz server\nwith 32 GB main memory. We observe that the proposed algorithm is extremely ef\ufb01cient, e.g., even\nwith 1 million users, 1 million items, and more than 166 million purchase records, the running time\nof the proposed algorithm is less than 2 hours.\n\n5.2 Experiment with Real-World Data\n\nIn the real-world experiments, we evaluate the proposed demand-aware recommendation algorithm\nby comparing it with the six state-of the-art recommendation methods: (a) M3F, maximum-margin\nmatrix factorization [24], (b) PMF, probabilistic matrix factorization [25], (c) WR-MF, weighted reg-\nularized matrix factorization [14], (d) CP-APR, Candecomp-Parafac alternating Poisson regression\n[7], (e) Rubik, knowledge-guided tensor factorization and completion method [30], and (f) BPTF,\nBayesian probabilistic tensor factorization [31]. Among them, M3F and PMF are widely-used static\ncollaborative \ufb01ltering algorithms. We include these two algorithms as baselines to justify whether\ntraditional collaborative \ufb01ltering algorithms are suitable for general e-commerce recommendation\ninvolving both durable and nondurable goods. Since they require explicit ratings as inputs, we\nfollow [2] to generate numerical ratings based on the frequencies of (user, item) consumption pairs.\nWR-MF is essentially the positive-unlabeled version of PMF and has shown to be very effective in\nmodeling implicit feedback data. All the other three baselines, i.e., CP-APR, Rubik, and BPTF, are\ntensor-based methods that can consider time utility when making recommendations. We refer to the\nproposed recommendation algorithm as Demand-Aware Recommender for One-Sided Sampling,\nor DAROSS for short.\nOur testbeds are two real-world datasets Tmall6 and Amazon Review7. Since some of the baseline\nalgorithms are not scalable enough, we \ufb01rst conduct experiments on their subsets and then on the full\nset of Amazon Review. In order to generate the subsets, we randomly sample 80 item categories for\nTmall dataset and select the users who have purchased at least 3 items within these categories, leading\nto the purchase records of 377 users and 572 items. For Amazon Review dataset, we randomly select\n300 users who have provided reviews to at least 5 item categories on Amazon.com. This leads to a\ntotal of 5, 111 items belonging to 11 categories. Time information for both datasets is provided in\ndays, and we have 177 and 749 time slots for Tmall and Amazon Review subsets, respectively. The\nfull Amazon Review dataset is signi\ufb01cantly larger than its subset. After removing duplicate items, it\ncontains more than 72 million product reviews from 19.8 million users and 7.7 million items that\n\n6http://ijcai-15.org/index.php/repeat-buyers-prediction-competition\n7http://jmcauley.ucsd.edu/data/amazon/\n\n7\n\n\f(a) Category Prediction\n\n(b) Purchase Time Prediction\n\nFigure 2: Prediction performance on real-world datasets Tmall and Amazon Review subsets\n\nTable 2: Estimated inter-review durations for Amazon Review subset\n\nCategories\n\nInstant Apps for Automotive Baby Beauty Digital Grocery\nVideo\n... Food\n\nAndroid\n\nMusical\n\nInstruments\n\nd\n\n0\n\n0\n\n326\n\n0\n\n0\n\n0\n\n38\n\nMusic\n158\n\nOf\ufb01ce\nProducts\n\n94\n\nPatio ...\nGarden\n271\n\nPet\n\nSupplies\n\n40\n\nbelong to 24 item categories. The collected reviews span a long range of time: from May 1996 to July\n2014, which leads to 6, 639 time slots in total. Comparing to its subset, the full set is a much more\nchallenging dataset both due to its much larger size and much higher sparsity, i.e., many reviewers\nonly provided a few reviews, and many items were only reviewed a small number of times.\nFor each user, we randomly sample 90% of her purchase records as the training data, and use the\nremaining 10% as the test data. For each purchase record (u, i, t) in the test set, we evaluate all\nthe algorithms on two tasks: (i) category prediction, and (ii) purchase time prediction. In the \ufb01rst\ntask, we record the highest ranking of items that are within item i\u2019s category among all items at\ntime t. Since a purchase record (u, i, t) may suggest that in time slot t, user u needed an item\nthat share similar functionalities with item i, category prediction essentially checks whether the\nrecommendation algorithms recognize this need. In the second task, we record the number of slots\nbetween the true purchase time t and its nearest predicted purchase time within item i\u2019s category.\nIdeally, good recommendations should have both small category rankings and small time errors. Thus\nwe adopt the average top percentages, i.e., (average category ranking) / n \u00d7 100% and (average\ntime error) / l \u00d7 100%, as the evaluation metrics of category and purchase time prediction tasks,\nrespectively. The algorithms M3F, PMF, and WR-MF are excluded from the purchase time prediction\ntask since they are static models that do not consider time information.\nFigure 2 displays the predictive performance of the seven recommendation algorithms on Tmall and\nAmazon Review subsets. As expected, M3F and PMF fail to deliver strong performance since they\nneither take into account users\u2019 demands, nor consider the positive-unlabeled nature of the data. This\nis veri\ufb01ed by the performance of WR-MF: it signi\ufb01cantly outperforms M3F and PMF by considering\nthe PU issue and obtains the second-best item prediction accuracy on both datasets (while being\nunable to provide a purchase time prediction). By taking into account both issues, our proposed\nalgorithm DAROSS yields the best performance for both datasets and both tasks. Table 2 reports the\ninter-review durations of Amazon Review subset estimated by our algorithm. Although they may not\nperfectly re\ufb02ect the true inter-purchase durations, the estimated durations clearly distinguish between\ndurable good categories, e.g., automotive, musical instruments, and non-durable good categories, e.g.,\ninstant video, apps, and food. Indeed, the learned inter-purchase durations can also play an important\nrole in applications more advanced than recommender systems, such as inventory management,\noperations management, and sales/marketing mechanisms. We do not report the estimated durations\nof Tmall herein since the item categories are anonymized in the dataset.\nFinally, we conduct experiments on the full Amazon Review dataset. In this study, we replace category\nprediction with a more strict evaluation metric item prediction [8], which indicates the predicted\nranking of item i among all items at time t for each purchase record (u, i, t) in the test set. Since\nmost of our baseline algorithms fail to handle such a large dataset, we only obtain the predictive\nperformance of three algorithms: DAROSS, WR-MF, and PMF. Note that for such a large dataset,\nprediction time instead of training time becomes the bottleneck: to evaluate average item rankings, we\n\n8\n\n\fneed to compute the scores of all the 7.7 million items, thus is computationally inef\ufb01cient. Therefore,\nwe only sample a subset of items for each user and estimate the rankings of her purchased items.\nUsing this evaluation method, the average item ranking percentages for DAROSS, WR-MF, and\nPMF are 16.7%, 27.3%, and 38.4%, respectively. In addition to superior performance, it only takes\nour algorithm 10 iterations and 1 hour to converge to a good solution. Since WR-MF and PMF\nare both static models, our algorithm is the only approach evaluated here that considers time utility\nwhile being scalable enough to handle the full Amazon Review dataset. Note that this dataset has\nmore users, items, and time slots but fewer purchase records than our largest synthesized dataset,\nand the running time of the former dataset is lower than the latter one. This clearly veri\ufb01es that the\ntime complexity of our algorithm is dominated by the number of purchase records instead of the\ntensor size. Interestingly, we found that some inter-review durations estimated from the full Amazon\nReview dataset are much smaller than the durations estimated from its subset. This is because the\ndurations may be underestimated when many users reviewed items within a same durable goods\ncategory in close temporal succession. On the other hand, this result veri\ufb01es the effectiveness of\nthe PU formulation \u2013 even if the durations are underestimated, our algorithm still outperforms the\ncompetitors by a considerable margin. As a \ufb01nal note, we want to point out that Tmall and Amazon\nReview may not take full advantage of the proposed algorithm, since (i) their categories are relatively\ncoarse and may contain multiple sub-categories with different durations, and (ii) the time stamps of\nAmazon Review re\ufb02ect the review time instead of purchase time, and inter-review durations could\nbe different from inter-purchase durations. By choosing a purchase history dataset with a more\nappropriate category granularity, we may obtain more accurate duration estimations and also a better\nrecommendation performance.\n\n6 Conclusion\n\nIn this paper, we examine the problem of demand-aware recommendation in settings when inter-\npurchase duration within item categories affects users\u2019 purchase intention in combination with\nintrinsic properties of the items themselves. We formulate it as a tensor nuclear norm minimization\nproblem that seeks to jointly learn the form utility tensor and a vector of inter-purchase durations,\nand propose a scalable optimization algorithm with a tractable time complexity. Our empirical\nstudies show that the proposed approach can yield perfect recovery of duration vectors in noiseless\nsettings; it is robust to noise and scalable as analyzed theoretically. On two real-world datasets, Tmall\nand Amazon Review, we show that our algorithm outperforms six state-of-the-art recommendation\nalgorithms on the tasks of category, item, and purchase time predictions.\n\nAcknowledgements\n\nCho-Jui Hsieh and Yao Li acknowledge the support of NSF IIS-1719097, TACC and Nvidia.\n\nReferences\n[1] Gediminas Adomavicius and Alexander Tuzhilin. Context-aware recommender systems. In Recommender\n\nSystems Handbook, pages 217\u2013253. Springer, New York, NY, 2011.\n\n[2] Linas Baltrunas and Xavier Amatriain. Towards time-dependant recommendation based on implicit\n\nfeedback. In Workshop on context-aware recommender systems, 2009.\n\n[3] Jes\u00fas Bobadilla, Fernando Ortega, Antonio Hernando, and Jes\u00fas Bernal. A collaborative \ufb01ltering approach\n\nto mitigate the new user cold start problem. Knowl.-Based Syst., 26:225\u2013238, February 2012.\n\n[4] Pedro G. Campos, Fernando D\u00edez, and Iv\u00e1n Cantador. Time-aware recommender systems: a comprehensive\nsurvey and analysis of existing evaluation protocols. User Model. User-Adapt. Interact., 24(1-2):67\u2013119,\n2014.\n\n[5] Emmanuel J. Cand\u00e8s and Benjamin Recht. Exact matrix completion via convex optimization. Foundations\n\nof Computational Mathematics, 9(6):717\u2013772, 2009.\n\n[6] Christopher Chat\ufb01eld and Gerald J Goodhardt. A consumer purchasing model with erlang inter-purchase\n\ntimes. Journal of the American Statistical Association, 68(344):828\u2013835, 1973.\n\n9\n\n\f[7] Eric C. Chi and Tamara G. Kolda. On tensors, sparsity, and nonnegative factorizations. SIAM Journal on\n\nMatrix Analysis and Applications, 33(4):1272\u20131299, 2012.\n\n[8] Nan Du, Yichen Wang, Niao He, Jimeng Sun, and Le Song. Time-sensitive recommendation from recurrent\n\nuser activities. In NIPS, pages 3474\u20133482, 2015.\n\n[9] Marthinus Christoffel du Plessis, Gang Niu, and Masashi Sugiyama. Analysis of learning from positive\n\nand unlabeled data. In NIPS, pages 703\u2013711, 2014.\n\n[10] Silvia Gandy, Benjamin Recht, and Isao Yamada. Tensor completion and low-n-rank tensor recovery via\n\nconvex optimization. Inverse Problems, 27(2):025010, 2011.\n\n[11] L. Grippo and M. Sciandrone. On the convergence of the block nonlinear Gauss-Seidel method under\n\nconvex constraints. Operations Research Letters, 26:127\u2013136, 2000.\n\n[12] C.-J. Hsieh and P. A. Olsen. Nuclear norm minimization via active subspace selection. In ICML, 2014.\n[13] Cho-Jui Hsieh, Nagarajan Natarajan, and Inderjit S. Dhillon. PU learning for matrix completion. In ICML,\n\npages 2445\u20132453, 2015.\n\n[14] Y. Hu, Y. Koren, and C. Volinsky. Collaborative \ufb01ltering for implicit feedback datasets. In ICDM, pages\n\n263\u2013272. IEEE, 2008.\n\n[15] Yifan Hu, Yehuda Koren, and Chris Volinsky. Collaborative \ufb01ltering for implicit feedback datasets. In\n\nICDM, pages 263\u2013272, 2008.\n\n[16] P. Jain and S. Oh. Provable tensor factorization with missing data. In NIPS, pages 1431\u20131439, 2014.\n[17] Yehuda Koren. Collaborative \ufb01ltering with temporal dynamics. Commun. ACM, 53(4):89\u201397, April 2010.\n\n[18] Dokyun Lee and Kartik Hosanagar. Impact of recommender systems on sales volume and diversity. In\n\nProc. Int. Conf. Inf. Syst., Auckland, New Zealand, December 2014.\n\n[19] Bin Li, Xingquan Zhu, Ruijiang Li, Chengqi Zhang, Xiangyang Xue, and Xindong Wu. Cross-domain\n\ncollaborative \ufb01ltering over time. In IJCAI, pages 2293\u20132298, 2011.\n\n[20] Bing Liu, Yang Dai, Xiaoli Li, Wee Sun Lee, and Philip S. Yu. Building text classi\ufb01ers using positive and\n\nunlabeled examples. In ICML, pages 179\u2013188, 2003.\n\n[21] Ji Liu, Przemyslaw Musialski, Peter Wonka, and Jieping Ye. Tensor completion for estimating missing\n\nvalues in visual data. IEEE Trans. Pattern Anal. Mach. Intell., 35(1):208\u2013220, 2013.\n\n[22] Atsuhiro Narita, Kohei Hayashi, Ryota Tomioka, and Hisashi Kashima. Tensor factorization using auxiliary\n\ninformation. In ECML/PKDD, pages 501\u2013516, 2011.\n\n[23] Steffen Rendle, Christoph Freudenthaler, Zeno Gantner, and Lars Schmidt-Thieme. BPR: bayesian\n\npersonalized ranking from implicit feedback. In UAI, pages 452\u2013461, 2009.\n\n[24] Jason D. M. Rennie and Nathan Srebro. Fast maximum margin matrix factorization for collaborative\n\nprediction. In ICML, pages 713\u2013719, 2005.\n\n[25] Ruslan Salakhutdinov and Andriy Mnih. Bayesian probabilistic matrix factorization using markov chain\n\nmonte carlo. In ICML, pages 880\u2013887, 2008.\n\n[26] Clayton Scott et al. Calibrated asymmetric surrogate losses. Electronic Journal of Statistics, 6:958\u2013992,\n\n2012.\n\n[27] Robert L. Sexton. Exploring Economics. Cengage Learning, Boston, MA, 2013.\n[28] Robert L. Steiner. The prejudice against marketing. J. Marketing, 40(3):2\u20139, July 1976.\n\n[29] John Z. Sun, Dhruv Parthasarathy, and Kush R. Varshney. Collaborative Kalman \ufb01ltering for dynamic\n\nmatrix factorization. IEEE Trans. Signal Process., 62(14):3499\u20133509, 15 July 2014.\n\n[30] Yichen Wang, Robert Chen, Joydeep Ghosh, Joshua C. Denny, Abel N. Kho, You Chen, Bradley A. Malin,\nand Jimeng Sun. Rubik: Knowledge guided tensor factorization and completion for health data analytics.\nIn SIGKDD, pages 1265\u20131274, 2015.\n\n[31] Liang X., Xi C., Tzu-Kuo H., Jeff G. S., and Jaime G. C. Temporal collaborative \ufb01ltering with bayesian\n\nprobabilistic tensor factorization. In SDM, pages 211\u2013222, 2010.\n\n[32] Jinfeng Yi, Rong Jin, Shaili Jain, and Anil K. Jain. Inferring users\u2019 preferences from crowdsourced\npairwise comparisons: A matrix completion approach. In First AAAI Conference on Human Computation\nand Crowdsourcing (HCOMP), 2013.\n\n10\n\n\f", "award": [], "sourceid": 1421, "authors": [{"given_name": "Jinfeng", "family_name": "Yi", "institution": "Tencent AI Lab/IBM TJ Watson Research Center"}, {"given_name": "Cho-Jui", "family_name": "Hsieh", "institution": "UC Davis"}, {"given_name": "Kush", "family_name": "Varshney", "institution": "IBM Research"}, {"given_name": "Lijun", "family_name": "Zhang", "institution": "Nanjing University (NJU)"}, {"given_name": "Yao", "family_name": "Li", "institution": "University of California, Davis"}]}