{"title": "Modeling Dynamic Missingness of Implicit Feedback for Recommendation", "book": "Advances in Neural Information Processing Systems", "page_first": 6669, "page_last": 6678, "abstract": "Implicit feedback is widely used in collaborative filtering methods for recommendation. It is well known that implicit feedback contains a large number of values that are \\emph{missing not at random} (MNAR); and the missing data is a mixture of negative and unknown feedback, making it difficult to learn user's negative preferences. \nRecent studies modeled \\emph{exposure}, a latent missingness variable which indicates whether an item is missing to a user, to give each missing entry a confidence of being negative feedback.\nHowever, these studies use static models and ignore the information in temporal dependencies among items, which seems to be a essential underlying factor to subsequent missingness. To model and exploit the dynamics of missingness, we propose a latent variable named ``\\emph{user intent}'' to govern the temporal changes of item missingness, and a hidden Markov model to represent such a process. The resulting framework captures the dynamic item missingness and incorporate it into matrix factorization (MF) for recommendation. We also explore two types of constraints to achieve a more compact and interpretable representation of \\emph{user intents}. Experiments on real-world datasets demonstrate the superiority of our method against state-of-the-art recommender systems.", "full_text": "Modeling Dynamic Missingness of Implicit Feedback\n\nfor Recommendation\n\nMenghan Wang\n\nCollege of Computer Science,\n\nZhejiang University\n\nwangmengh@zju.edu.cn\n\nXiaolin Zheng\u2217\n\nCollege of Computer Science,\n\nZhejiang University\n\nxlzheng@zju.edu.cn\n\nMingming Gong\n\nDepartment of Biomedical Informatics,\n\nUniversity of Pittsburgh\n\nmig73@pitt.edu\n\nKun Zhang\n\nDepartment of Philosophy,\nCarnegie Mellon University\n\nkunz1@cmu.edu\n\nAbstract\n\nImplicit feedback is widely used in collaborative \ufb01ltering methods for recommenda-\ntion. It is well known that implicit feedback contains a large number of values that\nare missing not at random (MNAR); and the missing data is a mixture of negative\nand unknown feedback, making it dif\ufb01cult to learn users\u2019 negative preferences.\nRecent studies modeled exposure, a latent missingness variable which indicates\nwhether an item is exposed to a user, to give each missing entry a con\ufb01dence of\nbeing negative feedback. However, these studies use static models and ignore the\ninformation in temporal dependencies among items, which seems to be an essential\nunderlying factor to subsequent missingness. To model and exploit the dynamics\nof missingness, we propose a latent variable named \u201cuser intent\u201d to govern the tem-\nporal changes of item missingness, and a hidden Markov model to represent such\na process. The resulting framework captures the dynamic item missingness and\nincorporate it into matrix factorization (MF) for recommendation. We also explore\ntwo types of constraints to achieve a more compact and interpretable representation\nof user intents. Experiments on real-world datasets demonstrate the superiority of\nour method against state-of-the-art recommender systems.\n\n1\n\nIntroduction\n\nCollaborative \ufb01ltering methods based on implicit feedback (e.g., purchase records and browsing\nhistory) are widely used in recommender systems. Compared to explicit feedback (e.g., 1-5 star\nratings), implicit feedback is more abundant and accessible in real-world applications. However,\nthe missing data of implicit feedback also brings two challenges. First, the data is missing not at\nrandom (MNAR). Only positive feedback is collected in implicit feedback and all negative feedback\nis missing, leading to a severely biased dataset. Second, the missing data is a mixture of negative and\nunknown feedback; a missing entry may indicate the user either dislikes or does not know the item,\nwhich makes it hard to learn user\u2019s negative preferences. Several previous works [Hu et al., 2008,\nMarlin and Zemel, 2009] provided evidence that both ignoring missing data and treating all missing\ndata as negative feedback will lead to biased recommendations.\nA possible solution is to model the MNAR mechanism and treat the missing data properly. Several\nresearchers have proposed various methods to address this issue. Popular methods [Hu et al., 2008,\n\n\u2217Corresponding author\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fPan et al., 2008] are based on the uniformity assumption that assigns a uniform weight to degrade\nthe importance of the missing data, assuming that each missing entry is equally likely to be negative\nfeedback. This is a strong assumption and limits models\u2019 \ufb02exibility for real applications. Recently,\nresearchers [Liang et al., 2016, Wang et al., 2018a] treated missing entries differently with the so-\ncalled \u201cexposure\u201d variables and achieved improved results. An exposure variable indicates whether\nor not an item is missing to a user. They make predictions in two steps: They \ufb01rst model exposure\nvariables for each user to get the candidate items that are not missing and then recommend top-ranked\nitems in the set of candidate items based on user preferences.\nHowever, these modeled exposure-based missingness mechanisms are static and the temporal de-\npendencies among items are not utilized, which can naturally in\ufb02uence the subsequent missingness\ngreatly. Consider the following example. If a user has just bought a mobile phone, it is more likely\nfor him/her to buy a suitable phone case next so missingness probabilities of candidate phone cases\nwill be lower than if the user had not bought the phone. Moreover, the effect of item dependencies on\nthe missingness is asymmetric: purchase of a phone case indicates that the user has probably owned a\nmobile phone and the missingness probabilities of phones should be high during his/her next purchase.\nThus the key to modeling the dynamic missingness is how to utilize the temporal information of\nimplicit feedback to capture the asymmetric item dependencies. Instead of \ufb01nding explicit item\ndependencies, we assume that the missingness of items for a user at one time is generated by a latent\nvariable called \u201cuser intent\", and that the dynamics of missingness are driven by a Markov process of\nuser intents. In other words, user intents capture item relations implicitly and generate time-sensitive\nexposure variables.\nParticularly, in this paper, we use a hidden Markov model (HMM) to represent the dynamic missing-\nness of implicit feedback and the estimated missingness of items is incorporated into a probabilistic\nmatrix factorization (MF) model for recommendation. To the best of our knowledge, the proposed\nframework, namly \u201cH4MF\u201d, as a strategy of \u201cleveraging HMM and MF to model the dynamic\nMissingness for recommendation,\u201d is the \ufb01rst work to address the dynamic missingness of implicit\nfeedback in recommendation area. The HMM and MF are seamlessly incorporated in H4MF, making\nthe framework interpretable and extensible. Further, we propose a principled computational algorithm,\nshowing promising results on real-world datasets.\n\n2 Related Work\n\nMissing data presents a common challenge for empirical sciences. Most prior studies on recommender\nsystems assumed data is missing at random (MAR); however, Marlin and Zemel [2009] demonstrated\nthat data in real recommender systems is not MAR and recommendation algorithms based on\nMAR assumption may lead to biased results. Several studies have modeled different missingness\nmechanisms to address the MNAR problem. For explicit feedback, a widely accepted mechanism\n[Marlin and Zemel, 2009, Ling et al., 2012, Hern\u00e1ndez-Lobato et al., 2014] is that missingness is\nrelated to the potential ratings (e.g, 1-5 star ratings). Data for items with high ratings are less likely to\nbe missing compared to items with low ratings. For implicit feedback, some causal-process-based\nmethods [Liang et al., 2016, Wang et al., 2018a] \ufb01rst computed exposures for each user and then used\nthem to guide rating prediction, which have shown promising results. Different from these studies,\nwe address the MNAR problem with a dynamic missingness assumption.\nAnother related work is sequential recommendation, where researchers utilize temporal data for\nnext-item recommendation. Existing sequential recommender systems mainly capture the dynamic\nuser preferences. A popular idea is to utilize Markov chains [He and McAuley, 2016] to model\nthe sequential information. Rendle et al. [2010] proposed a factorized personalized Markov chain\n(FPMC) model that combines both a common Markov chain and a matrix factorization model. Sahoo\net al. [2012] chose a hidden Markov model to capture the dynamic of user preferences for personalized\nrecommendation. However, they did not consider the MNAR problem and the missing data is not\nwell utilized. Some other researchers also used deep learning techniques (e.g., LSTM [Wu et al.,\n2017] and GRU [Chung et al., 2015] ) for sequential recommendation; however, they are limited\nin interpretability. In this paper we assume user preferences are static and focus on modeling the\ndynamic missingness for the MNAR problem. Moreover, it is rather straightforward to extend our\nframework to capture the dynamic user preferences with existing studies on online learning [Mairal\net al., 2010].\n\n2\n\n\fFigure 1: Graphical model of the proposed model.\n\n3 H4MF Framework\n\ni\n\n.\n\ni , y2\n\ni , ..., yT\n\ni }, where yt\n\nij = 1 means item j is exposed to user i at time t, and \u03b1t\n\nIn this section, we \ufb01rst introduce the problem formula and our proposed framework. Then we describe\nthe parameter inference and the prediction formula in detail.\nProblem Formulation. Suppose we have N users and M items. For each user i, a T -length rating\nhistory in chronological order is given as Yi = {y1\ni denotes the item that user i\nrated at time t (Note that the rating denotes implicit feedback in this paper). The goal of recommender\nsystems is to predict which item the user will rate next, more speci\ufb01cally, yT +1\nBefore describing our model, we \ufb01rst introduce the representation of yt\ni and the de\ufb01nition of miss-\ningness variables, which can help to understand the proposed dynamic missingness mechanism. We\ni as a M \u00d7 1 rating vector. As one user can only rate one of M items at one time, there\nrepresent yt\ni and \u201c0\u201d elsewhere. Thus the missing data of implicit feedback refers\nis a \u201c1\u201d in one position of yt\nto \u201c0\u201d entries, which contain negative and unknown feedback. For each yt\nij in dataset, we use a\nij (same as the exposure variable in [Liang et al., 2016]) to indicate\nBernoulli missingness variable \u03b1t\nthe missingness: \u03b1t\nij = 0 means the user\ndoes not see the item. The missingness variables have a reasonable interpretation: users \ufb01rst have to\nij can be utilized to extract negative\nsee the items, then they have the possibility to rate them. Thus \u03b1t\nfeedback from the missing data: if user i has seen item j (\u03b1t\nij is 0, this rating\nis more likely to be negative feedback rather than unknown feedback, which can be further utilized to\nlearn user preference. Note that \u03b1t\nij may be different for different t and our model aims to capture its\ndynamics.\nModel Description. We assume that user intent and user preference work together for recommen-\ndation: User intent determines the missingness of items and user preference determines recommen-\ndations from the non-missing items. In this paper we propose a framework named \u201cH4MF\u201d that\ncombines HMM and MF to model the dynamic Missingness for recommendation. As shown in\nFigure 1, H4MF has two components: the User Intent Component and the User Preference Com-\nponent. In the User Intent Component we use a \ufb01rst-order hidden Markov model to capture the\nmissingness mechanism. \u03b1t is a M \u00d7 1 missingness vector of items at time t generated by a latent\nstate variable St (named \u201cuser intent\u201d), and the probability of St depends only on the last state\nSt\u22121. The user intent is a single categorical random variable that can take one of D discrete values,\nSt \u2208 {1, ..., D}. We assume that user intents are shared by all users so the generated \u03b1t\nj represents\nij for all possible users. The state transitions follow a categorical distribution and the conditional\n\u03b1t\nobservation distribution is de\ufb01ned as:\n\nij = 1) but the rating yt\n\nM(cid:89)\n\n(cid:88)\n\np(yt\n\ni|St, P ) =\n\np(yt\n\nij|\u03b1t\n\nij, P )p(\u03b1t\n\nij|St), \u03b1t\n\nij \u2208 {0, 1}\n\n(1)\n\nj=1\n\n\u03b1t\nij\n\nIn the User Preference Component, we adopt a classical but effective matrix factorization model\n[Mnih and Salakhutdinov, 2008]: the user preference P \u2208 RN\u00d7M is decomposed as a product of\ntwo submatrices U \u2208 RK\u00d7N and V \u2208 RK\u00d7M , which represent user-speci\ufb01c and item-speci\ufb01c latent\n\n3\n\n(cid:22)(cid:21)(cid:3)(cid:4)(cid:11)(cid:21)(cid:3)(cid:4)(cid:22)(cid:21)(cid:11)(cid:21)(cid:22)(cid:21)(cid:2)(cid:4)(cid:11)(cid:21)(cid:2)(cid:4)(cid:7)(cid:9)(cid:10)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:8)(cid:21)(cid:3)(cid:4)(cid:8)(cid:21)(cid:8)(cid:21)(cid:2)(cid:4)(cid:1)(cid:1)(cid:1)(cid:1)(cid:9)(cid:20)(cid:13)(cid:19)(cid:1)(cid:6)(cid:16)(cid:21)(cid:13)(cid:16)(cid:21)(cid:1)(cid:5)(cid:17)(cid:15)(cid:18)(cid:17)(cid:16)(cid:13)(cid:16)(cid:21)(cid:1)(cid:9)(cid:20)(cid:13)(cid:19)(cid:1)(cid:7)(cid:19)(cid:13)(cid:14)(cid:13)(cid:19)(cid:13)(cid:16)(cid:12)(cid:13)(cid:1)(cid:5)(cid:17)(cid:15)(cid:18)(cid:17)(cid:16)(cid:13)(cid:16)(cid:21)(cid:1)\ffeature factors respectively. More speci\ufb01cally, we use Pij = U T\ntoward item j. The conditional distribution over the observed ratings Y t\nterm) for user i and the prior distribution are given by:\n\ni Vj to show the preference of user i\ni \u2208 RN\u00d7M (the likelihood\n\nT(cid:89)\n\nM(cid:89)\n\nt=1\n\ni , P ) =\n\np(Yi|\u03b1T\n\nijN (yt\n[\u03b1t\nij|St) = Bernoulli(Is(\u00b5t\nN (Ui|0, \u03bb\u22121\n\np(\u03b1t\np(U|\u03bbu) =\n\nN(cid:89)\n\nj=1\n\nij|Pij, \u03bb\u22121\n\ny IK) + (1 \u2212 \u03b1t\n\nij)I[yt\n\nij = 0]],\n\nj)), \u00b5t\n\nj \u223c Beta(at, bt),\nM(cid:89)\n\nu IK), p(V |\u03bbv) =\n\nN (Vj|0, \u03bb\u22121\n\nv IK),\n\n(2)\n\ni=1\n\nj=1\n\nij is de\ufb01nitely 0; when \u03b1t\n\nij = 0, the rating is missing so yt\n\nij = 0 is true, and 0 otherwise. Is(\u00b5t\n\nwhere N (x|\u00b5, \u03bb) denotes the Gaussian distribution with mean \u00b5 and precision \u03bb, I[yt\nij = 0] is the\nj) indicates that \u00b5t\nindicator function that evaluates to 1 when yt\nj\nis St-speci\ufb01c. IK stands for the identity matrix of dimension K. p(Yi|\u03b1T\ni , P ) can be interpreted as\nij = 1, the rating is not\nfollows: when \u03b1t\nmissing so yt\nij is either 0 or 1, depending on the user preference Pij. In this paper we present our\nmethod and its inference for the case of one user\u2019s sequential records; but it is straightforward to\napply them to multiple user cases. Note that users have variable-length rating records so that T is not\na \ufb01xed number for different users.\nNext we explain the underlying design of H4MF. We choose HMM for user intent because HMM can\nwell utilize the temporal data to mine the asymmetric item dependencies; and the latent states (user\nintents) can be shared by all users, which simpli\ufb01es the structure of the missingness mechanism. We\nchoose MF for user preference because MF can model a low dimensional representation for both users\nand items, which has been proved effective in recommender systems. Meanwhile, H4MF is more\nexplainable and reasonable with this modular structure. Most existing sequential recommendation\nalgorithms [Xiang et al., 2010, Shi et al., 2014] only used \u201cdynamic preference\u201d to account for the\ntemporal user behaviors; they assumed time-varying user preference is the only explanation for the\nnoisy user behaviors. In this case the learned user preference will \ufb02uctuate rapidly and be dif\ufb01cult to\nexplain.\nAlthough choosing a dynamic preference model will make H4MF more reasonable, we assume user\npreference to be static for two main reasons: 1) User preference evolves steadily and is rather stable\ncompared to user intent. Moore et al. [2013] visualized dynamic user preference via trajectories.\nTheir results show that user preferences change steadily and slowly in a long time (month level),\nespecially for older users. In contrast, user intent changes every user-item interaction in H4MF. So\nit is reasonable to choose static preferences in H4MF. 2) Simplicity for inference is a concern. As\nour goal to explore the effects of dynamic missingness to recommender systems, MF is also fair for\ncomparison to baselines.\nParameter Inference. We choose expectation-maximization (EM) to \ufb01nd the maximum a posteriori\n(MAP) estimations of the parameters of H4MF. In the E-step, we compute the expected log posterior\nof the observed data and the user intents, which is:\n\ni , P|ST , Yi) \u221d log p(Yi|\u03b1T\n\ni |ST ) + log p(ST ) + log p(P )\n\nlog p(\u03b1T\n\n(3)\ni |ST ) is computed as\nThe log p(P ) is computed as log p(U|\u03bbu) + log p(V |\u03bbv) and log p(\u03b1T\nlog Is(\u00b5t\nj. As the exact expectation\nof HMM is computationally intractable, we use Gibbs sampling to infer the posterior probabilities of\nSt. For a given rating sequence {Y t\n\nj)|at, bt); we add a prior to regularize the \u00b5t\n\ni } by user i. St is sampled from\n\nj) + log p(Is(\u00b5t\n\ni , P ) + log p(\u03b1T\n\np(St|St\u22121, St+1, Yi, \u03b1t\n\n(4)\nwhere p(St|St\u22121) and p(St+1|St) can be obtained from the state transition matrix of the HMM, and\nthe expectation of log likelihood of one rating record yt\n\ni, P ) \u221d p(St|St\u22121)p(St+1|St)p(yt\n\ni|\u03b1t\n\ni, P ),\n\nlog p(yt\n\ni|\u03b1t\n\ni, P ) =\n\nlog\n\nyt\nij\u00b5t\nj\n\nN (0|U T\n\n(cid:16)\n\nM(cid:88)\n\nj=1\n\ni is given by:\nN (1|U T\ni Vj, \u03bb\u22121\ny )\ny ) + N (1|U T\ni Vj, \u03bb\u22121\n\ni Vj, \u03bb\u22121\ny )\ni Vj, \u03bb\u22121\ny )\ny ) + N (1|U T\n\nN (0|U T\ni Vj, \u03bb\u22121\n\ni Vj, \u03bb\u22121\ny )\n\n(cid:17)\n\n)\n\n(5)\n\n+ (1 \u2212 yt\n\nij)(1 \u2212 \u00b5t\n\nj + \u00b5t\nj\n\nN (0|U T\n\n4\n\n\fIn the M-step, we maximize the log posterior with respect to \u00b5, U, V , and {St}. We use gradient\nascent to update \u00b5, and compute optimal U and V by setting their derivatives to zero. The details are\nincluded in Appendix 1.1. Note that we update U and V and \ufb01x the hyperparameters \u03bbu, \u03bbv, and\n\u03bby. This strategy follows the original PMF [Mnih and Salakhutdinov, 2008] for simpli\ufb01cation. For\nuser intents {St}, we use the Baum-Welch algorithm [Ghahramani and Jordan, 1996] to update the\ntransition matrix and initial states probability distribution of the HMM; as a strict EM-type algorithm\nit is guaranteed to converge to at least a local maximum.\nMaking Prediction. In the recommendation phase we are interested in the prediction of yT +1\nfor\nuser i given his/her previous rating records. We make predictions by integrating out the uncertainty\nfrom the missing variable \u03b1T +1\n\nij\n\n:\n\n, Pij](cid:3)\n\n(cid:2)Ey[yT +1\n(cid:88)\n\nij\n\n\u2208{0,1}\n\u00b7 U T\n\ni Vj\n\n|\u03b1T +1\np(\u03b1T +1\n\nj\n\nj\n\nj\n\nEy[yT +1\n\nij\n\n|Pij] = E\u03b1\n\n=\n\n\u03b1T +1\n\nj\n\n= \u00b5T +1\n\nj\n\n) Ey[yT +1\n\nij\n\n|\u03b1T +1\n\nj\n\n, Ui, Vj]\n\n(6)\n\nwhere \u00b5T +1\nalgorithm of HMM.\n\nj\n\nis determined by the next user intent ST +1, which can be predicted with the forward\n\n4 Further Constraints on Items\nCurrently all missingness variables {\u03b1t\nj} share the symmetric Beta priors. One potential drawback is\nthe learned user intents may be redundant and items under the same user intent tend to have similar\nmissing probabilities. In this section we de\ufb01ne two kinds of constraints, namely inner constraint\nand outer constraint, to specialize the Beta priors of missingness variables of different items under\ndifferent user intents. The intuitions are simple but reasonable: Items have relations under the\nsame user intent and their exposure variables are related. We use the inner constraint to denote\nthe in\ufb02uences from other items under the same user intent to one item\u2019s missingness. Meanwhile,\nthe missingness of one item under different user intents should follow some patterns to reduce the\nredundancy. And we use the outer constraint to denote the in\ufb02uences from the same item under\ndifferent user intents to one item\u2019s missingness. We adopt a simple implementation: we update the\nBeta priors in every M-step as follows:\n\n#total records under user intent d\n\nj \u03bbInner + \u03c9d\n\nj \u03bbOuter, bd\n\nnew \u2190 ad\nad\nini and bd\n\nnew \u2190 bd\n\nini + \u03bbInner + \u03bbOuter, d \u2208 {1, ..., D}\n\nWhere ad\n#records of item j under user intent d\n\nini + \u03c3d\nini are initial Beta priors, \u03bbInner and \u03bbOuter are the scale parameters, \u03c3d\n\nj =\nindicates the occurrence probability of item j with respect to other\nitems under user intent d, and \u03c9d\nindicates the occurrence probability\nof item j that is \u201ctriggered\u201d by user intent d. Then the ad and bd are not global constants during\nthe EM procedure and play a constraint role. The items with similar occurrence probabilities under\nthe same user intent will have similar Beta priors. Instead of putting constraints directly on \u00b5d\nj ,\nthis strategy can avoid sophisticated inferences and later experiments prove its effectiveness. In\nexperiments we denote this constrained version as H4MFc.\n\nj = #records of item j under user intent d\n\n#total records of item j\n\n(7)\n\n5 Experimental Results\n\nIn this section we describe the used datasets and experimental settings, evaluate the performance\nresults, and analyze the user intent and the item constraints.\n\n5.1 Datasets and Settings\n\nWe evaluate the performance of our method on three real-world datasets: 1) MovieLens-100K dataset\n(\u223c 100 thousand ratings from 943 users on 1,682 movies). The dataset was collected during the\nseven-month period from September 19th, 1997 through April 22nd, 1998. 2) MovieLens-1M dataset\n(\u223c 1 million ratings from 6,040 users on 3,706 movies). The dataset was collected from April 25th,\n\n5\n\n\f2000 through February 28th, 2003. 3) LastFM dataset (\u223c 100 thousand ratings from 1,892 users on\n17,632 movies). The time period is from August 1st, 2005 through May 1st, 2011. We transform\nthe two MovieLens datasets into implicit data by setting ratings that are \u2265 3 to \u201c1\u201d and the others\nto \u201c0\u201d. We then choose four prevalent methods for comparison, including: (1) PMF [Mnih and\nSalakhutdinov, 2008], a classical matrix factorization approach that is widely applied as a benchmark.\n(2) WMF [Hu et al., 2008], a standard matrix factorization model for implicit data, which uses a\nsimple heuristic where all unobserved user-item interactions are equally down weighted against the\nobserved interactions. (3) FPMC [Rendle et al., 2010], a sequential recommendation algorithm based\non personalized transition graphs over underlying Markov chains. It used a variant of Bayesian\nPersonalized Ranking (BPR) [Rendle et al., 2009] for optimization. (4) ExpoMF Liang et al. [2016],\na probabilistic approach that incorporates user exposure to items into collaborative \ufb01ltering. The\nbaselines are chosen for the following reasons: PMF and FPMC can been seen as sub-models of\nH4MF, while they overlook the missing data problem. WMF treats the missing data as a MAR\nproblem. ExpoMF takes a static method to the MNAR problem. The main goal of the experiments is\nto show that how we treat the missing data makes a difference.\nWe adopt Hit Ratio (HR) and Normalized Discounted Cumulative Gain (NDCG) to measure the item\nranking accuracy of different algorithms. HR measures whether the ground truth item is present\non the ranked list, while NDCG measures the ranking quality by considering the positions of hits.\nWe follow the de\ufb01nitions of HR and NDCG in [He et al., 2015]. In our study we always report\nthe averaged HR and NDCG across users. We split the dataset for experiments with the following\nstrategy: we \ufb01rst sort the historical ratings of each user by time order. Then the last records of users\nare used as test data, the second last records are used as validation data, and the remaining records are\nused for training. We search for the optimal parameters to maximize the performance on validation\ndata and evaluate the model on test data. For the parameters of baseline models, we refer to their\noriginal papers and follow their tuning strategies.\n\n5.2 Analysis of Prediction Performance\n\nWe report the performance of our methods and baseline models with optimal parameters. For PMF,\nwe set K = 10. For WMF, we set K = 10, \u03b1 = 0.4. For ExpoMF, we set \u03bb\u03b8 = 0.01, \u03bb\u03b2 = 0.01,\n\u03bby = 0.01, and K = 30. For our models, we set \u03bb\u03b8 = 0.1, \u03bb\u03b2 = 0.1, \u03bby = 0.1, K = 30, ad\nini = 1,\nini = 2. For item constraints, \u03bbOuter is set as 1, and \u03bbInner is set 10, 1, and 0.1 for MovieLens-\nand bd\n100K, MovieLens-1M, and LastFM, respectively. We show the performance of our methods with\nother baseline models in Table 1. As shown in the results, H4MFc achieves higher item ranking\naccuracy than the other compared algorithms due to the capability of better capturing the missingness\nof implicit feedback. Note that PMF, WMF, ExpoMF, and H4MF model user preference similarly:\nthey all use a basic matrix factorization method and the main difference is the way they model the\nmissing data. PMF performs poorly because the datasets are sparse and all the missing entries are\ntreated as negative feedback. So the positive feedback is overwhelmed by negative feedback, leading\nto a biased user preference learning. FPMC has the same reason for its poor performance. Besides, it\nis originally proposed for next-basket recommendation. Here we set basket size as 1 as we do not\nhave basket information, which also limits the effectiveness of FPMC. WMF is better than PMF as it\ntreats the missing data with a globally \ufb01xed low con\ufb01dence. ExpoMF models exposure variable \u03b1ui\nfor every user-item pair so it can capture more information from the missing data compared to WMF\nand PMF. H4MF is better than ExpoMF because it considers the dynamic missingness of items. Note\nthat the experimental results of WMF, ExpoMF, and H4MF are very close; WMF even beats WMF\nand H4MF on LastFM. This is because modeling missingness for each missing entry adds model\ncomplexity and is prone to over\ufb01tting. On the other hand, the superiority of H4MFc compared to\nH4MF proves the effectiveness of the user intent constraints.\n\n5.3 Analysis of User Intents\n\nThis section analyzes user intents in three aspects: recommendation overlaps, sensitivity of user\nintent number, and interpretation of user intents.\nRecommendation Overlaps. In H4MF, we use user preference and user intent for recommendation.\nFor a particular user with \ufb01xed preference, we sample different user intents and see how different\nare the recommendation lists. We use the term \u201crecommendation overlap\u201d to denote the ratio of\ncommon items in Top-N recommendation lists generated by two different user intents. A large\n\n6\n\n\fDataset\n\nMovieLens-100K\n\nMovieLens-1M\n\nLastFM\n\nMetrics\nHR@10\nHR@50\nNDCG@10\nNDCG@50\nHR@10\nHR@50\nNDCG@10\nNDCG@50\nHR@10\nHR@50\nNDCG@10\nNDCG@50\n\nEffectiveness of models\nPMF\n0.0031\n0.0296\n0.0011\n0.0066\n0.0021\n0.0093\n0.0008\n0.0022\n0.0012\n0.0037\n0.0004\n0.0009\n\nFPMC WMF\n0.1251\n0.0021\n0.3968\n0.0212\n0.0501\n0.0007\n0.0046\n0.1203\n0.0791\n0.0034\n0.2696\n0.0129\n0.0372\n0.0087\n0.0800\n0.0549\n0.0021\n0.0835\n0.2144\n0.0360\n0.0432\n0.0008\n0.0074\n0.0713\n\nExpoMF H4MF H4MFc\n0.1569\n0.1230\n0.4347\n0.3478\n0.0779\n0.0616\n0.1367\n0.1101\n0.0877\n0.0801\n0.3049\n0.2808\n0.0435\n0.0331\n0.0897\n0.0675\n0.0945\n0.0736\n0.2298\n0.1824\n0.0495\n0.0352\n0.0789\n0.0575\n\n0.1317\n0.3990\n0.0583\n0.1205\n0.0805\n0.2704\n0.0408\n0.0811\n0.0799\n0.1980\n0.0423\n0.0639\n\nTable 1: Performance of different models on three datasets.\n\nFigure 2: Performances of proposed models with different numbers of user intents (D).\n\nrecommendation overlap indicates that the two user intents have similar missingness mechanisms. We\nchoose N = 10 and show the average of recommendation overlaps across users in Table 2. We can\nsee the recommendation overlaps of H4MFc are much smaller than those of H4MF, proving that the\nitem constraints can reduce the redundancy of user intents. Meanwhile, the recommendation overlaps\ndecrease both in H4MF and in H4MFc when D increases. This result conforms to our expectations\nbecause our methods can capture more aspects of user intents with a large D.\n\nRecommendation Overlaps of Different User Intents\n\nMovielens-100K\n\nMovielens-1M\n\nLastFM\n\nDataset\n\nUser Intent\nU1 vs U2\nU2 vs U3\nU1 vs U3\n\nD=2\n\nD=3\n\nD=2\n\nD=3\n\nD=2\n\nD=3\n\nH4MF H4MFc H4MF H4MFc H4MF H4MFc H4MF H4MFc H4MF H4MFc H4MF H4MFc\n14%\n80%\n2%\n0%\n\n84%\n78%\n78%\n\n74%\n72%\n72%\n\n92%\n\n14%\n\n52%\n32%\n28%\n\n30%\n18%\n16%\n\n52%\n\n-\n-\n\n26%\n\n-\n-\n\n8%\n-\n-\n\n6%\n2%\n2%\n\n-\n-\n\n-\n-\n\n-\n-\n\nTable 2: Recommendation overlaps of different user intents on three datasets. U 1, U 2, and U 3\nindicate the indices of user intents. The cases of D = 4 and D = 5 are included in the Appendix 1.2.\n\nSensitivity of User Intent Number. The number of user intents D is vital to the performance of\nH4MF. We varied D to train H4MF and presented the prediction results in Figure 2. We can see that\nH4MFc performs consistently better than H4MF on all the three datasets. The optimal D is 2 on\nthree datasets. When D increase after D = 2, the performance decreases monotonously. Note that in\nlast paragraph we \ufb01nd that the recommendation overlaps decreases when D increase; but this does\nnot guarantee the recommendation performance because a large D will also add model complexity.\nInterpretation of User Intents. User intents could be utilized to interpret user behaviors and provide\nexplainable recommendations. Table 3 shows a recommendation example of one user in Movielens-\n100K under two different user intents. From the results we can see the genres of recommended\nmovies under user intent 1 are mainly about \u201cCrime\u201d and \u201cAction\u201d, but the genres under user intent\n2 are mainly about \u201cComedy\u201d, \u201cRomance\u201d, and \u201cDrama\u201d (Note that the genre information is not\nused in model training). Thus we can infer that the user mainly has two tastes in movies. As H4MF\n\n7\n\n12345D0.100.150.200.250.300.350.400.450.50HR@50Movielens-100KH4MFH4MF_c12345D0.100.150.200.250.300.35HR@50Movielens-1MH4MFH4MF_c12345D0.000.050.100.150.200.250.30HR@50LastFMH4MFH4MF_c\fcan predict the user\u2019s next user intent, we will know which genres the user want to see next and\nprovide more precise and interpretable recommendations.\n\nUser Intent 1\n\nUser Intent 2\n\nGenres\nMovie Name\nComedy, Romance\n1. Pulp Fiction\nDrama\n2. Fargo\nDocumentary\n3. Star Wars\nDrama\n4. The Full Monty\nComedy, Drama\n5. Contact\nDrama\n6. The English Patient\nComedy\n7. Four Weddings and a Funeral\n8. The Fugitive\nComedy, Drama\nDocumentary\n9. The Princess Bride\nDrama, Romance\n10. Raiders of the Lost Ark\nTable 3: Top 10 recommendations for one user on Movielens-100K under two user intents.\n\nMovie Name\n1. Little City\n2. The Whole Wide World\n3. Maya Lin: A Strong Clear Vision\n4. Savage Nights\n5. Beat the Devil\n6. Ill Gotten Gains\n7. Withnail and I\n8. The Inkwell\n9. Fast, Cheap & Out of Control\n10. Carrington\n\nGenres\nCrime, Drama\nCrime, Drama, Thriller\nAction, Adventure, Sci-Fi, War\nComedy\nDrama, Sci-Fi\nDrama, Romance, War\nComedy, Romance\nAction, Thriller\nAction, Adventure, Romance\nAction, Adventure\n\n5.4 Effectiveness of Item Constraints\n\nTo evaluate the effectiveness of item constraints, we tune the \u03bbInner and \u03bbOuter to observe how\nthey in\ufb02uence the HR@50 of H4MFc. We \ufb01x other parameters as described in Section 5.2 and\nshow the results in Figure 3. The optimal parameters are \u03bbInner = 10, \u03bbOuter = 1 for Movielens-\n100K, \u03bbInner = 1, \u03bbOuter = 1 for Movielens-1M, and \u03bbInner = 0.1, \u03bbOuter = 1 for LastFM. The\noptimal \u03bbOuter is around 1 for all the three datasets; When it increases, the HR@50 decreases\ndramatically. Meanwhile, the optimal \u03bbInner varies across datasets and the performance is less\nsensitive to the change of \u03bbInner. One main reason is that the total item records under user intents are\nhuge when we have a small D. So the ratio measure \u03c3d\nj is very small for all items and there are fewer\ndifferences among different \u03c3d\nj , which limits the effectiveness of \u03bbInner. The black dashed lines are\nthe performances of H4MF (\u03bbInner = 0 and \u03bbOuter = 0). We can conclude that H4MFc can achieve\nimprovements with proper constraints, which supports the effectiveness of the two item constraints.\n\n(a) Movielens-100k\n\n(b) Movielens-1M\n\n(c) LastFM\n\nFigure 3: Effectiveness of \u03bbInner and \u03bbOuter in H4MFc.\n\n5.5 Discussions\n\nIn this section we \ufb01rst discuss the extensibility and ef\ufb01ciency of H4MF, and then discuss utilization\nof item relations in recommendation.\nExtensibility. User intent and user preference can be seen as a factorization of user behavior, which\nmakes H4MF more modular and extensible. We can extend one component without considering the\nother component. Moreover, both HMM and MF are well studied techniques and their variants can\nbring insights into H4MF. For example, we can use local low-rank MF [Lee et al., 2013] and mixture-\nrank matrix approximation [Li et al., 2017] to learn user preference by exploiting the underlying\ngroup information of users and items. We can also use hidden semi-Markov model [Yu, 2010] to\nmodel the durations of user intents: it is always the case that users purchase serveral items to meet\none intent.\nEf\ufb01ciency. A potential limitation of H4MF is the time complexity. The inference of the HMM is a\nbottleneck; its theoretical complexity is O( \u02c6T D2) for each iteration of the EM method, where \u02c6T is the\n\n8\n\n\u03bbOuter0.00.51.01.52.02.53.03.54.0\u03bbInner0246810121416HR@500.320.330.350.360.370.390.400.410.430.440.340.350.360.370.380.390.400.410.42\u03bbOuter0.00.51.01.52.02.53.03.54.0\u03bbInner0.00.51.01.52.02.53.03.54.0HR@500.210.220.230.240.250.260.270.280.290.300.2320.2400.2480.2560.2640.2720.2800.288\u03bbOuter0.00.51.01.52.02.53.03.54.0\u03bbInner012345HR@500.120.130.140.150.160.170.180.190.200.210.1380.1440.1500.1560.1620.1680.1740.180\flength of training data and D is the state number. The experimental runtime results in Appendix 1.3\nalso reveal that the runtime increases dramatically when D and \u02c6T increase. In real-world applications\ncustomers\u2019 data are collected accumulatively, so the \u02c6T will become very large. One of the possible\nextensions is to devise an online version of H4MF. Currently there are several studies related to the\nonline learning of HMM and MF [Mongillo and Deneve, 2008, Mairal et al., 2010], which can be\nutilized to make H4MF more scalable.\nItem relations in recommendation. Most recommendation algorithms mainly focus on mining\nand utilizing the information of item similarity. However, item similarity may lead to meaningless\nrecommendations (e.g., the phone and phone case example in introduction). The key to address this\nissue is to \ufb01nd asymmetric relations of items. Several researchers [McAuley et al., 2015, Wang et al.,\n2018b] proposed methods to discriminate substitutes and complements from similar products. But\ntheir methods are supervised and the ground truth of labels are directly extracted from user log \ufb01les,\nwhich may contain biases and noise. A more principled approach is to apply techniques of causal\ndiscovery to \ufb01nd the directed relations among items. However, current techniques of causal discovery\n(e.g, modi\ufb01ed PC [Spirtes et al., 2000] and GES [Chickering and Meek, 2002]) may not work well\non the recommendation data as they are extremely sparse and MNAR. Instead in our model, the\nasymmetric relations of items are revealed from the temporal data by the dynamical missingness\nmechanism. In this regard our H4MF can be seen as a step toward causality-based recommendations\nfrom similarity-based recommendations.\n6 Conclusion\nIn this paper we aim to model and leverage properties of dynamic item missingness to improve\nrecommendation. We proposed a framework that seamlessly combines HMM and MF to model the\ndynamic missing mechanism of implicit feedback for recommendation. To make the user intents less\nredundant, we introduced two types of constraints for the missingness variables. Empirical results on\nthree datasets show that our method not only outperform alternatives but also provide interpretable\nrecommendations. Further analysis demonstrates the effectiveness of user intent and its constraints.\nFuture work includes extending H4MF with recent advanced variants of HMM and MF.\n\nAcknowledgments\n\nThis work was supported in part by the National Natural Science Foundation of China (No.U1509221),\nthe National Key TechnologyR&D Program (2015BAH07F01), the Zhejiang Province key R&D\nprogram (No.2017C03044). This material is partially based upon work supported by United States Air\nForce under Contract No. FA8650-17-C-7715, by National Science Foundation under EAGER Grant\nNo. IIS-1829681, and National Institutes of Health under Contract No. NIH-1R01EB022858-01,\nFAINR01EB022858, NIH-1R01LM012087, NIH-5U54HG008540-02, and FAIN-U54HG008540.\nAny opinions, \ufb01ndings, and conclusions or recommendations expressed in this material are those of\nthe authors and do not necessarily re\ufb02ect the views of the United States Air Force or the National\nInstitutes of Health or the National Science Foundation. We appreciate the comments from anonymous\nreviewers, which helped to improve the paper.\n\nReferences\nDavid Maxwell Chickering and Christopher Meek. Finding optimal bayesian networks. In UAI,\n\npages 94\u2013102, 2002.\n\nJunyoung Chung, Caglar Gulcehre, Kyunghyun Cho, and Yoshua Bengio. Gated feedback recurrent\n\nneural networks. In ICML, pages 2067\u20132075, 2015.\n\nZoubin Ghahramani and Michael I Jordan. Factorial hidden markov models. In NIPS, pages 472\u2013478,\n\n1996.\n\nRuining He and Julian McAuley. Fusing similarity models with markov chains for sparse sequential\n\nrecommendation. In ICDM, pages 191\u2013200. IEEE, 2016.\n\nXiangnan He, Tao Chen, Min-Yen Kan, and Xiao Chen. Trirank: Review-aware explainable\n\nrecommendation by modeling aspects. In CIKM, pages 1661\u20131670, 2015.\n\n9\n\n\fJos\u00e9 Miguel Hern\u00e1ndez-Lobato, Neil Houlsby, and Zoubin Ghahramani. Probabilistic matrix factor-\n\nization with non-random missing data. In ICML, pages 1512\u20131520, 2014.\n\nYifan Hu, Yehuda Koren, and Chris Volinsky. Collaborative \ufb01ltering for implicit feedback datasets.\n\nIn ICDM, pages 263\u2013272. IEEE, 2008.\n\nJoonseok Lee, Seungyeon Kim, Guy Lebanon, and Yoram Singer. Local low-rank matrix approxima-\n\ntion. In ICML, pages 82\u201390, 2013.\n\nDongsheng Li, Chao Chen, Wei Liu, Tun Lu, Ning Gu, and Stephen Chu. Mixture-rank matrix\n\napproximation for collaborative \ufb01ltering. In NIPS, pages 477\u2013485, 2017.\n\nDawen Liang, Laurent Charlin, James Mcinerney, and David M Blei. Modeling user exposure in\n\nrecommendation. In WWW, pages 951\u2013961, 2016.\n\nGuang Ling, Haiqin Yang, Michael R Lyu, and Irwin King. Response aware model-based collaborative\n\n\ufb01ltering. In UAI, pages 501\u2013510, 2012.\n\nJulien Mairal, Francis Bach, Jean Ponce, and Guillermo Sapiro. Online learning for matrix factoriza-\n\ntion and sparse coding. JMLR, 11(Jan):19\u201360, 2010.\n\nBenjamin M. Marlin and Richard S. Zemel. Collaborative prediction and ranking with non-random\n\nmissing data. In RecSys, pages 5\u201312, 2009.\n\nJulian J. McAuley, Rahul Pandey, and Jure Leskovec.\n\nInferring networks of substitutable and\n\ncomplementary products. In SIGKDD, pages 785\u2013794, 2015.\n\nAndriy Mnih and Ruslan R Salakhutdinov. Probabilistic matrix factorization.\n\n1257\u20131264, 2008.\n\nIn NIPS, pages\n\nGianluigi Mongillo and Sophie Deneve. Online learning with hidden markov models. Neural\n\ncomputation, 20(7):1706\u20131716, 2008.\n\nJoshua L Moore, Shuo Chen, Douglas Turnbull, and Thorsten Joachims. Taste over time: The\n\ntemporal dynamics of user preferences. In ISMIR, pages 401\u2013406, 2013.\n\nRong Pan, Yunhong Zhou, Bin Cao, Nathan N Liu, Rajan Lukose, Martin Scholz, and Qiang Yang.\n\nOne-class collaborative \ufb01ltering. In ICDM, pages 502\u2013511. IEEE, 2008.\n\nSteffen Rendle, Christoph Freudenthaler, Zeno Gantner, and Lars Schmidt-Thieme. Bpr: Bayesian\n\npersonalized ranking from implicit feedback. In UAI, pages 452\u2013461. AUAI Press, 2009.\n\nSteffen Rendle, Christoph Freudenthaler, and Lars Schmidt-Thieme. Factorizing personalized markov\n\nchains for next-basket recommendation. In WWW, pages 811\u2013820. ACM, 2010.\n\nNachiketa Sahoo, Tridas Mukhopadhyay, et al. A hidden markov model for collaborative \ufb01ltering.\n\nManagement Information Systems Quarterly, 36(4):1329\u20131356, 2012.\n\nYue Shi, Martha Larson, and Alan Hanjalic. Collaborative \ufb01ltering beyond the user-item matrix: A\n\nsurvey of the state of the art and future challenges. CSUR, 47(1):3, 2014.\n\nPeter Spirtes, Clark N Glymour, and Richard Scheines. Causation, prediction, and search. MIT\n\npress, 2000.\n\nMenghan Wang, Xiaolin Zheng, Yang Yang, and Kun Zhang. Collaborative \ufb01ltering with social\n\nexposure: A modular approach to social recommendation. In AAAI, pages 2516\u20132523, 2018a.\n\nZihan Wang, Ziheng Jiang, Zhaochun Ren, Jiliang Tang, and Dawei Yin. A path-constrained\nIn\n\nframework for discriminating substitutable and complementary products in e-commerce.\nWSDM, pages 619\u2013627, 2018b.\n\nChao-Yuan Wu, Amr Ahmed, Alex Beutel, Alexander J Smola, and How Jing. Recurrent recom-\n\nmender networks. In WSDM, pages 495\u2013503. ACM, 2017.\n\nLiang Xiang, Quan Yuan, Shiwan Zhao, Li Chen, Xiatian Zhang, Qing Yang, and Jimeng Sun.\nTemporal recommendation on graphs via long-and short-term preference fusion. In SIGKDD,\npages 723\u2013732. ACM, 2010.\n\nShun-Zheng Yu. Hidden semi-markov models. Arti\ufb01cial intelligence, 174(2):215\u2013243, 2010.\n\n10\n\n\f", "award": [], "sourceid": 3361, "authors": [{"given_name": "Menghan", "family_name": "Wang", "institution": "Zhejiang University"}, {"given_name": "Mingming", "family_name": "Gong", "institution": "University of Pittsburgh"}, {"given_name": "Xiaolin", "family_name": "Zheng", "institution": "Zhejiang University"}, {"given_name": "Kun", "family_name": "Zhang", "institution": "CMU"}]}