{"title": "Learning Label Trees for Probabilistic Modelling of Implicit Feedback", "book": "Advances in Neural Information Processing Systems", "page_first": 2816, "page_last": 2824, "abstract": "User preferences for items can be inferred from either explicit feedback, such as item ratings, or implicit feedback, such as rental histories. Research in collaborative filtering has concentrated on explicit feedback, resulting in the development of accurate and scalable models. However, since explicit feedback is often difficult to collect it is important to develop effective models that take advantage of the more widely available implicit feedback. We introduce a probabilistic approach to collaborative filtering with implicit feedback based on modelling the user's item selection process. In the interests of scalability, we restrict our attention to tree-structured distributions over items and develop a principled and efficient algorithm for learning item trees from data. We also identify a problem with a widely used protocol for evaluating implicit feedback models and propose a way of addressing it using a small quantity of explicit feedback data.", "full_text": "Learning Label Trees for Probabilistic Modelling of\n\nImplicit Feedback\n\nAndriy Mnih\n\nYee Whye Teh\n\namnih@gatsby.ucl.ac.uk\n\nywteh@gatsby.ucl.ac.uk\n\nGatsby Computational Neuroscience Unit\n\nGatsby Computational Neuroscience Unit\n\nUniversity College London\n\nUniversity College London\n\nAbstract\n\nUser preferences for items can be inferred from either explicit feedback, such as\nitem ratings, or implicit feedback, such as rental histories. Research in collabora-\ntive \ufb01ltering has concentrated on explicit feedback, resulting in the development\nof accurate and scalable models. However, since explicit feedback is often dif\ufb01-\ncult to collect it is important to develop effective models that take advantage of the\nmore widely available implicit feedback. We introduce a probabilistic approach to\ncollaborative \ufb01ltering with implicit feedback based on modelling the user\u2019s item\nselection process. In the interests of scalability, we restrict our attention to tree-\nstructured distributions over items and develop a principled and ef\ufb01cient algorithm\nfor learning item trees from data. We also identify a problem with a widely used\nprotocol for evaluating implicit feedback models and propose a way of addressing\nit using a small quantity of explicit feedback data.\n\n1\n\nIntroduction\n\nThe rapidly growing number of products available online makes it increasingly dif\ufb01cult for users to\nchoose the ones worth their attention. Recommender systems assist users in making these choices by\nranking the products based on inferred user preferences. Collaborative \ufb01ltering [6] has become the\napproach of choice for building recommender systems due to its ability to infer complex preference\npatterns from large collections of user preference data. Most collaborative \ufb01ltering research deals\nwith inferring preferences from explicit feedback, for example ratings given to items. As a result,\nseveral effective methods have been developed for this version of the problem. Matrix factorization\nbased models [13, 5, 12] have emerged as the most popular of these due to their simplicity and supe-\nrior predictive performance. Such models are also highly scalable because their training algorithms\ntake advantage of the sparsity of the rating matrix, resulting in training times that are linear in the\nnumber of observed ratings.\nHowever, since explicit feedback is often dif\ufb01cult to collect it is essential to develop effective models\nthat take advantage of the more abundant implicit feedback, such as logs of user purchases, rentals,\nor clicks. The dif\ufb01culty of modelling implicit feedback comes from the fact that it contains only\npositive examples, since users explicitly express their interest, by selecting items, but not their dis-\ninterest. Note that not selecting a particular item is not necessarily an expression of disinterest, as it\nmight also be due to the obscurity of the item, lack of time, or other reasons.\nJust like their explicit feedback counterparts, the most successful implicit feedback collaborative\n\ufb01ltering (IFCF) methods are based on matrix factorization [4, 10, 9]. However, instead of a highly\nsparse rating matrix, they approximate a dense binary matrix, where each entry indicates whether or\nnot a particular user selected a particular item. We will collectively refer to such methods as Binary\nMatrix Factorization (BMF). Since such approaches treat unobserved user/item pairs as fake negative\nexamples which can dominate the much less numerous positive examples, the contribution to the\n\n1\n\n\fobjective function from the zero entries is typically downweighted. The matrix being approximated\nis no longer sparse, so models of this type are typically trained using batch alternating least squares.\nAs a result, the training time is cubic in the number of latent factors, which makes these models less\nscalable than their explicit feedback counterparts.\nRecently [11] introduced a new method, called Bayesian Personalized Ranking (BPR), for modelling\nimplicit feedback that is based on more realistic assumptions than BMF. Instead of assuming that\nusers like the selected items and dislike the unselected ones, it assumes that users merely prefer the\nformer to the latter. The model is presented with selected/unselected item pairs and is trained to rank\nthe selected items above the unselected ones. Since the number of such pairs is typically very large,\nthe unselected items are sampled at random.\nIn this paper we develop a new method that explicitly models the user item selection process using\na probabilistic model that, unlike the existing approaches, can generate new item lists. Like BPR\nit assumes that selected items are more interesting than the unselected ones. Unlike BPR, however,\nit represents the appeal of items to a user using a probability distribution, producing a complete\nordering of items by probability value. In order to scale to large numbers of items ef\ufb01ciently, we\nrestrict our attention to tree-structured distributions. Since the accuracy of the resulting models\ndepends heavily on the choice of the tree structure, we develop an algorithm for learning trees from\ndata that takes into account the structure of the model the tree will be used with.\nWe then turn our attention to the task of evaluating implicit feedback models and point out a problem\nwith a widely used evaluation protocol, which stems from the assumption that all items not selected\nby a user are irrelevant. Our proposed solution involves using a small quantity of explicit feedback\nto reliably identify the irrelevant items.\n\n2 Modelling item selection\n\nWe propose a new approach to collaborative \ufb01ltering with implicit feedback based on modelling\nthe item selection process performed by each user. The identities of the items selected by a user\nare modelled as independent samples from a user-speci\ufb01c distribution over all available items. The\nprobability of an item under this distribution re\ufb02ects the user\u2019s interest in it. Training our model\namounts to performing multinomial density estimation for each user from the observed user/item\npairs without explicitly considering the unobserved pairs.\nTo make the modelling task more manageable we make two simplifying assumptions. First, we\nassume that user preferences do not change with time and model all items chosen by a user as\nindependent samples from a \ufb01xed user-speci\ufb01c distribution. Second, to keep the model as simple\nas possible we assume that items are sampled with replacement. We believe that sampling with\nreplacement is a reasonable approximation to sampling without replacement in this case because\nthe space of items is large while the number of items selected by a user is relatively small. These\nsimpli\ufb01cations allow us to model the identities of the items selected by a user as IID samples.\nWe now outline a simple implementation of the proposed idea which, though impractical for large\ndatasets, will serve as a basis for developing a more scalable model. As is typical for matrix factor-\nization methods in collaborative \ufb01ltering, we represent users and items with real-valued vectors of\nlatent factors. The factor vectors for user u and item i will be denoted by Uu and Vi respectively.\nIntuitively, Uu captures the preferences of user u, while Vi encodes the properties of item i. Both\nuser and item factor vectors are unobserved and so have to be learned from the observed user/item\npairs. The dot product between Uu and Vi quanti\ufb01es the preference of user u for item i. We de\ufb01ne\nthe probability of user u choosing item i as\nP (i|u) =\n\nu Vi + ci)\n\n(1)\n\n(cid:80)\n\nexp(U(cid:62)\nk exp(U(cid:62)\n\n,\n\nu Vk + ck)\n\nwhere ci is the bias parameter that captures the overall popularity of item i and index k ranges over\nall items in the inventory. The model can be trained using stochastic gradient ascent [2] on the\nlog-likelihood by iterating through the user/item pairs in the training set, updating Uu, Vi, and ci\nbased on the gradient of log P (i|u). The main weakness of the model is that its training time is\nlinear in the inventory size because computing the gradient of the log-probability of a single item\nrequires explicitly considering all available items. Though linear time complexity might not seem\n\n2\n\n\fprohibitive, it severely limits the applicability of the model since collaborative \ufb01ltering tasks with\ntens or even hundreds of thousands of items are now common.\n\n3 Hierarchical item selection model\n\nThe linear time complexity of the gradient computation is a consequence of normalization over the\nentire inventory in Eq. 1, which is required because the space of items is unstructured. We can speed\nup normalization, and thus learning, exponentially by assuming that the space of items has a known\ntree structure. We start by supposing that we are given a K-ary tree with items at the leaves and\nexactly one item per leaf. For simplicity, we will assume that each item is located at exactly one\nleaf. Such a tree is uniquely determined by specifying for each item the path from the root to the leaf\ncontaining the item. Any such path can be represented by the sequence of nodes n = n0, n1, ..., nL\nit visits, where n0 is always the root node.\nBy making the choice of the next node stochastic, we can induce a distribution over the leaf nodes in\nthe tree and thus over items. To allow each user to have a different distribution over items we make\nthe probability of choosing each child a function of the user\u2019s factor vector. The probability will\nalso depend on the child node\u2019s factor vector and bias the same way the probability of choosing an\nitem in Eq. 1 depends on the item\u2019s factor vector and bias. Let C(n) be the set of children of node\nn. Then for user u, the probability of moving from node nj to node n on a root-to-leaf tree traversal\nis given by\n\nP (n|nj, u) =\n\n(cid:80)\n\nexp(cid:0)U(cid:62)\n\n(cid:1)\n\nu Qn + bn\n\nm\u2208C(nj ) exp (U(cid:62)\n\nu Qm + bm)\n\n,\n\n(2)\n\nif n is a child of nj and 0 otherwise. Here Qn and bn are the factor vector and the bias of node n.\nThe probability of selecting item i is then given by the product of the probabilities of the decisions\nthat lead from the root to the leaf containing i:\n\nP (i|u) =(cid:81)Li\n\nj=1 P (ni\n\nj|ni\n\nj\u22121, u).\n\n(3)\n\nWe will call the model de\ufb01ned by Eq. 3 the Collaborative Item Selection (CIS) model. Given a tree\nover items, the CIS model can be trained using stochastic gradient ascent in log-likelihood, updating\nparameters after each user/item pair.\nWhile the model can use any tree over items, the choice of the tree affects the model\u2019s ef\ufb01ciency and\nability to generalize. Since computing the probability of a single item takes time linear in the item\u2019s\ndepth in the tree, we want to avoid trees that are too unbalanced. To produce a model that generalizes\nwell we also want to avoid trees with dif\ufb01cult classi\ufb01cation problems at the internal nodes [1], which\ncorrespond to hard-to-predict item paths.\nOne way to produce a tree that results in relatively easy classi\ufb01cation problems is to assign similar\nitems to the same class, which is the approach of [7] and [14]. However, the similarity metrics used\nby these methods are not model-based in the sense that they are not derived from the classi\ufb01ers that\nwill be used at the tree nodes. In Section 5 we will develop a scalable model-based algorithm for\nlearning trees with item paths that are easy to predict using Eq. 2.\n\n4 Related work\n\nThe use of tree-structured label spaces to reduce the normalization cost has originated in statistical\nlanguage modelling, where it was used to accelerate neural and maximum-entropy language models\n[3, 8]. The task of learning trees for ef\ufb01cient probabilistic multiclass classi\ufb01cation has received sur-\nprisingly little attention. The two algorithms most closely related to the one proposed in this paper\nare [1] and [7]. [1] proposed a fully online algorithm for multinomial density estimation that con-\nstructs a binary label tree by inserting the previously unseen labels whenever they are encountered.\nThe location for a new label is found proceeding from the root to a leaf making the left child/right\nchild decisions based on their probability under the model and a tree balancing penalty. This is the\nonly tree-learning algorithm we are aware of that takes into account the probabilistic model the tree\nis used with. Unfortunately, this approach is very optimistic because it decides on the location for a\nnew label in the tree based on a single training case and never revisits that decision.\n\n3\n\n\fThe algorithm in [7] was developed for learning trees over words for use in probabilistic language\nmodels. It constructs such trees by performing top-down hierarchical clustering of words, which are\nrepresented by real-valued vectors. The word representations are learned through bootstrapping by\ntraining a language model based on a random tree. This algorithm, unlike the one we propose in\nSection 5, does not take into consideration the model the tree is constructed for.\nMost work on tree-based multiclass classi\ufb01cation deals with non-probabilistic models and does not\napply to the problem we are concerned with in this paper. Of these approaches our algorithm is\nmost similar to the one in [14], which looks for a tree structure that avoids requiring to discriminate\nbetween easily confused items as much as possible. The main weakness of that approach is the need\nfor training a \ufb02at classi\ufb01er to produce the confusion matrix needed by the algorithm. As a result, it\nis unlikely to scale to large datasets containing tens of thousands of classes.\n\n5 Model-based learning of item trees\n5.1 Overview\nIn this section we develop a scalable algorithm for learning trees that takes into account the para-\nmetric form of the model the tree will be used with. At the highest level our approach can be seen as\ntop-down model-based hierarchical clustering of items. We chose top-down clustering over bottom-\nup clustering because it is the more scalable option. Since \ufb01nding the best tree is intractable, we take\na greedy approach that constructs the tree one level at a time, learning the lth node of each item path\nbefore \ufb01xing it and advancing to the (l + 1)st node. Because our approach is model-based, it learns\nmodel parameters, i.e. node biases and factor vectors, jointly with the item paths. As a result, at ev-\nery point during its execution it speci\ufb01es a complete probabilistic model of the data, which becomes\nmore expressive with each additional tree level. This makes it possible to monitor the progress of\nthe algorithm by evaluating the predictions made after learning each level.\nFor simplicity, our tree-learning algorithm assumes that user factor vectors are known and \ufb01xed.\nSince these vectors are actually unknown, we learn them by \ufb01rst training a CIS model based on\na random balanced tree. We then extract the user vectors learned by the model and use them to\nlearn a better tree from the data. Finally, we train a CIS model based on the learned tree, updating\nall the parameters, including the user vectors. This three-stage approach is similar to the one used\nin [7] to learn trees over words. However, because our tree-learning algorithm is model-based,\nwe already have a complete probabilistic model at its termination, so we only need to \ufb01netune its\nparameters instead of learning them from scratch. Finetuning is necessary because the parameters\nlearned while building the tree are based on the \ufb01xed user factor vectors from the random-tree-based\nmodel. Though it is possible to continue alternating between optimizing over the tree structure and\nover user vectors, we found the resulting gains too small to be worth the computational cost.\n\n5.2 Learning a level of a tree\nWe now describe how to learn a level of the tree. Suppose we have learned the \ufb01rst l \u2212 1 nodes of\neach item path and would like to learn the lth node. Let Ui be the set of users who rated item i in\nthe training set. The contribution made by item i to the log-likelihood is then given by\nj|ni\n\nj\u22121, u) =(cid:80)\n\nP (i|u) =(cid:80)\n\nj log P (ni\n\n(cid:80)\n\nj\u22121, u).\n\nj P (ni\n\nj|ni\n\n(4)\n\nu\u2208Ui\n\nu\u2208Ui\n\nLi = log(cid:81)\n(cid:80)\n\nu\u2208Ui\n\nlog(cid:81)\nj\u22121, u) =(cid:80)l\u22121\n(cid:80)Li\n\nThe log-likelihood contribution due to a single observation can be expressed as\nl|ni\n\nj log P (ni\n\nj|ni\n\nl\u22121, u)+\n\nj|ni\nj=1 log P (ni\nj\u22121, u) + log P (ni\nj|ni\nj=l+1 log P (ni\n\nj\u22121, u).\n\n(5)\n\nThe \ufb01rst term on the RHS depends only on the parameters and path nodes that have already been\nlearned, so it can be left out of the objective function. The third term is the log-probability of item i\nl, which depends on the structure and parameters of that subtree,\nunder the subtree rooted at node ni\nwhich we have not learned yet. To emphasize the fact that this term is based on a user-dependent\nl we will denote it by log P (i|ni\ndistribution over items under node ni\nThe overall objective function for learning level l is obtained by adding up the contributions of all\nitems, leaving out the terms that do not depend on the quantities to be learned:\nlog P (i|ni\n\nLl =(cid:80)\n\nl\u22121, u) +(cid:80)\n\n(cid:80)\n\n(cid:80)\n\nlog P (ni\n\nl|ni\n\nl, u).\n\nl, u).\n\n(6)\n\ni\n\nu\u2208Ui\n\ni\n\nu\u2208Ui\n\n4\n\n\fThe most direct approach to learning the paths would be to alternate between updating the lth node\nin the paths and the corresponding factor vectors and biases. Since jointly optimizing over the lth\nnode in all item paths is infeasible, we have to resort to incremental updates, maximizing Ll over\nthe lth node in one item path at a time. Unfortunately, even this operation is intractable because\nevaluating each value of ni\nl requires knowing the optimal contribution from the still-to-be-learned\nlevels of the tree, which is the second term in Eq. 6. In other words, to \ufb01nd the optimal ni\nl we need\nto compute\n\nni\nl = arg maxn\u2208C(ni\n\nl\u22121)\n\nu\u2208Ui\n\nlog P (n|ni\n\nl\u22121, u) + F (n, ni\n\n(7)\n\nl\u22121)(cid:1) ,\n\nwhere we left out the terms that do not depend on ni\nthe future levels is de\ufb01ned as\nl, ni\n\nl\u22121) = max\n\nF (ni\n\nk\u2208I(ni\n\n(cid:80)\n\nl\u22121)\n\nu\u2208Uk\n\n\u0398\n\nl. The optimal contribution F (ni\n\nl, ni\n\nl\u22121) from\n\nlog P (k|nk\n\nl , u),\n\n(8)\n\n(cid:0)(cid:80)\n(cid:80)\n\nwhere I(ni\ntors, biases, and tree structures that parameterize the set of distributions {P (k|nk\n\nl\u22121) is the set of items that are assigned to node ni\n\nl\u22121, and \u0398 is the set of node factor vec-\nl\u22121)}.\n\nl , u)|k \u2208 I(ni\n\n5.3 Approximating the future\n\nl, ni\n\nl, ni\n\nl, ni\n\nl of that node. Since F (ni\n\nThe value of F (ni\nl\u22121) quanti\ufb01es the dif\ufb01culty of discriminating between items assigned to node\nl\u22121 using the best tree structure and parameter setting possible given that item i is assigned to the\nni\nchild ni\nl\u22121) in Eq. 8 rules out degenerate solutions where all items\nbelow a node are assigned to the same child of it, leaving F (ni\nl\u22121) out to make the optimization\nproblem easier is not an option.\nWe address the intractability of Eq. 7 while avoiding the degenerate solutions by approximating the\nuser-dependent distributions P (k|nk\nl , u) by simpler distributions that make it much easier to evaluate\nl\u22121) requires maximizing\nF (ni\nover the free parameters of P (k|nk\nl , u), choosing a parameterization of P (k|nk\nl , u) that makes this\nmaximization easy can greatly speed up this computation. We approximate the tree-structured user-\ndependent P (k|nk\nl ). The main advantage of\nthis parameterization is that the optimal P (k|nk\nl ) can be computed by counting the number of times\nl occurs in the training data and normalizing. In other words, when\neach item assigned to node nk\nP (k|nk\n\nl , u) with a \ufb02at user-independent distribution P (k|nk\n\nl\u22121) for each candidate value for ni\n\nl ) is used in Eq. 8, the maximum is achieved at\n\nl. Since computing F (ni\n\nl, ni\n\nl, ni\n\nwhere Ni is the number of times item i occurs in the training set. The corresponding value for\nF (ni\n\nl, ni\n\nl\u22121) is given by\n\nTo show that F (ni\nitems under node ni\n\nl\u22121) can be computed in constant time we start by observing that the sum over\n\nl, ni\nl\u22121 can be written in terms of sums over items under each of its children:\n\n.\n\n(10)\n\nm\u2208I(nk\n\nl ) Nm\n\n(9)\n\n(11)\n\n(12)\n\nP (i|nk\n\nif i \u2208 I(nk\nl )\notherwise\n\nNk(cid:80)\n\n0\n\nl ) =\n\nk\u2208I(ni\n\nl ) Nm\n\n(cid:80)\n\nNi\nm\u2208I(nk\n\nl\u22121) Nk log\n\n\uf8f1\uf8f2\uf8f3\nl\u22121) =(cid:80)\n(cid:80)\nNk(cid:80)\nk\u2208I(c) Nk log Nk \u2212(cid:80)\n(cid:80)\nl\u22121) = \u2212(cid:80)\n\nk\u2208I(c) Nk log\n\nl, ni\n\nl\u22121)\n\nl\u22121)\n\nc\u2208C(ni\n\nc\u2208C(ni\n\nm\u2208I(c) Nm\n\n5\n\nF (ni\n\nl, ni\n\nl\u22121) =(cid:80)\n=(cid:80)\n\nF (ni\n\nl, ni\n\nwith Zc =(cid:80)\n\nc\u2208C(ni\n\nl\u22121) Zc log Zc.\n\nk\u2208I(c) Nk. Since adding a constant to F (ni\n7 and the \ufb01rst term in the equation does not depend on ni\n\nl\u22121) has no effect on the solution of Eq.\n\nl, ni\nl, we can drop it to get\n\n\u02dcF (ni\n\nl\u22121) Zc log Zc.\nl\u22121) ef\ufb01ciently, we store Zc\u2019s and the old \u02dcF (ni\n\nc\u2208C(ni\n\nTo compute \u02dcF (ni\nwhenever an item is moved to a different node. Such updates can be performed in constant time.\n\nl\u22121) value, updating them\n\nl, ni\n\nl, ni\n\n\fWe now show that the \ufb01rst term in Eq. 7, corresponding to the contribution of the lth path node for\nitem i, can be computed ef\ufb01ciently. Plugging in the de\ufb01nition of P (n|nj, u) from Eq. 2 we get\n\n(cid:80)\n\nlog P (n|ni\n\nu\u2208Ui\n\nl\u22121, u) =(cid:80)\n=(cid:0)(cid:80)\n\nu\u2208Ui\nu\u2208Ui\n\n(cid:0)U(cid:62)\n(cid:1)(cid:62)\n\n(cid:1) + C\n\nu Qn + bn\n\nQn + |Ui|bn + C\n\n(13)\nwhere C is a term that does not depend on n and so does not have to be considered when maximizing\nover n. Since we assume that the user factor vectors are known and \ufb01xed, we precompute Ri =\n\nUu\n\nUu for each user, which can be seen as creating a surrogate representation for item i.\n\n(cid:80)\n\nu\u2208Ui\n\nFinally, plugging Eq. 13 into Eq. 7 gives us the following update for item nodes:\n\nni\nl = arg maxn\u2208C(ni\n\nl\u22121)\n\ni Qn + |Ui|bn + \u02dcF (n, ni\nR(cid:62)\n\nl\u22121)\n\n.\n\n(14)\n\n(cid:16)\n\n(cid:17)\n\n6 Evaluating models of implicit feedback\n\nEstablishing sensible evaluation protocols for machine learning problems is important because they\neffectively de\ufb01ne what \u201cbetter\u201d performance means and implicitly guide the development of future\nmethods. Given that the problem of implicit feedback collaborative \ufb01ltering is relatively new, it is\nnot surprising that the typical evaluation protocol was adopted from information retrieval. However,\nwe believe that this protocol is much less appropriate in collaborative \ufb01ltering than it is in its \ufb01eld\nof origin.\nImplicit feedback models are typically evaluated using information retrieval metrics, such as Mean\nAverage Precision (MAP), that require knowing which items are relevant and which are irrelevant\nto each user. It is typical to assume that the items the user selected are relevant and all others are not\n[10]. However, this approach is problematic because it fails to distinguish between the items the user\nreally has no interest in (i.e. the truly irrelevant ones) and the relevant items the user simply did not\nrate. And while the irrelevant items do tend to be far more numerous than the unobserved relevant\nones, the effect of the latter can still be strong enough to affect model comparison, as we demonstrate\nin the next section. To address this issue, we propose using some explicit feedback information to\nidentify a small number of truly irrelevant items for each user and using them in place of items of\nunknown relevance in the evaluation. Thus the models will be evaluated on their ability to rank\nthe truly relevant items above the truly irrelevant ones, which we believe is the ultimate task of\ncollaborative \ufb01ltering. Though this approach does require access to explicit feedback, only a small\nquantity of it is necessary, and it is used only for evaluation.\nFor probabilistic models P (i|u), the most natural performance metrics are log-probability of the\nheld-out data D and the closely-related perplexity (PPL), the standard metric for language models:\n\n(cid:18)\n\n(cid:19)\n(cid:80)\n(u,i)\u2208D log P (i|u)\n\nPPL = exp\n\n\u2212 1\n|D|\n\n.\n\n(15)\n\nThe model that assigns the correct item probability 1 has the perplexity of 1, while the model that\nassigns all N items the same probability (1/N) has the perplexity of N. Unlike the ranking met-\nrics above, perplexity is computed based on the selected/relevant items alone and does not require\nassuming that the unselected items are irrelevant.1\n\n7 Experimental results\n\nFirst we investigated the impact of using tree-structured distributions over items by comparing the\nperformance of tree-based CIS models to that of a \ufb02at model de\ufb01ned by Eq. 1. We used MovieLens\n1M, which is a fairly small dataset, for the comparison in order to be able to train the \ufb02at model\nwithin reasonable time. The dataset contains 1M ratings on a scale from 1 to 5 given by 6040 users\nto 3952 movies. To simulate the implicit feedback setting, where the presence of a user/item pair\nindicates an expression of interest, we kept only the user/item pairs associated with ratings 4 and\nabove (and discarded the rating values) and split the resulting 575K pairs into a 475K-pair training\nset, and a validation and test sets of 50K pairs each. We trained three models with 5-dimensional\n\n1The implicit assumption here is that the selected items are more relevant than the unselected ones.\n\n6\n\n\fTable 1: Test set scores in percent on the MovieLens 10M dataset obtained by treating items with low\nratings as irrelevant. Higher scores indicate better performance for all metrics except for perplexity.\n\nModel\nCIS (Random)\nCIS (LearnedRI)\nCIS (LearnedCI)\nBPR\nBMF\n\nPPL MAP\n921\n70.68\n72.50\n822\n72.61\n820\n72.75\n865\n\u2013\n70.80\n\nP@1\n74.65\n76.64\n76.68\n75.75\n75.66\n\nP@5\n58.02\n59.29\n59.37\n59.15\n58.03\n\nP@10 R@1\n49.91\n20.66\n21.51\n50.64\n21.54\n50.69\n21.50\n50.63\n49.77\n20.94\n\nR@5 R@10\n60.02\n77.31\n78.22\n61.24\n78.27\n61.31\n78.39\n61.43\n60.04\n77.21\n\nTable 2: Test set scores in percent on the MovieLens 10M dataset obtained by treating all unobserved\nitems as irrelevant.\n\nModel MAP\n12.73\nBPR\nBMF\n16.13\n\nP@1\n14.27\n22.10\n\nP@5\n11.56\n16.25\n\nP@10 R@1 R@5 R@10\n18.86\n9.89\n12.94\n23.55\n\n11.55\n15.64\n\n3.06\n4.66\n\nfactor vectors: a \ufb02at model, a CIS model with a random balanced binary tree, and a CIS model\nwith a learned binary tree (as in Section 5). The \ufb02at model took 12 hours to train and had the test\nset perplexity of 920. Training the random tree model took half an hour, resulting in the perplexity\nof 975. The training process for the learned-tree model, which included training a random-tree\nmodel, learning a tree from the resulting user factor vectors, and \ufb01netuning all the parameters, took\n1 hour. The resulting model performed very well, achieving the test set perplexity of 912. These\nresults suggest that even when the number of items is relatively small our tree-based approach to\nitem selection modelling can yield an order-of-magnitude reduction in training times relative to the\n\ufb02at model without hurting the predictive accuracy.\nWe then used the larger MovieLens 10M dataset (0-5 rating scale, 69878 users, 10677 movies)\nto compare the proposed approach to the existing IFCF methods. As on MovieLens 1M, we kept\nonly the user/item pairs with ratings 4 and above, producing a 4M-pair training set, a 500K-pair\nvalidation set, and a 500K-pair test set. We compared the models based on their perplexity and\nranking performance as measured by the standard information retrieval metrics: Mean Average\nPrecision (MAP), Precision@k, and Recall@k. We used the evaluation approach described in the\nprevious section, which involved having the models rank only the items with known relevance status.\nWe used the rating values to determine relevance, considering items rated below 3 as irrelevant and\nitems rated 4 and above as relevant.\nWe compared our hierarchical item selection model to two state-of-the-art models for implicit feed-\nback: the Bayesian Personalized Ranking model (BPR) and the Binary Matrix Factorization model\n(BMF). All models used 25-dimensional factor vectors, as we found that higher-dimensional factor\nvectors resulted in only marginal improvements. We included three CIS models based on differ-\nent binary trees (K = 2) to highlight the effect of tree construction methods. The methods are as\nfollows: \u201cRandom\u201d generates random balanced trees; \u201cLearnedRI\u201d is the method from Section 5\nwith randomly initialized item-node assignments; \u201cLearnedCI\u201d is the same method with item-node\nassignments initialized by clustering surrogate item representations Ri from Section 5.3. Training\na \ufb02at item selection model on this dataset was infeasible, as a single pass through the data took six\nhours, compared to a mere two minutes for CIS (LearnedCI).\nBetter performance corresponds to lower values of perplexity and higher values of the other metrics.\nTable 1 shows the test scores for the resulting models. In terms of perplexity, CIS (Learned) is the\ntop performer, with BPR coming in second and CIS (Random) a distant third. Since BMF does not\nproduce a distribution over items, its performance cannot be naturally measured in terms of PPL. On\nthe ranking metrics, CIS (Learned) and BPR emerge as the best-performing methods, achieving very\nsimilar scores. BPR has a slight edge over CIS on MAP, while CIS performs better on Precision@1.\nBMF and CIS (Random) are the weakest performers, with considerably worse scores than BPR and\nCIS (Learned) on all metrics. Comparing the results of CIS (Learned) and CIS (Random) shows that\nthe of the tree used has a strong effect on the performance of CIS models and that using trees learned\nwith the proposed algorithm makes CIS competitive with the best collaborative \ufb01ltering models. The\nsimilar results achieved by CIS (LearnedRI) and CIS (LearnedCI) suggest that that the performance\n\n7\n\n\fof the resulting model is not particularly sensitive to the initialization scheme of the tree-learning\nalgorithm.\nTo understand the behaviour of our tree-learning algorithm better we examined the trees produced\nby it. The learned trees looked sensible, with neighbouring leaves typically containing movies from\nthe same sub-genre and appealing to the same audience. We then determined how discriminative\nthe decisions were at each level of the tree by replacing the user-dependent distributions under all\nnodes at a particular depth by the best user-independent approximations (frequencies of items under\nthe node). Comparing the perplexity of a model using the tree truncated at level l and at level\nl + 1 allowed us to determine how much level l + 1 contributed to the model. In the CIS (Random)\nmodel, the \ufb01rst few and the last few levels had little effect on perplexity and the medium-depth levels\naccounted for most of perplexity reduction. In contrast, in the CIS (LearnedRI) model, the effect\nof a level on perplexity decreased with level depth, with the \ufb01rst few levels reducing perplexity the\nmost, which is a consequence of the greedy nature of the tree-learning algorithm.\nTo highlight the importance of excluding items of unknown relevance when evaluating implicit\nfeedback models we recomputed the performance metrics treating all items not rated by a user\nas irrelevant. As the scores in Table 2 show this seemingly minor modi\ufb01cation of the evaluation\nprotocol makes BMF appear to outperform BPR by a large margin, which, as Table 1 indicates is not\nactually the case. In retrospect, these changes in relative performance are not particularly surprising\nsince the training algorithm for BMF treats unobserved items as negative examples, which perfectly\nmatches the assumption the evaluation is based on, namely that unobserved items are irrelevant. This\nis a clear example of a \ufb02awed evaluation protocol favouring an unrealistic modelling assumption.\n\n8 Discussion\n\nWe proposed a model that in addition to being competitive with the best implicit feedback models\nin terms of predictive accuracy also provides calibrated item selection probabilities for each user,\nwhich quantify the user\u2019s interest in the items. These probabilities allow comparing the degree of\ninterest in an item across users, making it possible to maximize the total user satisfaction when\nitem availability is limited. More generally, the probabilities provided by the model can be used in\ncombination with utility functions for making sophisticated decisions.\nAlthough we introduced our tree-learning algorithm in the context of collaborative \ufb01ltering, it is\napplicable to several other problems. One such problem is statistical language modelling, where\nthe task is to predict the distribution of the next word in a sentence given its context consisting of\nseveral preceding words. While there already exists an algorithm for learning the structure of tree-\nbased language models [7], it constructs trees by clustering word representations, not taking into\naccount the form of the model that will use these trees. In contrast, our algorithm optimizes the tree\nstructure and model parameters jointly, which can lead to superior model performance.\nThe proposed algorithm can also be used to learn trees over labels for multinomial regression mod-\nels. When the number of labels is large, using a label space with a sensible tree structure can lead\nto much faster training and improved generalization. Our algorithm can be applied in this setting\nby noticing the correspondence between items and labels, and between user factor vectors and input\nvectors. However, unlike in collaborative \ufb01ltering where user factor vectors have to be learned, in\nthis case input vectors are observed, which eliminates the need to train a model based on a random\ntree before applying the tree-learning algorithm.\nWe believe that evaluation protocols for implicit feedback models deserve more attention than they\nhave received. In this paper we observed that one widely used protocol can produce misleading\nresults due to an unrealistic assumption it makes about item relevance. We proposed using a small\nquantity of explicit feedback data to directly estimate item relevance in order to avoid having to\nmake that assumption.\n\nAcknowledgments\nWe thank Biljana Petreska and Lloyd Elliot for their helpful comments and the Gatsby Charitable\nFoundation for generous funding. The research leading to these results has received funding from the\nEuropean Community\u2019s Seventh Framework Programme (FP7/2007-2013) under grant agreement\nNo. 270327.\n\n8\n\n\fReferences\n[1] Alina Beygelzimer, John Langford, Yuri Lifshits, Gregory B. Sorkin, and Alexander L. Strehl.\nConditional probability tree estimation analysis and algorithms. In Proceedings of the 25th\nConference on Uncertainty in Arti\ufb01cial Intelligence, 2009.\n\n[2] L\u00b4eon Bottou. Large-scale machine learning with stochastic gradient descent. In Proceedings\nof the 19th International Conference on Computational Statistics (COMPSTAT\u20192010), pages\n177\u2013187, 2010.\n\n[3] J. Goodman. Classes for fast maximum entropy training.\n\nvolume 1, pages 561\u2013564, 2001.\n\nIn Proceedings of ICASSP \u201901,\n\n[4] Yifan Hu, Yehuda Koren, and Chris Volinsky. Collaborative \ufb01ltering for implicit feedback\ndatasets. In Proceedings of the 2008 Eighth IEEE International Conference on Data Mining,\npages 263\u2013272, 2008.\n\n[5] Yehuda Koren. Factorization meets the neighborhood: a multifaceted collaborative \ufb01ltering\nmodel. In Proceeding of the 14th ACM SIGKDD international conference on Knowledge dis-\ncovery and data mining, pages 426\u2013434, 2008.\n\n[6] Benjamin Marlin. Collaborative \ufb01ltering: A machine learning perspective. Master\u2019s thesis,\n\nUniversity of Toronto, 2004.\n\n[7] Andriy Mnih and Geoffrey Hinton. A scalable hierarchical distributed language model. In\n\nAdvances in Neural Information Processing Systems, volume 21, 2009.\n\n[8] Frederic Morin and Yoshua Bengio. Hierarchical probabilistic neural network language model.\n\nIn AISTATS\u201905, pages 246\u2013252, 2005.\n\n[9] Rong Pan and Martin Scholz. Mind the gaps: weighting the unknown in large-scale one-class\n\ncollaborative \ufb01ltering. In KDD, pages 667\u2013676, 2009.\n\n[10] Rong Pan, Yunhong Zhou, Bin Cao, Nathan Nan Liu, Rajan M. Lukose, Martin Scholz, and\n\nQiang Yang. One-class collaborative \ufb01ltering. In ICDM, pages 502\u2013511, 2008.\n\n[11] Steffen Rendle, Christoph Freudenthaler, Zeno Gantner, and Schmidt-Thieme Lars. BPR:\n\nBayesian personalized ranking from implicit feedback. In UAI \u201909, pages 452\u2013461, 2009.\n\n[12] Ruslan Salakhutdinov and Andriy Mnih. Probabilistic matrix factorization. In Advances in\n\nNeural Information Processing Systems, volume 20, 2008.\n\n[13] Nathan Srebro, Jason D. M. Rennie, and Tommi Jaakkola. Maximum-margin matrix factoriza-\n\ntion. In Advances in Neural Information Processing Systems, 2004.\n\n[14] Jason Weston, Samy Bengio, and David Grangier. Label embedding trees for large multi-class\n\ntasks. In Advances in Neural Information Processing Systems (NIPS), 2010.\n\n9\n\n\f", "award": [], "sourceid": 1289, "authors": [{"given_name": "Andriy", "family_name": "Mnih", "institution": null}, {"given_name": "Yee", "family_name": "Teh", "institution": null}]}