{"title": "Collaborative Recurrent Autoencoder: Recommend while Learning to Fill in the Blanks", "book": "Advances in Neural Information Processing Systems", "page_first": 415, "page_last": 423, "abstract": "Hybrid methods that utilize both content and rating information are commonly used in many recommender systems. However, most of them use either handcrafted features or the bag-of-words representation as a surrogate for the content information but they are neither effective nor natural enough. To address this problem, we develop a collaborative recurrent autoencoder (CRAE) which is a denoising recurrent autoencoder (DRAE) that models the generation of content sequences in the collaborative filtering (CF) setting. The model generalizes recent advances in recurrent deep learning from i.i.d. input to non-i.i.d. (CF-based) input and provides a new denoising scheme along with a novel learnable pooling scheme for the recurrent autoencoder. To do this, we first develop a hierarchical Bayesian model for the DRAE and then generalize it to the CF setting. The synergy between denoising and CF enables CRAE to make accurate recommendations while learning to fill in the blanks in sequences. Experiments on real-world datasets from different domains (CiteULike and Netflix) show that, by jointly modeling the order-aware generation of sequences for the content information and performing CF for the ratings, CRAE is able to significantly outperform the state of the art on both the recommendation task based on ratings and the sequence generation task based on content information.", "full_text": "Collaborative Recurrent Autoencoder:\n\nRecommend while Learning to Fill in the Blanks\n\nHao Wang, Xingjian Shi, Dit-Yan Yeung\n\nHong Kong University of Science and Technology\n\n{hwangaz,xshiab,dyyeung}@cse.ust.hk\n\nAbstract\n\nHybrid methods that utilize both content and rating information are commonly\nused in many recommender systems. However, most of them use either handcrafted\nfeatures or the bag-of-words representation as a surrogate for the content infor-\nmation but they are neither effective nor natural enough. To address this problem,\nwe develop a collaborative recurrent autoencoder (CRAE) which is a denoising\nrecurrent autoencoder (DRAE) that models the generation of content sequences in\nthe collaborative \ufb01ltering (CF) setting. The model generalizes recent advances in\nrecurrent deep learning from i.i.d. input to non-i.i.d. (CF-based) input and provides\na new denoising scheme along with a novel learnable pooling scheme for the recur-\nrent autoencoder. To do this, we \ufb01rst develop a hierarchical Bayesian model for the\nDRAE and then generalize it to the CF setting. The synergy between denoising\nand CF enables CRAE to make accurate recommendations while learning to \ufb01ll\nin the blanks in sequences. Experiments on real-world datasets from different\ndomains (CiteULike and Net\ufb02ix) show that, by jointly modeling the order-aware\ngeneration of sequences for the content information and performing CF for the\nratings, CRAE is able to signi\ufb01cantly outperform the state of the art on both the\nrecommendation task based on ratings and the sequence generation task based on\ncontent information.\n\n1\n\nIntroduction\n\nWith the high prevalence and abundance of Internet services, recommender systems are becoming\nincreasingly important to attract users because they can help users make effective use of the informa-\ntion available. Companies like Net\ufb02ix have been using recommender systems extensively to target\nusers and promote products. Existing methods for recommender systems can be roughly categorized\ninto three classes [13]: content-based methods that use the user pro\ufb01les or product descriptions only,\ncollaborative \ufb01ltering (CF) based methods that use the ratings only, and hybrid methods that make\nuse of both. Hybrid methods using both types of information can get the best of both worlds and, as a\nresult, usually outperform content-based and CF-based methods.\nAmong the hybrid methods, collaborative topic regression (CTR) [20] was proposed to integrate a\ntopic model and probabilistic matrix factorization (PMF) [15]. CTR is an appealing method in that it\nproduces both promising and interpretable results. However, CTR uses a bag-of-words representation\nand ignores the order of words and the local context around each word, which can provide valuable\ninformation when learning article representation and word embeddings. Deep learning models like\nconvolutional neural networks (CNN) which use layers of sliding windows (kernels) have the potential\nof capturing the order and local context of words. However, the kernel size in a CNN is \ufb01xed during\ntraining. To achieve good enough performance, sometimes an ensemble of multiple CNNs with\ndifferent kernel sizes has to be used. A more natural and adaptive way of modeling text sequences\nwould be to use gated recurrent neural network (RNN) models [8, 3, 18]. A gated RNN takes in one\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fword (or multiple words) at a time and lets the learned gates decide whether to incorporate or to\nforget the word. Intuitively, if we can generalize gated RNNs to the CF setting (non-i.i.d.) to jointly\nmodel the generation of sequences and the relationship between items and users (rating matrices), the\nrecommendation performance could be signi\ufb01cantly boosted.\nNevertheless, very few attempts have been made to develop feedforward deep learning models for CF,\nlet alone recurrent ones. This is due partially to the fact that deep learning models, like many machine\nlearning models, assume i.i.d. inputs. [16, 6, 7] use restricted Boltzmann machines and RNN instead\nof the conventional matrix factorization (MF) formulation to perform CF. Although these methods\ninvolve both deep learning and CF, they actually belong to CF-based methods because they do not\nincorporate the content information like CTR, which is crucial for accurate recommendation. [14]\nuses low-rank MF in the last weight layer of a deep network to reduce the number of parameters, but\nit is for classi\ufb01cation instead of recommendation tasks. There have also been nice explorations on\nmusic recommendation [10, 25] in which a CNN or deep belief network (DBN) is directly used for\ncontent-based recommendation. However, the models are deterministic and less robust since the noise\nis not explicitly modeled. Besides, the CNN is directly linked to the ratings making the performance\nsuffer greatly when the ratings are sparse, as will be shown later in our experiments. Very recently,\ncollaborative deep learning (CDL) [23] is proposed as a probabilistic model for joint learning of\na probabilistic stacked denoising autoencoder (SDAE) [19] and collaborative \ufb01ltering. However,\nCDL is a feedforward model that uses bag-of-words as input and it does not model the order-aware\ngeneration of sequences. Consequently, the model would have inferior recommendation performance\nand is not capable of generating sequences at all, which will be shown in our experiments. Besides\norder-awareness, another drawback of CDL is its lack of robustness (see Section 3.1 and 3.5 for\ndetails). To address these problems, we propose a hierarchical Bayesian generative model called\ncollaborative recurrent autoencoder (CRAE) to jointly model the order-aware generation of sequences\n(in the content information) and the rating information in a CF setting. Our main contributions are:\n\u2022 By exploiting recurrent deep learning collaboratively, CRAE is able to sophisticatedly model\nthe generation of items (sequences) while extracting the implicit relationship between items\n(and users). We design a novel pooling scheme for pooling variable-length sequences into\n\ufb01xed-length vectors and also propose a new denoising scheme to effectively avoid over\ufb01tting.\nBesides for recommendation, CRAE can also be used to generate sequences on the \ufb02y.\n\u2022 To the best of our knowledge, CRAE is the \ufb01rst model that bridges the gap between RNN\nand CF, especially with respect to hybrid methods for recommender systems. Besides, the\nBayesian nature also enables CRAE to seamlessly incorporate other auxiliary information\nto further boost the performance.\n\u2022 Extensive experiments on real-world datasets from different domains show that CRAE can\n\nsubstantially improve on the state of the art.\n\n2 Problem Statement and Notation\n\nSimilar to [20], the recommendation task considered in this paper takes implicit feedback [9] as the\ntraining and test data. There are J items (e.g., articles or movies) in the dataset. For item j, there is a\ncorresponding sequence consisting of Tj words where the vector e(j)\nspeci\ufb01es the t-th word using the\n1-of-S representation, i.e., a vector of length S with the value 1 in only one element corresponding\nto the word and 0 in all other elements. Here S is the vocabulary size of the dataset. We de\ufb01ne an\nI-by-J binary rating matrix R = [Rij]I\u00d7J where I denotes the number of users. For example, in the\nCiteULike dataset, Rij = 1 if user i has article j in his or her personal library and Rij = 0 otherwise.\nGiven some of the ratings in R and the corresponding sequences of words e(j)\n(e.g., titles of articles\nor plots of movies), the problem is to predict the other ratings in R.\n\nt\n\nt\n\ndenotes the noise-corrupted version of e(j)\n\n(cid:48)(j)\nt ) refers to\nIn the following sections, e\nt\ne)\nthe concatenation of the two KW -dimensional column vectors. All input weights (like Ye and Yi\ne) are of dimensionality KW -by-KW . The output state h(j)\nand recurrent weights (like We and Wi\n,\n(j)), and cell state s(j)\ngate units (e.g., ho\nare of dimensionality KW . K is the dimensionality of the\nt\n\ufb01nal representation \u03b3j, middle-layer units \u03b8j, and latent vectors vj and ui. IK or IKW denotes a\nK-by-K or KW -by-KW identity matrix. For convenience we use W+ to denote the collection of all\nweights and biases. Similarly h+\nt\n\nis used to denote the collection of ht, hi\n\nt, hf\n\nt , and ho\nt .\n\nt\n\nt\n\nand (h(j)\n\n; s(j)\n\nt\n\nt\n\n2\n\n\fFigure 1: On the left is the graphical model for an example CRAE where Tj = 2 for all j. To\nprevent clutter, the hyperparameters for beta-pooling, all weights, biases, and links between ht and \u03b3\nare omitted. On the right is the graphical model for the degenerated CRAE. An example recurrent\nautoencoder with Tj = 3 is shown. \u2018(cid:104)?(cid:105)\u2019 is the (cid:104)wildcard(cid:105) and \u2018$\u2019 marks the end of a sentence. E(cid:48)\nand E are used in place of [e\n3 Collaborative Recurrent Autoencoder\n\n]Tj\nt=1 respectively.\n\n]Tj\nt=1 and [e(j)\n\n(cid:48)(j)\nt\n\nt\n\nIn this section we will \ufb01rst propose a generalization of the RNN called robust recurrent networks\n(RRN), followed by the introduction of two key concepts, wildcard denoising and beta-pooling, in\nour model. After that, the generative process of CRAE is provided to show how to generalize the\nRRN as a hierarchical Bayesian model from an i.i.d. setting to a CF (non-i.i.d.) setting.\n\n3.1 Robust Recurrent Networks\n\nOne problem with RNN models like long short-term memory networks (LSTM) is that the computa-\ntion is deterministic without taking the noise into account, which means it is not robust especially\nwith insuf\ufb01cient training data. To address this robustness problem, we propose RRN as a type of\nnoisy gated RNN. In RRN, the gates and other latent variables are designed to incorporate noise,\nmaking the model more robust. Note that unlike [4, 5], the noise in RRN is directly propagated back\nand forth in the network, without the need for using separate neural networks to approximate the\ndistributions of the latent variables. This is much more ef\ufb01cient and easier to implement. Here we\nprovide the generative process of RRN. Using t = 1 . . . Tj to index the words in the sequence, we\nhave (we drop the index j for items for notational simplicity):\n\nst \u223c N (\u03c3(hf\n\nt\u22121) (cid:12) st\u22121 + \u03c3(hi\n\nxt\u22121 \u223c N (Wwet\u22121, \u03bb\u22121\n\ns IKW ), at\u22121 \u223c N (Yxt\u22121 + Wht\u22121 + b, \u03bb\u22121\n\n(1)\n(2)\nwhere xt is the word embedding of the t-th word, Ww is a KW -by-S word embedding matrix, et is\nthe 1-of-S representation mentioned above, (cid:12) stands for the element-wise product operation between\ntwo vectors, \u03c3(\u00b7) denotes the sigmoid function, st is the cell state of the t-th word, and b, Y, and\nW denote the biases, input weights, and recurrent weights respectively. The forget gate units hf\nt\nand the input gate units hi\nt in Equation (2) are drawn from Gaussian distributions depending on their\ncorresponding weights and biases Yf , Wf , Yi, Wi, bf , and bi:\n\nt\u22121) (cid:12) \u03c3(at\u22121), \u03bb\u22121\n\ns IKW ),\n\ns IKW )\n\ns IKW ), hi\n\nt \u223c N (Yixt + Wiht + bi, \u03bb\u22121\n\nt \u223c N (Yf xt + Wf ht + bf , \u03bb\u22121\nhf\nThe output ht depends on the output gate ho\nt \u223c N (Yoxt + Woht + bo, \u03bb\u22121\nho\n\nt which has its own weights and biases Yo, Wo, and bo:\n(3)\nIn the RRN, information of the processed sequence is contained in the cell states st and the output\nstates ht, both of which are column vectors of length KW . Note that RRN can be seen as a generalized\nand Bayesian version of LSTM [1]. Similar to [18, 3], two RRNs can be concatenated to form an\nencoder-decoder architecture.\n\ns IKW ), ht \u223c N (tanh(st) (cid:12) \u03c3(ho\n\nt\u22121), \u03bb\u22121\n\ns IKW ).\n\ns IKW ).\n\n3.2 Wildcard Denoising\n\nSince the input and output are identical here, unlike [18, 3] where the input is from the source\nlanguage and the output is from the target language, this naive RRN autoencoder can suffer from\nserious over\ufb01tting, even after taking noise into account and reversing sequence order (we \ufb01nd that\n\n3\n\nh1h1h2h2h3h3h4h4h5h5s1s1s2s2s3s3s4s4s5s5e01e01e02e02e1e1e2e2\u00b5\u00b5\u00b0\u00b0vvJIRRuu\u00b8u\u00b8u\u00b8v\u00b8vJIEEE0E0\u00b8w\u00b8wW+W+vvRRuu\u00b8u\u00b8u\u00b5\u00b5A$AB$\u00b8v\u00b8v\freversing sequence order in the decoder [18] does not improve the recommendation performance).\nOne natural way of handling it is to borrow ideas from the denoising autoencoder [19] by randomly\ndropping some of the words in the encoder. Unfortunately, directly dropping words may mislead\nthe learning of transition between words. For example, if we drop the word \u2018is\u2019 in the sentence\n\u2018this is a good idea\u2019, the encoder will wrongly learn the subsequence \u2018this a\u2019, which never appears\nin a grammatically correct sentence. Here we propose another denoising scheme, called wildcard\ndenoising, where a special word \u2018(cid:104)wildcard(cid:105)\u2019 is added to the vocabulary and we randomly select\nsome of the words and replace them with \u2018(cid:104)wildcard(cid:105)\u2019. This way, the encoder RRN will take \u2018this\n(cid:104)wildcard(cid:105) a good idea\u2019 as input and successfully avoid learning wrong subsequences. We call this\ndenoising recurrent autoencoder (DRAE). Note that the word \u2018(cid:104)wildcard(cid:105)\u2019 also has a corresponding\nword embedding. Intuitively this wildcard denoising RRN autoencoder learns to \ufb01ll in the blanks in\nsentences automatically. We \ufb01nd this denoising scheme much better than the naive one. For example,\nin dataset CiteULike wildcard denoising can provide a relative accuracy boost of about 20%.\n\n3.3 Beta-Pooling\n\nThe RRN autoencoders would produce a representation vector for each input word. In order to\nfacilitate the factorization of the rating matrix, we need to pool the sequence of vectors into one\nsingle vector of \ufb01xed length 2KW before it is further encoded into a K-dimensional vector. A natural\nway is to use a weighted average of the vectors. Unfortunately different sequences may need weights\nof different size. For example, pooling a sequence of 8 vectors needs a weight vector with 8 entries\nwhile pooling a sequence of 50 vectors needs one with 50 entries. In other words, we need a weight\nvector of variable length for our pooling scheme. To tackle this problem, we propose to use a beta\ndistribution. If six vectors are to be pooled into one single vector (using weighted average), we\ncan use the area wp in the range ( p\u22121\n6 ) of the x-axis of the probability density function (PDF)\nfor the beta distribution Beta(a, b) as the pooling weight. Then the resulting pooling weight vector\nbecomes y = (w1, . . . , w6)T . Since the total area is always 1 and the x-axis is bounded, the beta\ndistribution is perfect for this type of variable-length pooling (hence the name beta-pooling). If we\nset the hyperparameters a = b = 1, it will be equivalent to average pooling. If a is set large enough\nand b > a the PDF will peak slightly to the left of x = 0.5, which means that the last time step of the\nencoder RRN is directly used as the pooling result. With only two parameters, beta-pooling is able to\npool vectors \ufb02exibly enough without having the risk of over\ufb01tting the data.\n\n6 , p\n\n3.4 CRAE as a Hierarchical Bayesian Model\n\nFollowing the notation in Section 2 and using the DRAE in Section 3.2 as a component, we then\nprovide the generative process of the CRAE (note that t indexes words or time steps, j indexes\nsentences or documents, and Tj is the number of words in document j):\n(cid:48)(j)\nEncoding (t = 1, 2, . . . , Tj): Generate x\nt\u22121, a(j)\nCompression and decompression (t = Tj + 1):\n\naccording to Equation (1)-(2).\n\nt\u22121, and s(j)\n\nt\n\n\u03b8j \u223c N (W1(h(j)\n\nTj\n\n; s(j)\nTj\n\n) + b1, \u03bb\n\n\u22121\ns IK ), (h(j)\n\nTj +1; s(j)\n\nTj +1) \u223c N (W2 tanh(\u03b8j) + b2, \u03bb\n\n\u22121\ns I2KW ).\n\n(4)\n\nDecoding (t = Tj + 2, Tj + 3, . . . , 2Tj + 1): Generate a(j)\ntion (1)-(3), after which generate:\n\nt\u22121, s(j)\n\nt\n\n, and h(j)\n\nt\n\naccording to Equa-\n\nt\u2212Tj\u22122 \u223c Mult(softmax(Wgh(j)\ne(j)\n\nt + bg)).\n\nBeta-pooling and recommendation:\n\n\u03b3j \u223c N (tanh(W1fa,b({(h(j)\nvj \u223c N (\u03b3j, \u03bb\u22121\n\nv IK), ui \u223c N (0, \u03bb\u22121\n\nt\n\nu IK), Rij \u223c N (uT\n\ni vj, C\u22121\nij ).\n\nt )}t) + b1), \u03bb\u22121\n; s(j)\n\ns IK)\n\n(5)\n\nNote that each column of the weights and biases in W+ is drawn from N (0, \u03bb\u22121\nN (0, \u03bb\u22121\ncan be drawn as described in Section 3.1. e(cid:48)(j)\n\nw IK). In the generative process above, the input gate hi\n\ndenotes the corrupted word (with the embedding\n\n(j) and the forget gate hf\n\nw IKW ) or\n(j)\n\nt\u22121\n\nt\u22121\n\nt\n\n4\n\n\ft\n\nt\n\n) and e(j)\n\ndenotes the original word (with the embedding x(j)\n\n(cid:48)(j)\n). \u03bbw, \u03bbu, \u03bbs, and \u03bbv are hy-\nx\nt\nperparameters and Cij is a con\ufb01dence parameter (Cij = \u03b1 if Rij = 1 and Cij = \u03b2 otherwise).\nNote that if \u03bbs goes to in\ufb01nity, the Gaussian distribution (e.g., in Equation (4)) will become a Dirac\ndelta distribution centered at the mean. The compression and decompression act like a bottleneck\nbetween two Bayesian RRNs. The purpose is to reduce over\ufb01tting, provide necessary nonlinear\ntransformation, and perform dimensionality reduction to obtain a more compact \ufb01nal representa-\ntion \u03b3j for CF. The graphical model for an example CRAE where Tj = 2 for all j is shown in\nFigure 1(left). fa,b({(h(j)\nt )}t) in Equation (5) is the result of beta-pooling with hyperparameters\n; s(j)\na and b. If we denote the cumulative distribution function of the beta distribution as F (x; a, b),\n\u03c6(j)\nt = (h(j)\n; s(j)\nt+1) for t = Tj + 1, . . . , 2Tj, then\nwe have fa,b({(h(j)\n, a, b))\u03c6t. Please see Section 3 of the\nsupplementary materials for details (including hyperparameter learning) of beta-pooling. From the\ngenerative process, we can see that both CRAE and CDL are Bayesian deep learning (BDL) models\n(as described in [24]) with a perception component (DRAE in CRAE) and a task-speci\ufb01c component.\n\nt )}t) =(cid:80)2Tj\n\nt ) for t = 1, . . . , Tj, and \u03c6(j)\n\nt = (h(j)\n, a, b) \u2212 F ( t\u22121\n\nt=1(F ( t\n2Tj\n\nt+1; s(j)\n\n; s(j)\n\n2Tj\n\nt\n\nt\n\nt\n\n3.5 Learning\n\nAccording to the CRAE model above, all parameters like h(j)\nand vj can be treated as random\nvariables so that a full Bayesian treatment such as methods based on variational approximation can\nbe used. However, due to the extreme nonlinearity and the CF setting, this kind of treatment is\nnon-trivial. Besides, with CDL [23] and CTR [20] as our primary baselines, it would be fairer to use\nmaximum a posteriori (MAP) estimates, which is what CDL and CTR do.\nEnd-to-end joint learning: Maximization of the posterior probability is equivalent to maximizing\nthe joint log-likelihood of {ui}, {vj}, W+, {\u03b8j}, {\u03b3j}, {e(j)\nt }, and R\ngiven \u03bbu, \u03bbv, \u03bbw, and \u03bbs:\n\n(cid:48)(j)\nt }, {h+\n\n(j)}, {s(j)\n\nt }, {e\n\nt\n\nt\n\nL = log p(DRAE|\u03bbs, \u03bbw) \u2212 \u03bbu\n2\n\n(cid:107)ui(cid:107)2\n\n2 \u2212 \u03bbv\n2\n\n(cid:107)vj \u2212 \u03b3j(cid:107)2\n\n(Rij \u2212 uT\n\ni vj)2\n\nCij\n2\n\nj\n\ni,j\n\n(cid:88)\n\n2 \u2212(cid:88)\n\n(cid:88)\n\ni\n\n(cid:88)\n\nj\n\n\u2212 \u03bbs\n2\n\n(cid:107) tanh(W1fa,b({(h(j)\n\nt\n\nt )}t) + b1) \u2212 \u03b3j(cid:107)2\n; s(j)\n2,\n\nt\n\nt }, {e\n\n(cid:48)(j)\nt }, {h+\n\n(j)}, and {s(j)\n\nwhere log p(DRAE|\u03bbs, \u03bbw) corresponds to the prior and likelihood terms for DRAE (including\nthe encoding, compression, decompression, and decoding in Section 3.4) involving W+, {\u03b8j},\n{e(j)\nt }. For simplicity and computational ef\ufb01ciency, we can \ufb01x the\nhyperparameters of beta-pooling so that Beta(a, b) peaks slightly to the left of x = 0.5 (e.g.,\na = 9.8 \u00d7 107, b = 1 \u00d7 108), which leads to \u03b3j = tanh(\u03b8j) (a treatment for the more general case\nwith learnable a or b is provided in the supplementary materials). Further, if \u03bbs approaches in\ufb01nity,\nthe terms with \u03bbs in log p(DRAE|\u03bbs, \u03bbw) will vanish and \u03b3j will become tanh(W1(h(j)\n)+b1).\nFigure 1(right) shows the graphical model of a degenerated CRAE when \u03bbs approaches positive\nin\ufb01nity and b > a (with very large a and b). Learning this degenerated version of CRAE is equivalent\nto jointly training a wildcard denoising RRN and an encoding RRN coupled with the rating matrix. If\n\u03bbv (cid:28) 1, CRAE will further degenerate to a two-step model where the representation \u03b8j learned by\nthe DRAE is directly used for CF. On the contrary if \u03bbv (cid:29) 1, the decoder RRN essentially vanishes.\nBoth extreme cases can greatly degrade the predictive performance, as shown in the experiments.\nRobust nonlinearity on distributions: Different from [23, 22], nonlinear transformation is per-\nformed after adding the noise with precision \u03bbs (e.g. a(j)\nin Equation (1)). In this case, the input of\nthe nonlinear transformation is a distribution rather than a deterministic value, making the nonlinearity\nmore robust than in [23, 22] and leading to more ef\ufb01cient and direct learning algorithms than CDL.\nConsider a univariate Gaussian distribution N (x|\u00b5, \u03bb\u22121\ns ) and the sigmoid function \u03c3(x) =\n1+exp(\u2212x), the expectation (see Section 6 of the supplementary materials for details):\n\n, s(j)\nTj\n\nTj\n\n1\n\nt\n\nE(x) =\n\nN (x|\u00b5, \u03bb\u22121\n\ns )\u03c3(x)dx = \u03c3(\u03ba(\u03bbs)\u00b5),\n\n(6)\n\n(cid:90)\n\nEquation (6) holds because the convolution of a sigmoid function with a Gaussian distribution can be\napproximated by another sigmoid function. Similarly, we can approximate \u03c3(x)2 with \u03c3(\u03c11(x + \u03c10)),\n\n5\n\n\f\u221a\nwhere \u03c11 = 4 \u2212 2\n\n(cid:90)\n\n\u221a\n2 and \u03c10 = \u2212 log(\ns ) \u25e6 \u03a6(\u03be\u03c11(x + \u03c10))dx \u2212 E(x)2 = \u03c3(\n\u22121\n\n2 + 1). Hence the variance\n\nD(x) \u2248\n\nN (x|\u00b5, \u03bb\n\n\u03c11(\u00b5 + \u03c10)\n\u22121\ns )1/2\n\n(1 + \u03be2\u03c12\n\n1\u03bb\n\n) \u2212 E(x)2 \u2248 \u03bb\n\n\u22121\ns ,\n\n(7)\n\nN (\u03c3(hf\n\u2248 N (\u03c3(\u03ba(\u03bbs)h\n\nto approximate D(x) for computational ef\ufb01ciency. Using Equation (6) and (7),\n\nwhere we use \u03bb\u22121\nthe Gaussian distribution in Equation (2) can be computed as:\nt\u22121) (cid:12) \u03c3(at\u22121), \u03bb\u22121\n\nt\u22121) (cid:12) st\u22121 + \u03c3(hi\n\ns\n\ns IKW )\n\nf\n\nt\u22121) (cid:12) st\u22121 + \u03c3(\u03ba(\u03bbs)h\n\ni\n\nt\u22121) (cid:12) \u03c3(\u03ba(\u03bbs)at\u22121), \u03bb\u22121\n\ns IKW ),\n\n(8)\nwhere the superscript (j) is dropped. We use overlines (e.g., at\u22121 = Yext\u22121 + Weht\u22121 + be) to\ndenote the mean of the distribution from which a hidden variable is drawn. By applying Equation (8)\nrecursively, we can compute st for any t. Similar approximation is used for tanh(x) in Equation (3)\nsince tanh(x) = 2\u03c3(2x) \u2212 1. This way the feedforward computation of DRAE would be seamlessly\nchained together, leading to more ef\ufb01cient learning algorithms than the layer-wise algorithms in\n[23, 22] (see Section 6 of the supplementary materials for more details).\nLearning parameters: To learn ui and vj, block coordinate ascent can be used. Given the current\nW+, we can compute \u03b3 as \u03b3 = tanh(W1fa,b({(h(j)\nt )}t) + b1) and get the following update\n; s(j)\nrules:\n\nt\n\nui \u2190 (VCiVT + \u03bbuIK)\u22121VCiRi\nvj \u2190 (UCiUT + \u03bbvIK)\u22121(UCjRj + \u03bbv tanh(W1fa,b({(h(j)\n\nt\n\nt )}t) + b1)T ),\n; s(j)\n\ni=1, V = (vj)J\n\nj=1, Ci = diag(Ci1, . . . , CiJ ) is a diagonal matrix, and Ri =\n\nwhere U = (ui)I\n(Ri1, . . . , RiJ )T is a column vector containing all the ratings of user i.\nGiven U and V, W+ can be learned using the back-propagation algorithm according to Equation\n(6)-(8) and the generative process in Section 3.4. Alternating the update of U, V, and W+ gives a\nlocal optimum of L . After U and V are learned, we can predict the ratings as Rij = uT\n\ni vj.\n\n4 Experiments\n\nIn this section, we report some experiments on real-world datasets from different domains to evaluate\nthe capabilities of recommendation and automatic generation of missing sequences.\n\n4.1 Datasets\n\nWe use two datasets from different real-world domains. CiteULike is from [20] with 5,551 users and\n16,980 items (articles with text). Net\ufb02ix consists of 407,261 users, 9,228 movies, and 15,348,808\nratings after removing users with less than 3 positive ratings (following [23], ratings larger than 3 are\nregarded as positive ratings). Please see Section 7 of the supplementary materials for details.\n\n4.2 Evaluation Schemes\n\nRecommendation: For the recommendation task, similar to [21, 23], P items associated with each\nuser are randomly selected to form the training set and the rest is used as the test set. We evaluate the\nmodels when the ratings are in different degrees of density (P \u2208 {1, 2, . . . , 5}). For each value of P ,\nwe repeat the evaluation \ufb01ve times with different training sets and report the average performance.\nFollowing [20, 21], we use recall as the performance measure since the ratings are in the form of\nimplicit feedback [9, 12]. Speci\ufb01cally, a zero entry may be due to the fact that the user is not interested\nin the item, or that the user is not aware of its existence. Thus precision is not a suitable performance\nmeasure. We sort the predicted ratings of the candidate items and recommend the top M items for\nthe target user. The recall@M for each user is then de\ufb01ned as:\n\nrecall@M =\n\n# items that the user likes among the top M\n\n# items that the user likes\n\n.\n\nThe average recall over all users is reported.\n\n6\n\n\fFigure 2: Performance comparison of CRAE, CDL, CTR, DeepMusic, CMF, and SVDFeature based\non recall@M for datasets CiteULike and Net\ufb02ix. P is varied from 1 to 5 in the \ufb01rst two \ufb01gures.\nWe also use another evaluation metric, mean average precision (mAP), in the experiments. Exactly\nthe same as [10], the cutoff point is set at 500 for each user.\nSequence generation on the \ufb02y: For the sequence generation task, we set P = 5. In terms of\ncontent information (e.g., movie plots), we randomly select 80% of the items to include their content\nin the training set. The trained models are then used to predict (generate) the content sequences for\nthe other 20% items. The BLEU score [11] is used to evaluate the quality of generation. To compute\nthe BLEU score in CiteULike we use the titles as training sentences (sequences). Both the titles\nand sentences in the abstracts of the articles (items) are used as reference sentences. For Net\ufb02ix, the\n\ufb01rst sentences of the plots are used as training sentences. The movie names and sentences in the\nplots are used as reference sentences. A higher BLEU score indicates higher quality of sequence\ngeneration. Since CDL, CTR, and PMF cannot generate sequences directly, a nearest neighborhood\nbased approach is used with the resulting vj. Note that this task is extremely dif\ufb01cult because the\nsequences of the test set are unknown during both the training and testing phases. For this reason,\nthis task is impossible for existing machine translation models like [18, 3].\n\n4.3 Baselines and Experimental Settings\n\nThe models for comparison are listed as follows:\n\ninformation by simultaneously factorizing multiple matrices.\n\npaper we use the bag-of-words as raw features to feed into SVDFeature.\n\n\u2022 CMF: Collective Matrix Factorization [17] is a model incorporating different sources of\n\u2022 SVDFeature: SVDFeature [2] is a model for feature-based collaborative \ufb01ltering. In this\n\u2022 DeepMusic: DeepMusic [10] is a feedforward model for music recommendation mentioned\n\u2022 CTR: Collaborative Topic Regression [20] is a model performing topic modeling and\n\u2022 CDL: Collaborative Deep Learning (CDL) [23] is proposed as a probabilistic feedforward\n\u2022 CRAE: Collaborative Recurrent Autoencoder is our proposed recurrent model. It jointly\n\ncollaborative \ufb01ltering simultaneously as mentioned in the previous section.\n\nin Section 1. We use the best performing variant as our baseline.\n\nmodel for joint learning of a probabilistic SDAE [19] and CF.\n\nperforms collaborative \ufb01ltering and learns the generation of content (sequences).\n\nIn the experiments, we use 5-fold cross validation to \ufb01nd the optimal hyperparameters for CRAE and\nthe baselines. For CRAE, we set \u03b1 = 1, \u03b2 = 0.01, K = 50, KW = 100. The wildcard denoising\nrate is set to 0.4. See Section 5.1 of the supplementary materials for details.\n\n4.4 Quantitative Comparison\n\nRecommendation: The \ufb01rst two plots of Figure 2 show the recall@M for the two datasets when P\nis varied from 1 to 5. As we can see, CTR outperforms the other baselines except for CDL. Note that\nas previously mentioned, in both datasets DeepMusic suffers badly from over\ufb01tting when the rating\nmatrix is extremely sparse (P = 1) and achieves comparable performance with CTR when the rating\nmatrix is dense (P = 5). CDL as the strongest baseline consistently outperforms other baselines.\nBy jointly learning the order-aware generation of content (sequences) and performing collaborative\n\ufb01ltering, CRAE is able to outperform all the baselines by a margin of 0.7% \u223c 1.9% (a relative boost\nof 2.0% \u223c 16.7%) in CiteULike and 3.5% \u223c 6.0% (a relative boost of 5.7% \u223c 22.5%) in Net\ufb02ix.\nNote that since the standard deviation is minimal (3.38 \u00d7 10\u22125 \u223c 2.56 \u00d7 10\u22123), it is not included in\nthe \ufb01gures and tables to avoid clutter.\nThe last two plots of Figure 2 show the recall@M for CiteULike and Net\ufb02ix when M varies from 50\nto 300 and P = 1. As shown in the plots, the performance of DeepMusic, CMF, and SVDFeature is\n\n7\n\n123450.10.150.20.250.30.35PRecall CRAECDLCTRDeepMusicCMFSVDFeature123450.150.20.250.30.350.40.450.50.550.60.65PRecall CRAECDLCTRDeepMusicCMFSVDFeature501001502002503000.020.030.040.050.060.070.080.090.10.110.12MRecall CRAECDLCTRDeepMusicCMFSVDFeature501001502002503000.050.10.150.20.250.3MRecall CRAECDLCTRDeepMusicCMFSVDFeature\f(a)\n\n(b)\n\n(c)\n\n(d)\n\n(e)\n\nFigure 3: The shape of the beta distribution for different a and b (corresponding to Table 1).\n\n(f)\n\n(g)\n\n(h)\n\nTable 1: Recall@300 for beta-pooling with different hyperparameters\n\na\nb\n\nRecall\n\n31112\n40000\n12.17\n\n1\n1\n\n1\n10\n\n311\n400\n12.54\n10.72\nTable 2: mAP for two datasets\n\n0.4\n0.4\n11.08\n\n10.48\n\n11.62\n\n10\n1\n\nCiteULike\nNet\ufb02ix\n\nCRAE\n0.0123\n0.0301\n\nDeepMusic\n\nCDL\n0.0091\n0.0275\n\nCMF\n0.0061\n0.0144\nTable 3: BLEU score for two datasets\n\nCTR\n0.0071\n0.0211\n\n0.0058\n0.0156\n\n400\n311\n12.71\n\n40000\n31112\n12.22\n\nSVDFeature\n\n0.0056\n0.0173\n\nCiteULike\nNet\ufb02ix\n\nCRAE\n46.60\n48.69\n\nCDL\n21.14\n6.90\n\nCTR\n31.47\n17.17\n\nPMF\n17.85\n11.74\n\nsimilar in this setting. Again CRAE is able to outperform the baselines by a large margin and the\nmargin gets larger with the increase of M.\nAs shown in Figure 3 and Table 1, we also investigate the effect of a and b in beta-pooling and \ufb01nd\nthat in DRAE: (1) temporal average pooling performs poorly (a = b = 1); (2) most information\nconcentrates near the bottleneck; (3) the right of the bottleneck contains more information than the\nleft. Please see Section 4 of the supplementary materials for more details.\nAs another evaluation metric, Table 2 compares different models based on mAP. As we can see,\ncompared with CDL, CRAE can provide a relative boost of 35% and 10% for CiteULike and\nNet\ufb02ix, respectively. Besides quantitative comparison, qualitative comparison of CRAE and CDL is\nprovided in Section 2 of the supplementary materials. In terms of time cost, CDL needs 200 epochs\n(40s/epoch) while CRAE needs about 80 epochs (150s/epoch) for optimal performance.\nSequence generation on the \ufb02y: To evaluate the ability of sequence generation, we compute the\nBLEU score of the sequences (titles for CiteULike and plots for Net\ufb02ix) generated by different models.\nAs mentioned in Section 4.2, this task is impossible for existing machine translation models like\n[18, 3] due to the lack of source sequences. As we can see in Table 3, CRAE achieves a BLEU\nscore of 46.60 for CiteULike and 48.69 for Net\ufb02ix, which is much higher than CDL, CTR and PMF.\nIncorporating the content information when learning user and item latent vectors, CTR is able to\noutperform other baselines and CRAE can further boost the BLEU score by sophisticatedly and jointly\nmodeling the generation of sequences and ratings. Note that although CDL is able to outperform\nother baselines in the recommendation task, it performs poorly when generating sequences on the \ufb02y,\nwhich demonstrates the importance of modeling each sequence recurrently as a whole rather than as\nseparate words.\n\n5 Conclusions and Future Work\n\nWe develop a collaborative recurrent autoencoder which can sophisticatedly model the generation of\nitem sequences while extracting the implicit relationship between items (and users). We design a new\npooling scheme for pooling variable-length sequences and propose a wildcard denoising scheme to\neffectively avoid over\ufb01tting. To the best of our knowledge, CRAE is the \ufb01rst model to bridge the\ngap between RNN and CF. Extensive experiments show that CRAE can signi\ufb01cantly outperform the\nstate-of-the-art methods on both the recommendation and sequence generation tasks.\nWith its Bayesian nature, CRAE can easily be generalized to seamlessly incorporate auxiliary\ninformation (e.g., the citation network for CiteULike and the co-director network for Net\ufb02ix) for\nfurther accuracy boost. Moreover, multiple Bayesian recurrent layers may be stacked together to\nincrease its representation power. Besides making recommendations and guessing sequences on\nthe \ufb02y, the wildcard denoising recurrent autoencoder also has potential to solve other challenging\nproblems such as recovering the blurred words in ancient documents.\n\n8\n\n00.5105010015020025000.51051015202500.51024681000.5100.511.5200.5105101500.51024681000.51051015202500.51050100150200250\fReferences\n[1] Y. Bengio, I. J. Goodfellow, and A. Courville. Deep learning. Book in preparation for MIT Press, 2015.\n\n[2] T. Chen, W. Zhang, Q. Lu, K. Chen, Z. Zheng, and Y. Yu. SVDFeature: a toolkit for feature-based\n\ncollaborative \ufb01ltering. JMLR, 13:3619\u20133622, 2012.\n\n[3] K. Cho, B. van Merrienboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio. Learning\nphrase representations using RNN encoder-decoder for statistical machine translation. In EMNLP, 2014.\n\n[4] J. Chung, K. Kastner, L. Dinh, K. Goel, A. C. Courville, and Y. Bengio. A recurrent latent variable model\n\nfor sequential data. In NIPS, 2015.\n\n[5] O. Fabius and J. R. van Amersfoort. Variational recurrent auto-encoders. arXiv preprint arXiv:1412.6581,\n\n2014.\n\n[6] K. Georgiev and P. Nakov. A non-iid framework for collaborative \ufb01ltering with restricted Boltzmann\n\nmachines. In ICML, 2013.\n\n[7] B. Hidasi, A. Karatzoglou, L. Baltrunas, and D. Tikk. Session-based recommendations with recurrent\n\nneural networks. arXiv preprint arXiv:1511.06939, 2015.\n\n[8] S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural Computation, 9(8):1735\u20131780, 1997.\n\n[9] Y. Hu, Y. Koren, and C. Volinsky. Collaborative \ufb01ltering for implicit feedback datasets. In ICDM, 2008.\n\n[10] A. V. D. Oord, S. Dieleman, and B. Schrauwen. Deep content-based music recommendation. In NIPS,\n\n2013.\n\n[11] K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu. BLEU: a method for automatic evaluation of machine\n\ntranslation. In ACL, 2002.\n\n[12] S. Rendle, C. Freudenthaler, Z. Gantner, and L. Schmidt-Thieme. BPR: Bayesian personalized ranking\n\nfrom implicit feedback. In UAI, 2009.\n\n[13] F. Ricci, L. Rokach, and B. Shapira. Introduction to Recommender Systems Handbook. Springer, 2011.\n\n[14] T. N. Sainath, B. Kingsbury, V. Sindhwani, E. Arisoy, and B. Ramabhadran. Low-rank matrix factorization\n\nfor deep neural network training with high-dimensional output targets. In ICASSP, 2013.\n\n[15] R. Salakhutdinov and A. Mnih. Probabilistic matrix factorization. In NIPS, 2007.\n\n[16] R. Salakhutdinov, A. Mnih, and G. E. Hinton. Restricted Boltzmann machines for collaborative \ufb01ltering.\n\nIn ICML, 2007.\n\n[17] A. P. Singh and G. J. Gordon. Relational learning via collective matrix factorization. In KDD, 2008.\n\n[18] I. Sutskever, O. Vinyals, and Q. V. Le. Sequence to sequence learning with neural networks. In NIPS,\n\n2014.\n\n[19] P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio, and P.-A. Manzagol. Stacked denoising autoencoders:\nLearning useful representations in a deep network with a local denoising criterion. JMLR, 11:3371\u20133408,\n2010.\n\n[20] C. Wang and D. M. Blei. Collaborative topic modeling for recommending scienti\ufb01c articles. In KDD,\n\n2011.\n\n[21] H. Wang, B. Chen, and W.-J. Li. Collaborative topic regression with social regularization for tag recom-\n\nmendation. In IJCAI, 2013.\n\n[22] H. Wang, X. Shi, and D. Yeung. Relational stacked denoising autoencoder for tag recommendation. In\n\nAAAI, 2015.\n\n[23] H. Wang, N. Wang, and D. Yeung. Collaborative deep learning for recommender systems. In KDD, 2015.\n\n[24] H. Wang and D. Yeung. Towards Bayesian deep learning: A framework and some existing methods. TKDE,\n\n2016, to appear.\n\n[25] X. Wang and Y. Wang. Improving content-based and hybrid music recommendation using deep learning.\n\nIn ACM MM, 2014.\n\n9\n\n\f", "award": [], "sourceid": 245, "authors": [{"given_name": "Hao", "family_name": "Wang", "institution": "HKUST"}, {"given_name": "Xingjian", "family_name": "SHI", "institution": "Hong Kong University of Science and Technology"}, {"given_name": "Dit-Yan", "family_name": "Yeung", "institution": "HKUST, Hong Kong"}]}