{"title": "Collaborative Gaussian Processes for Preference Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 2096, "page_last": 2104, "abstract": "We present a new model based on Gaussian processes (GPs) for learning pairwise preferences expressed by multiple users. Inference is simplified by using a \\emph{preference kernel} for GPs which allows us to combine supervised GP learning of user preferences with unsupervised dimensionality reduction for multi-user systems. The model not only exploits collaborative information from the shared structure in user behavior, but may also incorporate user features if they are available. Approximate inference is implemented using a combination of expectation propagation and variational Bayes. Finally, we present an efficient active learning strategy for querying preferences. The proposed technique performs favorably on real-world data against state-of-the-art multi-user preference learning algorithms.", "full_text": "Collaborative Gaussian Processes for\n\nPreference Learning\n\nNeil Houlsby \u2217\n\nDepartment of Engineering\nUniversity of Cambridge\n\nJose Miguel Hern\u00b4andez-Lobato \u2217\n\nDepartment of Engineering\nUniversity of Cambridge\n\nFerenc Husz\u00b4ar\n\nDepartment of Engineering\nUniversity of Cambridge\n\nZoubin Ghahramani\n\nDepartment of Engineering\nUniversity of Cambridge\n\nAbstract\n\nWe present a new model based on Gaussian processes (GPs) for learning pair-\nwise preferences expressed by multiple users. Inference is simpli\ufb01ed by using\na preference kernel for GPs which allows us to combine supervised GP learn-\ning of user preferences with unsupervised dimensionality reduction for multi-user\nsystems. The model not only exploits collaborative information from the shared\nstructure in user behavior, but may also incorporate user features if they are avail-\nable. Approximate inference is implemented using a combination of expectation\npropagation and variational Bayes. Finally, we present an ef\ufb01cient active learning\nstrategy for querying preferences. The proposed technique performs favorably on\nreal-world data against state-of-the-art multi-user preference learning algorithms.\n\n1\n\nIntroduction\n\nPreference learning is concerned with making inference from data consisting of pairs of items and\ncorresponding binary labels indicating user preferences. This data arises in many contexts, including\nmedical assistive technologies [1], graphical design [3] and recommendation systems [5]. A popular\nmodeling approach assumes the existence of a utility function f such that f (x) gives the utility of\nan item with feature vector x; f (xi) > f (xj) indicates that item i is preferred to item j. Bayesian\nmethods can be used to learn f, for example, by modeling f independently for each user as a draw\nfrom a Gaussian process (GP) prior [4]. However, when data from many users is available, such\nmethods do not leverage similarities in preferences across users. Current multi-user approaches\nrequire that features are available for each user and assume that users with similar features have\nsimilar preferences [2], or perform single-user learning, ignoring user features, but tie information\nacross users with a hierachical prior [1]. These methods are not \ufb02exible and can only address one of\ntwo possible scenarios: a) user features are available and they are useful for prediction and b) when\nthis is not the case. Additionally, they involve at least solving U GP problems, where U is the total\nnumber of users. This cost is prohibitive even for modest U. Our approach, by contrast, can address\nboth a) and b) by combining informative user features with collaborative information. Furthermore,\nwe perform scalable inference which can handle problems with large U.\nOur new multi-user model is based on dimensionality reduction ideas from the \ufb01eld of collabora-\ntive \ufb01ltering [19, 16]. Unsupervised learning of similarities in users\u2019 behavior is exploited without\nrequiring access to user-speci\ufb01c feature vectors. However, if these are available it may be desirable\n\n\u2217Both authors contributed equally.\n\n1\n\n\fto incorporate them for predictions; our model can use these user-speci\ufb01c features as well. The\nproposed method is based on a connection between preference learning and GP binary classi\ufb01ca-\ntion. We show that both problems are equivalent when a covariance function called the preference\nkernel is used. This speci\ufb01c kernel simpli\ufb01es the inference process, allowing us to implement more\ncomplex models such as the proposed multi-user approach. Finally, in real scenarios, querying users\nfor preference may be costly and intrusive, so it is desirable to learn preferences using the least\ndata possible. With this objective, we present BALD (Bayesian active learning by disagreement), an\nef\ufb01cient active learning strategy for binary classi\ufb01cation problems with GP priors.\n\n2 Pairwise preference learning as special case of binary classi\ufb01cation\n\nThe problem of pairwise preference learning can be recast as a special case of binary classi\ufb01cation.\nLet us consider two items i and j with corresponding feature vectors xi, xj \u2208 X . In the pairwise\npreference learning problem, we are given pairs of feature vectors xi and xj and corresponding class\nlabels y \u2208 {\u22121, 1} such that y = 1 if the user prefers item i to item j and y = \u22121 otherwise. The\ntask of interest is then to predict the class label for a new pair of feature vectors not seen before.\nThis problem can be addressed by introducing a latent preference function f : X (cid:55)\u2192 R such that\nf (xi) > f (xj) whenever the user prefers item i to item j and f (xi) < f (xj) otherwise [4]. When\nthe evaluations of f are contaminated with Gaussian noise with zero mean and (without loss of\ngenerality) variance 1/2, we obtain the following likelihood for f given xi, xj and y\n\nP(y|xi, xj, f ) = \u03a6[(f [xi] \u2212 f [xj])y] ,\n\n(1)\nwhere \u03a6 is the standard Gaussian cumulative distribution function. The preference learning problem\ncan be solved by combining a GP prior on f with the likelihood function in (1) [4]. The posterior\nfor f can then be used to make predictions on the user preferences for new pairs of items.\nNote that the likelihood (1) depends only on the difference between f (xi) and f (xj). Let g : X 2 (cid:55)\u2192\nR be the latent function g(xi, xj) = f (xi) \u2212 f (xj). We can recast the inference problem in terms\nof g and ignore f. When the evaluation of g is contaminated with standard Gaussian noise, the\nlikelihood for g given xi, xj and y is\n\nP(y|xi, xj, g) = \u03a6[g(xi, xj)y] .\n\n(2)\nSince g is obtained from f through a linear operation, the GP prior on f induces a GP prior on g. The\ncovariance function kpref of the GP prior on g can be computed from the covariance function k of the\nGP on f as kpref((xi, xj), (xk, xl)) = k(xi, xk)+k(xj, xl)\u2212k(xi, xl)\u2212k(xj, xk). The derivations\ncan be found in Section 1 of the supplementary material. We call kpref the preference kernel. The\nsame kernel function can be derived from a large margin classi\ufb01cation viewpoint [6]. However, to\nour knowledge, the preference kernel has not been used previously for GP-based models.\nThe combination of (2) with a GP prior based on the preference kernel allows us to transform the\npairwise preference learning problem into binary classi\ufb01cation with GPs. This means that state-of-\nthe-art methods for GP binary classi\ufb01cation, such as expectation propagation [14], can be applied\ndirectly to preference learning. Furthermore, the simpli\ufb01ed likelihood (2) allows us to implement\ncomplex methods such as the multi-user approach which is described in the following section.\n\n3 Multi-user preference learning\nConsider I items with feature vectors xi \u2208 X for i = 1, . . . , I. The single-user learning approach\nassumes an independent latent function for the u-th user, gu : X 2 (cid:55)\u2192 R. Our approach to the multi-\nuser problem is to assume common structure in the user latent functions. In particular, we assume a\nset of D shared latent functions, hd : X 2 (cid:55)\u2192 R for d = 1, . . . , D, such that the user latent functions\nare generated by a linear combination of these functions, namely\n\ngu(xj, xk) =\n\n(3)\nhere wu,d \u2208 R is the weight given to function hd for user u. We place a GP prior over the shared\nlatent functions h1, . . . , hD using the preference kernel described in the previous section. This\nmodel allows the preferences of the different users to share some common structure represented by\nthe latent functions h1, . . . , hD. This approach is similar to dimensionality reduction methods that\nare commonly used for addressing collaborative \ufb01ltering problems [19, 16].\n\nwu,dhd(xj, xk) ,\n\nd=1\n\nD(cid:88)\n\n2\n\n\fWe may extend this model further to the case in which, for each user u, there is a feature vector\nuu \u2208 U containing information that might be useful for prediction. We denote by U the set of all\nthe users\u2019 feature vectors, that is, U = {u1, . . . , uU}. The user features are incorporated now by\nplacing a separate GP prior over the users weights. In particular, we replace the scalars wu,d in (3)\nd(uu) : U \u2192 R. These weight functions describe the contribution of shared latent\nwith functions w(cid:48)\nfunction hd to the user latent function gu as a function of the user feature vector uu.\nIn the multi-user setting we are given a list L = {p1, . . . , pP} with all the pairs of items evaluated\nby the users, where P \u2264 I(I \u2212 1)/2 (the maximum number of pairs). The data consists of L, the\nsets of feature vectors for the users U (if available), the item features X = {x1, . . . , xI}, and U\nsets of preference judgements, one for each user, D = {{zu,i, yu,i}Mu\nu=1, where zu,i indexes the\ni-th pair evaluated by user u, yi,u = 1 if this user prefers the \ufb01rst item in the pair to the second and\nyi,u = \u22121 otherwise. Mu is the number of preference judgements made by the u-th user.\n\ni=1}U\n\n3.1 Probabilistic description\n\nTo address the task of predicting preference on unseen item pairs we cast the model into a probabilis-\ntic framework. Let G be an U\u00d7P \u2018user-function\u2019 matrix, where each row corresponds to a particular\nuser\u2019s latent function, that is, the entry in the u-th column and i-th row is gu,i = gu(x\u03b1(i), x\u03b2(i)) and\n\u03b1(i) and \u03b2(i) denote respectively the \ufb01rst and second item in the i-th pair from L. Let H be a D\u00d7 P\n\u2018shared-function\u2019 matrix, where each row represents the shared latent functions, that is, the entry in\nthe d-th row and i-th column is hd,i = hd(x\u03b1(i), x\u03b2(i)). Finally, we introduce the U \u00d7 D weight\nmatrix W such that each row contains a user\u2019s weights, that is, the entry in the u-th row and d-th\ncolumn of this matrix is w(cid:48)\nd(uu). Note that G = WH represents equation (3) in matrix form. Let\nT be the U \u00d7 P target matrix given by T = sign[G + E], where E is an U \u00d7 P noise matrix whose\nentries are sampled i.i.d. from a standard Gaussian distribution and the function \u201csign\u201d retains only\nthe sign of the elements in a matrix. The observations yu,i in D = {{zu,i, yu,i}Mu\nu=1 are mapped\nto the corresponding entries of T using tu,zu,i = yu,i. Let T(D) and G(D) represent the elements of\nT and G corresponding only to the available observations yu,i in D. Then, the likelihood for G(D)\ngiven T(D) and conditional distribution for G(D) given H and W are\n\ni=1}U\n\nP(T(D)|G(D)) =\n\n\u03a6[tu,zu,i gu,zu,i ] and P(G(D)|W, H) =\n\n\u03b4[gu,zu,i \u2212 wuh\u00b7,zu,i ]\n\nU(cid:89)\n\nMu(cid:89)\n\nu=1\n\ni=1\n\nU(cid:89)\n\nMu(cid:89)\n\nu=1\n\ni=1\n\nrespectively, where wu is the u-th row in W, h\u00b7,i is the i-th column in H and \u03b4 represents a point\nprobability mass at zero. We now select the priors for W and H. We assume that each function\nw(cid:48)\n1, . . . , w(cid:48)\nD is sampled a priori from a GP with zero mean and speci\ufb01c covariance function. Let\nKusers be the U \u00d7 U covariance matrix for entries in each column of matrix W. Then\n\nP(W|U) =\n\nN (w\u00b7,d|0, Kusers) ,\n\n(4)\n\nwhere w\u00b7,d is the d-th column in W. If user features are unavailable, Kusers becomes the identity\nmatrix. Finally, we assume that each shared latent function h1, . . . , hD is sampled a priori from a\nGP with zero mean and covariance function given by a preference kernel. Let Kitems be the P \u00d7 P\npreference covariance matrix for the item pairs in L. The prior for H is then\n\nD(cid:89)\n\nd=1\n\nD(cid:89)\n\nj=1\n\nP(H|X,L) =\n\nN (hj|0, Kitems) ,\n\n(5)\n\n.\n\n(6)\n\nwhere hj is the j-th row in H. The resulting posterior for W, H and G(D) is\n\nP(W, H, G(D)|T(D), X,L) =\n\nP(T(D)|G(D))P(G(D)|W, H)P(W|U)P(H|X,L)\n\nP(T(D|X,L)\n\nGiven a new item pair pP +1, we can compute the predictive distribution for the preference of the\nu-th user (1 \u2264 u \u2264 U) on this pair by integrating out the parameters H, W and G(D) as follows:\n\nP(tu,P +1|T(D), X,L, pP +1) =\n\nP(tu,P +1|gu,P +1)P(gu,P +1|wu, h\u00b7,P +1)\n\n(cid:90)\n\nP(h\u00b7,P +1|H, X,L, pP +1)P(H, W, G(D)|T(D), X,L) dH dW dG(D) ,\n\n(7)\nwhere P(tu,P +1|gu,P +1) = \u03a6[tu,P +1gu,P +1], P(gu,P +1|wu, h\u00b7,P +1) = \u03b4[gu,P +1 \u2212 wuh\u00b7,P +1],\n(8)\n\nP(h\u00b7,P +1|H, X,L, pP +1) =\n\nN (hd,P +1|kT\n\nitemshd, k(cid:63) \u2212 kT\n\u22121\n\n\u22121\nitemsk(cid:63))\n\nD(cid:89)\n\n(cid:63)K\n\n(cid:63)K\n\nd=1\n\n3\n\n\fFigure 1: Toy example with 1D input. Circles\nand crosses denote labelled data. The plot\nshows the mean and variance of the GP pre-\ndictive distribution. Maximum Entropy Sam-\npling (MES) samples from the region of high-\nest marginal uncertainty, ignoring the second\nterm in (10). BALD samples from the region\nof greatest uncertainty in the latent function.\n\nk(cid:63) is the prior variance of hd(x\u03b1(P +1), x\u03b2(P +1)) and k(cid:63) is a P -dimensional vector that contains\nthe prior covariances between hd(x\u03b1(P +1), x\u03b2(P +1)) and hd(x\u03b1(1), x\u03b2(1)), . . . , hd(x\u03b1(P ), x\u03b2(P )).\nComputing (6) or (8) is infeasible and approximations must be used. For this, we use a combination\nof expectation propagation (EP) [14] and variation Bayes (VB) [7]. Empirical studies show that EP\nobtains state-of-the-art performance in the related problem of GP binary classi\ufb01cation [15].\nWe want to learn user preferences with the proposed model from the least amount of data possible.\nTherefore we desire to query users actively about their preferences on the most informative pairs of\nitems [3]. Next, we describe a novel method to implement this strategy. This method exploits the\npreference kernel and so may be trivially generalized to GP binary classi\ufb01cation problems also.\n\n4 Bayesian active learning by disagreement\n\nThe goal of active learning is to choose item pairs such that we learn the preference functions for the\nusers using minimal data. Information theoretic approaches to active learning are popular because\nthey do not require prior knowledge of loss functions or test domains. The central goal is to iden-\ntify the new data point that maximizes the expected reduction in posterior entropy. For preference\nlearning (see Section 2), this implies choosing the new item features xi and xj that maximize\n\nwhere D are the user preferences observed so far and H[p(x)] = \u2212(cid:82) p(x) log p(x) dx represents\n\nH[P(g|D)] \u2212 EP(y|xi,xj ,D) [H[P(g|y, xi, xj,D)]] ,\n\n(9)\n\nthe Shannon entropy. This framework, originally proposed in [10], is dif\ufb01cult to apply directly to\nmodels based on GPs. In these models, entropies can be poorly de\ufb01ned or their computation can\nbe intractable.\nIn practice, current approaches make approximations for the computation of the\nposterior entropy [12, 9]. However, a second dif\ufb01culty arises; if n new data points are available for\nselection, with |{\u22121, 1}| = 2 possible values for y. Then O(2n) potentially expensive posterior\nupdates are required to \ufb01nd the maximizer of (9): one for every available feature vector and possible\nclass value. This is often too expensive in practice.\nA solution consists in noting that (9) is equivalent to the conditioned mutual information between y\nand g. Using this we can rearrange this equation to compute entropies in y space:\n\nH[P(y|xi, xj,D)] \u2212 EP(g|D) [H [P(y|xi, xj, g)]] .\n\n(10)\nThis overcomes the previous challenges. Entropies are now evaluated in output space, which has low\ndimension. Furthermore, g is now conditioned only upon D, so only O(1) updates of the posterior\ndistribution are required. We only need to recompute the posterior once per data point selected, not\nfor every possible data point under consideration. Expression (10) also provides us with an intuition\nabout the objective; we seek the xi and xj for which a) the model is marginally uncertain about\ny (high H[P(y|xi, xj,D)]) and b) conditioned on a particular value of g the model is con\ufb01dent\nabout y (low EP(g|D) [H[P(y|xi, xj, g])]). This can be interpreted as seeking the pair xi and xj\nfor which the latent functions g, under the posterior, \u2018disagree\u2019 with each other the most about the\noutcome, that is, the preference judgement. Therefore, we refer to this objective as Bayesian Active\nLearning by Disagreement (BALD). This method is independent of the approach used for inference,\nsomething which does not hold for the techniques described in [12, 8, 9]. In the following section\nwe show how (10) can be applied to binary classi\ufb01cation with GPs, and hence via the preference\nkernel also to any preference learning problem.\n\n4.1 BALD in binary classi\ufb01cation with GPs\n\nMost approximate inference methods for the problem of binary classi\ufb01cation with GPs produce\na Gaussian approximation to the posterior distribution of f, the latent function of interest.\nIn\n\n4\n\n}}\fx)dfx\n\n(cid:3) = h(cid:2)\u03a6(cid:0)\u00b5x(\u03c32\n\nh(cid:2)(cid:82) \u03a6(fx)N (fx|\u00b5x, \u03c32\nx + 1)\u22121/2(cid:1)(cid:3), where \u2248 represents here the Gaussian\nx + C 2)\u22121/2 exp(cid:0)\u2212\u00b52\nx + C 2(cid:1))\u22121(cid:1), where\nC =(cid:112)\u03c0 log 2/2. This result is obtained by using the Gaussian approximation to the posterior of fx\n\nthe binary GP classi\ufb01er, the entropy of y given the corresponding value of f can be expressed\nin terms of the binary entropy function, h[f ] = \u2212f log f \u2212 (1 \u2212 f ) log(1 \u2212 f ).\nIn particular,\nH[p(y|x, f )] = h [\u03a6(f (x)]. When a Gaussian is used to approximate the posterior of f, we have\nthat for each x, fx = f (x) will follow a Gaussian distribution with mean \u00b5x and variance \u03c32\nx. The\n\ufb01rst term in (10), that is, H[p(y|x,D)], can be handled analytically in this case: H[p(y|x,D)] \u2248\napproximation to the posterior of fx. The second term in (10), that is, Ep(f|D) [H[p(y|x, f )]], can\nbe approximated as Ep(f|D) [H[p(y|x, f )]] \u2248 C(\u03c32\nand then approximating h[\u03a6(fx)] by the squared exponential curve exp(\u2212f 2\nbe found in Section 3 of the supplementary material).\nTo summarize, the BALD algorithm for active binary GP classi\ufb01cation / preference learning \ufb01rst\nx of f at\napplies any approximate inference method to obtain the posterior mean \u00b5x and variance \u03c32\neach point of interest x. Then, it selects the feature vector x that maximizes the objective\n\nx(2(cid:0)\u03c32\n\nx/\u03c0 log 2) (details can\n\n\u22121/2 exp(cid:0)\u2212\u00b52\n\nx(2(cid:0)\u03c32\n\nx + C 2(cid:1))\n\n\u22121(cid:1) .\n\nh\n\n\u03a6\n\nx + 1)\n\n\u00b5x(\u03c32\n\nx + C 2)\n\n(11)\nBALD assigns a high value to the feature vector x when the model is both uncertain about the label\nx is large). The second term prevents BALD\n(\u00b5x close to 0) and there is high uncertainty about fx (\u03c32\nfrom sampling in regions where the model knows that the label is uncertain. Figure 1 illustrates\nthe differences between BALD and Maximum Entropy Sampling [17] (details in the supplementary\nmaterial, Section 5). MES considers only marginal uncertainty (the \ufb01rst term in (11)), and hence\nseeks data in an uninformative region of the plot. By contrast, BALD samples data from the region\nof greatest uncertainty in the latent function.\n\n\u22121/2(cid:17)(cid:105) \u2212 C(\u03c32\n\n(cid:104)\n\n(cid:16)\n\n5 Expectation propagation and variational Bayes\n\nApproximate inference in our model is implemented using a combination of expectation propagation\n(EP) [13] and variational Bayes (VB) [7]. Here, we brie\ufb02y describe the method, but full details are\nin Section 4 of the supplementary. We approximate the posterior (6) by the parametric distribution\n\n(cid:35)\n\nQ(W, H, G(D)) =\n\nN (wud|mw\n\nu,d, vw\n\nu,d)\n\nN (hd,i|mh\n\nd,i, vh\n\nd,i)\n\nN (gu,zu,j|mg\n\nu,j, vg\n\nu,j)\n\n,\n\n(12)\n\nd,i, vh\n\nu,d, vw\n\nd,i, mg\n\nu,d, mh\n\nu,j, and vg\n\nnamely, P(G(D), W, H, T(D), X, (cid:96)) = (cid:81)4\n\nwhere mw\nu,j are free parameters to be determined by EP\nand the superscripts w, h and g indicate the random variables described by these parameters.\nThe joint distribution P(G(D), W, H, T(D), X, (cid:96)) can be factorized into four factors f1, . . . , f4,\na=1 fa(G(D), W, H), where f1(G(D), W, H) =\nP(T(D)|G(D)), f2(G(D), W, H) = P(G(D)|W, H), f3(G(D), W, H) = P(W|U) and\nf4(G(D), W, H) = P(H|X, (cid:96)). EP approximates these exact factors by approximate factors\n\u02c6f1(W, H, G(D)), . . . , \u02c6f4(W, H, G(D)) that have the same functional form as Q\nN (hd,i| \u02c6ma,h\n\n\u02c6fa(G(D), W, H) =\n\nN (wud| \u02c6ma,w\n\nP(cid:89)\n\n(cid:35)\n\nu,d , \u02c6va,w\nu,d )\n\nd,i , \u02c6va,h\nd,i )\n\nu,j , \u02c6va,g\nu,j )\n\nN (gu,zu,j| \u02c6ma,g\nd,i , \u02c6ma,g\nd,i , \u02c6va,h\n\n\u02c6sa ,\n\n(13)\n\nu,j, \u02c6va,g\n\nu,d , \u02c6va,w\n\nu,d , \u02c6ma,h\n\nwhere a = 1, . . . , 4 and \u02c6ma,w\nu,j and \u02c6sa are free parameters. Note that\nQ is the normalized product of \u02c6f1, . . . , \u02c6f4. The \ufb01rst step of EP is to initialize \u02c6f1, . . . , \u02c6f4 and Q\nto be uniform. After that, EP iteratively re\ufb01nes of \u02c6f1, . . . , \u02c6f4 by minimizing the Kullback-Leibler\n(KL) divergence between the product of Q\\a and fa and the product of Q\\a and \u02c6fa, where Q\\a is\nthe ratio between Q and \u02c6fa. However, this does not perform well for re\ufb01ning \u02c6f2; details on this\nproblem can be found in Section 4 of the supplementary material and in [19]. For this factor we\nfollow a VB approach. Instead of minimizing KL(Q\\2f2(cid:107)Q\\2 \u02c6f2) with respect to the parameters of\n\u02c6f2, we re\ufb01ne this approximate factor so that the reversed version of the KL divergence is minimized,\n\n5\n\n(cid:34) U(cid:89)\n(cid:34) N(cid:89)\n\nu=1\n\nD(cid:89)\nMu(cid:89)\n\nd=1\n\nu=1\n\nj=1\n\n(cid:34) U(cid:89)\n(cid:34) N(cid:89)\n\nu=1\n\nD(cid:89)\nMu(cid:89)\n\nd=1\n\nu=1\n\nj=1\n\n(cid:35)(cid:34) D(cid:89)\n(cid:35)\n\nd=1\n\nP(cid:89)\n\ni=1\n\n(cid:35)(cid:34) D(cid:89)\n(cid:35)\n\nd=1\n\ni=1\n\n\fthat is, we minimize KL(Q\\2 \u02c6f2(cid:107)Q\\2f2). EP iteratively re\ufb01nes all the approximate factors until\nconvergence. This method also approximates the predictive distribution (7). For this, we replace the\nexact posterior in (7) with Q. Finally, EP can also approximate the normalization constant in (6)\n(the model evidence) as the integral of the product of all the approximate factors \u02c6f1, . . . , \u02c6f4.\n\n5.1 A sparse approximation to speed up computation\n\nThe cost of GPs is cubic in the number of function evaluations. In our case, re\ufb01ning \u02c6f3 has cost\nO(DU 3), where U is the number of users, and D the number of shared latent functions. The cost of\nre\ufb01ning \u02c6f4 is O(DP 3), where P is the number of observed item pairs. These costs can be reduced\nby approximating Kusers and Kitems in (4) and (5). We use the FITC approximation [18]. Under this\napproximation, an n\u00d7n covariance matrix K generated by the evaluation of a covariance function at\nn locations is approximated by K(cid:48) = Q+diag(K\u2212Q), where Q = Knn0K\u22121\nnn0, Kn0n0 is the\nn0\u00d7n0 matrix generated by the evaluation of the covariance function at all possible combinations of\nonly n0 < n locations or pseudo-inputs and Knn0 is the n\u00d7n0 matrix with the covariances between\nall possible combinations of original locations and pseudo-inputs. These approximations allow us\nto re\ufb01ne \u02c6f3 and \u02c6f4 in O(DU 2\n0 P ) operations, where U0 and P0 are the number of\npseudo-inputs for the users and for the item pairs, respectively. A detailed description of the EP\nupdates based on the FITC approximation is given in Section 4.4 of the supplementary material.\n\n0 U ) and O(DP 2\n\nKT\n\nn0n0\n\n6 Experiments and Discussion\n\nThe performance of our collaborative preference model with the BALD active learning strategy is\nevaluated in a series of experiments with simulated and real-world data. The analyzed datasets\ninclude a) synthetic data generated from the probabilistic model assumed by the proposed multi-\nuser method (Synthetic), b) a collection of user preferences on different movies (MovieLens), c) the\nnumber of votes obtained by different political parties in the 2010 UK general election (Election),\nd) preferences of users about different types of sushi (Sushi), and \ufb01nally, e) information regarding\nthe concentration of heavy metals in the Swiss Jura region (Jura). Section 6 in the supplementary\nmaterial contains a detailed description of these datasets.\n\n6.1 Comparison with other multi-user methods\n\nAlternative models. Two versions of the proposed collaborative preference (CP) model are used.\nThe \ufb01rst version (CPU) takes into account the available user features, as described in Section 3.\nThe second version (CP) ignores these features by replacing Kusers in (4) with the identity matrix.\nThe \ufb01rst multi-user method we compare to is the approach of Birlitiu et al. (BI) [1]. This method\ndoes not use user features, and captures similarities between users with a hierarchical GP model.\nIn particular, a common GP prior is assumed for the preference function of each user; using this\nprior the model learns the full GP posterior for each user. The second multi-user method is the\ntechnique of Bonilla et al.\nIn this model there exists one high-dimensional function\nwhich depends on both the features of the two items to be compared and on the features of the\nuser who makes the comparison. Relationships between users\u2019 behaviors are captured only via\nthe user features. We implement BO and BI using the preference kernel and EP for approximate\ninference1. The computational costs of BO and BI are rather high; BO has cubic complexity in\nu=1 Mu)3), our model (CPU) has a signi\ufb01cantly lower\ncost of O(D(U 3 + P 3)) (before further speed-up from FITC). BI does not include user features,\nbut learns U GPs, so has complexity O(U P 3); the equivalent version of our model (CP) has cost\nO(N P + DP 3), which is lower because D << U. More details about BI and BO are given in\nsections 7 and 8 of the supplementary material. Finally, we consider a single user approach (SU)\nwhich \ufb01ts a different GP classi\ufb01er independently to the data of each user.\n\nthe total number of observations i.e. O(((cid:80)U\n\n(BO) [2].\n\n1Although this is not the same as the original implementations (sampling-based for BI, Laplace approxi-\nmation for BO), the preference kernel and EP are likely to augment the performance of these algorithms, and\nprovides the fairest comparison of the underlying models.\n\n6\n\n\fTable 1: Average test error with 100 users.\n\nBI\n\nBO\n\nCPU CP\nSU\nDataset\n0.162 0.180 0.175 0.157 0.226\nSynthetic\n0.171 0.163 0.160 0.266 0.187\nSushi\nMovieLens 0.182 0.166 0.168 0.302 0.217\n0.199 0.123 0.077 0.401 0.300\nElection\n0.159 0.153 0.153 0.254 0.181\nJura\n\nBI\n\nCP\n\nTable 2: Training times (s) with 100 users.\nDataset\nSU\nCPU\n9.498 22.524 311.574 0.927\nSynthetic\n7.793\nSushi\n4.307 20.028 215.136 0.817\n5.694\n5.313\nMovieLens\n69.048 0.604\n4.013 19.366\n13.134 12.408 20.880 120.011 0.888\nElection\nJura\n3.762\n88.502 0.628\n\n2.404 15.234\n\nBO\n\nTable 3: Test error for each method and active learning strategy with at most 1000 users.\n\nDataset\nSynthetic\nSushi\nMovieLens\nElection\nJura\n\nCPU-B CPU-E CPU-R\n0.135\n0.139\n0.148\n0.178\n0.170\n0.199\n0.224\n0.202\n0.143\n0.168\n\n0.135\n0.153\n0.176\n0.158\n0.141\n\nCP-B CP-E CP-R\n0.153 0.160 0.173\n0.144 0.151 0.176\n0.163 0.170 0.195\n0.097 0.093 0.151\n0.138 0.138 0.169\n\nSU-B SU-E SU-R\n0.249 0.259 0.268\n0.179 0.197 0.212\n0.225 0.235 0.248\n0.332 0.346 0.338\n0.176 0.166 0.197\n\nExperimental procedure. Due to the high computational cost of BI and BO, to compare to these\nmethods we must subsample the datasets, keeping only 100 users. The available data were split\nrandomly into training and test sets of item pairs, where the training sets contain 20 pairs per user in\nSushi, MovieLens and Election, 15 pairs in Jura and 30 in Synthetic. This was repeated 25 times to\nobtain statistically meaningful results. In CPU and CP, we selected the number of latent functions\nD to be 20 (see Table 6.1). In general, the proposed models, CPU and CP, are robust to over-\ufb01tting\nand over-estimation of D does not harm predictive performance. Note that the Synthetic dataset is\ngenerated using D = 5 and CPU and CP still obtain very good results using D = 20. This automatic\npruning of unnecessary degrees of freedom seems to be common in methods based on variational\nBayes [11]. We selected the kernel lengthscales to be equal to the median distance between feature\nvectors. This leads to good empirical performance for most methods. An exception is BO, where the\nkernel hyperparameters are tuned to some held-out data using automatic relevance determination. In\nour model, we can also estimate the kernel lengthscales by maximizing the EP approximation of the\nmodel evidence, as illustrated in Section 9 of the supplementary material. This alternative approach\ncan be used when it is necessary to \ufb01ne tune the lengthscale parameters to the data. In CPU we\nuse U0 = 25 pseudo inputs for approximating Kusers. These pseudo inputs are selected randomly\nfrom the set of available data points. Similarly, in CP and CPU, we use P0 = 25 pseudo inputs for\napproximating Kitems, except in the Jura and Election datasets (which contain fewer items) where\nwe use P0 = 15. The results obtained are not sensitive to the number of pseudo inputs used, as long\nas the number is not excessively low.\n\nResults. Average test errors are shown in Table 1. Those highlighted in bold are statistically\ndifferent to those not highlighted (calculated using a paired t test). Overall, CP and CPU outperform\nSU and BO, and breaks even with BI; the \ufb01nal result is notable as BI learns the full mean and\ncovariance structure across all users, ours uses only a few latent dimensions, which provides the key\nto scaling to many more users. CP outperforms CPU in all cases except in the Synthetic dataset.\nIn the real-world datasets, users with similar features do not seem to have similar preferences and\nso correlating behavior of users with similar features is detrimental. In this case, the unsupervised\nlearning of similarities in user preferences is more useful for prediction than the user features. This\nalso explains the poor overall results obtained by BO. Finally, running times in seconds are presented\nin Table 2. The entries for BO do not include the time spent by this method to tune the kernel hyper-\nparameters. CP and CPU are faster than BO and BI. The FITC approximation imposes a large\nmultiplicative constant in the cost of CP and CPU so for larger datasets the gains are much larger.\n\n6.2 Active learning on large datasets\n\nHere we evaluate the performance of BALD, in particular, we compare CPU, CP, and SU using\nBALD (-B), Maximum Entropy Sampling (-E) and random sampling (-R). We now use all the avail-\nable users from each dataset, with a maximum of 1000 users. For each user the available preference\ndata are split randomly into training, pool and test sets with 5, 35 and 5 data points respectively in\n\n7\n\n\fSynthetic\n\nSushi\n\nMovieLens\n\nElection\n\nJura\n\nFigure 2: Average test error for CPU, CP and SU, using the strategies BALD (-B), entropy (-E) and\nrandom (-R) for active learning. For clarity, the curves for CPU are included only in the Synthetic\nand Election datasets. The complete plots can be found in Section 10 of the supplementary material.\n\nSynthetic, Sushi and MovieLens, 3, 22 and 3 data points in Election and 3, 15 and 3 data points\nin Jura. Each method is \ufb01tted using the training sets and its performance is then evaluated on the\ncorresponding test sets. After this, the most informative data point is identi\ufb01ed in each of the pool\nsets. These data points are moved into the corresponding training sets and the process repeats until\n10 of these active additions to the training sets have been completed. The entire process, includ-\ning the dataset splitting is repeated 25 times. Figure 2 shows the learning curve for each method.\nFor clarity, the curve for CPU is included only for the Synthetic and Election datasets; in the other\ndatasets CPU is marginally outperformed by CP (see supplementary material, Section 10). Average\nerrors after 10 queries from the pool set of each user are summarized in Table 3. For each model\n(CPU, CP and SU), the results of the best active learning strategy are highlighted in bold. The re-\nsults of the best model/active learning strategy combination are underlined. Highlighted results are\nstatistically signi\ufb01cant with respect to non-highlighted results according to a paired t test. BALD\nalways outperforms random sampling and usually outperforms or obtains equivalent performance\nto MES. In particular, BALD signi\ufb01cantly outperforms MES in 9 cases, while MES is better than\nBALD in only 2 cases.\n\n7 Conclusions\n\nWe have proposed a multi-user model that combines collaborative \ufb01ltering methods with GP binary\npreference modeling. We have shown that the task of learning user preferences can be recast as a\nparticular case of binary classi\ufb01cation with GPs when a covariance function called the preference\nkernel is used. We have also presented BALD, a novel active learning strategy for binary classi\ufb01ca-\ntion models with GPs. The proposed multi-user model with BALD performs favorably on simulated\nand real-world data against single-user methods and existing approaches for multi-user preference\nlearning, whilst having signi\ufb01cantly lower computational times than competing multi-user methods.\n\nAcknowledgements\n\nNH is a recipient of the Google Europe Fellowship in Statistical Machine Learning, and this research\nis supported in part by this Google Fellowship. JMH is supported by Infosys Labs, Infosys Limited.\n\n8\n\n02468100.10.150.20.250.30.35num sampleserror02468100.10.150.20.250.30.35num sampleserror02468100.20.250.30.35num sampleserror02468100.050.10.150.20.250.30.350.40.45num sampleserror02468100.10.150.20.250.30.350.4num sampleserrorCPU\u2212BCPU\u2212ECPU\u2212RCP\u2212BCP\u2212ECP\u2212RSU\u2212BSU\u2212ESU\u2212R\fReferences\n[1] A. Birlutiu, P. Groot, and T. Heskes. Multi-task preference learning with an application to\n\nhearing aid personalization. Neurocomputing, 73(79):1177 \u2013 1185, 2010.\n\n[2] Edwin V. Bonilla, Shengbo Guo, and Scott Sanner. Gaussian process preference elicitation. In\n\nAdvances in Neural Information Processing Systems 23, pages 262\u2013270, 2010.\n\n[3] E. Brochu, N. de Freitas, and A. Ghosh. Active preference learning with discrete choice data.\n\nAdvances in Neural Information Processing Systems 20, 20:409\u2013416, 2007.\n\n[4] W. Chu and Z. Ghahramani. Preference learning with Gaussian processes. In Proceedings of\n\nthe 22nd international conference on Machine learning, pages 137\u2013144, 2005.\n\n[5] M. De Gemmis, L. Iaquinta, P. Lops, C. Musto, F. Narducci, and G. Semeraro. Preference\nIn ECML/PKDD-09 Workshop on Preference Learning,\n\nlearning in recommender systems.\n2009.\n\n[6] J. F\u00a8urnkranz and E. H\u00a8ullermeier. Preference learning. Springer-Verlag New York Inc, 2010.\n[7] Z. Ghahramani and M. J. Beal. Advanced Mean Field Method\u2014Theory and Practice, chapter\n\nGraphical models and variational methods, pages 161\u2013177. 2001.\n\n[8] B. Krishnapuram, D. Williams, Y. Xue, A. Hartemink, L. Carin, and M. Figueiredo. On semi-\nIn Advances in neural information processing systems 17, pages\n\nsupervised classi\ufb01cation.\n721\u2013728, 2004.\n\n[9] N.D. Lawrence, M. Seeger, and R. Herbrich. Fast sparse gaussian process methods: The\ninformative vector machine. Advances in Neural Information Processing Systems 15, 15:609\u2013\n616, 2002.\n\n[10] D.V. Lindley. On a measure of the information provided by an experiment. The Annals of\n\nMathematical Statistics, 27(4):986\u20131005, 1956.\n\n[11] D. J. C. MacKay. Local minima, symmetry-breaking, and model pruning in variational free\nenergy minimization. Available at http://www.inference.phy.cam.ac.uk/mackay/minima.pdf,\n2001.\n\n[12] D.J.C. MacKay. Information-based objective functions for active data selection. Neural com-\n\nputation, 4(4):590\u2013604, 1992.\n\n[13] T. Minka and J. Lafferty. Expectation-propagation for the generative aspect model. In Pro-\nceedings of the Eighteenth conference on Uncertainty in arti\ufb01cial intelligence, pages 352\u2013359,\n2002.\n\n[14] Tom Minka. A family of algorithms for approximate Bayesian inference. PhD thesis, MIT,\n\n2001.\n\n[15] Hannes Nickisch and Carl Edward Rasmussen. Approximations for binary Gaussian process\n\nclassi\ufb01cation. The Journal of Machine Learning Research, 9:2035\u20132078, 2008.\n\n[16] T. Raiko, A. Ilin, and K. Juha. Principal component analysis for large scale problems with lots\nof missing values. In Joost Kok, Jacek Koronacki, Raomon Mantaras, Stan Matwin, Dunja\nMladenic, and Andrzej Skowron, editors, Machine Learning: ECML 2007, volume 4701 of\nLecture Notes in Computer Science, pages 691\u2013698. Springer Berlin / Heidelberg, 2007.\n\n[17] P. Sebastiani and H.P. Wynn. Maximum entropy sampling and optimal Bayesian experimental\ndesign. Journal of the Royal Statistical Society. Series B (Statistical Methodology), 62(1):145\u2013\n157, 2000.\n\n[18] E. Snelson and Z. Ghahramani. Sparse gaussian processes using pseudo-inputs. In Advances\n\nin Neural Information Processing Systems 18, 2005.\n\n[19] D. H. Stern, R. Herbrich, and T. Graepel. Matchbox: large scale online bayesian recommenda-\ntions. In Proceedings of the 18th international conference on World wide web, pages 111\u2013120,\n2009.\n\n9\n\n\f", "award": [], "sourceid": 1031, "authors": [{"given_name": "Neil", "family_name": "Houlsby", "institution": null}, {"given_name": "Ferenc", "family_name": "Huszar", "institution": null}, {"given_name": "Zoubin", "family_name": "Ghahramani", "institution": null}, {"given_name": "Jose", "family_name": "Hern\u00e1ndez-lobato", "institution": null}]}