{"title": "Context Selection for Embedding Models", "book": "Advances in Neural Information Processing Systems", "page_first": 4816, "page_last": 4825, "abstract": "Word embeddings are an effective tool to analyze language. They have been recently extended to model other types of data beyond text, such as items in recommendation systems. Embedding models consider the probability of a target observation (a word or an item) conditioned on the elements in the context (other words or items). In this paper, we show that conditioning on all the elements in the context is not optimal. Instead, we model the probability of the target conditioned on a learned subset of the elements in the context. We use amortized variational inference to automatically choose this subset. Compared to standard embedding models, this method improves predictions and the quality of the embeddings.", "full_text": "Context Selection for Embedding Models\n\nLi-Ping Liu\u2217\nTufts University\n\nSusan Athey\n\nStanford University\n\nFrancisco J. R. Ruiz\nColumbia University\n\nUniversity of Cambridge\n\nDavid M. Blei\n\nColumbia University\n\nAbstract\n\nWord embeddings are an effective tool to analyze language. They have been\nrecently extended to model other types of data beyond text, such as items in\nrecommendation systems. Embedding models consider the probability of a target\nobservation (a word or an item) conditioned on the elements in the context (other\nwords or items). In this paper, we show that conditioning on all the elements in the\ncontext is not optimal. Instead, we model the probability of the target conditioned\non a learned subset of the elements in the context. We use amortized variational\ninference to automatically choose this subset. Compared to standard embedding\nmodels, this method improves predictions and the quality of the embeddings.\n\n1\n\nIntroduction\n\nWord embeddings are a powerful model to capture latent semantic structure of language. They\ncan capture the co-occurrence patterns of words (Bengio et al., 2006; Mikolov et al., 2013a,b,c;\nPennington et al., 2014; Mnih and Kavukcuoglu, 2013; Levy and Goldberg, 2014; Vilnis and\nMcCallum, 2015; Arora et al., 2016), which allows for reasoning about word usage and meaning\n(Harris, 1954; Firth, 1957; Rumelhart et al., 1986). The ideas of word embeddings have been extended\nto other types of high-dimensional data beyond text, such as items in a supermarket or movies in\na recommendation system (Liang et al., 2016; Barkan and Koenigstein, 2016), with the goal of\ncapturing the co-occurrence patterns of objects. Here, we focus on exponential family embeddings\n(EFE) (Rudolph et al., 2016), a method that encompasses many existing methods for embeddings and\nopens the door to bringing expressive probabilistic modeling (Bishop, 2006; Murphy, 2012) to the\nproblem of learning distributed representations.\nIn embedding models, the object of interest is the conditional probability of a target given its context.\nFor instance, in text, the target corresponds to a word in a given position and the context are the words\nin a window around it. For an embedding model of items in a supermarket, the target corresponds to\nan item in a basket and the context are the other items purchased in the same shopping trip.\nIn this paper, we show that conditioning on all elements of the context is not optimal. Intuitively,\nthis is because not all objects (words or items) necessarily interact with each other, though they may\nappear together as target/context pairs. For instance, in shopping data, the probability of purchasing\nchocolates should be independent of whether bathroom tissue is in the context, even if the latter is\nactually purchased in the same shopping trip.\nWith this in mind, we build a generalization of the EFE model (Rudolph et al., 2016) that relaxes\nthe assumption that the target depends on all elements in the context. Rather, our model considers\nthat the target depends only on a subset of the elements in the context. We refer to our approach as\n\n\u2217Li-Ping Liu\u2019s contribution was made when he was a postdoctoral researcher at Columbia University.\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fcontext selection for exponential family embeddings (CS-EFE). Speci\ufb01cally, we introduce a binary\nhidden vector to indicate which elements the target depends on. By inferring the indicator vector, the\nembedding model is able to use more related context elements to \ufb01t the conditional distribution, and\nthe resulting learned vectors capture more about the underlying item relations.\nThe introduction of the indicator comes at the price of solving this inference problem. Most embedding\ntasks have a large amount of target/context pairs and require a fast solution to the inference problem.\nTo avoid solving the inference problem separately for all target/context pairs, we use amortized\nvariational inference (Dayan et al., 1995; Gershman and Goodman, 2014; Korattikara et al., 2015;\nKingma and Welling, 2014; Rezende et al., 2014; Mnih and Gregor, 2014). We design a shared neural\nnetwork structure to perform inference for all pairs. One dif\ufb01culty here is that the varied sizes of the\ncontexts require varied input and output sizes for the shared structure. We overcome this problem\nwith a binning technique, which we detail in Section 2.3.\nOur contributions are as follows. First, we develop a model that allows conditioning on a subset of the\nelements in the context in an EFE model. Second, we develop an ef\ufb01cient inference algorithm for the\nCS-EFE model, based on amortized variational inference, which can automatically infer the subset of\nelements in the context that are most relevant to predict the target. Third, we run a comprehensive\nexperimental study on three datasets, namely, MovieLens for movie recommendations, eBird-PA for\nbird watching events, and grocery data for shopping behavior. We found that CS-EFE consistently\noutperforms EFE in terms of held-out predictive performance on the three datasets. For MovieLens,\nwe also show that the embedding representations of the CS-EFE model have higher quality.\n\n2 The Model\n\nOur context selection procedure builds on models based on embeddings. We adopt the formalism\nof exponential family embeddings (EFE) (Rudolph et al., 2016), which extend the ideas of word\nembeddings to other types of data such as count or continuous-valued data. We brie\ufb02y review the\nEFE model in Section 2.1. We then describe our model in Section 2.2, and we put forward an ef\ufb01cient\ninference procedure in Section 2.3.\n\n2.1 Exponential Family Embeddings\n\nIn exponential family embeddings (EFE), we have a collection of J objects, such as words (in text\napplications) or movies (in a recommendation problem). Our goal is to learn a vector representation\nof these objects based on their co-occurrence patterns.\nLet us consider a dataset represented as a (typically sparse) N \u00d7 J matrix X, where rows are\ndatapoints and columns are objects. For example, in text applications each row corresponds to a\nlocation in the text, and it is a one-hot vector that represents the word appearing in that location. In\nmovie data, each entry xnj indicates the rating of movie j for user n.\nThe EFE model learns the vector representation of objects based on the conditional probability of each\nobservation, conditioned on the observations in its context. The context cnj = [(n1, j1), (n2, j2), . . .]\ngives the indices of the observations that appear in the conditional probability distribution of xnj.\nThe de\ufb01nition of the context varies across applications. In text, it corresponds to the set of words in a\n\ufb01xed-size window centered at location n. In movie recommendation, cnj corresponds to the set of\nmovies rated by user n, excluding j.\nIn EFE, we represent each object j with two vectors: an embedding vector \u03c1j and a context vector\n\u03b1j. These two vectors interact in the conditional probability distributions of each observation xnj\nas follows. Given the context cnj and the corresponding observations xcnj indexed by cnj, the\ndistribution for xnj is in the exponential family,\n\nwhere t(xnj) is the suf\ufb01cient statistic of the exponential family distribution, and \u03b7j(xcnj ; \u03b1, \u03c1) is its\nnatural parameter. The natural parameter is set to\n\np(xnj | xcnj ; \u03b1, \u03c1) = ExpFam(cid:0)t(xnj), \u03b7j(xcnj ; \u03b1, \u03c1)(cid:1) ,\n\uf8f6\uf8f8 ,\n\n\uf8eb\uf8ed\u03c1(0)\n\n\u03b7j(xcnj ; \u03b1, \u03c1) = g\n\n|cnj|(cid:88)\n\nxnkjk \u03b1jk\n\nj +\n\n1\n\n|cnj| \u03c1(cid:62)\n\nj\n\nk=1\n\n2\n\n(1)\n\n(2)\n\n\fwhere Bnj =(cid:80)\n\n|cnj|(cid:88)\n\nk=1\n\n\u03c1(cid:62)\n\nj\n\n1\nBnj\n\n\uf8f6\uf8f8 ,\n\n\uf8eb\uf8ed\u03c1(0)\np(bnj; \u03c0nj) \u221d(cid:89)\n\nwhere |cnj| is the number of elements in the context, and g(\u00b7) is the link function (which depends\non the application and plays the same role as in generalized linear models). We consider a slightly\ndifferent form for \u03b7j(xcnj ; \u03b1, \u03c1) than in the original EFE paper by including the intercept terms\n\u03c1(0)\n. We also average the elements in the context. These choices generally improve the model\nj\nperformance.\nThe vectors \u03b1j and \u03c1j (and the intercepts) are found by maximizing the pseudo-likelihood, i.e., the\nproduct of the conditional probabilities in Eq. 1 for each observation xnj.\n\n2.2 Context Selection for Exponential Family Embeddings\n\nThe base EFE model assumes that all objects in the context cnj play a role in the distribution of xnj\nthrough Eq. 2. This is often an unrealistic assumption. The probability of purchasing chocolates\nshould not depend on the context vector of bathroom tissue, even when the latter is actually in the\ncontext. Put formally, there are domains where the all elements in the context interact selectively in\nthe probability of xnj.\nWe now develop our context selection for exponential family embeddings (CS-EFE) model, which\nselects a subset of the elements in the context for the embedding model, so that the natural parameter\nonly depends on objects that are truly related to the target object. For each pair (n, j), we introduce a\nhidden binary vector bnj \u2208 {0, 1}|cnj| that indicates which elements in the context cnj should be\nconsidered in the distribution for xnj. Thus, we set the natural parameter as\n\n\u03b7j(xcnj , bnj; \u03b1, \u03c1) = g\n\nj +\n\nbnjkxnkjk \u03b1jk\n\n(3)\n\nk bnjk is the number of non-zero elements of bnj.\n\nk\n\nThe prior distribution. We assign a prior to bnj, such that Bnj \u2265 1 and\n(\u03c0njk)bnjk (1 \u2212 \u03c0njk)1\u2212bnjk .\n\n(4)\nThe constraint Bnj \u2265 1 states that at least one element in the context needs to be selected. For\nvalues of bnj satisfying the constraint, their probabilities are proportional to those of independent\nBernoulli variables, with hyperparameters \u03c0njk. If \u03c0njk is small for all k (near 0), then the distribution\napproaches a categorical distribution. If a few \u03c0njk values are large (near 1), then the constraint Bnj \u2265\n1 becomes less relevant and the distribution approaches a product of Bernoulli distributions.\nThe scale of the probabilities \u03c0nj has an impact on the number if elements to be selected as the\ncontext. We let\n(5)\nwhere \u03c0 \u2208 (0, 1) is a global parameter to be learned, and \u03b2 is a hyperparameter. The value of \u03b2\ncontrols the average number of elements to be selected. If \u03b2 tends to in\ufb01nity and we hold \u03c0 \ufb01xed to\n1, then we recover the basic EFE model.\nThe objective function. We form the objective function L as the (regularized) pseudo log-likelihood.\nAfter marginalizing out the variables bnj, it is\n\n\u03c0njk \u2261 \u03c0nj = \u03c0 min(1, \u03b2/|cnj|),\n\nL = Lreg +\n\nlog\n\np(xnj | xcnj , bnj; \u03b1, \u03c1)p(bnj; \u03c0nj),\n\n(6)\n\n(cid:88)\n\n(cid:88)\n\nn,j\n\nbnj\n\nwhere Lreg is the regularization term. Following Rudolph et al. (2016), we use (cid:96)2-regularization over\nthe embedding and context vectors.\nIt is computationally dif\ufb01cult to marginalize out the context selection variables bnj, particularly\nwhen the cardinality of the context cnj is large. We address this issue in the next section.\n\n2.3\n\nInference\n\nWe now show how to maximize the objective function in Eq. 6. We propose an algorithm based on\namortized variational inference, which shares a global inference network for all local variables bnk.\nHere, we describe the inference method in detail.\n\n3\n\n\fVariational inference. In variational inference, we introduce a variational distribution q(bnj; \u03bdnj),\n\nparameterized by \u03bdnj \u2208 R|cnj|, and we maximize a lower bound (cid:101)L of the objective in Eq. 6, L \u2265 (cid:101)L,\n(cid:2)log p(xnj | xcnj , bnj; \u03b1, \u03c1) + log p(bnj; \u03c0nj) \u2212 log q(bnj; \u03bdnj)(cid:3) .\n(cid:101)L = Lreg +\n\nE\nq(bnj ;\u03bdnj )\n\n(cid:88)\n\nn,j\n\n(7)\nMaximizing this bound with respect to the variational parameters \u03bdnj corresponds to minimizing\nthe Kullback-Leibler divergence from the posterior of bnj to the variational distribution q(bnj; \u03bdnj)\n(Jordan et al., 1999; Wainwright and Jordan, 2008). Variational inference was also used for EFE by\nBamler and Mandt (2017).\nThe properties of this maximization problem makes this approach hard in our case. First, there is\nno closed-form solution, even if we use a mean-\ufb01eld variational distribution. Second, the large size\nof the dataset requires fast online training of the model. Generally, we cannot \ufb01t each q(bnj; \u03bdnj)\nindividually by solving a set of optimization problems, nor even store \u03bdnj for later use.\nTo address the former problem, we use black-box variational inference (Ranganath et al., 2014),\nwhich approximates the expectations via Monte Carlo to obtain noisy gradients of the variational\nlower bound. To tackle the latter, we use amortized inference (Gershman and Goodman, 2014; Dayan\net al., 1995), which has the advantage that we do not need to store or optimize local variables.\nAmortization. Amortized inference avoids the optimization of the parameter \u03bdnj for each local\nvariational distribution q(bnj; \u03bdnj); instead, it \ufb01ts a shared structure to calculate each local parameter\n\u03bdnj. Speci\ufb01cally, we consider a function f (\u00b7) that inputs the target observation xnj, the context\nelements xcnj and indices cnj, and the model parameters, and outputs a variational distribution for\nbnj. Let anj = [xnj, cnj, xcnj , \u03b1, \u03c1, \u03c0nj] be the set of inputs of f (\u00b7), and let \u03bdnj \u2208 R|cnj| be its\noutput, such that \u03bdnj = f (anj) is a vector containing the logits of the variational distribution,\n\nwith \u03bdnjk = [f (anj)]k.\n\nq(bnjk = 1; \u03bdnjk) = sigmoid (\u03bdnjk) ,\n\n(8)\nSimilarly to previous work (Korattikara et al., 2015; Kingma and Welling, 2014; Rezende et al., 2014;\nMnih and Gregor, 2014), we let f (\u00b7) be a neural network, parameterized by W. The key in amortized\ninference is to design the network and learn its parameters W.\nNetwork design. Typical neural networks transform \ufb01xed-length inputs into \ufb01xed-length outputs.\nHowever, in our case, we face variable size inputs and outputs. First, the output of the function f (\u00b7) for\nq(bnj; \u03bdnj) has length equal to the context size |cnj|, which varies across target/context pairs. Second,\nthe length of the local variables anj also varies, because the length of xcnj depends on the number of\nelements in the context. We propose a network design that addresses these challenges.\nTo overcome the dif\ufb01culty of the varying output sizes, we split the computation of each component\n\u03bdnjk of \u03bdnj into |cnj| separate tasks. Each task computes the logit \u03bdnjk using a shared function f (\u00b7),\n\u03bdnjk = f (anjk). The input anjk contains information about anj and depends on the index k.\nWe now need to specify how we form the input anjk. A na\u00efve approach would be to represent the\nindices of the context items and their corresponding counts as a sparse vector, but this would require\na network with a very large input size. Moreover, most of the weights of this large network would not\nbe used (nor trained) in the computation of \u03bdnjk, since only a small subset of them would be assigned\na non-zero input.\nInstead, in this work we use a two-step process to build an input vector anjk that has \ufb01xed\nlength regardless of the context size |cnj|.\nIn Step 1, we transform the original input anj =\n[xnj, cnj, xcnj , \u03b1, \u03c1, \u03c0nj] into a vector of reduced dimensionality that preserves the relevant infor-\nmation (we de\ufb01ne \u201crelevant\u201d below). In Step 2, we transform the vector of reduced dimensionality\ninto a \ufb01xed-length vector.\nFor Step 1, we \ufb01rst need to determine which information is relevant. For that, we inspect the posterior\nfor bnj,\n\np(bnj | xnj, xcnj ; \u03b1, \u03c1, \u03c0nj) \u221d p(xnj | xcnj , bnj, bnj; \u03b1, \u03c1)p(bnj; \u03c0nj)\n\n(9)\nWe note that the dependence on xcnj , \u03b1, and \u03c1 comes through the scores snj, a vector of length |cnj|\nthat contains for each element the inner product of the corresponding embedding and context vector,\n\n= p(xnj | snj, bnj)p(bnj; \u03c0nj).\n\n4\n\n\fFigure 1: Representation of the amortized inference network that outputs the variational parameter\nfor the context selection variable bnjk. The input has \ufb01xed size regardless of the context size, and\nit is formed by the score snjk (Eq. 10), the prior parameter \u03c0nj, the target observation xnj, and a\nhistogram of the scores snjk(cid:48) (for k(cid:48) (cid:54)= k).\nscaled by the context observation,\n\nsnjk = xnkjk \u03c1(cid:62)\n\nj \u03b1jk .\n\n(10)\nTherefore, the scores snj are suf\ufb01cient: f (\u00b7) does not need the raw embedding vectors as input, but\nrather the scores snj \u2208 R|cnj|. We have thus reduced the dimensionality of the input.\nFor Step 2, we need to transform the scores snj \u2208 R|cnj| into a \ufb01xed-length vector that the neural\nnetwork f (\u00b7) can take as input. We represent this vector and the full neural network structure in\nFigure 1. The transformation is carried out differently for each value of k. For the network that\noutputs the variational parameter \u03bdnjk, we let the k-th score snjk be directly one of the inputs. The\nreason is that the k-th score snjk is more related to \u03bdnjk, because the network that outputs \u03bdnjk\nultimately indicates the probability that bnjk takes value 1, i.e., \u03bdnjk indicates whether to include\nthe kth element as part of the context in the computation of the natural parameter in Eq. 3. All\nother scores (snjk(cid:48) for k(cid:48) (cid:54)= k) have the same relation to \u03bdnjk, and their permutations give the same\nposterior. We bin these scores (snjk(cid:48), for k(cid:48) (cid:54)= k) into L bins, therefore obtaining a \ufb01xed-length\nvector. Instead of using bins with hard boundaries, we use Gaussian-shaped kernels. We denote by\nnj \u2208 RL to the binned\n\u03c9(cid:96) and \u03c3(cid:96) the mean and width of each Gaussian kernel, and we denote by h(k)\nvariables, such that\n\nh(k)\nnj(cid:96) =\n\nexp\n\n\u2212 (snjk(cid:48) \u2212 \u03c9(cid:96))2\n\n\u03c32\n(cid:96)\n\n.\n\n(11)\n\n(cid:18)\n\n|cnj|(cid:88)\n\n(cid:48)\nk\n=1\n(cid:48)(cid:54)=k\nk\n\n(cid:19)\n\nFinally, for \u03bdnjk = f (anjk) we form a neural network that takes as input the score snjk, the binned\nnj , which summarize the information of the scores (snjk(cid:48) : k(cid:48) (cid:54)= k), as well as the target\nvariables h(k)\nobservation xnj and the prior probability \u03c0nj. That is, anjk = [snjk, h(k)\nVariational updates. We denote by W the parameters of the network (all weights and biases). To\nperform inference, we need to iteratively update W, together with \u03b1, \u03c1, and \u03c0, to maximize Eq. 7,\nwhere \u03bdnj is the output of the network f (\u00b7). We follow a variational expectation maximization (EM)\nalgorithm. In the M step, we take a gradient step with respect to the model parameters (\u03b1, \u03c1, and\n\u03c0). In the E step, we take a gradient step with respect to the network parameters (W). We obtain\nthe (noisy) gradient with respect to W using the score function method as in black-box variational\ninference (Paisley et al., 2012; Mnih and Gregor, 2014; Ranganath et al., 2015), which allows\nrewriting the gradient of Eq. 7 as an expectation with respect to the variational distribution,\n\n(cid:104)(cid:0)log p(xnj | xsnj , bnj) + log p(bnj; \u03c0nj) \u2212 log q(bnj; W)(cid:1)\u00b7\n\n\u2207(cid:101)L =\n\nnj , xnj, \u03c0nj].\n\nE\nq(bnj ;W)\n\n(cid:88)\n\nn,j\n\n(cid:105)\n\u2207 log q(bnj; W)\n\n.\n\nThen, we can estimate the gradient via Monte Carlo by drawing samples from q(bnj; W).\n\n5\n\nOtherscores(variablelength)(Lbins)h(k)njL\f3 Empirical Study\n\nWe study the performance of context selection on three different application domains: movie rec-\nommendations, ornithology, and market basket analysis. On these domains, we show that context\nselection improves predictions. For the movie data, we also show that the learned embeddings are\nmore interpretable; and for the market basket analysis, we provide a motivating example of the\nvariational probabilities inferred by the network.\nData. MovieLens: We consider the MovieLens-100K dataset (Harper and Konstan, 2015), which\ncontains ratings of movies on a scale from 1 to 5. We only keep those ratings with value 3 or more\n(and we subtract 2 from all ratings, so that the counts are between 0 and 3). We remove users who\nrated less than 20 movies and movies that were rated fewer than 50 times, yielding a dataset with 943\nusers and 811 movies. The average number of non-zeros per user is 82.2. We set aside 9% of the data\nfor validation and 10% for test.\neBird-PA: The eBird data (Munson et al., 2015; Sullivan et al., 2009) contains information about a\nset of bird observation events. Each datum corresponds to a checklist of counts of 213 bird species\nreported from each event. The values of the counts range from zero to hundreds. Some extraordinarily\nlarge counts are treated as outliers and set to the mean of positive counts of that species. Bird\nobservations in the subset eBird-PA are from a rectangular area that mostly overlaps Pennsylvania\nand the period from day 180 to day 210 of years from 2002 to 2014. There are 22, 363 checklists in\nthe data and 213 unique species. The average number of non-zeros per checklist is 18.3. We split the\ndata into train (67%), test (26%), and validation (7%) sets.\nMarket-Basket: This dataset contains purchase records of more than 3, 000 customers on an anony-\nmous supermarket. We aggregate the purchases of one month at the category level, i.e., we combine\nall individual UPC (Universal Product Code) items into item categories. This yields 45, 615 purchases\nand 364 unique items. The average basket size is of 12.5 items. We split the data into training (86%),\ntest (5%), and validation (9%) sets.\nModels. We compare the base exponential family embeddings (EFE) model (Rudolph et al., 2016)\nwith our context selection procedure. We implement the amortized inference network described in\nSection 2.32, for different values of the prior hyperparameter \u03b2 (Eq. 5) (see below).\nFor the movie data, in which the ratings range from 0 to 3, we use a binomial conditional distribution\n(Eq. 1) with 3 trials, and we use an identity link function for the natural parameter \u03b7j (Eq. 2), which is\nthe logit of the binomial probability. For the eBird-PA and Market-Basket data, which contain counts,\nwe consider a Poisson conditional distribution and use the link function3 g(\u00b7) = log softplus (\u00b7) for\nthe natural parameter, which is the Poisson log-rate. The context set corresponds to the set of other\nmovies rated by the same user in MovieLens; the set of other birds in the same checklist on eBird-PA;\nand the rest of items in the same market basket.\nExperimental setup. We explore different values for the dimensionality K of the embedding vectors.\nIn our tables of results, we report the values that performed best in the validation set (there was no\nqualitative difference in the relative performance between the methods for the non-reported results).\nWe use negative sampling (Rudolph et al., 2016) with a ratio of 1/10 of positive (non-zero) versus\nnegative samples. We use stochastic gradient descent to maximize the objective function, adaptively\nsetting the stepsize with Adam (Kingma and Ba, 2015), and we use the validation log-likelihood to\nassess convergence. We consider unit-variance (cid:96)2-regularization, and the weight of the regularization\nterm is \ufb01xed to 1.0.\nIn the context selection for exponential family embeddings (CS-EFE) model, we set the number of\nhidden units to 30 and 15 for each of the hidden layers, and we consider 40 bins to form the histogram.\n(We have also explored other settings of the network, obtaining very similar results.) We believe\nthat the network layers can adapt to different settings of the bins as long as they pick up essential\ninformation of the scores. In this work, we place these 40 bins equally spaced by a distance of 0.2\nand set the width to 0.1.\n\n2The code is in the github repo: https://github.com/blei-lab/context-selection-embedding\n3The softplus function is de\ufb01ned as softplus (x) = log(1 + exp(x)).\n\n6\n\n\fBaseline: EFE\n\nK (Rudolph et al., 2016)\n10\n50\n\n-1.06 ( 0.01 )\n-1.06 ( 0.01 )\n\nBaseline: EFE\n\nK (Rudolph et al., 2016)\n50\n100\n\n-1.74 ( 0.01 )\n-1.74 ( 0.01 )\n\n\u03b2 = 20\n\n-1.00 ( 0.01 )\n-0.97 ( 0.01 )\n\nCS-EFE (this paper)\n\u03b2 = 50\n\n\u03b2 = 100\n\n\u03b2 = \u221e\n\n-1.03 ( 0.01 )\n-0.99 ( 0.01 )\n\n-1.03 ( 0.01 )\n-1.00 ( 0.01 )\n\n-1.03 ( 0.01 )\n-1.01 ( 0.01 )\n\n(a) MovieLens-100K.\n\n\u03b2 = 2\n\nCS-EFE (this paper)\n\n\u03b2 = 5\n\n\u03b2 = 10\n\n\u03b2 = \u221e\n\n-1.34 ( 0.01 )\n-1.34 ( 0.00 )\n\n-1.33 ( 0.00 )\n-1.33 ( 0.00 )\n\n-1.51 ( 0.01 )\n-1.31 ( 0.00 )\n\n-1.34 ( 0.01 )\n-1.31 ( 0.01 )\n\n(b) eBird-PA.\n\nBaseline: EFE\n\nK (Rudolph et al., 2016)\n50\n100\n\n-0.632 ( 0.003 )\n-0.633 ( 0.003 )\n\n\u03b2 = 2\n\nCS-EFE (this paper)\n\n\u03b2 = 5\n\n\u03b2 = 10\n\n\u03b2 = \u221e\n\n-0.626 ( 0.003 )\n-0.630 ( 0.003 )\n\n-0.623 ( 0.003 )\n-0.623 ( 0.003 )\n\n-0.625 ( 0.003 )\n-0.626 ( 0.003 )\n\n-0.628 ( 0.003 )\n-0.628 ( 0.003 )\n\n(c) Market-Basket.\n\nTable 1: Test log-likelihood for the three considered datasets. Our CS-EFE models consistently\noutperforms the baseline for different values of the prior hyperparameter \u03b2. The numbers in brackets\nindicate the standard errors.\n\nIn our experiments, we vary the hyperparameter \u03b2 in Eq. 5 to check how the expected context size\n(see Section 2.2) impacts the results. For the MovieLens dataset, we choose \u03b2 \u2208 {20, 50, 100,\u221e},\nwhile for the other two datasets we choose \u03b2 \u2208 {2, 5, 10,\u221e}.\nResults: Predictive performance. We compare the methods in terms of predictive pseudo log-\nlikelihood on the test set. We calculate the marginal log-likelihood in the same way as Rezende et al.\n(2014). We report the average test log-likelihood on the three datasets in Table 1. The numbers are\nthe average predictive log-likelihood per item, together with the standard errors in brackets. We\ncompare the predictions of our models (in each setting) with the baseline EFE method using paired\nt-test, obtaining that all our results are better than the baseline at a signi\ufb01cance level p = 0.05. In the\ntable we only bold the best performance across different settings of \u03b2.\nThe results show that our method outperforms the baseline on all three datasets. The improvement\nover the baseline is more signi\ufb01cant on the eBird-PA datasets. We can also see that the prior parameter\n\u03b2 has some impact on the model\u2019s performance.\nEvaluation: Embedding quality. We also study how context selection affects the quality of the\nembedding vectors of the items. In the MovieLens dataset, each movie has up to 3 genre labels. We\ncalculate movie similarities by their genre labels and check whether the similarities derived from the\nembedding vectors are consistent with genre similarities.\nMore in detail, let gj \u2208 {0, 1}G be a binary vector containing the genre labels for each movie j,\nwhere G = 19 is the number of genres. We de\ufb01ne the similarity between two genre vectors, gj and\ngj(cid:48), as the number of common genres normalized by the larger number genres,\n\nsim(gj, gj(cid:48)) =\n\ng(cid:62)\nj gj(cid:48)\n\nmax(1(cid:62)gj, 1(cid:62)gj(cid:48))\n\n,\n\n(12)\n\nwhere 1 is a vector of ones. In an analogous manner, we de\ufb01ne the similarity of two embedding\nvectors as their cosine distance.\nWe now compute the similarities of each movie to all other movies, according to both de\ufb01nitions\nof similarity (based on genre and based on embeddings). For each query movie, we provide two\ncorrelation metrics between both lists. The \ufb01rst metric is simply Spearman\u2019s correlation between the\ntwo ranked lists. For the second metric, we rank the movies based on the embedding similarity only,\nand we calculate the average genre similarity of the top 5 movies. Finally, we average both metrics\nacross all possible query movies, and we report the results in Table 2.\n\n7\n\n\fMetric\n\nSpearmans\nmean-sim@5\n\nBaseline: EFE\n\n(Rudolph et al., 2016)\n\n0.066\n0.272\n\n\u03b2 = 20\n0.108\n0.328\n\nCS-EFE (this paper)\n\u03b2 = 50\n\u03b2 = 100\n0.090\n0.317\n\n0.082\n0.299\n\n\u03b2 = \u221e\n0.076\n0.289\n\nTable 2: Correlation between the embedding vectors and the movie genre. The embedding vectors\nfound with our CS-EFE model exhibit higher correlation with movie genres.\n\nTarget: Taco shells Target: Cat food dry\n\nTaco shells\n\nHispanic salsa\n\nTortilla\n\nHispanic canned food\n\n\u2212\n0.309\n0.287\n0.315\n\nCat food dry\nCat food wet\n\nCat litter\n\nPet supplies\n\n0.220\n0.206\n0.225\n0.173\n\n0.219\n0.185\n0.151\n0.221\n\u2212\n0.297\n0.347\n0.312\n\nTable 3: Approximate posterior probabilities of the CS-EFE model for a basket with eight items\nbroken down into two unrelated clusters. The left column represents a basket of eight items of two\ntypes, and then we take one item of each type as target in the other two columns. For a Mexican\nfood target, the posterior probabilities of the items in the Mexican type are larger compared to the\nprobabilities in the pet type, and vice-versa.\n\nFrom this result, we can see that the similarity of the embedding vectors obtained by our model is\nmore consistent with the genre similarity. (We have also computed the top-1 and top-10 similarities,\nwhich supports the same conclusion.) The result suggests a small number of context items are actually\nbetter for learning relations of movies.\nEvaluation: Posterior checking. To get more insight of the variational posterior distribution that our\nmodel provides, we form a heterogeneous market basket that contains two types of items: Mexican\nfood, and pet-related products. In particular, we form a basket with four items of each of those types,\nand we compute the variational distribution (i.e., the output of the neural network) for two different\ntarget items from the basket. Intuitively, the Mexican food items should have higher probabilities\nwhen the target item is also in the same type, and similarly for the pet food.\nWe \ufb01t the CS-EFE model with \u03b2 = 2 on the Market-Basket data. We report the approximate posterior\nprobabilities in Table 3, for two query items (one from each type). As expected, the probabilities for\nthe items of the same type than the target are higher, indicating that their contribution to the context\nwill be higher.\n\n4 Conclusion\n\nThe standard exponential family embeddings (EFE) model \ufb01nds vector representations by \ufb01tting\nthe conditional distributions of objects conditioned on their contexts. In this work, we show that\nchoosing a subset of the elements in the context can improve performance when the objects in the\nsubset are truly related to the object to be modeled. As a consequence, the embedding vectors can\nre\ufb02ect co-occurrence relations with higher \ufb01delity compared with the base embedding model.\nWe formulate the context selection problem as a Bayesian inference problem by using a hidden binary\nvector to indicate which objects to select from each context set. This leads to a dif\ufb01cult inference\nproblem due to the (large) scale of the problems we face. We develop a fast inference algorithm by\nleveraging amortization and stochastic gradients. The varying length of the binary context selection\nvectors poses further challenges in our amortized inference algorithm, which we address using a\nbinning technique. We \ufb01t our model on three datasets from different application domains, showing\nits superiority over the EFE model.\nThere are still many directions to explore to further improve the performance of the proposed context\nselection for exponential family embeddings (CS-EFE). First, we can apply the context selection\ntechnique on text data. Though the neighboring words of each target word are more likely to be the\n\n8\n\n\f\u201ccorrect\u201d context, we can still combine the context selection technique with the ordering in which\nwords appear in the context, hopefully leading to better word representations. Second, we can explore\nvariational inference schemes that do not rely on mean-\ufb01eld, improving the inference network to\ncapture more complex variational distributions.\n\nAcknowledgments\n\nThis work is supported by NSF IIS-1247664, ONR N00014-11-1-0651, DARPA PPAML FA8750-\n14-2-0009, DARPA SIMPLEX N66001-15-C-4032, the Alfred P. Sloan Foundation, and the John\nSimon Guggenheim Foundation. Francisco J. R. Ruiz is supported by the EU H2020 programme\n(Marie Sk\u0142odowska-Curie grant agreement 706760). We also acknowledge the support of NVIDIA\nCorporation with the donation of two GPUs used for this research.\n\nReferences\nArora, S., Li, Y., Liang, Y., and Ma, T. (2016). RAND-WALK: A latent variable model approach to word\n\nembeddings. Transactions of the Association for Computational Linguistics, 4.\n\nBamler, R. and Mandt, S. (2017). Dynamic word embeddings. In International Conference in Machine Learning.\n\nBarkan, O. and Koenigstein, N. (2016). Item2Vec: Neural item embedding for collaborative \ufb01ltering. In IEEE\n\nInternational Workshop on Machine Learning for Signal Processing.\n\nBengio, Y., Schwenk, H., Sen\u00e9cal, J.-S., Morin, F., and Gauvain, J.-L. (2006). Neural probabilistic language\n\nmodels. In Innovations in Machine Learning. Springer.\n\nBishop, C. M. (2006). Pattern Recognition and Machine Learning (Information Science and Statistics). Springer-\n\nVerlag New York, Inc., Secaucus, NJ, USA.\n\nDayan, P., Hinton, G. E., Neal, R. M., and Zemel, R. S. (1995). The Helmholtz machine. Neural Computation,\n\n7(5):889\u2013904.\n\nFirth, J. R. (1957). A synopsis of linguistic theory 1930-1955. In Studies in Linguistic Analysis (special volume\n\nof the Philological Society), volume 1952\u20131959.\n\nGershman, S. J. and Goodman, N. D. (2014). Amortized inference in probabilistic reasoning. In Proceedings of\n\nthe Thirty-Sixth Annual Conference of the Cognitive Science Society.\n\nHarper, F. M. and Konstan, J. A. (2015). The MovieLens datasets: History and context. ACM Transactions on\n\nInteractive Intelligent Systems (TiiS), 5(4):19.\n\nHarris, Z. S. (1954). Distributional structure. Word, 10(2\u20133):146\u2013162.\n\nJordan, M. I., Ghahramani, Z., Jaakkola, T. S., and Saul, L. K. (1999). An introduction to variational methods\n\nfor graphical models. Machine Learning, 37(2):183\u2013233.\n\nKingma, D. P. and Ba, J. L. (2015). Adam: A method for stochastic optimization. In International Conference\n\non Learning Representations.\n\nKingma, D. P. and Welling, M. (2014). Auto-encoding variational Bayes. In International Conference on\n\nLearning Representations.\n\nKorattikara, A., Rathod, V., Murphy, K. P., and Welling, M. (2015). Bayesian dark knowledge. In Advances in\n\nNeural Information Processing Systems.\n\nLevy, O. and Goldberg, Y. (2014). Neural word embedding as implicit matrix factorization. In Advances in\n\nNeural Information Processing Systems.\n\nLiang, D., Altosaar, J., Charlin, L., and Blei, D. M. (2016). Factorization meets the item embedding: Regularizing\n\nmatrix factorization with item co-occurrence. In ACM Conference on Recommender System.\n\nMikolov, T., Chen, K., Corrado, G. S., and Dean, J. (2013a). Ef\ufb01cient estimation of word representations in\n\nvector space. International Conference on Learning Representations.\n\nMikolov, T., Sutskever, I., Chen, K., Corrado, G. S., and Dean, J. (2013b). Distributed representations of words\n\nand phrases and their compositionality. In Advances in Neural Information Processing Systems.\n\nMikolov, T., Yih, W.-T. a., and Zweig, G. (2013c). Linguistic regularities in continuous space word represen-\ntations. In Conference of the North American Chapter of the Association for Computational Linguistics:\nHuman Language Technologies.\n\n9\n\n\fMnih, A. and Gregor, K. (2014). Neural variational inference and learning in belief networks. In International\n\nConference on Machine Learning.\n\nMnih, A. and Kavukcuoglu, K. (2013). Learning word embeddings ef\ufb01ciently with noise-contrastive estimation.\n\nIn Advances in Neural Information Processing Systems.\n\nMunson, M. A., Webb, K., Sheldon, D., Fink, D., Hochachka, W. M., Iliff, M., Riedewald, M., Sorokina, D.,\n\nSullivan, B., Wood, C., and Kelling, S. (2015). The eBird reference dataset.\n\nMurphy, K. P. (2012). Machine Learning: A Probabilistic Perspective. MIT Press.\nPaisley, J. W., Blei, D. M., and Jordan, M. I. (2012). Variational Bayesian inference with stochastic search. In\n\nInternational Conference on Machine Learning.\n\nPennington, J., Socher, R., and Manning, C. D. (2014). GloVe: Global vectors for word representation. In\n\nConference on Empirical Methods on Natural Language Processing.\n\nRanganath, R., Gerrish, S., and Blei, D. M. (2014). Black box variational inference. In Arti\ufb01cial Intelligence\n\nand Statistics.\n\nRanganath, R., Tang, L., Charlin, L., and Blei, D. M. (2015). Deep exponential families. In Arti\ufb01cial Intelligence\n\nand Statistics.\n\nRezende, D. J., Mohamed, S., and Wierstra, D. (2014). Stochastic backpropagation and approximate inference\n\nin deep generative models. In International Conference on Machine Learning.\n\nRudolph, M., Ruiz, F. J. R., Mandt, S., and Blei, D. M. (2016). Exponential family embeddings. In Advances in\n\nNeural Information Processing Systems.\n\nRumelhart, D. E., Hinton, G. E., and Williams, R. J. (1986). Learning representations by back-propagating\n\nerrors. Nature, 323(9):533\u2013536.\n\nSullivan, B., Wood, C., Iliff, M. J., Bonney, R. E., Fink, D., and Kelling, S. (2009). eBird: A citizen-based bird\n\nobservation network in the biological sciences. Biological Conservation, 142:2282\u20132292.\n\nVilnis, L. and McCallum, A. (2015). Word representations via Gaussian embedding. In International Conference\n\non Learning Representations.\n\nWainwright, M. J. and Jordan, M. I. (2008). Graphical models, exponential families, and variational inference.\n\nFoundations and Trends in Machine Learning, 1(1\u20132):1\u2013305.\n\n10\n\n\f", "award": [], "sourceid": 2504, "authors": [{"given_name": "Liping", "family_name": "Liu", "institution": "Tufts University"}, {"given_name": "Francisco", "family_name": "Ruiz", "institution": null}, {"given_name": "Susan", "family_name": "Athey", "institution": "Stanford University"}, {"given_name": "David", "family_name": "Blei", "institution": "Columbia University"}]}