{"title": "Exponential Family Embeddings", "book": "Advances in Neural Information Processing Systems", "page_first": 478, "page_last": 486, "abstract": "Word embeddings are a powerful approach to capturing semantic similarity among terms in a vocabulary. In this paper, we develop exponential family embeddings, which extends the idea of word embeddings to other types of high-dimensional data. As examples, we studied several types of data: neural data with real-valued observations, count data from a market basket analysis, and ratings data from a movie recommendation system. The main idea is that each observation is modeled conditioned on a set of latent embeddings and other observations, called the context, where the way the context is defined depends on the problem. In language the context is the surrounding words; in neuroscience the context is close-by neurons; in market basket data the context is other items in the shopping cart. Each instance of an embedding defines the context, the exponential family of conditional distributions, and how the embedding vectors are shared across data. We infer the embeddings with stochastic gradient descent, with an algorithm that connects closely to generalized linear models. On all three of our applications\u2014neural activity of zebrafish, users\u2019 shopping behavior, and movie ratings\u2014we found that exponential family embedding models are more effective than other dimension reduction methods. They better reconstruct held-out data and find interesting qualitative structure.", "full_text": "Exponential Family Embeddings\n\nMaja Rudolph\n\nColumbia University\n\nFrancisco J. R. Ruiz\nUniv. of Cambridge\nColumbia University\n\nStephan Mandt\n\nColumbia University\n\nDavid M. Blei\n\nColumbia University\n\nAbstract\n\nWord embeddings are a powerful approach for capturing semantic similarity among\nterms in a vocabulary. In this paper, we develop exponential family embeddings,\na class of methods that extends the idea of word embeddings to other types of\nhigh-dimensional data. As examples, we studied neural data with real-valued\nobservations, count data from a market basket analysis, and ratings data from\na movie recommendation system. The main idea is to model each observation\nconditioned on a set of other observations. This set is called the context, and\nthe way the context is de\ufb01ned is a modeling choice that depends on the problem.\nIn language the context is the surrounding words; in neuroscience the context is\nclose-by neurons; in market basket data the context is other items in the shopping\ncart. Each type of embedding model de\ufb01nes the context, the exponential family of\nconditional distributions, and how the latent embedding vectors are shared across\ndata. We infer the embeddings with a scalable algorithm based on stochastic\ngradient descent. On all three applications\u2014neural activity of zebra\ufb01sh, users\u2019\nshopping behavior, and movie ratings\u2014we found exponential family embedding\nmodels to be more e\ufb00ective than other types of dimension reduction. They better\nreconstruct held-out data and \ufb01nd interesting qualitative structure.\n\n1\n\nIntroduction\n\nWord embeddings are a powerful approach for analyzing language (Bengio et al., 2006; Mikolov et al.,\n2013a,b; Pennington et al., 2014). A word embedding method discovers distributed representations of\nwords; these representations capture the semantic similarity between the words and re\ufb02ect a variety of\nother linguistic regularities (Rumelhart et al., 1986; Bengio et al., 2006; Mikolov et al., 2013c). Fitted\nword embeddings can help us understand the structure of language and are useful for downstream\ntasks based on text.\nThere are many variants, adaptations, and extensions of word embeddings (Mikolov et al., 2013a,b;\nMnih and Kavukcuoglu, 2013; Levy and Goldberg, 2014; Pennington et al., 2014; Vilnis and Mc-\nCallum, 2015), but each re\ufb02ects the same main ideas. Each term in a vocabulary is associated with\ntwo latent vectors, an embedding and a context vector. These two types of vectors govern conditional\nprobabilities that relate each word to its surrounding context. Speci\ufb01cally, the conditional probability\nof a word combines its embedding and the context vectors of its surrounding words. (Di\ufb00erent meth-\nods combine them di\ufb00erently.) Given a corpus, we \ufb01t the embeddings by maximizing the conditional\nprobabilities of the observed text.\nIn this paper we develop the exponential family embedding (ef-emb), a class of models that generalizes\nthe spirit of word embeddings to other types of high-dimensional data. Our motivation is that other\ntypes of data can bene\ufb01t from the same assumptions that underlie word embeddings, namely that\na data point is governed by the other data in its context. In language, this is the foundational idea\nthat words with similar meanings will appear in similar contexts (Harris, 1954). We use the tools of\nexponential families (Brown, 1986) and generalized linear models (glms) (McCullagh and Nelder,\n1989) to adapt this idea beyond language.\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fAs one example beyond language, we will study computational neuroscience. Neuroscientists measure\nsequential neural activity across many neurons in the brain. Their goal is to discover patterns in these\ndata with the hope of better understanding the dynamics and connections among neurons. In this\nexample, a context can be de\ufb01ned as the neural activities of other nearby neurons, or as neural activity\nin the past. Thus, it is plausible that the activity of each neuron depends on its context. We will use\nthis idea to \ufb01t latent embeddings of neurons, representations of neurons that uncover hidden features\nwhich help suggest their roles in the brain.\nAnother example we study involves shoppers at the grocery store. Economists collect shopping\ndata (called \u201cmarket basket data\u201d) and are interested in building models of purchase behavior for\ndownstream econometric analysis, e.g., to predict demand and market changes. To build such models,\nthey seek features of items that are predictive of when they are purchased and in what quantity. Similar\nto language, purchasing an item depends on its context, i.e., the other items in the shopping cart. In\nmarket basket data, Poisson embeddings can capture important econometric concepts, such as items\nthat tend not to occur together but occur in the same contexts (substitutes) and items that co-occur,\nbut never one without the other (complements).\nWe de\ufb01ne an ef-emb, such as one for neuroscience or shopping data, with three ingredients. (1)\nWe de\ufb01ne the context, which speci\ufb01es which other data points each observation depends on. (2) We\nde\ufb01ne the conditional exponential family. This involves setting the appropriate distribution, such as a\nGaussian for real-valued data or a Poisson for count data, and the way to combine embeddings and\ncontext vectors to form its natural parameter. (3) We de\ufb01ne the embedding structure, how embeddings\nand context vectors are shared across the conditional distributions of each observation. These three\ningredients enable a variety of embedding models.\nWe describe ef-emb models and develop e\ufb03cient algorithms for \ufb01tting them. We show how ex-\nisting methods, such as continuous bag of words (cbow) (Mikolov et al., 2013a) and negative\nsampling (Mikolov et al., 2013b), can each be viewed as an ef-emb. We study our methods on\nthree di\ufb00erent types of data\u2014neuroscience data, shopping data, and movie ratings data. Mirror-\ning the success of word embeddings, ef-emb models outperform traditional dimension reduction,\nsuch as exponential family principal component analysis (pca) (Collins et al., 2001) and Poisson\nfactorization (Gopalan et al., 2015), and \ufb01nd interpretable features of the data.\nRelated work. ef-emb models generalize cbow (Mikolov et al., 2013a) in the same way that\nexponential family pca (Collins et al., 2001) generalizes pca, glms (McCullagh and Nelder, 1989)\ngeneralize regression, and deep exponential families (Ranganath et al., 2015) generalize sigmoid belief\nnetworks (Neal, 1990). A linear ef-emb (which we de\ufb01ne precisely below) relates to context-window-\nbased embedding methods such as cbow or the vector log-bilinear language model (vlbl) (Mikolov\net al., 2013a; Mnih and Kavukcuoglu, 2013), which model a word given its context. The more\ngeneral ef-emb relates to embeddings with a nonlinear component, such as the skip-gram (Mikolov\net al., 2013a) or the inverse vector log-bilinear language model (ivlbl) (Mnih and Kavukcuoglu,\n2013). (These methods might appear linear but, when viewed as a conditional probabilistic model,\nthe normalizing constant of each word induces a nonlinearity.)\nResearchers have developed di\ufb00erent approximations of the word embedding objective to scale the\nprocedure. These include noise contrastive estimation (Gutmann and Hyv\u00e4rinen, 2010; Mnih and Teh,\n2012), hierarchical softmax (Mikolov et al., 2013b), and negative sampling (Mikolov et al., 2013a).\nWe explain in Section 2.2 and Supplement A how negative sampling corresponds to biased stochastic\ngradients of an ef-emb objective.\n\n2 Exponential Family Embeddings\nWe consider a matrix x D x1WI of I observations, where each xi is a D-vector. As one example, in\nlanguage xi is an indicator vector for the word at position i and D is the size of the vocabulary. As\nanother example, in neural data xi is the neural activity measured at index pair i D .n; t /, where n\nindexes a neuron and t indexes a time point; each measurement is a scalar (D D 1).\nThe goal of an exponential family embedding (ef-emb) is to derive useful features of the data. There\nare three ingredients: a context function, a conditional exponential family, and an embedding structure.\nThese ingredients work together to form the objective. First, the ef-emb models each data point\nconditional on its context; the context function determines which other data points are at play. Second,\n\n2\n\n\fthe conditional distribution is an appropriate exponential family, e.g., a Gaussian for real-valued data.\nIts parameter is a function of the embeddings of both the data point and its context. Finally, the\nembedding structure determines which embeddings are used when the ith point appears, either as\ndata or in the context of another point. The objective is the sum of the log probabilities of each data\npoint given its context. We describe each ingredient, followed by the ef-emb objective. Examples\nare in Section 2.1.\nContext. Each data point i has a context ci, which is a set of indices of other data points. The\nef-emb models the conditional distribution of xi given the data points in its context.\nThe context is a modeling choice; di\ufb00erent applications will require di\ufb00erent types of context. In\nlanguage, the data point is a word and the context is the set of words in a window around it. In neural\ndata, the data point is the activity of a neuron at a time point and the context is the activity of its\nsurrounding neurons at the same time point. (It can also include neurons at future time or in the past.)\nIn shopping data, the data point is a purchase and the context is the other items in the cart.\nConditional exponential family. An ef-emb models each data point xi conditional on its context\nxci\n(1)\nwhere \u0001i .xci / is the natural parameter and t .xi / is the su\ufb03cient statistic. In language modeling, this\nfamily is usually a categorical distribution. Below, we will study Gaussian and Poisson.\nWe parameterize the conditional with two types of vectors, embeddings and context vectors. The\nembedding of the ith data point helps govern its distribution; we denote it (cid:26)\u0152i \u008d 2 RK(cid:2)D. The context\nvector of the ith data point helps govern the distribution of data for which i appears in their context;\nwe denote it \u02db\u0152i \u008d 2 RK(cid:2)D.\nHow to de\ufb01ne the natural parameter as a function of these vectors is a modeling choice. It captures\nhow the context interacts with an embedding to determine the conditional distribution of a data\npoint. Here we focus on the linear embedding, where the natural parameter is a function of a linear\ncombination of the latent vectors,\n\n. The distribution is an appropriate exponential family,\n\n(cid:24) ExpFam.\u0001i .xci /; t .xi //;\n\nxi j xci\n\n0@(cid:26)\u0152i \u008d\n>X\n\n1A :\n\n\u0001i .xci / D fi\n\nj2ci\n\n\u02db\u0152j \u008dxj\n\n(2)\nFollowing the nomenclature of generalized linear models (glms), we call fi .(cid:1)/ the link function. We\nwill see several examples of link functions in Section 2.1.\nThis is the setting of many existing word embedding models, though not all. Other models, such as\nthe skip-gram, determine the probability through a \u201creverse\u201d distribution of context words given the\ndata point. These non-linear embeddings are still instances of an ef-emb.\nEmbedding structure. The goal of an ef-emb is to \ufb01nd embeddings and context vectors that\ndescribe features of the data. The embedding structure determines how an ef-emb shares these\nvectors across the data. It is through sharing the vectors that we learn an embedding for the object of\nprimary interest, such as a vocabulary term, a neuron, or a supermarket product. In language the same\nparameters (cid:26)\u0152i \u008d D (cid:26) and \u02db\u0152i \u008d D \u02db are shared across all positions i. In neural data, observations share\nparameters when they describe the same neuron. Recall that the index connects to both a neuron and\ntime point i D .n; t /. We share parameters with (cid:26)\u0152i \u008d D (cid:26)n and \u02db\u0152i \u008d D \u02dbn to \ufb01nd embeddings and\ncontext vectors that describe the neurons. Other variants might tie the embedding and context vectors\nto \ufb01nd a single set of latent variables, (cid:26)\u0152i \u008d D \u02db\u0152i \u008d.\nThe objective function. The ef-emb objective sums the log conditional probabilities of each data\npoint, adding regularizers for the embeddings and context vectors.1 We use log probability functions\nas regularizers, e.g., a Gaussian probability leads to `2 regularization. We also use regularizers to\nconstrain the embeddings,e.g., to be non-negative. Thus, the objective is\n\nL.(cid:26); \u02db/ D IX\n\n(cid:0)\u0001\n\ni t .xi / (cid:0) a.\u0001i /(cid:1) C log p.(cid:26)/ C log p.\u02db/:\n\n>\n\n(3)\n\niD1\n\n1One might be tempted to see this as a probabilistic model that is conditionally speci\ufb01ed. However, in general\n\nit does not have a consistent joint distribution (Arnold et al., 2001).\n\n3\n\n\fWe maximize this objective with respect to the embeddings and context vectors. In Section 2.2 we\nexplain how to \ufb01t it with stochastic gradients.\nEquation (3) can be seen as a likelihood function for a bank of glms (McCullagh and Nelder, 1989).\nEach data point is modeled as a response conditional on its \u201ccovariates,\u201d which combine the context\nvectors and context, e.g., as in Equation (2); the coe\ufb03cient for each response is the embedding itself.\nWe use properties of exponential families and results around glms to derive e\ufb03cient algorithms for\nef-emb models.\n\n2.1 Examples\nWe highlight the versatility of ef-emb models with three example models and their variations. We\ndevelop the Gaussian embedding (g-emb) for analyzing real observations from a neuroscience\napplication; we also introduce a nonnegative version, the nonnegative Gaussian embedding (ng-\nemb). We develop two Poisson embedding models, Poisson embedding (p-emb) and additive Poisson\nembedding (ap-emb), for analyzing count data; these have di\ufb00erent link functions. We present\na categorical embedding model that corresponds to the continuous bag of words (cbow) word\nembedding (Mikolov et al., 2013a). Finally, we present a Bernoulli embedding (b-emb) for binary\ndata. In Section 2.2 we explain how negative sampling (Mikolov et al., 2013b) corresponds to biased\nstochastic gradients of the b-emb objective. For convenience, these acronyms are in Table 1.\n\nef-emb\ng-emb\nng-emb\np-emb\nap-emb\nb-emb\n\nexponential family embedding\nGaussian embedding\nnonnegative Gaussian embedding\nPoisson embedding\nadditive Poisson embedding\nBernoulli embedding\n\nTable 1: Acronyms used for exponential family embeddings.\n\nExample 1: Neural data and Gaussian observations. Consider the (calcium) expression of a large\npopulation of zebra\ufb01sh neurons (Ahrens et al., 2013). The data are processed to extract the locations\nof the N neurons and the neural activity xi D x.n;t / across location n and time t. The goal is to\nmodel the similarity between neurons in terms of their behavior, to embed each neuron in a latent\nspace such that neurons with similar behavior are close to each other.\nWe consider two neurons similar if they behave similarly in the context of the activity pattern of\ntheir surrounding neurons. Thus we de\ufb01ne the context for data index i D .n; t / to be the indices\nof the activity of nearby neurons at the same time. We \ufb01nd the K-nearest neighbors (knn) of each\nneuron (using a Ball-tree algorithm) according to their spatial distance in the brain. We use this set to\nconstruct the context ci D c.n;t / D f.m; t /jm 2 knn.n/g. This context varies with each neuron, but\nis constant over time.\nWith the context de\ufb01ned, each data point xi is modeled with a conditional Gaussian. The conditional\nmean is the inner product from Equation (2), where the context is the simultaneous activity of the\nnearest neurons and the link function is the identity. The conditionals of two observations share\nparameters if they correspond to the same neuron. The embedding structure is thus (cid:26)\u0152i \u008d D (cid:26)n and\n\u02db\u0152i \u008d D \u02dbn for all i D .n; t /. Similar to word embeddings, each neuron has two distinct latent vectors:\nthe neuron embedding (cid:26)n 2 RK and the context vector \u02dbn 2 RK.\nThese ingredients, along with a regularizer, combine to form a neural embedding objective. g-emb\nuses `2 regularization (i.e., a Gaussian prior); ng-emb constrains the vectors to be nonnegative (`2\nregularization on the logarithm. i.e., a log-normal prior).\nExample 2: Shopping data and Poisson observations. We also study data about people shopping.\nThe data contains the individual purchases of anonymous users in chain grocery and drug stores.\nThere are N di\ufb00erent items and T trips to the stores among all households. The data is a sparse\nN (cid:2) T matrix of purchase counts. The entry xi D x.n;t / indicates the number of units of item n that\nwas purchased on trip t. Our goal is to learn a latent representation for each product that captures the\nsimilarity between them.\n\n4\n\n\fWe consider items to be similar if they tend to be purchased in with similar groups of other items.\nThe context for observation xi is thus the other items in the shopping basket on the same trip. For the\npurchase count at index i D .n; t /, the context is ci D fj D .m; t /jm \u00a4 ng.\nWe use conditional Poisson distributions to modelthe count data. The su\ufb03cient statistic of the Poisson\nis t .xi / D xi, and its natural parameter is the logarithm of the rate (i.e., the mean). We set the natural\nparameter as in Equation (2), with the link function de\ufb01ned below. The embedding structure is the\nsame as in g-emb, producing embeddings for the items.\nWe explore two choices for the link function. p-emb uses an identity link function. Since the\nconditional mean is the exponentiated natural parameter, this implies that the context items contribute\nmultiplicatively to the mean. (We use `2-regularization on the embeddings.) Alternatively, we can\nconstrain the parameters to be nonnegative and set the link function f .(cid:1)/ D log.(cid:1)/. This is ap-emb, a\nmodel with an additive mean parameterization. (We use `2-regularization in log-space.) ap-emb\nonly captures positive correlations between items.\nExample 3: Text modeling and categorical observations. ef-embs are inspired by word embed-\ndings, such as cbow (Mikolov et al., 2013a). cbow is a special case of an ef-emb; it is equivalent\nto a multivariate ef-emb with categorical conditionals. In the notation here, each xi is an indicator\nvector of the ith word. Its dimension is the vocabulary size. The context of the ith word are the other\nwords in a window around it (of size w), ci D fj \u00a4 iji (cid:0) w \u0004 j \u0004 i C wg.\nThe distribution of xi is categorical, conditioned on the surrounding words xci\n; this is a softmax\nregression. It has natural parameter as in Equation (2) with an identity link function. The embedding\nstructure imposes that parameters are shared across all observed words. The embeddings are shared\nglobally ((cid:26)\u0152i \u008d D (cid:26), \u02db\u0152i \u008d D \u02db 2 RN(cid:2)K). The word and context embedding of the nt h word is the nt h\nrow of (cid:26) and \u02db respectively. cbow does not use any regularizer.\nExample 4: Text modeling and binary observations. One way to simplify the cbow objective is\nwith a model of each entry of the indicator vectors. The data are binary and indexed by i D .n; v/,\nwhere n is the position in the text and v indexes the vocabulary; the variable xn;v is the indicator that\nword n is equal to term v. (This model relaxes the constraint that for any n only one xn;v will be on.)\nWith this notation, the context is ci D f.j; v\n; j \u00a4 n; n (cid:0) w \u0004 j \u0004 n C wg; the embedding\nstructure is (cid:26)\u0152i \u008d D (cid:26)\u0152.n; v/\u008d D (cid:26)v and \u02db\u0152i \u008d D \u02db\u0152.n; v/\u008d D \u02dbv.\nWe can consider di\ufb00erent conditional distributions in this setting. As one example, set the conditional\ndistribution to be a Bernoulli with an identity link; we call this the b-emb model for text.\nIn\nSection 2.2 we show that biased stochastic gradients of the b-emb objective recovers negative\nsampling (Mikolov et al., 2013b). As another example, set the conditional distribution to Poisson with\nlink f .(cid:1)/ D log.(cid:1)/. The corresponding embedding model relates closely to Poisson approximations of\ndistributed multinomial regression (Taddy et al., 2015).\n\n/j8v\n\n0\n\n0\n\nInference and Connection to Negative Sampling\n\n2.2\nWe \ufb01t the embeddings (cid:26)\u0152i \u008d and context vectors \u02db\u0152i \u008d by maximizing the objective function in Equa-\ntion (3). We use stochastic gradient descent (sgd) with Adagrad (Duchi et al., 2011). We can derive\nthe analytic gradient of the objective function using properties of the exponential family (see the\nSupplement for details). The gradients linearly combine the data in summations we can approximate\nusing subsampled minibatches of data. This reduces the computational cost.\nWhen the data is sparse, we can split the gradient into the summation of two terms: one term\ncorresponding to all data entries i for which xi \u00a4 0, and one term corresponding to those data entries\nxi D 0. We compute the \ufb01rst term of the gradient exactly\u2014when the data is sparse there are not many\nsummations to make\u2014and we estimate the second term by subsampling the zero entries. Compared\nto computing the full gradient, this reduces the complexity when most of the entries xi are zero. But\nit retains the strong information about the gradient that comes from the non-zero entries.\nThis relates to negative sampling, which is used to approximate the skip-gram objective (Mikolov\net al., 2013b). Negative sampling re-de\ufb01nes the skip-gram objective to distinguish target (observed)\nwords from randomly drawn words, using logistic regression. The gradient of the stochastic objective\nis identical to a noisy but biased estimate of the gradient for a b-emb model. To obtain the equivalence,\npreserve the terms for the non-zero data and subsample terms for the zero data. While an unbiased\n\n5\n\n\fModel\nfa\ng-emb (c=10)\ng-emb (c=50)\nng-emb (c=10)\n\nK D 10\n0:290 \u02d9 0:003\n0:239 \u02d9 0:006\n0:227 \u02d9 0:002\n0:263 \u02d9 0:004\n\nsingle neuron held out\nK D 100\n0:275 \u02d9 0:003\n0:239 \u02d9 0:005\n0:222 \u02d9 0:002\n0:261 \u02d9 0:004\n\n25% of neurons held out\nK D 100\n0:276 \u02d9 0:003\n0:245 \u02d9 0:003\n0:232 \u02d9 0:003\n0:261 \u02d9 0:004\n\nK D 10\n0:290 \u02d9 0:003\n0:246 \u02d9 0:004\n0:235 \u02d9 0:003\n0:250 \u02d9 0:004\n\nTable 2: Analysis of neural data: mean squared error and standard errors of neural activity (on the test\nset) for di\ufb00erent models. Both ef-emb models signi\ufb01cantly outperform fa; g-emb is more accurate\nthan ng-emb.\n\nstochastic gradient would rescale the subsampled terms, negative sampling does not. Thus, negative\nsampling corresponds to a biased estimate, which down-weights the contribution of the zeros. See\nthe Supplement for the mathematical details.\n\n3 Empirical Study\n\nWe study exponential family embedding (ef-emb) models on real-valued and count-valued data, and\nin di\ufb00erent application domains\u2014computational neuroscience, shopping behavior, and movie ratings.\nWe present quantitative comparisons to other dimension reduction methods and illustrate how we can\nglean qualitative insights from the \ufb01tted embeddings.\n\n3.1 Real Valued Data: Neural Data Analysis\n\n(cid:3)\n\nt\n\n(cid:3)\n\nt\n\n(cid:3)\nt(cid:0)1\n\n(cid:0) x\n\nData. We analyze the neural activity of a larval zebra\ufb01sh, recorded at single cell resolution for\n3000 time frames (Ahrens et al., 2013). Through genetic modi\ufb01cation, individual neurons express a\ncalcium indicator when they \ufb01re. The resulting calcium imaging data is preprocessed by a nonnegative\n2 RN of the\nmatrix factorization to identify neurons, their locations, and the \ufb02uorescence activity x\nindividual neurons over time (Friedrich et al., 2015). Using this method, our data contains 10,000\nneurons (out of a total of 200,000).\nWe \ufb01t all models on the lagged data xt D x\nto \ufb01lter out correlations based on calcium decay\nand preprocessing.2 The calcium levels can be measured with great spatial resolution but the temporal\nresolution is poor; the neuronal \ufb01ring rate is much higher than the sampling rate. Hence we ignore\nall \u201ctemporal structure\u201d in the data and model the simultaneous activity of the neurons. We use the\nGaussian embedding (g-emb) and nonnegative Gaussian embedding (ng-emb) from Section 2.1 to\nmodel the lagged activity of the neurons conditional on the lags of surrounding neurons. We study\ncontext sizes c 2 f10; 50g and latent dimension K 2 f10; 100g.\nModels. We compare ef-emb to probabilistic factor analysis (fa), \ufb01tting K-dimensional factors for\neach neuron and K-dimensional factor loadings for each time frame. In fa, each entry of the data\nmatrix is Gaussian distributed, with mean equal to the inner product of the corresponding factor and\nfactor loading.\nEvaluation. We train each model on a random sample of 90% of the lagged time frames and hold\nout 5% each for validation and testing. With the test set, we use two types of evaluation. (1) Leave\none out: For each neuron xi in the test set, we use the measurements of the other neurons to form\npredictions. For fa this means the other neurons are used to recover the factor loadings; for ef-emb\nthis means the other neurons are used to construct the context. (2) Leave 25% out: We randomly split\nthe neurons into 4 folds. Each neuron is predicted using the three sets of neurons that are out of its\nfold. (This is a more di\ufb03cult task.) Note in ef-emb, the missing data might change the size of the\ncontext of some neurons. See Table 5 in Supplement C for the choice of hyperparameters.\nResults. Table 2 reports both types of evaluation. The ef-emb models signi\ufb01cantly outperform fa in\nterms of mean squared error on the test set. g-emb obtains the best results with 100 components and a\ncontext size of 50. Figure 1 illustrates how to use the learned embeddings to hypothesize connections\nbetween nearby neurons.\n\n2We also analyzed unlagged data but all methods resulted in better reconstruction on the lagged data.\n\n6\n\n\fFigure 1: Top view of the zebra\ufb01sh brain, with blue circles at the location of the individual neurons.\nWe zoom on 3 neurons and their 50 nearest neighbors (small blue dots), visualizing the \u201csynaptic\nweights\u201d learned by a g-emb model (K D 100). The edge color encodes the inner product of the\n>\nn \u02dbm for each neighbor m. Positive values are\nneural embedding vector and the context vectors (cid:26)\ngreen, negative values are red, and the transparency is proportional to the magnitude. With these\nweights we can hypothesize how nearby neurons interact.\n\nK D 20\n\nK D 100\n\nModel\np-emb\n\n(cid:0)7:497 \u02d9 0:007 (cid:0)7:199 \u02d9 0:008\np-emb (dw) (cid:0)7:110 \u02d9 0:007 (cid:0)6:950 \u02d9 0:007\n(cid:0)7:868 \u02d9 0:005 (cid:0)8:414 \u02d9 0:003\nap-emb\n(cid:0)7:740 \u02d9 0:008 (cid:0)7:626 \u02d9 0:007\n(cid:0)11:01 \u02d9 0:01\nPoisson pca (cid:0)8:314 \u02d9 0:009\n\nhpf\n\nK D 20\n\nK D 100\n\n(cid:0)5:691 \u02d9 0:006 (cid:0)5:726 \u02d9 0:005\n(cid:0)5:790 \u02d9 0:003 (cid:0)5:798 \u02d9 0:003\n(cid:0)5:964 \u02d9 0:003 (cid:0)6:118 \u02d9 0:002\n(cid:0)5:787 \u02d9 0:006 (cid:0)5:859 \u02d9 0:006\n(cid:0)5:908 \u02d9 0:006\n(cid:0)7:50 \u02d9 0:01\n\n(a) Market basket analysis.\n\n(b) Movie ratings.\n\nTable 3: Comparison of predictive log-likelihood between p-emb, ap-emb, hierarchical Poisson\nfactorization (hpf) (Gopalan et al., 2015), and Poisson principal component analysis (pca) (Collins\net al., 2001) on held out data. The p-emb model outperforms the matrix factorization models in both\napplications. For the shopping data, downweighting the zeros improves the performance of p-emb.\n\n3.2 Count Data: Market Basket Analysis and Movie Ratings\nWe study the Poisson models Poisson embedding (p-emb) and additive Poisson embedding (ap-emb)\non two applications: shopping and movies.\nMarket basket data. We analyze the IRI dataset3 (Bronnenberg et al., 2008), which contains the\npurchases of anonymous households in chain grocery and drug stores. It contains 137; 632 trips in\n2012. We remove items that appear fewer than 10 times, leaving a dataset with 7; 903 items. The\ncontext for each purchase is the other purchases from the same trip.\nMovieLens data. We also analyze the MovieLens-100K dataset (Harper and Konstan, 2015), which\ncontains movie ratings on a scale from 1 to 5. We keep only positive ratings, de\ufb01ned to be ratings of\n3 or more (we subtract 2 from all ratings and set the negative ones to 0). The context of each rating\nis the other movies rated by the same user. After removing users who rated fewer than 20 movies\nand movies that were rated fewer than 50 times, the dataset contains 777 users and 516 movies; the\nsparsity is about 5%.\nModels. We \ufb01t the p-emb and the ap-emb models using number of components K 2 f20; 100g.\nFor each K we select the Adagrad constant based on best predictive performance on the validation\nset. (The parameters we used are in Table 5.) In these datasets, the distribution of the context size\nis heavy tailed. To handle larger context sizes we pick a link function for the ef-emb model which\nrescales the sum over the context in Equation (2) by the context size (the number of terms in the\nsum). We also \ufb01t a p-emb model that arti\ufb01cially downweights the contribution of the zeros in the\nobjective function by a factor of 0:1, as done by Hu et al. (2008) for matrix factorization. We denote\nit as \u201cp-emb (dw).\u201d\n\n3We thank IRI for making the data available. All estimates and analysis in this paper, based on data provided\n\nby IRI, are by the authors and not by IRI.\n\n7\n\n\fMaruchan chicken ramen\nM. creamy chicken ramen\nM. oriental \ufb02avor ramen\nM. roast chicken ramen\n\nYoplait strawberry yogurt\nYoplait apricot mango yogurt\nYoplait strawberry orange smoothie Mtn. Dew lemon lime soda\nYoplait strawberry banana yogurt\n\nMountain Dew soda\nMtn. Dew orange soda\n\nPepsi classic soda\n\nDean Foods 1 % milk\nDean Foods 2 % milk\nDean Foods whole milk\nDean Foods chocolate milk\n\nTable 4: Top 3 similar items to a given example query words (bold face). The p-emb model successfuly\ncaptures similarities.\n\nWe compare the predictive performance with hpf (Gopalan et al., 2015) and Poisson pca (Collins\net al., 2001). Both hpf and Poisson pca factorize the data into K-dimensional positive vectors of user\npreferences, and K-dimensional positive vectors of item attributes. ap-emb and hpf parameterize\nthe mean additively; p-emb and Poisson pca parameterize it multiplicatively. For the ef-emb models\nand Poisson pca, we use stochastic optimization with `2 regularization. For hpf, we use variational\ninference. See Table 5 in Supplement C for details.\nEvaluation. For the market basket data we hold out 5% of the trips to form the test set, also removing\ntrips with fewer than two purchased di\ufb00erent items. In the MovieLens data we hold out 20% of\nthe ratings and set aside an additional 5% of the non-zero entries from the test for validation. We\nreport prediction performance based on the normalized log-likelihood on the test set. For p-emb and\nap-emb, we compute the likelihood as the Poisson mean of each nonnegative count (be it a purchase\nquantity or a movie rating) divided by the sum of the Poisson means for all items, given the context.\nTo evaluate hpf and Poisson pca at a given test observation we recover the factor loadings using the\nother test entries we condition on, and we use the factor loading to form the prediction.\nPredictive performance. Table 3 summarizes the test log-likelihood of the four models, together\nwith the standard errors across entries in the test set. In both applications the p-emb model outperforms\nhpf and Poisson pca. On shopping data p-emb with K D 100 provides the best predictions; on\nMovieLens p-emb with K D 20 is best. For p-emb on shopping data, downweighting the contribution\nof the zeros gives more accurate estimates.\nItem similarity in the shopping data. Embedding models can capture qualitative aspects of the\ndata as well. Table 4 shows four example products and their three most similar items, where similarity\nis calculated as the cosine distance between embedding vectors. (These vectors are from p-emb with\ndownweighted zeros and K D 100.) For example, the most similar items to a soda are other sodas;\nthe most similar items to a yogurt are (mostly) other yogurts.\nThe p-emb model can also identify complementary and substitutable products. To see this, we\ncompute the inner products of the embedding and the context vectors for all item pairs. A high value\nof the inner product indicates that the probability of purchasing one item is increased if the second\nitem is in the shopping basket (i.e., they are complements). A low value indicates the opposite e\ufb00ect\nand the items might be substitutes for each other.\nWe \ufb01nd that items that tend to be purchased together have high value of the inner product (e.g., potato\nchips and beer, potato chips and frozen pizza, or two di\ufb00erent types of soda), while items that are\nsubstitutes have negative value (e.g., two di\ufb00erent brands of pasta sauce, similar snacks, or soups\nfrom di\ufb00erent brands). Other items with negative value of the inner product are not substitutes, but\nthey are rarely purchased together (e.g., toast crunch and laundry detergent, milk and a toothbrush).\nSupplement D gives examples of substitutes and complements.\nTopics in the movie embeddings. The embeddings from MovieLens data identify thematically\nsimilar movies. For each latent dimension k, we sort the context vectors by the magnitude of the kth\ncomponent. This yields a ranking of movies for each component. In Supplement E we show two\nexample rankings. (These are from a p-emb model with K D 50.) The \ufb01rst one contains children\u2019s\nmovies; the second contains science-\ufb01ction/action movies.\n\nAcknowledgments\n\nThis work is supported by the EU H2020 programme (Marie Sk\u0142odowska-Curie grant agreement\n706760), NFS IIS-1247664, ONR N00014-11-1-0651, DARPA FA8750-14-2-0009, DARPA N66001-\n15-C-4032, Adobe, the John Templeton Foundation, and the Sloan Foundation.\n\n8\n\n\fReferences\nAhrens, M. B., Orger, M. B., Robson, D. N., Li, J. M., and Keller, P. J. (2013). Whole-brain functional imaging\n\nat cellular resolution using light-sheet microscopy. Nature Methods, 10(5):413\u2013420.\n\nArnold, B. C., Castillo, E., Sarabia, J. M., et al. (2001). Conditionally speci\ufb01ed distributions: an introduction\n\n(with comments and a rejoinder by the authors). Statistical Science, 16(3):249\u2013274.\n\nBengio, Y., Schwenk, H., Sen\u00e9cal, J.-S., Morin, F., and Gauvain, J.-L. (2006). Neural probabilistic language\n\nmodels. In Innovations in Machine Learning, pages 137\u2013186. Springer.\n\nBronnenberg, B. J., Kruger, M. W., and Mela, C. F. (2008). Database paper: The IRI marketing data set.\n\nMarketing Science, 27(4):745\u2013748.\n\nBrown, L. D. (1986). Fundamentals of statistical exponential families with applications in statistical decision\n\ntheory. Lecture Notes-Monograph Series, 9:i\u2013279.\n\nCollins, M., Dasgupta, S., and Schapire, R. E. (2001). A generalization of principal components analysis to the\n\nexponential family. In Neural Information Processing Systems, pages 617\u2013624.\n\nDuchi, J., Hazan, E., and Singer, Y. (2011). Adaptive subgradient methods for online learning and stochastic\n\noptimization. Journal of Machine Learning Research, 12:2121\u20132159.\n\nFriedrich, J., Soudry, D., Paninski, L., Mu, Y., Freeman, J., and Ahrens, M. (2015). Fast constrained non-negative\n\nmatrix factorization for whole-brain calcium imaging data. In NIPS workshop on Neural Systems.\n\nGopalan, P., Hofman, J., and Blei, D. M. (2015). Scalable recommendation with hierarchical Poisson factorization.\n\nIn Uncertainty in Arti\ufb01cial Intelligence.\n\nGutmann, M. and Hyv\u00e4rinen, A. (2010). Noise-contrastive estimation: A new estimation principle for unnormal-\n\nized statistical models. In Journal of Machine Learning Research.\n\nHarper, F. M. and Konstan, J. A. (2015). The MovieLens datasets: History and context. ACM Transactions on\n\nInteractive Intelligent Systems (TiiS), 5(4):19.\n\nHarris, Z. S. (1954). Distributional structure. Word, 10(2-3):146\u2013162.\nHu, Y., Koren, Y., and Volinsky, C. (2008). Collaborative \ufb01ltering for implicit feedback datasets. Data Mining.\nLevy, O. and Goldberg, Y. (2014). Neural word embedding as implicit matrix factorization. In Neural Information\n\nProcessing Systems, pages 2177\u20132185.\n\nMcCullagh, P. and Nelder, J. A. (1989). Generalized linear models, volume 37. CRC press.\nMikolov, T., Chen, K., Corrado, G., and Dean, J. (2013a). E\ufb03cient estimation of word representations in vector\n\nspace. ICLR Workshop Proceedings. arXiv:1301.3781.\n\nMikolov, T., Sutskever, I., Chen, K., Corrado, G. S., and Dean, J. (2013b). Distributed representations of words\n\nand phrases and their compositionality. In Neural Information Processing Systems, pages 3111\u20133119.\n\nMikolov, T., Yih, W.-T. a., and Zweig, G. (2013c). Linguistic regularities in continuous space word representations.\n\nIn HLT-NAACL, pages 746\u2013751.\n\nMnih, A. and Kavukcuoglu, K. (2013). Learning word embeddings e\ufb03ciently with noise-contrastive estimation.\n\nIn Neural Information Processing Systems, pages 2265\u20132273.\n\nMnih, A. and Teh, Y. W. (2012). A fast and simple algorithm for training neural probabilistic language models.\n\nIn International Conference on Machine Learning, pages 1751\u20131758.\n\nNeal, R. M. (1990). Learning stochastic feedforward networks. Department of Computer Science, University of\n\nToronto.\n\nPennington, J., Socher, R., and Manning, C. D. (2014). Glove: Global vectors for word representation. In\n\nConference on Empirical Methods on Natural Language Processing, volume 14, pages 1532\u20131543.\n\nRanganath, R., Tang, L., Charlin, L., and Blei, D. M. (2015). Deep exponential families. Arti\ufb01cial Intelligence\n\nand Statistics.\n\nRumelhart, D. E., Hintont, G. E., and Williams, R. J. (1986). Learning representations by back-propagating\n\nerrors. Nature, 323:9.\n\nTaddy, M. et al. (2015). Distributed multinomial regression. The Annals of Applied Statistics, 9(3):1394\u20131414.\nVilnis, L. and McCallum, A. (2015). Word representations via Gaussian embedding. In International Conference\n\non Learning Representations.\n\n9\n\n\f", "award": [], "sourceid": 269, "authors": [{"given_name": "Maja", "family_name": "Rudolph", "institution": "Columbia University"}, {"given_name": "Francisco", "family_name": "Ruiz", "institution": "Columbia University"}, {"given_name": "Stephan", "family_name": "Mandt", "institution": "Disney Research"}, {"given_name": "David", "family_name": "Blei", "institution": "Columbia University"}]}