{"title": "Bayesian Predictive Profiles With Applications to Retail Transaction Data", "book": "Advances in Neural Information Processing Systems", "page_first": 1353, "page_last": 1360, "abstract": "", "full_text": "Bayesian Predictive Pro\ufb02les with\n\nApplications to Retail Transaction Data\n\nIgor V. Cadez\n\nPadhraic Smyth\n\nInformation and Computer Science\n\nInformation and Computer Science\n\nUniversity of California\n\nUniversity of California\n\nIrvine, CA 92697-3425, U.S.A.\n\nIrvine, CA 92697-3425, U.S.A.\n\nicadez@ics.uci.edu\n\nsmyth@ics.uci.edu\n\nAbstract\n\nMassive transaction data sets are recorded in a routine manner in\ntelecommunications, retail commerce, and Web site management.\nIn this paper we address the problem of inferring predictive in-\ndividual pro\ufb02les from such historical transaction data. We de-\nscribe a generative mixture model for count data and use an an\napproximate Bayesian estimation framework that e\ufb01ectively com-\nbines an individual\u2019s speci\ufb02c history with more general population\npatterns. We use a large real-world retail transaction data set to\nillustrate how these pro\ufb02les consistently outperform non-mixture\nand non-Bayesian techniques in predicting customer behavior in\nout-of-sample data.\n\n1\n\nIntroduction\n\nTransaction data sets consist of records of pairs of individuals and events, e.g., items\npurchased (market basket data), telephone calls made (call records), or Web pages\nvisited (from Web logs). Of signi\ufb02cant practical interest in many applications is\nthe ability to derive individual-speci\ufb02c (or personalized) models for each individ-\nual from the historical transaction data, e.g., for exploratory analysis, adaptive\npersonalization, and forecasting.\n\nIn this paper we propose a generative model based on mixture models and Bayesian\nestimation for learning predictive pro\ufb02les. The mixture model is used to address\nthe heterogeneity problem: di\ufb01erent individuals purchase combinations of products\non di\ufb01erent visits to the store. The Bayesian estimation framework is used to\naddress the fact that we have di\ufb01erent amounts of data for di\ufb01erent individuals.\nFor an individual with very few transactions (e.g., only one) we can \\shrink\" our\npredictive pro\ufb02le for that individual towards a general population pro\ufb02le. On the\nother hand, for an individual with many transactions, their predictive model can be\nmore individualized. Our goal is an accurate and computationally e\u2013cient modeling\nframework that smoothly adapts a pro\ufb02le to each individual based on both their own\nhistorical data as well as general population patterns. Due to space limitations only\nselected results are presented here; for a complete description of the methodology\nand experiments see Cadez et al. (2001).\n\n\fThe idea of using mixture models as a (cid:176)exible approach for modeling discrete and\ncategorical data has been known for many years, e.g., in the social sciences for latent\nclass analysis (Lazarsfeld and Henry, 1968). Traditionally these methods were only\napplied to relatively small low-dimensional data sets. More recently there has been\na resurgence of interest in mixtures of multinomials and mixtures of conditionally\nindependent Bernoulli models for modeling high-dimensional document-term data\nin text analysis (e.g., McCallum, 1999; Ho\ufb01man, 1999). The work of Heckerman\net al. (2000) on probabilistic model-based collaborative \ufb02ltering is also similar in\nspirit to the approach described in this paper except that we focus on explicitly\nextracting individual-level pro\ufb02les rather than global models (i.e., we have explicit\nmodels for each individual in our framework). Our work can be viewed as being an\nextension of this broad family of probabilistic modeling ideas to the speci\ufb02c case\nof transaction data, where we deal directly with the problem of making inferences\nabout speci\ufb02c individuals and handling multiple transactions per individual. Other\napproaches have also been proposed in the data mining literature for clustering\nand exploratory analysis of transaction data, but typically in a non-probabilistic\nframework (e.g., Agrawal, Imielinski, and Swami, 1993; Strehl and Ghosh, 2000;\nLawrence et al., 2001). The lack of a clear probabilistic semantics (e.g., for asso-\nciation rule techniques) can make it di\u2013cult for these models to fully leverage the\ndata for individual-level forecasting.\n\n2 Mixture-Basis Models for Pro\ufb02les\n\nWe have an observed data set D = fD1; : : : ; DN g, where Di is the observed data on\nthe ith customer, 1 \u2022 i \u2022 N . Each individual data set Di consists of one or more\ntransactions for that customer , i.e., Di = fyi1; : : : ; yij; : : : ; yinig, where yij is the\njth transaction for customer i and ni is the total number of transactions observed\nfor customer i.\n\nThe jth transaction for individual i, yij, consists of a description of the set of\nproducts (or a \\market basket\") that was purchased at a speci\ufb02c time by customer\ni (and yi will be used to denote an arbitrary transaction from individual i). For\nthe purposes of the experiments described in this paper, each individual transaction\nyij is represented as a vector of d counts yij = (nij1; : : : nijc; : : : ; nijC), where nijc\nindicates how many items of type c are in transaction yij, 1 \u2022 c \u2022 C.\n\nWe de\ufb02ne a predictive pro\ufb02le as a probabilistic model p(yi), i.e., a probability\ndistribution on the items that individual i will purchase during a store-visit. We\npropose a simple generative mixture model for an individual\u2019s purchasing behavior,\nnamely that a randomly selected transaction yi from individual i is generated by one\nof K components in a K-component mixture model. The kth mixture component,\n1 \u2022 k \u2022 K is a speci\ufb02c model for generating the counts and we can think of each of\nthe K models as \\basis functions\" describing prototype transactions. For example,\nin a clothing store, one might have a mixture component that acts as a prototype\nfor suit-buying behavior, where the expected counts for items such as suits, ties,\nshirts, etc., given this component, would be relatively higher than for the other\nitems.\n\nThere are several modeling choices for the component transaction models for gen-\nerating item counts.\nIn this paper we choose a particularly simple memoryless\nmultinomial model that operates as follows. Conditioned on nij (the total number\nof items in the basket) each of the individual items is selected in a memoryless\nfashion by nij draws from a multinomial distribution Pk = ((cid:181)k1; : : : ; (cid:181)kC) on the C\npossible items, (cid:181)kj = 1.\n\n\f0.6\n\n0.4\n\n0.2\n\ny\nt\ni\nl\ni\n\nb\na\nb\no\nr\nP\n\n0\n\n0\n\n0.6\n\n0.4\n\n0.2\n\ny\nt\ni\nl\ni\n\nb\na\nb\no\nr\nP\n\n0\n\n0\n\n0.6\n\n0.4\n\n0.2\n\ny\nt\ni\nl\ni\n\nb\na\nb\no\nr\nP\n\nCOMPONENT 1\n\n10\n\n20\n\n30\n\n40\n\n50\n\nCOMPONENT 3\n\n10\n\n20\n\n30\n\n40\n\n50\n\nCOMPONENT 5\n\n0.6\n\n0.4\n\n0.2\n\n0\n\n0\n\n0.6\n\n0.4\n\n0.2\n\n0\n\n0\n\n0.6\n\n0.4\n\n0.2\n\nCOMPONENT 2\n\n10\n\n20\n\n30\n\n40\n\n50\n\nCOMPONENT 4\n\n10\n\n20\n\n30\n\n40\n\n50\n\nCOMPONENT 6\n\n0\n\n0\n\n10\n\n30\n20\nDepartment\n\n40\n\n50\n\n0\n\n0\n\n10\n\n30\n20\nDepartment\n\n40\n\n50\n\nFigure 1: An example of 6 \\basis\" mixture components \ufb02t to retail transaction\ndata.\n\nFigure 1 shows an example of K = 6 such basis mixture components that have\nbeen learned from a large retail transaction data (more details on learning will be\ndiscussed below). Each window shows a di\ufb01erent set of component probabilities\nPk, each modeling a di\ufb01erent type of transaction. The components show a striking\nbimodal pattern in that the multinomial models appear to involve departments\nthat are either above or below department 25, but there is very little probability\nmass that crosses over. In fact the models are capturing the fact that departments\nnumbered lower than 25 correspond to men\u2019s clothing and those above 25 correspond\nto women\u2019s clothing, and that baskets tend to be \\tuned\" to one set or the other.\n\n2.1 Individual-Speci\ufb02c Weights\n\nWe further assume that for each individual i there exists a set of K weights,\nand in the general case these weights are individual-speci\ufb02c, denoted by \ufb01i =\n(\ufb01i1; : : : ; \ufb01iK ), where Pk \ufb01ik = 1. Weight \ufb01ik represents the probability that when\nindividual i enters the store their transactions will be generated by component k.\nOr, in other words, the \ufb01ik\u2019s govern individual i\u2019s propensity to engage in \\shopping\nbehavior\" k (again, there are numerous possible generalizations, such as making the\n\ufb01ik\u2019s have dependence over time, that we will not discuss here). The \ufb01ik\u2019s are in\ne\ufb01ect the pro\ufb02le coe\u2013cients for individual i, relative to the K component models.\n\nThis idea of individual-speci\ufb02c weights (or pro\ufb02les) is a key component of our pro-\nposed approach. The mixture component models Pk are \ufb02xed and shared across\nall individuals, providing a mechanism for borrowing of strength across individual\ndata. The individual weights are in principle allowed to freely vary for each indi-\nvidual within a K-dimensional simplex. In e\ufb01ect the K weights can be thought as\nbasis coe\u2013cients that represent the location of individual i within the space spanned\nby the K basis functions (the component Pk multinomials). This approach is quite\nsimilar in spirit to the recent probabilistic PCA work of Hofmann (1999) on mixture\nmodels for text documents, where he proposes a general mixture model framework\nthat represents documents as existing within a K-dimensional simplex of multino-\nmial component models.\n\nThe model for each individual is an individual-speci\ufb02c mixture model, where the\n\n\f8\n\n6\n\n4\n\n2\n\ns\nm\ne\nt\ni\n \nf\no\n \nr\ne\nb\nm\nu\nN\n\n0\n\n0\n\n8\n\n6\n\n4\n\n2\n\ns\nm\ne\nt\ni\n \nf\no\n \nr\ne\nb\nm\nu\nN\n\n0\n\n0\n\nTRAINING PURCHASES\n\n5\n\n10\n\n15\n\n20\n\n25\n\n30\n\n35\n\n40\n\n45\n\n50\n\nTEST PURCHASES\n\n5\n\n10\n\n15\n\n20\n\n25\n\n30\n\n35\n\n40\n\n45\n\n50\n\nDepartment\n\nFigure 2: Histograms indicating which products a particular individual purchased,\nfrom both the training data and the test data.\n\n0.2\n\ny\nt\ni\nl\ni\n\nb\na\nb\no\nr\nP\n\n0.15\n\n0.1\n\n0.05\n\n0\n\n0\n\n0.2\n\ny\nt\ni\nl\ni\n\nb\na\nb\no\nr\nP\n\n0.15\n\n0.1\n\n0.05\n\n0\n\n0\n\n0.2\n\ny\nt\ni\nl\ni\n\nb\na\nb\no\nr\nP\n\n0.15\n\n0.1\n\n0.05\n\n0\n\n0\n\nPROFILE FROM GLOBAL WEIGHTS\n\n5\n\n10\n\n15\n\n20\n\n25\n\n30\n\n35\n\n40\n\n45\n\n50\n\nSMOOTHED HISTOGRAM PROFILE (MAP)\n\n5\n\n10\n\n15\n\n20\n\n25\n\n30\n\n35\n\n40\n\n45\n\n50\n\nPROFILE FROM INDIVIDUAL WEIGHTS\n\n5\n\n10\n\n15\n\n20\n\n25\n\n30\n\n35\n\n40\n\n45\n\n50\n\nFigure 3: Inferred \\e\ufb01ective\" pro\ufb02les from global weights, smoothed histograms, and\nindividual-speci\ufb02c weights for the individual whose data was shown in Figure 2.\n\nweights are speci\ufb02c to individual i:\n\np(yij) =\n\n=\n\nK\n\nX\n\nk=1\n\nK\n\nX\n\nk=1\n\n\ufb01ikp(yijjk)\n\nC\n\n\ufb01ik\n\nY\n\n(cid:181)\n\nnijc\nkc\n\n:\n\nc=1\n\nwhere (cid:181)kc is the probability that the cth item is purchased given component k\nand nijc is the number of items of category c purchased by individual i, during\ntransaction ij.\n\nAs an example of the application of these ideas, in Figure 2 the training data and\ntest data for a particular individual are displayed. Note that there is some pre-\ndictability from training to test data, although the test data contains (for example)\na purchase in department 14 (which was not seen in the training data). Figure 3\nplots the e\ufb01ective pro\ufb02les1 for this particular individual as estimated by three dif-\nferent schemes in our modeling approach: (1) global weights that result in everyone\n\n1We call these \\e\ufb01ective pro\ufb02les\" since the predictive model under the mixture assump-\n\n\fbeing assigned the same \\generic\" pro\ufb02le, i.e., \ufb01ik = \ufb01k, (2) a maximum a pos-\nteriori (MAP) technique that smooths each individual\u2019s training histogram with\na population-based histogram, and (3) individual weights estimated in a Bayesian\nfashion that are \\tuned\" to the individual\u2019s speci\ufb02c behavior.\n(More details on\neach of these methods are provided later in the paper; a complete description can\nbe found in Cadez et al. (2001)).\n\nOne can see in Figure 3 that the global weight pro\ufb02le re(cid:176)ects broad population-based\npurchasing patterns and is not representative of this individual. The smoothed\nhistogram is somewhat better, but the smoothing parameter has \\blurred\" the\nindividual\u2019s focus on departments below 25. The individual-weight pro\ufb02le appears\nto be a better representation of this individual\u2019s behavior and indeed it does provide\nthe best predictive score (of the 3 methods) on the test data in Figure 2. Note that\nthe individual-weights pro\ufb02le in Figure 3 \\borrows strength\" from the purchases of\nother similar customers, i.e., it allows for small but non-zero probabilities of the\nindividual making purchases in departments (such as 6 through 9) if he or she has\nnot purchased there in the past. This particular individual\u2019s weights, the \ufb01ik\u2019s, are\n(0:00; 0:47; 0:38; 0:00; 0:00:0:15) corresponding to the six component models shown\nin Figure 1. The most weight is placed on components 2, 3 and 6, which agrees\nwith our intuition given the individual\u2019s training data.\n\n2.2 Learning the Model Parameters\n\nThe unknown parameters in our model consist of both the parameters of the K\nmultinomials, (cid:181)kc; 1 \u2022 k \u2022 K; 1 \u2022 c \u2022 C, and the vectors of individual-speci\ufb02c\npro\ufb02le weights \ufb01i; 1 \u2022 i \u2022 N . We investigate two di\ufb01erent approaches to learning\nindividual-speci\ufb02c weights:\n\n\u2020 Mixture-Based Maximum Likelihood (ML) Weights: We treat the\nweights \ufb01i and component parameters (cid:181) as unknown and use expectation-\nmaximization (EM) to learn both simultaneously. Of course we expect this\nmodel to over\ufb02t given the number of parameters being estimated but we\ninclude it nonetheless as a baseline.\n\n\u2020 Mixture-Based Empirical Bayes (EB) Weights: We \ufb02rst use EM\nto learn a mixture of K transaction models (ignoring individuals). We\nthen use a second EM algorithm in weight-space to estimate individual-\nspeci\ufb02c weights \ufb01i for each individual. The second EM phase uses a \ufb02xed\nempirically-determined prior (a Dirichlet) for the weights. In e\ufb01ect, we are\nlearning how best to represent each individual within the K-dimensional\nsimplex of basis components. The empirical prior uses the marginal weights\n(\ufb01\u2019s) from the \ufb02rst run for the mean of the Dirichlet, and an equivalent\nsample size of n = 10 transactions is used in the results reported in the\npaper. In e\ufb01ect, this can be viewed as an approximation to either a fully\nBayesian hierarchical estimation or an empirical Bayesian approach (see\nCadez et al. (2001) for more detailed discussion). We did not pursue the\nfully Bayesian or empirical Bayesian approaches for computational reasons\nsince the necessary integrals cannot be evaluated in closed form for this\nmodel and numerical methods (such as Markov Chain Monte Carlo) would\nbe impractical given the data sizes involved.\n\nWe also compare two other approaches for pro\ufb02ling for comparison: (1) Global\nMixture Weights:\ninstead of individualized weights we set each individual\u2019s\n\ntion is not a multinomial that can be plotted as a bar chart: however, we can approximate\nit and we are plotting one such approximation here\n\n\f]\nn\ne\nk\no\nt\n/\ns\nt\ni\nb\n[\n \ne\nr\no\nc\nS\nP\ng\no\nL\n \ne\nv\ni\nt\na\ng\ne\nN\n\n \n\n3.5\n\n3.4\n\n3.3\n\n3.2\n\n3.1\n\n3\n\n2.9\n\n2.8\n\n2.7\n\n2.6\n\n2.5\n\n0\n\nIndividualized MAP weights \n\nMixtures: Individualized ML weights \n\nMixtures: Global mixture weights \n\nMixtures: Individualized EB weights \n\n10\n\n20\n\n30\n\n40\n\n50\n\n60\n\n70\n\n80\n\n90\n\n100\n\nNumber of Mixture Components [k]\n\nFigure 4: Plot of the negative log probability scores per item (predictive entropy) on\nout-of-sample transactions, for various weight models as a function of the number\nof mixture components K.\n\nweight vector to the marginal weights (\ufb01ik = \ufb01k), and (2) Individualized MAP\nweights: a non-mixture approach where we use an empirically-determined Dirich-\nlet prior directly on the multinomials, and where the equivalent sample size of this\nprior was \\tuned\" on the test set to give optimal performance. This provides an (op-\ntimistic) baseline of using multinomial pro\ufb02les directly, without use of any mixture\nmodels.\n\n3 Experimental Results\n\nTo evaluate our approach we used a real-world transaction data set. The data\nconsists of transactions collected at a chain of retail stores over a two-year period.\nWe analyze the transactions here at the store department level (50 categories of\nitems). We separate the data into two time periods (all transactions are time-\nstamped), with approximately 70% of the data being in the \ufb02rst time period (the\ntraining data) and the remainder in the test period data. We train our mixture and\nweight models on the \ufb02rst period and evaluate our models in terms of their ability\nto predict transactions that occur in the subsequent out-of-sample test period.\n\nThe training data contains data on 4339 individuals, 58,866 transactions, and\n164,000 items purchased. The test data consists of 4040 individuals, 25,292 trans-\nactions, and 69,103 items purchased. Not all individuals in the test data set appear\nin the training data set (and vice-versa):\nindividuals in the test data set with no\ntraining data are assigned a global population model for scoring purposes.\n\nTo evaluate the predictive power of each model, we calculate the log-probability\n(\\logp scores\") of the transactions as predicted by each model. Higher logp scores\nmean that the model assigned higher probability to events that actually occurred.\nNote that the mean negative logp score over a set of transactions, divided by the\ntotal number of items, can be interpreted as a predictive entropy term in bits. The\nlower this entropy term, the less uncertainty in our predictions (bounded below by\nzero of course, corresponding to zero uncertainty).\n\nFigure 4 compares the out-of-sample predictive entropy scores as a function of the\n\n\f0\n\n\u221250\n\n\u2212100\n\n\u2212150\n\ns\nt\nh\ng\ne\nw\n\ni\n\n\u2212200\n\n \nl\n\ni\n\na\nu\nd\nv\nd\nn\n\ni\n\ni\n \n,\n\nP\ng\no\n\nl\n\n\u2212250\n\n\u2212300\n\n\u2212350\n\n\u2212400\n\n\u2212400\n\n\u221250\n\n\u221255\n\n\u221260\n\n\u221265\n\n\u221270\n\n\u221275\n\n\u221280\n\n\u221285\n\n\u221290\n\n\u221295\n\ns\nt\n\ni\n\nh\ng\ne\nw\n\n \nl\n\ni\n\na\nu\nd\nv\nd\nn\n\ni\n\ni\n \n,\n\nP\ng\no\n\nl\n\n\u2212350\n\n\u2212300\n\n\u2212250\n\n\u2212200\n\n\u2212150\n\n\u2212100\n\n\u221250\n\n0\n\nlogP, global weights\n\n\u2212100\n\n\u2212100\n\n\u221295\n\n\u221290\n\n\u221285\n\n\u221275\n\n\u221280\n\u221270\nlogP, global weights\n\n\u221265\n\n\u221260\n\n\u221255\n\n\u221250\n\nFigure 5: Scatter plots of the log probability scores for each individual on out-of-\nsample transactions, plotting log probability scores for individual weights versus log\nprobability scores for the global weights model. Left: all data, Right: close up.\n\nnumber of mixture components K for the mixture-based ML weights, the mixture-\nbased Global weights (where all individuals are assigned the same marginal mixture\nweights), the mixture-based Empirical Bayes weights, and the non-mixture MAP\nhistogram method (as a baseline). The mixture-based approaches generally outper-\nform the non-mixture MAP histogram approach (solid line). The ML-based mixture\nweights start to over\ufb02t after about 6 mixture components (as expected). The Global\nmixture weights and individualized mixture weights improve up to about K = 50\ncomponents and then show some evidence of over\ufb02tting. The mixture-based individ-\nual weights method is systematically the best predictor, providing a 15% decrease\nin predictive entropy compared to the MAP histogram method, and a roughly 3%\ndecrease compared to non-individualized global mixture weights.\n\nFigure 5 shows a more detailed comparison of the di\ufb01erence between individual\nmixtures and the Global pro\ufb02les, on a subset of individuals. We can see that the\nGlobal pro\ufb02les are systematically worse than the individual weights model (i.e., most\npoints are above the bisecting line). For individuals with the lowest likelihood (lower\nleft of the left plot) the individual weight model is consistently better: typically\nlower weight total likelihood individuals are those with more transactions and items.\n\nIn Cadez et al. (2001) we report more detailed results on both this data set and a\nsecond retail data set involving 15 million items and 300,000 individuals. On both\ndata sets the individual-level models were found to be consistently more accurate\nout-of-sample compared to both non-mixture and non-Bayesian approaches. We\nalso found (empirically) that the time taken for EM to converge is roughly linear\nas both a function of number of components and the number of transactions (plots\nare omitted due to lack of space), allowing for example \ufb02tting of models with 100\nmixture components to approximately 2 million baskets in a few hours.\n\n4 Conclusions\n\nIn this paper we investigated the use of mixture models and approximate Bayesian\nestimation for automatically inferring individual-level pro\ufb02les from transaction data\nrecords. On a real-world retail data set the proposed framework consistently outper-\nformed alternative approaches in terms of accuracy of predictions on future unseen\ncustomer behavior.\n\n\fAcknowledgements\n\nThe research described in this paper was supported in part by NSF award IRI-\n9703120. The work of Igor Cadez was supported by a Microsoft Graduate Research\nFellowship.\n\nReferences\n\nAgrawal, R., Imielenski, T., and Swami, A. (1993) Mining association rules between sets\nof items in large databases, Proceedings of the ACM SIGMOD Conference on\nManagement of Data (SIGMOD\u201998), New York: ACM Press, pp. 207{216.\n\nCadez, I. V., Smyth, P., Ip, E., Mannila, H. (2001) Predictive pro\ufb02les for transaction\ndata using \ufb02nite mixture models, Technical Report UCI-ICS-01-67, Informa-\ntion and Computer Science, University of California, Irvine (available online at\nwww.datalab.uci.edu.\n\nHeckerman, D., Chickering, D. M., Meek, C., Rounthwaite, R., and Kadie, C. (2000)\nDependency networks for inference, collaborative \ufb02ltering, and data visualization.\nJournal of Machine Learning Research, 1, pp. 49{75.\n\nHo\ufb01mann, T. (1999) Probabilistic latent sematic indexing, Proceedings of the ACM SIGIR\n\nConference 1999, New York: ACM Press, 50{57.\n\nLawrence, R.D., Almasi, G.S., Kotlyar, V., Viveros, M.S., Duri, S.S. (2001) Personal-\nization of supermarket product recommendations, Data Mining and Knowledge\nDiscovery, 5 (1/2).\n\nLazarsfeld, P. F. and Henry, N. W. (1968) Latent Structure Analysis, New York:\n\nHoughton Mi\u2020in.\n\nMcCallum, A. (1999) Multi-label text classi\ufb02cation with a mixture model trained by EM,\n\nin AAAI\u201999 Workshop on Text Learning.\n\nStrehl, A. and J. Ghosh (2000) Value-based customer grouping from large retail datasets,\nProc. SPIE Conf. on Data Mining and Knowledge Discovery, SPIE Proc. Vol.\n4057, Orlando, pp 33{42.\n\n\f", "award": [], "sourceid": 2120, "authors": [{"given_name": "Igor", "family_name": "Cadez", "institution": null}, {"given_name": "Padhraic", "family_name": "Smyth", "institution": null}]}