{"title": "A Machine Learning Approach to Conjoint Analysis", "book": "Advances in Neural Information Processing Systems", "page_first": 257, "page_last": 264, "abstract": null, "full_text": "                    A Machine Learning Approach\n                                    to Conjoint Analysis\n\n\n\n                                    Olivier Chapelle, Zaid Harchaoui\n                          Max Planck Institute for Biological Cybernetics\n                           Spemannstr. 38 - 72076 Tubingen - Germany\n          {olivier.chapelle,zaid.harchaoui}@tuebingen.mpg.de\n\n\n\n                                               Abstract\n\n           Choice-based conjoint analysis builds models of consumer preferences\n           over products with answers gathered in questionnaires. Our main goal is\n           to bring tools from the machine learning community to solve this prob-\n           lem more efficiently. Thus, we propose two algorithms to quickly and\n           accurately estimate consumer preferences.\n\n\n1     Introduction\n\nConjoint analysis (also called trade-off analysis) is one of the most popular marketing re-\nsearch technique used to determine which features a new product should have, by conjointly\nmeasuring consumers trade-offs between discretized1 attributes. In this paper, we will fo-\ncus on the choice-based conjoint analysis (CBC) framework [11] since it is both widely\nused and realistic: at each question in the survey, the consumer is asked to choose one\nproduct from several.\n\nThe preferences of a consumer are modeled via a utility function representing how much a\nconsumer likes a given product. The utility u(x) of a product x is assumed to be the sum\nof the partial utilities (or partworths) for each attribute, i.e. linear: u(x) = w  x. However,\ninstead of observing pairs (xl, yl), the training samples are of the form ({x1, . . . , xp}, y\n                                                                                       k    k     k )\nindicating that among the p products {x1 , . . . , xp}, the yth was preferred. Without noise,\n                                                k         k          k\nthis is expressed mathematically by u(xyk )  u(xb ),                b = y\n                                                k              k               k .\n\nLet us settle down the general framework of a regular conjoint analysis survey. We have a\npopulation of n consumers available for the survey. The survey consists of a questionnaire\nof q questions for each consumer, each asking to choose one product from a basket of\np. Each product profile is described through a attributes with l1, ..., la levels each, via a\nvector of length m =           a     l\n                               s=1 s, with 1 at positions of levels taken by each attribute and 0\nelsewhere.\n\nMarketing researchers are interested in estimating individual partworths in order to per-\nform for instance a segmentation of the population afterwards. But traditional conjoint\nestimation techniques are not reliable for this task since the number of parameters m to be\nestimated is usually larger than the number of answers q available for each consumer. They\nestimate instead the partworths on the whole population (aggregated partworths). Here we\n\n     1e.g. if the discretized attribute is weight, the levels would be light/heavy.\n\n\f\naim to investigate this issue, for which machine learning can provide efficient tools. We\nalso address adaptive questionnaire design with active learning heuristics.\n\n\n2       Hierarchical Bayes Analysis\n\nThe main idea of HB2 is to estimate the individual utility functions under the constraint that\ntheir variance should not be too small. By doing so, the estimation problem is not ill-posed\nand the lack of information for a consumer can be completed by the other ones.\n\n\n2.1      Probabilistic model\n\nIn this section, we follow [11] for the description of the HB model and its implementation.\nThis method aims at estimating the individual linear utility functions ui(x) = wi  x, for\n1  i  n. The probabilistic model is the following:\n\n\n         1. The individual partworths wi are drawn from a Gaussian distribution with mean \n             (representing the aggregated partworths) and covariance  (encoding population's\n             heterogeneity),\n\n         2. The covariance matrix  has an invert Wishart prior, and  has an (improper) flat\n             prior.\n\n         3. Given a set of products (x1, . . . xp), the probability that the consumer i chooses\n             the product x is given by\n\n                                                          exp(wi  x)\n                                       P (x|wi) =       p                      .              (1)\n                                                                exp(w\n                                                         b=1         i  xb)\n\n\n2.2      Model estimation\n\nWe describe now the standard way of estimating , w  (w1, . . . , wn) and  based on\nGibbs sampling and then propose a much faster algorithm that approximates the maximum\na posteriori (MAP) solution.\n\n\nGibbs sampling          As far as we know, all implementations of HB rely on a variant of the\nGibbs sampling [11]. During one iteration, each of the three sets of variables (, w and )\nis drawn in turn from its posterior distribution the two others being fixed. Sampling for \nand  is straightforward, whereas sampling from P (w|, , Y )  P (Y |w). P (w|, )\nis achieved with the Metropolis-Hastings algorithm.\n\nWhen convergence is reached, the sampling goes on and finally outputs the empirical ex-\npectation of , w and . Although the results of this sampling-based implementation of\nHB3 are impressive, practitioners complain about its computational burden.\n\n\nApproximate MAP solution              So far HB implementations make predictions by evaluating\n(1) at the empirical mean of the samples, in contrast with the standard bayesian approach,\nwhich would average the rhs of (1) over the different samples, given samples w from the\nposterior. In order to alleviate the computational issues associated with Gibbs sampling,\nwe suggest to consider the maximum of the posterior distribution (maximum a posteriori,\nMAP) rather than its mean.\n\n       2Technical papers of Sawtooth software [11], the world leading company for conjoint analysis\nsoftwares, provide very useful and extensive references.\n       3that we will call HB-Sampled or HB-S in the rest of the paper.\n\n\f\nTo find , w and  which maximize P (, w, |Y ), let us use Bayes' rule,\n\n                P (, w, |Y )             P (Y |, w, )  P (w|, )  P (|)  P ()\n\n                                           P (Y |w)  P (w|, )  P ()                            (2)\n\n\nMaximizing (2) with respect to  yields MAP = I+Cw , with C\n                                                                 n+d          w being the \"covariance\"\nmatrix of the wi centered at : Cw =                   (wi - )(wi - ) . Putting back this value in\n(2), we get\n\n               - log P (, w, MAP|Y ) = - log P (Y |w) + log |I + Cw()| + C,                       (3)\n\nwhere C is an irrelevant constant. Using the model (1), the first term in the rhs of (3) is\nconvex in w, but not the second term. For this reason, we propose to change log |I + Cw|\nby trace(Cw) =           ||wi - ||2 (this would be a valid approximation if trace(Cw)                1).\nWith this new prior on w, the rhs of (3) becomes\n\n                                              n\n\n                          W (, w) =                - log P (Yi|wi) + ||wi - ||2.                   (4)\n                                             i=1\n\nAs in equation (3), this objective function is minimized with respect to  when  is equal to\nthe empirical mean of the wi. We thus suggest the following iterative scheme to minimize\nthe convex functional (4):\n\n         1. For a given , minimize (4) with respect to each of the wi independently.\n\n         2. For a given w, set  to the empirical mean4 of the w.\n\nThanks to the convexity, this optimization problem can be solved very efficiently. A New-\nton approach in step 1, as well as in step 2 to speed-up the global convergence to a fixed\npoint , has been implemented. Only couple of steps in both cases are necessary to reach\nconvergence.\n\n\nRemark         The approximation from equation (3) to (4) might be too crude. After all it boils\ndown to setting  to the identity matrix. One might instead consider  as an hyperparam-\neter and optimize it by maximizing the marginalized likelihood [14].\n\n\n3       Conjoint Analysis with Support Vector Machines\n\nSimilarly to what has recently been proposed in [3], we are now investigating the use of\nSupport Vector Machines (SVM) [1, 12] to solve the conjoint estimation problem.\n\n\n3.1      Soft margin formulation of conjoint estimation\n\nLet us recall the learning problem. At the k-th question, the consumer chooses the yth\n                                                                                                      k\nproduct from the basket {x1 , . . . , xp}: w  xyk  w  xb ,  b = y\n                                  k           k            k        k         k . Our goal is to estimate\nthe individual partworths w, with the individual utility function now being u(x) = w  x.\nWith a reordering of the products, we can actually suppose that yk = 1. Then the above\ninequalities can be rewritten as a set of p - 1 constraints:\n\n                                       w  (x1 - xb )  0,      2  b  p.\n                                             k        k                                              (5)\n\nEq. (5) shows that the conjoint estimation problem can be cast as a classification problem\nin the product-profiles differences space. From this point of view, it seems quite natural to\nuse state-of-the-art classifiers such as SVMs for this purpose.\n\n       4which is consistent with the L2-loss measuring deviations of wi-s from .\n\n\f\nMore specifically, we propose to train a L2-soft margin classifier (see also [3] for a similar\napproach) with only positive examples and with a hyperplane passing through the origin\n(no bias), modelling the noise in the answers with slack variables kb:\n\n                                            Minimize w2 + C                      q        p         2\n                                                                                 k=1      b=2       kb\n                                            subject to w  (x1 - xb )  1 - \n                                                                     k           k                  kb.\n\n\n3.2        Estimation of individual utilities\n\nIt was proposed in [3] to train one SVM per consumer to get wi and to compute the\nindividual partworths by regularizing with the aggregated partworths w = 1                                                          n      w\n                                                                                                                               n    i=1    i:\nw = wi+w .\n     i        2\n\nInstead, to estimate the individual utility partworths wi, we suggest the following opti-\nmization problem (the set Qi contains the indices j such that the consumer i was asked to\nchoose between products x1 , . . . , xp) :\n                                             k          k\n\n\n                                                                                         ~\n                        Minimize w2 + C                         p         2 +           C                        p      2\n                                       i          qi    kQi    b=2       kb                   q                  b=2    kb\n                                                                                        j=i    j          k /\n                                                                                                           Qi\n                        subject to wi  (x1 - xb )  1 - \n                                                  k      k                kb,         k, b  2 .\n\nHere the ratio C determines the trade-off between the individual scale and the aggregated\n                           ~\n                          C\none.5 For C = 1, the population is modeled as if it were homogeneous, i.e. all partworths\n                   ~\n                   C\nwi are equal. For C                    1, the individual partworths are computed independently, without\n                                ~\n                                C\ntaking into account aggregated partworths.\n\n\n4         Related work\n\nOrdinal regression                   Very recently [2] explores the so-called ordinal regression task for\nranking, and derive two techniques for hyperparameters learning and model selection in\na hierarchical bayesian framework, Laplace approximation and Expectation Propagation\nrespectively. Ordinal regression is similar yet distinct from conjoint estimation since train-\ning data are supposed to be rankings or ratings in contrast with conjoint estimation where\ntraining data are choice-based. See [4] for more extensive bibliography.\n\n\nLarge margin classifiers                       Casting the preference problem in a classification framework,\nleading to learning by convex optimization, was known for a long time in the psycho-\nmetrics community. [5] pioneered the use of large margin classifiers for ranking tasks.\n[3] introduced the kernel methods machinery for conjoint analysis on the individual scale.\nVery recently [10] proposes an alternate method for dealing with heterogeneity in conjoint\nanalysis, which boils down to a very similar optimization to our HB-MAP approximation\nobjective function, but with large margin regularization and with minimum deviation from\nthe aggregated partworths.\n\n\nCollaborative filtering                      Collaborative filtering exploits similarity between ratings across\na population. The goal is to predict a person's rating on new products given the person's\npast ratings on similar products and the ratings of other people on all the products. Again\ncollaborative is designed for overlapping training samples for each consumer, and usually\nrating/ranking training data, whereas conjoint estimation usually deals with different ques-\ntionnaires for each consumer and choice-based training data.\n\n          5C  ~\n               C In this way, directions for which the xj , j  Qi contain information are estimated\naccurately, whereas the others directions are estimated thanks to the answers of the other consumers.\n\n\f\n5       Experiments\n\nArtificial experiments           We tested our algorithms on the same benchmarking artificial ex-\nperimental setup used in [3, 16]. The simulated product profiles consist of 4 attributes, each\nof them being discretized through 4 levels. A random design was used for the question-\nnaire. For each question, the consumer was asked to choose one product from a basket of 4.\nA population of 100 consumers was simulated, each of them having to answer 4 questions.\nFinally, the results presented below are averaged over 5 trials.\n\nThe 100 true consumer partworths were generated from a Gaussian distribution with mean\n(-, -/3, /3, ) (for each attribute) and with a diagonal covariance matrix 2I. Each\nanswer is a choice from the basket of products, sampled from the discrete logit-type distri-\nbution (1). Hence when  (called the magnitude6) is large, the consumer will choose with\nhigh probability the product with the highest utility, whereas when  is small, the answers\nwill be less reliable. The ratio 2/ controls the heterogeneity7 of the population.\n\nFinally, as in [3], the performances are computed using the mean of the L2 distances be-\ntween the true and estimated individual partworths (also called RMSE). Beforehand the\npartworths are translated such that the mean on each attribute is 0 and normalized to 1.\n\n\nReal experiments           We tested our algorithms on disguised industrial datasets kindly pro-\nvided by Sawtooth Software Inc., the world leading company in conjoint analysis soft-\nwares.\n\n11 one-choice-based8 conjoint surveys datasets9 were used for real experiments below. The\nnumber of attributes ranged from 3 to 6 (hence total number of levels from 13 to 28), the\nsize of the baskets, to pick one product from at each question, ranged from 2 to 5, and\nthe number of questions ranged from 6 to 15. The numbers of respondents ranged roughly\nfrom 50 to 1200. Since here we did not address the issue of no choice options in question\nanswering, we removed10 questions where customers refused to choose a product from the\nbasket and chose the no-choice-option as an answer11.\n\nFinally, as in [16], the performances are computed using the hit rate, i.e. the misprediction\nrate of the preferred product.\n\n\n5.1      Analysis of HB-MAP\n\nWe compare in this section our implementation of the HB method described in Section 2,\nthat we call HB-MAP, to HB-S, the standard HB implementation.\n\nThe average training time for HB-S was 19 minutes (with 12000 iterations as suggested in\n[11]), whereas our implementation based on the approximation of the MAP solution took\nin average only 1.8 seconds. So our primary goal, i.e. to alleviate the sampling phase\ncomplexity, was achieved since we got a speed-up factor of the order of 1000.\n\nThe accuracy does not seem to be significantly weakened by this new implementation.\nIndeed, as shown in both Table 1 and Table 2, the performances achieved by HB-MAP\nwere surprisingly often as good as HB-S's, and sometimes even a bit better. This might be\n\n       6as in [3], we tested High Magnitude ( = 3) and Low Magnitude ( = 0.5).\n       7It was either set to 2 = 3 or 2 = 0.5, respectively High and Low Heterogeneity cases.\n       8We limited ourselves to datasets in which respondents were asked to choose 1 product among a\nbasket at each question.\n       9see [4] for more details on the numerical features of the datasets.\n     10One could use EM-based methods to deal with such missing training choice data.\n     11When this procedure boiled down to unreasonable number of questions for hold-out evaluation\nof our algorithms, we simply removed the corresponding individuals.\n\n\f\nexplained by the fact that assuming that the covariance matrix is quasi-diagonal is a reason-\nable approximation, and that the mode of the posterior distribution is actually roughly close\nto the mean, for the real datasets considered. Additionally it is likely that HB-S may have\ndemanded much more iterations for convergence to systematically behave more accurately\nthan HB-MAP as one would have normally expected.\n\n\n5.2       Analysis of SVMs\n\nWe now turn to the SVM approach presented in section 3.2 that we call Im.SV12. We did\nnot use a non-linear kernel in our experiments. Hence it was possible to minimize (3.2)\ndirectly in the primal, instead of using the dual formulation as done usually. This turned\nout to be faster since the number of constraints was, for our problem, larger than the number\nof variables. The resulting mean training time was 4.7 seconds. The so-called chapspan,\nspan estimate of leave-one-out prediction error [17], was used to select a suitable value of\nC13, since it gave a quasi-convex estimation on the regularization path.\n\nThe performances of Im.SV in Table 2, compared to the HB methods and logistic regression\n[3] are very satisfactory in case of artificial experiments. In real experiments, Im.SV gives\noverall quite satisfactory results, but sometimes disappointing ones in Table 2. One reason\nmight be that hyperparameters (C, ~\n                                               C) were optimized once for the whole population. This\nmay also be due to the lack of robustness14 of Im.SV to heterogeneity in the number of\ntraining samples for each consumer.\n\n\n            Table 1: Average RMSE between estimated and true individual partworths\n\n                          Mag        Het     HB-S     HB-MAP            Logistic    Im.SV\n                               L      L       0.90       0.83            0.84       0.86\n                               L     H        0.95       0.91            1.16       0.90\n                               H      L       0.44       0.40            0.43       0.41\n                               H     H        0.72       0.68            0.82       0.67\n\n\n\n\n                                Table 2: Hit rate performances on real datasets.\n\n                       Im.SV        HB-MAP      HB-S                      Im.SV      HB-MAP     HB-S\n          Dat12        0.16          0.16        0.17         Dat15         0.52       0.45      0.48\n          Dat22        0.15          0.13        0.15         Dat25         0.58       0.47      0.51\n                       Im.SV        HB-MAP      HB-S                      Im.SV      HB-MAP     HB-S\n          Dat13        0.37          0.24        0.25         Dat1\n          Dat2                                                     4        0.33       0.36      0.35\n                  3    0.34          0.33        0.33         Dat2\n          Dat3                                                     4        0.33       0.36      0.28\n                  3    0.35          0.28        0.24         Dat3\n          Dat4                                                     4        0.45       0.40      0.25\n                  3    0.35          0.31        0.28\n\n\n\nLegend of Tables 1 and 2                     The first two columns indicate the Magnitude and the\nHeterogeneity (High or Low). p in Datmp is the number of products respondents are\nasked to choose one from at each question.\n\n  12since individual choice data are Immersed in the rest of the population choice data, via the\noptimization objective\n  13We observed that the value of the constant ~\n                                                         C was irrelevant, and that only the ratio C/ ~\n                                                                                                     C mat-\ntered.\n  14Indeed the no-choice data cleaning step might have lead to a strong unbalance to which Im.SV\nis maybe much more sensitive than HB-MAP or HB-S.\n\n\f\n6      Active learning\n\nMotivation        Traditional experimental designs are built by minimizing the variance of an\nestimator (e.g. orthogonal designs [6]). However, they are sub-optimal because they do\nnot take into account the previous answers of the consumer. Therefore adaptive conjoint\nanalysis was proposed [11, 16] for adaptively designing questionnaires.\n\nThe adaptive design concept is often called active learning in machine learning, as the\nalgorithm can actively select questions whose responses are likely to be informative. In the\nSVM context, a common and intuitive strategy is to select, as the next point to be labeled,\nthe nearest one from the decision boundary (see for instance [15]).\n\n\nExperiments         We implemented this heuristic for conjoint analysis by selecting for each\nquestion a set of products whose estimated utilities are as close as possible15. To compare\nthe different designs, we used the same artificial simulations as in section 5, but with 16\nquestions per consumer in order to fairly compare to the orthogonal design.\n\n\n               Table 3: Comparison of the RMSE achieved by different designs.\n\n                         Mag      Het    Random       Orthogonal       Adaptive\n                           L       L        0.66          0.61           0.66\n                           L       H        0.62          0.56           0.56\n                           H       L        0.31          0.29           0.24\n                           H       H        0.49          0.45           0.34\n\n\nResults in Table 3 show that active learning produced an adaptive design which seems\nefficient, especially in the case of high magnitude, i.e. when the answers are not noisy16.\n\n\n7      Discussion\n\nWe may need to capture correlations between attributes to model interaction effects among\nthem. The polynomial kernel K(u, v) = (u.v + 1)d seems particularly relevant for such a\ntask. HB methods kernelization can be done in the framework presented in [7]. For large\nmargin methods [10, 3] give a way to use the kernel trick in the space of product-profiles\ndifferences. Prior knowledge of product-profile structure [3] may also be incorporated in\nthe estimation process by using virtual examples [12].\n\n[9] approach would allow us to improve our approximate MAP solution by learning a vari-\national approximation of a non-isotropic diagonal covariance matrix.\nA fully bayesian HB setting, i.e. with a maximum likelihood type II17 (ML II) step, in\ncontrast of sampling from the posterior, is known in the statistics community as bayesian\nmultinomial logistic regression. [18] use Laplace approximation to compute integration\nover hyperparameters for multi-class classification, while [8] develop a variational approx-\nimation of the posterior distribution.\nNew insights on learning gaussian process regression in a HB framework have just been\ngiven in [13], where a method combining an EM algorithm and a generalized Nystrom ap-\nproximation of covariance matrix is proposed, and could be incorporated in the HB-MAP\napproximation above.\n\n     15Since the bottom-line goal of the conjoint analysis is not really to estimate the partworths but\nto design the \"optimal\" product, adaptive design can also be helpful by focusing on products which\nhave a high estimated utility.\n     16Indeed noisy answers are neither informative nor reliable for selecting the next question.\n     17aka evidence maximization or hyperparameters learning\n\n\f\n8    Conclusion\n\nChoice-based conjoint analysis seems to be a very promising application field for machine\nlearning techniques. Further research include fully bayesian HB methods, extensions to\nnon-linear models as well as more elaborate and realistic active learning schemes.\n\n\nAcknowledgments\n\nThe authors are very grateful to J. Qui~nonero-Candela and C. Rasmussen for fruitful dis-\ncussions, and O. Toubia for providing us with his HB implementation. Many thanks to\nSawtooth Software Inc. for providing us with real conjoint analysis datasets.\n\n\nReferences\n\n [1] B. E. Boser, I. M. Guyon, and V. N. Vapnik. A training algorithm for optimal margin\n     classifiers. In Proc. 5th Annu. Workshop on Comput. Learning Theory, 1992.\n\n [2] W. Chu and Z. Ghahramani. Gaussian processes for ordinal regression. Technical\n     report, University College London, 2004.\n\n [3] T. Evgeniou, C. Boussios, and G. Zacharia. Generalized robust conjoint estimation.\n     Marketing Science, 25, 2005.\n\n [4] Z. Harchaoui. Statistical learning approaches to conjoint estimation. Technical report,\n     Max Planck Institute for Biological Cybernetics, to appear.\n\n [5] R. Herbrich, T. Graepel, and K. Obermayer. Large margin rank boundaries for ordinal\n     regression. In Advances in Large Margin Classifiers. MIT Press, 2000.\n\n [6] J. Huber and K. Zwerina. The importance of utility balance in efficient choice designs.\n     Journal of Marketing Research, 33, 1996.\n\n [7] T. S. Jaakkola and D. Haussler. Probabilistic kernel regression models. In Artificial\n     Intelligence and Statistics, 1999.\n\n [8] T. S. Jaakkola and M. I. Jordan. Bayesian logistic regression: a variational approach.\n     Statistics and Computing, 10:2537, 2000.\n\n [9] T. Jebara. Convex invariance learning. In Artificial Intelligence and Statistics, 2003.\n\n[10] C. A. Micchelli and M. Pontil. Kernels for multitask learning. In Advances in Neural\n     Information Processing Systems 17, 2005.\n\n[11] Sawtooth     Software.           Research    paper     series.          Available    at\n     www.sawtoothsoftware.com/techpap.shtml#hbrel.\n\n[12] B. Scholkopf and A. Smola. Learning with kernels. MIT Press, 2002.\n\n[13] A. Schwaighofer, V. Tresp, and K. Yu. Hierarchical bayesian modelling with gaussian\n     processes. In Advances in Neural Information Processing Systems 17, 2005.\n\n[14] M. Tipping. Bayesian inference: Principles and practice. In Advanced Lectures on\n     Machine Learning. Springer, 2004.\n\n[15] S. Tong and D. Koller. Support vector machine active learning with applications to\n     text classification. In Journal of Machine Learning Research, volume 2, 2001.\n\n[16] O. Toubia, J. R. Hauser, and D. I. Simester. Polyhedral methods for adaptive choice-\n     based conjoint analysis. Journal of Marketing Research, 41(1):116131, 2004.\n\n[17] V. Vapnik and O. Chapelle. Bounds on error expectation for support vector machines.\n     Neural Computation, 12(9), 2000.\n\n[18] C. K. I. Williams and D. Barber. Bayesian classification with gaussian processes.\n     IEEE Trans. Pattern Anal. Mach. Intell., 20, 1998.\n\n\f\n", "award": [], "sourceid": 2725, "authors": [{"given_name": "Olivier", "family_name": "Chapelle", "institution": null}, {"given_name": "Za\u00efd", "family_name": "Harchaoui", "institution": null}]}