{"title": "Pairwise Choice Markov Chains", "book": "Advances in Neural Information Processing Systems", "page_first": 3198, "page_last": 3206, "abstract": "As datasets capturing human choices grow in richness and scale, particularly in online domains, there is an increasing need for choice models flexible enough to handle data that violate traditional choice-theoretic axioms such as regularity, stochastic transitivity, or Luce's choice axiom. In this work we introduce the Pairwise Choice Markov Chain (PCMC) model of discrete choice, an inferentially tractable model that does not assume these traditional axioms while still satisfying the foundational axiom of uniform expansion, which can be viewed as a weaker version of Luce's axiom. We show that the PCMC model significantly outperforms the Multinomial Logit (MNL) model in prediction tasks on two empirical data sets known to exhibit violations of Luce's axiom. Our analysis also synthesizes several recent observations connecting the Multinomial Logit model and Markov chains; the PCMC model retains the Multinomial Logit model as a special case.", "full_text": "Pairwise Choice Markov Chains\n\nStephen Ragain\n\nStanford University\nStanford, CA 94305\n\nJohan Ugander\n\nStanford University\nStanford, CA 94305\n\nManagement Science & Engineering\n\nManagement Science & Engineering\n\nsragain@stanford.edu\n\njugander@stanford.edu\n\nAbstract\n\nAs datasets capturing human choices grow in richness and scale\u2014particularly in\nonline domains\u2014there is an increasing need for choice models that escape tra-\nditional choice-theoretic axioms such as regularity, stochastic transitivity, and\nLuce\u2019s choice axiom.\nIn this work we introduce the Pairwise Choice Markov\nChain (PCMC) model of discrete choice, an inferentially tractable model that does\nnot assume any of the above axioms while still satisfying the foundational axiom\nof uniform expansion, a considerably weaker assumption than Luce\u2019s choice ax-\niom. We show that the PCMC model signi\ufb01cantly outperforms both the Multino-\nmial Logit (MNL) model and a mixed MNL (MMNL) model in prediction tasks\non both synthetic and empirical datasets known to exhibit violations of Luce\u2019s\naxiom. Our analysis also synthesizes several recent observations connecting the\nMultinomial Logit model and Markov chains; the PCMC model retains the Multi-\nnomial Logit model as a special case.\n\n1\n\nIntroduction\n\nDiscrete choice models describe and predict decisions between distinct alternatives. Traditional ap-\nplications include consumer purchasing decisions, choices of schooling or employment, and com-\nmuter choices for modes of transportation among available options. Early models of probabilistic\ndiscrete choice, including the well known Thurstone Case V model [27] and Bradley-Terry-Luce\n(BTL) model [7], were developed and re\ufb01ned under diverse strict assumptions about human de-\ncision making. As complex individual choices become increasingly mediated by engineered and\nlearned platforms\u2014from online shopping to web browser clicking to interactions with recommen-\ndation systems\u2014there is a pressing need for \ufb02exible models capable of describing and predicting\nnuanced choice behavior.\nLuce\u2019s choice axiom, popularly known as the independence of irrelevant alternatives (IIA), is ar-\nguably the most storied assumption in choice theory [18]. The axiom consists of two statements, ap-\nplied to each subset of alternatives S within a broader universe U. Let paS = Pr(a chosen from S)\nfor any S \u2286 U, and in a slight abuse of notation let pab = Pr(a chosen from {a, b}) when there are\nonly two elements. Luce\u2019s axiom is then that: (i) if pab = 0 then paS = 0 for all S containing a and\nb, (ii) the probability of choosing a from U conditioned on the choice lying in S is equal to paS.\nThe BTL model, which de\ufb01nes pab = \u03b3a/(\u03b3a + \u03b3b) for latent \u201cquality\u201d parameters \u03b3i > 0, satis\ufb01es\nthe axiom while Thurstone\u2019s Case V model does not [1]. Soon after its introduction, the BTL model\nwas generalized from pairwise choices to choices from larger sets [4]. The resulting Multinomal\nLogit (MNL) model again employs quality parameters \u03b3i \u2265 0 for each i \u2208 U and de\ufb01nes piS, the\nprobability of choosing i from S \u2286 U, proportional to \u03b3i for all i \u2208 S. Any model that satis\ufb01es\nLuce\u2019s choice axiom is equivalent to some MNL model [19].\nOne consequence of Luce\u2019s choice axiom is strict stochastic transitivity between alternatives: if\npab \u2265 0.5 and pbc \u2265 0.5, then pac \u2265 max(pab, pbc). A possibly undesirable consequence of\nstrict stochastic transitivity is the necessity of a total order across all elements. But note that strict\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fstochastic transitivity does not imply the choice axiom; Thurstone\u2019s model exhibits strict stochastic\ntransitivity.\nMany choice theorists and empiricists, including Luce, have noted that the choice axiom and stochas-\ntic transitivity are strong assumptions that do not hold for empirical choice data [9, 12, 13, 26, 28].\nA range of discrete choice models striving to escape the con\ufb01nes of the choice axiom have emerged\nover the years. The most popular of these models have been Elimination by Aspects [29], mixed\nMNL (MMNL) [6], and nested MNL [22]. Inference is practically dif\ufb01cult for all three of these\nmodels [15, 23]. Additionally, Elimination by Aspects and the MMNL model also both exhibit the\nrigid property of regularity, de\ufb01ned below.\nA broad, important class of models in the study of discrete choice is the class of random utility\nmodels (RUMs) [4, 20]. A RUM af\ufb01liates with each i \u2208 U a random variable Xi and de\ufb01nes for each\nsubset S \u2286 U the probability Pr(i chosen from S) = Pr(Xi \u2265 Xj,\u2200j \u2208 S). An independent RUM\nhas independent Xi. RUMs assume neither choice axiom nor stochastic transitivity. Thurstone\u2019s\nCase V model and the BTL model are both independent RUMs; the Elimination by Aspects and\nMMNL models are both RUMs. A major result by McFadden and Train establishes that for any\nRUM there exists a MMNL model that can approximate the choice probabilities of that RUM to\nwithin an arbitrary error [23], a strong result about the generality of MMNL models. The nested\nMNL model, meanwhile, is not a RUM.\nAlthough RUMs need not exhibit stochastic transitivity, they still exhibit the weaker property of\nregularity: for any choice sets A, B where A \u2286 B, pxA \u2265 pxB. Regularity may at \ufb01rst seem\nintuitively pleasing, but it prevents models from expressing framing effects [12] and other empirical\nobservations from modern behavior economics [28]. This rigidity motivates us to contribute a new\nmodel of discrete choice that escapes historically common assumptions while still furnishing enough\nstructure to be inferentially tractable.\nThe present work.\nIn this work we introduce a conceptually simple and inferentially tractable\nmodel of discrete choice that we call the PCMC model. The parameters of the PCMC model are\nthe off-diagonal entries of a rate matrix Q indexed by U. The PCMC model af\ufb01liates each subset\nS of the alternatives with a continuous time Markov chain (CTMC) on S with transition rate matrix\nQS, whose off-diagonal entries are entries of Q indexed by pairs of items in S. The model de\ufb01nes\npiS, the selection probability of alternative i \u2208 S, as the probability mass of alternative i \u2208 S of the\nstationary distribution of the CTMC on S.\nThe transition rates of these CTMCs can be interpreted as measures of preferences between pairs\nof alternatives. Special cases of the model use pairwise choice probabilities as transition rates,\nand as a result the PCMC model extends arbitrary models of pairwise choice to models of set-\nwise choice. Indeed, we show that when the matrix Q is parameterized with the pairwise selection\nprobabilities of a BTL pairwise choice model, the PCMC model reduces to an MNL model. Recent\nparameterizations of non-transitive pairwise probabilities such as the Blade-Chest model [8] can be\nusefully employed to reduce the number of free parameters of the PCMC model.\nOur PCMC model can be thought of as building upon the observation underlying the recently in-\ntroduced Iterative Luce Spectral Ranking (I-LSR) procedure for ef\ufb01ciently \ufb01nding the maximum\nlikelihood estimate for parameters of MNL models [21]. The analysis of I-LSR is precisely analyz-\ning a PCMC model in the special case where the matrix Q has been parameterized by BTL. In that\ncase the stationary distribution of the chain is found to satisfy the stationary conditions of the MNL\nlikelihood function, establishing a strong connection between MNL models and Markov chains. The\nPCMC model generalizes that connection.\nOther recent connections between the MNL model and Markov chains include the work on Rank-\nCentrality [24], which employs a discrete time Markov chain for inference in the place of I-LSR\u2019s\ncontinuous time chain, in the special case where all data are pairwise comparisons.\nSeparate recent work has contributed a different discrete time Markov chain model of \u201cchoice sub-\nstitution\u201d capable of approximating any RUM [3], a related problem but one with a strong focus on\nordered preferences. Lastly, recent work by Kumar et al. explores conditions under which a prob-\nability distribution over discrete items can be expressed as the stationary distribution of a discrete\ntime Markov chain with \u201cscore\u201d functions similar to the \u201cquality\u201d parameters in an MNL model\n[17].\n\n2\n\n\fThe PCMC model is not a RUM, and in general does not exhibit stochastic transitivity, regularity,\nor the choice axiom. We \ufb01nd that the PCMC model does, however, obey the lesser known but\nfundamental axiom of uniform expansion, a weakened version of Luce\u2019s choice axiom proposed\nby Yellott that implies the choice axiom for independent RUMs [30].\nIn this work we de\ufb01ne a\nconvenient structural property termed contractibility, for which uniform expansion is a special case,\nand we show that the PCMC model exhibits contractibility. Of the models mentioned above, only\nElimination by Aspects exhibits uniform expansion without being an independent RUM. Elimination\nby Aspects obeys regularity, which the PCMC model does not; as such, the PCMC model is uniquely\npositioned in the literature of axiomatic discrete choice, minimally satisfying uniform expansion\nwithout the other aforementioned axioms.\nAfter presenting the model and its properties, we investigate choice predictions from our model on\ntwo empirical choice datasets as well as diverse synthetic datasets. The empirical choice datasets\nconcern transportation choices made on commuting and shopping trips in San Francisco. Inference\non synthetic data shows that PCMC is competitive with MNL when Luce\u2019s choice axiom holds,\nwhile PCMC outperforms MNL when the axiom does not hold. More signi\ufb01cantly, for both of the\nempirical datasets we \ufb01nd that a learned PCMC model predicts empirical choices signi\ufb01cantly better\nthan a learned MNL model.\n\n2 The PCMC model\n\nFigure 1: Markov chains on choice\nsets {a, b}, {a, c}, and {b, c},\nwhere line thicknesses denote tran-\nsition rates.\nThe chain on the\nchoice set {a, b, c} is assembled\nusing the same rates.\n\nand setting qii = \u2212(cid:80)\nof the corresponding CTMC on S, and let \u03c0S(A) =(cid:80)\n\nA Pairwise Choice Markov Chain (PCMC) model de\ufb01nes the\nselection probability piS, the probability of choosing i from\nS \u2286 U, as the probability mass on alternative i \u2208 S of\nthe stationary distribution of a continuous time Markov chain\n(CTMC) on the set of alternatives S. The model\u2019s parame-\nters are the off-diagonal entries qij of rate matrix Q indexed\nby pairs of elements in U. See Figure 1 for a diagram. We\nimpose the constraint qij + qji \u2265 1 for all pairs (i, j), which\nensures irreducibility of the chain for all S.\nGiven a query set S \u2286 U, we construct QS by restricting the rows and columns of Q to elements in S\nj\u2208S\\i qij for each i \u2208 S. Let \u03c0S = {\u03c0S(i)}i\u2208S be the stationary distribution\nx\u2208A \u03c0S(x). We de\ufb01ne the choice probability\npiS := \u03c0S(i), and now show that the PCMC model is well de\ufb01ned.\nProposition 1. The choice probabilities piS are well de\ufb01ned for all i \u2208 S, all S \u2286 U of a \ufb01nite U.\nProof. We need only to show that there is a single closed communicating class. Because S is \ufb01nite,\nthere must be at least one closed communicating class. Suppose the chain had more than one closed\ncommunicating class and that i \u2208 S and j \u2208 S were in different closed communicating classes. But\nqij +qji \u2265 1, so at least one of qij and qji is strictly positive and the chain can switch communicating\nclasses through the transition with strictly positive rate, a contradiction.\nWhile the support of \u03c0S is the single closed communicating class, S may have transient states\ncorresponding to alternatives with selection probability 0. Note that irreducibility argument needs\nonly that qij + qji be positive, not necessarily at least 1 as imposed in the model de\ufb01nition. One\ncould simply constrain qij + qji \u2265 \u0001 for some positive \u0001. However, multiplying all entriesof Q by\nsome c > 0 does not affect the stationary distribution of the corresponding CTMC, so multiplication\nby 1/\u0001 gives a Q with the same selection probabilities.\nIn the subsections that follow, we develop key properties of the model. We begin by showing how\nassigning Q according a Bradley-Terry-Luce (BTL) pairwise model results in the PCMC model\nbeing equivalent to BTL\u2019s canonical extension, the Multinomial Logit (MNL) set-wise model. We\nthen construct a Q for which the PCMC model is neither regular nor a RUM.\n\n2.1 Multinomial Logit from Bradley-Terry-Luce\n\nWe now observe that the Multinomial Logit (MNL) model, also called the Plackett-Luce model,\nis precisely a PCMC model with a matrix Q consisting of pairwise BTL probabilities. Recall that\nthe BTL model assumes the existence of latent \u201cquality\u201d parameters \u03b3i > 0 for i \u2208 U with pij =\n\u03b3i/(\u03b3i + \u03b3j),\u2200i, j \u2208 U and that the MNL generalization de\ufb01nes piS \u221d \u03b3i,\u2200i \u2208 S for each S \u2286 U.\n\n3\n\nabcabbcac\fProposition 2. Let \u03b3 be the parameters of a BTL model on U. For qji = \u03b3i\n\u03b3i+\u03b3j\nprobabilities piS are consistent with an MNL model on S with parameters \u03b3.\n\n, the PCMC\n\nProof. We aim to show that \u03c0S = \u03b3||\u03b3||1\nWe have:\n\nis a stationary distribution of the PCMC chain: \u03c0T\n\n\uf8eb\uf8ed(cid:88)\n\nj(cid:54)=i\n\n(\u03c0T\n\nS QS)i =\n\n1\n||\u03b3||1\n\n\u03b3jqji \u2212 \u03b3i(\n\nqji)\n\n\u03b3i\n||\u03b3||1\n\n\u03b3j\n\n\u03b3i + \u03b3j\n\n\uf8f6\uf8f8 =\n\n(cid:88)\n\nj(cid:54)=i\n\n\uf8eb\uf8ed(cid:88)\n\nj(cid:54)=i\n\nS QS = 0.\n\n\uf8f6\uf8f8 = 0, \u2200i.\n\n\u2212(cid:88)\n\nj(cid:54)=i\n\n\u03b3j\n\n\u03b3i + \u03b3j\n\nThus \u03c0S is always the stationary distribution of the chain, and we know by Proposition 1 that it is\nunique. It follows that piS \u221d \u03b3i for all i \u2208 S, as desired.\n\nOther parameterizations of Q, which can be used for parameter reduction or to extend arbitrary\nmodels for pairwise choice, are explored section 1 of the Supplementary material.\n\n2.2 A counterexample to regularity\nThe regularity property stipulates that for any S(cid:48) \u2282 S, the probability of selecting a from S(cid:48) is at\nleast the probability of selecting a from S. All RUMs exhibit regularity because S(cid:48) \u2286 S implies\nPr(Xi = maxj\u2208S(cid:48) Xj) \u2265 Pr(Xi = maxj\u2208S Xj). We now construct a simple PCMC model which\ndoes not exhibit regularity, and is thus not a RUM.\nConsider U = {r, p, s} corresponding to a rock-paper-scissors-like stochastic game where each\n2. Constructing a PCMC model where the\npairwise matchup has the same win probability \u03b1 > 1\ntransition rate from i to j is \u03b1 if j beats i in rock-paper-scissors yields the rate matrix\n\n(cid:34) \u22121\n\nQ =\n\n\u03b1\n1 \u2212 \u03b1\n\n(cid:35)\n\n.\n\n1 \u2212 \u03b1\n\u22121\n\u03b1\n\n\u03b1\n1 \u2212 \u03b1\n\u22121\n\nWe see that for pairs of objects, the PCMC model returns the same probabilities as the pairwise\ngame, i.e. pij = \u03b1 when i beats j in rock-paper-scissors, as pij = qji when qij +qji = 1. Regardless\nof how the probability \u03b1 is chosen, however, it is always the case that prU = ppU = psU = 1/3. It\nfollows that regularity does not hold for \u03b1 > 2/3.\nWe view the PCMC model\u2019s lack of regularity is a positive trait in the sense that empirical choice\nphenomena such as framing effects and asymmetric dominance violate regularity [12], and the\nPCMC model is rare in its ability to model such choices. Deriving necessary and suf\ufb01cient con-\nditions on Q for a PCMC model to be a RUM, analogous to known characterization theorems for\nRUMs [10] and known suf\ufb01cient conditions for nested MNL models to be RUMs [5], is an interest-\ning open challenge.\n\n3 Properties\n\nWhile we have demonstrated already that the PCMC model avoids several restrictive properties that\nare often inconsistent with empirical choice data, we demonstrate in this section that the PCMC\nmodel still exhibits deep structure in the form of contractibility, which implies uniform expansion.\nInspired by a thought experiment that was posed as an early challenge to the choice axiom, we de\ufb01ne\nthe property of contractibility to handle notions of similarity between elements. We demonstrate that\nthe PCMC model exhibits contractibility, which gracefully handles this thought experiment.\n\n3.1 Uniform expansion\n\nYellott [30] introduced uniform expansion as a weaker condition than Luce\u2019s choice axiom, but one\nthat implies the choice axiom in the context of any independent RUM. Yellott posed the axiom of\ninvariance to uniform expansion in the context of \u201ccopies\u201d of elements which are \u201cidentical.\u201d In the\ncontext of our model, such copies would have identical transition rates to alternatives:\nDe\ufb01nition 1 (Copies). For i, j in S \u2286 U, we say that i and j are copies if for all k \u2208 S \u2212 i \u2212 j,\nqik = qjk and qij = qji.\n\n4\n\n\fYellott\u2019s introduction to uniform expansion asks the reader to consider an offer of a choice of bever-\nage from k identical cups of coffee, k identical cups of tea, and k identical glasses of milk. Yellott\ncontends that the probability the reader chooses a type of beverage (e.g. coffee) in this scenario\nshould be the same as if they were only shown one cup of each beverage type, regardless of k \u2265 1.\nDe\ufb01nition 2 (Uniform Expansion). Consider a choice between n elements in a set S1 =\n{i11, . . . , in1}, and another choice from a set Sk containing k copies of each of the n elements:\nSk = {i11, . . . , i1k, i21, . . . , i2k, . . . , in1, . . . , ink}. The axiom of uniform expansion states that for\neach m = 1, . . . , n and all k \u2265 1:\n\nk(cid:88)\n\npim1S1 =\n\npimj Sk .\n\nj=1\n\nWe will show that the PCMC model always exhibits a more general property of contractibility, of\nwhich uniform expansion is a special case; it thus always exhibits uniform expansion.\nYellott showed that for any independent RUM with |U| \u2265 3 the double-exponential distribution\nfamily is the only family of independent distributions that exhibit uniform expansion for all k \u2265 1,\nand that Thurstone\u2019s model based on the Gaussian distribution family in particular does not exhibit\nuniform expansion.\nWhile uniform expansion seems natural in many discrete choice contexts, it should be regarded with\nsome skepticism in applications that model competitions. Sports matches or races are often modeled\nusing RUMs, where the winner of a competition can be modeled as the competitor with the best draw\nfrom their random variable. If a competitor has a performance distribution with a heavy upper tail\n(so that their wins come from occasional \u201cgood days\u201d), uniform expansion would not hold. This\nobservation relates to recent work on team performance and selection [14], where non-invariance\nunder uniform expansion plays a key role.\n\n3.2 Contractibility\n\nIn a book review of Luce\u2019s early work on the choice axiom, Debreu [9] considers a hypothetical\nchoice between three musical recordings: one of Beethoven\u2019s eighth symphony conducted by X,\nanother of Beethoven\u2019s eighth symphony conducted by Y , and one of Debussy quartet conducted\nby Z. We will call these options B1, B2, and D respectively. When compared to D, Debreu argues\nthat B1 and B2 are indistinguishable in the sense that pDB1 = pDB2. However, someone may\nprefer B1 over B2 in the sense that pB1B2 > 0.5. This is impossible under a BTL model, in which\npDB1 = pDB2 implies that \u03b3B1 = \u03b3B2 and in turn pB1B2 = 0.5.\nTo address contexts in which elements compare identically to alternatives but not each other (e.g. B1\nand B2), we introduce contractible partitions that group these similar alternatives into sets. We\nthen show that when a PCMC model contains a contractible partition, the relative probabilities of\nselecting from one of these partitions is independent from how comparisons are made between\nalternatives in the same set. Our contractible partition de\ufb01nition can be viewed as akin to (but\ndistinct from) nests in nested MNL models [22].\nDe\ufb01nition 3 (Contractible Partition). A partition of U into non-empty sets A1, . . . , Ak is a con-\ntractible partition if qaiaj = \u03bbij for all ai \u2208 Ai, aj \u2208 Aj for some \u039b = {\u03bbij} for i, j \u2208 {1, . . . , k}.\nProposition 3. For a given \u039b, let A1, . . . , Ak be a contractible partition for two PCMC models on\nU represented by Q, Q(cid:48) with stationary distributions \u03c0, \u03c0(cid:48). Then for any Ai:\n\npjU =\n\np(cid:48)\njU ,\n\n(1)\n\nor equivalently, \u03c0(Ai) = \u03c0(cid:48)(Ai).\n\nj\u2208Ai\n\n\uf8eb\uf8ed (cid:88)\n\ny\u2208A1\\x\n\nk(cid:88)\n\n(cid:88)\n\ni=2\n\nai\u2208Ai\n\n(cid:88)\n\n(cid:88)\n\nj\u2208Ai\n\n\uf8f6\uf8f8 =\n\n(cid:88)\n\ny\u2208A1\\x\n\n5\n\nProof. Suppose Q has contractible partition A1, . . . , Ak with respect to \u039b. If we decompose the\nbalance equations (i.e. each row of \u03c0T Q = 0), for x \u2208 A1 WLOG we obtain:\n\n\u03c0(x)\n\nqxy +\n\nqxai\n\n\u03c0(y)qyx +\n\n\u03c0(ai)qaix.\n\n(2)\n\nk(cid:88)\n\n(cid:88)\n\ni=2\n\nai\u2208Ai\n\n\f\u03c0(x)\n\ny\u2208A1\\x\n\n\uf8eb\uf8ed (cid:88)\n\uf8eb\uf8ed (cid:88)\n\ny\u2208A1\\x\n\nk(cid:88)\n\ni=2\n\nqxy\n\n\uf8f6\uf8f8 + \u03c0(x)\n\uf8f6\uf8f8 + \u03c0(A1)\nk(cid:88)\n\ni=2\n\n\u03c0(x)\n\nqxy\n\n(cid:88)\n\nx\u2208A1\n\nSumming over x \u2208 A1 then gives\n\nThe leftmost term of each side is equal, so we have\n\nk(cid:88)\n\ni=2\n\n\u03c0(Ai)\u03bbi1.\n\n\u03c0(y)qyx + |A1| k(cid:88)\n\ni=2\n\n\u03c0(Ai)\u03bbi1.\n\n|Ai|\u03bbi1 =\n\n\u03c0(y)qyx +\n\n(cid:88)\n\ny\u2208A1\\x\n\n(cid:88)\n\n(cid:88)\n\n|Ai|\u03bbi1 =\n\nx\u2208A1\n\ny\u2208A1\\x\n\n|A1|(cid:80)k\n(cid:80)\ni=2 \u03c0(Ai)\u03bbi1\ni=2 |Ai|\u03bb1i\n\nNoting that for ai \u2208 Ai and aj \u2208 Aj, qaiaj = \u03bbij, (2) can be rewritten:\n\n\u03c0(A1) =\n\n,\n\n(3)\n\nand \u02dcqAiAi = \u2212(cid:80)\n\nwhich makes \u03c0(A1) the solution to global balance equations for a different continuous time Markov\nchain with the states {A1, . . . , Ak} and transition rate \u02dcqAiAj = |Aj|\u03bbij between state Ai and Aj,\nj(cid:54)=i \u02dcqAiAj . Now qaiaj + qaj ai \u2265 1 implies \u03bbij + \u03bbji \u2265 1. Combining this\nobservation with |Ai| > 0 shows (as with the proof of Proposition 1) that this chain is irreducible\nand thus that {\u03c0(Ai)}k\ni=1 are well-de\ufb01ned. Furthermore, because \u02dcQ is determined entirely by \u039b and\n|A1|, . . . ,|Ak|, we have that \u02dcQ = \u02dcQ(cid:48), and thus that \u03c0(Ai) = \u03c0(cid:48)(Ai),\u2200i regardless of how Q and Q(cid:48)\nmay differ, completing the proof.\n\nThe intuition is that we can \u201ccontract\u201d each Ai to a single \u201ctype\u201d because the probability of choosing\nan element of Ai is independent of the pairwise probabilities between elements within the sets. The\nabove proposition and the contractibility of a PCMC model on all uniformly expanded sets implies\nthat all PCMC models exhibit uniform expansion.\nProposition 4. Any PCMC model exhibits uniform expansion.\n\nProof. We translate the problem of uniform expansion into the language of contractibility. Let U1\nbe the universe of unique items i11, i21, . . . , in1, and let Uk be a universe containing k copies of\neach item in U1. Let imj denote the jth copy of the mth item in U1. Thus Uk = \u222an\nLet Q be the transition rate matrix of the CTMC on U1. We construct a contractible partition of Uk\ninto the n sets, each containing the k copies of some item in U1. Thus Am = \u222ak\nj=1imj. By the\nm=1 is a contractible partition of Uk with \u039b = Q. Noting |Am| = k\nde\ufb01nition of copies, that {Am}n\nfor all m in Equation (3) above results in {\u03c0(Am)}n\nm=1 being the solution to \u03c0T Q = \u03c0T \u039b =\nj=1 pimj Uk for each m, showing that the model exhibits uniform\n\n0. Thus pimU1 = \u03c0(Am) = (cid:80)k\n\nm=1 \u222ak\n\nj=1 imj.\n\nexpansion.\n\nWe end this section by noting that every PCMC model has a trivial contractible partition into single-\ntons. Detection and exploitation of Q\u2019s non-trivial contractible partitions (or appropriately de\ufb01ned\n\u201cnearly contractible partitions\u201d) are interesting open research directions.\n\n4\n\nInference and prediction\n\nOur ultimate goal in formulating this model is to make predictions: using past choices from diverse\nsubsets S \u2286 U to predict future choices. In this section we \ufb01rst give the log-likelihood function\nlog L(Q;C) of the rate matrix Q given a choice data collection of the form C = {(ik, Sk)}n\nk=1, where\nik \u2208 Sk was the item chosen from Sk. We then investigate the ability of a learned PCMC model to\nmake choice predictions on empirical data, benchmarked against learned MNL and MMNL models,\nand interpret the inferred model parameters \u02c6Q. Let CiS(C) = |{(ik, Sk) \u2208 C : ik = i, Sk = S}|\ndenote the number of times in the data that i was chosen out of set S for each S \u2286 U, and let\nCS(C) = |{(ik, Sk) \u2208 C : Sk = S}| be the number of times that S was the choice set for each\nS \u2286 U.\n\n6\n\n\f4.1 Maximum likelihood\nFor each S \u2286 U, i \u2208 S, recall that piS(Q) is the probability that i is selected from set S as a function\nof the rate matrix Q. After dropping all additive constants, the log-likelihood of Q given the data C\n(derived from the probability mass function of the multinomial distribution) is:\n\nlog L(Q;C) =\n\nCiS(C) log(piS(Q)).\n\n(cid:88)\n(cid:88)\nS QS = 0 and(cid:80)\n\nS\u2286U\n\ni\u2208S\n\nRecall that for the PCMC model, piS(Q) = \u03c0S(i), where \u03c0S is the stationary distribution for a\nCTMC with rate matrix QS, i.e. \u03c0T\ni\u2208S \u03c0S(i) = 1. There is no general closed form\nexpression for piS(Q). The implicit de\ufb01nition also makes it dif\ufb01cult to derive gradients for log L\nwith respect to the parameters qij. We employ SLSQP [25] to maximize log L(Q;C), which is non-\nconcave in general. For more information on the optimization techniques used in this section, see\nthe Supplementary Materials.\n\n4.2 Empirical data results\n\nWe evaluate our inference procedure on two empirical choice datasets, SFwork and SFshop, col-\nlected from a survey of transportation preferences around the San Francisco Bay Area [16]. The\nSFshop dataset contains 3,157 observations each consisting of a choice set of transportation alter-\nnatives available to individuals traveling to and returning from a shopping center, as well as a choice\nfrom that choice set. The SFwork dataset, meanwhile, contains 5,029 observations consisting of\ncommuting options and the choice made on a given commute. Basic statistics describing the choice\nset sizes and the number of times each pair of alternatives appear in the same choice set appear in\nthe Supplementary Materials1.\nWe train our model on observations Ttrain \u2282 C and evaluate on a test set Ttest \u2282 C via\n|pjS( \u02c6Q(Ttrain)) \u2212 \u02dcpiS(Ttest)|,\n\nError(Ttrain; Ttest) =\n\n(cid:88)\n\n(cid:88)\n\n(4)\n\n1\n|Ttest|\n\n(i,S)\u2208Ttest\n\nj\u2208S\n\nwhere \u02c6Q(Ttrain) is the estimate for Q obtained from the observations in Ttrain and \u02dcpiS(Ttest) =\nCiS(Ttest)/CS(Ttest) is the empirical probability of i was selected from S among observations in\nTtest. Note that Error(Ttrain; Ttest) is the expected (cid:96)1-norm of the difference between the empirical\ndistribution and the inferred distribution on a choice set drawn uniformly at random from the obser-\nvations in Ttest. We applied small amounts of additive smoothing to each dataset.\nWe compare our PCMC model against both an MNL model trained using Iterative Luce Spectral\nRanking (I-LSR) [21] and a more \ufb02exible MMNL model. We used a discrete mixture of k MNL\nmodels (with O(kn) parameters), choosing k so that the MMNL model had strictly more parameters\nthan the PCMC model on each data set. For details on how the MMNL model was trained, see the\nSupplementary Materials.\nFigure 2 shows Error(Ttrain; Ttest) on the SFwork data as the learning procedure is applied to in-\ncreasing amounts of data. The results are averaged over 1,000 different permutations of the data\nwith a 75/25 train/test split employed for each permutation. We show the error on the testing data as\nwe train with increasing proportions of the training data. A similar \ufb01gure for SFshop data appears\nin the Supplementary Materials.\nWe see that our model is better equipped to learn from and make predictions in both datasets, and\nwhen using all of the training data, we observe an error reduction of 36.2% and 46.5% compared to\nMNL and 24.4% and 31.7% compared to MMNL on SFwork and SFshop respectively.\nFigure 2 also gives two different heat maps of \u02c6Q for the SFwork data, showing the relative rates\n\u02c6qij/\u02c6qji between pairs of items as well as how the total rate \u02c6qij + \u02c6qji between pairs compares to\ntotal rates between other pairs. The index ordering of each matrix follows the estimated selection\nprobabilities of the PCMC model on the full set of the alternatives for that dataset. The ordered\noptions for SFwork are: (1) driving alone, (2) sharing a ride with one other person, (3) walking,\n\n1Data and code available here: https://github.com/sragain/pcmc-nips\n\n7\n\n\fFigure 2: Prediction error on a 25% holdout of the SFwork data for the PCMC, MNL, and MMNL\nmodels. PCMC sees improvements of 35.9% and 24.5% in prediction error over MNL and MMNL,\nrespectively, when training on 75% of the data.\n\n(4) public transit, (5) biking, and (6) carpooling with at least two others. Numerical values for the\nentries of \u02c6Q for both datasets appear in the Supplementary Materials.\nThe inferred pairwise selection probabilities are \u02c6pij = \u02c6qji/(\u02c6qji + \u02c6qij). Constructing a tournament\ngraph on the alternatives where (i, j) \u2208 E if \u02c6pij \u2265 0.5, cyclic triplets are then length-3 cycles in the\ntournament. A bound due to Harary and Moser [11] establishes that the maximum number of cyclic\ntriples on a tournament graph on n nodes is 8 when n = 6 and 20 when n = 8. According to our\nlearned model, the choices exhibit 2 out of a maximum 8 cyclic triplets in the SFwork data and 6\nout of a maximum 20 cyclic triplets for the SFshop data.\nAdditional evaluations of predictive performance across a range of synthetic datasets appear in the\nSupplementary Materials. The majority of datasets in the literature on discrete choice focus on\npairwise comparisons or ranked lists, where lists inherently assume transitivity and the independence\nof irrelevant alternatives. The SFwork and SFshop datasets are rare examples of public datasets\nthat genuinely study choices from sets larger than pairs.\n\n5 Conclusion\n\nWe introduce a Pairwise Choice Markov Chain (PCMC) model of discrete choice which de\ufb01nes\nselection probabilities according to the stationary distributions of continuous time Markov chains\non alternatives. The model parameters are the transition rates between pairs of alternatives.\nIn general the PCMC model is not a random utility model (RUM), and maintains broad \ufb02exibility\nby eschewing the implications of Luce\u2019s choice axiom, stochastic transitivity, and regularity. De-\nspite this \ufb02exibility, we demonstrate that the PCMC model exhibits desirable structure by ful\ufb01lling\nuniform expansion, a property previously found only in the Multinomial Logit (MNL) model and\nthe intractable Elimination by Aspects model.\nWe also introduce the notion of contractibility, a property motivated by thought experiments instru-\nmental in moving choice theory beyond the choice axiom, for which Yellott\u2019s axiom of uniform\nexpansion is a special case. Our work demonstrates that the PCMC model exhibits contractibility,\nwhich implies uniform expansion. We also showed that the PCMC model offers straightforward in-\nference through maximum likelihood estimation, and that a learned PCMC model predicts empirical\nchoice data with a signi\ufb01cantly higher \ufb01delity than both MNL and MMNL models.\nThe \ufb02exibility and tractability of the PCMC model opens up many compelling research directions.\nFirst, what necessary and suf\ufb01cient conditions on the matrix Q guarantee that a PCMC model is a\nRUM [10]? The ef\ufb01cacy of the PCMC model suggests exploring other effective parameterizations\nfor Q, including developing inferential methods which exploit contractibility. There are also open\ncomputational questions, such as streamlining the likelihood maximization using gradients of the\nimplicit function de\ufb01nitions. Very recently, learning results for nested MNL models have shown\nfavorable query complexity under an oracle model [2], and a comparison of our PCMC model with\nthese approaches to learning nested MNL models is important future work.\n\nAcknowledgements. This work was supported in part by a David Morgenthaler II Faculty Fel-\nlowship and a Dantzig\u2013Lieberman Operations Research Fellowship.\n\n8\n\n\fReferences\n[1] E. Adams and S. Messick. An axiomatic formulation and generalization of successive intervals scaling.\n\nPsychometrika, 23(4):355\u2013368, 1958.\n\n[2] A. R. Benson, R. Kumar, and A. Tomkins. On the relevance of irrelevant alternatives. In WWW, 2016.\n[3] J. Blanchet, G. Gallego, and V. Goyal. A markov chain approximation to choice modeling. In EC, 2013.\n[4] H. D. Block and J. Marschak. Random orderings and stochastic theories of responses. Contributions to\n\nProbability and Statistics, 2:97\u2013132, 1960.\n\n[5] A. B\u00a8orsch-Supan. On the compatibility of nested logit models with utility maximization. Journal of\n\nEconometrics, 43(3):373\u2013388, 1990.\n\n[6] J H Boyd and R E Mellman. The effect of fuel economy standards on the us automotive market: an\n\nhedonic demand analysis. Transportation Research Part A: General, 14(5-6):367\u2013378, 1980.\n\n[7] R. A. Bradley and M. E. Terry. Rank analysis of incomplete block designs the method of paired compar-\n\nisons. Biometrika, 39(3-4):324\u2013345, 1952.\n\n[8] S. Chen and T. Joachims. Modeling intransitivity in matchup and comparison data. In WSDM, 2016.\n[9] G. Debreu. Review of individual choice behavior: A theoretical analysis. American Economic Review,\n\n1960.\n\n[10] J.-C. Falmagne. A representation theorem for \ufb01nite random scale systems. J. Math. Psych., 18(1):52\u201372,\n\n1978.\n\n[11] Frank Harary and Leo Moser. The theory of round robin tournaments. The American Mathematical\n\nMonthly, 73(3):231\u2013246, 1966.\n\n[12] J. Huber, J. W. Payne, and C. Puto. Adding asymmetrically dominated alternatives: Violations of regular-\n\nity and the similarity hypothesis. Journal of Consumer Research, pages 90\u201398, 1982.\n\n[13] S. Ieong, N. Mishra, and O. Sheffet. Predicting preference \ufb02ips in commerce search. In ICML, 2012.\n[14] J. Kleinberg and M. Raghu. Team performance with test scores. In EC, pages 511\u2013528, 2015.\n[15] R. Kohli and K. Jedidi. Error theory for elimination by aspects. Operations Research, 63(3):512\u2013526,\n\n2015.\n\n[16] F. S Koppelman and C. Bhat. A self instructing course in mode choice modeling: multinomial and nested\n\nlogit models. US Department of Transportation, Federal Transit Administration, 31, 2006.\n\n[17] Ravi Kumar, Andrew Tomkins, Sergei Vassilvitskii, and Erik Vee. Inverting a steady-state. In Proceedings\nof the Eighth ACM International Conference on Web Search and Data Mining, pages 359\u2013368. ACM,\n2015.\n\n[18] R. D. Luce. Individual Choice Behavior: A Theoretical Analysis. Wiley, 1959.\n[19] R. D. Luce. The choice axiom after twenty years. J. Math. Psych., 15(3):215\u2013233, 1977.\n[20] C. F. Manski. The structure of random utility models. Theory and Decision, 8(3):229\u2013254, 1977.\n[21] L. Maystre and M. Grossglauser. Fast and accurate inference of plackett\u2013luce models. In NIPS, 2015.\n[22] D. McFadden. Econometric models for probabilistic choice among products. Journal of Business, pages\n\nS13\u2013S29, 1980.\n\n[23] D. McFadden, K. Train, et al. Mixed mnl models for discrete response. Journal of Applied Econometrics,\n\n15(5):447\u2013470, 2000.\n\n[24] S. Negahban, S. Oh, and D. Shah. Rank centrality: Ranking from pair-wise comparisons. arXiv preprint\n\narXiv:1209.1688v4, 2015.\n\n[25] J. Nocedal and S. J. Wright. Numerical optimization. 2006.\n[26] I. Simonson and A. Tversky. Choice in context: Tradeoff contrast and extremeness aversion. Journal of\n\nMarketing Research, 29(3):281, 1992.\n\n[27] L. L. Thurstone. A law of comparative judgment. Psychological Review, 34(4):273, 1927.\n[28] J. S. Trueblood, S. D. Brown, A. Heathcote, and J. R. Busemeyer. Not just for consumers context effects\n\nare fundamental to decision making. Psychological Science, 24(6):901\u2013908, 2013.\n\n[29] A. Tversky. Elimination by aspects: A theory of choice. Psychological Review, 79(4):281, 1972.\n[30] J. I. Yellott. The relationship between luce\u2019s choice axiom, thurstone\u2019s theory of comparative judgment,\n\nand the double exponential distribution. J. Math. Psych., 15(2):109\u2013144, 1977.\n\n9\n\n\f", "award": [], "sourceid": 1590, "authors": [{"given_name": "Stephen", "family_name": "Ragain", "institution": "Stanford University"}, {"given_name": "Johan", "family_name": "Ugander", "institution": "Stanford University"}]}