{"title": "An MDP-Based Approach to Online Mechanism Design", "book": "Advances in Neural Information Processing Systems", "page_first": 791, "page_last": 798, "abstract": "", "full_text": "An MDP-Based Approach to Online\n\nMechanism Design\n\nDavid C. Parkes\n\nDivision of Engineering and Applied Sciences\n\nHarvard University\n\nparkes@eecs.harvard.edu\n\nSatinder Singh\n\nComputer Science and Engineering\n\nUniversity of Michigan\n\nbaveja@umich.edu\n\nAbstract\n\nOnline mechanism design (MD) considers the problem of provid-\ning incentives to implement desired system-wide outcomes in sys-\ntems with self-interested agents that arrive and depart dynami-\ncally. Agents can choose to misrepresent their arrival and depar-\nture times, in addition to information about their value for di(cid:11)erent\noutcomes. We consider the problem of maximizing the total long-\nterm value of the system despite the self-interest of agents. The\nonline MD problem induces a Markov Decision Process (MDP),\nwhich when solved can be used to implement optimal policies in a\ntruth-revealing Bayesian-Nash equilibrium.\n\n1\n\nIntroduction\n\nMechanism design (MD) is a sub(cid:12)eld of economics that seeks to implement par-\nticular outcomes in systems of rational agents [1]. Classically, MD considers static\nworlds in which a one-time decision is made and all agents are assumed to be pa-\ntient enough to wait for the decision. By contrast, we consider dynamic worlds in\nwhich agents may arrive and depart over time and in which a sequence of decisions\nmust be made without the bene(cid:12)t of hindsight about the values of agents yet to\narrive. The MD problem for dynamic systems is termed online mechanism design\n[2]. Online MD supposes the existence of a center, that can receive messages from\nagents and enforce a particular outcome and collect payments.\n\nSequential decision tasks introduce new subtleties into the MD problem. First,\ndecisions now have expected value instead of certain value because of uncertainty\nabout the future. Second, new temporal strategies are available to an agent, such\nas waiting to report its presence to try to improve its utility within the mechanism.\nOnline mechanisms must bring truthful and immediate revelation of an agent\u2019s value\nfor sequences of decisions into equilibrium.\n\n\fWithout the problem of private information and incentives, the sequential decision\nproblem in online MD could be formulated and solved as a Markov Decision Process\n(MDP). In fact, we show that an optimal policy and MDP-value function can be\nused to de(cid:12)ne an online mechanism in which truthful and immediate revelation of\nan agent\u2019s valuation for di(cid:11)erent sequences of decisions is a Bayes-Nash equilibrium.\n\nOur approach is very general, applying to any MDP in which the goal is to maximize\nthe total expected sequential value across all agents. To illustrate the (cid:13)exibility of\nthis model, we can consider the following illustrative applications:\n\nreusable goods. A renewable resource is available in each time period. Agents\narrive and submit a bid for a particular quantity of resource for each of a\ncontiguous sequence of periods, and before some deadline.\n\nmulti-unit auction. A (cid:12)nite number of identical goods are for sale. Agents submit\nbids for a quantity of goods with a deadline, by which time a winner-\ndetermination decision must be made for that agent.\n\nmultiagent coordination. A central controller determines and enforces the ac-\ntions that will be performed by a dynamically changing team of agents.\nAgents are only able to perform actions while present in the system.\n\nOur main contribution is to identify this connection between online MD and MDPs,\nand to de(cid:12)ne a new family of dynamic mechanisms, that we term the online VCG\nmechanism. We also clearly identify the role of the ability to stall a decision, as it\nrelates to the value of an agent, in providing for Bayes-Nash truthful mechanisms.\n\n1.1 Related Work\n\nThe problem of online MD is due to Friedman and Parkes [2], who focused on\nstrategyproof online mechanisms in which immediate and truthful revelation of an\nagent\u2019s valuation function is a dominant strategy equilibrium. The authors de(cid:12)ne\nthe mechanism that we term the delayed VCG mechanism, identify problems for\nwhich the mechanism is strategyproof, and provide the seeds of our work in Bayes-\nNash truthful mechanisms. Work on online auctions [3] is also related, in that\nit considers a system with dynamic agent arrivals and departures. However, the\nonline auction work considers a much simpler setting (see also [4]), for instance the\nallocation of a (cid:12)xed number of identical goods, and places less emphasis on temporal\nstrategies or allocative e(cid:14)ciency. Awerbuch et al. [5], provide a general method to\nconstruct online auctions from online optimization algorithms. In contrast to our\nmethods, their methods consider the special case of single-minded bidders with a\nvalue vi for a particular set of resources ri, and are only temporally strategyproof\nin the special-case of online algorithms with a non-decreasing acceptance threshold.\n\n2 Preliminaries\n\nIn this section, we introduce a general discrete-time and (cid:12)nite-action formulation\nfor a multiagent sequential decision problem. Putting incentives to one side for\nnow, we also de(cid:12)ne and solve an MDP formalization of the problem. We consider\na (cid:12)nite-horizon problem1 with a set T of discrete time points and a sequence of\ndecisions k = fk1; : : : ; kT g, where kt 2 Kt and Kt is the set of feasible decisions\nin period t. Agent i 2 I arrives at time ai 2 T , departs at time di 2 T , and has\nvalue vi(k) (cid:21) 0 for the sequence of decisions k. By assumption, an agent has no\n\n1The model can be trivially extended to consider in(cid:12)nite horizons if all agents share\n\nthe same discount factor, but will require some care for more general settings.\n\n\fvalue for decisions outside of interval [ai; di]. Agents also face payments, which we\nallow in general to be collected after an agents departure. Collectively, information\n(cid:18)i = (ai; di; vi) de(cid:12)nes the type of agent i with (cid:18)i 2 (cid:2). Agent types are sampled\ni.i.d.\nfrom a probability distribution f ((cid:18)), assumed known to the agents and to\nthe central mechanism. We allow multiple agents to arrive and depart at the same\ntime. Agent i, with type (cid:18)i, receives utility ui(k; p; (cid:18)i) = vi(k; (cid:18)i) (cid:0) p, for decisions\nk and payment p. Agents are modeled as expected-utility maximizers. We adopt as\nour goal that of maximizing the expected total sequential value across all agents.\n\nIf we were to simply ignore incentive issues, the expected-value maximizing decision\nproblem induces an MDP. The state2 of the MDP at time t is the history-vector\nht = ((cid:18)1; : : : ; (cid:18)t; k1; : : : ; kt(cid:0)1), and includes the reported types up to and including\nperiod t and the decisions made up to and including period t (cid:0) 1. The set of all\npossible states at time t is denoted Ht. The set of all possible states across all time is\nH = ST +1\nt=1 Ht. The set of decisions available in state ht is Kt(ht). Given a decision\nkt 2 Kt(ht) in state ht, there is some probability distribution Prob(ht+1jht; kt)\nover possible next states ht+1 determined by the random new agent arrivals, agent\ndepartures, and the impact of decision kt. This makes explicit the dynamics that\nwere left implicit in type distribution (cid:18)i 2 f ((cid:18)i), and includes additional information\nabout the domain.\n\nThe objective is to make decisions to maximize the expected total value across all\nagents. We de(cid:12)ne a payo(cid:11) function for the induced MDP as follows: there is a\npayo(cid:11) Ri(ht; kt) = vi(k(cid:20)t; (cid:18)i) (cid:0) vi(k(cid:20)t(cid:0)1; (cid:18)i), that becomes available from agent i\nupon taking action kt in state ht. With this, we have P(cid:28)\nt=1 Ri(ht; kt) = vi(k(cid:20)(cid:28) ; (cid:18)i),\nfor all periods (cid:28) . The summed value, Pi Ri(ht; kt), is the payo(cid:11) obtained from all\nagents at time t, and is denoted R(ht; kt). By assumption, the reward to an agent\nin this basic online MD problem depends only on decisions, and not on state. The\ntransition probabilities and the reward function de(cid:12)ned above, together with the\nfeasible decision space, constitute the induced MDP Mf .\n\nGiven a policy (cid:25) = f(cid:25)1; (cid:25)2; : : : ; (cid:25)T g where (cid:25)t : Ht ! Kt, an MDP de(cid:12)nes an MDP-\nvalue function V (cid:25) as follows: V (cid:25)(ht) is the expected value of the summed payo(cid:11)\nobtained from state ht onwards under policy (cid:25), i.e., V (cid:25)(ht) = E(cid:25)fR(ht; (cid:25)(ht)) +\nR(ht+1; (cid:25)(ht+1)) + (cid:1) (cid:1) (cid:1) + R(hT ; (cid:25)(hT ))g. An optimal policy (cid:25)(cid:3) is one that maximizes\nthe MDP-value of every state3 in H. The optimal MDP-value function V (cid:3) can be\ncomputed via the following value iteration algorithm: for t = T (cid:0) 1; T (cid:0) 2; : : : ; 1\n\n8h 2 Ht V (cid:3)(h) = max\nk2Kt(h)\n\n[R(h; k) + X\n\nP rob(h0jh; k)V (cid:3)(h0)]\n\nh02Ht+1\n\nwhere V (cid:3)(h 2 HT ) = maxk2KT (h) R(h; k). This algorithm works backwards in time\nfrom the horizon and has time complexity polynomial in the size of the MDP and\nthe time horizon T .\n\nGiven the optimal MDP-value function, the optimal policy is derived as follows: for\nt < T\n\n(cid:25)(cid:3)(h 2 Ht) = arg max\nk2Kt(h)\n\n[R(h; k) + X\n\nP rob(h0jh; k)V (cid:3)(h0)]\n\nh02Ht+1\n\nand (cid:25)(cid:3)(h 2 HT ) = arg maxk2KT (h) R(h; k). Note that we have chosen not to\nsubscript the optimal policy and MDP-value by time because it is implicit in the\nlength of the state.\n\n2Using histories as state in the induced MDP will make the state space very large.\nOften, there will be some function g for which g(h) is a su(cid:14)cient statistic for all possible\nstates h. We ignore this possibility here.\n\n3It is known that a deterministic optimal policy always exists in MDPs[6].\n\n\fLet R<t0 (ht) denote the total payo(cid:11) obtained prior to time t0 for a state ht with\nt (cid:21) t0. The following property of MDPs is useful.\n\nLemma 1 (MDP value-consistency) For any time t < T , and for any policy (cid:25),\nEfht+1;:::;hT jht;(cid:25)gfR<t0 (ht0 ) + V (cid:25)(ht0 )g = R<t(ht) + V (cid:25)(ht), for all t0 (cid:21) t, where the\nexpectation is taken with respect to a (correct) MDP model, Mf , given information\nup to and including period t and policy (cid:25).\n\nWe will need to allow for incorrect models, Mf , because agents may misreport their\ntrue types (cid:18) as untruthful types ^(cid:18). Let ht(^(cid:18); (cid:25)) denote the state at time t produced\nby following policy (cid:25) on agents with reported types ^(cid:18). Payo(cid:11), R(ht; kt), will always\ndenote the payo(cid:11) with respect to the reported valuations of agents; in particular,\nR<t0 (^(cid:18); (cid:25)) denotes the total payo(cid:11) prior to period t0 obtained by applying policy (cid:25)\nto reported types ^(cid:18).\n\nExample. (WiFi at Starbucks) [2] There is a (cid:12)nite set of WiFi (802.11b) channels\nto allocate to customers that arrive and leave a co(cid:11)ee house. A decision de(cid:12)nes an\nallocation of a channel to a customer for some period of time. There is a known\ndistribution on agent valuations and a known arrival and departure process. Each\ncustomer has her own value function, for example \\I value any 10 minute connection\nin the next 30 minutes a $0.50.\" The decision space might include the ability to delay\nmaking a decision for a new customer, before (cid:12)nally making a de(cid:12)nite allocation\ndecision. At this point the MDP reward would be the total value to the agent for\nthis allocation into the future.\n\nThe following domain properties are required to formally state the economic prop-\nerties of our online VCG mechanism. First, we need value-monotonicity, which will\nbe su(cid:14)cient to provide for voluntary participation in our mechanism. Let (cid:18)i 2 ht\ndenote that agent i with type (cid:18)i arrived in some period t0 (cid:20) t in history ht.\n\nDe(cid:12)nition 1 (value-monotonicity) MDP, Mf , satis(cid:12)es value-monotonicity if\nfor all states, ht, the optimal MDP-value function satis(cid:12)es V (cid:3)(ht(^(cid:18) [ (cid:18)i; (cid:25)(cid:3))) (cid:0)\nV (cid:3)(ht(^(cid:18); (cid:25)(cid:3))) (cid:21) 0, for agent i with type (cid:18)i that arrives in period t.\n\nValue-monotonicity requires that the arrival of each additional agent has a positive\ne(cid:11)ect on the expected total value from that state forward. In WiFi at Starbucks,\nthis is satis(cid:12)ed because an agent with a low value can simply be ignored by the\nmechanism. It may fail in other problems, for instance in a physical domain with a\nnew robot that arrives and blocks the progress of other robots.\n\nSecond, we need no-positive-externalities, which will be su(cid:14)cient for our mecha-\nnisms to run without payment de(cid:12)cits to the center.\n\nsatis(cid:12)es no-positive-\nDe(cid:12)nition 2 (no-positive-externalities) MDP, Mf ,\nexternalities if\nthe optimal MDP-value function satis(cid:12)es\nV (cid:3)(ht(^(cid:18) [ (cid:18)i; (cid:25)(cid:3))) (cid:0) vi((cid:25)(cid:3)(ht(^(cid:18) [ (cid:18)i; (cid:25)(cid:3))); (cid:18)i) (cid:20) V (cid:3)(ht(^(cid:18); (cid:25)(cid:3))), for agent i with type\n(cid:18)i that arrives in period t.\n\nfor all states, ht,\n\nNo-positive-externalities requires that the arrival of each additional agent can only\nmake the other agents worse o(cid:11) in expectation. This holds in WiFi at Starbucks,\nbecause a new agent can take resources from other agents, but not in general,\nfor instance when agents are both providers and consumers of resources or when\nmultiple agents are needed to make progress.\n\n\f3 The Delayed VCG Mechanism\n\nIn this section, we de(cid:12)ne the delayed VCG mechanism, which was introduced in\nFriedman and Parkes [2]. The mechanism implements a sequence of decisions based\non agent reports but delays (cid:12)nal payments until the (cid:12)nal period T . We prove that\nthe delayed VCG mechanism brings truth-revelation into a Bayes-Nash equilibrium\nin combination with an optimal MDP policy.\n\nThe delayed VCG mechanism is a direct-revelation online mechanism (DRM). The\nstrategy space restricts an agent to making a single claim about its type. Formally,\nan online direct-revelation mechanism, M = ((cid:2); (cid:25); p), de(cid:12)nes a feasible type space\n(cid:2), along with a decision policy (cid:25) = ((cid:25)1; : : : ; (cid:25)T ), with (cid:25)t : Ht ! Kt, and a payment\nrule p = (p1; : : : ; pT ), with pt : Ht ! RN , such that pt;i(ht) denotes the payment\nto agent i in period t given state ht.\n\nDe(cid:12)nition 3 (delayed VCG mechanism) Given history h 2 H, mechanism\nMDvcg = ((cid:2); (cid:25); pDvcg), implements decisions kt = (cid:25)(ht), and computes payment\n\npDvcg\ni\n\n(^(cid:18); (cid:25)) = Ri\n\n(cid:20)T (^(cid:18); (cid:25)) (cid:0) hR(cid:20)T (^(cid:18); (cid:25)) (cid:0) R(cid:20)T (^(cid:18)(cid:0)i; (cid:25))i\n\n(1)\n\nto agent i at the end of the (cid:12)nal period, where R(cid:20)T (^(cid:18)(cid:0)i; (cid:25)) denotes the total reported\npayo(cid:11) for the optimal policy in the system without agent i.\n\nAn agent\u2019s payment is discounted from its reported value for the outcome by a term\nequal to the total (reported) marginal value generated by its presence. Consider\nagent i, with type (cid:18)i, and let (cid:18)<i denote the types of agents that arrive before\nagent i, and let (cid:18)>i denote a random variable (distributed according to f ((cid:18))) for\nthe agents that arrive after agent i.\n\nDe(cid:12)nition 4 (Bayesian-Nash Incentive-Compatible) Mechanism MDvcg\nis\nBayesian-Nash incentive-compatible if and only if the policy (cid:25) and payments satisfy:\n\nE(cid:18)>i fvi((cid:25)((cid:18)<i; (cid:18)i; (cid:18)>i); (cid:18)i) (cid:0) pDvcg\n\ni\n\n((cid:18)<i; (cid:18)i; (cid:18)>i; (cid:25))g\n\n(BNIC)\n\n(cid:21)E(cid:18)>ifvi((cid:25)((cid:18)<i; ^(cid:18)i; (cid:18)>i); (cid:18)i) (cid:0) pDvcg\n\ni\n\n((cid:18)<i; ^(cid:18)i; (cid:18)>i; (cid:25))g\n\nfor all types (cid:18)<i, all types (cid:18)i, and all ^(cid:18)i 6= (cid:18)i.\n\nBayes-Nash IC states that truth-revelation is utility maximizing in expectation,\ngiven common knowledge about the distribution on agent valuations and arrivals\nf ((cid:18)) and when other agents are truthful. Moreover, it implies immediate revelation,\nbecause the type includes information about an agent\u2019s arrival period.\n\nTheorem 1 A delayed VCG mechanism, ((cid:2); (cid:25)(cid:3); pDvcg), based on an optimal policy\n(cid:25)(cid:3) for a correct MDP model de(cid:12)ned for a decision space that includes stalling is\nBayes-Nash incentive compatible.\n\nProof. Assume without loss of generality that the other agents are report-\ning truthfully. Consider some agent i, with type (cid:18)i, and suppose agents (cid:18)<i\nhave already arrived. Now, the expected utility to agent i when it reports\n, is E(cid:18)>ifvi((cid:25)(cid:3)((cid:18)<i; ^(cid:18)i; (cid:18)>i); (cid:18)i) +\ntype ^(cid:18)i, substituting for the payment term pDvcg\n(cid:20)T ((cid:18)<i; ^(cid:18)i; (cid:18)>i; (cid:25)(cid:3)) (cid:0) R(cid:20)T ((cid:18)<i; (cid:18)>i; (cid:25)(cid:3))g. We can ignore the (cid:12)nal term be-\nPj6=i Rj\ncause it does not depend on the choice of ^(cid:18)i at all. Let (cid:28) denote the arrival period\nai of agent i, with state h(cid:28) including agent types (cid:18)<i, decisions up to and includ-\ning period (cid:28) (cid:0) 1, and the reported type of agent i if it makes a report in period\n\ni\n\n\fai. Ignoring R<(cid:28) (h(cid:28) ), which is the total payo(cid:11) already received by agents j 6= i\nin periods up to and including (cid:28) (cid:0) 1, the remaining terms are equal to the ex-\npected value of the summed payo(cid:11) obtained from state h(cid:28) onwards under policy (cid:25)(cid:3),\nE(cid:25)(cid:3) fvi((cid:25)(cid:3)(h(cid:28) ); (cid:18)i)+Pj6=i vj((cid:25)(cid:3)(h(cid:28) ); ^(cid:18)j)+vi((cid:25)(cid:3)(h(cid:28) +1); (cid:18)i)+Pj6=i vj((cid:25)(cid:3)(h(cid:28) +1); ^(cid:18)j)+\n: : : + vi((cid:25)(cid:3)(hT ); (cid:18)i) + Pj6=i vj((cid:25)(cid:3)(hT ); ^(cid:18)j)g, de(cid:12)ned with respect to the true type\nof agent i and the reported types of agents j 6= i. This is the MDP-value for pol-\nicy (cid:25)(cid:3) in state h(cid:28) , E(cid:25)(cid:3) fR(h(cid:28) ; (cid:25)(cid:3)(h(cid:28) )) + R(h(cid:28) +1; (cid:25)(cid:3)(h(cid:28) +1)) + : : : + R(hT ; (cid:25)(cid:3)(hT ))g,\nbecause agents j 6= i are assumed to report their true types in equilibrium. We\nhave a contradiction with the optimality of policy (cid:25)(cid:3) because if there is some type\n^(cid:18)i 6= (cid:18)i that agent i can report to improve the MDP-value of policy (cid:25)(cid:3), given types\n(cid:18)<i, then we can construct a new policy (cid:25)0 that is better than policy (cid:25)(cid:3); policy (cid:25)0\nis identical to (cid:25)(cid:3) in all states except h(cid:28) , when it implements the decision de(cid:12)ned\nby (cid:25)(cid:3) in the state with type (cid:18)i replaced by type ^(cid:18)i. The new policy, (cid:25)\n, lies in the\nspace of feasible policies because the decision space includes stalling and can mimic\nut\nthe e(cid:11)ect of any manipulation in which agent i reports a later arrival time.\n\n0\n\nThe e(cid:11)ect of the (cid:12)rst term in the discount in Equation 1 is to align the agent\u2019s in-\ncentives with the system-wide objective of maximizing the total value across agents.\nWe do not have a stronger equilibrium concept than Bayes-Nash because the mech-\nanism\u2019s model will be incorrect if other agents are not truthful and its policy subop-\ntimal. This leaves space for useful manipulation. The following corollary captures\nthe requirement that the MDPs decision space must allow for stalling, i.e. it must\ninclude the option to delay making a decision that will determine the value of agent\ni until some period after the agent\u2019s arrival. Say an agent has patience if di > ai.\n\nCorollary 2 A delayed VCG mechanism cannot be Bayes-Nash incentive-\ncompatible if agents have any patience and the expected value of its policy can be\nimproved by stalling a decision.\n\nIf the policy can be improved through stalling, then an agent can improve its ex-\npected utility by delaying its reported arrival to correct for this, and make the\npolicy stall. This delayed VCG mechanism is ex ante e(cid:14)cient, because it im-\nplements the policy that maximizes the expected total sequential value across\nall agents. Second, it is interim individual-rational as long as the MDP sat-\nis(cid:12)es the value-monotonicity property. The expected utility to agent i in equi-\nlibrium is E(cid:18)>ifR(cid:20)T ((cid:18)<i; (cid:18)i; (cid:18)>i; (cid:25)(cid:3)) (cid:0) R(cid:20)T ((cid:18)<i; (cid:18)>i; (cid:25)(cid:3))g, which is equivalent to\nvalue-monotonicity. Third, the mechanism is ex ante budget-balanced as long\nas the MDP satis(cid:12)es the no-positive-externalities property. The expected pay-\nment by agent i, with type (cid:18)i, to the mechanism is E(cid:18)>ifR(cid:20)T ((cid:18)<i; (cid:18)>i; (cid:25)(cid:3)) (cid:0)\n(R(cid:20)T ((cid:18)<i; (cid:18)i; (cid:18)>i; (cid:25)(cid:3)) (cid:0) Ri\n(cid:20)T ((cid:18)<i; (cid:18)i; (cid:18)>i; (cid:25)(cid:3)))g, which is non-negative exactly when\nthe no-positive-externalities condition holds.\n\n4 The Online VCG Mechanism\n\nWe now introduce the online VCG mechanism, in which payments are determined as\nsoon as all decisions are made that a(cid:11)ect an agent\u2019s value. Not only is this a better\n(cid:12)t with the practical needs of online mechanisms, but the online VCG mechanism\nalso enables better computational properties than the delayed mechanism.\nLet V (cid:25)(ht(^(cid:18)(cid:0)i; (cid:25))) denote the MDP-value of policy (cid:25) in the system without agent\ni, given reports (cid:18)(cid:0)i from other agents, and evaluated in some period t.\n\n\fDe(cid:12)nition 5 (online VCG mechanism) Given history h 2 H, mechanism\nMvcg = ((cid:2); (cid:25); pvcg) implements decisions kt = (cid:25)(ht), and computes payment\n\npvcg\ni\n\n(^(cid:18); (cid:25)) = Ri\n\n(cid:20)mi(^(cid:18); (cid:25)) (cid:0) hV (cid:25)(h^ai (^(cid:18); (cid:25))) (cid:0) V (cid:25)(h^ai (^(cid:18)(cid:0)i; (cid:25)))i\n\n(2)\n\nto agent i in its commitment period mi, with zero payments in all other periods.\n\nNote the payment is computed in the commitment period for an agent, which is\nsome period before an agent\u2019s departure at which its value is fully determined. In\nWiFi at Starbucks, this can be the period in which the mechanism commits to a\nparticular allocation for an agent.\n\nAgent i\u2019s payment in the online VCG mechanism is equal to its reported value from\nthe sequence of decisions made by the policy, discounted by the expected marginal\nvalue that agent i will contribute to the system (as determined by the MDP-value\nfunction for the policy in its arrival period). The discount is de(cid:12)ned as the expected\nforward looking e(cid:11)ect the agent will have on the value of the system. Establishing\nincentive-compatibility requires some care because the payment now depends on\nthe stated arrival time of an agent. We must show that there is no systematic\ndependence that an agent can use to its advantage.\n\nTheorem 3 An online VCG mechanism, ((cid:2); (cid:25)(cid:3); pvcg), based on an optimal policy\n(cid:25)(cid:3) for a correct MDP model de(cid:12)ned for a decision space that includes stalling is\nBayes-Nash incentive compatible.\n\nProof. We establish this result by demonstrating that the expected value of the\npayment by agent i in the online VCG mechanism is the same as in the delayed VCG\nmechanism, when other agents report their true types and for any reported type of\nagent i. This proves incentive-compatibility, because the policy in this online VCG\nmechanism is exactly that in the delayed VCG mechanism (and so an agent\u2019s value\nfrom decisions is the same), and with identical expected payments the equilibrium\nfollows from the truthful equilibrium of the delayed mechanism. The (cid:12)rst term in\n(cid:20)mi(^(cid:18)i; (cid:18)(cid:0)i; (cid:25)(cid:3)) and has the same value as the\nthe payment (see Equation 2) is Ri\n(cid:20)T (^(cid:18)i; (cid:18)(cid:0)i; (cid:25)(cid:3)), in the payment in the delayed mechanism (see Equation\n(cid:12)rst term, Ri\n1). Now, consider the discount term in Equation 2, and rewrite this as:\n\nV (cid:3)(h^ai (^(cid:18)i; (cid:18)(cid:0)i; (cid:25)(cid:3))) + R^ai ((cid:18)(cid:0)i; (cid:25)(cid:3)) (cid:0) V (cid:3)(h^ai ((cid:18)(cid:0)i; (cid:25)(cid:3))) (cid:0) R^ai((cid:18)(cid:0)i; (cid:25)(cid:3))\n\n(3)\n\nThe expected value of the left-hand pair of terms in Equation 3 is equal to\nV (cid:3)(h^ai (^(cid:18)i; (cid:18)(cid:0)i; (cid:25)(cid:3))) + R^ai(^(cid:18)i; (cid:18)(cid:0)i; (cid:25)(cid:3)) because agent i\u2019s announced type has no ef-\nfect on the reward before its arrival. Applying Lemma 1, the expected value of\nthese terms is constant and equal to the expected value of V (cid:3)(ht0 (^(cid:18)i; (cid:18)(cid:0)i; (cid:25)(cid:3))) +\nRt0 (^(cid:18)i; (cid:18)(cid:0)i; (cid:25)(cid:3)) for all t0 (cid:21) ai (with the expectation taken wrt history hai available\nto agent i in its true arrival period.) Moreover, taking t0 to be the (cid:12)nal period, T ,\nthis is also equal to the expected value of R(cid:20)T (^(cid:18)i; (cid:18)(cid:0)i; (cid:25)(cid:3)), which is the expected\nvalue of the (cid:12)rst term of the discount in the payment in the delayed VCG mech-\nanism. Similarly, the (negated) expected value of the right-hand pair of terms in\nEquation 3 is constant, and equals V (cid:3)(ht0 ((cid:18)(cid:0)i; (cid:25)(cid:3))) + Rt0((cid:18)(cid:0)i; (cid:25)(cid:3)) for all t0 (cid:21) ai.\nAgain, taking t0 to be the (cid:12)nal period T this is also equal to the expected value of\nR(cid:20)T ((cid:18)(cid:0)i; (cid:25)(cid:3)), which is the expected value of the second term of the discount in the\npayment in the delayed VCG mechanism.\nut\n\nWe have demonstrated that although an agent can systematically reduce the ex-\npected value of each of the (cid:12)rst and second terms in the discount in its payment\n(Equation 2) by delaying its arrival, these e(cid:11)ects exactly cancel each other out.\n\n\fNote that it also remains important for incentive-compatibility on the online VCG\nmechanism that the policy allows stalling.\n\nThe online VCG mechanism shares the properties of allocative e(cid:14)ciency and\nbudget-balance with the delayed VCG mechanism (under the same conditions).\nThe online VCG mechanism is ex post individual-rational so that an agent\u2019s\nexpected utility is always non-negative, a slightly stronger condition that for the\ndelayed VCG mechanism. The expected utility to agent i is V (cid:3)(hai ) (cid:0) V (cid:3)(hai n i)\nand non-negative because of the value-monotonicity property of MDPs.\n\nThe online VCG mechanism also suggests the possibility of new computational\nspeed-ups. The payment to an agent only requires computing the optimal-MDP\nvalue without the agent in the state in which it arrives, while the delayed VCG\npayment requires computing the sequence of decisions that the optimal policy would\nhave made in the counterfactual world without the presence of each agent.\n\n5 Discussion\n\nWe described a direct-revelation mechanism for a general sequential decision mak-\ning setting with uncertainty. In the Bayes-Nash equilibrium each agent truthfully\nreveals its private type information, and immediately upon arrival. The mecha-\nnism induces an MDP, and implements the sequence of decisions that maximize\nthe expected total value across all agents. There are two important directions in\nwhich to take this preliminary work. First, we must deal with the fact that for\nmost real applications the MDP that will need to be solved to compute the decision\nand payment policies will be too big to be solved exactly. We will explore meth-\nods for solving large-scale MDPs approximately, and consider the consequences for\nincentive-compatibility. Second, we must deal with the fact that the mechanism\nwill often have at best an incomplete and inaccurate knowledge of the distributions\non agent-types. We will explore the interaction between models of learning and\nincentives, and consider the problem of adaptive online mechanisms.\n\nAcknowledgments\n\nThis work is supported in part by NSF grant IIS-0238147.\n\nReferences\n\n[1] Matthew O. Jackson. Mechanism theory. In The Encyclopedia of Life Support\n\nSystems. EOLSS Publishers, 2000.\n\n[2] Eric Friedman and David C. Parkes. Pricing WiFi at Starbucks{ Issues in online\nmechanism design. Short paper, In Fourth ACM Conf. on Electronic Commerce\n(EC\u201903), 240{241, 2003.\n\n[3] Ron Lavi and Noam Nisan. Competitive analysis of incentive compatible on-line\n\nauctions. In Proc. 2nd ACM Conf. on Electronic Commerce (EC-00), 2000.\n\n[4] Avrim Blum, Vijar Kumar, Atri Rudra, and Felix Wu. Online learning in online\nauctions. In Proceedings of the 14th Annual ACM-SIAM symposium on Discrete\nalgorithms, 2003.\n\n[5] Baruch Awerbuch, Yossi Azar, and Adam Meyerson. Reducing truth-telling\nonline mechanisms to online optimization. In Proc. ACM Symposium on Theory\nof Computing (STOC\u201903), 2003.\n\n[6] M. L. Puterman. Markov decision processes : discrete stochastic dynamic pro-\n\ngramming. John Wiley & Sons, New York, 1994.\n\n\f", "award": [], "sourceid": 2432, "authors": [{"given_name": "David", "family_name": "Parkes", "institution": null}, {"given_name": "Satinder", "family_name": "Singh", "institution": null}]}