{"title": "Approximately Efficient Online Mechanism Design", "book": "Advances in Neural Information Processing Systems", "page_first": 1049, "page_last": 1056, "abstract": null, "full_text": " Approximately Efficient Online Mechanism\n Design\n\n\n\n David C. Parkes Satinder Singh Dimah Yanovsky\n DEAS, Maxwell-Dworkin Comp. Science and Engin. Harvard College\n Harvard University University of Michigan yanovsky@fas.harvard.edu\n parkes@eecs.harvard.edu baveja@umich.edu\n\n\n\n Abstract\n\n\n Online mechanism design (OMD) addresses the problem of sequential\n decision making in a stochastic environment with multiple self-interested\n agents. The goal in OMD is to make value-maximizing decisions despite\n this self-interest. In previous work we presented a Markov decision pro-\n cess (MDP)-based approach to OMD in large-scale problem domains.\n In practice the underlying MDP needed to solve OMD is too large and\n hence the mechanism must consider approximations. This raises the pos-\n sibility that agents may be able to exploit the approximation for selfish\n gain. We adopt sparse-sampling-based MDP algorithms to implement -\n efficient policies, and retain truth-revelation as an approximate Bayesian-\n Nash equilibrium. Our approach is empirically illustrated in the context\n of the dynamic allocation of WiFi connectivity to users in a coffeehouse.\n\n\n1 Introduction\n\nMechanism design (MD) is concerned with the problem of providing incentives to im-\nplement desired system-wide outcomes in systems with multiple self-interested agents.\nAgents are assumed to have private information, for example about their utility for differ-\nent outcomes and about their ability to implement different outcomes, and act to maximize\ntheir own utility. The MD approach to achieving multiagent coordination supposes the ex-\nistence of a center that can receive messages from agents and implement an outcome and\ncollect payments from agents. The goal of MD is to implement an outcome with desired\nsystem-wide properties in a game-theoretic equilibrium.\n\nClassic mechanism design considers static systems in which all agents are present and a\none-time decision is made about an outcome. Auctions, used in the context of resource-\nallocation problems, are a standard example. Online mechanism design [1] departs from\nthis and allows agents to arrive and depart dynamically requiring decisions to be made with\nuncertainty about the future. Thus, an online mechanism makes a sequence of decisions\nwithout the benefit of hindsight about the valuations of the agents yet to arrive. Without the\nissue of incentives, the online MD problem is a classic sequential decision problem.\n\nIn prior work [6], we showed that Markov decision processes (MDPs) can be used to define\nan online Vickrey-Clarke-Groves (VCG) mechanism [2] that makes truth-revelation by the\nagents (called incentive-compatibility) a Bayesian-Nash equilibrium [5] and implements a\npolicy that maximizes the expected summed value of all agents. This online VCG model\n\n\f\ndiffers from the line of work in online auctions, introduced by Lavi and Nisan [4] in that it\nassumes that the center has a model and it handles a general decision space and any decision\nhorizon. Computing the payments and allocations in the online VCG mechanism involves\nsolving the MDP that defines the underlying centralized (ignoring self-interest) decision\nmaking problem. For large systems, the MDPs that need to be solved exactly become large\nand thus computationally infeasible.\n\nIn this paper we consider the case where the underlying centralized MDPs are indeed too\nlarge and thus must be solved approximately, as will be the case in most real applications.\nOf course, there are several choices of methods for solving MDPs approximately. We show\nthat the sparse-sampling algorithm due to Kearns et al. [3] is particularly well suited to\nonline MD because it produces the needed local approximations to the optimal value and\naction efficiently. More challengingly, regardless of our choice the agents in the system can\nexploit their knowledge of the mechanism's approximation algorithm to try and \"cheat\" the\nmechanism to further their own selfish interests. Our main contribution is to demonstrate\nthat our new approximate online VCG mechanism has truth-revelation by the agents as\nan -Bayesian-Nash equilibrium (BNE). This approximate equilibrium supposes that each\nagent is indifferent to within an expected utility of , and will play a truthful strategy in best-\nresponse to truthful strategies of other agents if no other strategy can improve its utility by\nmore than . For any , our online mechanism has a run-time that is independent of the\nnumber of states in the underlying MDP, provides an -BNE, and implements a policy with\nexpected value within of the optimal policy's value.\n\nOur approach is empirically illustrated in the context of the dynamic allocation of WiFi con-\nnectivity to users in a coffeehouse. We demonstrate the speed-up introduced with sparse-\nsampling (compared with policy calculation via value-iteration), as well as the economic\nvalue of adopting an MDP-based approach over a simple fixed-price approach.\n\n\n2 Preliminaries\n\nHere we formalize the multiagent sequential decision problem that defines the online mech-\nanism design (OMD) problem. The approach is centralized. Each agent is asked to report\nits private information (for instance about its value and its capabilities) to a central planner\nor mechanism upon arrival. The mechanism implements a policy based on its view of the\nstate of the world (as reported by agents). The policy defines actions in each state, and the\nassumption is that all agents acquiesce to the decisions of the planner. The mechanism also\ncollects payments from agents, which can themselves depend on the reports of agents.\n\nOnline Mechanism Design We consider a finite-horizon problem with a set T of time\npoints and a sequence of decisions k = {k1, . . . , kT }, where kt Kt and Kt is the set of\nfeasible decisions in period t. Agent i I arrives at time ai T , departs at time li T ,\nand has value vi(k) 0 for a sequence of decisions k. By assumption, an agent has no\nvalue for decisions outside of interval [ai, li]. Agents also face payments, which can be col-\nlected after an agent's departure. Collectively, information i = (ai, li, vi) defines the type\nof agent i with i . Agent types are sampled i.i.d. from a probability distribution f (),\nassumed known to the agents and to the central mechanism. Multiple agents can arrive and\ndepart at the same time. Agent i, with type i, receives utility ui(k, p; i) = vi(k; i) - p,\nfor decisions k and payment p. Agents are modeled as expected-utility maximizers.\n\nDefinition 1 (Online Mechanism Design) The OMD problem is to implement the sequence\nof decisions that maximizes the expected summed value across all agents in equilibrium,\ngiven self-interested agents with private information about valuations.\n\nIn economic terms, an optimal (value-maximizing) policy is the allocatively-efficient, or\nsimply the efficient policy. Note that nothing about the OMD models requires centralized\n\n\f\nexecution of the joint plan. Rather, the agents themselves can have capabilities to perform\nactions and be asked to perform particular actions by the mechanism. The agents can also\nhave private information about the actions that they are able to perform.\n\n\nUsing MDPs to Solve Online Mechanism Design. In the MDP-based approach to solv-\ning the OMD problem the sequential decision problem is formalized as an MDP with the\nstate at any time encapsulating both the current agent population and constraints on current\ndecisions as reflected by decisions made previously. The reward function in the MDP is\nsimply defined to correspond with the total reported value of all agents for all sequences of\ndecisions.\n\nGiven types i f () we define an MDP, Mf , as follows. Define the state of the MDP at\ntime t as the history-vector ht = (1, . . . , t; k1, . . . , kt-1), to include the reported types up\nto and including period t and the decisions made up to and including period t - 1.1 The set\nof all possible states at time t is denoted Ht. The set of all possible states across all time is\nH = T +1 H\n t=1 t. The set of decisions available in state ht is Kt(ht). Given a decision kt \nKt(ht) in state ht, there is some probability distribution Prob(ht+1|ht, kt) over possible\nnext states ht+1. In the setting of OMD, this probability distribution is determined by the\nuncertainty on new agent arrivals (as represented within f ()), together with departures\nand the impact of decision kt on state.\n\nThe payoff function for the induced MDP is defined to reflect the goal of maximizing the\ntotal expected reward across all agents. In particular, payoff Ri(ht, kt) = vi(kt; i) -\nvi(kt-1; i) becomes available from agent i upon taking action kt in state ht. With this,\nwe have Ri(h\n t=1 t, kt) = vi(k ; i), for all periods to provide the required cor-\nrespondence with agent valuations. Let R(ht, kt) = Ri(h\n i t, kt), denote the payoff\nobtained from all agents at time t. Given a (nonstationary) policy = {1, 2, . . . , T }\nwhere t : Ht Kt, an MDP defines an MDP-value function V as follows: V (ht) is\nthe expected value of the summed payoff obtained from state ht onwards under policy ,\ni.e., V (ht) = E{R(ht, (ht)) + R(ht+1, (ht+1)) + + R(hT , (hT ))}. An optimal\npolicy is one that maximizes the MDP-value of every state in H.\n\nThe optimal MDP-value function V can be computed by value-iteration, and is defined\nso that V (h) = maxkK P rob(h |h, k)V (h )] for t = T -\n t (h) [R(h, k) + h Ht+1\n1, T - 2, . . . , 1 and all h Ht, with V (h HT ) = maxkKT (h) R(h, k). Given the\noptimal MDP-value function, the optimal policy is derived as follows: for t < T , policy\n(h Ht) chooses action arg maxkK P rob(h |h, k)V (h )]\n t (h) [R(h, k) + h Ht+1\n\nand (h HT ) = arg maxkKT (h) R(h, k). Let ^\n t denote reported types up to and\nincluding period t . Let Ri (^\n \n t t ; ) denote the total reported reward to agent i up to and\nincluding period t . The commitment period for agent i is defined as the first period, mi,\nfor which t mi, Ri (^\n ; ) = Ri (^\n ; )\n , for any types still to\n m m m\n i i t i >mi >mi\narrive. This is the earliest period in which agent i's total value is known with certainty.\n\nLet ht (^\n t ; ) denote the state in period t given reports ^\n t . Let ^\n t \\i = ^\n t \\ ^\n i.\n\nDefinition 2 (Online VCG mechanism) Given history h H, mechanism Mvcg =\n(; , pvcg) implements policy and collects payment,\n\n pvcg(^\n ; ) = Ri (^\n ; ) - V (h (^\n ; )) - V (h (^\n \n i mi m m ^\n a ^\n a ^\n a ^\n a\n i i i i i i \\i ; )) (1)\n\n\nfrom agent i in some period t mi.\n\n 1Using histories as state will make the state space very large. Often, there will be some function\ng for which g(h) is a sufficient statistic for all possible states h. We ignore this possibility here.\n\n\f\nAgent i's payment is equal to its reported value discounted by the expected marginal value\nthat it will contribute to the system as determined by the MDP-value function for the policy\nin its arrival period. The incentive-compatibility of the Online VCG mechanism requires\nthat the MDP satisfies stalling which requires that the expected value from the optimal\npolicy in every state in which an agent arrives is at least the expected value from following\nthe optimal action in that state as though the agent had never arrived and then returning to\nthe optimal policy. Clearly, property Kt(ht) Kt(ht \\ i) in any period t in which i has\njust arrived is sufficient for stalling. Stalling is satisfied whenever an agent's arrival does\nnot force a change in action on a policy.\n\nTheorem 1 (Parkes & Singh [6]) An online VCG mechanism, Mvcg = (; , pvcg),\nbased on an optimal policy for a correct MDP model that satisfies stalling is Bayesian-\nNash incentive compatible and implements the optimal MDP policy.\n\n\n3 Solving Very Large MDPs Approximately\n\nFrom Equation 1, it can be seen that making outcome and payment decisions at any point\nin time in an online VCG mechanism does not require a global value function or a global\npolicy. Unlike most methods for approximately solving MDPs that compute global approx-\nimations, the sparse-sampling methods of Kearns et al. [3] compute approximate values and\nactions for a single state at a time. Furthermore, sparse-sampling methods provide approx-\nimation guarantees that will be important to establish equilibrium properties -- they can\ncompute an -approximation to the optimal value and action in a given state in time inde-\npendent of the size of the state space (though polynomial in 1 and exponential in the time\nhorizon). Thus, sparse-sampling methods are particularly suited to approximating online\nVCG and we adopt them here.\n\nThe sparse-sampling algorithm uses the MDP model Mf as a generative model, i.e., as a\nblack box from which a sample of the next-state and reward distributions for any given\nstate-action pair can be obtained. Given a state s and an approximation parameter , it\ncomputes an -accurate estimate of the optimal value for s as follows. We make the param-\neterization on explicit by writing sparse-sampling( ). The algorithm builds out a depth-T\nsampled tree in which each node is a state and each node's children are obtained by sam-\npling each action in that state m times (where m is chosen to guarantee an approximation\nto the optimal value of s), and each edge is labeled with the sample reward for that transi-\ntion. The algorithm computes estimates of the optimal value for nodes in the tree working\nbackwards from the leaves as follows. The leaf-nodes have zero value. The value of a node\nis the maximum over the values for all actions in that node. The value of an action in a\nnode is the summed value of the m rewards on the m outgoing edges for that action plus\nthe summed value of the m children of that node. The estimated optimal value of state s is\nthe value of the root node of the tree. The estimated optimal action in state s is the action\nthat leads to the largest value for the root node in the tree.\n\nLemma 1 (Adapted from Kearns, Mansour & Ng [3]) The sparse-sampling( ) algorithm,\ngiven access to a generative model for any n-action MDP M , takes as input any state\ns S and any > 0, outputs an action, and satisfies the following two conditions:\n\n (Running Time) The running time of the algorithm is O((nC)T ), where C =\n f (n, 1 , Rmax, T ) and f is a polynomial function of the approximation parameter\n 1 , the number of actions n, the largest expected reward in a state Rmax and the\n horizon T . In particular, the running time has no dependence on the number of\n states.\n\n (Near-Optimality) The value function of the stochastic policy implemented by the\n sparse-sampling( ) algorithm, denoted V ss, satisfies |V (s) - V ss(s)| si-\n\n\f\n multaneously for all states s S.\n\nIt is straightforward to derive the following corollary from the proof of Lemma 1 in [3].\n\nCorollary 1 The value function computed by the sparse-sampling( ) algorithm, denoted\n^\nV ss, is near-optimal in expectation, i.e., |V (s) - E{ ^\n V ss(s)}| simultaneously for all\nstates s S and where the expectation is over the randomness introduced by the sparse-\nsampling( ) algorithm.\n\n4 Approximately Efficient Online Mechanism Design\n\nIn this section, we define an approximate online VCG mechanism and consider the effect\non incentives of substituting the sparse-sampling( ) algorithm into the online VCG mech-\nanism. We model agents as indifferent between decisions that differ by at most a utility of\n > 0, and consider an approximate Bayesian-Nash equilibrium. Let vi(; ) denote the\nfinal value to agent i after reports given policy .\n\nDefinition 3 (approximate BNE) Mechanism Mvcg = (, , pvcg) is -Bayesian-Nash in-\ncentive compatible if\n\n E| {vi(; ) - pvcg(; )} + E {vi(-i, ^\n i; ) - pvcg(-i, ^\n i; )}(2)\n |\n t i t i\n\nwhere agent i with type i arrives in period t , and with the expectation taken over future\ntypes given current reports t .\n\nIn particular, when truth-telling is an -BNE we say that the mechanism is -BNE incentive\ncompatible and no agent can improve its expected utility by more than > 0, for any type,\nas long as other agents are bidding truthfully. Equivalently, one can interpret an -BNE as\nan exact equilibrium for agents that face a computational cost of at least to compute the\nexact BNE.\n\nDefinition 4 (approximate mechanism) A sparse-sampling( ) based approximate online\nVCG mechanism, Mvcg( ) = (; ~\n , ~\n pvcg), uses the sparse-sampling( ) algorithm to imple-\nment stochastic policy ~\n and collects payment\n\n ~\n pvcg(^\n ; ~\n ) = Ri (^\n ; ~\n ) - ^\n V ss(h (^\n ; ~\n )) - ^\n V ss(h (^\n \n i mi m m ^\n a ^\n a ^\n a ^\n a\n i i i i i i \\i ; ~\n ))\n\nfrom agent i in some period t mi, for commitment period mi.\n\nOur proof of incentive-compatibility first demonstrates that an approximate delayed VCG\nmechanism [1, 6] is -BNE. With this, we demonstrate that the expected value of the pay-\nments in the approximate online VCG mechanism is within 3 of the payments in the\napproximate delayed VCG mechanism. The delayed VCG mechanism makes the same\ndecisions as the online VCG mechanism, except that payments are delayed until the final\nperiod and computed as:\n\n pDvcg(^\n ; ) = Ri (^\n ; ) - R\n i T T ( ^\n ; ) - RT (^\n -i; ) (3)\n\nwhere the discount is computed ex post, once the effect of an agent on the system value\nis known. In an approximate delayed-VCG mechanism, the role of the sparse-sampling\nalgorithm is to implement an approximate policy, as well as counterfactual policies for the\nworlds -i without each agent i in turn. The total reported reward to agents = i over this\ncounterfactual series of states is used to define the payment to agent i.\n\nLemma 2 Truthful bidding is an -Bayesian-Nash equilibrium in the sparse-sampling( )\nbased approximate delayed-VCG mechanism.\n\n\f\nProof: Let ~\n denote the approximate policy computed by the sparse-sampling algorithm.\nAssume agents = i are truthful. Now, if agent i bids truthfully its expected utility is\n\n E| {v Rj (; ~\n ) - Rj (\n a i(; ~\n ) + T T -i; ~\n )} (4)\n i\n\n j=i j=i\n\nwhere the expectation is over both the types yet to be reported and the random-\nness introduced by the sparse-sampling( ) algorithm. Substituting R 0, the sparse-sampling( ) based approximate online\nVCG mechanism has -efficiency in an 4 -BNE.\n\n5 Empirical Evaluation of Approximate Online VCG\n\nThe WiFi Problem. The WiFi problem considers a fixed number of channels C with\na random number of agents (max A) that can arrive per period. The time horizon\n\n\f\nT = 50. Agents demand a single channel and arrive with per-unit value, distributed i.i.d.\nV = {v1, . . . , vk} and duration in the system, distributed i.i.d. D = {d1, . . . , dl}. The\nvalue model requires that any allocation to agent i must be for contiguous periods, and be\nmade while the agent is present (i.e., during periods [t, ai + di], for arrival ai and duration\ndi). An agent's value for an allocation of duration x is vix where vi is its per-unit value.\nLet dmax denote the maximal possible allocated duration. We define the following MDP\ncomponents:\nState space: We use the following compact, sufficient, statistic of history: a resource\nschedule is a (weakly non-decreasing) vector of length dmax that counts the number of\nchannels available in the current period and next dmax - 1 periods given previous actions\n(C channels are available after this); an agent vector of size (k l) that provides a count\nof the number of agents present but not allocated for each possible per-unit value and each\npossible duration (the duration is automatically decremented when an agent remains in the\nsystem for a period without receiving an allocation); the time remaining until horizon T .\nAction space: The policy can postpone an agent allocation, or allocate an agent to a chan-\nnel for the remaining duration of the agent's time in the system if a channel is available,\nand the remaining duration is not greater than dmax.\nPayoff function: The reward at a time step is the summed value obtained from all agents\nfor which an allocation is made in this time step. This is the total value such an agent will\nreceive before it departs.\nTransition probabilities: The change in resource schedule, and in the agent vector that\nrelates to agents currently present, is deterministic. The random new additions to the agent\nvector at each step are unaffected by the actions and also independent of time.\n\nMechanisms. In the exact online VCG mechanism we compute an optimal policy,\nand optimal MDP values, offline using finite-horizon value iteration [7]. In the sparse-\nsampling( ) approach, we define a sampling tree depth L (perhaps < T ) and sample each\nstate m times. This limited sampling depth places a lower-bound on the best possible ap-\nproximation accuracy from the sparse-sampling algorithm. We also employ agent pruning,\nwith the agent vector in the state representation pruned to remove dominated agents: con-\nsider agent type with duration d and value v and remove all but C - N agents where N is\nthe number of agents that either have remaining duration d and value > v or duration\n< d and value v. In computing payments we use factoring, and only determine VCG\npayments once for each type of agent to arrive. We compare performance with a simple\nfixed-price allocation scheme that given a particular problem, computes off-line a fixed\nnumber of periods and a fixed price (agents are queued and offered the price at random as\nresources become available) that yields the maximum expected total value.\n\nResults In the default model, we set C = 5, A = 5, define the set of values V = {1, 2, 3},\ndefine the set of durations D = {1, 2, 6}, with lookahead L = 4 and sampling width\nm = 6. All results are averaged over at least 10 instances, and experiments were performed\non a 3GHz P4, with 512 MB RAM. Value and revenue is normalized by the total value\ndemanded by all agents, i.e. the value with infinite capacity.2 Looking first at economic\nproperties, Figure 1(A) shows the effect of varying the number of agents from 2 to 12,\ncomparing the value and revenue between the approximate online VCG mechanism and the\nfixed price mechanism. Notice that the MDP method dominates the price-based scheme for\nvalue, with a notable performance improvement over fixed price when demand is neither\nvery low (no contention) nor very high (lots of competition). Revenue is also generally\nbetter from the MDP-based mechanism than in the fixed price scheme. Fig. 1(B) shows the\nsimilar effect of varying the number of channels from 3 to 10.\n\nTurning now to computational properties, Figure 1 (C) illustrates the effectiveness of\nsparse-sampling, and also agent pruning, sampled over 100 instances. The model is very\n\n 2This explains why the value appears to drop as we scale up the number of agents-- the total\navailable value is increasing but supply remains fixed.\n\n\f\n 100\n value:mdp value:mdp\n 80 rev:mdp rev:mdp\n value:fixed 80 value:fixed\n rev:fixed rev:fixed\n\n\n 60\n 60\n\n % %\n\n 40 40\n\n\n\n 20\n 20\n\n\n 2 4 6 8 10 12 03 4 5 6 7 8 9 10\n Number of agents Number of channels\n\n\n 98 1.0\n 600\n value:pruning vs. #agents\n time:pruning\n vs. #agents (no pruning)\n 96 value:no pruning 0.8 500 time:no pruning\n vs. #channels\n\n\n 400\n 94 0.6\n\n\n 300\n\n 92 0.4\n Run time (s)\n % of Exact Value % of Exact Time 200\n\n 90 0.2\n time:pruning 100\n time:no pruning\n\n 88 0 0\n 2 4 6 8 10 2 4 6 8 10 12\n\n Sampling Width Number of Agents \n\n\n\nFigure 1: (A) Value and Revenue vs. Number of Agents. (B) Value and Revenue vs. Number of\nChannels. (C) Effect of Sampling Width. (D) Pruning speed-up.\n\nsmall: A = 2, C = 2, D = {1, 2, 3}, V = {1, 2, 3} and L = 4, to allow a compari-\nson with the compute time for an optimal policy. The sparse-sampling method is already\nrunning in less than 1% of the time for optimal value-iteration (right-hand axis), with an\naccuracy as high as 96% of the optimal. Pruning provides an incremental speed-up, and\nactually improves accuracy, presumably by making better use of each sample. Figure 1 (D)\nshows that pruning is extremely useful computationally (in comparison with plain sparse-\nsampling), for the default model parameters and as the number of agents is increased from\n2 to 12. Pruning is effective, removing around 50% of agents (summed across all states in\nthe lookahead tree) at 5 agents.\n\nAcknowledgments. David Parkes was funded by NSF grant IIS-0238147. Satinder Singh\nwas funded by NSF grant CCF 0432027 and by a grant from DARPA's IPTO program.\n\n\nReferences\n\n[1] Eric Friedman and David C. Parkes. Pricing WiFi at Starbucks Issues in online mechanism\n design. In Fourth ACM Conf. on Electronic Commerce (EC'03), pages 240241, 2003.\n\n[2] Matthew O. Jackson. Mechanism theory. In The Encyclopedia of Life Support Systems. EOLSS\n Publishers, 2000.\n\n[3] Michael Kearns, Yishay Mansour, and Andrew Y Ng. A sparse sampling algorithm for near-\n optimal planning in large Markov Decision Processes. In Proc. 16th Int. Joint Conf. on Artificial\n Intelligence, pages 13241331, 1999. To appear in Special Issue of Machine Learning.\n\n[4] Ron Lavi and Noam Nisan. Competitive analysis of incentive compatible on-line auctions. In\n Proc. 2nd ACM Conf. on Electronic Commerce (EC-00), 2000.\n\n[5] Martin J Osborne and Ariel Rubinstein. A Course in Game Theory. MIT Press, 1994.\n\n[6] David C. Parkes and Satinder Singh. An MDP-based approach to Online Mechanism Design. In\n Proc. 17th Annual Conf. on Neural Information Processing Systems (NIPS'03), 2003.\n\n[7] M L Puterman. Markov decision processes: Discrete stochastic dynamic programming. John\n Wiley & Sons, New York, 1994.\n\n\f\n", "award": [], "sourceid": 2633, "authors": [{"given_name": "David", "family_name": "Parkes", "institution": null}, {"given_name": "Dimah", "family_name": "Yanovsky", "institution": null}, {"given_name": "Satinder", "family_name": "Singh", "institution": null}]}