Part of Advances in Neural Information Processing Systems 17 (NIPS 2004)
David C. Parkes, Dimah Yanovsky, Satinder Singh
Online mechanism design (OMD) addresses the problem of sequential decision making in a stochastic environment with multiple self-interested agents. The goal in OMD is to make value-maximizing decisions despite this self-interest. In previous work we presented a Markov decision pro- cess (MDP)-based approach to OMD in large-scale problem domains. In practice the underlying MDP needed to solve OMD is too large and hence the mechanism must consider approximations. This raises the pos- sibility that agents may be able to exploit the approximation for selfish gain. We adopt sparse-sampling-based MDP algorithms to implement - efficient policies, and retain truth-revelation as an approximate Bayesian- Nash equilibrium. Our approach is empirically illustrated in the context of the dynamic allocation of WiFi connectivity to users in a coffeehouse.
1 Introduction
Mechanism design (MD) is concerned with the problem of providing incentives to im- plement desired system-wide outcomes in systems with multiple self-interested agents. Agents are assumed to have private information, for example about their utility for differ- ent outcomes and about their ability to implement different outcomes, and act to maximize their own utility. The MD approach to achieving multiagent coordination supposes the ex- istence of a center that can receive messages from agents and implement an outcome and collect payments from agents. The goal of MD is to implement an outcome with desired system-wide properties in a game-theoretic equilibrium.
Classic mechanism design considers static systems in which all agents are present and a one-time decision is made about an outcome. Auctions, used in the context of resource- allocation problems, are a standard example. Online mechanism design [1] departs from this and allows agents to arrive and depart dynamically requiring decisions to be made with uncertainty about the future. Thus, an online mechanism makes a sequence of decisions without the benefit of hindsight about the valuations of the agents yet to arrive. Without the issue of incentives, the online MD problem is a classic sequential decision problem.
In prior work [6], we showed that Markov decision processes (MDPs) can be used to define an online Vickrey-Clarke-Groves (VCG) mechanism [2] that makes truth-revelation by the agents (called incentive-compatibility) a Bayesian-Nash equilibrium [5] and implements a policy that maximizes the expected summed value of all agents. This online VCG model
differs from the line of work in online auctions, introduced by Lavi and Nisan [4] in that it assumes that the center has a model and it handles a general decision space and any decision horizon. Computing the payments and allocations in the online VCG mechanism involves solving the MDP that defines the underlying centralized (ignoring self-interest) decision making problem. For large systems, the MDPs that need to be solved exactly become large and thus computationally infeasible.
In this paper we consider the case where the underlying centralized MDPs are indeed too large and thus must be solved approximately, as will be the case in most real applications. Of course, there are several choices of methods for solving MDPs approximately. We show that the sparse-sampling algorithm due to Kearns et al. [3] is particularly well suited to online MD because it produces the needed local approximations to the optimal value and action efficiently. More challengingly, regardless of our choice the agents in the system can exploit their knowledge of the mechanism's approximation algorithm to try and "cheat" the mechanism to further their own selfish interests. Our main contribution is to demonstrate that our new approximate online VCG mechanism has truth-revelation by the agents as an -Bayesian-Nash equilibrium (BNE). This approximate equilibrium supposes that each agent is indifferent to within an expected utility of , and will play a truthful strategy in best- response to truthful strategies of other agents if no other strategy can improve its utility by more than . For any , our online mechanism has a run-time that is independent of the number of states in the underlying MDP, provides an -BNE, and implements a policy with expected value within of the optimal policy's value.
Our approach is empirically illustrated in the context of the dynamic allocation of WiFi con- nectivity to users in a coffeehouse. We demonstrate the speed-up introduced with sparse- sampling (compared with policy calculation via value-iteration), as well as the economic value of adopting an MDP-based approach over a simple fixed-price approach.
2 Preliminaries
Here we formalize the multiagent sequential decision problem that defines the online mech- anism design (OMD) problem. The approach is centralized. Each agent is asked to report its private information (for instance about its value and its capabilities) to a central planner or mechanism upon arrival. The mechanism implements a policy based on its view of the state of the world (as reported by agents). The policy defines actions in each state, and the assumption is that all agents acquiesce to the decisions of the planner. The mechanism also collects payments from agents, which can themselves depend on the reports of agents.
Online Mechanism Design We consider a finite-horizon problem with a set T of time points and a sequence of decisions k = {k1, . . . , kT }, where kt Kt and Kt is the set of feasible decisions in period t. Agent i I arrives at time ai T , departs at time li T , and has value vi(k) 0 for a sequence of decisions k. By assumption, an agent has no value for decisions outside of interval [ai, li]. Agents also face payments, which can be col- lected after an agent's departure. Collectively, information i = (ai, li, vi) defines the type of agent i with i . Agent types are sampled i.i.d. from a probability distribution f (), assumed known to the agents and to the central mechanism. Multiple agents can arrive and depart at the same time. Agent i, with type i, receives utility ui(k, p; i) = vi(k; i) - p, for decisions k and payment p. Agents are modeled as expected-utility maximizers.
Definition 1 (Online Mechanism Design) The OMD problem is to implement the sequence of decisions that maximizes the expected summed value across all agents in equilibrium, given self-interested agents with private information about valuations.
In economic terms, an optimal (value-maximizing) policy is the allocatively-efficient, or simply the efficient policy. Note that nothing about the OMD models requires centralized
execution of the joint plan. Rather, the agents themselves can have capabilities to perform actions and be asked to perform particular actions by the mechanism. The agents can also have private information about the actions that they are able to perform.
Using MDPs to Solve Online Mechanism Design. In the MDP-based approach to solv- ing the OMD problem the sequential decision problem is formalized as an MDP with the state at any time encapsulating both the current agent population and constraints on current decisions as reflected by decisions made previously. The reward function in the MDP is simply defined to correspond with the total reported value of all agents for all sequences of decisions.
Given types i f () we define an MDP, Mf , as follows. Define the state of the MDP at time t as the history-vector ht = (1, . . . , t; k1, . . . , kt-1), to include the reported types up to and including period t and the decisions made up to and including period t - 1.1 The set of all possible states at time t is denoted Ht. The set of all possible states across all time is H = T +1 H t=1 t. The set of decisions available in state ht is Kt(ht). Given a decision kt Kt(ht) in state ht, there is some probability distribution Prob(ht+1|ht, kt) over possible next states ht+1. In the setting of OMD, this probability distribution is determined by the uncertainty on new agent arrivals (as represented within f ()), together with departures and the impact of decision kt on state.
The payoff function for the induced MDP is defined to reflect the goal of maximizing the total expected reward across all agents. In particular, payoff Ri(ht, kt) = vi(kt; i) - vi(kt-1; i) becomes available from agent i upon taking action kt in state ht. With this, we have Ri(h t=1 t, kt) = vi(k ; i), for all periods to provide the required cor- respondence with agent valuations. Let R(ht, kt) = Ri(h i t, kt), denote the payoff obtained from all agents at time t. Given a (nonstationary) policy = {1, 2, . . . , T } where t : Ht Kt, an MDP defines an MDP-value function V as follows: V (ht) is the expected value of the summed payoff obtained from state ht onwards under policy , i.e., V (ht) = E{R(ht, (ht)) + R(ht+1, (ht+1)) + + R(hT , (hT ))}. An optimal policy is one that maximizes the MDP-value of every state in H.
The optimal MDP-value function V can be computed by value-iteration, and is defined so that V (h) = maxkK P rob(h |h, k)V (h )] for t = T - t (h) [R(h, k) + h Ht+1 1, T - 2, . . . , 1 and all h Ht, with V (h HT ) = maxkKT (h) R(h, k). Given the optimal MDP-value function, the optimal policy is derived as follows: for t < T , policy (h Ht) chooses action arg maxkK P rob(h |h, k)V (h )] t (h) [R(h, k) + h Ht+1
and (h HT ) = arg maxkKT (h) R(h, k). Let ^ t denote reported types up to and including period t . Let Ri (^ t t ; ) denote the total reported reward to agent i up to and including period t . The commitment period for agent i is defined as the first period, mi, for which t mi, Ri (^ ; ) = Ri (^ ; ) , for any types still to m m m i i t i >mi >mi arrive. This is the earliest period in which agent i's total value is known with certainty.
Let ht (^ t ; ) denote the state in period t given reports ^ t . Let ^ t \i = ^ t \ ^ i.
Definition 2 (Online VCG mechanism) Given history h H, mechanism Mvcg = (; , pvcg) implements policy and collects payment,
pvcg(^ ; ) = Ri (^ ; ) - V (h (^ ; )) - V (h (^ i mi m m ^ a ^ a ^ a ^ a i i i i i i \i ; )) (1)
from agent i in some period t mi.
1Using histories as state will make the state space very large. Often, there will be some function g for which g(h) is a sufficient statistic for all possible states h. We ignore this possibility here.
Agent i's payment is equal to its reported value discounted by the expected marginal value that it will contribute to the system as determined by the MDP-value function for the policy in its arrival period. The incentive-compatibility of the Online VCG mechanism requires that the MDP satisfies stalling which requires that the expected value from the optimal policy in every state in which an agent arrives is at least the expected value from following the optimal action in that state as though the agent had never arrived and then returning to the optimal policy. Clearly, property Kt(ht) Kt(ht \ i) in any period t in which i has just arrived is sufficient for stalling. Stalling is satisfied whenever an agent's arrival does not force a change in action on a policy.
Theorem 1 (Parkes & Singh [6]) An online VCG mechanism, Mvcg = (; , pvcg), based on an optimal policy for a correct MDP model that satisfies stalling is Bayesian- Nash incentive compatible and implements the optimal MDP policy.
3 Solving Very Large MDPs Approximately
From Equation 1, it can be seen that making outcome and payment decisions at any point in time in an online VCG mechanism does not require a global value function or a global policy. Unlike most methods for approximately solving MDPs that compute global approx- imations, the sparse-sampling methods of Kearns et al. [3] compute approximate values and actions for a single state at a time. Furthermore, sparse-sampling methods provide approx- imation guarantees that will be important to establish equilibrium properties -- they can compute an -approximation to the optimal value and action in a given state in time inde- pendent of the size of the state space (though polynomial in 1 and exponential in the time horizon). Thus, sparse-sampling methods are particularly suited to approximating online VCG and we adopt them here.
The sparse-sampling algorithm uses the MDP model Mf as a generative model, i.e., as a black box from which a sample of the next-state and reward distributions for any given state-action pair can be obtained. Given a state s and an approximation parameter , it computes an -accurate estimate of the optimal value for s as follows. We make the param- eterization on explicit by writing sparse-sampling( ). The algorithm builds out a depth-T sampled tree in which each node is a state and each node's children are obtained by sam- pling each action in that state m times (where m is chosen to guarantee an approximation to the optimal value of s), and each edge is labeled with the sample reward for that transi- tion. The algorithm computes estimates of the optimal value for nodes in the tree working backwards from the leaves as follows. The leaf-nodes have zero value. The value of a node is the maximum over the values for all actions in that node. The value of an action in a node is the summed value of the m rewards on the m outgoing edges for that action plus the summed value of the m children of that node. The estimated optimal value of state s is the value of the root node of the tree. The estimated optimal action in state s is the action that leads to the largest value for the root node in the tree.
Lemma 1 (Adapted from Kearns, Mansour & Ng [3]) The sparse-sampling( ) algorithm, given access to a generative model for any n-action MDP M , takes as input any state s S and any > 0, outputs an action, and satisfies the following two conditions:
(Running Time) The running time of the algorithm is O((nC)T ), where C = f (n, 1 , Rmax, T ) and f is a polynomial function of the approximation parameter 1 , the number of actions n, the largest expected reward in a state Rmax and the horizon T . In particular, the running time has no dependence on the number of states.
(Near-Optimality) The value function of the stochastic policy implemented by the sparse-sampling( ) algorithm, denoted V ss, satisfies |V (s) - V ss(s)| si-
multaneously for all states s S.
It is straightforward to derive the following corollary from the proof of Lemma 1 in [3].
Corollary 1 The value function computed by the sparse-sampling( ) algorithm, denoted ^ V ss, is near-optimal in expectation, i.e., |V (s) - E{ ^ V ss(s)}| simultaneously for all states s S and where the expectation is over the randomness introduced by the sparse- sampling( ) algorithm.
4 Approximately Efficient Online Mechanism Design
In this section, we define an approximate online VCG mechanism and consider the effect on incentives of substituting the sparse-sampling( ) algorithm into the online VCG mech- anism. We model agents as indifferent between decisions that differ by at most a utility of > 0, and consider an approximate Bayesian-Nash equilibrium. Let vi(; ) denote the final value to agent i after reports given policy .
Definition 3 (approximate BNE) Mechanism Mvcg = (, , pvcg) is -Bayesian-Nash in- centive compatible if
E| {vi(; ) - pvcg(; )} + E {vi(-i, ^ i; ) - pvcg(-i, ^ i; )}(2) | t i t i
where agent i with type i arrives in period t , and with the expectation taken over future types given current reports t .
In particular, when truth-telling is an -BNE we say that the mechanism is -BNE incentive compatible and no agent can improve its expected utility by more than > 0, for any type, as long as other agents are bidding truthfully. Equivalently, one can interpret an -BNE as an exact equilibrium for agents that face a computational cost of at least to compute the exact BNE.
Definition 4 (approximate mechanism) A sparse-sampling( ) based approximate online VCG mechanism, Mvcg( ) = (; ~ , ~ pvcg), uses the sparse-sampling( ) algorithm to imple- ment stochastic policy ~ and collects payment
~ pvcg(^ ; ~ ) = Ri (^ ; ~ ) - ^ V ss(h (^ ; ~ )) - ^ V ss(h (^ i mi m m ^ a ^ a ^ a ^ a i i i i i i \i ; ~ ))
from agent i in some period t mi, for commitment period mi.
Our proof of incentive-compatibility first demonstrates that an approximate delayed VCG mechanism [1, 6] is -BNE. With this, we demonstrate that the expected value of the pay- ments in the approximate online VCG mechanism is within 3 of the payments in the approximate delayed VCG mechanism. The delayed VCG mechanism makes the same decisions as the online VCG mechanism, except that payments are delayed until the final period and computed as:
pDvcg(^ ; ) = Ri (^ ; ) - R i T T ( ^ ; ) - RT (^ -i; ) (3)
where the discount is computed ex post, once the effect of an agent on the system value is known. In an approximate delayed-VCG mechanism, the role of the sparse-sampling algorithm is to implement an approximate policy, as well as counterfactual policies for the worlds -i without each agent i in turn. The total reported reward to agents = i over this counterfactual series of states is used to define the payment to agent i.
Lemma 2 Truthful bidding is an -Bayesian-Nash equilibrium in the sparse-sampling( ) based approximate delayed-VCG mechanism.
Proof: Let ~ denote the approximate policy computed by the sparse-sampling algorithm. Assume agents = i are truthful. Now, if agent i bids truthfully its expected utility is
E| {v Rj (; ~ ) - Rj ( a i(; ~ ) + T T -i; ~ )} (4) i
j=i j=i