{"title": "Sample Complexity of Policy Search with Known Dynamics", "book": "Advances in Neural Information Processing Systems", "page_first": 97, "page_last": 104, "abstract": null, "full_text": "Sample complexity of policy search with known d y n a m ic s\n\nPeter L. Bartlett Divison of Computer Science and Department of Statistics University of California, Berkeley Berkeley, CA 94720-1776 bartlett@cs.berkeley.edu\n\nAmbuj Tewari Division of Computer Science University of California, Berkeley Berkeley, CA 94720-1776 ambuj@cs.berkeley.edu\n\nAbstract\nWe consider methods that try to find a good policy for a Markov decision process by choosing one from a given class. The policy is chosen based on its empirical performance in simulations. We are interested in conditions on the complexity of the policy class that ensure the success of such simulation based policy search methods. We show that under bounds on the amount of computation involved in computing policies, transition dynamics and rewards, uniform convergence of empirical estimates to true value functions occurs. Previously, such results were derived by assuming boundedness of pseudodimension and Lipschitz continuity. These assumptions and ours are both stronger than the usual combinatorial complexity measures. We show, via minimax inequalities, that this is essential: boundedness of pseudodimension or fat-shattering dimension alone is not sufficient.\n\n1 Introduction\nA Markov Decision Process (MDP) models a situation in which an agent interacts (by performing actions and receiving rewards) with an environment whose dynamics is Markovian, i.e. the future is independent of the past given the current state of the environment. Except for toy problems with a few states, computing an optimal policy for an MDP is usually out of the question. Some relaxations need to be done if our aim is to develop tractable methods for achieving near optimal performance. One possibility is to avoid considering all possible policies by restricting oneself to a smaller class of policies. Given a simulator for the environment, we try to pick the best policy from . The hope is that if the policy class is appropriately chosen, the best policy in would not be too much worse than the true optimal policy. Use of simulators introduces an additional issue: how is one to be sure that performance of policies in the class on a few simulations is indicative of their true performance? This is reminiscent of the situation in statistical learning. There the aim is to learn a concept and one restricts attention to a hypotheses class which may or may not contain the \"true\" concept. The sample complexity question then is: how many labeled examples are needed in order to be confident that error rates on the training set are close to the true error rates of the hypotheses in our class? The answer turns out to depend on \"complexity\" of the hypothesis class as measured by combinatorial quantities associated with the class such as the VC dimension, the pseudodimension and the fat-shattering dimension. Some progress [6,7] has already been made to obtain uniform bounds on the difference between value functions and their empirical estimates, where the value function of a policy is the expected long term reward starting from a certain state and following the policy thereafter. We continue this line of work by further investigating what properties of the policy class determine the rate of uniform convergence of value function estimates. The key difference between the usual statistical learning setting and ours is that we not only have to consider the complexity of the class but also of the\n\n\f\nclasses derived from by composing the functions in with themselves and with the state evolution process implied by the simulator. Ng and Jordan [7] used a finite pseudodimension condition along with Lipschitz continuity to derive uniform bounds. The Lipschitz condition was used to control the covering numbers of the iterated function classes. We provide a uniform convergence result (Theorem 1) under the assumption that policies are parameterized by a finite number of parameters and that the computations involved in computing the policy, the single-step simulation function and the reward function all require a bounded number of arithmetic operations on real numbers. The number of samples required grows linearly with the dimension of the parameter space but is independent of the dimension of the state space. Ng and Jordan's and our assumptions are both stronger than just assuming finiteness of some combinatorial dimension. We show that this is unavoidable by constructing two examples where the fat-shattering dimension and the pseudodimension respectively are bounded, yet no simulation based method succeeds in estimating the true values of policies well. This happens because iteratively composing a function class with itself can quickly destroy finiteness of combinatorial dimensions. Additional assumptions are therefore needed to ensure that these iterates continue to have bounded combinatorial dimensions. Although we restrict ourselves to MDPs for ease of exposition, the analysis in this paper carries over easily to the case of partially obervable MDPs (POMDPs), provided the simulator also simulates the conditional distribution of observations given state using a bounded amount of computation. The plan of the rest of the paper is as follows. We set up notation and terminology in Section 2. In the same section, we describe the model of computation over reals that we use. Section 3 proves Theorem 1, which gives a sample complexity bound for achieving a desired level of performance within the policy class. In Section 4, we give two examples of policy classes whose combinatorial dimensions are bounded. Nevertheless, we can prove strong minimax lower bounds implying that no method of choosing a policy based on empirical estimates can do well for these examples.\n\n2 Preliminaries\nWe define an MDP M as a tuple (S, D, A, P (|s, a), r, ) where S is the state space, D the initial state distribution, A the action space, P (s |s, a) gives the probability of moving to state s upon taking action a in state s, r is a function mapping states to distributions over rewards (which are assumed to lie in a bounded interval [0, R]), and (0, 1) is a factor that discounts future rewards. In this paper, we assume that the state space S and the action space A are finite dimensional Euclidean spaces of dimensionality dS and dA respectively. A (randomized) policy is a mapping from S to distributions over A. Each policy induces a natural Markov chain on the state space of the MDP, namely the one obtained by starting in a start state s0 sampled from D and st+1 sampled according to P (|st , at ) with at drawn from (st ) for t 0. Let rt ( ) be the expected reward at time step t in this Markov chain, i.e. rt ( ) = E[t ] where t is drawn from the distribution r(st ). Note that the expectation is over the randomness in the choice of the initial state, the state transitions, and the randomized policy and reward outcomes. Define the value VM ( ) of the policy by t t rt ( ) . VM ( ) =\n=0\n\nWe omit the subscript M in the value function if the MDP in question is unambiguously identified. For a class of policies, define opt(M, ) = sup VM ( ) .\n \n\nThe regret of a policy elative to an MDP M and a policy class is defined as RegM, ( ) = opt(M, ) - VM ( ) . We use a degree bounded version of the Blum-Shub-Smale [3] model of computation over reals. At each time step, we can perform one of the four arithmetic operations +, -, , / or can branch based on a comparison (say <). While Blum et al. allow an arbitrary fixed rational map to be computed in one time step, we further require that the degree of any of the polynomials appearing at computation nodes be at most 1.\n\nr\n\n\f\nDefinition 1. Let k , l, m, be positive integers, f a function from Rk to probability distributions over Rl and a probability distribution over Rm . The function f is (, )-computable if there exists a degree bounded finite dimensional machine M over R with input space Rk+m and output space Rl such that the following hold. 1. For every x Rk and Rm , the machine halts with halting time TM (x, ) . 2. For every x Rk , if Rm is distributed according to the input-output map M (x, ) is distributed as f (x). Informally, the definition states that given access to an oracle which generates samples from , we can generate samples from f (x) by doing a bounded amount of computation. For precise definitions of the input-output map and halting time, we refer the reader to [3, Chap. 2]. In Section 3, we assume that the policy class is parameterized by a finite dimensional parameter Rd . In this setting (s; ), P (|s, a) and r(s) are distributions over RdA , RdS and [0, R] respectively. The following assumption states that all these maps are computable within time steps in our model of computation. Assumption A. There exists a probability distribution over Rm and a positive integer such that (s; ), P (|s, a) and r(s) are (, )-computable. Let M , MP and Mr respectively be the machines that compute them. This assumption will be satisfied if we have three \"programs\" that make a call to a random number generator for distribution , do a fixed number of floating-point operations and simulate the policies in our class, the state-transition dynamics and the rewards respectively. The following two examples illustrate this for the state-transition dynamics. Linear Dynamical System with Additive Noise 1 Suppose P and Q are dS dS and dS dA matrices and the system dynamics is given by st+1 = P st + Qat + t , (1 ) where t are i.i.d. from some distribution . Since computing (1) takes 2(d2 + dS dA + dS ) S operations, P (|s, a) is (, )-computable for = O(dS (dS + dA )). Discrete States and Actions Suppose S = {1, 2, . . . , nS } and A = {1, 2, . . . , nA }. For some fixed s, a, P (|s, a) is i k described by n numbers ps,a = (p1 , . . . , pnS ), pi = 1. Let Pk = i=1 pi . For (0, 1], set f ( ) = min{k : Pk }. Thus, if has uniform distribution on (0, 1], then f ( ) = k with probability pk . Since the Pk 's are non-decreasing, f ( ) can be computed in log nS steps using binary search. But this was for a fixed s, a pair. Finding which ps,a to use, further takes log(nS nA ) steps using binary search. So if denotes the uniform distribution on (0, 1] then P (|s, a) is (, )-computable for = O(log nS + log nA ). For a small , let H be the horizon time, i.e. ignoring rewards beyond time H does not affect the value of any policy by more than . To obtain sample rewards, given initial state s0 and policy = (; ), we first compute the trajectory s0 , . . . , sH sampled from the Markov chain induced by . This requires H \"calls\" each to M and MP . A further H + 1 calls to Mr are then required to generate the rewards 0 through H . These calls require a total of 3H + 1 samples from . The (i) empirical estimates are computed as follows. Suppose, for 1 i n, (s0 , i ) are i.i.d. samples 3H +1 . Define the empirical estimate of the value of the generated from the joint distribution D policy by nH 1i t (i) ^H t t (s0 , , i ) . VM ( ) = n =1 =0\n\n^ We omit the subscript M in V when it is clear from the context. Define an -approximate maximizer ^ to be a policy such that of V ^H ^H VM ( ) sup VM ( ) - .\n \n1 In this case, the realizable dynamics (mapping from state to next state for a given policy class) is not uniformly Lipschitz if policies allow unbounded actions. So previously known bounds [7] are not applicable even in this simple setting.\n\n\f\nFinally, we mention the definitions of three standard combinatorial dimensions. Let X be some space and consider classes G and F of {-1, +1} and real valued functions on X , respectively. Fix a finite set X = {x1 , . . . , xn } X . We say that G shatters X if for all bit vectors b {0, 1}n there exists g G such that for all i, bi = 0 g (xi ) = -1, bi = 1 g (xi ) = +1. We say that F shatters X if there exists r Rn such that, for all bit vectors b {0, 1}n , there exists f F such that for all i, bi = 0 f (xi ) < ri , bi = 1 f (xi ) ri . We say that F -shatters X if these exists r Rn such that, for all bit vectors b {0, 1}n, there exists f F such that for all i, bi = 0 f (xi ) ri - , bi = 1 f (xi ) ri + . We then have the following definitions, VCdim(G ) = max{|X | : G shatters X } , Pdim(F ) = max{|X | : F shatters X } , fatF ( ) = max{|X | : F -shatters X } .\n\n3 Regret Bound for Parametric Policy Classes Computable in Bounded Time\nTheorem 1. Fix an MDP M, a policy class = {s (s; ) : Rd }, and an > 0. Suppose Assumption A holds. Then e R2 R H d log n>O (1 - )2 2 (1 - ) R ^ nsures that E egM, (n ) 3 + , where n is an -approximate maximizer of V and H = log1/ (2R/( (1 - ))) is the /2 horizon time. Proof. The proof consists of three steps: (1) Assumption A is used to get bounds on pseudodimension; (2) The pseudodimension bound is used to prove uniform convergence of empirical estimates to true value functions; (3) Uniform convergence and the definition of -approximate maximizer gives the bound on expected regret. S T E P 1 . Given initial state s0 , parameter and random numbers 1 through 3H +1 , we first compute the trajectory as follows. Recall that M refers to the input-output map of a machine M. st = MP (st-1 , M (, s, 2t-1 ), 2t ), 1 t H . The rewards are then computed by t = Mr (st , 2H +t+1 ), 0 t H . The H -step discounted reward sum is computed as tH t t = 0 + (1 + (2 + . . . (pH -1 + H ) . . .)) . (4 ) (3 ) (2 )\n\n=0\n\nH Define the function class R = {(s0 , ) t=0 t t (s0 , , ) : Rd }, where we have explicitly shown the dependence of t on s0 , and . Let us count the number of arithmetic operations needed to compute a function in this class. Using Assumption A, we see that steps (2) and (3) require no more than 2 H and (H + 1) operations respectively. Step (4) requires H multiplications and H additions. This gives a total of 2 H + (H + 1) + 2H 6 H operations. Goldberg and Jerrum [4] showed that the VC dimension of a function class can be bounded in terms of an upper bound on the number of arithmetic operations it takes to compute the functions in the class. Since the pseudodimension of R can be written as Pdim(R) = VCdim{(s0 , , c) sign(f (s0 , ) - c) : f R, c R} , we get the following bound by [2, Thm. 8.4], Pdim(R) 4d(6 H + 3) . H (5 ) S T E P 2 . Let V H ( ) = t=0 t rt ( ). For the choice of H stated in the theorem, we have for all , |V H ( ) - V ( )| /2. Therefore, ^ ^ P n ( : |V H ( ) - V ( )| > ) P n ( : |V H ( ) - V H ( )| > /2) . (6 )\n\n\f\nFunctions in R are positive and bounded above by R = R/(1 - ). There are well-known bounds for deviations of empirical estimates from true expectations for bounded function classes in terms of the pseudodimension of the class (see, for example, Theorems 3 and 5 in [5]; also see Pollard's book [8]). Using a weak form of these results, we get 3 Pdim(R) 2 2 eR 2 n ^ H ( ) - V H ( )| > ) 8 P ( : |V e- 2n/64R . ^ In order to ensure that P n ( : |V H ( ) - V H ( )| > /2) < , we need 6 Pdim(R) 2 4 eR 2 e- 2n/256R < , 8 Using the bound (5) on Pdim(R), we get that s ^ > n P up V H ( ) - V ( )\n \n\n<, 4 eR (1 - ) 6\n\n(7 )\n\nprovided\n\nn>\n\n256R 2 (1 - )2 2\n\nl og\n\n8 \n\n+\n\n8d(6 H + 3) log\n\n.\n\nS T E P 3 . We now show that (7) implies E RegM, (n ) R /(1 - ) + (2 + ). The theorem them immediately follows by setting = (1 - ) /R. ^ ^ Suppose that for all , |V H ( ) - V ( )| . This implies that for all , V ( ) V H ( ) + ^ , we have for all , V H ( ) V H (n ) + . ^ ^ . Since n is an -approximate maximizer of V ^ Thus, for all , V ( ) V H (n ) + + . Taking the supremum over and using the ^ H (n ) V (n ) + , we get sup fact that V V ( ) V (n ) + 2 + , which is equivalent to RegM, (n ) 2 + . Thus, if (7) holds then we have R P n egM, (n ) > 2 + <. Denoting the event {RegM, (n ) > 2 + } by E , we have E RegM, (n ) = E RegM, (n )1E + E RegM, (n )1(E ) R /(1 - ) + (2 + ) . where we used the fact that regret is bounded above by R/(1 - ).\n\n4 Two Policy Classes Having Bounded Combinatorial Dimensions\nWe will describe two policy classes for which we can prove that there are strong limitations on the performance of any method (of choosing a policy out of a policy class) that has access only to empirically observed rewards. Somewhat surprisingly, one can show this for policy classes which are \"simple\" in the sense that standard combinatorial dimensions of these classes are bounded. This shows that sufficient conditions for the success of simulation based policy search (such as the assumptions in [7] and in our Theorem 1) have to be necessarily stronger than boundedness of standard combinatorial dimensions. The first example is a policy class F1 for which fatF1 ( ) < for all > 0. The second example is a class F2 for which Pdim(F2 ) = 1. Since finiteness of pseudodimension is a stronger condition, the second example makes our point more forcefully than the first one. However, the first example is considerably less contrived than the second one. Example 1 Let MD = (S, D, A, P (|s, a), r, ) be an MDP where S = [-1, +1], D = some distribution on [-1, +1], A = [-2, +2], P (s |s, a) = 1 if s\n=\n\nmax(-1, min(s + a, 1))), 0 otherwise ,\n\n\f\n0.2\n\n0\n\n-0.2\n\n-0.4 fT(x) -0.6 -0.8 -1 -1\n\n-0.5\n\n0 x\n\n0.5\n\n1\n\nFigure 1: Plot of the function fT with T = {0.2, 0.3, 0.6, 0.8}. Note that, for x > 0, fT (x) is 0 iff x T . Also, fT (x) satisfies the Lipschitz condition (with constant 1) everywhere except at 0. r = deterministic reward that maps s to s, and = some fixed discount factor in (0, 1). For a function f : [-1, +1] [-1, +1], let f denote the (deterministic) policy which takes action f (s) - s in state s. Given a class F of functions, we define an associated policy class F = {f : f F }. We now describe a specific function class F1 . Fix 1 > 0. Let T be an arbitrary finite subset of (0, 1). Let (x) = (1 - |x|)+ be the \"triangular spike\" function. Let - 1 -1 x < 0 x . fT (x) = 0 1 - 0=0 x -y 1 max 2. Define F1 = {fT : T ( 1, 1), |T | < }. By construction, functions in F1 have bounded total variation and so, fatF1 ( ) is O(1/ ) (see, for example, [2, Chap. 11]). Moreover, fT (x) satisfies the Lipschitz condition everywhere (with constant L = 1) except at 0. This is striking in the sense that the loss of the Lipschitz property at a single point allows us to prove the following lower bound. Theorem 2. Let gn range over functions from S n to F1 . Let D range over probability distributions on S . Then, 2 R inf sup E(s1 ,...,sn )Dn egMD ,F1 (gn (s1 ,...,sn ) ) -2 1 . gn D 1-\n\nThis says that for any method that maps random initial states s1 , . . . , sn to a policy in F1 , there is an initial state distribution such that the expected regret of the selected policy is at least 2 /(1 - ) - 2 1. This is in sharp contrast to Theorem 1 where we could reduce, by using sufficiently many samples, the expected regret down to any positive number given the ability to maximize the ^ empirical estimates V . Let us see how maximization of empirical estimates behaves in this case. Since fatF1 ( ) < for all > 0, the law of large numbers holds uniformly [1, Thm. 2.5] over the class F1 . The transitions, policies and rewards here are all deterministic. The reward function is just the identity. This means that the 1-step reward function family is just F1 . So the estimates of 1-step rewards are\n\n\f\nstill uniformly concentrated around their expected values. Since the contribution of rewards from time step 2 onwards can be no more than 2 + 3 + . . . = 2 /(1 - ), we can claim that the expected ^ regret of the V maximizer n behaves like R 2 E egM,F1 (n ) + en 1- where en 0. Thus the bound in Theorem 2 above is essentially tight. Before we prove Theorem 2, we need the following lemma whose proof is given in the appendix accompanying the paper. Lemma 1. Fix an interval (a, b) and let T be the set of all its finite subsets. Let gn range over functions from (a, b)n to T . Let D range over probability distributions on (a, b). Then, s 1. inf sup up EX D 1(X T ) - E(X1 ,...,Xn )Dn E(X D) 1(X gn (X1 ,...,Xn ))\ngn D T T\n\nProof of Theorem 2. We will prove the inequality when D ranges over distributions on (0, 1) which, obviously, implies the theorem. Since, for all f F1 and n > 2, f n = f 2 , we have opt(MD , F1 ) - E(s1 ,...,sn )Dn VMD (gn (s1 ,...,sn ) ) s - 2 2 = sup EsD + f (s) + f (s) 1- f F 1 = E 2 gn (s1 , . . . , sn )2 (s)] E(s1 ,...,sn )Dn sD [s + gn (s1 , . . . , sn )(s) + 1- - 2 2 f (s) sup EsD f (s) + 1- f F 1 E F 2 E(s1 ,...,sn )Dn sD [ gn (s1 , . . . , sn )(s) + gn (s1 , . . . , sn )2 (s)] 1- or all f1 , f2 , |Ef1 - Ef2 | E|f1 - f2 | 1. Therefore, we can get rid of the first terms in both sub-expressions above without changing the value by more than 2 1. - - E 2 2 2 f (s) E(s1 ,...,sn )Dn sD [ gn (s1 , . . . , sn )2 (s)] sup EsD 1- 1- f F 1 2 1 - s f - 2 = E(s1 ,...,sn )Dn EsD [gn (s1 , . . . , sn )2 (s) + 1] up EsD 2 (s) + 1 1 - f F 1 2 1\n\n2 From (8), we know that fT (x) + 1 restricted to x (0, 1) is the same as 1(xT ) . Therefore, restricting D to probability measures on (0, 1) and applying Lemma 1, we get 2 o inf sup pt(MD , F1 ) - E(s1 ,...,sn )Dn VMD (gn (s1 ,...,sn ) ) - 2 1 . gn D 1- To finish the proof, we note that < 1 and, by definition, RegMD ,F1 (gn (s1 ,...,sn ) ) = opt(MD , F1 ) - VMD (gn (s1 ,...,sn ) ) .\n\nExample 2 We use the MDP of the previous section with a different policy class which we now describe. For a real number x, y (0, 1) with binary expansions (choose the terminating representation for rationals) 0.b1 b2 b3 . . . and 0.c1 c2 c3 . . ., define mix(x, y ) = 0.b1 c1 b2 c2 . . . stretch(x) = 0.b1 0b2 0b3 . . . even(x) = 0.b2 b4 b6 . . . odd(x) = 0.b1 b3 b5 . . .\n\n\f\nSome obvious identities are mix(x, y ) = stretch(x) + stretch(y )/2, odd(mix(x, y )) = x and even(mix(x, y )) = y . Now fix 2 > 0. Since, finite subsets of (0, 1) and irrationals in (0, 2) have the same cardinality, there exists a bijection h which maps every finite subset T of (0, 1) to some irrational h(T ) (0, 2). For a finite subset T of (0, 1), define 0 x = -1 1 (odd(-x)h-1 (even(-x)) -1 < x < 0 fT (x) = 0 . x=0 - mix(x, h(T )) 0 2, f n = f 2 . Second, for all f1 , f2 F2 and x [-1, +1], |f1 (x) - f2 (x)| 2/2. This is because fT1 and fT2 can differ only for x (0, 1). For such an x, |fT1 (x) - fT2 (x)| = | mix(x, h(T1 ) - mix(x, h(T2 ))| = 2 | stretch(h(T1 )) - stretch(h(T2 ))|/2 2/2. Third, the restriction of fT to (0, 1) is 1(xT ) . Acknowledgments We acknowledge the support of DARPA under grants HR0011-04-1-0014 and FA8750-05-2-0249. References\n[1] Alon, N., Ben-David, S., Cesa-Bianchi, N. & Haussler, D. (1997) Scale-sensitive Dimensions, Uniform Convergence, and Learnability. Journal of the ACM 44(4):615631. [2] Anthony, M. & Bartlett P.L. (1999) Neural Network Learning: Theoretical Foundations. Cambridge University Press. [3] Blum, L., Cucker, F., Shub, M. & Smale, S. (1998) Complexity and Real Computation. Springer-Verlag. [4] Goldberg, P.W. & Jerrum, M.R. (1995) Bounding the Vapnik-Chervonenkis Dimension of Concept Classes Parameterized by Real Numbers. Machine Learning 18(2-3):131148. [5] Haussler, D. (1992) Decision Theoretic Generalizations of the PAC Model for Neural Net and Other Learning Applications. Information and Computation 100:78150. [6] Jain, R. & Varaiya, P. (2006) Simulation-based Uniform Value Function Estimates of Discounted and Average-reward MDPs. SIAM Journal on Control and Optimization, to appear. [7] Ng A.Y. & Jordan M.I. (2000) PEGASUS: A Policy Search Method for MDPs and POMDPs. In Proceedings of the 16th Annual Conference on Uncertainty in Artificial Intelligence, pp. 405415. Morgan Kauffman Publishers. [8] Pollard D. (1990) Empirical Processes: Theory and Applications. NSF-CBMS Regional Conference Series in Probability and Statistics, Volume 2.\n\n\f\n", "award": [], "sourceid": 2990, "authors": [{"given_name": "Peter", "family_name": "Bartlett", "institution": null}, {"given_name": "Ambuj", "family_name": "Tewari", "institution": null}]}