{"title": "Online Linear Regression and Its Application to Model-Based Reinforcement Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 1417, "page_last": 1424, "abstract": "We provide a provably efficient algorithm for learning Markov Decision Processes (MDPs) with continuous state and action spaces in the online setting. Specifically, we take a model-based approach and show that a special type of online linear regression allows us to learn MDPs with (possibly kernalized) linearly parameterized dynamics. This result builds on Kearns and Singh's work that provides a provably efficient algorithm for finite state MDPs. Our approach is not restricted to the linear setting, and is applicable to other classes of continuous MDPs.", "full_text": "Online Linear Regression and Its Application to\n\nModel-Based Reinforcement Learning\n\nAlexander L. Strehl\u2217\n\nYahoo! Research\nNew York, NY\n\nstrehl@yahoo-inc.com\n\nMichael L. Littman\n\nDepartment of Computer Science\n\nRutgers University\nPiscataway, NJ USA\n\nmlittman@cs.rutgers.edu\n\nAbstract\n\nWe provide a provably ef\ufb01cient algorithm for learning Markov Decision Processes\n(MDPs) with continuous state and action spaces in the online setting. Speci\ufb01cally,\nwe take a model-based approach and show that a special type of online linear\nregression allows us to learn MDPs with (possibly kernalized) linearly parame-\nterized dynamics. This result builds on Kearns and Singh\u2019s work that provides a\nprovably ef\ufb01cient algorithm for \ufb01nite state MDPs. Our approach is not restricted\nto the linear setting, and is applicable to other classes of continuous MDPs.\n\nIntroduction\n\nCurrent reinforcement-learning (RL) techniques hold great promise for creating a general type of\narti\ufb01cial intelligence (AI), speci\ufb01cally autonomous (software) agents that learn dif\ufb01cult tasks with\nlimited feedback (Sutton & Barto, 1998). Applied RL has been very successful, producing world-\nclass computer backgammon players (Tesauro, 1994) and model helicopter \ufb02yers (Ng et al., 2003).\nMany applications of RL, including the two above, utilize supervised-learning techniques for the\npurpose of generalization. Such techniques enable an agent to act intelligently in new situations by\nlearning from past experience in different but similar situations.\n\nProvably ef\ufb01cient RL for \ufb01nite state and action spaces is accomplished by Kearns and Singh (2002)\nand hugely contributes to our understanding of the relationship between exploration and sequential\ndecision making. The achievement of the current paper is to provide an ef\ufb01cient RL algorithm that\nlearns in Markov Decision Processes (MDPs) with continuous state and action spaces. We prove that\nit learns linearly-parameterized MDPs, a model introduced by Abbeel and Ng (2005), with sample\n(or experience) complexity that grows only polynomially with the number of state space dimensions.\n\nOur new RL algorithm utilizes a special linear regresser, based on least-squares regression, whose\nanalysis may be of interest to the online learning and statistics communities. Although our primary\nresult is for linearly-parameterized MDPs, our technique is applicable to other classes of continuous\nMDPs and our framework is developed speci\ufb01cally with such future applications in mind. The lin-\near dynamics case should be viewed as only an interesting example of our approach, which makes\nsubstantial progress in the goal of understanding the relationship between exploration and general-\nization in RL.\n\nAn outline of the paper follows. In Section 1, we discuss online linear regression and pose a new\nonline learning framework that requires an algorithm to not only provide predictions for new data\npoints but also provide formal guarantees about its predictions. We also develop a speci\ufb01c algorithm\nand prove that it solves the problem. In Section 2, using the algorithm and result from the \ufb01rst\nsection, we develop a provably ef\ufb01cient RL algorithm. Finally, we conclude with future work.\n\n\u2217Some of the work presented here was conducted while the author was at Rutgers University.\n\n1\n\n\f1 Online Linear Regression\n\nLinear Regression (LR) is a well-known and tremendously powerful technique for prediction of\nthe value of a variable (called the response or output) given the value of another variable (called\nthe explanatory or input). Suppose we are given some data consisting of input-output pairs:\n(x1, y1), (x2, y2), . . . , (xm, ym), where xi \u2208 Rn and yi \u2208 R for i = 1, . . . , m. Further, suppose\nthat the data satis\ufb01es a linear relationship, that is yi \u2248 \u03b8T xi \u2200i \u2208 {1, . . . , m}, where \u03b8 \u2208 Rn is an\nn-dimensional parameter vector. When a new input x arrives, we would like to make a prediction\nof the corresponding output by estimating \u03b8 from our data. A standard approach is to approximate\n\u03b8 with the least-squares estimator \u02c6\u03b8 de\ufb01ned by \u02c6\u03b8 = (X T X)\u22121X T y, where X \u2208 Rm\u00d7n is a matrix\nwhose ith row consists of the ith input xT\ni and y \u2208 Rn is a vector whose ith component is the ith\noutput yi.\nAlthough there are many analyses of the linear regression problem, none is quite right for an appli-\ncation to model-based reinforcement learning (MBRL). In particular, in MBRL, we cannot assume\nthat X is \ufb01xed ahead of time and we require more than just a prediction of \u03b8 but knowledge about\nwhether this prediction is suf\ufb01ciently accurate. A robust learning agent must not only infer an ap-\nproximate model of its environment but also maintain an idea about the accuracy of the parameters\nof this model. Without such meta-knowledge, it would be dif\ufb01cult to determine when to explore (or\nwhen to trust the model) and how to explore (to improve the model). We coined the term KWIK\n(\u201cknow what it knows\u201d) for algorithms that have this special property. With this idea in mind, we\npresent the following online learning problem related to linear regression. Let ||v|| denote the Eu-\nclidean norm of a vector v and let Var [X] denote the variance of a random variable X.\n\nDe\ufb01nition 1 (KWIK Linear Regression Problem or KLRP) On every timestep t = 1, 2, . . . an\ninput vector xt \u2208 Rnsatisf ying||xt|| \u2264 1 and output number yt \u2208 [\u22121, 1] is provided. The input\nxt may be chosen in any way that depends on the previous inputs and outputs (x1, y1), . . . , (xt, yt).\nThe output yt is chosen probabilistically from a distribution that depends only on xt and satis\ufb01es\nE[yt] = \u03b8T xt and Var[yt] \u2264 \u03c32, where \u03b8 \u2208 Rn is an unknown parameter vector satisfying ||\u03b8|| \u2264 1\nand \u03c3 \u2208 R is a known constant. After observing xt and before observing yt, the learning algorithm\nmust produce an output \u02c6yt \u2208 [\u22121, 1] \u222a {\u2205} (a prediction of E[yt|xt]). Furthermore, it should be\nable to provide an output \u02c6y(x) for any input vector x \u2208 {0, 1}n.\nA key aspect of our problem that distinguishes it from other online learning models is that the\nalgorithm is allowed to output a special value \u2205 rather than make a valid prediction (an output other\nthan \u2205). An output of \u2205 signi\ufb01es that the algorithm is not sure of what to predict and therefore\ndeclines to make a prediction. The algorithm would like to minimize the number of times it predicts\n\u2205, and, furthermore, when it does make a valid prediction the prediction must be accurate, with\nhigh probability. Next, we formalize the above intuition and de\ufb01ne the properties of a \u201csolution\u201d to\nKLRP.\n\nDe\ufb01nition 2 We de\ufb01ne an admissible algorithm for the KWIK Linear Regression Problem to\nbe one that takes two inputs 0 \u2264 \u0001 \u2264 1 and 0 \u2264 \u03b4 < 1 and, with probability at least 1 \u2212 \u03b4, satis\ufb01es\nthe following conditions:\n\n1. Whenever the algorithm predicts \u02c6yt(x) \u2208 [\u22121, 1], we have that |\u02c6yt(x) \u2212 \u03b8T x| \u2264 \u0001.\n2. The number of timesteps t for which \u02c6yt(xt) = \u2205 is bounded by some function \u03b6(\u0001, \u03b4, n),\n\npolynomial in n, 1/\u0001 and 1/\u03b4, called the sample complexity of the algorithm.\n\n1.1 Solution\n\nFirst, we present an algorithm and then a proof that it solves KLRP. Let X denote an m \u00d7 n matrix\nwhose rows we interpret as transposed input vectors. We let X(i) denote the transpose of the ith\nrow of X. Since X T X is symmetric, we can write it as\n\nX T X = U \u039bU T ,\n\n(1)\nwhere U = [v1, . . . , vn] \u2208 Rn\u00d7n, with v1, . . . , vn being a set of orthonormal eigenvectors of X T X.\nLet the corresponding eigenvalues be \u03bb1 \u2265 \u03bb2 \u2265 \u00b7\u00b7\u00b7 \u2265 \u03bbk \u2265 1 > \u03bbk+1 \u2265 \u00b7\u00b7\u00b7 \u2265 \u03bbn \u2265 0. Note that\n\u039b = diag(\u03bb1, . . . , \u03bbn) is diagonal but not necessarily invertible. Now, de\ufb01ne \u00afU = [v1, . . . , vk] \u2208\n\n(Singular Value Decomposition)\n\n2\n\n\fRn\u00d7k and \u00af\u039b = diag(\u03bb1, . . . , \u03bbk) \u2208 Rk\u00d7k. For a \ufb01xed input xt (a new input provided to the\nalgorithm at time t), de\ufb01ne\n(2)\n(3)\n\n\u00afq := X \u00afU \u00af\u039b\u22121 \u00afU T xt \u2208 Rm\u00d7n,\n\n\u00afv = [0, . . . , 0, vT\n\nk+1xt, . . . , vT\n\nn xt]T \u2208 Rn.\n\nAlgorithm 1 KWIK Linear Regression\n0: Inputs: \u03b11, \u03b12\n1: Initialize X = [ ] and y = [ ].\n2: for t = 1, 2, 3,\u00b7\u00b7\u00b7 do\n3:\n4:\n5:\n6:\n\nLet xt denote the input at time t.\nCompute \u00afq and \u00afv using Equations 2 and 3.\nif ||\u00afq|| \u2264 \u03b11 and ||\u00afv|| \u2264 \u03b12 then\nChoose \u02c6\u03b8 \u2208 Rn that minimizes Pi [y(i) \u2212 \u00af\u03b8T X(i)]2 subject to ||\u00af\u03b8|| \u2264 1, where X(i) is\n\nthe transpose of the ith row of X and y(i) is the ith component of y.\nOutput valid prediction xT \u02c6\u03b8.\n\nelse\n\n7:\n8:\n9:\n10:\n11:\n12:\nend if\n13:\n14: end for\n\nOutput \u2205.\nReceive output yt.\nAppend xT\nAppend yt as a new element to the vector y.\n\nt as a new row to the matrix X.\n\nOur algorithm for solving the KWIK Linear Regression Problem uses these quantities and is pro-\nvided in pseudocode by Algorithm 1. Our \ufb01rst main result of the paper is the following theorem.\n\nTheorem 1 With appropriate parameter settings, Algorithm 1 is an admissible algorithm for the\nKWIK Linear Regression Problem with a sample complexity bound of \u02dcO(n3/\u00014).\n\nAlthough the analysis of Algorithm 1 is somewhat complicated, the algorithm itself has a simple\ninterpretation. Given a new input xt, the algorithm considers making a prediction of the output yt\nusing the norm-constrained least-squares estimator (speci\ufb01cally, \u02c6\u03b8 de\ufb01ned in line 6 of Algorithm1).\nThe norms of the vectors \u00afq and \u00afv provide a quantitative measure of uncertainty about this estimate.\nWhen both norms are small, the estimate is trusted and a valid prediction is made. When either norm\nis large, the estimate is not trusted and the algorithm produces an output of \u2205.\nOne may wonder why \u00afq and \u00afv provide a measure of uncertainty for the least-squares estimate.\nConsider the case when all eigenvalues of X T X are greater than 1. In this case, note that x =\nX T X(X T X)\u22121x = X T \u00afq. Thus, x can be written as a linear combination of the rows of X, whose\ncoef\ufb01cients make up \u00afq, of previously experienced input vectors. As shown by Auer (2002), this\nIntuitively,\nparticular linear combination minimizes ||q|| for any linear combination x = X T q.\nif the norm of \u00afq is small, then there are many previous training samples (actually, combinations\nof inputs) \u201csimilar\u201d to x, and hence our least-squares estimate is likely to be accurate for x. For\nthe case of ill-conditioned X T X (when X T X has eigenvalues close to 0), X(X T X)\u22121x may be\nunde\ufb01ned or have a large norm. In this case, we must consider the directions corresponding to small\neigenvalues separately and this consideration is dealt with by \u00afv.\n\n1.2 Analysis\n\nWe provide a sketch of the analysis of Algorithm 1. Please see our technical report for full details.\nThe analysis hinges on two key lemmas that we now present.\n\nIn the following lemma, we analyze the behavior of the squared error of predictions based on an\nincorrect estimator \u02c6\u03b8 6= \u03b8 verses the squared error of using the true parameter vector \u03b8. Speci\ufb01cally,\nwe show that the squared error of the former is very likely to be larger than the latter when the pre-\ndictions based on \u02c6\u03b8 (of the form \u02c6\u03b8T x for input x) are highly inaccurate. The proof uses Hoeffding\u2019s\nbound and is omitted.\n\n3\n\n\fLemma 1 Let \u03b8 \u2208 Rn and \u02c6\u03b8 \u2208 Rn be two \ufb01xed parameter vectors satisfying ||\u03b8|| \u2264 1 and ||\u02c6\u03b8|| \u2264\n1. Suppose that (x1, y1), . . . , (xm, ym) is any sequence of samples satisfying xi \u2208 Rn, yi \u2208 R,\n||xi|| \u2264 1, yi \u2208 [\u22121, 1], E[yi|xi] = \u03b8T xi, and Var[yi|xi] \u2264 \u03c32. For any 0 < \u03b40 < 1 and \ufb01xed\npositive constant z, if\n\nm\n\nthen\n\nXi=1\n\n[(\u03b8 \u2212 \u02c6\u03b8)T xi]2 \u2265 2p8m ln(2/\u03b4) + z,\n\nm\n\nm\n\n(4)\n\n(5)\n\nXi=1\n\n(yi \u2212 \u02c6\u03b8T xi)2 >\n\n(yi \u2212 \u03b8T xi)2 + z\n\nXi=1\nwith probability at least 1 \u2212 2\u03b40.\nThe following lemma, whose proof is fairly straight-forward and therefore omitted, relates the error\nof an estimate \u02c6\u03b8T x for a \ufb01xed input x based on an inaccurate estimator \u02c6\u03b8 to the quantities ||\u00afq||,\n||\u00afv||, and \u2206E(\u02c6\u03b8) := qPm\ni=1 [(\u03b8 \u2212 \u02c6\u03b8)T X(i)]2. Recall that when ||\u00afq|| and ||\u00afv|| are both small, our\nalgorithm becomes con\ufb01dent of the least-squares estimate. In precisely this case, the lemma shows\nthat |(\u03b8 \u2212 \u02c6\u03b8)T x| is bounded by a quantity proportional to \u2206E(\u02c6\u03b8).\nLemma 2 Let \u03b8 \u2208 Rn and \u02c6\u03b8 \u2208 Rn be two \ufb01xed parameter vectors satisfying ||\u03b8|| \u2264 1 and ||\u02c6\u03b8|| \u2264\n1. Suppose that (x1, y1), . . . , (xm, ym) is any sequence of samples satisfying xi \u2208 Rn, yi \u2208 R,\n||xi|| \u2264 1, yi \u2208 [\u22121, 1]. Let x \u2208 Rn be any vector. Let \u00afq and \u00afv be de\ufb01ned as above. Let \u2206E(\u02c6\u03b8)\ndenote the error term qPm\n\ni=1 [(\u03b8 \u2212 \u02c6\u03b8)T xi]2. We have that\n\n|(\u03b8 \u2212 \u02c6\u03b8)T x| \u2264 ||\u00afq||\u2206E(\u02c6\u03b8) + 2||\u00afv||.\n\n(6)\n\nProof sketch: (of Theorem 1)\n\nThe proof has three steps. The \ufb01rst is to bound the sample complexity of the algorithm (the number\nof times the algorithm makes a prediction of \u2205), in terms of the input parameters \u03b11 and \u03b12. The\nsecond is to choose the parameters \u03b11 and \u03b12. The third is to show that, with high probability, every\nvalid prediction made by the algorithm is accurate.\nStep 1\nWe derive an upper bound \u00afm on the number of timesteps for which either ||\u00afq|| > \u03b11 holds or\n||\u00afv|| > \u03b12 holds. Observing that the algorithm trains on only those samples experienced during\npricisely these timesteps and applying Lemma 13 from the paper by Auer (2002) we have that\n\n\u00012\n\n+\n\n(7)\n\n\u03b12\n1\n\nn\n\u03b12\n\n, and \u03b12 = \u0001/4.\n\nn\u221aln(1/(\u0001\u03b4)) ln(n)\n\n\u00afm = O(cid:18) n ln(n/\u03b11)\n\n2(cid:19) .\nStep 2 We choose \u03b11 = C\u00b7Q ln Q, where C is a constant and Q =\nStep 3 Consider some \ufb01xed timestep t during the execution of Algorithm 1 such that the algorithm\nmakes a valid prediction (not \u2205). Let \u02c6\u03b8 denote the solution of the norm-constrained least-squares\nminimization (line 6 in the pseudocode). By de\ufb01nition, since \u2205 was not predicted, we have that\n\u00afq \u2264 \u03b11 and \u00afv \u2264 \u03b12. We would like to show that |\u02c6\u03b8T x\u2212 \u03b8T x| \u2264 \u0001 so that Condition 1 of De\ufb01nition 2\nis satis\ufb01ed. Suppose not, namely that |(\u02c6\u03b8 \u2212 \u03b8)T x| > \u0001. Using Lemma 2, we can lower bound the\nquantity \u2206E(\u02c6\u03b8)2 = Pm\ni=1[(\u03b8 \u2212 \u02c6\u03b8)T X(i)]2, where m denotes the number of rows of the matrix X\n(equivalently, the number of samples obtained used by the algorithm for training, which we upper-\nbounded by \u00afm), and X(i) denotes the transpose of the ith row of X. Finally, we would like to\napply Lemma 1 to prove that, with high probability, the squared error of \u02c6\u03b8 will be larger than the\nsquared error of predictions based on the true parameter vector \u03b8, which contradicts the fact that \u02c6\u03b8\nwas chosen to minimize the term Pm\ni=1(yi \u2212 \u02c6\u03b8T X(i))2. One problem with this approach is that\nLemma 1 applies to a \ufb01xed \u02c6\u03b8 and the least-squares computation of Algorithm 1 may choose any \u02c6\u03b8 in\nthe in\ufb01nite set {\u02c6\u03b8 \u2208 Rn such that ||\u02c6\u03b8|| \u2264 1}. Therefore, we use a uniform discretization to form a\n\n4\n\n\f\ufb01nite cover of [\u22121, 1]n and apply the theorem to the member of the cover closest to \u02c6\u03b8. To guarantee\nthat the total failure probability of the algorithm is at most \u03b4, we apply the union bound over all\n(\ufb01nitely many) applications of Lemma 1. 2\n\n1.3 Notes\n\nIn our formulation of KLRP we assumed an upper bound of 1 on the the two-norm of the inputs xi,\noutputs yi, and the true parameter vector \u03b8. By appropriate scaling of the inputs and/or outputs, we\ncould instead allow a larger (but still \ufb01nite) bound.\n\nOur analysis of Algorithm 1 showed that it is possible to solve KLRP with polynomial sample com-\nplexity (where the sample complexity is de\ufb01ned as the number of timesteps t that the algorithm\noutputs \u2205 for the current input xt), with high probability. We note that the algorithm also has poly-\nnomial computational complexity per timestep, given the tractability of solving norm-constrained\nleast-squares problems (see Chapter 12 of the book by Golub and Van Loan (1996)).\n\n1.4 Related Work\n\nWork on linear regression is abundant in the statistics community (Seber & Lee, 2003). The use\nof the quantities \u00afv and \u00afq to quantify the level of certainty of the linear estimator was introduced by\nAuer (2002). Our analysis differs from that by Auer (2002) because we do not assume that the input\nvectors xi are \ufb01xed ahead of time, but rather that they may be chosen in an adversarial manner. This\nproperty is especially important for the application of regression techniques to the full RL problem,\nrather than the Associative RL problem considered by Auer (2002). Our analysis has a similar \ufb02avor\nto some, but not all, parts of the analysis by Abbeel and Ng (2005). However, a crucial difference\nof our framework and analysis is the use of output \u2205 to signify uncertainty in the current estimate,\nwhich allows for ef\ufb01cient exploration in the application to RL as described in the next section.\n\n2 Application to Reinforcement Learning\n\nThe general reinforcement-learning (RL) problem is how to enable an agent (computer program,\nrobot, etc.) to maximize an external reward signal by acting in an unknown environment. To ensure\na well-de\ufb01ned problem, we make assumptions about the types of possible worlds. To make the\nproblem tractable, we settle for near-optimal (rather than optimal) behavior on all but a polynomial\nnumber of timesteps, as well as a small allowable failure probability. This type of performance\nmetric was introduced by Kakade (2003), in the vein of recent RL analyses (Kearns & Singh, 2002;\nBrafman & Tennenholtz, 2002).\n\nIn this section, we formalize a speci\ufb01c RL problem where the environment is mathematically mod-\neled by a continuous MDP taken from a rich class of MDPs. We present an algorithm and prove\nthat it learns ef\ufb01ciently within this class. The algorithm is \u201cmodel-based\u201d in the sense that it con-\nstructs an explicit MDP that it uses to reason about future actions in the true, but unknown, MDP\nenvironment. The algorithm uses, as a subroutine, any admissible algorithm for the KWIK Linear\nRegression Problem introduced in Section 1. Although our main result is for a speci\ufb01c class of con-\ntinuous MDPs, albeit an interesting and previously studied one, our technique is more general and\nshould be applicable to many other classes of MDPs as described in the conclusion.\n\n2.1 Problem Formulation\n\nThe model we use is slightly modi\ufb01ed from the model described by Abbeel and Ng (2005). The\nmain difference is that we consider discounted rather than undiscounted MDPs and we don\u2019t require\nthe agent to have a \u201creset\u201d action that takes it to a speci\ufb01ed start state (or distribution). Let PS denote\nthe set of all (measurable) probability distributions over the set S. The environment is described by\na discounted MDP M = hS, A, T, R, \u03b3i, where S = RnS is the state space, A = RnA is the action\nspace, T : S \u00d7 A \u2192 PS is the unknown transition dynamics, \u03b3 \u2208 [0, 1) is the discount factor, and\nR : S \u00d7 A \u2192 R is the known reward function.1 For each timestep t, let xt \u2208 S denote the current\n1All of our results can easily be extended to the case of an unknown reward function with a suitable linearity\n\nassumption.\n\n5\n\n\fstate and ut \u2208 A the current action. The transition dynamics T satisfy\n\nxt+1 = M \u03c6(xt, ut) + wt,\n\n(8)\nwhere xt+1 \u2208 S, \u03c6(\u00b7,\u00b7) : RnS +nA \u2192 Rn is a (basis or kernel) function satisfying ||\u03c6(\u00b7,\u00b7)|| \u2264 1,\nand M is an nS \u00d7 n matrix. We assume that the 2-norm of each row of M is bounded by 1.2 Each\ncomponent of the noise term wt \u2208 RnS is chosen i.i.d. from a normal distribution with mean 0\nand variance \u03c32 for a known constant \u03c3. If an MDP satis\ufb01es the above conditions we say that it\nis linearly parameterized, because the next-state xt+1 is a linear function of the vector \u03c6(xt, ut)\n(which describes the current state and action) plus a noise term.\nWe assume that the learner (also called the agent) receives nS, nA, n, R, \u03c6(\u00b7,\u00b7), \u03c3, and \u03b3 as input,\nwith T initially being unknown. The learning problem is de\ufb01ned as follows. The agent always\noccupies a single state s of the MDP M . The agent is given s and chooses an action a. It then\nreceives an immediate reward r \u223c R(s, a) and is transported to a next state s0 \u223c T (s, a). This\nprocedure then repeats forever. The \ufb01rst state occupied by the agent may be chosen arbitrarily.\n\nM (s) (Q\u03c0\n\nA policy is any strategy for choosing actions. We assume (unless noted otherwise) that rewards all lie\nin the interval [0, 1]. For any policy \u03c0, let V \u03c0\nM (s, a)) denote the discounted, in\ufb01nite-horizon\nvalue (action-value) function for \u03c0 in M (which may be omitted from the notation) from state s.\nSpeci\ufb01cally, let st and rt be the tth encountered state and received reward, respectively, resulting\nfrom execution of policy \u03c0 in some MDP M from state s0. Then, V \u03c0\nM (s) = E[P\u221ej=0 \u03b3jrj|s0 = s].\nThe optimal policy is denoted \u03c0\u2217 and has value functions V \u2217M (s) and Q\u2217M (s, a). Note that a policy\ncannot have a value greater than vmax := 1/(1 \u2212 \u03b3) by the assumption of a maximum reward of 1.\n2.2 Algorithm\n\nFirst, we discuss how to use an admissible learning algorithm for KLRP to construct an MDP model.\nWe proceed by specifying the transition model for each of the (in\ufb01nitely many) state-action pairs.\nGiven a \ufb01xed state-action pair (s, a), we need to estimate the next-state distribution of the MDP from\npast experience, which consists of input state-action pairs (transformed by the nonlinear function \u03c6)\nand output next states. For each state component i \u2208 {1, . . . , nS}, we have a separate learning\nproblem that can be solved by any instance Ai of an admissible KLRP algorithm.3 If each instance\nmakes a valid prediction (not \u2205), then we simply construct an approximate next-state distribution\nwhose ith component is normally distributed with variance \u03c32 and whose mean is given by the\nprediction of Ai (this procedure is equivalent to constructing an approximate transition matrix \u02c6M\nwhose ith row is equal to the transpose of the approximate parameter vector \u02c6\u03b8 learned by Ai).\nIf any instance of our KLRP algorithm predicts \u2205 for state-action pair (s, a), then we cannot estimate\nthe next-state distribution. Instead, we make s highly rewarding in the MDP model to encourage\nexploration, as done in the R-MAX algorithm (Brafman & Tennenholtz, 2002). Following the ter-\nminology introduced by Kearns and Singh (2002), we call such a state (state-action) an \u201cunknown\u201d\nstate (state-action) and we ensure that the value function of our model assigns vmax (maximum pos-\nsible) to state s. The standard way to satisfy this condition for \ufb01nite MDPs is to make the transition\nfunction for action a from state s a self-loop with reward 1 (yielding a value of vmax = 1/(1\u2212 \u03b3) for\nstate s). We can affect the exact same result in a continuous MDP by adding a component to each\nstate vector s and to each vector \u03c6(s, a) for every state-action pair (s, a). If (s, a) is \u201cunknown\u201d we\nset the value of the additional components (of \u03c6(s, a) and s) to 1, otherwise we set it to 0. We add an\nadditional row and column to M that preserves this extra component (during the transformation from\n\u03c6(s, a) to the next state s0) and otherwise doesn\u2019t change the next-state distribution. Finally, we give\na reward of 1 to any unknown state, leaving rewards for the known states unchanged. Pseudocode\nfor the resulting KWIK-RMAX algorithm is provided in Algorithm 2.\n\nTheorem 2 For any \u0001 and \u03b4, the KWIK-RMAX algorithm executes an \u0001-optimal policy on at most\na polynomial (in n, nS, 1/\u0001, 1/\u03b4, and 1/(1 \u2212 \u03b3)) number of steps, with probability at least 1 \u2212 \u03b4.\n2The algorithm can be modi\ufb01ed to deal with bounds (on the norms of the rows of M) that are larger than\n\none.\n\n3One minor technical detail is that our KLRP setting requires bounded outputs (see De\ufb01nition 1) while our\napplication to MBRL requires dealing with normal, and hence unbounded outputs. This is easily dealt with by\nignoring any extremely large (or small) outputs and showing that the resulting norm of the truncated normal\ndistribution learned by the each instance Ai is very close to the norm of the untruncated distribution.\n\n6\n\n\fAlgorithm 2 KWIK-RMAX Algorithm\n0: Inputs: nS, nA, n, R, \u03c6(\u00b7,\u00b7), \u03c3, \u03b3, \u0001, \u03b4, and admissible learning algorithm ModelLearn.\n1: for all state components i \u2208 {1, . . . , nS} do\n2:\n\nInitialize a new instantiation of ModelLearn, denoted Ai, with inputs C \u0001(1\u2212\u03b3)2\nand \u03b4/nS,\n2\u221an\nfor inputs \u0001 and \u03b4, respectively, in De\ufb01nition 2, and where C is some constant determined by\nthe analysis.\n\n\u03b3 and transition function speci\ufb01ed by Ai for i \u2208 {1, . . . , nS} as described above.\n\n3: end for\n4: Initialize an MDP Model with state space S, action space A, reward function R, discount factor\n5: for t = 1, 2, 3,\u00b7\u00b7\u00b7 do\n6:\n7:\n8:\n9:\n10:\n11:\n12:\n13: end for\n\nLet s denote the state at time t.\nChoose action a := \u02c6\u03c0\u2217(s) where \u02c6\u03c0\u2217 is the optimal policy of the MDP Model.\nLet s0 be the next state after executing action a.\nfor all factors i \u2208 {1, . . . , n} do\nend for\nUpdate MDP Model.\n\nPresent input-output pair (\u03c6(s, a), s0(i)) to Ai,a.\n\n2.3 Analysis\n\nProof sketch: (of Theorem 2)\nIt can be shown that, with high probability, policy \u02c6\u03c0\u2217 is either an \u0001-optimal policy (V \u02c6\u03c0\u2217\n(s) \u2265\nV \u2217(s) \u2212 \u0001) or it is very likely to lead to an unknown state. However, the number of times the latter\nevent can occur is bounded by the maximum number of times the instances Ai can predict \u2205, which\nis polynomial in the relevant parameters. 2\n\n2.4 The Planning Assumption\n\nWe have shown that the KWIK-RMAX Algorithm acts near-optimally on all but a small (poly-\nnomial) number of timesteps, with high probability. Unfortunately, to do so, the algorithm must\nsolve its internal MDP model completely and exactly. It is easy to extend the analysis to allow \u0001-\napproximate solution. However, it is not clear whether even this approximate computation can be\ndone ef\ufb01ciently. In any case, discretization of the state space can be used, which yields computa-\ntional complexity that is exponential in the number of (state and action) dimensions of the problem,\nsimilar to the work of Chow and Tsitsiklis (1991). Alternatively, sparse sampling can be used, whose\ncomplexity has no dependence on the size of the state space but depends exponentially on the time\nhorizon (\u2248 1/(1 \u2212 \u03b3)) (Kearns et al., 1999). Practically, there are many promising techniques that\nmake use of value-function approximation for fast and ef\ufb01cient solution (planning) of MDPs (Sutton\n& Barto, 1998). Nevertheless, it remains future work to fully analyze the complexity of planning.\n\n2.5 Related Work\n\nThe general exploration problem in continuous state spaces was considered by Kakade et al. (2003),\nand at a high level our approach to exploration is similar in spirit. However, a direct application\nof Kakade et al.\u2019s (2003) algorithm to linearly-parameterized MDPs results in an algorithm whose\nsample complexity scales exponentially, rather than polynomially, with the state-space dimension.\nThat is because the analysis uses a factor of the size of the \u201ccover\u201d of the metric space. Reinforce-\nment learning in continuous MDPs with linear dynamics was studied by Fiechter (1997). However,\nan exact linear relationship between the current state and next state is required for this analysis to\ngo through, while we allow the current state to be transformed (for instance, adding non-linear state\nfeatures) through non-linear function \u03c6. Furthermore, Fiechter\u2019s algorithm relied on the existence\nof a \u201creset\u201d action and a speci\ufb01c form of reward function. These assumptions admit a solution\nthat follows a \ufb01xed policy and doesn\u2019t depend on the actual history of the agent or the underlying\nMDP. The model that we consider, linearly parameterized MDPs, is taken directly from the work by\nAbbeel and Ng (2005), where it was justi\ufb01ed in part by an application to robotic helicopter \ufb02ight. In\n\n7\n\n\fthat work, a provably ef\ufb01cient algorithm was developed in the apprenticeship RL setting. In this set-\nting, the algorithm is given limited access (polynomial number of calls) to a \ufb01xed policy (called the\nteacher\u2019s policy). With high probably, a policy is learned that is nearly as good as the teacher\u2019s pol-\nicy. Although this framework is interesting and perhaps more useful for certain applications (such as\nhelicopter \ufb02ying), it requires a priori expert knowledge (to construct the teacher) and alleviates the\nproblem of exploration altogether. In addition, Abbeel and Ng\u2019s (2005) algorithm also relies heavily\non a reset assumption, while ours does not.\n\nConclusion\n\nWe have provided a provably ef\ufb01cient RL algorithm that learns a very rich and important class of\nMDPs with continuous state and action spaces. Yet, many real-world MDPs do not satisfy the lin-\nearity assumption, a concern we now address. Our RL algorithm utilized a speci\ufb01c online linear\nregression algorithm. We have identi\ufb01ed certain interesting and general properties (see De\ufb01nition 2)\nof this particular algorithm that support online exploration. These properties are meaningful without\nthe linearity assumption and should be useful for development of new algorithms for different mod-\neling assumptions. Our real goal of the paper is to work towards developing a general technique for\napplying regression algorithms (as black boxes) to model-based reinforcement-learning algorithms\nin a robust and formally justi\ufb01ed way. We believe the approach used with linear regression can be\nrepeated for other important classes, but we leave the details as interesting future work.\n\nAcknowledgements\n\nWe thank NSF and DARPA IPTO for support.\n\nReferences\n\nAbbeel, P., & Ng, A. Y. (2005). Exploration and apprenticeship learning in reinforcement learning. ICML \u201905:\nProceedings of the 22nd international conference on Machine learning (pp. 1\u20138). New York, NY, USA:\nACM Press.\n\nAuer, P. (2002). Using con\ufb01dence bounds for exploitation-exploration trade-offs. Journal of Machine Learning\n\nResearch, 3, 397\u2013422.\n\nBrafman, R. I., & Tennenholtz, M. (2002). R-MAX\u2014a general polynomial time algorithm for near-optimal\n\nreinforcement learning. Journal of Machine Learning Research, 3, 213\u2013231.\n\nChow, C.-S., & Tsitsiklis, J. N. (1991). An optimal one-way multigrid algorithmfor discrete time stochastic\n\ncontrol. IEEE Transactions on Automatic Control, 36, 898\u2013914.\n\nFiechter, C.-N. (1997). PAC adaptive control of linear systems. Tenth Annual Conference on Computational\n\nLearning Theory (COLT) (pp. 72\u201380).\n\nGolub, G. H., & Van Loan, C. F. (1996). Matrix computations. Baltimore, Maryland: The Johns Hopkins\n\nUniversity Press. 3rd edition.\n\nKakade, S. M. (2003). On the sample complexity of reinforcement learning. Doctoral dissertation, Gatsby\n\nComputational Neuroscience Unit, University College London.\n\nKakade, S. M. K., Kearns, M. J., & Langford, J. C. (2003). Exploration in metric state spaces. Proceedings of\n\nthe 20th International Conference on Machine Learning (ICML-03).\n\nKearns, M., Mansour, Y., & Ng, A. Y. (1999). A sparse sampling algorithm for near-optimal planning in\nlarge Markov decision processes. Proceedings of the Sixteenth International Joint Conference on Arti\ufb01cial\nIntelligence (IJCAI-99) (pp. 1324\u20131331).\n\nKearns, M. J., & Singh, S. P. (2002). Near-optimal reinforcement learning in polynomial time. Machine\n\nLearning, 49, 209\u2013232.\n\nNg, A. Y., Kim, H. J., Jordan, M. I., & Sastry, S. (2003). Autonomous helicopter \ufb02ight via reinforcement\n\nlearning. Advances in Neural Information Processing Systems 16 (NIPS-03).\n\nSeber, G. A. F., & Lee, A. J. (2003). Linear regression analysis. Wiley-Interscience.\nSutton, R. S., & Barto, A. G. (1998). Reinforcement learning: An introduction. The MIT Press.\nTesauro, G. (1994). TD-Gammon, a self-teaching backgammon program, achieves master-level play. Neural\n\nComputation, 6, 215\u2013219.\n\n8\n\n\f", "award": [], "sourceid": 631, "authors": [{"given_name": "Alexander", "family_name": "Strehl", "institution": null}, {"given_name": "Michael", "family_name": "Littman", "institution": null}]}