{"title": "Robust Data-Driven Dynamic Programming", "book": "Advances in Neural Information Processing Systems", "page_first": 827, "page_last": 835, "abstract": "In stochastic optimal control the distribution of the exogenous noise is typically unknown and must be inferred from limited data before dynamic programming (DP)-based solution schemes can be applied. If the conditional expectations in the DP recursions are estimated via kernel regression, however, the historical sample paths enter the solution procedure directly as they determine the evaluation points of the cost-to-go functions. The resulting data-driven DP scheme is asymptotically consistent and admits efficient computational solution when combined with parametric value function approximations. If training data is sparse, however, the estimated cost-to-go functions display a high variability and an optimistic bias, while the corresponding control policies perform poorly in out-of-sample tests. To mitigate these small sample effects, we propose a robust data-driven DP scheme, which replaces the expectations in the DP recursions with worst-case expectations over a set of distributions close to the best estimate. We show that the arising min-max problems in the DP recursions reduce to tractable conic programs. We also demonstrate that this robust algorithm dominates state-of-the-art benchmark algorithms in out-of-sample tests across several application domains.", "full_text": "Robust Data-Driven Dynamic Programming\n\nGrani A. Hanasusanto\nImperial College London\nLondon SW7 2AZ, UK\n\ng.hanasusanto11@imperial.ac.uk\n\nDaniel Kuhn\n\n\u00c9cole Polytechnique F\u00e9d\u00e9rale de Lausanne\n\nCH-1015 Lausanne, Switzerland\n\ndaniel.kuhn@epfl.ch\n\nAbstract\n\nIn stochastic optimal control the distribution of the exogenous noise is typically\nunknown and must be inferred from limited data before dynamic programming\n(DP)-based solution schemes can be applied. If the conditional expectations in the\nDP recursions are estimated via kernel regression, however, the historical sample\npaths enter the solution procedure directly as they determine the evaluation points\nof the cost-to-go functions. The resulting data-driven DP scheme is asymptotically\nconsistent and admits an ef\ufb01cient computational solution when combined with\nparametric value function approximations. If training data is sparse, however, the\nestimated cost-to-go functions display a high variability and an optimistic bias,\nwhile the corresponding control policies perform poorly in out-of-sample tests. To\nmitigate these small sample effects, we propose a robust data-driven DP scheme,\nwhich replaces the expectations in the DP recursions with worst-case expectations\nover a set of distributions close to the best estimate. We show that the arising min-\nmax problems in the DP recursions reduce to tractable conic programs. We also\ndemonstrate that the proposed robust DP algorithm dominates various non-robust\nschemes in out-of-sample tests across several application domains.\n\n1\n\nIntroduction\n\nWe consider a stochastic optimal control problem in discrete time with continuous state and action\nspaces. At any time t the state of the underlying system has two components. The endogenous state\nst \u2208 Rd1 captures all decision-dependent information, while the exogenous state \u03bet \u2208 Rd2 captures\nthe external random disturbances. Conditional on (st, \u03bet) the decision maker chooses a control\naction ut \u2208 Ut \u2286 Rm and incurs a cost ct(st, \u03bet, ut). From time t to t + 1 the system then migrates\nto a new state (st+1, \u03bet+1). Without much loss of generality we assume that the endogenous state\nobeys the recursion st+1 = gt(st, ut, \u03bet+1), while the evolution of the exogenous state can be\nmodeled by a Markov process. Note that even if the exogenous state process has \ufb01nite memory, it\ncan be reduced as an equivalent Markov process on a higher-dimensional space. Thus, the Markov\nassumption is unrestrictive for most practical purposes. By Bellman\u2019s principle of optimality, a\ndecision maker aiming to minimize the expected cumulative costs solves the dynamic program\n\nct(st, \u03bet, ut) + E[Vt+1(st+1, \u03bet+1)|\u03bet]\n\nVt(st, \u03bet) = min\nut\u2208Ut\ns. t.\n\nst+1 = gt(st, ut, \u03bet+1)\n\n(1)\nbackwards for t = T, . . . , 1 with VT +1 \u2261 0; see e.g. [1]. The cost-to-go function Vt(st, \u03bet) quanti-\n\ufb01es the minimum expected future cost achievable from state (st, \u03bet) at time t.\nStochastic optimal control has numerous applications in engineering and science, e.g. in supply\nchain management, power systems scheduling, behavioral neuroscience, asset allocation, emergency\nservice provisioning, etc. [1, 2]. There is often a natural distinction between endogenous and exoge-\nnous states. For example, in inventory control the inventory level can naturally be interpreted as the\nendogenous state, while the uncertain demand represents the exogenous state.\n\n1\n\n\fIn spite of their exceptional modeling power, dynamic programming problems of the above type\nsuffer from two major shortcomings that limit their practical applicability. First, the backward in-\nduction step (1) is computationally burdensome due to the intractability to evaluate the cost-to-go\nfunction Vt for the continuum of all states (st, \u03bet), the intractability to evaluate the multivariate\nconditional expectations and the intractability to optimize over the continuum of all control actions\nut [2]. Secondly, even if the dynamic programming recursions (1) could be computed ef\ufb01ciently,\nthere is often substantial uncertainty about the conditional distribution of \u03bet+1 given \u03bet. Indeed,\nthe distribution of the exogenous states is typically unknown and must be inferred from historical\nobservations. If training data is sparse\u2014as is often the case in practice\u2014it is impossible to estimate\nthis distribution reliably. Thus, we lack essential information to evaluate (1) in the \ufb01rst place.\nIn this paper, we assume that only a set of N sample trajectories of the exogenous state is given,\nand we use kernel regression in conjunction with parametric value function approximations to esti-\nmate the conditional expectation in (1). Thus, we approximate the conditional distribution of \u03bet+1\ngiven \u03bet by a discrete distribution whose discretization points are given by the historical samples,\nwhile the corresponding conditional probabilities are expressed in terms of a normalized Nadaraya-\nWatson (NW) kernel function. This data-driven dynamic programming (DDP) approach is concep-\ntually appealing and avoids an arti\ufb01cial separation of estimation and optimization steps. Instead, the\nhistorical samples are used directly in the dynamic programming recursions. It is also asymptoti-\ncally consistent in the sense that the true conditional expectation is recovered when N grows [3].\nMoreover, DDP computes the value functions only on the N sample trajectories of the exogenous\nstate, thereby mitigating one of the intractabilities of classical dynamic programming.\nAlthough conceptually and computationally appealing, DDP-based policies exhibit a poor perfor-\nmance in out-of-sample tests if the training data is sparse. In this case the estimate of the conditional\nexpectation in (1) is highly noisy (but largely unbiased). The estimate of the corresponding cost-\nto-go value inherits this variability. However, it also displays a downward bias caused by the mini-\nmization over ut. This phenomenon is reminiscent of over\ufb01tting effects in statistics. As estimation\nerrors in the cost-to-go functions are propagated through the dynamic programming recursions, the\nbias grows over time and thus incentivizes poor control decisions in the early time periods.\nThe detrimental over\ufb01tting effects observed in DDP originate from ignoring distributional uncer-\ntainty: DDP takes the estimated discrete conditional distribution of \u03bet+1 at face value and ignores\nthe possibility of estimation errors. In this paper we propose a robust data-driven dynamic pro-\ngramming (RDDP) approach that replaces the expectation in (1) by a worst-case expectation over\na set of distributions close to the nominal estimate in view of the \u03c72-distance. We will demon-\nstrate that this regularization reduces both the variability and the bias in the approximate cost-to-go\nfunctions and that RDDP dominates ordinary DDP as well as other popular benchmark algorithms\nin out-of-sample tests. Leveraging on recent results in robust optimization [4] and value function\napproximation [5] we will also show that the nested min-max problems arising in RDDP typically\nreduce to conic optimization problems that admit ef\ufb01cient solution with interior point algorithms.\nRobust value iteration methods have recently been studied in robust Markov decision process (MDP)\ntheory [6, 7, 8, 9]. However, these algorithms are not fundamentally data-driven as their primitives\nare uncertainty sets for the transition kernels instead of historical observations. Moreover, they\nassume \ufb01nite state and action spaces. Data-driven approaches to dynamic decision making are rou-\ntinely studied in approximate dynamic programming and reinforcement learning [10, 11, 12], but\nthese methods are not robust (in a worst-case sense) with respect to distributional uncertainty and\ncould therefore be susceptible to over\ufb01tting effects. The robust value iterations in RDDP are facil-\nitated by combining convex parametric function approximation methods (to model the dependence\non the endogenous state) with nonparametric kernel regression techniques (for the dependence on\nthe exogenous state). This is in contrast to most existing methods, which either rely exclusively\non parametric function approximations [10, 11, 13] or nonparametric ones [12, 14, 15, 16]. Due\nto the convexity in the endogenous state, RDDP further bene\ufb01ts from mathematical programming\ntechniques to optimize over high-dimensional continuous action spaces without requiring any form\nof discretization.\nNotation. We use lower-case bold face letters to denote vectors and upper-case bold face letters\nto denote matrices. We de\ufb01ne 1 \u2208 Rn as the vector with all elements equal to 1, while \u2206 = {p \u2208\np = 1} denotes the probability simplex in Rn. The dimensions of 1 and \u2206 will usually be\nRn\n+ : 1\nclear from the context. The space of symmetric matrices of dimension n is denoted by Sn. For any\ntwo matrices X, Y \u2208 Sn, the relation X (cid:60) Y implies that X \u2212 Y is positive semide\ufb01nite.\n\n(cid:124)\n\n2\n\n\f2 Data-driven dynamic programming\n\nAssume from now on that the distribution of the exogenous states is unknown and that we are only\ngiven N observation histories {\u03bei\nt}T\nt=1 for i = 1, . . . , N. This assumption is typically well justi\ufb01ed\nin practice. In this setting, the conditional expectation in (1) cannot be evaluated exactly. However\nit can be estimated, for instance, via Nadaraya-Watson (NW) kernel regression [17, 18].\n\nt+1, \u03bei\n\nt+1)\n\n(2)\n\nThe conditional probabilities in (2) are set to\n\nE[Vt+1(st+1, \u03bet+1)|\u03bet] \u2248 N(cid:88)\n(cid:80)N\nKH(\u03bet \u2212 \u03bei\nt)\nk=1 KH(\u03bet \u2212 \u03bek\nt )\n\ni=1\n\nqti(\u03bet)Vt+1(si\n\n,\n\n(3)\n\n2 K(|H|\u2212 1\n\nqti(\u03bet) =\nwhere the kernel function KH(\u03be) = |H|\u2212 1\n2 \u03be) is de\ufb01ned in terms of a symmetric multi-\nvariate density K and a positive de\ufb01nite bandwidth matrix H. For a large bandwidth, the conditional\nprobabilities qti(\u03bet) converge to 1\nN , in which case (2) reduces to the (unconditional) sample aver-\nage. Conversely, an extremely small bandwidth causes most of the probability mass to be assigned\nto the sample point closest to \u03bet. In the following we set the bandwidth matrix H to its best es-\nt}N\ntimate assuming that the historical observations {\u03bei\ni=1 follow a Gaussian distribution; see [19].\nN(cid:88)\nSubstituting (2) into (1), results in the data-driven dynamic programming (DDP) formulation\nqti(\u03bet)V d\nt+1) \u2200i ,\n\ni=1\nsi\nt+1 = gt(st, ut, \u03bei\n\nV d\nt (st, \u03bet) = min\nut\u2208Ut\n\nT +1 \u2261 0. The idea to use kernel-based approximations to estimate the\nwith terminal condition V d\nexpected future costs is appealing due to its simplicity. Such approximations have been studied, for\nexample, in the context of stochastic optimization with state observation [20]. However, to the best\nof our knowledge they have not yet been used in a fully dynamic setting\u2014maybe for the reasons to be\noutlined in \u00a7 3. On the positive side, DDP with NW kernel regression is asymptotically consistent for\nlarge N under a suitable scaling of the bandwidth matrix and under a mild boundedness assumption\non V d\nt+1 [3]. Moreover, DDP evaluates the cost-to-go function of the next period only at the sample\npoints and thus requires no a-priori discretization of the exogenous state space, thus mitigating one\nof the intractabilities of classical dynamic programming.\n\nct(st, \u03bet, ut) +\n\ns. t.\n\nt+1(si\n\nt+1, \u03bei\n\nt+1)\n\n(4)\n\n3 Robust data-driven dynamic programming\n\nu = 1} and Vt+1(st+1, \u03bet+1) = 1\n\nIf the training data is sparse, the NW estimate (2) of the conditional expectation in (4) typically\nexhibits a small bias and a high variability. Indeed, the variance of the estimator scales with \u223cO( 1\nN )\ninherits this variability. However, it also displays a signi\ufb01cant\n[21]. The DDP value function V d\nt\noptimistic bias. The following stylized example illustrates this phenomenon.\n(cid:124)\nExample 3.1 Assume that d1 = 1, d2 = m = 5, ct(st, \u03bet, ut) = 0, gt(st, ut, \u03bet+1) = \u03be\nt+1ut,\nt+1 \u2212 st+1. In order to facilitate a\nUt = {u \u2208 Rm : 1\n(cid:124)\ncontrolled experiment, we also assume that (\u03bet, \u03bet+1) follows a multivariate Gaussian distribution,\nwhere each component has unit mean and variance. The correlation between \u03bet,k and \u03bet+1,k is set\nto 30%. All other correlations are zero. Our aim is to solve (1) and to estimate Vt(st, \u03bet) at \u03bet = 1.\nBy permutation symmetry, the optimal decision under full distributional knowledge is u\u2217\n5 1.\nAn analytical calculation then yields the true cost-to-go value Vt(st, 1) = \u22120.88. In the following\nwe completely ignore our distributional knowledge. Instead, we assume that only N independent\nsamples (\u03bei\nt+1) are given, i = 1, . . . , N. To showcase the high variability of NW estimation,\nwe \ufb01x the decision u\u2217\nt and use (2) to estimate its expected cost conditional on \u03bet = 1. Figure 1\n(left) shows that this estimator is unbiased but \ufb02uctuates within \u00b15% around its median even for\nN = 500. Next, we use (4) to estimate V d\nt (st, 1), that is, the expected cost of the best decision\nobtained without distributional information. Figure 1 (middle) shows that this cost estimator is even\nmore noisy than the one for a \ufb01xed decision, exhibits a signi\ufb01cant downward bias and converges\nslowly as N grows.\n\nt = 1\n\n10 s2\n\nt, \u03bei\n\n3\n\n\fFigure 1: Estimated costs of true optimal and data-driven decisions. Note the different scales. All\nreported values represent averages over 200 independent simulation runs.\n\nThe downward bias in V d\nover\ufb01tting effect, which can be explained as follows. Setting Vt+1 \u2261 V d\n\nt as an estimator for the true value function Vt is the consequence of an\n\nt+1, we \ufb01nd\n\nVt(st, \u03bet) = min\nut\u2208Ut\n\u2248 min\nut\u2208Ut\n\n\u2265 E(cid:104)\n\nct(st, \u03bet, ut) + E[V d\n\nct(st, \u03bet, ut) + E[\n\nt+1(gt(st, ut, \u03bet+1), \u03bet+1)|\u03bet]\nN(cid:88)\nN(cid:88)\n\nt+1(gt(st, ut, \u03bei\n\nqti(\u03bet)V d\n\ni=1\n\nqti(\u03bet)V d\n\nt+1(gt(st, ut, \u03bei\n\nmin\nut\u2208Ut\n\nct(st, \u03bet, ut) +\n\ni=1\n\nt+1), \u03bei\n\nt+1)|\u03bet]\n\nt+1), \u03bei\n\nt+1)\n\n(cid:105)\n\n.\n\n(cid:12)(cid:12)(cid:12)\u03bet\n\nThe relation in the second line uses our observation that the NW estimator of the expected cost\nassociated with any \ufb01xed decision ut is approximately unbiased. Here, the expectation is with\nrespect to the (independent and identically distributed) sample trajectories used in the NW estimator.\nThe last line follows from the conditional Jensen inequality. Note that the expression inside the\nt (st, \u03bet) must\nconditional expectation coincides with V d\nindeed underestimate Vt(st, \u03bet) on average. We emphasize that all systematic estimation errors of\nthis type accumulate as they are propagated through the dynamic programming recursions.\nTo mitigate the detrimental over\ufb01tting effects, we propose a regularization that reduces the decision\n(cid:124). Thus, we allow the condi-\nmaker\u2019s overcon\ufb01dence in the weights qt(\u03bet) = [qt1(\u03bet) . . . qtN (\u03bet)]\ntional probabilities used in (4) to deviate from their nominal values qt(\u03bet) up to a certain degree.\nThis is achieved by considering uncertainty sets \u2206 (q) that contain all weight vectors suf\ufb01ciently\nclose to some nominal weight vector q \u2208 \u2206 with respect to the \u03c72-distance for histograms.\n\nt (st, \u03bet). This argument suggests that V d\n\n\u2206 (q) = {p \u2208 \u2206 :\n\n(pi \u2212 qi)2/pi \u2264 \u03b3}\n\n(5)\n\nN(cid:88)\n\ni=1\n\n4\n\nThe \u03c72-distance belongs to the class of \u03c6-divergences [22], which also includes the Kullback-Leibler\ndistances. Our motivation for using uncertainty sets of the type (5) is threefold. First, \u2206(q) is\ndetermined by a single size parameter \u03b3, which can easily be calibrated, e.g., via cross-validation.\nSecondly, the \u03c72-distance guarantees that any distribution p \u2208 \u2206(q) assigns nonzero probability to\nall scenarios that have nonzero probability under the nominal distribution q. Finally, the structure of\n\u2206(q) implied by the \u03c72-distance has distinct computational bene\ufb01ts that become evident in \u00a7 4.\nAllowing the conditional probabilities in (4) to range over the uncertainty set \u2206(qt(\u03bet)) results in\nthe robust data-driven dynamic programming (RDDP) formulation\n\nV r\nt (st, \u03bet) = min\nut\u2208Ut\n\nct(st, \u03bet, ut) + max\n\npiV r\n\nt+1(si\n\nt+1, \u03bei\n\nt+1)\n\n(6)\n\nsi\nt+1 = gt(st, ut, \u03bei\n\ns. t.\nT +1 \u2261 0. Thus, each RDDP recursion involves the solution of a robust\nwith terminal condition V r\noptimization problem [4], which can be viewed as a game against \u2018nature\u2019 (or a malicious adversary):\nfor every action ut chosen by the decision maker, nature selects the corresponding worst-case weight\nvector from within p \u2208 \u2206 (qt(\u03bet)). By anticipating nature\u2019s moves, the decision maker is forced\nto select more conservative decisions that are less susceptible to amplifying estimation errors in the\nnominal weights qt(\u03bet). The level of robustness of the RDDP scheme can be steered by selecting\n\nN(cid:88)\np\u2208\u2206(qt(\u03bet))\nt+1) \u2200i\n\ni=1\n\n100200300400500\u22121\u22120.95\u22120.9\u22120.85\u22120.8NCostEstimatedcostoftrueoptimaldecisionTrueoptimalcost10th&90thpercentilesMedian100200300400500\u22121.2\u22121.1\u22121\u22120.9\u22120.8NCostEstimatedcostofDDPdecisionTrueoptimalcost10th&90thpercentilesMedian100200300400500\u22121\u22120.95\u22120.9\u22120.85\u22120.8NCostEstimatedcostofRDDPdecisionTrueoptimalcost10th&90thpercentilesMedian\fthe parameter \u03b3. We suggest to choose \u03b3 large enough such that the envelope of all conditional\nCDFs of \u03bet+1 implied by the weight vectors in \u2206(qt(\u03bet)) covers the true conditional CDF with high\ncon\ufb01dence (Figure 2). The following example illustrates the potential bene\ufb01ts of the RDDP scheme.\nExample 3.2 Consider again Example 3.1. Assuming that only the samples {\u03bei\ni=1 are\nknown, we can compute a worst-case optimal decision using (6). Fixing this decision, we can then\nuse (2) to estimate its expected cost conditional on \u03bet = 1. Note that this cost is generically different\nt (st, 1). Figure 1 (right) shows that the resulting cost estimator is less noisy and\u2014perhaps\nfrom V r\nsurprisingly\u2014unbiased. Thus, it clearly dominates V d\nt (st, 1) as an estimator for the true cost-to-go\nvalue Vt(st, 1) (which is not accessible in reality as it relies on full distributional information).\n\nt+1}N\n\nt, \u03bei\n\nRobust optimization models with uncertainty sets of the type (5) have previously been studied in [23,\n24]. However, these static models are fundamentally different in scope from our RDDP formulation.\nRDDP seeks the worst-case probabilities of N historical samples of the exogenous state, using the\nNW weights as nominal probabilities. In contrast, the static models in [23, 24] rely on a partition\nof the uncertainty space into N bins. Worst-case probabilities are then assigned to the bins, whose\nnominal probabilities are given by the empirical frequencies. This latter approach does not seem to\nextend easily to our dynamic setting as it would be unclear where in each bin one should evaluate\nthe cost-to-go functions.\nInstead of immunizing the DDP scheme against estimation\nerrors in the conditional probabilities (as advocated here),\none could envisage other regularizations to mitigate the\nover\ufb01tting phenomena. For instance, one could construct\nan uncertainty set for (\u03bei\ni=1 and seek control actions\nthat are optimal in view of the worst-case sample points\nwithin this set. However, this approach would lead to a\nharder robust optimization problem, where the search space\nof the inner maximization has dimension O(N d2) (as op-\nposed to O(N ) for RDDP). Moreover, this approach would\nonly be tractable if V r\nt+1 displayed a very regular (e.g., lin-\near or quadratic) dependence on \u03bet+1. RDDP imposes no\nsuch restrictions on the cost-to-go function; see \u00a7 4.\n\nFigure 2: Envelope of all conditional\nCDFs implied by weight vectors in\n\u2206(qt(\u03bet)).\n\nt+1)N\n\n4 Computational solution procedure\n\nIn this section we demonstrate that RDDP is computationally tractable under a convexity assumption\nand if we approximate the dependence of the cost-to-go functions on the endogenous state through\na piecewise linear or quadratic approximation architecture. This result immediately extends to the\nDDP scheme of \u00a7 2 as the uncertainty set (5) collapses to a singleton for \u03b3 = 0.\nAssumption 4.1 For all t = 1, . . . , T , the cost function ct is convex quadratic in (st, ut), the\ntransition function gt is af\ufb01ne in (st, ut), and the feasible set Ut is second-order conic representable.\nUnder Assumption 4.1, V r\n\nt (st, \u03bet) can be evaluated by solving a convex optimization problem.\n\nTheorem 4.1 Suppose that Assumption 4.1 holds and that the cost-to-go function V r\nthe endogenous state. Then, (6) reduces to the following convex minimization problem.\n1\n\nV r\nt (st, \u03bet) = min\n\n(cid:124)\nct(st, \u03bet, ut) + \u03bb\u03b3 \u2212 \u00b5 \u2212 2qt(\u03bet)\nz, y \u2208 RN\n\ny + 2\u03bbqt(\u03bet)\n\nt+1 is convex in\n\n(cid:124)\n\ns. t. ut \u2208 Ut, \u00b5 \u2208 R,\nV r\nt+1(gt(st, ut, \u03bei\nzi + \u00b5 \u2264 \u03bb,\n\n(cid:113)\n\n\u03bb \u2208 R+,\nt+1) \u2264 zi \u2200i\n\nt+1), \u03bei\n\ni + (zi + \u00b5)2 \u2264 2\u03bb \u2212 zi \u2212 \u00b5 \u2200i\n\n4y2\n\n(7)\n\nt (st, \u03bet) is convex in st whenever V r\n\nCorollary 4.1 If Assumption 4.1 holds, then RDDP preserves convexity in the exogenous state.\nThus, V r\n\nt+1(st+1, \u03bet+1) is convex in st+1.\nNote that problem (7) becomes a tractable second-order cone program if V r\nt+1 is convex piecewise\nlinear or convex quadratic in st+1. Then, it can be solved ef\ufb01ciently with interior point algorithms.\n\n5\n\n01ProbabilityTrueCDFNadaraya\u2212WatsonCDFEnvelopeofimpliedCDFs\fAlgorithm 1: Robust data-driven dynamic programming\n\nInputs: Sample trajectories {sk\nt }T\nt=1 for k = 1, . . . , K;\nt}T +1\nobservation histories {\u03bei\nt=1 for i = 1, . . . , N.\nT +1(\u00b7, \u03bei\nfor all i = 1, . . . , N do\n\nInitialization: Let \u02c6V r\nfor all t = T, . . . , 1 do\n\nfor all k = 1, . . . , K do\n\nT +1) be the zero function for all i = 1, . . . , N.\n\nLet \u02c6V r\n\nend for\nConstruct \u02c6V r\n\nt,k,i be the optimal value of problem (7) with input \u02c6V r\nt,k,i)}K\n\nt) from the interpolation points {(sk\n\nt (\u00b7, \u03bei\n\nt , \u02c6V r\n\nt+1(\u00b7, \u03bej\n\nt+1) \u2200j.\n\nk=1 as in (8a) or (8b).\n\nend for\n\nend for\nOutputs: Approximate cost-to-go functions \u02c6V r\n\nt (\u00b7, \u03bei\n\nt) for i = 1, . . . , N and t = 1, . . . , T .\n\nt }T\n\nWe now describe an algorithm that computes all cost-to-go functions {V r\nt }T\nt=1 approximately. Ini-\nt}T\ntially, we collect historical observation trajectories of the exogenous state {\u03bei\nt=1, i = 1, . . . , N,\nand generate sample trajectories of the endogenous state {sk\nt=1, k = 1, . . . , K, by simulating the\nevolution of st under a prescribed control policy along randomly selected exogenous state trajecto-\nries. Best results are achieved if the sample-generating policy is near-optimal. If no near-optimal\npolicy is known, an initial naive policy can be improved sequentially in a greedy fashion. The core\nof the algorithm computes approximate value functions \u02c6V r\nt , which are piecewise linear or quadratic\nin st, by backward induction on t. Iteration t takes \u02c6V r\nt+1 as an input and computes the optimal value\n\u02c6V r\nt). For any \ufb01xed i we then\nt,k,i of the second-order cone program (7) for each sample state (sk\nconstruct the function \u02c6V r\nt , \u02c6V r\nk=1. If the endogenous\nstate is univariate (d1 = 1), the following piecewise linear approximation is used.\nt \u2212 sk\u22121\n\nt) from the interpolation points {(sk\n\nt,k\u22121,i + (st \u2212 sk\u22121\n\nt \u2212 st)/(sk\n(sk\n\nt , \u03bei\nt,k,i)}K\n\nt \u2212 sk\u22121\n\n\u02c6V r\nt (st, \u03bei\n\nt (\u00b7, \u03bei\n\nt) = max\n\n)/(sk\n\n(8a)\n\n) \u02c6V r\n\n) \u02c6V r\n\nt,k,i\n\nt\n\nt\n\nt\n\nk\n\nIn the multivariate case (d1 > 1), we aim to \ufb01nd the convex quadratic function \u02c6V r\nt) =\n(cid:124)\n(cid:124)\ni st + mi that best explains the given interpolation points in a least-squares sense.\nt Mist + 2m\ns\nThis quadratic function can be computed ef\ufb01ciently by solving the following semide\ufb01nite program.\n\nt (st, \u03bei\n\nmin\ns. t. Mi \u2208 Sd1 , Mi (cid:60) 0, mi \u2208 Rd1 , mi \u2208 R\n\nt,k,i\n\nk=1\n\n(cid:124)\ni sk\nt + 2m\n\nt + mi \u2212 \u02c6V r\n\nMisk\n\n(sk\nt )\n\n(cid:124)\n\n(8b)\n\n(cid:88)K\n\n(cid:104)\n\n(cid:105)2\n\nt (\u00b7, \u03bei\n\nQuadratic approximation architectures of the above type \ufb01rst emerged in approximate dynamic pro-\ngramming [5]. Once the function \u02c6V r\nt) is computed for all i = 1, . . . , N, the algorithm proceeds\nto iteration t \u2212 1. A summary of the overall procedure is provided in Algorithm 1.\nRemark 4.1 The RDDP algorithm remains valid if the feasible set Ut depends on the state (st, \u03bet)\nor if the control action ut includes components that are of the \u2018here-and-now\u2019-type (i.e., they are\nchosen before \u03bet+1 is observed) as well as others that are of the \u2018wait-and-see\u2019-type (i.e., they are\nchosen after \u03bet+1 has been revealed). In this setting, problem (7) becomes a two-stage stochastic\nprogram [25] but remains ef\ufb01ciently solvable as a second-order cone program.\n\n5 Experimental results\n\nWe evaluate the RDDP algorithm of \u00a7 4 in the context of an index tracking and a wind energy\ncommitment application. All semide\ufb01nite programs are solved with SeDuMi [26] by using the\nYalmip [27] interface, while all linear and second-order cone programs are solved with CPLEX.\n\n5.1\n\nIndex tracking\n\nThe objective of index tracking is to match the performance of a stock index as closely as possible\nwith a portfolio of other \ufb01nancial instruments. In our experiment, we aim to track the S&P 500\n\n6\n\n\fStatistic\nMean\n\nStd. dev.\n90th prct.\nWorst case\n\nLSPI\n5.692\n11.699\n14.597\n126.712\n\nDDP\n4.697\n15.067\n9.048\n157.201\n\nRDDP\n1.285\n2.235\n2.851\n18.832\n\nTable 1: Out-of-sample statistics of sum of\n\nsquared tracking errors in(cid:104).\n\nFigure 3: Cumulative distribution function of\nsum of squared tracking errors.\n\nindex with a combination of the NASDAQ Composite, Russell 2000, S&P MidCap 400, and AMEX\nMajor Market indices. We set the planning horizon to T = 20 trading days (1 month).\nLet st \u2208 R+ be the value of the current tracking portfolio relative to the value of S&P 500 on day t,\nwhile \u03bet \u2208 R5\n+ denotes the vector of the total index returns (price relatives) from day t \u2212 1 to day t.\nThe \ufb01rst component of \u03bet represents the return of S&P 500. The objective of index tracking is to\nmaintain st close to 1 in a least-squares sense throughout the planning horizon, which gives rise to\nthe following dynamic program with terminal condition VT +1 \u2261 0.\n\nVt(st, \u03bet) = min (1 \u2212 st)2 + E[Vt(st+1, \u03bet+1)|\u03bet]\nu1 = 0,\n\ns. t. u \u2208 R5\n+,\n\nu = st,\n\n(cid:124)\n1\n\nst+1 = \u03bet+1\n\n(cid:124)\n\nu/\u03bet+1,1\n\n(9)\n\nHere, ui/st can be interpreted as the portion of the tracking portfolio that is invested in index i on\nday t. Our computational experiment is based on historical returns of the indices over 5440 days\nfrom 26-Aug-1991 to 8-Mar-2013 (272 trading months). We solve the index tracking problem using\nthe DDP and RDDP algorithms (i.e., the algorithm of \u00a7 4 with \u03b3 = 0 and \u03b3 = 10, respectively)\nas well as least-squares policy iteration (LSPI) [10]. As the endogenous state is univariate, DDP\nand RDDP employ the piecewise linear approximation architecture (8a). LSPI solves an in\ufb01nite-\nhorizon variant of problem (9) with discount factor \u03bb = 0.9, polynomial basis features of degree\n3 and a discrete action space comprising 1,000 points sampled uniformly from the true continuous\naction space. We train the algorithms on the \ufb01rst 80 and test on the remaining 192 trading months.\nTable 1 reports several out-of-sample statistics of the sum of squared tracking errors. We \ufb01nd that\nRDDP outperforms DDP and LSPI by a factor of 4-5 in view of the mean, the standard deviation\nand the 90th percentile of the error distribution, and it outperforms the other algorithms by an order\nof magnitude in view of the worst-case (maximum) error. Figure 3 further shows that the error\ndistribution generated by RDDP stochastically dominates those generated by DDP and LSPI.\n\n5.2 Wind energy commitment\n\nNext, we apply RDDP to the wind energy commitment problem proposed in [28, 29]. On every\nday t, a wind energy producer chooses the energy commitment levels xt \u2208 R24\n+ for the next 24\n\nSite Statistic\nMean\n\nNC\n\nOH\n\nStd. dev.\n10th prct.\nWorst case\n\nMean\n\nStd. dev.\n10th prct.\nWorst case\n\nPersistence DDP\n4.698\n6.338\n-1.463\n-22.666\n4.104\n5.548\n0.118\n-21.317\n\n4.039\n3.964\n0.524\n-11.221\n2.746\n3.428\n0.154\n-12.065\n\nRDDP\n7.549\n5.133\n1.809\n0.481\n5.510\n4.500\n1.395\n0.280\n\nTable 2: Out-of-sample statistics of pro\ufb01t (in\n$100,000).\n\nFigure 4: Out-of-sample pro\ufb01t distribution for\nthe North Carolina site.\n\n7\n\n05101520253000.20.40.60.81Sum of squared tracking errors (in \u2030)Probability  LSPIDDPRDDP\u221220\u221210010203000.20.40.60.81Profit (in $100,000)Probability  PersistenceDDPRDDP\fhours. The day-ahead prices \u03c0t \u2208 R24\n+ per unit of energy committed are known at the beginning\nof the day. However, the hourly amounts of wind energy \u03c9t+1 \u2208 R24\n+ generated over the day\nare uncertain. If the actual production falls short of the commitment levels, there is a penalty of\ntwice the respective day-ahead price for each unit of unsatis\ufb01ed demand. The wind energy producer\nalso operates three storage devices indexed by l \u2208 {1, 2, 3}, each of which can have a different\ncapacity sl, hourly leakage \u03c1l, charging ef\ufb01ciency \u03c1l\nd. We denote\nby sl\n+ the hourly \ufb01lling levels of storage l over the next 24 hours. The wind producer\u2019s\nobjective is to maximize the expected pro\ufb01t over a short-term planning horizon of T = 7 days.\nThe endogenous state is given by the storage levels at the end of day t, st = {sl\nthe exogenous state comprises the day-ahead prices \u03c0t \u2208 R24\n\u03c9t \u2208 R24\nThe best bidding and storage strategy can be found by solving the dynamic program\n\n+, while\n+ and the wind energy production levels\n+ of day t \u2212 1, which are revealed to the producer on day t. Thus, we set \u03bet = (\u03c0t, \u03c9t).\n\nc and discharging ef\ufb01ciency \u03c1l\n\nt+1 \u2208 R24\n\nl=1 \u2208 R3\n\nt,24}3\n\nVt(st, \u03bet) = max \u03c0\n\ns. t. xt, e\n\n(cid:124)\n(cid:124)\nt+1|\u03bet] + E[Vt+1(st+1, \u03bet+1)|\u03bet]\nt xt \u2212 2\u03c0\nE[eu\nt\n{c,w,u}\n{+,\u2212},l\n\u2208 R24\n\u2200l\ne\n+ ,\nt+1\nt+1\nt+1,h + e+,1\nt+1,h + ew\n\nt+1 \u2208 R24\n, sl\nt+1,h + e+,2\nt+1,h + e+,3\n\n+\n\nt+1,h \u2200h\n\n\u03c9t+1,h = ec\nxt,h = ec\n\n(10)\n\ne\n\nt+1,h + e\n\n\u2212,l\nt+1,h,\n\nt+1,h\u22121 + \u03c1l\n\nsl\nt+1,h = \u03c1lsl\n\n\u2212,1\nt+1,h + e\n\n\u2212,2\n\u2212,3\nt+1,h + eu\nt+1,h + e\nt+1,h \u2212 1\nce+,l\n\u03c1l\nd\n\nt+1,h \u2200h\nt+1,h \u2264 sl \u2200h, l\nsl\nwith terminal condition VT +1 \u2261 0. Here, we adopt the convention that sl\nt,24 for all l.\nt+1,0 = sl\nBesides the usual here-and-now decisions xt, the decision vector ut now also includes wait-and-see\ndecisions that are chosen after \u03bet+1 has been revealed (see Remark 4.1): ec represents the amount\nof wind energy used to meet the commitment, e+,l represents the amount of wind energy fed into\nstorage l, e\u2212,l represents the amount of energy from storage l used to meet the commitment, ew rep-\nresents the amount of wind energy that is wasted, and eu represents the unmet energy commitment.\nOur computational experiment is based on day-ahead prices for the PJM market and wind speed data\nfor North Carolina (33.9375N, 77.9375W) and Ohio (41.8125N, 81.5625W) from 2002 to 2011 (520\nweeks). As \u03bet is a 48 dimensional vector with high correlations between its components, we perform\nprincipal component analysis to obtain a 6 dimensional subspace that explains more than 90% of\nthe variability of the historical observations. The conditional probabilities qt(\u03bet) are subsequently\nestimated using the projected data points. The parameters for the storage devices are taken from\n[30]. We solve the wind energy commitment problem using the DDP and RDDP algorithms (i.e.,\nthe algorithm of \u00a7 4 with \u03b3 = 0 and \u03b3 = 1, respectively) as well as a persistence heuristic that naively\npledges the wind generation of the previous day by setting xt = \u03c9t. Persistence was proposed as\na useful baseline in [28]. Note that problem (10) is beyond the scope of traditional reinforcement\nlearning algorithms due to the high dimensionality of the action spaces and the seasonalities in\nthe wind and price data. We train DDP and RDDP on the \ufb01rst 260 weeks and test the resulting\ncommitment strategies as well as the persistence heuristic on the last 260 weeks of the data set.\nTable 2 reports the test statistics of the different algorithms. We \ufb01nd that the persistence heuristic\nwins in terms of standard deviation, while RDDP wins in all other categories. However, the higher\nstandard deviation of RDDP can be explained by a heavier upper tail (which is indeed desirable).\nMoreover, the pro\ufb01t distribution generated by RDDP stochastically dominates those generated by\nDDP and the persistence heuristic; see Figure 4. Another major bene\ufb01t of RDDP is that it cuts off\nany losses (negative pro\ufb01ts), whereas all other algorithms bear a signi\ufb01cant risk of incurring a loss.\n\nConcluding remarks The proposed RDDP algorithm combines ideas from robust optimization,\nreinforcement learning and approximate dynamic programming. We remark that the N K convex\noptimization problems arising in each backward induction step are independent of each other and\nthus lend themselves to parallel implementation. We also emphasize that Assumption 4.1 could be\nrelaxed to allow ct and gt to display a general nonlinear dependence on st. This would invalidate\nCorollary 4.1 but not Theorem 4.1. If one is willing to accept a potentially larger mismatch between\nthe true nonconvex cost-to-go function and its convex approximation architecture, then Algorithm 1\ncan even be applied to speci\ufb01c motor control, vehicle control or other nonlinear control problems.\nAcknowledgments: This research was supported by EPSRC under grant EP/I014640/1.\n\n8\n\n\fReferences\n[1] D.P. Bertsekas. Dynamic Programming and Optimal Control, Vol. II. Athena Scienti\ufb01c, 3rd edition, 2007.\n[2] W. Powell. Approximate Dynamic Programming: Solving the Curses of Dimensionality. Wiley-Blackwell,\n\n2007.\n\n[3] L. Devroye. The uniform convergence of the nadaraya-watson regression function estimate. Canadian\n\nJournal of Statistics, 6(2):179\u2013191, 1978.\n\n[4] A. Ben-Tal, L. El Ghaoui, and A. Nemirovski. Robust Optimization. Princeton University Press, 2009.\n[5] A. Keshavarz and S. Boyd. Quadratic approximate dynamic programming for input-af\ufb01ne systems. In-\n\nternational Journal of Robust and Nonlinear Control, 2012. Forthcoming.\n\n[6] A. Nilim and L. El Ghaoui. Robust control of Markov decision processes with uncertain transition matri-\n\nces. Operations Research, 53(5):780\u2013798, 2005.\n\n[7] G. Iyengar. Robust dynamic programming. Mathematics of Operations Research, 30(2):257\u2013280, 2005.\n[8] S. Mannor, O. Mebel, and H. Xu. Lightning does not strike twice: Robust MDPs with coupled uncertainty.\n\nIn Proceedings of the 29th International Conference on Machine Learning, pages 385\u2013392, 2012.\n\n[9] W. Wiesemann, D. Kuhn, and B. Rustem. Robust Markov decision processes. Mathematics of Operations\n\nResearch, 38(1):153\u2013183, 2013.\n\n[10] M.G. Lagoudakis and R. Parr. Least-squares policy iteration. The Journal of Machine Learning Research,\n\n4:1107\u20131149, 2003.\n\n[11] D.P. Bertsekas. Approximate policy iteration: A survey and some new methods. Journal of Control\n\nTheory and Applications, 9(3):310\u2013335, 2011.\n\n[12] C.E. Rasmussen and M. Kuss. Gaussian processes in reinforcement learning.\n\nInformation Processing Systems, pages 751\u2013759, 2004.\n\nIn Advances in Neural\n\n[13] L. Bu\u00b8soniu, A. Lazaric, M. Ghavamzadeh, R. Munos, R. Babu\u0161ka, and B. De Schutter. Least-squares\n\nmethods for policy iteration. In Reinforcement Learning, pages 75\u2013109. Springer, 2012.\n\n[14] X. Xu, T. Xie, D. Hu, and X. Lu. Kernel least-squares temporal difference learning. International Journal\n\nof Information Technology, 11(9):54\u201363, 2005.\n\n[15] Y. Engel, S. Mannor, and R. Meir. Reinforcement learning with Gaussian processes. In Proceedings of\n\nthe 22nd International Conference on Machine Learning, pages 201\u2013208, 2005.\n\n[16] G. Taylor and R. Parr. Kernelized value function approximation for reinforcement learning. In Proceed-\n\nings of the 26th International Conference on Machine Learning, pages 1017\u20131024, 2009.\n\n[17] E.A. Nadaraya. On estimating regression. Theory of Probability & its Applications, 9(1):141\u2013142, 1964.\n[18] G.S. Watson. Smooth regression analysis. Sankhy\u00afa: The Indian Journal of Statistics, Series A, 26(4):359\u2013\n\n372, 1964.\n\n[19] B. Silverman. Density Estimation for Statistics and Data Analysis. Chapman & Hall/CRC, 1986.\n[20] L. Hannah, W. Powell, and D. Blei. Nonparametric density estimation for stochastic optimization with an\nobservable state variable. In Advances in Neural Information Processing Systems, pages 820\u2013828, 2010.\n\n[21] A. Cybakov. Introduction to Nonparametric Estimation. Springer, 2009.\n[22] L. Pardo. Statistical Inference Based on Divergence Measures, volume 185 of Statistics: A Series of\n\nTextbooks and Monographs. Chapman and Hall/CRC, 2005.\n\n[23] Z. Wang, P.W. Glynn, and Y. Ye. Likelihood robust optimization for data-driven newsvendor problems.\n\nTechnical report, Stanford University, 2009.\n\n[24] A. Ben-Tal, D. den Hertog, A. De Waegenaere, B. Melenberg, and G. Rennen. Robust solutions of\n\noptimization problems affected by uncertain probabilities. Management Science, 59(2):341\u2013357, 2013.\n\n[25] A. Shapiro, D. Dentcheva, and A. Ruszczy\u00b4nski. Lectures on Stochastic Programming: Modeling and\n\nTheory. SIAM, 2009.\n\n[26] J.F. Sturm. Using SeDuMi 1.02, a MATLAB toolbox for optimization over symmetric cones. Optimiza-\n\ntion Methods and Software, 11-12:625\u2013654, 1999.\n\n[27] J. L\u00f6fberg. YALMIP : A toolbox for modeling and optimization in MATLAB. In Proceedings of the\n\nCACSD Conference, 2004.\n\n[28] L. Hannah and D. Dunson. Approximate dynamic programming for storage problems. In Proceedings of\n\nthe 28th International Conference on Machine Learning, pages 337\u2013344, 2011.\n\n[29] J.H. Kim and W.B. Powell. Optimal energy commitments with storage and intermittent supply. Opera-\n\ntions Research, 59(6):1347\u20131360, 2011.\n\n[30] M. Kraning, Y. Wang, E. Akuiyibo, and S. Boyd. Operation and con\ufb01guration of a storage portfolio via\n\nconvex optimization. In Proceedings of the IFAC World Congress, pages 10487\u201310492, 2011.\n\n9\n\n\f", "award": [], "sourceid": 472, "authors": [{"given_name": "Grani Adiwena", "family_name": "Hanasusanto", "institution": "Imperial College London"}, {"given_name": "Daniel", "family_name": "Kuhn", "institution": "EPFL"}]}