{"title": "Optimizing Energy Production Using Policy Search and Predictive State Representations", "book": "Advances in Neural Information Processing Systems", "page_first": 2969, "page_last": 2977, "abstract": "We consider the challenging practical problem of optimizing the power production of a complex of hydroelectric power plants, which involves control over three continuous action variables, uncertainty in the amount of water inflows and a variety of constraints that need to be satisfied. We propose a policy-search-based approach coupled with predictive modelling to address this problem. This approach has some key advantages compared to other alternatives, such as dynamic programming: the policy representation and search algorithm can conveniently incorporate domain knowledge; the resulting policies are easy to interpret, and the algorithm is naturally parallelizable. Our algorithm obtains a policy which outperforms the solution found by dynamic programming both quantitatively and qualitatively.", "full_text": "Optimizing Energy Production Using Policy Search\n\nand Predictive State Representations\n\nYuri Grinberg\n\nDoina Precup\n\nSchool of Computer Science, McGill University\n\nMontreal, QC, Canada\n\n{ygrinb,dprecup}@cs.mcgill.ca\n\nMichel Gendreau\u2217\n\n\u00b4Ecole Polytechnique de Montr\u00b4eal\n\nMontreal, QC, Canada\n\nmichel.gendreau@cirrelt.ca\n\nAbstract\n\nWe consider the challenging practical problem of optimizing the power produc-\ntion of a complex of hydroelectric power plants, which involves control over three\ncontinuous action variables, uncertainty in the amount of water in\ufb02ows and a va-\nriety of constraints that need to be satis\ufb01ed. We propose a policy-search-based\napproach coupled with predictive modelling to address this problem. This ap-\nproach has some key advantages compared to other alternatives, such as dynamic\nprogramming: the policy representation and search algorithm can conveniently\nincorporate domain knowledge; the resulting policies are easy to interpret, and\nthe algorithm is naturally parallelizable. Our algorithm obtains a policy which\noutperforms the solution found by dynamic programming both quantitatively and\nqualitatively.\n\n1\n\nIntroduction\n\nThe ef\ufb01cient harnessing of renewable energy has become paramount in an era characterized by\ndecreasing natural resources and increasing pollution. While some efforts are aimed towards the\ndevelopment of new technologies for energy production, it is equally important to maximize the ef-\n\ufb01ciency of existing sustainable energy production methods [5], such as hydroelectric power plants.\nIn this paper, we consider an instance of this problem, speci\ufb01cally the optimization of one of a com-\nplex of hydroelectric power plants operated by Hydro-Qu\u00b4ebec, the largest hydroelectricity producer\nin Canada [17].\nThe problem of optimizing hydroelectric power plants, also known as the reservoir management\nproblem, has been extensively studied for several decades and a variety of computational methods\nhave been applied to solve it (see e.g. [3, 4] a for literature review). The most common approach is\nbased on dynamic programming (DP) [13]. However, one of the major obstacles of this approach lies\nin the dif\ufb01culty of incorporating different forms of domain knowledge, which are key to obtaining\nsolutions that are practically relevant. For example, the optimization is subject to constraints on\nwater levels which might span several time-steps, making them dif\ufb01cult to integrate into typical DP-\nbased algorithms. Moreover, human decision makers in charge of the power plants are reluctant to\nrely on black-box closed loop policies that are hard to understand. This has led to continued use in\nthe industry of deterministic optimization methods that provide long-term open loop policies; such\npolicies are then further adjusted by experts [2]. Finally, despite the different measures taken to\nrelieve the curse of dimensionality in DP-style approaches, it remains a big concern for large scale\nproblems.\nIn this paper, we develop and evaluate a variation of simulation\u2013based optimization [16], a special\ncase of policy search [6], which combines some aspects of stochastic gradient descent and block\n\u2217NSERC/Hydro-Qu\u00b4ebec Industrial Research Chair on the Stochastic Optimization of Electricity Genera-\ntion, CIRRELT and D\u00b4epartement de Math\u00b4ematiques et de G\u00b4enie Industriel, \u00b4Ecole Polytechnique de Montr\u00b4eal.\n\n1\n\n\fcoordinate descent [14]. We compare our solution to a DP-based solution developed by Hydro-\nQu\u00b4ebec based on historical in\ufb02ow data, and show both quantitative and qualitative improvement.\nWe demonstrate how domain knowledge can be naturally incorporated into an easy-to-interpret pol-\nicy representation, as well as used to guide the policy search algorithm. We use a type of predictive\nstate representations [9, 10] to learn a model for the water in\ufb02ows. The policy representation fur-\nther leverages the future in\ufb02ow predictions obtained from this model. The approach is very easy\nto parallelize, and therefore easily scalable to larger problems, due to the availability of low-cost\ncomputing resources. Although much effort in this paper goes to analyzing and solving one spe-\nci\ufb01c problem, the proposed approach is general and could be applied to any sequential optimization\nproblems involving constraints. At the end of the paper, we summarize the utility of this approach\nfrom a domain\u2013independent perspective.\nThe paper is organized as follows. Sec. 2 provides information about the hydroelectric power plant\ncomplex (needed to implement the simulator used in the policy search procedure) and describes the\ngenerative model used by Hydro-Qu\u00b4ebec to generate in\ufb02ow data with similar statistical properties\nas in\ufb02ows observed historically. Sec. 3 describes the learning algorithm that produces a predic-\ntive model for the in\ufb02ows, based on recent advances in predictive state representations. In Sec. 4\nwe present the policy representation and the search algorithm. Sec. 5 presents a quantitative and\nqualitative analysis of the results, and Sec. 6 concludes the paper.\n\n2 Problem speci\ufb01cation\n\nWe consider a hydroelectric power plant system consisting of four sites, R1, . . . ,R4 operating on the\nsame course of water. Although each site has a group of turbines, we treat this group as a single\nlarge turbine whose speed is to be controlled. R4 is the topmost site, and water turbined at reservoir\nRi \ufb02ows to Ri\u22121 (where it gets added to any other naturally incoming \ufb02ows). The topmost three\nsites (R2,R3,R4) have their own reservoirs, in which water accumulates before being pushed through\na number of turbines which generate the electricity. However, some amount of water might not be\nuseful for producing electricity because it is spilled (e.g., to prevent reservoir over\ufb02ow). Typically,\npolicies that manage to reduce spillage produce more power.\nThe amount of water in each reservoir changes as a function of the water turbined/spilled from the\nupstream site, the water in\ufb02ow coming from the ground, and the amount of water turbined/spilled at\nthe current site, as follows:\n\nV4(t + 1) = V4(t) + I4(t) \u2212 X4(t) \u2212 Y4(t),\nVi(t + 1) = Vi(t) + Xi+1(t) + Yi+1(t) + Ii(t) \u2212 Xi(t) \u2212 Yi(t), i = 2, 3\n\nwhere Vi(t) is the volume of water at reservoir Ri at time t, Xi(t) is the amount of water turbined\nat Ri at time t, Yi(t) is the amount of water spilled at site Ri at time t, and Ii(t) is water in\ufb02ow to\nsite Ri at time t. Since R1 does not have a reservoir, all the incoming water is used to operate the\nturbine, and the extra water is spilled. At the other sites, the water spillage mechanism is used only\nas a means to prevent reservoir over\ufb02ow.\nThe control problem that needs to be solved is to determine the amount of water to turbine during\neach period t, in order to maximize power production, while also satisfying constraints on the water\nlevel. We are interested in a problem considered of intermediate temporal resolution, in which\na control action at each of the 3 topmost sites is chosen weekly, after observing the state of the\nreservoirs and the in\ufb02ows of the previous week.\n\nPower production model\nThe amount of power produced is a function of the current water level (headwater) at the reservoir\nand the total speed of the turbines (m3/s). It is not a linear function, but it is well approximated by\na piece-wise linear function for a \ufb01xed value of the headwater (see Fig. A.1 in the supplementary\nmaterial) . The following equation is used to obtain the power production curve for other values of\nthe headwater [18]:\n\n(cid:19)1.5 \u00b7 Pref\n\n(cid:32)(cid:20) h\n\n(cid:18) h\n\nhref\n\n(cid:33)\n\n(cid:21)\u22120.5 \u00b7 x\n\nhref\n\nP (x, h) =\n\n,\n\n(1)\n\nwhere x is the \ufb02ow, h is the current headwater level, href is the reference headwater, and Pref is\nthe production curve of the reference headwater. Note that Eq. 1 implies that the maximum total\n\n2\n\n\fx should not\nspeed of the turbines also changes as the headwater changes; speci\ufb01cally,\nexceed the maximum total speed of the turbines, given in the appendix \ufb01gures. For completeness,\nFigure A.2 (supplementary material) can be used to convert the amount of water in the reservoir to\nthe headwater value.\n\nhref\n\n(cid:104) h\n\n(cid:105)\u22120.5\n\nConstraints\n\nSeveral constraints must be satis\ufb01ed while operating the plant, which are ecological in nature.\n\n1. Minimum turbine speed at R1 (M IN F LOW (w), w \u2208 {1, ..., 52}):\n\nThis suf\ufb01cient \ufb02ow needs to be maintained to allow for easy passage for the \ufb01sh living in\nthe river.\n\n2. Stable turbine speed throughout weeks 43-45 (\ufb02uctuations of up to BU F F ER = 35 m3/s\nbetween weeks are acceptable). Nearly constant water \ufb02ow at this time of the year ensures\nthat the area is favorable for \ufb01sh spawning.\n\n3. The amount of water in reservoir R2 should not go below M IN V OL = 1360 hm3.\nDue to the depth of the reservoir, the top and bottom water temperatures differ. Turbin-\ning warmer water (at reservoir\u2019s top) is preferrable for the \ufb01sh, but this constraint is less\nimportant than the previous two.\n\nWater in\ufb02ow process\n\nThe operation of the hydroelectric power plant is almost entirely dependent on the in\ufb02ows at each\nsite. Historical data suggests that it is safe to assume that the in\ufb02ows at different sites in the same\nperiod t are just scaled values of each other. However, there is relatively little data available to\noptimize the problem through simulation: there are only 54 years of in\ufb02ow data, which translates\ninto 2808 values (one value per week - see Fig. 1). Hydro-Quebec use this data to learn a generative\nmodel for in\ufb02ows. It is a periodic autoregressive model of \ufb01rst order, PAR(1), whose structure is\nwell aligned with the hydrological description of the in\ufb02ows [1]. The model generates data using\nthe following equation:\nwhere \u03be(t) \u223c N (0, \u03bdt mod N ) i.i.d., x(0) = \u03be(0), and N = 52 in our setting.\nAs the weekly historical data is not necessarily normally distributed, transformations are used to\nnormalize the data before learning the parameters of the PAR(1) model. The transformations used\nhere are either logarithmic, ln(X + a), where a is a parameter, or gamma, based on Wilson Hilferty\ntransformation [15]. Hence, to generate synthetic data, the reverse of these transformations are\napplied to the output produced by the PAR(1) process1.\n\nx(t + 1) = \u03b1t mod N \u00b7 x(t) + \u03be(t),\n\nFigure 1: Historical in\ufb02ow data.\n\n1The parameters of the PAR(1) process, as well as the transformations and their parameters (in the logarith-\n\nmic case) are estimated using the SAMS software [11].\n\n3\n\n\f3 Predictive modeling of the in\ufb02ows\n\nIt is intuitively clear that predicting future in\ufb02ows well could lead to better control policies. In this\nsection, we describe the model that lets us compute the predictions of future in\ufb02ows, which are used\nas an input to policies. We use a recently developed time series modelling framework based on pre-\ndictive state representations (PSRs) [9, 10], called mixed-observable PSRs (MO-PSR) [8]. Although\none could estimate future in\ufb02ows based on knowledge that the generative process is PAR(1), our ob-\njective is to use a general modelling tool that does not rely on this assumption, for two reasons. First,\ndecoupling the generative model from the predictive model allows us to replace the current gener-\native model with more complex alternatives later on, with little effort. Moreover, more complex\nmodels do not necessary have a clear way to estimate a suf\ufb01cient statistic from a given history (see\ne.g. temporal disaggregation models [12]). Second, we want to test the ability of predictive state\nrepresentations, which are a fairly recent approach, to produce a model that is useful in a real-world\ncontrol problem. We now describe the models and learning algorithms used.\n\n3.1 Predictive state representations\n\n(Linear) PSRs were introduced as a means to represent a partially observable environment without\nexplicitly modelling latent states, with the goal of developing ef\ufb01cient learning algorithms [9, 10]. A\npredictive representation is only required to keep some form of suf\ufb01cient statistic of the past, which\nis used to predict the probability of future sequences of observations generated by the underlying\nstochastic process.\nLet O be a discrete observation space. With probability P(o1, ..., ok), the process outputs a sequence\nof observations o1, ..., ok \u2208 O. Then, for some n \u2208 N, the set of parameters\n\n{m\u2217 \u2208 Rn,{Mo \u2208 Rn\u00d7n}o\u2208O, p0 \u2208 Rn}\n\nde\ufb01nes a n-dimensional linear PSR that represents this process if the following holds:\n\n\u2200k \u2208 N, oi \u2208 O : P(o1, ..., ok) = m(cid:62)\n\n\u2217 Mok \u00b7\u00b7\u00b7 Mo1p0,\n\nwhere p0 is the initial state of the PSR [7]. Let p(h) be the PSR state corresponding to a history h.\nThen, for any o \u2208 O, it is possible to track a suf\ufb01cient statistic of the history, which can be used to\nmake any future predictions, using the equation:\n\np(ho) (cid:44) Mop(h)\nm(cid:62)\u2217 Mop(h)\n\n.\n\nBecause PSRs are very general, learning can be dif\ufb01cult without exploiting some structure of the\nproblem domain. In our problem, knowing the week of the year gives signi\ufb01cant information to the\npredictive model, but the model does not need to learn the dynamics of this variable. This turns\nout to be a special case of the so-called mixed observable PSR model [8], in which an observation\nvariable can be used to decompose the problem into several, typically much smaller, problems.\n\n3.2 Mixed-observable PSR for in\ufb02ow process\nWe de\ufb01ne the discrete observation space O by\ndiscretizing the space of in\ufb02ows into 20 bins,\nthen follow [8] to estimate a MO-PSR represen-\ntation from 3 \u00d7 105 trajectories obtained from\nthe generative model. This procedure is a gen-\neralization of the spectral learning algorithm\ndeveloped for PSRs [7], which is a consistent\nestimator.\nSpeci\ufb01cally, let the set of all observed tuples of\nsequences of length 3 be denoted by H and T\nsimultaneously. We then split the set H into 52\nsubsets, each corresponding to a different week\nof the year, and obtain a collection {Hw}w\u2208W,\nwhere W = {1, ..., 52}. Then, we estimate a\ncollection of the following vectors and matrices\nfrom data:\n\nFigure 2: Prediction accuracy of the mean pre-\ndictor (blue), MO-PSR predictor (black), and the\npredictions calculated from a true model (red).\n\n4\n\n\f\u2022 {PHw}w\u2208W - a set of |Hw|-dimensional vectors with entries equal to\nP(h \u2208 Hw|h occured right before week w),\n\u2022 {PT ,Hw}w\u2208W - a set of |T | \u00d7 |Hw|-dimensional matrices with entries equal to\nP(h, t|h \u2208 Hw, t \u2208 T , h occured right before week w),\n\u2022 {PT ,o,Hw}w\u2208W,o\u2208O - a set of |T | \u00d7 |Hw|-dimensional matrices with entries equal to\nP(h, o, t|h \u2208 Hw, o \u2208 O, t \u2208 T , h occured right before week w).\n\nFinally, we perform Singular Value Decomposition (SVD) on the estimated matrices {PT ,Hw}w\u2208W\nand use their corresponding low rank matrices of left singular vectors {Uw}w\u2208W to compute the\nMO-PSR parameters as follows:\n\u2022 \u2200o \u2208 O, w \u2208 W : Bw\n\u2022 \u2200w \u2208 W : bw\n0 = U(cid:62)\n\u2022 \u2200w \u2208 W : bw\u2217 = (P(cid:62)\n\no = U(cid:62)\nwPT ,Hw 1,\nT ,Hw\n\nUw)\u2020PHw,\n\nw\u22121PT ,o,Hw (U(cid:62)\n\nwPT ,Hw )\u2020,\n\nwhere w \u2212 1 is the week before w, and \u2020 denotes the Moore\u2013Penrose pseudoinverse. The above\nparameters can be used to estimate probability of any sequence of future observations, given starting\nweek w, as:\n\nP(o1, ..., ot) = bw+t(cid:62)\n\n\u2217\n\nBw+t\u22121\n\not\n\n\u00b7\u00b7\u00b7 Bw\n\no1\n\nbw\n0 ,\n\nwhere w + i represents the i-th week after w.\nFigure 2 shows the prediction accuracy of the learnt MO-PSR model at different horizons, compared\nto two baselines: the weekly average, and the true PAR(1) model that knows the hidden state (oracle\npredictor).\n\n4 Policy search\n\nThe objective is to maximize the expected return, E(R), over each year, given by the amount of\npower produced that year minus the penalty for constraint violations. Speci\ufb01cally,\n\n52(cid:88)\n\n(cid:34)\nP (w) \u2212 3(cid:88)\n\nw=1\n\ni=1\n\n(cid:35)\n\nR =\n\n\u03b1iCi(w)\n\n,\n\nwhere P (w) is the amount of power produced during week w, and Ci(w) is the penalty for violating\nthe i-th constraint, de\ufb01ned as:\n\nC1(w) = min{M IN F LOW (w) \u2212 R1f low(w), 0}2\n\n(cid:26) min{|R1f low(w) \u2212 meanR1f low| \u2212 BU F F ER, 0}2\n\nif w \u2208 {43, 44, 45}\notherwise\n\nC2(w) =\nC3(w) = min{M IN V OL \u2212 R2vol(w), 0}3/2\n\n0\n\nwhere R1f low(w) is the water \ufb02ow (turbined + spilled) at R1 during week w, R2vol(w) is the water\nvolume at R2 at the end of week w, and meanR1f low is the average water \ufb02ow at site R1 during\nweeks 43-45. There are three variables to control: the speed of turbines R2,R3,R4. As discussed,\nthe speed of the turbine at site R1 is entirely controlled by the amount of incoming water.\nThe approach we take belongs to a general class of policy search methods [6]. This technique is\nbased on the ability to simulate policies, and the algorithm will typically output the policy that has\nachieved the highest reward during the simulation.\nThe policy for each turbine takes the parametric form of a truncated linear combination of features:\n\nmin\n\nmax\n\nxj \u00b7 \u03b8j, M AX SP EEDRi\n\n, 0\n\n,\n\nwhere M AX SP EEDRi is the maximum speed of the turbine at Ri, xj are the features and \u03b8j are\nthe parameters. For each site, the features include the current amount of water in the reservoir, the\ntotal amount of water in downstream reservoirs, and a constant. For the policy that uses the predictive\n\ni=1\n\n5\n\n(cid:34)\n\n(cid:32) k(cid:88)\n\n(cid:33)\n\n(cid:35)\n\n\fmodel we include one more feature per site: the expected amount of in\ufb02ow for the following week.\nHence, there are 8 and 11 features for the policies without/with predictions respectively (as there are\nno downstream reservoirs for R2).\nUsing this policy representation results in reasonable performance, but a closer look at constraint 2\nduring simulation reveals that the reservoirs should not be too full; otherwise, there is a high chance\nof spillage, preventing the ability to set a stable \ufb02ow during the three consecutive weeks critical for\n\ufb01sh spawning. To address this concern, we use a different set of parameters during weeks 41-43, to\nensure that the desired state of the reservoirs is reached before the constrained period sets in. Note\nthat the policy search framework allows us to make such an adjustment very easily.\nFinally, we also use the structure of the policy to comply as much as possible with constraint 2,\nby setting the speed of the turbine at site R2 during weeks 44-45 to be equal to the previous water\n\ufb02ow at site R1. For the policy that uses the predictive model, we further re\ufb01ne this by subtracting\nthe expected predicted amount of in\ufb02ow at site R1. This brings the number of parameters used for\nthe policies to 16 and 22 respectively. As the policies are simply (truncated) linear combinations of\nfeatures, they are easy to inspect and interpret.\nOur algorithm is based on a random local search around the current solution, by perturbing different\nblocks of parameters while keeping others \ufb01xed, as in block coordinate descent [14]. Each time a\nsigni\ufb01cantly better solution than the current one is found, line search is performed in the direction\nof improvement. The pseudo-code is shown in Alg. 1. The algorithm itself, like the policy represen-\ntation, exploits problem structure by also searching the parameters of a single turbine as part of the\noverall search procedure.\n\nAlgorithm 1 Policy search algorithm\nParameters:\nN\u2212 maximum number of interations\n\u03b8 = {\u03b8R2, \u03b8R3, \u03b8R4} = {\u03b81, ..., \u03b8m} \u2208 Rm - initial parameter vector\nn\u2212 number of parallel policy evaluations\nT hreshold\u2212 signi\ufb01cance threshold\n\u03b3\u2212 sampling variance\nOutput: \u03b8\n\n\u03b8 = SEARCHWITHINBLOCK(\u03b8, all indexes)\n\n(cid:46) searching over entire parameter space\n\n(cid:46) searching over parameters of each turbine separately\n\n\u03b8 = SEARCHWITHINBLOCK(\u03b8, parameter indexes of turbine Rj)\n\n(cid:46) searching over each parameter separately\n\n\u03b8 = SEARCHWITHINBLOCK(\u03b8, index j)\n\nStage 1:\n\nStage 2:\n\nStage 3:\n\nfor j \u2190 1, m do\n\nfor all reservoirs Rj do\n\n1: repeat\n2:\n3:\n4:\n5:\n6:\n7:\n8:\n9:\n10: until no improvement at any stage\n11:\n12: procedure SEARCHWITHINBLOCK(\u03b8,I)\n13:\n14:\n15:\n16:\n17:\n18:\n19:\n20:\n\nrepeat\n\n(cid:46) I,I c - an index set and its complement\n\nObtain n samples {\u2206i \u223c N (0, \u03b3I)}i\u2208{1,...,n}\nEvaluate policies de\ufb01ned by parameters {{\u03b8Ic, \u03b8I + \u2206i}}i\u2208{1,...,n} (in parallel)\nif \u02c6E(R{\u03b8Ic ,\u03b8I +\u2206i}) > \u02c6E(R\u03b8) + T hreshold then\n\nFind \u03b1\u2217 = arg max\u03b1 \u02c6E(R{\u03b8Ic ,\u03b8I +\u03b1\u2206i}) using a line search\n\u03b8 \u2190 {\u03b8Ic , \u03b8I + \u03b1\u2217\u2206i}\n\nuntil no improvement for N consecutive iterations\nreturn \u03b8\n\nThe estimate of the expected reward of a policy is calculated by running the simulator on a single\n2000-year-long trajectory obtained from the generative model described in Sec. 2. Since the algo-\n\n6\n\n\f(a)\n\n(c)\n\n(e)\n\n(b)\n\n(d)\n\n(f)\n\nFigure 3: Qualitative comparison between DP and PS with pred solutions evaluated on the historical data.\nLeft - DP, right - PS with pred. Plots (a)-(b) show the amount of water turbined at site R4; plots (c)-(d) show\nthe water \ufb02ow at site R1; plots (e)-(f) show the change in the volume of reservoir R2. Dashed horizontal lines\nin plots (c)-(f) represent the constraints, dotted vertical lines in plots (c)-(d) mark weeks 43-45.\n\nrithm depends on the initialization of the parameter vector, we sample the initial parameter vector\nuniformly at random and repeat the search 50 times. The best solution is reported.\n\nMean-prod\n\nR1 v.% R1 43-45 v.% R1 43-45 v. mean R2 v.%\n\nDP\n\nPS no pred\nPS with pred\n\n8,251GW 0%\n8,286GW 0%\n8,290GW 0%\n\n22%\n28%\n3.7%\n\n11\n2.6\n0.5\n\n0%\n1.8%\n1.8%\n\nTable 1: Comparison between solutions found by dynamic programming (DP), policy search without predic-\ntive model (PS no pred) and policy search using the predictive model (PS with pred). Mean-prod represents the\naverage annual electricity production; R1 v.% is the percentage of years in which constraint 1 is violated; R2\nv.% is the percentage of years in which constraint 3 is violated; R1 43-45 v.% is the percentage of years in which\nconstraint 2 is violated; R1 43-45 v. mean represents the average amount by which constraint 2 is violated.\n\n5 Experimental results\n\nWe compare the solutions obtained using the proposed policy search with (PS with pred) and with-\nout predictive model (PS no pred) to a solution based on dynamic programming (DP), developed by\nHydro-Qu\u00b4ebec. The state space of DP is de\ufb01ned by: week, water volume at each reservoir, and pre-\nvious total in\ufb02ow. All the continuous variables are discretized, and the transition matrix is calculated\nbased on the PAR(1) generative model of the in\ufb02ow process presented earlier. The discretization was\n\n7\n\n\foptimized to obtain best results. During the evaluation, the solution provided by DP is adjusted to\navoid obviously wrong decisions, like unnecessary water spilling. All solutions are evaluated on the\noriginal historical data. The constraints in DP are handled in the same way as in both PS solutions,\nwith penalties for violations taking the same form as shown previously. The only exception is the\nconstraint 2, which requires keeping the \ufb02ow roughly equal throughout several time steps. Since it\nis not possible to incorporate this constraint into DP as is, it is handled by enforcing a turbine \ufb02ow\nbetween 265 m3/s (the minimum required by constraint 1) and 290 m3/s.\nTable 1 shows the quantitative comparison between the solutions obtained by three methods. PS\nsolutions are able to produce more power, with the best value improving by nearly half of a percent\n- a sizeable improvement in the \ufb01eld of energy production. All solutions ensure that constraint 1\nis satis\ufb01ed (column R1 v.%), but constraint 2 is more dif\ufb01cult. Although PS no pred violates this\nconstraint slightly more often then DP (column R1 43-45 v.%), the amount by which the constraint\nis violated is signi\ufb01cantly smaller (column R1 43-45 v. mean). As expected, PS with pred performs\nmuch better, because it explicitly incorporates in\ufb02ow predictions. Finally, although both PS solu-\ntions violate constraint 3 during one out of 54 years (see Fig. 3(f)), such occasional violations are\nacceptable as long as they help satisfy other constraints. Overall, it is clear that PS with pred is a\nnoticeable improvement over DP based on the quantitative comparison alone.\nPractitioners are also often interested to assess the applicability of the simulated solution by other\ncriteria that are not always captured in the problem formulation. Fig. 3 provides different plots that\nallow such a comparison between the DP and PS with pred solutions. Plots (a)-(b) show that the\nsolution provided by PS with pred offers a signi\ufb01cantly smoother policy compared to the DP solution\n(see also Fig. A.3 in supplementary material). This smoothness is due to the policy parametrization,\nwhile the DP roughness is the result of the discretization of the input/output spaces. Unless there\nare signi\ufb01cant changes in the amount of in\ufb02ows within consecutive weeks, major \ufb02uctuations in\nturbine speeds are undesirable, and their presence cannot be easily explained to the operator. The\nonly \ufb02uctuations in the solution of PS with pred that are not the result of large in\ufb02ows are cases in\nwhich the reservoir is empty (see e.g. rapid drops around 10-th week at plot (b)), or a signi\ufb01cant\nincrease in turbine speed around weeks 41-45 due to the change in policy parameters. This also\naffects the smoothness of the change in the water volume trajectory, which can be observed at plots\n(e)-(f) for reservoir R2 for example. The period of weeks 43-45 is a reasonable exception due to the\nchange in policy parameters that require turbining at faster speeds to satisfy constraint 2.\n6 Discussion\nWe considered the problem of optimizing energy production of a hydroelectric power plant com-\nplex under several constraints. The proposed approach is based on a problem-adapted policy search\nwhose features include predictions obtained from a predictive state representation model. The re-\nsulting solution is superior to a well-established alternative, both quantitatively and qualitatively.\nIt is important to point out that the proposed approach is not, in fact, speci\ufb01c to this problem or\nthis domain alone. Often, real-world sequential decision problems have several decision variables,\na variety of constraints of different priorities, uncertainty, etc. Incorporating all available domain\nknowledge into the optimization framework is often the key to obtaining acceptable solutions. This\nis where the policy search approach is very useful, because it is typically easy to incorporate many\ntypes of domain knowledge naturally within this framework. First, the policy space can rely on\nfeatures that are deemed useful for the problem, have an interpretable structure and adhere to the\nconstraints of the problem. Second, policy search can explore the most likely directions of im-\nprovement \ufb01rst, as considered by experts. Third, the policy can be evaluated directly based on its\nperformance (regardless of the complexity of the reward function). Forth, it is usually easy to im-\nplement the policy search and parallelize parts of the policy search procedure. Finally, the use of\nPSRs allows us to produce good features for the policy by providing reliable predictions of future\nsystem behavior. For future work, the main objective is to evaluate the proposed approach on other\nrealistic complex problems, in particular in domains where solutions obtained from other advanced\ntechniques are not practically relevant.\nAcknowledgments\nWe thank Gr\u00b4egory Emiel and Laura Fagherazzi of Hydro-Qu\u00b4ebec for many helpful discussions and for pro-\nviding access to the simulator and their DP results, and Kamran Nagiyev for porting an initial version of the\nsimulator to Java. This research was supported by the NSERC/Hydro-Qu\u00b4ebec Industrial Research Chair on the\nStochastic Optimization of Electricity Generation, and by the NSERC Discovery Program.\n\n8\n\n\fReferences\n[1] Salas, J. D. (1980). Applied modeling of hydrologic time series. Water Resources Publication.\n[2] Carpentier, P. L., Gendreau, M., Bastin, F. (2013). Long-term management of a hydroelec-\ntric multireservoir system under uncertainty using the progressive hedging algorithm. Water\nResources Research, 49(5), 2812-2827.\n\n[3] Rani, D., Moreira, M.M. (2010). Simulation-optimization modeling: a survey and potential\n\napplication in reservoir systems operation. Water resources management, 24(6), 1107-1138.\n\n[4] Labadie, J.W. (2004). Optimal operation of multireservoir systems: State-of-the-art review.\n\nJournal of Water Resources Planning and Management, 130(2), 93-111.\n\n[5] Ba\u02dcnos, R., Manzano-Agugliaro, F., Montoya, F. G., Gil, C., Alcayde, A., G\u00b4omez, J. (2011).\nOptimization methods applied to renewable and sustainable energy: A review. Renewable and\nSustainable Energy Reviews, 15(4), 1753-1766.\n\n[6] Deisenroth, M.P., Neumann, G., Peters, J. (2013). A Survey on Policy Search for Robotics.\n\nFoundations and Trends in Robotics, 21, pp.388-403.\n\n[7] Boots, B., Siddiqi, S., Gordon, G. (2010). Closing the learning-planning loop with predictive\n\nstate representations. In Proc. of Robotics: Science and Systems VI.\n\n[8] Ong, S., Grinberg, Y., Pineau, J. (2013). Mixed Observability Predictive State Representations.\n\nIn Proc. of 27th AAAI Conference on Arti\ufb01cial Intelligence.\n\n[9] Littman, M., Sutton, R., Singh, S. (2002). Predictive representations of state. Advances in\n\nNeural Information Processing Systems (NIPS).\n\n[10] Singh, S., James, M., Rudary, M. (2004). Predictive state representations: A new theory for\nmodeling dynamical systems. In Proc. of 20th Conference on Uncertainty in Arti\ufb01cial Intelli-\ngence.\n\n[11] Sveinsson, O.G.B., Salas, J.D., Lane, W.L., Frevert, D.K. (2007). Stochastic Analisys Modeling\n\nand Simulation (SAMS-2007). URL: http://www.sams.colostate.edu.\n\n[12] J.B., Marco, R., Harboe, J.D., Salas (Eds.) (1993). Stochastic hydrology and its use in water\n\nresources systems simulation and optimization, 237. Springer.\n\n[13] Bellman, R. (1954). Dynamic Programming. Princeton University Press.\n[14] Tseng, P. (2001). Convergence of a block coordinate descent method for nondifferentiable\n\nminimization. Journal of optimization theory and applications, 109(3), 475-494.\n\n[15] Loucks, D.P., J.R. Stedinger, D.A. Haith (1981). Water Resources Systems Planning and Anal-\n\nysis. Prentice-Hall, Englewood Cliffs, N.J..\n\n[16] Gosavi, A. (2003). Simulation-based optimization: parametric optimization techniques and\n\nreinforcement learning, 25. Springer.\n\n[17] Fortin, P. (2008). Canadian clean: Clean, renewable hydropower leads electricity generation\n\nin Canada. IEEE Power Energy Mag., July/August, 41-46.\n\n[18] Breton, M., Hachem, S., Hammadia, A. (2002). A decomposition approach for the solution of\n\nthe unit loading problem in hydroplants. Automatica, 38(3), 477-485.\n\n9\n\n\f", "award": [], "sourceid": 1554, "authors": [{"given_name": "Yuri", "family_name": "Grinberg", "institution": "McGill University"}, {"given_name": "Doina", "family_name": "Precup", "institution": "McGill University"}, {"given_name": "Michel", "family_name": "Gendreau", "institution": "CIRRELT"}]}