{"title": "Auto-Regressive HMM Inference with Incomplete Data for Short-Horizon Wind Forecasting", "book": "Advances in Neural Information Processing Systems", "page_first": 136, "page_last": 144, "abstract": "Accurate short-term wind forecasts (STWFs), with time horizons from 0.5 to 6 hours, are essential for efficient integration of wind power to the electrical power grid. Physical models based on numerical weather predictions are currently not competitive, and research on machine learning approaches is ongoing. Two major challenges confronting these efforts are missing observations and weather-regime induced dependency shifts among wind variables at geographically distributed sites. In this paper we introduce approaches that address both of these challenges. We describe a new regime-aware approach to STWF that use auto-regressive hidden Markov models (AR-HMM), a subclass of conditional linear Gaussian (CLG) models. Although AR-HMMs are a natural representation for weather regimes, as with CLG models in general, exact inference is NP-hard when observations are missing (Lerner and Parr, 2001). Because of this high cost, we introduce a simple approximate inference method for AR-HMMs, which we believe has applications to other sequential and temporal problem domains that involve continuous variables. In an empirical evaluation on publicly available wind data from two geographically distinct regions, our approach makes significantly more accurate predictions than baseline models, and uncovers meteorologically relevant regimes.", "full_text": "Auto-Regressive HMM Inference with Incomplete\n\nData for Short-Horizon Wind Forecasting\n\nChris Barber\n\nEE and Computer Science\n\nUniversity of Wisconsin-Milwaukee, USA\n\nUniversity of Wisconsin-Milwaukee, USA\n\nJoseph Bockhorst\n\nEE and Computer Science\n\nPaul Roebber\n\nAtmospheric Science\n\nUniversity of Wisconsin-Milwaukee, USA\n\nAbstract\n\nAccurate short-term wind forecasts (STWFs), with time horizons from 0.5 to 6\nhours, are essential for ef\ufb01cient integration of wind power to the electrical power\ngrid. Physical models based on numerical weather predictions are currently not\ncompetitive, and research on machine learning approaches is ongoing. Two major\nchallenges confronting these efforts are missing observations and weather-regime\ninduced dependency shifts among wind variables. In this paper we introduce ap-\nproaches that address both of these challenges. We describe a new regime-aware\napproach to STWF that use auto-regressive hidden Markov models (AR-HMM), a\nsubclass of conditional linear Gaussian (CLG) models. Although AR-HMMs are\na natural representation for weather regimes, as with CLG models in general, ex-\nact inference is NP-hard when observations are missing (Lerner and Parr, 2001).\nWe introduce a simple approximate inference method for AR-HMMs, which we\nbelieve has applications in other problem domains.\nIn an empirical evaluation\non publicly available wind data from two geographically distinct regions, our ap-\nproach makes signi\ufb01cantly more accurate predictions than baseline models, and\nuncovers meteorologically relevant regimes.\n\n1\n\nIntroduction\n\nAccurate wind speed and direction forecasts are essential for ef\ufb01cient integration of wind energy\ninto electrical transmission systems. The importance of wind forecasts for the wind energy industry\nstems from three facts: 1) for reliability and safety the aggregate power produced and consumed\nthroughout a power system must be nearly in balance at all times, 2) because it depends strongly\non wind speed and direction, the power output of a wind farm is highly variable, and 3) ef\ufb01cient\nand cost effective energy storage mechanisms do not exist. A recent estimate placed the value of a\nperfect forecast at $3Billion annually (Piwko and Jordan, 2010) for the United States power system\nof 2030 envisioned by the Department of Energy (Lindenberg, 2008). Because information on the\n30 minute to six-hour time horizon is actionable for many control decisions, and the current state-\nof-the-art is considered inadequate, there has been a recent surge of interest in improving forecasts\nin this range.\nThe short-term wind forecasting (STWF) problem presents numerous challenges to the modeler.\nData produced by the current observation network are sparse relative to the temporal and spatial scale\nof weather events that drive short-term changes in wind features; observations are frequently missing\nor corrupted; quality training sets with multiple years of turbine height (\u223c80 m) wind observations\nat numerous sites are typically not available; transfer of learned models (Caruana, 1997) across\n\n1\n\n\fwind farms is dif\ufb01cult, and because of the dynamic nature of weather the spatial and temporal\ndependencies of wind features within a geographical region are not \ufb01xed.\nNumerical weather predictions (NWP) methods are the primary means for producing the large-scale\nweather forecasts used throughout the world, but are not competitive for STWF. In fact, NWP based\nwind speed predictions are less accurate than \u201cpersistence\u201d forecasts (Giebel, 2003), a surprisingly\nrobust method for time horizons less than a few hours. Approaches to STWF include ARMA mod-\nels (Marti et al., 2004), support vector machines (Zheng and Kusiak, 2009), and other data mining\nmethods (Kusiak et al., 2009), but with the exception of two methods (Gneiting et al., 2006; Pin-\nson and Madsen, 2008) these do not consider dependency dynamics. Gneiting et al. (2006) \u201chard\ncode\u201d their regimes based on wind direction, while Pinson and Madsen (2008) learns regimes for\na single forecasting site with complete data. At the time of writing we are unaware of any previ-\nous STWF work which simultaneously learns regimes, incorporates multiple observation sites, and\naccepts missing observations.\nWe propose a novel approach to STWF that automatically reasons about learned weather regimes\nacross multiple sites while naturally handling missing observations. Our approach is based on\nswitching conditional linear Gaussian (CLG) models, variously known as switching vector autore-\ngressive models or autoregressive hidden Markov models. For an overview of CLG models, see\nKoller and Friedman (2009, Chap. 14). Since exact inference in CLG models with incomplete data\nis NP-hard (Lerner and Parr, 2001), we pursue approximate methods. We introduce a novel and sim-\nple approximate inference approach that exploits the tendency of regimes to persist for several hours.\nPredictions by our learned models are signi\ufb01cantly better than baseline persistence predictions in ex-\nperiments on national climatic data center (NCDC) data from two sites in the United States: one in\nthe Paci\ufb01c Northwest and one in southern Wisconsin. Inspection of the learned models show that\nour approach learns meteorologically interesting regimes.\nSwitching CLG models have been applied in other domains where missing observations are an\nissue, such as meteorology (Tang, 2004; Paroli et al., 2005), epidemiology (Toscani et al., 2010) and\neconometrics (Perez-Quiros et al., 2010). Some approaches are able to avoid the issue of missing\ndata by throwing out affected timesteps, or by imputing values through a variety of domain-speci\ufb01c\ntechniques. Alternatively, Markov Chain Monte Carlo parameter estimation techniques have been\napplied. Our approach may be an attractive alternative in these domains, offering a solution that\ndoes not require deletion or imputation.\n\n2 Methods\n\nt V(cid:48)\n\n1, and we denote settings to random variables with lowercase letters, for example wt.\n\nWe consider the setting in which wind observations from a set of M stations arrive at regular inter-\nvals (hourly in our experiments). Let Ut and Vt be M-by-1 vector of random variables for the u\nt](cid:48) to refer to both Ut\nand v components of the wind at all sites at time t. We use Wt = [1 U(cid:48)\nand Vt\nOur approach to STWF is based on auto-regressive HMMs where at each time t we have a single\ndiscrete random variable Rt that represents the active regime, and a continuous valued vector random\nvariable Wt that represents measured wind speeds. As local probabilities are linear Gaussian (LG)\nwe denote the model in which the regime variables Rt have cardinality C by AR-LG(C). Thus,\nAR-LG(1) is a traditional AR model. Figure 1 shows example graphical models.\nThe local conditional distributions, Pr(Rt+1|Rt) and Pr(Wt|Wt\u22121, Rt), are shared across time.\nWe represent Pr(Rt+1|Rt) by the C-by-C transition matrix T where T (r, s) > 0 is the probability\nof transitioning from regime r to regime s. Since weather regimes tend to persist for multiple\nhours, the self-transition probabilities T (r, r) are typically the largest. The local distributions for\nthe continuous variables are linear Gaussian, Pr(wt|wt\u22121, Rt = r) = N (B(r)wt\u22121, Q(r)), where\nB(r) is the 2M-by-2M regression matrix for regime r, row j of B(r) is the regression vector for\nthe jth component of wt, Q(r) is the regime\u2019s covariance matrix, and N (\u00b5, \u03a3) is the multivariate\nGaussian (MVG) with mean \u00b5 and covariance \u03a3. The joint probability of a setting to all variables\n\n1We include the additional dimension with constant value of 1.0 here to indicate that we include a constant\nterm in all our models. But for notational simplicity in what follows we include this term only implicitly and\ndescribe our methods as if Wt were comprised of observations only.\n\n2\n\n\fFigure 1: Graphical structures of wind speed models. Darkly shaded nodes are observed, lightly shaded nodes\nare partially observed, and unshaded nodes are unobserved. (a) Auto-regressive linear Gaussian (AR-LG(1)) for\ndata set with L time steps. (b) Auto-regressive HMM (AR-LG(C), C > 1). Exact inference in (b) with missing\nobservations is NP-hard (Lerner and Parr, 2001). (c) Truncated AR-LG(C) HOMO approximation of (b) for\npredictions of wind speeds at target time t + h made at t with K = 2 and horizon h = 2. Our approximation\nassumes the regime does not change in the window t \u2212 K to t + h. (d) Truncated (non-conditional) AR-LG(1)\nanalogous to (c). (e) Detailed structure of (d) for 3 sites showing within time-slice conditional independencies\nand assorted missing observations.\n\nfor L time steps is\n\nPr(r1, w1,\u00b7\u00b7\u00b7 , rL, wL) = Pr(r1) Pr(w1|r1)\n\nL(cid:89)\n\nt=2\n\nPr(rt|rt\u22121) Pr(wt|rt, wt\u22121)\n\nL(cid:89)\n\n= \u03bd(r1)N (w1; \u00b51(r1), Q1(r1))\n\nT (rt\u22121, rt)N (wt; B(rt)wt\u22121, Q(rt))\n\nwhere \u03bd is the initial regime distribution, the observations at t = 1 for regime r are Gaussian\nwith mean \u00b51(r) and covariance Q1(r), and N () with three arguments denotes MVG density. We\nset \u03bd to the stationary state-distribution, given by the eigenvector of T (cid:48) associated with eigenvalue\n1. We train model parameters with standard EM methods for conditional linear Gaussian (CLG)\nmodels (Murphy, 1998), except that the E-step uses approximate inference.\n\nt=2\n\n2.1 Approximate Inference Methods\nConsider a length L time series and let W = (W1,\u00b7\u00b7\u00b7 , WL) refer to the continuous variables. We\ndenote the sequence of partial observations as \u02d9w1:L = ( \u02d9w1, \u02d9w2,\u00b7\u00b7\u00b7 \u02d9wL) where the \u201cdot\u201d notation\n\u02d9wt denotes a potentially incomplete vector with missing data. Our inference tasks are to calculate\nPr(Wt\u22121, Wt, Rt| \u02d9w1:L) and Pr(Rt\u22121, Rt| \u02d9w1:L) while training to compute the expected suf\ufb01cient\nstatistics needed for estimation of CLG parameters using EM (Murphy, 1998), and to compute\nPr(Wt+H| \u02d9w1:t) for horizon H forecasting at time t.\nIn AR-LG(1) models with no discrete variables (Figure 1a), the chain structure permits ef\ufb01cient ex-\nact inference using message passing techniques (Weiss and Freeman, 2001). For general AR-LG(C)\nmodels, however, the posterior distributions over unobserved continuous variables are mixtures of\nexponentially many Gaussians and exact inference is NP-hard (Lerner and Parr, 2001). Speci\ufb01cally,\nPr(Wt+H| \u02d9w1:t) has C d+H component Gaussians, one for each setting of Rt\u2212d+1,\u00b7\u00b7\u00b7 , Rt+H,\nwhere d is the number of contiguous time steps in the suf\ufb01x of \u02d9w1:t with at least one missing obser-\nvation. The training posteriors Pr(Wt\u22121, Wt| \u02d9w1:L) have C dl+2+dr components where dl and dr\nare the number of consecutive time steps to the left of t \u2212 1 and right of t with at least one missing\nobservation. Because of the nature of data collection, most wind data sets with multiple sites will\nhave a number of missing observations. Indeed, our Wisconsin (21 sites) and Paci\ufb01c Northwest (24\nsites) data sets have only 5.6% and 6.4% hours of complete data, respectively. Missing observations\n\n3\n\n\fare by no means unique to wind. Lauritzen\u2019s approach (Lauritzen and Jensen, 2001) for exact infer-\nence in conditional linear Gaussian models offers no relief as the clique sizes in the strongly rooted\njunction trees are exponentially large for AR-LG(C) models.\nSince exact inference is intractable we investigate approximate methods. We \ufb01rst make a sim-\npli\ufb01cation that involves focusing only on observations temporally close to the inference vari-\nables. We ignore observations more than K time-steps from t. For example, we approximate\nPr(Wt\u22121, Wt| \u02d9w1:L) by a truncated model Pr(Wt\u22121, Wt| \u02d9wt\u2212K:t+K). While inference in the\ntruncated model will be less costly than in the full model, it is still O(C 2K+1) in the worst case,\nwhich is prohibitive for moderate K on large datasets.\nOur approaches are based on the general concept of pruning (Koller and Friedman, 2009, Chap. 14),\nwhere all but n mixture components are discarded, in order to approximate a posterior distribution\nwith a prohibitive number of components. Let P (V ) refer to a desired posterior distribution under\nthe truncated model given evidence, which we assume has at least one missing observation at each\ntime step. P (V ) is a mixture of Gaussians with an exponential number of components N, which\nj=1 \u03c9jpj(V ). Each mixing proportion \u03c9j is associated with a regime state\nsequence and pj(V ) is the posterior Gaussian for that sequence. We approximate P (V ) by the\nj=1 \u03c0jqj(V ) with a much smaller number components n in which each\ncomponent in Q is equal to one component in P . Without loss of generality we re-order components\nof P so that the selected components comprise the \ufb01rst n, and thus qj = pj for j \u2264 n. As pointed out\npreviously (Lerner and Parr, 2001), this approach is appropriate in many real world settings in which\na large fraction of the probability mass of P (V ) is contained in a small number of components. This\nis the case for us as weather regimes tend to persist for a number of hours, and thus regime sequences\nwith frequent regime switching are highly unlikely.\n\nwe write P (V ) = (cid:80)N\ndistribution Q(V ) = (cid:80)n\n\n2.1.1 Approach 1: PRIOR\n\nWe consider three approaches to choosing the components in Q. Our \ufb01rst approach is the method\nof Lerner and Parr (2001) that chooses the components associated with the n a-priori most likely\nstate sequences, which, since our discrete variables form a simple chain, we can \ufb01nd ef\ufb01ciently using\nthe Best Max-Marginal First (BMMF) method (Yanover and Weiss, 2003). The mixing proportions\nare set so that \u03c0i \u221d \u03c9i. Although not theoretically justi\ufb01ed in Lerner and Parr (2001), we show here\nthat this choice in fact minimizes an upper bound on the a-priori or evidence-free KL divergence\nfrom Q to P , D(Q||P ), among all Q made from n components of P . To see this we \ufb01rst extend Q\n(cid:80)N\nto have N components where \u03c0j = 0 for j > n and apply an upper bound on the KL-divergence\nbetween mixtures of Gaussians (Singer and Warmuth, 1998; Do, 2003), D(Q||P ) \u2264 D(\u03c0||\u03c9) +\nand D(Q||P ) \u2264 D(\u03c0||\u03c9) =(cid:80)n\nj=1 \u03c0jD(qj||pj). Since we constrain Q to have components from P , the second term drops out,\nthe proportionality constant Z = (cid:80)n\n\n\u03c9j\nZ\nj=1 \u03c9i is the sum of chosen mixing probabilities. This leaves\nD(Q||P ) \u2264 \u2212 log(Z), which is clearly minimized by choosing the n components of P with largest\n\u03c9j. We call this approach PRIOR(n).\n\nj=1\n\n(cid:0)log( \u03c9j\n\nZ ) \u2212 log(\u03c9j)(cid:1), where here we use that \u03c0j \u221d \u03c9j where\n\n2.1.2 Approach 2: HOMO\n\nOur second method for setting Q(V ) is a simple but often effective approach that assumes no regime\nchanges in the truncated model. This approach, which we call HOMO has n = C components, one\nfor each homogeneous regime sequence. If the self transition probabilities T (r, r) are largest, then\nthe most likely regime sequence a-priori is homogeneous, and thus is also chosen by PRIOR(n).\nThe other components of PRIOR(n) may be only small variations from this homogeneous regime\nsequence, however, the components selected by HOMO are very different from one another. This\ndiversity may be advantageous for prediction.\n\n2.1.3 Approach 3: POST\n\nOur \ufb01nal method depends on the evidence. We would like to select the components for the top\nn regime sequences with maximum posterior likelihood, however, this too is NP-hard (Lerner\nand Parr, 2001). We instead use a fast approximation in which the posterior potential of set-\ntings to regime variables is set by local evidence. We de\ufb01ne \u03c4t(r) = Pr( \u02d9wt| \u02d9wt\u22121, Rt = r) to\n\n4\n\n\fTable 1: Missing data summary. The \u201cCount\u201d row lists the number of hours in our WI data set (21 sites total)\nin which the number of sites with missing values was exactly equal to the value \u201c# Sites Missing\u201d.\n\n# Sites Missing\nCount\nFrequency\n\n7+\n0\n1978\n1028\n5.6% 7.4% 18.4% 23.3% 25.2% 13.6% 3.5% 2.9%\n\n1\n2583\n\n2\n6463\n\n3\n8165\n\n4\n8849\n\n5\n4769\n\n6\n1229\n\n\u03c4t\u2212k(rt\u2212K)(cid:81)t+K\n\nbe the potential for Rt = r, and then run BMMF on the model where Pr(rt\u2212K,\u00b7\u00b7\u00b7 , rt+K) \u221d\nt(cid:48)=t\u2212K+1 \u03c4t(cid:48)(rt(cid:48))T (rt(cid:48)\u22121, rt(cid:48)). Note that each \u03c4t(r) is the density value of a single\nGaussian, and can be computed quickly from model parameters. We call this approach POST(n).\n\n3 Experimental evaluation\n\nWe compare the forecasts of our models to forecasts of persistence models, which represent the\ncurrent state-of-the-art for STWF. We design our experiments to answer the following questions.\n1. Are STWFs of our single-regime models more accurate than persistence forecasts? 2. Are\nSTWFs of our models that consider regimes more accurate than those of the single-regime models?\n3. Do differences between learned regimes make sense meteorologically? Additionally, we wish to\ncomparatively evaluate the effectiveness of our approximate inference algorithm.\n\n3.1 Data set\n\nWe conduct our evaluation in two meteorologically distinct regions in the United States: Wisconsin\n(WI), and the Paci\ufb01c Northwest (PNW) states of Washington and Oregon. The National Climatic\nData Center (NCDC) maintains a publicly accessible database of hourly historical climatic surface\ndata, from which we obtained 4 years of data from a number of sites in both regions. The WI\nobservations span from January 1, 2006 through December 31, 2009, and the PNW observations\nspan from February 4, 2006 through February 3, 2010. We have data from 21 and 24 sites in WI and\nPNW, respectively. This data is available at http://ganymede.cs.uwm.edu/nips2010/.\nWe collect wind direction and wind speed at each site, as measured at 10 meters above ground\nlevel. Since our primary motivation is wind power forecasting, we prefer wind speed measurements\ntaken at turbine height (approximately 50-100 meters above ground level). Publicly available turbine\nheight observations, however, are scarce, so we use the 10 m. data as a compromise and for proof of\nconcept.\nRaw data from the NCDC is approximately hourly, but readings often appear off-the-hour in an\nunpredictable fashion. We use the simple rule of selecting the nearest data point within +/\u221210\nminutes of the hourly transition. We discard all readings outside this margin. Additionally, NCDC\nappends various quality \ufb02ags to each reading, and we discarded any data which did not pass all\nquality checks. These discarded points as well as varying site-speci\ufb01c instrumentation practices\nintroduce missing observations. Table 1 shows a summary of missing data in the the WI data set.\nMissing data did not arise from a few misbehaving stations.\n\n3.2 Experimental methodology\n\nData was assembled into four cross-validation folds, each contiguous and exactly 1 year in length.\nFor each fold we use the three training years to learn AR-LG(C) models with C = 1, 2, ..., 5. Note\nthat AR-LG(1) is the standard (non-conditional) auto-regressive model. With each learned model\nwe forecast wind speeds at all sites and all test-year hours at six horizons (1-6 hours). Thus, for\neach geography (WI and PNW) we have 20 learned models (4 folds and C = 1, 2, ..., 5) and 120\nprediction sequences (horizon 1-6 hrs for each of the 20 learned models). Note that this entails\n\u201ccasting out\u201d or unrolling a learned model to reach longer horizons, which as we see below can\nimpact performance. For the persistence model, we only make a horizon h forecast for target time\nt + h if the time t observation at that site is available. For point-predictions we predict the expected\nvalue of the posterior wind distribution at the prediction time.\n\n5\n\n\fFigure 2: Mean RMSE over all sites and folds for single-regime model (AR-LG(1)) and persistence models in\nWI (left) and PNW (right). Errorbars extend one standard deviation above and below the mean.\n\nFigure 3: Site-by-site average RMSE values for AR-LG(1) and persistence models in WI for 1, 3 and 5 hour\nhorizons. Errorbars show standard deviations calculated across folds (years).\n\nWe chose for these experiments to use the HOMO approximation method, with a truncated model\ncorresponding to 3 hours (K = 1). This approximation method is simplest and suits our domain,\nwhere we expect distinct regimes to generally persist over a period of a few hours.\nWe use two performance measures to evaluate prediction sequences, test-set log-likelihood (LL) and\nroot mean squared error (RMSE). The RMSE measure provides an evaluation for point predictions\nwhile the LL provides an evaluation of probabilistic predictions. For a given geographical region\nwe denote the RMSE of the horizon h prediction sequence made by AR-LG(C) model for site s\nand year y by e(h, s, y, C). In a similar way we denote RMSE of a persistence prediction sequence\nby ep(h, s, y). We denote collections and aggregates with MATLAB-style notation. For example,\ne(1, :, :, 2) is the collection of RMSE values for 1 hour predictions for the 2 regime model across\nall sites and years, and mean [e(1, :, :, 2)] and std [e(1, :, :, 2)] are the collection\u2019s mean and standard\ndeviation.\nWe calculate LL values of the AR-LG(C) models relative to LL values of a persistent Gaussian\nmodel wt+h = wt + \u0001h. Here, \u0001h is the horizon h zero-mean Gaussian noise vector with variance\nestimated from the training-set.\nIn order to make meaningful comparisons between AR-LG(C) and persistence models, we calculate\nperformance measures for all horizon h prediction sequences from only hours for which a corre-\nsponding horizon h persistence prediction is available.\n\n3.3 Results\n\nTo compare our approximation methods, we evaluate the three approximate inference procedures,\nHOMO, PRIOR(2) and POST(2) using simulated data from a situation with ten sites arrayed lin-\nearly (eg, east-to-west) and two regimes. The parameters of regime 1 were set for a east-to-west\n\n6\n\n0123456700.511.522.533.54Horizon (hour)RMSE (m/s) Persistence, WIAR\u2212LG(1), WI0123456700.511.522.533.54 Horizon (hour)Persistence, PNWAR\u2212LG(1), PNW00.511.522.533.54SiteRMSE (m/s)1 hour 00.511.522.533.54Site3 hour 00.511.522.533.54Site5 hour Persistence, WIAR\u2212LG(1), WI\fFigure 4: Performance of multi- versus single-regime models at a 2 hour prediction horizon, in each of 4\nfolds. Performance measures of test-set relative log-likelihood across all sites for WI (left) and PNW (right),\nwith number of regimes, C, on the x-axis. Note that at C = 1 the value is zero since this is comparison of\nAR-LG(1) against itself.\n\nmoving weather regime and the parameters for regime 2 were set for a west-to-east weather regime.\nSelf transition probabilities were set to 0.8. Observations were generated using these models and\n20% were hidden, which is consistent with the missing rate in our data sets. Then we make 2-hr\nahead forecasts at all time points using window size K from 2 to 15 hours. The mean-absolute error\nof PRIOR(2) was highest for all K (1.05), HOMO had the lowest overall error (0.95), and there\nwas surprisingly no obvious trends due to K. The good performance of HOMO supports our hy-\npothesis that the performance of PRIOR(2) will suffer from lack of diversity, however, we expected\nPOST(2) to perform better relative to HOMO, but instead it had an overall error of 0.965.\nIn all further experiments, we attempted to assess the effectiveness of the AR-LG(C) models, using\nthe real wind data described above.\nTo answer the \ufb01rst question above, we compute mean [e(h, :, :, 1)] and mean [ep(h, :, :)] for the\nRMSE collections of the AR-LG(1) and persistence models for both geographical locations and all\nhorizons h = 1, 2, ..., 6. Figure 2 plots the mean RMSE of these collections. The errorbars extend\n1 standard deviation unit above and below the mean. Not surprisingly, error increases with horizon\nlength. In both WI and PNW the AR-LG(1) model has signi\ufb01cantly lower RMSE than persistence\nfor 1 and 2 hour time horizons. For longer horizons the results vary by geography. In PNW the gap\nbetween AR-LG(1) and persistence grows with h, while in in WI the AR-LG(1) performance begins\nto degrade relative to persistence starting with h = 3. At 3 and 4 hour horizons we see an increase\nin the variance of AR-LG(1), but still a lower mean RMSE than persistence. For h = 5 and h = 6\nthe persistence model has lower mean RMSE than the AR-LG(1).\nTo gain insight into decreasing performance at longer horizons in WI, we plot in Figure 3 the mean\nand standard deviation RMSE values for the site speci\ufb01c collections e(h, s, :, 1) and ep(h, s, :) at\nall WI sites for 1,3 and 5 hour horizons. Each collection here contains four RMSE values, one per\nfold. For h = 1 our AR-LG(1) model beats persistence at all sites, usually by by multiple standard\ndeviation units. This is a signi\ufb01cant result because persistence forecasts have been shown to be\ndif\ufb01cult to improve upon for very short horizons. At h = 3 problems begin to appear. Although\nat most sites AR-LG(1) has improved further upon persistence accuracy, two sites display high\nvariance and one (second from left) has high variance and very high RMSE near 3 m/s. At h = 5\nhigh variance is widespread and the RMSE of the ill behaving sites at h = 3 have grown. This\nsuggests that large erroneous predictions at a small number of sites spread throughout the system as\nit evolves forward in time.\nNext, we consider the performance of multiple regime models. For these models we focus on the\nLL measure. Figure 4 plots total LL values for AR-LG(2), AR-LG(3), AR-LG(4) and AR-LG(5)\nrelative to AR-LG(1) for individual years. In both WI and PNW there is a large jump from 1 to 2\nregimes. While in WI there is no obvious trend from 2 to 5 regimes, in PNW there is a clear increase\nin performance as the number of regimes increases.\n\n7\n\n1234500.511.522.5x 104Number of regimesRelative log\u2212likelihood YR1YR2YR3YR41234500.511.522.533.5x 104Number of regimesRelative log\u2212likelihood YR1YR2YR3YR4\fFigure 5: Meteorological properties of learned regimes of AR-LG(5) models in WI. (a) Mean wind vectors\n(u, v) at each of 21 sites in WI, in each of 5 regimes (regime indicated by shape). (b) Mean regime posteriors\nwith respect to test-set hour-of-day (CST), showing diurnal trends. (c) Mean regime posteriors with respect to\ntest-set month.\n\nIncrease in forecast skill and test-set log-likelihood indicate that regimes in the multi-regime models\nare capturing important generalizable patterns in short-term wind dynamics, whose features ought\nto arise from underlying meteorology. Indeed, model parameters exhibit strong clustering patterns\nwhich can be tied to known regional meteorological phenomena. Figure 5 shows an analysis of a\n\ufb01ve regime model trained on the WI dataset. Figure 5 (a) plots learned wind vectors in the \ufb01rst\ntime-slice between the \ufb01ve regimes. Figures 5 (b) and (c) analyze posterior regime likelihoods with\nrespect to diurnal (daily) and seasonal time-frames. We note strong clusterings in (a) and signi\ufb01cant\ndiurnal and seasonal trends.\n\n4 Conclusion\n\nWe have described a model for short-term wind forecasting (STWF), an important task in the wind\npower industry. Our model is set apart from previous STWF approaches in three important ways:\nFirstly, forecasts are informed by off-site evidence through a representation of the dynamical evolu-\ntion of winds in the region. Secondly, our models can learn and reason about meteorological regimes\nunique to the local climate. Finally, our model is tolerant to missing data which is present in most\nsources of wind data. These points are shown empirically through an improvement in forecasting\nerror versus state-of-the-art, and observation of meteorological properties of learned regimes.\nWe presented novel approximate inference procedures that enables AR-HMMs to be gracefully used\nin situations with missing data. We hope these approaches can be applied to other problem domains\nsuited to AR-HMMs.\n\nReferences\nCaruana, R. (1997). Multitask learning. Machine Learning, 28:41\u201375.\nDo, M. (2003). Fast approximation of Kullback-Leibler distance for dependence trees and hidden Markov\n\nmodels. Signal Processing Letters, IEEE, 10(4):115 \u2013 118.\n\nGiebel, G. (2003). The state-of-the-art in short-term prediction of wind power. Deliverable Report D1.1, Project\n\nAnemos. Available online at http://anemos.cma.fr.\n\nGneiting, T., Larson, K., Westrick, K., Genton, M. G., and Aldrich, E. (2006). Calibrated probabilistic forecast-\ning at the stateline wind energy center. Journal of the American Statistical Association, 101(475):968\u2013979.\nKoller, D. and Friedman, N. (2009). Probabilistic Graphical Models: Principles and Techniques. MIT Press.\nKusiak, A., Zheng, H., and Song, Z. (2009). Short-term prediction of wind farm power: A data mining ap-\n\nproach. IEEE Transactions on Energy Conversion, 24(1):125\u2013136.\n\nLauritzen, S. L. and Jensen, F. (2001). Stable local computation with conditional Gaussian distributions. Statis-\n\ntics and Computing, 11:191\u2013203.\n\n8\n\n\u2212505\u22125\u22124\u22123\u22122\u22121012345u (m/s)v (m/s) Regime 1Regime 2Regime 3Regime 4Regime 500.10.20.30.40.50:00 1:00 2:00 3:00 4:00 5:00 6:00 7:00 8:00 9:00 10:00 11:00 12:00 13:00 14:00 15:00 16:00 17:00 18:00 19:00 20:00 21:00 22:00 23:00 Average posterior probability00.10.20.30.40.5JanuaryFebruaryMarchAprilMayJuneJulyAugustSeptemberOctoberNovemberDecemberAverage posterior probability\fLerner, U. and Parr, R. (2001). Inference in hybrid networks: Theoretical limits and practical algorithms. In\nBreese, J. and Koller, D., editors, Proceedings of the Seventeenth Conference on Uncertainty in Arti\ufb01cial\nIntelligence (UAI-01), pages 310\u2013318, San Francisco, CA. Morgan Kaufmann Publishers.\n\nLindenberg, S. (2008). 20% wind energy by 2030: Increasing wind energy\u2019s contribution to U.S. electricity\n\nsupply. US Department of Energy Report.\n\nMarti, I., San Isidro, M., Cabezn, D., Loureiro, Y., Villanueva, J., Cantero, E., and Perez, I. (2004). Wind power\nprediction in complex terrain: from the synoptic scale to the local scale. In EAWE Conference,The science\nof making torque from wind, Delft, The Netherlands.\n\nMurphy, K. (1998).\n\nphyk/Papers/learncg.pdf.\n\nFitting a conditional\n\nlinear Gaussian distribution.\n\nhttp://www.cs.ubc.ca/ mur-\n\nParoli, R., Pistollato, S., Rosa, M., and Spezia, L. (2005). Non-homogeneous markov mixture of periodic\nautoregressions for the analysis of air pollution in the lagoon of venice. In Applied Stochastic Models and\nData Analysis (ASMDA-2005), pages 1124\u20131132.\n\nPerez-Quiros, G., Camacho, M., and Poncela, P. (2010). Green shoots? Where, when and how? Working\n\nPapers 2010-04, FEDEA.\n\nPinson, P. and Madsen, H. (2008). Probabilistic forecasting of wind power at the minute time-scale with\n\nmarkov-switching autoregressive models. Imagine.\n\nPiwko, D. and Jordan, G. (2010). The economic value of day-ahead wind forecasts for power grid operations.\n\n2010 UWIG Workshop on Wind Forecasting.\n\nSinger, Y. and Warmuth, M. K. (1998). Batch and on-line parameter estimation of Gaussian mixtures based on\nthe joint entropy. In Kearns, M. J., Solla, S. A., and Cohn, D. A., editors, NIPS, pages 578\u2013584. The MIT\nPress.\n\nTang, X. (2004). Autoregressive hidden markov model with application in an El Ni\u02dcno study. Master\u2019s thesis,\n\nUniversity of Saskatchewan, Saskatoon, Saskatchewan, Canada.\n\nToscani, D., Archetti, F., Quarenghi, L., Bargna, F., and Messina, E. (2010). A DSS for assessing the impact\nof environmental quality on emergency hospital admissions. In Health Care Management (WHCM), 2010\nIEEE Workshop on, pages 1 \u20136.\n\nWeiss, Y. and Freeman, W. T. (2001). Correctness of belief propagation in Gaussian graphical models of\n\narbitrary topology. Neural Computation, 13(10):2173\u20132200.\n\nYanover, C. and Weiss, Y. (2003). Finding the M most probable con\ufb01gurations using loopy belief propagation.\n\nIn Thrun, S., Saul, L. K., and Sch\u00a8olkopf, B., editors, NIPS. MIT Press.\n\nZheng, H. and Kusiak, A. (2009). Prediction of wind farm power ramp rates: A data-mining approach. Journal\n\nof Solar Energy Engineering.\n\n9\n\n\f", "award": [], "sourceid": 1284, "authors": [{"given_name": "Chris", "family_name": "Barber", "institution": null}, {"given_name": "Joseph", "family_name": "Bockhorst", "institution": null}, {"given_name": "Paul", "family_name": "Roebber", "institution": null}]}