{"title": "Enhancing the Locality and Breaking the Memory Bottleneck of Transformer on Time Series Forecasting", "book": "Advances in Neural Information Processing Systems", "page_first": 5243, "page_last": 5253, "abstract": "Time series forecasting is an important problem across many domains, including predictions of solar plant energy output, electricity consumption, and traffic jam situation. In this paper, we propose to tackle such forecasting problem with Transformer. Although impressed by its performance in our preliminary study, we found its two major weaknesses: (1) locality-agnostics: the point-wise dot- product self-attention in canonical Transformer architecture is insensitive to local context, which can make the model prone to anomalies in time series; (2) memory bottleneck: space complexity of canonical Transformer grows quadratically with sequence length L, making directly modeling long time series infeasible. In order to solve these two issues, we first propose convolutional self-attention by producing queries and keys with causal convolution so that local context can be better incorporated into attention mechanism. Then, we propose LogSparse Transformer with only O(L(log L)^2) memory cost, improving forecasting accuracy for time series with fine granularity and strong long-term dependencies under constrained memory budget. Our experiments on both synthetic data and real- world datasets show that it compares favorably to the state-of-the-art.", "full_text": "Enhancing the Locality and Breaking the Memory\n\nBottleneck of Transformer on Time Series Forecasting\n\nShiyang Li\n\nshiyangli@ucsb.edu\n\nXiaoyong Jin\n\nx_jin@ucsb.edu\n\nYao Xuan\n\nyxuan@ucsb.edu\n\nXiyou Zhou\n\nxiyou@ucsb.edu\n\nWenhu Chen\n\nwenhuchen@ucsb.edu\n\nYu-Xiang Wang\n\nyuxiangw@cs.ucsb.edu\n\nXifeng Yan\n\nxyan@cs.ucsb.edu\n\nUniversity of California, Santa Barbara\n\nAbstract\n\nTime series forecasting is an important problem across many domains, including\npredictions of solar plant energy output, electricity consumption, and traf\ufb01c jam\nsituation.\nIn this paper, we propose to tackle such forecasting problem with\nTransformer [1]. Although impressed by its performance in our preliminary study,\nwe found its two major weaknesses: (1) locality-agnostics: the point-wise dot-\nproduct self-attention in canonical Transformer architecture is insensitive to local\ncontext, which can make the model prone to anomalies in time series; (2) memory\nbottleneck: space complexity of canonical Transformer grows quadratically with\nsequence length L, making directly modeling long time series infeasible.\nIn\norder to solve these two issues, we \ufb01rst propose convolutional self-attention by\nproducing queries and keys with causal convolution so that local context can\nbe better incorporated into attention mechanism. Then, we propose LogSparse\nTransformer with only O(L(log L)2) memory cost, improving forecasting accuracy\nfor time series with \ufb01ne granularity and strong long-term dependencies under\nconstrained memory budget. Our experiments on both synthetic data and real-\nworld datasets show that it compares favorably to the state-of-the-art.\n\n1\n\nIntroduction\n\nTime series forecasting plays an important role in daily life to help people manage resources and make\ndecisions. For example, in retail industry, probabilistic forecasting of product demand and supply\nbased on historical data can help people do inventory planning to maximize the pro\ufb01t. Although\nstill widely used, traditional time series forecasting models, such as State Space Models (SSMs) [2]\nand Autoregressive (AR) models, are designed to \ufb01t each time series independently. Besides, they\nalso require practitioners\u2019 expertise in manually selecting trend, seasonality and other components.\nTo sum up, these two major weaknesses have greatly hindered their applications in the modern\nlarge-scale time series forecasting tasks.\nTo tackle the aforementioned challenges, deep neural networks [3, 4, 5, 6] have been proposed as\nan alternative solution, where Recurrent Neural Network (RNN) [7, 8, 9] has been employed to\nmodel time series in an autoregressive fashion. However, RNNs are notoriously dif\ufb01cult to train [10]\nbecause of gradient vanishing and exploding problem. Despite the emergence of various variants,\nincluding LSTM [11] and GRU [12], the issues still remain unresolved. As an example, [13] shows\nthat language models using LSTM have an effective context size of about 200 tokens on average\nbut are only able to sharply distinguish 50 tokens nearby, indicating that even LSTM struggles to\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fcapture long-term dependencies. On the other hand, real-world forecasting applications often have\nboth long- and short-term repeating patterns [7]. For example, the hourly occupancy rate of a freeway\nin traf\ufb01c data has both daily and hourly patterns. In such cases, how to model long-term dependencies\nbecomes the critical step in achieving promising performances.\nRecently, Transformer [1, 14] has been proposed as a brand new architecture which leverages attention\nmechanism to process a sequence of data. Unlike the RNN-based methods, Transformer allows the\nmodel to access any part of the history regardless of distance, making it potentially more suitable\nfor grasping the recurring patterns with long-term dependencies. However, canonical dot-product\nself-attention matches queries against keys insensitive to local context, which may make the model\nprone to anomalies and bring underlying optimization issues. More importantly, space complexity of\ncanonical Transformer grows quadratically with the input length L, which causes memory bottleneck\non directly modeling long time series with \ufb01ne granularity. We speci\ufb01cally delve into these two\nissues and investigate the applications of Transformer to time series forecasting. Our contributions\nare three fold:\n\u2022 We successfully apply Transformer architecture to time series forecasting and perform extensive\nexperiments on both synthetic and real datasets to validate Transformer\u2019s potential value in better\nhandling long-term dependencies than RNN-based models.\n\n\u2022 We propose convolutional self-attention by employing causal convolutions to produce queries and\nkeys in the self-attention layer. Query-key matching aware of local context, e.g. shapes, can help\nthe model achieve lower training loss and further improve its forecasting accuracy.\n\n\u2022 We propose LogSparse Transformer, with only O(L(log L)2) space complexity to break the\nmemory bottleneck, not only making \ufb01ne-grained long time series modeling feasible but also\nproducing comparable or even better results with much less memory usage, compared to canonical\nTransformer.\n\n2 Related Work\n\nDue to the wide applications of forecasting, various methods have been proposed to solve the problem.\nOne of the most prominent models is ARIMA [15]. Its statistical properties as well as the well-\nknown Box-Jenkins methodology [16] in the model selection procedure make it the \ufb01rst attempt for\npractitioners. However, its linear assumption and limited scalability make it unsuitable for large-scale\nforecasting tasks. Further, information across similar time series cannot be shared since each time\nseries is \ufb01tted individually. In contrast, [17] models related time series data as a matrix and deal with\nforecasting as a matrix factorization problem. [18] proposes hierarchical Bayesian methods to learn\nacross multiple related count time series from the perspective of graph model.\nDeep neural networks have been proposed to capture shared information across related time series\nfor accurate forecasting. [3] fuses traditional AR models with RNNs by modeling a probabilistic\ndistribution in an encoder-decoder fashion. Instead, [19] uses an RNN as an encoder and Multi-layer\nPerceptrons (MLPs) as a decoder to solve the so-called error accumulation issue and conduct multi-\nahead forecasting in parallel. [6] uses a global RNN to directly output the parameters of a linear\nSSM at each step for each time series, aiming to approximate nonlinear dynamics with locally linear\nsegments. In contrast, [9] deals with noise using a local Gaussian process for each time series while\nusing a global RNN to model the shared patterns. [20] tries to combine the advantages of AR models\nand SSMs, and maintain a complex latent process to conduct multi-step forecasting in parallel.\nThe well-known self-attention based Transformer [1] has recently been proposed for sequence\nmodeling and has achieved great success. Several recent works apply it to translation, speech, music\nand image generation [1, 21, 22, 23]. However, scaling attention to extremely long sequences is\ncomputationally prohibitive since the space complexity of self-attention grows quadratically with\nsequence length [21]. This becomes a serious issue in forecasting time series with \ufb01ne granularity\nand strong long-term dependencies.\n\n2\n\n\f3 Background\nProblem de\ufb01nition Suppose we have a collection of N related univariate time series {zi,1:t0}N\ni=1,\nwhere zi,1:t0 , [zi,1, zi,2,\u00b7\u00b7\u00b7 , zi,t0] and zi,t 2 R denotes the value of time series i at time t1. We\nare going to predict the next \u2327 time steps for all time series, i.e. {zi,t0+1:t0+\u2327}N\ni=1. Besides, let\ni=1 be a set of associated time-based covariate vectors with dimension d that are assumed\n{xi,1:t0+\u2327}N\nto be known over the entire time period, e.g. day-of-the-week and hour-of-the-day. We aim to model\nthe following conditional distribution\n\np(zi,t0+1:t0+\u2327|zi,1:t0, xi,1:t0+\u2327 ; ) =\n\np(zi,t|zi,1:t1, xi,1:t; ).\n\nt0+\u2327Yt=t0+1\n\nWe reduce the problem to learning a one-step-ahead prediction model p(zt|z1:t1, x1:t; ) 2, where\n denotes the learnable parameters shared by all time series in the collection. To fully utilize both\nthe observations and covariates, we concatenate them to obtain an augmented matrix as follows:\n\nyt , [zt1  xt] 2 Rd+1,\n\nYt = [y1,\u00b7\u00b7\u00b7 , yt]T 2 Rt\u21e5(d+1),\n\nwhere [\u00b7\u00b7 ] represents concatenation. An appropriate model zt \u21e0 f (Yt) is then explored to predict\nthe distribution of zt given Yt.\n\nTransformer We instantiate f with Transformer 3 by taking advantage of the multi-head self-\nattention mechanism, since self-attention enables Transformer to capture both long- and short-term\ndependencies, and different attention heads learn to focus on different aspects of temporal patterns.\nThese advantages make Transformer a good candidate for time series forecasting. We brie\ufb02y introduce\nits architecture here and refer readers to [1] for more details.\nIn the self-attention layer, a multi-head self-attention sublayer simultaneously transforms Y 4 into H\ndistinct query matrices Qh = YWQ\nh , and value matrices Vh = YWV\nh\nrespectively, with h = 1,\u00b7\u00b7\u00b7 , H. Here WQ\nh 2 R(d+1)\u21e5dv are learnable\nparameters. After these linear projections, the scaled dot-product attention computes a sequence of\nvector outputs:\n\u00b7 M\u25c6 Vh.\n\nh 2 R(d+1)\u21e5dk and WV\nOh = Attention(Qh, Kh, Vh) = softmax\u2713 QhKT\nhpdk\n\nh , key matrices Kh = YWK\n\nh , WK\n\nNote that a mask matrix M is applied to \ufb01lter out rightward attention by setting all upper triangular\nelements to 1, in order to avoid future information leakage. Afterwards, O1, O2,\u00b7\u00b7\u00b7 , OH are\nconcatenated and linearly projected again. Upon the attention output, a position-wise feedforward\nsublayer with two layers of fully-connected network and a ReLU activation in the middle is stacked.\n\n4 Methodology\n\n4.1 Enhancing the locality of Transformer\nPatterns in time series may evolve with time signi\ufb01cantly due to various events, e.g. holidays and\nextreme weather, so whether an observed point is an anomaly, change point or part of the patterns\nis highly dependent on its surrounding context. However, in the self-attention layers of canonical\nTransformer, the similarities between queries and keys are computed based on their point-wise values\nwithout fully leveraging local context like shape, as shown in Figure 1(a) and (b). Query-key matching\nagnostic of local context may confuse the self-attention module in terms of whether the observed\nvalue is an anomaly, change point or part of patterns, and bring underlying optimization issues.\nWe propose convolutional self-attention to ease the issue. The architectural view of proposed\nconvolutional self-attention is illustrated in Figure 1(c) and (d). Rather than using convolution of\n\n1Here time index t is relative, i.e. the same t in different time series may represent different actual time point.\n2Since the model is applicable to all time series, we omit the subscript i for simplicity and clarity.\n3By referring to Transformer, we only consider the autoregressive Transformer-decoder in the following.\n4At each time step the same model is applied, so we simplify the formulation with some abuse of notation.\n\n3\n\n\fFigure 1: The comparison between canonical and our convolutional self-attention layers. \u201cConv,\n1\u201d and \u201cConv, k\u201d mean convolution of kernel size {1, k} with stride 1, respectively. Canonical\nself-attention as used in Transformer is shown in (b), may wrongly match point-wise inputs as shown\nin (a). Convolutional self-attention is shown in (d), which uses convolutional layers of kernel size k\nwith stride 1 to transform inputs (with proper paddings) into queries/keys. Such locality awareness\ncan correctly match the most relevant features based on shape matching in (c).\n\nFigure 2: Learned attention patterns from a 10-layer canonical Transformer trained on traffic-f\ndataset with full attention. The green dashed line indicates the start time of forecasting and the\ngray dashed line on its left side is the conditional history. Blue, cyan and red lines correspond to\nattention patterns in layer 2, 6 and 10, respectively, for a head when predicting the value at the time\ncorresponding to the green dashed line. a) Layer 2 tends to learn shared patterns in every day. b)\nLayer 6 focuses more on weekend patterns. c) Layer 10 further squeezes most of its attention on only\nseveral cells in weekends, causing most of the others to receive little attention.\n\nkernel size 1 with stride 1 (matrix multiplication), we employ causal convolution of kernel size k\nwith stride 1 to transform inputs (with proper paddings) into queries and keys. Note that causal\nconvolutions ensure that the current position never has access to future information. By employing\ncausal convolution, generated queries and keys can be more aware of local context and hence, compute\ntheir similarities by their local context information, e.g. local shapes, instead of point-wise values,\nwhich can be helpful for accurate forecasting. Note that when k = 1, the convolutional self-attention\nwill degrade to canonical self-attention, thus it can be seen as a generalization.\n\n4.2 Breaking the memory bottleneck of Transformer\n\nTo motivate our approach, we \ufb01rst perform a qualitative assessment of the learned attention patterns\nwith a canonical Transformer on traffic-f dataset. The traffic-f dataset contains occupancy\nrates of 963 car lanes of San Francisco bay area recorded every 20 minutes [6]. We trained a 10-layer\ncanonical Transformer on traffic-f dataset with full attention and visualized the learned attention\npatterns. One example is shown in Figure 2. Layer 2 clearly exhibited global patterns, however, layer\n6 and 10, only exhibited pattern-dependent sparsity, suggesting that some form of sparsity could be\nintroduced without signi\ufb01cantly affecting performance. More importantly, for a sequence with length\nL, computing attention scores between every pair of cells will cause O(L2) memory usage, making\nmodeling long time series with \ufb01ne granularity and strong long-term dependencies prohibitive.\nWe propose LogSparse Transformer, which only needs to calculate O(log L) dot products for each\ncell in each layer. Further, we only need to stack up to O(log L) layers and the model will be able to\naccess every cell\u2019s information. Hence, the total cost of memory usage is only O(L(log L)2). We\nde\ufb01ne I k\nl as the set of indices of the cells that cell l can attend to during the computation from kth\n\n4\n\n Masked Multi-Head Attention Masked Multi-Head AttentionQVKConv, 1Conv, 1Conv, 1Conv, kConv, 1Conv, kQVK(a)(b)(c)(d)\fFigure 3: Illustration of different attention mechanism between adjacent layers in Transformer.\n\nl |/ log L.\n\n\u02dckl\nl\n\n.\n\nto its followings in the next layer. Let Sk\n\nl \u21e2{ j : j \uf8ff l} so that |I k\n\nlayer to (k + 1)th layer. In the standard self-attention of Transformer, I k\nl = {j : j \uf8ff l}, allowing\nevery cell to attend to all its past cells and itself as shown in Figure 3(a). However, such an algorithm\nsuffers from the quadratic space complexity growth along with the input length. To alleviate such an\nissue, we propose to select a subset of the indices I k\nl | does not grow too\nfast along with l. An effective way of choosing indices is |I k\nin kth self-attention layer and can\nNotice that cell l is a weighted combination of cells indexed by I k\nl\npass the information of cells indexed by I k\nl be the set which\nl\ncontains indices of all the cells whose information has passed to cell l up to kth layer. To ensure that\nevery cell receives the information from all its previous cells and itself, the number of stacked layers\n\u02dckl\n\u02dckl should satisfy that S\nl = {j : j \uf8ff l} for l = 1,\u00b7\u00b7\u00b7 , L. That is, 8l and j \uf8ff l, there is a directed\npath Pjl = (j, p1, p2,\u00b7\u00b7\u00b7 , l) with \u02dckl edges, where j 2 I 1\nWe propose LogSparse self-attention by allowing each cell only to attend to its previous cells\nwith an exponential step size and itself. That is, 8k and l, I k\nl = {l  2blog2 lc, l  2blog2 lc1, l \n2blog2 lc2, ..., l  20, l}, where b\u00b7c denotes the \ufb02oor operation, as shown in Figure 3(b).5\nTheorem 1. 8l and j \uf8ff l, there is at least one path from cell j to cell l if we stack blog2 lc + 1 layers.\nMoreover, for j < l, the number of feasible unique paths from cell j to cell l increases at a rate of\nO(blog2(l  j)c!).\nThe proof, deferred to Appendix A.1, uses a constructive argument.\nTheorem 1 implies that despite an exponential decrease in the memory usage (from O(L2) to\nO(L log2 L)) in each layer, the information could still \ufb02ow from any cell to any other cell provided\nthat we go slightly \u201cdeeper\u201d \u2014 take the number of layers to be blog2 Lc + 1. Note that this implies\nan overall memory usage of O(L(log2 L)2) and addresses the notorious scalability bottleneck of\nTransformer under GPU memory constraint [1]. Moreover, as two cells become further apart, the\nnumber of paths increases at a rate of super-exponential in log2(l  j), which indicates a rich\ninformation \ufb02ow for modeling delicate long-term dependencies.\n\np1, p1 2 I 2\n\np2, \u00b7\u00b7\u00b7 , p\u02dckl1 2 I\n\nLocal Attention We can allow each cell to densely attend to cells in its left window of size\nO(log2 L) so that more local information, e.g. trend, can be leveraged for current step forecasting.\nBeyond the neighbor cells, we can resume our LogSparse attention strategy as shown in Figure 3(c).\n\nRestart Attention Further, one can divide the whole input with length L into subsequences and set\neach subsequence length Lsub / L. For each of them, we apply the LogSparse attention strategy.\nOne example is shown in Figure 3(d).\nEmploying local attention and restart attention won\u2019t change the complexity of our sparse attention\nstrategy but will create more paths and decrease the required number of edges in the path. Note that\none can combine local attention and restart attention together.\n\n5Applying other bases is trivial so we don\u2019t discuss other bases here for simplicity and clarity.\n\n5\n\n(a). Full Self Attention(b). LogSparse Self Attention(d). Restart Attention + LogSparse Self Attention(c). Local Attention + LogSparse Self AttentionLogSparse Attention RangeLogSparse Attention RangeLogSparse Attention RangeLocal Attention RangeSelfLogSparse Attention RangeSelfSelfSelf\f5 Experiments\n\nf (x) =8><>:\n\n5.1 Synthetic datasets\nTo demonstrate Transformer\u2019s capability to capture long-term dependencies, we conduct experiments\non synthetic data. Speci\ufb01cally, we generate a piece-wise sinusoidal signals\nx 2 [0, 12),\nx 2 [12, 24),\nx 2 [24, t0),\n\nA1 sin(\u21e1x/6) + 72 + Nx\nA2 sin(\u21e1x/6) + 72 + Nx\nA3 sin(\u21e1x/6) + 72 + Nx\nA4 sin(\u21e1x/12) + 72 + Nx x 2 [t0, t0 + 24),\n\nwhere x is an integer, A1, A2, A3 are randomly generated by uniform distribution on [0, 60], A4 =\nmax(A1, A2) and Nx \u21e0N (0, 1). Following the forecasting setting in Section 3, we aim to predict\nthe last 24 steps given the previous t0 data points. Intuitively, larger t0 makes forecasting more\ndif\ufb01cult since the model is required to understand and remember the relation between A1 and A2\nto make correct predictions after t0  24 steps of irrelevant signals. Hence, we create 8 different\ndatasets by varying the value of t0 within {24, 48, 72, 96, 120, 144, 168, 192}. For each dataset, we\ngenerate 4.5K, 0.5K and 1K time series instances for training, validation and test set, respectively.\nAn example time series with t0 = 96 is shown in Figure 4(a).\nIn this experiment, we use a 3-layer canonical Transformer with standard self-attention. For com-\nparison, we employ DeepAR [3], an autoregressive model based on a 3-layer LSTM, as our baseline.\nBesides, to examine if larger capacity could improve performance of DeepAR, we also gradually\nincrease its hidden size h as {20, 40, 80, 140, 200}. Following [3, 6], we evaluate both methods using\n\u21e2-quantile loss R\u21e2 with \u21e2 2 (0, 1),\n\nR\u21e2(x, \u02c6x) =\n\n, D\u21e2(x, \u02c6x) = (\u21e2  I{x\uf8ff\u02c6x})(x  \u02c6x),\n\nt\n\n2Pi,t D\u21e2(x(i)\nPi,t |x(i)\n\nt\n\n, \u02c6x(i)\nt )\n|\n\nwhere \u02c6x is the empirical \u21e2-quantile of the predictive distribution and I{x\uf8ff\u02c6x} is an indicator function.\nFigure 4(b) presents the performance of DeepAR\nand Transformer on the synthetic datasets.\nWhen t0 = 24, both of them perform very well.\nBut, as t0 increases, especially when t0  96,\nthe performance of DeepAR drops signi\ufb01cantly\nwhile Transformer keeps its accuracy, suggest-\ning that Transformer can capture fairly long-\nterm dependencies when LSTM fails to do so.\n\n5.2 Real-world datasets\nWe further evaluate our model on several real-\nworld datasets. The electricity-f (fine)\ndataset consists of electricity consumption of\n370 customers recorded every 15 minutes and\nthe electricity-c (coarse) dataset is the\naggregated electricity-f by every 4 points,\nproducing hourly electricity consumption. Sim-\nilarly, the traffic-f (fine) dataset contains\noccupancy rates of 963 freeway in San Francisco\nrecorded every 20 minutes and the traffic-c\n(coarse) contains hourly occupancy rates by\naveraging every 3 points in traffic-f. The\nsolar dataset6 contains the solar power pro-\nduction records from January to August in 2006,\nwhich is sampled every hour from 137 PV plants\nin Alabama. The wind7 dataset contains daily\n6https://www.nrel.gov/grid/solar-power-data.html\n7https://www.kaggle.com/sohier/30-years-of-european-wind-generation\n\nFigure 4: (a) An example time series with t0 = 96.\nBlack line is the conditional history while red\ndashed line is the target. (b) Performance compar-\nison between DeepAR and canonical Transformer\nalong with the growth of t0. The larger t0 is, the\nlonger dependencies the models need to capture\nfor accurate forecasting.\n\n6\n\n\fTable 1: Results summary (R0.5/R0.9-loss) of all methods. e-c and t-c represent electricity-c\nand traffic-c, respectively. In the 1st and 3rd row, we perform rolling-day prediction of 7 days\nwhile in the 2nd and 4th row, we directly forecast 7 days ahead. TRMF outputs points predictions, so\nwe only report R0.5. \u21e7 denotes results from [6].\nTRMF\n0.084/-\n0.087/-\n0.186/-\n0.202/-\n\nDeepState\n0.083\u21e7/0.056\u21e7\n0.085\u21e7/0.052\u21e7\n0.167\u21e7/0.113\u21e7\n0.168\u21e7/0.114\u21e7\n\n0.101/0.077\n0.121\u21e7/0.101\u21e7\n0.236/0.148\n0.509\u21e7/0.529\u21e7\n\nDeepAR\n\n0.075\u21e7/0.040\u21e7\n0.082/0.053\n0.161\u21e7/0.099\u21e7\n0.179/0.105\n\nOurs\n\n0.059/0.034\n0.070/0.044\n0.122/0.081\n0.139/0.094\n\nETS\n\nARIMA\n\n0.154/0.102\n0.283\u21e7/0.109\u21e7\n0.223/0.137\n0.492\u21e7/0.280\u21e7\n\ne-c1d\ne-c7d\nt-c1d\nt-c7d\n\nFigure 5: Training curve comparison (with proper smoothing) among kernel size k 2{ 1, 3, 9} in\ntraffic-c (left) and electricity-c (right) dataset. Being aware of larger local context size, the\nmodel can achieve lower training error and converge faster.\n\nestimates of 28 countries\u2019 energy potential from 1986 to 2015 as a percentage of a power plant\u2019s\nmaximum output. The M4-Hourly contains 414 hourly time series from M4 competition [24].\n\nLong-term and short-term forecasting We \ufb01rst show the effectiveness of canonical Trans-\nformer equipped with convolutional self-attention in long-term and short-term forecasting in\nelectricity-c and traffic-c dataset. These two datasets exhibit both hourly and daily sea-\nsonal patterns. However, traffic-c demonstrates much greater difference between the patterns of\nweekdays and weekends compared to electricity-c. Hence, accurate forecasting in traffic-c\ndataset requires the model to capture both long- and short-term dependencies very well. As baselines,\nwe use classical forecasting methods auto.arima, ets implemented in R\u2019s forecast package and\nthe recent matrix factorization method TRMF [17], a RNN-based autoregressive model DeepAR and a\nRNN-based state space model DeepState [6]. For short-term forecasting, we evaluate rolling-day\nforecasts for seven days ( i.e., prediction horizon is one day and forecasts start time is shifted by one\nday after evaluating the prediction for the current day [6]). For long-term forecasting, we directly\nforecast 7 days ahead. As shown in Table 1, our models with convolutional self-attention get betters\nresults in both long-term and short-term forecasting, especially in traffic-c dataset compared to\nstrong baselines, partly due to the long-term dependency modeling ability of Transformer as shown\nin our synthetic data.\n\nConvolutional self-attention In this experiment, we conduct ablation study of our proposed convo-\nlutional self-attention. We explore different kernel size k 2{ 1, 2, 3, 6, 9} on the full attention model\nand \ufb01x all other settings. We still use rolling-day prediction for seven days on electricity-c and\ntraffic-c datasets. The results of different kernel sizes on both datasets are shown in Table 2. On\nelectricity-c dataset, models with kernel size k 2{ 2, 3, 6, 9} obtain slightly better results in\nterm of R0.5 than canonical Transformer but overall these results are comparable and all of them\nperform very well. We argue it is because electricity-c dataset is less challenging and covariate\nvectors have already provided models with rich information for accurate forecasting. Hence, being\naware of larger local context may not help a lot in such cases. However, on much more challenging\ntraffic-c dataset, the model with larger kernel size k can make more accurate forecasting than\nmodels with smaller ones with as large as 9% relative improvement. These consistent gains can be\nthe results of more accurate query-key matching by being aware of more local context. Further, to\nverify if incorporating more local context into query-key matching can ease the training, we plot the\n\n7\n\n\fTable 2: Average R0.5/R0.9-loss of different kernel sizes for rolling-day prediction of 7 days.\n\nk = 1\n\nk = 2\n\nk = 3\n\nk = 6\n\nk = 9\n\nelectricity-c1d\ntraffic-c1d\n\n0.060/0.030\n0.134/0.089\n\n0.058/0.030\n0.124/0.085\n\n0.057/0.031\n0.123/0.083\n\n0.057/0.031\n0.123/0.083\n\n0.059/0.034\n0.122/0.081\n\ntraining loss of kernel size k 2{ 1, 3, 9} in electricity-c and traffic-c datasets. We found that\nTransformer with convolutional self-attention also converged faster and to lower training errors, as\nshown in Figure 5, proving that being aware of local context can ease the training process.\n\nSparse attention Further, we compare our proposed LogSparse Transformer to the full attention\ncounterpart on \ufb01ne-grained datasets, electricity-f and traffic-f. Note that time series in\nthese two datasets have much longer periods and are noisier comparing to electricity-c and\ntraffic-c. We \ufb01rst compare them under the same memory budget. For electricity-f dataset,\nwe choose Le1 = 768 with subsequence length Le1/8 and local attention length log2(Le1/8) in each\nsubsequence for our sparse attention model and Le2 = 293 in the full attention counterpart. For\ntraffic-f dataset, we select Lt1 = 576 with subsequence length Lt1/8 and local attention length\nlog2(Lt1/8) in each subsequence for our sparse attention model, and Lt2 = 254 in the full attention\ncounterpart. The calculation of memory usage and other details can be found in Appendix A.4. We\nconduct experiments on aforementioned sparse and full attention models with/without convolutional\nself-attention on both datasets. By following such settings, we summarize our results in Table 3\n(Upper part). No matter equipped with convolutional self-attention or not, our sparse attention models\nachieve comparable results on electricity-f but much better results on traffic-f compared\nto its full attention counterparts. Such performance gain on traffic-f could be the result of the\ndateset\u2019s stronger long-term dependencies and our sparse model\u2019s better capability of capturing these\ndependencies, which, under the same memory budget, the full attention model cannot match. In\naddition, both sparse and full attention models bene\ufb01t from convolutional self-attention on challenging\ntraffic-f, proving its effectiveness.\nTo explore how well our sparse attention model performs compared to full attention model with\nthe same input length, we set Le2 = Le1 = 768 and Lt2 = Lt1 = 576 on electricity-f and\ntraffic-f, respectively. The results of their comparisons are summarized in Table 3 (Lower part).\nAs one expects, full attention Transformers can outperform our sparse attention counterparts no matter\nthey are equipped with convolutional self-attention or not in most cases. However, on traffic-f\ndataset with strong long-term dependencies, our sparse Transformer with convolutional self-attention\ncan get better results than the canonical one and, more interestingly, even slightly outperform its full\nattention counterpart in term of R0.5, meaning that our sparse model with convolutional self-attention\ncan capture long-term dependencies fairly well. In addition, full attention models under length\nconstraint consistently obtain gains from convolutional self-attention on both electricity-f and\ntraffic-f datasets, showing its effectiveness again.\n\nTable 3: Average R0.5/R0.9-loss comparisons between sparse attention and full attention models\nwith/without convolutional self-attention by rolling-day prediction of 7 days. \u201cFull\u201d means models\nare trained with full attention while \u201cSparse\u201d means they are trained with our sparse attention strategy.\n\u201c+ Conv\u201d means models are equipped with convolutional self-attention with kernel size k = 6.\nSparse + Conv\n0.079/0.049\n0.138/0.092\n0.079/0.049\n0.138/0.092\n\nelectricity-f1d\ntraffic-f1d\nelectricity-f1d\ntraffic-f1d\n\nFull + Conv\n0.078/0.048\n0.149/0.102\n0.074/0.042\n0.139/0.090\n\nSparse\n\n0.084/0.047\n0.150/0.098\n0.084/0.047\n0.150/0.098\n\nFull\n\n0.083/0.051\n0.161/0.109\n0.082/0.047\n0.147/0.096\n\nConstraint Dataset\n\nMemory\n\nLength\n\nFurther Exploration In our last experiment, we evaluate how our methods perform on datasets\nwith various granularities compared to our baselines. All datasets except M4-Hourly are evaluated\nby rolling window 7 times since the test set of M4-Hourly has been provided. The results are shown\nin Table 4. These results further show that our method achieves the best performance overall.\n\n8\n\n\fTable 4: R0.5/R0.9-loss of datasets with various granularities. The subscript of each dataset presents\nthe forecasting horizon (days). TRMF is not applicable for M4-Hourly2d and we leave it blank. For\nother datasets, TRMF outputs points predictions, so we only report R0.5. \u21e7 denotes results from [10].\n\nelectricity-f1d\n\ntraffic-f1d\n\nTRMF\nDeepAR\nOurs\n\n0.094/-\n\n0.082/0.063\n0.074/0.042\n\n0.213/-\n\n0.230/0.150\n0.139/0.090\n\nsolar1d\n0.241/-\n\n0.222/0.093\n0.210 /0.082\n\nM4-Hourly2d\n\n-/-\n\n0.090\u21e7/0.030\u21e7\n0.067 /0.025\n\nwind30d\n0.311/-\n\n0.286/0.116\n0.284/0.108\n\n6 Conclusion\n\nIn this paper, we propose to apply Transformer in time series forecasting. Our experiments on\nboth synthetic data and real datasets suggest that Transformer can capture long-term dependencies\nwhile LSTM may suffer. We also showed, on real-world datasets, that the proposed convolutional\nself-attention further improves Transformer\u2019 performance and achieves state-of-the-art in different\nsettings in comparison with recent RNN-based methods, a matrix factorization method, as well as\nclassic statistical approaches. In addition, with the same memory budget, our sparse attention models\ncan achieve better results on data with long-term dependencies. Exploring better sparsity strategy in\nself-attention and extending our method to better \ufb01t small datasets are our future research directions.\n\nReferences\n\n[1] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, \u0141ukasz\nKaiser, and Illia Polosukhin. Attention is all you need. In Advances in neural information processing\nsystems, pages 5998\u20136008, 2017.\n\n[2] James Durbin and Siem Jan Koopman. Time series analysis by state space methods. Oxford university\n\npress, 2012.\n\n[3] Valentin Flunkert, David Salinas, and Jan Gasthaus. Deepar: Probabilistic forecasting with autoregressive\n\nrecurrent networks. arXiv preprint arXiv:1704.04110, 2017.\n\n[4] Alex Graves. Generating sequences with recurrent neural networks. arXiv preprint arXiv:1308.0850, 2013.\n\n[5] Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence learning with neural networks. In\n\nAdvances in neural information processing systems, pages 3104\u20133112, 2014.\n\n[6] Syama Sundar Rangapuram, Matthias W Seeger, Jan Gasthaus, Lorenzo Stella, Yuyang Wang, and Tim\nJanuschowski. Deep state space models for time series forecasting. In Advances in Neural Information\nProcessing Systems, pages 7785\u20137794, 2018.\n\n[7] Guokun Lai, Wei-Cheng Chang, Yiming Yang, and Hanxiao Liu. Modeling long-and short-term temporal\npatterns with deep neural networks. In The 41st International ACM SIGIR Conference on Research &\nDevelopment in Information Retrieval, pages 95\u2013104. ACM, 2018.\n\n[8] Rose Yu, Stephan Zheng, Anima Anandkumar, and Yisong Yue. Long-term forecasting using tensor-train\n\nrnns. arXiv preprint arXiv:1711.00073, 2017.\n\n[9] Danielle C Maddix, Yuyang Wang, and Alex Smola. Deep factors with gaussian processes for forecasting.\n\narXiv preprint arXiv:1812.00098, 2018.\n\n[10] Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio. On the dif\ufb01culty of training recurrent neural\n\nnetworks. In International conference on machine learning, pages 1310\u20131318, 2013.\n\n[11] Sepp Hochreiter and J\u00fcrgen Schmidhuber. Long short-term memory. Neural computation, 9(8):1735\u20131780,\n\n1997.\n\n[12] Kyunghyun Cho, Bart Van Merri\u00ebnboer, Dzmitry Bahdanau, and Yoshua Bengio. On the properties of\n\nneural machine translation: Encoder-decoder approaches. arXiv preprint arXiv:1409.1259, 2014.\n\n[13] Urvashi Khandelwal, He He, Peng Qi, and Dan Jurafsky. Sharp nearby, fuzzy far away: How neural\n\nlanguage models use context. arXiv preprint arXiv:1805.04623, 2018.\n\n9\n\n\f[14] Ankur P Parikh, Oscar T\u00e4ckstr\u00f6m, Dipanjan Das, and Jakob Uszkoreit. A decomposable attention model\n\nfor natural language inference. arXiv preprint arXiv:1606.01933, 2016.\n\n[15] George EP Box and Gwilym M Jenkins. Some recent advances in forecasting and control. Journal of the\n\nRoyal Statistical Society. Series C (Applied Statistics), 17(2):91\u2013109, 1968.\n\n[16] George EP Box, Gwilym M Jenkins, Gregory C Reinsel, and Greta M Ljung. Time series analysis:\n\nforecasting and control. John Wiley & Sons, 2015.\n\n[17] Hsiang-Fu Yu, Nikhil Rao, and Inderjit S Dhillon. Temporal regularized matrix factorization for high-\ndimensional time series prediction. In Advances in neural information processing systems, pages 847\u2013855,\n2016.\n\n[18] Nicolas Chapados. Effective bayesian modeling of groups of related count time series. arXiv preprint\n\narXiv:1405.3738, 2014.\n\n[19] Ruofeng Wen, Kari Torkkola, Balakrishnan Narayanaswamy, and Dhruv Madeka. A multi-horizon quantile\n\nrecurrent forecaster. arXiv preprint arXiv:1711.11053, 2017.\n\n[20] Xiaoyong Jin, Shiyang Li, Yunkai Zhang, and Xifeng Yan. Multi-step deep autoregressive fore-\ncasting with latent states. URL http://roseyu.com/time-series-workshop/submissions/2019/timeseries-\nICML19_paper_19.pdf, 2019.\n\n[21] Cheng-Zhi Anna Huang, Ashish Vaswani, Jakob Uszkoreit, Noam Shazeer, Curtis Hawthorne, Andrew M\nDai, Matthew D Hoffman, and Douglas Eck. An improved relative self-attention mechanism for transformer\nwith application to music generation. arXiv preprint arXiv:1809.04281, 2018.\n\n[22] Daniel Povey, Hossein Hadian, Pegah Ghahremani, Ke Li, and Sanjeev Khudanpur. A time-restricted\nself-attention layer for asr. In 2018 IEEE International Conference on Acoustics, Speech and Signal\nProcessing (ICASSP), pages 5874\u20135878. IEEE, 2018.\n\n[23] Niki Parmar, Ashish Vaswani, Jakob Uszkoreit, \u0141ukasz Kaiser, Noam Shazeer, Alexander Ku, and Dustin\n\nTran. Image transformer. arXiv preprint arXiv:1802.05751, 2018.\n\n[24] Spyros Makridakis, Evangelos Spiliotis, and Vassilios Assimakopoulos. The m4 competition: Results,\n\n\ufb01ndings, conclusion and way forward. International Journal of Forecasting, 34(4):802\u2013808, 2018.\n\n[25] Tomaso Poggio, Hrushikesh Mhaskar, Lorenzo Rosasco, Brando Miranda, and Qianli Liao. Why and when\ncan deep-but not shallow-networks avoid the curse of dimensionality: a review. International Journal of\nAutomation and Computing, 14(5):503\u2013519, 2017.\n\n[26] Kunihiko Fukushima. Neocognitron: A self-organizing neural network model for a mechanism of pattern\n\nrecognition unaffected by shift in position. Biological cybernetics, 36(4):193\u2013202, 1980.\n\n[27] Hrushikesh N Mhaskar and Tomaso Poggio. Deep vs. shallow networks: An approximation theory\n\nperspective. Analysis and Applications, 14(06):829\u2013848, 2016.\n\n[28] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint\n\narXiv:1412.6980, 2014.\n\n[29] A\u00e4ron Van Den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal\nKalchbrenner, Andrew W Senior, and Koray Kavukcuoglu. Wavenet: A generative model for raw audio.\nSSW, 125, 2016.\n\n[30] Anastasia Borovykh, Sander Bohte, and Cornelis W Oosterlee. Conditional time series forecasting with\n\nconvolutional neural networks. arXiv preprint arXiv:1703.04691, 2017.\n\n[31] Scott Gray, Alec Radford, and Diederik P. Kingma. Gpu kernels for block-sparse weights. arXiv preprint\n\narXiv:1711.09224, 2017.\n\n[32] Tim Cooijmans, Nicolas Ballas, C\u00e9sar Laurent, \u00c7a\u02d8glar G\u00fcl\u00e7ehre, and Aaron Courville. Recurrent batch\n\nnormalization. arXiv preprint arXiv:1603.09025, 2016.\n\n[33] Rose Yu, Yaguang Li, Cyrus Shahabi, Ugur Demiryurek, and Yan Liu. Deep learning: A generic approach\nfor extreme condition traf\ufb01c forecasting. In Proceedings of the 2017 SIAM International Conference on\nData Mining, pages 777\u2013785. SIAM, 2017.\n\n[34] Guoqiang Zhang, B Eddy Patuwo, and Michael Y Hu. Forecasting with arti\ufb01cial neural networks:: The\n\nstate of the art. International journal of forecasting, 14(1):35\u201362, 1998.\n\n10\n\n\f[35] Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. Generating long sequences with sparse\n\ntransformers. arXiv preprint arXiv:1904.10509, 2019.\n\n[36] Peter J Liu, Mohammad Saleh, Etienne Pot, Ben Goodrich, Ryan Sepassi, Lukasz Kaiser, and Noam\nShazeer. Generating wikipedia by summarizing long sequences. arXiv preprint arXiv:1801.10198, 2018.\n\n[37] Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever.\n\nImproving language under-\nstanding by generative pre-training. URL https://s3-us-west-2. amazonaws. com/openai-assets/research-\ncovers/languageunsupervised/language understanding paper. pdf, 2018.\n\n[38] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirec-\n\ntional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.\n\n[39] Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. Generating long sequences with sparse\n\ntransformers. arXiv preprint arXiv:1904.10509, 2019.\n\n[40] Nikolay Laptev, Jason Yosinski, Li Erran Li, and Slawek Smyl. Time-series extreme event forecasting with\nneural networks at uber. In International Conference on Machine Learning, number 34, pages 1\u20135, 2017.\n\n11\n\n\f", "award": [], "sourceid": 2835, "authors": [{"given_name": "Shiyang", "family_name": "Li", "institution": "UCSB"}, {"given_name": "Xiaoyong", "family_name": "Jin", "institution": "UCSB"}, {"given_name": "Yao", "family_name": "Xuan", "institution": "University of California, Santa Barbara"}, {"given_name": "Xiyou", "family_name": "Zhou", "institution": "UC Santa Barbara"}, {"given_name": "Wenhu", "family_name": "Chen", "institution": "University of California, Santa Barbara"}, {"given_name": "Yu-Xiang", "family_name": "Wang", "institution": "UC Santa Barbara"}, {"given_name": "Xifeng", "family_name": "Yan", "institution": "UCSB"}]}