{"title": "DTWNet: a Dynamic Time Warping Network", "book": "Advances in Neural Information Processing Systems", "page_first": 11640, "page_last": 11650, "abstract": "Dynamic Time Warping (DTW) is widely used as a similarity measure in various domains. Due to its invariance against warping in the time axis, DTW provides more meaningful discrepancy measurements between two signals than other dis- tance measures. In this paper, we propose a novel component in an artificial neural network. In contrast to the previous successful usage of DTW as a loss function, the proposed framework leverages DTW to obtain a better feature extraction. For the first time, the DTW loss is theoretically analyzed, and a stochastic backpropogation scheme is proposed to improve the accuracy and efficiency of the DTW learning. We also demonstrate that the proposed framework can be used as a data analysis tool to perform data decomposition.", "full_text": "DTWNet: a Dynamic Time Warping Network\n\nXingyu Cai\n\nUniversity of Connecticut\n\nTingyang Xu\nTencent AI Lab\n\nJinfeng Yi\n\nJD.com AI Lab\n\nJunzhou Huang\nTencent AI Lab\n\nSanguthevar Rajasekaran\nUniversity of Connecticut\n\nAbstract\n\nDynamic Time Warping (DTW) is widely used as a similarity measure in various\ndomains. Due to its invariance against warping in the time axis, DTW provides\nmore meaningful discrepancy measurements between two signals than other dis-\ntance measures. In this paper, we propose a novel component in an arti\ufb01cial neural\nnetwork. In contrast to the previous successful usage of DTW as a loss function, the\nproposed framework leverages DTW to obtain a better feature extraction. For the\n\ufb01rst time, the DTW loss is theoretically analyzed, and a stochastic backpropogation\nscheme is proposed to improve the accuracy and ef\ufb01ciency of the DTW learning.\nWe also demonstrate that the proposed framework can be used as a data analysis\ntool to perform data decomposition.\n\n((cid:80)d\n\nIntroduction\n\n1\nIn many data mining and machine learning problems, a proper metric of similarity or distance could\nplay a signi\ufb01cant role in the model performance. Minkowski distance, de\ufb01ned as dist(x, y) =\nk=1 |xk \u2212 yk|p)1/p for input x, y \u2208 Rd, is one of the most popular metrics. In particular, when\np = 1, it is called Manhattan distance; when p = 2, it is the Euclidean distance. Another popular\nmeasure, known as Mahalanobis distance, can be viewed as the distorted Euclidean distance. It is\nde\ufb01ned as dist(x, y) = ((x \u2212 y)T \u03a3\u22121(x \u2212 y))1/2, where \u03a3 \u2208 Rd\u00d7d is the covariance matrix. With\ngeometry in mind, these distance (or similarity) measures, are straightforward and easy to represent.\nHowever, in the domain of sequence data analysis, both Minkowski and Mahalanobis distances fail to\nreveal the true similarity between two targets. Dynamic Time Warping (DTW) [1] has been proposed\nas an attractive alternative. The most signi\ufb01cant advantage of DTW is its invariance against signal\nwarping (shifting and scaling in the time axis, or Doppler effect). Therefore, DTW has become one\nof the most preferable measures in pattern matching tasks. For instance, two different sampling\nfrequencies could generate two pieces of signals, while one is just a compressed version of the other.\nIn this case, it will be very dissimilar and deviant from the truth to use the point-wise Euclidean\ndistance. On the contrary, DTW would capture such scaling nicely and output a very small distance\nbetween them. DTW not only outputs the distance value, but also reveals how two sequences are\naligned against each other. Sometimes, the alignment could be more interesting. Furthermore, DTW\ncould be leveraged as a feature extracting tool, and hence it becomes much more useful than a\nsimilarity measure itself. For example, prede\ufb01ned patterns can be identi\ufb01ed in the data via DTW\ncomputing. Subsequently these patterns could be used to classify the temporal data into categories,\ne.g., [8]. Some interesting applications can be found in, e.g., [6, 14].\nThe standard algorithm for computing Dynamic Time Warping involves a Dynamic Programming\n(DP) process. With the help of O(n2) space, a cost matrix C would be built sequentially, where\n\nCi,j = ||xi \u2212 yj|| + min{Ci\u22121,j, Ci,j\u22121, Ci\u22121,j\u22121}\n\n(1)\nHere ||xi \u2212 yj|| denotes the norm of (xi \u2212 yj), e.g., p-norm, p = 1, 2 or \u221e. After performing the\nDP, we can trace back and identify the warping path from the cost matrix. This is illustrated in\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\f(a) DTW aligns x and y\nFigure 1: Illustration of DTW Computation, Dynamic Programming and Warping Path\n\n(c) The path is \ufb01xed after DP\n\n(b) DTW path of x and y\n\nFigure 1a 1b, where two sequences of different lengths are aligned. There are speedup techniques\nto reduce DTW\u2019s time complexity, e.g., [15], which is beyond the scope of this paper. In general, a\nstandard DP requires O(n2) time.\nAlthough DTW is already one of the most important similarity measures and feature extracting\ntools in temporal data mining, it has not contributed much to the recent deep learning \ufb01eld. As we\nknow, a powerful feature extractor is the key to the success of an arti\ufb01cial neural network (ANN).\nThe best example could be the CNNs that utilize convolutional kernels to capture local and global\nfeatures [10]. Unlike the convolution, DTW has the non-linear transformation property (warping),\nproviding a summary of the target against Doppler effects. This makes DTW a good candidate as\na feature extractor in general ANNs. With this motivation, we propose DTWNet, a neural network\nwith learnable DTW kernels.\nKey Contributions: We apply the learnable DTW kernels in neural networks to represent Doppler\ninvariance in the data. To learn the DTW kernel, a stochastic backpropogation method based on the\nwarping path is proposed, to compute the gradient of a DP process. A convergence analysis of our\nbackpropogation method is offered. To the best of the authors\u2019 knowledge, for the \ufb01rst time, DTW\nloss function is theoretically analyzed. A differentiable streaming DTW learning is also proposed to\novercome the problem of missing local features, caused by global alignment of the standard DTW.\nEmpirical study shows the effectiveness of the proposed backpropogation and the success of capturing\nfeatures using DTW kernels. We also demonstrate a data decomposition application.\n2 Related Work\n2.1\n\nIntroduction of Dynamic Time Warping\n\nDynamic Time Warping is a very popular tool in temporal data mining. For instance, DTW is\ninvariant of Doppler effects thus it is very useful in acoustic data analysis [14]. Another example is\nthat biological signals such as ECG or EEG, could use DTW to characterize potential diseases [24].\nDTW is also a powerful feature extractor in conjunction with prede\ufb01ned patterns (features), in the\ntime series classi\ufb01cation problem [8]. Using Hamming distance, the DTW alignment in this setting is\ncalled the Edit distance and also well studied [7].\nDue to the Dynamic Programming involved in DTW computation, the complexity of DTW can be\nhigh. More critically, DP is a sequential process which makes DTW not parallelizable. To speedup\nthe computation, some famous lower bounds based techniques [9, 12, 20] have been proposed. There\nare also attempts on parallelization of DP [25] or GPU acceleration [22].\nTwo dimensional DTW has also drawn research interests. In [11], the author showed that the DTW\ncould be extended to the 2-D case for image matching. Note that this is different from another\ntechnique called multi-variate DTW [23, 13, 21], sometimes also referred to as multi-dimensional\nDTW. In multi-variate DTW, the input is a set of 1-D sequences, e.g., of dimension k \u00d7 n where n is\nthe sequence length. However, in 2-D or k-D DTW, the input is no longer a stack of 1-D sequences\nbut images (n2) or higher dimensional volumes (nk). As a result, the cost of computing 2-D DTW\ncan be as high as O(n6) and thus making it not applicable for large datasets.\n\n2.2 SPRING Algorithm, the Streaming Version of DTW\n\nTo process the streaming data under DTW measure, [18] proposed a modi\ufb01ed version of DTW\ncomputation called SPRING. The original DTW aims to \ufb01nd the best alignment between two input\nsequences, and the alignment is from the beginning of both sequences to the end. On the contrary,\n\n2\n\n024681012140.00.51.01.52.02.53.0xy02468101214X0246810Y0510152025c[i-1, j-1]c[i, j]c[i-1, j]c[i+1, j-1]c[i, j-1]c[i+1, j]c[i, j+1]c[i-1, j+1]c[i+1, j+1]\fthe streaming version tries to identify all the subsequences from a given sequence, that are close to a\ngiven pattern under the DTW measure. The naive approach computes DTW between all possible\nsubsequences and the pattern. Let the input sequence and the pattern be of lengths n and l, respectively.\nThe naive method takes (nl + (n \u2212 1)l + . . .) = O(n2l) time. However, SPRING only takes O(nl)\ntime, which is consistent with the standard DTW.\nSPRING modi\ufb01es the original DTW computation with two key factors. First, it prepends one wild-\ncard to the pattern. When matching the pattern with the input, since the wild-card can represent any\nvalue, the start of the pattern could match any position in the input sequence at no cost. The second\nmodi\ufb01cation is that SPRING makes use of an auxiliary matrix to store the source of each entry in the\noriginal dynamic programming matrix. This source matrix will keep records of each candidate path\nand hence we can trace back from the end. Interested readers could refer to [18] for more details.\n\n2.3 DTW as a Loss Function\n\nRecently, in order to apply the DTW distance for optimization problems, the differentiability of DTW\nhas been discussed in the literature. As we know, computing DTW is a sequential process in general.\nDuring the \ufb01lling of the DP matrix, each step takes a min operation on the neighbors. Since the min\noperator is not continuous, the gradient or subgradient is not very well de\ufb01ned. The \ufb01rst attempt to use\nsoft-min function to replace min is reported in [19]. In their paper, the authors provide the gradient of\nsoft-min DTW, and perform shapelet learning to boost the performance of time series classi\ufb01cation in\nlimited test datasets. Using the same soft-min idea, in [4], the authors empirically show that applying\nDTW as a loss function leads to a better performance than conventional Euclidean distance loss, in a\nnumber of applications. Another very recent paper [2] also uses continuous relaxation of the min\noperator in DTW to solve video alignment and segmentation problems.\n\n3 Proposed DTW Layer and its Backpropogation\n\nIn this paper, we propose to use DTW layers in a deep neural network. A DTW layer consists of\nmultiple DTW kernels that extract meaningful features from the input. Each DTW kernel generates a\nsingle channel by performing DTW computation between the kernel and the input sequences. For\nregular DTW, one distance value will be generated for each kernel. For the streaming DTW, multiple\nvalues would be output (details will be given in \u00a7 5). If using a sliding window, the DTW kernel\nwould generate a sequence of distances, just as a convolutional kernel. After the DTW layer, linear\nlayers could be appended, to obtain classi\ufb01cation or regression results. A complete example of\nDTWNet on a classi\ufb01cation task is illustrated in Algorithm 1.\n\ndenoted as Gx,w : Rn \u2192 Z.\n\nAlgorithm 1 DTWNet training for a classi\ufb01cation task. Network parameters are: number of DTW\nkernels Nkernel; kernels xi \u2208 Rl; linear layers with weights w.\nINPUT: Dataset Y = {(yi, zi)|yi \u2208 Rn, zi \u2208 Z = [1, Nclass]}. The DTWNet data\ufb02ow can be\nOUTPUT: The trained DTWNet Gx,w\n1: Init w; For i = 1 to Nkernel: randomly init xi; Set total # of iteration be T , stopping condition \u0001\n2: for t = 0 to T do\n3:\n4:\n5:\n6:\n7:\n8:\n\nSample a mini-batch (y, z) \u2208 Y . Compute DTWNet output: \u02c6z \u2190 Gx,w(y)\nRecord warping path P and obtain determined form ft(x, y), as in Equation 2\nLet Lt \u2190 LCrossEntropy(\u02c6z, z). Compute \u2207wLt through regular BP.\nFor i = 1 to Nkernel: compute \u2207xiLt \u2190 \u2207xift(xi, y) \u2202Lt\nSGD Update: let w \u2190 w \u2212 \u03b1\u2207wLt and for i = 1 to Nkernel do xi \u2190 xi \u2212 \u03b2\u2207xiLt\nIf \u2206L = |Lt \u2212 Lt\u22121| < \u0001: return Gx,w\n\nbased on P, as in Equation 3\n\n\u2202ft\n\nGradient Calculation and Backpropogation\n\nTo achieve learning of the DTW kernels, we propose a novel gradient calculation and backpropogation\n(BP) approach. One simple but important observation is that: after performing DP and obtaining the\nwarping path, the path itself is settled down for this iteration. If the input sequences and the kernel\nare of lengths n and l, respectively, the length of the warping path cannot be larger than O(n + l).\n\n3\n\n\fThis means that the \ufb01nal DTW distance could be represented using O(n + l) terms, and each term is\n||yi \u2212 xj|| where i, j \u2208 S, and S is the set containing the indices of elements along the warping path.\nFor example, if we use 2-norm, the \ufb01nal squared DTW distance could be of the following form:\n\ndtw2(x, y) = ft(x, y) = ||y0 \u2212 x0||2\n\n2 + ||y1 \u2212 x0||2\n\n2 + ||y2 \u2212 x1||2\n\n2 + . . .\n\n(2)\n\nThis is illustrated in Figure 1c, where the solid bold lines and the highlighted nodes represent the\nwarping path after Dynamic Programming. Since the warping path is determined, other entries in the\ncost matrix no longer affect the DTW distance, thus the differentiation can be done only along the\npath. Since the DTW distance obtains its determined form, e.g., Equation 2, taking derivative with\nrespect to either x or y becomes trivial, e.g.,\n\n\u2207xdtw2(x, y) = \u2207xft(x, y) = [2(y0 + y1 \u2212 2x0) , 2(y2 \u2212 x1) , . . .]T\n\n(3)\n\nSince the min operator does not have a gradient, directly applying auto-diff will result in a very high\nvariance. Soft-min could somewhat mitigate this problem, however, as shown above, since the \ufb01nal\nDTW distance is only dependent on the elements along the warping path, differentiation on all the\nentries in the cost matrix becomes redundant. Other than this, additional attention needs to be paid\nto the temperature hyperparameter in the soft-min approach, which controls the trade-off between\naccuracy and numerical stability.\nIn contrast, taking derivative using the determined form along the warping path, we can avoid the\ncomputation redundancy. As the warping path length cannot exceed O(n + l), the differentiation part\nonly takes O(n + l) time instead of O(nl) as in the soft-min approaches. Note that there is still a\nvariance which arises from the difference in DP\u2019s warping paths from iteration to iteration, so the BP\ncan be viewed as a stochastic process.\nTime Complexity: The computation of DTW loss requires building a Dynamic Programming matrix.\nThe standard DP needs O(nl) time. There are speeding-up/approximating techniques for DP such as\nbanded constraint (limit the warping path within a band), which is beyond the scope of this paper.\nThe gradient is evaluated in O(n + l) time as shown above. Although the DP part is not parallelizable\nin general, parallelization can still be achieved for independent evaluation for different kernels.\n\n4 DTW Loss and Convergence\nTo simplify the analysis, we consider that for one input sequence y \u2208 Rn. The goal is to obtain\na target kernel x \u2208 Rl that has the best alignment with y, i.e., minx dtw2(x, y). Without loss of\ngenerality, we assume l \u2264 n. The kernel x is randomly initialized and we perform learning through\nstandard gradient descent. De\ufb01ne the DTW distance function as d = Hy(x), where d \u2208 R is the\nDTW distance evaluated by performing the Dynamic Programming operator, i.e., d = DP(x, y).\nDe\ufb01nition 1. Since DP provides a deterministic warping path for arbitrary x, we de\ufb01ne the space of\nall the functions of x representing all possible warping paths as\nIij||(xi \u2212 yj)||2\n2}\n\nFy = {fy(x)|fy(x) =\n\n(cid:88)\n\ns.t.\n\ni,j\n\ni \u2208 [0, l \u2212 1]; j \u2208 [0, n \u2212 1]; Iij \u2208 {0, 1}; n \u2264 |I| \u2264 n + l;\ni, j satisfy temporal order constraints.\n\nHere the cardinality of I is within the range of n and n + l, because the warping path length can only\nbe between n and n + l. The temporal order constraints make sure that the combination of i, j must\nbe valid. For example, if xi is aligned with yj, then xi+1 cannot be aligned with yj\u22121, otherwise the\nalignment will be against the DTW de\ufb01nition.\nWith De\ufb01nition 1, when we perform Dynamic Programming at an arbitrary point x to evaluate\nHy(x), we know that it must be equal to some function sampled from the functional space Fy, i.e.,\nHy(x)|x=\u02c6x = f (u)\ny \u2208 Fy. So we can approximate Hy(x) as a collection of functions\nin Fy, where each x could correspond to its own sample function. In the proposed backpropogation\nstep we compute the gradient of f (u)\n(x) and perform the gradient descent using this gradient. The\n\ufb01rst question is whether \u2207xf (u)\n\n(x)|x=\u02c6x = \u2207xHy(x)|x=\u02c6x.\n\n(x)|x=\u02c6x , f (u)\n\ny\n\ny\n\ny\n\n4\n\n\f(a) Quadratic\n\n(b) Linear\n\n(c) Analysis case 1\n\n(d) Analysis case 2\n\nFigure 2: Loss function d = Hy(x) and analysis. (a): Hy(x) approximated by quadratic fy(x);\n(b): by linear fy(x); The curves on the wall are projections of Hy(x) for better illustration. (c):\nIllustration of transitions from u to v, here f (v)\ny = 0) is outside of\nv; (d): both u and v have bowl-shapes.\n\ny \u2019s stationary point (where \u2207xk f (v)\n\nWe notice the fact that Hy(x) is not smooth in the space of x. More speci\ufb01cally, there exist positions\nx such that\n\nu (cid:54)= v; f (u)\n\ny\n\ny \u2208 Fy\n, f (v)\n\n(4)\n\n(cid:40)\n\nHy(x) =\n\n(x)|x=x+\nf (u)\ny\ny (x)|x=x\u2212\nf (v)\n\nwhere x+ and x\u2212 represent in\ufb01nitesimal amounts of perturbation applied on x, in the opposite\ndirections. However, note that the cardinality of Fy is \ufb01nite. In fact, in the Dynamic Programming\nmatrix, for any position, the warping path can only evolve in at most three directions, due to the\ntemporal order constraints. In boundary positions, only one direction can the warping path evolve\nalong. So we have:\nLemma 1. Warping paths number |Fy| < 3n+l, where (n + l) is the largest possible path length.\n\ny\n\nThis means that the space of x is divided into regions such that Hy(x) is perfectly approximated\nby f (u)\n(x) in the particular region u. In other words, the loss function Hy(x) is a piece-wise (or\nregion-wise) quadratic function of x, if we compute the DTW loss as a summation of squared 2-norms,\ne.g., dtw2(x, y) = ||x0 \u2212 y0||2\n2 + . . .. Similarly, if we use the absolute value as the\nelement distance for the functions in the set Fy, then we obtain piece-wise linear function as Hy(x).\nThis is shown in Figure 2a, 2b. We perform Monte-Carlo simulations to generate the points and\ncompute their corresponding DTW loss. The length of x is 6, but we only vary the middle two\nelements after a random initialization and hence can generate the 3-D plots. The length of y is 10.\nThe elements in both x and y are randomly initialized within [0, 1]. Figure 2a veri\ufb01es that Hy(x) is\npiece-wise quadratic using 2-norms, where Figure 2b corresponds to the piece-wise linear function.\n\n2 + ||x1 \u2212 y0||2\n\nEscaping Local Minima\n\nSome recent theoretical work provides proofs for global convergence in non-convex neural network\nloss functions, e.g., see [5]. In this paper, we offer a different perspective for the analysis by exploiting\nthe fact that the global loss function is piece-wise quadratic or linear obtained by a DP process,\nand the number of regions is bounded by O(3n+l) (Lemma 1). Without loss of generality, we only\nconsider HY (x) being piece-wise quadratic. Treating the regions as a collection of discrete states\nV , where |V | < 3n+l, we \ufb01rst analyze the behavior of escaping u and jumping to its neighbor v, for\nu, v \u2208 V , using the standard gradient descent. Without loss of generality, we only look at coordinate\nk (xk is of interest). Assume that after DP, a fraction yp:p+q is aligned with xk. Taking out the items\nrelated to xk, we can write the local quadratic function in u, and its partial derivative with respect to\nxk, as\n\np+q(cid:88)\n\nj=p\n\n(cid:88)\n\ni,j\u2208U\n\nf (u)\ny =\n\n(yj \u2212 xk)2 +\n\nIij(xi \u2212 yj)2 and \u2207xk f (u)\n\ny =\n\n2(xk \u2212 yj)\n\n(5)\n\nwhere U = {i, j|i (cid:54)= k, j /\u2208 [p, p + q]}, Iij \u2208 {0, 1}, which is obtained through DP, and i, j satisfy\ntemporal order. Setting \u2207xk f (u)\n\ny = 0 we get the stationary point at x(u)\u2217\n\nk = 1\nq+1\n\nj=p yj.\n\np+q(cid:88)\n\nj=p\n\n(cid:80)p+q\n\n5\n\nX10.00.20.40.60.81.0X20.00.20.40.60.81.0Z0.750.800.850.900.951.00X10.00.20.40.60.81.0X20.00.20.40.60.81.0Z0.80.91.01.11.21.31.4\ud835\udc64\ud835\udc62\ud835\udc63\ud835\udc65%&\u2217\ud835\udc65%(\u2217\ud835\udc65%)\u2217\ud835\udc65%\ud835\udc64\ud835\udc62\ud835\udc63\ud835\udc65%&\u2217\ud835\udc65%(\u2217\ud835\udc65%)\u2217\ud835\udc65%\ud835\udc65*\fy , the same as f (u)\n\nWithout loss of generality, consider the immediate neighbor f (v)\nthe alignment of yp+q+1, i.e.,\n\np+q+1(cid:88)\n\n(cid:88)\nfor the other immediate neighbor w that aligns(cid:80)p+q\u22121\n(cid:80)p+q+1\n\n(yj \u2212 xk)2 +\n\n(cid:80)p+q\n\nf (v)\ny =\n\nwhere V = {i, j|i (cid:54)= k, j /\u2208 [p, p + q + 1]}. The corresponding stationary point is at x(v)\u2217\nyj, the stationary point is at x(w)\u2217\n\n(cid:80)p+q\u22121\n\ni,j\u2208V\n\nj=p\n\nj=p\n\nk\n\nk\n\ny\n\nx(u)\u2217\nk =\n\nj=p yj\nq + 1\n\n, x(v)\u2217\n\nk =\n\nyj\n\n, x(w)\u2217\n\nk =\n\nj=p\nq + 2\n\nj=p\nq\n\nIij(xi \u2212 yj)2\n\n(6)\n\nexcept for only\n\n. Similarly,\n.We have\n\nyj\n\n(7)\n\nk\n\nk\n\nk\n\nk\n\nk\n\nk\n\nk\n\nk\n\ny\n\ny\n\nk\n\n.\n\nk < x2\n\nk < x3\n\n, x(u)\u2217\n\nk, for x1\n\n= x(v)\u2217\n\nand x(v)\u2217\n\nk \u2208 u, x3\n\nk \u2212 x(u)\u2217\n\nk \u2208 w, x2\n, f (u)\n, f (v)\n\ny , and their local minima (or stationary points) x(w)\u2217\n\nWithout loss of generality, assume that the three neighbor regions w, u, v are from left to right,\nk \u2208 v. The three regions corresponding to three local\ni.e., x1\n, x(v)\u2217\nquadratic functions f (w)\n,\nare illustrated in Figure 2c, 2d. Note that we are interested in transition u \u2192 v, when u\u2019s local\nminimum is not at the boundary (u has a bowl-shape and we want to jump out).\nThere could be 3 possibilities for the destination (region v). The \ufb01rst one is illustrated in Figure 2c,\nwhere x(v)\u2217\nis not inside region v, but somewhere to the left. In this case, it is easy to see the\nglobal minimum will not be in v since some part in u is lower (u has the bowl-shape due to its local\nminimum). If jumping to v, the gradient in v would point back to u, which is not the case of interest.\nIn the second case, both u and v have the bowl-shapes. As shown in Figure 2d, the distance between\nthe bottom of two bowls is d(u,v)\n. The boundary must be somewhere in between\nx(u)\u2217\n. Since we need to travel from u to v, the starting point xk = \u02dcx \u2208 u must be to the\nk\nleft of x(u)\u2217\n(as shown in the red double-arrows region, in Figure 2d). Otherwise the gradient at \u02dcx\nwill point to region w instead of v. To ensure one step crossing the boundary and arrives at v, it needs\nto travel a distance of at most (x(v)\u2217\nk \u2212 \u02dcx), because the boundary between u and v could never reach\nx(v)\u2217\nFor the third case, v does not have the bowl-shape, but x(v)\u2217\nis to the right of v. We can still travel\n(x(v)\u2217\nk \u2212 \u02dcx) to jump beyond v. Similar to case 1, the right neighbor of v (denoted as v+) would have a\nlower minimum if v+ has bowl-shape. Even if v+ does not have a bowl-shape, the combined region\n[v, v+] can be viewed as either a quasi-bowl or an extended v, thus jumping here is still valid.\nNext, we need to consider the relationship between feasible starting point \u02dcx and f (w)\n\u2019s stationary\npoint x(w)\u2217\n. However, there could\nbe cases in which w does not hold f (w)\nis to the left\nof region w, then the inequality \u02dcx > x(w)\u2217\nbecomes looser, but still valid. Another case is that when\nx(w)\u2217\nis to the right side of w. This means w is monotonically decreasing, so we can combine [w, u]\nas a whole quasi-bowl region u(cid:48), and let w(cid:48) be the left neighbor of the combined u(cid:48). Therefore, the\nabove analysis on w(cid:48), u(cid:48) and v still holds, and we want to jump out u(cid:48) to v. Hence we arrive at the\nfollowing theorem.\nTheorem 1. Assume that the starting point at coordinate k, i.e. xk = \u02dcx, is in some region u where\nf (u)\nis de\ufb01ned in Equation 5. Let x and y have lengths n and l, respectively, and assume that l < n.\ny\nTo ensure escaping from u to its immediate right-side neighbor region, the expected step size E[\u03b7]\nneeds to satisfy: E[\u03b7] > l\n2n .\n(cid:80)m\ni=0 Hyi(x) and \u2207xHY (x) =(cid:80)m\nThe proof can be found in the supplementary A. In other cases, we consider a dataset Y = {yi|yi \u2208\n(cid:80)\nRn, i = 1, . . . , m}. The DTW loss and its full gradient have the summation form, i.e., HY (x) =\ni=0 \u2207xHyi(x). The updating of x is done via stochastic gradient\n(cid:80)\ndescent (SGD) over mini-batches, i.e., x \u2190 x + \u03b7 m\nb \u2207xHyi(x), where b < m is the mini-\nbatch size, and \u03b7 is the step size. Though the stochastic gradient is an unbiased estimator, i.e.\nb \u2207xHyi(x)] = \u2207xHY (x), the variance offers the capability to jump out of local minima.\nE[ m\nb\n\nis within region w, since \u02dcx \u2208 u, we know that \u02dcx > x(w)\u2217\n\n\u2019s stationary point. If the stationary point x(w)\u2217\n\n. If x(w)\u2217\n\nk\n\nk\n\nk\n\nk\n\ny\n\nk\n\nk\n\nk\n\ny\n\nb\n\n6\n\n\f(a) Data samples\n\n(b) No reg\n\n(c) Week reg\n\n(d) Strong reg\n\nFigure 3: Illustration of the effect of the streaming DTW\u2019s regularizer: from left to right, \u03b1 = 0 and\n1 \u00d7 10\u22124 and 0.1, respectively.\n\n5 Streaming DTW Learning\n\nThe typical length of a DTW kernel is much shorter than the input data. Aligning the short kernel\nwith a long input sequence, could lead to misleading results. For example, consider the ECG data\nsequence which consists of several periods of heartbeat pulses, and we would like to let the kernel\nlearn the heartbeat pulse pattern. However, applying an end-to-end DTW, the kernel will align the\nentire sequence rather than a single pulse period. If the kernel is very short, it does not even have\nenough resolution and thus \ufb01nally outputs a useless abstract.\nTo address this problem, we bring the SPRING [18] algorithm to output the patterns aligning\nsubsequences of the original input:\n\nx\u2217 = arg min\n\ni,\u2206,x\n\ndtw2(x, yi:i+\u2206)\n\n(8)\n\nwhere yi:i+\u2206 denotes the subsequence of y that starts at position i and ends at i + \u2206, and x is the\npattern (the DTW kernel) we would like to learn. Note that i and \u2206 are parameters to be optimized.\nIn fact, SPRING not only \ufb01nds the best matching among all subsequences, but also reports a number\nof candidate warping paths that have small DTW distances. As a result, we propose two schemes that\nexploit this property. In the \ufb01rst scheme, we pre-specify a constant k (e.g. 3 or 5) and let SPRING\nprovide the top k best warping paths (k different non-overlapping subsequences that have least DTW\ndistances to the pattern x). In the second scheme, rather than specifying the number of paths, we set\na value of \u0001 such that all the paths that have distances smaller than (1 + \u0001)d\u2217 are reported, where d\u2217\nis the best warping path\u2019s DTW distance. After obtaining multiple warping paths, we can do either\nan averaging, or random sampling as our DTW computing result. In our experiments, we choose\n\u0001 = 0.1 and randomly sample one path for simplicity.\n\nRegularizer in Streaming DTW\n\nSince SPRING encourages the kernel x to learn some repeated pattern in the input sequence, there is\nno constraint of such patterns\u2019 shapes, which could cause problematic learning results. As a matter of\nfact, some common shapes that do not carry much useful information always occur in the input data.\nFor example, an up-sweep or down-sweep always exists, even the Gaussian noise is a combination of\nsuch sweeps. The kernel without any regularization would easily capture such useless patterns and\nfall into such local minima. To solve this issue, we propose a simple solution that adds a regularizer\non the shape of the pattern. Assuming x is of length l, we change the objective to\n\n(1 \u2212 \u03b1)dtw2(x, yi:i+\u2206) + \u03b1||x0 \u2212 xl||\n\nmin\ni,\u2206,x\n\n(9)\n\nwhere \u03b1 is the hyper parameter that controls the regularizer. This essentially forces the pattern to be\na \"complete\" one, in the sense that the beginning and the ending of the pattern should be close. It\nis a general assumption that we want to capture such \"complete\" signal patterns, rather than parts\nof them. As shown in Figure 3a, the input sequences contain either upper or lower half circles as\nthe target to be learned. Without regulation, Figure 3b shows that the kernel only learns a part of\nthat signal. Figure 3c corresponds to a weak regularizer, where the kernel tries to escape from the\ntempting local minima (these up-sweeps are so widely spread in the input and lead to small SPRING\nDTW distances). A full shape is well learned with a proper \u03b1, as shown in Figure 3d. Other shape\nregularizers could be also used, if they contain prior knowledge from human experts.\n\n7\n\n01020304050607080\u22121.0\u22120.50.00.51.0Data samples from 2 categoriestype 1type 1type 2type 20.02.55.07.50.000.250.500.751.001.25filter shape0.02.55.07.5\u22121.0\u22120.50.0filter shape0.02.55.07.5\u22121.5\u22121.0\u22120.50.0filter shape\f(a) Data samples\n\n(b) Learned kernels\n\n(c) Test acc\n\n(d) Test loss\n\nFigure 4: Performance comparison on synthetic data sequences (400 iterations)\n\n6 Experiments and Applications\n\nIn this experimental section, we compare the proposed scheme with existing approaches. We refer\nto the end-to-end DTW kernel as Full DTW, and the streaming version as SPRING DTW. We\nimplement our approach in PyTorch [16].\n\n6.1 Comparison with Convolution Kernel\n\nIn this very simple classi\ufb01cation task, two types of synthetic data sequences are generated. Category\n1 only consists of half square signal patterns. Category 2 only has upper triangle signal patterns.\nEach data sequence was planted with two such signals, but in random locations with random pattern\nlengths. The patterns do not overlap in each sequence. Also, Gaussian noise is injected into the\nsequences. Figure 4a provides some sample sequences from both categories.\nThe length of the input sequences is 100 points, where the planted pattern length varies from 10\nto 30. There are a total of 100 sequences in the training set, 50 in each category. Another 100\nsequences form the testing set, 50 for each type as well. We added Gaussian noise with \u03c3 = 0.1. For\ncomparison, we tested one full DTW kernel, one SPRING DTW kernel, and one convolution kernel.\nThe kernel lengths are set to 10. \u03b1 = 0.1 for SPRING DTW. We append 3 linear layers to generate\nthe prediction.\nIn Figure 4b, we show the learned DTW kernel after convergence. As expected, the full DTW kernel\ntries to capture the whole sequence. Since the whole sequence consists of two planted patterns,\nthe full DTW also has two peaks. On the contrary, SPRING DTW only matches partial signal,\nthus resulting in a sweep shape. Figure 4c and Figure 4d show the test accuracy and test loss for\n400 iterations. Since both full DTW and SPRING DTW achieve 100% accuracy, and their curves\nare almost identical, we only show the curve from the full DTW. Surprisingly, the network with\nthe convolution kernel fails to achieve 100% accuracy after convergence on this simple task. The\n\"MLP\" represents a network consisting of only 3 linear layers, and performs the worst among all the\ncandidates as expected.\nNote that we can easily extend the method to multi-variate time series data (MDTW [21]), without\nany signi\ufb01cant modi\ufb01cations. Details can be found in the supplementary B.\n\n6.2 Evaluation of Gradient Calculation\n\nTo evaluate the effectiveness and accuracy of the proposed BP scheme, we follow the experimental\nsetup in [4] and perform barycenter computations. The UCR repository [3] is used in this experiment.\nWe evaluate our method against SoftDTW [4], DBA [17] and SSG [4]. We report the average of 5\nruns for each experiment. A random initialization is done for all the methods. Due to space limit, we\nonly provide a summary in this section but details can be found in supplementary C (Table 2, 3).\nThe barycenter experiment aims to \ufb01nd the barycenter for the given input sequences. We use the\nentire training set to train the model to obtain the barycenter bi for each category, and then calculate\nthe DTW loss as:\n\ndtw(si,j, bi)\n\n(10)\n\nNclass(cid:88)\n\nNi(cid:88)\n\nj=0\n\n1\nNi\n\nLdtw =\n\n1\n\nNclass\n\ni=0\n\nwhere Nclass is the number of categories, Ni is the number of sequences in class i, and si,j is sequence\nj in class i. The DTW distance is computed using (cid:96)2 norm. Clearly, the less the loss, the better is the\n\n8\n\n01020304050607080\u22120.20.00.20.40.60.81.01.2Data samples from 2 categoriestype 1: squaretype 2: triangular02468\u22120.50.00.51.01.5Kernel ShapeFull DTWSPRING DTW01002003004000.50.60.70.80.91.0test accdtwconvMLP01002003004000.30.40.50.60.7test lossdtwconvMLP\fTable 1: Barycenter Experiment Summary\n\nTraining Set\n\nTesting Set\n\nAlg\nWin\n\nAvg-rank\nAvg-loss\n\n4\n\nSoftDTW SSG\n23\n2.14\n26.19\n\n3.39\n27.75\n\nDBA Ours\n37\n21\n2.2\n2.27\n24.79\n26.42\n\nSoftDTW SSG\n21\n2.31\n33.84\n\n11\n3.12\n33.08\n\nDBA Ours\n31\n22\n2.21\n2.36\n31.99\n33.62\n\nperformance. We also evaluate on the testing set by using si,j from the testing set. Note that we \ufb01rst\nrun SoftDTW with 4 different hyperparameter settings \u03b3 = 1, 0.1, 0.01 and 0.001 as in [4]. In the\ntraining set, \u03b3 = 0.1 outperforms others, while in the testing set, \u03b3 = 0.001 gives the best results,\nthus we select \u03b3 accordingly.\nThe experimental results are summarized in Table 1. \"Win\" denotes the number of times the smallest\nloss was achieved, among all the 85 datasets. We also report the average rank and average loss (sum\nall the losses and divide by number of datasets) in the table. From the results we can clearly see\nthat our proposed approach achieves the best performance among these methods. The details of this\nexperiment can be found in supplementary C.\n\n6.3 Application of DTW Decomposition\n\nIn this subsection, we propose an application of DTWNet as a time series data decomposition tool.\nWithout loss of generality, we design 5 DTW layers and each layer has one DTW kernel, i.e., xi.\nThe key idea is to forward the residual of layer i to the next layer in this network. Note that DTW\ncomputation dtw(y, xi) will generate the warping path like Equation 2, from which we obtain the\nresidual by subtracting the corresponding aligned xi,j from yj, where j is the index of elements.\n\n(a) Samples of Haptics dataset\n\n(b) Kernel 0 (c) Kernel 1 (d) Kernel 2 (e) Kernel 3 (f) Kernel 4\n\nFigure 5: Illustration of DTW Decomposition\n\nFigure 5 illustrates the effect of the decomposition. Kernel 0 to kernel 4 correspond to the \ufb01rst\nlayer (input side) till the last layer (output side). The training goal is to minimize the residual of the\nnetwork\u2019s output, and we randomly initialize the kernels before training. We use the Haptics dataset\nfrom the UCR repository to demonstrate the decomposition.\nAfter a certain amount of epochs, we can clearly see that the kernels from different layers form\ndifferent shapes. The kernel 0 from the \ufb01rst layer, has a large curve that describes the overall shape\nof the data. This can be seen as the low-frequency part of the signal. In contrast, kernel 4 has those\nzig-zag shapes that describe the high-frequency parts. Generally, in deeper layers, the kernels tend to\nlearn \"higher frequency\" parts. This can be utilized as a good decomposition tool given a dataset.\nMore meaningfully, the shapes of the kernels are very interpretable for human beings.\n\n7 Conclusions and Future Work\n\nIn this paper, we have applied DTW kernel as a feature extractor and proposed the DTWNet\nframework. To achieve backpropogation, after evaluating DTW distance via Dynamic Programming,\nwe compute the gradient along the determined warping path. A theoretical study of the DTW as a\nloss function is provided. We identify DTW loss as region-wise quadratic or linear, and describe\nthe conditions for the step size of the proposed method in order to jump out of local minima. In\nthe experiments, we show that the DTW kernel could outperform standard convolutional kernels in\ncertain tasks. We have also evaluated the effectiveness of the proposed gradient computation and\nbackpropogation, and offered an application to perform data decomposition.\n\n9\n\n020406080100\u22124\u2212202Data Samplessample 1sample 2sample 305\u22121.0\u22120.50.005\u22120.250.000.250.5005\u22121.0\u22120.50.00.505\u22120.4\u22120.20.005\u22121.0\u22120.50.00.5\fReferences\n[1] Donald J Berndt and James Clifford. Using dynamic time warping to \ufb01nd patterns in time series.\n\nIn KDD workshop, volume 10, pages 359\u2013370. Seattle, WA, 1994.\n\n[2] Chien-Yi Chang, De-An Huang, Yanan Sui, Li Fei-Fei, and Juan Carlos Niebles. D3tw:\nDiscriminative differentiable dynamic time warping for weakly supervised action alignment\nand segmentation. arXiv preprint arXiv:1901.02598, 2019.\n\n[3] Yanping Chen, Eamonn Keogh, Bing Hu, Nurjahan Begum, Anthony Bagnall, Abdullah Mueen,\nand Gustavo Batista. The ucr time series classi\ufb01cation archive, July 2015. www.cs.ucr.edu/\n~eamonn/time_series_data/.\n\n[4] Marco Cuturi and Mathieu Blondel. Soft-dtw: a differentiable loss function for time-series.\n\narXiv preprint arXiv:1703.01541, 2017.\n\n[5] Simon S. Du, Jason D. Lee, Haochuan Li, Liwei Wang, and Xiyu Zhai. Gradient descent \ufb01nds\n\nglobal minima of deep neural networks. arXiv preprint arXiv:1811.03804, 2018.\n\n[6] Sergio Giraldo, Ariadna Ortega, Alfonso Perez, Rafael Ramirez, George Waddell, and Aaron\nWilliamon. Automatic assessment of violin performance using dynamic time warping classi\ufb01-\ncation. In 2018 26th Signal Processing and Communications Applications Conference (SIU),\npages 1\u20133. IEEE, 2018.\n\n[7] Omer Gold and Micha Sharir. Dynamic time warping and geometric edit distance: Breaking\n\nthe quadratic barrier. ACM Transactions on Algorithms (TALG), 14(4):50, 2018.\n\n[8] Rohit J Kate. Using dynamic time warping distances as features for improved time series\n\nclassi\ufb01cation. Data Mining and Knowledge Discovery, 30(2), 2016.\n\n[9] Eamonn Keogh and Chotirat Ann Ratanamahatana. Exact indexing of dynamic time warping.\n\nKnowledge and information systems, 7(3):358\u2013386, 2005.\n\n[10] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classi\ufb01cation with deep\nconvolutional neural networks. In Advances in neural information processing systems, pages\n1097\u20131105, 2012.\n\n[11] Hansheng Lei and Venu Govindaraju. Direct image matching by dynamic warping. In Computer\nVision and Pattern Recognition Workshop, 2004. CVPRW\u201904. Conference on, pages 76\u201376.\nIEEE, 2004.\n\n[12] Daniel Lemire. Faster retrieval with a two-pass dynamic-time-warping lower bound. Pattern\n\nrecognition, 42(9):2169\u20132180, 2009.\n\n[13] Jiangyuan Mei, Meizhu Liu, Yuan-Fang Wang, and Huijun Gao. Learning a mahalanobis\ndistance-based dynamic time warping measure for multivariate time series classi\ufb01cation. IEEE\ntransactions on Cybernetics, 46(6):1363\u20131374, 2016.\n\n[14] Lindasalwa Muda, Mumtaj Begam, and Irraivan Elamvazuthi. Voice recognition algorithms\nusing mel frequency cepstral coef\ufb01cient (mfcc) and dynamic time warping (dtw) techniques.\narXiv preprint arXiv:1003.4083, 2010.\n\n[15] Abdullah Mueen, Nikan Chavoshi, Noor Abu-El-Rub, Hossein Hamooni, Amanda Minnich,\nand Jonathan MacCarthy. Speeding up dynamic time warping distance for sparse time series\ndata. Knowledge and Information Systems, 54(1):237\u2013263, 2018.\n\n[16] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito,\nZeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in\npytorch. 2017.\n\n[17] Fran\u00e7ois Petitjean and Pierre Gan\u00e7arski. Summarizing a set of time series by averaging: From\nsteiner sequence to compact multiple alignment. Theoretical Computer Science, 414(1):76\u201391,\n2012.\n\n10\n\n\f[18] Yasushi Sakurai, Christos Faloutsos, and Masashi Yamamuro. Stream monitoring under the time\nwarping distance. In Data Engineering, 2007. ICDE 2007. IEEE 23rd International Conference\non, pages 1046\u20131055. IEEE, 2007.\n\n[19] Mit Shah, Josif Grabocka, Nicolas Schilling, Martin Wistuba, and Lars Schmidt-Thieme. Learn-\ning dtw-shapelets for time-series classi\ufb01cation. In Proceedings of the 3rd IKDD Conference on\nData Science, 2016, page 3. ACM, 2016.\n\n[20] Yilin Shen, Yanping Chen, Eamonn Keogh, and Hongxia Jin. Accelerating time series searching\nwith large uniform scaling. In Proceedings of the 2018 SIAM International Conference on Data\nMining, pages 234\u2013242. SIAM, 2018.\n\n[21] Mohammad Shokoohi-Yekta, Bing Hu, Hongxia Jin, Jun Wang, and Eamonn Keogh. Gen-\neralizing dtw to the multi-dimensional case requires an adaptive approach. Data mining and\nknowledge discovery, 31(1):1\u201331, 2017.\n\n[22] Peter Steffen, Robert Giegerich, and Mathieu Giraud. Gpu parallelization of algebraic dynamic\nprogramming. In International Conference on Parallel Processing and Applied Mathematics,\npages 290\u2013299. Springer, 2009.\n\n[23] Gineke A ten Holt, Marcel JT Reinders, and EA Hendriks. Multi-dimensional dynamic time\nwarping for gesture recognition. In Thirteenth annual conference of the Advanced School for\nComputing and Imaging, volume 300, page 1, 2007.\n\n[24] R Varatharajan, Gunasekaran Manogaran, MK Priyan, and Revathi Sundarasekar. Wearable\nsensor devices for early detection of alzheimer disease using dynamic time warping algorithm.\nCluster Computing, pages 1\u201310, 2017.\n\n[25] Fei-Yue Wang, Jie Zhang, Qinglai Wei, Xinhu Zheng, and Li Li. Pdp: parallel dynamic\n\nprogramming. IEEE/CAA Journal of Automatica Sinica, 4(1):1\u20135, 2017.\n\n11\n\n\f", "award": [], "sourceid": 6218, "authors": [{"given_name": "Xingyu", "family_name": "Cai", "institution": "University of Connecticut"}, {"given_name": "Tingyang", "family_name": "Xu", "institution": "Tencent AI Lab"}, {"given_name": "Jinfeng", "family_name": "Yi", "institution": "JD Research"}, {"given_name": "Junzhou", "family_name": "Huang", "institution": "University of Texas at Arlington / Tencent AI Lab"}, {"given_name": "Sanguthevar", "family_name": "Rajasekaran", "institution": "University of Connecticut"}]}