{"title": "Diffeomorphic Temporal Alignment Nets", "book": "Advances in Neural Information Processing Systems", "page_first": 6574, "page_last": 6585, "abstract": "Time-series analysis is confounded by nonlinear time warping of the data. Traditional methods for joint alignment do not generalize: after aligning a given signal ensemble, they lack a mechanism, that does not require solving a new optimization problem, to align previously-unseen signals. In the multi-class case, they must also first classify the test data before aligning it. Here we propose the Diffeomorphic Temporal alignment Net (DTAN), a learning-based method for time-series joint alignment. Via flexible temporal transformer layers, DTAN learns and applies an input-dependent nonlinear time warping to its input signal. Once learned, DTAN easily aligns previously-unseen signals by its inexpensive forward pass. In a single-class case, the method is unsupervised: the ground-truth alignments are unknown. In the multi-class case, it is semi-supervised in the sense that class labels (but not the ground-truth alignments) are used during learning; in test time, however, the class labels are unknown. As we show, DTAN not only outperforms existing joint-alignment methods in aligning training data but also generalizes well to test data. Our code is available at https://github.com/BGU-CS-VIL/dtan.", "full_text": "Diffeomorphic Temporal Alignment Nets\n\nRon Shapira Weber\nBen-Gurion University\n\nronsha@post.bgu.ac.il\n\nMatan Eyal\n\nBen-Gurion University\n\nmataney@post.bgu.ac.il\n\nNicki Skafte Detlefsen\n\nTechnical University of Denmark\n\nnsde@dtu.dk\n\nOren Freifeld\n\nBen-Gurion University\n\norenfr@cs.bgu.ac.il\n\nOren Shriki\n\nBen-Gurion University\nshrikio@bgu.ac.il\n\nAbstract\n\nTime-series analysis is confounded by nonlinear time warping of the data. Tradi-\ntional methods for joint alignment do not generalize: after aligning a given signal\nensemble, they lack a mechanism, that does not require solving a new optimization\nproblem, to align previously-unseen signals. In the multi-class case, they must also\n\ufb01rst classify the test data before aligning it. Here we propose the Diffeomorphic\nTemporal Alignment Net (DTAN), a learning-based method for time-series joint\nalignment. Via \ufb02exible temporal transformer layers, DTAN learns and applies an\ninput-dependent nonlinear time warping to its input signal. Once learned, DTAN\neasily aligns previously-unseen signals by its inexpensive forward pass. In a single-\nclass case, the method is unsupervised: the ground-truth alignments are unknown.\nIn the multi-class case, it is semi-supervised in the sense that class labels (but not\nthe ground-truth alignments) are used during learning; in test time, however, the\nclass labels are unknown. As we show, DTAN not only outperforms existing joint-\nalignment methods in aligning training data but also generalizes well to test data.\nOur code is available at https://github.com/BGU-CS-VIL/dtan.\n\n1\n\nIntroduction\n\nTime-series data often presents a signi\ufb01cant amount of misalignment, also known as nonlinear time\nwarping. To \ufb01x ideas, consider ECG recordings from healthy patients during rest. Suppose that\nthe signals were partitioned correctly such that each segment corresponds to a heartbeat and that\nthese segments were resampled to have equal length (e.g., see Figure 1). Each resampled segment is\nthen viewed as a distinct signal. The sample mean of these usually-misaligned signals (even when\nrestricting to single-patient recordings) would not look like the iconic ECG sinus rhythm; rather, it\nwould smear the correct peaks and valleys and/or contain super\ufb02uous ones. This is unfortunate as\nthe sample mean, a cornerstone of Descriptive Statistics, has numerous applications in data analysis\n(e.g., providing a succinct data summary). Moreover, even if one succeeds somehow in aligning a\ncurrently-available recording batch, upon the arrival of new data batches, the latter will also need to\nbe aligned; i.e., one would like to generalize the inferred alignment from the original batch to the\nnew data without having to solve a new optimization problem. This is especially the case if the new\ndataset is much larger than the original one; e.g., imagine a hospital solving the problem once, and\nthen generalizing its solution, essentially at no cost, to align all the data collected in the following\nyear. Finally, these issues become even more critical for multi-class data (e.g., healthy/sick patients),\nwhere only in the original batch we know which signal belongs to which class; i.e., seemingly, the\nnew data will have to be explicitly classi\ufb01ed before its within-class alignment.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\f(a) Train\n\n(b) Test\n\nFigure 1: Learning to generalize time-series joint alignment from train to test signals on the ECGFive-\nDays dataset [8]. Top row: 10 random misaligned signals from each set and their respective average\nsignal (shaded areas correspond to standard deviations). Bottom: The signals after the estimated\nalignment. DTAN aligns, in an input-dependent manner, a new test signal in a single forward pass.\n\nLet (Ui)N\n\ni=1 be a set of N time-series observations. The nonlinear misalignment can be written as:\n(1)\nwhere Ui is the ith misaligned signal, Vi is the ith latent aligned signal, \u201c\u25e6\u201c stands for function\ncomposition, and Wi is a latent warp of the domain of Vi. For technical reasons, the misalignment is\nusually viewed in terms of Ti (cid:44) W \u22121\n, the inverse warp of Wi, implicitly suggesting Wi is invertible.\ni=1 belong to T , some nominal family of warps parameterized by \u03b8:\nIt is typically assumed that (Ti)N\n(2)\n\ni=1 , Ti = T \u03b8i \u2208 T \u2200i \u2208 (1, . . . , N ) .\n\ni=1 = (Ui \u25e6 T \u03b8i)N\n\n(Vi)N\n\n(Ui)N\n\ni=1 = (Vi \u25e6 Wi)N\n\ni=1\n\ni\n\nThe nuisance warps, (T \u03b8i )N\ni=1, create a \ufb01ctitious variability in the range of the signals, confounding\ntheir statistical analysis. Thus, the joint-alignment problem, de\ufb01ned below, together with the ability\nto use its solution for generalization, is of great interest to the machine-learning community as well\nas to other \ufb01elds.\nDe\ufb01nition 1 (the joint-alignment problem) Given (Ui)N\n\ni=1, infer the latent (T \u03b8i)N\n\ni=1 \u2282 T .\n\nWe argue that this problem should be seen as a learning one, mostly due to the need for generalization.\nParticularly, we propose a novel deep-learning (DL) approach for the joint alignment of time-series\ndata. More speci\ufb01cally, inspired by computer-vision and/or pattern-theoretic solutions for misaligned\nimages (e.g., congealing [38, 31, 26, 25, 10, 11], ef\ufb01cient diffeomorphisms [19, 20, 56, 57], and\nspatial transformer nets [28, 32, 49]), we introduce the Diffeomorphic Temporal Alignment Net\n(DTAN) which learns and applies an input-dependent diffeomorphic time warping to its input signal\nto minimize a joint-alignment loss and a regularization term. In the single-class case, this yields an\nunsupervised method for joint-alignment learning. For multi-class problems, we propose a semi-\nsupervised method which results in a single net (for all classes) that learns how to perform, within\neach class, joint alignment without knowing, at test time, the class labels. We demonstrate the utility\nof the proposed framework on both synthetic and real datasets with applications to time-series joint\nalignment, averaging and classi\ufb01cation, and compare it with DTW Barycenter Averaging (DBA) [44]\nand SoftDTW [12]. On training data, DTAN outperforms both. More importantly, it generalizes to\ntest data (and in fact excels in it); this is an ability not possessed by those methods.\nOur key contributions are as follows. 1) DTAN, a new DL framework for learning joint alignment\nof time-series data; 2) A recurrent version of DTAN (which is also the \ufb01rst recurrent diffeomorphic\ntransformer net); 3) A new and fast tool for averaging misaligned single-class time-series data; 4) The\nproposed learning-based method generalizes to previously-unseen data; i.e., unlike existing methods\nfor time-series joint alignment, DTAN can align new test signals and the test-time computations are\nremarkably fast.\n\n2\n\n0204060801001206420246Misaligned signals0204060801001206420246Misaligned average signalAverage signal\u00b10204060801001206420246DTAN aligned signals0204060801001206420246DTAN average signalAverage signal\u00b10204060801001206420246Misaligned signals0204060801001206420246Misaligned average signalAverage signal\u00b10204060801001206420246DTAN aligned signals0204060801001206420246DTAN average signalAverage signal\u00b1\fFigure 2: Left: An illustration of a CPAB warp (relative to the identity transformation) with its\ncorresponding CPA velocity \ufb01eld (above). Right: DTAN joint alignment demonstrated on two classes\nof the Trace dataset [8]. During test, the class labels are unknown.\n\n2 Related Work\n\nDynamic Time Warping (DTW). A popular approach for aligning a time-series pair is DTW [47, 48]\nwhich, by solving Bellman\u2019s recursion via dynamic programming, \ufb01nds an optimal monotonic\nalignment between two signals. DTW does not scale well to the joint-alignment problem: computing\na pairwise DTW for N signals of length K requires O(K N ) operations [52], which is intractable for\neither a large N or a large K. Moreover, averaging under the DTW distance is a nontrivial task, as it\ninvolves solving the joint-alignment problem. While several authors proposed smart solutions for the\naveraging problem [50, 22, 44, 43, 13, 12], none of them offered a generalization mechanism \u2013 that\ndoes not require solving a new optimization problem each time \u2013 for aligning new signals.\nCongealing, Joint Alignment, and Atlas-based Methods. A congealing algorithm solves itera-\ntively for the joint alignment (of a set of signals such as images, time series, etc.) by gradually aligning\none signal towards the rest [31]. Typical alignment criteria used in congealing are entropy minimiza-\ntion [38, 31, 26, 37] or least squares [10, 11]. Also related is the Continuous Pro\ufb01le Model [33], a\ngenerative model in which each observed time series is a non-uniformly subsampled version of a\nsingle latent trace. While not directly related to our work, note that many medical-imaging works\nfocus on building an atlas, including with diffeomorphisms (e.g., [29]), via the (pairwise- or joint-)\nalignment of multiple images. Since all these methods above do not generalize, in order to align\nNtest new signals to the average signal of the previously-aligned Ntrain signals (or to an atlas), one\nmust solve Ntest pairwise-alignment problems. Alternatively, to jointly align Ntest new signals, one\nmust solve a new joint-alignment problem. In both cases, such solutions scale poorly with Ntest.\nIn the multi-class case, it is even worse since the new signals must be classi\ufb01ed, and classi\ufb01cation\nerrors increase alignment errors. Note that in [25] the authors propose a two-step process: the \ufb01rst\nlearns deep Convolutional Neural Networks (CNN) features, unrelated to alignment, and the second\nuses congealing to align these features (without learning how to align the features of a new data).\nIn parallel to our work, and independently of it, Dalca et al. [14] propose a learning-based method\nfor building deformable conditional templates based on diffeomorphisms. While their model offers\ngeneralization, they focus on neuroimaging and not time-series joint alignment.\nSpatial/Temporal Transformer Nets and Diffeomorphisms in DL. In computer vision, the Spatial\nTransformer Net (STN) [28] was introduced to allow for invariances to spatial warps. While there are\nworks on the pairwise alignment of time-series hidden states [50, 6, 21, 2], Temporal Transformer\nNets (TTN), the time-series analog of STNs, were so far limited to af\ufb01ne transformations [41], phase\nand frequency offset recovery [42]. It was also proposed to use TTN on the 2D spectrogram of time\nseries [58]. Very recently, Lohit et al. proposed a TTN based on 1D diffeomorphisms for time-series\nclassi\ufb01cation [35]; as their warps are not parametric, the method does not scale well with the signal\u2019s\nlength; e.g., a one-second input signal at 8kHz will yield a TTN with a \ufb01nal fully-connected (FC)\nlayer of dim = 8, 000 neurons, which in turn produces 8, 000 trainable weights per neuron in the\nprevious layer (for comparison, we use an FC layer of dim = 32); moreover, the nonparametric form\nprevents them from having an equivalent to the ef\ufb01cient gradient that we use. In addition, none of\nthese methods utilized TTN for learning time-series joint alignment.\n\n3\n\n02004006008001000xTime02004006008001000T(x)Identity transformationCPAB transformation0.20.10.0v\fFigure 3: Time-series averaging methods comparison on the ECG200 dataset (each row depicts a\ndifferent class). The Euclidean mean serves as a baseline, showing how nonlinear misalignment of\nthe data confounds its averaging. Comparing with DTW-based methods, DTAN outperforms DBA on\nboth train/test data. While the barycenter obtained by SoftDTW (\u03b3 = 1) is comparable to the one\nobtained by DTAN, it is (1) inapplicable to new signals; (2) computed on each class individually.\nDTAN, however, was trained on both classes together and generalized to test data (rightmost panels).\n\nRecently, Skafte et al. [49] showed it is possible to explicitly incorporate \ufb02exible and ef\ufb01cient\ndiffeomorphisms [19, 20] within DL architectures via an STN; particularly, they focused on image\nrecognition and classi\ufb01cation and their framework was supervised. Inspired by [49], we propose to\nuse a diffeomorphic TTN to solve the joint-alignment problem. Our approach differs from [49] in the\nfollowing: the signal type (1D signals vs. 2D images); the task (joint alignment vs. classi\ufb01cation);\namount of supervision (unsupervised/semi-supervised vs. supervised); usage of recurrent nets and\nwarp regularization (here we use both, neither was used in [49]). In addition to [49], there are several\nworks, particularly in medical imaging, that involve DL and diffeomorphisms. Their formulation is\ndifferent from ours. E.g, while Yang et al. [55] use supervised DL to predict diffeomorphisms, their\nnet has no STN so the diffeomorphisms are not explicitly incorporated in it. In contrast, unsupervised\ndiffeomorphic alignment was achieved via an STN [15, 7]. In all these three works [55, 15, 7] (as well\nas in others omitted here due to space limits) the nets learn pairwise alignments, not joint alignment.\nIn any case, we are unaware of works that use diffemorphic nonlinear transformer nets for time-series\ndata (with the exception of [35]), let alone for joint alignment of such data (with no exceptions).\n\n3 Preliminaries: Temporal Transformer Nets and Diffeomorphisms\nTemporal Transformer Nets. Given T , a spatial-warp family parameterized by \u03b8, a Spatial Trans-\nformer (ST) layer performs a learnable input-dependent warp [28]. Reducing this from images (a\n2D domain) to time series (1D), one obtains a TT layer (a TTN is a neural net with at least one TT\nlayer). In more detail, let U denote the input of the TT layer. Its output consists of \u03b8 = f loc(U ) and\nV = U \u25e6 T \u03b8 (the latter, i.e., the warped signal, is what is being passed downstream the TTN), where\nT \u03b8 \u2208 T is a 1D warp parameterized by \u03b8. The function floc : U (cid:55)\u2192 \u03b8 is itself a neural net called the\nlocalization net. Let w denote the parameters (also known as weights) of floc and let\n\ni=1)\n\nF ((Ui, \u03b8i(Ui; w))N\n\n(3)\ndenote a loss function. The TT layer is trained (i.e., optimized over w) along with the rest of the\nTTN. As is usual in DL, this involves back-propagation [46] which requires certain partial derivatives\n(see our Sup. Mat.). Also note one of these derivatives, \u2207\u03b8(T \u03b8(\u00b7)), depends on the choice of T .\nDiffeomorphisms. As mentioned in \u00a7 1, T needs to be speci\ufb01ed. In the context of time warping,\ndiffeomorphisms is a natural choice [39]. A (C 1) diffeomorphism is a differentiable invertible map\nwith a differentiable inverse. Working with diffeomorphisms usually involves expensive computations.\nIn our case, since the proposed method explicitly incorporates them in a DL architecture, it is even\nmore important (than in traditional non-DL applications of diffeomorphisms) to drastically reduce the\ncomputational dif\ufb01culties: in training, evaluations of x (cid:55)\u2192 T \u03b8(x) and x (cid:55)\u2192 \u2207\u03b8T \u03b8(x) are computed\nat multiple time points x and for multiple \u03b8\u2019s. Thus, until recently, explicit incorporation of highly-\nexpressive diffeomorphism families into DL architectures used to be infeasible. This, however, is\nstarting to change (e.g., [49, 7]). Particularly, Skafte et al. [49] utilized, in their STNs, the CPAB\nwarps that had been proposed by Freifeld et al. [19, 20] and are also used in this work. CPAB warps\n\n4\n\n020406080100202Euclidean020406080100202DBA020406080100202SoftDTW (=1.0)020406080100202DTAN (train)020406080100202DTAN (test)020406080100202020406080100202020406080100202020406080100202020406080100202\fFigure 4: R-DTAN joint-alignment of synthetic data. Each column depicts a different class. Top\nrow: Source latent signals from which each class was created. Second: 10 perturbed signals and their\nrespective mean. Last three rows illustrate R-DTAN output at each recurrence, eventually unwarping\nthe nonlinear misaligned applied to the latent source signals. All the results shown here are on test\ndata, and were obtained by the same single net (without knowing, at test time, the class labels).\n\ncombine expressiveness and ef\ufb01ciency, making them a natural choice in a DL context [24, 49]. Other\nef\ufb01cient and expressive diffeomorphisms (e.g.,[57, 4, 17, 3]) can also be explored in the DTAN\ncontext, provided they also offer an ef\ufb01cient and highly-accurate way to evaluate x (cid:55)\u2192 \u2207\u03b8T \u03b8(x)\nas CPAB warps do [18]. Below we brie\ufb02y explain CPAB warps (restricting the discussion to 1D,\nwhich is the domain of interest in this work), and refer the reader to [19, 20, 18] for more details.\nThe name CPAB, short for CPA-Based, is due to the fact that these warps are based on Continuous\nPiecewise-Af\ufb01ne (CPA) velocity \ufb01elds. The term \u201cpiecewise\u201d is w.r.t. a partition, denoted by \u2126, of\nthe signal\u2019s domain into subintervals. Let V denote the linear space of CPA velocity \ufb01elds w.r.t. such\na \ufb01xed \u2126, let d = dim(V), and let v\u03b8 : \u2126 \u2192 R, a velocity \ufb01eld parameterized by \u03b8 \u2208 Rd, denote the\ngeneric element of V, where \u03b8 stands for the coef\ufb01cient w.r.t. some basis of V. The corresponding\nspace of CPAB warps, obtained via integration of elements of V, is\n\n(cid:90) t\n\nT (cid:44){T \u03b8 : x (cid:55)\u2192 \u03c6\u03b8(x; 1) s.t. \u03c6\u03b8(x; t) = x +\n\nv\u03b8(\u03c6\u03b8(x; \u03c4 )) d\u03c4 where v\u03b8 \u2208 V };\n\n(4)\n\n0\n\nit can be shown that these warps are indeed (C 1) diffeomorphisms [19, 20]. See Figure 2 for a\ntypical warp. While v\u03b8 is CPA, T \u03b8 : \u2126 \u2192 \u2126 is not (e.g., T \u03b8 is differentiable). CPA velocity \ufb01elds\nsupport an integration method that is faster and more accurate than typical velocity-\ufb01eld integration\nmethods [19, 20]. The \ufb01neness of \u2126 controls the trade-off between expressiveness of T on the one\nhand and the associated computational complexity and dimensionality on the other hand. Importantly\nin the TTN context, the CPAB gradient, \u2207\u03b8T \u03b8(x), is given by the ef\ufb01cient solution of a system of\ncoupled integral equations [20]; see [18] for details.\n\n4 The Proposed Diffeomorphic Temporal Alignment Nets\nDe\ufb01nition 1 requires the speci\ufb01cation of T and a loss function for estimating (T \u03b8i)N\ni=1. To meet\nour goal, i.e., solving the joint-alignment problem while being able to generalize its solution to the\nalignment of new data, we propose a DL-based method which includes a TTN with diffeomorphic\nTT layers. Particularly, we choose T to be a family of 1D CPAB warps [19, 20] and incorporate the\nlatter within TT layers. For simplicity, we base the data term of the training loss on least squares but\nother criteria can be used as well. Altogether, this lets us propose the \ufb01rst DTAN for time-series joint\nalignment (it is also the \ufb01rst diffeomorphic transformer net for joint alignment of any kind of data, not\njust time series). Below we explain the method in more detail, including how it is used for aligning\nand averaging either existing or new data. We also discuss the critical role of warp regularization as\nwell as recurrent DTANs.\nTime-series Joint Alignment. Let Ui denote an input signal, let \u03b8i = floc(Ui, w) denote the\ncorresponding output of the localization net floc(\u00b7, w) of weights w, and let Vi denote the result of\nwarping Ui by T \u03b8i \u2208 T ; i.e., Vi = Ui \u25e6 T \u03b8i, where \u03b8i depends on w and Ui, as de\ufb01ned above.\nConsider \ufb01rst the case where all the Ui\u2019s belong to the same class. As the variance of the observed\ni=1, we seek to minimize the\n(Ui)N\n\ni=1 is (at least partially) explained by the latent warps, (T \u03b8i)N\n\n5\n\n1011011011010100200300400500101010020030040050001002003004005000100200300400500\fempirical variance of the warped signals, (Vi)N\n\ni=1. In other words, our data term in this setting is\n\ni=1\n\n(cid:0)w, (Ui)N\n(cid:1) (cid:44)(cid:88)K\n\nFdata\n\n(cid:0)w, (Ui)N\n\n(cid:88)N\n\n(cid:1) (cid:44) 1\n\nN\n\ni=1\n\n(cid:88)\n\nN\n\n(cid:13)(cid:13)(cid:13)(cid:13)Vi(Ui; w) \u2212 1\n(cid:88)N\n(cid:13)(cid:13)(cid:13)(cid:13)Vi (Ui; w) \u2212 1\n\nNk\n\nj=1\n\n(cid:88)\n\n(cid:13)(cid:13)(cid:13)(cid:13)2\n\n(cid:96)2\n\n(cid:13)(cid:13)(cid:13)(cid:13)2\n\nwhere (cid:107)\u00b7(cid:107)(cid:96)2\nis the sum of the within-class variances:\n\nis the (cid:96)2 norm. Note this setting is unsupervised. For multi-class problems, our data term\n\nVj(Uj; w)\n\n(5)\n\ni=1\n\nk=1\n\n1\nNk\n\ni:zi=k\n\nFdata\n\n(6)\nwhere K is the number of classes, zi takes values in {1, . . . , K} and is the class label associated with\nUi (namely: zi = k if and only if Ui belongs to class k), and Nk = |{i : zi = k}| is the number\nof examples in class k. This is a semi-supervised setting in the following sense: the labels, (zi)N\ni=1\nare known during the learning (but not during the test) while the within-class alignment remains\nunsupervised as in the single-class case. Importantly, note that the same single network is responsible\nfor aligning each of the classes; i.e., w does not vary with k; see Figure 2. In both the single- and\nmulti-class cases, we (unlike Skafte et al. [49]) also use a regularization term on the warps,\n\nVj(Uj; w)\n\nj:zj =k\n\n(cid:96)2\n\nFreg(w, (Ui)N\n\ni=1) =\n\n(\u03b8i(w, Ui))T \u03a3\u22121\n\nCPA\u03b8i(w, Ui)\n\n(7)\n\n(cid:88)N\n\ni=1\n\nwhere \u03a3CPA is a CPA covariance matrix (proposed by Freifeld et al. [19, 20]) associated with a\nzero-mean Gaussian smoothness prior over CPA \ufb01elds. Akin to the standard formulation in, e.g.,\nGaussian processes [45], \u03a3CPA has two parameters: \u03bbvar, which controls the overall variance, and\n\u03bbsmooth, which controls the smoothness of the \ufb01eld. A small \u03bbvar favors small warps (i.e., close to\nthe identity) and vice versa; similarly, the larger \u03bbsmooth is, the more it favors CPA velocity \ufb01elds\nthat are almost purely af\ufb01ne and vice versa. This also gives another way, an alternative to changing\nthe resolution of \u2126, to control the amount of expressiveness of the warps. In the context of our\njoint-alignment task (as opposed to, say, the classi\ufb01cation task in [49]), using regularization is critical,\npartly since it is too easy to minimize Fdata by unrealistically-large deformations that would cause\nmost of the inter-signal variability to concentrate on a small region of the domain; the regularization\nterm prevents that. Our loss function, to be minimized over w, is\n\ni=1) .\n\nF (w, (Ui)N\n\ni=1) + Freg(w, (Ui)N\n\ni=1) = Fdata(w, (Ui)N\n\n(8)\nThe optimization (i.e. the training of the net) is done via standard methods for DL training (see \u00a7 5).\nGeneralization via the Learned Joint Alignment. Once the net is trained, a signal U (regardless\nwhether it is a training or a test signal) is aligned as follows. First set \u03b8 = floc(U ); i.e., a forward\npass of the net (an operation which is, as is usually the case in DL, simple and very fast). Next, obtain\nthe aligned signal, V , via warping U by T \u03b8; i.e., set V = U \u25e6 T \u03b8. Especially useful and elegant is\nthe fact that, in the multi-class case, the same single net aligns each new test signal, without knowing\nthe label of the latter. This is in sharp contrast to other joint-alignment methods (e.g., those based on\nDBA, SoftDTW, atlases, etc.) that require knowing the label of the to-be-aligned signal.\nTime-series Averaging. The data misalignment distorts, among other things, the sample mean [53,\n23]. As discussed in \u00a7 2, averaging under the DTW distance is a common approach to this issue [44,\n43, 13, 12]; however, such non-learning DTW-based methods are computationally expensive. This is\nespecially problematic since, as these methods do not generalize, each batch of new signals requires\nthem to solve another optimization problem. In contrast, since DTAN easily aligns new signals\ninexpensively and almost instantaneously via its forward pass, it also provides, in the single-class\ncase, an instant mechanism for quickly averaging a new collection of previously-unseen signals (see\nFigure 3) by simply computing the sample mean of the warped test data: \u00afV = 1\nj=1 Vj(Uj; w).\nN\nVariable length and multi-channel data The current work focuses on univariate time-series data\nand \ufb01xed-length input. The generalization to multichannel signal is trivial: DTAN can either apply\nthe same warp to all channels (just like an STN warps RGB images) or learn and apply different\nwarps for each channel. To generalize DTAN for variable length (VL) input, we need to consider\nfloc , T and the loss function. For floc , Recurrent Neural Networks (RNNs) are a natural choice, as\nthey are designed to handle VL inputs. A nominal CPAB family, T , is capable of warping any time\ninterval towards any other, even if they are of different lengths, as long as no boundary conditions are\nused. Finally, a loss function that can handle VL must be chosen (e.g., SoftDTW [12]).\n\n(cid:80)N\n\n6\n\n\fTable 1: Synthetic data variance of the misaligned data (\u201cBaseline\u201d) and the aligned data via\nDTAN, Recurrent-DTAN (R-DTAN2 and 4). For each set, Dir(k), k speci\ufb01es the seriousness of the\ndeformation, where a lower k indicates higher deformations. DTAN exhibits comparable results in\nterms of variance reduction between the train and test sets. Increasing the number of applied warps\nvia an R-DTAN (without increasing the number of learned parameters) further decreases the variance.\n\nTrain set variance\n\nTest set variance\n\nDataset\nDir(32)\nDir(16)\nDir(8)\n\nBaseline DTAN R-DTAN2 R-DTAN4 Baseline DTAN R-DTAN2 R-DTAN4\n0.483\n0.522\n0.536\n\n0.088\n0.098\n0.122\n\n0.130\n0.154\n0.183\n\n0.136\n0.240\n0.254\n\n0.106\n0.162\n0.181\n\n0.466\n0.514\n0.532\n\n0.234\n0.332\n0.362\n\n0.167\n0.24\n0.248\n\nRecurrent DTANs. While often a coarse \u2126 suf\ufb01ces, the expressiveness of T can be increased using a\n\ufb01ner \u2126 at the cost of computation speed and a higher d [19, 20]. In fact, at the limit of an in\ufb01nitely-\ufb01ne\n\u2126, any diffeomorphism that is representable by integrating a Lipshitz-continuous stationary velocity\n\ufb01eld can be approximated by a CPAB diffeomorphism [19, 20]. Moreover, CPAB warps do not form a\ngroup under the composition operation [20] (even though they contain the identity warp and are closed\nunder inversion); i.e., the composition of CPAB warps is a diffeomorphism but usually not CPAB\nitself. Thus, a way to increase expressiveness without re\ufb01ning \u2126 is by composing CPAB warps [20].\nConcatenating CPAB warps increases expressiveness beyond T as it implies a non-stationary velocity\n\ufb01eld which is CPA w.r.t. \u2126 and piecewise constant w.r.t. time. Compositions increase dimensionality,\nbut the overall cost of evaluating the composed warp scales better (in comparison with re\ufb01nement\nof \u2126), and it is also easier to infer the \u03b8\u2019s. While this fact was not exploited in [49], we leverage it\nhere as follows. We propose the Recurrent-DTAN (R-DTAN), a net that recurrently applies nonlinear\ntime warps, via diffeomorphic TT layers, to the input signal (Figure 4). By sharing the learned\nparameters by all the TT layers, an R-DTAN increases expressiveness without increasing the number\nof parameters. While this is similar to, and inspired by, how Lin et al. [32] use a recurrent net with\naf\ufb01ne 2D warps, there is a key difference: since in the af\ufb01ne case zero-boundary conditions imply\ndegeneracies, they explained they had to propagate warp parameters instead of the warped image\nas they would have liked. In contrast, as CPAB warps support optional zero-boundary conditions,\npropagating a warped signal through an R-DTAN is a non-issue.\nImplementation. We adapted, to the 1D case, the implementation from [16] of the CPAB transformer\nlayer, CPAB gradient, the Tensor\ufb02ow C++ API, and Keras wrapper for the transformer layer. We also\nimplemented in Tensor\ufb02ow/Keras the CPAB regularization term as well as the recurrent net, both of\nwhich were not used in [49]. To summarize, users can bene\ufb01t from our DTAN implementation in any\nTensor\ufb02ow [1] or Keras [9] generic DL architecture in a few lines of code.\n\n5 Experiments and Results\n\nWe evaluated DTAN\u2019s time-series joint alignment of both synthetic and real-world data. For simplicity,\nin our experiments floc is set to be a 1D CNN consisting of 3 conv-layers (128\u201364\u201364 \ufb01lters per layer,\nrespectively) each followed by a ReLU nonlinear activation function [40], batch-normalization and\nmax-pooling layers [27], where d = dim(\u03b8) = 32. The learning rate was \u03b7 = 10\u22124, set to minimize\nEq. (6) via the Adam optimizer [30]. The last activation function was tanh.\n\n5.1 Learning Joint Alignment of Synthetic Data\n\nWe generated synthetic data by perturbing 4 synthetic signals using random warps sampled from\na Dirichlet prior (see Sup. Mat. for details of the data-generation procedure). We generated 250\nsamples per-class (1000 in total) and used a 60-20-20% train, validation and test split, choosing the\nmodel with the lowest validation loss (where \u03bbvar = .01, \u03bbsmooth = 1). We studied the effect of\ndifferent temporal deformations on DTAN\u2019s ability to \ufb01nd the perturbed signals joint alignment and\nthus recover the latent input signals. Unlike in the UCR dataset (see below), in the synthetic dataset\nthe latent source signal is available and can be used as a reference for evaluation. We studied the\nfollowing aspects: (1) The dif\ufb01culty of the input signals (Figure 4, the different columns); (2) the\nseriousness of the deformation, achieved by varying K, the dimension of the Dirichlet distribution\n\n7\n\n\fTable 2: Timing test-set alignments for a single-class synthetic data. There are 16 test sets. Within\neach set, the length of the signals is \ufb01xed. There are 4 different lengths (across the sets): 64, 128, 256,\nand 512. The size (i.e., the number of signals) of each test set is either 10, 102, 103, or 104. Taking\nall possible combinations of these 4 lengths and 4 sizes, yielded the 16 test sets. Each entry in the\ntable represents the time it took to align an entire such test set by DTAN\u2019s forward pass.\n\n# of signals\n\nlength\n\nAlignment timing per test set (in [sec])\n103\n0.007\n0.012\n0.042\n0.084\n\n102\n0.003\n0.004\n0.038\n0.007\n\n64\n128\n256\n512\n\n10\n\n0.003\n0.003\n0.014\n0.003\n\n104\n0.109\n0.211\n0.455\n0.660\n\nFigure 5: Correct classi\ufb01cation rates using NCC. Each point above the diagonal indicates an entire\nUCR archive dataset [8] where DTAN achieved better (or no-worse) results than the competing\nmethod. Blue: DTAN\u2019s test accuracy compared with: Euclidean (DTAN was better or no worse in\n93% of the datasets), DBA (77%) and SoftDTW (62%). Red: DTAN-CNN compared with CNN\n(87%).\n\n(Table 1, rows) and (3) the number of recurrences (Figure 4, rows). We also measured the timings of\nalignment of a single-class test data by DTAN. The test sets vary in size (10 : 104, log-spaced values)\nand signal length (64, 128, 256, 512). We trained DTAN on 100 samples for each signal length. For\neach condition, we measured how long it took to align the entire test set via DTAN\u2019s forward pass.\nTiming was measured on a Nvidia GeForce GTX 1080 graphic card.\nResults. Table 1 reports the average within-class variance of the misaligned signals (\u201cBaseline\u201d) and\nthe reduced variance after alignment by DTAN, R-DTAN2 and R-DTAN4 on both the train and test\nsets. The results show that DTAN generalizes well. In addition, as the number of diffeomorphic warps\nincreases, R-DTAN performs \ufb01ner alignments without increasing the number of parameters. Figure 4\nillustrates how the synthetic misaligned signals are iteratively warped by R-DTAN, recovering the\nlatent signals (up to a diffeomorphic offset). We also study the effect of adding Gaussian noise to\nthe perturbed signals on DTAN\u2019s performance; see tables and discussion in the Sup. Mat. Table 2\nsummarizes the timing results, showing that DTAN\u2019s timing scales gracefully; e.g., aligning the\nlargest test set (104 signals of length 512) took DTAN only 0.66 [sec].\n\n5.2 UCR Time-Series Classi\ufb01cation Archive (Real Data)\n\nThe UCR time-series classi\ufb01cation archive [8] contains 85 real-world datasets (we used 84). The\ndatasets differ from each other in the number of examples, signal length, application domain (e.g.:\nECG; medical imaging; motion sensors), and number of classes (2\u201360). We worked with the train\nand test sets provided with the archive. Here we report a summary of our results which appear in full\ndetail (together with a study of the effect of the regularization term) at our Sup. Mat.\nNearest Centroid Classi\ufb01cation (NCC) experiment. The 1-Nearest Neighbor (1-NN) classi\ufb01er,\nwhen using the DTW distance, was shown [54, 5] to be on par with state-of-the-art time-series classi-\n\ufb01ers; however, 1-NN requires: 1) the entire train set to be stored; 2) DTW to be computed between\neach pair of training example and and test example. This scales poorly in terms of computational\nef\ufb01ciency and storage. This issue is mitigated considerably by performing NCC, using each class\naverage signal as a centroid [43]. In the lack of ground truth for the latent warps in real data, NCC\nsuccess rates also provide an indicative metric for the quality of the joint alignment and/or average\n\n8\n\n0.20.40.60.8Euclidean test accuracy0.00.20.40.60.81.0DTAN0.20.40.60.8DBA test accuracy0.00.20.40.60.81.0DTAN0.20.40.60.8SoftDTW test accuracy0.00.20.40.60.81.0DTAN0.20.40.60.8CNN test accuracy0.00.20.40.60.81.0DTAN-CNN\fsignal. Thus, we perform NCC on the UCR archive, comparing DTAN to: (1) the sample mean of the\nmisaligned sets (Euclidean); (2) DBA; (3) SoftDTW.\nExperiment outline. For each of the UCR datasets, we trained DTAN in a similar fashion to 5.1,\nwhere \u03bbvar \u2208 [10\u22123, 10\u22122], \u03bbsmooth \u2208 [0.5, 1]. We used R-DTANx, where x \u2208 {1, 2, 4} is the\nnumber of TT layers. We then computed the centroid (w.r.t. to a Euclidean distance) of each class\nin the aligned train set. NCC was conducted by aligning each test sample through the trained net\nand measuring a Euclidean distance to each of the centroids. DBA and SoftDTW were measured by\nDTW distance (which is the distance associated with these methods). We used Python\u2019s tslearn\u2019s\nimplementation of DTW, DBA and SoftDTW [51], limiting each to 100 iterations. The SoftDTW\nbarycenter loss was minimized via L-BFGS [34] and the best \u03b3 was chosen among the following\nvalues: 10\u22123, 10\u22122, 10\u22121,1, and 10.\nResults. Figure 5 shows the NCC\nexperiment\u2019s results.\nEach point\nabove the diagonal stands for an en-\ntire dataset where DTAN correct clas-\nsi\ufb01cation rate was better than (or\nequal to) the competing method. This\nwas the case for 93% of the datasets\nwhen compared to Euclidean, 77%\nfor DBA, and 62% for SoftDTW.\nThese results (1) illustrate the impor-\ntance of unwarping the misaligned\ndata (as shown by the Euclidean case)\nand (2) indicate that averaging via\nDTAN under Euclidean geometry is\nusually superior to DTW-based aver-\naging. These \ufb01ndings are also sup-\nported by the average signals displayed in Figure 3. The Euclidean mean is strongly affected by the\nmisalignment, while DBA falls to a bad local minimum. SoftDTW and DTAN show comparable\nqualitative results on this set, but note two major differences: (1) DTAN jointly aligns several classes\nwithin the same model (while SoftDTW had to be computed for each class separately) and (2) DTAN\ngeneralizes the learned alignment to new test samples (rightmost panel), while it is inapplicable for\nSoftDTW (as it must be computed again for new signals). For more results, please see our Sup. Mat.\nCNN classi\ufb01cation experiment. We also tested whether DTAN can increase CNN classi\ufb01cation\naccuracy. We \ufb01rst trained DTAN to minimize Eq. (6) using the same regularization and recurrence\nparameters used in the NCC experiment. After training, we froze the weights of floc and fed DTAN\u2019s\noutputs to another CNN, and trained it for classi\ufb01cation (identical to floc in terms of architecture\nand optimization). We call this model DTAN-CNN. Note other time-series averaging methods\ncannot be used in a similar way. We compared the average test accuracy of DTAN-CNN to the\nsame CNN without DTAN, using 5 runs per dataset. DTAN-CNN achieved higher, or equal to,\ncorrect classi\ufb01cation rates on 87% of the datasets (see Figure 5, red). Figure 6, which provides a\nt-SNE visualization of the original and aligned data [36], illustrates how DTAN decreases intra-class\nvariance while increasing inter-class one, thus improving the performance of classi\ufb01cation net.\n\nFigure 6: t-SNE visualization of the original and aligned\ntest data of the 11-class FacesUCR dataset. The class labels\nare used here for visualization, but were not used during the\ntest-data alignment. This highlights how DTAN decreases\nwithin-class variance while increasing inter-class variance.\n\n6 Conclusion\n\nBuilding on both recent ideas such as STN [28, 49], ef\ufb01cient highly-expressive diffeomorphisms [19,\n20], and older ones such as congealing [31, 10], we proposed DTAN, a deep net for learning time-\nseries joint alignment. The alignment learning is done in an unsupervised way. If, however, class\nlabels are known in train time, we use them within a semi-supervised framework that reduces the\nvariance within each class separately. In addition, we proposed a regularization term for the warps,\nwhich is critical in an unsupervised framework. We also proposed R-DTAN, a recurrent variant of\nDTAN, which improves the expressiveness and performance of DTAN without increasing the number\nof parameters. Our experiments showed that the proposed method works well on both training and\ntest data sets.\nAcknowledgement: NSD was supported by research grant #15334 from the VILLUM FONDEN.\n\n9\n\n4020020406040200204060Original data4020020406040200204060Aligned dataFacesUCR Dataset\fReferences\n[1] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving,\nM. Isard, et al. Tensor\ufb02ow: a system for large-scale machine learning. In OSDI, volume 16,\npages 265\u2013283, 2016. 7\n\n[2] A. Abid and J. Zou. Autowarp: Learning a warping distance from unlabeled time series using\n\nsequence autoencoders. arXiv preprint arXiv:1810.10107, 2018. 3\n\n[3] S. Allassonniere, S. Durrleman, and E. Kuhn. Bayesian mixed effect atlas estimation with a\n\ndiffeomorphic deformation model. SIAM Journal on Imaging Sciences, 2015. 5\n\n[4] V. Arsigny, O. Commowick, X. Pennec, and N. Ayache. A log-euclidean polyaf\ufb01ne framework\n\nfor locally rigid or af\ufb01ne registration. In BIR. Springer, 2006. 5\n\n[5] A. Bagnall and J. Lines. An experimental evaluation of nearest neighbour time series classi\ufb01ca-\n\ntion. arXiv preprint arXiv:1406.4757, 2014. 8\n\n[6] D. Bahdanau, K. Cho, and Y. Bengio. Neural machine translation by jointly learning to align\n\nand translate. arXiv preprint arXiv:1409.0473, 2014. 3\n\n[7] G. Balakrishnan, A. Zhao, M. R. Sabuncu, J. Guttag, and A. V. Dalca. An unsupervised learning\nmodel for deformable medical image registration. In Proceedings of the IEEE conference on\ncomputer vision and pattern recognition, pages 9252\u20139260, 2018. 4\n\n[8] Y. Chen, E. Keogh, B. Hu, N. Begum, A. Bagnall, A. Mueen, and G. Batista. The ucr time series\nclassi\ufb01cation archive, July 2015. www.cs.ucr.edu/~eamonn/time_series_data/.\n2, 3, 8\n\n[9] F. Chollet et al. Keras. https://github.com/fchollet/keras, 2015. 7\n[10] M. Cox, S. Sridharan, S. Lucey, and J. Cohn. Least squares congealing for unsupervised\nalignment of images. In 2008 IEEE Conference on Computer Vision and Pattern Recognition,\npages 1\u20138. IEEE, 2008. 2, 3, 9\n\n[11] M. Cox, S. Sridharan, S. Lucey, and J. Cohn. Least-squares congealing for large numbers of\n\nimages. In ICCV, pages 1949\u20131956. IEEE, 2009. 2, 3\n\n[12] M. Cuturi and M. Blondel. Soft-dtw: a differentiable loss function for time-series. In Proceed-\nings of the 34th International Conference on Machine Learning-Volume 70, pages 894\u2013903.\nJMLR. org, 2017. 2, 3, 6\n\n[13] M. Cuturi and A. Doucet. Fast computation of wasserstein barycenters.\n\nIn International\n\nConference on Machine Learning, pages 685\u2013693, 2014. 3, 6\n\n[14] A. V. Dalca, M. Rakic, J. Guttag, and M. R. Sabuncu. Learning conditional deformable templates\nwith convolutional networks. In Advances in neural information processing systems, 2019. 3\n[15] B. D. de Vos, F. F. Berendsen, M. A. Viergever, M. Staring, and I. I\u0161gum. End-to-end unsuper-\nvised deformable image registration with a convolutional neural network. In Deep Learning\nin Medical Image Analysis and Multimodal Learning for Clinical Decision Support, pages\n204\u2013212. Springer, 2017. 4\n\n[16] N. S. Detlefsen. libcpab. https://github.com/SkafteNicki/libcpab, 2018. 7\n[17] S. Durrleman, S. Allassonni\u00e8re, and S. Joshi. Sparse adaptive parameterization of variability in\n\nimage ensembles. IJCV, 2013. 5\n\n[18] O. Freifeld. Deriving the CPAB derivative. Technical report, Ben-Gurion University, 2018. 5\n[19] O. Freifeld, S. Hauberg, K. Batmanghelich, and J. W. Fisher III. Highly-expressive spaces of\n\nwell-behaved transformations: Keeping it simple. In ICCV, 2015. 2, 4, 5, 6, 7, 9\n\n[20] O. Freifeld, S. Hauberg, K. Batmanghelich, and J. W. Fisher III. Transformations based on\n\ncontinuous piecewise-af\ufb01ne velocity \ufb01elds. IEEE TPAMI, 2017. 2, 4, 5, 6, 7, 9\n\n[21] J. Grabocka and L. Schmidt-Thieme. Neuralwarp: Time-series similarity with warping networks.\n\narXiv preprint arXiv:1812.08306, 2018. 3\n\n[22] L. Gupta, D. L. Molfese, R. Tammana, and P. G. Simos. Nonlinear alignment and averaging for\nestimating the evoked potential. IEEE Transactions on Biomedical Engineering, 43(4):348\u2013356,\n1996. 3\n\n[23] D. Gus\ufb01eld. Algorithms on strings, trees, and sequences: computer science and computational\n\nbiology. Cambridge University Press, 1997. 6\n\n[24] S. Hauberg, O. Freifeld, A. B. L. Larsen, J. W. F. III, and L. K. Hansen. Dreaming more data:\nClass-dependent distributions over diffeomorphisms for learned data augmentation. In AISTATS,\n2016. 5\n\n[25] G. Huang, M. Mattar, H. Lee, and E. G. Learned-Miller. Learning to align from scratch. In\n\nNIPS, pages 764\u2013772, 2012. 2, 3\n\n10\n\n\f[26] G. B. Huang, V. Jain, and E. Learned-Miller. Unsupervised joint alignment of complex images.\n\nIn ICCV, pages 1\u20138. IEEE, 2007. 2, 3\n\n[27] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing\n\ninternal covariate shift. arXiv preprint arXiv:1502.03167, 2015. 7\n\n[28] M. Jaderberg, K. Simonyan, A. Zisserman, et al. Spatial transformer networks. In Advances in\n\nneural information processing systems, pages 2017\u20132025, 2015. 2, 3, 4, 9\n\n[29] S. Joshi, B. Davis, M. Jomier, and G. Gerig. Unbiased diffeomorphic atlas construction for\n\ncomputational anatomy. NeuroImage, 23:S151\u2013S160, 2004. 3\n\n[30] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. CoRR, 2014. 7\n[31] E. G. Learned-Miller. Data driven image models through continuous joint alignment. IEEE\n\nTPAMI, 28(2):236\u2013250, 2006. 2, 3, 9\n\n[32] C.-H. Lin and S. Lucey. Inverse compositional spatial transformer networks. In Proceedings of\nthe IEEE Conference on Computer Vision and Pattern Recognition, pages 2568\u20132576, 2017. 2,\n7\n\n[33] J. Listgarten, R. M. Neal, S. T. Roweis, and A. Emili. Multiple alignment of continuous time\n\nseries. In Advances in neural information processing systems, pages 817\u2013824, 2005. 3\n\n[34] D. C. Liu and J. Nocedal. On the limited memory bfgs method for large scale optimization.\n\nMathematical programming, 45(1-3):503\u2013528, 1989. 9\n\n[35] S. Lohit, Q. Wang, and P. Turaga. Temporal transformer networks: Joint learning of invariant\nand discriminative time warping. In Proceedings of the IEEE Conference on Computer Vision\nand Pattern Recognition, pages 12426\u201312435, 2019. 3, 4\n\n[36] L. v. d. Maaten and G. Hinton. Visualizing data using t-sne. Journal of machine learning\n\nresearch, 9(Nov):2579\u20132605, 2008. 9\n\n[37] M. A. Mattar, M. G. Ross, and E. G. Learned-Miller. Nonparametric curve alignment. In\n\nICASSP, pages 3457\u20133460. IEEE, 2009. 3\n\n[38] E. G. Miller, N. E. Matsakis, and P. A. Viola. Learning from one example through shared\n\ndensities on transforms. In CVPR, volume 1, pages 464\u2013471. IEEE, 2000. 2, 3\n\n[39] D. Mumford and A. Desolneux. Pattern theory: the stochastic analysis of real-world signals.\n\nAK Peters/CRC Press, 2010. 4\n\n[40] V. Nair and G. E. Hinton. Recti\ufb01ed linear units improve restricted boltzmann machines. In\nProceedings of the 27th international conference on machine learning (ICML-10), pages 807\u2013\n814, 2010. 7\n\n[41] J. Oh, J. Wang, and J. Wiens. Learning to exploit invariances in clinical time-series data using\n\nsequence transformer networks. arXiv preprint arXiv:1808.06725, 2018. 3\n\n[42] T. J. O\u2019Shea, L. Pemula, D. Batra, and T. C. Clancy. Radio transformer networks: Attention\nmodels for learning to synchronize in wireless systems. In Signals, Systems and Computers,\n2016 50th Asilomar Conference on, pages 662\u2013666. IEEE, 2016. 3\n\n[43] F. Petitjean, G. Forestier, G. I. Webb, A. E. Nicholson, Y. Chen, and E. Keogh. Dynamic time\nwarping averaging of time series allows faster and more accurate classi\ufb01cation. In Data Mining\n(ICDM), 2014 IEEE International Conference on, pages 470\u2013479. IEEE, 2014. 3, 6, 8\n\n[44] F. Petitjean, A. Ketterlin, and P. Gan\u00e7arski. A global averaging method for dynamic time\n\nwarping, with applications to clustering. Pattern Recognition, 44(3):678\u2013693, 2011. 2, 3, 6\n\n[45] C. E. Rasmussen. Gaussian processes in machine learning. In Advanced lectures on machine\n\nlearning, pages 63\u201371. Springer, 2004. 6\n\n[46] D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Learning internal representations by error\npropagation. Technical report, California Univ San Diego La Jolla Inst for Cognitive Science,\n1985. 4\n\n[47] H. Sakoe. Dynamic-programming approach to continuous speech recognition. 1971 Proc. the\n\nInternational Congress of Acoustics, Budapest, 1971. 3\n\n[48] H. Sakoe and S. Chiba. Dynamic programming algorithm optimization for spoken word\nrecognition. IEEE Transactions on Acoustics, Speech, and Signal Processing, 26(1):43\u201349,\n1978. 3\n\n[49] N. Skafte Detlefsen, O. Freifeld, and S. Hauberg. Deep diffeomorphic transformer networks. In\n\nCVPR, 2018. 2, 4, 5, 6, 7, 9\n\n[50] G.-Z. Sun, H.-H. Chen, and Y.-C. Lee. Time warping invariant neural networks. In Advances in\n\nNeural Information Processing Systems, pages 180\u2013187, 1993. 3\n\n11\n\n\f[51] R. Tavenard, J. Faouzi, and G. Vandewiele. tslearn: A machine learning toolkit dedicated to\n\ntime-series data, 2017. https://github.com/rtavenar/tslearn. 9\n\n[52] L. Wang and T. Jiang. On the complexity of multiple sequence alignment. Journal of computa-\n\ntional biology, 1994. 3\n\n[53] T. M. Wigley, K. R. Briffa, and P. D. Jones. On the average value of correlated time series,\nwith applications in dendroclimatology and hydrometeorology. Journal of climate and Applied\nMeteorology, 23(2):201\u2013213, 1984. 6\n\n[54] X. Xi, E. Keogh, C. Shelton, L. Wei, and C. A. Ratanamahatana. Fast time series classi\ufb01cation\nusing numerosity reduction. In Proceedings of the 23rd international conference on Machine\nlearning, pages 1033\u20131040. ACM, 2006. 8\n\n[55] X. Yang, R. Kwitt, M. Styner, and M. Niethammer. Quicksilver: Fast predictive image\n\nregistration\u2013a deep learning approach. NeuroImage, 2017. 4\n\n[56] M. Zhang and P. T. Fletcher. Finite-dimensional Lie algebras for fast diffeomorphic image\n\n[57] M. Zhang and P. T. Fletcher. Fast diffeomorphic image registration via fourier-approximated lie\n\nregistration. In IPMI, 2015. 2\n\nalgebras. IJCV, 2018. 2, 5\n\n[58] T. Zhang, K. Zhang, and J. Wu. Temporal transformer networks for acoustic scene classi\ufb01cation.\n\nProc. Interspeech 2018, pages 1349\u20131353, 2018. 3\n\n12\n\n\f", "award": [], "sourceid": 3549, "authors": [{"given_name": "Ron", "family_name": "Shapira Weber", "institution": "Ben-Gurion University"}, {"given_name": "Matan", "family_name": "Eyal", "institution": "Ben Gurion University"}, {"given_name": "Nicki", "family_name": "Skafte", "institution": "Technical University of Denmark"}, {"given_name": "Oren", "family_name": "Shriki", "institution": "Ben-Gurion University of the Negev"}, {"given_name": "Oren", "family_name": "Freifeld", "institution": "Ben-Gurion University"}]}