{"title": "Unsupervised Scalable Representation Learning for Multivariate Time Series", "book": "Advances in Neural Information Processing Systems", "page_first": 4650, "page_last": 4661, "abstract": "Time series constitute a challenging data type for machine learning algorithms, due to their highly variable lengths and sparse labeling in practice. In this paper, we tackle this challenge by proposing an unsupervised method to learn universal embeddings of time series. Unlike previous works, it is scalable with respect to their length and we demonstrate the quality, transferability and practicability of the learned representations with thorough experiments and comparisons. To this end, we combine an encoder based on causal dilated convolutions with a novel triplet loss employing time-based negative sampling, obtaining general-purpose representations for variable length and multivariate time series.", "full_text": "Unsupervised Scalable Representation Learning\n\nfor Multivariate Time Series\n\nJean-Yves Franceschi\u2217\n\nSorbonne Universit\u00e9, CNRS, LIP6, F-75005 Paris, France\n\njean-yves.franceschi@lip6.fr\n\nAymeric Dieuleveut\n\nMLO, EPFL, Lausanne CH-1015, Switzerland\nCMAP, Ecole Polytechnique, Palaiseau, France\naymeric.dieuleveut@polytechnique.edu\n\nMartin Jaggi\n\nMLO, EPFL, Lausanne CH-1015, Switzerland\n\nmartin.jaggi@epfl.ch\n\nAbstract\n\nTime series constitute a challenging data type for machine learning algorithms,\ndue to their highly variable lengths and sparse labeling in practice. In this paper,\nwe tackle this challenge by proposing an unsupervised method to learn universal\nembeddings of time series. Unlike previous works, it is scalable with respect to\ntheir length and we demonstrate the quality, transferability and practicability of\nthe learned representations with thorough experiments and comparisons. To this\nend, we combine an encoder based on causal dilated convolutions with a novel\ntriplet loss employing time-based negative sampling, obtaining general-purpose\nrepresentations for variable length and multivariate time series.\n\n1\n\nIntroduction\n\nWe investigate in this work the topic of unsupervised general-purpose representation learning for time\nseries. In spite of the increasing amount of work about representation learning in \ufb01elds like natural\nlanguage processing (Young et al., 2018) or videos (Denton & Birodkar, 2017), few articles explicitly\ndeal with general-purpose representation learning for time series without structural assumption on\nnon-temporal data.\nThis problem is indeed challenging for various reasons. First, real-life time series are rarely or\nsparsely labeled. Therefore, unsupervised representation learning would be strongly preferred.\nSecond, methods need to deliver compatible representations while allowing the input time series to\nhave unequal lengths. Third, scalability and ef\ufb01ciency both at training and inference time is crucial,\nin the sense that the techniques must work for both short and long time series encountered in practice.\nHence, we propose in the following an unsupervised method to learn general-purpose representations\nfor multivariate time series that comply with the issues of varying and potentially high lengths of the\nstudied time series. To this end, we introduce a novel unsupervised loss training a scalable encoder,\nshaped as a deep convolutional neural network with dilated convolutions (Oord et al., 2016) and\noutputting \ufb01xed-length vector representations regardless of the length of its output. This loss is built\n\n\u2217Work partially done while studying at ENS de Lyon and MLO, EPFL.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fas a triplet loss employing time-based negative sampling, taking advantage of the encoder resilience\nto time series of unequal lengths. To our knowledge, it is the \ufb01rst fully unsupervised triplet loss in the\nliterature of time series.\nWe assess the quality of the learned representations on various datasets to ensure their universality.\nIn particular, we test how our representations can be used for classi\ufb01cation tasks on the standard\ndatasets in the time series literature, compiled in the UCR repository (Dau et al., 2018). We show\nthat our representations are general and transferable, and that our method outperforms concurrent\nunsupervised methods and even matches the state of the art of non-ensemble supervised classi\ufb01cation\ntechniques. Moreover, since UCR time series are exclusively univariate and mostly short, we also\nevaluate our representations on the recent UEA multivariate time series repository (Bagnall et al.,\n2018), as well as on a real-life dataset including very long time series, on which we demonstrate\nscalability, performance and generalization ability across different tasks beyond classi\ufb01cation.\nThis paper is organized as follows. Section 2 outlines previous works on unsupervised representation\nlearning, triplet losses and deep architectures for time series in the literature. Section 3 describes the\nunsupervised training of the encoder, while Section 4 details the architecture of the latter. Finally,\nSection 5 provides results of the experiments that we conducted to evaluate our method.\n\n2 Related Work\n\nUnsupervised learning for time series. To our knowledge, apart from those dealing with videos\nor high-dimensional data (Srivastava et al., 2015; Denton & Birodkar, 2017; Villegas et al., 2017;\nOord et al., 2018), few recent works tackle unsupervised representation learning for time series.\nFortuin et al. (2019) deal with a related but different problem to this work, by learning temporal\nrepresentations of time series that represent well their evolution. Hyvarinen & Morioka (2016) learn\nrepresentations on evenly sized subdivisions of time series by learning to discriminate between those\nsubdivisions from these representations. Lei et al. (2017) expose an unsupervised method designed\nso that the distances between learned representations mimic a standard distance (Dynamic Time\nWarping, DTW) between time series. Malhotra et al. (2017) design an encoder as a recurrent neural\nnetwork, jointly trained with a decoder as a sequence-to-sequence model to reconstruct the input\ntime series from its learned representation. Finally, Wu et al. (2018a) compute feature embeddings\ngenerated in the approximation of a carefully designed and ef\ufb01cient kernel.\nHowever, these methods either are not scalable nor suited to long time series (due to the sequential\nnature of a recurrent network, or to the use of DTW with a quadratic complexity with respect to the\ninput length), are tested on no or very few standard datasets and with no publicly available code, or\ndo not provide suf\ufb01cient comparison to assess the quality of the learned representations. Our scalable\nmodel and extensive analysis aim at overcoming these issues, besides outperforming these methods.\n\nTriplet losses. Triplet losses have recently been widely used in various forms for representation\nlearning in different domains (Mikolov et al., 2013; Schroff et al., 2015; Wu et al., 2018b) and have\nalso been theoretically studied (Arora et al., 2019), but have not found much use for time series apart\nfrom audio (Bredin, 2017; Lu et al., 2017; Jansen et al., 2018), and never, to our knowledge, in a\nfully unsupervised setting, as existing works assume the existence of class labels or annotations in\nthe training data. Closer to our work even though focusing on a different, more speci\ufb01c task, Turpault\net al. (2019) learn audio embeddings in a semi-supervised setting, while partially relying on speci\ufb01c\ntransformations of the training data to sample positive samples in the triplet loss; Logeswaran & Lee\n(2018) train a sentence encoder to recognize, among randomly chosen sentences, the true context of\nanother sentence, which is a dif\ufb01cult method to adapt to time series. Our method instead relies on a\nmore natural choice of positive samples, learning similarities using subsampling.\n\nConvolutional networks for time series. Deep convolutional neural networks have recently been\nsuccessfully applied to time series classi\ufb01cation tasks (Cui et al., 2016; Wang et al., 2017), showing\ncompetitive performance. Dilated convolutions, popularized by WaveNet (Oord et al., 2016) for\naudio generation, have been used to improve their performance and were shown to perform well as\nsequence-to-sequence models for time series forecasting (Bai et al., 2018) using an architecture that\ninspired ours. These works particularly show that dilated convolutions help to build networks for\nsequential tasks outperforming recurrent neural networks in terms of both ef\ufb01ciency and prediction\nperformance.\n\n2\n\n\f3 Unsupervised Training\n\nyl\n\nyi\n\nyj\n\nxref\n\nxpos\n\nWe seek to train an encoder-only architecture, avoiding the need to jointly train with a decoder as in\nautoencoder-based standard representation learning methods as done by Malhotra et al. (2017), since\nthose would induce a larger computational cost. To this end, we introduce a novel triplet loss for time\nseries, inspired by the successful and by now classic word representation learning method known\nas word2vec (Mikolov et al., 2013). The proposed triplet loss uses original time-based sampling\nstrategies to overcome the challenge of learning on unlabeled data. As far as we know, this work is\nthe \ufb01rst in the time series literature to rely on a triplet loss in a fully unsupervised setting.\nThe objective is to ensure that similar time series\nobtain similar representations, with no supervi-\nsion to learn such similarity. Triplet losses help\nto achieve the former (Schroff et al., 2015), but\nrequire to provide pairs of similar inputs, thus\nchallenging the latter. While previous super-\nvised works for time series using triplet losses\nassume that data is annotated, we introduce an\nunsupervised time-based criterion to select pairs\nof similar time series and taking into account\ntime series of varying lengths, by following word2vec\u2019s intuition. The assumption made in the CBOW\nmodel of word2vec is twofold. The representation of the context of a word should probably be, on\none hand, close to the one of this word (Goldberg & Levy, 2014), and, on the other hand, distant\nfrom the one of randomly chosen words, since they are probably unrelated to the original word\u2019s\ncontext. The corresponding loss then pushes pairs of (context, word) and (context, random word) to\nbe linearly separable. This is called negative sampling.\nTo adapt this principle to time series, we consider (see Figure 1 for an illustration) a random subseries2\nxref of a given time series yi. Then, on one hand, the representation of xref should be close to the one\nof any of its subseries xpos (a positive example). On the other hand, if we consider another subseries\nxneg (a negative example) chosen at random (in a different random time series yj if several series are\navailable, or in the same time series if it is long enough and not stationary), then its representation\nshould be distant from the one of xref. Following the analogy with word2vec, xpos corresponds to a\nword, xref to its context, and xneg to a random word. To improve the stability and convergence of the\ntraining procedure as well as the experimental results of our learned representations, we introduce, as\nin word2vec, several negative samples (xneg\n\nFigure 1: Choices of xref, xpos and xneg.\n\nTime\n\nxneg\n\nk , \u03b8)\n\nf (xneg\n\nk )k\u2208(cid:74)1,K(cid:75), chosen independently at random.\n(cid:17)(cid:17)\n(cid:17)(cid:17) \u2212 K(cid:88)\n\n(cid:16)\u2212f(cid:0)xref , \u03b8(cid:1)(cid:62)\n\nThe objective corresponding to these choices to minimize during training can be thought of the one of\nword2vec with its shallow network replaced by a deep network f (\u00b7, \u03b8) with parameters \u03b8, or formally\n\nf(cid:0)xref , \u03b8(cid:1)(cid:62)\n(cid:1) at random as detailed in Algorithm 1, and performing a minimization step on\n\nwhere \u03c3 is the sigmoid function. This loss pushes the computed representations to distinguish\nbetween xref and xneg, and to assimilate xref and xpos. Overall, the training procedure consists in\ntraveling through the training dataset for several epochs (possibly using mini-batches), picking tuples\n\n(cid:0)xref , xpos, (xneg\n\nf (xpos, \u03b8)\n\n\u2212 log\n\nthe corresponding loss for each pair, until training ends. The overall computational and memory cost\nis O(K \u00b7 c(f )), where c(f ) is the cost of evaluating and backpropagating through f on a time series;\nthus this unsupervised training is scalable as long as the encoder architecture is scalable as well.\nThe length of the negative examples is chosen at random in Algorithm 1 for the most general case;\nhowever, their length can also be the same for all samples and equal to size(xpos). The latter case is\nsuitable when all time series in the dataset have equal lengths, and speeds up the training procedure\nthanks to computation factorizations; the former case is only used when time series in the dataset do\nnot have the same lengths, as we experimentally saw no other difference than time ef\ufb01ciency between\nthe two cases. In our experiments, we do not cap the lengths of xref, xpos and xneg since they are\nalready limited by the length of the train time series, which corresponds to scales of lengths on which\nour representations are tested.\n\nk )k\n\nlog\n\n\u03c3\n\n(cid:16)\n\n(cid:16)\n\n(cid:16)\n\n\u03c3\n\nk=1\n\n,\n\n(1)\n\n2I.e., a subsequence of a time series composed by consecutive time steps of this time series.\n\n3\n\n\fAlgorithm 1: Choices of xref, xpos and (xneg\n\nk )k \u2208(cid:74)1,K(cid:75) for an epoch over the set (yi)i\u2208(cid:74)1,N(cid:75).\npick spos = size(xpos) in(cid:74)1, si(cid:75) and sref = size(cid:0)xref(cid:1) in(cid:74)spos, si(cid:75) uniformly at random;\n1 for i \u2208(cid:74)1, N(cid:75) with si = size(yi) do\nk ) in(cid:74)1, size(yk)(cid:75) and \ufb01nally\npick uniformly at random ik \u2208(cid:74)1, N(cid:75), then sneg\n\npick xref uniformly at random among subseries of yi of length sref;\npick xpos uniformly at random among subseries of xref of length spos;\n\namong subseries of yk of length sneg\n\nk = size(xneg\n\nk , for k \u2208(cid:74)1, K(cid:75).\n\n2\n3\n4\n5\n\nxneg\n\nk\n\nWe highlight that this time-based triplet loss leverages the ability of the chosen encoder to take as\ninput time series of different lengths. By training the encoder on a range of input lengths going from\none to the length of the longest time series in the train set, it becomes able to output meaningful and\ntransferable representations regardless of the input length, as shown in Section 5.\nThis training procedure is interesting in that it is ef\ufb01cient enough to be run over long time series\n(see Section 5) with a scalable encoder (see Section 4), thanks to its decoder-less design and the\nseparability of the loss, on which a backpropagation per term can be performed to save memory.3\n\n4 Encoder Architecture\n\nWe explain and present in this section our choice of architecture for the encoder, which is motivated\nby three requirements: it must extract relevant information from time series; it needs to be time- and\nmemory-ef\ufb01cient, both for training and testing; and it has to allow variable-length inputs. We choose\nto use deep neural networks with exponentially dilated causal convolutions to handle time series.\nWhile they have been popularized in the context of sequence generation (Oord et al., 2016), they have\nnever been used for unsupervised time series representation learning. They offer several advantages.\nCompared to recurrent neural networks, which are inherently designed for sequence-modeling tasks\nand thus sequential, these networks are scalable as they allow ef\ufb01cient parallelization on modern\nhardware such as GPUs. Besides this demonstrated ef\ufb01ciency, exponentially dilated convolutions\nhave also been introduced to better capture, compared to full convolutions, long-range dependencies\nat constant depth by exponentially increasing the receptive \ufb01eld of the network (Oord et al., 2016; Yu\n& Koltun, 2016; Bai et al., 2018).\nConvolutional networks have also been demonstrated to be performant on various aspects for se-\nquential data. For instance, recurrent networks are known to be subject to the issue of exploding and\nvanishing gradients, due to their recurrent nature (Goodfellow et al., 2016, Chapter 10.9). While\nsigni\ufb01cant work has been done to tackle this issue and improve their ability to capture long-term\ndependencies, such as the LSTM (Hochreiter & Schmidhuber, 1997), recurrent networks are still\noutperformed by convolutional networks on this aspect (Bai et al., 2018). On the speci\ufb01c domains of\ntime series classi\ufb01cation, which is an essential part of our experimental evaluation, and forecasting,\ndeep neural networks have recently been successfully used (Bai et al., 2018; Ismail Fawaz et al.,\n2019).\nOur model is particularly based on stacks of dilated causal convolutions (see Figure 2a), which map\na sequence to a sequence of the same length, such that the i-th element of the output sequence is\ncomputed using only values up until the i-th element of the input sequence, for all i. It is thus called\ncausal, since the output value corresponding to a given time step is not computed using future input\nvalues. Causal convolutions allow alleviating the disadvantage of not using recurrent networks at\ntesting time. Indeed, recurrent networks can be used in an online fashion, thus saving memory and\ncomputation time during testing. In our case, causal convolutions organize the computational graph\nso that, in order to update its output when an element is added at the end of the input time series, one\nonly has to evaluate the highlighted graph shown in Figure 2a rather than the full graph.\nInspired by Bai et al. (2018), we build each layer of our network as a combination of causal convolu-\ntions, weight normalizations (Salimans & Kingma, 2016), leaky ReLUs and residual connections (see\nFigure 2b). Each of these layers is given an exponentially increasing dilation parameter (2i for the\n\n3We used this optimization for multivariate or long (with length higher than 10 000) time series.\n\n4\n\n\fDilation 22 = 4\n\nDilation 21 = 2\n\nDilation 20 = 1\n\n(a)\n\nOutput\n\nInput\n\nTime\n\nConvolution\nKernel size 1\n(if needed for\nup- or down-\nsampling)\n\nLeaky ReLU\n\nWeight Norm\n\nCausal Convolution 2\n\nDilation 2i\n\nLeaky ReLU\n\nWeight Norm\n\nCausal Convolution 1\n\nDilation 2i\n\n(b)\n\nFigure 2: (a) Illustration of three stacked dilated causal convolutions. Lines between each sequence\nrepresent their computational graph. Red solid lines highlight the dependency graph for the computa-\ntion of the last value of the output sequence, showing that no future value of the input time series is\nused to compute it. (b) Composition of the i-th layer of the chosen architecture.\n\ni-th layer). The output of this causal network is then given to a global max pooling layer squeezing\nthe temporal dimension and aggregating all temporal information in a \ufb01xed-size vector (as proposed\nby Wang et al. (2017) in a supervised setting with full convolutions). A linear transformation of this\nvector is then the output of the encoder, with a \ufb01xed, independent from the input length, size.\n\n5 Experimental Results\n\nWe review in this section experiments conducted to investigate the relevance of the learned repre-\nsentations. The code corresponding to these experiments is attached in the supplementary material\nand is publicly available.4 The full training process and hyperparameter choices are detailed in the\nsupplementary material, Sections S1 and S2. We used Python 3 for implementation, with PyTorch\n0.4.1 (Paszke et al., 2017) for neural networks and scikit-learn (Pedregosa et al., 2011) for SVMs.\nEach encoder was trained using the Adam optimizer (Kingma & Ba, 2015) on a single Nvidia Titan\nXp GPU with CUDA 9.0, unless stated otherwise.\nSelecting hyperparameters for an unsupervised method is challenging since the plurality of down-\nstream tasks is usually supervised. Therefore, as Wu et al. (2018a), we choose for each considered\ndataset archive a single set of hyperparameters regardless of the downstream task. Moreover, we\nhighlight that we perform no hyperparameter optimization of the unsupervised encoder architecture\nand training parameters for any task, unlike other unsupervised works such as TimeNet (Malhotra\net al., 2017). Particularly, for classi\ufb01cation tasks, no label was used during the encoder training.\n\n5.1 Classi\ufb01cation\n\nWe \ufb01rst assess the quality of our learned representations on supervised tasks in a standard manner (Xu\net al., 2003; Dosovitskiy et al., 2014) by using them for time series classi\ufb01cation. In this setting, we\nshow that (1) our method outperforms state-of-the-art unsupervised methods, and notably achieves\nperformance close to the supervised state of the art, (2) strongly outperforms supervised deep learning\nmethods when data is only sparsely labeled, (3) produces transferable representations.\nFor each considered dataset with a train / test split, we unsupervisedly train an encoder using its train\nset. We then train an SVM with radial basis function kernel on top of the learned features using the\ntrain labels of the dataset, and output the corresponding classi\ufb01cation score on the test set. As our\n\n4https://github.com/White-Link/UnsupervisedScalableRepresentationLearningTimeSeries.\n\n5\n\n\fFigure 3: Boxplot of the ratio of the accuracy\nversus maximum achieved accuracy (higher is\nbetter) for compared methods on the \ufb01rst 85 UCR\ndatasets.\n\nFigure 4: Accuracy of ResNet and our method\nwith respect to the ratio of labeled data on TwoPat-\nterns. Error bars correspond to the standard devi-\nation over \ufb01ve runs per point for each method.\n\ntraining procedure encourages representations of different time series to be separable, observing the\nclassi\ufb01cation performance of a simple SVM on these features allows to check their quality (Wu et al.,\n2018a). Using SVMs also allows, when the encoder is trained, an ef\ufb01cient training both in terms of\ntime (training is a matter of minutes in most cases) and space.\nAs K has a signi\ufb01cant impact on the performance, we present a combined version of our method,\nwhere representations computed by encoders trained with different values of K (see Section S2 for\nmore details) are concatenated. This enables our learned representations with different parameters to\ncomplement each other, and to remove some noise in the classi\ufb01cation scores.\n\n5.1.1 Univariate Time Series\n\nWe present accuracy scores for all 128 datasets of the new iteration of the UCR archive (Dau et al.,\n2018), which is a standard set of varied univariate datasets. We report in Table 1 scores for only some\nUCR datasets, while scores for all datasets are reported in the supplementary material, Section S3.\nWe \ufb01rst compare our scores to the two concurrent methods of this work, TimeNet (Malhotra et al.,\n2017) and RWS (Wu et al., 2018a), which are two unsupervised methods also training a simple\nclassi\ufb01er on top of the learned representations, and reporting their results on a few UCR datasets. We\nalso compare on the \ufb01rst 85 datasets of the archive5 to the four best classi\ufb01ers of the supervised state\nof the art studied by Bagnall et al. (2017): COTE \u2013 replaced by its improved version HIVE-COTE\n(Lines et al., 2018) \u2013, ST (Bostrom & Bagnall, 2015), BOSS (Sch\u00e4fer, 2015) and EE (Lines & Bagnall,\n2015). HIVE-COTE is a powerful ensemble method using many classi\ufb01ers in a hierarchical voting\nstructure; EE is a simpler ensemble method; ST is based on shapelets and BOSS is a dictionary-based\nclassi\ufb01er.6 We also add DTW (one-nearest-neighbor classi\ufb01er with DTW as measure) as a baseline.\nHIVE-COTE includes ST, BOSS, EE and DTW in its ensemble, and is thus expected to outperform\nthem. Additionally, we compare our method to the ResNet method of Wang et al. (2017), which is\nthe best supervised neural network method studied in the review of Ismail Fawaz et al. (2019).\n\nPerformance. Comparison with the unsupervised state of the art (Section S3, Table S3 of the\nsupplementary material), indicates that our method consistently matches or outperforms both unsu-\npervised methods TimeNet and RWS (on 11 out of 12 and 10 out of 11 UCR datasets), showing its\n\n5The new UCR archive includes 43 new datasets on which no reproducible results of state-of-the-art methods\nhave been produced yet. Still, we provide complete results for our method on these datasets in the supplementary\nmaterial, Section S3, Table S4, along with those of DTW, the only other method for which they were available.\n6While ST and BOSS are also ensembles of classi\ufb01ers, we chose not to qualify both of them as ensembles\n\nsince their ensemble only includes variations of the same novel classi\ufb01cation method.\n\n6\n\n110100Percentage of labeled data0.20.30.40.50.60.70.80.91.0Ours (K=5)ResNetMost represented class\fTable 1: Accuracy scores of variants of our method compared with other supervised and unsupervised\nmethods, on some UCR datasets. Results for the whole archive are available in the supplementary\nmaterial, Section S3, Tables S1, S2 and S4. Bold and underlined scores respectively indicate the best\nand second-best (when there is no tie for \ufb01rst place) performing methods.\n\nDataset\n\nDiatomSizeReduction\nECGFiveDays\nFordB\nHam\nPhoneme\nSwedishLeaf\n\nUnsupervised\n\nOurs\n\nDTW\n\nST\n\nBOSS\n\nSupervised\n\nEnsemble\n\nHIVE-COTE\n\nEE\n\nK = 5 K = 10\n0.993\n\n0.984\n\nCombined\n\nFordA\n\n1\n\n0.781\n0.657\n0.249\n0.925\n\n1\n\n0.793\n0.724\n0.276\n0.914\n\n0.993\n\n1\n0.81\n0.695\n0.289\n0.931\n\n0.974\n\n1\n\n0.798\n0.533\n0.196\n0.925\n\n0.967\n\n1\n0.62\n0.467\n0.228\n0.792\n\n0.925\n0.984\n0.807\n0.686\n0.321\n0.928\n\n0.931\n\n1\n\n0.711\n0.667\n0.265\n0.922\n\n0.941\n\n1\n\n0.823\n0.667\n0.382\n0.954\n\n0.944\n0.82\n0.662\n0.571\n0.305\n0.915\n\nperformance. Unlike our work, code and full results on the UCR archive are not provided for these\nmethods, hence the incomplete results.\nWhen comparing to the supervised non-neural-network state of the art, we observe (see Figures S2\nand S3 in the supplementary material) that our method is globally the second-best one (with average\nrank 2.92), only beaten by HIVE-COTE (1.71) and equivalent to ST (2.95). Thus, our unsupervised\nmethod beats several recognized supervised classi\ufb01er, and is only preceded by a powerful ensemble\nmethod, which was expected since the latter takes advantage of numerous classi\ufb01ers and data\nrepresentations. Additionally, Figure 3 shows that our method has the second-best median for the ratio\nof accuracy over maximum achieved accuracy, behind HIVE-COTE and above ST. Finally, results\nreported from the study of Ismail Fawaz et al. (2019) for the fully supervised ResNet (Section S3,\nTable S3 of the supplementary material) show that it expectedly outperforms our method on 63% out\nof 71 UCR datasets.7 Overall, our method achieves remarkable performance as it is close to the best\nsupervised neural network, matches the second-best studied non-neural-network supervised method,\nand, in particular, is at the level of the best performing method included in HIVE-COTE.8\n\nSparse labeling. Taking advantage of their unsupervised training, we show that our representations\ncan be successfully used on sparsely labeled datasets compared to supervised methods, since only\nthe SVM is restricted to be learned on the small portion of labeled data. Figure 4 shows that an\nSVM trained on our representations of a randomly chosen labeled set consistently outperforms\nthe supervised neural network ResNet trained on a labeled set of the same size, especially when\nthe percentage of labeled data is small. For example, with only 1.5% of labeled data, we achieve\nan accuracy of 81%, against only 26% for ResNet, equivalent to a random classi\ufb01er. Moreover,\nwe exceed 99% of accuracy starting from 11% of labeled data, while ResNet only achieves this\nlevel of accuracy with more than 50% of labeled data. This shows the relevance of our method in\nsemi-supervised settings, compared to fully supervised methods.\n\nRepresentations metric space. Besides being suitable for classi\ufb01cation purposes, the learned\nrepresentations may also be used to de\ufb01ne a meaningful measure between time series. Indeed, we\ntrain, instead of an SVM, a one-nearest-neighbor classi\ufb01er with respect to the (cid:96)2 distance on the same\nrepresentations, and compare it to DTW, which uses the same classi\ufb01er on the raw time series. As\nshown in S3, this version of our method outperforms DTW on 66% of the UCR datasets, showing\nthe advantage of the learned representations even in a non-parametric classi\ufb01cation setting. We\nalso include quantitative experiments to assess the usefulness of comparing time series using the\n(cid:96)2 distance between their representations with dimensionality reduction (Figure 5) and clustering\n(Section 5.2 and Figure 6) visualizations.\n\nTransferability. We include in the comparisons the classi\ufb01cation accuracy for each dataset of an\nSVM trained on this dataset using the representations computed by an encoder, which was trained on\nanother dataset (FordA, with K = 5), to test the transferability of our representations. We observe\n\n7Those results are incomplete as Ismail Fawaz et al. (2019) performed their experiments on the old version\n\nof the archive, whereas ours are performed on its most recent release where some datasets were changed.\n\n8Our method could be included in HIVE-COTE, which could improve its performance, but this is beyond the\n\nscope of this work and requires technical work, as HIVE-COTE is implemented in Java and ours in Python.\n\n7\n\n\f(a) DiatomSizeReduction.\n\n(b) FordB.\n\n(c) OSULeaf.\n\nFigure 5: Two-dimensional t-SNE (Maaten & Hinton, 2008) with perplexity 30 of the learned\nrepresentations of three UCR test sets. Elements classes are distinguishable using their respective\nmarker shapes and colors.\n\nthat the scores achieved by this SVM trained on transferred representations are close to the scores\nreported when the encoder is trained on the same dataset as the SVM, showing the transferability\nof our representations from a dataset to another, and from time series to other time series with\ndifferent lengths. More generally, this transferability and the performance of simple classi\ufb01ers on the\nrepresentations we learn indicate that they are universal and easy to make use of.\n\n5.1.2 Multivariate Time Series\n\nTo complement our evaluation on the UCR archive which exclusively contains univariate series, we\nevaluate our method on multivariate time series. This can be done by simply changing the number of\ninput \ufb01lters of the \ufb01rst convolutional layer of the proposed encoder. We test our method on all 30\ndatasets of the newly released UEA archive (Bagnall et al., 2018). Full accuracy scores are presented\nin the supplementary material, Section S4, Table S5.\nThe UEA archive has been designed as a \ufb01rst attempt to provide a standard archive for multivariate\ntime series classi\ufb01cation such as the UCR one for univariate series. As it has only been released\nrecently, we could not compare our method to state-of-the-art classi\ufb01ers for multivariate time series.\nHowever, we provide a comparison with DTWD as baseline using results provided by Bagnall et al.\n(2018). DTWD (dimension-Dependent DTW) is a possible extension of DTW in the multivariate\nsetting, and is the best baseline studied by Bagnall et al. (2018). Overall, our method matches\nor outperforms DTWD on 69% of the UEA datasets, which indicates good performance. As this\narchive is destined to grow and evolve in the future, and without further comparisons, no additional\nconclusion can be drawn.\n\n5.2 Evaluation on Long Time Series\n\nWe show the applicability and scalability of our method on long time series without labeling for\nregression tasks, which could correspond to an industrial application and complements the performed\ntests on the UCR and UEA archives, whose datasets mostly contain short time series.\nThe Individual Household Electric Power Consumption (IHEPC) dataset from the UCI Machine\nLearning Repository (Dheeru & Karra Taniskidou, 2017) is a single time series of length 2 075 259\nmonitoring the minute-averaged electricity consumption of one French household for four years. We\nsplit this time series into train (\ufb01rst 5 \u00d7 105 measurements, approximately a year) and test (remaining\nmeasurements). The encoder is trained over the train time series on a single Nvidia Tesla P100 GPU\nin no more than a few hours, showing that our training procedure is scalable to long time series.\nWe consider the learned encoder on two regression tasks involving two different input scales. We\ncompute, for each time step of the time series, the representations of the last window corresponding\nto a day (1 440 measurements) and a quarter (12 \u00b7 7 \u00b7 1 440 measurements) using the same encoder.\nAn example of application of the day-long representations is shown in Figure 6. The considered\ntasks consist in, for each time step, predicting the discrepancy between the mean value of the series\n\n8\n\n30201001020302010010203020100102030201001020105051015201001020\fFigure 6: Minute-averaged electricity consumption for a single day, with respect to the hour of the\nday. Vertical lines and colors divide the day into six clusters, obtained with k-means clustering\nbased on representations computed on a day-long sliding window. The clustering divides the day in\nmeaningful portions (night, morning, afternoon, evening).\n\nfor the next period (either a day or quarter) and the one for the previous period. We compare linear\nregressors, trained using gradient descent, to minimize the mean squared error between the prediction\nand the target, applied either on the raw time series or the previously computed representations.\nResults and execution times on an Nvidia Titan\nXp GPU are presented in Table 2.9 On both\nscales of inputs, our representations induce only\na slightly degraded performance but provide a\nlarge ef\ufb01ciency improvement, due to their small\nsize compared to the raw time series. This shows\nthat a single encoder trained to minimize our\ntime-based loss is able to output representations\nfor different scales of input lengths that are also\nhelpful for other tasks than classi\ufb01cation, cor-\nroborating their universality.\n\nTable 2: Results obtained on the IHEPC dataset.\n\nRepresentations\n8.92 \u00d7 10\u22122\n\n6.26 \u00d7 10\u22122\n1h 40min 15s\n\n8.92 \u00d7 10\u22122\n\nTest MSE\nWall time\n\nTest MSE\nWall time\n\nTask\n\nDay\n\n7.26 \u00d7 10\u22122\n\n12s\n\n9s\n\nRaw values\n\n3min 1s\n\nMetric\n\nQuarter\n\n6 Conclusion\n\nWe present an unsupervised representation learning method for time series that is scalable and\nproduces high-quality and easy-to-use embeddings. They are generated by an encoder formed\nby dilated convolutions that admits variable-length inputs, and trained with an ef\ufb01cient triplet\nloss using novel time-based negative sampling for time series. Conducted experiments show that\nthese representations are universal and can easily and ef\ufb01ciently be used for diverse tasks such as\nclassi\ufb01cation, for which we achieve state-of-the-art performance, and regression.\n\nAcknowledgements\n\nWe would like to acknowledge Patrick Gallinari, Sylvain Lamprier, Mehdi Lamrayah, Etienne Simon,\nValentin Guiguet, Clara Gainon de Forsan de Gabriac, Eloi Zablocki, Antoine Saporta, Edouard\nDelasalles, Sidak Pal Singh, Andreas Hug, Jean-Baptiste Cordonnier, Andreas Loukas and Fran\u00e7ois\nFleuret for helpful comments and discussions. We thank as well our anonymous reviewers for their\nconstructive suggestions, Liljefors et al. (2019) for their extensive and positive reproducibility report\non our work, and all contributors to the datasets and archives we used for this project (Dau et al.,\n2018; Bagnall et al., 2018; Dheeru & Karra Taniskidou, 2017). We acknowledge \ufb01nancial support\nfrom the SFA-AM ETH Board initiative, the LOCUST ANR project (ANR-15-CE23-0027) and\nCLEAR (Center for LEArning and data Retrieval, joint laboratory with Thales10).\n\n9While acting on representations of the same size, the quarterly linear regressor is slightly faster than the\n\ndaily one because the number of quarters in the considered time series is smaller than the number of days.\n\n10https://www.thalesgroup.com.\n\n9\n\n\fReferences\nArora, S., Khandeparkar, H., Khodak, M., Plevrakis, O., and Saunshi, N. A theoretical analysis of\n\ncontrastive unsupervised representation learning. arXiv preprint arXiv:1902.09229, 2019.\n\nBagnall, A., Lines, J., Bostrom, A., Large, J., and Keogh, E. The great time series classi\ufb01cation\nbake off: a review and experimental evaluation of recent algorithmic advances. Data Mining and\nKnowledge Discovery, 31(3):606\u2013660, May 2017.\n\nBagnall, A., Dau, H. A., Lines, J., Flynn, M., Large, J., Bostrom, A., Southam, P., and Keogh, E. The\nUEA multivariate time series classi\ufb01cation archive, 2018. arXiv preprint arXiv:1811.00075, 2018.\n\nBai, S., Kolter, J. Z., and Koltun, V. An empirical evaluation of generic convolutional and recurrent\n\nnetworks for sequence modeling. arXiv preprint arXiv:1803.01271, 2018.\n\nBostrom, A. and Bagnall, A. Binary shapelet transform for multiclass time series classi\ufb01cation. In\nBig Data Analytics and Knowledge Discovery, pp. 257\u2013269, Cham, 2015. Springer International\nPublishing.\n\nBredin, H. Tristounet: Triplet loss for speaker turn embedding.\n\nIn 2017 IEEE International\nConference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5430\u20135434, March 2017.\n\nCui, Z., Chen, W., and Chen, Y. Multi-scale convolutional neural networks for time series classi\ufb01ca-\n\ntion. arXiv preprint arXiv:1603.06995, 2016.\n\nDau, H. A., Keogh, E., Kamgar, K., Yeh, C.-C. M., Zhu, Y., Gharghabi, S., Ratanamahatana, C. A.,\nYanping, Hu, B., Begum, N., Bagnall, A., Mueen, A., and Batista, G. The UCR time series\nclassi\ufb01cation archive, October 2018.\n\nDenton, E. L. and Birodkar, V. Unsupervised learning of disentangled representations from video. In\nAdvances in Neural Information Processing Systems 30, pp. 4414\u20134423. Curran Associates, Inc.,\n2017.\n\nDheeru, D. and Karra Taniskidou, E. UCI machine learning repository, 2017.\n\nDosovitskiy, A., Springenberg, J. T., Riedmiller, M., and Brox, T. Discriminative unsupervised\nfeature learning with convolutional neural networks. In Advances in Neural Information Processing\nSystems 27, pp. 766\u2013774. Curran Associates, Inc., 2014.\n\nFortuin, V., H\u00fcser, M., Locatello, F., Strathmann, H., and R\u00e4tsch, G. SOM-VAE: Interpretable discrete\nrepresentation learning on time series. In International Conference on Learning Representations,\n2019.\n\nGoldberg, Y. and Levy, O. word2vec explained: deriving Mikolov et al.\u2019s negative-sampling word-\n\nembedding method. arXiv preprint arXiv:1402.3722, 2014.\n\nGoodfellow, I., Bengio, Y., and Courville, A. Deep Learning. MIT Press, 2016. http://www.\n\ndeeplearningbook.org.\n\nHochreiter, S. and Schmidhuber, J. Long short-term memory. Neural computation, 9(8):1735\u20131780,\n\n1997.\n\nHyvarinen, A. and Morioka, H. Unsupervised feature extraction by time-contrastive learning and\nnonlinear ICA. In Advances in Neural Information Processing Systems 29, pp. 3765\u20133773. Curran\nAssociates, Inc., 2016.\n\nIsmail Fawaz, H., Forestier, G., Weber, J., Idoumghar, L., and Muller, P.-A. Deep learning for time\n\nseries classi\ufb01cation: a review. Data Mining and Knowledge Discovery, March 2019.\n\nJansen, A., Plakal, M., Pandya, R., Ellis, D. P. W., Hershey, S., Liu, J., Moore, R. C., and Saurous,\nR. A. Unsupervised learning of semantic audio representations. In 2018 IEEE International\nConference on Acoustics, Speech and Signal Processing (ICASSP), pp. 126\u2013130, April 2018.\n\nKingma, D. P. and Ba, J. Adam: A method for stochastic optimization. In International Conference\n\non Learning Representations, 2015.\n\n10\n\n\fLei, Q., Yi, J., Vaculin, R., Wu, L., and Dhillon, I. S. Similarity preserving representation learning\n\nfor time series analysis. arXiv preprint arXiv:1702.03584, 2017.\n\nLiljefors, F., Sorkhei, M. M., and Broome, S. [Re] Unsupervised scalable representation learning for\n\nmultivariate time series. 2019. URL https://openreview.net/forum?id=HyxQr65z6S.\n\nLines, J. and Bagnall, A. Time series classi\ufb01cation with ensembles of elastic distance measures. Data\n\nMining and Knowledge Discovery, 29(3):565\u2013592, May 2015.\n\nLines, J., Taylor, S., and Bagnall, A. Time series classi\ufb01cation with HIVE-COTE: The hierarchical\nvote collective of transformation-based ensembles. ACM Transactions on Knowledge Discovery\nfrom Data, 12(5):52:1\u201352:35, July 2018.\n\nLogeswaran, L. and Lee, H. An ef\ufb01cient framework for learning sentence representations.\n\nInternational Conference on Learning Representations, 2018.\n\nIn\n\nLu, R., Wu, K., Duan, Z., and Zhang, C. Deep ranking: Triplet MatchNet for music metric learning.\nIn 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.\n121\u2013125, March 2017.\n\nMaaten, L. v. d. and Hinton, G. Visualizing data using t-SNE. Journal of Machine Learning Research,\n\n9:2579\u20132605, November 2008.\n\nMalhotra, P., TV, V., Vig, L., Agarwal, P., and Shroff, G. TimeNet: Pre-trained deep recurrent neural\n\nnetwork for time series classi\ufb01cation. arXiv preprint arXiv:1706.08838, 2017.\n\nMikolov, T., Sutskever, I., Chen, K., Corrado, G. S., and Dean, J. Distributed representations of words\nand phrases and their compositionality. In Advances in Neural Information Processing Systems 26,\npp. 3111\u20133119. Curran Associates, Inc., 2013.\n\nOord, A. v. d., Dieleman, S., Zen, H., Simonyan, K., Vinyals, O., Graves, A., Kalchbrenner, N.,\nSenior, A., and Kavukcuoglu, K. WaveNet: A generative model for raw audio. arXiv preprint\narXiv:1609.03499, 2016.\n\nOord, A. v. d., Li, Y., and Vinyals, O. Representation learning with contrastive predictive coding.\n\narXiv preprint arXiv:1807.03748, 2018.\n\nPaszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., DeVito, Z., Lin, Z., Desmaison, A., Antiga,\n\nL., and Lerer, A. Automatic differentiation in PyTorch. 2017.\n\nPedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M.,\nPrettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M.,\nPerrot, M., and Duchesnay, E. Scikit-learn: Machine learning in Python. Journal of Machine\nLearning Research, 12:2825\u20132830, October 2011.\n\nSalimans, T. and Kingma, D. P. Weight normalization: A simple reparameterization to accelerate\ntraining of deep neural networks. In Advances in Neural Information Processing Systems 29, pp.\n901\u2013909. Curran Associates, Inc., 2016.\n\nSch\u00e4fer, P. The BOSS is concerned with time series classi\ufb01cation in the presence of noise. Data\n\nMining and Knowledge Discovery, 29(6):1505\u20131530, November 2015.\n\nSchroff, F., Kalenichenko, D., and Philbin, J. Facenet: A uni\ufb01ed embedding for face recognition and\nclustering. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp.\n815\u2013823, June 2015.\n\nSrivastava, N., Mansimov, E., and Salakhudinov, R. Unsupervised learning of video representations\nIn Proceedings of the 32nd International Conference on Machine Learning,\nusing LSTMs.\nvolume 37 of Proceedings of Machine Learning Research, pp. 843\u2013852, Lille, France, July 2015.\nPMLR.\n\nTurpault, N., Serizel, R., and Vincent, E. Semi-supervised triplet loss based learning of ambient audio\nembeddings. In 2019 IEEE International Conference on Acoustics, Speech and Signal Processing\n(ICASSP), pp. 760\u2013764, May 2019.\n\n11\n\n\fVillegas, R., Yang, J., Hong, S., Lin, X., and Lee, H. Decomposing motion and content for natural\n\nvideo sequence prediction. International Conference on Learning Representations, 2017.\n\nWang, Z., Yan, W., and Oates, T. Time series classi\ufb01cation from scratch with deep neural networks:\nA strong baseline. In 2017 International Joint Conference on Neural Networks (IJCNN), pp.\n1578\u20131585, May 2017.\n\nWu, L., Yen, I. E.-H., Yi, J., Xu, F., Lei, Q., and Witbrock, M. Random Warping Series: A random\nfeatures method for time-series embedding. In Proceedings of the Twenty-First International\nConference on Arti\ufb01cial Intelligence and Statistics, volume 84 of Proceedings of Machine Learning\nResearch, pp. 793\u2013802. PMLR, April 2018a.\n\nWu, L. Y., Fisch, A., Chopra, S., Adams, K., Bordes, A., and Weston, J. Starspace: Embed all the\n\nthings! In Thirty-Second AAAI Conference on Arti\ufb01cial Intelligence, 2018b.\n\nXu, W., Liu, X., and Gong, Y. Document clustering based on non-negative matrix factorization. In Pro-\nceedings of the 26th Annual International ACM SIGIR Conference on Research and Development\nin Informaion Retrieval, SIGIR \u201903, pp. 267\u2013273, New York, NY, USA, 2003. ACM.\n\nYoung, T., Hazarika, D., Poria, S., and Cambria, E. Recent trends in deep learning based natural\n\nlanguage processing. IEEE Computational Intelligence Magazine, 13(3):55\u201375, August 2018.\n\nYu, F. and Koltun, V. Multi-scale context aggregation by dilated convolutions. In International\n\nConference on Learning Representations, 2016.\n\n12\n\n\f", "award": [], "sourceid": 2607, "authors": [{"given_name": "Jean-Yves", "family_name": "Franceschi", "institution": "Sorbonne Universit\u00e9"}, {"given_name": "Aymeric", "family_name": "Dieuleveut", "institution": "Ecole Polytechnique, IPParis"}, {"given_name": "Martin", "family_name": "Jaggi", "institution": "EPFL"}]}