{"title": "Deep Anomaly Detection Using Geometric Transformations", "book": "Advances in Neural Information Processing Systems", "page_first": 9758, "page_last": 9769, "abstract": "We consider the problem of anomaly detection in images, and \npresent a new detection technique. Given a sample\nof images, all known to belong to a ``normal'' class (e.g., dogs), \nwe show how to train a deep neural model that can detect \nout-of-distribution images (i.e., non-dog objects). The main \nidea behind our scheme is to train a multi-class model to discriminate between\ndozens of geometric transformations applied on all the given images. The auxiliary expertise learned by the model generates feature detectors that effectively identify, at test time, anomalous images based on the softmax activation statistics of the model when applied on transformed images.\nWe present extensive experiments using the proposed detector, which indicate that our algorithm improves state-of-the-art methods by a wide margin.", "full_text": "Deep Anomaly Detection Using Geometric\n\nTransformations\n\nIzhak Golan\n\nRan El-Yaniv\n\nDepartment of Computer Science\n\nTechnion \u2013 Israel Institute of Technology\n\nDepartment of Computer Science\n\nTechnion \u2013 Israel Institute of Technology\n\nHaifa, Israel\n\nizikgo@cs.technion.ac.il\n\nHaifa, Israel\n\nrani@cs.technion.ac.il\n\nAbstract\n\nWe consider the problem of anomaly detection in images, and present a new\ndetection technique. Given a sample of images, all known to belong to a \u201cnormal\u201d\nclass (e.g., dogs), we show how to train a deep neural model that can detect\nout-of-distribution images (i.e., non-dog objects). The main idea behind our\nscheme is to train a multi-class model to discriminate between dozens of geometric\ntransformations applied on all the given images. The auxiliary expertise learned\nby the model generates feature detectors that effectively identify, at test time,\nanomalous images based on the softmax activation statistics of the model when\napplied on transformed images. We present extensive experiments using the\nproposed detector, which indicate that our technique consistently improves all\nknown algorithms by a wide margin.\n\n1\n\nIntroduction\n\nFuture machine learning applications such as self-driving cars or domestic robots will, inevitably,\nencounter various kinds of risks including statistical uncertainties. To be usable, these applications\nshould be as robust as possible to such risks. One such risk is exposure to statistical errors or\ninconsistencies due to distributional divergences or noisy observations. The well-known problem\nof anomaly/novelty detection highlights some of these risks, and its resolution is of the utmost\nimportance to mission critical machine learning applications. While anomaly detection has long\nbeen considered in the literature, conclusive understanding of this problem in the context of deep\nneural models is sorely lacking. For example, in machine vision applications, presently available\nnovelty detection methods can suffer from poor performance in some problems, as demonstrated by\nour experiments.\nIn the basic anomaly detection problem, we have a sample from a \u201cnormal\u201d class of instances,\nemerging from some distribution, and the goal is to construct a classi\ufb01er capable of detecting out-\nof-distribution \u201cabnormal\u201d instances [5].1 There are quite a few variants of this basic anomaly\ndetection problem. For example, in the positive and unlabeled version, we are given a sample from\nthe \u201cnormal\u201d class, as well as an unlabeled sample that is contaminated with abnormal instances.\nThis contaminated-sample variant turns out to be easier than the pure version of the problem (in the\nsense that better performance can be achieved) [2]. In the present paper, we focus on the basic (and\nharder) version of anomaly detection, and consider only machine vision applications for which deep\nmodels (e.g., convolutional neural networks) are essential.\nThere are a few works that tackle the basic, pure-sample-anomaly detection problem in the context\nof images. The most successful results among these are reported for methods that rely on one of\n\n1Unless otherwise mentioned, the use of the adjective \u201cnormal\u201d is unrelated to the Gaussian distribution.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fthe following two general schemes. The \ufb01rst scheme consists of methods that analyze errors in\nreconstruction, which is based either on autoencoders or generative adversarial models (GANs)\ntrained over the normal class. In the former case, reconstruction de\ufb01ciency of a test point indicates\nabnormality. In the latter, the reconstruction error of a test instance is estimated using optimization to\n\ufb01nd the approximate inverse of the generator. The second class of methods utilizes an autoencoder\ntrained over the normal class to generate a low-dimensional embedding. To identify anomalies,\none uses classical methods over this embedding, such as low-density rejection [8, 9] or single-class\nSVM [29, 30]. A more advanced variant of this approach combines these two steps (encoding and\nthen detection) using an appropriate cost function, which is used to train a single neural model that\nperforms both procedures [27].\nIn this paper we consider a completely different approach that bypasses reconstruction (as in au-\ntoencoders or GANs) altogether. The proposed method is based on the observation that learning to\ndiscriminate between many types of geometric transformations applied to normal images, encourages\nlearning of features that are useful for detecting novelties. Thus, we train a multi-class neural classi\ufb01er\nover a self-labeled dataset, which is created from the normal instances and their transformed versions,\nobtained by applying numerous geometric transformations. At test time, this discriminative model is\napplied on transformed instances of the test example, and the distribution of softmax response values\nof the \u201cnormal\u201d train images is used for effective detection of novelties. The intuition behind our\nmethod is that by training the classi\ufb01er to distinguish between transformed images, it must learn\nsalient geometrical features, some of which are likely to be unique to the single class.\nWe present extensive experiments of the proposed method and compare it to several state-of-the-art\nmethods for pure anomaly detection. We evaluate performance using a one-vs-all scheme over\nseveral image datasets such as CIFAR-100, which (to the best of our knowledge) have never been\nconsidered before in this setting. Our results overwhelmingly indicate that the proposed method\nachieves dramatic improvements over the best available methods. For example, on the CIFAR-10\ndataset (10 different experiments), we improved the top performing baseline AUROC by 32% on\naverage. In the CatsVsDogs dataset, we improve the top performing baseline AUROC by 67%.\n\n2 Related Work\n\nThe literature related to anomaly detection is extensive and beyond the scope of this paper (see,\ne.g., [5, 42] for wider scope surveys). Our focus is on anomaly detection in the context of images\nand deep learning. In this scope, most published works rely, implicitly or explicitly, on some form\nof (unsupervised) reconstruction learning. These methods can be roughly categorized into two\napproaches.\n\nReconstruction-based anomaly score. These methods assume that anomalies possess different\nvisual attributes than their non-anomalous counterparts, so it will be dif\ufb01cult to compress and\nreconstruct them based on a reconstruction scheme optimized for single-class data. Motivated by\nthis assumption, the anomaly score for a new sample is given by the quality of the reconstructed\nimage, which is usually measured by the (cid:96)2 distance between the original and reconstructed image.\nClassic methods belonging to this category include Principal Component Analysis (PCA) [18], and\nRobust-PCA [4]. In the context of deep learning, various forms of deep autoencoders are the main tool\nused for reconstruction-based anomaly scoring. Xia et al. [37] use a convolutional autoencoder with\na regularizing term that encourages outlier samples to have a large reconstruction error. Variational\nautoencoder is used by An and Cho [1], where they estimate the reconstruction probability through\nMonte-Carlo sampling, from which they extract an anomaly score. Another related method, which\nscores an unseen sample based on the ability of the model to generate a similar one, uses Generative\nAdversarial Networks (GANS) [16]. Schlegl et al. [28] use this approach on optical coherence\ntomography images of the retina. Deecke et al. [7] employ a variation of this model called ADGAN,\nreporting slightly superior results on CIFAR-10 [21] and MNIST [22].\n\nReconstruction-based representation learning. Many conventional anomaly detection methods\nuse a low-density rejection principle [8]. Given data, the density at each point is estimated, and new\nsamples are deemed anomalous when they lie in a low-density region. Examples of such methods\nare kernel density estimation (KDE) [25], and Robust-KDE [19]. This approach is known to be\nproblematic when handling high-dimensional data due to the curse of dimensionality. To mitigate\n\n2\n\n\fthis problem, practitioners often use a two-step approach of learning a compact representation of\nthe data, and then applying density estimation methods on the lower-dimensional representation [4].\nMore advanced techniques combine these two steps and aim to learn a representation that facilitates\nthe density estimation task. Zhai et al. [41] utilize an energy-based model in the form of a regularized\nautoencoder in order to map each sample to an energy score, which is the estimated negative log-\nprobability of the sample under the data distribution. Zong et al. [43] uses the representation layer of\nan autoencoder in order to estimate parameters of a Gaussian mixture model.\nThere are few approaches that tackled the anomaly detection problem without resorting to some form\nof reconstruction. A recent example was published by Ruff et al. [27], who have developed a deep\none-class SVM model. The model consists of a deep neural network whose weights are optimized\nusing a loss function resembling the SVDD [30] objective.\n\n3 Problem Statement\nIn this paper, we consider the problem of anomaly detection in images. Let X be the space of all\n\u201cnatural\u201d images, and let X \u2286 X be the set of images de\ufb01ned as normal. Given a sample S \u2286 X,\nand a type-II error constraint (rate of normal samples that were classi\ufb01ed as anomalies), we would\nlike to learn the best possible (in terms of type-I error) classi\ufb01er hS(x) : X \u2192 {0, 1}, where\nhS(x) = 1 \u21d4 x \u2208 X, which satis\ufb01es the constraint. Images that are not in X are referred to as\nanomalies or novelties.\nTo control the trade-off between type-I and type-II errors when classifying, a common practice is to\nlearn a scoring (ranking) function nS(x) : X \u2192 R, such that higher scores indicate that samples are\nmore likely to be in X. Once such a scoring function has been learned, a classi\ufb01er can be constructed\nfrom it by specifying an anomaly threshold (\u03bb):\n\n(cid:26)1 nS(x) \u2265 \u03bb\n\n0 nS(x) < \u03bb.\n\nh\u03bb\nS(x) =\n\nAs many related works [28, 31, 17], in this paper we also focus only on learning the scoring function\nnS(x), and completely ignore the constrained binary decision problem. A useful (and common\npractice) performance metric to measure the quality of the trade-off of a given scoring function is the\narea under the Receiver Operating Characteristic (ROC) curve, which we denote here as AUROC.\nWhen prior knowledge on the proportion of anomalies is available, the area under the precision-recall\ncurve (AUPR) metric might be preferred [6]. We also report on performance in term of this metric in\nthe supplementary material.\n\n4 Discriminative Learning of an Anomaly Scoring Function Using\n\nGeometric Transformations\n\nAs noted above, we aim to learn a scoring function nS (as described in Section 3) in a discriminative\nfashion. To this end, we create a self-labeled dataset of images from our initial training set S, by\nusing a class of geometric transformations T . The created dataset, denoted ST , is generated by\napplying each geometric transformation in T on all images in S, where we label each transformed\nimage with the index of the transformation that was applied on it. This process creates a self-labeled\nmulti-class dataset (with |T | classes) whose cardinality is |T ||S|. After the creation of ST , we train a\nmulti-class image classi\ufb01er whose objective is to predict, for each image, the index of its generating\ntransformation in T . At inference time, given an unseen image x, we decide whether it belongs to\nthe normal class by \ufb01rst applying each transformation on it, and then applying the classi\ufb01er on each\nof the |T | transformed images. Each such application results in a softmax response vector of size\n|T |. The \ufb01nal normality score is de\ufb01ned using the combined log-likelihood of these vectors under an\nestimated distribution of \u201cnormal\u201d softmax vectors (see details below).\n\n4.1 Creating and Learning the Self-Labeled Dataset\nLet T = {T0, T1, . . . , Tk\u22121} be a set of geometric transformations, where for each 1 \u2264 i \u2264 k\u22121, Ti :\nX \u2192 X , and T0(x) = x is the identity transformation. The set T is a hyperparameter of our method,\non which we elaborate in Section 6. The self-labeled set ST is de\ufb01ned as\n\nST (cid:44) {(Tj(x), j) : x \u2208 S, Tj \u2208 T } .\n\n3\n\n\f[ \u02dc\u03b1i]j) \u2212 k\u22121(cid:88)\n\nk\u22121(cid:88)\n\ni=0\n\n\uf8ee\uf8f0log \u0393(\nk\u22121(cid:88)\nk\u22121(cid:88)\nk\u22121(cid:88)\n\nj=0\n\nk\u22121(cid:88)\n\nk\u22121(cid:88)\n\n\uf8f9\uf8fb .\n\nThus, for any x \u2208 S, j is the label of Tj(x). We use this set to straightforwardly learn a deep\nk-class classi\ufb01cation model, f\u03b8, which we train over the self-labeled dataset ST using the standard\ncross-entropy loss function. To this end, any useful classi\ufb01cation architecture and optimization\nmethod can be employed for this task.\n\n4.2 Dirichlet Normality Score\nWe now de\ufb01ne our normality score function nS(x). Fix a set of geometric transformations T =\n{T0, T1, . . . , Tk\u22121}, and assume that a k-class classi\ufb01cation model f\u03b8 has been trained on the self-\nlabeled set ST (as described above). For any image x, let y(x) (cid:44) softmax (f\u03b8 (x)), i.e., the vector of\nsoftmax responses of the classi\ufb01er f\u03b8 applied on x. To construct our normality score we de\ufb01ne:\n\nnS(x) (cid:44) k\u22121(cid:88)\n\ni=0\n\nlog p(y(Ti(x))|Ti),\n\nwhich is the combined log-likelihood of a transformed image conditioned on each of the applied\ntransformations in T , under a na\u00efve (typically incorrect) assumption that all of these conditional\ndistributions are independent. We approximate each conditional distribution to be y(Ti(x))|Ti \u223c\nDir(\u03b1i), where \u03b1i \u2208 Rk\n+, x \u223c pX (x), i \u223c Uni(0, k \u2212 1), and pX (x) is the real data probability\ndistribution of \u201cnormal\u201d samples. Our choice of the Dirichlet distribution is motivated by two reasons.\nFirst, it is a common choice for distribution approximation when samples (i.e., y) reside in the\nunit k \u2212 1 simplex. Second, there are ef\ufb01cient methods for numerically estimating the maximum\nlikelihood parameters [24, 34]. We denote the estimation by \u02dc\u03b1i. Using the estimated Dirichlet\nparameters, the normality score of an image x is:\n\nnS(x) =\n\nlog \u0393([ \u02dc\u03b1i]j) +\n\nj=0\n\nj=0\n\n([ \u02dc\u03b1i]j \u2212 1) log y(Ti(x))j\n\nSince all \u02dc\u03b1i are constant w.r.t x, we can ignore the \ufb01rst two terms in the parenthesis and rede\ufb01ne a\nsimpli\ufb01ed normality score, which is equivalent in its normality ordering:\n\nnS(x) =\n\n([ \u02dc\u03b1i]j \u2212 1) log y(Ti(x))j =\n\n( \u02dc\u03b1i \u2212 1) \u00b7 log y(Ti(x)).\n\ni=0\n\nj=0\n\ni=0\n\nAs demonstrated in our experiments, this score tightly captures normality in the sense that for two\nimages x1 and x2, nS(x1) > nS(x2) tend to imply that x1 is \u201cmore normal\u201d than x2. For each\ni \u2208 {0, . . . , k\u2212 1}, we estimate \u02dc\u03b1i using the \ufb01xed point iteration method described in [24], combined\nwith the initialization step proposed by Wicker et al. [34]. Each vector \u02dc\u03b1i is estimated based on the\nset Si = {y(Ti(x))|x \u2208 S}. We note that the use of an independent image set for estimating \u02dc\u03b1i may\nimprove performance. A full and detailed algorithm is available in the supplementary material.\nA simpli\ufb01ed version of the proposed normality score was used during preliminary stages of this\nresearch: \u02c6nS(x) (cid:44) 1\nj=0 [y (Tj(x))]j . This simple score function eliminates the need for the\nDirichlet parameter estimation, is easy to implement, and still achieves excellent results that are only\nslightly worse than the above Dirichlet score.\n\n(cid:80)k\u22121\n\nk\n\n5 Experimental Results\n\nIn this section, we describe our experimental setup and evaluation method, the baseline algorithms\nwe use for comparison purposes, the datasets, and the implementation details of our technique\n(architecture used and geometric transformations applied). We then present extensive experiments\non the described publicly available datasets, demonstrating the effectiveness of our scoring function.\nFinally, we show that our method is also effective at identifying out-of-distribution samples in labeled\nmulti-class datasets.\n\n5.1 Baseline Methods\n\nWe compare our method to state-of-the-art deep learning approaches as well as a few classic methods.\n\n4\n\n\fOne-Class SVM. The one-class support vector machine (OC-SVM) is a classic and popular kernel-\nbased method for novelty detection [29, 30]. It is typically employed with an RBF kernel, and learns\na collection of closed sets in the input space, containing most of the training samples. Samples\nresiding outside of these enclosures are deemed anomalous. Following [41, 7], we use this model on\nraw input (i.e., a \ufb02attened array of the pixels comprising an image), as well as on a low-dimensional\nrepresentation obtained by taking the bottleneck layer of a trained convolutional autoencoder. We\nname these models RAW-OC-SVM and CAE-OC-SVM, respectively.\nIt is very important to\nnote that in both these variants of OC-SVM, we provide the OC-SVM with an unfair signi\ufb01cant\nadvantage by optimizing its hyperparameters in hindsight; i.e., the OC-SVM hyperparameters (\u03bd and\n\u03b3) were optimized to maximize AUROC and taken to be the best performing values among those in\nthe parameter grid: \u03bd \u2208 {0.1, 0.2, . . . , 0.9}, \u03b3 \u2208 {2\u22127, 2\u22126, . . . , 22}. Note that the hyperparameter\noptimization procedure has been provided with a two-class classi\ufb01cation problem. There are, in\nfact, methods for optimizing these parameters without hindsight knowledge [33, 3]. These methods\nare likely to degrade the performance of the OC-SVM models. The convolutional autoencoder is\nchosen to have a similar architecture to that of DCGAN [26], where the encoder is adapted from the\ndiscriminator, and the decoder is adapted from the generator.\nIn addition, we compare our method to a recently published, end-to-end variant of OC-SVM called\nOne-Class Deep SVDD [27]. This model, which we name E2E-OC-SVM, uses an objective similar\nto that of the classic SVDD [30] to optimize the weights of a deep architecture. However, there are\nconstraints on the used architecture, such as lack of bias terms and unbounded activation functions.\nThe experimental setup used by the authors is identical to ours, allowing us to report their published\nresults as they are, on CIFAR-10.\n\nDeep structured energy-based models. A deep structured energy-based model (DSEBM) is a\nstate-of-the-art deep neural technique, whose output is the energy function (negative log probability)\nassociated with an input sample [41]. Such models can be trained ef\ufb01ciently using score matching in\na similar way to a denoising autoencoder [32]. Samples associated with high energy are considered\nanomalous. While the authors of [41] used a very shallow architecture in their model (which\nis ineffective in our problems), we selected a deeper one when using their method. The chosen\narchitecture is the same as that of the encoder part in the convolutional autoencoder used by CAE-\nOC-SVM, with ReLU activations in the encoding layer.\n\nDeep Autoencoding Gaussian Mixture Model. A deep autoencoding Gaussian mixture model\n(DAGMM) is another state-of-the-art deep autoencoder-based model, which generates a low-\ndimensional representation of the training data, and leverages a Gaussian mixture model to perform\ndensity estimation on the compact representation [43]. A DAGMM jointly and simultaneously\noptimizes the parameters of the autoencoder and the mixture model in an end-to-end fashion, thus\nleveraging a separate estimation network to facilitate the parameter learning of the mixture model.\nThe architecture of the autoencoder we used is similar to that of the convolutional autoencoder from\nthe CAE-OC-SVM experiment, but with linear activation in the representation layer. The estimation\nnetwork is inspired by the one in the original DAGMM paper.\n\nAnomaly Detection with a Generative Adversarial Network. This network, given the acronym\nADGAN, is a GAN based model, which learns a one-way mapping from a low-dimensional multi-\nvariate Gaussian distribution to the distribution of the training set [7]. After training the GAN on the\n\u201cnormal\u201d dataset, the discriminator is discarded. Given a sample, the training of ADGAN uses gradient\ndescent to estimate the inverse mapping from the image to the low-dimensional seed. The seed is\nthen used to generate a sample, and the anomaly score is the (cid:96)2 distance between that image and the\noriginal one. In our experiments, for the generative model of the ADGAN we incorporated the same\narchitecture used by the authors of the original paper, namely, the original DCGAN architecture [26].\nAs described, ADGAN requires only a trained generator.\n\n5.2 Datasets\n\nWe consider four image datasets in our experiments: CIFAR-10, CIFAR-100 [21], CatsVsDogs [11],\nand fashion-MNIST [38], which are described below. We note that in all our experiments, pixel\nvalues of all images were scaled to reside in [\u22121, 1]. No other pre-processing was applied.\n\n5\n\n\f\u2022 CIFAR-10: consists of 60,000 32x32 color images in 10 classes, with 6,000 images per class.\nThere are 50,000 training images and 10,000 test images, divided equally across the classes.\n\u2022 CIFAR-100: similar to CIFAR-10, but with 100 classes containing 600 images each. This set has\na \ufb01xed train/test partition with 500 training images and 100 test images per class. The 100 classes in\nthe CIFAR-100 are grouped into 20 superclasses, which we use in our experiments.\n\u2022 Fashion-MNIST: a relatively new dataset comprising 28x28 grayscale images of 70,000 fashion\nproducts from 10 categories, with 7,000 images per category. The training set has 60,000 images\nand the test set has 10,000 images. In order to be compatible with the CIFAR-10 and CIFAR-100\nclassi\ufb01cation architectures, we zero-pad the images so that they are of size 32x32.\n\u2022 CatsVsDogs: extracted from the ASIRRA dataset, it contains 25,000 images of cats and dogs,\n12,500 in each class. We split this dataset into a training set containing 10,000 images, and a test set\nof 2,500 images in each class. We also rescale each image to size 64x64. The average dimension size\nof the original images is roughly 360x400.\n\n5.3 Experimental Protocol\n\nWe employ a one-vs-all evaluation scheme in each experiment. Consider a dataset with C classes,\nfrom which we create C different experiments. For each 1 \u2264 c \u2264 C, we designate class c to be the\nsingle class of normal images. We take S to be the set of images in the training set belonging to class\nc. The set S is considered to be the set of \u201cnormal\u201d samples based on which the model must learn\na normality score function. We emphasize that S contains only normal samples, and no additional\nsamples are provided to the model during training. The normality score function is then applied on\nall images in the test set, containing both anomalies (not belonging to class c) and normal samples\n(belonging to class c), in order to evaluate the model\u2019s performance. As stated in Section 3, we\ncompletely ignore the problem of choosing the appropriate anomaly threshold (\u03bb) on the normality\nscore, and quantify performance using the area under the ROC curve metric, which is commonly\nutilized as a performance measure for anomaly detection models. We are able to compute the ROC\ncurve since we have full knowledge of the ground truth labels of the test set.2\n\nHyperparameters and Optimization Methods For the self-labeled classi\ufb01cation task, we use 72\ngeometric transformations. These transformations are speci\ufb01ed in the supplementary material (see\nalso Section 6 discussing the intuition behind the choice of these transformations). Our model is\nimplemented using the state-of-the-art Wide Residual Network (WRN) model [40]. The parameters\nfor the depth and width of the model for all 32x32 datasets were chosen to be 10 and 4, respectively,\nand for the CatsVsDogs dataset (64x64), 16 and 8, respectively. These hyperparameters were selected\nprior to conducting any experiment, and were \ufb01xed for all runs.3 We used the Adam [20] optimizer\nwith default hyperparameters. Batch size for all methods was set to 128. The number of epochs was\nset to 200 on all benchmark models, except for training the GAN in ADGAN for which it was set to\n100 and produced superior results. We trained the WRN for (cid:100)200/|T |(cid:101) epochs on the self-labeled set\nST , to obtain approximately the same number of parameter updates as would have been performed\nhad we trained on S for 200 epochs.\n\n5.4 Results\n\nIn Table 1 we present our results. The table is composed of four blocks, with each block containing\nseveral anomaly detection problems derived from the same dataset (for lack of space we omit class\nnames from the tables, and those can be found in the supplementary material). For example, the\n\ufb01rst row contains the results for an anomaly detection problem where the normal class is class 0 in\nCIFAR-10 (airplane), and the anomalous instances are images from all other classes in CIFAR-10\n(classes 1-9). In this row (as in any other row), we see the average AUROC results over \ufb01ve runs and\nthe corresponding standard error of the mean for all baseline methods. The results of our algorithm\nare shown in the rightmost column. OC-SVM variants and ADGAN were run once due to their\n\n2A complete code of the proposed method\u2019s implementation and the conducted experiments is available at\n\nhttps://github.com/izikgo/AnomalyDetectionTransformations.\n\n3The parameters 16, 8 were used on CIFAR-10 by the authors. Due to the induced computational complexity,\nwe chose smaller values. When testing the parameters 16, 8 with our method on the CIFAR-10 dataset, anomaly\ndetection results improved.\n\n6\n\n\ftime complexity. The best performing method in each row appears in bold. For example, in the\nCatsVsDogs experiments where dog (class 1) is the \u201cnormal\u201d class, the best baseline (DSEBM)\nachieves 0.561 AUROC. Note that the trivial average AUROC is always 0.5, regardless of the\nproportion of normal vs. anomalous instances. Our method achieves an average AUROC of 0.888.\nSeveral interesting observations can be made by inspecting the numbers in Table 1. Our relative\nadvantage is most prominent when focusing on the larger images. All baseline methods, including\nOC-SVM variants, which enjoy hindsight information, only achieve performance that is slightly\nbetter than random guessing in the CatsVsDogs dataset. On the smaller-sized images, the baselines\ncan perform much better. In most cases, however, our algorithm signi\ufb01cantly outperformed the other\nmethods. Interestingly, in many cases where the baseline methods struggled with separating normal\nsamples from anomalies, our method excelled. See, for instance, the cases of automobile (class 1)\nand horse (class 7; see the CIFAR-10 section in the table). Inspecting the results on CIFAR-100\n(where 20 super-classes de\ufb01ned the partition), we observe that our method was challenged by the\ndiversity inside the normal class. In this case, there are a few normal classes on which our method\ndid not perform well; see e.g., non-insect invertebrates (class 13), insects (class 7), and household\nelectrical devices (class 5). In Section 6 we speculate why this might happen. We used the super-class\npartitioning of CIFAR-100 (instead of the 100 base classes) because labeled data for single base\nclasses is scarce. On the fashion-MNIST dataset, all methods, excluding DAGMM, performed very\nwell, with a slight advantage to our method. The fashion-MNIST dataset was designed as a drop-in\nreplacement for the original MNIST dataset, which is slightly more challenging. Classic models, such\nas SVM with an RBF kernel, can perform well on this task, achieving almost 90% accuracy [38].\n\n5.5\n\nIdentifying Out-of-distribution Samples in Labeled Multi-class Datasets\n\nAlthough it is not the main focus of this work, we have also tackled the problem of identifying out-of-\ndistribution samples in labeled multi-class datasets (i.e., identify images that belong to a different\ndistribution than that of the labeled dataset). To this end, we created a two-headed classi\ufb01cation model\nbased on the WRN architecture. The model has two separate softmax output layers. One for categories\n(e.g., cat, truck, airplane, etc.) and another for classifying transformations (our method). We use\nthe categories softmax layer only during training. At test time, we only utilize the transformations\nsoftmax layer output as described in section 4.2, but use the simpli\ufb01ed normality score. When training\non the CIFAR-10 dataset, and taking the tiny-imagenet (resized) dataset to be anomalies as done by\nLiang et al. [23] in their ODIN method, we improved ODIN\u2019s AUROC/AUPR-In/AUPR-Out results\nfrom 92.1/89.0/93.6 to 95.7/96.1/95.4, respectively. It is important to note that in contrast to our\nmethod, ODIN is inapplicable in the pure single class setting, where there are no class labels.\n\n6 On the Intuition for Using Geometric Transformations\n\nIn this section we explain our intuition behind the choice of the set of transformations used in\nour method. Any bijection of a set (having some geometric structure) to itself is a geometric\ntransformation. Among all geometric transformations, we only used compositions of horizontal\n\ufb02ipping, translations, and rotations in our model, resulting in 72 distinct transformations (see\nsupplementary material for the entire list). In the earlier stages of this work, we tried a few non-\ngeometric transformations (e.g., Gaussian blur, sharpening, gamma correction), which degraded\nperformance and we abandoned them altogether. We hypothesize that non-geometric transformations\nperform worse since they can eliminate important features of the learned image set.\nWe speculate that the effectiveness of the chosen transformation set is affected by their ability to\npreserve spatial information about the given \u201cnormal\u201d images, as well as the ability of our classi\ufb01er\nto predict which transformation was applied on a given transformed image. In addition, for a \ufb01xed\ntype-II error rate, the type-I error rate of our method decreases the harder it gets for the trained\nclassi\ufb01er to correctly predict the identity of the transformations that were applied on anomalies.\nWe demonstrate this idea by conducting three experiments. Each experiment has the following\nstructure. We train a neural classi\ufb01er to discriminate between two transformations, where the normal\nclass is taken to be images of a single digit from the MNIST [22] training set. We then evaluate our\nmethod using AUROC on a set of images comprising normal images and images of another digit\nfrom the MNIST test set. The three experiments are:\n\n7\n\n\f(a) Left: original \u20180\u2019s.\nthe \u20180\u2019s opti-\nRight:\nmized to a normality\nscore learned on \u20183\u2019.\n\n(b) Left: original \u20183\u2019s.\nthe \u20183\u2019s opti-\nRight:\nmized to a normality\nscore learned on \u20183\u2019.\n\nFigure 1: Optimizing digit images to maximize the normality score\n\n\u2022 Normal digit: \u20188\u2019. Anomaly: \u20183\u2019. Transformations: Identity and horizontal \ufb02ip. It can be\nexpected that due to the invariance of \u20188\u2019 to horizontal \ufb02ip, the classi\ufb01er will have dif\ufb01culties learning\ndistinguishing features. Indeed, when presented with the test set containing \u20183\u2019 as anomalies (which\ndo not exhibit such invariance), our method did not perform well, achieving an AUROC of 0.646.\n\u2022 Normal digit: \u20183\u2019. Anomaly: \u20188\u2019. Transformations: Identity and horizontal \ufb02ip. In contrast\nto the previous experiment, the transformed variants of digit \u20183\u2019 can easily be classi\ufb01ed to the correct\ntransformation. Indeed, our method, using the trained model for \u20183\u2019, achieved 0.957 AUROC in this\nexperiment.\n\u2022 Normal digit: \u20188\u2019. Anomaly: \u20183\u2019. Transformations: Identity and translation by 7 pixels. In\nthis experiment, the transformed images are distinguishable from each other. As can be expected, our\nmethod performs well in this case, achieving an AUROC of 0.919.\nTo convince ourselves that high scores given by our scoring function indicate membership in the\nnormal class, we tested how an image would need to change in order to obtain a high normality\nscore. This was implemented by optimizing an input image using gradient ascent to maximize the\nsimpli\ufb01ed variant of the normality score described in section 5.5 (see, e.g., [39]). Thus, we trained a\nclassi\ufb01er on the digit \u20183\u2019 from the MNIST dataset, with a few geometric transformations. We then\ntook an arbitrary image of the digit \u20180\u2019 and optimized it. In Figure 1(a) we present two such images,\nwhere the left one is the original, and the right is the result after taking 200 gradient ascent steps\nthat \u201coptimize\u201d the original image. It is evident that the \u20180\u2019 digits have deformed, now resembling\nthe digit \u20183\u2019. This illustrates the fact that the classi\ufb01cation model has learned features relevant to\nthe \u201cnormal\u201d class. To further strengthen our hypothesis, we conducted the same experiment using\nimages from the normal class (i.e., images of the digit \u20183\u2019). We expected these images to maintain\ntheir appearance during the optimization process, since they already contain the features that should\ncontribute to a high normality score. Figure 1(b) contains two examples of the process, where in each\nrow, the left image is the initial \u20183\u2019, and the right is the result after taking 200 gradient ascent steps\non it. As hypothesized, it is evident that the images remained roughly unchanged at the end of the\noptimization process (regardless of their different orientations).\n\n7 Conclusion and Future Work\n\nWe presented a novel method for anomaly detection of images, which learns a meaningful repre-\nsentation of the learned training data in a fully discriminative fashion. The proposed method is\ncomputationally ef\ufb01cient, and as simple to implement as a multi-class classi\ufb01cation task. Unlike\nbest-known methods so far, our approach completely alleviates the need for a generative component\n(autoencoders/GANs). Most importantly, our method signi\ufb01cantly advances the state-of-the-art by\noffering a dramatic improvement over the best available anomaly detection methods. Our results open\nmany avenues for future research. First, it is important to develop a theory that grounds the use of\ngeometric transformations. It would be interesting to study the possibility of selecting transformations\nthat would best serve a given training set, possibly with prior knowledge on the anomalous samples.\nAnother avenue is explicitly optimizing the set of transformations. Due to the effectiveness of our\nmethod, it is tempting to try adapting it to other settings or utilizing it in applications. Some exam-\nples are open-world classi\ufb01cation, selective classi\ufb01cation and regression [36, 35, 13], uncertainty\nestimation [15], and deep active learning [12, 14]. Finally, it would be interesting to consider using\nour techniques in settings where additional unlabeled \u201ccontaminated\u201d data (consisting of both normal\nand novel instances) is provided, perhaps within a transductive learning framework [10].\n\n8\n\n\fTable 1: Average area under the ROC curve in % with SEM (over 5 runs) of anomaly detection\nmethods. For all datasets, each model was trained on the single class, and tested against all other\nclasses. E2E column is taken from [27]. OC-SVM hyperparameters in RAW and CAE variants were\noptimized with hindsight knowledge. The best performing method in each experiment is in bold.\n\nOC-SVM\n\nE2E\n\n61.7\u00b11.3\n65.9\u00b10.7\n50.8\u00b10.3\n59.1\u00b10.4\n60.9\u00b10.3\n65.7\u00b10.8\n67.7\u00b10.8\n67.3\u00b10.3\n75.9\u00b10.4\n73.1\u00b10.4\n\n64.8\n\nDataset\n\nCIFAR-10\n(32x32x3)\n\nCIFAR-100\n(32x32x3)\n\nFashion-\nMNIST\n(32x32x1)\n\nCatsVsDogs\n(64x64x3)\n\nci\n\n0\n1\n2\n3\n4\n5\n6\n7\n8\n9\navg\n0\n1\n2\n3\n4\n5\n6\n7\n8\n9\n10\n11\n12\n13\n14\n15\n16\n17\n18\n19\navg\n0\n1\n2\n3\n4\n5\n6\n7\n8\n9\navg\n0\n1\navg\n\nRAW CAE\n74.9\n70.6\n51.3\n51.7\n68.9\n69.1\n52.8\n52.4\n76.7\n77.3\n52.9\n51.2\n74.1\n70.9\n53.1\n52.6\n71.0\n70.9\n50.6\n50.6\n62.0\n62.4\n68.4\n68.0\n63.6\n63.1\n52.0\n50.4\n62.7\n64.7\n58.2\n59.7\n54.9\n53.5\n57.2\n55.9\n62.9\n64.4\n66.7\n65.6\n74.1\n70.1\n84.1\n83.0\n58.0\n59.7\n68.5\n68.7\n65.0\n64.6\n51.2\n50.7\n62.8\n63.5\n66.6\n68.3\n71.7\n73.7\n52.8\n50.2\n58.4\n57.5\n63.1\n62.6\n98.2\n97.7\n89.9\n90.3\n91.4\n90.7\n94.2\n90.7\n89.1\n89.4\n91.8\n88.5\n83.4\n81.7\n98.7\n98.8\n91.9\n90.6\n99.0\n98.6\n91.7\n92.8\n55.2\n50.4\n49.9\n53.0\n51.7\n52.5\n\n60.9\n\n53.1\n\n56.0\u00b16.9\n48.3\u00b11.8\n61.9\u00b10.1\n50.1\u00b10.4\n73.3\u00b10.2\n60.5\u00b10.3\n68.4\u00b10.3\n53.3\u00b10.7\n73.9\u00b10.3\n63.6\u00b13.1\n\n43.4\u00b13.9\n49.5\u00b12.7\n66.1\u00b11.7\n52.6\u00b11.0\n56.9\u00b13.0\n52.4\u00b12.2\n55.0\u00b11.1\n52.8\u00b13.7\n53.2\u00b14.8\n42.5\u00b12.5\n52.7\u00b13.9\n46.4\u00b12.4\n42.7\u00b13.1\n45.4\u00b10.7\n57.2\u00b11.3\n48.8\u00b11.5\n54.4\u00b13.1\n36.4\u00b12.3\n52.4\u00b11.4\n50.3\u00b11.0\n\n64.0\u00b10.2\n47.9\u00b10.1\n53.7\u00b14.1\n48.4\u00b10.5\n59.7\u00b16.3\n46.6\u00b11.6\n51.7\u00b10.8\n54.8\u00b11.6\n66.7\u00b10.2\n71.2\u00b11.2\n78.3\u00b11.1\n62.7\u00b10.7\n66.8\u00b10.0\n52.6\u00b10.1\n44.0\u00b10.6\n56.8\u00b10.1\n63.1\u00b10.1\n73.0\u00b11.0\n57.7\u00b11.6\n55.5\u00b10.7\n\nDAGMM DSEBM AD-\nGAN\n41.4\u00b12.3\n64.9\n57.1\u00b12.0\n39.0\n53.8\u00b14.0\n65.2\n51.2\u00b10.8\n48.1\n52.2\u00b17.3\n73.5\n49.3\u00b13.6\n47.6\n64.9\u00b11.7\n62.3\n55.3\u00b10.8\n48.7\n51.9\u00b12.4\n66.0\n54.2\u00b15.8\n37.8\n55.3\n63.1\n54.9\n41.3\n50.0\n40.6\n42.8\n51.1\n55.4\n59.2\n62.7\n79.8\n53.7\n58.9\n57.4\n39.4\n55.6\n63.3\n66.7\n44.3\n53.0\n54.7\n89.9\n81.9\n87.6\n91.2\n86.5\n89.6\n74.3\n97.2\n89.0\n97.1\n88.4\n50.7\n48.1\n49.4\n\n91.6\u00b11.2\n71.8\u00b10.5\n88.3\u00b10.2\n87.3\u00b13.6\n85.2\u00b10.9\n87.1\u00b10.0\n73.4\u00b14.1\n98.1\u00b10.0\n86.0\u00b13.2\n97.1\u00b10.3\n\n42.1\u00b19.1\n55.1\u00b13.5\n50.4\u00b17.3\n57.0\u00b16.7\n26.9\u00b15.4\n70.5\u00b19.7\n48.3\u00b15.0\n83.5\u00b111.4\n49.9\u00b17.2\n34.0\u00b13.0\n\n51.8\n\n43.4\u00b10.5\n52.0\u00b11.9\n\n86.6\n\n47.1\u00b11.7\n56.1\u00b11.2\n\n50.5\n\n47.7\n\n58.8\n\n51.6\n\nOURS\n74.7\u00b10.4\n95.7\u00b10.0\n78.1\u00b10.4\n72.4\u00b10.5\n87.8\u00b10.2\n87.8\u00b10.1\n83.4\u00b10.5\n95.5\u00b10.1\n93.3\u00b10.0\n91.3\u00b10.1\n\n86.0\n\n74.7\u00b10.4\n68.5\u00b10.2\n74.0\u00b10.5\n81.0\u00b10.8\n78.4\u00b10.5\n59.1\u00b11.0\n81.8\u00b10.2\n65.0\u00b10.1\n85.5\u00b10.4\n90.6\u00b10.1\n87.6\u00b10.2\n83.9\u00b10.6\n83.2\u00b10.3\n58.0\u00b10.4\n92.1\u00b10.2\n68.3\u00b10.1\n73.5\u00b10.2\n93.8\u00b10.1\n90.7\u00b10.1\n85.0\u00b10.2\n\n78.7\n\n99.4\u00b10.0\n97.6\u00b10.1\n91.1\u00b10.2\n89.9\u00b10.4\n92.1\u00b10.0\n93.4\u00b10.9\n83.3\u00b10.1\n98.9\u00b10.1\n90.8\u00b10.1\n99.2\u00b10.0\n\n93.5\n\n88.3\u00b10.3\n89.2\u00b10.3\n\n88.8\n\n-\n-\n-\n-\n-\n-\n-\n-\n-\n-\n-\n-\n-\n-\n-\n-\n-\n-\n-\n-\n-\n-\n-\n-\n-\n-\n-\n-\n-\n-\n-\n-\n-\n-\n-\n\n9\n\n\fAcknowledgements\n\nThis research was partially supported by the Israel Science Foundation (grant No. 710/18).\n\nReferences\n[1] J. An and S. Cho. Variational autoencoder based anomaly detection using reconstruction\n\nprobability. SNU Data Mining Center, Tech. Rep., 2015.\n\n[2] G. Blanchard, G. Lee, and C. Scott. Semi-supervised novelty detection. Journal of Machine\n\nLearning Research, 11(Nov):2973\u20133009, 2010.\n\n[3] E. Burnaev, P. Erofeev, and D. Smolyakov. Model selection for anomaly detection. In Eighth\nInternational Conference on Machine Vision (ICMV 2015), volume 9875, page 987525. Interna-\ntional Society for Optics and Photonics, 2015.\n\n[4] E. J. Cand\u00e8s, X. Li, Y. Ma, and J. Wright. Robust principal component analysis? Journal of the\n\nACM (JACM), 58(3):11, 2011.\n\n[5] V. Chandola, A. Banerjee, and V. Kumar. Anomaly detection: A survey. ACM Computing\n\nSurveys (CSUR), 41(3):15, 2009.\n\n[6] J. Davis and M. Goadrich. The relationship between precision-recall and roc curves.\n\nIn\nProceedings of the 23rd international conference on Machine learning, pages 233\u2013240. ACM,\n2006.\n\n[7] L. Deecke, R. Vandermeulen, L. Ruff, S. Mandt, and M. Kloft. Anomaly detection with genera-\n\ntive adversarial networks, 2018. URL https://openreview.net/forum?id=S1EfylZ0Z.\n\n[8] R. El-Yaniv and M. Nisenson. Optimal single-class classi\ufb01cation strategies. In Advances in\n\nNeural Information Processing Systems, pages 377\u2013384, 2007.\n\n[9] R. El-Yaniv and M. Nisenson. On the foundations of adversarial single-class classi\ufb01cation.\n\nCoRR, abs/1010.4466, 2010. URL http://arxiv.org/abs/1010.4466.\n\n[10] R. El-Yaniv and D. Pechyony. Transductive rademacher complexity and its applications. In\nLearning Theory, 20th Annual Conference on Learning Theory, (COLT), pages 157\u2013171, 2007.\n\n[11] J. Elson, J. J. Douceur, J. Howell, and J. Saul. Asirra: A captcha that exploits interest-aligned\nmanual image categorization. In Proceedings of 14th ACM Conference on Computer and\nCommunications Security (CCS). Association for Computing Machinery, Inc., October 2007.\n\n[12] Y. Geifman and R. El-Yaniv. Deep active learning over the long tail. CoRR, 2017. URL\n\nhttp://arxiv.org/abs/1711.00941.\n\n[13] Y. Geifman and R. El-Yaniv. Selective classi\ufb01cation for deep neural networks. In Advances in\n\nNeural Information Processing Systems (NIPS), pages 4878\u20134887. 2017.\n\n[14] Y. Geifman and R. El-Yaniv. Deep active learning with a neural architecture search. CoRR,\n\nabs/1811.07579, 2018. URL http://arxiv.org/abs/1811.07579.\n\n[15] Y. Geifman, G. Uziel, and R. El-Yaniv. Boosting uncertainty estimation for deep neural\n\nclassi\ufb01ers. CoRR, 2018. URL http://arxiv.org/abs/1805.08206.\n\n[16] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and\nY. Bengio. Generative adversarial nets. In Advances in Neural Information Processing Systems,\npages 2672\u20132680, 2014.\n\n[17] T. Iwata and M. Yamada. Multi-view anomaly detection via robust probabilistic latent variable\n\nmodels. In Advances In Neural Information Processing Systems, pages 1136\u20131144, 2016.\n\n[18] I. T. Jolliffe. Principal component analysis and factor analysis. In Principal Component Analysis,\n\npages 115\u2013128. Springer, 1986.\n\n10\n\n\f[19] J. Kim and C. D. Scott. Robust kernel density estimation. Journal of Machine Learning\n\nResearch, 13(Sep):2529\u20132565, 2012.\n\n[20] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint\n\narXiv:1412.6980, 2014.\n\n[21] A. Krizhevsky and G. Hinton. Learning multiple layers of features from tiny images. Master\u2019s\n\nthesis, Department of Computer Science, University of Toronto, 2009.\n\n[22] Y. LeCun, C. Cortes, and C. Burges. Mnist handwritten digit database. AT&T Labs [Online].\n\nAvailable: http://yann.lecun.com/exdb/mnist, 2, 2010.\n\n[23] S. Liang, Y. Li, and R. Srikant. Enhancing the reliability of out-of-distribution image detection\nin neural networks. In International Conference on Learning Representations, 2018. URL\nhttps://openreview.net/forum?id=H1VGkIxRZ.\n\n[24] T. Minka. Estimating a dirichlet distribution, 2000.\n\n[25] E. Parzen. On estimation of a probability density function and mode. The Annals of Mathemati-\n\ncal Statistics, 33(3):1065\u20131076, 1962.\n\n[26] A. Radford, L. Metz, and S. Chintala. Unsupervised representation learning with deep convolu-\n\ntional generative adversarial networks. arXiv preprint arXiv:1511.06434, 2015.\n\n[27] L. Ruff, N. G\u00f6rnitz, L. Deecke, S. A. Siddiqui, R. Vandermeulen, A. Binder, E. M\u00fcller, and\nM. Kloft. Deep one-class classi\ufb01cation. In International Conference on Machine Learning,\npages 4390\u20134399, 2018.\n\n[28] T. Schlegl, P. Seeb\u00f6ck, S. M. Waldstein, U. Schmidt-Erfurth, and G. Langs. Unsupervised\nanomaly detection with generative adversarial networks to guide marker discovery. In Interna-\ntional Conference on Information Processing in Medical Imaging, pages 146\u2013157. Springer,\n2017.\n\n[29] B. Sch\u00f6lkopf, R. C. Williamson, A. J. Smola, J. Shawe-Taylor, and J. C. Platt. Support vector\nmethod for novelty detection. In Advances in Neural Information Processing Systems, pages\n582\u2013588, 2000.\n\n[30] D. M. Tax and R. P. Duin. Support vector data description. Machine learning, 54(1):45\u201366,\n\n2004.\n\n[31] A. Taylor, S. Leblanc, and N. Japkowicz. Anomaly detection in automobile control network\ndata with long short-term memory networks. In Data Science and Advanced Analytics (DSAA),\n2016 IEEE International Conference on, pages 130\u2013139. IEEE, 2016.\n\n[32] P. Vincent. A connection between score matching and denoising autoencoders. Neural Compu-\n\ntation, 23(7):1661\u20131674, 2011.\n\n[33] S. Wang, Q. Liu, E. Zhu, F. Porikli, and J. Yin. Hyperparameter selection of one-class support\n\nvector machine by self-adaptive data shifting. Pattern Recognition, 74:198\u2013211, 2018.\n\n[34] N. Wicker, J. Muller, R. K. R. Kalathur, and O. Poch. A maximum likelihood approximation\nmethod for dirichlet\u2019s parameter estimation. Computational statistics & data analysis, 52(3):\n1315\u20131322, 2008.\n\n[35] Y. Wiener and R. El-Yaniv. Agnostic selective classi\ufb01cation. In Advances in Neural Information\n\nProcessing Systems (NIPS), pages 1665\u20131673. 2011.\n\n[36] Y. Wiener and R. El-Yaniv. Pointwise tracking the optimal regression function. In Advances in\n\nNeural Information Processing Systems (NIPS), pages 2051\u20132059, 2012.\n\n[37] Y. Xia, X. Cao, F. Wen, G. Hua, and J. Sun. Learning discriminative reconstructions for unsu-\npervised outlier removal. In Proceedings of the IEEE International Conference on Computer\nVision, pages 1511\u20131519, 2015.\n\n11\n\n\f[38] H. Xiao, K. Rasul, and R. Vollgraf. Fashion-mnist: a novel image dataset for benchmarking\n\nmachine learning algorithms, 2017.\n\n[39] J. Yosinski, J. Clune, A. Nguyen, T. Fuchs, and H. Lipson. Understanding neural networks\n\nthrough deep visualization. arXiv preprint arXiv:1506.06579, 2015.\n\n[40] S. Zagoruyko and N. Komodakis. Wide residual networks. In BMVC, 2016.\n\n[41] S. Zhai, Y. Cheng, W. Lu, and Z. Zhang. Deep structured energy based models for anomaly\ndetection. In Proceedings of the 33rd International Conference on Machine Learning - Volume\n48, ICML\u201916, pages 1100\u20131109. JMLR.org, 2016.\n\n[42] A. Zimek, E. Schubert, and H.-P. Kriegel. A survey on unsupervised outlier detection in\nhigh-dimensional numerical data. Statistical Analysis and Data Mining: The ASA Data Science\nJournal, 5(5):363\u2013387, 2012.\n\n[43] B. Zong, Q. Song, M. R. Min, W. Cheng, C. Lumezanu, D. Cho, and H. Chen. Deep autoencod-\ning gaussian mixture model for unsupervised anomaly detection. In International Conference\non Learning Representations, 2018.\n\n12\n\n\f", "award": [], "sourceid": 6397, "authors": [{"given_name": "Izhak", "family_name": "Golan", "institution": "Technion"}, {"given_name": "Ran", "family_name": "El-Yaniv", "institution": "Technion"}]}