{"title": "Multivariate Triangular Quantile Maps for Novelty Detection", "book": "Advances in Neural Information Processing Systems", "page_first": 5060, "page_last": 5071, "abstract": "Novelty detection, a fundamental task in machine learning, has drawn a lot of recent attention due to its wide-ranging applications and the rise of neural approaches. In this work, we present a general framework for neural novelty detection that centers around a multivariate extension of the univariate quantile function. Our framework unifies and extends many classical and recent novelty detection algorithms, and opens the way to exploit recent advances in flow-based neural density estimation. We adapt the multiple gradient descent algorithm to obtain the first efficient end-to-end implementation of our framework that is free of tuning hyperparameters. Extensive experiments over a number of real datasets confirm the efficacy of our proposed method against state-of-the-art alternatives.", "full_text": "Multivariate Triangular Quantile Maps\n\nfor Novelty Detection\n\nJingjing Wang1, Sun Sun2, Yaoliang Yu1\n\nUniversity of Waterloo1, National Research Council Canada2\n\n{jingjing.wang, sun.sun, yaoliang.yu}@uwaterloo.ca\n\nAbstract\n\nNovelty detection, a fundamental task in machine learning, has drawn a lot of recent\nattention due to its wide-ranging applications and the rise of neural approaches. In\nthis work, we present a general framework for neural novelty detection that centers\naround a multivariate extension of the univariate quantile function. Our framework\nuni\ufb01es and extends many classical and recent novelty detection algorithms, and\nopens the way to exploit recent advances in \ufb02ow-based neural density estimation.\nWe adapt the multiple gradient descent algorithm to obtain the \ufb01rst ef\ufb01cient end-\nto-end implementation of our framework that is free of tuning hyperparameters.\nExtensive experiments over a number of real datasets con\ufb01rm the ef\ufb01cacy of our\nproposed method against state-of-the-art alternatives.\n\n1\n\nIntroduction\n\nNovelty detection refers to the fundamental task in machine learning that detects \u201cnovel\u201d or \u201cunusual\u201d\nsamples in a data stream. It has wide-ranging applications such as network intrusion detection [14],\nmedical signal processing [17], jet design [19], video surveillance [42, 43], image scene analysis\n[25, 47], document classi\ufb01cation [29, 30], reinforcement learning [39], etc.; see the review articles\n[7, 31, 32, 41] for more insightful applications. Over the last two decades or so, many novelty\ndetection algorithms have been proposed and studied in the machine learning \ufb01eld, of which the\nstatistical approach that aims to identify low-density regions of the underlying data distribution has\nbeen most popular [e.g. 4, 49, 51, 53]. More recently, new novelty detection algorithms based on\ndeep neural networks [e.g. 1, 9, 11, 18, 26, 40, 44, 46, 48, 56, 58, 59] have drawn a lot of attention as\nthey signi\ufb01cantly improve their non-neural counterparts, especially in domains (such as image and\nvideo) where complex high-dimensional structures abound.\nThis work offers a closer look of these recent neural novelty detection algorithms, by making a connec-\ntion to recent \ufb02ow-based generative modelling techniques [22]. In \u00a72 we show that the triangular map\nstudied in [22] for neural density estimation serves as a natural extension of the classical univariate\nquantile function to the multivariate setting. Since density estimation is extremely challenging in\nhigh dimensions, recent neural novelty detection algorithms all extract a lower dimensional latent\nrepresentation, whose probabilistic properties can then by captured by our multivariate triangular\nquantile map. Based on this observation we propose a general framework for neural novelty detection\nthat includes as special cases many classical approaches such as one-class SVM [49] and support\nvector data description [53], as well as many recent neural approaches [e.g. 1, 40, 46, 58, 59]. This\nuni\ufb01ed view of neural novelty detection enables us to better understand the similarities and subtle dif-\nferences of the many existing approaches, and provides some guidance on designing next-generation\nnovelty detection algorithms.\nMore importantly, our general framework makes it possible to effortlessly plug-in recent \ufb02ow-based\nneural density estimators, which have been shown to be surprisingly effective even in moderately high\ndimensions. Furthermore, centering our framework around the (multivariate) triangular quantile map\n(TQM) also enables us to unify the two scoring strategies in the literature [34]: we can either threshold\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fthe density function [4, 51] or the (univariate) quantile function [49, 53]. Using the multivariate\ntriangular quantile map, for the \ufb01rst time we can simultaneously perform both, without incurring any\nadditional cost. In \u00a73, motivated by the sub-optimality of pre-training we cast our novelty detection\nframework as multi-objective optimization [35] and apply the multiple gradient descent algorithm\n[12, 15, 36] for the \ufb01rst time. We present an ef\ufb01cient implementation that learns the TQM consistently,\nend-to-end and free of tuning hyperparameters. In \u00a74 we perform extensive experiments on a variety\nof datasets and verify the effectiveness of our framework against state-of-the-art alternatives.\nWe summarize our main contributions as follows:\n\u2022 We extend the univariate quantile function to the multivariate setting through increasing triangular\nmaps. This multivariate triangular quantile map may be of independent interest for many other\nproblems involving multivariate probabilistic modelling.\n\u2022 We present a new framework for neural novelty detection, which uni\ufb01es and extends many existing\n\u2022 For the \ufb01rst time we apply the multiple gradient descent algorithm to novelty detection and obtain\nan ef\ufb01cient end-to-end implementation of our framework that is free of any tuning hyperparameters.\n\u2022 We perform extensive experiments to compare to existing novelty detection baselines and to\n\napproaches including the celebrated one-class SVM and many recent neural ones.\n\ncon\ufb01rm the ef\ufb01cacy of our proposed framework.\n\nOur code is available at https://github.com/GinGinWang/MTQ.\n\n2 A General Framework for Novelty Detection\n\nIn this section we present a general framework for novelty detection. Our framework builds on recent\nprogresses in generative modelling and uni\ufb01es and extends many existing works.\n\nWe follow the standard setup for novelty detection [e.g. 7]: Given n i.i.d. samples(cid:42)X1, . . . , Xn(cid:43)\n\nfrom an unknown distribution P over Rd, we want to decide if a new (unseen) sample \u02dcX is \u201cnovel,\u201d\ni.e. if it is unlikely to come from the same distribution P . Due to lack of supervision, the notion of\n\u201cnovelty\u201d is not well-de\ufb01ned. Practically, a popular surrogate is to identify the low-density regions\nof the distribution P [4, 49, 51], as samples from these areas are probabilistically unlikely. For\nsimplicity we assume the underlying distribution P has a density p w.r.t. the Lebesgue measure.\nWe exploit the following multivariate generalization of the quantile function. Recall that the cumula-\ntive distribution function (CDF) F and the quantile function Q of a univariate random variable X is\nde\ufb01ned as:\n\nF (x) = Pr(X \u2264 x),\n\nQ(u) = F \u22121(u) := inf{x : F (x) \u2265 u}.\n\nWhile the CDF can be easily generalized to the multivariate setting, it is not so obvious for the\nquantile function, as its de\ufb01nition intrinsically relies on the total ordering on the real line. However,\nfollowing [e.g. 13, 16] we observe that if U follows the uniform distribution over the interval [0, 1],\nthen Q(U ) follows the distribution F . In other words, the quantile function can be de\ufb01ned as a\nmapping that pushes the uniform distribution over [0, 1] into the distribution F of interest. This\nalternative interpretation allows us to extend the quantile function to the multivariate setting. We\nrecall that a mapping T = (T1, . . . , Td) : Rd \u2192 Rd is called triangular if for all j = 1, . . . , d, the\nj-th component Tj depends only on the \ufb01rst j coordinates of the input, and it is called increasing if\nfor all j, Tj is increasing w.r.t. the j-th coordinate when all other coordinates are \ufb01xed. We call T\ntriangular since its derivative is always a triangular matrix (and vice versa).\nDe\ufb01nition 1 (Triangular Quantile Map (TQM)) Let X be a random vector in Rd, and let U be\nuniform over the unit hypercube [0, 1]d. We call an increasing triangular map Q = QX : [0, 1]d \u2192\nRd the triangular quantile map of X if Q(U) \u223c X, where \u223c means equality in distribution.\nNote that the TQM Q is vector-valued, unlike the CDF which is always real-valued. The existence\nand uniqueness of Q follows from results in [5]. Our de\ufb01nition immediately leads to the following\nquantile change-of-variable formula (cf. the usual change-of-variable formula for densities):\nProposition 1 Let T : Rd \u2192 Rd be an increasing triangular map. If Y = T(X), then\n\nQY = T \u25e6 QX.\n\n2\n\n(1)\n\n\fPractically, eq. (1) allows us to easily stack elementary parameterizations of increasing triangular\nmaps together and still obtain a valid TQM.\nTo our best knowledge, a similar de\ufb01nition, through conditional univariate quantiles, appeared in\na number of works [2, 10, 37, 45], albeit mostly as a theoretical tool. Our de\ufb01nition makes the\nimportant triangular structure explicit and amenable to parameterization through deep networks.\nNeedless to say, when d = 1, the triangular property is vacuous and our de\ufb01nition reduces to the\nclassical quantile function. For a more comprehensive introduction to triangular maps and its recent\nrise in machine learning, see [22, 33, 50].\n\nRemark 1 A different de\ufb01nition of the multivariate quantile map, based on the theory of optimal\ntransport [54], is discussed in a number of recent works [e.g. 8, 13, 16]: Q is instead constrained\nto be maximally cyclically monotone, i.e. it is the subdifferential of some convex function. On one\nhand, this de\ufb01nition is invariant to permutations of the input coordinates while ours is not. On the\nother hand, our de\ufb01nition is composition friendly (see Proposition 1) hence can easily exploit recent\nprogresses in deep generative models, as we will see shortly. The two de\ufb01nitions coincide with each\nother only when reduced to the univariate case.\nWe note that the recent work of Inouye and Ravikumar [21] proposed yet another similar de\ufb01nition\nwhere Q (termed density destructor there) is only required to be invertible. However, this de\ufb01nition\ndoes not lead to a unique quantile map and it is less computationally convenient.\nWe are now ready to present our general framework for novelty detection. Let f : Rd \u2192 Rm be a\nfeature map and X a random sample from the unknown density p. We propose to learn the density1\nf#p of the latent random vector Z = f (X) using the approach illustrated in [22]. In details, we learn\nthe feature map f and the TQM Q simultaneously by minimizing the following objective:\n\n\u03b3KL(f#p(cid:107)Q#q) + \u03bb(cid:96)(f ) + \u03b6g(Q),\n\nmin\nf ,Q\n\n(2)\n\nwhere g embodies some potential constraints on the increasing triangular map Q, (cid:96) is some loss asso-\nciated with learning the feature map f, q is a \ufb01xed reference density (in our case the uniform density\nover the hypercube [0, 1]m), \u03b6, \u03bb, \u03b3 \u2265 0 are regularization constants, and we use the KL-divergence to\nmeasure the discrepancy between two densities. Exploiting Proposition 1 we parameterize the TQM\nas the composition Q = T \u25e6 \u03a6\u22121, where \u03a6 = (\u03a6, . . . , \u03a6) with \u03a6 the CDF of standard univaraite\nGaussian and T : Rd \u2192 Rd an increasing triangular map. Note that unlike Q whose support is\nconstrained to the unit hypercube, there is no constraint on the support of T, hence it is easier to\nhandle the latter computationally.\nOnce the feature map f and TQM Q are estimated (see next section), we can detect novel test samples\nby either thresholding the density function of the latent variable Z or thresholding its TQM. In details,\nthe density of Z = f (X) = Q(U) = T(\u03a6\u22121(U)), using the change-of-variable formula, is\n\u03d5([T\u22121(z)]j), where \u03d5 = \u03a6(cid:48).\n\n|T(cid:48)(T\u22121(z))| \u00b7 m(cid:89)\n\npZ(z) = 1/|Q(cid:48)(Q\u22121(z))| =\n\n1\n\nj=1\n\nThus, we declare a test sample \u02dcX to be \u201cnovel\u201d if\n\nlog |T(cid:48)(T\u22121(f ( \u02dcX)))| + 1\n\n2(cid:107)T\u22121(f ( \u02dcX))(cid:107)2\n\n2 \u2265 \u03c4,\n\n(3)\nwhere \u03c4 is some chosen threshold. Crucially, since T is increasing triangular, T\u22121 and the triangular\ndeterminant |T(cid:48)| can both be computed very ef\ufb01ciently [22]. The (slight) downside of this density\napproach is that the scale of an appropriate threshold \u03c4 is usually dif\ufb01cult to guess.\nAlternatively, we can declare a test sample \u02dcX to be \u201cnovel\u201d by directly thresholding the TQM Q.\nIndeed, let N \u2286 [0, 1]m be a subset whose (uniform) measure is 1 \u2212 \u03b1 for some \u03b1 \u2208 (0, 1), then we\nsay \u02dcX is \u201cnovel\u201d iff\n\n(4)\nFor instance, we can choose N to be the cube centered at (1/2, . . . , 1/2) and with side length\n(1 \u2212 \u03b1)1/m, in which case\n\nQ\u22121(f ( \u02dcX)) (cid:54)\u2208 N.\n\nQ\u22121(f ( \u02dcX)) (cid:54)\u2208 N \u21d0\u21d2 (cid:107)Q\u22121(f ( \u02dcX)) \u2212 1\n\n2(cid:107)\u221e \u2265 (1 \u2212 \u03b1)1/m/2.\n\n1The notation T#p stands for the push-forward density, i.e., the density of T(X) when X \u223c p.\n\n3\n\n\fThe upside of this quantile approach is that we can control Type-I error (i.e. false positive) precisely,\ni.e. if \u02dcX is indeed sampled from p, then we will declare it to be novel with probability at most \u03b1.\nBefore proceeding to the implementation details of (2), let us mention the advantages of our general\nframework (2) for novelty detection: (a) It allows us to perform feature extraction on the original\nsample X in an end-to-end fashion. As is well-known, density estimation hence also novelty detection\nbecomes extremely challenging when the dimension d is high. Our framework alleviates this curse-\nof-dimensionality by setting m (cid:28) d and employing f to perform dimensionality reduction. (b) Our\nend-to-end framework enables us to adopt the recent \ufb02ow-based density estimation algorithms, which\nhave been shown to be universally consistent [20, 22] and extremely effective in practice. (c) By\nestimating the TQM Q once, we can employ the two scoring rules, i.e. the density scoring rule (3)\nand the quantile scoring rule (4), simultaneously, without incurring any extra overhead. This allows us\nto perform a fair and comprehensive experimental comparison of the two complementary approaches.\n(d) Last but not least, our framework recovers, uni\ufb01es, and extends many existing approaches in the\nliterature. Let us conclude this section with some examples.\n\nExample 1 (One-class SVM [49]) As shown in [52], the one-class SVM minimizes precisely the\nconditional value-at-risk, which is the average of the tail of a distribution:\n\nCVaR\u03b1(f (X)) + \u03bb(cid:107)f(cid:107)2H\u03ba\n\n, where CVaR\u03b1(Z) := E(Z|Z \u2265 QZ(\u03b1)),\n\nmin\n\nf\n\nQZ(\u03b1) is the \u03b1-th quantile of the real random variable Z, and H\u03ba is the reproducing kernel Hilbert\nspace (RKHS) induced by some kernel \u03ba. This approach employs the quantile scoring rule (4).\nTo cast one-class SVM into our framework (2), let us set m = 1 hence the TQM reduces to the\nand g(Q) = CVaR\u03b1(Q#q). Now with \u03b6 = 1 and \u03b3 = \u221e in (2) we\nclassical one. Let (cid:96)(f ) = (cid:107)f(cid:107)2H\u03ba\nrecover the celebrated one-class SVM.\nIf instead of choosing f from an RKHS, we represent f using a deep network, then we recover the\nrecent approach in [6].\n\nExample 2 (Support Vector Data Description (SVDD) [53]) Similar to one-class SVM, it is easy\nto show that SVDD also minimizes the conditional value-at-risk:\nCVaR\u03b1((cid:107)\u03d5(X) \u2212 c(cid:107)2H\u03ba\n\n),\n\nmin\nc\u2208H\u03ba\n\nwhere \u03d5 : Rd \u2192 H\u03ba is the canonical feature map of the RKHS. This approach also employs the\nquantile scoring rule (4). It is well-known known that SVDD and one-class SVM are equivalent for\nradial kernels [e.g. 49].\nAgain in this case m = 1. Let f (X) = (cid:107)\u03d5(X) \u2212 c(cid:107)2H\u03ba\napproaches \u221e in (2), we recover the SVDD formulation.\nIf instead of choosing \u03d5 as the canonical feature map of an RKHS, we represent \u03d5 using a deep\nnetwork, then we recover the recent approach in [44].\n\n, (cid:96) \u2261 0 and g(Q) = CVaR\u03b1(Q#q). As \u03b3\n\nExample 3 (Latent Space Autoregression (LSA) [1]) The recent work [1], following a sequence\nof previous attempts [40, 46, 58, 59], proposed to learn the feature map f using an auto-encoder\nstructure, and to learn the density of the latent variable Z = f (X) using an autoregressive model,\nwhich, as argued in [22], exactly corresponds to a triangular map. In other words, if we set f as the\nparameters of an auto-encoder, (cid:96) to be its reconstruction loss, and g \u2261 0, then our framework (2)\nreduces to LSA. However, our general framework opens the way to exploit more advanced \ufb02ow-based\ndensity estimation algorithms, as well as the quantile scoring rule (4).\n\n3 Estimating TQM Using Deep Networks\n\nIn this section we show how to estimate the TQM Q in (2) based on samples(cid:42)X1, . . . , Xn(cid:43) i.i.d.\u223c p.\n\nIn particular, any \ufb02ow-based neural density estimator can be plugged into our framework.\nOur framework (2) has three components which we implement as follows:\n\u2022 A feature extractor f for performing dimensionality reduction. Following previous works [1, 40,\n46, 58, 59] we implement f through a deep autoencoder that consists of one encoder Z = E(X; \u03b8E)\n\n4\n\n\fand one decoder \u02c6X = D(Z; \u03b8D) . We use the Euclidean reconstruction loss:\n\n(cid:96)(f ) = (cid:96)(\u03b8E, \u03b8D) =(cid:80)n\n\ni=1 (cid:107)Xi \u2212 \u02c6Xi(cid:107)2.\n\nAs argued in [3], the reconstruction error, aside from low likelihood, is an important indicator for\n\u201cnovelty.\u201d Indeed, since the autoencoder is trained on nominal data, a test sample will incur a large\nreconstruction error only when it is novel, as such samples have never been encountered before.\n\u2022 A \ufb02ow-based neural density estimator for Q. Here we adopt the sum-of-squares (SOS) \ufb02ow\nproposed in [22], although other neural density estimators would apply equally well. The SOS \ufb02ow\nconsists of two parts: an increasing (univariate) polynomial P2r+1(u; a) with degree 2r + 1 for\nmodelling conditional densities and a conditioner network Cj(u1, . . . , uj\u22121; \u03b8Q) for generating\nthe coef\ufb01cients a of the polynomial:\n\nP2r+1(u; a) = c +(cid:82) u\n\n(cid:80)k\n\n0\n\ns=1\n\nl=0 al,stl(cid:1)2\n(cid:0)(cid:80)r\n(cid:0)uj; Cj(u1, . . . , uj\u22121; \u03b8Q)(cid:1).\n\ndt,\n\n(1 \u2212 \u03bb)\u2207h(Xi; \u03b8t) + \u03bb\u2207(cid:96)(Xi; \u03b8t)\n\n5\n\nwhere c \u2208 R is an arbitrary constant, r \u2208 N is the degree of polynomial, and k can be chosen as\nsmall as 2. In other words, the TQM Q learned using SOS \ufb02ow has the following form:\n\nQ = T \u25e6 \u03a6\u22121, where \u2200j, Tj(u1, . . . , uj) = P2r+1\n\n(5)\nAny regularization term on the conditioner network weights \u03b8Q can be put into the function g(Q)\nin our framework (2).\n\u2022 Lastly, the KL-divergence term in (2) can be approximated empirically using the given sample\n\n(cid:42)X1, . . . , Xn(cid:43). Upon dropping irrelevant constants we reduce the KL term in (2) to:\n\nlog |Q(cid:48)(Q\u22121(f (Xi)))| \u2212 log q(Q\u22121(f (Xi)))\n\n,\n\n(cid:105)\n\nn(cid:88)\n\n(cid:104)\n\ni=1\n\nmin\n\u03b8Q\n\nwhere each component of Q is given in (5). Crucially, since Q is increasing triangular, evaluating\nthe inverse Q\u22121 and the Jacobian |Q(cid:48)| can both be done in linear time [22].\n\nSince q is the uniform density over the hypercube, upon simpli\ufb01cation the \ufb01nal training objective we\nuse in our experiments is as follows. Let Zi = E(Xi; \u03b8E), we aim to solve:\n\n(1 \u2212 \u03bb)\n\nlog |T(cid:48)(T\u22121(Zi))| + (cid:107)T\u22121(Zi)(cid:107)2\n\n+ \u03bb(cid:107)Xi \u2212 D(Zi; \u03b8D)(cid:107)2\n\n,\n\n(6)\n\n(cid:123)(cid:122)\n\n(cid:105)\n\n(cid:125)\n\n2/2\n\n(cid:124)\n\n(cid:123)(cid:122)\n\n(cid:125)\n\nn(cid:88)\n\ni=1\n\nmin\n\n\u03b8\n\n(cid:104)\n\n(cid:124)\n\nnegative log-likelihood h(Xi;\u03b8)\n\nreconstruction loss (cid:96)(Xi;\u03b8)\n\nand recall that Q = T \u25e6 \u03a6\u22121 is parameterized through the conditioner network weights \u03b8Q in (5).\nWe did not \ufb01nd it necessary to further regularize Q hence set g \u2261 0 in (2) and w.l.o.g. \u03b3 = 1 \u2212 \u03bb.\nThe \ufb01rst KL term in (2), as is well-known, reduces to the negative log-likelihood of the latent\nrandom vectors Zi in (6), and the second term is the standard reconstruction loss. The two terms\nshare the encoder weights \u03b8E and the trade-off is balanced through the hyperparameter \u03bb. This\ndesign choice conforms to the psychology \ufb01ndings in [3]. In practice, we found that the variance\nof the log-likelihood is much larger than that of the reconstruction loss, and as a consequence we\nobserved substantial dif\ufb01culty in directly minimizing the weighted objective in (6). A popular pre-\ntraining heuristic is to train the whole model in two stages: we \ufb01rst minimize the reconstruction\nloss (cid:96)(\u03b8E, \u03b8D) and then, with the learned hidden vector Z, we estimate the TQM Q by maximum\nlikelihood. However, as shown in [59], the latent representation learned in the \ufb01rst stage does not\nnecessarily help the task in the second stage.\nInstead, we cast the two competing objectives in (6) as multi-objective optimization, which we solve\nusing the multiple gradient descent algorithm (MGDA) [12, 15, 36]. Our motivation comes from the\nfollowing observation: the two-stage procedure amounts to \ufb01rst setting \u03bb = 1 and running gradient\ndescent (GD) for a number of iterations, then switching to \u03bb = 0 (or \u03bb = 0.5 say) and running GD\nfor the remaining iterations. Naturally, instead of any pre-determined schedule for the hyperparameter\n\u03bb (such as switching from 1 to 0 or 0.5), why not let GD decide what \u03bb to use in each iteration? This\nis precisely the main idea behind MGDA, where at iteration t we solve\n\n(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)2\n\n(cid:110)\n\n(cid:110)\n\n= min\n\n1, max\n\n0,\n\n(cid:104)\u2207hI\u2212\u2207(cid:96)I ,\u2207hI(cid:105)\n(cid:107)\u2207hI\u2212\u2207(cid:96)I(cid:107)2\n\n(cid:111)(cid:111)\n\n,\n\n(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:88)\n\ni\u2208I\n\n\u03bbt = argmin\n0\u2264\u03bb\u22641\n\n\fwhere I \u2286 {1, . . . , n} is a minibatch of samples, and obviously \u2207hI = (cid:80)\n\nsimilarly for \u2207(cid:96)I. With \u03bbt calculated we can continue the gradient update:\n\ni\u2208I \u2207h(Xi; \u03b8t) and\n\n\u03b8t+1 = \u03b8t \u2212 \u03b7[(1 \u2212 \u03bbt)\u2207hI + \u03bbt\u2207(cid:96)I ],\n\nwhere \u03b7 \u2265 0 is the step size. As shown in [12], this algorithm converges to a Pareto-optimal solution\nunder fairly general conditions. Pleasantly, MGDA eliminates the need of tuning the hyperparameter\n\u03bb as it is determined automatically on the \ufb02y. To our best knowledge, our work is the \ufb01rst to\ndemonstrate the effectiveness of MGDA on novelty detection tasks.\nWe end our discussion by pointing out that the algorithm we develop here can easily be adapted to\nother design choices that \ufb01t into our general framework (2). For instance, if we use a variational\nautoencoder [23] or a denoising autoencoder [55], then we need only replace the square reconstruction\nloss in (6) accordingly.\n\n4 Empirical Results\n\nIn this section, we evaluate the performance of our proposed method for novelty detection and\ncompare it with the traditional and state-of-the-art alternatives. For evaluation, we use precision,\nrecall, F1 score, and the Area Under Receiver Operating Characteristic (AUROC) curve as our\nperformance metrics, which are commonly used in previous works.\n\n4.1 Datasets\n\nIn our experiments, we use two public image datasets: MNIST and Fashion-MNIST, as well as two\nnon-image datasets: KDDCUP and Thyroid. A detailed description of these datasets, the applied\nnetwork architectures, and the training hyperparameters can be found in Appendix A. For MNIST\nand Fashion-MNIST, each of the ten classes is deemed as the nominal class while the rest of the nine\nclasses are deemed as the novel class. We use the standard training and test splits. For every class,\nwe hold out 10% of the training set as the validation set, which is used to tune hyperparameters and\nto monitor the training process.\n\n4.2 Competitor Algorithms\n\nWe compare our method with the following alternative algorithms:\n\u2022 OC-SVM [49]. OC-SVM is a traditional kernel-based quantile approach which has been widely\nused in practice for novelty detection. We use the RBF kernel in our experiments. We consider two\nOC-SVM-based methods for comparison. 1) RAW-OC-SVM: the input is directly fed to OC-SVM;\n2) CAE-OC-SVM: a convolutional autoencoder is \ufb01rst applied to the input data for dimensionality\nreduction, and then the low-dimensional latent representation is fed to OC-SVM.\n\u2022 Geometric transformation (GT) [18]. A self-labeled multi-class dataset is \ufb01rst created by ap-\nplying a set of geometric transformations to the original nominal examples. Then, a multi-class\nclassi\ufb01er is trained to discriminate the geometric transformations of each nominal example. The\nscoring function in GT is the conditional probability of the softmax responses of the classi\ufb01er given\nthe geometric transformations.\n\u2022 Variational autoencoder (VAE) [23]. The evidence lower bound is used as the scoring function.\n\u2022 Denoising autoencoder (DAE) [55]. The reconstruction error is used as the scoring function.\n\u2022 Deep structured energy-based models (DSEBM) [58]. DSEBM employs a deterministic deep\nneural network to output the energy function (i.e., negative log-likelihood), which is used to form\nthe density of nominal data. The network is trained by score matching in a way similar to training\nDAE. Two scoring functions based on reconstruction error and energy score are considered.\n\u2022 Deep autoencoding Gaussian mixture model (DAGMM) [59]. DAGMM consists of a compression\nnetwork implemented using a deep autoencoder and a Gaussian mixture estimation network that\noutputs the joint density of the latent representations and some reconstruction features from the\nautoencoder. The energy function is used as the scoring function.\n\u2022 Generative probabilistic novelty detection (GPND) [40]. GPND, based on adversarial autoen-\ncoders, employs an extra adversarial loss to impose priors on the output distribution. The density is\n\n6\n\n\fTable 1: AUROC of Variants of Our Method on MNIST\n\nScoring function\nNLL\nTQM1\nTQM2\nTQM\u221e\n\n\u03bb = 0.99\n0.9729\n0.9622\n0.9666\n0.9499\n\n0.9\n\n0.9692\n0.9616\n0.9645\n0.9527\n\n0.5\n\n0.9537\n0.9430\n0.9465\n0.9371\n\n0.1\n\nOptimized\n\n0.9389\n0.9319\n0.9347\n0.9128\n\n0.9728\n0.9666\n0.9699\n0.9531\n\nTable 2: Average Precision, Recall, and F1 Score on Non-image Datasets\n\nMethod\nRAW-OC-SVM *\nDSEBM *\nDAGMM *\nOurs-REC\nOurs-NLL\nOurs-TQM1\nOurs-TQM2\nOurs-TQM\u221e\n\nThyroid\nPrecision Recall\n0.4239\n0.3639\n0.0404\n0.0403\n0.4834\n0.4766\n\n\u2013\n\n0.7312\n0.5269\n0.5806\n0.7527\n\n\u2013\n\n0.7312\n0.5269\n0.5806\n0.7527\n\nKDDCUP\nPrecision Recall\n0.8523\n0.7457\n0.7369\n0.7477\n0.9442\n0.9297\n0.6287\n0.6305\n0.9622\n0.9622\n0.9621\n0.9621\n0.9622\n0.9622\n0.9622\n0.9622\n\nF1\n\n0.7954\n0.7423\n0.9369\n0.6296\n0.9622\n0.9621\n0.9622\n0.9622\n\nF1\n\n0.3887\n0.0403\n0.4782\n\n\u2013\n\n0.7312\n0.5269\n0.5806\n0.7527\n\nused as the scoring function. By linearizing the manifold that nominal data resides on, its density\nis factorized into two product terms, which are then approximately computed using nominal data.\n\u2022 Latent space autoregression (LSA) [1]. A parametric autoregressive model is used to estimate\nthe density of the latent representation generated by a deep autoencoder, where the conditional\nprobability densities are modeled as multinomials over quantized latent representations. The sum\nof the normalized reconstruction error and log-likelihood is used as the scoring function.\n\n4.3 Variants of Our Method\n\n2/2;\n\nIn this subsection, we \ufb01rst compare some variants of our proposed method. With regard to the network\ncon\ufb01guration, except on Thyroid whose dimension is too small to require any form of dimentionality\nreduction, all other experiments contain both an autoencoder and an estimation network.\nWe consider the following \ufb01ve scoring functions that we threshold at some level \u03c4. In particular,\ngiven a test example \u02dcX, we denote its reconstruction by \u02c6X and its latent representation by \u02dcZ = f ( \u02dcX).\n(cid:107) \u02dcX \u2212 \u02c6X(cid:107)2;\n\u2022 Reconstruction error (REC):\nlog |T(cid:48)(T\u22121( \u02dcZ))| + (cid:107)T\u22121( \u02dcZ)(cid:107)2\n\u2022 Negative log-likelihood (NLL):\n(cid:107)\u03a6(T\u22121( \u02dcZ)) \u2212 1\n\u2022 1-norm of quantile (TQM1):\n\u2022 2-norm of quantile (TQM2):\n(cid:107)\u03a6(T\u22121( \u02dcZ)) \u2212 1\n\u2022 In\ufb01nity norm of quantile (TQM\u221e): (cid:107)\u03a6(T\u22121( \u02dcZ)) \u2212 1\nIn Table 1, we compare two approaches on MNIST for selecting the hyperparameter \u03bb in the training\nphase: 1) chosen from a pre-set family using the validation set; and 2) automatically optimized using\nMGDA [12, 15, 36]. We report the average AUROC over 10 classes. It is clear that for all scoring\nfunctions, the optimized \u03bb generally leads to the highest AUROC. This is also observed on other\ndatasets such as Fashion-MNIST. Within the proposed variants, NLL results in the highest AUROC\namong all scoring functions, followed by TQM2. In Table 2, on the two non-image datasets we\nevaluate the average precision, recall, and F1 score. The superscript \u2217 on the baselines indicates that\nthe results are directly quoted from the respective references. The threshold is chosen by assuming\nthe prior knowledge of the ratio between the novel and nominal examples in the test set. Under this\nassumption, the number of false positives is equal to that of false negatives, thus the value of the\nthree metrics coincides. On Thyroid, TQM\u221e is slightly better than the density-based method. On\nKDDCUP, the density and quantile-based approaches have the same performance, while REC results\nin the worst performance. On both datasets, our proposed methods are superior to the benchmarks.\n\n2(cid:107)1,\n2(cid:107)2;\n2(cid:107)\u221e.\n\n7\n\n\fTable 3: AUROC on MNIST and Fashion-MNIST\n\nMNIST\n\nClass\n\nOC-SVM\n\nRAW\n\n0.995\n0.999\n0.926\n0.936\n0.967\n0.955\n0.987\n0.966\n0.903\n0.962\n0.960\n\nCAE\n\n0.990\n0.999\n0.919\n0.939\n0.946\n0.936\n0.979\n0.951\n0.896\n0.960\n0.952\n\n0\n1\n2\n3\n4\n5\n6\n7\n8\n9\navg\n\nClass\n\nOC-SVM\n\nRAW\n\n0.919\n0.990\n0.894\n0.942\n0.907\n0.918\n0.834\n0.988\n0.903\n0.982\n0.928\n\nCAE\n\n0.908\n0.987\n0.884\n0.911\n0.913\n0.865\n0.820\n0.984\n0.877\n0.955\n0.910\n\n0\n1\n2\n3\n4\n5\n6\n7\n8\n9\navg\n\nVAE\n\nDAE\n\nLSA\n\nGT\n\nDAGMM\n\nGPND\n\nDSEBM\n\nOurs-NLL\n\nOurs-TQM2\n\n0.985\n0.997\n0.943\n0.916\n0.945\n0.929\n0.977\n0.975\n0.864\n0.967\n0.950\n\n0.982\n0.998\n0.936\n0.929\n0.940\n0.928\n0.982\n0.971\n0.857\n0.974\n0.950\n\n0.998\n0.999\n0.923\n0.974\n0.955\n0.966\n0.992\n0.969\n0.935\n0.969\n0.968\n\n0.982\n0.893\n0.993\n0.987\n0.993\n0.994\n0.999\n0.966\n0.974\n0.993\n0.977\n\n0.500\n0.766\n0.326\n0.319\n0.368\n0.490\n0.515\n0.500\n0.467\n0.813\n0.508\n\nFashion-MNIST\n\n0.999\n0.999\n0.980\n0.968\n0.980\n0.987\n0.998\n0.988\n0.929\n0.993\n0.982\n\n0.320\n0.987\n0.482\n0.753\n0.696\n0.727\n0.954\n0.911\n0.536\n0.905\n0.727\n\n0.995\n0.998\n0.953\n0.963\n0.966\n0.962\n0.992\n0.969\n0.955\n0.977\n0.973\n\n0.993\n0.997\n0.948\n0.957\n0.963\n0.960\n0.990\n0.966\n0.951\n0.976\n0.970\n\nVAE\n\nDAE\n\nLSA\n\nGT\n\nDAGMM\n\nGPND\n\nDSEBM\n\nOurs-NLL\n\nOurs-TQM2\n\n0.874\n0.977\n0.816\n0.912\n0.872\n0.916\n0.738\n0.976\n0.795\n0.965\n0.884\n\n0.867\n0.978\n0.808\n0.914\n0.865\n0.921\n0.738\n0.977\n0.782\n0.963\n0.881\n\n0.916\n0.983\n0.878\n0.923\n0.897\n0.907\n0.841\n0.977\n0.910\n0.984\n0.922\n\n0.903\n0.993\n0.927\n0.906\n0.907\n0.954\n0.832\n0.981\n0.976\n0.994\n0.937\n\n0.303\n0.311\n0.475\n0.481\n0.499\n0.413\n0.420\n0.374\n0.518\n0.378\n0.472\n\n0.917\n0.983\n0.878\n0.945\n0.906\n0.924\n0.785\n0.984\n0.916\n0.876\n0.911\n\n0.891\n0.560\n0.861\n0.903\n0.884\n0.859\n0.782\n0.981\n0.865\n0.967\n0.855\n\n0.922\n0.958\n0.899\n0.930\n0.922\n0.894\n0.844\n0.980\n0.945\n0.983\n0.928\n\n0.917\n0.950\n0.899\n0.925\n0.921\n0.884\n0.838\n0.972\n0.943\n0.983\n0.923\n\n4.4 Comparison with Baseline Methods\n\nIn this section, we compare our method with the baseline approaches. Note that except RAW-OC-SVM\nand GT, all other methods, including our own, are based on autoencoders.\nIn Table 3, we show the comparison of AUROC on the image datasets. Among the proposed quantile\nscoring functions we only list TQM2, which outputs the highest value of AUROC. We observe that\non both datasets our proposed methods are superior to most of the benchmarks, with the density\nscoring function being slightly better than the quantile one. On MNIST, GPND and GT have better\nperformance; and on Fashion-MNIST, GT outputs the highest value of AUROC followed by Ours-\nNLL and RAW-OC-SVM. However, since GT explicitly extracts features by using a set of geometric\ntransformations, it inevitably suffers a high computational and space complexity. In Appendix B, we\nfurther compare and discuss the proposed density and quantile-based approaches in detail.\n\n4.5 Comparison with Two-Stage Training\n\nIn our proposed algorithm the autoencoder and the estimation network are trained jointly by employing\nMGDA. For comparison, we also consider the following two-stage training strategies:\n\u2022 We \ufb01rst train the autoencoder, then \ufb01x the autoencoder and train the estimation network alone\n\u2022 we \ufb01rst pretrain the autoencoder, then jointly train the autoencoder and the estimation network\n\n(denoted as Fix-).\n\nwith the weight \u03bb \ufb01xed to 0.5 (denoted as Pretrain-).\n\nThe comparison regarding AUROC on MNIST is shown in Table 4. We found that the proposed\njoint training method leads to the best performance for both the density-based and the quantile-based\nscoring functions. This is consistent with the \ufb01ndings in many existing works [e.g. 1, 6, 44, 58]. For\nthe \ufb01xed two-stage method, our understanding is that the latent representation learned in the \ufb01rst\nstage may not be the most bene\ufb01cial for the training of the estimation network in the second stage,\nwhich in turn degrades the overall performance. For the pretrained two-stage method, although in\nthe second stage the two parts are trained jointly the autoencoder is initialized with the parameters\n\n8\n\n\fTable 4: Comparison between joint and two-stage training: AUROC on MNIST\n\nClass\n\n0\n1\n2\n3\n4\n5\n6\n7\n8\n9\navg\n\nFix-NLL\n0.9939\n0.9971\n0.9403\n0.9568\n0.9703\n0.9612\n0.9878\n0.9629\n0.9549\n0.9736\n0.9699\n\nPretrain-NLL Ours-NLL\n\n0.9954\n0.9988\n0.9677\n0.9496\n0.9445\n0.9564\n0.9907\n0.9676\n0.9587\n0.9733\n0.9703\n\n0.9951\n0.9977\n0.9526\n0.9627\n0.9657\n0.9618\n0.9915\n0.9686\n0.9551\n0.9768\n0.9728\n\nFix-TQM2\n0.9904\n0.9972\n0.9188\n0.9481\n0.9700\n0.9525\n0.9841\n0.9587\n0.9397\n0.9742\n0.9634\n\nPretrain-TQM2 Ours-TQM2\n\n0.9939\n0.9985\n0.9568\n0.9414\n0.9388\n0.9486\n0.9881\n0.9656\n0.9527\n0.9641\n0.9649\n\n0.9925\n0.9969\n0.9479\n0.9567\n0.9625\n0.9601\n0.9895\n0.9660\n0.9512\n0.9756\n0.9699\n\nFigure 1: Distributional comparison on training and test scoring statistics on MNIST (nominal: digit\n1). From left to right: 1) NLL; 2) TQM1; 3) TQM2; and 4) TQM\u221e.\n\nlearned in the \ufb01rst stage, which might prevent it from being updated to a more suitable local optimum.\nThe comparison on Fashion-MNIST dataset is similar and is shown in Appendix C.\n\n4.6 Visualization\n\nIn Figure 1, we show the violin plots of the scoring statistics NLL, TQM1, TQM2, and TQM\u221e on\nMNIST test set (with digit 1 serving the nominal class). We use the network parameters produced at\nevery 20 epochs in training to generate each curve. We can see that, in the beginning the nominal and\nnovel data have a large region of overlap and after more training epochs they are gradually separated.\nAfter about 20 epochs of training they can be clearly distinguished under NLL, TQM1, and TQM2,\nwhich indicates the effectiveness of these scoring functions. For TQM\u221e, the distribution of novel data\nis concentrated within a narrow region, which is near the boundary of that of nominal data. More\nresults on visualization can be found in Appendix D.\n\n5 Conclusion\n\nThe univariate quantile function was extended to the multivariate setting through increasing triangular\nmaps, which in turn motivates us to develop a general framework for neural novelty detection. Our\nframework uni\ufb01es and extends many existing algorithms in novelty detection. We adapted the\nmultiple gradient algorithm to obtain an ef\ufb01cient, end-to-end implementation of our framework that\nis free of any tuning hyperparameters. We performed extensive experiments on a number of datasets\nto con\ufb01rm the competitiveness of our method against state-of-the-art alternatives. In the future we\nwill study the consistency of our estimation algorithm for the multivariate triangular quantile map\nand we plan to apply it to other multivariate probabilistic modelling tasks.\n\nAcknowledgement\n\nWe thank the reviewers for their constructive comments. We thank Priyank Jaini for bringing\nDecurninge\u2019s work to our attention. This work is supported by NSERC.\n\n9\n\n\fReferences\n[1] Davide Abati, Angelo Porrello, Simone Calderara, and Rita Cucchiara. Latent Space Autore-\ngression for Novelty Detection. In IEEE/CVF Conference on Computer Vision and Pattern\nRecognition, 2019.\n\n[2] Elja Arjas and Tapani Lehtonen. Approximating Many Server Queues by Means of Single\n\nServer Queues. Mathematics of Operations Research, 3(3):205\u2013223, 1978.\n\n[3] Andrew Barto, Marco Mirolli, and Gianluca Baldassarre. Novelty or Surprise? Frontiers in\n\nPsychology, 4:907, 2013.\n\n[4] Shai Ben-David and Michael Lindenbaum. Learning Distributions by Their Density Levels:\nA Paradigm for Learning without a Teacher. Journal of Computer and System Sciences,\n55(1):171\u2013182, 1997.\n\n[5] Vladimir Igorevich Bogachev, Aleksandr Viktorovich Kolesnikov, and Kirill Vladimirovich\nMedvedev. Triangular transformations of measures. Sbornik: Mathematics, 196(3):309\u2013335,\n2005.\n\n[6] Raghavendra Chalapathy, Aditya Krishna Menon, and Sanjay Chawla. Anomaly Detection\n\nusing One-Class Neural Networks, 2018. arXiv:1802.06360.\n\n[7] Varun Chandola, Arindam Banerjee, and Vipin Kumar. Anomaly Detection: A Survey. ACM\n\nComputing Surveys, 41(3):15:1\u201315:58, 2009.\n\n[8] Victor Chernozhukov, Alfred Galichon, Marc Hallin, and Marc Henry. Monge\u2013Kantorovich\n\ndepth, quantiles, ranks and signs. The Annals of Statistics, 45(1):223\u2013256, 2017.\n\n[9] Sanjoy Dasgupta, Timothy C. Sheehan, Charles F. Stevens, and Saket Navlakha. A neural data\n\nstructure for novelty detection. Proceedings of the National Academy of Sciences, 2018.\n\n[10] Alexis Decurninge. Univariate and multivariate quantiles, probabilistic and statistical ap-\n\nproaches; radar applications. PhD thesis, 2015.\n\n[11] Lucas Deecke, Robert Vandermeulen, Lukas Ruff, Stephan Mandt, and Marius Kloft. Im-\nage Anomaly Detection with Generative Adversarial Networks. In Machine Learning and\nKnowledge Discovery in Databases, pages 3\u201317, 2019.\n\n[12] Jean-Antoine D\u00e9sid\u00e9ri. Multiple-gradient descent algorithm (MGDA) for multiobjective opti-\n\nmization. Comptes Rendus Mathematique, 350(5):313\u2013318, 2012.\n\n[13] Ivar Ekeland, Alfred Galichon, and Marc Henry. Comonotonic Measures of Multivariate Risks.\n\nMathematical Finance, 22(1):109\u2013132, 2012.\n\n[14] W. Fan, M. Miller, S. Stolfo, W. Lee, and P. Chan. Using arti\ufb01cial anomalies to detect unknown\n\nand known network intrusions. Knowledge and Information Systems, 6(5):507\u2013527, 2004.\n\n[15] J\u00f6rg Fliege and Benar Fux Svaiter. Steepest descent methods for multicriteria optimization.\n\nMathematical Methods of Operations Research, 51(3):479\u2013494, 2000.\n\n[16] Alfred Galichon and Marc Henry. Dual theory of choice with multivariate risks. Journal of\n\nEconomic Theory, 147(4):1501\u20131516, 2012.\n\n[17] Andrew B. Gardner, Abba M. Krieger, George Vachtsevanos, and Brian Litt. One-Class Novelty\nDetection for Seizure Analysis from Intracranial EEG. Journal of Machine Learning Research,\n7:1025\u20131044, 2006.\n\n[18] Izhak Golan and Ran El-Yaniv. Deep Anomaly Detection Using Geometric Transformations. In\n\nAdvances in Neural Information Processing Systems 31, pages 9758\u20139769. 2018.\n\n[19] Paul M. Hayton, Bernhard Sch\u00f6lkopf, Lionel Tarassenko, and Paul Anuzis. Support Vector\nNovelty Detection Applied to Jet Engine Vibration Spectra. In Advances in Neural Information\nProcessing Systems 13, pages 946\u2013952, 2001.\n\n10\n\n\f[20] Chin-Wei Huang, David Krueger, Alexandre Lacoste, and Aaron Courville. Neural Autoregres-\n\nsive Flows. In ICML, 2018.\n\n[21] David Inouye and Pradeep Ravikumar. Deep Density Destructors. In Proceedings of the 35th\n\nInternational Conference on Machine Learning, pages 2167\u20132175, 2018.\n\n[22] Priyank Jaini, Kira A. Selby, and Yaoliang Yu. Sum-of-Squares Polynomial Flow. In Proceed-\n\nings of The 36th International Conference on Machine Learning, 2019.\n\n[23] Diederik P Kingma and Max Welling. Auto-Encoding Variational Bayes. In International\n\nConference on Learning Representations, 2014.\n\n[24] Yann LeCun. The mnist database of handwritten digits. http://yann.lecun.com/\n\nexdb/mnist/.\n\n[25] W. Li, V. Mahadevan, and N. Vasconcelos. Anomaly Detection and Localization in Crowded\nScenes. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(1):18\u201332, 2014.\n\n[26] Shiyu Liang, Yixuan Li, and R. Srikant. Enhancing The Reliability of Out-of-distribution Image\nDetection in Neural Networks. In International Conference on Learning Representations, 2018.\n\n[27] Moshe Lichman. UCI machine learning repository. http://kdd.ics.uci.edu/\n\ndatabases/kddcup99.\n\n[28] Moshe Lichman. UCI machine learning repository. http://archive.ics.uci.edu/\n\nml.\n\n[29] Larry Manevitz and Malik Yousef. One-class document classi\ufb01cation via Neural Networks.\n\nNeurocomputing, 70(7):1466\u20131481, 2007.\n\n[30] Larry M. Manevitz and Malik Yousef. One-Class SVMs for Document Classi\ufb01cation. Journal\n\nof Machine Learning Research, 2:139\u2013154, 2001.\n\n[31] Markos Markou and Sameer Singh. Novelty detection: a review\u2014part 1: statistical approaches.\n\nSignal Processing, 83(12):2481\u20132497, 2003.\n\n[32] Markos Markou and Sameer Singh. Novelty detection: a review\u2014part 2: neural network based\n\napproaches. Signal Processing, 83(12):2499\u20132521, 2003.\n\n[33] Youssef Marzouk, Tarek Moselhy, Matthew Parno, and Alessio Spantini. Sampling via Measure\n\nTransport: An Introduction, pages 1\u201341. Springer, 2016.\n\n[34] Aditya Krishna Menon and Robert C. Williamson. A loss framework for calibrated anomaly\ndetection. In Advances in Neural Information Processing Systems 31, pages 1494\u20131504, 2018.\n\n[35] Mary M. Moya and Don R. Hush. Network constraints and multi-objective optimization for\n\none-class classi\ufb01cation. Neural Networks, 9(3):463\u2013474, 1996.\n\n[36] H. Mukai. Algorithms for multicriterion optimization. IEEE Transactions on Automatic Control,\n\n25(2):177\u2013186, 1980.\n\n[37] G. L. O\u2019Brien. The Comparison Method for Stochastic Processes. The Annals of Probability,\n\n3(1):80\u201388, 1975.\n\n[38] George Papamakarios, Theo Pavlakou, and Iain Murray. Masked autoregressive \ufb02ow for density\nestimation. In Advances in Neural Information Processing Systems, pages 2338\u20132347, 2017.\n\n[39] Deepak Pathak, Pulkit Agrawal, Alexei A. Efros, and Trevor Darrell. Curiosity-driven Explo-\nration by Self-supervised Prediction. In Proceedings of the 34th International Conference on\nMachine Learning, pages 2778\u20132787, 2017.\n\n[40] Stanislav Pidhorskyi, Ranya Almohsen, and Gianfranco Doretto. Generative Probabilistic Nov-\nelty Detection with Adversarial Autoencoders. In Advances in Neural Information Processing\nSystems 31, pages 6823\u20136834, 2018.\n\n11\n\n\f[41] Marco A.F. Pimentel, David A. Clifton, Lei Clifton, and Lionel Tarassenko. A review of novelty\n\ndetection. Signal Processing, 99:215\u2013249, 2014.\n\n[42] M. Ravanbakhsh, M. Nabi, E. Sangineto, L. Marcenaro, C. Regazzoni, and N. Sebe. Abnormal\nevent detection in videos using generative adversarial nets. In IEEE International Conference\non Image Processing (ICIP), pages 1577\u20131581, 2017.\n\n[43] M. Ravanbakhsh, E. Sangineto, M. Nabi, and N. Sebe. Training Adversarial Discriminators\nIn IEEE Winter Conference on\n\nfor Cross-Channel Abnormal Event Detection in Crowds.\nApplications of Computer Vision (WACV), pages 1896\u20131904, 2019.\n\n[44] Lukas Ruff, Robert Vandermeulen, Nico Goernitz, Lucas Deecke, Shoaib Ahmed Siddiqui,\nAlexander Binder, Emmanuel M\u00fcller, and Marius Kloft. Deep One-Class Classi\ufb01cation. In\nProceedings of the 35th International Conference on Machine Learning, volume 80, pages\n4393\u20134402, 2018.\n\n[45] Ludger R\u00fcschendorf. Stochastically ordered distributions and monotonicity of the oc-function\n\nof sequential probability ratio tests. Series Statistics, 12(3):327\u2013338, 1981.\n\n[46] M. Sabokrou, M. Khalooei, M. Fathy, and E. Adeli. Adversarially Learned One-Class Classi\ufb01er\nfor Novelty Detection. In IEEE/CVF Conference on Computer Vision and Pattern Recognition,\npages 3379\u20133388, 2018.\n\n[47] Mohammad Sabokrou, Mohsen Fayyaz, Mahmood Fathy, Zahra. Moayed, and Reinhard Klette.\nDeep-anomaly: Fully convolutional neural network for fast anomaly detection in crowded\nscenes. Computer Vision and Image Understanding, 172:88\u201397, 2018.\n\n[48] Thomas Schlegl, Philipp Seeb\u00f6ck, Sebastian M. Waldstein, Georg Langs, and Ursula Schmidt-\nErfurth. f-AnoGAN: Fast unsupervised anomaly detection with generative adversarial networks.\nMedical Image Analysis, 54:30\u201344, 2019.\n\n[49] Bernhard Sch\u00f6lkopf, John C. Platt, John Shawe-Taylor, Alex J. Smola, and Robert C.\nWilliamson. Estimating the Support of a High-Dimensional Distribution. Neural Compu-\ntation, 13(7):1443\u20131471, 2001.\n\n[50] Alessio Spantini, Daniele Bigoni, and Youssef Marzouk.\n\nInference via low-dimensional\n\ncouplings. Journal of Machine Learning Research, 19:1\u201371, 2018.\n\n[51] Ingo Steinwart, Don Hush, and Clint Scovel. A classi\ufb01cation framework for anomaly detection.\n\nJournal of Machine Learning Research, 6:211\u2013232, 2005.\n\n[52] Akiko Takeda and Masashi Sugiyama. \u03bd-Support Vector Machine as Conditional Value-at-Risk\nMinimization. In 25th International Conference on Machine Learning, pages 1056\u20131063, 2008.\n[53] David M. J. Tax and Robert P. W. Duin. Support vector data description. Machine learning,\n\n54(1):45\u201366, 2004.\n\n[54] C\u00e9dric Villani. Optimal Transport: Old and New, volume 338. Springer, 2008.\n[55] Pascal Vincent, Hugo Larochelle, Yoshua Bengio, and Pierre-Antoine Manzagol. Extracting and\ncomposing robust features with denoising autoencoders. In Proceedings of the 25th international\nconference on Machine learning, pages 1096\u20131103, 2008.\n\n[56] Y. Xia, X. Cao, F. Wen, G. Hua, and J. Sun. Learning Discriminative Reconstructions for\nUnsupervised Outlier Removal. In IEEE International Conference on Computer Vision (ICCV),\npages 1511\u20131519, 2015.\n\n[57] Han Xiao, Kashif Rasul, and Roland Vollgraf. Fashion-MNIST: a novel image dataset for\n\nbenchmarking machine learning algorithms. arXiv:1708.07747, 2017.\n\n[58] Shuangfei Zhai, Yu Cheng, Weining Lu, and Zhongfei Zhang. Deep Structured Energy Based\nModels for Anomaly Detection. In Proceedings of The 33rd International Conference on\nMachine Learning, volume 48, pages 1100\u20131109, 2016.\n\n[59] Bo Zong, Qi Song, Martin Renqiang Min, Wei Cheng, Cristian Lumezanu, Daeki Cho, and\nHaifeng Chen. Deep Autoencoding Gaussian Mixture Model for Unsupervised Anomaly\nDetection. In International Conference on Learning Representations, 2018.\n\n12\n\n\f", "award": [], "sourceid": 2776, "authors": [{"given_name": "Jingjing", "family_name": "Wang", "institution": "University of Waterloo"}, {"given_name": "Sun", "family_name": "Sun", "institution": "National Research Council"}, {"given_name": "Yaoliang", "family_name": "Yu", "institution": "University of Waterloo"}]}