{"title": "Generative Probabilistic Novelty Detection with Adversarial Autoencoders", "book": "Advances in Neural Information Processing Systems", "page_first": 6822, "page_last": 6833, "abstract": "Novelty detection is the problem of identifying whether a new data point is considered to be an inlier or an outlier. We assume that training data is available to describe only the inlier distribution. Recent approaches primarily leverage deep encoder-decoder network architectures to compute a reconstruction error that is used to either compute a novelty score or to train a one-class classifier. While we too leverage a novel network of that kind, we take a probabilistic approach and effectively compute how likely it is that a sample was generated by the inlier distribution. We achieve this with two main contributions. First, we make the computation of the novelty probability feasible because we linearize the parameterized manifold capturing the underlying structure of the inlier distribution, and show how the probability factorizes and can be computed with respect to local coordinates of the manifold tangent space. Second, we improve the training of the autoencoder network. An extensive set of results show that the approach achieves state-of-the-art performance on several benchmark datasets.", "full_text": "Generative Probabilistic Novelty Detection with\n\nAdversarial Autoencoders\n\nStanislav Pidhorskyi\n\nRanya Almohsen\n\nDonald A. Adjeroh\n\nGianfranco Doretto\n\nLane Department of Computer Science and Electrical Engineering\n\nWest Virginia University, Morgantown, WV 26506\n\n{stpidhorskyi, ralmohse, daadjeroh, gidoretto}@mix.wvu.edu\n\nAbstract\n\nNovelty detection is the problem of identifying whether a new data point is con-\nsidered to be an inlier or an outlier. We assume that training data is available to\ndescribe only the inlier distribution. Recent approaches primarily leverage deep\nencoder-decoder network architectures to compute a reconstruction error that is\nused to either compute a novelty score or to train a one-class classi\ufb01er. While\nwe too leverage a novel network of that kind, we take a probabilistic approach\nand effectively compute how likely it is that a sample was generated by the inlier\ndistribution. We achieve this with two main contributions. First, we make the\ncomputation of the novelty probability feasible because we linearize the parame-\nterized manifold capturing the underlying structure of the inlier distribution, and\nshow how the probability factorizes and can be computed with respect to local\ncoordinates of the manifold tangent space. Second, we improve the training of the\nautoencoder network. An extensive set of results show that the approach achieves\nstate-of-the-art performance on several benchmark datasets.\n\n1\n\nIntroduction\n\nNovelty detection is the problem of identifying whether a new data point is considered to be an inlier\nor an outlier. From a statistical point of view this process usually occurs while prior knowledge\nof the distribution of inliers is the only information available. This is also the most dif\ufb01cult and\nrelevant scenario because outliers are often very rare, or even dangerous to experience (e.g., in\nindustry process fault detection [1]), and there is a need to rely only on inlier training data. Novelty\ndetection has received signi\ufb01cant attention in application areas such as medical diagnoses [2], drug\ndiscovery [3], and among others, several computer vision applications, such as anomaly detection in\nimages [4, 5], videos [6], and outlier detection [7, 8]. We refer to [9] for a general review on novelty\ndetection. The most recent approaches are based on learning deep network architectures [10, 11], and\nthey tend to either learn a one-class classi\ufb01er [12, 11], or to somehow leverage as novelty score, the\nreconstruction error of the encoder-decoder architecture they are based on [13, 7].\nIn this work, we introduce a new encoder-decoder architecture as well, which is based on adversarial\nautoencoders [14]. However, we do not train a one-class classi\ufb01er, instead, we learn the probability\ndistribution of the inliers. Therefore, the novelty test simply becomes the evaluation of the probability\nof a test sample, and rare samples (outliers) fall below a given threshold. We show that this approach\nallows us to effectively use the decoder network to learn the parameterized manifold shaping the inlier\ndistribution, in conjunction with the probability distribution of the (parameterizing) latent space. The\napproach is made computationally feasible because for a given test sample we linearize the manifold,\nand show that with respect to the local manifold coordinates the data model distribution factorizes\ninto a component dependent on the manifold (decoder network plus latent distribution), and another\none dependent on the noise, which can also be learned of\ufb02ine.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fWe named the approach generative probabilistic novelty detection (GPND) because we compute the\nprobability distribution of the full model, which includes the signal plus noise portion, and because it\nrelies on being able to also generate data samples. We are mostly concerned with novelty detection\nusing images, and with controlling the distribution of the latent space to ensure good generative\nreproduction of the inlier distribution. This is essential not so much to ensure good image generation,\nbut for the correct computation of the novelty score. This aspect has been overlooked by the deep\nlearning literature so far, since the focus has been only on leveraging the reconstruction error. We do\nleverage that as well, but we show in our framework that the reconstruction error affects only the\nnoise portion of the model. In order to control the latent distribution and image generation we learn\nan adversarial autoencoder network with two discriminators that address these two issues.\nSection 2 reviews the related work. Section 3 introduces the GPND framework, and Section 4\ndescribes the training and architecture of the adversarial autoencoder network. Section 6 shows a\nrich set of experiments showing that GPND is very effective and produces state-of-the-art results on\nseveral benchmarks.\n\n2 Related Work\n\nNovelty detection is the task of recognizing abnormality in data. The literature in this area is sizable.\nNovelty detection methods can be statistical and probabilistic based [15, 16], distance based [17],\nand also based on self-representation [8]. Recently, deep learning approaches [7, 11] have also been\nused, greatly improving the performance of novelty detection.\nStatistical methods [18, 19, 15, 16] usually focus on modeling the distribution of inliers by learning\nthe parameters de\ufb01ning the probability, and outliers are identi\ufb01ed as those having low probability\nunder the learned model. Distance based outlier detection methods [20, 17, 21] identify outliers by\ntheir distance to neighboring examples. They assume that inliers are close to each other while the\nabnormal samples are far from their nearest neighbors. A known work in this category is LOF [22],\nwhich is based on k-nearest neighbors and density based estimation. More recently, [23] introduced\nthe Kernel Null Foley-Sammon Transform (KNFST) for multi-class novelty detection, where training\nsamples of each known category are projected onto a single point in the null space and then distances\nbetween the projection of a test sample and the class representatives are used to obtain a novelty\nmeasure. [24] improves on previous approaches by proposing an incremental procedure called\nIncremental Kernel Null Space Based Discriminant Analysis (IKNDA).\nSince outliers do not have sparse representations, self-representation approaches have been proposed\nfor outlier detection in a union of subspaces [4, 25]. Similarly, deep learning based approaches\nhave used neural networks and leveraged the reconstruction error of encoder-decoder architectures.\n[26, 27] used deep learning based autoencoders to learn the model of normal behaviors and employed\na reconstruction loss to detect outliers. [28] used a GAN [29] based method by generating new\nsamples similar to the training data, and demonstrated its ability to describe the training data. Then it\ntransformed the implicit data description of normal data to a novelty score. [10] trained GANs using\noptical \ufb02ow images to learn a representation of scenes in videos. [7] minimized the reconstruction\nerror of an autoencoder to remove outliers from noisy data, and by utilizing the gradient magnitude\nof the auto-encoder they make the reconstruction error more discriminative for positive samples.\nIn [11] they proposed a framework for one-class classi\ufb01cation and novelty detection. It consists of\ntwo main modules learned in an adversarial fashion. The \ufb01rst is a decoder-encoder convolutional\nneural network trained to reconstruct inliers accurately, while the second is a one-class classi\ufb01er\nmade with another network that produces the novelty score.\nThe proposed approach relates to the statistical methods because it aims at computing the probability\ndistribution of test samples as novelty score, but it does so by learning the manifold structure of the\ndistribution with an encoder-decoder network. Moreover, the method is different from those that\nlearn a one-class classi\ufb01er, or rely on the reconstruction error to compute the novelty score, because\nin our framework we represent only one component of the score computation, allowing to achieve an\nimproved performance.\nState-of-the art works on density estimation for image compression include Pixel Recurrent Neural\nNetworks [30] and derivatives [31, 32]. These pixel-based methods allow to sequentially predict\npixels in an image along the two spatial dimensions. Because they model the joint distribution of the\nraw pixels along with their sequential correlation, it is possible to use them for image compression.\n\n2\n\n\fFigure 1: Manifold schematic representa-\ntion. This \ufb01gure shows connection between the\nparametrized manifold M, its tangent space T ,\ndata point x and its projection x(cid:107).\n\nFigure 2: Reconstruction of inliers and out-\nliers. This \ufb01gure showns reconstructions for the\nautoencoder network that was trained on inlier\nof label \"7\" of MNIST [37] dataset. First line\nis input of inliers of label \"7\", the second line\nshows corresponding reconstructions. The third\nline corresponds to input of outlier of label \"0\"\nand the forth line, corresponding reconstructions.\n\nAlthough they could also model the probability distribution of known samples, they work at a local\nscale in a patch-based fashion, which makes non-local pixels loosely correlated. Our approach instead,\ndoes not allow modeling the probability density of individual pixels but works with the whole image.\nIt is not suitable for image compression, and while its generative nature allows in principle to produce\nnovel images, in this work we focus only on novelty detection by evaluating the inlier probability\ndistribution on test samples.\nA recent line of work has focussed on detecting out-of-distribution samples by analyzing the output\nentropy of a prediction made by a pre-trained deep neural network [33, 34, 35, 36]. This is done by\neither simply thresholding the maximum softmax score [34], or by \ufb01rst applying perturbations to the\ninput, scaled proportionally to the gradients w.r.t. to the input and then combining the softmax score\nwith temperature scaling, as it is done in Out-of-distribution Image Detection in Neural Networks\n(ODIN) [36]. While these approaches require labels for the in-distribution data to train the classi\ufb01er\nnetwork, our method does not use label information. Therefore, it can be applied for the case when\nin-distribution data is represented by one class or label information is not available.\n\n3 Generative Probabilistic Novelty Detection\nWe assume that training data points x1, . . . , xN , where xi \u2208 Rm, are sampled, possibly with noise\n\u03bei, from the model\n\ni = 1,\u00b7\u00b7\u00b7 , N ,\n\nxi = f (zi) + \u03bei\n\n(1)\nwhere zi \u2208 \u2126 \u2282 Rn. The mapping f : \u2126 \u2192 Rm de\ufb01nes M \u2261 f (\u2126), which is a parameterized\nmanifold of dimension n, with n < m. We also assume that the Jacobi matrix of f is full rank at\nevery point of the manifold. In addition, we assume that there is another mapping g : Rm \u2192 Rn,\nsuch that for every x \u2208 M, it follows that f (g(x)) = x, which means that g acts as the inverse of f\non such points.\nGiven a new data point \u00afx \u2208 Rm, we design a novelty test to assert whether \u00afx was sampled from\nmodel (1). We begin by observing that \u00afx can be non-linearly projected onto \u00afx(cid:107) \u2208 M via \u00afx(cid:107) = f (\u00afz),\nwhere \u00afz = g(\u00afx). Assuming f to be smooth enough, we perform a linearization based on its \ufb01rst-order\nTaylor expansion\n\nf (z) = f (\u00afz) + Jf (\u00afz)(z \u2212 \u00afz) + O((cid:107)z \u2212 \u00afz(cid:107)2) ,\n\n(2)\nwhere Jf (\u00afz) is the Jacobi matrix computed at \u00afz, and (cid:107) \u00b7 (cid:107) is the L2 norm. We note that T =\nspan(Jf (\u00afz)) represents the tangent space of f at \u00afx(cid:107) that is spanned by the n independent column\nvectors of Jf (\u00afz), see Figure 1. Also, we have T = span(U(cid:107)), where Jf (\u00afz) = U(cid:107)SV (cid:62) is the\nsingular value decomposition (SVD) of the Jacobi matrix. The matrix U(cid:107) has rank n, and if we de\ufb01ne\nU\u22a5 such that U = [U(cid:107)U\u22a5] is a unitary matrix, we can represent the data point \u00afx with respect to the\nlocal coordinates that de\ufb01ne the tangent space T , and its orthogonal complement T \u22a5. This is done\n\n3\n\nInput:Reconstruction:Label \u201c7\u201d - inlierLabel \u201c0\u201d - outlierInput:Reconstruction:\fby computing\n\n(cid:34)\n\nU(cid:107)(cid:62)\n\u00afx\nU\u22a5(cid:62)\n\u00afx\n\n(cid:35)\n\n(cid:20) \u00afw(cid:107)\n\n(cid:21)\n\n(5)\n\n(7)\n\n\u00afw = U(cid:62) \u00afx =\n\n,\n\n=\n\n\u00afw\u22a5\n\n(3)\nwhere the rotated coordinates \u00afw are decomposed into \u00afw(cid:107), which are parallel to T , and \u00afw\u22a5 which are\northogonal to T .\nWe now indicate with pX (x) the probability density function describing the random variable X,\nfrom which training data points have been drawn. Also, pW (w) is the probability density function of\nthe random variable W representing X after the change of coordinates. The two distributions are\nidentical. However, we make the assumption that the coordinates W (cid:107), which are parallel to T , and\nthe coordinates W \u22a5, which are orthogonal to T , are statistically independent. This means that the\nfollowing holds\n\npX (x) = pW (w) = pW (w(cid:107), w\u22a5) = pW (cid:107) (w(cid:107))pW \u22a5 (w\u22a5) .\n\n(4)\nThis is motivated by the fact that in (1) the noise \u03be is assumed to predominantly deviate the point\nx away from the manifold M in a direction orthogonal to T . This means that W \u22a5 is primarely\nresponsible for the noise effects, and since noise and drawing from the manifold are statistically\nindependent, so are W (cid:107) and W \u22a5.\nFrom (4), given a new data point \u00afx, we propose to perform novelty detection by executing the\nfollowing test\n\n(cid:26) \u2265 \u03b3 =\u21d2 Inlier\n\n< \u03b3 =\u21d2 Outlier\n\npX (\u00afx) = pW (cid:107) ( \u00afw(cid:107))pW \u22a5( \u00afw\u22a5) =\n\nwhere \u03b3 is a suitable threshold.\n\n3.1 Computing the distribution of data samples\nThe novelty detector (5) requires the computation of pW (cid:107) (w(cid:107)) and pW \u22a5 (w\u22a5). Given a test data\npoint \u00afx \u2208 Rm its non-linear projection onto M is \u00afx(cid:107) = f (g(\u00afx)). Therefore, \u00afw(cid:107) can be written as\n\u00afw(cid:107) = U(cid:107)(cid:62)\n\u00afx(cid:107), where we have made the approximation that\nU(cid:107)(cid:62)\n(\u00afx \u2212 \u00afx(cid:107)) \u2248 0. Since \u00afx(cid:107) \u2208 M, then in its neighborhood it can be parameterized as in (2), which\nmeans that w(cid:107)(z) = U(cid:107)(cid:62)\nf (\u00afz) + SV (cid:62)(z \u2212 \u00afz) + O((cid:107)z \u2212 \u00afz(cid:107)2). Therefore, if Z represents the random\nvariable from which samples are drawn from the parameterized manifold, and pZ(z) is its probability\ndensity function, then it follows that\n\n(\u00afx \u2212 \u00afx(cid:107)) + U(cid:107)(cid:62)\n\n\u00afx(cid:107) = U(cid:107)(cid:62)\n\n\u00afx = U(cid:107)(cid:62)\n\npW (cid:107) (w(cid:107)) = |detS\u22121| pZ(z) ,\n\n(6)\nsince V is a unitary matrix. We note that pZ(z) is a quantity that is independent from the lineariza-\ntion (2), and therefore it can be learned of\ufb02ine, as explained in Section 5.\nIn order to compute pW \u22a5 (w\u22a5), we approximate it with its average over the hypersphere S m\u2212n\u22121 of\nradius (cid:107)w\u22a5(cid:107), giving rise to\n\n(cid:1)\n\n\u0393(cid:0) m\u2212n\n2 (cid:107)w\u22a5(cid:107)m\u2212n p(cid:107)W \u22a5(cid:107)((cid:107)w\u22a5(cid:107)) ,\n\n2\n\nm\u2212n\n\npW \u22a5(w\u22a5) \u2248\n\n2\u03c0\n\nwhere \u0393(\u00b7) represents the gamma function. This is motivated by the fact that noise of a given intensity\nwill be equally present in every direction. Moreover, its computation depends on p(cid:107)W \u22a5(cid:107)((cid:107)w\u22a5(cid:107)),\nwhich is the distribution of the norms of w\u22a5, and which can easily be learned of\ufb02ine by histogramming\nthe norms of \u00afw\u22a5 = U\u22a5(cid:62)\n\n\u00afx.\n\n4 Manifold learning with adversarial autoencoders\n\nIn this section we describe the network architecture and the training procedure for learning the\nmapping f that de\ufb01ne the parameterized manifold M, and also the mapping g. The mappings\ng and f represent and are modeled by an encoder network, and a decoder network, respectively.\n\n4\n\n\fFigure 3: Architecture overview. Architecture of the network for manifold learning. It is based\non training an Adversarial Autoenconder (AAE) [14]. Similarly to [43, 11] it has an additional\nadversarial component to improve generative capabilities of decoded images and a better manifold\nlearning. The architecture layers of the AAE and of the discriminator Dx are speci\ufb01ed on the right.\n\nSimilarly to previous work on novelty detection [38, 39, 40, 7, 11, 13], such networks are based on\nautoencoders [41, 42].\nThe autoencoder network and training should be such that they reproduce the manifold M as closely\nas possible. For instance, if M represents the distribution of images depicting a certain object\ncategory, we would want the estimated encoder and decoder to be able to generate images as if they\nwere drawn from the real distribution. Differently from previous work, we require the latent space,\nrepresented by z, to be close to a known distribution, preferably a normal distribution, and we would\nalso want each of the components of z to be maximally informative, which is why we require them to\nbe independent random variables. Doing so facilitates learning a distribution pZ(z) from training data\nmapped onto the latent space \u2126. This means that the autoenoder has generative properties, because by\nsampling from pZ(z) we would generate data points x \u2208 M. Note that differently from GANs [29]\nwe also require an encoder function g.\nVariational Auto-Encoders (VAEs) [44] are known to work well in presence of continuous latent\nvariables and they can generate data from a randomly sampled latent space. VAEs utilize stochastic\nvariational inference and minimize the Kullback-Leibler (KL) divergence penalty to impose a prior\ndistribution on the latent space that encourages the encoder to learn the modes of the prior distribution.\nAdversarial Autoencoders (AAEs) [14], in contrast to VAEs, use an adversarial training paradigm to\nmatch the posterior distribution of the latent space with the given distribution. One of the advantages\nof AAEs over VAEs is that the adversarial training procedure encourages the encoder to match the\nwhole distribution of the prior.\nUnfortunately, since we are concerned with working with images, both AAEs and VAEs tend to\nproduce examples that are often far from the real data manifold. This is because the decoder part of\nthe network is updated only from a reconstruction loss that is typically a pixel-wise cross-entropy\nbetween input and output image. Such loss often causes the generated images to be blurry, which has a\nnegative effect on the proposed approach. Similarly to AAEs, PixelGAN autoencoders [45] introduce\nthe adversarial component to impose a prior distribution on the latent code, but the architecture is\nsigni\ufb01cantly different, since it is conditioned on the latent code.\nSimilarly to [43, 11] we add an adversarial training criterion to match the output of the decoder\nwith the distribution of real data. This allows to reduce blurriness and add more local details to the\ngenerated images. Moreover, we also combine the adversarial training criterion with AAEs, which\nresults in having two adversarial losses: one to impose a prior on the latent space distribution, and the\nsecond one to impose a prior on the output distribution.\nOur full objective consists of three terms. First, we use an adversarial loss for matching the distribution\nof the latent space with the prior distribution, which is a normal with 0 mean, and standard deviation\n1, N (0, 1). Second, we use an adversarial loss for matching the distribution of the decoded images\n\n5\n\nRealorFakeConvolutional LayersFully connected LayersFake SampleReal SampleRealorFakeEncoderDecoderDiscriminatorDiscriminatorDistribution prior4x4, 64, 2, 14x4, 256, 2, 14x4, 256, 2, 12048, Z32x32 ImageEncoder:ZZ, 102432x32 ImageDecoder:Z4x4, 256, 2, 14x4, 128, 2, 14x4, 1, 2, c4x4, 64, 2, 14x4, 256, 2, 14x4, 256, 2, 12048, Z32x32 ImageDx:Fake/RealDz:Z128, 128128, 1Fake/RealLeaky ReLUBatch norm ReLU Sigmoid tanh Z, 128Conv/Deconv:kernel, Ch. out,stride, paddingFully connected:inputs, outputs\ffrom z and the known, training data distribution. Third, we use an autoencoder loss between the\ndecoded images and the encoded input image. Figure 3 shows the architecture con\ufb01guration.\n\n4.1 Adversarial losses\n\nFor the discriminator Dz, we use the following adversarial loss:\n\nLadv\u2212dz (x, g, Dz) = E[log(Dz(N (0, 1)))] + E[log(1 \u2212 Dz(g(x)))] ,\n\n(8)\nwhere the encoder g tries to encode x to a z with distribution close to N (0, 1). Dz aims to distinguish\nbetween the encoding produced by g and the prior normal distribution. Hence, g tries to minimize\nthis objective against an adversary Dz that tries to maximize it.\nSimilarly, we add the adversarial loss for the discriminator Dx:\n\nLadv\u2212dx (x, Dx, f ) = E[log(Dx(x))] + E[log(1 \u2212 Dx(f (N (0, 1))))] ,\n\n(9)\nwhere the decoder f tries to generate x from a normal distribution N (0, 1), in a way that x is as if it\nwas sampled from the real distribution. Dx aims to distinguish between the decoding generated by f\nand the real data points x. Hence, f tries to minimize this objective against an adversary Dx that\ntries to maximize it.\n\n4.2 Autoencoder loss\n\nWe also optimize jointly the encoder g and the decoder f so that we minimize the reconstruction\nerror for the input x that belongs to the known data distribution.\n\nLerror(x, g, f ) = \u2212Ez[log(p(f (g(x))|x))] ,\n\n(10)\nwhere Lerror is minus the expected log-likelihood, i.e., the reconstruction error. This loss does not\nhave an adversarial component but it is essential to train an autoencoder. By minimizing this loss we\nencourage g and f to better approximate the real manifold.\n\n4.3 Full objective\n\nThe combination of all the previous losses gives\n\nL(x, g, Dz, Dx, f ) = Ladv\u2212dz (x, g, Dz) + Ladv\u2212dx (x, Dx, f ) + \u03bbLerror(x, g, f ) ,\n\n(11)\n\nWhere \u03bb is a parameter that strikes a balance between the reconstruction and the other losses. The\nautoencoder network is obtained by minimizing (11), giving:\n\n\u02c6g, \u02c6f = arg min\ng,f\n\nmax\nDx,Dz\n\nL(x, g, Dz, Dx, f ) .\n\n(12)\n\nThe model is trained using stochastic gradient descent by doing alternative updates of each component\nas follows\n\n\u2022 Maximize Ladv\u2212dx by updating weights of Dx;\n\u2022 Minimize Ladv\u2212dx by updating weights of f;\n\u2022 Maximize Ladv\u2212dz by updating weights of Dz;\n\u2022 Minimize Lerror and Ladv\u2212dz by updating weights of g and f.\n\n5\n\nImplementation Details and Complexity\n\nAfter learning the encoder and decoder networks, by mapping the training set onto the latent space\nthrough g, we \ufb01t to the data a generalized Gaussian distribution and estimate pZ(z). In addition,\nby histogramming the quantities (cid:107)U\u22a5(cid:62)\n(x \u2212 x(cid:107))(cid:107) we estimate p(cid:107)W \u22a5(cid:107)((cid:107)w\u22a5(cid:107)). The entire training\nprocedure takes about one hour with a high-end PC with one NVIDIA TITAN X.\nWhen a sample is tested, the procedure entails mainly computing a derivative, i.e. the Jacoby matrix\nJf , with a subsequent SVD. Jf is computed numerically, around the test sample representation \u00afz and\ntakes approximately 20.4ms for an individual sample and 0.55ms if computed as part of a batch of\nsize 512, while the SVD takes approximately 4.0ms.\n\n6\n\n\fTable 1: F1 scores on MNIST [37]. Inliers are taken to be images of one category, and outliers are\nrandomly chosen from other categories.\n\n% of outliers D(R(X)) [11] D(X) [11] LOF [22] DRAE [7] GPND (Ours)\n10\n20\n30\n40\n50\n\n0.983\n0.971\n0.961\n0.950\n0.939\n\n0.97\n0.92\n0.92\n0.91\n0.88\n\n0.95\n0.91\n\n0.93\n0.90\n\n0.87\n0.84\n\n0.82\n\n0.88\n0.82\n\n0.92\n0.83\n\n0.72\n0.65\n\n0.55\n\n0.73\n\n6 Experiments\n\nWe evaluate our novelty detection approach, which we call Generative Probabilistic Novelty Detection\n(GPND), against several state-of-the-art approaches and with several performance measures. We use\nthe F1 measure, the area under the ROC curve (AUROC), the FPR at 95% TPR (i.e., the probability\nof an outlier to be misclassi\ufb01ed as inlier), the Detection Error (i.e., the misclassi\ufb01cation probability\nwhen TPR is 95%), and the area under the precision-recall curve (AUPR) when inliers (AUPR-In) or\noutliers (AUPR-Out) are speci\ufb01ed as positives. All reported results are from our publicly available\nimplementation1, based on the deep machine learning framework PyTorch [46]. An overview of the\narchitecture is provided in Figure 3.\n\n6.1 Datasets\n\nWe evaluate GPND on the following datasets.\nMNIST [37] contains 70, 000 handwritten digits from 0 to 9. Each of ten categories is used as inlier\nclass and the rest of the categories are used as outliers.\nThe Coil-100 dataset [47] contains 7, 200 images of 100 different objects. Each object has 72 images\ntaken at pose intervals of 5 degrees. We downscale the images to size 32 \u00d7 32. We take randomly n\ncategories, where n \u2208 1, 4, 7 and randomly sample the rest of the categories for outliers. We repeat\nthis procedure 30 times.\nFashion-MNIST [48] is a new dataset comprising of 28 \u00d7 28 grayscale images of 70, 000 fashion\nproducts from 10 categories, with 7, 000 images per category. The training set has 60, 000 images\nand the test set has 10, 000 images. Fashion-MNIST shares the same image size, data format and the\nstructure of training and testing splits with the original MNIST.\nOthers. We compare GPND with ODIN [36] using their protocol. For inliers are used samples\nfrom CIFAR-10(CIFAR-100) [49], which is a publicly available dataset of small images of size\n32 \u00d7 32, which have each been labeled to one of 10 (100) classes. Each class is represented by 6, 000\n(600) images for a total of 60, 000 samples. For outliers are used samples from TinyImageNet [50],\nLSUN [51], and iSUN [52]. For more details please refer to [36]. We reuse the prepared datasets of\noutliers provided by the ODIN GitHub project page.\n\n6.2 Results\n\nMNIST dataset. We follow the protocol described in [11, 7] with some differences discussed below.\nResults are averages from a 5-fold cross-validation. Each fold takes 20% of each class. 60% of each\nclass is used for training, 20% for validation, and 20% for testing. Once pX (\u00afx) is computed for each\nvalidation sample, we search for the \u03b3 that gives the highest F1 measure. For each class of digit, we\ntrain the proposed model and simulate outliers as randomly sampled images from other categories\nwith proportion from 10% to 50%. Results for D(R(X)) and D(X) reported in [11] correspond to\nthe protocol for which data is not split into separate training, validation and testing sets, meaning that\nthe same inliers used for training were also used for testing. We diverge from this protocol and do\nnot reuse the same inliers for training and testing. We follow the 60%/20%/20% splits for training,\nvalidation and testing. This makes our testing harder, but more realistic, while we still compare our\nnumbers against those obtained by others with easier settings. Results on the MNIST dataset are\nshown in Table 1 and Figure 4, where we compare with [11, 22, 7].\n\n1https://github.com/podgorskiy/GPND\n\n7\n\n\fTable 2: Results on Coil-100. Inliers are taken to be images of one, four, or seven randomly chosen\ncategories, and outliers are randomly chosen from other categories (at most one from each category).\n\nOutRank [55, 56] CoP [57] REAPER [58] OutlierPursuit [59] LRR [60] DPCP [61]\n\n(cid:96)1 thresholding [25] R-graph [8] Ours\n\nAUC 0.836\nF1\n0.862\n\nAUC 0.613\nF1\n0.491\n\nAUC 0.570\nF1\n0.342\n\nInliers: one category of images , Outliers: 50%\n\n0.843\n0.866\n\n0.900\n0.892\n\n0.908\n0.902\n\n0.847\n0.872\n\n0.900\n0.882\n\nInliers: four category of images , Outliers: 25%\n\n0.628\n0.500\n\n0.877\n0.703\n\n0.837\n0.686\n\n0.687\n0.541\n\n0.859\n0.684\n\nInliers: seven category of images , Outliers: 15%\n\n0.580\n0.346\n\n0.824\n0.541\n\n0.822\n0.528\n\n0.628\n0.366\n\n0.804\n0.511\n\n0.991\n0.978\n\n0.992\n0.941\n\n0.991\n0.897\n\n0.997\n0.990\n\n0.996\n0.970\n\n0.996\n0.955\n\n0.968\n0.979\n\n0.945\n0.960\n\n0.919\n0.941\n\nTable 3: Results on Fashion-MNIST [48]. Inliers are taken to be images of one category, and outliers\nare randomly chosen from other categories.\n\n% of outliers\nF1\nAUC\n\n10\n\n20\n\n30\n\n40\n\n50\n\n0.968\n\n0.945\n\n0.917\n\n0.891\n\n0.864\n\n0.928\n\n0.932\n\n0.933\n\n0.933\n\n0.933\n\nFigure 4: Results on MNIST [37] dataset.\n\nFigure 5:\nMNIST of the model components of GPND.\n\nAblation study. Comparison on\n\nCoil-100 dataset. We follow the protocol described in [8] with some differences discussed below.\nResults are averages from 5-fold cross-validation. Each fold takes 20% of each class. Because the\ncount of samples per category is very small, we use 80% of each class for training, and 20% for\ntesting. We \ufb01nd the optimal threshold \u03b3 on the training set. Results reported in [8] correspond to not\nsplitting data into separate training, validation and testing sets, because it is not essential, since they\nleverage a VGG [53] network pretrained on ImageNet [54]. We diverge from that protocol and do not\nreuse inliers and follow 80%/20% splits for training and testing.\nResults on Coil-100 are shown in Table 2. We do not outperform R-graph [8], however as mentioned\nbefore, R-graph uses a pretrained VGG network, while we train an autoencoder from scratch on a\nvery limited number of samples, which is on average only 70 per category.\nFashion-MNIST dataset. We repeat the same experiment with the same protocol that we have used\nfor MNIST, but on Fashion-MNIST. Results are provided in Table 3.\nCIFAR-10 (CIFAR-100) dataset. We follow the protocol described in [36], where for inliers and\noutliers are used different datasets. ODIN relies on a pretrained classi\ufb01er and thus requires label\ninformation provided with the training samples, while our approach does not use label information.\nThe results are reported in Table 4. Despite the fact that ODIN relies upon powerful classi\ufb01er\nnetworks such as Dense-BC and WRN with more than 100 layers, the much smaller network of\nGPND competes well with ODIN. Note that for CIFAR-100, GPND signi\ufb01cantly outperforms both\nODIN architectures. We think this might be due to the fact that ODIN relies on the perturbation of\nthe network classi\ufb01er output, which becomes less accurate as the number of classes grows from 10 to\n100. On the other hand, GPND does not use class label information and copes much better with the\nadditional complexity induced by the increased number of classes.\n\n8\n\n1020304050Percentage of outliers (%)0.700.750.800.850.900.951.00F1-ScoreGPND(Ours)D(R(X))D(X)LOFDRAE1020304050Percentage of outliers (%)0.860.880.900.920.940.960.98F1-ScoreGPND completeResidual component onlyParallel component onlyP(z) only\fTable 4: Comparison with ODIN [36]. \u2191 indicates larger value is better, and \u2193 indicates lower value\nis better.\n\nOutlier dataset\n\nFPR(95%TPR)\u2193\n\nDetection\u2193\n\nAUPR in\u2191\nODIN-WRN-28-10 / ODIN-Dense-BC / GPND\n\nAUROC\u2191\n\nCIFAR-10\n\nCIFAR-100\n\nTinyImageNet (crop)\nTinyImageNet (resize)\nLSUN (resize)\niSUN\nUniform\nGaussian\nTinyImageNet (crop)\nTinyImageNet (resize)\nLSUN (resize)\niSUN\nUniform\nGaussian\n\n23.4/4.3/29.1\n25.5/7.5/11.8\n17.6/3.8/4.9\n21.3/6.3/11.0\n0.0/0.0/0.0\n0.0/0.0/0.0\n\n43.9/17.3/33.2\n55.9/44.3/15.0\n56.5/44.0/6.8\n57.3/49.5/14.3\n\n0.1/0.5/0.0\n1.0/0.2/0.0\n\n14.2/4.7/15.7\n15.2/6.3/8.3\n11.3/4.4/4.9\n13.2/5.7/7.8\n2.5/2.5/0.1\n2.5/2.5/0.0\n\n24.4/11.2/17.2\n30.4/24.6/9.5\n30.8/24.5/5.8\n31.1/27.2/9.3\n2.5/2.8/0.0\n3.0/2.6/0.0\n\n94.2/99.1/90.1\n92.1/98.5/96.5\n95.4/99.2/98.7\n93.7/98.8/96.9\n100.0/99.9/99.9\n100.0/100.0/100.0\n\n92.8/99.1/84.1\n89.0/98.6/95.0\n93.8/99.3/98.4\n91.2/98.9/96.1\n\n100.0/100.0/100.0\n100.0/100.0/100.0\n\n90.8/97.1/89.1\n84.0/90.7/95.9\n86.0/91.5/98.3\n85.6/90.1/96.2\n99.1/99.5/100.0\n98.5/99.6/100.0\n\n91.4/97.4/83.8\n82.8/91.4/94.6\n86.2/92.4/98.0\n85.9/91.1/95.6\n99.4/99.6/100.0\n99.1/99.7/100.0\n\nAUPR out\u2191\n\n94.7/99.1/99.5\n93.6/98.5/99.8\n96.1/99.2/99.7\n94.9/98.8/99.7\n100.0/99.9/99.5\n100.0/100.0/99.8\n90.0/96.8/98.7\n84.4/90.1/99.4\n84.9/90.6/99.6\n84.8/88.9/99.3\n97.5/99.0/99.7\n95.9/99.1/100.0\n\nTable 5: Comparison with baselines. All values are percentages. \u2191 indicates larger value is better, and\n\u2193 indicates lower value is better.\n\n10% 20% 30% 40% 50% 10% 20% 30% 40% 50% 10% 20% 30% 40% 50%\n\nF1\u2191\n96.1\n79.5\n94.2\n94.0\n\n97.1\n95.0\n79.6\n77.6\n95.8\n92.4\n95.5\n92.0\nDetection error\u2193\n5.9\n5.8\n12.0\n11.4\n9.7\n9.7\n9.3\n9.8\n\n5.8\n11.6\n9.7\n9.5\n\nAUROC\u2191\n\n93.9\n75.6\n90.5\n90.2\n\n6.0\n12.2\n9.5\n9.8\n\n98.1\n93.4\n95.2\n95.2\n\n99.7\n98.9\n99.3\n99.2\n\n98.0\n93.8\n95.7\n95.6\n\n99.4\n97.8\n98.7\n98.6\n\n98.0\n93.4\n95.6\n95.3\n\nAUPR in\u2191\n\n99.1\n95.8\n97.8\n97.4\n\n98.0\n92.9\n95.8\n95.2\n\n98.6\n93.2\n96.7\n96.0\n\n98.0\n92.8\n95.9\n95.3\n\n98.0\n90.0\n95.6\n94.3\n\n8.1\n24.3\n18.8\n20.7\n\n86.3\n78.0\n81.7\n79.3\n\nFPR(95%TPR)\u2193\n8.8\n9.1\n23.9\n24.6\n18.0\n17.3\n18.9\n19.3\n\n8.7\n24.7\n17.4\n19.0\n\nAUPR out\u2191\n\n92.2\n86.0\n89.2\n87.7\n\n95.0\n89.7\n92.5\n91.5\n\n96.5\n92.0\n94.6\n93.7\n\n8.9\n23.7\n17.0\n18.6\n\n97.5\n94.0\n96.3\n95.4\n\nGPND\nAE\nP-VAE\nP-AAE\n\nGPND\nAE\nP-VAE\nP-AAE\n\n98.2\n84.8\n97.6\n97.3\n\n5.4\n11.4\n9.8\n9.4\n\n6.3 Ablation\n\nTable 5 compares GPND with some baselines to better appreciate the improvement provided by the\narchitectural choices. The baselines are: i) vanilla AE with thresholding of the reconstruction error\nand same pipeline (AE); ii) proposed approach where the AAE is replaced by a VAE (P-VAE); iii)\nproposed approach where the AAE is without the additional adversarial component induced by the\ndiscriminator applied to the decoded image (P-AAE).\nTo motivate the importance of each component of pX (\u00afx) in (5), we repeat the experiment with MNIST\nunder the following conditions: a) GPND Complete is the unmodi\ufb01ed approach, where pX (\u00afx) is\ncomputed as in (5); b) Parallel component only drops pW \u22a5 and assumes pX (\u00afx) = pW (cid:107) ( \u00afw(cid:107)); c)\nPerpendicular component only drops pW (cid:107) and assumes pX (\u00afx) = pW \u22a5( \u00afw\u22a5); d) pZ(z) only drops\nalso |detS\u22121| and assumes pX (\u00afx) = pZ(z). dThe results are shown in Figure 5. It can be noticed\nthat the scaling factor |detS\u22121| plays an essential role in the Parallel component only, and that the\nParallel component only and the Perpendicular component only play an essential role in composing\nthe GPND Complete model.\nAdditional implementation details include the choice of hyperparameters. For MNIST and COIL-100\nthe latent space size was chosen to maximize F1 on the validation set. It is 16, and we varied it from\n16 to 64 without signi\ufb01cant performance change. For CIFAR-10 and CIFAR-100, the latent space\nsize was set to 256. The hyperparameters of all losses are one, except for Lerror and Ladv\u2212dz when\noptimizing for Dz, which are equal to 2.0. For CIFAR-10 and CIFAR-100, the hyperparameter of\nLerror is 10.0. We use the Adam optimizer with learning rate of 0.002, batch size of 128, and 80\nepochs.\n7 Conclusion\nWe introduced GPND, an approach and a network architecture for novelty detection that is based on\nlearning mappings f and g that de\ufb01ne the parameterized manifold M which captures the underlying\nstructure of the inlier distribution. Unlike prior deep learning based methods, GPND detects that\na given sample is an outlier by evaluating its inlier probability distribution. We have shown how\neach architectural and model components are essential to the novelty detection. In addition, with a\nrelatively simple architecture we have shown how GPND provides state-of-the-art performance using\ndifferent measures, different datasets, and different protocols, demonstrating to compare favorably\nalso with the out-of-distribution literature.\n\n9\n\n\fAcknowledgments\n\nThis material is based upon work supported by the National Science Foundation under Grant No.\nIIS-1761792.\n\nReferences\n[1] Z. Ge, Z. Song, and F. Gao. Review of recent research on data-based process monitoring. Ind. Eng. Chem.\n\nRes., 52(10):3543\u20133562, 2013.\n\n[2] Thomas Schlegl, Philipp Seeb\u00f6ck, Sebastian M. Waldstein, Ursula Schmidt-Erfurth, and Georg Langs.\nUnsupervised anomaly detection with generative adversarial networks to guide marker discovery. In Marc\nNiethammer, Martin Styner, Stephen Aylward, Hongtu Zhu, Ipek Oguz, Pew-Thian Yap, and Dinggang\nShen, editors, Information Processing in Medical Imaging, pages 146\u2013157, Cham, 2017.\n\n[3] Artur Kadurin, Sergey Nikolenko, Kuzma Khrabrov, Alex Aliper, and Alex Zhavoronkov. drugan: An\nadvanced generative adversarial autoencoder model for de novo generation of new molecules with desired\nmolecular properties in silico. Molecular Pharmaceutics, 14(9):3098\u20133104, 2017. PMID: 28703000.\n\n[4] Yang Cong, Junsong Yuan, and Ji Liu. Sparse reconstruction cost for abnormal event detection. In\nComputer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on, pages 3449\u20133456. IEEE,\n2011.\n\n[5] W. Li, V. Mahadevan, and N. Vasconcelos. Anomaly detection and localization in crowded scenes. IEEE\n\nTransactions on Pattern Analysis and Machine Intelligence, 36(1):18\u201332, 2014.\n\n[6] M. Sabokrou, M. Fayyaz, M. Fathy, and R. Klette. Deep-cascade: Cascading 3d deep neural networks\nfor fast anomaly detection and localization in crowded scenes. IEEE Transactions on Image Processing,\n26(4):1992\u20132004, 2017.\n\n[7] Yan Xia, Xudong Cao, Fang Wen, Gang Hua, and Jian Sun. Learning discriminative reconstructions for\nunsupervised outlier removal. In Proceedings of the IEEE International Conference on Computer Vision,\npages 1511\u20131519, 2015.\n\n[8] Chong You, Daniel P Robinson, and Ren\u00e9 Vidal. Provable self-representation based outlier detection in a\n\nunion of subspaces. arXiv preprint arXiv:1704.03925, 2017.\n\n[9] Marco A.F. Pimentel, David A. Clifton, Lei Clifton, and Lionel Tarassenko. A review of novelty detection.\n\nSignal Processing, 99:215 \u2013 249, 2014.\n\n[10] Mahdyar Ravanbakhsh, Moin Nabi, Enver Sangineto, Lucio Marcenaro, Carlo Regazzoni, and Nicu Sebe.\nAbnormal event detection in videos using generative adversarial nets. arXiv preprint arXiv:1708.09644,\n2017.\n\n[11] Mohammad Sabokrou, Mohammad Khalooei, Mahmood Fathy, and Ehsan Adeli. Adversarially learned\none-class classi\ufb01er for novelty detection. In Proceedings of the IEEE Conference on Computer Vision and\nPattern Recognition, pages 3379\u20133388, 2018.\n\n[12] Shehroz S. Khan and Michael G. Madden. One-class classi\ufb01cation: taxonomy of study and review of\n\ntechniques. The Knowledge Engineering Review, 29(3):345\u2013374, 2014.\n\n[13] M Sabokrou, M Fathy, and M Hoseini. Video anomaly detection and localisation based on the sparsity and\n\nreconstruction error of auto-encoder. Electronics Letters, 52(13):1122\u20131124, 2016.\n\n[14] Alireza Makhzani, Jonathon Shlens, Navdeep Jaitly, Ian Goodfellow, and Brendan Frey. Adversarial\n\nautoencoders. arXiv preprint arXiv:1511.05644, 2015.\n\n[15] JooSeuk Kim and Clayton D Scott. Robust kernel density estimation. Journal of Machine Learning\n\nResearch, 13(Sep):2529\u20132565, 2012.\n\n[16] Eleazar Eskin. Anomaly detection over noisy data using learned probability distributions. In In Proceedings\n\nof the International Conference on Machine Learning. Citeseer, 2000.\n\n[17] Ville Hautamaki, Ismo Karkkainen, and Pasi Franti. Outlier detection using k-nearest neighbour graph. In\nPattern Recognition, 2004. ICPR 2004. Proceedings of the 17th International Conference on, volume 3,\npages 430\u2013433. IEEE, 2004.\n\n[18] Vic Barnett and Toby Lewis. Outliers in statistical data. Wiley, 1974.\n\n10\n\n\f[19] Kenji Yamanishi, Jun-Ichi Takeuchi, Graham Williams, and Peter Milne. On-line unsupervised outlier\ndetection using \ufb01nite mixtures with discounting learning algorithms. Data Mining and Knowledge\nDiscovery, 8(3):275\u2013300, 2004.\n\n[20] Edwin M Knorr, Raymond T Ng, and Vladimir Tucakov. Distance-based outliers: algorithms and\napplications. The VLDB Journal?The International Journal on Very Large Data Bases, 8(3-4):237\u2013253,\n2000.\n\n[21] Eleazar Eskin, Andrew Arnold, Michael Prerau, Leonid Portnoy, and Sal Stolfo. A geometric framework\nfor unsupervised anomaly detection. In Applications of data mining in computer security, pages 77\u2013101.\nSpringer, 2002.\n\n[22] Markus M Breunig, Hans-Peter Kriegel, Raymond T Ng, and J\u00f6rg Sander. Lof: identifying density-based\n\nlocal outliers. In ACM sigmod record, volume 29, pages 93\u2013104. ACM, 2000.\n\n[23] Paul Bodesheim, Alexander Freytag, Erik Rodner, Michael Kemmler, and Joachim Denzler. Kernel null\nspace methods for novelty detection. In Computer Vision and Pattern Recognition (CVPR), 2013 IEEE\nConference on, pages 3374\u20133381. IEEE, 2013.\n\n[24] Juncheng Liu, Zhouhui Lian, Yi Wang, and Jianguo Xiao. Incremental kernel null space discriminant\nanalysis for novelty detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern\nRecognition, pages 792\u2013800, 2017.\n\n[25] Mahdi Soltanolkotabi, Emmanuel J Candes, et al. A geometric analysis of subspace clustering with outliers.\n\nThe Annals of Statistics, 40(4):2195\u20132238, 2012.\n\n[26] Mahmudul Hasan, Jonghyun Choi, Jan Neumann, Amit K Roy-Chowdhury, and Larry S Davis. Learning\ntemporal regularity in video sequences. In Computer Vision and Pattern Recognition (CVPR), 2016 IEEE\nConference on, pages 733\u2013742. IEEE, 2016.\n\n[27] Dan Xu, Elisa Ricci, Yan Yan, Jingkuan Song, and Nicu Sebe. Learning deep representations of appearance\n\nand motion for anomalous event detection. arXiv preprint arXiv:1510.01553, 2015.\n\n[28] Huan-gang Wang, Xin Li, and Tao Zhang. Generative adversarial network based novelty detection\nusingminimized reconstruction error. Frontiers of Information Technology & Electronic Engineering,\n19(1):116\u2013125, 2018.\n\n[29] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron\nCourville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural information processing\nsystems, pages 2672\u20132680, 2014.\n\n[30] Aaron van den Oord, Nal Kalchbrenner, and Koray Kavukcuoglu. Pixel recurrent neural networks. arXiv\n\npreprint arXiv:1601.06759, 2016.\n\n[31] Aaron van den Oord, Nal Kalchbrenner, Lasse Espeholt, Oriol Vinyals, Alex Graves, et al. Conditional\nimage generation with pixelcnn decoders. In Advances in Neural Information Processing Systems, pages\n4790\u20134798, 2016.\n\n[32] Tim Salimans, Andrej Karpathy, Xi Chen, and Diederik P Kingma. Pixelcnn++: Improving the pixelcnn\nwith discretized logistic mixture likelihood and other modi\ufb01cations. arXiv preprint arXiv:1701.05517,\n2017.\n\n[33] A. Kendall and Y. Gal. What uncertainties do we need in bayesian deep learning for computer vision? In\n\nNIPS, pages 5574\u2014-5584, 2017.\n\n[34] D. Hendrycks and K. Gimpel. A baseline for detecting misclassi\ufb01ed and out-of-distribution examples in\n\nneural networks. In ICLR, 2017.\n\n[35] Terrance DeVries and Graham W Taylor. Learning con\ufb01dence for out-of-distribution detection in neural\n\nnetworks. arXiv preprint arXiv:1802.04865, 2018.\n\n[36] Shiyu Liang, Yixuan Li, and R Srikant. Enhancing the reliability of out-of-distribution image detection in\n\nneural networks. In ICLR, 2018.\n\n[37] Yann LeCun. The mnist database of handwritten digits. http://yann. lecun. com/exdb/mnist/, 1998.\n\n[38] Nathalie Japkowicz, Catherine Myers, Mark Gluck, et al. A novelty detection approach to classi\ufb01cation.\n\nIn IJCAI, volume 1, pages 518\u2013523, 1995.\n\n11\n\n\f[39] Larry Manevitz and Malik Yousef. One-class document classi\ufb01cation via neural networks. Neurocomputing,\n\n70(7-9):1466\u20131481, 2007.\n\n[40] Mayu Sakurada and Takehisa Yairi. Anomaly detection using autoencoders with nonlinear dimensionality\nreduction. In Proceedings of the MLSDA 2014 2nd Workshop on Machine Learning for Sensory Data\nAnalysis, page 4. ACM, 2014.\n\n[41] Herv\u00e9 Bourlard and Yves Kamp. Auto-association by multilayer perceptrons and singular value decompo-\n\nsition. Biological cybernetics, 59(4-5):291\u2013294, 1988.\n\n[42] David E Rumelhart, Geoffrey E Hinton, and Ronald J Williams. Learning representations by back-\n\npropagating errors. nature, 323(6088):533, 1986.\n\n[43] Anders Boesen Lindbo Larsen, S\u00f8ren Kaae S\u00f8nderby, Hugo Larochelle, and Ole Winther. Autoencoding\n\nbeyond pixels using a learned similarity metric. arXiv preprint arXiv:1512.09300, 2015.\n\n[44] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114,\n\n2013.\n\n[45] Alireza Makhzani and Brendan J Frey. Pixelgan autoencoders.\n\nProcessing Systems, pages 1972\u20131982, 2017.\n\nIn Advances in Neural Information\n\n[46] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming\nLin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in pytorch. In NIPS-W,\n2017.\n\n[47] Sameer A Nene, Shree K Nayar, Hiroshi Murase, et al. Columbia object image library (coil-20). 1996.\n\n[48] Han Xiao, Kashif Rasul, and Roland Vollgraf. Fashion-mnist: a novel image dataset for benchmarking\n\nmachine learning algorithms. arXiv preprint arXiv:1708.07747, 2017.\n\n[49] Alex Krizhevsky. Learning multiple layers of features from tiny images. 2009.\n\n[50] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical\nimage database. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on,\npages 248\u2013255. Ieee, 2009.\n\n[51] FYYZS Song and Ari Seff Jianxiong Xiao. Construction of a large-scale image dataset using deep learning\n\nwith humans in the loop. arXiv preprint arXiv: 1506.03365, 2015.\n\n[52] Pingmei Xu, Krista A Ehinger, Yinda Zhang, Adam Finkelstein, Sanjeev R Kulkarni, and Jianxiong Xiao.\nTurkergaze: Crowdsourcing saliency with webcam based eye tracking. arXiv preprint arXiv:1504.06755,\n2015.\n\n[53] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recogni-\n\ntion. arXiv preprint arXiv:1409.1556, 2014.\n\n[54] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang,\nAndrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. ImageNet Large\nScale Visual Recognition Challenge. International Journal of Computer Vision (IJCV), 115(3):211\u2013252,\n2015.\n\n[55] HDK Moonesignhe and Pang-Ning Tan. Outlier detection using random walks. In Tools with Arti\ufb01cial\n\nIntelligence, 2006. ICTAI\u201906. 18th IEEE International Conference on, pages 532\u2013539. IEEE, 2006.\n\n[56] HDK Moonesinghe and Pang-Ning Tan. Outrank: a graph-based outlier detection framework using random\n\nwalk. International Journal on Arti\ufb01cial Intelligence Tools, 17(01):19\u201336, 2008.\n\n[57] Mostafa Rahmani and George K Atia. Coherence pursuit: Fast, simple, and robust principal component\n\nanalysis. IEEE Transactions on Signal Processing, 65(23):6260\u20136275, 2016.\n\n[58] Gilad Lerman, Michael B McCoy, Joel A Tropp, and Teng Zhang. Robust computation of linear models by\n\nconvex relaxation. Foundations of Computational Mathematics, 15(2):363\u2013410, 2015.\n\n[59] Huan Xu, Constantine Caramanis, and Sujay Sanghavi. Robust pca via outlier pursuit. In Advances in\n\nNeural Information Processing Systems, pages 2496\u20132504, 2010.\n\n[60] Guangcan Liu, Zhouchen Lin, and Yong Yu. Robust subspace segmentation by low-rank representation. In\nProceedings of the 27th international conference on machine learning (ICML-10), pages 663\u2013670, 2010.\n\n[61] Manolis C Tsakiris and Ren\u00e9 Vidal. Dual principal component pursuit. In Proceedings of the IEEE\n\nInternational Conference on Computer Vision Workshops, pages 10\u201318, 2015.\n\n12\n\n\f", "award": [], "sourceid": 3427, "authors": [{"given_name": "Stanislav", "family_name": "Pidhorskyi", "institution": "Lane Department of Computer Science and Electrical Engineering of West Virginia University"}, {"given_name": "Ranya", "family_name": "Almohsen", "institution": "West Virginia University"}, {"given_name": "Gianfranco", "family_name": "Doretto", "institution": "West Virginia University"}]}