NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Reviewer 1
## After rebuttal The authors well addressed my questions. I think that even inliers do not have a unified learning target in AE-based methods is the key reason why AE-based methods fails. It will be nice to empirically verify this. For example, the authors can do a similar experiment as that in Figure 2. Anyway, I believe this paper has its contribution to the community. I will raise my score to 7. ---------------------------------------------------------------------- Strengths: - I like the idea that we can create some high-level supervision from unlabeled data rather than just using the low-level pixel-level supervision like in the autoencoder. And surprisingly, the algorithms greatly improve the outlier detection performance compared to multiple AE-based methods. Questions: - The key point of this method is to create pseudo labels for the unlabeled data, which will augment the training data by default. When training the AE-based methods. Did the authors also used the same kind of data augmentation? What is the latent dimension of the AE? In my opinion, to fairly compare the proposed algorithm and the AE-based methods, there should also be the same data augmentation for the AE-based methods and the latent dimension for the AE should be smaller than K, which is the number of different operations. This is because the capacity of the network is very important when using AE to detect the outlier. Suppose the network capacity is infinite and the latent dimension is equal to the input dimension, then the model will perfectly remember all the data and has no ability to detect the outliers. So using data augmentation and choosing a small latent dimension might help improve the outlier detection performance since it will make the network has little extra capacity to fit the outliers. Of course the proposed method used the extra high-level information, but I am curious about whether it is the pseudo label that is working or it is just the data augmentation / limited output dimension that is helping more. - What is the point of all the derivations from line 155 to line 180? With the randomly initialized network, isn't it obvious that the inlier will dominate the gradients since it has more samples? These derivations are trying to make things complicated but no extra intuition/information is given. I know the authors are trying to make things more rigorous. But in the derivation, strong and impractical assumptions are needed and the conclusion is trivial. What I care more about is that as the learning procedure goes on, whether the gradients from the inlier or the outlier will dominate. My intuition would be that in the early stage the inlier will dominate, as the authors suggested. However, as the training procedure goes on, each inlier sample will have very small gradients. If the network capacity is really huge, it will also try to fit the outlier. Otherwise the gradient of each inlier sample will be much smaller than that of each outlier sample, though the gradient of ALL the inlier samples may still be larger than that of all the outlier samples. Either a rigorous derivation or an empirical validation will be much interesting than the trivial derivations from line 155 to 180. - The analysis between line 181 and 197 is kind of the same thing as the analysis between line 155 to 180. Suppose you have two vectors g_{in} and g_{out}. The magnitude of g_{in} is much larger than g_{out}. Let g_{sum} = g_{in} + g_{out}. Of course the angle between g_{in} and g_{sum} is much smaller and thus g_{sum} will have a larger projection in the direction of g_{in} than g_{out}. I am glad that the authors validate this empirically in Figure 2, which verifies the intuition. - It seems the choice of the operation set is also important. As the authors mentioned, a digit "8" will still be an "8" when applying a flip operation. The authors argued this will cause a misclassification. Though I agree with this, I think such cases will also harm the training. Ideally speaking, for a digit "8", it will give 0.5 probability to FLIP and 0.5 probability to NOT FLIP. This will make any "8" looks like an outlier based on this operation alone. Of course the model can still remedy this problem by using other operations. What I just argued is just an example where some kinds of operations may make things worse. A good way to check this is to train another model without the flip operator on digit "8" and see if the performance increases. It would also be nice to provide a general guidance about how to select the operations. - How did the authors deal with the regions in the transformed image that has no correspondence in the original image? For example, if we try to shift the image 1 step in the left direction, then how did the authors deal with the right most column in the transformed image? If we just use zeros to fill that column, the network will easily distinguish this for both inlier and outlier samples without learning any high-level representations.
Reviewer 2
Originality The use of self supervision for this particular application is novel as far as I know. The idea of incorporating the self supervised transformations during inference (outlier scoring) adds additional novelty. Quality Clarity The motivation for inlier priority seems overly complicated. This already seems intuitive (there is a lot of calculation to conclude that the expected norm of gradients will be proportional to ratio of inlier/outlier) and it feels as if it can been motivated in an even simpler way, or perhaps leave the mathematical rigor to supplementary material. The paper is clear and introduces both outlier detection and self supervision to the unfamiliar reader. However, I do feel that the paper could cite more self supervision papers (right now it only has 2: the immediate methods used) and at least once call it the popular name “self supervision” once. I suggest this to bring this paper to the attention of that community since knowing that there is an application for their work could inspire them use this benchmark. This would increase the mutual benefits between the two communities (self sup and outliers). Significance I think this is an interesting new application of self supervision. The experimental results look compelling on established benchmarks (although I was not previously familiar with them). From my perspective (familiar with self supervision but not the outlier detection), this looks like a significant contribution, both for connecting the areas but also for the method itself.
Reviewer 3
In this paper, the authors provide a novel unsupervised outlier detection framework for images. The authors first propose a surrogate supervision-based framework to learn the representation of each datum. The authors then provide a theoretical explanation on why the proposed framework would prioritize the inliers in the data set. The authors then provide several outlier score definitions based on the proposed framework. They also conduct a series of experiments to verify the effectiveness of the proposed method and show substantial improvement over baselines. Outlier detection is a classical research problem in unsupervised learning with many applications. The authors provide a novel methodology to attack outlier detection problem for images. Unlike previous methods which largely aim to learn a model/distribution where normal data can be recovered/generated from, the authors introduce surrogate supervision to learn the representation. I think the idea is brilliant. Although the transformations in this paper are image-specific, it is definitely inspiring for outlier detection methods in other fields like text or social networks. The theoretical analysis in Section 3.2 is insightful. It shows how the proposed network would be dominated by inliers. More importantly, I think it shows the relations between how rare the outliers are and how much they will disrupt the training. I believe this part would be much more convincing if the authors could also provide an analysis for AE/CAE based method and do a comparison. This will provide more insight on why the proposed method in this paper outperforms the baselines. The authors conduct thorough experiments on multiple data sets. The experimental results seem promising. A minor concern I have is the setting of \rho. In many real scenarios of outlier detection, \rho could be a much smaller value such as 1% or 0.5%. Do the authors have experimental results on these configurations? %==After rebuttal==% Thanks for the authors' response. Based on the reviews, I think the authors need to clarify the value of the analysis in Section 3.2, i.e. "the quantitative correlation between inliers/outliers’ gradient magnitude and the inlier/outlier ratio". I still think it would be interesting to perform similar analysis on other algorithms, even an extremely naive one. But overall I remain my positive opinion.