NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID: 8301 Likelihood Ratios for Out-of-Distribution Detection

### Reviewer 1

When comparing the difference in the distributions of the log likelihoods of in-distribution vs out of distribution (Fig 1a + line 94), the authors state that they largely overlap - can they be more specific, ie what is the difference in means, can you compute a p-value using e.g. Wilcoxon rank sum test, etc, rather than simply say they “largely overlap” ? The authors assume that (i) each dimension of the input space can be described as either background or semantic, and (ii) these are independent (eq 1). How true is this in practice? You could imagine e.g. a background effect of darkening an image, in which case the probability of observing a sequence depends on an interaction between the semantic and background components - similarly, the GC content of a sequence is similarly a function of the semantic component when classifying bacterial sequences. Can the authors demonstrate this assumption holds? The LLR as defined in equation (5) depends only on the semantic features -- how are these identified in practice on the test set since as the authors note z is unknown a priori? This assumption of independence between semantic and background features seems a key component of the model, and the likelihood ratio should be computed on the semantic elements alone, but it is not obvious how these are identified in practice? Or is the full feature set used and this approximate equality assumed to be true? Do the authors have an explanation for the AUROC being significantly worse than random on the Fashion MNIST dataset?