NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center

### Reviewer 1

This paper addresses an interesting problem and proposes a solution that is shown to have empirical advantages. However, the text of the paper has been poorly executed. The notation has not been clearly defined or explained. In line 75, $x*$, $\hat{\theta}$ are not defined. These variables have been consistently used in the rest of the paper. I had to read [18] to understand this notation. In lines 74-80 the definition of a prior network is unclear. In lines 196-199 the intuitive explanation for why prior networks are more robust to adversarial attacks is also unclear. This diminishes the quality of this paper as a standalone piece of work. The main contribution of this work is the improved training criterion. In previous work, prior networks were trained under the forward KL divergence while this paper proposes to use the reverse KL divergence instead. This implies empirical benefits in training. It is also shown empirically that these networks have better out of distribution detection performance and in some cases are shown to be more robust to adversarial attacks. However, in complex datasets like CIFAR-100 the improvement shown is only modest, so it would be nice to see the performance of these networks on more datasets (like ImageNet). ------------------------------------------------------------------------------------------------------------------------------------------------ In light of the author response I tend to keep my overall score (6).

### Reviewer 3

The authors present a novel algorithm with theoretical analysis and empirical results. We have a few comments and suggestions for the work: The comparison of forward vs reverse KL divergence as the objective criteria resembles the choice of mode vs. mean seeking form of the objective in variational inference, respectively (in this case applied with a Dirichlet distribution). We recommend that the authors refer and make connections to this similar literature. It would be great if the authors could expand upon the distinction of in-domain and out-of-domain training data in lines 105-106. How are these datasets created and is the purpose of separating the data to improve generalization. How is the optimization performed in practice? In the algorithm, the authors propose to set the in-domain \beta parameters to large value of 1e2 and the out-of-domain parameters to small values of 0. How sensitive are the results to these specific choices? The authors also note that the losses were equally weighted using the forward KL divergence and had a large relative weighting \gamma when using the reverse loss. What criteria was used to optimally choose the \gamma parameter? Lastly, we had a few minor suggestions for the text: using the conventional indicator variable instead of \mathcal{I} may be more clear, and defining all notation (i.e., \pi) in the main text may improve readability.