Paper ID: | 1695 |
---|---|

Title: | Copula-like Variational Inference |

After rebuttal ------------------------ Thank you for the rebuttal. After having read all the reviews and the rebuttal, I will keep my score of 7. That being said, I reviewer with Reviewer #3 that the experimental part can be improved and I still think the authors could do a better job analyzing and explaining the representative power and flexibility of the proposed variational family. Summary -------------------- The paper is proposing a new variational family for variational inference as an alternative to classic mean-field Gaussians and full-rank Gaussians. The new variational family is constructed as follows. The starting point is a copula-like density, meaning a density that lives on the hypercube, but do not have uniform marginals. The authors then propose to apply a normalizing flow-type map and a sequence of Given rotations to the random variable in order to make the resulting density more expressive. The authors present an algorithm to sample from the proposed distribution, which is meant as a plug in-replace for Gaussians. Finally, the paper is concluded with a set of numerical experiments Clarity ----------------- Overall, the paper is well written and well structured. Quality ------------------ The paper appears technically correct. The claims of the paper are supported to some degree by empirical evidence, but there is no theoretical analysis. Based on the simple 2D experiments 8 (Sec. 6.1 & 6.2), it is seen that the proposed method is capable of modelling asymmetric posterior distributions. The code is not included in the submission. Significance ---------------- The proposed construction is indeed interesting. This work would be of interest for both researchers and practitioners. Originality ----------- To the best of my knowledge, both the idea of using the distribution in eq. (6) as a base distribution and the idea of using Given rotations in variational inference are also novel. Other comments ------------- It is obvious that the density given in eq. (6) is a density? Figure 3: The quality of the figure is very bad on print

# Summary This paper presents a new family of variational distributions with posterior dependence preserved, targeting for high-dimensional problems. The construction of variational distribution is motivated by Sklar's theorem, in which the dependence structures and univariate margins are handled separately. While the marginal distribution of any copula has to be uniform, the proposed copula-like density allows for non-uniform marginal distributions. As a result, the dependency structure can be parametrized flexibly with linear complexity, and sampling from the copula-like density is easy. While the number of parameters in an unconstrained Gaussian covariance is quadratic. It has been pointed out by Opper, M., & Archambeau, C. (2009) that the number of parameters in Gaussian variational inference can be O(d) instead of O(d^2), with a parameterized covariance matrix. Please refer to Khan, M. E. et al. (2013) and the references therein. Admittedly, although the Gaussian covariance matrix is flexible and easier to optimize, it may not necessarily close to the true posterior as well. The dependency parameters presented in this paper are hard to optimize. To alleviate this problem, a sequence of Givens rotation as normalizing flow are considered to make further adjustments. It is nice to see that the transformation follows the FFT-style butterfly-architecture, and it could serve as standardization and potentially improve Variational Gaussian approximations as well. Experimental results demonstrate that the proposed method yields higher ELBO than mean-field Gaussian and Full-covariance Gaussian methods in several tasks. Potentially, the ELBO can be further improved by adopting a more flexible margin. # Originality This paper presents a novel way of constructing variational distributions motivated by Sklar's theorem. Toward high-dimensional dependence, both the copula-like density and the sparse rotation matrix are carefully designed, leading to a tradeoff between flexibility and computational efficiency. # Quality The mathematical derivations look solid. The experiments are well-designed and results are clearly presented. # Clarity This presentation is clear and the paper is well-organized. It would be good to add more discussion on the intuitions of the copula-like density before throwing out the complex mathematical form. More discussions on the efficiency of sampling and how the numerical value of delta manifests the strength of dependence are needed. # Significance Existing multivariate copulas are often restrictive. This paper breaks the constraint of uniform margins in copula density in exchange for better scalability of structured stochastic variational inference to high-dimensional problems. The method is appealing as a tool for approximate Bayesian inference in deep models.

The paper raises a copula-like variational distribution with rotation. The approach seems to work theoretically but the author should offer more detailed empirical evidence. Here are several major comments: 1. Since the proposed copula-like construction is a composition of multiple components, an ablation analysis is helpful in identifying the source of representative power. I am particularly interested in the performance of copula-like densities without rotation, copula-like densities with (any other) normalizing flow and mean-field Gaussian with rotation under the experimental setting in the paper. 2. What is the role of the transformation H defined in (7)? Since $\delta$ is fixed, H doesnâ€™t improve the representative power of the variational family. 3. The comparison in Table 4 is somehow unfair. The author should provide the prediction errors for a 400*400 networks with a Gaussian prior using copula-like variational approximation. 4. In related work the author mentioned that their technique can be applied in high dimensions. But the experiments show that the rotation trick is not used in Bayesian neural network. The computational complexity of $O(d\log d)$ seems too high to deal with Bayesian neural networks. In addition, Figure 3 is barely readable and (7) seems to lack brackets.