__ Summary and Contributions__: This paper expands the variational family for the branch length distributions (in phylogenetic tree) using normalizing flow (NF) for tree approximation. Their main contribution over the recent existing work (Zhang and Matsen 2019) is an adaptation of the NF to the phylogenetic tree problem. Motivated by the problem, each NF’s transformation is designed to be permutation equivariant. The authors adopt two ways of using normalizing flow (Planner and RealNVP) to this problem.

__ Strengths__: 1-Calculation of approximation and amortization gaps in table 2 for different variational families
2-Developing permutation equivariant transformation for NF

__ Weaknesses__: Main comments:
The paper is difficult to follow because some details are just referred to previous work and are not mentioned in the paper. Please consider adding more details about the background, at least to the appendix.
Since the main reason for using VI is to speed up the Bayesian inference, it is really important to provide experimental results regarding the time complexity of the method and its convergence. I think adding a plot similar to figure 4 in the paper (Zhang and Matsen 2019) can be helpful. Since your paper provides a more flexible family, it is expected not to converge as fast as (Zhang and Matsen 2019), but it is essential to see how much it works better than MCMC.
Minor comments
In the equation after line 248, the notation q borrowed directly from the paper Chris et al. (2018) to refer to variational distribution; however, it is better to use another notation to avoid confusion with branch length parameters. Also, the notation w_{x_i} in equation 5 is better to be changed to w_{n(x_i)} because ‘w’ does not depend on the value of x_i.
It will be valuable to report the ELBO (k=1) for both PSP and the proposed method on the real datasets and showing that there is gain.
After the author's response:
I thank the authors for providing more empirical results for their model's computational complexity and flexibility. I think it would be better also to elaborate on the novelty of the method more.

__ Correctness__: Yes

__ Clarity__: More detail about the background is needed

__ Relation to Prior Work__: Yes

__ Reproducibility__: No

__ Additional Feedback__:

__ Summary and Contributions__: The paper presents a new type of variational Bayesian phylogenetic Inference that makes use of normalizing flows to construct more expressive posteriors for branch length. A permutation equivariant construction of normalizing flows (planar and RealNVP) is proposed to handle the non-Euclidean branch length of phylogenetic models. Results on benchmark datasets demonstrate the effectiveness of the proposed family of posteriors.

__ Strengths__: The paper presents a more expressive model for branch length approximation in phylogenetic inference. The proposed construction of permutation equivariant normalizing flows has grounds and connections to DeepSets, a SOTA theoretical framework for equivariance in deep models. The proposed variational inference approach is a faster and more efficient alternative to existing sampling-based methods with the additional advantage of providing a tighter bound. The proposed permutation equivariant construction of flows and variational inference of phylogenetic trees is relevant to the NeurIPS community as a faster and more efficient alternative to sampling-based methods.

__ Weaknesses__: Lack of experimental evidence for significant improvements in lower bounds and marginal likelihood estimates compared with the existing diagonal Lognormal branch length approximation. The bijectivity of flow-based models mandates learning mappings in the same dimensionality of the posterior space. There is no discussion about such dimensionality and whether this introduces risks of overfitting. Experiments lack comparisons to sampling-based methods.

__ Correctness__: Flow-based models are bounded by the manifold structure of the posterior distribution as it has issues modeling discontinuous manifolds, potentially assigns mass to unsupported regions. Given that, it is not clear if the flow model is the ideal candidate for the phylogenetic posteriors, but it improves the performance, though marginally.

__ Clarity__: The paper is well written, structured, and easy to follow with enough background material.

__ Relation to Prior Work__: The proposed equivariant constructions of flow models assume sets of scalars, which could be considered as a special case of sets of vectors. The connection to existing works that extend flow models to unordered data and whether they can handle sets of scalars are not well justified.

__ Reproducibility__: Yes

__ Additional Feedback__:

__ Summary and Contributions__: This paper proposes permutation equivariant normalizing flows as a variational inference method for phylogenetic tree estimation. While there is prior work that has done variational inference for phylogenetic tree estimation, the novelty comes from 1) using normalizing flows and 2) permutation equivariance in the architecture.
[Comments in brackets denote post-rebuttal responses. My score remains unchanged at "Good submission. Accept," and I appreciate the authors' rebuttal as they clarified many of my questions.]

__ Strengths__: Relevance: Variational inference and its applications is pretty core to the community to me.
Significance: While not really a core ML subfield, phylogenetics and phylogenetic tree inference more specifically is really a fascinating and difficult problem with lots of real world importance.
Theoretical grounding: This paper goes beyond a fair amount of ML applied to specific problem papers by taking the time to think about what sorts of inductive biases are needed for the problem at hand, and devises a permutation equivariant normalizing flow architecture for that. They also spend some time proving that the flows they use are equivariant.
Empirical evaluation: This really isn't my subfield but the results look statistically significant; there are also some of the obvious ablation studies that I would have wanted (number of layers, knocking out the effect of the tree, etc.).
The performance differences between flows and standard VI are clearly pointed out in Figure 3 which is really nice.
Novelty: Beyond just solving a really nice applied problem, this is one of the first papers to really devise and study equivariant normalizing flows. Some googling points to https://proceedings.icml.cc/static/paper_files/icml/2020/6711-Paper.pdf as an extremely recent normalizing flows paper in this vein (not necessary to cite), but your work goes pretty far beyond that to encode a different type of symmetry and to use it for an inference problem.

__ Weaknesses__: Relevance: The application is really quite niche which could mean that the equivariant architecture solution could get lost in the swamp of NeurIPS papers.
Significance: I would really have hoped that the authors devoted a little bit more time to answering the following two claims:
1) Why is phylogenetic inference interesting to the broader ML community, and more specifically what can we do with better phylogenetic inference tools? Or, is there a larger problem that the development of this tool helps to solve? [Thanks for the clarification. I hope that some of the extra space for the camera ready goes to discuss your response.]
2) Why exactly is the "diagonal Lognormal branch design distribution ... not ... flexible enough"? My understanding is that MCMC methods are essentially the state of the art in this area, so is there evidence for substantial correlation between branches for these runs? [Thanks for the argument, but what I was really hoping for was some sort of traceplot demonstrating that the MCMC (and flow) methods actually pick up correlations over these parameters...]
Empirical Evaluation:
See point 2) above.
You probably need to justify the usage of the importance sampling method in the VI approximation a bit better, although some googling on my end finds that the importance sampling from the VI estimate works pretty well - https://escholarship.org/content/qt77d8v106/qt77d8v106.pdf. The natural choice in the ML community is probably annealed importance sampling. [Thanks for the clarification and promise of experiment.]
From a ML perspective, it's well known that normalizing flows produce really high quality likelihood estimates (Section 6.1 https://arxiv.org/pdf/1912.02762.pdf), which could explain why adding new layers just seems to continue decreasing the MLL. Is there a tradeoff between the number of layers and the MLL beyond which adding new layers to the flow just stops improving the likelihood (or it even becomes untrainable)?
[Somewhat unaddressed.]
Novelty: Although this is one of the first papers to study equivariant normalizing flows, it's not the first, and Section 5.6 of https://arxiv.org/pdf/1912.02762.pdf contains the broad portions of your proofs. However, they broadly just point to two workshop papers (one of which seems to have been extended into the ICML paper cited above).

__ Correctness__: From reading the two proofs, things look broadly correct.
I think that the empirical methodology is pretty much fine.
I'd like to see a time comparison compared to standard VI if possible. I've trained some flows for inference previously and have been disappointed at how slow they are, which could prove problematic for the claims of "speed compared to MCMC." [Thanks for the iteration complexity comparison.]

__ Clarity__: Yes, the paper is well-written and flows nicely throughout.
I think the arrow in the amoritization gap pointer in Figure 3 could be pushed slightly higher.
The figures are of good size and the captions tell the story of the figures well.

__ Relation to Prior Work__: Yes, in general, it seems to be well related to the prior ML works in this area. I don't know the phylogenetic literature that well, but it seems to be novel in that space too.
It seems like there's broadly some concurrent work on equivariant normalizing flows for physics problems that overlaps a bit (basically just Section 5.6 of https://arxiv.org/pdf/1912.02762.pdf that summarizes two NeurIPS '19 workshop papers). However, this isn't really a major deal at all.

__ Reproducibility__: Yes

__ Additional Feedback__: You probably need to reference the fact that the proofs to the propositions are in the Appendix right after the statements; it took me a few minutes to find them.
Overall, I really like this work and think that it represents an improvement on the current state of the art. In the rebuttal, I'd like to see 1) a bit more discussion of how this work solves open phylogenetic problems and 2) a time comparison to VI and/or MCMC.
Minor comments:
Line 29 of appendix: "numerical stable" --> numerically stable.
Line 30: is the determinant invariant for the same reason as the Jacobian being invariant?

__ Summary and Contributions__: In this work the authors consider the problem of creating normalizing flows as the approximating distribution for branch lengths in (Bayesian) phylogenetic variational inference problems. They begin referencing previous work that considers building approximating distributions based on tree splits and the associated primary subsplit pairs (PSPs). Under the existing framework, a log normal variational distribution is established, where the variational parameters for the mean, variance are composed by summing over variational parameters associated with splits and PSPs, thus making inference amortized. The overall trick for introducing flows into this problem seems twofold: (i) instead of establishing flows-specific variational parameters, use parameters associated with the splits and PSPs as per previous work as the parameters of the NF, and (ii) ensure the flows are permutation invariant since there is no consistent alignment for non-pendant tree edges. The authors introduce permutation-invariant versions of previously established flows (planar and RealNVP) with the parameters corresponding to split and PSP specific variables. They discuss inference methodology, and apply it to a number of real datasets for phylogenetic reconstruction problems, demonstrating state-of-the-art lower bounds estimates.

__ Strengths__: Overall I think this is a very strong paper: it blends flow based/change of measure inference methods (an extremely active area of research) in a novel manner, applying it to a particular domain/inference problem (phylogenetics). The paper is methodologically exciting and achieves state of the art results, and acts as a strong foundation for any future work that would attempt to fit the entire inference procedure (topology + branch lengths) into a normalizing flow framework, rather than just the branch lengths.

__ Weaknesses__: The only weakness I would leverage (and it’s very minor) is that I don’t think in any case is the marginal likelihood significantly improved over the previous (PSP) model (table 1) when the error bounds are taken into account (though admittedly this will be high variance), and in some cases it appears PSP achieves a higher marginal likelihood (on datasets DS5,6,8). Can the authors comment on this? Admittedly all variance estimates are lower.

__ Correctness__: To the best of my knowledge the paper looks correct.

__ Clarity__: Exceptionally minor comment: presumably \psi^\sigma is constrained > 0 ? (line 124)
Equation 5: the authors introduce (z,x) presumably to standardize with existing NF notation but x is typically continuous leading to gamma and w being indexed by a continuous value, which is somewhat confusing, and doesn’t match eq 6 where x -> q, and the indexing is performed on edges e rather than on x. Can the authors clarify?

__ Relation to Prior Work__: Yes — the paper well references and explains previous work on variational inference for phylogenetics. The paper appears to be the first of its kind connecting normalizing flows to Bayesian phylogenetic inference.

__ Reproducibility__: Yes

__ Additional Feedback__: Update: the authors addressed my (minor) concerns and I keep my score unchanged at 8