NIPS 2018
Sun Dec 2nd through Sat the 8th, 2018 at Palais des Congrès de Montréal
Paper ID: 6577 Generalizing Point Embeddings using the Wasserstein Space of Elliptical Distributions

### Reviewer 1

The paper proposes using elliptical distribution (specifically uniform elliptical distributions) as probability measures for representing data instances in an embedding space. The work is inspired by prior work where the KL divergence was used for Gaussian distribution based embeddings. A motivating factor versus previous work is ensuring consistent behavior as the distributions collapse to Dirac delta functions. The papers use the 2-Wasserstein divergence/distance (motivated by its connection to optimal transport) as the metric for the embedding space. This divergence/distance has a known closed form expression for uniform elliptical distributions that relies on the Bures/Kakutani distance metric. The paper then explores manifold optimizations of the divergence in terms of the covariance matrix, and uses a factored parameterization of the covariance matrix for the optimization. To compute the gradient of the Bures distance a Newton-Schulz algorithm is used to find the matrix square roots and their inverses, which is a more parallelizable manner than the eigendecompostion. The approach is applied to a simple multidimensional scaling example and then to word embedding data sets that have previously been used with other probability distribution based embeddings. Strength: A novel idea that is fairly well conveyed. The discussion of the manifold optimization and Figure 2 are useful. The optimization of the gradient implementation seems to be well designed. It seems like a fairly complete piece of work with the potential for further refinement and application. Weaknesses: The text should include more direct motivation on why the added flexibility of an elliptical embedding is useful for modeling versus standard point embedding, especially relating to the hypernymy experiment. The equation between lines 189 and 190 appears to be wrong. By Equation 2, the trace term should be divided by $d+2$, unless the covariances are pre-scaled. The arguments (including Figure 3) for using the precision versus the covariance for visualization could be removed. The counterintuitive nature appears to be subjective and distracts from the modeling purpose of a probability distribution based embedding. Reference for the tractable (computable) form of Bures distance should be Uhlmann. A. Uhlmann, "The “transition probability” in the state space of a∗-algebra," Rep. Math. Phys. 9, 273 (1976) Some issues with typesetting appear to be problems with inline figures and sectioning. Minor errors and corrections: Line 25 "are extremely small (2 or 3 numbers" -> "can be represented in two or three dimensional space. Line 37 add comma around the phrase "or more ... $\langle y_i,y_j\rangle$" Line 55 The phrase "test time" is not clear in this context. Line 55 I cannot find any relevant comment in Table 1. Line 63 "scenarii" Line 69 Should notation for $M$ in pseudo-inverse match other cases (i.e., boldface)? Line 117 Typesetting and line number are messed up by inline figure 1. Line 130 $B$ should be $b$, and $C$ should be $B$. Line 132 Reasoning for the counterintuitive visualizations is that for uniform distributions the divergence decreases if the scale decreases since the density is more concentrated. Line 146 "adressed" Line 152 Here and elsewhere the references appears sometimes with both parentheses and brackets. Line 155 "on the contrary" to what? This seems to be incorrectly copied from the introductory paragraph in Section 3. Line 157 "to expensive" Line 166 "w.r.t" Line 177 Should be supplementary material or supplement. Line 183.1 Should be "a general loss function" since $\matchal{L}$ is indicated. Line 214 "matrics" Line 215 "would require to perform" "would require one to perform" Line 233 Elliptical should be lower case Line 236 "is to use their precision matrix their covariances" ? Line 242 $y_i,y_j$ -> $y_i-y_j$ Line 243 Missing comma before $\mathbf{a}_n$ Figure 5 caption fix the initial quotation mark before Bach. Line 289 "asymetry" Line 310 "embedd" Line 315 Should clarify if this is the Bures dot product or the Bures cosine. Line 323 "interesections" The reference list is poorly done: lack of capitalization of proper nouns, inconsistent referencing to journal names (some with abbreviations some without, some missing title case), some URL s and ISSN inserted sporadically, and Inconsistent referencing to articles on arXiv. ------ Author rebuttal response ----- The author rebuttal helped clarify and I have raised my score and edited the above. The authors have promised to clarify the dot product equation that appears inconsistent to me. I agree with Reviewer 3 that the visualization issue is not convincing and would caution elaborating on this somewhat tangential issue more. The original intuition was that the distance between a point and a uniform ellipsoid increase as the scale of the latter increases. However, with the additive nature of the similarity metric (dot product) used for the word embedding this doesn't appear to be the case. Switching the covariance with precision for the uniform elliptical embeddings seems to make it more confusing for Figure 5, where the embedding is trained using the dot product rather than distance. I also hope the authors take care in carefully proofreading the manuscript and adhering to NIPS template for typesetting.