{"title": "Calculating Optimistic Likelihoods Using (Geodesically) Convex Optimization", "book": "Advances in Neural Information Processing Systems", "page_first": 13942, "page_last": 13953, "abstract": "A fundamental problem arising in many areas of machine learning is the evaluation of the likelihood of a given observation under different nominal distributions. Frequently, these nominal distributions are themselves estimated from data, which makes them susceptible to estimation errors. We thus propose to replace each nominal distribution with an ambiguity set containing all distributions in its vicinity and to evaluate an optimistic likelihood, that is, the maximum of the likelihood over all distributions in the ambiguity set. When the proximity of distributions is quantified by the Fisher-Rao distance or the Kullback-Leibler divergence, the emerging optimistic likelihoods can be computed efficiently using either geodesic or standard convex optimization techniques. We showcase the advantages of working with optimistic likelihoods on a classification problem using synthetic as well as empirical data.", "full_text": "Calculating Optimistic Likelihoods\n\nUsing (Geodesically) Convex Optimization\n\nViet Anh Nguyen\nSoroosh Sha\ufb01eezadeh-Abadeh\n\u00c9cole Polytechnique F\u00e9d\u00e9rale de Lausanne, Switzerland\n{viet-anh.nguyen, soroosh.shafiee}@epfl.ch\n\nMan-Chung Yue\n\nThe Hong Kong Polytechnic University, Hong Kong\n\nmanchung.yue@polyu.edu.hk\n\nDaniel Kuhn\n\n\u00c9cole Polytechnique F\u00e9d\u00e9rale de Lausanne, Switzerland\n\ndaniel.kuhn@epfl.ch\n\nImperial College Business School, United Kingdom\n\nWolfram Wiesemann\n\nww@imperial.ac.uk\n\nAbstract\n\nA fundamental problem arising in many areas of machine learning is the evaluation\nof the likelihood of a given observation under different nominal distributions.\nFrequently, these nominal distributions are themselves estimated from data, which\nmakes them susceptible to estimation errors. We thus propose to replace each\nnominal distribution with an ambiguity set containing all distributions in its vicinity\nand to evaluate an optimistic likelihood, that is, the maximum of the likelihood\nover all distributions in the ambiguity set. When the proximity of distributions\nis quanti\ufb01ed by the Fisher-Rao distance or the Kullback-Leibler divergence, the\nemerging optimistic likelihoods can be computed ef\ufb01ciently using either geodesic\nor standard convex optimization techniques. We showcase the advantages of\nworking with optimistic likelihoods on a classi\ufb01cation problem using synthetic as\nwell as empirical data.\n\n1\n\nIntroduction\n\n(cid:44) x1, . . . , xM \u2208 Rn is generated from one of several\nAssume that a set of i.i.d. data points xM\n1\nGaussian distributions Pc, c \u2208 C with |C| < \u221e. To determine the distribution Pc(cid:63), c(cid:63) \u2208 C, under\n(cid:41)\n1 , Pc) across all Pc, c \u2208 C, we can solve the problem\nM(cid:88)\nwhich xM\n\n1 has the highest likelihood (cid:96)(xM\n1 , Pc) (cid:44) \u2212\n\nc (xm \u2212 \u00b5c) \u2212 log det \u03a3c\n\n(xm \u2212 \u00b5c)(cid:62)\u03a3\u22121\n\n,\n\n(1)\n\nc(cid:63) \u2208 arg max\n\nc\u2208C\n\n(cid:96)(xM\n\n(cid:40)\n\n1\nM\n\nm=1\n\nwhere \u00b5c and \u03a3c denote the means and covariance matrices that unambiguously characterize the\n1 , Pc) quanti\ufb01es the (logarithm of the)\ndistributions Pc, c \u2208 C, and the log-likelihood function (cid:96)(xM\n1 under the Gaussian distribution Pc. Problem (1) naturally\nrelative probability of observing xM\narises in various machine learning applications. In quadratic discriminant analysis, for example,\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\f1 denotes the input values of data samples whose categorical outputs y1, . . . , yM \u2208 C are to be\nxM\npredicted based on the class-conditional distributions Pc, c \u2208 C [25]. Likewise, in Bayesian inference\nwith synthetic likelihoods, a Bayesian belief about the models Pc, c \u2208 C, assumed to be Gaussian for\ncomputational tractability, is performed based on an observation xM\n1 [31, 41]. Problem (1) also arises\n1 is generated by a distribution Pc, c \u2208 C0\u2019 is\nin likelihood-ratio tests where the null hypothesis \u2018xM\n1 is generated by a distribution Pc, c \u2208 C1\u2019 [17, 18].\ncompared with the alternative hypothesis \u2018xM\nIn practice, the parameters (\u00b5c, \u03a3c) of the candidate distributions Pc, c \u2208 C, are typically not known\nand need to be estimated from data. In quadratic discriminant analysis, for example, it is common to\nreplace the means \u00b5c and covariance matrices \u03a3c with their empirical counterparts \u02c6\u00b5c and \u02c6\u03a3c that are\nestimated from data. Similarly, the rival model distributions Pc, c \u2208 C, in Bayesian inference with\nsynthetic likelihoods are Gaussian estimates derived from (costly) sampling processes. Unfortunately,\nproblem (1) is highly sensitive to misspeci\ufb01cation of the candidate distributions Pc. To combat this\nproblem, we propose to replace the likelihood function in (1) with the optimistic likelihood\n\n(cid:110)P \u2208 M : \u03d5(\u02c6Pc, P) \u2264 \u03c1c\n\n(cid:111)\n\n(cid:96)(xM\n\n1 , P) with Pc =\n\n,\n\nmaxP\u2208Pc\n\n(2)\nwhere M is the set of all non-degenerate Gaussian distributions on Rn, \u03d5 is a dissimilarity measure\nsatisfying \u03d5(P, P) = 0 for all P \u2208 M, and \u03c1c \u2208 R+ are the radii of the ambiguity sets Pc. Problem (2)\nassumes that the true candidate distributions Pc are unknown but close to the nominal distributions\n\u02c6Pc that are estimated from the training data. In contrast to the log-likelihood (cid:96)(xM\n1 , Pc) that is\nmaximized in problem (1), the optimistic likelihood (2) is of interest in its own right. A common\nproblem in constrained likelihood estimation, for example, is to determine a Gaussian distribution\nP(cid:63) \u223c (\u00b5(cid:63), \u03a3(cid:63)) that is close to a nominal distribution P0 \u223c (\u00b50, \u03a30) re\ufb02ecting the available prior\n1 has high likelihood under P(cid:63) [34]. This task is an instance of the optimistic\ninformation such that xM\nlikelihood evaluation problem (2) with a suitably chosen dissimilarity measure \u03d5.\nOf crucial importance in the generalized likelihood problem (2) is the choice of the dissimilarity\nmeasure \u03d5 as it impacts both the statistical properties as well as the computational complexity\nof the estimation procedure. A natural choice appears to be the Wasserstein distance, which has\nrecently been popularized in the \ufb01eld of optimal transport [37, 40]. The Wasserstein distance on\nthe space of Gaussian distributions is a Riemannian distance, that is, the distance corresponding the\ncurvilinear geometry on the set of Gaussian distributions induced by the Wasserstein distance, as\nopposed to the usual distance obtained by treating it as a subset of the space of symmetric matrices.\nHowever, since the Wasserstein manifold has a non-negative sectional curvature [37], calculating\nthe associated optimistic likelihood (2) appears to be computationally intractable. Instead, we study\nthe optimistic likelihood under the Fisher-Rao (FR) distance, which is commonly used in signal and\nimage processing [30, 2] as well as computer vision [21, 39]. The FR distance is also a Riemannian\nmetric, and it enjoys many attractive statistical properties that we review in Section 2 of this paper.\nMost importantly, the FR distance has a non-positive sectional curvature, which implies that the\noptimistic likelihood (2) reduces to the solution of a geodesically convex optimization problem that\nis amenable to an ef\ufb01cient solution [6, 35, 38, 42, 43, 44]. We also study problem (2) under the\nKullback\u2013Leibler (KL) divergence (or relative entropy), which is intimately related to the FR metric.\nWhile the KL divergence lacks some of the desirable statistical features of the FR metric, we will\nshow that it gives rise to optimistic likelihoods that can be evaluated in quasi-closed form by reduction\nto a one dimensional problem.\nWhile this paper focuses on the parametric approximation of the likelihood where P belongs to the\nfamily of Gaussian distributions, we emphasize that the optimistic likelihood approach can also be\nutilized in a non-parametric setting [28].\nThe contributions of this paper may be summarized as follows.\n\n1. We show that for Fisher-Rao ambiguity sets, the optimistic likelihood (2) reduces to a geodesically\nconvex problem and is hence amenable to an ef\ufb01cient solution via a Riemannian gradient descent\nalgorithm. We analyze the optimality as well as the convergence of the resulting algorithm.\n\n2. We show that for Kullback-Leibler ambiguity sets, the optimistic likelihood (2) can be evaluated\n\nin quasi-closed form by reduction to a one dimensional convex optimization problem.\n\n3. We evaluate the numerical performance of our optimistic likelihoods on a classi\ufb01cation problem\n\nwith arti\ufb01cially generated as well as standard benchmark instances.\n\n2\n\n\fOur optimistic likelihoods follow a broader optimization paradigm that exercises optimism in the\nface of ambiguity. This strategy has been shown to perform well, among others, in multi-armed\nbandit problems and Bayesian optimization, where the Upper Con\ufb01dence Bound algorithm takes\ndecisions based on optimistic estimates of the reward [12, 13, 26, 36]. Optimistic optimization\nhas also been successfully applied in support vector machines [8], and it closely relates to sparsity\ninducing non-convex regularization schemes [29].\nThe remainder of the paper proceeds as follows. We study the optimistic likelihood (2) under FR and\nKL ambiguity sets in Sections 2 and 3, respectively. We test our theoretical \ufb01ndings in the context\nof a classi\ufb01cation problem, and we report on numerical experiments in Section 4. Supplementary\nmaterial and all proofs are provided in the online companion.\nNotation. Throughout this paper, Sn, Sn\n++ denote the spaces of n-dimensional symmetric,\nsymmetric positive semi-de\ufb01nite and symmetric positive de\ufb01nite matrices, respectively. For any\ni=1 Aii. For any A \u2208 Sn, \u03bbmin(A) and \u03bbmax(A)\n\nA \u2208 Rn\u00d7n, the trace of A is de\ufb01ned as Tr(cid:0)A(cid:1) =(cid:80)n\n\ndenote the minimum and maximum eigenvalues of A, respectively. The base of log(\u00b7) is e.\n2 Optimistic Likelihood Problems under the FR Distance\n\n+ and Sn\n\nConsider a family of distributions with density functions p\u03b8(x), where the parameter \u03b8 ranges over\na \ufb01nite-dimensional smooth manifold \u0398. At each point \u03b8 \u2208 \u0398, the Fisher information matrix\nI\u03b8 = Ex[\u2207\u03b8 log(p\u03b8(x))\u2207\u03b8 log(p\u03b8(x))(cid:62)\n|\u03b8] de\ufb01nes an inner product (cid:104)\u00b7,\u00b7(cid:105)\u03b8 on the tangent space T\u03b8\u0398\nby (cid:104)\u03b61, \u03b62(cid:105)\u03b8 = \u03b6 T\n1 I\u03b8\u03b62 for \u03b61, \u03b62 \u2208 T\u03b8\u0398. The family of inner products {(cid:104)\u00b7,\u00b7(cid:105)\u03b8}\u03b8\u2208\u0398 on the tangent\nspaces then de\ufb01nes a Riemannian metric, called the FR metric. The FR distance on \u0398 is the geodesic\n(cid:90) 1\ndistance associated with the FR metric, i.e., the FR distance between the two points \u03b80, \u03b81 \u2208 \u0398 is\n\n(cid:113)\n\nd(\u03b80, \u03b81) = inf\n\u03b3\n\n0\n\n(cid:104)\u03b3(cid:48)(t), \u03b3(cid:48)(t)(cid:105)\u03b3(t)dt,\n\nwhere the in\ufb01mum is taken over all smooth curves \u03b3 : [0, 1] \u2192 \u0398 with \u03b3(0) = \u03b80 and \u03b3(1) = \u03b81.\nAny curve \u03b3 attaining the in\ufb01mum is said to be a geodesic from \u03b80 to \u03b81. The FR metric represents a\nnatural distance measure for parametric families of probability distributions as it is invariant under\ntransformations on the data space (the x space) by a class of statistically important mappings, and it\nis the unique (up to a scaling) Riemannian metric enjoying such a property, see [14, 15, 5].\nSince the covariance matrix is more dif\ufb01cult to estimate than the mean (see Appendix A), we focus\nhere on the family of all Gaussian distributions with a \ufb01xed mean vector \u02c6\u00b5 \u2208 Rn. These distributions\nare parameterized by \u03b8 = \u03a3, that is, the covariance matrix. The parameter manifold is thus given by\n\u0398 = Sn\nProposition 2.1 (FR distance for Gaussian distributions [3]). If N (\u02c6\u00b5, \u03a30) and N (\u02c6\u00b5, \u03a31) are Gaussian\ndistributions with identical mean \u02c6\u00b5 \u2208 Rn and covariance matrices \u03a30, \u03a31 \u2208 Sn\n\n++. On this manifold, the FR distance is available in closed form.1\n\n++, we have\n\n(cid:13)(cid:13)(cid:13)log(\u03a3\n\n1\n\u221a2\n\n\u2212 1\n1 \u03a30\u03a3\n\n2\n\n\u2212 1\n1\n\n2\n\n(cid:13)(cid:13)(cid:13)F\n\n)\n\nd(\u03a30, \u03a31) =\n\n,\n\n(3)\n\nwhere log(\u00b7) represents the matrix logarithm, and (cid:107) \u00b7 (cid:107)F stands for the Frobenius norm.\nThe distance d(\u00b7,\u00b7) is invariant under inversions and congruent transformations of the input parame-\nters [32, Proposition 1], i.e., for any \u02c6\u03a3, \u03a3 \u2208 Sn\n\n++ and invertible matrix A \u2208 Rn\u00d7n, we have\n\nd( \u02c6\u03a3\u22121, \u03a3\u22121) = d( \u02c6\u03a3, \u03a3)\nand d(A \u02c6\u03a3A(cid:62), A\u03a3A(cid:62)) = d( \u02c6\u03a3, \u03a3).\n\n(4)\n(5)\nBy the inversion invariance (4), the distance d(\u00b7,\u00b7) is independent of whether we use the covariance\nmatrix \u03a3 or the precision matrix \u03a3\u22121 to parametrize normal distributions. Note that if x1 \u223c N (\u00b5, \u03a31)\nand x2 \u223c N (\u00b5, \u03a32), then Ax1 + b \u223c N (A\u00b5 + b, A\u03a31A(cid:62)) and Ax2 + b \u223c N (A\u00b5 + b, A\u03a32A(cid:62)). By\nthe congruence invariance (5), the distance d(\u00b7,\u00b7) thus remains unchanged under af\ufb01ne transformations\n1We can also handle the case where the covariance matrix is \ufb01xed but the mean is subject to ambiguity, see\nAppendix B. However, as there is no closed-form expression for the FR distance between two generic Gaussian\ndistributions, we cannot handle the case where both the mean and the covariance matrix are subject to ambiguity.\n\n3\n\n\f(cid:26) L(\u03a3) (cid:44)(cid:10)S, \u03a3\u22121(cid:11) + log det \u03a3,\nBFR (cid:44) {\u03a3 \u2208 Sn\n\nx \u2192 Ax + b. Remarkably, the invariance property (5) uniquely characterizes the distance d(\u00b7,\u00b7).\nMore precisely, any Riemannian distance satisfying the invariance property (5) coincides (up to a\nscaling) with the distance d(\u00b7,\u00b7), see, for example, [33, Section 3] and [9, Section 2].\nWe now study the optimistic likelihood problem (2), where the FR distance is used as the dissimilarity\nmeasure. Given a data batch xM\n\n1 and a radius \u03c1 > 0, the optimistic likelihood problem reduces to\n\nL(\u03a3), where\n\n++ : d(\u03a3, \u02c6\u03a3) \u2264 \u03c1},\n\nmin\n\u03a3\u2208BFR\nm=1(xm \u2212 \u02c6\u00b5)(xm \u2212 \u02c6\u00b5)(cid:62) stands for the sample covariance matrix.\n\nand S = M\u22121(cid:80)M\nWe next prove that problem (6) is solvable, which justi\ufb01es the use of the minimization operator.\nLemma 2.2. The optimal value of problem (6) is \ufb01nite and is attained by some \u03a3(cid:63) \u2208 BFR.\nEven though the objective function of (6) involves a concave log-det term, it can be shown to be\nconvex over the region 0 \u227a \u03a3 (cid:22) 2S [10, Exercise 7.4]. However, in practice S may be singular, in\nwhich case this region becomes empty. Maximum likelihood estimation problems akin to (6) are\noften reparameterized in terms of the precision matrix X = \u03a3\u22121. In this case, (6) becomes\n\n(6)\n\n(cid:111)\n\n.\n\n(cid:110)(cid:10)S, X(cid:11)\n\nmin\n\n\u2212 log det X : X \u2208 Sn\n\n++, (cid:107) log(X\n\n1\n\n2 \u02c6\u03a3X\n\n1\n\n2 )(cid:107)F \u2264\n\n\u221a2\u03c1\n\nEven though this reparameterization convexi\ufb01es the objective, it renders the feasible set non-convex.\n\n2.1 Geodesic Convexity of the Optimistic Likelihood Problem\n\n(cid:0)\u03a3\n\n(cid:1)t\n\n\u2212 1\n0 \u03a31\u03a3\n\n2\n\n\u2212 1\n0\n\n2\n\n1\n2\n\nAs problem (6) cannot be addressed with standard methods from convex optimization, we re-interpret\nit as a constrained minimization problem on the Riemannian manifold \u0398 = Sn\n++ endowed with the FR\nmetric. The key advantage of this approach is that we can show problem (6) to be geodesically convex.\nGeodesic convexity generalizes the usual notion of convexity in Euclidean spaces to Riemannian\nmanifolds. We can thus solve problem (6) via algorithms from geodesically convex optimization,\nwhich inherit many bene\ufb01ts of the standard algorithms of convex optimization in Euclidean spaces.\nThe Riemannian manifold \u0398 = Sn\n++ endowed with the FR metric is in fact a Hadamard manifold,\nthat is, a complete simply connected Riemannian manifold with non-positive sectional curvature,\nsee [22, Theorem XII 1.2]. Thus, any two points are connected by a unique geodesic [11]. By [7,\nTheorem 6.1.6], for \u03a30, \u03a31 \u2208 Sn\n++ from \u03a30 to \u03a31 is given by\n\n++, the unique geodesic \u03b3 : [0, 1] \u2192 Sn\n\n1\n\u03b3(t) = \u03a3\n2\n0\n\n\u03a3\n\n0 .\n\n(7)\nWe are now ready to give precise de\ufb01nitions of geodesically convex sets and functions on Hadamard\nmanifolds. We emphasize that these de\ufb01nitions would be more subtle for general Riemannian\nmanifolds, which can have several geodesics between two points.\nDe\ufb01nition 2.3 (Geodesically convex set). A set U \u2286 Sn\n++ is said to be geodesically convex if for all\n\u03a30, \u03a31 \u2208 U, the image of the unique geodesic from \u03a30 to \u03a31 is contained in U, i.e., \u03b3([0, 1]) \u2286 U.\nDe\ufb01nition 2.4 (Geodesically convex function). A function f : Sn\n++ \u2192 R is said to be geodesically\nconvex if for all \u03a30, \u03a31 \u2208 Sn\n++, the unique geodesic \u03b3 from \u03a30 to \u03a31 satis\ufb01es f (\u03b3(t)) \u2264 (1 \u2212\nt)f (\u03a30) + tf (\u03a31) \u2200t \u2208 [0, 1].\nIn order to prove that (6) is a geodesically convex optimization problem, we need to establish the\ngeodesic convexity of the feasible region BFR and the loss function L(\u00b7). Note that, in stark contrast\nto Euclidean geometry, a geodesic ball on a general manifold may not be geodesically convex.2\nTheorem 2.5 (Geodesic convexity of problem (6)). For any \u02c6\u03a3 \u2208 Sn\n+ and \u03c1 \u2208 R+, BFR is\n++, S \u2208 Sn\na geodesically convex set, and L(\u00b7) is a geodesically convex function over Sn\n++.\nTheorem 2.5 establishes that the optimistic likelihood problem (6), which is non-convex with respect\nto the usual Euclidean geometry on the embedding space Rn\u00d7n, is actually convex with respect to\nthe Riemannian geometry on Sn\n\n++ induced by the FR metric.\n\n2For example, consider the circle S1 (cid:44) {x \u2208 R2 : (cid:107)x(cid:107)2 = 1} which is a 1-dimensional manifold. Any\n\nmajor arc is a geodesic ball but not a geodesically convex subset of S1.\n\n4\n\n\fAlgorithm 1 Projected Geodesic Gradient Descent Algorithm\nInput: \u02c6\u03a3 (cid:31) 0, \u03c1 > 0, S (cid:23) 0, K \u2208 N, {\u03b1k}K\nInitialization: Set \u03a31 \u2190 \u02c6\u03a3, \u00af\u03a31 \u2190 \u02c6\u03a3\nfor k = 1, 2, . . . , K \u2212 1 do\nCompute the Riemannian gradient at \u03a3k: Gk \u2190 2(\u03a3k \u2212 S)\nPerform a gradient descent step using the exponential map:\n\nk=1 \u2286 R++\nk exp(cid:0)\n\n1\n2\n\n2 \u2190 Exp\u03a3k\n\n(\u2212\u03b1kGk) = \u03a3\n\u03a3k+ 1\nonto BFR: \u03a3k+1 \u2190 ProjBFR (\u03a3k+ 1\n\nProject \u03a3k+ 1\nCompute the new iterate by interpolation: \u00af\u03a3k+1 \u2190 Exp \u00af\u03a3k\nend for\nOutput: Report the last iterate \u00af\u03a3K as an approximate solution\n\n)\n\n2\n\n2\n\n\u2212 \u03b1k\u03a3\n\n1\n2\nk\n\n(cid:1)\u03a3\n(\u03a3k+1)(cid:1)\n\n\u2212 1\nk\n\n2\n\n\u2212 1\nk Gk\u03a3\n\n2\n\n(cid:0) 1\nk+1 Exp\u22121\n\n\u00af\u03a3k\n\n2.2 Projected Geodesic Gradient Descent Algorithm\n\nFigure 1: Visualization of the FR ball BFR (yel-\nlow set) within the manifold Sn\n++ (white set).\n\nIn the same way as the convexity of a standard con-\nstrained optimization problem can be exploited to\n\ufb01nd a global minimizer via a projected gradient\ndescent algorithm, the geodesic convexity of prob-\nlem (6) can be exploited to \ufb01nd a global minimizer\nby using a projected geodesic gradient descent algo-\nrithm of the type described in [43]. The mechanics\nof a generic iteration are visualized in Figure 1. As\nin any gradient descent method, given the current\niterate \u03a3, we \ufb01rst need to compute the direction\nalong which the objective function L decreases\nfastest. In the context of optimization on manifolds,\nthis direction corresponds to the negative Rieman-\nnian gradient \u2212G at point \u03a3, which belongs to the tangent space T\u03a3Sn\n++ (cid:39) Sn. Unfortunately, the\ncurve \u03b3(\u03b1) = \u03a3 \u2212 \u03b1G fails to be a geodesic and will eventually leave the manifold for suf\ufb01ciently\nlarge step sizes \u03b1. This prompts us to construct the (unique) geodesic that emanates from point \u03a3\nwith initial velocity \u2212G. Formally, this geodesic can be represented as \u03b3(\u03b1) = Exp\u03a3(\u2212\u03b1G), where\nExp\u03a3(\u00b7) denotes the exponential map at \u03a3. As we will see below, this geodesic (visualized by the red\ncurve) remains within the manifold for any \u03b1 > 0 but may eventually leave the feasible region BFR.\nIf this happens for the chosen step size \u03b1, we project Exp\u03a3(\u2212\u03b1G) back onto the feasible region, that\nis, we map it to its closest point in BFR with respect to the FR distance (visualized by the yellow\ncross). Denoting this FR projection by ProjBFR (\u00b7), the next iterate of the projected geodesic gradient\ndescent algorithm can thus be expressed as \u03a3+ = ProjBFR (Exp\u03a3(\u2212\u03b1G)).\nStarting from \u03a31 = \u02c6\u03a3, the proposed algorithm constructs K iterates {\u03a3k}K\nk=1 via the above recursion.\nAs in [43], the algorithm also constructs a second sequence { \u00af\u03a3k}K\nk=1 of feasible covariance matrices\nwith \u00af\u03a31 = \u02c6\u03a3 and \u00af\u03a3k+1 = \u00af\u03b3(1/(k + 1)) for k = 1, . . . , K \u2212 1, where \u00af\u03b3(t) represents the geodesic (7)\nconnecting \u00af\u03a3k with \u03a3k+1. Thus, \u00af\u03a3k+1 is de\ufb01ned as a geodesic convex combination of \u00af\u03a3k and \u03a3k+1.\nA precise description of the proposed algorithm in pseudocode is provided in Algorithm 1.\nIn the following we show that the Riemannian gradient, the exponential map Exp\u03a3(\u00b7) as well as the\nprojection ProjBFR(\u00b7) can all be evaluated in closed form in O(n3).\nBy [3, Page 362], the FR metric on the tangent space T\u03a3Sn\n\n++ can be re-expressed as\n\n(cid:10)\u21261, \u21262\n\n(cid:11)\n\n(cid:44) 1\n2\n\n\u03a3\n\nTr(cid:0)\u21261\u03a3\u22121\u21262\u03a3\u22121(cid:1)\n\n++ at \u03a3 \u2208 Sn\n\u2200\u21261, \u21262 \u2208 T\u03a3Sn\n++.\n\n(8)\n\n(9)\n\nUsing (8) and [1, Equation 3.32], the Riemannian gradient G = grad L of the objective function L(\u00b7)\nat \u03a3 can be computed from the Euclidean gradient \u2207L(\u03a3) as\n\ngrad L(\u03a3) = 2\u03a3(\u2207L(\u03a3))\u03a3 = 2\u03a3(\u03a3\u22121 \u2212 \u03a3\u22121S\u03a3\u22121)\u03a3 = 2(\u03a3 \u2212 S).\n\n5\n\n\fMoreover, by [35, Equation (3.2)], the exponential map Exp\u03a3 : T\u03a3Sn\n2 , G \u2208 T\u03a3Sn\n2(cid:1)\u03a3\nwhere exp(\u00b7) denotes the matrix exponential. The inverse map Exp\u22121\n\n2(cid:0) log \u03a3\u2212 1\n\n2 exp(\u03a3\u2212 1\n\nExp\u03a3(G) = \u03a3\n\n2 G\u03a3\u2212 1\n\n2 A\u03a3\u2212 1\n\nExp\u22121\n\n2 )\u03a3\n\n1\n\n1\n\n1\n\n\u03a3 (A) = \u03a3\n\n++ at \u03a3 is given by\n\n++ satis\ufb01es\n\n++ \u2192 Sn\n++ (cid:39) Sn,\n\u03a3 : Sn\n++ \u2192 T\u03a3Sn\n2 , A \u2208 Sn\n++.\n\u2208 Sn\n++.\n\n1\n\nFinally, the projection ProjB(\u00b7) onto BFR with respect to the FR distance is de\ufb01ned through\n\nProjBFR (\u03a3(cid:48)) (cid:44) arg min\n\u03a3\u2208BFR\n\nd(\u03a3, \u03a3(cid:48)), \u03a3(cid:48)\n\nThe following lemma ensures that this projection is well-de\ufb01ned and admits a closed-form expression.\nLemma 2.6 (Projection onto BFR). For any \u03a3(cid:48)\n\u2208 Sn\n(i) There arg min-mapping in (10) is a singleton, and thus ProjBFR (\u03a3(cid:48)) is well-de\ufb01ned.\n2(cid:0) \u02c6\u03a3\u2212 1\n(ii) The projection of \u03a3(cid:48) onto BFR is given by\n\n++ with d( \u02c6\u03a3, \u03a3(cid:48)) = \u03c1(cid:48) the following hold.\n\n2(cid:1) \u03c1\n\n2 \u03a3(cid:48) \u02c6\u03a3\u2212 1\n\n(cid:40)\n\n\u03c1(cid:48) \u02c6\u03a3 1\n\n2\n\n(10)\n\n(11)\n\nProjBFR(\u03a3(cid:48)) =\n\n\u02c6\u03a3 1\n\u03a3(cid:48)\n\nif \u03c1(cid:48) > \u03c1,\notherwise.\n\nBy comparison with (7), one easily veri\ufb01es that ProjBFR (\u03a3(cid:48)) constitutes a geodesic convex combi-\nnation between \u03a3(cid:48) and \u02c6\u03a3. Figure 1 visualizes the geodesic from \u02c6\u03a3 to \u03a3(cid:48) by the blue dashed line.\nTherefore, the projection ProjBFR onto the FR ball BFR within Sn\n++ endowed with the FR metric is\nconstructed in a similar manner as the projection onto a Euclidean ball within a Euclidean space.\nThe following theorem asserts that Algorithm 1 enjoys a sublinear convergence rate.\nTheorem 2.7 (Sublinear convergence rate). With a constant stepsize\n\u03c1 tanh(2\u221a2\u03c1)/(\u0393\u221aK),\n\n(cid:113)\n\n\u221a\n\n2\u03c1 \u00b7 \u03bb\u22122\n\n\u221a\n\n\u03b1k \u2261 21/4\nmin( \u02c6\u03a3) \u00b7 max{|1 \u2212 e\n(cid:113)\nL( \u00af\u03a3K) \u2212 L(\u03a3(cid:63)) \u2264\n\n2\u03c1\u03bb\u22121\nmin( \u02c6\u03a3)\u03bbmax(S)|, 1}, Algorithm 1 satis\ufb01es\n4 \u03c1 3\n2 7\n\n2 \u0393\n\nK tanh(2\u221a2\u03c1)\n\n.\n\nwhere \u0393 (cid:44) 2\u22121/2\u221an \u00b7 e2\n\nThe proof of Theorem 2.7 closely follows that of [43, Theorem 9]. The difference is that [43,\nTheorem 9] requires the objective function to be Lipschitz continuous on Sn\n++. Unfortunately, such\nan assumption is not satis\ufb01ed by L(\u00b7). We circumvent this by proving that the Riemannian gradient\nof L(\u00b7) is bounded uniformly on BFR.\nEndeavors are currently underway to devise algorithms for minimizing a geodesically strongly convex\nobjective function over a geodesically convex feasible set that offer a linear convergence guarantee,\nsee, e.g., [43, Theorem 15]. The next lemma shows that the objective function of problem (6) is\nindeed geodesically smooth and geodesically strongly convex3 whenever S (cid:31) 0. This suggests that\nthe empirical performance of Algorithm 1 could be signi\ufb01cantly better than the theoretical guarantee\nof Theorem 2.7. Indeed, our numerical results in Section 4.1 con\ufb01rm that if S (cid:31) 0, then Algorithm 1\ndisplays a linear convergence rate.\nLemma 2.8 (Strong convexity and smoothness of L(\u00b7)). The objective function L(\u00b7) of problem (6)\nis geodesically \u03b2-smooth on BFR with\n\u03b2 =\n\n2\u03bbmax(S)\n\n.\n\nIf S (cid:31) 0, then L(\u00b7) is also geodesically \u03c3-strongly convex on BFR with\n\n\u03bbmin((cid:98)\u03a3) exp(\u2212\u221a2\u03c1)\n\u03bbmax((cid:98)\u03a3) exp(\u221a2\u03c1)\n\n2\u03bbmin(S)\n\n.\n\n\u03c3 =\n\nRemark 2.9. Problem (6) could also be addressed with the algorithmic framework developed in [24].\nDue to space limitations, we leave this for future research.\n\n3The strong convexity and smoothness properties are de\ufb01ned in De\ufb01nitions C.4 and C.5, respectively.\n\n6\n\n\f3 Generalized Likelihood Estimation under the KL Divergence\n\nThe KL divergence, which is widely used in information theory [16, \u00a7 2], can be employed as\nan alternative dissimilarity measure in the optimistic likelihood problem (2). If both \u02c6P and P are\nGaussian distributions, then the KL divergence from \u02c6P to P admits an analytical expression.\nProposition 3.1 (KL divergence for Gaussian distributions). For any \u02c6\u00b5 \u2208 Rn and \u03a30, \u03a31 \u2208 Sn\n++,\n(cid:1) + log det(\u03a31\u03a3\u22121\nthe KL divergence from P0 = N (\u02c6\u00b5, \u03a30) to P1 = N (\u02c6\u00b5, \u03a31) amounts to\n\n(cid:0) Tr(cid:0)\u03a3\u22121\n\n1 \u03a30\n\n0 ) \u2212 n(cid:1).\n\nKL(P0 (cid:107) P1) =\n\n1\n2\n\n(cid:111)\n\nmin\n\u03a3(cid:31)0\n\nUnlike the FR distance, the KL divergence is not symmetric. Proposition 3.1 implies that if the KL\ndivergence is used as the dissimilarity measure, then the optimistic likelihood problem (2) reduces to\n\n(cid:110)\nTr(cid:0)S\u03a3\u22121(cid:1) + log det \u03a3 : Tr(cid:0)\u03a3\u22121 \u02c6\u03a3(cid:1) + log det(\u03a3 \u02c6\u03a3\u22121) \u2212 n \u2264 2\u03c1\nwhere S = M\u22121(cid:80)M\n(12)\nm=1(xm \u2212 \u02c6\u00b5)(xm \u2212 \u02c6\u00b5)(cid:62) denotes again the sample covariance matrix. Because\nof the concave log-det terms in the objective and the constraints, problem (12) is non-convex. By\nusing the variable substitution X \u2190 \u03a3\u22121, however, problem (12) can be reduced to a univariate\nconvex optimization problem and thereby solved in quasi-closed form.\n(1 + \u03b3(cid:63)) Tr(cid:0)S(S + \u03b3(cid:63) \u02c6\u03a3)\u22121(cid:1) + log det(S + \u03b3(cid:63) \u02c6\u03a3) \u2212 n log(1 + \u03b3(cid:63)),\nTheorem 3.2. For any \u02c6\u03a3 (cid:31) 0 and \u03c1 > 0, the optimal value of problem (12) amounts to\n(cid:110)\n(cid:111)\nwhere \u03b3(cid:63) is the unique optimal solution of the univariate convex optimization problem\n\u03b3(2\u03c1 + log det \u02c6\u03a3) + n(1 + \u03b3) log(1 + \u03b3) \u2212 (1 + \u03b3) log det(S + \u03b3 \u02c6\u03a3)\n\nmin\n\u03b3>0\n\n(13)\n\n,\n\n.\n\nProblem (13) can be solved ef\ufb01ciently using state-of-the-art \ufb01rst- or second-order methods, see\nAppendix E. However, in each iteration we still need to evaluate the determinant of a positive de\ufb01nite\nn-by-n matrix, which requires O(n3) arithmetic operations. The following corollary shows that this\ncomputational burden can be alleviated when the sample covariance matrix S has low rank.\nCorollary 3.3 (Singular sample covariance matrices). If S = \u039b\u039b(cid:62) for some \u039b \u2208 Rn\u00d7k and k \u2208 N\n(cid:111)\nwith k < n, then problem (13) simpli\ufb01es to\n\n(cid:110)\n2\u03b3\u03c1 + n(1 + \u03b3) log (1 + \u03b3) \u2212 (n \u2212 k)(1 + \u03b3) log \u03b3 \u2212 (1 + \u03b3) log det(\u03b3Ik + \u039b(cid:62) \u02c6\u03a3\u22121\u039b)\n\nmin\n\u03b3>0\n\n.\n\nWe will see that for classi\ufb01cation problems the matrix S has rank 1, in which case the log-det term\nin the above univariate convex program reduces to the scalar logarithm. In Appendix E we provide\nexplicit \ufb01rst- and second-order derivatives of the objective of problem (13) and its simpli\ufb01cation.\n\n4 Numerical Results\n\nWe investigate the empirical behavior of our projected geodesic gradient descent algorithm (Sec-\ntion 4.1) and the predictive power of our \ufb02exible discriminant rules (Section 4.2). Our algo-\nrithm and all tests are implemented in Python, and the source code is available from https:\n//github.com/sorooshafiee/Optimistic_Likelihoods.\n\n4.1 Convergence Behavior of the Projected Geodesic Descent Algorithm\nTo study the empirical convergence behavior of Algorithm 1, for n \u2208 {10, 20, . . . , 100} we generate\n100 covariance matrices \u02c6\u03a3 according to the following procedure. We (i) draw a standard normal\nrandom matrix B \u2208 Rn\u00d7n and compute A = B + B(cid:62); we (ii) conduct an eigenvalue decomposition\nA = RDRT ; we (iii) replace D with a random diagonal matrix \u02c6D whose diagonal elements are\nsampled uniformly from [1, 10]n; and we (iv) set \u02c6\u03a3 = R \u02c6DR(cid:62). For each of these covariance matrices,\n(cid:44) x for a standard normal random vector x \u2208 Rn and calculate\nwe set \u02c6\u00b5 = 0, M = 1, xM\n1\n\n7\n\n\f(a) Scaling of iteration count for S (cid:23) 0\n\n(b) Scaling of execution time for S (cid:23) 0\n\n(c) Convergence for n = 100 for S (cid:23) 0\n\n(d) Convergence for n = 100 for S (cid:31) 0\n\nFigure 2: Convergence behavior of the projected geodesic gradient descent algorithm. Solid lines\n(shaded regions) represent averages (ranges) across 100 independent simulations.\n\nthe optimistic likelihood (6) for \u03c1 = \u221an/100. This choice of \u03c1 ensures that the radius of the\nambiguity set scales with n in the same way as the Frobenius norm. Figures 2(a) and 2(b) report\nthe number of iterations as well as the overall execution time of Algorithm 1 when we terminate the\nalgorithm as soon as the relative improvement |[L(\u03a3k+1) \u2212 L(\u03a3k)]/L(\u03a3k+1)| drops below 0.01%.\nNotice that the number of required iterations scales linearly with n while the overall runtime grows\npolynomially with n. Figure 2(c) shows the relative improvement as a function of the iteration count.\nEmpirically, the number of iterations scales with O(1/k2), which is faster than the theoretical rate\nestablished in Theorem 2.7. We also study the empirical convergence behavior of Algorithm 1 when\nthe input matrix S is positive de\ufb01nite. We repeat the \ufb01rst experiment with M = 100, and we set\ni /M for \u03b4 = 10\u22126 to ensure that S is positive de\ufb01nite. Figure 2(d) indicates\n\nS = \u03b4I +(cid:80)M\n\ni=1 xix(cid:62)\n\nthat, in this case, the empirical convergence rate of Algorithm 1 is linear.\n\n4.2 Application: Flexible Discriminant Rules\nConsider a classi\ufb01cation problem where a categorical response Y \u2208 C, C = {1, . . . , C}, should be\npredicted from continuous inputs X \u2208 Rn. In this context, Bayes\u2019 Theorem implies that P(Y =\nc|X = x) \u221d \u03c0c \u00b7 fc(x), c \u2208 C, where \u03c0c = P(Y = c) denotes the prior probability of the response\nbelonging to class c, and fc is the density function of X for an observation of class c. In practice,\n\u03c0c and fc are unknown and need to be estimated from a training data set (\u02c6x1, \u02c6y1), . . . , (\u02c6xN , \u02c6yN ).\nAssuming that the densities fc, c \u2208 C, correspond to Gaussian distributions with (unknown) class-\nspeci\ufb01c means \u00b5c and covariance matrices \u03a3c, the quadratic discriminant analysis (QDA) replaces\n\u03c0c with \u02c6\u03c0c = Nc/N, where Nc = |{i : \u02c6yi = c}|, and fc with the density of the Gaussian distribution\n\u02c6Pc \u223c N (\u02c6\u00b5c, \u02c6\u03a3c), whose mean and covariance matrix are estimated from the training data, to classify\na new observation x using the discriminant rule\n\n(cid:26) 1\n\n2\n\n8\n\nCQDA(x) \u2208 arg max\n\nc\u2208C\n\n(cid:96)(x, \u02c6Pc) + log(\u02c6\u03c0c)\n\n.\n\n(cid:27)\n\n102030405060708090100Dimension(n)20406080100Numberofiterations102030405060708090100Dimension(n)01234Executiontime(s)100101102103Numberofiterations10\u2212610\u2212510\u2212410\u2212310\u22122|[L(\u03a3k+1)\u2212L(\u03a3k)]/L(\u03a3k+1)|246810Numberofiterations10\u22121410\u22121210\u22121010\u2212810\u22126|[L(\u03a3k+1)\u2212L(\u03a3k)]/L(\u03a3k+1)|\fTable 1: Average correct classi\ufb01cation rates on the benchmark instances\n\nAustralian\nBanknote authentication\nClimate model\nCylinder\nDiabetic\nFourclass\nGerman credit\nHaberman\nHeart\nHousing\nIlpd\nMammographic mass\nPima\nRingnorm\n\nFQDA KQDA QDA RQDA SQDA WQDA\n79.94\n80.68\n98.54\n99.07\n94.46\n92.78\n70.34\n70.69\n75.97\n75.04\n80.13\n79.33\n74.50\n76.31\n74.97\n74.87\n84.23\n82.42\n88.31\n88.89\n55.15\n57.42\n80.66\n80.65\n75.97\n75.04\n98.69\n98.56\n\n83.68\n99.47\n94.55\n70.67\n74.53\n79.97\n74.60\n75.41\n83.31\n92.90\n57.83\n80.85\n74.53\n98.65\n\n79.76\n98.54\n92.72\n70.33\n74.60\n79.32\n76.18\n74.96\n82.62\n87.01\n54.97\n80.88\n74.60\n98.56\n\n80.73\n98.53\n94.42\n70.99\n74.70\n79.32\n74.99\n75.04\n84.17\n81.69\n55.45\n81.05\n74.70\n98.65\n\n80.03\n98.56\n91.78\n67.10\n74.19\n79.32\n71.41\n74.92\n81.42\n88.54\n55.18\n80.37\n74.19\n98.56\n\nHere, the likelihood (cid:96)(x, \u02c6Pc) is de\ufb01ned as in (1) for M = 1. If \u02c6\u03c01 = . . . = \u02c6\u03c0C, this classi\ufb01cation\nrule reduces to the maximum likelihood discrimant rule [20, \u00a7 14].\nQDA can be sensitive to misspeci\ufb01cations of the empirical moments. To reduce this sensitivity, we\nreplace the nominal Gaussian distributions \u02c6Pc with the Gaussian distributions P(cid:63)\nc that would have\ngenerated the sample x with highest likelihood, among all Gaussian distributions in the vicinity of\nthe nominal distributions \u02c6Pc. This results in a \ufb02exible discriminant rule of the form\n\n(cid:27)\n\n(cid:26) 1\n\n2\n\nC\ufb02ex(x) \u2208 arg max\n\nc\u2208C\n\nmaxP\u2208Pc\n\n(cid:96)(x, P) + log(\u02c6\u03c0c)\n\n,\n\nwhich makes use of the optimistic likelihoods (2). Here, Pc is the FR or KL ball centered at the\nnominal distribution \u02c6Pc. To ensure that \u02c6\u03a3c (cid:31) 0 for all c \u2208 C, we use the Ledoit-Wolf covariance\nestimator [23], which is parameter-free and returns a well-conditioned matrix by minimizing the\nmean squared error between the estimated and the real covariance matrix.\nWe compare the performance of our \ufb02exible discriminant rules with standard QDA implementations\nfrom the literature on datasets from the UCI repository [4]. Speci\ufb01cally, we compare the following\nmethods.\n\u2022 FQDA and KQDA: our \ufb02exible discriminant rules based on FR (FQDA) and KL (KQDA) ambi-\nguity sets with radii \u03c1c;\n\u2022 QDA: regular QDA with empirical means and covariance matrices estimated from data;\n\u2022 RQDA: regularized QDA based on the linear shrinkage covariance estimator \u02c6\u03a3c + \u03c1cIn;\n\u2022 SQDA: sparse QDA based on the graphical lasso covariance estimator [19] with parameter \u03c1c;\n\u2022 WQDA: Wasserstein QDA based on the nonlinear shrinkage approach [27] with parameter \u03c1c.\nAll results are averaged across 100 independent trials for \u03c1c \u2208 {a\u221an \u00b7 10b : a \u2208 {1, . . . , 9}, b \u2208\n{\u22123,\u22122,\u22121}}. In each trial, we randomly select 75% of the data for training and the remaining 25%\nfor testing. The size of the ambiguity set and the regularization parameter are selected using strati\ufb01ed\n5-fold cross validation. The performance of the classi\ufb01ers is measured by the correct classi\ufb01cation\nrate (CCR). The average CCR scores over the 100 trials are reported in Table 1.\n\nAcknowledgments We gratefully acknowledge \ufb01nancial support from the Swiss National Science\nFoundation under grant BSCGI0_157733 as well as the EPSRC grants EP/M028240/1, EP/M027856/1\nand EP/N020030/1.\n\nReferences\n[1] P.-A. Absil, R. Mahony, and R. Sepulchre. Optimization Algorithms on Matrix Manifolds.\n\nPrinceton University Press, 2009.\n\n9\n\n\f[2] M. Arnaudon, F. Barbaresco, and L. Yang. Riemannian medians and means with applications to\nradar signal processing. IEEE Journal of Selected Topics in Signal Processing, 7(4):595\u2013604,\n2013.\n\n[3] C. Atkinson and A. F. Mitchell. Rao\u2019s distance measure. Sankhy\u00afa: The Indian Journal of\n\nStatistics, Series A, 43(3):345\u2013365, 1981.\n\n[4] K. Bache and M. Lichman. UCI machine learning repository, 2013. Available from http:\n\n//archive.ics.uci.edu/ml.\n\n[5] M. Bauer, M. Bruveris, and P. W. Michor. Uniqueness of the Fisher\u2013Rao metric on the space of\n\nsmooth densities. Bulletin of the London Mathematical Society, 48(3):499\u2013506, 2016.\n\n[6] G. C. Bento, O. P. Ferreira, and J. G. Melo. Iteration-complexity of gradient, subgradient\nand proximal point methods on Riemannian manifolds. Journal of Optimization Theory and\nApplications, 173(2):548\u2013562, 2017.\n\n[7] R. Bhatia. Positive De\ufb01nite Matrices. Princeton University Press, 2009.\n\n[8] J. Bi and T. Zhang. Support vector classi\ufb01cation with input data uncertainty. In Advances in\n\nNeural Information Processing Systems, pages 161\u2013168, 2005.\n\n[9] S. Bonnabel and R. Sepulchre. Riemannian metric and geometric mean for positive semide\ufb01nite\nmatrices of \ufb01xed rank. SIAM Journal on Matrix Analysis and Applications, 31(3):1055\u20131070,\n2009.\n\n[10] S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, 2004.\n\n[11] M. R. Bridson and A. Hae\ufb02iger. Metric Spaces of Non-Positive Curvature. Springer, 2013.\n\n[12] E. Brochu, V. M. Cora, and N. de Freitas. A tutorial on Bayesian optimization of expensive\ncost functions, with application to active user modeling and hierarchical reinforcement learning.\nTechnical Report UBC TR-2009-023 and arXiv:1012.2599, University of British Columbia,\nDepartment of Computer Science, 2009.\n\n[13] S. Bubeck and N. Cesa-Bianchi. Regret analysis of stochastic and nonstochastic multi-armed\n\nbandit problems. Foundations and Trends R(cid:13) in Machine Learning, 5(1):1\u2013122, 2012.\n\n[14] L. L. Campbell. An extended \u02c7Cencov characterization of the information metric. Proceedings\n\nof the American Mathematical Society, 98(1):135\u2013141, 1986.\n\n[15] N. N. \u02c7Cencov. Statistical Decision Rules and Optimal Inference. American Mathematical\n\nSociety, 2000.\n\n[16] T. M. Cover and J. A. Thomas. Elements of Information Theory. Wiley-Interscience, 2006.\n\n[17] D. R. Cox. Tests of separate families of hypotheses. In Proceedings of the Fourth Berkeley\n\nSymposium on Mathematical Statistics and Probability, pages 105\u2013123, 1961.\n\n[18] D. R. Cox. A return to an old paper: \u2018tests of separate families of hypotheses\u2019. Journal of the\n\nRoyal Statistical Society: Series B, 75(2):207\u2013215, 2013.\n\n[19] J. Friedman, T. Hastie, and R. Tibshirani. Sparse inverse covariance estimation with the\n\ngraphical lasso. Biostatistics, 9(3):432\u2013441, 2008.\n\n[20] W. K. H\u00e4rdle and L. Simar. Applied Multivariate Statistical Analysis. Springer, 2015.\n\n[21] S. Jayasumana, R. Hartley, M. Salzmann, H. Li, and M. Harandi. Kernel methods on the\nRiemannian manifold of symmetric positive de\ufb01nite matrices. In IEEE Conference on Computer\nVision and Pattern Recognition, pages 73\u201380, 2013.\n\n[22] S. Lang. Fundamentals of Differential Geometry. Springer, 2012.\n\n[23] O. Ledoit and M. Wolf. A well-conditioned estimator for large-dimensional covariance matrices.\n\nJournal of Multivariate Analysis, 88(2):365\u2013411, 2004.\n\n10\n\n\f[24] C. Liu and N. Boumal. Simple Algorithms for Optimization on Riemannian Manifolds with\n\nConstraints. Applied Mathematics and Optimization, 2019.\n\n[25] G. McLachlan. Discriminant Analysis and Statistical Pattern Recognition. John Wiley & Sons,\n\n2004.\n\n[26] R. Munos. From bandits to Monte-Carlo tree search: The optimistic principle applied to\noptimization and planning. Foundations and Trends R(cid:13) in Machine Learning, 7(1):1\u2013129, 2014.\n[27] V. A. Nguyen, D. Kuhn, and P. M. Esfahani. Distributionally robust inverse covariance estima-\n\ntion: The Wasserstein shrinkage estimator. arXiv preprint arXiv:1805.07194, 2018.\n\n[28] V. A. Nguyen, S. Sha\ufb01eezadeh-Abadeh, M.-C. Yue, D. Kuhn, and W. Wiesemann. Optimistic\ndistributionally robust optimization for nonparametric likelihood approximation. In Advances\nin Neural Information Processing Systems, 2019.\n\n[29] M. Norton, A. Takeda, and A. Mafusalov. Optimistic robust optimization with applications to\n\nmachine learning. arXiv preprint arXiv:1711.07511, 2017.\n\n[30] X. Pennec, P. Fillard, and N. Ayache. A Riemannian framework for tensor computing. Interna-\n\ntional Journal of Computer Vision, 66(1):41\u201366, 2006.\n\n[31] L. F. Price, C. C. Drovandi, A. Lee, and D. J. Nott. Bayesian synthetic likelihood. Journal of\n\nComputational and Graphical Statistics, 27(1):1\u201311, 2018.\n\n[32] S. Said, L. Bombrun, Y. Berthoumieu, and J. H. Manton. Riemannian Gaussian distributions on\nthe space of symmetric positive de\ufb01nite matrices. IEEE Transactions on Information Theory,\n63(4):2153\u20132170, 2017.\n\n[33] R. P. Savage. The space of positive de\ufb01nite matrices and Gromov\u2019s invariant. Transactions of\n\nthe American Mathematical Society, 274(1):239\u2013263, 1982.\n\n[34] R. Schoenberg. Constrained maximum likelihood. Computational Economics, 10(3):251\u2013266,\n\n1997.\n\n[35] S. Sra and R. Hosseini. Conic geometric optimization on the manifold of positive de\ufb01nite\n\nmatrices. SIAM Journal on Optimization, 25(1):713\u2013739, 2015.\n\n[36] N. Srinivas, A. Krause, S. Kakade, and M. Seeger. Gaussian process optimization in the bandit\nsetting: no regret and experimental design. In International Conference on Machine Learning,\npages 1015\u20131022, 2010.\n\n[37] A. Takatsu. Wasserstein geometry of Gaussian measures. Osaka Journal of Mathematics,\n\n48(4):1005\u20131026, 2011.\n\n[38] N. Tripuraneni, N. Flammarion, F. Bach, and M. Jordan. Averaging stochastic gradient descent\non Riemannian manifolds. In Conference On Learning Theory, volume 75 of Proceedings of\nMachine Learning Research, pages 650\u2013687, 2018.\n\n[39] O. Tuzel, F. Porikli, and P. Meer. Pedestrian detection via classi\ufb01cation on Riemannian\nmanifolds. IEEE Transactions on Pattern Analysis and Machine Intelligence, 30(10):1713\u2013\n1727, 2008.\n\n[40] C. Villani. Optimal Transport: Old and New. Springer, 2008.\n\n[41] S. N. Wood. Statistical inference for noisy nonlinear ecological dynamic systems. Nature,\n\n466(7310):1102\u20131104, 2010.\n\n[42] H. Zhang, S. J. Reddi, and S. Sra. Riemannian SVRG: Fast stochastic optimization on Rie-\nmannian manifolds. In Advances in Neural Information Processing Systems, pages 4592\u20134600,\n2016.\n\n[43] H. Zhang and S. Sra. First-order methods for geodesically convex optimization. In Conference\non Learning Theory, volume 49 of Proceedings of Machine Learning Research, pages 1617\u2013\n1638, 2016.\n\n11\n\n\f[44] H. Zhang and S. Sra. An estimate sequence for geodesically convex optimization. In Conference\nOn Learning Theory, volume 75 of Proceedings of Machine Learning Research, pages 1703\u2013\n1723, 2018.\n\n12\n\n\f", "award": [], "sourceid": 7779, "authors": [{"given_name": "Viet Anh", "family_name": "Nguyen", "institution": "EPFL"}, {"given_name": "Soroosh", "family_name": "Shafieezadeh Abadeh", "institution": "EPFL"}, {"given_name": "Man-Chung", "family_name": "Yue", "institution": "The Hong Kong Polytechnic University"}, {"given_name": "Daniel", "family_name": "Kuhn", "institution": "EPFL"}, {"given_name": "Wolfram", "family_name": "Wiesemann", "institution": "Imperial College"}]}