{"title": "Inexact trust-region algorithms on Riemannian manifolds", "book": "Advances in Neural Information Processing Systems", "page_first": 4249, "page_last": 4260, "abstract": "We consider an inexact variant of the popular Riemannian trust-region algorithm for structured big-data minimization problems. The proposed algorithm approximates the gradient and the Hessian in addition to the solution of a trust-region sub-problem. Addressing large-scale finite-sum problems, we specifically propose sub-sampled algorithms with a fixed bound on sub-sampled Hessian and gradient sizes, where the gradient and Hessian are computed by a random sampling technique. Numerical evaluations demonstrate that the proposed algorithms outperform state-of-the-art Riemannian deterministic and stochastic gradient algorithms across different applications.", "full_text": "Inexact trust-region algorithms on\n\nRiemannian manifolds\n\nHiroyuki Kasai\n\nThe University of Electro-Communications\n\nJapan\n\nkasai@is.uec.ac.jp\n\nBamdev Mishra\n\nMicrosoft\n\nIndia\n\nbamdevm@microsoft.com\n\nAbstract\n\nWe consider an inexact variant of the popular Riemannian trust-region algorithm\nfor structured big-data minimization problems. The proposed algorithm approx-\nimates the gradient and the Hessian in addition to the solution of a trust-region\nsub-problem. Addressing large-scale \ufb01nite-sum problems, we speci\ufb01cally propose\nsub-sampled algorithms with a \ufb01xed bound on sub-sampled Hessian and gradient\nsizes, where the gradient and Hessian are computed by a random sampling tech-\nnique. Numerical evaluations demonstrate that the proposed algorithms outper-\nform state-of-the-art Riemannian deterministic and stochastic gradient algorithms\nacross different applications.\n\n1\n\nIntroduction\n\nWe consider the optimization problem\n\nf (x),\n\nmin\nx\u2208M\n\nn!n\n\n(1)\nwhere f : M\u2192 R is a smooth real-valued function on a Riemannian manifold M [1]. The focus\non the paper is when f has a \ufb01nite-sum structure, which frequently arises as big-data problems in\nmachine learning applications. Speci\ufb01cally, we consider the form f (x) ! 1\ni=1 fi(x), where n\nis the total number of samples and fi(x) is the cost function for the i-th (i \u2208 [n]) sample.\nRiemannian optimization translates the constrained optimization problem (1) into an unconstrained\noptimization problem over the manifold M. This viewpoint has shown bene\ufb01ts in many applica-\ntions. The principal component analysis (PCA) and subspace tracking problems are de\ufb01ned on the\nGrassmann manifold [2, 3]. The low-rank matrix completion (MC) and tensor completion problems\nare examples on the manifold of \ufb01xed-rank matrices and tensors [4, 5, 6, 7, 8, 9, 10]. The linear\nregression problem is de\ufb01ned on the manifold of the \ufb01xed-rank matrices [11, 12]. The independent\ncomponent analysis (ICA) problem requires a whitening step that is posed as a joint diagonalization\nproblem on the Stiefel manifold [13, 14].\nA popular choice for solving (1) is the Riemannian steepest descent (RSD) algorithm [1, Sec. 4],\nwhich is traced back to [15]. RSD calculates the Riemannian full gradient gradf (x) every iteration,\nwhich can be computationally heavy when the data size n is extremely large. As an alternative,\nthe Riemannian stochastic gradient descent (RSGD) algorithm becomes a computationally ef\ufb01cient\napproach [16], which extends the stochastic gradient descent (SGD) in the Euclidean space to the\ngeneral Riemannian manifolds [17, 18, 19]. The bene\ufb01t of RSGD is that it calculates only Rie-\nmannian stochastic gradient gradfi(x) corresponding to a particular i-th sample every iteration.\nConsequently, the complexity per iteration of RSGD is independent of the sample size n, which\nleads to higher scalability for large-scale data. Although the iterates generated by RSGD do not\nguarantee to decrease the objective value, \u2212gradfi(x) is a decent direction in expectation. How-\never, similar to SGD, RSGD suffers from slow convergence due to a decaying stepsize sequence. For\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fthis issue, variance reduction (VR) methods on Riemannian manifolds, including RSVRG [20, 21]\nand RSRG [22], have recently been proposed to accelerate the convergence of RSGD, which are\ngeneralization of the algorithms in the Euclidean space [23, 24, 25, 26, 27, 28]. The core idea is\nto reduce the variance of noisy stochastic gradients by periodical full gradient estimations, resulting\nin a linear convergent rate. It should, however, be pointed out that such Riemannian VR methods\nrequire retraction and vector transport operations at every iteration. As the computational cost of a\nretraction and vector transport operation is similar to that of a Riemannian stochastic gradient com-\nputation, Riemannian VR methods may have slower wall-clock time performance per iteration than\nRSGD.\nAll the above algorithms are \ufb01rst-order algorithms, which guarantee convergence to the \ufb01rst-order\noptimality condition, i.e., \u2225gradf (x)\u2225x = 0, using only the gradient information. As a result, their\nperformance in ill-conditioned problems suffers due to poor curvature approximation. Second-order\nalgorithms, on the other hand, alleviate the effect of ill-conditioned problems by exploiting curva-\nture information effectively. Therefore, they are expected to converge to a solution that satis\ufb01es the\nsecond-order optimality conditions, i.e., \u2225gradf (x)\u2225x = 0 and Hessf (x) \u227d 0, where Hessf (x) is\nthe Riemannian Hessian of f at x [29]. The Riemannian Newton method is a second-order algorithm,\nwhich has a superlinear local convergence rate [1, Thm. 6.3.2]. The Riemannian Newton method,\nhowever, lacks global convergence and a practical variant of the Riemannian Newton method is com-\nputationally expensive to implement. A popular alternative to the Riemannian Newton method is the\nRiemannian limited memory BFGS algorithm (RLBFGS) that requires lower memory. It, however,\nexhibits only a linear convergence rate and requires many vector transports of curvature informa-\ntion pairs [30, 31, 32]. Finally, the Riemannian trust-region algorithm (RTR) comes with a global\nconvergence property [1, Thm 7.4.4] and a superlinear local convergence rate [1, Thm. 7.4.11]. It\ncan alleviate a poor approximation of the local quadratic model (e.g., that the Newton method uses)\nby adjusting a trustable radius every iteration. Considering an \u03f5-approximate second-order optimal-\nity condition (Def. 2.1), RTR can return an (\u03f5g,\u03f5 H)-optimality point in O(max{1/\u03f53\ng\u03f5H)})\niterations when the true Hessian is used in the model and a second-order retraction is used [33].\nOn the stochastic front, the VR methods have been recently extended to take curvature information\ninto account [34]. Although they achieve practical improvements for ill-conditioned problems, their\nconvergence rates are worse than that of RSVRG and RSRG.\nA common issue among second-order algorithms is higher computational costs for dealing with ex-\nact or approximate Hessian matrices, which is computationally prohibitive in a large-scale setting.\nTo address this issue, inexact techniques, including sub-sampling techniques, have recently been\nproposed in the Euclidean space [35, 36, 37, 38, 39]. However, no work has been reported in the\nRiemannian setting. To this end, we propose an inexact Riemannian trust-region algorithm, inex-\nact RTR, for (1). Additionally, we propose a sub-sampled trust-region algorithm, Sub-RTR, as a\npractical but ef\ufb01cient variant of inexact RTR for \ufb01nite-sum problems. The theoretical convergence\nproof heavily relies on that of the original works in the Euclidean space [37, 38, 39] and the RTR\nalgorithm [33]. We particularly derive the bounds of the sample size of the sub-sampled Riemannian\nHessian and gradient, and show practical performance improvements of our algorithms over other\nRiemannian algorithms. We speci\ufb01cally address the case of compact submanifolds of Rn by follow-\ning [33]. Additionally, the numerical experiments include problems on the Grassmann manifold to\nshow effectiveness of our algorithms on more general quotient manifolds.\nThe paper is organized as follows. Section 2 describes the preliminaries and assumptions. We\npropose a novel inexact trust-region algorithm in the Riemannian setting in Section 3. In particular,\nin Section 4, we propose sub-sampled trust-region algorithms as its practical variants. Building\nupon the results in the Euclidean space [37, 38, 39] and that of the RTR algorithm [33], we derive\nthe bounds of the sample size of sub-sampled gradients and Hessians in Theorem 4.1, which only\nrequires a \ufb01xed sample size [37]. This has not been addressed in [37, 38, 39, 33].\nIn Section\n5, numerical experiments on three different problems demonstrate signi\ufb01cant speed-ups compared\nwith state-of-the-art Riemannian deterministic and stochastic algorithms when the sample size n is\nlarge.\nThe implementation of the proposed algorithms uses the MATLAB toolbox Manopt [40] and is\navailable at https://github.com/hiroyuki-kasai/Subsampled-RTR. The proofs of theorems\nand additional experiments are provided as supplementary material.\n\nH, 1/(\u03f52\n\n2\n\n\f2 Preliminaries and assumptions\nWe assume that M is endowed with a Riemannian metric structure, i.e., a smooth inner product\n\u27e8\u00b7,\u00b7\u27e9x of tangent vectors is associated with the tangent space TxM for all x \u2208M . The norm \u2225\u00b7\u2225 x\nof a tangent vector in TxM is the norm associated with the Riemannian metric. We also assume that\nf is twice continuously differentiable throughout this paper.\n\n2.1 Riemannian trust-region algorithm (RTR)\nRTR is the generalization of the classical trust-region algorithm in the Euclidean space [41] to Rie-\nmannian manifolds [1, Chap. 7]. In comparison with the Euclidean case, in RTR, the approximation\nmodel mx of fx around x is obtained from the Taylor expansion of the pullback of the function\n\u02c6fx ! fx \u25e6 Rx de\ufb01ned on the tangent space, where Rx is the retraction operator that maps a tangent\nvector onto the manifold with a local rigidity condition that preserves the gradients at x [1, Chap. 4].\nExponential mapping is an instance of the retraction. \u02c6fx is a real-valued function on the vector space\nof TxM, and the pullback of fx at x to TxM through Rx, around the origin 0x of TxM. This model\nof mx is denoted as \u02c6mx, where mx = \u02c6mx \u25e6 R\u22121, and is chosen for \u03be \u2208 TxM as\n\n1\n2\u27e8H(x)[\u03be],\u03be \u27e9x,\n\n\u02c6mx(\u03be) = f (x) + \u27e8gradf (x),\u03be \u27e9x +\n\n(2)\nwhere H(x) : TxM\u2192 TxM is some symmetric operator on TxM. The algorithm of RTR starts\nwith an initial point x0 \u2208M , an initial radius \u22060, and a maximum radius \u2206max. At iteration k,\nRTR de\ufb01nes a trust region \u2206k around the current point xk \u2208M , which can be trusted such that it\nconstructs a local model \u02c6mxk that is a reasonable approximation of the the real objective function\n\u02c6fxk. It then \ufb01nds the direction and the length of the step, denoted as \u03b7k, simultaneously by solving a\nsub-problem based on the approximate model in this region. It should be noted that this calculation\nis performed in the vector space TxkM. The next candidate iterate x+\nk = Rxk (\u03b7k) is accepted as\nxk+1 = x+\nk ) is suf\ufb01ciently large\nagainst that of the approximate model \u02c6mk(0xk ) \u2212 \u02c6mk(\u03b7k). Otherwise, we accept as xk+1 = xk.\nHere, \u02c6fk and \u02c6mk represent \u02c6fxk and \u02c6mxk, respectively, and hereinafter we use them for notational\nsimplicity. The trust region \u2206k is enlarged, unchanged, or shrunk by the parameter \u03b3> 1 according\nto the degree of the agreement of the model decrease and the true function decrease.\n\nk when the decrease of the true objective function \u02c6fk(xk) \u2212 \u02c6fk(x+\n\n2.2 Essential assumptions\nSince the \ufb01rst-order optimality condition, i.e., \u2225gradf (x)\u2225x = 0, is not suf\ufb01cient in non-convex\nminimization problems due to existence of saddle points and local maximum points, we typically\ndesign algorithms that guarantee convergence to a point satisfying the second-order optimality con-\nditions \u2225gradf (x)\u2225x = 0 and Hessf (x) \u227d 0. In practice, however, we use its approximate condi-\ntion, which is de\ufb01ned as (\u03f5g,\u03f5 H)-optimality as presented below.\nDe\ufb01nition 2.1 ((\u03f5g,\u03f5 H)-optimality [42]). Given 0 <\u03f5 g,\u03f5 H < 1, x is said to be an (\u03f5g,\u03f5 H)-\noptimality of (1) when\n\n\u2225gradf (x)\u2225x \u2264 \u03f5g,\n\nand Hessf (x) \u227d \u2212\u03f5HId,\n\nwhere gradf (x) is the Riemannian gradient, and Hessf (x) is the Riemannian Hessian of f at x. Id\nis the identity mapping.\nWe now provide essential assumptions below. We consider the inexact Hessian H(xk) : TxkM\u2192\nTxkM and the inexact gradient G(xk) \u2208 TxkM for gradf (x) in (2). Hereinafter, we particularly\nuse Hk ! H(xk) and Gk ! G(xk) at xk for notational simplicity.\nAssumption 1 (Compact submanifold in Rn and second-order retraction). We consider compact\nsubmanifolds in Rn. We also assume that the retraction is the second-order retraction.\nIt should be noted that, although the Hessian \u22072 \u02c6fx(0x) and the Riemannian Hessian Hessf (x) are\nin general different from each other, they are identical under second-order retraction [33, Lem. 17].\nThis assumption ensures that, as stated in Theorem 3.1, Algorithm 1 provides a solution that satis\ufb01es\nthe (\u03f5g,\u03f5 H)-optimality. Otherwise, it gives a solution satisfying \u03bbmin(H(x)) \u2265 \u2212\u03f5H. It should\nbe stressed that the second-order retractions are available in many submanifolds such as Rx(\u03b7) =\n(x+\u03b7)/\u2225x+\u03b7\u2225x in the case of spherical manifold [1, Sec. 4].\n\n3\n\n\fAssumption 2 (Restricted Lipschitz Hessian [33, A.5]). If \u03f5H < \u221e, there exists LH \u2265 0 such that,\nfor all xk, \u02c6fk satis\ufb01es\n\n1\n\n2\u27e8\u03b7k,\u22072 \u02c6fk(0xk )[\u03b7k]\u27e9xk\"\"\"\" \u2264\n\n1\n2\n\nLH\u2225\u03b7k\u22253\nxk ,\n\nfor all \u03b7k \u2208 TxkM such that \u2225\u03b7\u2225xk \u2264 \u2206k.\nIt should be noted that the retraction Rx needs to be de\ufb01ned only in the radius of \u2206k. Since the\nmanifold under consideration is compact, Assumption 2 holds [33, Lem. 9]. We also assume a\nbound of the norm of the inexact Riemannian Hessian Hk [33, A.6].\nAssumption 3 (Norm bound on Hk). There exists KH \u2265 0 such that, for all xk, Hk satis\ufb01es\n\n\u2225Hk\u2225xk !\n\nsup\n\n\u03b7\u2208TxkM,\u2225\u03b7\u2225xk\u22641\u27e8\u03b7, Hk[\u03b7]\u27e9xk \u2264 KH.\n\nWe now provide essential assumptions on the bounds for approximation error of the inexact Rieman-\nnian gradient Gk and the inexact Riemannian Hessian Hk at iteration k. As seen later in Section 4,\nthis ensures that the sample size of sub-sampling can be \ufb01xed.\nAssumption 4 (Approximation error bounds on inexact gradient and Hessian). There exist constants\n0 <\u03b4 g,\u03b4 H < 1 such that the approximation of the gradient, Gk, and the approximation of the\nHessian, Hk, at iterate k, satisfy\n\n\u2225Gk \u2212 gradf (xk)\u2225xk \u2264 \u03b4g,\n\n\u2225(Hk \u2212 \u22072 \u02c6fk(0xk ))[\u03b7k]\u2225xk \u2264 \u03b4H\u2225\u03b7k\u2225xk .\n\n(3)\n(4)\n\nThe latter is a weaker condition than the below condition [33, A7].\n\u2225Hk \u2212 \u22072 \u02c6fk(0xk )\u2225xk \u2264 \u03b4H.\n\nIt should be emphasized that the approximation error bound for Hk is de\ufb01ned with the Hessian of\nthe pullback of f at xk, i.e., \u22072 \u02c6fk(0xk ), instead of the Riemannian Hessian of f, i.e., Hessf (xk).\nFurthermore, it should be noted that Assumption 4 is a relax form in comparison with a typical\ncondition in the Euclidean setting, which is de\ufb01ned as [43, AM.4]\n\n\u2225(Hk \u2212 \u22072 \u02c6fk(0xk ))[\u03b7k]\u2225xk \u2264 \u03b4H\u2225\u03b7k\u22252\nxk .\n\n(5)\nThis typical form (5) is different from (4). It should be noted that the condition (5) requires that\nthe sizes of the sub-sampled Hessian and gradient need to be increased towards the convergence,\nwhereas our new condition (4) allows the size to be \ufb01xed, as seen later in Section 4 [37, 38].\nFinally, we give an assumption for the step \u03b7k. We need a suf\ufb01cient decrease in \u02c6mk(\u03b7k), and there\nexit ways to solve the sub-problem (See [41, 1] for more details). However, the calculation of the\nexact solution of the problem is prohibitive, especially in large-scale problems. To this end, various\napproximate solvers have been investigated in the literature that require certain conditions to be met.\nThe popular conditions are the Cauchy and Eigenpoint conditions [41]. The assumptions required\nfor the convergence analysis of Algorithm 1 by generalizing [37, Cond. 2] are provided below.\nAssumption 5 (Suf\ufb01cient descent relative to the Cauchy and Eigen directions [41, 37]). We assume\nthe \ufb01rst-order step, called the Cauchy step, as\n\n\u02c6fk(\u03b7k) \u2212 f (xk) \u2212 \u27e8gradf (xk),\u03b7 k\u27e9xk \u2212\n\n\"\"\"\"\n\n\u02c6mk(0xk ) \u2212 \u02c6mk(\u03b7k) \u2265 \u02c6mk(0xk ) \u2212 \u02c6mk(\u03b7C\n\nk ) \u2265\n\n1\n\n2\u2225Gk\u2225xk min# \u2225Gk\u2225xk\n\n1 + \u2225Hk\u2225\n\n, \u2206k$ .\n\nWe assume the second-order step, called the Eigen step, for some \u03bd \u2208 (0, 1] when \u03bbmin(Hk) < \u2212\u03f5H\nas\n\n\u02c6mk(0xk ) \u2212 \u02c6mk(\u03b7k) \u2265 \u02c6mk(0xk ) \u2212 \u02c6mk(\u03b7E\n\nk ) \u2265\n\n1\n2\n\n\u03bd|\u03bbmin(Hk)|\u22062\nk.\n\nHere, \u03b7C\ndirection such that \u27e8\u03b7E\nsubproblem solvers, e.g., the Steihaug-Toint truncated conjugate gradients algorithm [44].\n\nk is the negative gradient direction and \u03b7E\nk ]\u27e9xk \u2264 \u03bd\u03bbmin(Hk)\u2225\u03b7E\n\nk is an approximation of the negative curvature\nxk < 0. Assumption 5 is ensured by using TR\nk \u22252\n\nk , Hk[\u03b7E\n\n4\n\n\fAlgorithm 1 Inexact Riemannian trust-region (Inexact RTR) algorithm\nRequire: 0 < \u2206max < \u221e, \u03f5g,\u03f5 H \u2208 (0, 1), \u03c1T H,\u03b3> 1.\n1: Initialize 0 < \u22060 < \u2206max, and a starting point x0 \u2208M .\n2: for k = 1, 2, . . . do\n3:\n4:\n5:\n6:\n\nSet the approximate (inexact) gradient Gk and Hk.\nif \u2225Gk\u2225 \u2264 \u03f5g and \u03bbmin(Hk) \u2265 \u2212\u03f5H then Return xk. end if\nif \u2225Gk\u2225 \u2264 \u03f5g then Gk = 0. end if\nCalculate \u03b7k \u2208 TxkM by solving \u03b7k \u2248 arg min\n\u2225\u03b7\u2225\u2264\u2206k\nSet \u03c1k =\nif \u03c1k \u2265 \u03c1T H then xk+1 = Rxk (\u03b7k) and \u2206k+1 = \u03b3\u2206k.\nelse xk+1 = xk and \u2206k+1 =\u2206 k/\u03b3. end if\n\n\u02c6fk(0xk )\u2212 \u02c6fk(\u03b7k)\n\u02c6mk(0xk )\u2212 \u02c6mk(\u03b7k) .\n\n7:\n8:\n9:\n10: end for\n11: Output xk.\n\nf (xk) + \u27e8Gk,\u03b7 \u27e9xk + 1\n\n2\u27e8\u03b7, Hk[\u03b7]\u27e9xk.\n\n3 Riemannian trust-regions with inexact Hessian and gradient\n\nThis section proposes an inexact variant of the Riemannian trust-region algorithm, i.e., inexact RTR,\nwhich approximates gradient and Hessian as well as the solution of a sub-problem. The proposed\nalgorithm is summarized in Algorithm 1. The inexact RTR algorithm solves approximately a sub-\nproblem \u02c6mk(\u03b7) : TxkM\u2192 R for \u03b7 \u2208 TxkM of the form\n\n\u03b7k \u2248 arg min\n\u03b7\u2208TxkM\n\n\u02c6mk(\u03b7)\n\nsubject to\n\n\u2225\u03b7\u2225xk \u2264 \u2206k,\n\n(6)\n\nwhere \u02c6mk(\u03b7) is notably de\ufb01ned as\n\n\u02c6mk(\u03b7) = \u23a7\u23aa\u23a8\u23aa\u23a9\n\n1\n2\u27e8\u03b7, Hk[\u03b7]\u27e9xk ,\n\nf (xk) + \u27e8Gk,\u03b7 \u27e9xk +\nf (xk) +\n\n1\n2\u27e8\u03b7, Hk[\u03b7]\u27e9xk ,\n\n\u2225Gk\u2225xk \u2265 \u03f5g,\notherwise.\n\n(7a)\n\n(7b)\n\n2\n\n4\n\n\u03f5g and \u03b4H < min) 1\u2212\u03c1T H\n\nIt should be stressed that, as (7b) represents, we ignore the gradient when it is smaller than \u03f5g, i.e.,\n\u2225Gk\u2225xk <\u03f5 g, which is crucial for the convergence analysis in Theorem 3.1 [38].\nNow, we show the convergence analysis of the proposed inexact RTR. To this end, we assume an\nadditional approximation condition on the inexact gradient and Hessian for the constants in Assump-\ntion 4 [38, Cond. 1]. This additional assumption is essential for the relax form of (4).\nAssumption 6 (Gradient and Hessian approximations for Algorithm 1 [38]). Let \u03c1T H be the thresh-\nold parameter of the reduction ratio of the true objective function and the approximate model in\nAlgorithm 1. For \u03bd \u2208 (0, 1] in Assumption 5, we assume that the constants of the inexact gradient\nand Hessian satisfy \u03b4g < 1\u2212\u03c1T H\nThis implies that we only need \u03b4g \u2208O (\u03f5g) and \u03b4H \u2208O (\u03f5H) [38, Cond. 1].\nTheorem 3.1 (Optimal complexity of Algorithm 1). Consider 0 <\u03f5 g,\u03f5 H < 1. Suppose Assump-\ntions 1, 2, and 3 hold. Also, suppose that the inexact Hessian Hk and gradient Gk satisfy Assumption\n4 with the approximation tolerance \u03b4g and \u03b4H. Suppose that the solution of the sub-problem (6) sat-\nis\ufb01es Assumption 5 and Assumption 6 holds. Then, Algorithm 1 returns an (\u03f5g,\u03f5 H)-optimal solution\nin, at most, T \u2208O (max{\u03f5\u22122\nThe proof of Theorem 3.1 follows that of [37, 38, 33]. Therefore, we only provide the proof sketch\nin Section B.1 of the supplementary material \ufb01le.\n4 Sub-sampled Riemannian trust-regions for \ufb01nite-sum problems\nParticularly addressing large-scale \ufb01nite-sum minimization problems, we propose an inexact gradi-\nent and Hessian trust-region algorithm, Sub-RTR, by exploiting a sub-sampling technique to gener-\nate inexact gradient and Hessian. The generated inexact gradient and Hessian satisfy Assumption 4\nin a probabilistic way. More concretely, we derive sampling conditions based on the probabilistic\n\n\u03bd\u03f5H, 1*.\n\nH }) iterations.\n\ng \u03f5\u22121\n\nH ,\u03f5 \u22123\n\n5\n\n\fdeviation bounds for random matrices, which originate from the Bernstein inequality in Lemma B.2\nof the supplementary material \ufb01le.\nWe \ufb01rst de\ufb01ne the sub-sampled inexact gradient and Hessian as\nGk ! 1\n\nand Hk ! 1\n\ni = 1, 2, . . . , n,\n\nHessfi(xk),\n\ngradfi(xk)\n\n|Sg| +i\u2208Sg\n\n|SH| +i\u2208SH\n\nwhere Sg,SH \u2282{ 1, . . . , n} are the set of the sub-sampled indexes for the estimates of the approx-\nimate gradient and Hessian, respectively. Their sizes, i.e., the cardinalities, are denoted as |Sg| and\n|SH|, respectively. Next, we provide the sampling conditions. For simplicity, we use the standard\nRiemannian metric in the analysis. Equivalently, M is endowed with a smooth inner product \u27e8\u00b7,\u00b7\u27e92\nand the norm \u2225\u00b7\u2225 2. We suppose that\nand\n\ni = 1, 2, . . . , n,\n\nsup\n\nx\u2208M\u2225gradfi(x)\u22252 \u2264 Ki\n\ng\n\nsup\n\nx\u2208M\u2225Hessfi(x)\u22252 \u2264 Ki\n\nH\n\nand we also de\ufb01ne Kmax\nsampling to guarantee the convergence in Theorem 3.1, we have the following theorem.\nTheorem 4.1 (Bounds on sampling size). Given Ki\nwe de\ufb01ne\n\nH ! maxi Ki\n\n! maxi Ki\n\ng and Kmax\n\nH, Kmax\n\ng, Kmax\n\nand Ki\n\ng\n\ng\n\nH. As for the suf\ufb01cient size of sub-\n\nH , and 0 <\u03b4,\u03b4 g,\u03b4 H < 1,\n\n|Sg|\u2265\n\n32(Kmax\n\ng\n\n)2 log(1/\u03b4) + 1/4\n\n\u03b42\ng\n\nand\n\n|SH|\u2265\n\n32(Kmax\n\nH )2 log(1/\u03b4) + 1/4\n\n\u03b42\nH\n\n.\n\nAt any xk \u2208M , suppose that the sampling is done uniformly at random to generate Sg and SH.\nThen, we have\n\nPr(\u2225Gk \u2212 gradf (xk)\u22252 \u2264 \u03b4g) \u2265 1 \u2212 \u03b4,\nPr(\u2225(Hk \u2212 \u22072 \u02c6fk(0x))[\u03b7k]\u22252 \u2264 \u03b4H\u2225\u03b7k\u22252) \u2265 1 \u2212 \u03b4.\n\n2\n\ng\n\n\u03b42\nH\u2225\u03b7k\u22252\n\nH )2 log(1/\u03b4)+1/4\n\nand KH = Kmax\nH .\n\nFrom Theorem 4.1, it can be easily seen that Assumption 4 follows with the same probability with\nIt should be emphasized that if we use the typical condition\nKg = Kmax\n(5) instead of Assumption 4, we obtain, e.g., |SH|\u2265 32(Kmax\nfor the sub-sampled\nHessian Hk. Considering that \u2225\u03b7k\u2225 goes to nearly zero as the iterations proceed, this obtained\nbound indicates that |SH| increases accordingly. Consequently, the size of the sub-sampled Hessian\nneeds to be increased towards the convergence. On the other hand, our results ensure that the sample\nsize can be \ufb01xed to guarantee the convergence of Algorithm 1.\n5 Numerical comparisons\nThis section evaluates the performance of our two proposed inexact RTR algorithms:\nthe sub-\nsampled Hessian RTR (Sub-H-RTR) and the sub-sampled Hessian and gradient RTR (Sub-HG-\nRTR). We compare them with the Riemannian deterministic algorithms: RSD, Riemannian conju-\ngate gradient (RCG), RLBFGS, and RTR. We also show comparisons with RSVRG [20, 21]. We\ncompare the algorithms in terms of the total number of oracle calls and run time, i.e., \u201cwall-clock\u201d\ntime. The former measures the number of function, gradient, and Hessian-vector product compu-\ntations. The sub-sampled RTR requires (n + |Sg| + rs|SH|) oracle calls per iteration, whereas the\noriginal RTR requires (2n + rsn) oracle calls. Here, rs is the number of iterations required for solv-\ning the trust-region sub-problem approximately. RSD, RCG, and RLBFGS require (n + rln) oracle\ncalls per iteration, where rl is the number of line searches carried out. RSVRG requires (n + mn)\noracle calls per outer iteration, where m is the update frequency of the outer loop. Algorithms are\ninitialized randomly and are stopped when either the gradient norm is below a particular threshold.\nMultiple constant stepsizes from {10\u221210, 10\u22129, . . . , 1} are used for RSVRG and the best-tuned re-\nsults are shown. By following [38], we set |Sg| = n/10 and |SH| = n/102 except Cases P5, P6,\nM4, and M5. We set the batch-size to n/10 in RSVRG. All simulations are performed in MATLAB\non a 4.0 GHz Intel Core i7 machine with 32 GB RAM.\nWe address the independent component analysis (ICA) problem on the Stiefel manifold and two\nproblems on the Grassmann manifold, namely the principal component analysis (PCA) and the low-\nrank matrix completion (MC) problems. The Stiefel manifold is the set of orthogonal r-frames in\n\n6\n\n\fRSD\nRCG\nRLBFGS\nRSVRG\nRTR\nSub-H-RTR\nSub-HG-RTR\n\n0\n\n1\n\n2\nOracle calls\n\n3\n\n4\n\n5\n4\n\n10\n\n-1000\n\n-2000\n\nt\ns\no\nC\n\n-3000\n\n-4000\n\n-5000\n\n-6000\n\nRSD\nRCG\nRLBFGS\nRSVRG\nRTR\nSub-H-RTR\nSub-HG-RTR\n\n7\n\n-10\n\nt\ns\no\nC\n\nRSD\nRCG\nRLBFGS\nRSVRG\nRTR\nSub-H-RTR\nSub-HG-RTR\n\n0\n\n0.2\n\n0.4\n0.6\nTime [sec]\n\n0.8\n\n1\n\n0\n\n2\n\n4\n6\nOracle calls\n\n8\n\n10\n4\n\n10\n\n(a-1) Case I1: Oracle calls.\n\n(a-2) Case I1: Run time.\n\n(b-1) Case I2: Oracle calls.\n\nRSD\nRCG\nRLBFGS\nRSVRG\nRTR\nSub-H-RTR\nSub-HG-RTR\n\n7\n\n-10\n\nt\ns\no\nC\n\nRSD\nRCG\nRLBFGS\nRSVRG\nRTR\nSub-H-RTR\nSub-HG-RTR\n\n7\n\n-10\n\nt\ns\no\nC\n\nRSD\nRCG\nRLBFGS\nRSVRG\nRTR\nSub-H-RTR\nSub-HG-RTR\n\n-1000\n\n-2000\n\nt\ns\no\nC\n\n-3000\n\n-4000\n\n-5000\n\n-6000\n\n7\n\n-10\n\nt\ns\no\nC\n\n0\n\n0.5\n\n1\n\n1.5\n\n2\n\n2.5\n\n3\n\n0\n\n2\n\nTime [sec]\n\n6\n4\nOracle calls\n\n8\n\n10\n5\n\n10\n\n0\n\n2\n\n4\n\n6\n\n8\n\n10\n\n12\n\nTime [sec]\n\n(b-2) Case I2: Run time.\n\n(c-1) Case I3: Oracle calls.\n\n(c-2) Case I3: Run time.\n\nFigure 1: Performance evaluations on the ICA problem.\n\nRd for some r \u2264 d and is viewed as an embedded submanifold of Rd\u00d7r [1, Sec. 3.3]. On the\nother hand, the Grassmann manifold Gr(r, d) is the set of r-dimensional subspaces in Rd and is a\nRiemannian quotient manifold of the Stiefel manifold [1, Sec. 3.4]. The motivation behind including\nthe latter two applications is to show that our proposed algorithms empirically work very well even\nif the manifold is not a submanifold. In all these problems, full gradient methods, i.e., RSD, RCG,\nRLBFGS, and RTR, become prohibitively computationally expensive when n is very large and the\ninexact approach is one promising way to achieve scalability. The details of the manifolds and the\nderivations of the Riemannian gradient and Hessian are provided as supplementary material.\n\nICA problem\n\nF , where \u2225diag(A)\u22252\n\n5.1\nThe ICA or the blind source separation problem refers to separating a signal into components so\nthat the components are as independent as possible [45]. A particular preprocessing step is the\nwhitening step that is proposed through joint diagonalization on the Stiefel manifold [13], i.e.,\nn!n\ni=1 \u2225diag(U\u22a4CiU)\u22252\nF de\ufb01nes the sum of the squared di-\nminU\u2208Rd\u00d7r \u2212 1\nagonal elements of A. The symmetric matrices Cis are of size d \u00d7 d and can be cumulant matrices\nor time-lagged covariance matrices of different signal samples [13].\nWe use three real-world datasets: YaleB [46], COIL-100 [47], and CIFAR-100 [48]. From these\ndatasets, we create a Gabor-Based region covariance matrix (GRCM) descriptor [49, 50, 51]. A\n43 \u00d7 43 GRCM is computed from the pixel coordinates and Gabor features that are obtained by\nconvolving Gabor kernels with an intensity image. We set m = 1 in RSVRG. Figures 1 (a), (b), and\n(c) show the results on the YaleB dataset with (n, d, r) = (2015, 43, 43) (Case I1), the COIL-100\ndataset with (n, d, r) = (7.2 \u00d7 103, 43, 43) (Case I2) and the CIFAR-100 dataset with (n, d, r) =\n(6 \u00d7 104, 43, 43) (Case I3), respectively. As seen, the proposed Sub-H-RTR and Sub-HG-RTR\nperform better in terms of both the number of oracle calls and run time than others except RSVRG.\nIt should be emphasized that though RSVRG performs comparable to or slightly better than our\nproposed algorithms, its results require \ufb01ne tuning of stepsizes.\n\n5.2 PCA problem\nGiven an orthonormal matrix projector U \u2208 St(r, d),\nthe PCA problem is to minimize\nthe sum of squared residual errors between projected data points and the original data as\nn!n\ni=1 \u2225zi \u2212 UU\u22a4zi\u22252\n2, where zi is a data vector of size d \u00d7 1. This problem is\nminU\u2208St(r,d)\nn!n\ni=1 z\u22a4i UU\u22a4zi. Here, the critical points in the space St(r, d)\nequivalent to minU\u2208St(r,d) = \u2212 1\nare not isolated because the cost function remains unchanged under the group action U 0\u2192 UO for\nall orthogonal matrices O of size r \u00d7 r. Subsequently, the PCA problem is an optimization problem\non the Grassmann manifold Gr(r, d).\n\n1\n\n7\n\n\fRSD\nRCG\nRLBFGS\nRSVRG\nRTR\nSub-H-RTR\nSub-HG-RTR\n\np\na\ng\ny\nt\ni\nl\n\n \n\na\nm\n\ni\nt\np\nO\n\n0\n\n10\n\n-5\n\n10\n\n-10\n\n10\n\nRSD\nRCG\nRLBFGS\nRSVRG\nRTR\nSub-H-RTR\nSub-HG-RTR\n\n0\n\n10\n\n-5\n\n10\n\n-10\n\n10\n\np\na\ng\ny\nt\ni\nl\n\n \n\na\nm\n\ni\nt\np\nO\n\nRSD\nRCG\nRLBFGS\nRSVRG\nRTR\nSub-H-RTR\nSub-HG-RTR\n\n0\n\n2\n\n4\n\n6\n\n8\n\n10\n\nOracle calls\n\n12\n8\n\n10\n\n0\n\n50\n\n100\n\n150\nTime [sec]\n\n200\n\n250\n\n0\n\n2\n\n4\n\n6\n\n8\n\n10\n\nOracle calls\n\n12\n7\n\n10\n\n(a-1) Case P1: Oracle calls.\n\n(a-2) Case P1: Run time.\n\n(b-1) Case P2: Oracle calls.\n\nRSD\nRCG\nRLBFGS\nRSVRG\nRTR\nSub-H-RTR\nSub-HG-RTR\n\n0\n\n10\n\n-5\n\n10\n\n-10\n\n10\n\np\na\ng\ny\nt\ni\nl\n\n \n\na\nm\n\ni\nt\np\nO\n\n-15\n\n10\n\n0\n\nRSD\nRCG\nRLBFGS\nRSVRG\nRTR\nSub-H-RTR\nSub-HG-RTR\n\n0\n\n10\n\n-5\n\n10\n\n-10\n\n10\n\np\na\ng\ny\nt\ni\nl\n\n \n\na\nm\n\ni\nt\np\nO\n\nRSD\nRCG\nRLBFGS\nRSVRG\nRTR\nSub-H-RTR\nSub-HG-RTR\n\n5\n\n10\n\n15\n\n20\n\n0\n\n10\n\n20\n\n30\n\n40\n\np\na\ng\ny\nt\ni\nl\n\n \n\na\nm\n\ni\nt\np\nO\n\n0\n\n10\n\n-5\n\n10\n\n-10\n\n10\n\n0\n\n10\n\n-5\n\n10\n\n-10\n\n10\n\np\na\ng\ny\nt\ni\nl\n\n \n\na\nm\n\ni\nt\np\nO\n\n0\n\n50\n\n100\n\n150\n\n200\n\nTime [sec]\n\nTime [sec]\n\nTime [sec]\n\n(b-2) Case P2: Run time.\n\n(c) Case P3: MNIST dataset.\n\n(d) Case P4: Covertype dataset.\n\n100\n\n10-5\n\n10-10\n\np\na\ng\ny\nt\ni\nl\n\n \n\na\nm\n\ni\nt\np\nO\n\nRTR\nSub-H-RTR (|S\n\nSub-H-RTR (|S\n\nSub-H-RTR (|S\n\n|=500000)\n\n|=50000)\n\n|=5000)\n\nH\n\nH\n\nH\n\n100\n\n10-5\n\n10-10\n\np\na\ng\ny\nt\ni\nl\n\n \n\na\nm\n\ni\nt\np\nO\n\nRTR\nSub-H-RTR (fix)\nSub-HG-RTR (fix)\nSub-H-RTR (linear)\nSub-HG-RTR (linear)\nSub-H-RTR (adaptive)\nSub-HG-RTR (adaptive)\n\n0\n\n20\n\n40\n60\nTime [sec]\n\n80\n\n100\n\n0\n\n20\n\n40\n\n60\n\n80\n\n100\n\n120\n\nTime [sec]\n\n(e) Case P5: Sampling size insensitivity.\n\n(f) Case P6: Sampling algorithms.\nFigure 2: Performance evaluations on the PCA problem.\n\nFigures 2(a) and (b) show the results on two synthetic datasets with (n, d, r) = (5 \u00d7 106, 102, 5)\n(Case P1), and (n, d, r) = (5\u00d7105, 103, 5) (Case P2). We set m = 5 in RSVRG. It should be noted\nthat, although RSVRG is competitive in terms of the oracle calls in (a), its run time performance is\npoor than others. This is attributed to RSVRG requiring retraction and vector transport operations\nat every iteration. Overall, the proposed Sub-H-RTR outperforms others, whereas the proposed\nSub-HG-RTR is inferior to others. Figures 2(c) and (d) show the results on two real-world datasets\nwith r = 10, where Case P3 deals with the MNIST dataset [52] with (n, d) = (6 \u00d7 104, 784)\nand Case P4 deals with the Covertype dataset [53] with (n, d) = (581012, 54). From the \ufb01gure,\nour proposed Sub-H-RTR outperforms others. We also change the sample size in Sub-H-RTR as\n|SH| = {n/10, n/102, n/103} in Case P1. We observe that Sub-H-RTR has low sensitivity to the\nsize |SH| from Figures 2(e) (Case P5). Additionally, we compare three different ways to decide the\nsample size of |SH| and |Sg|: (i) \u201c\ufb01xed\u201d, (ii) \u201clinear\u201d, and (iii) \u201cadaptive\u201d variants (Case P6). The\n\u201c\ufb01xed\u201d variant keeps the size as the initial |Sg| and |SH| as theoretically supported by Theorem 4.1.\nThe \u201clinear\u201d variant uses k|Sg| and k|SH| at iteration k. The \u201cadaptive\u201d variant decides the sizes\nbased on (5) [39]. The results on the synthetic dataset same as Case P2 show that all the proposed\nalgorithms except Sub-HG-RTR with \ufb01xed sample size outperform the original RTR.\n\n5.3 MC problem\nThe MC problem amounts to completing an incomplete matrix Z, say of size d \u00d7 n, from a small\nnumber of entries by assuming a low-rank model for the matrix.\nIf \u2126 is the set of the indices\nfor which we know the entries in Z, the rank-r MC problem amounts to solving the problem\nminU\u2208Rd\u00d7r,A\u2208Rr\u00d7n \u2225P\u2126(UA) \u2212P \u2126(Z)\u22252\nF , where the operator P\u2126(Zpq) = Zpq if (p, q) \u2208 \u2126 and\nP\u2126(Zpq) = 0 otherwise is called the orthogonal sampling operator and is a mathematically conve-\nnient way to represent the subset of known entries. Partitioning Z = [z1, z2, . . . , zn], the problem\nis equivalent to the problem minU\u2208Rd\u00d7r,ai\u2208Rr\n2, where zi \u2208 Rd\nand the operator P\u2126i is the sampling operator for the i-th column.\n\nn!n\ni=1 \u2225P\u2126i(Uai) \u2212P \u2126i(zi)\u22252\n\n1\n\n8\n\n\fRSD\nRCG\nRLBFGS\nRSVRG\nRTRMC\nRTR\nSub-H-RTR\nSub-HG-RTR\n\n0\n\n10\n\n-5\n\n10\n\n-10\n\n10\n\n-15\n\n10\n\nt\ne\ns\n\n \nt\ns\ne\nt\n \nn\no\n \nr\no\nr\nr\ne\n\n \n\ne\nr\na\nu\nq\ns\n\n \n\ns\nn\na\ne\nM\n\nRSD\nRCG\nRLBFGS\nRSVRG\nRTRMC\nRTR\nSub-H-RTR\nSub-HG-RTR\n\n0\n\n10\n\n-5\n\n10\n\n-10\n\n10\n\n-15\n\n10\n\nt\ne\ns\n\n \nt\ns\ne\nt\n \nn\no\n \nr\no\nr\nr\ne\n\n \n\ne\nr\na\nu\nq\ns\n\n \n\ns\nn\na\ne\nM\n\nRSD\nRCG\nRLBFGS\nRSVRG\nRTRMC\nRTR\nSub-H-RTR\nSub-HG-RTR\n\n0\n\n0.5\n\n1\n\n1.5\n\n2\n\n2.5\n\n3\n\nOracle calls\n\n3.5\n7\n\n10\n\n0\n\n100\n\n200\n\n300\n400\nTime [sec]\n\n500\n\n600\n\n700\n\n0\n\n1\n\n3\n\n2\nOracle calls\n\n4\n\n5\n\n7\n\n10\n\n(a-1) Case M1: Oracle calls.\n\n(a-2) Case M1: Run time.\n\n(b-1) Case M2: Oracle calls.\n\nRSD\nRCG\nRLBFGS\nRSVRG\nRTRMC\nRTR\nSub-H-RTR\nSub-HG-RTR\n\n120\n\n110\n\n100\n\n90\n\n80\n\nt\ne\ns\n\n \nt\ns\ne\nt\n \nn\no\n \nr\no\nr\nr\ne\n\n \n\ne\nr\na\nu\nq\ns\n\n \n\ns\nn\na\ne\nM\n\n70\n\n0\n\nRSD\nRCG\nRLBFGS\nRSVRG\nRTRMC\nRTR\nSub-H-RTR\nSub-HG-RTR\n\n120\n\n110\n\n100\n\n90\n\n80\n\nt\ne\ns\n\n \nt\ns\ne\nt\n \nn\no\n \nr\no\nr\nr\ne\n\n \n\ne\nr\na\nu\nq\ns\n\n \n\ns\nn\na\ne\nM\n\nRSD\nRCG\nRLBFGS\nRSVRG\nRTRMC\nRTR\nSub-H-RTR\nSub-HG-RTR\n\n5\n\n10\n\nOracle calls\n\n70\n\n0\n\n15\n5\n\n10\n\n5\n\n10\n\n15\n\n20\n\n25\n\n30\n\nTime [sec]\n\n0\n\n10\n\n-5\n\n10\n\n-10\n\n10\n\n-15\n\n10\n\nt\ne\ns\n\n \nt\ns\ne\nt\n \nn\no\n \nr\no\nr\nr\ne\n\n \n\ne\nr\na\nu\nq\ns\n\n \n\ns\nn\na\ne\nM\n\n0\n\n10\n\n-5\n\n10\n\n-10\n\n10\n\n-15\n\n10\n\nt\ne\ns\n\n \nt\ns\ne\nt\n \nn\no\n \nr\no\nr\nr\ne\n\n \n\ne\nr\na\nu\nq\ns\n\n \n\ns\nn\na\ne\nM\n\n0\n\n200\n\n400\n\n600\n\n800\n\n1000\n\n1200\n\nTime [sec]\n\n(b-2) Case M2: Run time.\n\n(c-1) Case M3: Oracle calls.\n\n(c-2) Case M3: Run time.\n\nRTR\nSub-H-RTR (fixed)\nSub-HG-RTR (fixed)\nSub-H-RTR (linear)\nSub-HG-RTR (linear)\nSub-H-RTR (adaptive)\nSub-HG-RTR (adaptive)\n\n100\n\n10-5\n\n10-10\n\n10-15\n\nt\ne\ns\n \nt\ns\ne\nt\n \nn\no\n \nr\no\nr\nr\ne\n\n \n\ne\nr\na\nu\nq\ns\n\n \n\ns\nn\na\ne\nM\n\n0\n\n500\n\n1000\n\n1500\n\nTime [sec]\n\n120\n\n110\n\n100\n\n90\n\n80\n\nt\ne\ns\n\n \nt\ns\ne\nt\n \nn\no\n \nr\no\nr\nr\ne\n\n \n\ne\nr\na\nu\nq\ns\n\n \n\ns\nn\na\ne\nM\n\n70\n\n0\n\nRTR\nSub-H-RTR (fixed)\nSub-HG-RTR (fixed)\nSub-H-RTR (linear)\nSub-HG-RTR (linear)\nSub-H-RTR (adaptive)\nSub-HG-RTR (adaptive)\n\n5\n\n10\n\n15\n\n20\n\n25\n\n30\n\nTime [sec]\n\n(d) Case M4: Sampling algorithms.\n(e) Case M5: Sampling algorithms.\nFigure 3: Performance evaluations on the MC problem.\n\nWe also compared our proposed algorithms with RTRMC [10], a state-of-the-art MC algorithm. The\ncode of RTRMC is optimized for the MC problem. Therefore, we mainly compare the oracle calls\nof RTRMC for fair comparison. We \ufb01rst consider a synthetic dataset with (n, d, r) = (105, 102, 5).\nWe show the mean squares error (MSE) on a test set, which is different from the training set. The\nover-sampling ratio (OS) is 4, where the OS determines the number of entries that are known. An\nOS of 4 implies that 4(n + d \u2212 r)r number of randomly and uniformly selected entries are known\na priori out of the total nd entries. We also impose an exponential decay of singular values. The\nratio of the largest to the lowest singular value is known as the condition number (CN) of the matrix.\nWe set m = 5 in RSVRG. We consider a well-conditioned case with CN=5 (Case M1) and an ill-\nconditioned case with CN=20 (Case M2). Figures 3(a) and (b) show relatively good performance\nof RSVRG for Case M1 . RTRMC is, as expected, extremely fast in terms of run time (owing to its\noptimized code). Sub-H-RTR and Sub-HG-RTR show superior performance than others, especially\nfor the ill-conditioned case M2. Next, we consider the Jester dataset 1 [54] consisting of ratings\nof 100 jokes by 24983 users (Case M3). Each rating is a real number between \u221210 and 10. The\nalgorithms are run by \ufb01xing the rank to r = 5. Figure 3(c) shows the comparable or superior\nperformance of the sub-sampled RTR on the test sets against state-of-the-art algorithms. Finally,\nwe compare three variants: \u201c\ufb01xed\u201d, \u201clinear\u201d, and \u201cadaptive\u201d to decide the sample size in Cases M4\nand M5 under the same conditions as Cases M2 and M3, respectively. Figures 3(d) and (e) show\nthat all the proposed algorithms outperform the original RTR. In particular, the \u201c\ufb01xed\u201d variant gives\nsuperior performance than others as supported by Theorem 4.1.\n6 Conclusion\nWe have proposed an inexact trust-region algorithm in the Riemannian setting with a worst case\ntotal complexity bound. Additionally, we have also proposed sub-sampled trust-region algorithms\nfor \ufb01nite-sum problems, which need only \ufb01xed sample bounds of sub-sampled gradient and Hessian.\nThe numerical comparisons show the bene\ufb01ts of our proposed inexact RTR algorithms on a number\nof applications.\n\n9\n\n\fAcknowledgements\n\nH. Kasai was partially supported by JSPS KAKENHI Grant Numbers JP16K00031 and\nJP17H01732. We thank Nicolas Boumal and Hiroyuki Sato for insight discussions and also express\nour sincere appreciation to Jonas Moritz Kohler for sharing his expertise on sub-sampled algorithms\nin the Euclidean case.\n\nReferences\n[1] P.-A. Absil, R. Mahony, and R. Sepulchre. Optimization Algorithms on Matrix Manifolds.\n\nPrinceton University Press, 2008.\n\n[2] L. Balzano, R. Nowak, and B. Recht. Online identi\ufb01cation and tracking of subspaces from\n\nhighly incomplete information. In Allerton, 2010.\n\n[3] B. Mishra, H. Kasai, P. Jawanpuria, and A. Saroop. A Riemannian gossip approach to subspace\n\nlearning on Grassmann manifold. Machine Learning (to appear), 2019.\n\n[4] B. Mishra and R. Sepulchre. R3MC: A Riemannian three-factor algorithm for low-rank matrix\n\ncompletion. In IEEE CDC, pages 1137\u20131142, 2014.\n\n[5] H. Kasai and B. Mishra. Low-rank tensor completion: a Riemannian manifold preconditioning\n\napproach. In ICML, 2016.\n\n[6] M. Nimishakavi, P. Jawanpuria, and B. Mishra. A dual framework for low-rank tensor com-\n\npletion. In NeurIPS, 2018.\n\n[7] D. Kressner, M. Steinlechner, and B. Vandereycken. Low-rank tensor completion by Rieman-\n\nnian optimization. BIT Numer. Math., 54(2):447\u2013468, 2014.\n\n[8] B. Vandereycken. Low-rank matrix completion by Riemannian optimization. SIAM J. Optim.,\n\n23(2):1214\u20131236, 2013.\n\n[9] C. Da Silva and F. J. Herrmann. Optimization on the hierarchical tucker manifold\u2013applications\n\nto tensor completion. Linear Algebra Its Appl., 481:131\u2013173, 2015.\n\n[10] N. Boumal and P.-A Absil. Low-rank matrix completion via preconditioned optimization on\n\nthe Grassmann manifold. Linear Algebra Its Appl., 475(15):200\u2013239, 2015.\n\n[11] G. Meyer, S. Bonnabel, and R. Sepulchre. Linear regression under \ufb01xed-rank constraints: a\n\nRiemannian approach. In ICML, 2011.\n\n[12] U. Shalit, D. Weinshall, and G. Chechik. Online learning in the embedded manifold of low-\n\nrank matrices. J. Mach. Learn. Res., 13(Feb):429\u2013458, 2012.\n\n[13] F. J. Theis, T. P. Cason, and P.-A. Absil. Soft dimension reduction for ICA by joint diagonal-\n\nization on the Stiefel manifold. In ICA, 2009.\n\n[14] W. Huang, P.-A. Absil, and K. A. Gallivan. A Riemannian BFGS method for nonconvex\n\noptimization problems. In ENUMATH 2015. Springer, 2016.\n\n[15] D. G. Luenberger. The gradient projection method along geodesics. Manag. Sci., 18(11):620\u2013\n\n631, 1972.\n\n[16] S. Bonnabel. Stochastic gradient descent on Riemannian manifolds. IEEE Trans. on Automatic\n\nControl, 58(9):2217\u20132229, 2013.\n\n[17] H. Robbins and S. Monro. A stochastic approximation method. Ann. Math. Statistics, pages\n\n400\u2013407, 1951.\n\n[18] L. Bottou, F. E. Curtis, and J. Nocedal. Optimization mehtods for large-scale machine learning.\n\nSIAM Rev., 60(2):223\u2013311, 2018.\n\n[19] H. Kasai. SGDLibrary: A MATLAB library for stochastic optimization algorithms. JMLR,\n\n18(215):1\u20135, 2018.\n\n10\n\n\f[20] H. Sato, H. Kasai, and B. Mishra. Riemannian stochastic variance reduced gradient. arXiv\n\npreprint: arXiv:1702.05594, 2017.\n\n[21] H. Zhang, S. J. Reddi, and S. Sra. Riemannian SVRG: fast stochastic optimization on Rieman-\n\nnian manifolds. In NIPS, 2016.\n\n[22] H. Kasai, H. Sato, and B. Mishra. Riemannian stochastic recursive gradient algorithm.\n\nICML, 2018.\n\nIn\n\n[23] R. Johnson and T. Zhang. Accelerating stochastic gradient descent using predictive variance\n\nreduction. In NIPS, 2013.\n\n[24] N. L. Roux, M. Schmidt, and F. R. Bach. A stochastic gradient method with an exponential\n\nconvergence rate for \ufb01nite training sets. In NIPS, 2012.\n\n[25] S. Shalev-Shwartz and T. Zhang. Stochastic dual coordinate ascent methods for regularized\n\nloss minimization. JMLR, 14:567\u2013599, 2013.\n\n[26] A. Defazio, F. Bach, and S. Lacoste-Julien. SAGA: a fast incremental gradient method with\n\nsupport for non-strongly convex composite objectives. In NIPS, 2014.\n\n[27] S. J. Reddi, A. Hefny, S. Sra, B. Poczos, and A. Smola. Stochastic variance reduction for\n\nnonconvex optimization. In ICML, 2016.\n\n[28] L. M. Nguyen, J. Liu, K. Scheinberg, and M. Takac. SARAH: a novel method for machine\n\nlearning problems using stochastic recursive gradient. In ICML, 2017.\n\n[29] W. H. Yang, L.-H. Zhang, and R. Song. Optimality conditions for the nonlinear programming\n\nproblems on riemannian manifolds. Pac. J. Optim., 10(2):415\u2013434, 2014.\n\n[30] W. Huang, K. A. Gallivan, and P.-A. Absil. A Broyden class of quasi-Newton methods for\n\nRiemannian optimization. SIAM J. Optim., 25(3):1660\u20131685, 2015.\n\n[31] D. Gabay. Minimizing a differentiable function over a differential manifold. J. Optim. Theory\n\nAppl., 37(2):177\u2013219, 1982.\n\n[32] W. Ring and B. Wirth. Optimization methods on Riemannian manifolds and their application\n\nto shape space. SIAM J. Optim., 22(2):596\u2013627, 2012.\n\n[33] N. Boumal, P.-A. Absil, and C. Cartis. Global rates of convergence for nonconvex optimization\n\non manifolds. IMA J. Numer. Anal., 2018.\n\n[34] H. Kasai, H. Sato, and B. Mishra. Riemannian stochastic quasi-Newton algorithm with vari-\n\nance reduction and its convergence analysis. In AISTATS, 2018.\n\n[35] R. H. Byrd, G. M. Chin, W. Neveitt, and J. Nocedal. On the use of stochastic Hessian infor-\nmation in optimization methods for machine learning. SIAM J. Optim., 21(3):977\u2013995, 2011.\n\n[36] M. A. Erdogdu and A. Montanari. Convergence rates of sub-sampled Newton methods. In\n\nNIPS, 2015.\n\n[37] P. Xu, F. Roosta-Khorasani, and M. W. Mahoney. Newton-type methods for non-convex opti-\n\nmization under inexact Hessian information. arXiv preprint arXiv:1708.07164, 2017.\n\n[38] Z. Yao, P. Xu, F. Roosta-Khorasani, and M. W. Mahoney. Inexact non-convex Newton-type\n\nmethods. arXiv preprint arXiv:1802.06925, 2018.\n\n[39] J. M. Kohler and A. Lucchi. Sub-sampled cubic regularization for non-convex optimization.\n\nIn ICML, 2017.\n\n[40] N. Boumal, B. Mishra, P.-A. Absil, and R. Sepulchre. Manopt, a Matlab toolbox for optimiza-\n\ntion on manifolds. J. Mach. Learn. Res., 15(1):1455\u20131459, 2014.\n\n[41] A. R. Conn, N. I. M. Gould, and P. L. Toint. Trust Region Methods. MOS-SIAM Series on\n\nOptimization. SIAM, 2000.\n\n11\n\n\f[42] J. Nocedal and Wright S.J. Numerical Optimization. Springer, New York, USA, 2006.\n[43] C. Cartis, N. I. M. Gould, and P. L. Toint. Adaptive cubic regularisation methods for uncon-\nstrained optimization. part I: motivation, convergence and numerical results. Math. Program.,\n127(2):245\u2013295, 2011.\n\n[44] P. L. Toint. Towards an ef\ufb01cient sparsity exploiting Newton method for minimization. Sparse\n\nmatrices and their uses, page 1981, 1981.\n\n[45] A. Hyv\u00e4rinen and E. Oja. Independent component analysis: algorithms and applications. Neu-\n\nral networks, 13(4-5):411\u2013430, 2000.\n\n[46] The extended Yale Face Database b. http://vision.ucsd.edu/ leekc/ExtYaleDatabase/ExtYaleB.html.\n[47] Columbia university image library (COIL-100). http://www1.cs.columbia.edu/CAVE/software/softlib/coil-\n\n100.php.\n\n[48] The CIFAR-100 dataset. http://www.cs.toronto.edu/ kriz/cifar.html.\n[49] F. Porikli and O. Tuzel. Fast construction of covariance matrices for arbitrary size image\n\nwindows. In ICIP, 2006.\n\n[50] O. Tuzel, F. Porikli, and P. Meer. Region covariance: a fast descriptor for detection and classi-\n\n\ufb01cation. In ECCV, 2006.\n\n[51] Y. Pang, Y. Yuan, and X. Li. Gabor-based region covariance matrices for face recognition.\n\nIEEE Trans. Circuits Syst. Video Technol., 18(7):989\u2013993, 2008.\n\n[52] The MNIST database. http://yann.lecun.com/exdb/mnist/.\n[53] Covertype dataset. https://archive.ics.uci.edu/ml/datasets/covertype.\n[54] K. Goldberg, T. Roeder, D. Gupta, and C. Perkins. Eigentaste: a constant time collaborative\n\n\ufb01ltering algorithm. Inform. Retrieval, 4(2):133\u2013151, 2001.\n\n[55] D. Gross. Recovering low-rank matrices from few coef\ufb01cients in any basis. IEEE Trans. on\n\nInf. Theory, 57(3):1548\u20131566, 2011.\n\n[56] R. Kueng and D. Gross. Ripless compressed sensing from anisotropic measurements. Linear\n\nAlgebra and its Applications, 441:110\u2013123, 2014.\n\n12\n\n\f", "award": [], "sourceid": 2083, "authors": [{"given_name": "Hiroyuki", "family_name": "Kasai", "institution": "UEC"}, {"given_name": "Bamdev", "family_name": "Mishra", "institution": "Microsoft"}]}