NIPS 2018
Sun Dec 2nd through Sat the 8th, 2018 at Palais des Congrès de Montréal
Paper ID: 6438 On Markov Chain Gradient Descent

### Reviewer 1

There are great many analyses of stochastic gradient descent. The present paper studies sampling w.r.t. to a Markov chain and parametrising the rate of convergence with a proxy for the mixing time, which has been studied previously . Clearly, one could replace the whole SGD e.g., with the famous hit-and-run Markov chain (at least for convex bodies), and some balance between the complexity of sampling the MC and the iteration complexity of the SGD using it would be of interest. The constants in the "O(n^4)" mixing-time result of Lovasz  are exorbitant, though. (It would seem like a shame to consider trajectories of lengths in the millions so as to decide on how to sample the gradient.) The contribution of the authors is to -- study the use of all samples in the trajectory, rather than only the first sample beyond the mixing time (upper bound). -- consider non-reversible (but still time-homogeneous, irreducible, and aperiodic) MC. Although I like the the general idea, there are three issues in the proof of the main theorem. In Eq. 53, line 382, there is a typo. Instead of ln(k / 2 C_P H), there should be ln(2 C_P H k). In Eq. 61, line 299, there is a typo. Instead of 1/2k, there should be M/2k. Much more importantly, the inequality in Eq. 54, line 384, holds only for sufficiently large k. In particular, (54) holds only if k >= K_P, which is the assumption in Lemma 1. Since the authors write "With Lemma 1 and...", this may be a language issue and they actual mean that they make the same assumption. But in (61), (62), (63), the authors do not assume k >= K_P any longer, while they use (54). Consequently, step d) in Eq. 61, line 399, does not work and (unless they sink the whole lot prior to mixing time into C_4) the whole proof fails. The authors should explain in the rebuttal, whether the proof is wrong, the C_4 is exorbitant, or what is the story. Minor comments: 1. There is no clear motivating application "carried through". Initially, the authors mention the counting of discrete solutions. One could imagine many other applications, but the actual experimental evaluation is performed on a trivial example. 2. Also, The definitions are not paritcularly clear or standard. The distance between probability distributions considered in the mixing time, where TV, Lp, and and various divergencies and entropies are used (c.f. 2.9-2.12 in ), is l\infty, which is not a common choice. 3. Further, the definition is obfuscated with some eigengap condition base don Fill's , which for MC of real interest is very likely not computable. Perhaps a parametrisation by the mixing time would do just as well? 4. Computationally, the paper is not convincing. The figures presented look ok, but when one looks at the number of parameters "in the first test problem", it is clear that there well may be many variants, for which the MC would not work as well. 5. Generally, the paper is poorly written. Just look at the introduction, there are numerous typos: "We aim to solving" -> "we aim to solve" "direct sampling from \Pi is expansive" -> "is expensive" "cardinal of" -> "cardinality of" "This is a huge save to the number of samples." -> "This presents huge savings in the number"  https://link.springer.com/article/10.1007/s101070050099  https://people.cs.uchicago.edu/~laci/students/ganapathy.pdf  https://www.researchgate.net/profile/Mikael_Johansson6/publication/220133540_A_Randomized_Incremental_Subgradient_Method_for_Distributed_Optimization_in_Networked_Systems/links/0046352543f47cc8de000000.pdf  https://projecteuclid.org/euclid.aoap/1177005981 POST REBUTTAL: I do think that the edit to the proof suggested by the authors could work, but would lead to some exorbitant constant C4, a subject not addressed by the authors. Still, I have increased my score from "clear reject" to "accept" in the light of the fact that I am now happy with the validity of the proofs.