NIPS 2017
Mon Dec 4th through Sat the 9th, 2017 at Long Beach Convention Center
Paper ID: 3075 Parameter-Free Online Learning via Model Selection

### Reviewer 1

Online learning has recently attracted much attention due to its involvement as stochastic gradient descent in deep learning. Oracle inequalities form an important part of online learning theory. This paper proposes an online algorithm, MultiScaleFTPL, in terms of a scale c_i of regret to expert i. Then an oracle inequality for regret bounds is presented. An essential term is \sqrt{n \log (n c_i/\pi_i)}, which means a tight complexity requirement on the hypothesis set and is caused by the linear design of p_t in Algorithm 3. This special design leads to a restriction of applying the main oracle inequality to supervised learning only to Lipschitz losses in Theorems 8 and 9. But the results in the paper are interesting in general.

### Reviewer 2

SUMMARY While I am not heavily familiar with the literature on adaptive online learning, this paper seems to be a breakthrough, offering in the form of Theorem 1 a very strong result that can be leveraged to obtain adaptive (in the model complexity sense) online learning bounds in a number of settings. The efficiency, at least in the polytime sense, of the algorithms for the various settings makes these results all the more interesting. I was very surprised by the aside'' on the 1-mixability of logistic loss and the argument for circumventing the lower bound of Hazan, Koren, and Levy in the supervised learning setting. I wish that the authors could give more detail to this observation and the consequences, as the implications are so interesting that I would be (almost) sold on acceptance from this fact alone. I found the results of this paper to be very interesting, technically strong, and important, so I would strongly recommend acceptance. DETAILED REVIEW This paper is very well written and clearly identifies the core technical contribution. I do not really have any criticisms of the paper. I found Theorem 1 to be a very powerful result, as it readily gives rise to a number of corollaries by appropriately using the MultiScaleFTPL algorithm. I checked part of the proof of Theorem 1 (which is to say I did not look at the proof of Lemma 1), and I think the result is correct. The proof of Theorem 2 appears to be correct, and this result may be thought of as the main bound that can be specialized in the different applications. The ease of the analysis after starting from Theorem 2 is impressive. Regarding the various applications: These all seem either interesting or provide useful examples of how to apply Theorem 2. However, for some of the results (like Theorem 3), even if the previous best results involved inefficient algorithms, it would be good to establish the proper context by providing citations to those results. In the Discussion section, where you discuss Losses with curvature'', do you believe the $\log(n)$ factor arising from the $1/n$-cover to be necessary or just an artifact of using a cover? MINOR COMMENTS Line 154: is the norm $\|\cdot\|_i$ meant to be an $\ell_p$ norm for $p = i$? I suspect not. You are abusing notation right? This level of abuse is too high, and I'd perhaps put parenthese around the subscript $i$ to distinguish from an $\ell_p$ norm. Line 193: convexity chosen'' --> convexity was chosen'' Line 194: incorporate generalize'' doesn't make any sense; fix this