NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Reviewer 1
- This paper considers the online forecasting problem for non-convex machine learning methods. It first introduces a new concept to measure the regret of the online learning algorithm. It then proposes a method with sublinear regret (under the new regret definition). Experiments are not very comprehensive, but sufficient for a theoretically inclined paper. - The problem being studied by this paper is important and interesting: non-convex machine learning with gradient-based optimization. The proposed concept for dynamic regret is simple and intuitive. The theorems and results are clearly stated and proved. - The proposed technique of using exponential weights is sound, but seems rather straight forward and incremental compared with Hazan et al.
Reviewer 2
This paper introduces a local regret for non-convex models in a dynamic environment. The authors present an update rule incurring a cost that is sublinear in time T. It evaluates its performance on a real world dataset and shows improved performance compared with baselines. I am outside of the area and could not comment on the significance of the theoretical results. The experiments look reasonable to me.
Reviewer 3
This paper considers online forecasting problems with non-convex models. One of the main challenges in forecasting is concept drift, which refers to changes in the underlying relationship between inputs and outputs over time. Classic online learning algorithms have performance bounds in terms of regret, which compares the algorithm's performance to the best fixed action in hindsight. While regret makes sense when losses are convex, it is no longer appealing when there are non-convex losses. Hazan et al. 2017 introduced a notion of local regret to be used for online non-convex problems, as well as algorithms that achieve sublinear local regret. The local regret is a sum over time steps, where each term corresponds to the squared norm of the gradient of the average of the most recent loss functions. The main contribution of this paper is to propose a new form of local regret called dynamic local regret and to give algorithms for achieving low dynamic local regret. Dynamic local regret is similar to local regret in that it includes a sum of norms of gradients of some time-average. However, the time-average in dynamic local regret is weighted with some decay parameter and the functions within the time average are evaluated at previous inputs. The proposed algorithm for minimizing this regret is an exponentially time-smoothed SGD. The authors prove a bound on the dynamic local regret suffered by this algorithm. The authors also give some experimental results showing that their algorithm does well compared to the one from Hazan et al. 2017. On the positive side, this paper presents a new notion of regret for non-convex forecasting problems with concept drift. The authors also provide analysis and some theoretical bounds for their algorithm in terms of this new regret. The paper also gives a number of experimental results, showing that their algorithm is more stable and generally more computationally efficient than alternatives. On the negative side, the overall motivation and intuition behind the new definition of dynamic local regret isn't that compelling. Without some formal notion or even toy scenario for concept drift, it's not clear what theoretical basis there is to prefer this notion of regret to other notions other than some vague heuristics. Empirically, while the authors' algorithm (PTS-SGD) seems to be more stable and faster than the algorithm by Hazan et al. 2017, SGD seems to perform roughly as well as PTS-SGD, albeit with less stability. However, as the authors note, previous work has shown that PTS-SGD coincides with SGD with momentum as the decay factor approaches 1, so it seems like a better empirical comparison might be with SGD with momentum. Overall, this paper is a reject. The new notion of regret is interesting, but seems to need more theoretical justification to warrant why one would use this notion of regret instead of more established notions. The algorithm given is very similar to previous algorithms, and while the experiments show some interesting stability property of PTS-SGD, they do not give more evidence that dynamic local regret is important. The contributions seem like small modifications of previous results, and the benefits seem relatively unclear. The paper is reasonably clear, but could use more motivation or explanation of the notion of regret. ======== Post-rebuttal ======= I have read the other reviews and the rebuttal. I appreciated the toy example, the theoretical motivation via calibration, and the commends on SGD with momentum. I am happy with accepting the paper provided that the authors include the content from the rebuttal in the final paper.