Export Reviews, Discussions, Author Feedback and Meta-Reviews

Paper ID:	1436
Title:	Path-SGD: Path-Normalized Optimization in Deep Neural Networks

Current Reviews

Submitted by Assigned_Reviewer_1

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)

Deep rectified neural networks are over-parameterized in the sense that scaling of the weights in one layer, can be compensated for exactly in the subsequent layer. This paper introduces Path-SGD, a simple modification to the SGD update rule, whose update is invariant to such rescaling. The method is derived from the proximal form of gradient descent, whereby a constraint term is added which preserves the norm of the "product weight" formed along each path in the network (from input to output node). Path-SGD is thus principled and shown to yield faster convergence for a standard 2 layer rectifier network, across a variety of dataset (MNIST, CIFAR-10, CIFAR-100, SVHN). As the method implicitly regularizes the neural weights, this also translates to better generalization performance on half of the datasets.

As an algorithm, Path-SGD appears effective, simple to implement and addresses an obvious flaw in first-order updates to ReLU networks. As a paper, it could be improved however, especially with respect to notation (see details section below) which sometimes obfuscates simple concepts. Certain sections of the paper seem rushed and would require a careful rewrite. See details section below.

There are also a number of glaring omissions to prior work. At its core, Path-SGD belongs to the family of learning algorithms which aim to be invariant to model reparametrizations. This is the central tenet of Amari's natural gradient (NG), whose importance has resurfaced in the area of deep learning (see e.g [R1-R4]). Path-SGD can thus be cast an approximation to NG, which focuses on a particular type of rescaling between neighboring layers. The paper would greatly benefit from such a discussion in my opinion. I also believe NG to be a much more direct way to motivate Path-SGD, than the heuristics of max-norm regularization.

The experimental section could also benefit from a few extra experiments. Can the authors validate experimentally that $\pi(w)$ is more stable during optimization ? It is also regrettable that the authors chose a fixed model architecture for the experiments. One would expect the advantage of Path-SGD to be even more pronounced for deeper networks, which have longer paths from input to output. Such experiments would be much more informative that the dropout experiments of Figure 3, which do not add much to the narrative.

Detailed Feedback: Theorems should be rather self contained and mathematically precise. Path-SGD is not invariant to arbitrary rescaling, contrary to the claims of Theorem 4.1, but only invariant to the rescaling function $\rho$ for c > 0. Please correct accordingly. Eq. 5 (and most others) are missing an outer summation over i and j. Eq. 5: notation $v_in[i] \rightarrow ... \rightarrow v_out[j]$ found throughout the paper is quite cumbersome. One could instead use $\mathcal{P}$ to denote the set of all possible paths in the network, and then for each path $p \in \mathcal{P}$, sum over edges $\e_k \in p$ ? Idem for line 198 and other equation involving paths. line 212: missing $\mathcal{1}{2}$ term line 212: missing absolute values before exponentiating by $p$. line 216: "hard to calculate". Please be more precise: is it intractable, ill-defined, expensive to compute ? line 216: "Instead, we will update each coordinate" -> "perform coordinate descent". The original form was confusing to me as SGD also updates each parameter "independently", in the sense that they do not exploit curvature / covariance information. Eq 7 vs line 224: inconsistent notation between $e'$ and $e_k$ line 227: similar to previous comments, it is clumsy for $v_in[i] -> ... e .. -> v_out[j]$ to be implicitly defined over all i and j.

[R1] Enhanced Gradient for Training Restricted Boltzmann Machines, KyungHyun Cho, Tapani Raiko, Alexander Ilin. ICML 2011. [R2] Deep Learning Made Easier by Linear Transformations in Perceptrons. Tapani Raiko Harri Valpola Yann LeCun. AISTATS'12. [R3] Optimizing Neural Networks with Kronecker-factored Approximate Curvature, James Martens, Roger Grosse. ICML 2015. [R4] Scaling up Natural Gradient by Sparsely Factorizing the Inverse Fisher Matrix. Roger Grosse, Ruslan Salakhudinov. ICML 2015.