Paper ID: 378
Title: Dirty Statistical Models
Reviews

Submitted by Assigned_Reviewer_4

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
This is a well-written paper that extends current work on parameter estimation in "superposition-structured" problems. This includes sparse + group sparse structure, sparse + low rank, etc. The paper establishes error bounds for regularized M-estimators, where the loss function is convex and the penalty is a convex combination of regularizers corresponding to each of the regularizers promoting a specific type of structure on the model parameter.

The most relevant past work to the results of this paper seems to be Agarwal et al. '12; the paper compares the statements of Theorem 1 and Corollary 4 with corresponding results in that paper, and claims that the bounds are tighter (converging to 0 as the number of samples n increases) and the imposed conditions are weaker. From the point of view of a non-expert, it would be nice to explain these differences more carefully. For instance, Agarwal et al. proves matching lower bounds, so the improvements must arise from the difference in the class of models considered.

Otherwise, the theoretical conclusions of the paper seem sound and proofs are clearly written (I only read through a subset of them). Other comments:

(1) You might want to mention that condition (C3) is only required to hold locally around \theta^*, and not globally over all pairs (\theta_1, \theta_2).

(2) It's a bit confusing to figure out what exactly \Omega_\alpha refers to. In (C3) and (C4), this is called a "restricted set." Shouldn't the statement of Theorem 1 have some dependence on what restricted sets \Omega_\alpha are involved in (C3) and (C4)? For the sake of the corollaries in the paper, it seems \Omega_\alpha = \Theta, since all models considered are actually linear with an \ell_2 loss.

(3) Following the previous point, it would be interesting to say something about cases when (C3) and (C4) might not hold over the entire space \Theta, and one would need to work a bit harder to construct the appropriate analog of Theorem 1 over restricted sets. It would also be interesting to have some concrete examples with more general loss functions.

(4) The use of "local"/"global" terminology to refer to RSC conditions (lines 421 and 555) is confusing. Here, the authors seem to use "local" RSC to refer to deviations in one component only, whereas "global" refers to arbitrary deviations from \theta^*. However, local/global RSC would more naturally be used to describe RSC conditions which hold over a restricted radius vs. the entire space.

(5) Although the contribution of this paper is entirely theoretical, it would be nice to have some more motivating examples for why such a general, unified framework might be useful in practice (and what the impact of this paper is beyond the special cases already considered by previous authors).
Q2: Please summarize your review in 1-2 sentences
This is a solid theoretical paper that is clear and well-written. It could be improved in its comparison to past work and arguments for the relevance of the theoretical contribution.

Submitted by Assigned_Reviewer_5

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
This paper presents a unified framework for the high-dimensional analysis of dirty statistical models. The theoretical analysis is sound and covers many statistical models as special cases. It provides a guideline to analyze hybrid regularized learning problems. The paper analyzes the equivalent relationship between the “dirty” (superposition-structured) regularization and the “clean” regularization. Furthermore, the paper extends the analysis of M-estimator for a single regularizer in [11] to the settings of multiple regularizers where each regularizer captures a specific structure. The theoretical analysis in this paper is more general. The paper is well organized and written.

My only concern is whether the unified theoretical analysis framework is consistent with that of some specific settings. The dirty multi-task feature learning model [8] provides very strong theoretical analysis. How about the comparison between [8] and this paper when the loss function and regularizer have some specific forms?
Q2: Please summarize your review in 1-2 sentences
This paper provides a unified framework for statistical models. The theoretical analysis is solid.

Submitted by Assigned_Reviewer_6

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
This paper provides bounds for the learning of M-estimators minimizing a loss function with superposition-structured regularizers. This type of statistical model features a parameter vector which is a superposition of several vectors, each of them having a specific structure enforced by its own regularizer. Examples of such models have been the subject of very recent results (see in particular the referenced papers from Chandrasekaran et al.). The present work provides bounds for the learning of the parameter vector by extending the framework of decomposable regularizers proposed in (Negahban et al. 2012). While the general result is provided in Theorem 1, the resulting bound involves several unknown constants which are specific to the statistical model. The remaining of the paper studies several examples of such models and derives explicit bounds for each of them. Similarly to (Negahban et al. 2012), the resulting bounds emphasize the good behavior of structured regularizers for high-dimensional problems when the target vector lies in an appropriate structured subspace.
Pros:
The type of statistical models ("dirty" models) for which bounds are derived has triggered recently considerable interest from the statistical community, elegantly providing convex relaxations to challenging problems. These problems moreover touch areas directly relevant to machine learning (graphical models, regression,...) and have clear potential applications.
The paper provides important insights in the factors determining the performance of the superposition structured M-estimator. In particular, the concept of Structural Incoherence is introduced (assumption C4) to account for the interactions between the regularizers associated to each structure.
As stated in the paper, the provided bounds seem to represent considerable improvement with respect to the state of the art and are more general.
Finally, the authors convincingly illustrate the application of their general methodology in 3 interesting examples of sparse statistical models.
Cons:
Although the authors claim their result is very general, the assumptions on the model seem in practice quite constraining: in particular, the family of appropriate penalties, that are both norms and decomposable (as a sum of penalties over orthogonal subspaces), seems restricted to immediate generalizations of the l1-norm. It would be useful to discuss the possibility/consequences of including additional regularizers such as the squared Euclidian norm (typically: can this framework deal with elastic net penalties?).
One other issue is the applicability of the bounds to practical cases. As far as I understand, the assumptions required to prove these bounds are numerous and involve several unknown parameters. In particular, in the first two presented examples (Proposition 2, Corollary 2, Proposition 3), conditions for which assumptions hold are not established, especially all results assume the C-linearity condition (7) to hold without further justification. Corollary 4 assumes a (simpler) C-linearity condition (12) as well. Moreover, in all examples, bounds depend on \bar{\kappa} or directly on the curvature parameter \kappa_{\mathcal{L}}, for which no range of acceptable values is provided. Is it possible to show for that these assumptions hold with some degree of generality and for some values of the parameters (curvature…), that might depend on the specific model.
Although very well written, the paper is also very dense and some definitions/notations are difficult to understand without referring to (Negahban et al. 2012), see the following comments for details.
Comments:
-line 142: proposition 1 is important, please provide at least references for the concept of infimal convolution. Although the result seems intuitive, please make sure all necessary assumptions are correctly listed.
-line 157: please mention that the target parameter is the minimizer of the population risk
-line 168: the concept of restricted set is introduced, however without referring to other readings, it is difficult to understand how these sets are defined for each structure and used. Please provide some explanation.
-Corollaries 2, 3 and 4 do not mention whether the bound is obtained with high probability, please clarify.

Typos:
-line 230: "subspaces are chosen so that $\prod$...$", I guess the projector on the subspace *orthogonal* to $\mathcal{M}$ should vanish.
-line 303: "we are now ready *to* derive"
Q2: Please summarize your review in 1-2 sentences
Good theoretical paper, providing insights in the performance of recently proposed sparse statistical models with direct relevance to machine learning. However, the range of models for which the numerous required assumptions are valid is not clear.
Author Feedback

Q1:Author rebuttal: Please respond to any concerns raised in the reviews. There are no constraints on how you want to argue your case, except for the fact that your text should be limited to a maximum of 6000 characters. Note however that reviewers and area chairs are very busy and may not read long vague rebuttals. It is in your own interest to be concise and to the point.

We thank the reviewers for their careful and encouraging comments and feedback.


Reviewer 4:

Our comparisons with Agarwal et al. were terse for lack of space, we will expand upon these in the final version. As we had noted in the paper, their bounds did not go to zero as a function of n; this was in part because they did not impose any counterpart of our novel structural incoherence condition (C4) on the interaction between the different structured components; and imposed ell_infty bounds on the parameters instead.

On \Omega_\alpha: As the reviewer astutely notes, we set \Omega_\alpha = \Theta in all our corollaries; in fact, we intended to specify in Theorem~1 that (C3) and (C4) be satisfied with \Omega_\alpha = \Theta (alternatively, we could remove references to \Omega_\alpha entirely). This is because the conditions (C3) and (C4) have 'tolerance' terms; and thus do not need a restricted set of directions per se. (In other words, in (C3) instead of requiring a quadratic lower bound for a restricted set of directions, we can require a quadratic lower bound for all directions but minus a tolerance term). See the discussion of the restricted strong convexity condition in ref [12] which also just has a tolerance term; instead of the restricted directions version in their conference version in Negahban et al (2009). (In any case, the restricted subset of \Theta we actually need the conditions (C3) and (C4) be satisfied over, is specified by the set in eqn 16 in the supplementary materials; but restricting to this set instead of all of \Theta do not lead to better bounds given the choice of the tolerance term, so that we could set \Omega = \Theta).

On more examples: we were constrained due to lack of space; but we will add more examples to the supplementary materials (e.g. low-rank + row sparse + column sparse structure, and so on.)

Reviewer 5:

On comparison to ref [8]: there they provided a different class of guarantees, where they analyzed the sparsistency (recovery of the joint sparsity pattern) of their dirty multitask estimator (specifically, sparse + block-sparse multiple linear regression), for which they needed additional stronger irrepresentable conditions. In this paper, we restrict ourselves to providing parameter error bounds, for which we only impose much milder (restricted strong convexity and structural incoherence) conditions.

Reviewer 6:

The class of decomposable regularizers is fairly broad, see Negahban et al 2012; but we agree with the reviewer that extending the analysis to beyond decomposable regularizers is an interesting and important question for further research.

For \kappa_{\mathcal{L}} (and therefore \bar{\kappa}), we can directly appeal to the previous results for individual structures. For all the structures considered in our examples, \kappa_{\mathcal{L}} is (a small) constant; when the design X is drawn a multivariate Gaussian, and with the linear regression loss, it depends on a restricted mininmum eigenvalue of the population covariance matrix.

Thank you for correcting typos. We will fix them.