NeurIPS 2020

### Review 1

Summary and Contributions: In this paper, the authors present an algorithm for constrained (multi-class) classification problems where both the objective and the constraints depend on the confusion matrix. In particular, the results of the paper may be applied to fair multi-class problems. The authors present an algorithm inspired from both augmented Lagrangian and Frank Wolfe algorithms. Since one of the two linear minimizations is not directly possible, the authors use plug-in estimators to circumvent this issue. For a consistent plug-in estimator, the authors show that the output given by the algorithm converges to the optimal solution (when the number of sample is infinite). The theoretical result is accompanied with empirical results. UPDATE: I would like to thank the authors for their feedback. I keep thinking that the paper deserves to be accepted for publication. However, note that the statistical results can be improved (see reviews).

Strengths: The paper widely improve the results from "Learning with complex loss functions and constraints": 1) It holds under weaker assumption 2) The dependence with respect to the number of iteration is improved 3) The dependence with respect to the number of constraints is also improved. Moreover, the result is very general and can be applied to many multi-class classification problems. The empirical results highlight all these improvements.

Weaknesses: The paper is a little bit abstract. It could be nice to see if all the assumptions of the paper are verified for usual problems (fairness for example). For example, is it the case that $\psi$ is Lipschitz and bounded ? It is very clear, but it would be even clearer with the precise presentation of an example.

Correctness: There is a mistake in the main Theorem: $\rho$ should be $\sqrt \rho$ (see Theorem supplementary material). It's minor and is just a typo I guess !

Clarity: The paper is VERY clearly written. Everything is fluent and easy to follow. It was a great pleasure to review this article !

Relation to Prior Work: I am not an expert of the field, but the bibliography looks complete and well-organize.

Reproducibility: Yes

Additional Feedback: TYPOS: -Line 90: a space is missing -Line 167: 1 should be $j$ in the conditional probability -Line 211: problem in this the second limit. It should not be 0. COMMENTS : Here are some (minor) details that could improve the clarity of the paper. 1) It could be nice to add the proof of proposition 2 2) Since the parameter $\lambda$ seems to be important (see the dependence in the main theorem), it would be nice to write the augmented Lagrangian with an index depending on $\lambda$. Could you precise what is the choice of lambda ? Normally, there should be a lower bound on $\lambda$. The main theorem should hold for any $\lambda \geq \bar \lambda$, with a specific $\bar \lambda$ 3) In the main Theorem: it would be nice to explicit the constant $\bar \epsilon$ 4) In the main theorem: say with respect to which random variables the expectation is taken in line 199 5) Put the definition of $Delta_n$ (the simplex in R^n) at least say it.

### Review 2

Summary and Contributions: The authors look at multi-class classification problems under constraints depending (via a convex function) on general functions of the (ideal) confusion matrix, where the learning algorithm is assumed to output a stochastic classifer. Essentially, they break down the problem into two sub-tasks, related to the primary objective (also convexified here) and constraints, respectively, and working this sub-tasks into a linear objective. This formulation makes it reasonable to use traditional Frank-Wolfe type updates, and starting from a pre-trained base classifier (randomized), they use this update strategy to iteratively ensure the resulting (randomized) classifier does a “good” job (in a relative sense) of satisfying the constraints while still similarly achieving “good” performance in the primary objective. They give a procedure which is somewhat abstract, but one expects it can readily be implemented by users who flesh out the key abstracted components (sufficient stats, pre-trained classifier, linear minimization sub-routine). The authors give error bounds for this general-purpose procedure, depending on the approximation error of the pre-trained classifier and statistical error of typical empirical estimates in their sub-routines. The exposition comes with a fairly solid discussion of their algorithm design, and empirical tests are conducted to evaluate a practical implementation of their high-level procedure.

Strengths: The problem of interest is clear (if somewhat abstract), and the algorithm proposed here is implementable, grounded in clear computational and statistical principles, and comes with both theoretical and empirical analyses, yielding initial evidence for the efficacy of their approach, claiming that it is an order of magnitude more efficient (in terms of oracle calls) than a closely related prior work.

Weaknesses: The main limitation of the proposal here is that it is just a general-purpose wrapper around a pre-fixed classifier; in particular, this means that the actual quality of the output depends entirely on the quality of this base classifier, and that for certain special cases (of classifiers, sufficient stats, and constraints) it is perfectly plausible that a more efficient approach tuned to the setting could be found. This last point is the price that is paid for generality, of course.

Correctness: For the most part, the paper appears correct; however there is a clear issue with the statistical error term of the underlying procedure. The statistical error (the sqrt(log(1/delta)/N) term) is stated so roughly as to be on the verge of being patently false, as dimension dependence is completely hidden in the big-O notation. The authors’ procedure requires estimating the mean of a d-dimensional random vector, at each step. The authors casually mention (in the appendix) that a uniform version of Hoeffding’s inequality can be made with proper capacity notions, but that this is “not significant.” While somewhat technical, hiding the dependence on d (an important parameter from the perspective of the user constructing the performance metrics) is misleading.

Clarity: The main exposition and layout of the paper is fairly clear.

Relation to Prior Work: Prior work is discussed clearly.

Reproducibility: Yes

Additional Feedback: For the most part the paper is well-written, but I think that it should be a priority of the authors to present their main statistical result in the most clear and genuine way possible. The current form hides dependence on the dimension of the sufficient statistics and the iterative nature of the process (as mentioned above), and I believe needs revision. UPDATE: Having seen the feedback, I maintain my initial assessment, including the issues raised above.

### Review 3

Summary and Contributions: The paper proposes a novel approach to learning a multiclass classifier under structural constraints motivated from fairness applications. The performance of a candidate classifier~$h$ -- a general mapping of the feature space on a probability simplex -- is measured through the so-called confusion matrix'' $C[h]$ whose vectorization stacks the vector of expected sufficient statistics extracted from a datapoint $(X,Y)$ and the classifier output~$\hat Y$. The goal is then to solve a convex program, where a smooth and convex los is minimized over the set of achievable confusion matrices, corresponding to the fixed (and unknown to the learner) population distribution, under a number of (convex) functional constraints. This setup has been earlier considered by Narasimkhan (2018). The authors advocate the following approach: the problem is cast as convex program with smooth objective, and with constraint set given as the intersection of the two sets (of confusion matrices): - the feasible'' set $\F$ corresponding to the functional constraints; - the achievable'' set $\C$ corresponding to all possible confusion matrices for the data-generating distribution. As such, only the set $\C$ depends on the unknown data-generating distribution. Following the approach in the earlier work of Narasimhan (N18), the authors then use that a linear minimization oracle for $\C$ corresponds to the Bayes-optimal predictor of the confusion matrix; thus one can approximate this oracle via the plug-in approach. Using this idea, N18 advocated an approach based on the Frank-Wolfe method. The departure of the present work from N18 is in how they the functional constraints are treated: instead of incorporating them into the Lagrangian, as done in N18, they use the augmented-Lagrangian type approach which leads to a better optimization complexity estimate.

Strengths: Theoretical analysis seems overall sound, although there is laxity in some of the derivations as I describe below. Another merit of the paper is a thorough experimental analysis; the results demonstrate the superiority of the proposed method to the baseline of N18 on a few datasets.

Weaknesses: The main issue I see with the paper is what I perceive as incremental nature of its contributions with respect to earlier work, notably Naramsimhan (2015; 2018-N18) and Gidel et al. (2018-G18). The connection between LMO for the set $\C$ and Bayesian predictor (Prop. 2 of the present paper, given without proof and explicit reference) has been first observed in N18; notably, nor N18 nor the present paper gives explicit proof. The sample complexity analysis for the plug-in approximation of LMO has been given in N18. In fact, to my (limited) understanding, essentially all components in the proof of Theorem 1 can already be found in existing works (N18 and G18). Besides, the notation closely follows that of N18 which I personally find a bit disturbing. Apart from these problems, I can see the following minor technical issue: as far as I can tell, the second bound in Lemma 1 of the supplementary must include a dimension-dependent factor. Indeed, the authors give this bound without a proof, with a brief informal discussion. Yet, from this very discussion I can see that: (1) even without covering the bound for the l2-norm is $O(\sqrt{d})$ larger than the coordinatewise bound; (2) covering might result in another~$O(\sqrt{d})$ factor. Perhaps the authours meant that the constants $c_1, c_2, ...$ are allowed to be dimension-dependent; in this case this could have been mentioned explicitly.

Clarity: The clarity is overall reasonable but could have been improved. In particular, Proposition 2, given without proof, looks a bit cryptic. Mentioning standard uniform convergence argument'' in the proof of Lemma 1 is problematic (as I discussed in my previous answer).

Relation to Prior Work: The relevant work is mentioned and the contributions are positioned overall correctly with respect to it.

Reproducibility: Yes

Additional Feedback: UPDATE: In their rebuttal, the authors did address the minor of two issues I mentioned, the one with the dimension-dependent constant'' factor. Hence I'm improving my score from 5 to 6. I am still not convinced that technical contributions here are significant enough to be the key decision factor in whether the paper should be accepted. But the approach might be interesting to the broader community,

### Review 4

Summary and Contributions: This paper presents a consistent algorithm for constrained classification problems where the constraints and objective are general functions of the confusion matrix. The proposed method reduces the constrained learning problem into a sequence of sub-problems that learns a plug-in classifier. The main contributions are a reduction in the number of calls to the plug-in sub-problem by formulating the learning task as optimization over the intersection of two convex sets.

Strengths: Compared to a previous but similar approach, the current method achieves a better convergence rate, extends to any non-smooth convex constraints (as opposed to requiring smooth and differentiable), and is able to take advantage of specialized solvers for optimization.

Weaknesses: Limitations of this work currently lie in the requirements of convex constraints, and also for the consistency proof, smoothness and Lipschitz are required. The empirical results don't appear to be largely better, motivating the question, is this paper's contribution mostly on a faster convergence rate?

Correctness: The claims appear to be correct

Clarity: For the most part, the paper reads well though there are a few typos/grammatical errors. Line 78 what is \Delta_n? It is present throughout the paper (in algorithm 2, line 175). Line 192, Is B(u, r) a ball with radius r?

Relation to Prior Work: Relation to prior work is clearly discussed, and builds off previous work requiring a larger number of calls to the plug-in learning routine.

Reproducibility: Yes

Additional Feedback: UPDATE: I have read the authors rebuttal, but will keep my original score unchanged.