Review for NeurIPS paper: Certified Monotonic Neural Networks

NeurIPS 2020

Certified Monotonic Neural Networks

Review 1

Summary and Contributions: This paper deals with learning network that are monotonous with regards to the inputs. As opposed to approaches based on achieving monotonicity by construction, the authors heurisitically encourage monotonicity at training time by penalizing negative gradients at sample points and verify that monotonicity is achieved at test time. The verification is accomplished by using standard MIP formulations of piecewise linear neural network to verify that two layer neural network are monotonous, and decomposing deep networks into a stack of two layers neural network, requiring that each two layers neural network be monotonous.

Strengths: - This paper deals with an important problem. In addition to the motivation of fairness that the authors give, enforcing monotonicity also often makes sense when we are attempting to model physical systems for which we have domain knowledge. - The contribution relies on known techniques from Neural Network verification but builds novel elements on top (such as formulating the verification objective of the gradient sign) and apply those techniques to a new setting where they haven't been applied yet. - Experimental results are convincing and clearly shown, both on small examples where visualisation is possible (Figure 2) and on quantified datasets (Table 1 to 4) where the authors obtain strong results with less parameters than other methods.

Weaknesses: The whole section 3.1 feels out of place and unnecessary at the moment. Unless I'm mistaken, if verification of global monotonicity can be achieved, there is no point in performing individual monotonicity certification. As it is, this subsection feels un-necessary and is also not empirically tested.

Correctness: The method is correct and the empirical methodology tests their training algorithm correctly. One point that would be worth discussing is the convergence of their training algorithm described in Section 3.3. Is there a possibility for the problem to never converge to a solution with positive U_\alpha, leading to an infinite loop or would there be at least some arguments for why this would not happen?

Clarity: The paper is extremely well written, and quite easy to follow. The authors do a great job at progressively building up the description of their method.

Relation to Prior Work: Yes, the paper appropriately describe its relation to both the verification literature whose methods it uses, and the "monotonous by construction" literature which it aims to improve upon. To the best of my knowledge, there is no missing discussion of related works.

Reproducibility: Yes

Additional Feedback: ====== UPDATE AFTER REBUTTAL =============== I read the reviews by the other reviewers and the rebuttal by the authors. Most of the reviews seem positive and I personally think that the combination of making verification work (by decomposing the network in 2-stack) + sampling based regularizer to encourage monotonicity is interesting, so I'm quite happy to maintain my score where it is, especially if the author include the missing related works pointed by the other reviewers. I would nevertheless advise the authors to include a rigorous justification of the comments they gave in G#2 of the rebuttal in the final version of the paper. ====== UPDATE AFTER REBUTTAL =============== l.45-46: I would be wary of claiming support for "arbitrary" neural network architecture when using MIP would restrict the supported architectures to piecewise linear one. I understand that the distinction is made with regards to special purpose monotonous architecture like DeepLattice. Section 4: Are the networks that are used and for which performance is reported all guaranteed to be monotonic over the whole input domain? If so, I would advise the authors to make a note of it. Section 4.1: Is there actually a layer with only two neurons in the architecture used?

Review 2

Summary and Contributions: The paper presents 2 interconnected methods, with the second being more significant than the first. The first is a technique for training a ReLu neural network with monotonicity constraints by augmenting the usual objective function with a term which penalizes violation of monotonicity. Although this method on its own is not as novel as it is presented as being, the second method appears to be an important and novel contribution: a method for verifying whether a ReLu neural network is exactly monotone. The overall training technique consists of a back and forth between the first method and the verification step, with the penalty for monotonicity violation being strengthened if verification fails. Verification for deep networks (more than 1 hidden layer) is addressed by decomposing a deep network into a series of 2-layer networks. Positive experimental results (superior generalization) as compared to several competing techniques are presented on several real-world problems. A method for enhancing the interpretability of more general (non-monotone) neural networks by adding a monotonic layer is also presented.

Strengths: The ability to certify exact monotonicity for the general ReLu network case (not just the trivial all-positive-weight case) is an exciting and worthwhile contribution to the literature. The experimental results are also strong. It's also encouraging to see that the method learns non-trivial monotonicity implementations, i.e. negative weights but nonetheless a monotonic function overall.

Weaknesses: I'd probably be OK with the paper being published as is, but I feel strongly that the authors should clarify one key part of their algorithm. I would be happy to increase my rating if they could make this clarification. The objective function presented in eq. 11 after line 165 is shown only at a very theoretical level which skips over important technical details about how the R(f) monotonicity-violation term is actually evaluated. As the authors themselves say, if R(f) equals 0 in the true, exact, analytic sense of the expectation w.r.t the uniform distribution over X, then the network is indeed monotonic (aside from pathological measure-0 scenarios which we will leave aside). But if it is indeed truly monotonic, then you would not need to verify it with the MILP solver. So as I tried to figure how how R(f) is implemented in practice, I deduced that it must be approximated by sampling the partial derivatives at various locations in the input space rather than being R(f) being calculated analytically. However, as far as I can tell, this sampling procedure is not described or even acknowledged anywhere in the main paper. I had to go to line 394 in the supplementary to see an acknowledgment that the regularization involves sampling. Even in the supplementary, the details of the sampling are not described. Furthermore, I need more details on the lambda multiplying R(f). If lambda is a single, real-valued parameter and the number of samples in the estimation of R(f) is fixed, then increasing lambda and retraining doesn't make that much sense to me. Obviously, you wouldn't stop training the first time unless the sampled R(f) =0, since the network is certainly not monotonic overall if the sampled R(f) > 0. Buf if the sampled R(f) does equal 0 but the certification via U_alpha fails, then I don't see why increasing lambda would necessarily help, because the parameter solution the ReLu network found the first time, which satisfied "sampled R(f)=0" and therefore lambda*sampled R(f)=0, would still be available to the network for a larger lambda and would once again achieve 0 for the regularization term and just as low a loss (L(f) ) as before on the training data. Now, I suppose in practice, with a higher lambda, the network empirically might tend to find a different local minimum wherein the network is truly globally, certifiably monotonic everywhere, but if this occurs in practice, it seems like kind of a fluky effect which I would not want to rely on in the future, since it seems very unlikely to me that the training loss component would be as good in the certifiably monotonic case as it was in the case where sampled R(f) was zero but certification fails. So is lambda just a tunable single real number and is the number of samples of the partial derivative in the empirical, sampled R(f) fixed ? Or does "lambda" really somehow represent the number of samples in the sampled R(f) ? It would make a lot more sense to me if the method involved increasing the number of samples and retraining if the certification fails, rather than increasing a real-valued lambda for a fixed sampling. Assuming I am correct that R(f) is sampled at points within the input space, how many such points were used?

Correctness: The methods are correct as far as I can tell. My main concern is the vagueness of the presentation of the regularization methodology, as previously explained. Line 63: Min/Max "does not work well in practice" seems a bit unfair given that it is probably tied with Crystal overall when looking at the 4 experiments.

Clarity: Here are several suggested edits: Line 95 This kind of problems -> This kind of problem Line 142 Following the same spirit with the precious section -> Following the same spirit as the **previous** section Line 152 abundant of heuristics -> abundant heuristics (or an abundance of heuristics) Line 157 has been an active research areas -> active research area Line 195 scarification -> increase Line 275 Since it is a fundamental machine learning methodology, We -> we

Relation to Prior Work: Training a neural net with a regularization term which penalizes violation of monotonicity on a sample of the input space was first presented in Sill/Abu-Mostafa, Monotonicity Hints, NeurIPS 1997, so this paper should be cited. Although Sill/Abu-Mostafa penalize function value differences rather than true gradients, the method is close enough that it should be cited. [13], A. Gupta et. al is not the first example of this type of technique.

Reproducibility: No

Additional Feedback: I think I would feel very good about the paper if I can get enough clarification about how R(f) is implemented in practice. **** Update after reading author feedback **** Thank you for describing the details of the R(f) implementation. After reading this explanation, I have chosen to raise my score to a 7. I *strongly recommend* that this description (512 samples, with different samples at each iteration) be included in the text of the paper itself. This is the kind of experimental detail which can make the difference between someone else successfully re-implementing the paper or not. It was very valuable for me to learn that the samples are re-generated and different at each iteration. If the samples were the same throughout training (across all iterations) then increasing lambda and re-training if certification fails would not have made any sense, since the model parameter settings resulting from the first training would have sampled R(f)=0 on those fixed samples and there would be no reason to expect the second training with increased lambda not to find that same failed parameter setting. Knowing that different samples are used at each step makes all the difference, but this info was not in the first version of the paper. So I really think that needs to be in the paper. I realize there are tight space constraints for NeurIPS papers, but there must be something else the authors can cut to fit that in.

Review 3

Summary and Contributions: In this paper, the authors proposed methods to train and verify monotonic deep learning models. The authors adopted a two-step iterative approach: first train a deep learning model with penalty on the violation of monotonicity; second, verify whether the model is monotonic through an optimization problem which can be solved using mixed integer linear programming; then, iterate to increase the penalty weight until the monotonicity is verified to be satisfied. Overall, I think this is a novel paper, with a few weaknesses. Hence I gave a score of 5.

Strengths: Training monotonic models is an important problem, with vast applications in interpretability, security and fairness. This paper proposes a method to train flexible deep learning models and are yet monotonic. The proposal is novel, and numerical studies show that the proposal is statistically on par or better than some competing methods.

Weaknesses: The biggest weakness of this work is the challenge in the optimization problem in (10). As a result, various simplifications need to make, which sacrifices the performance of the proposal. First, it is computationally intractable to verify the overall monotonicity of a deep learning model; rather, we need to decompose the deep learning model into stacks of two-layer models for verification. Second, we may need to stop MLIP early. Thus, even if no adversarial example is found, there is still no guarantee that the model is fully monotonic. Another weakness of this method lies in the numerical studies. First, the proposed method is not statistically better than DLN in Loan Defaulter or Chest X-Ray datasets. Second, the authors did not show train metrics, which made it hard to understand whether the improvement comes from more flexible modeling, or from better generalization. Third, the results are averaged over 3 runs (it is unclear what the difference is between the 3 runs), making the validity of the error bars questionable. Fourth, the author does not make it clear the specifics of the MLIP solver (error tolerance etc.). Thus, as I mentioned above, the final model may not be guaranteed monotonic, and it would be important for the author to show the final verification result of the monotonicity of the model. Finally, due the iterative nature of the proposal: train -> verify -> train ..., hyperparameter selection becomes more difficult, i.e., the smallest lambda that makes the model monotonic should also depend on the model architecture, and the performance of the model depends both on lambda and model architecture. Thus, it is unclear to me how to efficiently choose the optimal lambda and model architecture that gives us 1) a monotonic model 2) that achieves the best performance.

Correctness: The methodological aspect of the method seems correct, though I did not carefully check the math. The numerical studies have room for improvement; see above.

Clarity: The presentation of the method is clear. The presentation of the numerical studies need improvement; see above. Other comments: Numerical study results need to be expanded; see above. What are the hyperparameters used for competing methods? How did you choose them? The right panel of Figure 1 is not clear. Please elaborate. Minor points: In the product price example provided at the end of page 2, the presentation is a bit vague. I suggest the author to make this example more specific. What can this feature x be? How could adversaries utilize the attack in practice? etc.

Relation to Prior Work: The authors discussed the connection and differences with existing methods.

Reproducibility: Yes

Additional Feedback: =========================================================== I appreciate the authors to address my concerns and questions. I especially appreciate the authors to include training loss in the rebuttle. I have updated my rating accordingly.

Review 4

Summary and Contributions: The idea of monotonic networks is the ability to impose monotonicity constraints on the subset of input features. For example, it might be unfair to offer a job to X over Y when X has better test scores and all other features are similar. Most trained neural networks are not guaranteed to satisfy monotonicity constraints. To enforce monotonicity in a trained model, it requires special network architectures that can lead to restricted models. Otherwise, it is not possible to guarantee monotonicity constraints in the general networks. In this paper, the authors use MILP to detect the violation of monotonicity constraints and use appropriate monotonicity enforcing regularization to ensure that the trained model satisfies monotonicity constraints. At a high level the main idea is to keep increasing the regularization till we guarantee that the trained model satisfies monotonicity constraints.

Strengths: 1) The paper is well written and addresses an important problem in the context of enforcing monotonicity in the solution of deep neural networks. 2) The solution using MILP to check for monotonicity is elegant. 3) The experimental results are convincing and shows improvement over existing baselines on multiple datasets. 4) The application to CNNs and interpretability benefits are shown on MNIST datasets.

Weaknesses: 1) The general idea seems to use a monotonicity enforcing penalty function as the regularization term and keep increasing this till the network satisfies motonocity conditions. Is there any guarantee that we can always achieve this through training? 2) This might be a general criticism regarding the work on monotonicity enforcing networks. In many real world situations, non-monotonic features change as well. There may be cases where some monotonic features increase one way and some decrease the other way. In such cases, it is not clear as to what the proposed algorithm will do. There are more realistic cases, and the monotonicity conditions may still not be very useful. 3) The MNIST example for using monotonicity for better interpretability is interesting. It would be good to know if the monotonic network achieved better classification accuracy or not. 4) Table 5 needs better explanation. What are A,B, C, D in Table 5. Also, there should be some explanation as to why positive paths lead to monotonic solution and under what scenarios? 5) The use of consecutive 2-layer networks for enforcing monotonicity seems like a good trick to improve the computational efficiency. As the paper mentions, this may lead to a more restricted monotonic network. It would be good to know if removing this restriction at the cost of computational efficiency would lead to better performance or not. 6) I am also a bit unclear about extending this method to larger networks with CNNs or ResNet architectures.

Correctness: The claims and the method seem to be correct.

Clarity: The paper is well written except the part in Table 5 mentioned before.

Relation to Prior Work: The paper discusses the prior work. The MILP formulation used in this paper reminds me of the MILP used in the recent formulations used in Fischetti et al. 2018 and Serra et al. 2018: [a] Matteo Fischetti and Jason Jo. Deep neural networks and mixed integer linear optimization. Con- straints, 23(3):296–309, 2018. [b] Thiago Serra, Christian Tjandraatmadja, and Srikumar Ramalingam. Bounding and counting linear regions of deep neural networks. In Proceedings of the 35th International Conference on Machine Learning, ICML 2018. It would be good to discuss the MILP with respect to some of these existing formulations for DNNs, since the formulation used in the submission is only for 2 layers.

Reproducibility: Yes

Additional Feedback: I went through other reviews and the rebuttal. Overall, the authors have responded to my major concerns, and also provided the necessary clarification for Table 5.