Reviews: Optimal Sparse Decision Trees

-Figure 1 of the evaluation section only has OSDT-WarmStart; why is OSDT missing? -A runtime benchmark against BinOCT and CART needs to be included in the evaluation section as well. -Figure 4 shows the optimal tree generated by OSDT on COMPAS; what do BinOCT and CART generate? Other nits: Line 98: the upper bound of the sum should be "D - 1", instead of "d - 1". Line 124: "alternately" should be "alternatively"

Reviewer 2

Originality: This work is an improvements on existing methods. The originality comes from identifying various bounds that one can impose to prune the search space. It is clear how this work differs from existing works. Quality: The work is technically correct. The experiments demonstrate the claims that the method is better at finding optimal trees than existing methods. The experiments seem thorough though there is no description of the weaknesses of the method. Clarity: The paper has been written in a manner that is straightforward to read and follow. Significance: There are two factors which dent the significance of this work. 1. The work uses only binary features. Real world data is usually a mix of binary, real and categorical features. It is not clear if the method is applicable to real and categorical features too. 2. The method does not seem to be scalable, unless a distributed version of it is developed. It's not reasonable to expect a single instance can hold all the training data that the real world datasets ususally contain.

Reviewer 3

Originality: Training of optimal decision trees is clearly a problem that has seen a lot of prior work. A distinguishing feature of this submission is that it focuses on optimal *sparse* decision trees for binary variables, and that the approach seems to be feasible in practice, which is achieved by a combination of analytical bounds that reduce the search space as well as efficient implementation techniques. The work builds upon the CORLES algorithm and its approach to creating optimal decision lists. However, the authors extend this approach to decision trees in a non-trivial manner that adds substantial novelty. Quality: The claims of the paper are very well supported by theoretical analysis as well as experiments. Besides a quantitative comparison, the experiments also give some insight into what optimal decision trees look like, and where non-optimal approaches fail. Clarity: The manuscript is very well-written and organized. The main manuscript is very condensed, which can make it hard to read at times; but the supplementary includes all relevant material at great detail. Significance: The paper mainly focuses on improved training accuracy. To further underline the practical relevance of this algorithm, the authors would need to demonstrate convincingly that improved training accuracy also translates into improved test set generalization. This can be harder to demonstrate than training accuracy, because it depends heavily on the data, the number of leaves, etc. The paper has no claims in this direction, instead focusing on the training problem. The supplementary material does provides numbers for accuracy on the test set for a number of problems, but the results are less conclusive than for training.

Paper ID:	3960
Title:	Optimal Sparse Decision Trees

Reviewer 1

Reviewer 2

Reviewer 3