Paper ID: | 5087 |
---|---|

Title: | Random Tessellation Forests |

The paper is a novel generalization of the well-known Mondrian Process. However, some of the portions could have been explained in words to make the paper more readable. The content appears original and it provides a convenient and general approach to classification. Although not much is discussed about the large sample consistency of the method, the approach appears to be intuitive in itself.

The authors present a very interesting new extension to random partition type priors with an appealing inference scheme. I think it is a solid extension of existing work. Numerical results suggest that it can perform just as well or better than existing random partition models, though I'm curious about the comparison of runtimes against various methods. One other thing I noticed is that you choose the particle with the highest weight after the SMC procedure. It seems like prediction would be improved by using an average of the prediction from each particle with respect to the particle weight. Is there a more principled reason as to why you do not do this? I think a discussion of your method with in comparison BART type methods would be helpful, as many people are more familiar with BART than the Mondrian method, and could help the reader unfamiliar with this literature. After reading the author rebuttals and the other reviews comments, I am inclined to vote "accept" for this paper. I think the idea is quite novel and well written. I think the addition of comparisons, even in the supplementary materials will help strengthen this paper. Also, please add standard errors for the results in Table 1.

General metrics: Originality: Yes this is an original contribution. The authors have also properly highlighted the context in which it sits. Quality: This work is high quality and polished and well presented. Some improvements could be made, see comments below. Clarity: The work is clearly presented, the figures and algorithms are useful. A handful of minor presentation issues highlighted below. Significant: The work feels like a significant new contribution and I can envision practitioners using these ideas and other theoretical research building on this work. Major comments 1. The idea presented is theoretically interesting and novel in an ML context. I thank the authors for a pleasurable read. 2. The approach would only seem to work in a bounded domain. (Since the measure $\Lambda$ is partly made up of Lebesgue measure on the half-line, $\Lambda([a])$ the normalising constant would be infinite otherwise.) This is stated by the authors at the head of the algorithm though possibly could be highlighted more clearly. eg. on line 61 it should read $W \subset R^d$ (ie strict subset). 3. I was sceptical that rejection sampler would work as written in a space of even moderately high dimension. As the dimension grows, does the smallest ball containing the convex hull of the data not increase in relative volume to an infeasible level? Or does the hyperplane (of dim d-1) still intersect a with reasonable probability? The example in Sec 3.2 seems to suggest it's still fine. Some reference to this problem, and the performance of the rejection sampler in relation to the dimension of the problem, should be included. 4. Similarly the authors state eg on line 168/169 that explicit computation of the polytopes a is not required. In this way they do not have access to the the value of $\Lambda([a])$, which is the normalising constant required in Algorithm 1. They state instead that they approximate the value with the volume of the covering ball. Is this approximation not increasingly poor in higher dimensions? Please refer to this issue, possibly in conjunction with the previous point. 5. Line 170-173. This seems very strange. What if lots of the test data is outside the collection of convex polytopes? Isn't the whole point of the inference to label the test data rather than average over the possible values of the labels? Or is this just used to form possibly larger convex hulls to include the test data? (In which case, this does not seem very robust in the case of inference for a second test dataset.) I am possibly misunderstanding what you are doing here but either way this section requires significant clarification. 6. line 206. The tesselations in T will not be equal wp1. So what is the meaning of the mode of T? Is it simply the single tesselation with highest marginal likelihood? Please explain more clearly in the text what you are choosing. 7. What is the effect of setting the Dirichlet parameter $\alpha$ as you have in line 219. Are there higher level features of the model that would change based on the setting of this value? cf Polya trees where eg continuous or discrete distributions can result depending on simply the value chosen for the equivalent parameter. Minor & notational comments 1. line 61 define Y. (I think you should just put 'Y' between the words `collection' and `of'). 2. line 86 H in italics 3. line 93 $\lambda^d$ should be superscript 4. line 133 P(Z) does not appear in the preceding equation. Do you mean P(V,Z)? Or P(V|Z)P(Z)? 5. line 139 what is $\phi^{-1}?$ Define. 6. line 146 Is it not sufficient to sample $u$ from [0,r]? As written this contradicts the definition of u earlier. 7. line 161. The notation of intersection with V is confusing (refers to intersection with convex hull). Your own notation on line 16 of algorithm 2 is much clearer, suggest you use that. 8. line 163 insert "strict" between "a" and "subset". 9. line 219. typo "to \alpha to n" 10. line 229. what does centre the points mean? write mathematically? 11. line 232. which category is coloured red and which black? (would be easier if you just stated it) 12. line 259-264 several typos principle -> principal. also analysis. 13. line 268 what is uRTF.i and MRTF.i? not defined anywhere. 14. line 272 if the budget is infinite, when does the partitioning procedure stop? UPDATE AFER RESPONSE: Thanks for detailed response. Please include the updated polytope sampling procedure (comment 3) in the final version. While the dimensionality issue still exists I am satisfied the authors are not sweeping it under the carpet and it is good that they are attempting to find strategies to deal with it. Explaining these clearly in the final submission would improve it. Include the D=85 figure to answer any similar questions raised by future readers. I am still not totally clear about the response to Comment 5. What is the effect of `snapping to the nearest'? This seems like it could be a significant source of error. Please further explain the assertion wrt missing-at-random data in the final submission. Please also include in the final submission the clarifications asked for in the minor comments which you did not specifically address in your response (presumably due to space constraints).