Paper ID: | 9259 |
---|---|

Title: | PIDForest: Anomaly Detection via Partial Identification |

This paper proposes a new scoring criterion (PIDScore) for the "anomalousness" of data points, based on (loosely) information-theoretic intuitions around how difficult it is to identify or characterize the point, which can be expressed in terms of the sparsity of the containing subcube. This is a richer notion that goes significantly beyond previous notions of "low density" regions for outlier detection, and also has benefits around interpretability. A corresponding algorithm (PIDForest) is developed, similar to Isolation Forests but with some differences that are motivated by PIDScore designed to overcome known weaknesses. Overall, I found the paper to be well-written and well-contextualized with respect to other work. The experiments show good performance against sound baselines, and some targeted "lesion-style" experiments support the claims around the underlying reasons why the approach works. The ideas behind the scoring function could be a good foundation for other research in anomaly detection. MORE DETAILED COMMENTS AND QUESTIONS I found the presentation of the core ideas of PIDScore (Section 2) to be clear. The differences between this approach and iForest are described and motivated well. L107: this idea gets better context in the Appendix, it might help to at least put the appropriate cites here. L42: the comparison to the "implicit" scoring definition of anomalies in iForest is very helpful for context. L74: at this point in the paper it is not very clear what this means. A simple example might help. L198: should be "in any partition _of_ C" ? L229: the parenthetical VC-dimension comment should be backed up with either a cite or clarification in the Appendix. L230: this is very well put, this paragraph captures the key distinction with iForest. L249: spelling error, "Seismic" L283: nice to see some concrete wall-clock numbers here. L289 and L297: these "lesion-style" analyses do a good job supporting the proposed claims about the benefits of the specific mechanisms of PIDForest. L313: "iForestunder" to "iForest under" L307: the synthetic experiments successfully highlight issues with other approaches. L453 (Appendix): in eq on LHS, it should be \bar{f(J(j)} (not I(J))? The Related Work section in the Appendix was informative. The Wigderson and Yehudayoff reference seems important enough to include in the main paper. The associated supplemental code submission was exceptionally thorough, it was especially helpful to browse the synthetic experiment notebooks. UPDATE: I have read the author response. The clarifications and additional detail are helpful, and if possible it would be good to slightly modify the discussions in the paper based on these. The addition of the other approaches to the "noise dimension" plot and associated discussion were also valuable, and could be included as well (perhaps in the Appendix).

This paper presents a novel outlier measure and an outlier detection algorithm to estimate the proposed measure. The tree-based measure designed can handle data sets with attributes with different nature (categorical, numerical, binary, etc.) and different scales (for numerical attributes). The authors propose a random forest-like algorithm to estimate the outlier measure. They also conduct experiments on 12 data sets where the proposed algorithm achieves the best performance on 6 data sets. The problem studied in this paper is interesting. Outlier detection is a classical task with various applications. Finding outliers in large-scale, heterogeneous data sets remains a challenging problem. One of the major contributions is the proposed outlier measure. The authors provide an outlier measure PIDScore defined based on inverse density (i.e. sparsity). They also point out how this measure unifies binary attributes and numerical attributes and achieves scale invariance for numerical attributes. I find the outlier measure definition simple yet quite intuitive. The algorithm provided is a heuristic estimate of the PIDScore. The algorithm aims to generate subcubes with large variance in terms of sparsity. However, the authors do not provide detailed explanation on why this is desired. The authors do not provide theoretical analysis on how well the algorithm can approximate the PIDScore either. Generally, the clarity of the algorithm needs improvement. The experimental results show that the algorithm outperforms baselines in some data sets but shows lower (sometimes substantially worse, e.g. on Vowels) performance on other data sets. Do authors have detailed analysis on why the algorithm does not work well on some data sets? The authors provide relatively thorough comparison against iForest, which is insightful. However, the authors may also want to include in-depth comparative analysis of the algorithm against other baselines such as PCA and KNN, especially on data sets where they achieve better performance. Detailed comments: 1) Line 17, 66, ...: heterogenous -> heterogeneous? 2) Line 168: it in -> it is 3) Line 177: We do not how -> We do not know how 4) Does the algorithm work for numerical attributes with unbounded values? %=== After Rebuttal ===% Thanks the authors for their responses. There are something in the responses that I think is worth being included in the paper. 1) I think the authors could consider to include their comparison between kNN/PCA to their methods into the final paper as it provides a more complete picture to the readers. 2) I also think it is reasonable to include some other possible options other than the proposed algorithm, and why the proposed algorithm is chosen. This can provide better understanding of the connection between the algorithm and the proposed method, and could also be inspiring for future research. 3) I suggest the authors to also investigate some earlier literature of grid-based clustering, which might be relevant. With these points addressed, I believe the paper will be more solid.

Besides the above-mentioned strong points of contributions, the following concerns should be taken into account to improve the current paper. (1) Presentation: - The Appendix is too long, but it is an optional for reviewers and readers to further read. The important contents, such as the Table 2 describing the parameters of data sets, figures of experiment results, and the related work, should not be put at the Appendix. The authors are suggested to try to make the introduction to the key idea of their approach more compact, so that these key information can be included into the main body of the paper. - The title with the keyword of "Certification" is misleading to the reviewers and readers. Actually, in the paper, there is nothing addressed that is related to how the PIDForest is doing the certification of anomaly. The authors are suggested to reconsider a more appropriate title. (2) Experiments: - In the algorithm of PIDForest, there are several parameters that will have significant impact on the performance of efficiency and accuracy. Such as $k$, the partition number would influence the size of the forest structure, which will finally influence the speed of finding optimal split for a node. These parameters $t$, $m$, $k$, and $h$ should be varying on the synthetic data sets to show their impacts on the PIDForest algorithm. - In Fig. 1 (b), the authors only showed the results of comparisons with three related methods, but indeed there are six comparative methods that are used in this paper. Why do not show the results with others, including RRCF, SVM, kNN and PCA? Is it probably that the others would have higher ability of handling random noise? Or the proposed algorithm would still show higher ability with random noise? It is not clear to reviewers and readers. This important point should be clarified in addition. - A minor concern is that the word "EM" is the first time appearing in the paper but without any descriptions. Although its meaning should be straightforward, it is still worthy making it clear explicitly for readers.