NeurIPS 2020

Synbols: Probing Learning Algorithms with Synthetic Datasets

Meta Review

This paper proposes a tool (Synbols) for generating datasets based on (augmented) images from unicode symbols sourced from thousands of fonts. The motivation behind this work is that recent trends on using large, high resolution images as datasets for improving and evaluating research methods requires a huge amount of compute that may not be available to everyone, and also have slow iteration cycles not to mention the energy used for training. The tool proposed allows the researcher to create smaller-scale synthetic datasets with control over parameters like font, language, resolution, textures, that can help facilitate the debugging, iteration, and development of new methods. They demonstrate the usefulness of their dataset in a number of tasks (supervised learning, active learning, out of distribution generalization, representation learning, objective counting), and demonstrate that their dataset can already be used to clearly identify limitations and flaws of existing well known algorithms in their respective paradigms. The reviewers have raised concerns and issues that were partially addressed in the authors’ response, which satisfied some of the reviewers. During the review process and discussion, we also clarified that this work is not about proposing a new data augmentation method, which all reviewers agreed on and the final evaluations and reviews are based on this assumption (although this can be made more clear in the writing's narrative IMO to avoid confusion). One weakness of the work is that the datasets may lack the sophistication of natural images which may limit its application, although I think as a tool for iterating new algorithms it is fine. After the review process and discussion, I agree with a few reviewers and feel, like R4, ""more confident in the good contribution of the paper"" and I think this will be a fine addition to the NeurIPS community. I am going to recommend acceptance (Poster), and I hope that the tool will be made readily available for the community to use later on, after the work is published.