__ Summary and Contributions__: The paper presents an approach to image clustering, that computes the distance of each input image to a cluster prototype image, by first appropriately transforming the prototype to match the image and then computing the distance of the transformed protoype to the input image.
The main idea is the training of deep neural networks (one for each transformation) that take an image as input and provide as output the corresponding transformation parameters.

__ Strengths__: An image clustering approach is presented that joinly learns to cluster and align images.
The method provides good empirical results on challenging web image collections.

__ Weaknesses__: The main drawback of the paper is the lack of important details concerning the tranformation parameters that are predicted (section 4.1). No information is provided about the outputs of the networks that predict the transformation parmeters (e.g. how many are the parameters to be predicted).
The optimization problem solved in the M-step seems to be hard.
Performance depends on cluster initialization and network initialization. There is no comment on this issue.

__ Correctness__: The method seems to be correct, however critical information is missing.

__ Clarity__: Section 4.1 needs to be improved to provide information about the network outputs and the transformation parameters that are predicted.

__ Relation to Prior Work__: Description of previous work is sufficient.

__ Reproducibility__: No

__ Additional Feedback__: The paper presents an approach to image clustering that jointly learns to cluster and align images. An interesting aspect of the paper is the application of the method to real photograph collections.
The method relies on training deep neural networks that provide as outputs appropriate transformation parameters for each image. No information is provided about the outputs of the networks that predict the transformation parmeters (e.g. how many are the parameters to be predicted).
The optimization problem to be solved seems to a hard one, since it involves both the GMM parameters and the network weights. The paper does not present convincing information about the viability of this task and how it depends on the initialization of the both the image clusters and the network parameters.
In order to apply the method, the exact sequence of transformations should be specified. What happens if one or more transformations are redundant for a specific dataset?

__ Summary and Contributions__: ***Post rebuttal update***
I have read the author's rebuttal and thank the author for answering my questions. I am in favour of the paper's acceptance.
This paper introduces a new method for clustering directly in image space. Existing methods build features on which to perform clustering in features space or use explicit image transformations to align the images before clustering in a joint optimisation manner. This paper also learns the transformations while clustering with a single loss and a joint optimisation algorithm, for both K-means and Gaussian Mixture Model (GMM). However, the authors propose to predict the transformations of each data point instead of optimising them, with use of a neural network. It thus builds on Spatial Transformers Networks and integrates the method in the clustering problem. Experiments are performed on standard benchmarks and more challenging real images (web images) to validate the relevance of their method.

__ Strengths__: Strengths:
* This work is not of theoretical contribution yet all claims are supported by strong empirical evaluation (ablation study is performed and extended comparison with existing methods is provided) and proof.
* The method is novel albeit its comparison with relevant methods are missing (see below).
* The significance of the paper is good, improving on minLoss over the different datasets, and interpretability.
* This paper is of high relevance to the NeurIPS community as it is simple to implement, leads to interpretable results, and is shown to work on real web images.

__ Weaknesses__: Weaknesses:
* The paper experimental evaluation lacks an analysis (unless I missed it) of the effect of the number of clusters K, which is an important parameter of the model especially on real images when the number of cluster is unknown. This is an important point when deciding on the applicability of the method.
* The related work section makes no comparison with the literature of equivariant models, that are, models that learn to encode the natural transformations of the data (e.g. rotations, translations), either with or without prior knowledge. See for example:
https://arxiv.org/pdf/1901.11399.pdf (and references therein)
https://arxiv.org/abs/1411.5908
https://arxiv.org/pdf/2002.06991.pdf

__ Correctness__: Yes

__ Clarity__: The paper is very clearly written and easy to follow.

__ Relation to Prior Work__: Yes, but the paper lack a comparison with the literature of equivariant models (see point in weaknesses).

__ Reproducibility__: Yes

__ Additional Feedback__: * How do their method compares to transforming the samples x_i instead of the prototypes (apart from the fact that it would require to transform each data sample instead of the prototypes for the entire batch).
* The authors claim that their method provide state-of-the-art results with a large margin on USPS and F-MNIST, in what sense ? MinLoss ? It seems that DEPICT and DCGAN are giving best avg results.
* Incoherent notation: sometimes prototypes are written c_k and other times m_k
* I would encourage the authors to explore the impact of the value of K

__ Summary and Contributions__: This paper presents a novel approach for transformation-invariant clustering, called Deep Transformation-Invariant (DTI). The main idea is to jointly learn image transformations (to align images) and to cluster them (previous work learn to cluster with explicit transformation). The main novelty of this work is to learn the transformation from the pixels while learning to cluster. The deep image transformation module is designed to learn image alignments . The module can model three types of transformation: spatial transforms (as in [34, 38]), color transform, and morphological ones (dilation, erosion). Experiments are conducted on standard benchmarks for image clustering, as well as web image benchmarks with strong results. Written presentation is clear and easy to understand.

__ Strengths__: - The proposed method can simultaneously learn to align images and cluster which is new and interesting.
- The design of the transformation module is interesting and include some new aspect of color & morphological transforms.
- Experiments are strong compared with current methods.

__ Weaknesses__: - The transformations are specifically applied to image data while the DTI framework can be more generic.

__ Correctness__: I believe that most of the claims made in this paper are correct. The claim, in the final sentence in conclusion is not quite correct, since the transformations are particularly image-based transformations (spatial, color, morphological). In order to apply DTI to other type of data, one need to design data, domain-specific transformation module, while the DTI framework (objective, optimization, etc) can be the same.

__ Clarity__: The paper is well written and easy to understand.

__ Relation to Prior Work__: The paper covers enough relevant prior work.

__ Reproducibility__: Yes

__ Additional Feedback__: === post rebuttal comment ===
Rebuttal addressed well my comment, I keep my rating unchanged.

__ Summary and Contributions__: The paper proposes a novel approach towards deep image clustering which, unlike previous approaches does not aim at learning suitable latent space representations but at learning to predict image transformations in order to cluster in the image space. The proposed approach is a deep transformation-invariant clustering approach that jointly learns to cluster and align images. The transformations such as spatial alignment, color modifications or morphological transformations are learnt in an image transformation module.
The paper provides a comparison to SotA image clustering approaches on MNIST, Fashion MNIST and USPS and show on par performance or small improvements.

__ Strengths__: The proposed approach is conceptually different from existing approaches. And performs well.
The idea to use transformers in this context is novel and interesting.
The paper is clearly written and well illustrated.
The proposed approach is evaluated in the context of kmeans clustering and Gaussian mixture models and performs well in both cases.
The proposed approach provides interpretable qualitative results.

__ Weaknesses__: The results depend on the initialization (kmeans), yet, they are reported without standard deviation.
Mean, median and standard deviation over several runs should be reported.
The improvement over the SotA is small.

__ Correctness__: The paper is technically correct and the empirical methodology for evaluation is correct. The paper compares to the relevant literature. Yet, mean, median and standard deviation over several runs should be reported in table 1.

__ Clarity__: Relevant competing approaches are discussed and compared to.

__ Relation to Prior Work__: The relation to prior work is sufficiently discussed.

__ Reproducibility__: Yes

__ Additional Feedback__: