NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Reviewer 1
1. I found it hard to understand what the "finetuning" method referred in this paper is. L217 explains it to be a method that fine-tunes the weights of the network. Is it the same as the proposed approach (same objective function) but instead of using the weight transform kernels (L130), the original parameters of the supervised network are updated during the unsupervised adaptation process? This is an important baseline for the paper because it shows the importance of using - a) the weight transform kernel; b) keeping the parameters of the landmark model itself fixed. Please clarify this in the author response. 2. What is the performance of the method on the original MPII dataset after the unsupervised domain adaptation process? The authors allude to catastrophic forgetting being a motivation to use the weight transform kernels, but do not show if it actually helps. For such an experiment, it is important to show what happens if a) the weight transform kernels are used on MPII; b) the weight transform kernels are set to identity on MPII. 3. I find the analysis in this paper a bit lacking. For example, rather than the extremes - use the weight transform kernels OR finetune all weights (assuming my understanding in 1 is correct), the authors do not show any approach in between. How about a model that optimizes the same loss as in L170, but does not use the weight transform kernel and only finetunes the last few layers? Although this approach may have catastrophic forgetting, it is a good baseline to demonstrate why the particular design decisions in the paper matter. 4. What are the training hyperparameters for the scratch network? Is it trained for longer than the other methods? With enough data augmentation (example affine transforms of images)?
Reviewer 2
Given an image set of which images contain a single target category object at near the center, the submission presents a neural network method that can discover a fixed number of landmarks in an unsupervised way. The basic idea of this work is to leverage the learned knowledge of the source network pre-trained with a source object category to discover the landmarks of a target object category. To this end, the proposed landmark estimation network is designed in a simplified way of Progressive networks [29]. All the activations obtained from each layer of the pre-trained fixed source network are transformed by respective 1x1 convolution layers (i.e., linear transform in Eq. (1)). The last convolution layer is modified to output the pre-defined number of the landmark for the target object. Then, all the linear transforms are trained by backpropagation with the unsupervised task loss, landmark conditional image generation [13], on the target category data. The number of parameters is significantly lower than training the whole network from scratch, because only the linear feature transforms are trained while fixing the source network. The proposed model is largely based on [13], so the authors reproduced [13] and used it as the baselines, called Scratch and Fine-tuning. The authors showed the proposed method outperforms the fine-tuning baseline. The authors' hypothesis is due to fewer degrees of freedom by the small number of parameters to optimize than [13]. The paper is written clearly, and the experiment shows the effectiveness of the proposed method. However, the authors fail to provide understandings of the proposed algorithm's behavior and failure cases that would contribute a lot to get informative takes. Due to this missing piece, it is a bit vogue where the performance gain actually comes from and how stable the algorithm is. - Strong assumption: the proposed approach is based on a strong assumption that useful intermediate activations for the target object can be linearly spanned from other categories activation; in other words, for a target object, there exist effective convolution filters constructed by the linear combination of the specifically trained filters for a source object. The authors overlooked mentioning this point. - Category dependency: Also, there could be some other source-target cases that may fail to represent each other by the linear combination. However, the authors failed to demonstrate the effects according to the categories. Currently, the source network is trained for the human pose (16 keypoints), and the target network is adapted to different object categories. Most of the evaluations are done on the "human=>face" case. The cases of "human=>{cat head, shoes}" are only demonstrated by consistency measure in Table 3 and Figure 3, which does not truly assess the landmark accuracy (will be discussed further in the below comment). These only show very limited generalization across the category. - Taxonomy; as mentioned above, there must be effective relationships between different category pairs. Showing varying performance and knowledge transferability would be critical evaluation in this submission due to its technical contribution point on adaptation. - The proposed method requires to have a pre-trained landmark detector for a different object category a priori, which is pre-trained in a fully supervised way. Thus, the system would largely depend on the quality of the core network. While this is the case, the authors do not provide any study to understand dependency. The comments in sections would be relevant to get better understand the proposed method. Please see . - L250-251 and Table 2; I found that the authors' interpretation of these lines is good. But if the 3D rotation of the landmark in the LS3D dataset is the factor that cannot be handled by all the methods, then the LS3D errors reported in Table 2 is not reliable to be interpreted because large dominant outliers will dominantly contribute to errors. Thus, what we can parse from the reported error is that the models just do not work. So, in this case, it would be more interpretable to measure the errors only for visible landmarks, rather than measuring overall landmarks. - Landmark consistency against random similarity transform is interesting, but not strong evidence of outperforming. For example, in an extreme case, if the network always produces average landmarks regardless of any input, then it will have the perfect consistency. Thus, this metric alone cannot be used to argue accurate landmark detection.
Reviewer 3
The paper proposes a method to adapt an existing landmark detector using an unsupervised objective to discover landmarks on new categories of objects. In particular, It uses a projection matrix to adapt the network. Experiments are performed on the face, cat head, and shoe datasets. The idea of using the knowledge in an existing landmark detection is interesting, and the experiments show that the method works to some extent. However, the technical novelty of this paper is not very significant. It is a relatively straightforward combination of existing unsupervised landmark learning and domain adaptation methods. The domain adaptation is done in a general way, which is not specific to the landmark discovery problem. A relevant question (not a con itself): is it possible to adapt an image classification or object detection model to unsupervised landmark learning? In Table 1, the proposed method is not consistently better than the previous methods and the baselines. It is discouraging that using a pretrained network led to worse performance than the previous work, who trained the model from scratch. In the remaining of the experiments, comparisons are done only with the two baselines. For example, the consistency metric (which is interesting) is not tested on previous methods. It will be interesting to see fine-tuning with proper learning rate after matrix projection. It will also be interesting different how different pretrained network can impact the final landmark discovery performance.