Reviews: Why Can't I Dance in the Mall? Learning to Mitigate Scene Bias in Action Recognition

Strengths of the paper are listed as follows: S1. The paper tackles the important problem of scene de-biasing for action recognition. It is of high concern for computer vision community to sanity check whether the proposed models (really) learn the dynamics of actions, and not just learn to leverage spurious bias such as the co-occurrence of the scene between actions. S2. The authors develop a sensible solution, forcing the model to consider the human region for recognition, trying to reduce the sensitivity of action representation to the surrounding context. This is achieved by borrowing ideas from adversarial learning, that is, the scene recognition ability of action code is altered by directly using gradient reversal [8], a well-known domain confusion method in the literature since 2015. S3. The authors conduct several action-related experiments to showcase the ability of their models. Weaknesses of the paper are listed as follows: W1. (Problem-wise weakness): The first concern about the paper is theoretical. This paper frames the surround 'scene' as a 'degenerative bias' that leads to misclassification. Where we agree that scene bias is evident in action recognition datasets (due to sampling bias, see Computer vision meets fairness talk, https://visualai.princeton.edu/slides/Fairness_CVPR2018.pdf), many actions have a preferred scene: Diving and Swimming happens in the pool whereas Baseball happens in the stadium. Action-Scene duality is highly visible in the visual world. So, in the end, any form of action recognition datasets will exhibit such bias, which can carry useful supplemental information about the action. This raises the following important questions: - Should we really learn to ignore the surround (and be invariant to it, as is done in this paper)? - Or learn to adaptively use it only when it is correlated with the scene (more in disentangling sense, see for example "Recognize Actions by Disentangling Components of Dynamics", CVPR 2018)? - Or should we designate tasks that happen in the same environment, but only the action class changes (see, for example, The 20BN-SOMETHING-SOMETHING dataset, which has been shown to be not solvable without considering human-object appearance)? In this manner, the paper is missing a discussion on these aspects of the action-scene duality. The quantitative results presented in the paper may tell us that, indeed scene should not be ignored, but should be factored out from the action, so that it is up to classifier when to rely on it or when to ignore it. In this respect, the authors are missing an important comparison to a highly relevant paper ("Pulling Actions out of Context: Explicit Separation for Effective Combination", Wang and Hoai, CVPR 2018) that factors out the action from the surround context, by masking out the human region, as is done by this submission. See experimental weaknesses for detailed comments. W2. (Methodological weakness): W.2.1. The developed method in this paper is borrowed from (Ganin and Lempitsky, ICML 2015) without modification. (Ganin and Lempitsky, ICML 2015) proposes a domain confusion objective, where the bottleneck network is trained to confuse from which domain the input is coming from. The authors adopt the same idea, whereas the source of confusion is the scene factor instead of the domain. The authors fairly cite this paper. Although this limits methodological novelty in this work. W.2.2. A similar objective for the second part of the loss, that strives for negative correlation of scene-only input, where the human is masked out is explored by (Wang and Hoai, CVPR 2018). In that paper, the authors utilize masked input to factorize the contextual factors from the action code, whereas in this paper it is used for de-correlation purposes only. W.2.3. A bounding box is a rough localization of a non-rigid object like human. This means more than 60% of pixels within the bounding box still belongs to the surround scene, and leaks into the action code. This limits the use of the method when there is a high co-occurrence of surround pixels within the bounding box. W3. (Experimental weakness): W.3.1. No comparison to (Wang and Hoai, CVPR 2018). Reducing the effect of the context in the action code for video action recognition has been previously explored in (Wang and Hoai, CVPR 2018), using C3D and I3D backbone architectures, using the same set of datasets. This makes (Wang and Hoai, CVPR 2018) a natural competitor for the proposed solution. However, the authors neither mention nor compare their solution to this highly relevant paper. W.3.2. No comparison to (Li et al, ECCV 2018, RESOUND: Towards Action Recognition without Representation Bias). This paper tackles dataset bias in action recognition, not limited to the scene. Even though the authors list this paper as relevant, there is no quantitative comparison against their approach either. This paper also proposes a new dataset called Diving48, which is a fine-grained action recognition dataset, and a natural data for the authors to showcase the ability of their model of ignoring the surround information (and being solely focusing on the actor). W.3.3. No evaluation of (altered) scene recognition performance provided. One of the objectives that the authors use enforces the model to misclassify the scene from the action code. However, the ability of this objective to reduce sensitivity for scene variability is only indirectly evaluated (through action recognition performance). The authors do not show that indeed using this objective, the action code performs poorly in scene recognition (as provided by Places-365 Resnet-50 model) as opposed to the vanilla action code that is learned without adversarial objective (e.g., I3D). In that sense, we don't have a good idea of to what extent the objective is met, to what extent indeed vanilla I3D or C3D exhibits scene information (Although, since there is no ground truth scene information provided, the results should be read in this manner). W.3.4. The choice of baseline models for de-biasing is unjustified. The authors choose to de-bias 3D-Resnet-18, R-C3D, and ROAD for de-biasing for three aforementioned tasks. As listed in the Tables, none of these models are current state-of-the-art for the considered datasets. In this manner, the authors choose to de-bias a low performing model for three tasks, leading to inferior results against the best model. This raises the question: Is the proposed solution not suitable for better performing models? Or Is scene-bias more severe in inferior models, making it infeasible to apply to the current state-of-the-art ? How would the performance improvement pronounce for I3D or S3D ? W.3.4. Obtained improvement on three tasks is insignificant. The authors try to justify the low and lower improvement in different datasets and different splits via the amount of scene bias measured by (Li et al, 2018). However, a concrete scatter plot which plots the obtained improvement (if any) against the amount of scene bias within these datasets is not provided. In this manner, it is almost impossible to judge where the improvement is coming from. W4. (Related work weakness, Minor): W.4.1. Severe scene bias of UCF-101 has been recognized before by (He et al, 2016, Human Action Recognition without Human) where the authors classify actions by masking out human. The paper may cite this as a relevant work.

This work investigates an open problem in action recognition, namely the strong correlation between action and scene. Such a correlation can lead to an overfitting to scene features over action-specific features, which in turn can lead to reduced generalization and recognition out of context. The proposed method makes sense, the link between actions and scenes is fairly investigated, and the approach is evaluated on multiple tasks, both to show the generalization of the method to any setting and to investigate the scene bias in multiple settings. The work does have a number of limitations: The experiments are not thoroughly convincing. The introduction, method, and figures highlight the importance of recuding scene bias. The method is however still evaluated on well-known action datasets, all with a known high scene bias. As a result, the experiments paint a picture that scene debiasing is far from important. While there is some gain for HMDB51, there is hardly any gain on UCF-101 and THUMOS-14 (temporal detection), while there is no direct gain on UCF-24 (spatio-temporal detection). The low impact of the proposed approach invariably leads to the conclusion that scene debiasing is not warranted. Figure 4 shows that the focus is now more on the actors, but the results are not really gaining from this focus. Why did the experiments focus on UCF-101, THUMOS-14, UCF-24, and HMDB51? Why not investigate a dataset with static backgrounds or with fixed scenes? There big gains are possible, because the only way to recognize the action is to look at the actor. With the current set of experiments, the only conclusion that the reader can make is that scene debiasing is intuitive but not important. Lastly a few smaller points: - Why only show results for HMDB51 in Tables 1 and 2? Is that because UCF-101 does not show good improvements? - What is the reason for not including L_{ENT} for spatio-temporal detection? - Why is scene debiasing important for (spatio-)temporal detection compared to classification? - Throughout the text, there are a number of typos and awkward sentences, e.g. lines 27 and 163.

Paper ID:	467
Title:	Why Can't I Dance in the Mall? Learning to Mitigate Scene Bias in Action Recognition

Reviewer 1

Reviewer 2

Reviewer 3