Summary and Contributions: The paper proposes an approach where a robot explores an unknown environment in order to build a model of the affordances. The robot is performing a policy learned through reinforcement learning, where it is rewarded for maximizing successful novel interactions. At the same time the agent is learning an image segmentation model which is predicting the affordances of areas in the image. All the experimentation and validation was performed in the AI2-iTHOR environment.
Strengths: The authors correctly identify that learning the properties of an environment with exploration is an important feature of human learning, and it is one of the most promising approaches through which a robot can learn the affordances of the environment. As far as I can tell, learning a segmentation based model for affordances is novel.
Weaknesses: Equation 1: The authors do not discuss the fact that the way the reward function is written here, it is non-Markovian. It depends on the history of the state. It could be made Markovian if one folds the visitation frequency into the state, but then the equation (1) is not of the correct form. The approach essentially goes and tries every single object in the environment, and checks whether a certain action can be performed on it or not. It does not perform a fine grain differentiation between the actions - basically taking a knife or an apple is the same, and toggling the fireplace and the coffee makers are also the same action. Thus the number of affordances is very low. The paper does not really deal with the question of what is the impact on the environment if one tries every possible action on every possible object. Clearly, this looseness in the definition precludes any real world evaluation.
Correctness: As far as I can tell, the claims, method and evaluation approach are correct.
Clarity: Overall, the paper is well written.
Relation to Prior Work: I am not aware of a prior paper on the same subject.
Reproducibility: Yes
Additional Feedback: I read the authors feedback. I would like to point out that environment resetting techniques do not solve the problem of more informed exploration (especially when the exploration might involve doing things with a knife). There are certain things that simply cannot be learned by trying out actions. Overall, the feedback does not change my ranking.
Summary and Contributions: The paper explores the important problem of learning affordances by interaction. Most previous works on learning affordances were based on manual annotations and passive approaches. In contrast, this paper explores an active approach in a dynamic environment to learn affordances. The paper proposes to learn an exploration policy and an affordance map jointly. This is a difficult search problem in the space of all objects, different types of affordances, agent locations, etc. The paper outperforms a number of baseline approaches and also provides ablation results. More interestingly, it shows the effect of pre-training using this method on a set of down-stream tasks.
Strengths: - The paper explores the interesting direction of learning affordances by interaction, which is a novel perspective compared to previous passive approaches. - The proposed approach has a been used as a pre-training step for a set of downstream tasks and shows improvement over alternative ways of pre-training. - The experiment section is comprehensive. It provides comparisons with a set of baseline approaches. It also provides a variety of ablation experiments. - The proposed approach outperforms the baselines in terms of precision and coverage metrics defined in the paper.
Weaknesses: - One of the main drawbacks of the paper is that it uses perfect odometry to compute the 3D world coordinates of the points. It would be much nicer if it used a noisy estimate of the odometry (using SLAM for example). It is interesting to see how the noise affects the results. - Some of the details are not clear: (a) In the IntExp(Obj) scenario, when the agent picks up the kettle, how does it know which pixels are pickupable? How does it know what the extent of the object is? (b) Lines 167-172 are not clear. It says "Classifier output y_A scores whether each interaction is successful at a location", while the condition for the indicator function is y=0 or y=1 (being either successful or unsuccessful). These are inconsistent. - It would be nice to provide the result of training with fully annotated images as an upper bound. I believe it is easy to obtain the annotations in THOR.
Correctness: Overall, the methodology seems correct.
Clarity: It is a well-written paper, but there are missing details (mentioned in the weaknesses section).
Relation to Prior Work: The paper does a good job of comparison with previous work.
Reproducibility: Yes
Additional Feedback: Comments after rebuttal: I read the other reviews and the rebuttal. The authors did a good job addressing the concerns. So I keep the initial 7 rating. I encourage the authors to include the new results in the revision.
Summary and Contributions: Paper presents an exploration strategy for indoor embodied agents. Essentially it leverages an auxiliary 2D affordance map segmentation task on top of the main RL problem and feeds the predicted affordance map as an extra input of the policy network. Experiments on exploration in the AI-THOR simulator demonstrates its effectiveness over heuristic based counterparts on the proposed interaction coverage and precision metrics.
Strengths: + The paper is overall clearly written and easy to follow. + I can't find any technical issues within the main methodology. The proposed method is technically sound. + The baseline comparison are sufficient and covers a broad range of SOTA exploration methods, especially for embodied agents.
Weaknesses: - Some technical details deserves more elaborations, especially on the multi-task learning. The main idea of this paper is to train an RL agents simultaneously with an affordance map segmentation network, though these are essentially treated as two orthogonal objectives, the learning procedure in a whole is still unclear to me. The authors are suggested to provide more details on how the training is proceeded, such as whether these two tasks are trained concurrently or alternatively, and if their learning processes are not synchronous, how they choose the ratio of the learning iteration of each task, and how these extra parameters can affect the performances (in an extra ablation study). A comprehensive loss function and pseudo code will be preferred. - There is still some gap needed to be filled in the evaluation to further improve the sufficiency. To name a few: a) The selected metrics are only evaluated in a limited range. There are only curves over time(training steps) for the converge, while the success rates should also be evaluated in this way but not just the final quantities as it could be insightful to see the analysis on how the proposed method could improve the interaction skills. I think if it does perform as expected (also the counterparts), there should be low success rate at the beginning (as it tend to interact more with the object) but can improve faster than the other methods. b) The authors demonstrate how some downstream RL tasks could benefit from their proposed method and compare with the seemingly strongest baseline (obj coverage). Given the overall quality of their contribution, more evaluation efforts should be included here. I would like to see some endeavor on extending this part towards this direction: * To combine the proposed method with other exploration strategy. I do feel most of the considered baselines focus less on interaction but navigation, which seems sort of opposite to what this paper specifically works with, thus it may be more interesting to see some results on how the proposed method can really mitigate their drawbacks than simply contrasting on some interaction-oriented tasks. This can also further verify the main motivation of this paper---an efficient solution of exploration for interactions. Nevertheless, I do feel the selected downstream tasks could also be more challenging, say there is a significant need for both navigation and interaction.
Correctness: I can't find more than minor issues within the main methodology. The results can partly verify their claim on better interaction exploration, in the sense that some technical details on multi-task learning are missing, I cannot be fully confident on whether the comparisons are fair. The evaluation protocols are seemingly reasonable.
Clarity: The paper is clear and well written. I can't find any language issues.
Relation to Prior Work: I found the author did a good job on posing their method w.r.t. to prior work. Related papers are placed properly.
Reproducibility: No
Additional Feedback: I really find it confusing on why the point-based label can earn advantage over object-mask based label. As it shown in the left table of fig. 5, mask-based label delivers better affordance prediction but have a lower score on the proposed metrics. This can somehow contradict the main idea of this paper--the task of affordance map segmentation could improve the interaction exploration. The authors are expected to clarify on this. post-rebuttal === I've read the rebuttal and thanks the authors for the additional results and clarification.