NIPS 2018
Sun Dec 2nd through Sat the 8th, 2018 at Palais des Congrès de Montréal
Paper ID: 3182 Learning to Teach with Dynamic Loss Functions

### Reviewer 1

The paper studies the framework of teaching a loss function to a machine learning algorithm (the student model). Inspired from ideas of machine teaching and recent work of “learning to teach” [Fan et al. 18], the paper proposes L2T-DLF framework where a teacher model is jointly trained with a student model. Here, the teacher’s goal is to learn a better policy of how to generate dynamic loss functions for the student by accounting for the current state of the student (e.g., training iteration, training error, test error). As shown in Algorithm#1, the teacher/student interaction happens in episodes: (1) the teacher’s parameter \theta is fixed in a given episode, (2) the student model is trained end-to-end, and (3) then \theta is updated. In Section 3.3, the paper proposes a gradient-based method to update the parameter of the teacher model. Extensive experiments are performed on two different tasks to demonstrate the effectiveness of the proposed framework. Please see a few comments below: (i) In my opinion, the paper is trying to overemphasize the connections with machine teaching literature and real-life classroom teaching in the Abstract and Introduction. The technical results of this paper have a very weak connection with machine/real-life teaching. In the proposed L2T-DLF framework, the teacher and student models are just two components of an end-to-end learning system. For instance, in Algorithm #1, the teacher model essentially tunes its policy by training the same student model again and again. This is fine for a learning system but somewhat disconnected from real-life teaching scenarios or machine teaching literature. (ii) Authors should expand the discussion of how their work technically differs from learning to teach framework [Fan et al. 18] as well as techniques of curriculum learning / self-paced learning. For an iterative learner such as those studied in [Liu et al . 2017, reference 30], the teacher’s choice of the training instance $x_t$ at time $t$ can equivalently be seen as choosing an appropriate loss function $l_t$. (iii) A few points to improve the presentation: - Line 121: Perhaps use $N$ for the number of data points instead of $M$. You are already using $m$ for the task-specific objective and \mathcal{M} for a score. - Mixing terms “dev”, “development”, and “test”. (iv) Appendix: I believe Section 2.4 “Teacher Optimization” and Figure#1 are important, and should be moved to the main paper.