Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
This paper considers the problem of teaching an active sequential *machine learner* (e.g., an active learning algorithm), with a teacher which can “fake” the labels/outcomes of training examples with the goal of steering the learner faster to the goal state. The authors refer to such teacher a “planning teacher”, as opposed to a “naive teacher” which is often considered in the classical machine teaching problems. This setting differs from conventional machine teaching settings, in that In classical machine teaching setting, the teacher can only choose among a given set of training examples that are consistent with the target concept, and is often not allowed to provide inconsistent examples. The majority of the existing work in machine teaching considers teaching a “passive learner”, with a few exceptions (see additional reference in the comments below). The assumption that the teacher can choose the data-generation distribution makes it a very powerful teacher with a much richer action set than conventional teaching. This makes the problem considered in this paper an interesting scenario with useful applications. The formulation of the teaching problem as a planning problem over a teaching MDP is also interesting. However, the paper seems to lack strong theoretical support (e.g., to rigorously reason about the gain of the learner being aware of itself being taught), while the algorithmic contribution, including the experimental study, seems incremental. === Post-rebuttal comments The rebuttal to some extent strengthens the empirical results by including additional results on pool-based active learning. I'll be happy to revise my review to borderline score (4->5). I would also suggest the authors revise the title to reflect the specific settings (e.g., “planning teacher”) of the machine teaching problem. Overall I think the results could offer useful insights for practitioners working as it provides an interesting new direction in machine teaching. While I appreciate the existing empirical results demonstrating scenarios that a planning teacher would help, understanding why adding fake labels helps, and under what conditions it helps, seems an important question that remains undiscussed.
This paper frames the learning problem as a cooperative game between the machine and human, where the human has a mental model of the machine's task. In this situation, the learner poses queries (similar to query-driven active learning) and the teacher responds strategically to them, having knowledge the learner doesn't have, to best guide the learner. This is called "machine teaching". The learner may be aware of what the teacher is doing or not. Both cases are considered as well as a mixture model of the two. The learner then has an "inverse reinforcement learning" (IRL) problem to solve, to the teacher's - in this case the human user - "inverse machine learning" problem. The learner queries the teacher, who then makes a possibly strategic recommendation based on state knowledge that the learner does not have, then the learner makes a move knowing that the teacher has this knowledge. This model contributes to understanding of user models, where the user has a mental model of how the system engaged with should perform. In the case presented of a multi-armed bandit, the teacher's knowledge comprises the reward probabilities of individual arms, with model parameters \theta, which allows the teacher to strategically alter responses y_i, by means of an MDP. For this to work, one "bandit" must inform another, else were they independent then no strategy of offering the "wrong" y_i would help. Figure 2 offers an example of this; however how this may work in the general case begs more elaboration, beyond the claim that the teacher uses a modified reward based on R_t(h_t;\theta*). Exactly how does this work for Bernoulli bandits? Then should we assume that this modified reward, and specifically the \theta* is what the learner seeks via inverse reinforcement learning? What is complicated here is that the reward probabilities do depend on \theta (Equation 1), so shouldn't the learner be able to infer \theta without the actions of the teacher? Maybe another way to state this is that dependencies among bandits due to \theta should be evident in the naive case, irrespective of the teacher's strategy. An experiment revealing the differences in the sequence of y_i and R_i with for the naive versus strategic teacher would be revealing. Insights into the claimed success of this method are limited by the lack of a complete model that comprehends both teacher and learner in a cooperative game. Both apparently share the same "uber" reward to be optimized, perhaps with different horizons. As a cooperative game one might expect an equilibrium solution that reveals properties of their combined actions. Without a joint game theoretic model (perhaps by solving the I-POMDP) it's not apparent why the combination works. This is a fascinating area with rich implications for understanding human-machine behavior, but currently this paper is limited to computational results suggestive of general system properties. Specifics: The paper mentions available source code, but none was provided; perhaps an oversight?
Originality: The paper presents an MDP model of the machine teaching problem and builds a framework around this to improve learning with a teacher. Quality: Good quality work (idea, presentation, empirical studies). Clarity: Except for the abstract, I found the paper well written and easy to follow. Significance: A relevant contribution to machine teaching. Below I list some comments that provide further details on my review and might be helpful in improving the manuscript. Main comments: 1) From the abstract, I was not able to tell what the paper is about and what its main ideas and contributions are. Some examples of what did not become clear: - novelty and benefit of MDP model of teacher - key goal of using a model of the teacher to improve learning performance - point (1) is unclear I suggest rewording the abstract. 2) l. 98: Why is a deterministic learning algorithm considered? I suppose this is because otherwise, capturing the learner’s dynamics might be more challenging. But could one not also consider stochastic learning in the same framework? Maybe this could be an aspect to discuss. 3) Is the teaching MDP formulation of Sec. 3.2 a novel contribution (the review of related work seems to suggest this)? If yes, I suggest to explicitly state this somewhere in Sec. 3. 4) l. 139: Is the teacher’s reward the same as the reward previously defined for the learner? Specifically, is R_t(h_t) = R_t as in l. 110? Or does the teacher follow a different goal? This should be discussed and made explicit. If it's not the same, different letters should be used for the variables to avoid confusion. 5) l. 160: The choice of the reward function as the scalar product between next sample location and true parameters is unclear to me. Can you elaborate on why this choice? 6) Why is the model in (3) chosen? Specifically, the exponential terms with temperature parameter. Is this a common model in inverse RL? Minor: 7) ll. 49-50: It would be helpful if the authors added to this sentence, what exactly they do differently. How is another agent than the teacher involved in designing the learning data? 8) ll. 96-100: I find it slightly confusing that (1), (2), (3) are used for enumeration here, as parentheses usually denote equation references. Consider changing to, e.g., (i), (ii), ... Similarly, in other parts of the paper. 9) l. 101: T should be math font 10) Unless I missed this, "boundedly optimal teachers" is never explained in the paper. I suggest doing this. 11) ll. 200-206: This is a nice illustrative example and discussion. ============================================================ AFTER REBUTTAL: I have read the authors’ response. I appreciate the authors’ explanations, which address my questions and concerns in a satisfactory manner. I will thus keep my original score.
Advantages: This paper considers combining machine teaching with active learning. Under this setting, the active learner decides the instances to query, and the teacher decides the label to return. This paper discloses a non-trivial fact: the active oracle can be sub-optimal even if it generates the label through the underlying true distribution. This reveals the importance for combining active learning and teaching. To my understanding, this is the most significant contribution for the paper. Furthermore, the Bernoulli bandit (generalized linear bandit in fact) problem is taken as an example to show how effective teaching can be achieved. The MDP-based sequential decision framework is utilized to model the teaching-student learning process. On the high-level, the proposed approach seems to be a reasonable framework to solve this kind of problems. The most interesting part of the experimental results is that they show that if the student can model the teacher’s behavior properly, then the learning performance can be boosted. Furthermore, they also suggest the potential usefulness of the proposed framework under the human-involved oracle scenario. These two results could be very useful guidance for further studies. Disadvantages: Writing can be significantly improved. The main contributions are not clearly described, and the algorithmic parts are also not clear enough for understanding. The explanation of why providing labels through the underlying true distribution can be sub-optimal is insufficient. The bandit algorithm seems not correct. The description of how to estimate theta in line 114-119 is not clear. It seems that they rely on the assumption that the x_t received in all iterations are independent, while this is not true since the choice of arms are decided by the learner. The discussed bandit problem is actually a special case of the generalized linear bandit, I suggest referring to  for the Thompson sampling algorithm under this setting. Overall, the paper discusses an interesting and important problem. The novelty is significant, and it discloses several non-trivial facts. The high-level framework can be followed by future researches. While the writing can be significantly improved and the issue in the bandit learning algorithm should be addressed. This makes my score decreased (if I am wrong about the bandit algorithm, feel free to point it out)  https://arxiv.org/abs/1611.06534 ------ after rebuttal: I agree with the author response that the bandit algorithm in the paper follows from the general framework of TS. On the other hand, I find no theoretical guarantees of this algorithm under this setting both in the paper or in the literature. I made a mistake in the review before: the Bernoulli bandit in the paper is not the generalized linear bandit model since the randomness is not from the additive noise. In spite of the issues of lacking theoretical guarantees, I appreciate the novel insights from the paper, thus I would like to raise my score. I also encourage the authors to include more discussions about the possibilities of further theoretical studies.