Reviews: Teaching Multiple Concepts to a Forgetful Learner

I think this is a very solid but unspectacular paper - explained below. Looks like a safe paper to accept as a poster. 1. The paper studies an interesting problem of next item selection to maximize retention in memory-related tasks (German, Biodiversity). The difference between this work and prior works is multi-concept; while this is an important contribution, it is not a fundamental one. 2. I like the theoretical contributions: the formulation is solid, the results obtained using submodularity are more or less expected. The model-specific result for exponential forgetting curves is somewhat useful - memory strength and the number of concepts have a log relationship, which makes sense. 3. For the human user experiments, the results are pretty good - statistically significant improvement despite using only 80/320 users. However, these results are heavily restricted; the whole session lasts only 25 minutes, which makes it hard to conclude that the same results will hold for real-life learning settings. I also wish there is a parameter sensitivity study (lines 264-270), since the authors simply ran the algorithm with some prior parameter choices and didn't tune them. Can you use the collected data to fit some parameters and compare them with your choices? Writing is good. I read the author response and it is good - I'll keep my score.

Reviewer 2

Originality: The work is a nice combination of developments in a few fields -- applying analysis and concepts from machine teaching to concept learning in forgetful humans. To the best of my knowledge, the theoretical analysis is novel. However, they missed some critical work in the field and missed an opportunity to compare their method to very relevant pre-existing work. This includes the following references: 1) Patil, Zhu, Kopec, & Love (2011). Optimal teaching for Limited-Capacity Human Learners. NIPS. (http://papers.nips.cc/paper/5541-optimal-teaching-for-limited-capacity-human-learners.pdf) -- they examine concept learning and machine teaching, but with a different variant on limiting capacity of the ability "to use" exemplars. It is similar and may even be isomorphic to the case in this submission (I doubt it's isomorphic). 2) Nosofsky, R. N., Sanders, C. A., Zhu, X., & McDaniel, M. A. (2019). Model-based search for optimal natural-science-category training exemplars: A work in progress. Psychonomic Bulletin & Review, 26, 48-76. Quality: The strongest portion of the article is the derivation of theoretical results. Although the simulations and behavioral experiments are a valuable contribution, the contribution is weakened due to the weak baselines chosen by the authors. Ignoring missing previous work, I would have liked to have seen a comparison to optimal teaching to a "forgetless" learner or how different levels of forgetting affected teaching and learning. There are also the models mentioned before which would have served as good baselines. Here are a few other minor issues: Line 190-192: HLR memory model. This really has a much longer provenance prior to Settles and Meeder (2016). It would be good to acknowledge previous researchers (references through the Pavlik and Anderson 2005 on ACT-R activation functions and learning should get you there. References in Walsh, Gluck, Gunzelmann, Jastrezembski, Krusmark, Myung, Pitt, & Zhang, 2018. Mechanisms underlying the Spacing Effect in Learning: A Comparison of Three Computational Models. Journal of Experimental Psychology: General could also help) who have used forgetting functions of a similar form. Line 282-284: The statistical tests aren't appropriate for binary outcome responses. The authors instead should use logistic regression or chi-squared tests based on contingency tables. Clarity: The paper is well-written, though I did have some difficulty following their mathematical derivation at times. One reason might be their use of \gamma, which is more traditionally a decay rate of some sort. I also was unclear on the precise definition of \tau as it seemed to change from section 3.2 to 3.3. I appreciate the lack of space and partially my expertise being more in the computational cognitive science than machine teaching analysis. Author feedback response: Thank you for your thoughts on my criticism of the submission. I apologize for my confusion and I appreciate your clarification. It may be worth mentioning that in the revised manuscript as other readers might have it. I am glad you are including the hypothesis testing statistics in the revised manuscript.

Reviewer 3

Originality: I think the specific teaching model proposed in the paper has never been considered in the literature. Quality: Owing to time constraints, I only managed to check the proofs of Theorems 1 and 2; I think they are correct. Overall, I think the quality of the main paper is generally very good, with very few typos. Clarity: The paper is quite clearly written. Significance: The development of a framework for modelling the teaching of multiple concepts to memory-limited learners is quite significant. Minor Comments: - Page 2, lines 67-68: Perhaps state what the acronym ACT-R stands for (just like other model acronyms used in the same sentence). - Page 4, equation (5): Should this definition/notation be extended to the conditional marginal gain of teaching a _sequence_ of concepts at time t? (On page 15, in the last equality before line 476, \Delta is applied to such a sequence.) - Page 6, Section 4.2: Perhaps explain what a HLR memory model is in more detail. - Page 9, references [8] and [22]: I suggest either spelling out the acronym PNAS in [8] or using the acronym PNAS in [22] (i.e., stick to only one formatting style). - Page 10, reference [27]: "The Generalization of [`student's'] problem..." - Page 14, line 449: Do we need to use the submodularity of \mu to show that g_i(\tau+1, (\sigma_{1:min(\tau,t)},\cdot)) \leq 1$ (if so, it might be helpful to mention this, since string submodular functions were not defined in the paper)? - Page 14, line 460: "...denote such [a] case by..." - Page 15, definition of conditional marginal gain of a policy: How is item i (mentioned in Line 468) used in the definition? It might be helpful to give an intuitive explanation for why, on the right-hand side of (12), one takes the concatenation of \sigma_{1:t} with \sigma^{\pi}(\Phi) (similarly for y_{1:t} and y^{\pi}(\Phi)); in particular, why does the sequence of items selected by \pi from t' = 1 to t' = t appear twice (once in \sigma_{1:t} and again in \sigma^{\pi}(\Phi))? (Did I interpret the definition wrongly?) - Page 16, inequality between lines 481 and 482: I could not see why this inequality follows directly from Definition 2. According to Definition 2, \omega_t is computed by taking the maximum over all (\sigma_{1:t},y_{1:t}) of the maximum expectation over all policies \pi with respect to (\sigma^\pi(\Phi),y^{\pi}(\Phi)); however, imposing the additional condition \Phi \sim (\sigma_{1:t},y_{1:t}) seems to reduce the value of E[f(\sigma_{1:t} \oplus \sigma^{\pi}(\Phi),y_{1:t}\oplus y^{\pi}(\Phi)) - f(\sigma^{\pi}(\Phi),y^{\pi}(\Phi))]/f(\sigma_{1:t},y_{1:t}). - Page 16, inequality (17): Do we need to take the expectation of the second summand? * Response to author feedback: Thank you very much for the detailed feedback. I am keen on following future work on this topic, especially any possible long-term experiments on language learning using the paper's algorithm. I will keep my current score.

Paper ID:	2238
Title:	Teaching Multiple Concepts to a Forgetful Learner

Reviewer 1

Reviewer 2

Reviewer 3