{"title": "Teaching Multiple Concepts to a Forgetful Learner", "book": "Advances in Neural Information Processing Systems", "page_first": 4048, "page_last": 4058, "abstract": "How can we help a forgetful learner learn multiple concepts within a limited time frame? While there have been extensive studies in designing optimal schedules for teaching a single concept given a learner's memory model, existing approaches for teaching multiple concepts are typically based on heuristic scheduling techniques without theoretical guarantees. In this paper, we look at the problem from the perspective of discrete optimization and introduce a novel algorithmic framework for teaching multiple concepts with strong performance guarantees. Our framework is both generic, allowing the design of teaching schedules for different memory models, and also interactive, allowing the teacher to adapt the schedule to the underlying forgetting mechanisms of the learner. Furthermore, for a well-known memory model, we are able to identify a regime of model parameters where our framework is guaranteed to achieve high performance. We perform extensive evaluations using simulations along with real user studies in two concrete applications: (i) an educational app for online vocabulary teaching; and (ii) an app for teaching novices how to recognize animal species from images. Our results demonstrate the effectiveness of our algorithm compared to popular heuristic approaches.", "full_text": "Teaching Multiple Concepts to a Forgetful Learner\n\nAnette Hunziker\u2020 Yuxin Chen\u00b6 Oisin Mac Aodha\u00a7 Manuel Gomez Rodriguez*\n\nAndreas Krause\u2021\n\nPietro Perona(cid:63) Yisong Yue(cid:63) Adish Singla*\n\n\u2020University of Zurich, anette.hunziker@gmail.com,\n\u00b6University of Chicago, chenyuxin@uchicago.edu,\n\u00a7University of Edinburgh, oisin.macaodha@ed.ac.uk,\n\n\u2021ETH Zurich, krausea@ethz.ch,\n\n(cid:63)Caltech, {perona, yyue}@caltech.edu,\n\n*MPI-SWS, {manuelgr, adishs}@mpi-sws.org\n\nAbstract\n\nHow can we help a forgetful learner learn multiple concepts within a limited time\nframe? While there have been extensive studies in designing optimal schedules for\nteaching a single concept given a learner\u2019s memory model, existing approaches for\nteaching multiple concepts are typically based on heuristic scheduling techniques\nwithout theoretical guarantees. In this paper, we look at the problem from the\nperspective of discrete optimization and introduce a novel algorithmic framework\nfor teaching multiple concepts with strong performance guarantees. Our framework\nis both generic, allowing the design of teaching schedules for different memory\nmodels, and also interactive, allowing the teacher to adapt the schedule to the\nunderlying forgetting mechanisms of the learner. Furthermore, for a well-known\nmemory model, we are able to identify a regime of model parameters where our\nframework is guaranteed to achieve high performance. We perform extensive eval-\nuations using simulations along with real user studies in two concrete applications:\n(i) an educational app for online vocabulary teaching; and (ii) an app for teaching\nnovices how to recognize animal species from images. Our results demonstrate the\neffectiveness of our algorithm compared to popular heuristic approaches.\n\n1\n\nIntroduction\n\nIn many real-world educational applications, human learners often intend to learn more than one\nconcept. For example, in a language learning scenario, a learner aims to memorize many vocabulary\nwords from a foreign language. In citizen science projects such as eBird [34] and iNaturalist [38],\nthe goal of a learner is to recognize multiple animal species from a given geographic region. As the\nnumber of concepts increases, the learning problem can become very challenging due to the learner\u2019s\nlimited memory and propensity to forget. It has been well established in the psychology literature that\nin the context of human learning, the knowledge of a learner decays rapidly without reconsolidation\n[7]. Somewhat analogously, in the sequential machine learning setting, modern machine learning\nmethods, such as arti\ufb01cial neural networks, can be drastically disrupted when presented with new\ninformation from different domains, which leads to catastrophic interference and forgetting [19, 14].\nTherefore, to retain long-term memory (for both human and machine learners), it is crucial to devise\nteaching strategies that adapt to the underlying forgetting mechanisms of the learner.\nTeaching forgetful learners requires repetition. Properly scheduled repetitions and reconsolidations of\nprevious knowledge have proven effective for a wide variety of real-world learning tasks, including\npiano [30], surgery [39, 33], video games [29], and vocabulary learning [4], among others. For many\nof the above applications, it has been shown that by carefully designing the scheduling policy, one can\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fFigure 1: Illustration of our adaptive teaching framework applied to German vocabulary learning, shown here for\nsix time steps in the learning phase. Each time step proceeds in three stages: (1) the system displays a \ufb02ashcard\nwith an image and its English description, (2) the learner inputs the German translation, and (3) the system\nprovides feedback in the form of the correct answer if the input is incorrect.\n\nachieve substantial gains over simple heuristics (such as spaced repetition at \ufb01xed time intervals, or a\nsimple round robin schedule) [3]. Unfortunately, while there have been extensive (theoretical) results\nin teaching a single concept using spaced repetition algorithms, existing approaches for teaching\nmultiple concepts are typically based on heuristics without theoretical guarantees.\nIn this paper, we explore the following research question: Given limited time, can we help a forgetful\nlearner ef\ufb01ciently learn multiple concepts in a principled manner? More concretely, we consider an\nadaptive setting where at each time step, the teacher needs to pick a concept from a \ufb01nite set based on\nthe learner\u2019s previous responses, and the process iterates until the learner\u2019s time budget is exhausted.\nGiven a memory model of the learner, what is an optimal teaching curriculum? How should this\nsequence be adapted based on the learner\u2019s performance history?\n\n1.1 Overview of our approach\n\nFor a high-level overview of our approach, consider the example in Fig. 1, which illustrates one of our\napplications on German vocabulary learning [2]. Here, our goal is to teach the learner three German\nwords in six time steps. One trivial approach could be to show the \ufb02ashcards in a round robin fashion.\nHowever, the round robin sequence is deterministic and thus not capable of adapting to the learner\u2019s\nperformance. In contrast, our algorithm outputs an adaptive teaching sequence based on the learner\u2019s\nperformance.\nOur algorithm is based on a novel formulation of the adaptive teaching problem. In \u00a72, we propose\na novel discrete optimization problem, where we seek to maximize a natural surrogate objective\nfunction that characterizes the learner\u2019s expected performance throughout the teaching session.\nNote that constructing the optimal teaching policy boils down to solving a stochastic sequence\noptimization problem, which is NP-hard in general. In \u00a73, we introduce our greedy algorithm,\nand derive performance guarantees based on two intuitive data-dependent properties. While it can\nbe challenging to compute these performance bounds, we show that for certain learner memory\nmodels, these bounds can be estimated ef\ufb01ciently. Furthermore, we identify parameter settings of the\nmemory models where the greedy algorithm is guaranteed to achieve high performance. Finally, we\ndemonstrate that our algorithm achieves signi\ufb01cant improvements over baselines for both simulated\nlearners (cf. \u00a74) and human learners (cf. \u00a75).\n\n2 The Teaching Model\n\nWe now formalize the problem addressed in this paper.\n\n2.1 Problem setup\n\nSuppose that the teacher aims to teach the learner n concepts in a \ufb01nite time horizon T . We highlight\nthe notion of a concept via two concrete examples: (i) when teaching the vocabulary of a foreign\nlanguage, each concept corresponds to a word, and (ii) when teaching to recognize different animal\nspecies, each concept corresponds to an animal name. We consider \ufb02ashcard-based teaching, where\neach concept is associated with a \ufb02ashcard (cf. Fig. 1).\n\n2\n\n132jouetSubmitLearning Phase (1)Answer: Spielzeug x jouetSubmitLearning Phase (3)Answer: Nachtisch xBuchSubmitLearning Phase (4)Answer: Buch \u2713 BuchnachsSubmitLearning Phase (5)Answer: Nachtisch x nachsNachtischSubmitLearning Phase (6)Answer: Nachtisch \u2713 NachtischSpielzeugSubmitLearning Phase (2)Answer: Spielzeug \u2713 Spielzeug\fWe study the following interactive teaching protocol: At time step t, the teacher picks a concept from\nthe set {1, . . . , n} and presents its corresponding \ufb02ashcard to the learner without revealing its correct\nanswer. The learner then tries to recall the concept. Let us use yt \u2208 {0, 1} to denote the learner\u2019s\nrecall at time step t. Here, yt = 1 means that the learner successfully recalls the concept (e.g., the\nlearner correctly recognizes the animal species), and yt = 0 otherwise. After the learner makes an\nattempt, the teacher observes the outcome yt and reveals the correct answer.\n\n2.2 Learner\u2019s memory model\n\nLet us use (\u03c3, y) to denote any sequence of concepts and observations. In particular, we use \u03c31:t\nto denote the sequence of concepts picked by the teacher up to time t. Similarly, we use y1:t to\ndenote the sequence of observations up to time t. Given the history (\u03c31:t, y1:t), we are interested\nin modeling the learner\u2019s probability to recall concept i at a future time \u03c4 \u2208 [t + 1, T ]. In general,\nthe learner\u2019s probability to recall concept i could depend on the history of teaching concept i or\nrelated concepts.1 Formally, we capture the learner\u2019s recall probability for concept i by a memory\nmodel gi (\u03c4, (\u03c31:t, y1:t)) that depends on the entire history (\u03c3, y). In \u00a73.2, we study an instance of\nthe learner model captured by exponential forgetting curve (see Eq. (9)).\n\n2.3 The teaching objective\n\nThere are several objectives of interest to the teacher, for instance, maximizing the learner\u2019s perfor-\nmance in recalling all concepts measured at the end of the teaching session. However, given that the\nlearning phase might stretch over a long time duration for language learning, another natural objective\nis to measure learner\u2019s performance across the entire teaching session. For any given sequence of\nconcepts and observations (\u03c31:T , y1:T ) of length T , we consider the following objective:\n\nn(cid:88)\n\nT(cid:88)\n\nf (\u03c31:T , y1:T ) =\n\n1\nnT\n\ni=1\n\n\u03c4 =1\n\ngi (\u03c4 + 1, (\u03c31:\u03c4 , y1:\u03c4 )) .\n\n(1)\n\nHere, gi (\u00b7) denotes the recall probability of concept i at \u03c4 + 1, given the history up to time step \u03c4.\nConcretely, for a given concept i \u2208 [n], our objective function can be interpreted as the (discrete)\narea under the learner\u2019s recall curve for concept i across the teaching session.\nThe teacher\u2019s teaching strategy can be represented as a policy \u03c0 : (\u03c3, y) \u2192 {1, . . . , n}, which maps\nany history (i.e., sequence of concepts selected \u03c3 and observations y) to the next concept to be taught.\nFor a given policy \u03c0, we use (\u03c3\u03c0\n1:T ) to denote a random trajectory from the policy until time T .\nThe average utility of a policy \u03c0 is de\ufb01ned as:\n\n1:T , y\u03c0\n\nF (\u03c0) = E\u03c3\u03c0,y\u03c0 [f (\u03c3\u03c0\n\n1:T , y\u03c0\n\n1:T )] .\n\n(2)\n\nGiven the learner\u2019s memory model for each concept i and the time horizon T , we seek the optimal\nteaching policy that achieves the maximal average utility:\n\n\u03c0\u2217 \u2208 max\n\n\u03c0\n\nF (\u03c0) .\n\n(3)\n\nIt can be shown that \ufb01nding the optimal solution for Eq. (3) is NP-hard (proof is provided in the\nsupplemental materials).\nTheorem 1. Problem (3) is NP-hard, even when the objective function does not depend on the\nlearner\u2019s responses.\n\n3 Teaching Algorithm and Analysis\n\nWe now present a simple, greedy approach for constructing teaching policies. To measure the teaching\nprogress at time t < T , we introduce the following generalization of objective de\ufb01ned in Eq. (1):\n\n1As an example, for German vocabulary learning, the recall probability for the concept \u201cApfelsaft\" (apple\n\njuice) could depend on the \ufb02ashcards shown for \u201cApfelsaft\" and \u201cApfel\" (apple).\n\n3\n\n\fn(cid:88)\n\nT(cid:88)\n\ni=1\n\n\u03c4 =1\n\n1\nnT\n\n(cid:0)\u03c4 + 1,(cid:0)\u03c31:min(\u03c4,t), y1:min(\u03c4,t)\n\n(cid:1)(cid:1) .\n\ngi\n\nf (\u03c31:t, y1:t) =\n\n(4)\n\nNote that this is equivalent to extending (\u03c31:t, y1:t) to length T by \ufb01lling in the remaining sequence\nfrom t + 1 to T with empty concepts and observations. Given the history (\u03c31:t\u22121, y1:t\u22121), we de\ufb01ne\nthe conditional marginal gain of teaching a concept i at time t as:\n\n\u2206 (i | \u03c31:t\u22121, y1:t\u22121) = Eyt [f (\u03c31:t\u22121 \u2295 i, y1:t\u22121 \u2295 yt)\u2212 f (\u03c31:t\u22121, y1:t\u22121)] ,\n\n(5)\nwhere \u2295 denotes the concatenation operation, and the expectation is taken over the randomness of\nlearner\u2019s recall yt, conditioned on the history (\u03c31:t\u22121, y1:t\u22121). The greedy algorithm, as described in\nAlgorithm 1, iteratively selects the concept that maximizes this conditional marginal gain.\nAlgorithm 1 Adaptive Teaching Algorithm\n\nSequence \u03c3 \u2190 \u2205; observation history y \u2190 \u2205\nfor t = {1, . . . , T} do\n\nSelect it \u2190 arg maxi \u2206 (i | \u03c3, y)\nShow it to the learner; Observe yt\nUpdate \u03c3 \u2190 \u03c3 \u2295 it, y \u2190 y \u2295 yt\n\n3.1 Theoretical guarantees\n\nWe now present a general theoretical framework for analyzing the performance of the adaptive\nteaching algorithm (Algorithm 1). Our bound depends on two natural properties of the objective\nfunction f, both related to a notion of diminishing returns of a sequence function. Intuitively, the\nfollowing two properties re\ufb02ect how much a greedy choice can affect the optimality of the solution.\nDe\ufb01nition 1 (Online stepwise submodular coef\ufb01cient). Consider policy \u03c0 for time T . The online\nsubmodular coef\ufb01cient of function f with respect to policy \u03c0 at step t is de\ufb01ned as\n\nwhere \u03b3(\u03c3, y) = mini,(\u03c3(cid:48),y(cid:48)):|\u03c3|+|\u03c3(cid:48)| 0 denotes how far in the future we choose to evaluate the learner\u2019s recall.\n\nBaselines To demonstrate the performance of our adaptive greedy policy (referred to as GR), we\nconsider three baseline algorithms. The \ufb01rst baseline, denoted by RD, is a random teacher that\npresents a random concept at each time step. The second baseline, denoted by RR, is a round robin\nteaching policy that picks concepts according to a \ufb01xed round robin schedule, i.e., iterating through\nconcepts at each time step. Our third baseline is a variant of the teaching strategy employed by [28],\nwhich can be considered as a generalization of the popular Leitner and Pimsleur systems [16, 25]. At\neach time step, the teacher chooses to display the concept with the lowest recall probability according\nto the HLR memory model of the learner. We refer to this algorithm as LR.\n\n4.2 Simulation results\n\nWe \ufb01rst evaluate the performance as a function of the teaching horizon T . In Fig. 2a and Fig. 2b, we\nplot the objective value and average recall at T + s for all algorithms over 10 random trials, where\nwe set s = 10, n = 20 with half easy and half dif\ufb01cult concepts, and vary T \u2208 [40, 80]. As we can\nsee from both plots, GR consistently outperforms baselines in all scenarios. The gap between the\nperformances of GR and the baselines is more signi\ufb01cant for smaller T . As we increase the time\nbudget, the performance of all algorithms improves\u2014this behavior is expected, as it corresponds to\nthe scenario where all concepts get a fair chance of repetition with abundant time budget. In Fig. 2c\nand Fig. 2d, we show the performance plot for a \ufb01xed teaching horizon of T = 60 when we vary the\nnumber of concepts n \u2208 [10, 30]. Here we observe a similar behavior as before\u2014GR is consistently\nbetter; as n increases, the gap between the performances of GR and the baselines becomes more\nsigni\ufb01cant. Our results suggest that the advantage of GR is most pronounced for more challenging\nsettings, i.e., when we have a tight time budget (small T ) or a large number of concepts (large n).\n\n6\n\n406080T0.40.50.60.7ObjectiveGRRRRDLR406080T0.60.70.80.91.0Recall at T+sGRRRRDLR102030n0.40.50.60.70.8ObjectiveGRRRRDLR102030n0.50.60.70.80.91.0Recall at T+sGRRRRDLR\fGerman\n\nGR\n0.572\n\n\u2013\n\nGR\n0.143\n\n\u2013\n\nLR\n0.487\n0.0652\n\nRR\n0.462\n0.0197\n\nRD\n0.467\n0.0151\n\nBiodiversity (common)\n\nLR\n0.118\n0.3111\n\nRR\n0.150\n0.8478\n\nRD\n0.086\n0.0047\n\nGR\n0.475\n\n\u2013\n\nGR\n0.766\n\n\u2013\n\navg gain\np-value\n\navg gain\np-value\n\nBiodiversity\nRR\n0.390\n\nLR\n0.411\n0.0017 <0.0001 <0.0001\n\nRD\n0.251\n\nBiodiversity (rare)\nLR\n0.668\n0.0001 <0.0001 <0.0001\n\nRR\n0.601\n\nRD\n0.396\n\nTable 1: Summary of the user study results. Here, the performance is measured as the gain in learner\u2019s\nperformance from prequiz phase to postquiz phase (see main text for details). We have n = 15, T = 40, and ran\nalgorithms with a total of 80 participants for German app and 320 participants for Biodiversity app.\n\n5 User Study\n\nWe have developed online apps for two concrete real-world applications: (i) German vocabulary\nteaching [2], and (ii) teaching novices to recognize animal species from images, motivated by citizen\nscience projects for biodiversity monitoring [1]. Next, we brie\ufb02y introduce the datasets used for these\ntwo apps and then present the user study results of teaching human learners.\n\n5.1 Experimental setup\n\nDataset For the German vocabulary teaching app, we collected 100 English-German word pairs in\nthe form of \ufb02ashcards, each associated with a descriptive image. These word pairs were provided by\na language expert (see [8]) and consist of popular vocabulary words taught in an entry-level German\nlanguage course. For the biodiversity teaching app, we collected images of 50 animal species. To\nextract a \ufb01ne-grained signal for our user study, we further categorize the Biodiversity dataset into two\ndif\ufb01culty levels, namely \u201ccommon\u201d and \u201crare\u201d, based on the prevalence of these species. Examples\nfrom both datasets are provided in the supplemental materials.\nFor real-world experiments, we do not know the learner\u2019s memory model. While it is possible to\n\ufb01t the HLR model through an extensive pre-study as in [28], we instead simply choose a \ufb01xed set\nof parameters. For the Biodiversity dataset, we set the parameters of each concept based on their\ndif\ufb01culty level. Namely, we set \u03b81 = (10, 5, 0) for \u201ccommon\u201d (i.e., easy) species and \u03b82 = (3, 1.5, 0)\nfor \u201crare\u201d (i.e., dif\ufb01cult) species, as also used in our simulation. For the German dataset, since\nthe parameters associated with a concept (i.e., vocabulary word) depend heavily on learner\u2019s prior\nknowledge, we chose a more robust set of parameters for each of the concepts given by \u03b8 = (6, 2, 0).\nWe defer the details of our sensitivity study of the HLR parameters to the supplemental materials.\n\nOnline teaching interface Our apps provide an online teaching interface where a user (i.e., human\nlearner) can participate in a \u201cteaching session\u201d. As in the simulations, here each session corresponds\nto teaching n concepts (sampled randomly from our dataset) via \ufb02ashcards over T time steps. We\ndemonstrate the teaching interface and present the detailed design ideas in the supplemental materials.\n\n5.2 User study results\n\nResults for German We now present the user study results for our German vocabulary teaching\napp [2]. We run our candidate algorithms with n = 15, T = 40 on a total of 80 participants (i.e.,\n20 per algorithm) recruited from Amazon Mechanical Turk. Results are shown in Table 1. where\nwe computed the average gain of each algorithm, and performed statistical analysis on the collected\nresults. The \ufb01rst row (avg gain) is obtained by treating the performance for each (participant, word)\npair as a separate sample (e.g., we get 20 \u00d7 15 samples per algorithm for the German app). The\nsecond row (p-value) indicates the statistical signi\ufb01cance of the results measured by the \u03c72 tests\n[6] (with contingency tables where rows are algorithms and columns are observed outcomes), when\ncomparing GR with the baselines. Overall, GR achieved higher gains compared to the baselines.\n\n7\n\n\fResults for Biodiversity Next, we present the user study results on our Biodiversity teaching\napp [1]. We recruited a total of 320 participants (i.e., 80 per algorithm). Here, we used different\nparameters for the learner\u2019s memory as described in \u00a75.1; all other conditions (i.e., n = 15, T = 40,\nand interface) were kept the same as for the German app. In Table 1, in addition to the overall\nperformance of the algorithms across all concepts, we also provide separate statistics on teaching\nthe \u201ccommon\u201d and \u201crare\u201d concepts. Note that, while the performance of GR is close to the baselines\nwhen teaching the \u201ccommon\" species (given the high prequiz score due to learner\u2019s prior knowledge\nabout these species), GR is signi\ufb01cantly more effective in teaching the \u201crare\u201d species.\n\nRemarks This user study provides a proof-of-concept that the performance of our algorithm GR\ndemonstrated on simulated learners is consistent with the performance observed on human learners.\nWhile teaching sessions in our current user study were limited to a span of 25 mins with participants\nrecruited from Mechanical Turk, we expect that the teaching applications we have developed could\nbe adapted to real-life educational scenarios for conducting long-term studies.\n\n6 Related Work\n\nSpaced repetition and memory models Numerous studies in neurobiology and psychology have\nemphasized the importance of the spacing effects in human learning. The spacing effect is the\nobservation that spaced repetition (i.e., introducing appropriate time gaps when learning a concept)\nproduces greater improvements in learning compared to massed repetition (i.e., \u201ccramming\u201d) [37].\nThese \ufb01ndings have inspired many computational models of human memory, including the Adaptive\nCharacter of Thought\u2013Rational model (ACT-R) [24], the Multiscale Context model (MCM) [22], and\nthe Half-life Regression model (HLR) [28]. In particular, HLR is a trainable spaced repetition model,\nwhich can be viewed as a generalization of the popular Leitner [16] and Pimsleur [25] systems. In this\npaper, we adopt a variant of HLR to model the learner. One of the key characteristics of these memory\nmodels is the function used to model the forgetting curve. Power-law and exponential functions are\ntwo popular ways of modeling the forgetting curve (for detailed discussion, see [27, 41, 24, 40]).\n\nOptimal scheduling with spaced repetition models\n[13] and [17] studied the ACT-R model and\nthe MCM model respectively for the optimal review scheduling problem where the goal is to maximize\na learner\u2019s retention through an intelligent review scheduler. One of the key differences between their\nsetting and ours is that, they consider a \ufb01xed curriculum of new concepts to teach, and the scheduler\nadditionally chooses which previous concept(s) to review at each step; whereas our goal is to design\na complete teaching curriculum. Even though the problem settings are somewhat different, we would\nlike to note that our theoretical framework can be adapted to their setting.\nRecently, [26] presented a queuing model for \ufb02ashcard learning based on the Leitner system and\nconsider a \u201cmean-recall approximation\" heuristic to tractably optimize the review schedule. One\nlimitation is that their approach does not adapt to the learner\u2019s performance over time. Furthermore,\nthe authors leave the problem of obtaining guarantees for the original review scheduling problem as\na question for future work. [35] considered optimizing learning schedules in continuous time for a\nsingle concept, and use control theory to derive optimal scheduling to minimize a penalized recall\nprobability area-under-the-curve loss function. In addition to being discrete time, the key difference\nof our setting is that we aim to teach multiple concepts.\n\nSequence optimization Our theoretical framework is inspired by recent results on sequence sub-\nmodular function maximization [43, 36] and adaptive submodular optimization [10]. In particular,\n[43] introduced the notion of string submodular functions, which, analogous to the classical notion of\nsubmodular set functions [15], enjoy similar performance guarantees for maximization of determinis-\ntic sequence functions. Our setting has two key differences in that we focus on the stochastic setting\nwith potentially non-submodular objective functions. In fact, our theoretical framework (in particular\nCorollary 3) generalizes string submodular function maximization to the adaptive setting.\n\nForgetful learners in machine learning Here, we highlight the differences with some recent work\nin the machine learning literature involving forgetful learners. In particular, [44] aimed to teach\nthe learner a binary classi\ufb01er by sequentially providing training examples, where the learner has an\nexponential decaying memory of the training examples. In contrast, we study a different problem,\nwhere we focus on teaching multiple concepts, and assume that the learner\u2019s memory of each concept\n\n8\n\n\fdecays over time. [14] explored the problem of how to construct a neural network for learning\nmultiple concepts. Instead of designing the optimal training schedule, their goal is to design a good\nlearner that suffers less from the forgetting behavior.\n\nMachine teaching Our work is also closely related to machine/algorithmic teaching literature\n(e.g., [46, 45, 32, 9]). Most of these works in machine teaching consider a non-adaptive setting\nwhere the teacher provides a batch of teaching examples at once without any adaptation. In this\npaper, we focus primarily on designing interactive teaching algorithms that adaptively select teaching\nexamples for a learner based on their responses. The problem of adaptive teaching has been studied\nrecently (e.g., [12, 42, 11, 5, 18, 31]). However, these works in machine teaching have not considered\nthe phenomena of forgetting. [23, 21] have studied the problem of concept learning and machine\nteaching when learner has \u201climited-capacity\" in terms of retrieving exemplars in memory during\nthe decision-making process. They model the learner via the Generalized Context Model [20] and\ninvestigated the problem of choosing the optimal exemplars for teaching a classi\ufb01cation task. In our\nsetting, the exemplars for each class are already given (in other words, we have only one exemplar\nper class), and we aim at optimally teaching the learner to memorize the (label of) exemplars.\n\n7 Conclusions\n\nWe presented an algorithmic framework for teaching multiple concepts to a forgetful learner. We\nproposed a novel discrete formulation of teaching based on stochastic sequence function optimization,\nand provided a general theoretical framework for deriving performance bounds. We have implemented\nteaching apps for two real-world applications. We believe our results have made an important\nstep towards bringing the theoretical understanding of algorithmic teaching closer to real-world\napplications where the forgetting phenomenon is an intrinsic factor.\n\nAcknowledgements\n\nThis work was done when Yuxin Chen and Oisin Mac Aodha were at Caltech. This work was\nsupported in part by NSF Award #1645832, Northrop Grumman, Bloomberg, AWS Research Credits,\nGoogle as part of the Visipedia project, and a Swiss NSF Early Mobility Postdoctoral Fellowship.\n\nReferences\n[1] App-Biodiversity. Website for teaching animal species. https://www.teaching-biodiversity.cc, 2018.\n\n[2] App-German. Website for teaching German vocabulary. https://www.teaching-german.cc, 2018.\n\n[3] David A Balota, Janet M Duchek, and Jessica M Logan. Is expanded retrieval practice a\nsuperior form of spaced retrieval? A critical review of the extant literature. Psychology Press\nNew York, NY, 2007.\n\n[4] Kristine C Bloom and Thomas J Shuell. Effects of massed and distributed practice on the\nlearning and retention of second-language vocabulary. The Journal of Educational Research,\n74(4):245\u2013248, 1981.\n\n[5] Yuxin Chen, Adish Singla, Oisin Mac Aodha, Pietro Perona, and Yisong Yue. Understanding\nthe role of adaptivity in machine teaching: The case of version space learners. In NeurIPS,\n2018.\n\n[6] William G Cochran. The \u03c72 test of goodness of \ufb01t. The Annals of Mathematical Statistics,\n\npages 315\u2013345, 1952.\n\n[7] Hermann Ebbinghaus. \u00dcber das ged\u00e4chtnis: untersuchungen zur experimentellen psychologie.\n\nDuncker & Humblot, 1885.\n\n[8] germanwordoftheday. German Word of the Day: Website for learning German vocabulary.\n\nhttps://germanwordoftheday.de, 2018.\n\n[9] Sally A Goldman and Michael J Kearns. On the complexity of teaching. Journal of Computer\n\nand System Sciences, 50(1):20\u201331, 1995.\n\n9\n\n\f[10] Daniel Golovin and Andreas Krause. Adaptive submodularity: Theory and applications in active\nlearning and stochastic optimization. Journal of Arti\ufb01cial Intelligence Research, 42:427\u2013486,\n2011.\n\n[11] Luis Haug, Sebastian Tschiatschek, and Adish Singla. Teaching inverse reinforcement learners\n\nvia features and demonstrations. In NeurIPS, 2018.\n\n[12] Parameswaran Kamalaruban, Rati Devidze, Volkan Cevher, and Adish Singla. Interactive\n\nteaching algorithms for inverse reinforcement learning. In IJCAI, pages 2692\u20132700, 2019.\n\n[13] Mohammad M Khajah, Robert V Lindsey, and Michael C Mozer. Maximizing students\u2019\nretention via spaced review: Practical guidance from computational models of memory. Topics\nin cognitive science, 2014.\n\n[14] James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, et al. Overcoming catastrophic forgetting\n\nin neural networks. PNAS, 114(13):3521\u20133526, 2017.\n\n[15] Andreas Krause and Daniel Golovin. Submodular function maximization. In Tractability:\n\nPractical Approaches to Hard Problems. Cambridge University Press, February 2014.\n\n[16] S. Leitner and R. Totter. So lernt man lernen. Angewandte Lernpsychologie ein Weg zum\n\nErfolg. Herder, 1972.\n\n[17] Robert V Lindsey, Jeffery D Shroyer, Harold Pashler, and Michael C Mozer.\n\nImproving\nstudents\u2019 long-term knowledge retention through personalized review. Psychological science,\n25(3):639\u2013647, 2014.\n\n[18] Weiyang Liu, Bo Dai, Ahmad Humayun, Charlene Tay, Chen Yu, Linda B. Smith, James M.\n\nRehg, and Le Song. Iterative machine teaching. In ICML, pages 2149\u20132158, 2017.\n\n[19] Michael McCloskey and Neal J Cohen. Catastrophic interference in connectionist networks:\n\nThe sequential learning problem. In Psychology of learning and motivation. 1989.\n\n[20] Robert M Nosofsky. Attention, similarity, and the identi\ufb01cation\u2013categorization relationship.\n\nJournal of experimental psychology: General, 115(1):39, 1986.\n\n[21] Robert M Nosofsky, Craig A Sanders, Xiaojin Zhu, and Mark A McDaniel. Model-guided search\nfor optimal natural-science-category training exemplars: A work in progress. Psychonomic\nbulletin & review, 26(1):48\u201376, 2019.\n\n[22] Harold Pashler, Nicholas Cepeda, Robert V Lindsey, Ed Vul, and Michael C Mozer. Predicting\nthe optimal spacing of study: A multiscale context model of memory. In NIPS, pages 1321\u20131329,\n2009.\n\n[23] Kaustubh R Patil, Jerry Zhu, \u0141ukasz Kope\u00b4c, and Bradley C Love. Optimal teaching for\n\nlimited-capacity human learners. In NIPS, pages 2465\u20132473, 2014.\n\n[24] Philip I Pavlik Jr and John R Anderson. Practice and forgetting effects on vocabulary memory:\n\nAn activation-based model of the spacing effect. Cognitive Science, 29(4):559\u2013586, 2005.\n\n[25] Paul Pimsleur. A memory schedule. The Modern Language Journal, 51(2):73\u201375, 1967.\n\n[26] Siddharth Reddy, Igor Labutov, Siddhartha Banerjee, and Thorsten Joachims. Unbounded\nhuman learning: Optimal scheduling for spaced repetition. In KDD, pages 1815\u20131824, 2016.\n\n[27] David C Rubin and Amy E Wenzel. One hundred years of forgetting: A quantitative description\n\nof retention. Psychological review, 1996.\n\n[28] Burr Settles and Brendan Meeder. A trainable spaced repetition model for language learning.\n\nIn ACL, volume 1, pages 1848\u20131858, 2016.\n\n[29] Wayne L Shebilske, Barry P Goettl, Kip Corrington, and Eric Anthony Day.\n\nInterlesson\nspacing and task-related processing during complex skill acquisition. Journal of Experimental\nPsychology: Applied, 5(4):413, 1999.\n\n10\n\n\f[30] Amy L Simmons. Distributed practice and procedural memory consolidation in musicians\u2019 skill\n\nlearning. Journal of Research in Music Education, 59(4):357\u2013368, 2012.\n\n[31] Adish Singla, Ilija Bogunovic, G Bart\u00f3k, A Karbasi, and A Krause. On actively teaching the\n\ncrowd to classify. In NIPS Workshop on Data Driven Education, 2013.\n\n[32] Adish Singla, Ilija Bogunovic, G\u00e1bor Bart\u00f3k, Amin Karbasi, and Andreas Krause. Near-\n\noptimally teaching the crowd to classify. In ICML, pages 154\u2013162, 2014.\n\n[33] Edward N Spruit, Guido PH Band, and Jaap F Hamming. Increasing ef\ufb01ciency of surgical\ntraining: effects of spacing practice on skill acquisition and retention in laparoscopy training.\nSurgical endoscopy, 29(8):2235\u20132243, 2015.\n\n[34] Brian L Sullivan, Christopher L Wood, Marshall J Iliff, Rick E Bonney, Daniel Fink, and Steve\nKelling. ebird: A citizen-based bird observation network in the biological sciences. Biological\nConservation, 142(10):2282\u20132292, 2009.\n\n[35] Behzad Tabibian, Utkarsh Upadhyay, Abir De, Ali Zarezade, Bernhard Sch\u00f6lkopf, and Manuel\nGomez-Rodriguez. Enhancing human learning via spaced repetition optimization. PNAS,\n116(10):3988\u20133993, 2019.\n\n[36] Sebastian Tschiatschek, Adish Singla, and Andreas Krause. Selecting sequences of items via\n\nsubmodular maximization. In AAAI, 2017.\n\n[37] Ovid J Tzeng. Stimulus meaningfulness, encoding variability, and the spacing effect. Journal\n\nof Experimental Psychology, 99(2):162\u2013166, 1973.\n\n[38] Grant Van Horn, Oisin Mac Aodha, Yang Song, et al. The inaturalist species classi\ufb01cation and\n\ndetection dataset. In CVPR, 2018.\n\n[39] EGG Verdaasdonk, LPS Stassen, RPJ Van Wijk, and J Dankelman. The in\ufb02uence of different\ntraining schedules on the learning of psychomotor skills for endoscopic surgery. Surgical\nendoscopy, 21(2):214\u2013219, 2007.\n\n[40] Matthew M Walsh, Kevin A Gluck, Glenn Gunzelmann, Tiffany Jastrzembski, Michael Krus-\nmark, Jay I Myung, Mark A Pitt, and Ran Zhou. Mechanisms underlying the spacing effect in\nlearning: A comparison of three computational models. Journal of Experimental Psychology:\nGeneral, 147(9):1325, 2018.\n\n[41] Thomas D Wickens. Measuring the time course of retention. 1999.\n\n[42] Teresa Yeo, Parameswaran Kamalaruban, Adish Singla, Arpit Merchant, Thibault Asselborn,\nLouis Faucon, Pierre Dillenbourg, and Volkan Cevher. Iterative classroom teaching. In AAAI,\npages 5684\u20135692, 2019.\n\n[43] Zhenliang Zhang, Edwin KP Chong, Ali Pezeshki, and William Moran. String submodular\nfunctions with curvature constraints. IEEE Transactions on Automatic Control, 61(3):601\u2013616,\n2016.\n\n[44] Yao Zhou, Arun Reddy Nelakurthi, and Jingrui He. Unlearn what you have learned: Adaptive\ncrowd teaching with exponentially decayed memory learners. In KDD, pages 2817\u20132826, 2018.\n\n[45] Xiaojin Zhu. Machine teaching: An inverse problem to machine learning and an approach\n\ntoward optimal education. In AAAI, pages 4083\u20134087, 2015.\n\n[46] Xiaojin Zhu, Adish Singla, Sandra Zilles, and Anna N. Rafferty. An overview of machine\n\nteaching. CoRR, abs/1801.05927, 2018.\n\n11\n\n\f", "award": [], "sourceid": 2238, "authors": [{"given_name": "Anette", "family_name": "Hunziker", "institution": "ETH Zurich"}, {"given_name": "Yuxin", "family_name": "Chen", "institution": "UChicago"}, {"given_name": "Oisin", "family_name": "Mac Aodha", "institution": "California Institute of Technology"}, {"given_name": "Manuel", "family_name": "Gomez Rodriguez", "institution": "Max Planck Institute for Software Systems"}, {"given_name": "Andreas", "family_name": "Krause", "institution": "ETH Zurich"}, {"given_name": "Pietro", "family_name": "Perona", "institution": "California Institute of Technology"}, {"given_name": "Yisong", "family_name": "Yue", "institution": "Caltech"}, {"given_name": "Adish", "family_name": "Singla", "institution": "MPI-SWS"}]}