{"title": "Understanding the Role of Adaptivity in Machine Teaching: The Case of Version Space Learners", "book": "Advances in Neural Information Processing Systems", "page_first": 1476, "page_last": 1486, "abstract": "In real-world applications of education, an effective teacher adaptively chooses the next example to teach based on the learner\u2019s current state. However, most existing work in algorithmic machine teaching focuses on the batch setting, where adaptivity plays no role. In this paper, we study the case of teaching consistent, version space learners in an interactive setting. At any time step, the teacher provides an example, the learner performs an update, and the teacher observes the learner\u2019s new state. We highlight that adaptivity does not speed up the teaching process when considering existing models of version space learners, such as the \u201cworst-case\u201d model (the learner picks the next hypothesis randomly from the version space) and the \u201cpreference-based\u201d model (the learner picks hypothesis according to some global preference). Inspired by human teaching, we propose a new model where the learner picks hypotheses according to some local preference defined by the current hypothesis. We show that our model exhibits several desirable properties, e.g., adaptivity plays a key role, and the learner\u2019s transitions over hypotheses are smooth/interpretable. We develop adaptive teaching algorithms, and demonstrate our results via simulation and user studies.", "full_text": "Understanding the Role of Adaptivity in Machine\nTeaching: The Case of Version Space Learners\n\nYuxin Chen\u2020 Adish Singla\u2021 Oisin Mac Aodha\u2020\n\nPietro Perona\u2020 Yisong Yue\u2020\n\n\u2020Caltech, {chenyux, macaodha, perona, yyue}@caltech.edu,\n\n\u2021MPI-SWS, adishs@mpi-sws.org\n\nAbstract\n\nIn real-world applications of education, an effective teacher adaptively chooses\nthe next example to teach based on the learner\u2019s current state. However, most\nexisting work in algorithmic machine teaching focuses on the batch setting,\nwhere adaptivity plays no role.\nIn this paper, we study the case of teaching\nconsistent, version space learners in an interactive setting. At any time step, the\nteacher provides an example, the learner performs an update, and the teacher\nobserves the learner\u2019s new state. We highlight that adaptivity does not speed up\nthe teaching process when considering existing models of version space learners,\nsuch as the \u201cworst-case\u201d model (the learner picks the next hypothesis randomly\nfrom the version space) and the \u201cpreference-based\u201d model (the learner picks\nhypothesis according to some global preference). Inspired by human teaching, we\npropose a new model where the learner picks hypotheses according to some local\npreference de\ufb01ned by the current hypothesis. We show that our model exhibits\nseveral desirable properties, e.g., adaptivity plays a key role, and the learner\u2019s\ntransitions over hypotheses are smooth/interpretable. We develop adaptive teaching\nalgorithms, and demonstrate our results via simulation and user studies.\n\n1\n\nIntroduction\n\nAlgorithmic machine teaching studies the interaction between a teacher and a student/learner where\nthe teacher\u2019s objective is to \ufb01nd an optimal training sequence to steer the learner towards a desired\ngoal [36]. Recently, there has been a surge of interest in machine teaching as several different\ncommunities have found connections to this problem setting: (i) machine teaching provides a rigorous\nformalism for a number of real-world applications including personalized educational systems [35],\nadversarial attacks [24], imitation learning [6, 14], and program synthesis [18]; (ii) the complexity of\nteaching (\u201cTeaching-dimension\u201d) has strong connections with the information complexity of learning\n(\u201cVC-dimension\u201d) [9]; and (iii) the optimal teaching sequence has properties captured by new models\nof interactive machine learning, such as curriculum learning [4] and self-paced learning [25].\nIn the above-mentioned applications, adaptivity clearly plays an important role. For instance, in\nautomated tutoring, adaptivity enables personalization of the content based on the student\u2019s current\nknowledge [31, 33, 17]. In this paper, we explore the adaptivity gain in algorithmic machine teaching,\ni.e., how much speedup a teacher can achieve via adaptively selecting the next example based on the\nlearner\u2019s current state? While this question has been well-studied in the context of active learning\nand sequential decision making [15], the role of adaptivity is much less understood in algorithmic\nmachine teaching. A deeper understanding would, in turn, enable us to develop better teaching\nalgorithms and more realistic learner models to exploit the adaptivity gain.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fWe consider the well-studied case of teaching a consistent, version space learner. A learner in\nthis model class maintains a version space (i.e., a subset of hypotheses that are consistent with\nthe examples received from a teacher) and outputs a hypothesis from this version space. Here, a\nhypothesis can be viewed as a function that assigns a label to any unlabeled example. Existing work\nhas studied this class of learner model to establish theoretical connections between the information\ncomplexity of teaching vs. learning [13, 37, 11]. Our main objective is to understand, when and\nby how much, a teacher can bene\ufb01t by adapting the next example based on the learner\u2019s current\nhypothesis. We compare two types of teachers: (i) an adaptive teacher that observes the learner\u2019s\nhypothesis at every time step, and (ii) a non-adaptive teacher that only knows the initial hypothesis of\nthe learner and does not receive any feedback during teaching. The non-adaptive teacher operates in\na batch setting where the complete sequence of examples can be constructed before teaching begins.\nInspired by real-world teaching scenarios and as a generaliza-\ntion of the global \u201cpreference-based\u201d model [11], we propose\na new model where the learner\u2019s choice of next hypothesis h1\ndepends on some local preferences de\ufb01ned by the current hy-\npothesis h. For instance, the local preference could encode that\nthe learner prefers to make smooth transitions by picking a con-\nsistent hypothesis h1 which is \u201cclose\u201d to h. Local preferences,\nas seen in Fig. 1, are an important aspect of many machine\nlearning algorithms (e.g., incremental or online learning algo-\nrithms [27, 28]) in order to increase robustness and reliability.\nWe present results in the context of two different hypotheses\nclasses, and show through simulation and user studies that adap-\ntivity can play a crucial role when teaching learners with local\npreferences.\n\nFigure 1: Local update preference.\nUsers were asked to update the posi-\ntion of the orange rectangle so that\ngreen cells were inside and blue\nones outside. The heatmap on the\nright displays the updated positions.\n\n2 Related Work\n\nModels of version space learners Within the model class of version space learners, there are\ndifferent variants of learner models depending upon their anticipated behavior, and these models lead\nto different notions of teaching complexity. For instance, (i) the \u201cworst-case\u201d model [13] essentially\nassumes nothing and the learner\u2019s behavior is completely unpredictable, (ii) the \u201ccooperative\u201d model\n[37] assumes a smart learner who anticipates that she is being taught, and (iii) the \u201cpreference-based\u201d\nmodel [11] assumes that she has a global preference over the hypotheses. Recently, some teaching\ncomplexity results have been extended beyond version space learners, such as Bayesian learners [34],\nprobabilistic/randomized learners [30, 3], learners implementing an optimization algorithm [22], and\nfor iterative learning algorithms based on gradient updates [23]. Here, we focus on the case of version\nspace learners, leaving the extension to other types of learners for future work.\nBatch vs. sequential teaching Most existing work on algorithmic machine teaching has focused\non the batch setting, where the teacher constructs a set of examples and provides it to the learner at\nthe beginning of teaching [13, 37, 11, 7]. There has been some work on sequential teaching models\nthat are more suitable for understanding the role of adaptivity. Recently, [23] studied the problem\nof iteratively teaching a gradient learner by providing a sequence of carefully constructed examples.\nHowever, since the learner\u2019s update rule is completely deterministic, a non-adaptive teacher with\nknowledge of the learner\u2019s initial hypothesis h0 would behave exactly the same as an adaptive teacher\n(i.e., the adaptivity gain is zero). [3] studied randomized version-space learners with limited memory,\nand demonstrated the power of adaptivity for a speci\ufb01c class of hypotheses. Sequential teaching\nhas also been studied in the context of crowdsourcing applications by [19] and [29], empirically\ndemonstrating the improved performance of adaptive vs. non-adaptive teachers. However, these\napproaches do not provide any theoretical understanding of the adaptivity gain as done in our work.\nIncremental learning and teaching Our learner model with local preferences is quite natural in\nreal-world applications. A large class of iterative machine learning algorithms are based on the idea\nof incremental updates which in turn is important for the robustness and generalization of learning\n[27, 28]. From the perspective of a human learner, the notion of incremental learning aligns well\nwith the concept of the \u201cZone of Proximal Development (ZPD)\u201d in the educational research and\npsychology literature [32]. The ZPD suggests that teaching is most effective when focusing on a\ntask slightly beyond the current abilities of the student as the human learning process is inherently\n\n2\n\n\fincremental. Different variants of learner model studied in the cognitive science literature [21, 5, 26]\nhave an aspect of incremental learning. For instance, the \u201cwin stay lose shift\u201d model [5] is a special\ncase of the local preference model that we propose in our work. Based on the idea of incremental\nlearning, [2] studied the case of teaching a variant of the version space learner when restricted to\nincremental learning and is closest to our model with local preferences. However, there are two key\ndifferences in their model compared to ours: (i) they allow learners to select inconsistent hypotheses\n(i.e., outside the version space), (ii) the restricted movement in their model is a hard constraint which\nin turns means that teaching is not always feasible \u2013 given a problem instance it is NP-Hard to decide\nif a given target hypothesis is teachable or not.\n\n3 The Teaching Model\n\nWe now describe the teaching domain, present a generic model of the learner and the teacher, and\nthen state the teacher\u2019s objective.\n\n3.1 The Teaching Domain\nLet X denote a ground set of unlabeled examples, and Y denote the set of possible labels that could\nbe assigned to elements of X . We denote by H a \ufb01nite class of hypotheses, each element h P H is a\nfunction h : X \u00d1 Y. In this paper, we will only consider boolean functions and hence Y \u201c t0, 1u.\nIn our model, X , H, and Y are known to both the teacher and the learner. There is a target hypothesis\nh\u02da P H that is known to the teacher, but not the learner. Let Z \u010e X \u02c6 Y be the ground set of labeled\nexamples. Each element z \u201c pxz, yzq P Z represents a labeled example where the label is given\nby the target hypothesis h\u02da, i.e., yz \u201c h\u02dapxzq. Here, we de\ufb01ne the notion of version space needed\nto formalize our model of the learner. Given a set of labeled examples Z \u010e Z, the version space\ninduced by Z is the subset of hypotheses HpZq P H that are consistent with the labels of all the\nexamples, i.e., HpZq :\u201c th P H : @z \u201c pxz, yzq P Z, hpxzq \u201c yzu.\n\n3.2 Model of the Learner\n\nWe now introduce a generic model of the learner by formalizing our assumptions about how she\nadapts her hypothesis based on the labeled examples received from the teacher. A key ingredient\nof this model is the preference function of the learner over the hypotheses as described below. As\nwe show in the next section, by providing speci\ufb01c instances of this preference function, our generic\nmodel reduces to existing models of version space learners, such as the \u201cworst-case\u201d model [13] and\nthe global \u201cpreference-based\" model [11].\nIntuitively, the preference function encodes the learner\u2019s transition preferences. Consider that the\nlearner\u2019s current hypothesis is h, and there are two hypotheses h1, h2 that they could possibly pick as\nthe next hypothesis. We want to encode whether the learner has any preference in choosing h1 or\nh2. Formally, we de\ufb01ne the preference function as \u03c3 : H \u02c6 H \u00d1 R`. Given current hypothesis h\nand any two hypothesis h1, h2, we say that h1 is preferred to h2 from h, iff \u03c3ph1; hq \u0103 \u03c3ph2; hq. If\n\u03c3ph1; hq \u201c \u03c3ph2; hq, then the learner could pick either one of these two.\nThe learner starts with an initial hypothesis h0 P H before receiving any labeled examples from the\nteacher. Then, the interaction between the teacher and the learner proceeds in discrete time steps. At\nany time step t, let us denote the labeled examples received by the learner up to (but not including)\ntime step t via a set Zt, the learner\u2019s version space as Ht \u201c HpZtq, and the current hypothesis as ht.\nAt time step t, we model the learning dynamics as follows: (i) the learner receives a new example\nzt; and (ii) the learner updates the version space Ht`1, and picks the next hypothesis based on the\ncurrent hypothesis ht, version space Ht`1, and the preference function \u03c3:\n\u03c3ph1; htqu.\n\nht`1 P th P Ht`1 : \u03c3ph; htq \u201c min\nh1PHt`1\n\n(3.1)\n\n3.3 Model of the Teacher and the Objective\nThe teacher\u2019s goal is to steer the learner towards the target hypothesis h\u02da by providing a sequence\nof labeled examples. At time step t, the teacher selects a labeled example zt P Z and the learner\ntransitions from the current ht to the next hypothesis ht`1 as per the model described above. Teaching\n\n3\n\n\f\ufb01nishes at time step t if the learner\u2019s hypothesis ht \u201c h\u02da. Our objective is to design teaching algo-\nrithms that can achieve this goal in a minimal number of time steps. We study the worst-case number\nof steps needed as is common when measuring the information complexity of teaching [13, 37, 11].\nWe assume that the teacher knows the learner\u2019s initial hypothesis h0 as well as the preference function\n\u03c3p\u00a8;\u00a8q. In order to quantify the gain from adaptivity, we compare two types of teachers: (i) an adaptive\nteacher who observes the learner\u2019s hypothesis ht before providing the next labeled example zt at\nany time step t; and (ii) a non-adaptive teacher who only knows the initial hypothesis of the learner\nand does not receive any feedback from the learner during the teaching process. Given these two\ntypes of teachers, we want to measure the adaptivity gain by quantifying the difference in teaching\ncomplexity of the optimal adaptive teacher compared to the optimal non-adaptive teacher.\n\n4 The Role of Adaptivity\n\nIn this section, we study different variants of the learner\u2019s preference function, and formally state the\nadaptivity gain with two concrete problem instances.\n\n4.1 State-independent Preferences\n\nWe \ufb01rst consider a class of preference models where the learner\u2019s preference about the next hypothesis\ndoes not depend on her current hypothesis. The simplest state-independent preference is captured\nby the \u201cworst-case\u201d model [13], where the learner\u2019s preference over all hypotheses is uniform, i.e.,\n@h, h1, \u03c3ph1; hq \u201c c, where c is some constant.\nA more generic state-independent preference model is captured by non-uniform, global preferences.\nMore concretely, for any h1 P H, we have \u03c3ph1; hq \u201c ch1 @h P H, a constant dependent only on h1.\nThis is similar to the notion of the global \u201cpreference-based\" version space learner introduced by [11].\n\nProposition 1 For the state-independent preference, adaptivity plays no role, i.e., the sample com-\nplexities of the optimal adaptive teacher and the optimal non-adaptive teacher are the same.\n\nIn fact, for the uniform preference model, the teaching complexity of the adaptive teacher is the same\nas the teaching dimension of the hypothesis class with respect to teaching h\u02da, given by\n\nTDph\u02da,Hq :\u201c min\n\n|Z|, s.t. HpZq \u201c th\u02dau.\n\nZ\n\n(4.1)\n\n(4.2)\n\nFor the global preference model, similar to the notion of preference-based teaching dimension [11],\nthe teaching complexity of the adaptive teacher is given by\n\n|Z|, s.t. @h P HpZqzth\u02dau, \u03c3ph;\u00a8q \u0105 \u03c3ph\u02da;\u00a8q.\n\nmin\n\nZ\n\n4.2 State-dependent Preferences\n\nIn real-world teaching scenarios, human learners incrementally build up their knowledge of the\nworld, and their preference of the next hypothesis naturally depends on their current state. To\nbetter understand the behavior of an adaptive teacher under a state-dependent preference model, we\ninvestigate the following two concrete examples:\nExample 1 (2-REC) H consists of up to two disjoint rectangles1 on a grid and X represents the\ngrid cells (cf. Fig. 1 and Fig. 3a). Consider an example z \u201c pxz, yzq P Z: yz \u201c 1 (positive) if the\ngrid cell xz lies inside the target hypothesis, and 0 (negative) elsewhere.\nThe 2-REC hypothesis class consists of two subclasses, namely H1: all hypotheses with one rectangle,\nand H2: those with exactly two (disjoint) rectangles. The 2-REC class is inspired by teaching a\nunion of disjoint objects. Here, objects correspond to rectangles and any h P H represents one or two\nrectangles. Furthermore, each hypothesis h is associated with a complexity measure given by the\nnumber of objects in the hypothesis. [10] recently studied the problem of teaching a union of disjoint\ngeometric objects, and [1] studied the problem of teaching a union of monomials. Their results show\n\n1For simplicity of discussion, we assume that for the 2-REC hypothesis that contains two rectangles, the\n\nedges of the two rectangles do not overlap.\n\n4\n\n\fthat, in general, teaching a target hypothesis of lower complexity from higher complexity hypotheses\nis the most challenging task.\nFor the 2-REC class, we assume the following local preferences: (i) in general, the learner prefers to\ntransition to a hypothesis with the same complexity as the current one (i.e., H1 \u00d1 H1 or H2 \u00d1 H2),\n(ii) when transitioning within the same subclass, the learner prefers small edits, e.g., by moving\nthe smallest number of edges possible when changing their hypothesis, and (iii) the learner could\nswitch to a subclass of lower complexity (i.e., H2 \u00d1 H1) in speci\ufb01c cases. We provide a detailed\ndescription of the preference function in the extended version of this paper [8].\nExample 2 (LATTICE) H and X both correspond to nodes in a 2-dimensional integer lattice of\nlength n. For a node v in the grid, we have an associated hv P H and xv P X . Consider an example\nzv \u201c pxzv , yzvq P Z: yzv \u201c 0 (negative) if the target hypothesis corresponds to the same node v,\nand 1 (positive) elsewhere. We consider the problem of teaching with positive-only examples.\n\nLATTICE class is inspired by teaching in a physical world from positive-only (or negative-only)\nreinforcements, for instance, teaching a robot to navigate to a target state by signaling that the current\nlocation is not the target. The problem of learning and teaching with positive-only examples is an\nimportant question with applications to learning languages and reinforcement learning tasks [12, 20].\nFor the LATTICE class, we assume that the learner prefers to move to a close-by hypothesis measured\nvia L1 (Manhattan) distance, and when hypotheses have equal distances we assume that the learner\nprefers hypotheses with larger coordinates.\n\nTheorem 2 For teaching the 2-REC class, the ratio between the cost of the optimal non-adaptive\nteacher and the optimal adaptive teacher is \u2126p|h0|{ log|h0|q, where |h0| denotes the number of\npositive examples induced by the learner\u2019s initial hypothesis h0; for teaching the LATTICE class, the\ndifference between the cost of the optimal non-adaptive teacher and the optimal adaptive teacher is\n\u2126pnq.\nIn the above theorem, we show that for both problems, under natural behavior of an incremental\nlearner, adaptivity plays a key role. The proof of Theorem 2 is provided in the extended version of\nthis paper [8]. Speci\ufb01cally, we show the teaching sequences for an adaptive teacher which matches\nthe above bounds for the 2-REC and LATTICE classes. We also provide lower bounds for any\nnon-adaptive algorithms for these two classes. Here, we highlight two necessary conditions under\nwhich adaptivity can possibly help: (i) preferences are local and (ii) there are ties among the learner\u2019s\npreference over hypotheses. The learner\u2019s current hypothesis, combined with the local preference\nstructure, gives the teacher a handle to steer the learner in a controlled way.\n\n5 Adaptive Teaching Algorithms\n\nIn this section, we \ufb01rst characterize the optimal teaching algorithm, and then propose a non-myopic\nadaptive teaching framework.\n\n5.1 The Optimality Condition\nAssume that the learner\u2019s current hypothesis is h, and the current version space is H \u010e H. Let\nD\u02daph, H, h\u02daq denote the minimal number of examples required in the worst-case to teach h\u02da. We\nidentify the following optimality condition for an adaptive teacher:\nProposition 3 A teacher achieves the minimal teaching cost, if and only if for all states ph, Hq of\n\u02c6\nthe learner, it picks an example such that\n1 `\n\u02c6\n1 `\n\nh1, H X Hptzuq, h\u02da\u02d8\u02d9\nh1, H X Hptzuq, h\u02da\u02d8\u02d9\n\nwhere Cph, H, \u03c3, zq denotes the set of candidate hypotheses in the next round as de\ufb01ned in (3.1), and\nfor all ph, Hq, it holds that\n\nD\u02daph, H, h\u02daq \u201c min\n\nmax\n\nh1PCph,H,\u03c3,zq D\u02da`\nh1PCph,H,\u03c3,zq D\u02da`\n\nmax\n\nz\u02da P arg min\n\nz\n\nz\n\n5\n\n\fAlgorithm 1 Non-myopic adaptive teaching\n\ninput: H, \u03c3, initial h0, target h\u02da.\nInitialize t \u00d0 0, H0 \u00d0 H\nwhile ht \u2030 h\u02da do\n\nt \u00d0 Oraclepht,Ht, h\u02daq\nh\u02da\nzt`1 \u00d0 Teacherp\u03c3, ht,Ht, h\u02da\nt q\nLearner makes an update\nt \u00d0 t ` 1\n\nFigure 2: An illustrative example for 2-REC. ht, h\u02da, and\nh\u02da\nt are represented by the orange rectangles, solid green\nrectangle and dashed green rectangles, respectively. (Left)\nThe teaching task. (Middle) Sub-task 1. (Right) Sub-task 2.\n\nIn general, computing the optimal cost D\u02da for non-trivial preference functions, including the uni-\nform/global preference, requires solving a linear equation system of size |H| \u00a8 2|H|.\nuph, H, h\u02daq \u201c\nState-independent preference When the learner\u2019s preference is uniform, D\u02da\nTDph\u02da, Hq (Eq. 4.1) denotes the set cover number of the version space, which is NP-Hard to compute.\nA myopic heuristic which gives best approximation guarantees for a polynomial time algorithm (with\ncost that is within a logarithmic factor of the optimal cost [13]) is given by \u02dcDuph, H, h\u02daq \u201c |H|. For\ngph, H, h\u02daq is given by Eq. (4.2). i.e., the set cover number\nthe global preference, the optimal cost D\u02da\nof all hypotheses in the version space that are more or equally preferred over h\u02da. Similarly, one can\nalso follow the greedy heuristic, i.e., \u02dcDgph, H, h\u02daq \u201c |th1 P H : \u03c3ph1;\u00a8q \u010f \u03c3ph\u02da;\u00a8qu| to achieve a\nlogarithmic factor approximation.\n\nGeneral preference\nheuristic for general preference models:\n\nInspired by the two myopic heuristics above, we propose the following\n\n\u02dcDph, H, h\u02daq \u201c |th1 P H : \u03c3ph1; hq \u010f \u03c3ph\u02da; hqu|\n\n(5.1)\n\nIn words, \u02dcD denotes the index of the target hypothesis h\u02da in the preference vector associated with h\nin the version space H. Notice that for the uniform (resp. global) preference model, the function \u02dcD\nreduces to \u02dcDu (resp. \u02dcDg). In the following theorem, we provide a suf\ufb01cient condition for the myopic\nadaptive algorithm that greedily minimizes Eq. (5.1) to attain provable guarantees:\nTheorem 4 Let h0 P H be the learner\u2019s initial hypothesis, and h\u02da P H be the target hypothesis.\nFor any H \u010e H, let \u00afHptzuq \u201c th1 P H : h1pxzq \u2030 yzu be the set of hypotheses in H which are\ninconsistent with the teaching example z P Z. If for all learner\u2019s states ph, Hq, the preference and\nthe structure of the teaching examples satisfy:\n\n1. @hi, hj P H, \u03c3phi; hq \u010f \u03c3phj; hq \u010f \u03c3ph\u02da; hq \u00f9\u00f1 \u03c3phj; hiq \u010f \u03c3ph\u02da; hiq\n2. @H1 \u010e \u00afHptzuq, there exists z1 P Z, s.t., \u00afHptz1uq \u201c H1,\n\nthen, the cost of the myopic algorithm that greedily minimizes2 (5.1) is within a factor of\n2plog \u02dcDph0,H, h\u02daq ` 1q approximation of the cost of the optimal adaptive algorithm.\nWe defer the proof of the theorem to the extended version of this paper [8]. Note that both the uniform\npreference model and the global preference model satisfy Condition 1. Intuitively, the \ufb01rst condition\nstates that there does not exist any hypothesis between h and h\u02da that provides a \u201cshort-cut\u201d to the target.\nCondition 2 implies that we can always \ufb01nd teaching examples that ensure smooth updates of the ver-\nsion space. For instance, a feasible setting that \ufb01ts Condition 2 is where we assume that the teacher can\nsynthesize an example to remove any subset of hypotheses of size at most k, where k is some constant.\n\n5.2 Non-Myopic Teaching Algorithms\n\nWhen the conditions provided in Theorem 4 do not hold, the myopic heuristic (5.1) could perform\npoorly. An important observation from Theorem 4 is that, when \u02dcDph, H, h\u02daq is small, i.e., h\u02da is close\n2In the case of ties, we assume that the teacher prefers examples that make learner stay at the same hypothesis.\n\n6\n\n031240312403124031240312403124\fto the learner\u2019s current hypothesis in terms of preference ordering, we need less stringent constraints\non the preference function. This motivates adaptively devising intermediate target hypotheses to\nground the teaching task into multiple, separate sub-tasks. Such divide-and-conquer approaches\nhave proven useful for many practical problems, e.g., constructing a hierarchical decomposition\nfor reinforcement learning tasks [16]. In the context of machine teaching, we assume that there is\nan oracle, Oracleph, H, h\u02daq that maps the learner\u2019s state ph, Hq and the target hypothesis h\u02da to an\nintermediate target hypothesis, which de\ufb01nes the current sub-task.\nWe outline the non-myopic adaptive teaching framework in Algorithm 1. Here, the subroutine\nTeacher aims to provide teaching examples that bring the learner closer to the intermediate target\nhypothesis. As an example, let us consider the 2-REC hypothesis class. In particular, we consider\nthe challenging case where the target hypothesis h\u02da P H1 represents a single rectangle r\u2039, and the\nlearner\u2019s initial hypothesis h0 P H2 has two rectangles pr1, r2q. Imagine that the \ufb01rst rectangle r1\nis overlapping with r\u2039, and the second rectangle r2 is disjoint from r\u2039. To teach the hypothesis h\u02da,\nthe \ufb01rst sub-task (as provided by the oracle) is to eliminate the rectangle r2 by providing negative\nexamples so that the learner\u2019s hypothesis represents a single rectangle r1. Then, the next sub-task\n(as provided by the oracle) is to teach h\u02da from r1. We illustrate the sub-tasks in Fig. 2, and provide\nthe full details of the adaptive teaching algorithm (i.e., Ada-R as used in our experiments) in the\nextended version of this paper [8].\n\n6 Experiments\n\nIn this section, we empirically evaluate our teaching algorithms on the 2-REC hypothesis class via\nsimulated learners.\n\n6.1 Experimental Setup\n\nFor the 2-REC hypothesis class (cf. Fig. 3a and Example 1), we consider a grid with size varying from\n5 \u02c6 5 to 20 \u02c6 20. The ground set of unlabeled teaching examples X consists of all grid cells. In our\nsimulations, we consider all four possible teaching scenarios, H1\u00d11, H1\u00d12, H2\u00d11, H2\u00d12, where\ni, j in Hi\u00d1j specify the subclasses of the learner\u2019s initial hypothesis h0 and the target hypothesis\nh\u02da. In each simulated teaching session, we sample a random pair of hypotheses ph0, h\u02daq from the\ncorresponding subclasses.\n\nTeaching algorithms We consider three different teaching algorithms as described below. The \ufb01rst\nalgorithm, SC, is a greedy set cover algorithm, where the teacher greedily minimizes \u02dcDu \u201c |H| (see\n\u00a75.1). In words, the teacher acts according to the uniform preference model, and greedily picks the\nteaching example that eliminates the most inconsistent hypotheses in the version space. The second\nalgorithm, denoted by Non-R for the class 2-REC, represents the non-adaptive teaching algorithm\nthat matches the non-adaptive lower bounds provided in Theorem 2, with implementation details\nprovided in the extended version of this paper [8]. Note that both SC and Non-R are non-adaptive.\nThe third algorithm, Ada-R, represents the non-myopic adaptive teaching algorithm instantiated\nfrom Algorithm 1. The details of the subroutines Oracle and Teacher for Ada-R are provided in\nthe extended version of this paper [8]. We note that all teaching algorithms have the same stopping\ncriterion: the teacher stops when the learner reaches the target hypothesis, that is, ht \u201c h\u02da.\n\n6.2 Results\n\nWe measure the performance of the teaching algorithms by their teaching complexity, and all results\nare averaged over 50 trials with random samples of ph0, h\u02daq.\nNoise-free setting Here, we consider the \u201cnoise-free\u201d setting, i.e., the learner acts according to\nthe state-dependent preference models as described in \u00a74.2. In Fig. 3b, we show the results for\n2-REC class with a \ufb01xed grid size 15 \u02c6 15 for all four teaching scenarios. As we can see from\nFig. 3b, Ada-R has a consistent advantage over the non-adaptive baselines across all four scenarios.\nAs expected, teaching H1\u00d11,H1\u00d12, and H2\u00d12 is easier, and the non-adaptive algorithms (SC and\nNon-R) perform well. In contrast, when teaching H2\u00d11, we see a signi\ufb01cant gain from Ada-R over\nthe non-adaptive baselines. In the worst case, SC has to explore all the negative examples to teach\nh\u02da, whereas Non-R needs to consider all negative examples within the learner\u2019s initial hypothesis\n\n7\n\n\f(a) 2-REC class\n\n(b) 2-REC, size 15 \u02c6 15\n\n(c) 2-REC, H2\u00d11\n\n(d) 2-REC, robustness\n\nFigure 3: Illustration and simulation results for 2-REC. (a) illustrates the 2-REC hypothesis class.\nThe initial hypothesis h0 P H2 is represented by the orange rectangles, and the target hypothesis\nh\u02da P H1 is represented by the green rectangle. The green and blue cells represent a positive and\na negative teaching example, respectively. Simulation results are shown in (b)-(d).\n\nh0 to make the learner jump from the subclass H2 to H1. In Fig. 3c, we observe that the adaptivity\ngain increases drastically as we increase the grid size. This matches our analysis of the logarithmic\nadaptivity gain in Theorem 2 for 2-REC.\n\nRobustness in a noisy setting In real-world teaching tasks, the learner\u2019s preference may deviate\nfrom the preference \u03c3 of an \u201cideal\u201d learner that the teacher is modeling. In this experiment, we\nconsider a more realistic scenario, where we simulate the noisy learners by randomly perturbing the\npreference of the \u201cideal\u201d learner at each time step. With probability 1 \u00b4 \u03b5 the learner follows \u03c3, and\nwith probability \u03b5, the learner switches to a random hypothesis in the version space. In Fig. 3d, we\nshow the results for the 2-REC hypothesis class with different noise levels \u03b5 P r0, 1s. We observe that\neven for highly noisy learners e.g., \u03b5 \u201c 0.9, our algorithm Ada-R performs much better than SC. 3,4\n\n7 User Study\n\nHere we describe experiments performed with human participants from Mechanical Turk using the 2-\nREC hypothesis class. We created a web interface in order to (i) elicit the preference over hypotheses\nof human participants, and to (ii) evaluate our adaptive algorithm when teaching human learners.\n\nEliciting human preferences We consider a two-step process for the elicitation experiments. At\nthe beginning of the session (\ufb01rst step), participants were shown a grid of green, blue, or white\ncells and asked to draw a hypothesis from the 2-REC class represented by one or two rectangles.\nParticipants could only draw \u201cvalid\u201d hypothesis which is consistent with the observed labels (i.e., the\nhypothesis should contain all the green cells and exclude all the blue cells), cf. Fig. 3a. The color of\nthe revealed cells is de\ufb01ned by an underlying target hypothesis h\u02da. In the second step, the interface\nupdated the con\ufb01guration of cells (either by adding or deleting green/blue cells) and participants were\nasked to redraw their rectangle(s) (or move the edges of the previously drawn rectangle(s)) which\nensures that the updated hypothesis is consistent.\nWe consider 5 types of sessions, depending on the class of h\u02da and con\ufb01gurations presented to a\nparticipant in the \ufb01rst and the second step. These con\ufb01gurations are listed in Fig. 4a. For instance,\nthe session type in the third row pH2,p1{2q, 2q means the following: the labels were generated based\non a hypothesis h\u02da P H2; in the \ufb01rst step, both subclasses H1 and H2 had consistent hypotheses; and\nin the second step, only the subclass H2 had consistent hypotheses.\nWe tested 215 participants, where each individual performed 10 trials on a grid of size 12 \u02c6 12. For\neach trial, we randomly selected one of the \ufb01ve types of sessions as discussed above. In Fig. 4a, we see\n\n3 In general, the teaching sequence constructed by the non-adaptive algorithms Non-R (resp. Non-L) would\nnot be suf\ufb01cient to reach the target under the noisy setting. Hence, we did not include the results of these\nnon-adaptive algorithms in the robustness plots. Note that one can tweak Non-R (resp. Non-L) by concatenating\nthe teaching sequence with teaching examples generated by SC; however, in general, in a worst-case sense, any\nnon-adaptive algorithm in the noisy setting will not perform better than SC.\nthe algorithm as the increase in the noise level increases the chance for the learner to randomly jump to h\u02da.\n\n4The performance of SC is non-monotone w.r.t. the noise-level. This is attributed to the stopping criteria of\n\n8\n\n031240312411122122scenarios0204060teaching complexityAda-RNon-RSC5101520grid length050100150200teaching complexityAda-RNon-RSC0.00.51.0noise level050100150200teaching complexityAda-RSC\fSession Type\n\nUser Transition\n\nh\u02da\nH1\nH2\nH2\nH2\nH2\n\n1st\np1{2q\np1{2q\np1{2q\n2\n2\n\n2nd H1 \u00d1 H1 H1 \u00d1 H2 H2 \u00d1 H1 H2 \u00d1 H2\np1{2q\np1{2q\n2\n2\np1{2q\n\n0.06\n0.01\n0.00\n0.00\n0.11\n\n0.54\n0.30\n0.00\n0.00\n0.00\n\n0.13\n0.31\n0.61\n0.00\n0.00\n\n0.27\n0.38\n0.39\n1.00\n0.89\n\n(a) Transitions across subclasses Hi \u00d1 Hj\n\n(b) Transitions within H1\n\n(c) Teaching results\n\nFigure 4: (a)-(b) represent results for eliciting human preferences for different session types as\nexplained in the text below and (c) shows results for teaching human learners. (a) Participants prefer\nstaying within the same hypothesis subclass when possible, displayed as the fraction of time they\nswitched subclasses for different session types. (b) Considering the transitions within subclass H1,\nparticipants favor staying at their current hypothesis if it remains valid, along with preferring smaller\nupdates, computed as the L1 distance between the initial and updated rectangle. (c) Adaptive teaching\nalgorithm Ada-R is signi\ufb01cantly better than SC and Rand.\n\nthat participants tend to favor staying in the same hypothesis subclass when possible. Within the same\nsubclass, they have a preference towards updates that are close to their initial hypothesis, cf. Fig. 4b.5\n\nTeaching human learners Next we evaluate our teaching algorithms on human learners. As in the\nsimulations, we consider four teaching scenarios H1\u00d11, H1\u00d12, H2\u00d11, and H2\u00d12. At the beginning\nof the teaching session, a participant was shown a blank 8 \u02c6 8 grid with either one or two initial\nrectangles, corresponding to h0. At every iteration, the participants were provided with a new teaching\nexample (i.e., a new green or blue cell is revealed), and were asked to update the current hypothesis.\nWe evaluate three algorithms, namely Ada-R, SC, and Rand, where Rand denotes a teaching\nstrategy that picks examples at random. The non-adaptive algorithm Non-R was not included in\nthe user study for the same reasons as explained in Footnote 3. We enlisted 200 participants to\nevaluate teaching algorithms and this was repeated \ufb01ve times for each participant. For each trial, we\nrandomly selected one of the three teaching algorithms and one of the four teaching scenarios. Then,\nwe recorded the number of examples required to learn the target hypothesis. Teaching was terminated\nwhen 60% of the cells were revealed. If the learner did not reach the target hypothesis by this time\nwe set the number of teaching examples to this upper limit. We illustrate a teaching session in the\nextended version of this paper [8].\nFig. 4c illustrates the superiority of the adaptive teacher Ada-R, while Rand performs the worst.\nIn both cases where the target hypothesis is in H2, the SC teacher performs nearly as well as the\nadaptive teacher, as at most 12 teaching examples are required to fully characterize the location of\nboth rectangles. However, we observe a large gain from the adaptive teacher for the scenario H2\u00d11.\n\n8 Conclusions\n\nWe explored the role of adaptivity in algorithmic machine teaching and showed that the adaptivity\ngain is zero when considering well-studied learner models (e.g., \u201cworst-case\u201d and \u201cpreference-based\u201d)\nfor the case of version space learners. This is in stark contrast to real-life scenarios where adaptivity\nis an important ingredient for effective teaching. We highlighted the importance of local preferences\n(i.e., dependent on the current hypothesis) when the learner transitions to the next hypothesis. We\npresented hypotheses classes where such local preferences arise naturally, given that machines and\nhumans have a tendency to learn incrementally. Furthermore, we characterized the structure of\noptimal adaptive teaching algorithms, designed near-optimal general purpose and application-speci\ufb01c\nadaptive algorithms, and validated these algorithms via simulation and user studies.\n\n5Given that a participant is allowed to move edges when updating the hypothesis, our interface could bias the\nparticipants\u2019 choice of the next hypothesis towards a preference structure that favors local edits as assumed by\nour algorithm. As future work, one could consider an alternative interface which enforces participants to draw\nthe rectangle(s) from scratch at every step.\n\n9\n\n051015L1 distance0.000.050.100.150.200.250.30fraction of updates11122122scenarios0102030teaching complexityAda-RSCRand\fAcknowledgments This work was supported in part by Northrop Grumman, Bloomberg, AWS Re-\nsearch Credits, Google as part of the Visipedia project, and a Swiss NSF Early Mobility Postdoctoral\nFellowship.\n\nReferences\n[1] Frank J Balbach. Measuring teachability using variants of the teaching dimension. Theoretical\n\nComputer Science, 397(1-3):94\u2013113, 2008.\n\n[2] Frank J Balbach and Thomas Zeugmann. Teaching learners with restricted mind changes. In\n\nALT, pages 474\u2013489. Springer, 2005.\n\n[3] Frank J Balbach and Thomas Zeugmann. Teaching randomized learners with feedback. Infor-\n\nmation and Computation, 209(3):296\u2013319, 2011.\n\n[4] Yoshua Bengio, J\u00e9r\u00f4me Louradour, Ronan Collobert, and Jason Weston. Curriculum learning.\n\nIn ICML, pages 41\u201348, 2009.\n\n[5] Elizabeth Bonawitz, Stephanie Denison, Alison Gopnik, and Thomas L Grif\ufb01ths. Win-stay,\nlose-sample: A simple sequential algorithm for approximating bayesian inference. Cognitive\npsychology, 74:35\u201365, 2014.\n\n[6] Maya Cakmak and Manuel Lopes. Algorithmic and human teaching of sequential decision\n\ntasks. In AAAI, 2012.\n\n[7] Yuxin Chen, Oisin Mac Aodha, Shihan Su, Pietro Perona, and Yisong Yue. Near-optimal\n\nmachine teaching via explanatory teaching sets. In AISTATS, April 2018.\n\n[8] Yuxin Chen, Adish Singla, Oisin Mac Aodha, Pietro Perona, and Yisong Yue. Understand-\ning the role of adaptivity in machine teaching: The case of version space learners. CoRR,\nabs/1802.05190, 2018.\n\n[9] Thorsten Doliwa, Gaojian Fan, Hans Ulrich Simon, and Sandra Zilles. Recursive teaching\n\ndimension, vc-dimension and sample compression. JMLR, 15(1):3107\u20133131, 2014.\n\n[10] Ziyuan Gao, David Kirkpatrick, Christoph Ries, Hans Simon, and Sandra Zilles. Preference-\n\nbased teaching of unions of geometric objects. In ALT, pages 185\u2013207, 2017.\n\n[11] Ziyuan Gao, Christoph Ries, Hans U Simon, and Sandra Zilles. Preference-based teaching.\n\nJMLR, 18(31):1\u201332, 2017.\n\n[12] E Mark Gold. Language identi\ufb01cation in the limit. Information and control, 10(5), 1967.\n\n[13] Sally A Goldman and Michael J Kearns. On the complexity of teaching. Journal of Computer\n\nand System Sciences, 50(1):20\u201331, 1995.\n\n[14] Luis Haug, Sebastian Tschiatschek, and Adish Singla. Teaching inverse reinforcement learners\nvia features and demonstrations. In Advances in Neural Information Processing Systems, 2018.\n\n[15] Lisa Hellerstein, Devorah Kletenik, and Patrick Lin. Discrete stochastic submodular maximiza-\n\ntion: Adaptive vs. non-adaptive vs. of\ufb02ine. In CIAC, pages 235\u2013248, 2015.\n\n[16] Bernhard Hengst. Hierarchical Reinforcement Learning, pages 495\u2013502. 2010.\n\n[17] Anette Hunziker, Yuxin Chen, Oisin Mac Aodha, Manuel Gomez-Rodriguez, Andreas Krause,\nPietro Perona, Yisong Yue, and Adish Singla. Teaching multiple concepts to a forgetful learner.\nCoRR, abs/1805.08322, 2018.\n\n[18] Susmit Jha and Sanjit A. Seshia. A theory of formal synthesis via inductive learning. Acta Inf.,\n\n54(7):693\u2013726, 2017.\n\n[19] Edward Johns, Oisin Mac Aodha, and Gabriel J Brostow. Becoming the expert-interactive\n\nmulti-class machine teaching. In CVPR, pages 2616\u20132624, 2015.\n\n10\n\n\f[20] Steffen Lange and Thomas Zeugmann. Incremental learning from positive data. Journal of\n\nComputer and System Sciences, 53(1):88\u2013103, 1996.\n\n[21] Marvin Levine. A cognitive theory of learning: Research on hypothesis testing. Lawrence\n\nErlbaum, 1975.\n\n[22] Ji Liu and Xiaojin Zhu. The teaching dimension of linear learners. JMLR, 17(162):1\u201325, 2016.\n\n[23] Weiyang Liu, Bo Dai, Ahmad Humayun, Charlene Tay, Chen Yu, Linda B. Smith, James M.\n\nRehg, and Le Song. Iterative machine teaching. In ICML, pages 2149\u20132158, 2017.\n\n[24] Shike Mei and Xiaojin Zhu. Using machine teaching to identify optimal training-set attacks on\n\nmachine learners. In AAAI, pages 2871\u20132877, 2015.\n\n[25] Deyu Meng, Qian Zhao, and Lu Jiang. A theoretical understanding of self-paced learning. Inf.\n\nSci., 414:319\u2013328, 2017.\n\n[26] Anna N Rafferty, Emma Brunskill, Thomas L Grif\ufb01ths, and Patrick Shafto. Faster teaching via\n\npomdp planning. Cognitive science, 40(6):1290\u20131332, 2016.\n\n[27] David A Ross, Jongwoo Lim, Ruei-Sung Lin, and Ming-Hsuan Yang. Incremental learning for\n\nrobust visual tracking. International journal of computer vision, 77(1-3):125\u2013141, 2008.\n\n[28] Shai Shalev-Shwartz. Online learning and online convex optimization. Foundations and Trends\n\nin Machine Learning, 4(2), 2012.\n\n[29] Adish Singla, Ilija Bogunovic, G Bart\u00f3k, A Karbasi, and A Krause. On actively teaching the\n\ncrowd to classify. In NIPS Workshop on Data Driven Education, 2013.\n\n[30] Adish Singla, Ilija Bogunovic, G\u00e1bor Bart\u00f3k, Amin Karbasi, and Andreas Krause. Near-\n\noptimally teaching the crowd to classify. In ICML, pages 154\u2013162, 2014.\n\n[31] Cem Tekin, Jonas Braun, and Mihaela van der Schaar. etutor: Online learning for personalized\n\neducation. In ICASSP, pages 5545\u20135549, 2015.\n\n[32] Lev Vygotsky. Zone of proximal development. Mind in society: The development of higher\n\npsychological processes, 5291:157, 1987.\n\n[33] Daniel S Weld, Eytan Adar, Lydia Chilton, Raphael Hoffmann, Eric Horvitz, Mitchell Koch,\nJames Landay, Christopher H Lin, and Mausam Mausam. Personalized online education\u2014a\ncrowdsourcing challenge. In HCOMP, 2012.\n\n[34] Xiaojin Zhu. Machine teaching for bayesian learners in the exponential family. In Advances in\n\nNeural Information Processing Systems, pages 1905\u20131913, 2013.\n\n[35] Xiaojin Zhu. Machine teaching: An inverse problem to machine learning and an approach\n\ntoward optimal education. In AAAI, pages 4083\u20134087, 2015.\n\n[36] Xiaojin Zhu, Adish Singla, Sandra Zilles, and Anna N. Rafferty. An overview of machine\n\nteaching. CoRR, abs/1801.05927, 2018.\n\n[37] Sandra Zilles, Steffen Lange, Robert Holte, and Martin Zinkevich. Models of cooperative\n\nteaching and learning. JMLR, 12(Feb):349\u2013384, 2011.\n\n11\n\n\f", "award": [], "sourceid": 758, "authors": [{"given_name": "Yuxin", "family_name": "Chen", "institution": "Caltech"}, {"given_name": "Adish", "family_name": "Singla", "institution": "MPI-SWS"}, {"given_name": "Oisin", "family_name": "Mac Aodha", "institution": "California Institute of Technology"}, {"given_name": "Pietro", "family_name": "Perona", "institution": "California Institute of Technology"}, {"given_name": "Yisong", "family_name": "Yue", "institution": "Caltech"}]}