{"title": "How Do Humans Teach: On Curriculum Learning and Teaching Dimension", "book": "Advances in Neural Information Processing Systems", "page_first": 1449, "page_last": 1457, "abstract": "We study the empirical strategies that humans follow as they teach a target concept with a simple 1D threshold to a robot. Previous studies of computational teaching, particularly the teaching dimension model and the curriculum learning principle, offer contradictory predictions on what optimal strategy the teacher should follow in this teaching task. We show through behavioral studies that humans employ three distinct teaching strategies, one of which is consistent with the curriculum learning principle, and propose a novel theoretical framework as a potential explanation for this strategy. This framework, which assumes a teaching goal of minimizing the learner's expected generalization error at each iteration, extends the standard teaching dimension model and offers a theoretical justification for curriculum learning.", "full_text": "How Do Humans Teach:\n\nOn Curriculum Learning and Teaching Dimension\n\nFaisal Khan, Xiaojin Zhu, Bilge Mutlu\n\nDepartment of Computer Sciences, University of Wisconsin\u2013Madison\n\nMadison, WI, 53706 USA. {faisal, jerryzhu, bilge}@cs.wisc.edu\n\nAbstract\n\nWe study the empirical strategies that humans follow as they teach a target concept\nwith a simple 1D threshold to a robot.1 Previous studies of computational teach-\ning, particularly the teaching dimension model and the curriculum learning prin-\nciple, offer contradictory predictions on what optimal strategy the teacher should\nfollow in this teaching task. We show through behavioral studies that humans em-\nploy three distinct teaching strategies, one of which is consistent with the curricu-\nlum learning principle, and propose a novel theoretical framework as a potential\nexplanation for this strategy. This framework, which assumes a teaching goal of\nminimizing the learner\u2019s expected generalization error at each iteration, extends\nthe standard teaching dimension model and offers a theoretical justi\ufb01cation for\ncurriculum learning.\n\n1\n\nIntroduction\n\nWith machine learning comes the question of how to effectively teach. Computational teaching\nhas been well studied in the machine learning community [9, 12, 10, 1, 2, 11, 13, 18, 4, 14, 15].\nHowever, whether these models can predict how humans teach is less clear. The latter question is\nimportant not only for such areas as education and cognitive psychology but also for applications of\nmachine learning, as learning agents such as robots become commonplace and learn from humans.\nA better understanding of the teaching strategies that humans follow might inspire the development\nof new machine learning models and the design of learning agents that more naturally accommodate\nthese strategies.\n\nStudies of computational teaching have followed two prominent threads. The \ufb01rst thread, devel-\noped by the computational learning theory community, is exempli\ufb01ed by the \u201cteaching dimension\u201d\nmodel [9] and its extensions [12, 10, 1, 2, 11, 13, 18]. The second thread, motivated partly by ob-\nservations in psychology [16], is exempli\ufb01ed by the \u201ccurriculum learning\u201d principle [4, 14, 15]. We\nwill discuss these two threads in the next section. However, they make con\ufb02icting predictions on\nwhat optimal strategy a teacher should follow in a simple teaching task. This con\ufb02ict serves as an\nopportunity to compare these predictions to human teaching strategies in the same task.\n\nThis paper makes two main contributions: (i) it enriches our empirical understanding of human\nteaching and (ii) it offers a theoretical explanation for a particular teaching strategy humans follow.\nOur approach combines cognitive psychology and machine learning. We \ufb01rst conduct a behavioral\nstudy with human participants in which participants teach a robot, following teaching strategies\nof their choice. This approach differs from most previous studies of computational teaching in\nmachine learning and psychology that involve a predetermined teaching strategy and that focus on\nthe behavior of the learner rather than the teacher. We then compare the observed human teaching\nstrategies to those predicted by the teaching dimension model and the curriculum learning principle.\n\n1Our data is available at http://pages.cs.wisc.edu/\u223cjerryzhu/pub/humanteaching.tgz.\n\n1\n\n\fFigure 1: The target concept hj.\n\nEmpirical results indicate that human teachers follow the curriculum learning principle, while no\nevidence of the teaching dimension model is observed. Finally, we provide a novel theoretical\nanalysis that extends recent ideas in teaching dimension model [13, 3] and offers curriculum learning\na rigorous underpinning.\n\n2 Competing Models of Teaching\nWe \ufb01rst review the classic teaching dimension model [9, 1]. Let X be an input space, Y the label\nspace, and (x1, y1), . . . , (xn, yn) \u2208 X \u00d7 Y a set of instances. We focus on binary classi\ufb01cation in\nthe unit interval: X = [0, 1],Y = {0, 1}. We call H \u2286 2{x1,...,xn} a concept class and h \u2208 H a\nconcept. A concept h is consistent with instance (x, y) iff x \u2208 h \u21d4 y = 1. h is consistent with a set\nof instances if it is consistent with every instance in the set. A set of instances is called a teaching\nset of a concept h with respect to H, if h is the only concept in H that is consistent with the set. The\nteaching dimension of h with respect to H is the minimum size of its teaching set. The teaching\ndimension of H is the maximum teaching dimension of its concepts.\nConsider the task in Figure 1, which we will use throughout the paper. Let x1 \u2264 . . . \u2264 xn. Let H be\nall threshold labelings: H = {h | \u2203\u03b8 \u2208 [0, 1],\u2200i = 1 . . . n : xi \u2208 h \u21d4 xi \u2265 \u03b8}. The target concept\nhj has the threshold between xj and xj+1: hj = {xj+1, . . . , xn}. Then, the teaching dimension\nof most hj is 2, as one needs the minimum teaching set {(xj, 0), (xj+1, 1)}; for the special cases\nh0 = {x1, . . . , xn} and hn = \u2205 the teaching dimension is 1 with the teaching set {(x1, 1)} and\n{(xn, 0)}, respectively. The teaching dimension of H is 2. For our purpose, the most important\nargument is the following: The teaching strategy for most hj\u2019s suggested by teaching dimension is\nto show two instances {(xj, 0), (xj+1, 1)} closest to the decision boundary. Intuitively, these are the\ninstances most confusable by the learner.\n\nAlternatively, curriculum learning suggests an easy-to-hard (or clear-to-ambiguous) teaching strat-\negy [4]. For the target concept in Figure 1, \u201ceasy\u201d instances are those farthest from the de-\ncision boundary in each class, while \u201chard\u201d ones are the closest to the boundary. One such\nteaching strategy is to present instances from alternating classes, e.g., in the following order:\n(x1, 0), (xn, 1), (x2, 0), (xn\u22121, 1), . . . , (xj, 0), (xj+1, 1). Such a strategy has been used for second-\nlanguage teaching in humans. For example, to train Japanese listeners on the English [r]-[l] distinc-\ntion, McCandliss et al. linearly interpolated a vocal tract model to create a 1D continuum similar\nto Figure 1 along [r] and [l] sounds. They showed that participants were better able to distinguish\nthe two phonemes if they were given easy (over-articulated) training instances \ufb01rst [16]. Compu-\ntationally, curriculum learning has been justi\ufb01ed as a heuristic related to continuation method in\noptimization to avoid poor local optima [4].\n\nHence, for the task in Figure 1, we have two sharply contrasting teaching strategies at hand: the\nboundary strategy starts near the decision boundary, while the extreme strategy starts with ex-\ntreme instances and gradually approaches the decision boundary from both sides. Our goal in this\npaper is to compare human teaching strategies with these two predictions to shed more light on\nmodels of teaching. While the teaching task used in our exploration is simple, as most real-world\nteaching situations do not involve a threshold in a 1D space, we believe that it is important to lay the\nfoundation in a tractable task before studying more complex tasks.\n\n3 A Human Teaching Behavioral Study\n\nUnder IRB approval, we conducted a behavioral study with human participants to explore human\nteaching behaviors in a task similar to that illustrated in Figure 1. In our study, participants teach\nthe target concept of \u201cgraspability\u201d\u2014whether an object can be grasped and picked up with one\nhand\u2014to a robot. We chose graspability because it corresponds nicely to a 1D space empirically\n\n2\n\n\f(a)\n\n(b)\n\nFigure 2: (a) A participant performing the card sorting/labeling and teaching tasks.\nteaching sequences that follow the extreme strategy gradually shrink the version space V1.\n\n(b) Human\n\nstudied before [17]. We chose to use a robot learner because it offers great control and consistency\nwhile facilitating natural interaction and teaching. The robot keeps its behavior consistent across\nconditions and trials, therefore, providing us with the ability to isolate various interactional factors.\nThis level of experimental control is hard to achieve with a human learner. The robot also affords\nembodied behavioral cues that facilitate natural interaction and teaching strategies that computers\ndo not afford.\nParticipants were 31 paid subjects recruited from the University of Wisconsin\u2013Madison campus.\nAll were native English speakers with an average age of 21 years.\nMaterials. We used black-and-white photos of n = 31 objects chosen from the norming study\nof Salmon et al. [17]. The photos were of common objects (e.g., food, furniture, animals) whose\naverage subjective graspability ratings evenly span the whole range. We printed each photo on a 2.5-\nby-4.5 inch card. The robot was a Wakamaru humanlike robot manufactured by Mitsubishi Heavy\nIndustries, Ltd. It neither learned nor responded to teaching. Instead, it was programmed to follow\nmotion in the room with its gaze. Though seemingly senseless, this behavior in fact provides a\nconsistent experience to the participants without extraneous factors to bias them. It also corresponds\nto the no-feedback assumption in most teaching models [3]. Participants were not informed that the\nrobot was not actually learning.\nProcedure. Each participant completed the experiment alone. The experiment involved two sub-\ntasks that were further broken down into multiple steps. In the \ufb01rst subtask, participants sorted the\nobjects based on their subjective ratings of their graspability following the steps below.\n\nIn step 1, participants were instructed to place each object along a ruler provided on a long table\nas seen in Figure 2(a). To provide baselines on the two ends of the graspability spectrum, we \ufb01xed\na highly graspable object (a toothbrush) and a highly non-graspable object (a building) on the two\nends of the ruler. We captured the image of the table and later converted the position of each card\ninto a participant-speci\ufb01c, continuous graspability rating x1, . . . , xn \u2208 [0, 1]. For our purpose, there\nis no need to enforce inter-participant agreement.\nIn step 2, participants assigned a binary \u201cgraspable\u201d (y = 1) or \u201cnot graspable\u201d (y = 0) label to each\nobject by writing the label on the back of the corresponding card. This gave us labels y1, . . . , yn.\nThe sorted cards and the decision boundary from one of the participants is illustrated in Figure 3.\n\nIn step 3, we asked participants to leave the room for a short duration so that \u201cthe robot could\nexamine the sorted cards on the table without looking at the labels provided at the back,\u201d creating\nthe impression that the learner will associate the cards with the corresponding values x1, . . . , xn.\nIn the second subtask, participants taught the robot the (binary) concept of graspability using the\ncards. In this task, participants picked up a card from the table, turned toward the robot, and held\nthe card up while providing a verbal description of the object\u2019s graspability (i.e., the binary label\ny) as seen in Figure 2(a). The two cards, \u201ctoothbrush\u201d and \u201cbuilding,\u201d were \ufb01xed to the table and\nnot available for teaching. The participants were randomly assigned into two conditions: (1) natural\nand (2) constrained. In the \u201cnatural\u201d condition, participants were allowed to use natural language to\ndescribe the graspability of the objects, while those in the \u201cconstrained\u201d condition were only allowed\n\n3\n\n00.10.20.30.40.50.60.70.80.91123456789101112131415iteration t|V1|\fto say either \u201cgraspable\u201d or \u201cnot graspable.\u201d They were instructed to use as few cards as they felt\nnecessary. There was no time limit on either subtasks.\nResults. The teaching sequences from all participants are presented in Figure 4. The title of each\nplot contains the participant ID and condition. The participant\u2019s rating and classi\ufb01cation of all\nobjects are presented above the x-axis. Objects labeled as \u201cnot graspable\u201d are indicated with blue\ncircles and those labeled as \u201cgraspable\u201d are marked with red plus signs. The x-axis position of the\nobject represents its rating x \u2208 [0, 1]. The vertical blue and red lines denote an \u201cambiguous region\u201d\naround the decision boundary; objects to the left of the blue line have the label \u201cnot graspable;\u201d\nthose to the right of the red line are labeled as \u201cgraspable,\u201d and objects between these lines could\nhave labels in mixed order. In theory, following the boundary strategy, the teacher should start with\nteaching instances on these two lines as suggested by the teaching dimension model. The y-axis is\ntrial t = 1, . . . , 15, which progresses upwards. The black line and dots represent the participant\u2019s\nteaching sequence. For example, participant P01 started teaching at t = 1 with an object she rated\nas x = 1 and labeled as \u201cgraspable;\u201d at t = 2, she chose an example with rating x = 0 and label\n\u201cnot graspable;\u201d and so on. The average teaching sequence had approximately 8 examples, while\nthe longest teaching sequence had a length of 15 examples.\n\nWe observed three major human teaching strategies in our data: (1) the extreme strategy, which\nstarts with objects with extreme ratings and gradually moves toward the decision boundary; (2)\nthe linear strategy, which follows a prominent left-to-right or right-to-left sequence; and (3) the\npositive-only strategy, which involves only positively labeled examples. We categorized most\nteaching sequences into these three strategies following a simple heuristic. First, sequences that\ninvolved only positive examples were assigned to the positive-only strategy. Then, we assigned\nthe sequences whose \ufb01rst two teaching examples had different labels to the extreme strategy and\nthe others to the linear strategy. While this simplistic approach does not guarantee perfect clas-\nsi\ufb01cation (e.g., P30 can be labeled differently), it minimizes hand-tuning and reduces the risk of\nover\ufb01tting. We made two exceptions, manually assigning P14 and P16 to the extreme strategy.\nNonetheless, these few potential misclassi\ufb01cations do not change our conclusions below.\n\nNone of the sequences followed the boundary strategy. In fact, among all 31 participants, 20 started\nteaching with the most graspable object (according to their own rating), 6 with the least graspable,\nnone in or around the ambiguous region (as boundary strategy would predict), and 5 with some\nother objects. In brief, people showed a tendency to start teaching with extreme objects, especially\nthe most graspable ones. During post-interview, when asked why they did not start with objects\naround their decision boundary, most participants mentioned that they wanted to start with clear\nexamples of graspability.\n\nFor participants who followed the extreme strategy, we are interested in whether their teaching\nsequences approach the decision boundary as curriculum learning predicts. Speci\ufb01cally, at any\ntime t, let the partial teaching sequence be (x1, y1), . . . , (xt, yt). The aforementioned ambiguous\nregion with respect to this partial sequence is the interval between the inner-most pair of teaching\nexamples with different labels. This can be written as V1 \u2261 [maxj:yj =0 xj, minj:yj =1 xj] where j is\nover 1 . . . t. V1 is exactly the version space of consistent threshold hypotheses (the subscript 1 will\nbecome clear in the next section). Figure 2(b) shows a box plot of the size of V1 for all participants\nas a function of t. The red lines mark the median and the blue boxes indicate the 1st & 3rd quartiles.\nAs expected, the size of the version space decreases.\n\nFigure 3: Sorted cards and the decision boundary from one of the participants.\n\n4\n\n\fThe extreme strategy\n\nThe linear strategy\n\nThe positive-only strategy\n\nFigure 4: Teaching sequences of all participants.\n\nFinally, the positive-only strategy was observed signi\ufb01cantly more in the \u201cnatural\u201d condition\n(3/16 \u2248 19%) than in the \u201cconstrained\u201d condition (0/15 = 0%), \u03c72(1, N = 31) = 4.27, p = .04.\nWe observed that these participants elaborated in English to the robot why they thought that their\nobjects were graspable. We speculate that they might have felt that they had successfully described\nthe rules and that there was no need to use negative examples. In contrast, the constrained condition\ndid not have the rich expressivity of natural language, necessitating the use of negative examples.\n\n4 A Theoretical Account of the \u201cExtreme\u201d Teaching Strategy\n\nWe build on our empirical results and offer a theoretical analysis as a possible rationalization for the\nextreme strategy. Research in cognitive psychology has consistently shown that humans represent\neveryday objects with a large number of features (e.g., [7, 8]). We posit that although our teaching\ntask was designed to mimic the one-dimensional task illustrated in Figure 1 (e.g., the linear layout\nof the cards in Figure 3), our teachers might still have believed (perhaps subconsciously) that the\nrobot learner, like humans, associates each teaching object with multiple feature dimensions.\n\nUnder the high-dimensional assumption, we show that the extreme strategy is an outcome of mini-\nmizing per-iteration expected error of the learner. Note that the classic teaching dimension model [9]\nfails to predict the extreme strategy even under this assumption. Our analysis is inspired by recent\nadvances in teaching dimension, which assume that teaching progresses in iterations and learning\nis to be maximized after each iteration [13, 3]. Different from those analysis, we minimize the\nexpected error instead of the worst-case error and employ different techniques.\n\n4.1 Problem Setting and Model Assumptions\nOur formal set up is as follows. The instance space is the d-dimensional hypercube X = [0, 1]d. We\nuse boldface x \u2208 X to denote an instance and xij for the j-th dimension of instance xi. The binary\nlabel y is determined by the threshold 1\n2}. This formulation\nidealizes our empirical study where the continuous rating is the \ufb01rst dimension. It implies that the\ntarget concept is unrelated to any of the other d\u22121 features. In practice, however, there may be other\n\n2 in the \ufb01rst dimension: yi = 1{xi1\u2265 1\n\n5\n\n00.51051015xtP01, natural00.51051015xtP03, natural00.51051015xtP13, natural00.51051015xtP15, natural00.51051015xtP25, natural00.51051015xtP31, natural00.51051015xtP06, constrained00.51051015xtP10, constrained00.51051015xtP12, constrained00.51051015xtP14, constrained00.51051015xtP16, constrained00.51051015xtP18, constrained00.51051015xtP20, constrained00.51051015xtP22, constrained00.51051015xtP05, natural00.51051015xtP07, natural00.51051015xtP09, natural00.51051015xtP11, natural00.51051015xtP17, natural00.51051015xtP19, natural00.51051015xtP23, natural00.51051015xtP02, constrained00.51051015xtP04, constrained00.51051015xtP08, constrained00.51051015xtP24, constrained00.51051015xtP26, constrained00.51051015xtP28, constrained00.51051015xtP30, constrained00.51051015xtP21, natural00.51051015xtP27, natural00.51051015xtP29, natural\ffeatures that are correlated with the target concept. But our analysis carries through by replacing d\nwith the number of irrelevant dimensions.\n\nDeparting from classic teaching models, we consider a \u201cpool-based sequential\u201d teaching setting.\nIn this setting, a pool of n instances are sampled iid x1, . . . , xn \u223c p(x), where we assume that\np(x) is uniform on X for simplicity. Their labels y1 . . . yn may be viewed as being sampled from\nthe conditional distribution p(yi = 1 | xi) = 1{xi1> 1\n2}. The teacher can only sequentially teach\ninstances selected from the pool (e.g., in our empirical study, the pool consists of the 29 objects).\nHer goal is for the learner to generalize well on test instances outside the pool (also sampled from\np(x, y) = p(x)p(y | x)) after each iteration.\nAt this point, we make two strong assumptions on the learner. First, we assume that the learner\nentertains axis-parallel hypotheses. That is, each hypothesis has the form hk\u03b8s(x) = 1{s(x\u00b7k\u2212\u03b8)\u22650}\nfor some dimension k \u2208 {1 . . . d}, threshold \u03b8 \u2208 [0, 1], and orientation s \u2208 {\u22121, 1}. The cogni-\ntive interpretation of an axis-parallel hypothesis is that the learner attends to a single dimension at\nany given time.2 As in classic teaching models, our learner is consistent (i.e., it never contradicts\nwith the teaching instances it receives). The version space V (t) of the learner, i.e., the set of hy-\npotheses that is consistent with the teaching sequence (x1, y1), . . . , (xt, yt) so far, takes the form\nk=1Vk(t) where Vk(t) = {hk\u03b8,1 | maxj:yj =0 xjk \u2264 \u03b8 \u2264 minj:yj =1 xjk} \u222a {hk\u03b8,\u22121 |\nV (t) = \u222ad\nmaxj:yj =1 xjk \u2264 \u03b8 \u2264 minj:yj =0 xjk}. The version space can be thought of as the union of inner\nintervals surviving the teaching examples.\n\nSecond, similar to the randomized learners in [2], our learner selects a hypothesis h uniformly from\nthe version space V (t), follows it until when h is no longer in V (t), and then randomly selects a\nreplacement hypothesis\u2014a strategy known as \u201cwin stay, lose shift\u201d in cognitive psychology [5]. It\nis thus a Gibbs classi\ufb01er. In particular, the risk, de\ufb01ned as the expected 0-1 loss of the learner on\na test instance, is R(t) \u2261 E(x,y)\u223cp(x,y)Eh\u2208V (t)1{h(x)6=y}. We point out that our assumptions are\npsychologically plausible and will greatly simplify the derivation below.\n\n4.2 Starting with Extreme Teaching Instances is Asymptotically Optimal\n\nWe now show why starting with extreme teaching instances as in curriculum learning, as opposed\nto the boundary strategy, is optimal under our setting. Speci\ufb01cally, we consider the problem of se-\nlecting an optimal teaching sequence of length t = 2, one positive and one negative, (x1, 1), (x2, 0).\nIntroducing the shorthand a \u2261 x11, b \u2261 x21, the teacher seeks a, b to minimize the risk:\n\nmin\n\na,b\u2208[0,1]\n\nR(2)\n\n(1)\n\nNote that we allow a, b to take any value within their domains, which is equivalent to having an\nin\ufb01nite pool for the teacher to choose from. We will tighten it later. Also note that we assume the\nteacher does not pay attention to irrelevant dimensions, whose feature values can then be modeled\nby uniform random variables.\n\nthe version space is |V (2)| = a \u2212 b + Pd\nFor any teaching sequence of length 2, the individual intervals of the version space are of size\n|V1(2)| = a \u2212 b, |Vk(2)| = |x1k \u2212 x2k| for k = 2 . . . d, respectively. The total size of\nk=2 |x1k \u2212 x2k|. Figure 5(a) shows that for all\nh1\u03b811 \u2208 V1(2), the decision boundary is parallel to the true decision boundary and the test\nerror is E(x,y)\u223cp(x,y)1{h1\u03b811(x)6=y} = |\u03b81 \u2212 1/2|. Figure 5(b) shows that for all hk\u03b8ks \u2208\n(cid:16)R a\n(cid:17)\nb |\u03b81 \u2212 1/2|d\u03b81 +Pd\n\u222ad\nk=2Vk(2), the decision boundary is orthogonal to the true decision boundary and the test error\n2|x1k \u2212 x2k|(cid:17)\nis 1/2. Therefore, we have R(2) =\n=\n. Introducing the shorthand ck \u2261 |x1k \u2212\n2 )2+c\n. The intuition is that a pair of teach-\ning instances lead to a version space V (2) consisting of one interval per dimension. A random\nhypothesis selected from the interval in the \ufb01rst dimension V1(2) can range from good (if \u03b81 is close\n\nx2k|, c \u2261Pd\n\nk=2 ck, one can write R(2) = ( 1\n\nR max(x1k,x2k)\n\n2)2 +Pd\n\n2 \u2212 b)2 + 1\n\n2(a \u2212 1\n\n2( 1\n\n2\u2212b)2+(a\u2212 1\n2(a\u2212b+c)\n\n1|V (2)|\n\n1\n\nk=2\n\nk=2\n\nmin(x1k,x2k)\n\n1\n2 d\u03b8k\n\n(cid:16) 1\n\n1|V (2)|\n\n2A generalization to arbitrary non-axis parallel linear separators is possible in theory and would be interest-\ning. However, non-axis parallel linear separators (known as \u201cinformation integration\u201d in psychology) are more\nchallenging for human learners. Consequently, our human teachers might not have expected the robot learner\nto perform information integration either.\n\n6\n\n\f(a)\n\n(b)\n\n(c)\n\nFigure 5: (a) A hypothesis h1\u03b811 \u2208 V1(2) is parallel to the true decision boundary, with test error\n|\u03b81\u22121/2| (shaded area). (b) A hypothesis h2\u03b82s \u2208 V2(2) is orthogonal to the true decision boundary,\nwith test error 1/2 (shaded area). (c) Theoretical teaching sequences gradually shrink |V1|, similar\nto human behaviors.\nto 1/2) to poor (\u03b81 far away from 1/2), while one selected from \u222ad\nk=2Vk(2) is always bad. The\nteacher can optimize the risk by choosing the size of V1(T ) related to the total version space size.\nThe optimal choice is speci\ufb01ed by the following theorem.\n\u221a\nTheorem 1. The minimum risk R(2) is achieved at a =\nProof. First, we show that at the minimum a, b are symmetric around 1/2, i.e., b = 1 \u2212 a. Suppose\nnot. Then, (a+b)/2 = 1/2+\u0001 for some \u0001 6= 0. Let a0 = a\u2212\u0001, b0 = b\u2212\u0001. Then, ( 1\n=\n2\u2212b)2+(a\u2212 1\n( 1\nthe minimum, a contradiction. Next, substituting b =\n1 \u2212 a in R(2) and setting the derivative w.r.t. a to 0 proves the theorem.\nRecall that c is the size of the part of the version space in irrelevant dimensions. When d \u2192 \u221e,\nc \u2192 \u221e and the solution is a = 1, b = 0. Here, the learner can form so many bad hypotheses in the\nmany wrong dimensions that the best strategy for the teacher is to make V1(2) as large as possible,\neven though many hypotheses in V1(2) have nonzero error.\nCorollary 2. The minimizer to (1) is a = 1, b = 0 when the dimensionality d \u2192 \u221e.\n\n2\u2212b0)2+(a0\u2212 1\n2(a0\u2212b0+c)\n\n2\u2212b)2+(a\u2212 1\n2(a\u2212b+c)\n\n, b = 1 \u2212 a.\n\n2 )2+c\u22122\u00012\n\nc2+2c\u2212c+1\n\n2(a\u2212b+c)\n\n< ( 1\n\n2 )2+c\n\n2\n\n2 )2+c\n\nProof. We characterize the distribution of ck by considering the distance between two random vari-\nables x1k, x2k sampled uniformly in [0, 1]. Let z(1), z(2) be the values of x1k, x2k sorted in an\nascending order. Then ck = z(2) \u2212 z(1) is an instance of order statistics [6]. One can show\nthat, in general with t independent unif[0, 1] random variables sorted in an ascending order as\nz(1), . . . , z(j), z(j+1), . . . , z(t), the distance z(j+1) \u2212 z(j) follows a Beta(1, t) distribution. In our\ncase with t = 2, ck \u223c Beta(1, 2), whose mean is 1/3 as expected. It follows that c is the sum of\nd \u2212 1 independent Beta random variables. As d \u2192 \u221e, c \u2192 \u221e. Let \u03b3 = 1/c. Applying l\u2019H\u02c6opital\u2019s\nrule, limc\u2192\u221e a = limc\u2192\u221e\n\n= lim\u03b3\u21920\n\nc2+2c\u2212c+1\n\n1+2\u03b3\u22121+\u03b3\n\n= 1.\n\n\u221a\n\n\u221a\n\n2\n\n2\u03b3\n\nCorollary 2 has an interesting cognitive interpretation; the teacher only needs to pay attention to the\nrelevant (\ufb01rst) dimension x11, x21 when selecting the two teaching instances. She does not need to\nconsider the irrelevant dimensions, as those will add up to a large c, which simpli\ufb01es the teacher\u2019s\ntask in choosing a teaching sequence; she simply picks two extreme instances in the \ufb01rst dimension.\nWe also note that in practice d does not need to be very large for a to be close to 1. For example,\n3(d \u2212 1) = 3 and the corresponding a = 0.94, with\nwith d = 10 dimensions, the average c is 1\nd = 100, a = 0.99. This observation provides further psychological plausibility to our model.\nSo far, we have assumed an in\ufb01nite pool, such that the teacher can select the extreme teaching\ninstances with x11 = 1, x21 = 0. In practice, the pool is \ufb01nite and the optimal a, b values speci\ufb01ed\nin Theorem 1 may not be attainable within the pool. However, it is straightforward to show that\nlimc\u2192\u221e R0(t) < 0 where the derivative is w.r.t. a after substituting b = 1 \u2212 a. That is, in the\ncase of c \u2192 \u221e, the objective in (1) is a monotonically decreasing function of a. Therefore, the\noptimal strategy for a \ufb01nite pool is to choose the negative instance with the smallest x\u00b71 value and\n\n7\n\nab1/21\u03b8101x121/2101x22\u03b8200.10.20.30.40.50.60.70.80.91123456789101112131415iteration t|V1| d=1000d=100d=12d=2\fthe positive instance with the largest x\u00b71 value. Note the similarity to curriculum learning which\nstarts with extreme (easy) instances.\n\n4.3 The Teaching Sequence should Gradually Approach the Boundary\nThus far, we have focused on choosing the \ufb01rst two teaching instances. We now show that, as\nteaching continues, the teacher should choose instances with a and b gradually approaching 1/2.\nThis is a direct consequence of minimizing the risk R(t) at each iteration, as c decreases to 0. In this\nsection, we study the speed by which c decreases to 0 and a to 1/2.\n\nConsider\nthe moment when the teacher has already presented a teaching sequence\n(x1, y1), . . . , (xt\u22122, yt\u22122) and is about to select the next pair of teaching instances (xt\u22121, 1), (xt, 0).\nTeaching with pairs is not crucial but will simplify the analysis. Following the discussion after Corol-\nlary 2, we assume that the teacher only pays attention to the \ufb01rst dimension when selecting teaching\ninstances. This assumption allows us to again model the other dimensions as random variables. The\nteacher wishes to determine the optimal a = xt\u22121,1, b = xt,1 values according to Theorem 1. What\nis the value of c for a teaching sequence of length t?\nTheorem 3. Let the teaching sequence contain t0 negative labels and t \u2212 t0 positive ones. Then\nrespectively) and \u03b2k \u223c Beta(1, t) independently for k = 2 . . . d. Consequently, E(c) = 2(d\u22121)\n\nthe random variables ck = \u03b1k\u03b2k, where \u03b1k \u223c Bernoulli(cid:0)2/(cid:0) t\n\n(cid:1), 1 \u2212 2/(cid:0) t\n\nt0\n\nt0\n\n(cid:1)(cid:1) (with values 1, 0\n(cid:1)(1+t)\n\n(cid:0) t\n\n.\n\nt0\n\nProof. We show that for each irrelevant dimension k = 2 . . . d, after t teaching instances, |Vk(t)| =\n\u03b1k\u03b2k. As mentioned above, these t teaching instances can be viewed as unif[0, 1] random variables\nin the kth dimension. Sort the values x1k, . . . , xtk in ascending order. Denote the sorted values\nas z(1), . . . , z(t). Vk(t) is non-empty only if the labels happen to be linearly separable, i.e., either\nz(1) . . . z(t0) having negative labels while the rest having positive labels or the other way around.\nConsider the corresponding analogy where one randomly selects a permutation of t items (there are\nt! permutations), such that the selected permutation has \ufb01rst t0 items with negative labels and the rest\nwith positive labels (there are t0!(t \u2212 t0)! such permutations). This probability corresponds to \u03b1k.\nWhen Vk(t) is nonempty, its size |Vk(t)| is characterized by the order statistics z(t0+1)\u2212z(t0), which\ncorresponds to the Beta random variable \u03b2k as mentioned earlier in the proof of Corollary 2.\nAs the binomial coef\ufb01cient in the denominator of E(c) suggests, c decreases to 0 rapidly with t,\nbecause t randomly-placed labels in 1D are increasingly unlikely to be linearly separable. Following\nTheorem 1, the corresponding optimal a, b approach 1/2. Due to the form of Theorem 1, the pace is\nslower. To illustrate how fast the optimal teaching sequence approaches 1/2 in the \ufb01rst dimension,\nFigure 5(c) shows a plot of |V1| = a \u2212 b as a function of t by using E(c) in Theorem 1 (note in\ngeneral that this is not E(|V1|), but only a typical value). We set t0 = t/2. This plot is similar to the\none we produced from human behavioral data in Figure 2(b). For comparison, that plot is copied\nhere in the background. Because the effective number of independent dimensions d is unknown, we\npresent several curves for different d\u2019s. Some of these curves provide a qualitatively reasonable \ufb01t\nto human behavior, despite the fact that we made several simplifying model assumptions.\n\n5 Conclusion and Future Work\n\nWe conducted a human teaching experiment and observed three distinct human teaching strategies.\nEmpirical results yielded no evidence for the boundary strategy but showed that the extreme\nstrategy is consistent with the curriculum learning principle. We presented a theoretical framework\nthat extends teaching dimension and explains two de\ufb01ning properties of the extreme strategy: (1)\nteaching starts with extreme instances and (2) teaching gradually approaches the decision boundary.\nOur framework predicts that, in the absence of irrelevant dimensions (d = 1), teaching should start\nat the decision boundary. To verify this prediction, in our future work, we plan to conduct additional\nhuman teaching studies where the objects have no irrelevant attributes. We also plan to further\ninvestigate and explain the linear strategy and the positive-only strategy that we observed in\nour current study.\nAcknowledgments: We thank Li Zhang and Eftychios Sifakis for helpful comments. Research supported by\nNSF IIS-0953219, IIS-0916038, AFOSR FA9550-09-1-0313, Wisconsin Alumni Research Foundation, and\nMitsubishi Heavy Industries, Ltd.\n\n8\n\n\fReferences\n\n[1] D. Angluin. Queries revisited. Theoretical Computer Science, 313(2):175\u2013194, 2004.\n[2] F. J. Balbach and T. Zeugmann. Teaching randomized learners.\n\nIn Proceedings of the 19th Annual\n\nConference on Computational Learning Theory (COLT), pages 229\u2013243. Springer, 2006.\n\n[3] F. J. Balbach and T. Zeugmann. Recent developments in algorithmic teaching. In Proceedings of the 3rd\n\nInternational Conference on Language and Automata Theory and Applications, pages 1\u201318, 2009.\n\n[4] Y. Bengio, J. Louradour, R. Collobert, and J. Weston. Curriculum learning. In L. Bottou and M. Littman,\neditors, Proceedings of the 26th International Conference on Machine Learning, pages 41\u201348, Montreal,\nJune 2009. Omnipress.\n\n[5] J. S. Bruner, J. J. Goodnow, and G. A. Austin. A Study of Thinking. New York: Wiley, 1956.\n[6] H. A. David and H. N. Nagaraja. Order Statistics. Wiley, 3rd edition, 2003.\n[7] S. De Deyne and G. Storms. Word associations: Network and semantic properties. Behavior Research\n\nMethods, 40:213\u2013231, 2008.\n\n[8] S. De Deyne and G. Storms. Word associations: Norms for 1,424 Dutch words in a continuous task.\n\nBehavior Research Methods, 40:198\u2013205, 2008.\n\n[9] S. Goldman and M. Kearns. On the complexity of teaching. Journal of Computer and Systems Sciences,\n\n50(1):20\u201331, 1995.\n\n[10] S. Goldman and H. Mathias. Teaching a smarter learner. Journal of Computer and Systems Sciences,\n\n52(2):255267, 1996.\n\n[11] S. Hanneke. Teaching dimension and the complexity of active learning. In Proceedings of the 20th Annual\n\nConference on Computational Learning Theory (COLT), page 6681, 2007.\n\n[12] T. Heged\u00a8us. Generalized teaching dimensions and the query complexity of learning. In Proceedings of\n\nthe eighth Annual Conference on Computational Learning Theory (COLT), pages 108\u2013117, 1995.\n\n[13] H. Kobayashi and A. Shinohara. Complexity of teaching by a restricted number of examples. In Pro-\nceedings of the 22nd Annual Conference on Computational Learning Theory (COLT), pages 293\u2013302,\n2009.\n\n[14] M. P. Kumar, B. Packer, and D. Koller. Self-paced learning for latent variable models. In NIPS, 2010.\n[15] Y. J. Lee and K. Grauman. Learning the easy things \ufb01rst: Self-paced visual category discovery.\n\nProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2011.\n\nIn\n\n[16] B. D. McCandliss, J. A. Fiez, A. Protopapas, M. Conway, and J. L. McClelland. Success and failure in\nteaching the [r]-[l] contrast to Japanese adults: Tests of a Hebbian model of plasticity and stabilization in\nspoken language perception. Cognitive, Affective, & Behavioral Neuroscience, 2(2):89\u2013108, 2002.\n\n[17] J. P. Salmon, P. A. McMullen, and J. H. Filliter. Norms for two types of manipulability (graspability and\nfunctional usage), familiarity, and age of acquisition for 320 photographs of objects. Behavior Research\nMethods, 42(1):82\u201395, 2010.\n\n[18] S. Zilles, S. Lange, R. Holte, and M. Zinkevich. Models of cooperative teaching and learning. Journal of\n\nMachine Learning Research, 12:349\u2013384, 2011.\n\n9\n\n\f", "award": [], "sourceid": 838, "authors": [{"given_name": "Faisal", "family_name": "Khan", "institution": null}, {"given_name": "Bilge", "family_name": "Mutlu", "institution": null}, {"given_name": "Jerry", "family_name": "Zhu", "institution": null}]}