{"title": "Preference-Based Batch and Sequential Teaching: Towards a Unified View of Models", "book": "Advances in Neural Information Processing Systems", "page_first": 9199, "page_last": 9209, "abstract": "Algorithmic machine teaching studies the interaction between a teacher and a learner where the teacher selects labeled examples aiming at teaching a target hypothesis. In a quest to lower teaching complexity and to achieve more natural teacher-learner interactions, several teaching models and complexity measures have been proposed for both the batch settings (e.g., worst-case, recursive, preference-based, and non-clashing models) as well as the sequential settings (e.g., local preference-based model). To better understand the connections between these different batch and sequential models, we develop a novel framework which captures the teaching process via preference functions $\\Sigma$. In our framework, each function $\\sigma \\in \\Sigma$ induces a teacher-learner pair with teaching complexity as $\\TD(\\sigma)$. We show that the above-mentioned teaching models are equivalent to specific types/families of preference functions in our framework. This equivalence, in turn, allows us to study the differences between two important teaching models, namely $\\sigma$ functions inducing the strongest batch (i.e., non-clashing) model and $\\sigma$ functions inducing a weak sequential (i.e., local preference-based) model. Finally, we identify preference functions inducing a novel family of sequential models with teaching complexity linear in the VC dimension of the hypothesis class: this is in contrast to the best known complexity result for the batch models which is quadratic in the VC dimension.", "full_text": "Preference-Based Batch and Sequential Teaching:\n\nTowards a Uni\ufb01ed View of Models\n\nFarnam Mansouri\u2020 Yuxin Chen\u2021 Ara Vartanian\u2039 Xiaojin Zhu\u2039 Adish Singla\u2020\n\u2020Max Planck Institute for Software Systems (MPI-SWS), {mfarnam, adishs}@mpi-sws.org,\n\n\u2021University of Chicago, chenyuxin@uchicago.edu,\n\n\u2039University of Wisconsin-Madison, {aravart, jerryzhu}@cs.wisc.edu\n\nAbstract\n\nAlgorithmic machine teaching studies the interaction between a teacher and a\nlearner where the teacher selects labeled examples aiming at teaching a target\nhypothesis. In a quest to lower teaching complexity and to achieve more natural\nteacher-learner interactions, several teaching models and complexity measures have\nbeen proposed for both the batch settings (e.g., worst-case, recursive, preference-\nbased, and non-clashing models) as well as the sequential settings (e.g., local\npreference-based model). To better understand the connections between these dif-\nferent batch and sequential models, we develop a novel framework which captures\nthe teaching process via preference functions \u03a3. In our framework, each function\n\u03c3 P \u03a3 induces a teacher-learner pair with teaching complexity as TDp\u03c3q. We show\nthat the above-mentioned teaching models are equivalent to speci\ufb01c types/families\nof preference functions in our framework. This equivalence, in turn, allows us to\nstudy the differences between two important teaching models, namely \u03c3 functions\ninducing the strongest batch (i.e., non-clashing) model and \u03c3 functions induc-\ning a weak sequential (i.e., local preference-based) model. Finally, we identify\npreference functions inducing a novel family of sequential models with teaching\ncomplexity linear in the VC dimension of the hypothesis class: this is in contrast to\nthe best known complexity result for the batch models which is quadratic in the\nVC dimension.\n\nIntroduction\n\n1\nAlgorithmic machine teaching studies the interaction between a teacher and a learner where the\nteacher\u2019s goal is to \ufb01nd an optimal training sequence to steer the learner towards a target hypothesis\n[GK95, ZLHZ11, Zhu13, SBB`14, Zhu15, ZSZR18]. An important quantity of interest is the\nteaching dimension (TD) of the hypothesis class, representing the worst-case number of examples\nneeded to teach any hypothesis in a given class. Given that the teaching complexity depends on\nwhat assumptions are made about teacher-learner interactions, different teaching models lead to\ndifferent notions of teaching dimension. In the past two decades, several such teaching models have\nbeen proposed, primarily driven by the motivation to lower teaching complexity and to \ufb01nd models\nfor which the teaching complexity has better connections with learning complexity measured by\nVapnik\u2013Chervonenkis dimension (VCD) [VC71] of the class.\nMost of the well-studied teaching models are for the batch setting (e.g., worst-case [GK95, Kuh99],\nrecursive [ZLHZ08, ZLHZ11, DFSZ14], preference-based [GRSZ17], and non-clashing [KSZ19]\nmodels). In these batch models, the teacher \ufb01rst provides a set of examples to the learner and then\nthe learner outputs a hypothesis. In a quest to achieve more natural teacher-learner interactions and\nenable richer applications, various different models have been proposed for the sequential setting\n(e.g., local preference-based model for version space learners [CSMA`18], models for gradient\nlearners [LDH`17, LDL`18, KDCS19], models inspired by control theory [Zhu18, LZZ19], models\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\ffor sequential tasks [CL12, HTS18, TGH`19], and models for human-centered applications that\nrequire adaptivity [SBB`13, HCMA`19]).\nIn this paper, we seek to gain a deeper understanding of how different teaching models relate to each\nother. To this end, we develop a novel teaching framework which captures the teaching process via\npreference functions \u03a3. Here, a preference function \u03c3 P \u03a3 models how a learner navigates in the\nversion space as it receives teaching examples (see \u00a72 for formal de\ufb01nition); in turn, each function \u03c3\ninduces a teacher-learner pair with teaching dimension TDp\u03c3q (see \u00a73). We highlight some of the key\nresults below:\n\n\u2022 We show that the well-studied teaching models in batch setting corresponds to speci\ufb01c\nfamilies of \u03c3 functions in our framework (see \u00a74 and Table 1).\n\u2022 We study the differences in the family of \u03c3 functions inducing the strongest batch\nmodel [KSZ19] and functions inducing a weak sequential model [CSMA`18] (\u00a75.2) (also,\nsee the relationship between \u03a3gvs and \u03a3local in Figure 1).\n\u2022 We identify preference functions inducing a novel family of sequential models with teaching\ncomplexity linear in the VCD of the hypothesis class. We provide a constructive procedure\nto \ufb01nd such \u03c3 functions with low teaching complexity (\u00a75.3).\n\nOur key \ufb01ndings are highlighted in Figure 1 and Ta-\nble 1. Here, Figure 1 illustrates the relationship between\ndifferent families of preference functions that we in-\ntroduce, and Table 1 summarizes the key complexity\nresults we obtain for different families. Our uni\ufb01ed\nview of the existing teaching models in turn opens up\nseveral intriguing new directions such as (i) using our\nconstructive procedures to design preference functions\nfor addressing open questions of whether RTD/ NCTD\nis linear in VCD, and (ii) understanding the notion of\ncollusion-free teaching in sequential models. We discuss\nthese directions further in \u00a76.\n\nFamilies\n\nReduction\n\nComplexity Results\n\n\u03a3const\nTD\n\u2013\n\n\u03a3global\n\nRTD / PBTD\nOpVCD2q\n\n\u03a3local\n\n\u03a3lvs\n\n\u03a3global\n\n\u03a3const\n\n\u03a3gvs\n\nFigure 1: Venn diagram for different fami-\nlies of preference functions.\n\n\u03a3gvs\nNCTD\nOpVCD2q\n[KSZ19]\n\n\u03a3local\n\nLocal-PBTD\nOpVCD2q\n[CSMA`18]\n\n\u03a3lvs\n\u2013\n\nOpVCDq\n\nTable 1: Overview of our main results \u2013 reduction to existing models and teaching complexity.\n\n[GK95]\n\n[ZLHZ11, GRSZ17, HWLW17]\n\n2 The Teaching Model\nThe teaching domain. Let X , Y be a ground set of unlabeled instances and the set of labels. Let H\nbe a \ufb01nite class of hypotheses; each element h P H is a function h : X \u00d1 Y. Here, we only consider\nboolean functions and hence Y \u201c t0, 1u. In our model, X , H, and Y are known to both the teacher\nand the learner. There is a target hypothesis h\u2039 P H that is known to the teacher, but not the learner.\nLet Z \u010e X \u02c6 Y be the ground set of labeled examples. Each element z \u201c pxz, yzq P Z represents\na labeled example where the label is given by the target hypothesis h\u2039, i.e., yz \u201c h\u2039pxzq. For any\nZ \u010e Z, the version space induced by Z is the subset of hypotheses HpZq \u010e H that are consistent\nwith the labels of all the examples, i.e., HpZq :\u201c th P H | @z \u201c pxz, yzq P Z, hpxzq \u201c yzu.\nLearner\u2019s preference function. We consider a generic model of the learner that captures our\nassumptions about how the learner adapts her hypothesis based on the labeled examples received from\nthe teacher. A key ingredient of this model is the learner\u2019s preference function over the hypotheses.\nThe learner, based on the information encoded in the inputs of preference function\u2014which include the\ncurrent hypothesis and the current version space\u2014will choose one hypothesis in H. Our model of the\nlearner strictly generalizes the local preference-based model considered in [CSMA`18], where the\nlearner\u2019s preference was only encoded by her current hypothesis. Formally, we consider preference\nfunctions of the form \u03c3 : H \u02c6 2H \u02c6 H \u00d1 R. For any two hypotheses h1, h2, we say that the learner\nprefers h1 to h2 based on the current hypothesis h and version space H \u010e H, iff \u03c3ph1; H, hq \u0103\n\u03c3ph2; H, hq. If \u03c3ph1; H, hq \u201c \u03c3ph2; H, hq, then the learner could pick either one of these two.\n\n2\n\n\fInteraction protocol and teaching objective. The teacher\u2019s goal is to steer the learner towards\nthe target hypothesis h\u2039 by providing a sequence of labeled examples. The learner starts with an\ninitial hypothesis h0 P H before receiving any labeled examples from the teacher. At time step t,\nthe teacher selects a labeled example zt P Z, and the learner makes a transition from the current\nhypothesis to the next hypothesis. Let us denote the labeled examples received by the learner up to\n(and including) time step t via Zt. Further, we denote the learner\u2019s version space at time step t as\nHt \u201c HpZtq, and the learner\u2019s hypothesis before receiving zt as ht\u00b41. The learner picks the next\nhypothesis based on the current hypothesis ht\u00b41, version space Ht, and preference function \u03c3:\n\nht P arg min\nh1PHt\n\n\u03c3ph1; Ht, ht\u00b41q.\n\n(2.1)\n\nUpon updating the hypothesis ht, the learner sends ht as feedback to the teacher. Teaching \ufb01nishes\nhere if the learner\u2019s updated hypothesis ht equals h\u2039. We summarize the interaction in Protocol 1.1\n\nProtocol 1 Interaction protocol between the teacher and the learner\n1: learner\u2019s initial version space is H0 \u201c H and learner starts from an initial hypothesis h0 P H\n2: for t \u201c 1, 2, 3, . . . do\n3:\n4:\n5:\n\nlearner receives zt \u201c pxt, ytq; updates Ht \u201c Ht\u00b41 X Hptztuq; picks ht per Eq. (2.1);\nteacher receives ht as feedback from the learner;\nif ht \u201c h\u2039 then teaching process terminates\n\n3 The Complexity of Teaching\n\n3.1 Teaching Dimension for a Fixed Preference Function\n\n1,\n\nOur objective is to design teaching algorithms that can steer the learner towards the target hypothesis\nin a minimal number of time steps. We study the worst-case number of steps needed, as is common\nwhen measuring information complexity of teaching [GK95, ZLHZ11, GRSZ17, Zhu18]. Fix the\nground set of instances X and the learner\u2019s preference \u03c3. For any version space H \u010e H, the\nworst-case optimal cost for steering the learner from h to h\u2039 is characterized by\n\n\"\n1 ` minz maxh2PC\u03c3pH,h,zq D\u03c3pH X Hptzuq, h2, h\u2039q, otherwise\n\nD\u03c3pH, h, h\u2039q \u201c\nwhere C\u03c3pH, h, zq \u201c arg minh1PHXHptzuq \u03c3ph1; H X Hptzuq, hq denotes the set of candidate hy-\npotheses most preferred by the learner. Note that our de\ufb01nition of teaching dimension is similar in\nspirit to the local preference-based teaching complexity de\ufb01ned by [CSMA`18]. We shall see in the\nnext section, this complexity measure in fact reduces to existing notions of teaching complexity for\nspeci\ufb01c families of preference functions.\nGiven a preference function \u03c3 and the learner\u2019s initial hypothesis h0, the teaching dimension w.r.t. \u03c3\nis de\ufb01ned as the worst-case optimal cost for teaching any target h\u2039:\nh\u2039 D\u03c3pH, h0, h\u2039q.\n\nDz, s.t. C\u03c3pH, h, zq \u201c th\u02dau\n\nTDX ,H,h0p\u03c3q \u201c max\n\n(3.1)\n\n3.2 Teaching Dimension for a Family of Preference Functions\n\nIn this paper, we will investigate several families of preference functions (as illustrated in Figure 1).\nFor a family of preference functions \u03a3, we de\ufb01ne the teaching dimension w.r.t the family \u03a3 as the\nteaching dimension w.r.t. the best \u03c3 in that family:\n\u03a3-TDX ,H,h0 \u201c min\n\u03c3P\u03a3\n\nTDX ,H,h0p\u03c3q.\n\n(3.2)\n\n1It is important to note that in our teaching model, the teacher and the learner use the same preference\nfunction. This assumption of shared knowledge of the preference function is also considered in existing teaching\nmodels for both the batch settings (e.g., as in [ZLHZ11, GRSZ17]) and the sequential settings [CSMA`18]).\n\n3\n\n\f3.3 Collusion-free Preference Functions\n\nAn important consideration when designing teaching models is to ensure that the teacher and the\nlearner are \u201ccollusion-free\u201d, i.e., they are not allowed to collude or use some \u201ccoding-trick\u201d to\nachieve arbitrarily low teaching complexity. A well-accepted notion of collusion-freeness in the\nbatch setting is one proposed by [GM96] (also see [AK97, OS99, KSZ19]). Intuitively, it captures\nthe idea that a learner conjecturing hypothesis h will not change its mind when given additional\ninformation consistent with h. In comparison to batch models, the notion of collusion-free teaching\nin the sequential models is not well understood. We introduce a novel notion of collusion-freeness\nfor the sequential setting, which captures the following idea: if h is the only hypothesis in the most\npreferred set de\ufb01ned by \u03c3, then the learner will always stay at h as long as additional information\nreceived by the learner is consistent with h. We formalize this notion in the de\ufb01nition below. Note\nthat for \u03c3 functions corresponding to batch models (see \u00a74), De\ufb01nition 1 reduces to the collusion-free\nde\ufb01nition of [GM96].\n\nDe\ufb01nition 1 (Collusion-free preference) Consider a time t where the learner\u2019s current hypothesis\nis ht\u00b41 and version space is Ht (see Protocol 1). Further assume that the learner\u2019s preferred\nhypothesis for time t is uniquely given by arg minh1PHt \u03c3ph1; Ht, ht\u00b41q \u201c t\u02c6hu. Let S be additional\nexamples provided by an adversary from time t onwards. We call a preference function collusion-free,\nif for any S consistent with \u02c6h, it holds that arg minh1PHtXHpSq \u03c3ph1; Ht X HpSq, \u02c6hq \u201c t\u02c6hu.\nIn this paper, we study preference functions that are collusion-free. In particular, we use \u03a3CF to\ndenote the set of preference functions that induce collusion-free teaching:\n\n\u03a3CF \u201c t\u03c3 | \u03c3 is collusion-freeu.\n\n4 Preference-based Batch Models\n4.1 Families of Preference Functions\n\nWe consider three families of preference functions which do not depend\non the learner\u2019s current hypothesis. The \ufb01rst one is the family of uniform\npreference functions, denoted by \u03a3const, which corresponds to constant\npreference functions:\n\n\u03a3const \u201c t\u03c3 P \u03a3CF | Dc P R, s.t. @h1, H, h, \u03c3ph1; H, hq \u201c cu\n\nThe second family, denoted by \u03a3global, corresponds to the preference\nfunctions that do not depend on the learner\u2019s current hypothesis and\nversion space. In other words, the preference functions capture some\nglobal preference ordering of the hypotheses:\n\n\u03a3global\n\n\u03a3const\n\n\u03a3gvs\n\nFigure 2: Batch models.\n\n\u03a3global \u201c t\u03c3 P \u03a3CF | D g : H \u00d1 R, s.t. @h1, H, h, \u03c3ph1; H, hq \u201c gph1qu\n\nThe third family, denoted by \u03a3gvs, corresponds to the preference functions that depend on the learner\u2019s\nversion space, but do not depend on the learner\u2019s current hypothesis:\n\n\u03a3gvs \u201c t\u03c3 P \u03a3CF | D g : H \u02c6 2H \u00d1 R, s.t. @h1, H, h, \u03c3ph1; H, hq \u201c gph1, Hqu\n\nFigure 2 illustrates the relationship between these preference families.\n\n4.2 Complexity Results\n\nWe \ufb01rst provide several de\ufb01nitions, including the formal de\ufb01nition of VC dimension as well as several\nexisting notions of teaching dimension.\nDe\ufb01nition 2 (Vapnik\u2013Chervonenkis dimension [VC71]) The VC dimension for H \u010e H w.r.t. a\n\ufb01xed set of unlabeled instances X \u010e X , denoted by VCDpH, Xq, is the cardinality of the largest set\nof points X1 \u010e X that are \u201cshattered\u201d.2 Formally, let H|X \u201c tphpx1q, ..., hpxnqq | @h P Hu denote\nall possible patterns of H on X. Then VCDpH, Xq \u201c max|X1|, s.t. X1 \u010e X and |H|X1| \u201c 2|X1|.\n2In the classical de\ufb01nition of VCD, only the \ufb01rst argument H is present; the second argument X is omitted\nand is by default the ground set of unlabeled instances X .\n\n4\n\n\fDe\ufb01nition 3 (Teaching dimension [GK95]) For any hypothesis h P H, we call a set of instances\nTphq \u010e X a teaching set for h, if it can uniquely identify h P H. The teaching dimension for H,\ndenoted by TDpHq, is the maximum size of the minimum teaching set for any h P H: TDpHq \u201c\nmaxhPH min|Tphq|.\nAs noted by [ZLHZ08], the teaching dimension of [GK95] does not always capture the intuitive idea\nof cooperation between teacher and learner. The authors then introduced a model of cooperative\nteaching that resulted in the complexity notion of recursive teaching dimension, as de\ufb01ned below.\nDe\ufb01nition 4 (Recursive teaching dimension [ZLHZ08, ZLHZ11]) The recursive teaching dimen-\nsion (RTD) of H, denoted by RTDpHq, is the smallest number k, such that one can \ufb01nd an ordered\nsequence of hypotheses in H, denoted by ph1, . . . , hi, . . . , h|H|q, where every hypothesis hi has a\nteaching set of size no more than k to be distinguished from the hypotheses in the remaining sequence.\n\nIn this paper we consider \ufb01nite hypothesis classes. Under this setting, RTD is equivalent to preference-\nbased teaching dimension (PBTD) [GRSZ17].\nIn a recent work of [KSZ19], a new notion of teaching complexity, called non-clashing teaching\ndimension or NCTD, was introduced (see de\ufb01nition below). Importantly, NCTD is the optimal\nteaching complexity among teaching models in the batch setting that satisfy the collusion-free\nproperty of [GM96].\nDe\ufb01nition 5 (Non-clashing teaching dimension [KSZ19]) Let H be a hypothesis class and T :\nH \u00d1 2X be a \u201cteacher mapping\u201d on H, i.e., mapping a given hypothesis to a teaching set.3 We say\nthat T is non-clashing on H iff there are no two distinct h, h1 P H such that Tphq is consistent with h1\nand Tph1q is consistent with h. The non-clashing Teaching Dimension of H, denoted by NCTDpHq,\nis de\ufb01ned as NCTDpHq \u201c minT is non-clashingtmaxhPH |Tphq|u.\nWe show in the following, that the teaching dimension \u03a3-TD in Eq. (3.2) uni\ufb01es the above de\ufb01nitions\nof TD\u2019s for batch models.\nTheorem 1 (Reduction to existing notions of TD\u2019s) Fix X ,H, h0. The teaching complexity for the\nthree families reduces to the existing notions of teaching dimensions:\n\n1. \u03a3const-TDX ,H,h0 \u201c TDpHq\n2. \u03a3global-TDX ,H,h0 \u201c RTDpHq \u201c OpVCDpH,Xq2q\n3. \u03a3gvs-TDX ,H,h0 \u201c NCTDpHq \u201c OpVCDpH,Xq2q\n\nOur teaching model strictly generalizes the local-preference based model of [CSMA`18], which\nreduces to the \u201cworst-case\u201d model when \u03c3 P \u03a3const (corresponding to TD) [GK95] and the global\n\u201cpreference-based\u201d model when \u03c3 P \u03a3global. Hence we get \u03a3const-TDX ,H,h0 \u201c TDpHq and\n\u03a3global-TDX ,H,h0 \u201c RTDpHq. To establish the equivalence between \u03a3gvs-TDX ,H,h0 and NCTDpHq,\nit suf\ufb01ces to show that for any X ,H, h0, the following holds: (i) \u03a3gvs-TDX ,H,h0 \u011b NCTDpHq, and\n(ii) \u03a3gvs-TDX ,H,h0 \u010f NCTDpHq. The full proof is provided in Appendix A.2 of the supplementary.\nIn Table 2, we consider the well known Warmuth hypothesis class [DFSZ14] where \u03a3const-TD \u201c 3,\n\u03a3global-TD \u201c 3, and \u03a3gvs-TD \u201c 2. Table 2b and Table 2d show preference functions \u03c3 P \u03a3const,\n\u03c3 P \u03a3global, and \u03c3 P \u03a3gvs that achieve the minima in Eq. (3.2). Table 2a shows the teaching sequences\nachieving these teaching dimensions for these preference functions. In Appendix A.1, we provide\nanother hypothesis class where \u03a3const-TD \u201c 3, \u03a3global-TD \u201c 2, and \u03a3gvs-TD \u201c 1.\n5 Preference-based Sequential Models\n5.1 Families of Preference Functions\n\nIn this section, we investigate two families of preference functions that depend on the learner\u2019s\ncurrent hypothesis ht\u00b41. The \ufb01rst one is the family of local preference-based functions [CSMA`18],\ndenoted by \u03a3local, which corresponds to preference functions that depend on the learner\u2019s current\n(local) hypothesis, but do not depend on the learner\u2019s version space:\n\n\u03a3local \u201c t\u03c3 P \u03a3CF | D g : H \u02c6 H \u00d1 R, s.t. @h1, H, h, \u03c3ph1; H, hq \u201c gph1, hqu\n\n3We refer the reader to the original paper [KSZ19] for a more formal description of \u201cteacher mapping\".\n\n5\n\n\fHHHHH\n\nx\n\nh\nh1\nh2\nh3\nh4\nh5\nh6\nh7\nh8\nh9\nh10\n\nx1\n1\n0\n0\n0\n1\n1\n0\n1\n0\n1\n\nx2\n1\n1\n0\n0\n0\n1\n1\n0\n1\n0\n\nx3\n0\n1\n1\n0\n0\n0\n1\n1\n0\n1\n\nx4\n0\n0\n1\n1\n0\n1\n0\n1\n1\n0\n\nx5\n0\n0\n0\n1\n1\n0\n1\n0\n1\n1\n\nSconst \u201c Sglobal\npx1, x2, x4q\npx2, x3, x5q\npx1, x3, x4q\npx2, x4, x5q\npx1, x3, x5q\npx1, x2, x4q\npx2, x3, x5q\npx1, x3, x4q\npx2, x4, x5q\npx1, x3, x5q\n\nSgvs\npx1, x2q\npx2, x3q\npx3, x4q\npx4, x5q\npx1, x5q\npx2, x4q\npx3, x5q\npx1, x4q\npx2, x5q\npx1, x3q\n\nSlocal\npx1q\npx3q\npx3, x4q\npx5, x4q\npx5q\npx4q\npx3, x5q\npx4, x3q\npx4, x5q\npx5, x3q\n\nSlvs\npx1q\npx2q\npx3q\npx4q\npx5q\npx3q\npx4q\npx5q\npx1q\npx2q\n\n(a) The Warmuth hypothesis class and the corresponding teaching sequences (denoted by S).\n\nh1\n\n\u03c3constph1;\u00a8,\u00a8q\n\u03c3globalph1;\u00a8,\u00a8q\n\n@h1 P H\n\n0\n\n(b) \u03c3const and \u03c3global\nh1\nH\n\nh2\nh1\nth2, h7u\nth1, h6u\nth2u\nth1u\n0\n0\n(d) \u03c3gvsph1; H,\u00a8q\n\n\u03c3gvs\n\n. . .\n. . .\n. . .\n. . .\n\nhzh1\n\n\u03c3localph1;\u00a8, h \u201c h1q\n\n. . .\n\nh1 h2 h3 h4 h5 h6 h7 h8 h9 h10\n0\n\n4\n\n4\n\n2\n\n3\n\n3\n\n1\n\n3\n\n3\n\n2\n\n(c) \u03c3local representing the Hamming distance between h1 and h.\n\nh1\nH\n\nh1\nth1uY\n\nth5, h6, h8, h10u\u02da\n\nh\n\u03c3lvs\n\nh1\n0\n\nh2\nth2uY\n\nth1, h7, h6, h9u\u02da\nh1\n0\n\nh2\n0\n\n. . .\n. . .\n. . .\n. . .\n. . .\n\n(e) \u03c3lvsph1; H, hq. Here, t\u00a8u\u02da denotes all subsets.\n\nTable 2: Teaching sequences with different preference functions for the Warmuth hypothesis class\n[DFSZ14].4 Full preference functions are given in Appendix B of the supplementary.\n\nThe second family, denoted by \u03a3lvs, corresponds to the preference functions that depend on all three\narguments of \u03c3ph1; H, hq. The dependence of \u03c3 on the learner\u2019s current (local) hypothesis and the\nversion space renders a powerful family of preference functions:\n\n\u03a3lvs \u201c t\u03c3 P \u03a3CF | D g : H \u02c6 2H \u02c6 H \u00d1 R, s.t. @h1, H, h, \u03c3ph1; H, hq \u201c gph1, H, hqu\n\nFigure 1 illustrates the relationship between these preference families. As an example, in Table 2c\nand Table 2e, we provide the preference functions \u03c3local and \u03c3lvs for the Warmuth hypothesis class\nthat achieve the minima in Eq. (3.2).\n\n5.2 Comparing \u03a3gvs-TD and \u03a3local-TD\n\nIn the following, we show that substantial differences arise as we transition from \u03c3 functions\ninducing the strongest batch (i.e., non-clashing) model to \u03c3 functions inducing a weak sequential\n(i.e., local preference-based) model. We provide the full proof of Theorem 2 in Appendix C of the\nsupplementary.\n\nTheorem 2 Neither of the families \u03a3gvs and \u03a3local dominates the other. Speci\ufb01cally,\n\n1. \u03a3gvs X \u03a3local \u201c \u03a3global\n2. There exist H, X , where @h0 P H, \u03a3local-TDX ,H,h0 \u0105 \u03a3gvs-TDX ,H,h0\n3. There exist H, X , where @h0 P H, \u03a3local-TDX ,H,h0 \u0103 \u03a3gvs-TDX ,H,h0\n\n5.3 Complexity Results\n\nWe now connect the teaching complexity of the sequential models with the VC dimension.\nTheorem 3 \u03a3local-TDX ,H,h0 \u201c OpVCDpH,Xq2q, and \u03a3lvs-TDX ,H,h0 \u201c OpVCDpH,Xqq.\nTo establish the proof, we \ufb01rst introduce an important de\ufb01nition (De\ufb01nition 6) and a key lemma\n(Lemma 4).\n\n4The Warmuth hypothesis class is the smallest concept class for which RTD exceeds VCD.\n\n6\n\n\fDe\ufb01nition 6 (Compact-Distinguishable Set) Fix H \u010e H and X \u010e X , where X \u201c tx1, ..., xnu.\nLet H|X \u201c tphpx1q, ..., hpxnqq | @h P Hu denote all possible patterns of H on X. Then, we say that\nX is compact-distinguishable on H, if |H|X| \u201c |H| and @X1 \u0102 X, |H|X1| \u0103 |H|. We will use \u03a8H\nto denote a compact-distinguishable set on H.\n\nIn words, one can uniquely identify any hypothesis in H with a (sub)set of examples from \u03a8H (also\nsee the de\ufb01nition of distinguishing sets in [DFSZ14]). Our de\ufb01nition of compact-distinguishable\nset further implies that there are no \u201credundant\u201d examples in \u03a8H. It can be shown that a compact-\ndistinguishable set satis\ufb01es the following two properties: (i) it does not contain any pair of distinct\ninstances x, x1 such that p@h P H : hpxq \u201c hpx1qq or p@h P H : hpxq \u2030 hpx1qq; and (ii) it does not\ncontain any instance x such that p@h P H : hpxq \u201c 1q or p@h P H : hpxq \u201c 0q.\nLemma 4 Consider a subset H \u010e H and any compact-distinguishable set \u03a8H \u201c tx1, ..., x|\u03a8H|u.\nFix any hypothesis hH P H. Let d \u201c VCDpH, \u03a8Hq denote the VC dimension of H on \u03a8H. If d \u011b 1,\nwe can divide H into m \u201c |\u03a8H| ` 1 separate hypothesis classes tH 1, ..., H mu, such that\n(i) @j P rms, there exists a compact-distinguishable set \u03a8H j s.t. VCDpH j, \u03a8H jq \u010f d \u00b4 1.\n(ii) @j P rm \u00b4 1s, H j is not empty and H j|txju \u201c tp1 \u00b4 hHpxjqqu.\n(iii) H m \u201c thHu.\nLemma 4 suggests that for any H,X , one can partition the hypothesis class H into m \u010f |X| ` 1\nsubsets with lower VC dimension with respect to some compact-distinguishable set.5 The main idea\nof the lemma is similar to the reduction of a concept class w.r.t. some instance x to lower VCD as done\nin Theorem 9 of [FW95]. The key distinction of Lemma 4 is that we consider compact-distinguishable\nsets for this partitioning, which in turn ensures the uniqueness of the version spaces associated with\nthese partitions (see proof of Theorem 3). Another key novelty in our proof of Theorem 3 is to\nrecursively apply the reduction step from the lemma.\nTo prove the lemma, we provide a constructive procedure to partition the hypothesis class, and show\nthat the resulting partitions have reduced VC dimensions on some compact-distinguishable set. We\nhighlight the procedure for constructing the partitions in Algorithm 2 (Line 7\u2013 Line 10). In Figure 3,\nwe provide an illustrative example for creating such partitions for the Warmuth hypothesis class from\nTable 2a. We sketch the proof of Lemma 4 below, and defer the detailed proof to Appendix D.1.\nProof [Proof Sketch of Lemma 4] Let us de\ufb01ne Hx \u201c th P H : h(cid:52)x|\u03a8H P H|\u03a8Hu. Here, h(cid:52)x\ndenotes the hypothesis that only differs with h on the label of x, and h|\u03a8H denotes the patterns of\nh on \u03a8H. Fix a reference hypothesis hH. For all j P rm \u00b4 1s, let yj \u201c 1 \u00b4 hHpxjq be the opposite\nlabel of xj P \u03a8H as provided by hH. As shown in Line 9 of Algorithm 2, we consider the set\nx1 \u201c th P Hx1 : hpx1q \u201c y1u as the \ufb01rst partition. In the appendix, we show that |H 1| \u0105 0.\nH 1 :\u201c H y1\nNext, we show that VCDpH 1, \u03a8Hztx1uq \u010f d \u00b4 1. When d \u0105 1, we prove the statement as follows:\nVCDpH 1, \u03a8Hztx1uq \u010f VCDpH y1\n, \u03a8Hq \u201c VCDpHx1 , \u03a8Hq \u00b4 1 \u010f VCDpH, \u03a8Hq \u00b4 1 \u010f d \u00b4 1\nIn the appendix, we prove the statement for d \u201c 1, and further show that there exists a compact-\ndistinguishable set \u03a8H 1 \u010e \u03a8Hztx1u for the \ufb01rst partition H 1. Then, we conclude that the \ufb01rst\npartition H 1 has VCDpH 1, \u03a8H 1q \u010f d \u00b4 1.\nNext, we remove the \ufb01rst partition H 1 from H, and continue to create the above mentioned partitions\non Hrest \u201c HzH 1 and Xrest \u201c \u03a8Hztx1u. As discussed in the appendix, we show that Xrest is a\ncompact-distinguishable set on Hrest. Therefore, we can repeat the above procedure (Line 7\u2013 Line 10,\nAlgorithm 2) to create the subsequent partitions. This process continues until the size of Xrest reduces\nto 1, i.e. Xrest \u201c txm\u00b41u. Until then, we obtain partitions tH 1, ..., H m\u00b42u. By construction, H j\nsatisfy properties (i) and (ii) for all j P rm \u00b4 2s.\nIt remains to show that H m\u00b41 and H m also satisfy the properties in Lemma 4. Since Xrest \u201c\ntxm\u00b41u before we start iteration m \u00b4 1, and Xrest is a compact-distinguishable set for Hrest, there\nmust exist exactly two hypotheses in Hrest, and therefore |H m\u00b41|,|H m| \u201c 1. This implies that\nVCDpH m\u00b41, \u03a8Hm\u00b41q \u201c VCDpH m, \u03a8Hmq \u201c 0. Furthermore, @j P rm \u00b4 1s and h P H j, we have\nhHpxjq \u2030 hpxjq. This indicates hH P Hm, and hence Hm \u201c thHu which completes the proof.\n\nx1\n\n5When VCDpH, \u03a8Hq \u201c 0, this implies |H| \u201c 1.\n\n7\n\n\fh1\npx1, 0q\n\nH 0\nx1\n\nh3\n\n0 0 1 1 0\n\npx2, 0q\n\nH 6\n\n1 1 0 0 0\n\npx5, 1q\n\npx4, 1q\n\nH 1\nx5\n\nh5\n\n1 0 0 0 1\n\nif h1 \u201c h\no.w.\n\n\u03c3lvsph1; H, hq \u00d0\n\nAlgorithm 2 Recursive procedure for constructing \u03c3lvs\nachieving TDX ,H,h0p\u03c3lvsq \u010f VCDpH,Xq\n1: Let I : H \u00d1 t1, . . . ,|H|u be any bijective mapping\n2: For all h1 P H, H \u010e H, h P H, initialize\n\nInput: X , H, h0\n\n\"\n0\n|H| ` 1\n3: SETPREFERENCEpH,H,X , h0q\n4: function SETPREFERENCE(V, H, X, h)\n5:\n6:\n7:\n8:\n9:\n10:\n11:\n12:\n13:\n14:\n\nCreate compact-distinguishable set \u03a8H \u010e X\nHrest :\u201c H, Xrest :\u201c \u03a8H\nfor x P \u03a8H do\ny \u201c 1 \u00b4 hpxq\nx \u00d0 th1 P Hrest : h1(cid:52)x|Xrest P Hrest|Xrest , h1pxq \u201c yu\nH y\nHrest \u00d0 HrestzH y\nVnext \u00d0 V X Hptpx, yquq\nfor h1 P H y\nhnext \u00d0 arg minh1PH y\nIph1q\nSETPREFERENCEpVnext, H y\n\nx do \u03c3lvsph1; Vnext, hq \u00d0 Iph1q\nx , \u03a8Hztxu, hnextq\n\nx, Xrest \u00d0 Xrestztxu\n\nx\n\nH 0\nx2\n0 0 0 1 1\n1 0 1 0 1\n\nh4\nh10\n\npx3, 1q\n\nH 1\nx4\n\nh6\nh9\n\n1 1 0 1 0\n0 1 0 1 1\n\nH 1\nx3\n\nh2\nh8\nh7\n\n0 1 1 0 0\n1 0 1 1 0\n0 1 1 0 1\n\nFigure 3: Illustration of Lemma 4 on the\nWarmuth class. The grouped hypotheses\nin the leaf clusters correspond to the sets\nH y\n\nx created in Line 9 of Algorithm 2.\n\npx1, 0q\n\nh1\n\npx2, 0q\n\n1 1 0 0 0\n\npx3, 1q\n\npx4, 1q\n\npx5, 1q\n\nh3\n\n0 0 1 1 0\n\n0 0 0 1 1\n\nh4\npx3, 1q\n\nh2\npx4, 1q\n\n0 1 1 0 0\n\nh6\n\npx5, 1q\n\n1 1 0 1 0\n\npx5, 1q\n\nh5\n\n1 0 0 0 1\n\nh10\n\n1 0 1 0 1\n\nh8\n\n1 0 1 1 0\n\nh7\n\n0 1 1 0 1\n\nh9\n\n0 1 0 1 1\n\nFigure 4: Illustration of Theorem 3 proof \u2013 constructing a \u03c3lvs P \u03a3lvs for the Warmuth class.\n\nRecursive construction of \u03c3lvs. As a part of the Theorem 3 proof, we provide a recursive procedure\nfor constructing a \u03c3lvs P \u03a3lvs achieving TDX ,H,h0p\u03c3lvsq \u201c O pVCDpH,Xqq.\nProof [Proof of Theorem 3] In a nutshell, the proof consists of three steps: (i) initialization of \u03c3lvs,\n(ii) setting the preferences by recursively invoking the constructive procedure for Lemma 4, and (iii)\nshowing that there exists a teaching sequence of length up to d for any target hypothesis h\u2039. We\nsummarize the recursive procedure in Algorithm 2.\nStep (i). To begin with, we initialize \u03c3lvs with default values which induce high \u03c3 values (i.e.,\nlow preference), except for \u03c3ph1; H, hq \u201c 0 where h1 \u201c h (c.f. Line 2 of Algorithm 2). The\nself-preference guarantees that \u03c3lvs is collusion-free as per De\ufb01nition 1.\nStep (ii). The recursion begins at the top level with H \u201c H, current version space V \u201c H, and\ninitial hypothesis h \u201c h0. Lemma 4 suggests that we can partition H into m \u201c |\u03a8H| ` 1 groups\ntH 1, ..., H mu, where for all j P rms, there exists a compact-distinguishable set \u03a8H j that satis\ufb01es\nthe properties in Lemma 4.\nNow consider the hypothesis h :\u201c h0. We show that for j P rm \u00b4 1s, every pxj, yjq, where xj P \u03a8H\nand yj \u201c 1 \u00b4 hpxjq, corresponds to a unique version space V j :\u201c th P V : hpxjq \u201c yju. To\nprove this statement, we consider Rj :\u201c V j X H \u201c th P H : hpxjq \u201c yju. As is discussed in\nAppendix D.2 of the supplementary, we know that none of Rj for j P rm \u00b4 1s are equal. This\nindicates that none of V j for j P rm \u00b4 1s are equal.\nWe then set the values of the preference function \u03c3lvsp\u00a8; V j, hq for all j P rm\u00b4 1s and yj \u201c 1\u00b4 hpxjq\n(Line 12). Upon receiving pxj, yjq, the learner will be steered to the next \u201csearch space\u201d H j, with\nversion space V j. By Lemma 4 we have VCDpH j, \u03a8H jq \u010f VCDpH, \u03a8Hq \u00b4 1.\nWe will build the preference function \u03c3lvs recursively m \u00b4 1 times for each pV j, H j, \u03a8H j , hnextq,\nwhere hnext corresponds to the unique hypothesis identi\ufb01ed by function I (Line 13\u2013Line 14). At\n\n8\n\n\feach level of recursion, VCD reduces by 1. We stop the recursion when VCDpH j; \u03a8H jq \u201c 0, which\ncorresponds to the scenario |H j| \u201c 1.\nStep (iii). Given the preference function constructed in Algorithm 2, we can build up the set of\n(labeled) teaching examples recursively. Consider the beginning of the teaching process, where the\nlearner\u2019s current hypothesis is h0 and version space is H, and the goal of the teacher is to teach h\u2039.\nConsider the \ufb01rst level of the recursion in Algorithm 2, where we divide H into m \u201c |\u03a8H|` 1 groups\ntH 1, ..., H mu. Let us consider the case where h\u2039 P H j\u2039\nwith j\u2039 P rm \u00b4 1s. The teacher provides\nan example given by px \u201c xj\u2039 , y \u201c h\u2039pxj\u2039qq. After receiving the teaching example, the resulting\npartition H j\u2039\nwill stay in the version space; meanwhile, h0 will be removed from the version space.\nThe new version space will be V j\u2039\n. The learner\u2019s new hypothesis induced by the preference function\nis given by hnext P H j\u2039\n. By repeating this teaching process for a maximum of d steps, the learner\nreaches a partition of size 1 (see Step (ii) for details). At this step h\u2039 must be the only hypothesis left\nin the search space. Therefore, hnext \u201c h\u2039, and the learner has reached h\u2039.\nFigure 4 illustrates the recursive construction of a \u03c3lvs P \u03a3lvs for the Warmuth class, with\nTDX ,H,h0p\u03c3lvsq \u201c 2.\n6 Discussion and Conclusion\n\nWe now discuss a few thoughts related to different families of preference functions. First of all, the\nsize of the families grows exponentially as we change our model from \u03a3const, \u03a3global to \u03a3gvs/\u03a3local\nand \ufb01nally to \u03a3lvs, thus resulting in more powerful models with lower teaching complexity. While\nrun time has not been the focus of this paper, it would be interesting to characterize the presumably\nincreased run time complexity of sequential learners and teachers with complex preference functions.\nFurthermore, as the size of the families grow, the problem of \ufb01nding the best preference function \u03c3 in\na given family \u03a3 that achieve the minima in Eq. (3.2) becomes more computationally challenging.\nThe recursive procedure in Algorithm 2 creates a preference function \u03c3lvs P \u03a3lvs that has teaching\ncomplexity at most VCD. It is interesting to note that the resulting preference function \u03c3lvs has the\ncharacteristic of \u201cwin-stay, loose shift\" [BDGG14, CSMA`18]: Given that for any hypothesis we\nhave \u03c3ph;\u00a8, hq \u201c 0, the learner prefers her current hypothesis as long as it remains consistent. Prefer-\nence functions with this characteristic naturally exhibit the collusion-free property in De\ufb01nition 1.\nFor some problems, one can achieve lower teaching complexity for a \u03c3 P \u03a3lvs. In fact, the preference\nfunction \u03c3lvs we provided for the Warmuth class in Table 2e has teaching complexity 1, while the\npreference function constructed in Figure 4 has teaching complexity 2.\nOne fundamental aspect of modeling teacher-learner interactions is the notion of collusion-free\nteaching. Collusion-freeness for the batched setting is well established in the research community\nand NCTD characterizes the complexity of the strongest collusion-free batch model. In this paper,\nwe are introducing a new notion of collusion-freeness for the sequential setting (De\ufb01nition 1). As\ndiscussed above, a stricter condition is the \u201cwin-stay lose-shift\u201d model, which is easier to validate\nwithout running the teaching algorithm. In contrast, the condition of De\ufb01nition 1 is more involved\nin terms of validation and is a joint property of the teacher-learner pair. One intriguing question for\nfuture work is de\ufb01ning notions of collusion-free teaching in sequential models and understanding\ntheir implications on teaching complexity.\nAnother interesting direction of future work is to better understand the properties of the teaching\nparameter \u03a3-TD. One question of particular interest is showing that the teaching parameter is not\nupper bounded by any constant independent of the hypothesis class, which would suggest a strong\ncollusion in our model. We can show that for certain hypothesis classes, \u03a3-TD is lower bounded by a\nfunction of VCD. In particular, for the power set class of size d (which has VCD \u201c d), \u03a3-TD is lower\nbounded by \u2126\n. Another direction of future work is to understand whether this parameter is\nadditive or subadditive over disjoint domains. Also, we consider a generalization of our results to the\nin\ufb01nite VC classes as a very interesting direction for future work.\nOur framework provides novel tools for reasoning about teaching complexity by constructing prefer-\nence functions. This opens up an interesting direction of research to tackle important open problems,\nsuch as proving whether NCTD or RTD is linear in VCD [SZ15, CCT16, HWLW17, KSZ19]. In this\npaper, we showed that neither of the families \u03a3gvs and \u03a3local dominates the other (Theorem 2). As a\ndirection for future work, it would be important to further quantify the complexity of \u03a3local family.\n\n\u00b4\n\n\u00af\n\nd\n\nlog d\n\n9\n\n\fAcknowledgements\n\nThis work was done in part when Yuxin Chen was at Caltech. Xiaojin Zhu is supported by NSF\n1545481, 1561512, 1623605, 1704117, 1836978 and the MADLab AF CoE FA9550-18-1-0166.\n\nReferences\n\n[AK97] Dana Angluin and M\u00afartin, \u0161 Krik, is. Teachers, learners and black boxes. In Proceedings\nof the tenth annual conference on Computational learning theory, pages 285\u2013297.\nACM, 1997.\n\n[BDGG14] Elizabeth Bonawitz, Stephanie Denison, Alison Gopnik, and Thomas L Grif\ufb01ths.\nWin-stay, lose-sample: A simple sequential algorithm for approximating bayesian\ninference. Cognitive psychology, 74:35\u201365, 2014.\n\n[CCT16] Xi Chen, Yu Cheng, and Bo Tang. On the recursive teaching dimension of vc classes.\n\nIn Advances in Neural Information Processing Systems, pages 2164\u20132171, 2016.\n\n[CL12] Maya Cakmak and Manuel Lopes. Algorithmic and human teaching of sequential\n\ndecision tasks. In AAAI, 2012.\n\n[CSMA`18] Yuxin Chen, Adish Singla, Oisin Mac Aodha, Pietro Perona, and Yisong Yue. Under-\nstanding the role of adaptivity in machine teaching: The case of version space learners.\nIn Advances in Neural Information Processing Systems, pages 1476\u20131486, 2018.\n\n[DFSZ14] Thorsten Doliwa, Gaojian Fan, Hans Ulrich Simon, and Sandra Zilles. Recursive\nteaching dimension, vc-dimension and sample compression. JMLR, 15(1):3107\u20133131,\n2014.\n\n[FW95] Sally Floyd and Manfred Warmuth. Sample compression, learnability, and the vapnik-\n\nchervonenkis dimension. Machine learning, 21(3):269\u2013304, 1995.\n\n[GK95] Sally A Goldman and Michael J Kearns. On the complexity of teaching. Journal of\n\nComputer and System Sciences, 50(1):20\u201331, 1995.\n\n[GM96] Sally A Goldman and H David Mathias. Teaching a smarter learner. Journal of\n\nComputer and System Sciences, 52(2):255\u2013267, 1996.\n\n[GRSZ17] Ziyuan Gao, Christoph Ries, Hans U Simon, and Sandra Zilles. Preference-based\n\nteaching. JMLR, 18(31):1\u201332, 2017.\n\n[HCMA`19] Anette Hunziker, Yuxin Chen, Oisin Mac Aodha, Manuel Gomez Rodriguez, Andreas\nKrause, Pietro Perona, Yisong Yue, and Adish Singla. Teaching multiple concepts to a\nforgetful learner. In Advances in Neural Information Processing Systems, 2019.\n\n[HTS18] Luis Haug, Sebastian Tschiatschek, and Adish Singla. Teaching inverse reinforce-\nment learners via features and demonstrations. In Advances in Neural Information\nProcessing Systems, pages 8464\u20138473, 2018.\n\n[HWLW17] Lunjia Hu, Ruihan Wu, Tianhong Li, and Liwei Wang. Quadratic upper bound\nfor recursive teaching dimension of \ufb01nite VC classes. In Proceedings of the 30th\nConference on Learning Theory, COLT, pages 1147\u20131156, 2017.\n\n[KDCS19] Parameswaran Kamalaruban, Rati Devidze, Volkan Cevher, and Adish Singla. In-\nteractive teaching algorithms for inverse reinforcement learning. In IJCAI, pages\n2692\u20132700, 2019.\n\n[KSZ19] David Kirkpatrick, Hans U. Simon, and Sandra Zilles. Optimal collusion-free teaching.\nIn Proceedings of the 30th International Conference on Algorithmic Learning Theory,\nvolume 98, pages 506\u2013528, 2019.\n\n[Kuh99] Christian Kuhlmann. On teaching and learning intersection-closed concept classes. In\nEuropean Conference on Computational Learning Theory, pages 168\u2013182. Springer,\n1999.\n\n10\n\n\f[LDH`17] Weiyang Liu, Bo Dai, Ahmad Humayun, Charlene Tay, Chen Yu, Linda B. Smith,\nJames M. Rehg, and Le Song. Iterative machine teaching. In ICML, pages 2149\u20132158,\n2017.\n\n[LDL`18] Weiyang Liu, Bo Dai, Xingguo Li, Zhen Liu, James M. Rehg, and Le Song. Towards\n\nblack-box iterative machine teaching. In ICML, pages 3147\u20133155, 2018.\n\n[LZZ19] Laurent Lessard, Xuezhou Zhang, and Xiaojin Zhu. An optimal control approach to\n\nsequential machine teaching. In AISTATS, pages 2495\u20132503, 2019.\n\n[OS99] Matthias Ott and Frank Stephan. Avoiding coding tricks by hyperrobust learning. In\nEuropean Conference on Computational Learning Theory, pages 183\u2013197. Springer,\n1999.\n\n[SBB`13] Adish Singla, Ilija Bogunovic, G Bart\u00f3k, A Karbasi, and A Krause. On actively\nteaching the crowd to classify. In NIPS Workshop on Data Driven Education, 2013.\n[SBB`14] Adish Singla, Ilija Bogunovic, G\u00e1bor Bart\u00f3k, Amin Karbasi, and Andreas Krause.\n\nNear-optimally teaching the crowd to classify. In ICML, pages 154\u2013162, 2014.\n\n[SZ15] Hans U Simon and Sandra Zilles. Open problem: Recursive teaching dimension versus\n\nvc dimension. In Conference on Learning Theory, pages 1770\u20131772, 2015.\n\n[TGH`19] Sebastian Tschiatschek, Ahana Ghosh, Luis Haug, Rati Devidze, and Adish Singla.\nLearner-aware teaching: Inverse reinforcement learning with preferences and con-\nstraints. In Advances in Neural Information Processing Systems, 2019.\n\n[VC71] VN Vapnik and A Ya Chervonenkis. On the uniform convergence of relative fre-\nquencies of events to their probabilities. Theory of Probability and its Applications,\n16(2):264, 1971.\n\n[Zhu13] Xiaojin Zhu. Machine teaching for bayesian learners in the exponential family. In\n\nAdvances in Neural Information Processing Systems, pages 1905\u20131913, 2013.\n\n[Zhu15] Xiaojin Zhu. Machine teaching: An inverse problem to machine learning and an\n\napproach toward optimal education. In AAAI, pages 4083\u20134087, 2015.\n\n[Zhu18] Xiaojin Zhu. An optimal control view of adversarial machine learning. arXiv preprint\n\narXiv:1811.04422, 2018.\n\n[ZLHZ08] Sandra Zilles, Steffen Lange, Robert Holte, and Martin Zinkevich. Teaching dimen-\n\nsions based on cooperative learning. In COLT, pages 135\u2013146, 2008.\n\n[ZLHZ11] Sandra Zilles, Steffen Lange, Robert Holte, and Martin Zinkevich. Models of coopera-\n\ntive teaching and learning. JMLR, 12(Feb):349\u2013384, 2011.\n\n[ZSZR18] Xiaojin Zhu, Adish Singla, Sandra Zilles, and Anna N. Rafferty. An overview of\n\nmachine teaching. CoRR, abs/1801.05927, 2018.\n\n11\n\n\f", "award": [], "sourceid": 4934, "authors": [{"given_name": "Farnam", "family_name": "Mansouri", "institution": "MPI-SWS"}, {"given_name": "Yuxin", "family_name": "Chen", "institution": "UChicago"}, {"given_name": "Ara", "family_name": "Vartanian", "institution": "University of Wisconsin -- Madison"}, {"given_name": "Jerry", "family_name": "Zhu", "institution": "University of Wisconsin-Madison"}, {"given_name": "Adish", "family_name": "Singla", "institution": "MPI-SWS"}]}