{"title": "Machine Teaching of Active Sequential Learners", "book": "Advances in Neural Information Processing Systems", "page_first": 11204, "page_last": 11215, "abstract": "Machine teaching addresses the problem of finding the best training data that can guide a learning algorithm to a target model with minimal effort. In conventional settings, a teacher provides data that are consistent with the true data distribution. However, for sequential learners which actively choose their queries, such as multi-armed bandits and active learners, the teacher can only provide responses to the learner\u2019s queries, not design the full data. In this setting, consistent teachers can be sub-optimal for finite horizons. We formulate this sequential teaching problem, which current techniques in machine teaching do not address, as a Markov decision process, with the dynamics nesting a model of the learner and the actions being the teacher's responses. Furthermore, we address the complementary problem of learning from a teacher that plans: to recognise the teaching intent of the responses, the learner is endowed with a model of the teacher. We test the formulation with multi-armed bandit learners in simulated experiments and a user study. The results show that learning is improved by (i) planning teaching and (ii) the learner having a model of the teacher. The approach gives tools to taking into account strategic (planning) behaviour of users of interactive intelligent systems, such as recommendation engines, by considering them as boundedly optimal teachers.", "full_text": "Machine Teaching of Active Sequential Learners\n\nTomi Peltola\n\ntomi.peltola@aalto.fi\n\nMustafa Mert \u00c7elikok\n\nmustafa.celikok@aalto.fi\n\nPedram Daee\n\npedram.daee@aalto.fi\n\nSamuel Kaski\n\nsamuel.kaski@aalto.fi\n\nHelsinki Institute for Information Technology HIIT\n\nDepartment of Computer Science, Aalto University, Helsinki, Finland\n\nAbstract\n\nMachine teaching addresses the problem of \ufb01nding the best training data that can\nguide a learning algorithm to a target model with minimal effort. In conventional\nsettings, a teacher provides data that are consistent with the true data distribution.\nHowever, for sequential learners which actively choose their queries, such as multi-\narmed bandits and active learners, the teacher can only provide responses to the\nlearner\u2019s queries, not design the full data. In this setting, consistent teachers can\nbe sub-optimal for \ufb01nite horizons. We formulate this sequential teaching problem,\nwhich current techniques in machine teaching do not address, as a Markov decision\nprocess, with the dynamics nesting a model of the learner and the actions being\nthe teacher\u2019s responses. Furthermore, we address the complementary problem of\nlearning from a teacher that plans: to recognise the teaching intent of the responses,\nthe learner is endowed with a model of the teacher. We test the formulation\nwith multi-armed bandit learners in simulated experiments and a user study. The\nresults show that learning is improved by (i) planning teaching and (ii) the learner\nhaving a model of the teacher. The approach gives tools to taking into account\nstrategic (planning) behaviour of users of interactive intelligent systems, such as\nrecommendation engines, by considering them as boundedly optimal teachers.\n\n1\n\nIntroduction\n\nHumans, casual users and domain experts alike, are increasingly interacting with arti\ufb01cial intelligence\nor machine learning based systems. As the number of interactions in human\u2013computer and other\ntypes of agent\u2013agent interaction is usually limited, these systems are often based on active sequential\nmachine learning methods, such as multi-armed bandits, Bayesian optimization, or active learning.\nThese methods explicitly optimise for the ef\ufb01ciency of the interaction from the system\u2019s perspective.\nOn the other hand, for goal-oriented tasks, humans create mental models of the environment for\nplanning their actions to achieve their goals [1, 2]. In AI systems, recent research has shown that\nusers form mental models of the AI\u2019s state and behaviour [3, 4]. Yet, the statistical models underlying\nthe active sequential machine learning methods treat the human actions as passive data, rather than\nacknowledging the strategic thinking of the user.\nMachine teaching studies a complementary problem to active learning: how to provide a machine\nlearner with data to learn a target model with minimal effort [5\u20137]. Apart from its fundamental\nmachine learning interest, machine teaching has been applied to domains such as education [8] and\nadversarial attacks [9]. In this paper, we study the machine teaching problem of active sequential\nmachine learners: the learner sequentially chooses queries and the teacher provides responses to them.\nImportantly, to steer the learner towards the teaching goal, the teacher needs to appreciate the order\nof the learner\u2019s queries and the effect of the responses on it. Current techniques in machine teaching\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fdo not address such interaction. Furthermore, by viewing users as boundedly optimal teachers, and\nsolving the (inverse machine teaching) problem of how to learn from the teacher\u2019s responses, our\napproach provides a way to formulate models of strategically planning users in interactive AI systems.\nOur main contributions are (i) formulating the problem of machine teaching of active sequential\nlearners as planning in a Markov decision process, (ii) formulating learning from the teacher\u2019s\nresponses as probabilistic inverse reinforcement learning, (iii) implementing the approach in Bayesian\nBernoulli multi-armed bandit learners with arm dependencies, and (iv) empirically studying the\nperformance in simulated settings and a user study. Source code is available at https://github.\ncom/AaltoPML/machine-teaching-of-active-sequential-learners.\n\n2 Related work\n\nMost work in machine teaching considers a batch setting, where the teacher designs a minimal dataset\nto make the learner learn the target model [5\u20137]. Some works have also studied sequential teaching,\nbut in different settings from ours: Teaching methods have been developed to construct batches of\nstate-action trajectories for inverse reinforcement learners [10, 11]. Variations on teaching online\nlearners, such as gradient descent algorithms, by providing them with a sequence of (x, y) data points\nhave also been considered [12\u201314]. Teaching in the context of education, with uncertainty about the\nlearner\u2019s state, has been formulated as planning in partially-observable Markov decision processes\n[8, 15]. A theoretical study of the teacher-aware learners was presented in [16, 17] where the teacher\nand the learner are aware of their cooperation. Compared to our setting, in these works, the teacher is\nin control of designing all of the learning data (while possibly using interaction to probe the state\nof the learner) and is not allowed to be inconsistent with regard to the true data distribution. Apart\nfrom [11, 16, 17], they also do not consider teacher-aware learners. Machine teaching can also be\nused towards attacking learning systems [9], and adversarial attacks against multi-armed bandits\nhave been developed, by poisoning historical data [18] or modifying rewards online [19]. The goal,\nsettings, and proposed methods differ from ours. Relatedly, our teaching approach for the case of a\nbandit learner can been seen as a form of reward shaping, which aims to make the environment more\nsupportive of reinforcement learning by alleviating the temporal credit assignment problem [20].\nThe proposed model of the interaction between a teacher and an active sequential learner is a\nprobabilistic multi-agent model. It can be connected to the overarching framework of interactive\npartially observable Markov decision processes (I-POMDPs; see Supplementary Section 1 for more\ndetails) [21] and other related multi-agent models [22\u201325]. I-POMDPs provide, in a principled\ndecision-theoretic framework, a general approach to de\ufb01ne multi-agent models that have recursive\nbeliefs about other agents. This also forms a rich basis for computational models of theory of mind,\nwhich is the ability to attribute mental states, such as beliefs and desires, to oneself and other agents\nand is essential for ef\ufb01cient social collaboration [26, 27]. Our teaching problem nests a model of a\nteacher-unaware learner, forming a learner\u2013teacher model. Teaching-aware learning adds a further\nlayer, forming a nested learner\u2013teacher\u2013learner model, where the higher level learner models a teacher\nmodelling a teaching-unaware learner. Learning from humans with recursive reasoning was opined in\n[28]. To our knowledge, our work is the \ufb01rst to propose a multi-agent recursive reasoning model in\nthe practically important case of multi-armed bandits, allowing us to learn online from the scarce\ndata emerging from human\u2013computer interaction.\nUser modelling in human\u2013computer interaction aims at improving the usability and usefulness of\ncollaborative human\u2013computer systems and providing personalised user experiences [29]. Machine\nlearning based interactive systems extend user modelling to encompass statistical models interpreting\nuser\u2019s actions. For example, in information exploration and discovery, the system needs to iteratively\nrecommend items to the user and update the recommendations based on the user feedback [30, 31].\nThe current underlying statistical models use the user\u2019s response to the system\u2019s queries, such as\ndid you like this movie?, as data for building a relevance pro\ufb01le of the user. Recent works have\ninvestigated more advanced user models [32, 33]; however, as far as we know, no previous work has\nproposed statistical user models that incorporate a model of the user\u2019s mental model of the system.\nFinally, our approach can be grounded to computational rationality, which models human behaviour\nand decision making under uncertainty as expected utility maximisation, subject to computational\nconstraints [34]. Our model assumes that the teacher chooses actions proportional to their likelihood\nto maximise, for a limited horizon, the future accumulated utility.\n\n2\n\n\f3 Model and computation\n\nWe consider machine teaching of an active sequential learner, with the iterations consisting of the\nlearner querying an input point x and the teacher providing a response y. First, the teaching problem\nis formulated as a Markov decision process, the solution of which provides a teaching policy. Then,\nlearning from the responses provided by the teacher is formulated as an inverse reinforcement learning\nproblem. We formulate the approach for general learners, and give a detailed implementation for the\nspeci\ufb01c case of a Bayesian Bernoulli multi-armed bandit learner, which models arm dependencies.\n\n3.1 Active sequential learning\n\nBefore considering machine teaching, we \ufb01rst de\ufb01ne the type of active sequential learners considered.\nThis also provides a baseline to which the teacher\u2019s performance is compared. The general de\ufb01nition\nencompasses multiple popular sequential learning approaches, including Bayesian optimisation and\nmulti-armed bandits, which aim to learn fast, with few queries.\nAn active sequential learner is de\ufb01ned by (i) a machine learning model relating the response y to the\ninputs x through a function f, y = f\u03b8(x), parameterised by \u03b8, or through a conditional distribution\np(y | x, \u03b8), (ii) a deterministic learning algorithm, \ufb01tting the parameters \u03b8 or their posterior p(\u03b8 | D)\ngiven a dataset D = {(x1, y1), . . . , (xt, yt)}, (iii) a query function that, possibly stochastically,\nchooses an input point x to query for a response y, usually formulated as utility maximisation.\nThe dynamics of the learning process then, for t = 1, . . . , T , consists of iterating the following steps:\n\n1. Use the query function to choose a query xt.\n2. Obtain the response yt for the query xt from a teacher (or some other information source).\n3. Update the training set to Dt = Dt\u22121 \u222a {(xt, yt)} and the model correspondingly.\n\nThe data produced by the dynamics forms a sequence, or history, hT = x1, y1, x2, y2, . . . , xT (we\nde\ufb01ne the history to end at the input xT , before yT , for notational convenience in the following).\n\nis to maximise the expected accumulated reward RT = E[(cid:80)T\n\nBayesian Bernoulli multi-armed bandit learner As our main application in this paper, we con-\nsider Bayesian Bernoulli bandits. At each iteration t, the learner chooses an arm it \u2208 {1, . . . , K}\nand receives a stochastic reward yt \u2208 {0, 1}, depending on the chosen arm. The goal of the learner\nt=1 yt]. This presents an exploration\u2013\nexploitation problem, as the learner needs to learn which arms produce reward with high probability.\nThe learner associates each arm k with a feature vector xk \u2208 RM and models the rewards as\nBernoulli-distributed binary random variables\n\npB(yt | \u00b5it) = Bernoulli(yt | \u00b5it)\n\n(1)\nk \u03b8), k = 1, . . . , K, where \u03b8 \u2208 RM is a weight vector and \u03c3(\u00b7)\nwith reward probabilities \u00b5k = \u03c3(xT\nthe logistic sigmoid function. The linearity assumption could be relaxed, for example, by encoding\nthe xk\u2019s using suitable basis functions or Gaussian processes. The Bayesian learner has a prior\ndistribution on the model parameters, here assumed to be a multivariate normal, \u03b8 \u223c N(0, \u03c4 2I), with\nmean zero and diagonal covariance matrix \u03c4 2I. Given a collected set of arm selections and reward\nobservations at step t, Dt = {(i1, y1), . . . , (it, yt)} (or equivalently Dt = (ht, yt)), the posterior\ndistribution of \u03b8, p(\u03b8 | Dt) is computed.\nThe learner uses a bandit arm selection strategy to select the next arm to query about. Here, we use\nThompson sampling [35], a practical and empirically and theoretically well-performing algorithm\n[36]; other methods could easily be used instead. The next arm is sampled with probabilities propor-\ntional to the arm maximising the expected reward, estimated over the current posterior distribution:\n\n(cid:90)\n\nPr(it+1 = k) =\n\nI(arg max\n\nj\n\n\u00b5j = k | \u03b8)p(\u03b8 | Dt)d\u03b8,\n\n(2)\n\nwhere I is the indicator function. This can be realised by \ufb01rst sampling a weight vector \u03b8 from\np(\u03b8 | Dt), computing the corresponding \u00b5(\u03b8), and choosing the arm with the maximal reward\nprobability, it+1 = arg maxk \u00b5(\u03b8)\nk .\n\n3\n\n\fFigure 1: Example of teaching effect on pool-based logistic regression active learner. Using\nuncertainty sampling for queries, the learner fails to sample useful points from the pool in 10\niterations to learn a good decision boundary (\"Without teacher\"; starting from blue training data). A\nplanning teacher can help the learner sample more representative points by switching some labels\n(\"With teacher\"; switched labels are shown in red). The average accuracy improvement is shown in\nthe right panel. Details of the setting are given in Supplementary Section 2.\n\n3.2 Machine teaching of active sequential learner\n\nFigure 2: Example of the teaching effect on a\nmulti-armed bandit learner. With the environmen-\ntal reward probabilities shown in the \ufb01gure, con-\nsider the \ufb01rst query being arm 6. The reward prob-\nability for the arm is low, so y1 = 0 with high\nprobability for a naive teacher. Yet, the optimal\naction for a planning teacher is y1 = 1, because the\nteacher can anticipate that this will lead to a higher\nprobability for the learner to sample the next arm\nnear the higher peak. Details on the setting are\ngiven in Supplementary Section 3.\n\nIn standard active sequential learning, the re-\nsponses yt are assumed to be generated by a\nstationary data-generating mechanism as inde-\npendent and identically distributed samples. We\ncall such a mechanism a naive teacher. Our\nmachine teaching formulation replaces it with\na planning teacher which, by choosing yt care-\nfully, aims to steer the learner towards a teaching\ngoal with minimal effort.\nWe formulate the teaching problem as a Markov\ndecision process (MDP), where the transition\ndynamics follow from the dynamics of the se-\nquential learner and the responses yt are the ac-\ntions. The teaching MDP is de\ufb01ned by the tuple\nM = (H,Y,T ,R, \u03b3), where states ht \u2208 H\ncorrespond to the history, actions are the re-\nsponses yt \u2208 Y, transition probabilities p(ht+1 |\nht, yt) \u2208 T are de\ufb01ned by the learner\u2019s sequen-\ntial dynamics, rewards Rt(ht) \u2208 R are used to\nde\ufb01ne the teacher\u2019s goal, and \u03b3 \u2208 (0, 1] is a dis-\ncount factor (optional if T is \ufb01nite). The objective of the teacher is to choose actions yt to maximise\nt=1 \u03b3t\u22121Rt(ht)], where T is the teacher\u2019s\nplanning horizon and the expectation is over the possible stochasticity in the learner\u2019s queries and the\nteacher\u2019s policy. The teacher\u2019s policy \u03c0(ht, yt) = p(yt | ht, \u03c0) maps the state ht to probabilities over\nthe action space Y. The solution to the teaching problem corresponds to \ufb01nding the optimal teaching\npolicy \u03c0\u2217.\nThe reward function Rt(ht) de\ufb01nes the goal of the teacher. In designing a teaching MDP, as in\nreinforcement learning, its choice is crucial. In machine teaching, a natural assumption is that the\nreward function is parameterized by an optimal model parameter \u03b8\u2217, or some other ground truth,\nknown to the teacher but not the learner. For teaching of a supervised learning algorithm, the reward\nRt(ht; \u03b8\u2217) can, for example, be de\ufb01ned based on the distance of the learner\u2019s estimate of \u03b8 to \u03b8\u2217\nor by evaluation of learner\u2019s predictions against the teacher\u2019s privileged knowledge of outcomes\n(Figure 1).\nIn the multi-armed bandit application, it is assumed that the teacher knows the true parameter \u03b8\u2217\nof the underlying environmental reward distribution and aims to teach the learner such that the\naccumulated environmental reward is maximised (Figure 2). We de\ufb01ne the teacher\u2019s reward function\nas Rt(ht; \u03b8\u2217) = xT\n\nthe cumulative reward, called value, V \u03c0(h1) = E\u03c0[(cid:80)T\n\nt \u03b8\u2217 (leaving out \u03c3(\u00b7) to simplify the formulas for the teacher model).\n\n4\n\n Full data pool and fitWithout teacherWith teacher0246810Steps0.60.70.80.91.0AccuracyFull pool (ground truth)Random queries (baseline)Active learner without teacherActive learner with teacher12345678910arms0.00.51.0reward probability\fpM(yt | ht, \u03b8\u2217) =\n\n(cid:80)\n\nexp (\u03b2Q\u2217(ht, yt; \u03b8\u2217))\ny(cid:48)\u2208Y\n\nexp (\u03b2Q\u2217(ht, y(cid:48); \u03b8\u2217))\n\n,\n\n(3)\n\nProperties of the teaching MDP In Supplementary Section 4, we brie\ufb02y discuss the transition\ndynamics and state de\ufb01nition of the teaching MDP, and contrast it to Bayes-adaptive MDPs to better\nunderstand its properties. Finding the optimal teaching policy presents similar challenges to planning\nin Bayes-adaptive MDPs. Methods such as Monte Carlo tree search [37] have been found to provide\neffective approaches.\n\n3.3 Learning from teacher\u2019s responses\n\nWe next describe how the learner can interpret the teacher\u2019s responses, acknowledging the teaching\nintent. Having formulated the teaching as an MDP, the teacher-aware learning follows naturally as\ninverse reinforcement learning [38, 39]. We formulate a probabilistic teacher model to make the\nlearning more robust towards suboptimal teaching and to allow using the teacher model as a block in\nprobabilistic modelling.\nAt each iteration t, the learner assumes that the teacher chooses the action yt with probability\nproportional to the action being optimal in value:\n\nwhere Q\u2217(ht, yt; \u03b8\u2217) is the optimal state-action value function of the teaching MDP for the action yt\n(that is, the value of taking action yt at t and following an optimal policy afterwards). Here \u03b2 is a\nteacher optimality parameter (or inverse temperature; for \u03b2 = 0, the distribution of yt is uniform;\nfor \u03b2 \u2192 \u221e, the action with the highest value is chosen deterministically). From the teaching-aware\nlearner\u2019s perspective, the teacher\u2019s \u03b8\u2217 is unknown, and Equation 3 functions as the likelihood for\nlearning about \u03b8 from the observed teaching. In the bandit case, this replaces Equation 1. Note that\nthe teaching MDP dynamics still follow from the teaching-unaware learner.\n\nOne-step planning Since our main motivating application is modelling users as boundedly optimal\nteachers, implemented for a Bernoulli multi-armed bandit system, it is interesting to consider the\nspecial case of one-step planning horizon, T = 1. The state-action value function Q\u2217(ht, yt; \u03b8\u2217) then\nsimpli\ufb01es to the rewards at the next possible arms, and the action observation model to\n\npM(yt | ht, \u03b8\u2217) \u221d exp(\u03b2((\u03b8\u2217)TX Tpht,yt)),\n\n(4)\nwhere pht,yt = [p1,ht,yt, . . . , pK,ht,yt]T collects the probabilities of the next arm given action\nyt \u2208 {0, 1} at the current arm xt in ht, as estimated according to the teaching MDP, and X \u2208 RK\u00d7M\ncollects the arm features into a matrix. Note that the reward of the current arm does not appear in\nthe action probability1. For deterministic bandit arm selection strategies, the transition probabilities\npk,ht,yt for each of the two actions would have a single 1 and K \u2212 1 zeroes (essentially picking\none of the possible arms), giving the action probability an interpretation as a preference for one\nof the possible next arms. For stochastic selection strategies, such as Thompson sampling, the\ninterpretation is similar, but the two arms are now weighted averages, \u00afxyt=0 = X Tpht,yt=0 and\n\u00afxyt=1 = X Tpht,yt=1. An algorithmic overview of learning with a one-step planning teacher model\nis given in Supplementary Section 5.\nFor an illustrative example, consider a case with two independent arms (x1 = [1, 0] and x2 = [0, 1]),\nwith the \ufb01rst arm having a larger reward probability than the other (\u03b8\u22171 > \u03b8\u22172). The optimal teaching\naction is then to give yt = 1 for queries on arm 1 and yt = 0 for arm 2. A teaching-unaware learner\nwill still need to query both arms multiple times to identify the better arm. A teaching-aware learner\n(when \u03b2 \u2192 \u221e) can identify the better arm from a single query (on either arm), since the likelihood\nfunction tends to the step function I(\u03b8\u22171 > \u03b8\u22172). This demonstrates that the teaching-aware learner can\nuse a query to reduce uncertainty about other arms even in the extreme case of independent arms.\n\nIncorporating uncertainty about the teacher Teachers can exhibit different kinds of strategies.\nTo make the learner\u2019s model of the teacher robust to different types of teachers, we formulate a\nmixture model over a set of alternative strategies. Here, for the multi-armed bandit case, we consider\na combination of a teacher that just passes on the environmental reward (naive teacher, Equation 1)\nand the planning teacher (Equation 3):\n\npB/M(yt | ht, \u03b8\u2217, \u03b1) =(1 \u2212 \u03b1)pB(yt | \u00b5it) + \u03b1pM(yt | ht, \u03b8\u2217),\n\n(5)\n\n1It cancels out. The teacher cannot affect the arm choice anymore, as it has already been made.\n\n5\n\n\fwhere \u03b1 \u2208 (0, 1) is a mixing weight and \u00b5it = \u03c3(xT\nin the history ht. A beta prior distribution, \u03b1 \u223c Beta(1, 1), is assumed for the mixing weight.\n\n\u03b8\u2217) is the reward probability of the latest arm\n\nit\n\n3.4 Computational details for Bayesian Bernoulli multi-armed bandits\n\nComputation presents three challenges: (i) computing the analytically intractable posterior distribution\nof the model parameters p(\u03b8 | Dt) or p(\u03b8\u2217, \u03b1 | Dt), (ii) solving the state-value functions Q\u2217 for\nthe teaching MDP, and (iii) computing the Thompson sampling probabilities that are needed for the\nstate-value functions.\nWe implemented the models in the probabilistic programming language Pyro (version 0.3, under\nPyTorch v1.0) [40] and approximate the posterior distributions with Laplace approximations [41,\nSection 4.1]. In brief, the posterior is approximated as a multivariate Gaussian, with the mean de\ufb01ned\nby the maximum a posteriori (MAP) estimate and the covariance matrix being the negative of the\ninverse Hessian matrix at the MAP estimate. In the mixture model, the mixture coef\ufb01cient \u03b1 \u2208 (0, 1)\nis transformed to the real axis via the logit function before computing the approximation.\nThe inference requires computing the gradient of the logarithm of the unnormalised posterior prob-\nability. For the teacher model, this entails computing the gradient of the logarithm of Equation 3\nat any value of the model parameters, which requires solving and computing the gradients of the\noptimal state-action value functions Q\u2217 with respect to \u03b8\u2217. To solve the Q\u2217 for both of the possible\nobservable actions yt = 0 and yt = 1, we compute all the possible trajectories in the MDP until the\nhorizon T and choose the ones giving maximal expected cumulative reward. Choi and Kim [39] show\nthat the gradients of Q\u2217 exist almost everywhere, and that the direct computation gives a subgradient\nat the boundaries where the gradient does not exist.\nWe mainly focus on one-step planning (T = 1) in the experiments. For long planning horizons and\nstochastic arm selection strategies, the number of possible trajectories grows too fast for the exact\nexhaustive computation to be feasible (K T trajectories for each initial action). In our multi-step\nexperiments, we approximate the forward simulation of the MDP with virtual arms: instead of consid-\nering all possible next arms given an action yt and weighting them with their selection probabilities\npht,yt, we update the model with a virtual arm that is the selection-probability-weighted average of\nthe next possible arms \u00afxht,yt = X Tpht,yt (for deterministic strategies, this is exact computation).\nThe virtual arms do not correspond to real arms in the system but are expectations over the next\narms. This leads to 2T\u22121 trajectories to simulate for each initial action. Moreover, for any trajectory\nt=1 \u03b3t\u22121pht,yt and\nif we cache the sum of the discounted transition probabilities for each trajectory from the forward\nsimulation, we can easily \ufb01nd the optimal Q\u2217 at any value of \u03b8\u2217 as required for the inference.\nComputing the next arm probabilities for the Q\u2217 values requires computing the actual Thompson\nsampling probabilities in Equation 2 instead of just sampling from it. As the sigmoid function is\nk)p(z | Dt)dz where z = X\u03b8\u2217. As p(\u03b8\u2217 | Dt) \u2248 N(\u03b8\u2217 | m, \u03a3), z has multivariate normal\ndistribution with mean Xm and covariance X\u03a3X T. The selection probabilities can then be\nestimated with Monte Carlo sampling. We further use Rao-Blackwellized estimates Pr(it+1 =\nk) \u2248 1\n\u2212k), with L Monte Carlo samples drawn for z\u2212k (z with kth\ncomponent removed) and Pr(zk > maxj(cid:54)=k zj | z(l)\n\u2212k) being the conditional normal probability of\ncomponent zk being larger than the largest component in z\u2212k.\n4 Experiments\n\nof actions y1, . . . , yT , this approximation gives Q(h1, y1; \u03b8\u2217) \u2248 (\u03b8\u2217)TX T(cid:80)T\nmonotonic, one can equivalently compute the probabilities as Pr(it+1 = k) =(cid:82) I(arg maxj zj =\n\n(cid:80)L\nl=1 Pr(zk > maxj(cid:54)=k zj | z(l)\n\nL\n\nWe perform simulation experiments for the Bayesian Bernoulli multi-armed bandit learner, based on a\nreal dataset, to study (i) whether a teacher can ef\ufb01ciently steer the learner towards a target to increase\nlearning performance, (ii) whether the ability of the learner to recognise the teaching intent increases\nthe performance, (iii) whether the mixture model is robust to assumptions about the teacher\u2019s strategy,\nand (iv) whether planning multiple steps ahead improves teaching performance. We then present\nresults from a proof-of-concept study with humans. Supplementary Section 6.1 includes an additional\nexperiment studying the teaching of an uncertainty-sampling-based logistic regression active learner,\nshowing that teaching can improve learning performance markedly.\n\n6\n\n\fFigure 3: Left-side panels: Planning teacher improves performance, both when the learner\u2019s teacher\nmodel is naive (P-N) or planning (P-P), over naive teacher (N-N). Right-side panels: Naive teacher\nwith a learner expecting a planning teacher (N-P) degrades performance. Learners with the mixture\nteacher model attain similar performance to matched models (P-M vs P-P and N-M vs N-N (left)).\nLines show the mean over 100 replications and shaded area the 95% con\ufb01dence intervals for the\nmean. See Table 1 for key to the abbreviations.\n\n4.1 Simulation experiments\n\nWe use a word relevance dataset for simulating an information retrieval task. In this task, the user is\ntrying to teach a relevance pro\ufb01le to the learner in order to reach her target word. The Word dataset is\na random selection of 10,000 words from Google\u2019s Word2Vec vectors, pre-trained on Google News\ndataset [42]. We reduce the dimensionality of the word embeddings from the original 300 to 10 using\nPCA. Feature vectors are mean-centred and normalised to unit length. We report results, with similar\nconclusions, on two other datasets in Supplementary Section 6.2.\nWe randomly generate 100 replicate experiments: a set of 100 arms is sampled without replacement\nand one arm is randomly chosen as the target \u02c6x \u2208 RM . The ground-truth relevance pro\ufb01le is\ngenerated by \ufb01rst setting \u02c6\u03b8\u2217 = [c, d \u02c6x] \u2208 RM +1, where c = \u22124 is a weight for an intercept term\n(a constant element of 1 is added to the xs) and d = 8 is a scaling factor. Then, the ground-truth\n\u02c6\u03b8\u2217) for each arm k (Supplementary Figure 2 shows\nreward probabilities are computed as \u02c6\u00b5k = \u03c3(xT\nk\nthe mean reward probability pro\ufb01le). To reduce experimental variance for method comparison, we\nchoose one of the arms randomly as the initial query for all methods.\nWe compare the learning performances of different\npairs of simulated teachers and learners (Table 1). A\nnaive teacher (N), which does not intentionally teach,\npasses on a stochastic binary reward (Equation 1) based\non the ground truth \u02c6\u00b5k as its action for arm k (the\nstandard bandit assumption). A planning teacher (P)\nuses the probabilistic teaching MDP model (Equation 4\nfor one-step and Equation 3 for multi-step) based on the ground truth \u02c6\u03b8\u2217 to plan its action. We\nuse \u02c6\u03b2 = 20 as the planning teacher\u2019s optimality parameter and also set \u03b2 of the learner\u2019s teacher\nmodel to the same value. For multi-step models, we set \u03b3t = 1\nT , so that they plan to maximise the\naverage return up to horizon T . The learners are named based on their models of the teacher: a\nteaching-unaware learner learns based on the naive teacher model (N; Equation 1) and teaching-aware\nlearner models the planning teacher (P; Equation 4 or Equation 3). Mixture model (M) refers to the\nlearner with a mixture of the two teacher models (Equation 5).\nExpected cumulative reward and concordance index are used as performance measures (higher is\nbetter for both). Expected cumulative reward measures how ef\ufb01ciently the system can \ufb01nd high\nreward arms and is a standard bandit benchmark value. Concordance index is equivalent to the\narea under the receiver operating characteristic curve. It is a common performance measure for\ninformation retrieval tasks. It estimates the probability that a random pair of arms is ordered in the\nsame order by their ground truth relevances and the model\u2019s estimated relevances; 0.5 corresponds to\nrandom and 1.0 to perfect performance.\n\nLearner\u2019s model of teacher\nplanning mixture\nnaive\nnaive N-N\nN-M\nP-M\nP-N\n\nTable 1: Teacher\u2013learner pairs.\n\nTeacher\n\nplanning\n\nN-P\nP-P\n\n7\n\n1102030Steps0510MeanCumulativeReward1102030Steps0.50.60.70.8ConcordanceIndexP-PP-NN-NP-PP-NN-N1102030Steps0510MeanCumulativeReward1102030Steps0.50.60.70.8ConcordanceIndexP-PP-MN-PP-MP-PN-MN-PN-M\f4.2 Simulation results\n\nTeaching improves performance Figure 3\nshows the performance of different combina-\ntions of pairs of teachers and learners (where\nplanning teachers have planning horizon T =\n1). The planning teacher can steer a teacher-\nunaware learner to achieve a marked increase in\nperformance compared to a naive teacher (P-N\nvs N-N; left-side panels), showing that inten-\ntional teaching makes the reward signal more\nsupportive of learning. The performance in-\ncreases markedly further when the learner mod-\nels the planning teacher (P-P; left-side panels).\nThe improvements are seen in both performance\nmeasures, and the concordance index implies\nparticularly that the proposed model learns faster\nabout relevant arms and also achieves higher\noverall performance at the end of the 30 steps.\n\nMixture model increases robustness to as-\nsumptions about the teacher A mismatch of\na naive teacher with a learner expecting a plan-\nning teacher (N-P) is markedly detrimental to\nperformance (Figure 3 right-side panels). The\nmixture model guards against the mismatch and\nattains a performance similar to the matching\nassumptions (P-M vs P-P and N-M vs N-N).\n\nPlanning for multiple steps increases perfor-\nmance further Figure 4 shows the cumulative\nreward difference for matching planning teacher\u2013\nlearner pairs (P-P) when planning two to four\nsteps ahead compared to one step. There is a\nmarked improvement especially when going to\n3-step or 4-step planning horizon.\n\nSensitivity analysis Sensitivity of the results\nto the simulated teacher\u2019s optimality parameter\n\u02c6\u03b2 (performance degrades markedly for small val-\nues of \u02c6\u03b2) and to the number of arms (500 instead\nof 100; results remain qualitatively similar) are\nshown in Supplementary Section 6.2.\n\n4.3 User experiment\n\nFigure 4: Teachers planning for multiple steps\nahead improve over 1-step (P-P) in performance.\n\nFigure 5: The accumulated reward was consis-\ntently higher for the participants when interacting\nwith a learner having the mixture teacher model,\ncompared to a learner with the naive teacher model.\nShaded lines show the mean performance (over the\n20 target words) of individual participants. Solid\nlines show the mean over the participants. Random\narm sampling is shown as baseline.\n\nWe conducted a proof-of-concept user study for the task introduced above, using a subset of 20 words\non ten university students and researchers. The goal of the study was introduced to the participants as\nhelping a system to \ufb01nd a target word, as fast as possible, by providing binary answers (yes/no) to the\nsystem\u2019s questions: \u201cIs this word relevant to the target?\u201d A target word was given to the participants\nat the beginning of each round (for twenty rounds; each word chosen once as the target word). Details\nof the study setting are provided in Supplementary Section 7.\nParticipants achieved noticeably higher average cumulative reward when interacting with a learner\nhaving the mixture teacher model, compared to a learner with the naive teacher model (Figure 5, red\nvs blue). This difference was at a signi\ufb01cant level (p-value < 0.01) after 12 questions, computed\nusing paired sample t-test (see Supplementary Section 7 for p-values per step).\n\n8\n\n1102030Steps\u22120.50.00.51.01.5Cumul.RewardDifferenceto1-step4-step3-step2-step13579111315NumberofQuestions02468MeanCumulativeRewardHumanTeacher-MixtureModelHumanTeacher-NaiveModelRandomdraw(baseline)\f5 Discussion and conclusions\n\nWe introduced a new sequential machine teaching problem, where the learner actively chooses queries\nand the teacher provides responses to them. This encompasses teaching popular sequential learners,\nsuch as active learners and multi-armed bandits. The teaching problem was formulated as a Markov\ndecision process, the solution of which provides the optimal teaching policy. We then formulate\nteacher-aware learning from the teacher\u2019s responses as probabilistic inverse reinforcement learning.\nExperiments on Bayesian Bernoulli multi-armed bandits and logistic regression active learners\ndemonstrated improved performance from teaching and from learning with teacher awareness. Better\ntheoretical understanding of the setting and studying a more varied set of assumptions and approaches\nto planning for both the teacher and the teacher-aware learner are important future directions.\nOur formulation provides a way to model users with strategic behaviour as boundedly optimal\nteachers in interactive intelligent systems. We conducted a proof-of-concept user study, showing\nencouraging results, where the user was tasked to steer a bandit system towards a target word. To\nscale the approach to more realistic systems, for example, to interactive exploratory information\nretrieval [43], of which our user study is a simpli\ufb01ed instance, or to human-in-the-loop Bayesian\noptimisation [44], where the user might not possess the exact knowledge of the goal, future work\nshould consider incorporating more advanced cognitive models of users. As an ef\ufb01cient teacher (user)\nneeds to be able to model the learner (system), our results also highlight the role of understandability\nand predictability of interactive systems for the user as an important design factor, not only for user\nexperience, but also for the statistical modelling in the system.\nWhile we focused here on teachers with bounded, short-horizon planning (as we would not expect\nhuman users to be able to predict behaviour of interactive systems for long horizons), scaling the\ncomputation to larger problems is of interest. Given the similarity of the teaching MDP to Bayes-\nadaptive MDPs (and partially observable MDPs), planning methods developed for them could be\nused for ef\ufb01cient search for teaching actions. The teaching setting has some advantages here: as the\nteacher is assumed to have privileged information, such as a target model, that information could\nbe used to generate a reasonable initial policy for choosing actions y. Such policy could be then\nre\ufb01ned, for example, using Monte Carlo tree search. The teacher-aware learning problem is more\nchallenging, as inverse reinforcement learning requires handling the planning problem in an inner\nloop. Considering the application and adaptation of state-of-the-art inverse reinforcement learning\nmethods for teacher-aware learning is future work.\n\nAcknowledgments\n\nThis work was \ufb01nancially supported by the Academy of Finland (Flagship programme: Finnish Center\nfor Arti\ufb01cial Intelligence, FCAI; grants 319264, 313195, 305780, 292334). Mustafa Mert \u00c7elikok\nis partially funded by the Finnish Science Foundation for Technology and Economics KAUTE. We\nacknowledge the computational resources provided by the Aalto Science-IT Project. We thank Antti\nOulasvirta and Marta Soare for comments that improved the article.\n\nReferences\n[1] Allen Newell and Herbert Alexander Simon. Human Problem Solving. Prentice-Hall, Inc.,\n\nUpper Saddle River, NJ, USA, 1972.\n\n[2] Keith J. Holyoak. Problem solving. In Edward E. Smith and Daniel N. Osherson, editors,\nThinking: An Invitation to Cognitive Science, Vol. 3, pages 267\u2013296. The MIT Press, 2nd edition,\n1995.\n\n[3] Arjun Chandrasekaran, Deshraj Yadav, Prithvijit Chattopadhyay, Viraj Prabhu, and Devi Parikh.\n\nIt takes two to tango: Towards theory of AI\u2019s mind. arXiv preprint arXiv:1704.00717, 2017.\n\n[4] Randi Williams, Hae Won Park, and Cynthia Breazeal. A is for arti\ufb01cial intelligence: The impact\nof arti\ufb01cial intelligence activities on young children\u2019s perceptions of robots. In Proceedings of\nthe 2019 CHI Conference on Human Factors in Computing Systems, pages 447:1\u2013447:11, 2019.\n\n[5] Sally A Goldman and Michael J Kearns. On the complexity of teaching. Journal of Computer\n\nand System Sciences, 50(1):20\u201331, 1995.\n\n9\n\n\f[6] Xiaojin Zhu. Machine teaching: An inverse problem to machine learning and an approach\ntoward optimal education. In Proceedings of the Twenty-Ninth AAAI Conference on Arti\ufb01cial\nIntelligence, pages 4083\u20134087, 2015.\n\n[7] Xiaojin Zhu, Adish Singla, Sandra Zilles, and Anna N. Rafferty. An overview of machine\n\nteaching. arXiv preprint arXiv:1801.05927, 2018.\n\n[8] Anna N Rafferty, Emma Brunskill, Thomas L Grif\ufb01ths, and Patrick Shafto. Faster teaching via\n\nPOMDP planning. Cognitive Science, 40(6):1290\u20131332, 2016.\n\n[9] Shike Mei and Xiaojin Zhu. Using machine teaching to identify optimal training-set attacks on\nmachine learners. In Proceedings of the Twenty-Ninth AAAI Conference on Arti\ufb01cial Intelligence,\npages 2871\u20132877, 2015.\n\n[10] Maya Cakmak and Manuel Lopes. Algorithmic and human teaching of sequential decision\ntasks. In Proceedings of the Twenty-Sixth AAAI Conference on Arti\ufb01cial Intelligence, pages\n1536\u20131542, 2012.\n\n[11] Daniel S. Brown and Scott Niekum. Machine teaching for inverse reinforcement learning:\nAlgorithms and applications. In Proceedings of the Thirty-Third AAAI Conference on Arti\ufb01cial\nIntelligence, 2019.\n\n[12] Laurent Lessard, Xuezhou Zhang, and Xiaojin Zhu. An optimal control approach to sequential\nmachine teaching. In Proceedings of the 22nd International Conference on Arti\ufb01cial Intelligence\nand Statistics, AISTATS, pages 2495\u20132503, 2019.\n\n[13] Weiyang Liu, Bo Dai, Ahmad Humayun, Charlene Tay, Chen Yu, Linda B Smith, James M\nRehg, and Le Song. Iterative machine teaching. In Proceedings of the 34th International\nConference on Machine Learning, ICML, pages 2149\u20132158, 2017.\n\n[14] Weiyang Liu, Bo Dai, Xingguo Li, Zhen Liu, James Rehg, and Le Song. Towards black-box\niterative machine teaching. In Proceedings of the 35th International Conference on Machine\nLearning, ICML, pages 3141\u20133149, 2018.\n\n[15] Jacob Whitehill and Javier Movellan. Approximately optimal teaching of approximately optimal\n\nlearners. IEEE Transactions on Learning Technologies, 11(2):152\u2013164, 2017.\n\n[16] Sandra Zilles, Steffen Lange, Robert Holte, and Martin Zinkevich. Models of cooperative\n\nteaching and learning. Journal of Machine Learning Research, 12:349\u2013384, 2011.\n\n[17] Thorsten Doliwa, Gaojian Fan, Hans Ulrich Simon, and Sandra Zilles. Recursive teaching\ndimension, VC-dimension and sample compression. Journal of Machine Learning Research,\n15:3107\u20133131, 2014.\n\n[18] Yuzhe Ma, Kwang-Sung Jun, Lihong Li, and Xiaojin Zhu. Data poisoning attacks in contextual\nbandits. In International Conference on Decision and Game Theory for Security, pages 186\u2013204,\n2018.\n\n[19] Kwang-Sung Jun, Lihong Li, Yuzhe Ma, and Jerry Zhu. Adversarial attacks on stochastic\nbandits. In Advances in Neural Information Processing Systems, NeurIPS, pages 3640\u20133649,\n2018.\n\n[20] Andrew Y Ng, Daishi Harada, and Stuart Russell. Policy invariance under reward transfor-\nmations: Theory and application to reward shaping. In Proceedings of the 16th International\nConference on Machine Learning, ICML, volume 99, pages 278\u2013287, 1999.\n\n[21] Piotr J Gmytrasiewicz and Prashant Doshi. A framework for sequential planning in multi-agent\n\nsettings. Journal of Arti\ufb01cial Intelligence Research, 24:49\u201379, 2005.\n\n[22] David V Pynadath and Milind Tambe. The communicative multiagent team decision problem:\nAnalyzing teamwork theories and models. Journal of Arti\ufb01cial Intelligence Research, 16:\n389\u2013423, 2002.\n\n10\n\n\f[23] Dylan Had\ufb01eld-Menell, Anca Dragan, Pieter Abbeel, and Stuart Russell. Cooperative inverse\nreinforcement learning. In Advances in Neural Information Processing Systems, NIPS, pages\n3909\u20133917, 2016.\n\n[24] Frans A. Oliehoek and Christopher Amato. A Concise Introduction to Decentralized POMDPs.\n\nSpringerBriefs in Intelligent Systems. Springer, May 2016.\n\n[25] Stefano V Albrecht and Peter Stone. Autonomous agents modelling other agents: A compre-\n\nhensive survey and open problems. Arti\ufb01cial Intelligence, 258:66\u201395, 2018.\n\n[26] Chris L Baker, Julian Jara-Ettinger, Rebecca Saxe, and Joshua B Tenenbaum. Rational quantita-\ntive attribution of beliefs, desires and percepts in human mentalizing. Nature Human Behaviour,\n1(4):0064, 2017.\n\n[27] Jaime F. Fisac, Monica A. Gates, Jessica B. Hamrick, Chang Liu, Dylan Had\ufb01eld-Menell,\nMalayandi Palaniappan, Dhruv Malik, S. Shankar Sastry, Thomas L. Grif\ufb01ths, and Anca D.\nIn International Symposium on Robotics\nDragan. Pragmatic-pedagogic value alignment.\nResearch, ISRR, 2017.\n\n[28] Mark P. Woodward and Robert J. Wood. Learning from humans as an I-POMDP. arXiv preprint\n\narXiv:1204.0274, 2012.\n\n[29] Gerhard Fischer. User modeling in human\u2013computer interaction. User Modeling and User-\n\nAdapted Interaction, 11(1-2):65\u201386, 2001.\n\n[30] Gary Marchionini. Exploratory search: From \ufb01nding to understanding. Communications of the\n\nACM, 49(4):41\u201346, 2006.\n\n[31] Tuukka Ruotsalo, Giulio Jacucci, Petri Myllym\u00e4ki, and Samuel Kaski.\n\nInteractive intent\nmodeling: Information discovery beyond search. Communications of the ACM, 58(1):86\u201392,\n2015.\n\n[32] Sven Schmit and Carlos Riquelme. Human interaction with recommendation systems. In\nProceedings of the Twenty-First International Conference on Arti\ufb01cial Intelligence and Statistics,\nAISTATS, pages 862\u2013870, 2018.\n\n[33] Pedram Daee, Tomi Peltola, Aki Vehtari, and Samuel Kaski. User modelling for avoiding\nover\ufb01tting in interactive knowledge elicitation for prediction. In 23rd International Conference\non Intelligent User Interfaces, IUI, pages 305\u2013310, 2018.\n\n[34] Samuel J Gershman, Eric J Horvitz, and Joshua B Tenenbaum. Computational rationality:\nA converging paradigm for intelligence in brains, minds, and machines. Science, 349(6245):\n273\u2013278, 2015.\n\n[35] William R Thompson. On the likelihood that one unknown probability exceeds another in view\n\nof the evidence of two samples. Biometrika, 25(3/4):285\u2013294, 1933.\n\n[36] Daniel J. Russo, Benjamin Van Roy, Abbas Kazerouni, Ian Osband, and Zheng Wen. A tutorial\non Thompson sampling. Foundations and Trends R(cid:13) in Machine Learning, 11(1):1\u201396, 2018.\n[37] Arthur Guez, David Silver, and Peter Dayan. Scalable and ef\ufb01cient Bayes-adaptive reinforce-\nment learning based on Monte-Carlo tree search. Journal of Arti\ufb01cial Intelligence Research, 48:\n841\u2013883, 2013.\n\n[38] Deepak Ramachandran and Eyal Amir. Bayesian inverse reinforcement learning. In Proceedings\nof the Twentieth International Joint Conference on Arti\ufb01cial Intelligence, IJCAI, pages 2586\u2013\n2591, 2007.\n\n[39] Jaedeug Choi and Kee-Eung Kim. MAP inference for Bayesian inverse reinforcement learning.\n\nIn Advances in Neural Information Processing Systems, NIPS, pages 1989\u20131997, 2011.\n\n[40] Eli Bingham, Jonathan P. Chen, Martin Jankowiak, Fritz Obermeyer, Neeraj Pradhan, Theofanis\nKaraletsos, Rohit Singh, Paul Szerlip, Paul Horsfall, and Noah D. Goodman. Pyro: Deep\nuniversal probabilistic programming. Journal of Machine Learning Research, 20(28):1\u20136, 2019.\n\n11\n\n\f[41] Andrew Gelman, John B Carlin, Hal S Stern, David B Dunson, Aki Vehtari, and Donald B\n\nRubin. Bayesian Data Analysis. Chapman & Hall/CRC, 3rd edition, 2014.\n\n[42] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed repre-\nsentations of words and phrases and their compositionality. In Advances in Neural Information\nProcessing Systems, NIPS, pages 3111\u20133119, 2013.\n\n[43] Tuukka Ruotsalo, Jaakko Peltonen, Manuel J. A. Eugster, Dorota Glowacka, Patrik Flor\u00e9en,\nPetri Myllym\u00e4ki, Giulio Jacucci, and Samuel Kaski. Interactive intent modeling for exploratory\nsearch. ACM Trans. Inf. Syst., 36(4):44:1\u201344:46, 2018.\n\n[44] Eric Brochu, Tyson Brochu, and Nando de Freitas. A Bayesian interactive optimization approach\nto procedural animation design. In Proceedings of the 2010 ACM SIGGRAPH/Eurographics\nSymposium on Computer Animation, pages 103\u2013112, 2010.\n\n12\n\n\f", "award": [], "sourceid": 5998, "authors": [{"given_name": "Tomi", "family_name": "Peltola", "institution": "Aalto University"}, {"given_name": "Mustafa Mert", "family_name": "\u00c7elikok", "institution": "Aalto University"}, {"given_name": "Pedram", "family_name": "Daee", "institution": "Aalto University"}, {"given_name": "Samuel", "family_name": "Kaski", "institution": "Aalto University"}]}