{"title": "Active Learning from Peers", "book": "Advances in Neural Information Processing Systems", "page_first": 7008, "page_last": 7017, "abstract": "This paper addresses the challenge of learning from peers in an online multitask setting. Instead of always requesting a label from a human oracle, the proposed method first determines if the learner for each task can acquire that label with sufficient confidence from its peers either as a task-similarity weighted sum, or from the single most similar task. If so, it saves the oracle query for later use in more difficult cases, and if not it queries the human oracle. The paper develops the new algorithm to exhibit this behavior and proves a theoretical mistake bound for the method compared to the best linear predictor in hindsight. Experiments over three multitask learning benchmark datasets show clearly superior performance over baselines such as assuming task independence, learning only from the oracle and not learning from peer tasks.", "full_text": "Active Learning from Peers\n\nKeerthiram Murugesan\n\nJaime Carbonell\n\nSchool of Computer Science\nCarnegie Mellon University\n\nPittsburgh, PA 15213\n\n{kmuruges,jgc}@cs.cmu.edu\n\nAbstract\n\nThis paper addresses the challenge of learning from peers in an online multitask\nsetting. Instead of always requesting a label from a human oracle, the proposed\nmethod \ufb01rst determines if the learner for each task can acquire that label with\nsuf\ufb01cient con\ufb01dence from its peers either as a task-similarity weighted sum, or\nfrom the single most similar task. If so, it saves the oracle query for later use in\nmore dif\ufb01cult cases, and if not it queries the human oracle. The paper develops the\nnew algorithm to exhibit this behavior and proves a theoretical mistake bound for\nthe method compared to the best linear predictor in hindsight. Experiments over\nthree multitask learning benchmark datasets show clearly superior performance\nover baselines such as assuming task independence, learning only from the oracle\nand not learning from peer tasks.\n\n1\n\nIntroduction\n\nMultitask learning leverages the relationship between the tasks to transfer relevant knowledge from\ninformation-rich tasks to information-poor ones. Most existing work in multitask learning focuses\non how to take advantage of these task relationships, either by sharing data directly [1] or learning\nmodel parameters via cross-task regularization techniques [2, 3, 4]. This paper focuses on a speci\ufb01c\nmultitask setting where tasks are allowed to interact by requesting labels from other tasks for dif\ufb01cult\ncases.\nIn a broad sense, there are two settings to learn multiple related tasks together: 1) batch learning, in\nwhich an entire training set is available to the learner 2) online learning, in which the learner sees\nthe data sequentially. In recent years, online multitask learning has attracted increasing attention\n[5, 6, 7, 8, 9, 10]. The online multitask setting starts with a learner at each round t, receiving an\nexample (along with a task identi\ufb01er) and predicts the output label. One may also consider learning\nmultiple tasks simultaneously by receiving K examples for K tasks at each round t. Subsequently,\nthe learner receives the true label and updates the model(s) as necessary. This sequence is repeated\nover the entire data, simulating a data stream. In this setting, the assumption is that the true label is\nreadily available for the task learner, which is impractical in many applications.\nRecent works in active learning for sequential problems have addressed this concern by allowing\nthe learner to make a decision on whether to ask the oracle to provide the true label for the current\nexample and incur a cost or to skip this example. Most approach in active learning for sequential\nproblems use a measure such a con\ufb01dence of the learner in the current example [11, 12, 13, 14, 15].\nIn online multitask learning, one can utilize the task relationship to further reduce the total number\nof labels requested from the oracle. This paper presents a novel active learning for the sequential\ndecision problems using peers or related tasks. The key idea is that when the learner is not con\ufb01dent\non the current example, the learner is allowed to query its peers, which usually has a low cost, before\nrequesting a true label from the oracle and incur a high cost. Our approach follows a perceptron-based\nupdate rule in which the model for a given task is updated only when the prediction for that task is\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fin error. The goal of an online learner in this setting is to minimize errors attempting to reach the\nperformance of the full hindsight learner and at the same time, reduce the total number of queries\nissued to the oracle.\nThere are many useful application areas for online multitask learning with selective sampling,\nincluding optimizing \ufb01nancial trading, email prioritization and \ufb01ltering, personalized news, crowd\nsource-based annotation, spam \ufb01ltering and spoken dialog system, etc. Consider the latter, where\nseveral automated agents/bots servicing several clients. Each agent is specialized or trained to answer\nquestions from customers on a speci\ufb01c subject such as automated payment, troubleshooting, adding\nor cancelling services, etc. In such setting, when one of the automated agents cannot answer a\ncustomer\u2019s question, it may request the assistance of another automated agent that is an expert in the\nsubject related to that question. For example, an automated agent for customer retention may request\nsome help from an automated agent for new services to offer new deals for the customer. When both\nthe agents could not answer the customer\u2019s question, the system may then direct the call to a live\nagent. This may reduce the number of service calls directed to live agents and the cost associated\nwith such requests.\nSimilarly in spam \ufb01ltering, where some spam is universal to all users (e.g. \ufb01nancial scams), some\nmessages might be useful to certain af\ufb01nity groups, but spam to most others (e.g. announcements\nof meditation classes or other special interest activities), and some may depend on evolving user\ninterests. In spam \ufb01ltering each user is a task, and shared interests and dis-interests formulate the\ninter-task relationship matrix. If we can learn the task relationship matrix as well as improving\nmodels from speci\ufb01c decisions from peers on dif\ufb01cult examples, we can perform mass customization\nof spam \ufb01ltering, borrowing from spam/not-spam feedback from users with similar preferences. The\nprimary contribution of this paper is precisely active learning for multiple related tasks and its use in\nestimating per-task model parameters in an online setting.\n\n1.1 Related Work\n\nWhile there is considerable literature in online multitask learning, many crucial aspects remain largely\nunexplored. Most existing work in online multitask learning focuses on how to take advantage of\ntask relationships. To achieve this, Lugosi et. al [7] imposed a hard constraint on the K simultaneous\nactions taken by the learner in the expert setting, Agarwal et. al [16] used matrix regularization, and\nDekel et. al [6] proposed a global loss function, as an absolute norm, to tie together the loss values\nof the individual tasks. In all these works, their proposed algorithms assume that the true labels are\navailable for each instance.\nSelective sampling-based learners in online setting, on the other hand, decides whether to ask the\nhuman oracle for labeling of dif\ufb01cult instances [11, 12, 13, 14, 15]. It can be easily extended to online\nmultitask learning setting by applying selective sampling for each individual task separately. Saha\net. al [9] formulated the learning of task relationship matrix as a Bregman-divergence minimization\nproblem w.r.t. positive de\ufb01nite matrices and used this task-relationship matrix to naively select the\ninstances for labelling from the human oracle.\nSeveral recent works in online multitask learning recommended updating all the task learners\non each round t [10, 9, 8]. When a task learner makes a mistake on an example, all the tasks\u2019\nmodel parameters are updated to account for the new examples. This signi\ufb01cantly increases the\ncomputational complexity at each round, especially when the number of tasks is large [17]. Our\nproposed method avoids this issue by updating only the learner of the current example and utilize the\nknowledge from peers only when the current learner requested them.\nThis work is motivated by the recent interests in active learning from multiple (strong or weak)\nteachers [18, 19, 12, 20, 21, 22]. Instead of single all-known oracle, these papers assume multiple\noracles (or teachers) each with a different area of expertise. At round t, some of the teachers are\nexperts in the current instance but the others may not be con\ufb01dent in their predicted labels. Such\nlearning setting is very common in crowd-sourcing platform where multiple annotators are used to\nlabel an instance. Our learning setting is different from their approaches where, instead of learning\nfrom multiple oracles, we learn from our peers (or related tasks) without any associated high cost.\nFinally, our proposed method is closely related to learning with rejection option [23, 24] where the\nlearner may choose not to predict label for an instance. To reject an instance, they use a measure of\n\n2\n\n\f1. Receive an example x(t) for the task k\n2. If the task k is not con\ufb01dent in the prediction for this example, ask\nthe peers or related tasks whether they can give a con\ufb01dent label\nto this example.\n\n3. If the peers are not con\ufb01dent enough, ask the oracle for the true\n\nlabel y(t).\n\nFigure 1: Proposed learning approach from peers.\n\ncon\ufb01dence to identify dif\ufb01cult instances. We use a similar approach to identify when to query peers\nand when to query the human oracle for true label.\n\n2 Problem Setup\n\nk , y(i)\n\nSuppose we are given K tasks where the kth task is associated with Nk training examples. For\nbrevity, we consider a binary classi\ufb01cation problem for each task, but the methods generalize to\nmulti-class settings and are also applicable to regression tasks. We denote by [N ] consecutive integers\nk \u2208 Rd is the ith instance\nk is its corresponding true label. When the notation is clear from the context,\n\ni=1 be data for task k where x(i)\n\nranging from 1 to N. Let(cid:8)(x(i)\n(cid:0)(x(t), k), y(t)(cid:1) are given by (cid:96)(t)\u2217\n\nfrom the kth task and y(i)\nwe drop the index k and write ((x(i), k), y(i)).\nLet {w\u2217\n\nk )(cid:9)Nk\nkk = (cid:0)1 \u2212 y(t)(cid:104)x(t), w\u2217\nk(cid:105)(cid:1)\n\nk}k\u2208[K] be any set of arbitrary vectors where w\u2217\n\nk \u2208 Rd. The hinge losses on the example\n+,\nrespectively, where (z)+ = max(0, z). Similarly, we de\ufb01ne hinge losses (cid:96)(t)\nkm for the linear\npredictors {w(t)\nk }k\u2208[K] learned at round t. Let Z (t) be a Bernoulli random variable to indicate whether\nthe learner requested a true label for the example x(t). Let M (t) be a binary variable to indicate\nwhether the learner made a mistake on the example x(t). We use the following expected hinge losses\nfor our theoretical analysis: \u02dcLkk = E\n\nkm = (cid:0)1 \u2212 y(t)(cid:104)x(t), w\u2217\nm(cid:105)(cid:1)\n\n+ and (cid:96)(t)\u2217\n\nand \u02dcLkm = E\n\n(cid:20)(cid:80)\n\n(cid:20)(cid:80)\n\nkk and (cid:96)(t)\n\nt M (t)Z (t)(cid:96)(t)\u2217\n\nkk\n\n(cid:21)\n\n(cid:21)\n\nt M (t)Z (t)(cid:96)(t)\u2217\n\nkm\n\n.\n\nWe start with our proposed active learning from peers algorithm based on selective sampling for\nonline multitask problems and study the mistake bound for the algorithm in Section 3. We report our\nexperimental results and analysis in Section 4. Additionally, we extend our learning algorithm to\nlearning multiple task in parallel in the supplementary.\n\n3 Learning from Peers\n\nk\n\nk\n\nWe consider multitask perceptron for our online learning algorithm. On each round t, we receive an\nexample (x(t), k) from task k 1. Each perceptron learner for the task k maintains a model represented\nby w(t\u22121)\nlearned from examples received until round t \u2212 1. Task k predicts a label for the received\nexample x(t) using hk(x(t)) = (cid:104)w(t\u22121)\n, x(t)(cid:105) 2. As in the previous works [11, 12, 23], we use\n|hk(x(t))| to measure the con\ufb01dence of the kth task learner on this example. When the con\ufb01dence is\nhigher, the learner doesn\u2019t require the need to request the true label y(t) from the oracle.\nBuilt on this idea, [11] proposed a selective sampling algorithm using the margin |hk(x(t))| to decide\nwhether to query the oracle or not. Intuitively, if |hk(x(t))| is small, then the kth task learner is not\ncon\ufb01dent in the prediction of x(t) and vice versa. They consider a Bernoulli random variable P (t)\nfor the event |hk(x(t))| \u2264 b1 with probability\nb1+|hk(x(t))| for some prede\ufb01ned constant b1 \u2265 0. If\n1We will consider a different online learning setting later in the supplementary section where we simultane-\n2We also use the notation \u02c6pkk = (cid:104)w(t\u22121)\n\nously receive K examples at each round, one for each task k\n\n, x(t)(cid:105) and \u02c6pkm = (cid:104)w(t\u22121)\n\nm , x(t)(cid:105)\n\nb1\n\nk\n\n3\n\n\fP (t) = 1 (con\ufb01dence is low), then the kth learner requests the oracle for the true label. Similarly\nwhen P (t) = 0 (con\ufb01dence is high), the learner skips the request to the oracle. This considerably\nsaves a lot of label requests from the oracle. When dealing with multiple tasks, one may use similar\nidea and apply selective sampling for each task individually [25]. Unfortunately, such approach\ndoesn\u2019t take into account the inherent relationship between the tasks.\nIn this paper, we consider a novel active learning (or selective sampling) for online multitask learning\nto address the concerns discussed above. Our proposed learning approach can be summarized in\nFigure 1. Unlike in the previous work [8, 9, 10], we update only the current task parameter wk\nwhen we made a mistake at round t, instead of updating all the task model parameters wm,\u2200m \u2208\n[K], m (cid:54)= k. Our proposed method avoids this issue by updating only the learner of the current\nexample and share the knowledge from peers only when the assistance is needed. In addition, the\ntask relationship is taken into account, to measure whether the peers are con\ufb01dent in predicting this\nexample. This approach provides a compromise between learning them independently and learning\nthem by updating all the learners when a speci\ufb01c learner makes a mistake.\nAs in traditional selective sampling algorithm [11], we consider a Bernoulli random variable P (t) for\nthe event |hk(x(t))| \u2264 b1 with probability\nb1+|hk(x(t))|. In addition, we consider a second Bernoulli\nrandom variable Q(t) for the event |hm(x(t))| \u2264 b2 with probability\nkm |hm(x(t))|.\nThe idea is that when the weighted sum of the con\ufb01dence of the peers on the current example is high,\nthen we use the predicted label \u02dcy(t) from the peers for the perceptron update instead of requesting a\ntrue label y(t) from the oracle. In our experiment in Section 4, we consider the con\ufb01dence of most\nrelated task instead of the weighted sum to reduce the computational complexity at each round. We\nset Z (t) = P (t)Q(t) and set M (t) = 1 if we made a mistake at round t i.e., (y(t) (cid:54)= \u02c6y(t)) (only when\nthe label is revealed/queried).\nThe pseudo-code is in Algorithm 1. Line 14 is executed when we request a label from the oracle or\nwhen peers are con\ufb01dent on the label for the current example. Note the two terms in (M (t)Z (t)y(t) +\n\u02dcZ (t) \u02dcy(t)) are mutually exclusive (when P (t) = 1). Line (15-16) computes the relationship between\ntasks \u03c4km based on the recent work by [10]. It maintains a distribution over peers w.r.t the current\ntask. The value of \u03c4 is updated at each round using the cross-task error (cid:96)km. In addition, we use the\n\u03c4 to get the con\ufb01dence of the most-related task rather than the weighted sum of the con\ufb01dence of\nthe peers to get the predicted label from the peers (see Section 4 for more details). When we are\nlearning with many tasks [17], it provides a faster computation without signi\ufb01cantly compromising\nthe performance of the learning algorithm. One may use different notion of task relationship based\non the application at hand. Now, we give the bound on the expected number of mistakes.\n\nm\u2208[K],m(cid:54)=k \u03c4 (t\u22121)\n\nb2+(cid:80)\n\nb1\n\nb2\n\nTheorem 1. let Sk =(cid:8)(cid:0)(x(t), k), y(t)(cid:1)(cid:9)T\n\nt=1 be a sequence of T examples given to Algorithm 1 where\nx(t) \u2208 Rd, y(t) \u2208 {\u22121, +1} and X = maxt (cid:107)x(t)(cid:107). Let P (t) be a Bernoulli random variable for\nthe event |hk(x(t))| \u2264 b1 with probability\nb1+|hk(x(t))| and let Q(t) be a Bernoulli random variable\nfor the event |hm(x(t))| \u2264 b2 with probability\n|hm(x(t))| . Let Z (t) = P (t)Q(t) and\nM (t) = I(y(t) (cid:54)= \u02c6y(t)).\nIf the Algorithm 1 is run with b1 > 0 and b2 > 0 (b2 \u2265 b1), then \u2200t \u2265 1 and \u03b3 > 0 we have\n\nb2\nb2+maxm\u2208[K]\nm(cid:54)=k\n\nb1\n\n(cid:20) (2b1 + X 2)2\n+(cid:0)1 +\n\n8b1\u03b3\n\nX 2\n2b1\n\nk(cid:107)2 + max\n\n(cid:0)(cid:107)w\u2217\n(cid:1)(cid:16) \u02dcLkk + max\n\nm\u2208[K],m(cid:54)=k\n\nm\u2208[K],m(cid:54)=k\n\n(cid:107)w\u2217\n\nm(cid:107)2(cid:1)\n(cid:17)(cid:21)\n\n\u02dcLkm\n\n(cid:20) (cid:88)\n\nE\n\nt\u2208[T ]\n\n(cid:21)\n\nM (t)\n\n\u2264 b2\n\u03b3\n\nThen, the expected number of label requests to the oracle by the algorithm is\n\n(cid:88)\n\nt\n\nb1\n\nb1 + |hk(x(t))|\n\nb2\nb2 + maxm\u2208[K]\nm(cid:54)=k\n\n|hm(x(t))|\n\n4\n\n\fAlgorithm 1: Active Learning from Peers\nInput : b1 > 0, b2 > 0 s.t., b2 \u2265 b1, \u03bb > 0, Number of rounds T\n1 Initialize w(0)\n2 for t = 1 . . . T do\n3\n\nm = 0 \u2200m \u2208 [K], \u03c4 (0).\n\n(cid:105)\n\nkk = (cid:104)x(t), w(t\u22121)\n\nReceive (x(t), k)\nCompute \u02c6p(t)\nk\nPredict \u02c6y(t) = sign(\u02c6p(t)\nkk )\nDraw a Bernoulli random variable P (t) with probability\nif P (t) = 1 then\nCompute \u02c6p(t)\n\nm (cid:105) \u2200m (cid:54)= k, m \u2208 [K]\n\nkm = (cid:104)x(t), w(t\u22121)\n\nCompute \u02dcp(t) =(cid:80)\n\nm(cid:54)=k,m\u2208[K] \u03c4 (t\u22121)\n\nkm \u02c6p(t)\n\nb1\nb1+| \u02c6p\n\n(t)\n\nkk |\n\n4\n\n5\n\n6\n\n7\n\n8\n\n9\n\n10\n\n11\n\n12\n\n13\n\n14\n15\n16\n\n17 end\n\nkm and \u02dcy(t) = sign(\u02dcp(t))\nb2+| \u02dcp(t)|\n\nb2\n\nDraw a Bernoulli random variable Q(t) with probability\n\nend\nSet Z (t) = P (t)Q(t) & \u02dcZ (t) = P (t)(1 \u2212 Q(t))\nQuery true label y(t) if Z (t) = 1 and set M (t) = 1 if \u02c6y(t) (cid:54)= y(t)\nUpdate w(t)\nUpdate \u03c4:\n\n+ (M (t)Z (t)y(t) + \u02dcZ (t) \u02dcy(t))x(t)\n\nk = w(t\u22121)\n\nk\n\n\u03c4 (t)\nkm =\n\n(cid:80)\n\n\u03c4 (t\u22121)\nkm e\u2212 Z(t)\n\u03c4 (t\u22121)\nkm(cid:48) e\n\nm(cid:48)\u2208[K]\nm(cid:48)(cid:54)=k\n\n\u03bb (cid:96)\n\n(t)\nkm\n\n\u2212 Z(t)\n\u03bb (cid:96)\n\nm \u2208 [K], m (cid:54)= k\n\n(1)\n\n(t)\n\nkm(cid:48)\n\n(cid:113)\n\n2 +\n\n(cid:107)w\u2217\nk(cid:107)\n2\n\n, where b1 = X 2\n2\n\nThe proof is given in Appendix A. It follows from Theorem 1 in [11] and Theorem 1 in [10] and\n\u02dcLkk\nsetting b2 = b1 + X 2\n\u03b3 . Theorem 1 states that the quality\nof the bound depends on both \u02dcLkk and the maximum of { \u02dcLkm}m\u2208[K],m(cid:54)=k. In other words, the worst-\ncase regret will be lower if the kth true hypothesis w\u2217\nk can predict the labels for training examples in\nboth the kth task itself as well as those in all the other related tasks in high con\ufb01dence. In addition,\nwe consider a related problem setting in which all the K tasks receive an example simultaneously.\nWe give the learning algorithm and mistake bound for this setting in Appendix B.\n\n1 + 4\u03b32\n\nk(cid:107)X 2\n\n(cid:107)w\u2217\n\n4 Experiments\n\nWe evaluate the performance of our algorithm in the online setting. All reported results in this section\nare averaged over 10 random runs on permutations of the training data. We set the value of b1 = 1 for\nall the experiments and the value of b2 is tuned from 20 different values. Unless otherwise speci\ufb01ed,\nall model parameters are chosen via 5-fold cross validation.\n\n4.1 Benchmark Datasets\n\nWe use three datasets for our experiments. Details are given below:\nLandmine Detection3 consists of 19 tasks collected from different landmine \ufb01elds. Each task is a\nbinary classi\ufb01cation problem: landmines (+) or clutter (\u2212) and each example consists of 9 features\nextracted from radar images with four moment-based features, three correlation-based features, one\nenergy ratio feature and a spatial variance feature. Landmine data is collected from two different\nterrains: tasks 1-10 are from highly foliated regions and tasks 11-19 are from desert regions, therefore\ntasks naturally form two clusters. Any hypothesis learned from a task should be able to utilize the\ninformation available from other tasks belonging to the same cluster.\n\n3http://www.ee.duke.edu/~lcarin/LandmineData.zip\n\n5\n\n\fSpam Detection4 We use the dataset obtained from ECML PAKDD 2006 Discovery challenge for\nthe spam detection task. We used the task B challenge dataset which consists of labeled training\ndata from the inboxes of 15 users. We consider each user as a single task and the goal is to build\na personalized spam \ufb01lter for each user. Each task is a binary classi\ufb01cation problem: spam (+)\nor non-spam (\u2212) and each example consists of approximately 150K features representing term\nfrequency of the word occurrences. Since some spam is universal to all users (e.g. \ufb01nancial scams),\nsome messages might be useful to certain af\ufb01nity groups, but spam to most others. Such adaptive\nbehavior of user\u2019s interests and dis-interests can be modeled ef\ufb01ciently by utilizing the data from\nother users to learn per-user model parameters.\nSentiment Analysis5 We evaluated our algorithm on product reviews from Amazon on a dataset\ncontaining reviews from 24 domains. We consider each domain as a binary classi\ufb01cation task.\nReviews with rating > 3 were labeled positive (+), those with rating < 3 were labeled negative (\u2212),\nreviews with rating = 3 are discarded as the sentiments were ambiguous and hard to predict. Similar\nto the previous dataset, each example consists of approximately 350K features representing term\nfrequency of the word occurrences.\nWe choose 3040 examples (160 training examples per task) for landmine, 1500 emails for spam (100\nemails per user inbox) and 2400 reviews for sentiment (100 reviews per domain) for our experiments.\nWe use the rest of the examples for test set. On average, each task in landmine, spam, sentiment has\n509, 400 and 2000 examples respectively. Note that we intentionally kept the size of the training data\nsmall to drive the need for learning from other tasks, which diminishes as the training sets per task\nbecome large.\n\n4.2 Results\n\nTo evaluate the performance of our proposed approach, we compare our proposed methods to 2\nstandard baselines. The \ufb01rst baseline selects the examples to query randomly (Random) and the second\nbaseline chooses the examples via selective sampling independently for each task (Independent)\n[11]. We compare these baselines against two versions of our proposed algorithm 1 with different\ncon\ufb01dence measures for predictions from peer tasks: PEERsum where the con\ufb01dence \u02dcp(t) at round t is\ncomputed by the weighted sum of the con\ufb01dence of each task as shown originally in Algorithm 1 and\nPEERone where the con\ufb01dence \u02dcp(t) is set to the con\ufb01dence of the most related task k (\u02c6p(t)\nk ), sampled\nkm, m \u2208 [K], m (cid:54)= k. The intuition is that, for multitask learning\nfrom the probability distribution \u03c4 (t)\nwith many tasks [17], PEERone provides a faster computation without signi\ufb01cantly compromising\nthe performance of the learning algorithm. The task weights \u03c4 are computed based on the relationship\nbetween the tasks. As mentioned earlier, the \u03c4 values can be easily replaced by other functions based\non the application at hand 6.\nIn addition to PEERsum and PEERone, we evaluated a method that queries the peer with the highest\ncon\ufb01dence, instead of the most related task as in PEERone, to provide the label. Since this method\nuses only local information for the task with highest con\ufb01dence, it is not necessarily the best peer in\nhindsight, and the results are worse than or comparable (in some cases) to the Independent baseline.\nHence, we do not report its results in our experiment.\nFigure 2 shows the performance of the models during training. We measure the average rate of\nmistakes (cumulative measure), the number of label requests to the oracle and the number of peer\nquery requests to evaluate the performance during the training time. From Figure 2 (top and middle),\nwe can see that our proposed methods (PEERsum and PEERone) outperform both the baselines.\nAmong the proposed methods, PEERsum outperforms PEERone as it uses the con\ufb01dence from all\nthe tasks (weighted by task relationship) to measure the \ufb01nal con\ufb01dence. We notice that during the\nearlier part of the learning, all the methods issue more query to the oracle. After a few initial set of\nlabel requests, peer requests (dotted lines) steadily take over in our proposed methods. We can see\nthree noticeable phases in our learning algorithm: initial label requests to the oracle, label requests to\npeers, and as task con\ufb01dence grows, learning with less dependency on other tasks.\n\n4http://ecmlpkdd2006.org/challenge.html\n5http://www.cs.jhu.edu/~mdredze/datasets/sentiment\n6Our algorithm and theorem can be easily generalized to other types of functions on \u03c4\n\n6\n\n\fFigure 2: Average rate of mistakes vs. Number of examples calculated for compared models on the\nthree datasets (top). Average number of label and peer requests on the three datasets (middle). Average\nrate of (training) mistakes vs. Number of examples with the query budget of (10%, 20%, 30%) of the\ntotal number of examples T on sentiment (bottom). These plots are generated during the training.\n\nIn order to ef\ufb01ciently evaluate the proposed methods, we restrict the total number of label re-\nquests issued to the oracle during training, that is we give all the methods the same query budget:\n(10%, 20%, 30%) of the total number of examples T on sentiment dataset. After the desired number\nof label requests to the oracle reached the said budget limit, the baseline methods predicts label for\nthe new examples based on the earlier assistance from the oracle. On the other hand, our proposed\nmethods continue to reduce the average mistake rate by requesting labels from peers. This shows the\npower of learning from peers when human expert assistance is expensive, scarce or unavailable.\nTable 1 summarizes the performance of all the above algorithms on the test set for the three datasets.\nIn addition to the average accuracy ACC scores, we report the average total number of queries\nor label requests to the oracle (#Queries) and the CPU time taken (seconds) for learning from T\nexamples (Time). From the table, it is evident that PEER* outperforms all the baselines in terms of\nboth ACC and #Queries. In case of landmine and sentiment, we get a signi\ufb01cant improvement in\nthe test set accuracy while reducing the total number of label requests to the oracle. As in the training\nset results before, PEERsum performs slightly better than PEERone. Our methods perform slightly\nbetter than Independent in spam, we can see from Figure 2 (middle) for spam dataset, the number of\npeer queries are lower compared to that of the other datasets.\nThe results justify our claim that relying on assistance from peers in addition to human intervention\nleads to improved performance. Moreover, our algorithm consumes less or comparable CPU time\nthan the baselines which take into account inter-task relationships and peer requests. Note that\nPEERone takes a little more training time than PEERsum. This is due to our implementation that\ntakes more time in (MATLAB\u2019s) inbuilt sampler to draw the most related task. One may improve\nthe sampling procedure to get better run time. However, the time spent on selecting the most related\ntasks is small compared to the other operations when dealing with many tasks.\nFigure 3 (left) shows the average test set accuracy computed for 20 different values of b2 for PEER*\nmethods in sentiment. We set b1 = 1. Each point in the plot corresponds to ACC (y-axis) and\n#Queries (x-axis) computed for a speci\ufb01c value of b2. We \ufb01nd the algorithm performs well for\n\n7\n\nNumber of samples050010001500200025003000Average rate of mistakes00.20.40.60.81RandomIndependentPEERsumPEERoneNumber of samples050010001500Average rate of mistakes0.20.250.30.350.40.450.5Number of samples05001000150020002500Average rate of mistakes0.20.220.240.260.280.30.320.34Number of samples050010001500200025003000Number of Label/Peer Requests020040060080010001200140016001800Random (Label Request)Independent (Label Request)PEERsum (Label Request)PEERone (Label Request)PEERsum (Peer Request)PEERone (Peer Request)Number of samples050010001500Number of Label/Peer Requests020040060080010001200Number of samples05001000150020002500Number of Label/Peer Requests05001000150020002500Number of samples05001000150020002500Average rate of mistakes0.180.20.220.240.260.280.30.320.34RandomIndependentPEERsumPEERoneNumber of samples05001000150020002500Average rate of mistakes0.20.220.240.260.280.30.320.34Number of samples05001000150020002500Average rate of mistakes0.20.220.240.260.280.30.320.34\fTable 1: Average test accuracy on three datasets: means and standard errors over 10 random shuf\ufb02es.\n\nLandmine Detection\n\nSpam Detection\n\nSentiment Analysis\n\nTime (s)\n\n#Queries\n\nTime (s)\n\nTime (s)\n\nModels\n\nRandom\n\nIndependent\n\nPEERsum\n\nPEERone\n\nACC\n0.8905\n(0.007)\n0.9040\n(0.016)\n0.9403\n(0.001)\n0.9377\n(0.003)\n\n#Queries\n1519.4\n(31.9)\n1802.8\n(35.5)\n265.6\n(18.7)\n303\n(17)\n\n0.38\n\n0.29\n\n0.38\n\n1.01\n\nACC\n0.8117\n(0.021)\n0.8309\n(0.022)\n0.8497\n(0.007)\n0.8344\n(0.018)\n\n753.4\n(29.1)\n1186.6\n(18.3)\n1108.8\n(32.1)\n1084.2\n(24.2)\n\nACC\n0.7443\n(0.028)\n0.7522\n(0.015)\n0.8141\n(0.001)\n0.8120\n(0.01)\n\n#Queries\n1221.8\n(22.78)\n2137.6\n(19.1)\n1494.4\n(68.59)\n1554.6\n(92.2)\n\n35.6\n\n35.6\n\n36\n\n36.3\n\n8\n\n7.9\n\n8\n\n8.3\n\nFigure 3: Average test set ACC calculated for different values of b2 (left). A visualization of the\npeer query requests among the tasks in sentiment learned by PEERone (middle) and comparison\nof proposed methods against SHAMPO in parallel setting. We report the average test set accuracy\n(right).\n\nb2 > b1 and the small values of b2. When we increase the value of b2 to \u221e, our algorithm reduces to\nthe baseline (Independent), as all request are directed to the oracle instead of the peers.\nFigure 3 (middle) shows the snapshot of the total number of peer requests between the tasks in\nsentiment at the end of the training of PEERone. Each edge says that there was one peer query request\nfrom a task/domain to another related task/domain (based on the task relationship matrix \u03c4). The\nedges with similar colors show the total number of peer requests from a task. It is evident from the\n\ufb01gure that all the tasks are collaborative in terms of learning from each other.\nFigure 3 (right) compares the PEER* implementation of Algorithm 2 in Appendix B against SHAMPO\nin terms of test set accuracy for sentiment dataset (See Supplementary material for more details on the\nAlgorithm). The algorithm learns multiple tasks in parallel, where at most \u03ba out of K label requests\nto the oracle are allowed at each round. While SHAMPO ignores the other tasks, our PEER* allows\npeer query to related tasks and thereby improves the overall performance. As we can see from the\n\ufb01gure, when \u03ba is set to small values, PEER* performs signi\ufb01cantly better than SHAMPO.\n\n5 Conclusion\n\nWe proposed a novel online multitask learning algorithm that learns to perform each task jointly\nwith learning inter-task relationships. The primary intuition we leveraged in this paper is that task\nperformance can be improved both by querying external oracles and by querying peer tasks. The\nformer incurs a cost or at least a query-budget bound, but the latter requires no human attention.\nHence, our hypothesis was that with bounded queries to the human expert, additionally querying\npeers should improve task performance. Querying peers requires estimating the relation among\ntasks. The key idea is based on smoothing the loss function of each task w.r.t. a probabilistic\ndistribution over all tasks, and adaptively re\ufb01ning such distribution over time. In addition to closed-\nform updating rules, we provided a theoretical bound on the expected number of mistakes. The\neffectiveness of our algorithm is empirically veri\ufb01ed over three benchmark datasets where in all\ncases task accuracy improves both for PEERsum (sum of peer recommendations weighted by task\nsimilarity) and PEERone (peer recommendation from the most highly related task) over baselines\nsuch as assuming task independence.\n\n8\n\nNumber of Label Requests140015001600170018001900200021002200Average Test set Accuracy0.750.760.770.780.790.80.810.82PEERsumPEERoneofficeelectronicstoyssoftwareoutdoorcell_phonescomputerbeautyinstrumentsautomativemusicapparelgourmetdvdkitchenvideobabycamerajewelrysportsbooksmagazineshealthgrocery02002000200200020020000020002000200020002000200200\u03ba1234Average Test Accuracy0.30.40.50.60.70.80.9SHAMPOPEERsumPEERone\fReferences\n[1] Koby Crammer and Yishay Mansour. Learning multiple tasks using shared hypotheses. In\n\nAdvances in Neural Information Processing Systems, pages 1475\u20131483, 2012.\n\n[2] Andreas Argyriou, Theodoros Evgeniou, and Massimiliano Pontil. Convex multi-task feature\n\nlearning. Machine Learning, 73(3):243\u2013272, 2008.\n\n[3] Ya Xue, Xuejun Liao, Lawrence Carin, and Balaji Krishnapuram. Multi-task learning for\nclassi\ufb01cation with dirichlet process priors. The Journal of Machine Learning Research, 8:35\u201363,\n2007.\n\n[4] Yu Zhang and Dit-Yan Yeung. A regularization approach to learning task relationships in\nmultitask learning. ACM Transactions on Knowledge Discovery from Data (TKDD), 8(3):12,\n2014.\n\n[5] Jacob Abernethy, Peter Bartlett, and Alexander Rakhlin. Multitask learning with expert advice.\n\nIn Learning Theory, pages 484\u2013498. Springer, 2007.\n\n[6] Ofer Dekel, Philip M Long, and Yoram Singer. Online learning of multiple tasks with a shared\n\nloss. Journal of Machine Learning Research, 8(10):2233\u20132264, 2007.\n\n[7] G\u00e1bor Lugosi, Omiros Papaspiliopoulos, and Gilles Stoltz. Online multi-task learning with\n\nhard constraints. arXiv preprint arXiv:0902.3526, 2009.\n\n[8] Giovanni Cavallanti, Nicolo Cesa-Bianchi, and Claudio Gentile. Linear algorithms for online\n\nmultitask classi\ufb01cation. The Journal of Machine Learning Research, 11:2901\u20132934, 2010.\n\n[9] Avishek Saha, Piyush Rai, Suresh Venkatasubramanian, and Hal Daume. Online learning of\nmultiple tasks and their relationships. In International Conference on Arti\ufb01cial Intelligence and\nStatistics, pages 643\u2013651, 2011.\n\n[10] Keerthiram Murugesan, Hanxiao Liu, Jaime Carbonell, and Yiming Yang. Adaptive smoothed\nonline multi-task learning. In Advances in Neural Information Processing Systems, pages\n4296\u20134304, 2016.\n\n[11] Nicolo Cesa-Bianchi, Claudio Gentile, and Luca Zaniboni. Worst-case analysis of selective\nsampling for linear classi\ufb01cation. Journal of Machine Learning Research, 7(Jul):1205\u20131230,\n2006.\n\n[12] Ofer Dekel, Claudio Gentile, and Karthik Sridharan. Selective sampling and active learning\nfrom single and multiple teachers. Journal of Machine Learning Research, 13(Sep):2655\u20132697,\n2012.\n\n[13] Francesco Orabona and Nicolo Cesa-Bianchi. Better algorithms for selective sampling. In\nProceedings of the 28th international conference on Machine learning (ICML-11), pages\n433\u2013440, 2011.\n\n[14] Giovanni Cavallanti, Nicolo Cesa-Bianchi, and Claudio Gentile. Linear classi\ufb01cation and\nselective sampling under low noise conditions. In Advances in Neural Information Processing\nSystems, pages 249\u2013256, 2009.\n\n[15] Alekh Agarwal. Selective sampling algorithms for cost-sensitive multiclass prediction. ICML\n\n(3), 28:1220\u20131228, 2013.\n\n[16] Alekh Agarwal, Alexander Rakhlin, and Peter Bartlett. Matrix regularization techniques for\nonline multitask learning. EECS Department, University of California, Berkeley, Tech. Rep.\nUCB/EECS-2008-138, 2008.\n\n[17] Kilian Weinberger, Anirban Dasgupta, John Langford, Alex Smola, and Josh Attenberg. Feature\nhashing for large scale multitask learning. In Proceedings of the 26th Annual International\nConference on Machine Learning, pages 1113\u20131120. ACM, 2009.\n\n9\n\n\f[18] Pinar Donmez and Jaime G Carbonell. Proactive learning: cost-sensitive active learning with\nmultiple imperfect oracles. In Proceedings of the 17th ACM conference on Information and\nknowledge management, pages 619\u2013628. ACM, 2008.\n\n[19] Pinar Donmez, Jaime G Carbonell, and Jeff Schneider. Ef\ufb01ciently learning the accuracy of\nlabeling sources for selective sampling. In Proceedings of the 15th ACM SIGKDD international\nconference on Knowledge discovery and data mining, pages 259\u2013268. ACM, 2009.\n\n[20] Ruth Urner, Shai Ben-David, and Ohad Shamir. Learning from weak teachers. In AISTATS,\n\npages 4\u20133, 2012.\n\n[21] Chicheng Zhang and Kamalika Chaudhuri. Active learning from weak and strong labelers. In\n\nAdvances in Neural Information Processing Systems, pages 703\u2013711, 2015.\n\n[22] Yan Yan, Glenn M Fung, R\u00f3mer Rosales, and Jennifer G Dy. Active learning from crowds.\nIn Proceedings of the 28th international conference on machine learning (ICML-11), pages\n1161\u20131168, 2011.\n\n[23] Peter L Bartlett and Marten H Wegkamp. Classi\ufb01cation with a reject option using a hinge loss.\n\nJournal of Machine Learning Research, 9(Aug):1823\u20131840, 2008.\n\n[24] Corinna Cortes, Giulia DeSalvo, and Mehryar Mohri. Learning with rejection. In International\n\nConference on Algorithmic Learning Theory, pages 67\u201382. Springer, 2016.\n\n[25] Haim Cohen and Koby Crammer. Learning multiple tasks in parallel with a shared annotator.\n\nIn Advances in Neural Information Processing Systems, pages 1170\u20131178, 2014.\n\n10\n\n\f", "award": [], "sourceid": 3527, "authors": [{"given_name": "Keerthiram", "family_name": "Murugesan", "institution": "Carnegie Mellon University"}, {"given_name": "Jaime", "family_name": "Carbonell", "institution": "CMU"}]}