{"title": "Improved Algorithms for Collaborative PAC Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 7631, "page_last": 7639, "abstract": "We study a recent model of collaborative PAC learning where $k$ players with $k$ different tasks collaborate to learn a single classifier that works for all tasks. Previous work showed that when there is a classifier that has very small error on all tasks, there is a collaborative algorithm that finds a single classifier for all tasks and has $O((\\ln (k))^2)$ times the worst-case sample complexity for learning a single task. In this work, we design new algorithms for both the realizable and the non-realizable setting, having sample complexity only $O(\\ln (k))$ times the worst-case sample complexity for learning a single task. The sample complexity upper bounds of our algorithms match previous lower bounds and in some range of parameters are even better than previous algorithms that are allowed to output different classifiers for different tasks.", "full_text": "Improved Algorithms for Collaborative PAC\n\nLearning\n\nHuy L\u00ea Nguy\u02dc\u00ean\n\nNortheastern University\n\nBoston, MA 02115\n\nLydia Zakynthinou\n\nNortheastern University\n\nBoston, MA 02115\n\nCollege of Computer and Information Science\n\nCollege of Computer and Information Science\n\nhu.nguyen@northeastern.edu\n\nzakynthinou.l@northeastern.edu\n\nAbstract\n\nWe study a recent model of collaborative PAC learning where k players with\nk different tasks collaborate to learn a single classi\ufb01er that works for all tasks.\nPrevious work showed that when there is a classi\ufb01er that has very small error\non all tasks, there is a collaborative algorithm that \ufb01nds a single classi\ufb01er for all\ntasks and has O((ln(k))2) times the worst-case sample complexity for learning\na single task. In this work, we design new algorithms for both the realizable\nand the non-realizable setting, having sample complexity only O(ln(k)) times the\nworst-case sample complexity for learning a single task. The sample complexity\nupper bounds of our algorithms match previous lower bounds and in some range\nof parameters are even better than previous algorithms that are allowed to output\ndifferent classi\ufb01ers for different tasks.\n\n1\n\nIntroduction\n\n\u0001\n\nd ln\n\n(cid:17)\n\n(cid:16) 1\n\n\u0001\n\n(cid:16)\n\n(cid:16) 1\n\n\u03b4\n\n(cid:16) 1\n\n(cid:17)(cid:17)(cid:17)\n\nThere has been a lot of work in machine learning concerning learning multiple tasks simultaneously,\nranging from multi-task learning [3, 4], to domain adaptation [10, 11], to distributed learning [2, 7, 14].\nAnother area in similar spirit to this work is meta-learning, where one leverages samples from many\ndifferent tasks to train a single algorithm that adapts well to all tasks (see e.g. [8]).\nIn this work, we focus on a model of collaborative PAC learning, proposed by [5]. In the classic\nPAC learning setting introduced by [13], where PAC stands for probably approximately correct,\nthe goal is to learn a task by drawing from a distribution of samples. The optimal classi\ufb01er that\nachieves the lowest error on the task with respect to the given distribution is assumed to come\nfrom a concept class F of VC dimension d. The VC theorem [1] states that for any instance\nlabeled samples suf\ufb01ce to learn a classi\ufb01er that achieves low\nm\u0001,\u03b4 = O\nerror with probability at least 1 \u2212 \u03b4, where the error depends on \u0001.\nIn the collaborative model, there are k players attempting to learn their own tasks, each task involving\na different distribution of samples. The goal is to learn a single classi\ufb01er that also performs well\non all the tasks. One example from [5], which motivates this problem, is having k hospitals with\ndifferent patient demographics which want to predict the overall occurrence of a disease. In this case,\nit would be more \ufb01tting as well as cost ef\ufb01cient to develop and distribute a single classi\ufb01er to all the\nhospitals. In addition, the requirement for a single classi\ufb01er is imperative in settings where there are\nfairness concerns. For example, consider the case that the goal is to \ufb01nd a classi\ufb01er that predicts loan\ndefaults for a bank by gathering information from bank stores located in neighborhoods with diverse\nsocioeconomic characteristics. In this setting, the samples provided by each bank store come from\ndifferent distributions while it is desired to guarantee low error rates for all the neighborhoods. Again,\nin this setting, the bank should employ a single classi\ufb01er among all the neighborhoods.\n\n+ ln\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\f(cid:17)(cid:17)\n(cid:16) k\n(cid:16) ln(k)\n\n\u03b4\n\n(cid:16)\n\n(cid:17)\n\n(cid:16) 1\n\n\u0001\n\n(cid:17)(cid:17)(cid:17)\n\n(cid:16) 1\n\n\u03b4\n\nIf each player were to learn a classi\ufb01er for their task without collaboration, they would each have to\ndraw a suf\ufb01cient number of samples from their distribution to train their classi\ufb01er. Therefore, solving\nk tasks independently would require k \u00b7 m\u0001,\u03b4 samples in the worst case. Thus, we are interested\nin algorithms that utilize samples from all players and solve all k tasks with sample complexity\no\n\n(cid:17)(cid:17)(cid:17)\n\n(cid:16) k\n\n(cid:16) 1\n\n(cid:16) 1\n\n(cid:17)\n\n(cid:16)\n\n+ ln\n\nd ln\n\n.\n\n\u0001\n\n\u0001\n\n\u03b4\n\n(cid:16) ln2(k)\n\n(cid:16)\n\n(cid:16) 1\n\n(cid:17)\n\n(cid:16) 1\n\n(cid:17)(cid:17)(cid:17)\n\nBlum et al. [5] give an algorithm with sample complexity O\nfor the realizable setting, that is, assuming the existence of a single classi\ufb01er with zero error on all the\ntasks. They also extend this result by proving that a slightly modi\ufb01ed algorithm returns a classi\ufb01er\nwith error \u0001, under the relaxed assumption that there exists a classi\ufb01er with error \u0001/100 on all the\ntasks. In addition, they prove a lower bound showing that there is a concept class with d = \u0398(k)\nwhere \u2126\n\nsamples are necessary.\n\n(cid:16) k\n\n(d + k) ln\n\n+ k ln\n\n\u03b4\n\n\u0001\n\n\u0001\n\n\u0001 ln\n\n(cid:16) k\n\n(cid:17)(cid:17)(cid:17)\n\n(cid:16) 1\n\n(cid:17)(cid:16)\n\n(cid:16) k\n\n\u03b4\n\n(cid:17)\n\n(cid:16) 1\n\n\u0001\n\n\u0001\n\n\u03b4\n\nd ln\n\nd ln\n\n\u0001 ln\n\n+ k ln\n\nand O\n\n+ k + ln\n\nIn this work, we give two new algorithms based on multiplicative weight updates which have sample\ncomplexities O\nfor the\nrealizable setting. Our \ufb01rst algorithm matches the sample complexity of [5] for the variant of the\nproblem in which the algorithm is allowed to return different classi\ufb01ers to the players and our second\nalgorithm has the sample complexity almost matching the lower bound of [5] when d = \u0398(k) and\nfor typical values of \u03b4. Both are presented in Section 3. Independently of our work, [6] use the\nmultiplicative weight update approach and achieve the same bounds as we do in that section.\nMoreover, in Section 4, we extend our results to the non-realizable setting, presenting two algorithms\nthat generalize the algorithms for the realizable setting. These algorithms learn a classi\ufb01er with\nerror at most (2 + \u03b1)OPT + \u0001 on all the tasks, where \u03b1 is set to a constant value, and have sample\ncomplexities O\n.\nWith constant \u03b1, these sample complexities are the same as in the realizable case. Finally, we give two\nalgorithms with randomized classi\ufb01ers whose error probability over the random choice of the example\nand the classi\ufb01er\u2019s randomness is at most (1+\u03b1)OPT+\u0001 for all tasks. The sample complexities of these\nalgorithms are O\n.\n\n(cid:17)(cid:17)(cid:17)\n(cid:16) k\n(cid:17)(cid:17)(cid:17)\n(cid:16) k\n\n(cid:16) ln(k)\n(cid:16)\n(cid:16) ln(k)\n\n(cid:16) 1\n(cid:16) 1\n\n(cid:17)\n(cid:16) 1\n(cid:17)\n(cid:16) 1\n\n(cid:17)\n(cid:16) 1\n(cid:16) 1\n\n(cid:17)(cid:17)(cid:17)\n(cid:17)(cid:17)(cid:17)\n\n(cid:17)(cid:16)\n(cid:16) k\n\n(cid:16) 1\n(cid:16) 1\n\n(cid:16) k\n\n(cid:17)(cid:16)\n\n(cid:16) 1\n\n(d + k) ln\n\n\u03b14\u0001 ln\n\nand O\n\nand O\n\n+ k ln\n\n+k ln\n\n+k ln\n\n(cid:16)\n\n(cid:17)\n\n(cid:17)\n\n+ ln\n\n+ln\n\nd ln\n\nd ln\n\nd ln\n\n\u03b14\u0001\n\n\u03b1\n\n\u03b4\n\n\u03b4\n\n\u03b4\n\n\u0001\n\n\u0001\n\n\u03b13\u00012\n\n\u0001\n\n\u03b4\n\n\u03b13\u00012 ln\n\n\u03b4\n\n\u0001\n\n\u03b4\n\n2 Model\nIn the traditional PAC learning model, there is a space of instances X and a set Y = {0, 1} of possible\nlabels for the elements of X . A classi\ufb01er f : X \u2192 Y, which matches each element of X to a label, is\ncalled a hypothesis. The error of a hypothesis with respect to a distribution D on X \u00d7 Y is de\ufb01ned as\nerrD(f ) = Pr(x,y)\u223cD[f (x) (cid:54)= y]. Let OPT = inf\nf\u2208F errD(f ), where F is a class of hypotheses. In the\nrealizable setting we assume that there exists a target classi\ufb01er with zero error, that is, there exists\nf\u2217 \u2208 F with errD(f\u2217) = OPT = 0 for all i \u2208 [k]. Given parameters (\u0001, \u03b4), the goal is to learn a\nclassi\ufb01er that has error at most \u0001, with probability at least 1 \u2212 \u03b4. In the non-realizable setting, the\noptimal classi\ufb01er f\u2217 is de\ufb01ned to have errD(f\u2217) \u2264 OPT + \u03b5 for any \u03b5 > 0. Given parameters (\u0001, \u03b4)\nand a new parameter \u03b1, which can be considered to be a constant, the goal is to learn a classi\ufb01er that\nhas error at most (1 + \u03b1)OPT + \u0001, with probability at least 1 \u2212 \u03b4.\nBy the VC theorem and its known extension, the desired guarantee can be achieved in both settings by\ndrawing a set of samples of size m\u0001,\u03b4 = O\nand returning the classi\ufb01er with\nminimum error on that sample. More precisely, in the non-realizable setting, m\u0001,\u03b4 = C\n\u0001\u03b1\n\n+\n, where C is also a constant. We consider an algorithm OF (S), where S is a set of samples\nln\ndrawn from an arbitrary distribution D over the domain X \u00d7 {0, 1}, that returns a hypothesis f0\nwhose error on the sample set satis\ufb01es errS(f0) \u2264 inf\nf\u2208F errS(f ) + \u03b5 for any \u03b5 > 0, if such a hypothesis\nexists. The VC theorem guarantees that if |S| = m\u0001,\u03b4, then errD(f0) \u2264 (1 + \u03b1)errS(f0) + \u0001.\nIn the collaborative model, there are k players with distributions D1, . . . , Dk. Similarly, OPT =\nerrDi(f ) and the goal is to learn a single good classi\ufb01er for all distributions. In [5], the\nf\u2208F max\ninf\ni\u2208[k]\n\n(cid:17)(cid:17)(cid:17)\n\n(cid:17)(cid:17)\n\n(cid:16) 1\n\n(cid:16) 1\n\n(cid:16) 1\n\n(cid:16) 1\n\n(cid:16) 1\n\n(cid:16)\n\n(cid:17)\n\n(cid:17)\n\n(cid:16)\n\n+ ln\n\nd ln\n\nd ln\n\n\u03b4\n\n\u03b4\n\n\u0001\n\n\u0001\n\n\u0001\n\n2\n\n\fauthors consider two variants of the model for the realizable setting, the personalized and the\ncentralized. In the former the algorithm can return a different classi\ufb01er to each player, while in\nthe latter it must return a single good classi\ufb01er. For the personalized variant, Blum et al. give an\nalgorithm with almost the same sample complexity as the lower bound they provide. We focus on the\nmore restrictive centralized variant of the model, for which the algorithm that Blum et al. give does\nnot match the lower bound. We note that the algorithms we present are improper, meaning that the\nclassi\ufb01er they return is not necessarily in the concept class F.\n\n3 Sample complexity upper bounds for the realizable setting\n\nIn this section, we present two algorithms and prove their sample complexity.\nBoth algorithms employ multiplicative weight updates, meaning that in each round they \ufb01nd a\nclassi\ufb01er with low error on the weighted mixture of the distributions and double the weights of the\nplayers for whom the classi\ufb01er did not perform well. In this way, the next sample set drawn will\ninclude more samples from these players\u2019 distributions so that the next classi\ufb01er will perform better\non them. To identify the players for whom the classi\ufb01er of the round did not perform well, the\nalgorithms test the classi\ufb01er on a small number of samples drawn from each player\u2019s distribution. If\nthe error of the classi\ufb01er on the sample is low, then the error on the player\u2019s distribution can not be\ntoo high and vise versa. In the end, both algorithms return the majority function over all the classi\ufb01ers\nof the rounds, that is, for each point x \u2208 X , the label assigned to x is the label that the majority of\nthe classi\ufb01ers assign to x.\nWe note that for typical values of \u03b4, Algorithm R2 is better than Algorithm R1. However, Algorithm\nR1 is always better than the algorithm of [5] for the centralized variant of the problem and matches\ntheir number of samples in the personalized variant, so we present both algorithms in this section.\nIn the algorithms of [5], the players are divided into classes based on the number of rounds for\nwhich that player\u2019s task is not solved with low error. The number of classes could be as large as the\nnumber of rounds, which is \u0398(log(k)), and their algorithm uses roughly m\u0001,\u03b4 samples from each\nclass. On the other hand, Algorithm R1 uses only m\u0001,\u03b4 samples across all classes and saves a factor\nof \u0398(log(k)) in the sample complexity. This requires analyzing the change in all classes together as\nopposed to class by class.\n\nAlgorithm R1\n\n(cid:16)\n\n:= 1; t := 5(cid:100)log(k)(cid:101); \u0001(cid:48) := \u0001/6; \u03b4(cid:48) := \u03b4/(3t);\ni=1 w(r\u22121)\n\n, where \u03a6(r\u22121) =(cid:80)k\n\nw(r\u22121)\n\nDi\n\ni\n\ni\n\n;\n\n(cid:17)\n\ni\n\ni=1\n\n\u03a6(r\u22121)\n\n(cid:80)k\n(cid:40)\n\nInitialize: \u2200i \u2208 [k] w(0)\nfor r = 1 to t do\n\u02dcD(r\u22121) \u2190 1\nDraw a sample set S(r) of size m\u0001(cid:48)/16,\u03b4(cid:48) from \u02dcD(r\u22121);\nf (r) \u2190 OF (S(r));\nGr \u2190 TEST(f (r), k, \u0001(cid:48), \u03b4(cid:48));\n2w(r\u22121)\n,\nUpdate: w(r)\nw(r\u22121)\n,\nend for\nreturn fR1 = maj({f (r)}t\nr=1)\nProcedure TEST(f (r), k, \u0001(cid:48), \u03b4(cid:48))\nfor i = 1 to k do\n\nif i /\u2208 Gr\notherwise\n\ni =\n\n;\n\ni\n\ni\n\n(cid:16) 1\n\n\u0001(cid:48) ln\n\n(cid:17)(cid:17)\n\n(cid:16) k\n\n\u03b4(cid:48)\n\nfrom Di;\n\nDraw a sample set Ti of size O\n4 \u0001(cid:48)};\n\nend for\nreturn {i | errTi(f (r)) \u2264 3\n\nAlgorithm R1 runs for t = \u0398(log(k)) rounds and learns a classi\ufb01er f (r) in each round r that has low\nerror on the weighted mixture of the distributions \u02dcD(r\u22121). For each player at least 0.6t of the learned\nclassi\ufb01ers are \u201cgood\u201d, meaning that they have error at most \u0001(cid:48) = \u0001/6 on the player\u2019s distribution.\nSince the algorithm returns the majority of the classi\ufb01ers, in order for an instance to be mislabeled, at\n\n3\n\n\fleast 0.5t of the total number of classi\ufb01ers should mislabel it. This implies that at least 0.1t of the\n\u201cgood\u201d classi\ufb01ers of that player should mislabel it, which amounts to 1/6 of the \u201cgood\u201d classi\ufb01ers.\nTherefore, the error of the majority of the functions for that player is at most 6\u0001(cid:48) = \u0001.\nTo identify the players for whom the classi\ufb01er of the round does not perform well, Algorithm R1\nuses a procedure called TEST. This procedure draws O\nsamples from each player\u2019s\ndistribution and tests the classi\ufb01er on these samples. If the error for a player\u2019s sample set is at most\n3\u0001(cid:48)/4 then TEST concludes that the classi\ufb01er is good for that player and adds them to the returned set\nGr. The samples that the TEST requires from each player suf\ufb01ce to make it capable of distinguishing\nbetween the players with error more than \u0001(cid:48) and players with error at most \u0001(cid:48)/2 with respect to their\ndistributions, with high probability.\nTheorem 1. For any \u0001, \u03b4 \u2208 (0, 1), and hypothesis class F of VC dimension d, Algorithm R1 returns\na classi\ufb01er fR1 with errDi(fR1) \u2264 \u0001 \u2200i \u2208 [k] with probability at least 1 \u2212 \u03b4 using m samples, where\n\n(cid:16) k\n\n(cid:16) 1\n\n(cid:17)(cid:17)\n\n\u0001(cid:48) ln\n\n\u03b4(cid:48)\n\n(cid:16) ln(k)\n\n(cid:16)\n\n\u0001\n\n(cid:16) 1\n\n(cid:17)\n\n\u0001\n\n(cid:16) k\n\n(cid:17)(cid:17)(cid:17)\n\n.\n\n\u03b4\n\nm = O\n\nd ln\n\n+ k ln\n\nThe proof of Theorem 1 is very similar to the one for Algorithm R2 so we omit it and refer the\nreader to Appendix A. Algorithm R1 is the natural boosting alternative to the algorithm of [5] for\nthe centralized variant of the model. Although it is discussed in [5] and mentioned to have the same\nsample complexity as their algorithm, it turns out that it is more ef\ufb01cient. Its sample complexity is\nslightly better (or the same, depending on the parameter regime) compared to the one of the algorithm\nfor the personalized setting presented in [5], which is O\n\n(cid:16) log(k)\n\n(cid:17)(cid:17)(cid:17)\n\n(cid:16) k\n\n(cid:16) 1\n\n(d + k) ln\n\n+ k ln\n\n(cid:16)\n\n(cid:17)\n\n.\n\n\u0001\n\n\u0001\n\n\u03b4\n\nHowever, in the setting of the lower bound in [5] where k = \u0398(d), there is a gap of log(k) multi-\nplicatively between the sample complexity of Algorithm R1 and the lower bound. This difference\nstems from the fact that in every round, the algorithm uses roughly \u0398(k) samples to \ufb01nd a classi\ufb01er\nbut roughly \u0398(k log(k)) samples to test the classi\ufb01er for k tasks. Motivated by this discrepancy,\nwe develop Algorithm R2, which is similar to Algorithm R1 but uses fewer samples to test the\nperformance of each classi\ufb01er on the players\u2019 distributions. To achieve high success probability,\nAlgorithm R2 uses a higher number of rounds.\n\nAlgorithm R2\n\n(cid:17)(cid:109)\n\n(cid:108)\n(cid:16) k\n(cid:17)\n, where \u03a6(r\u22121) =(cid:80)k\n\n\u03b4\n\n; \u0001(cid:48) := \u0001/6; \u03b4(cid:48) := \u03b4/(4t);\n\ni=1 w(r\u22121)\n\ni\n\n;\n\ni\n\nlog\n\n(cid:16)\n\n\u03a6(r\u22121)\n\n:= 1; t := 150\n\nInitialize: \u2200i \u2208 [k] w(0)\nfor r = 1 to t do\n\u02dcD(r\u22121) \u2190 1\nDraw a sample set S(r) of size m\u0001(cid:48)/16,\u03b4(cid:48) from \u02dcD(r\u22121);\nf (r) \u2190 OF (S(r));\nGr \u2190 FASTTEST(f (r), k, \u0001(cid:48), \u03b4(cid:48));\nUpdate: w(r)\n\n(cid:80)k\n(cid:40)\n\nw(r\u22121)\n\nDi\n\ni=1\n\n;\n\ni\n\nif i /\u2208 Gr\notherwise\n\ni\n\n,\n\ni =\n\n2w(r\u22121)\nw(r\u22121)\n,\nend for\nreturn fR2 = maj({f (r)}t\nProcedure FASTTEST(f (r), k, \u0001(cid:48), \u03b4(cid:48))\nfor i = 1 to k do\n\nr=1);\n\ni\n\nDraw a sample set Ti of size O\n4 \u0001(cid:48)};\n\nend for\nreturn {i | errTi(f (r)) \u2264 3\n\n(cid:16) 1\n\n\u0001(cid:48)\n\n(cid:17)\n\nfrom Di;\n\nMore speci\ufb01cally, Algorithm R2 runs for t = 150(cid:100)log( k\nidentify the players for whom the classi\ufb01er of the round does not perform well requires O\nsamples from each player. This helps us save one logarithmic factor in the second term of the sample\ncomplexity of Algorithm R1. We call this new test FASTTEST. The fact that FASTTEST uses less\n\n(cid:17)\n(cid:16) 1\n\u03b4 )(cid:101) rounds. In addition, the test it uses to\n\u0001(cid:48)\n\n4\n\n\fsamples causes it to be less successful at distinguishing the players for whom the classi\ufb01er was \u201cgood\u201d\nfrom the players for whom it was not, meaning that it has constant probability of making a mistake\nfor a given player at a given round. There are two types of mistakes that FASTTEST can make: to\nreturn i /\u2208 Gr and double the weight of i when the classi\ufb01er is good for i\u2019s distribution and to return\ni \u2208 Gr and not double the weight of i when the classi\ufb01er is not good.\nTheorem 2. For any \u0001, \u03b4 \u2208 (0, 1), and hypothesis class F of VC dimension d, Algorithm R2 returns\na classi\ufb01er fR2 with errDi(fR2) \u2264 \u0001 \u2200i \u2208 [k] with probability at least 1 \u2212 \u03b4 using m samples, where\n\n(cid:16) 1\n\n\u0001\n\n(cid:16) k\n\n(cid:17)(cid:16)\n\n\u03b4\n\n(cid:16) 1\n\n(cid:17)\n\n\u0001\n\nm = O\n\nln\n\nd ln\n\n+ k + ln\n\n(cid:16) 1\n\n(cid:17)(cid:17)(cid:17)\n\n.\n\n\u03b4\n\nTo prove the correctness and sample complexity of Algorithm R2, we need Lemma 2.1, which\ndescribes the set Gr that the FASTTEST returns. Its proof uses the multiplicative forms of the\nChernoff bounds and is in Appendix A.\nLemma 2.1. FASTTEST(f (r), k, \u0001(cid:48), \u03b4(cid:48)) is such that the following two properties hold, each with\nprobability at least 0.99, for given round r \u2208 [t] and player i \u2208 [k].\n\n(a) If errDi(f (r)) > \u0001(cid:48), then i /\u2208 Gr.\n2 , then i \u2208 Gr.\n(b) If errDi(f (r)) \u2264 \u0001(cid:48)\n\ni = |{r | r \u2208 [t] and errDi(f (r)) > \u0001(cid:48)}|.\n\nProof of Theorem 2. First, we prove that Algorithm R2 indeed learns a good classi\ufb01er, meaning\nthat, with probability at least 1 \u2212 \u03b4, for every player i \u2208 [k] the returned classi\ufb01er fR2 has error\nerrDi (fR2) \u2264 \u0001. Let e(t)\ni be the number of rounds for which the classi\ufb01er\u2019s error on Di was more\nthan \u0001(cid:48), i.e. e(t)\nClaim 2.1. With probability at least 1 \u2212 \u03b4, e(t)\nIf the claim holds, then with probability at least 1\u2212 \u03b4, less than 0.4t functions have error more than \u0001(cid:48)\non Di, \u2200i \u2208 [k]. Therefore, with probability at least 1 \u2212 \u03b4, errDi(fR2) \u2264 0.6\n0.1 \u0001(cid:48) \u2264 \u0001 for every i \u2208 [k].\n\ni < 0.4t \u2200i \u2208 [k].\n\nProof of Claim 2.1. Let us denote by I (r) the set of players having errDi(f (r)) > \u0001(cid:48)\n2 in round r, i.e.,\n2 }. We condition on the randomness in the \ufb01rst r \u2212 1 rounds and\nI (r) = {i \u2208 [k] | errDi(f (r)) > \u0001(cid:48)\ncompute E[\u03a6(r) | \u03a6(r\u22121)]. By linearity of expectation, the following hold for round r:\n(cid:16)\n(cid:17)\nk(cid:88)\n\n(cid:88)\n\n(cid:16)\n\n1\n\nw(r\u22121)\n\ni\n\nerrDi (f (r))\n\nw(r\u22121)\n\ni\n\nerrDi(f (r))\n\n(cid:17) \u2265 1\n\n\u03a6(r\u22121)\n\nerr \u02dcD(r\u22121) (f (r)) =\n\n\u03a6(r\u22121)\n\ni=1\n\nBy the de\ufb01nition of I (r), errDi(f (r)) > \u0001(cid:48)\nleast 1 \u2212 \u03b4(cid:48), err \u02dcD(r\u22121) (f (r)) \u2264 \u0001(cid:48)\nprobability at least 1 \u2212 \u03b4(cid:48),\n\n(1)\n2 for i \u2208 I (r). From the VC theorem, with probability at\n16. Using these two bounds and inequality (1), it follows that with\n\ni\u2208I (r)\\Gr\n\n(cid:88)\n\ni\u2208I (r)\\Gr\n\nw(r\u22121)\n\ni\n\n\u2264 1\n8\n\n\u03a6(r\u22121).\n\n(2)\n\nFor the rest of the analysis, we will condition our probability space to the event that inequality (2)\nholds for all t rounds. By the union bound, this event happens with probability 1 \u2212 t\u03b4(cid:48) = 1 \u2212 \u03b4/4.\nConsider the set of players i /\u2208 I (r) \u222a Gr. These are the players for whom the classi\ufb01er of the round\nperformed well but FASTTEST made a mistake and did not include them in the set Gr. By linearity\nof expectation:\n\n(cid:34) (cid:80)\n\n+ (cid:80)\n\nw(r\u22121)\n\ni\n\n(2), Lemma 2.1(b)\n\ni\u2208I (r)\\Gr\n\u2264\n\ni /\u2208I (r)\u222aGr\n(0.125 + 0.01)\u03a6(r\u22121)\n\n(cid:35)\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)\u03a6(r\u22121)\n\nw(r\u22121)\n\ni\n\n(3)\n\nE[ (cid:80)\n\ni /\u2208Gr\n\nw(r\u22121)\n\ni\n\n| \u03a6(r\u22121)] = E\n\n5\n\n\fE[\u03a6(r) | \u03a6(r\u22121)] = E\n\nThus, the expected value of the potential function in round r conditioned on its value in the previous\nround is bounded by\n\nw(r\u22121)\n\n(cid:88)\n\n\uf8ee\uf8f0 k(cid:88)\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)\u03a6(r\u22121)\n(cid:105) \u2264 \u03b4/2. It follows that with probability at least 1 \u2212 \u03b4\n\nw(r\u22121)\n\ni /\u2208Gr\n\ni=1\n\n+\n\ni\n\ni\n\nBy the de\ufb01nition of the expected value, this implies that E[\u03a6(r)] \u2264 1.135 E[\u03a6(r\u22121)]. Conditioned\non the fact that inequality (2) holds for all rounds, which is true with probability at least 1 \u2212 \u03b4\n4, we\ncan conclude that E[\u03a6(t)] \u2264 k(1.135)t, by induction. Using Markov\u2019s inequality we can state that\nPr\n\n\u03a6(t) \u2265 E[\u03a6(t)]\n\n(cid:104)\n\n4 \u2212 \u03b4\n\n2 = 1 \u2212 3\u03b4\n\n\u03b4/2\n\n4\n\n\uf8f9\uf8fb (3)\u2264 1.135\u03a6(r\u22121).\n\n\u03a6(t) \u2264 2k(1.135)t\n\n.\n\n(4)\n\ni\n\n. Let m(r)\n\n\u03b4\ndenote the number of rounds r(cid:48), up until and\nWe now need a lower bound for w(t)\nincluding round r, for which the procedure FASTTEST made a mistake and returned i \u2208 Gr(cid:48)\ni \u2212 m(r\u22121)\n] \u2264 0.01 so\nalthough errDi(f (r(cid:48))) > \u0001(cid:48). From Lemma 2.1(a), it follows that E[m(r)\ni = m(r)\nfor M (r)\n. Therefore, the sequence\n{M (r)\ni }t\nr=0 is a super-martingale. In addition to this, since we can make at most one mistake\nin each round, it holds that M (r)\n< 1. Using the Azuma-Hoeffding inequality with\nM (0)\n\ni \u2212 0.01 \u00b7 0 = 0 and the fact that t \u2265 150 we calculate that\n\ni \u2212 0.01r it holds that E[M (r)\n\ni \u2212 M (r\u22121)\n\n] \u2264 M (r\u22121)\n\n| M (r\u22121)\n\ni = m(0)\n\ni\n\ni\n\ni\n\ni\n\ni\n\ni\n\n(cid:104)\n\n(cid:105) \u2264 exp\n\n(cid:16) \u2212 (0.17t)2\n\n(cid:17) \u2264 \u03b4\n\ni \u2265 0.18t\nm(t)\n\nPr\n\n.\ni < 0.18t holds \u2200i \u2208 [k] with probability at least 1 \u2212 \u03b4\n4.\n\n4k\n\n2t\n\nBy union bound, m(t)\n\ni\n\ni \u2212m(t)\n\ni > 2e(t)\n\ni \u22120.18t < 2k(1.135)t\n150 t + 1\n\n\u21d2 e(t)\n150 t + 0.183t < 0.4t\n\ni \u22120.18t holds for all i \u2208 [k] with probability at least 1 \u2212 \u03b4\n\nThe number of times a weight is doubled throughout the algorithm is log(w(t)\ni ) and it is at least\nthe number of times the error of the classi\ufb01er was more than \u0001(cid:48) minus the number of times the\ni \u2212 m(t)\nerror was more than \u0001(cid:48) but the FASTTEST made a mistake, which is exactly e(t)\n. So\ni \u2265 2e(t)\nw(t)\n4. Combining this\nwith the bound from inequality (4) we have that with probability at least 1 \u2212 \u03b4:\ni \u2264 \u03a6(t) \u21d2 2e(t)\nw(t)\n\u21d2 e(t)\ni < 0.18t + 1\nIt remains to bound the number of samples. FASTTEST is called t = 150(cid:100)log( k\nquires O\n\n(cid:16)\n(cid:16) k\n\u03b4 )(cid:101) times, so it re-\n(cid:16)\n(cid:17)\n(cid:16) 1\n(cid:17) 1\n(cid:16) k\nsamples in total. The number of samples required to learn\n(cid:17)(cid:17)(cid:17)\n(cid:1)(cid:7)(cid:1) we get\nsamples. Substituting \u0001(cid:48) = \u0001/6 and \u03b4(cid:48) = \u03b4/(4t) = \u03b4/(cid:0)600(cid:6)log(cid:0) k\n(cid:17)(cid:16)\n(cid:16) k\n\neach round\u2019s classi\ufb01er is m\u0001(cid:48)/16,\u03b4(cid:48), so for all rounds there are required O\n\n(cid:17) k\n(cid:17)\n(cid:16) 1\n\ni \u2212 0.18t < 1 + log\n\n(cid:16) 1\n(cid:16) 1\n\nsamples in total. From the addition of the two bounds\n\n+ t log(1.135)\n\n(cid:16) k\n\n(cid:17)\n\n(cid:17)\n\n(cid:16)\n\n= O\n\nd ln\n\nlog\n\nlog\n\n(cid:4)\n\nln\n\n\u0001(cid:48)\n\n\u0001(cid:48)\n\n\u0001(cid:48)\n\n\u03b4\n\n\u03b4\n\n\u03b4\n\n\u03b4\n\n\u03b4\n\n+\n\n\u03b4(cid:48)\n\u0001 log\n\nO\nabove, the overall sample complexity bound is:\n\n+ ln\n\nd ln\n\n\u03b4\n\n\u0001\n\n\u03b4\n\n\u0001 log\n\n(cid:17)(cid:17)\n\n(cid:16) k\n(cid:16) k\n(cid:17)(cid:17)(cid:17)\n(cid:16) log(k)\n(cid:16) 1\n(cid:16) k\n(cid:17)(cid:16)\n\n\u03b4\n\nO\n\nln\n\n\u0001\n\n\u03b4\n\n(cid:16) 1\n\n(cid:17)\n\n\u0001\n\n(cid:16) 1\n\n(cid:17)(cid:17)(cid:17)\n\n\u03b4\n\nd ln\n\n+ k + ln\n\n4 Sample complexity upper bounds for the non-realizable setting\n\nWe design Algorithms NR1 and NR2 for the non-realizable setting, which generalize the results of\nAlgorithms R1 and R2, respectively.\n\n6\n\n\fTheorem 3. For any \u0001, \u03b4 \u2208 (0, 1), 7\u0001/6 < \u03b1 < 1, and hypothesis class F of VC dimension d,\nAlgorithm NR1 returns a classi\ufb01er fNR1 such that errDi(fNR1) \u2264 (2 + \u03b1)OPT + \u0001 holds for all i \u2208 [k]\nwith probability 1 \u2212 \u03b4 using m samples, where\n\n(cid:16) ln(k)\n\n(cid:16)\n\n\u03b14\u0001\n\n(cid:16) 1\n\n(cid:17)\n\n\u0001\n\n(cid:16) k\n\n(cid:17)(cid:17)(cid:17)\n\n.\n\n\u03b4\n\nm = O\n\nd ln\n\n+ k ln\n\nTheorem 4. For any \u0001, \u03b4 \u2208 (0, 1), 5\u0001/4 < \u03b1 < 1, and hypothesis class F of VC dimension d,\nAlgorithm NR2 returns a classi\ufb01er fNR2 such that errDi(fNR2) \u2264 (2 + \u03b1)OPT + \u0001 holds for all i \u2208 [k]\nwith probability 1 \u2212 \u03b4 using m samples, where\n\n(cid:16) 1\n\n(cid:16) k\n\n(cid:17)(cid:16)\n\n(cid:16) 1\n\n(cid:17)\n\n\u0001\n\n(cid:16) 1\n\n(cid:17)\n\n\u03b1\n\n(cid:16) 1\n\n(cid:17)(cid:17)(cid:17)\n\n.\n\n\u03b4\n\nm = O\n\nln\n\n\u03b14\u0001\n\n\u03b4\n\nd ln\n\n+ k ln\n\n+ ln\n\nTheir main modi\ufb01cation compared to the algorithms in the previous section is that these algorithms\nuse a smoother update rule. Algorithm NR2 is presented here and Algorithm NR1 is in Appx. B.1.\n\n:= 1; \u03b1(cid:48) := \u03b1/40; t := 2(cid:100)ln(4k/\u03b4)/\u03b1(cid:48)3(cid:101); \u0001(cid:48) := \u0001/64; \u03b4(cid:48) :=\n\ni\n\n(cid:16)\n\nDi\n\n(cid:17)\n, where \u03a6(r\u22121) :=(cid:80)k\n(cid:16)\n(cid:16) 1\n(cid:17)(cid:17)(cid:17)\n(cid:16) 1\n(cid:17)\n(cid:16) 1\ni=1 w(r\u22121)\n(cid:16) 1\n(cid:17)(cid:17)\n(cid:16) 1\nS(r) (f (r))+3\u0001(cid:48) , \u03b1(cid:48)(cid:17)\n\nfrom Di;\n\n\u03b1(cid:48)\u0001(cid:48) ln\n\n+ ln\n\nd ln\n\n\u03b1(cid:48)\u0001(cid:48)\n\n\u03b1(cid:48)\n\n\u03b4(cid:48)\n\n\u0001(cid:48)\n\ni\n\n;\n\nfrom \u02dcD(r\u22121);\n\nAlgorithm NR2\n1: Initialization: \u2200i \u2208 [k] w(0)\n\n\u03b4/(4t);\n\n2: for r = 1, . . . , t do\nw(r\u22121)\n\u02dcD(r\u22121) \u2190 1\n3:\n\u03a6(r\u22121)\nDraw a sample set S(r) of size O\nf (r) \u2190 OF (S(r));\nfor i = 1, . . . , k do\n\ni=1\n\n4:\n\ni\n\n(cid:80)k\n\nDraw a sample set Ti of size O\nerrTi (f (r))\u03b1(cid:48)2\ni \u2190 min\ns(r)\nUpdate: w(r)\n\n(cid:16)\n(1+3\u03b1(cid:48))err\ni \u2190 w(r\u22121)\n\ni\n\n(1 + s(r)\n\n)\n\ni\n\nend for\n\n9:\n10:\n11: end for\n12:\n13: return fNR2 = maj({f (r)}t\n\nr=1);\n\n5:\n6:\n7:\n\n8:\n\nAlgorithm NR2 faces a similar challenge as Algorithm R2. Given a player i, since the number of\nsamples Ti used to estimate errDi(f (r)) in each round is low, the estimation is not very accurate.\nIdeally, we would want the inequality\n\n|errTi(f (r)) \u2212 errDi(f (r))| \u2264 \u03b1(cid:48) \u00b7 errDi(f (r)) + \u0001(cid:48)\n\nto hold for all players and all rounds with high probability. The \u201cgood\u201d classi\ufb01ers are now de\ufb01ned\nas the ones corresponding to rounds for which the inequality holds and errTi(f (r)) is not very high\ni < \u03b1(cid:48)). The expected number of rounds that either one of these\n(an indication of which is that s(r)\nproperties does not hold is a constant fraction of the rounds (\u2248 t\u03b1(cid:48)) and due to the high number of\nrounds it is concentrated around that value, as in Algorithm R2. The proof of Theorem 4 is in Appx.\nB.2.\nWe note that the classi\ufb01ers returned by these algorithms have a multiplicative approximation factor\nof almost 2 on the error. A different approach would be to allow for randomized classi\ufb01ers with\nlow error probability over both the randomness of the example and the classi\ufb01er. We design two\nalgorithms, NR1-AVG and NR2-AVG that return a classi\ufb01er which satis\ufb01es this form of guarantee on\nthe error without the 2-approximation factor but use roughly \u03b1\n\u0001 times more samples. The returned\nclassi\ufb01er is a randomized algorithm that, given an element x, chooses one of the classi\ufb01ers of all\nrounds uniformly at random and returns the label that this classi\ufb01er gives to x. For any distribution\nover examples, the error probability of this randomized classi\ufb01er is exactly the average of the error\nprobability of classi\ufb01ers f (1), f (2), . . . , f (t), hence the AVG in the names. The algorithms as well as\nthe proofs of their corresponding theorems can be found in Appx. B.3 and B.4.\n\n7\n\n\fTheorem 5. For any \u0001, \u03b4 \u2208 (0, 1), 24\u0001/25 < \u03b1 < 1, and hypothesis class F of VC dimension d,\nAlgorithm NR1-AVG returns a classi\ufb01er fNR1-AVG such that for the expected error errDi (fNR1-AVG) \u2264\n(1 + \u03b1)OPT + \u0001 holds for all i \u2208 [k] with probability 1 \u2212 \u03b4 using m samples, where\n\n(cid:16) ln(k)\n\n(cid:16)\n\n\u03b13\u00012\n\n(cid:16) 1\n\n(cid:17)\n\n\u0001\n\n(cid:16) k\n\n(cid:17)(cid:17)(cid:17)\n\n.\n\n\u03b4\n\nm = O\n\nd ln\n\n+ k ln\n\nTheorem 6. For any \u0001, \u03b4 \u2208 (0, 1), 30\u0001/29 < \u03b1 < 1, and hypothesis class F of VC dimension d,\nAlgorithm NR2-AVG returns a classi\ufb01er fNR2-AVG such that for the expected error errDi (fNR2-AVG) \u2264\n(1 + \u03b1)OPT + \u0001 holds for all i \u2208 [k] with probability 1 \u2212 \u03b4 using m samples, where\n\n(cid:16) 1\n\n(cid:16) k\n\n(cid:17)(cid:16)\n\n\u03b13\u00012 ln\n\n\u03b4\n\n(cid:16) 1\n\n(cid:17)\n\n\u0001\n\n(cid:16) 1\n\n(cid:17)(cid:17)(cid:17)\n\n.\n\n\u03b4\n\n(d + k) ln\n\n+ ln\n\nm = O\n\n5 Discussion\n\nThe problem has four parameters, d, k, \u0001 and \u03b4, so there are many ways to compare the sample\ncomplexity of the algorithms. In the non-realizable setting there is one more parameter \u03b1, but this\nis set to be a constant in the beginning of the algorithms. Our sample complexity upper bounds are\nsummarized in the following table.\n\nTable 1: Sample complexity upper bounds\n\nAlgorithm 1\n\n(cid:16)\n(cid:16)\n(cid:16)\n\n(cid:16) ln(k)\n(cid:16) ln(k)\n(cid:16) ln(k)\n\n\u03b14\u0001\n\n\u0001\n\n\u03b13\u00012\n\nd ln\n\nd ln\n\nd ln\n\nRealizable\nNon-realizable\n(2 + \u03b1 approx.)\nNon-realizable\n(randomized)\n\nO\n\nO\n\nO\n\n(cid:16) 1\n(cid:16) 1\n(cid:16) 1\n\n\u0001\n\n\u0001\n\n\u0001\n\n(cid:17)\n(cid:17)\n(cid:17)\n\n+k ln\n\n+k ln\n\n+k ln\n\n(cid:16) k\n(cid:16) k\n(cid:16) k\n\n\u03b4\n\n\u03b4\n\n\u03b4\n\n(cid:17)(cid:17)(cid:17)\n(cid:17)(cid:17)(cid:17)\n(cid:17)(cid:17)(cid:17)\n\n(cid:17)\n(cid:17)\n\n(cid:16) 1\n(cid:16) 1\n\n\u0001\n\n\u0001\n\nd ln\n\nd ln\n\nAlgorithm 2\n\n(cid:16) ln(k/\u03b4)\n(cid:16) ln(k/\u03b4)\n(cid:16) ln(k/\u03b4)\n\n\u03b14\u0001\n\n\u0001\n\n(cid:16)\n(cid:16)\n(cid:16)\n\nO\n\nO\n\nO\n\n\u03b13\u00012\n\n(cid:16) 1\n\n\u03b1\n\n(cid:16) 1\n(cid:17)(cid:17)(cid:17)\n(cid:16) 1\n(cid:17)\n(cid:16) 1\n(cid:17)(cid:17)(cid:17)\n\n+ln\n\n\u03b4\n\n\u03b4\n\n+ k + ln\n\n+k ln\n\n(cid:17)\n\n(cid:16) 1\n\n\u0001\n\n(d + k) ln\n\n+ ln\n\n\u03b4\n\n(cid:17)(cid:17)(cid:17)\n\n1\n\n(cid:17)(cid:17)\n\n(cid:16) k\n\n(cid:16) k\n\nUsually \u03b4 can be considered constant, since it represents the required error probability, or, in the\nhigh success probability regime, \u03b4 =\npoly(k). For both of these natural settings, we can see that\nAlgorithm 2 is better than Algorithm 1, except for the case of the expected error guarantee. If we\nassume k = \u0398(d), then Algorithm 2 is always better than Algorithm 1.\nIn the realizable setting, Algorithm R1 is always better than the algorithm of [5] for the centralized\nvariant of the problem and matches their number of samples in the personalized variant. In addition,\nTheorem 4.1 of [5] states that the sample complexity of any algorithm in the collaborative model\n, given that d = \u0398(k) and \u0001, \u03b4 \u2208 (0, 0.1), and this holds even for the personalized\nis \u2126\nvariant. For d = \u0398(k), the sample complexity of Algorithm R2 is exactly ln\ntimes the sample\ncomplexity for learning one task. Furthermore, when |F| = 2d (e.g. the hard instance for the lower\nbound of [5]), only m\u0001,\u03b4 = O\nsamples are required in the non-collaborative setting\ninstead of the general bound of the VC theorem, so the sample complexity bound for Algorithm R2 is\nand matches exactly the lower bound of [5] up to lower order terms.\nO\n\n(cid:16)\n(cid:16) 1\n(cid:17)(cid:17)(cid:17)\n\n(cid:17)(cid:17)(cid:17)\n\n(cid:16) k\n\n(cid:16) k\n\n(cid:17) 1\n\n(cid:16) 1\n\n(cid:16) 1\n\nd + k + ln\n\nd + ln\n\n(cid:16)\n\n(cid:17)\n\n(cid:16)\n\n\u0001 ln\n\nln\n\n\u03b4\n\n\u03b4\n\n\u03b4\n\n\u0001\n\n\u03b4\n\n\u0001\n\n\u03b4\n\nIn the non-realizable setting, our generalization of algorithms R1 and R2, NR1 and NR2 respectively,\nhave the same sample complexity as in the realizable setting and match the error guarantee for\nOPT = 0. If OPT (cid:54)= 0, they guarantee an error of a factor 2 multiplicatively on OPT. The randomized\nclassi\ufb01ers returned by Algorithms NR1-AVG and NR2-AVG avoid this factor of 2 in their expected\nerror guarantee. However, to learn such classi\ufb01ers, there are required O\n\ntimes more samples.\n\n(cid:16) 1\n\n(cid:17)\n\n\u0001\n\n8\n\n\fAcknowledgements\n\nWe thank the anonymous reviewers for their helpful remarks and for pointing us to the idea of slightly\nmodifying the algorithms in the non-realizable setting so that the optimal error is unknown. This work\nwas partially supported by NSF CAREER 1750716 and a Graduate fellowship from Northeastern\nUniversity\u2019s College of Computer and Information Science.\n\nReferences\n[1] Martin Anthony and Peter L. Bartlett. Neural Network Learning: Theoretical Foundations.\n\nCambridge University Press, New York, NY, USA, 1st edition, 2009.\n\n[2] Maria-Florina Balcan, Avrim Blum, Shai Fine, and Yishay Mansour. Distributed learning, com-\nmunication complexity and privacy. In Proceedings of the 25th Conference on Computational\nLearning Theory (COLT), pages 26.1\u201326.22, 2012.\n\n[3] Jonathan Baxter. A Bayesian/information theoretic model of learning to learn via multiple task\n\nsampling. Machine Learning, 28(1):7\u201339, July 1997.\n\n[4] Jonathan Baxter. A model of inductive bias learning. Journal of Arti\ufb01cial Intelligence Research,\n\n12(1):149\u2013198, March 2000.\n\n[5] Avrim Blum, Nika Haghtalab, Ariel D. Procaccia, and Mingda Qiao. Collaborative PAC\nlearning. In Proceedings of the 30th Annual Conference on Neural Information Processing\nSystems (NIPS), pages 2389\u20132398, 2017.\n\n[6] Jiecao Chen, Qin Zhang, and Yuan Zhou. Tight bounds for collaborative PAC learning via\n\nmultiplicative weights. CoRR, abs/1805.09217, 2018.\n\n[7] Ofer Dekel, Ran Gilad-Bachrach, Ohad Shamir, and Lin Xiao. Optimal distributed online\nprediction. In Proceedings of the 28th International Conference on Machine Learning (ICML),\npages 713\u2013720, 2011.\n\n[8] Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adap-\ntation of deep networks. In Proceedings of the 34th International Conference on Machine\nLearning (ICML), pages 1126\u20131135, 2017.\n\n[9] Christos Koufogiannakis and Neal E. Young. A nearly linear-time PTAS for explicit fractional\n\npacking and covering linear programs. Algorithmica, 70(4):648\u2013674, December 2014.\n\n[10] Yishay Mansour, Mehryar Mohri, and Afshin Rostamizadeh. Domain adaptation: Learning\nbounds and algorithms. In Proceedings of the 22nd Conference on Computational Learning\nTheory (COLT), pages 19\u201330, 2009.\n\n[11] Yishay Mansour, Mehryar Mohri, and Afshin Rostamizadeh. Domain adaptation with multiple\nsources. In Proceedings of the 23rd Annual Conference on Neural Information Processing\nSystems (NIPS), pages 1041\u20131048, 2009.\n\n[12] Michael Mitzenmacher and Eli Upfal. Probability and Computing: Randomization and Proba-\nbilistic Techniques in Algorithms and Data Analysis. Cambridge University Press, 2nd edition,\n2017.\n\n[13] L. G. Valiant. A theory of the learnable. Commun. ACM, 27(11):1134\u20131142, November 1984.\n\n[14] Jialei Wang, Mladen Kolar, and Nathan Srebro. Distributed multi-task learning. In Proceedings\nof the 19th International Conference on Arti\ufb01cial Intelligence and Statistics (AISTATS), pages\n751\u2013760, 2016.\n\n9\n\n\f", "award": [], "sourceid": 3780, "authors": [{"given_name": "Huy", "family_name": "Nguyen", "institution": "Princeton"}, {"given_name": "Lydia", "family_name": "Zakynthinou", "institution": "Northeastern University"}]}