{"title": "Tight Bounds for Collaborative PAC Learning via Multiplicative Weights", "book": "Advances in Neural Information Processing Systems", "page_first": 3598, "page_last": 3607, "abstract": "We study the collaborative PAC learning problem recently proposed in Blum  et al.~\\cite{BHPQ17}, in which we have $k$ players and they want to learn a target function collaboratively, such that the learned function approximates the target function well on all players' distributions simultaneously. The quality of the collaborative learning algorithm is measured by the ratio between the sample complexity of the algorithm and that of the learning algorithm for a single distribution (called the overhead).  We obtain a collaborative learning algorithm with overhead $O(\\ln k)$, improving the one with overhead $O(\\ln^2 k)$ in \\cite{BHPQ17}.  We also show that an $\\Omega(\\ln k)$ overhead is inevitable when $k$ is polynomial bounded by the VC dimension of the hypothesis class.  Finally, our experimental study has demonstrated the superiority of our algorithm compared with the one in Blum  et al.~\\cite{BHPQ17} on real-world datasets.", "full_text": "Tight Bounds for Collaborative PAC Learning via\n\nMultiplicative Weights\u2217\n\nJiecao Chen\n\nComputer Science Department\n\nIndiana University at Bloomington\n\njiecchen@iu.edu\n\nQin Zhang\n\nComputer Science Department\n\nIndiana University at Bloomington\n\nqzhangcs@indiana.edu\n\nYuan Zhou\n\nComputer Science Department\n\nIndiana University at Bloomington\n\nand\n\nDepartment of Industrial and Enterprise Systems Engineering\n\nUniversity of Illinois at Urbana-Champaign\n\nyuanz@illinois.edu\n\nAbstract\n\nWe study the collaborative PAC learning problem recently proposed in Blum\net al. [3], in which we have k players and they want to learn a target function\ncollaboratively, such that the learned function approximates the target function\nwell on all players\u2019 distributions simultaneously. The quality of the collaborative\nlearning algorithm is measured by the ratio between the sample complexity of the\nalgorithm and that of the learning algorithm for a single distribution (called the\noverhead). We obtain a collaborative learning algorithm with overhead O(ln k),\nimproving the one with overhead O(ln2 k) in [3]. We also show that an \u2126(ln k)\noverhead is inevitable when k is polynomial bounded by the VC dimension of the\nhypothesis class. Finally, our experimental study has demonstrated the superiority\nof our algorithm compared with the one in Blum et al. [3] on real-world datasets.\n\n1\n\nIntroduction\n\nIn this paper we study the collaborative PAC learning problem recently proposed in Blum et al. [3].\nIn this problem we have an instance space X , a label space Y, and an unknown target function f\u2217 :\nX \u2192 Y chosen from the hypothesis class F. We have k players with distributions D1, D2, . . . , Dk\nlabeled by the target function f\u2217. Our goal is to probably approximately correct (PAC) learn the\ntarget function f\u2217 for every distribution Di. That is, for any given parameters \u0001, \u03b4 > 0, we need to\nreturn a function f so that with probability 1 \u2212 \u03b4, f agrees with the target f\u2217 on instances of at least\n1 \u2212 \u0001 probability mass in Di for every player i.\nAs a motivating example, consider a scenario of personalized medicine where a pharmaceutical\ncompany wants to obtain a prediction model for dose-response relationship of a certain drug based\non the genomic pro\ufb01les of individual patients. While existing machine learning methods are ef\ufb01cient\nto learn the model with good accuracy for the whole population, for fairness consideration, it is\nalso desirable to ensure the model accuracies among demographic subgroups, e.g. de\ufb01ned by\ngender, ethnicity, age, social-economic status and etc., where each of them is associated with a label\ndistribution.\n\n\u2217A full version of this paper is available at https://arxiv.org/abs/1805.09217\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fWe will be interested in the ratio between the sample complexity required by the best collaborative\nlearning algorithm and that of the learning algorithm for a single distribution, which is called the\noverhead ratio. A na\u00efve approach for collaborative learning is to allocate a uniform sample budget for\neach player distribution, and learn the model using all collected samples. In this method, the players\ndo minimal collaboration with each other and it leads to an \u2126(k) overhead for many hypothesis\nclasses (which is particularly true for the classes with \ufb01xed VC dimension \u2013 the ones we will focus\non in this paper). In this paper we aim to develop a collaborative learning algorithm with the optimal\noverhead ratio.\n\n(cid:16) ln2 k\n(cid:16) (ln k+ln \u03b4\u22121)(d+k)\n\n\u0001\n\n(cid:17)\n\n\u0001\n\n\u0001\n\n(cid:0)d + ln \u03b4\u22121(cid:1)(cid:1) [10]. We\n\n(cid:0)(d + k) ln \u0001\u22121 + k ln \u03b4\u22121(cid:1)(cid:17)\n\nthat there exists an (\u0001, \u03b4)-PAC learning algorithm L\u0001,\u03b4,F with S\u0001,\u03b4 = O(cid:0) 1\n\nOur Results. We will focus on the hypothesis class F = {f : X \u2192 Y} with VC dimension d. For\nevery \u0001, \u03b4 > 0, let S\u0001,\u03b4 be the sample complexity needed to (\u0001, \u03b4)-PAC learn the class F. It is known\nremark that we will use the algorithm L as a blackbox, and therefore our algorithms can be easily\nextended to other hypothesis classes given their single-distribution learning algorithms.\nGiven a function g and a set of samples T , let errT (g) = Pr(x,y)\u2208T [g(x) (cid:54)= y] be the error of g on\nT . Given a distribution D over X \u00d7 Y, de\ufb01ne errD(g) = Pr(x,y)\u223cD[g(x) (cid:54)= y] to be the error of\ng on D. The (\u0001, \u03b4)-PAC k-player collaborative learning problem can be rephrased as follows: For\nplayer distributions D1, D2, . . . , Dk and a target function f\u2217 \u2208 F, our goal is to learn a function\ng : X \u2192 Y so that Pr[\u2200i = 1, 2, . . . k, errDi (f\u2217, g) \u2264 \u0001] \u2265 1 \u2212 \u03b4. Here we allow the learning\nalgorithm to be improper, that is, the learned function g does not have to be a member of F.\nBlum et al. [3] showed an algorithm with sample complexity O\n.\nWhen k = O(d), this leads to an overhead ratio of O(ln2 k) (assuming \u0001, \u03b4 are constants). In this\npaper we propose an algorithm with sample complexity O\n(Theorem 4), which\ngives an overhead ratio of O(ln k) when k = O(d) and for constant \u03b4, matching the \u2126(ln k) lower\nbound proved in Blum et al. [3].\nSimilarly to the algorithm in Blum et al. [3], our algorithm runs in rounds and return the plurality\nof the functions computed in each round as the learned function g. In each round, the algorithm\nadaptively decides the number of samples to be taken from each player distribution, and calls L to\nlearn a function. While the algorithm in Blum et al. [3] uses a grouping idea and evenly takes samples\nfrom the distribution in each group, our algorithm adopts the multiplicative weight method. In our\nalgorithm, each player distribution is associated with a weight which helps to direct the algorithm to\ndistribute the sample budget among all player distributions. After each round, the weight for a player\ndistribution increases if the function learned in the round is not accurate on the distribution, letting the\nalgorithm pay more attention to it in the future rounds. We will \ufb01rst present a direct application of the\nmultiplicative weight method which leads to a slightly worse sample complexity bound (Theorem 3),\nand then prove Theorem 4 with more re\ufb01ned algorithmic ideas.\nOn the lower bound side, the lower bound result in Blum et al. [3] is only for the special case when\nk = d. We extend their result to every k and d. In particular, we show that the sample complexity for\ncollaborative learning has to be \u2126(max{d ln k, k ln d}/\u0001) for constant \u03b4 (Theorem 6). Therefore, the\nsample complexity of our algorithm is optimal when k = dO(1). 2\nFinally, we have implemented our algorithms and compared with the one in Blum et al. [3] and the\nna\u00efve method on several real-world datasets. Our experimental results demonstrate the superiority of\nour algorithm in terms of the sample complexity.\n\nRelated Work. As mentioned, collaborative PAC learning was \ufb01rst studied in Blum et al. [3].\nBesides the problem of learning one hypothesis that is good for all players\u2019 distributions (called\nthe centralized collaborative learning in [3]), the authors also studied the case in which we can use\ndifferent hypotheses for different distributions (called personalized collaborative learning). For the\npersonalized version they obtained an O(ln k) overhead in sample complexity. Our results show that\n\n2We note that this is a stronger statement than the earlier one on the \u201cthe optimal overhead ratio of O(ln k)\nfor k = O(d)\u201d in several aspects. First, the showing the optimal overhead ratio only needs a minimax lower\nbound; while in the latter statement we claim the optimal sample complexity for every k and d in the range.\nSecond, our latter statement works for a much wider parameter range for k and d.\n\n2\n\n\fwe can obtain the same overhead for the (more dif\ufb01cult) centralized version. In a concurrent work\n[15], the authors showed the similar results as in our paper.\nBoth our algorithms and Adaboost [7] use the multiplicative weights method. While Adaboost places\nweights on the samples in the pre\ufb01xed training set, our algorithms place weights on the distributions\nof data points, and adaptively acquire new samples to achieve better accuracy. Another important\nfeature of our improved algorithm is that it tolerates a few \u201cfailed rounds\u201d in the multiplicative\nweights method, which requires more efforts in its analysis and is crucial to shaving the extra ln k\nfactor when k = \u0398(d).\nBalcan et al. [1] studied the problem of \ufb01nding a hypothesis that approximates the target function well\non the joint mixture of k distributions of k players. They focused on minimizing the communication\nbetween the players, and allow players to exchange not only samples but also hypothesis and\nother information. Daume et al. [11, 12] studied the problem of computing linear separators in a\nsimilar distributed communication model. The communication complexity of distributed learning\nhas also been studied for a number of other problems, including principal component analysis [13],\nclustering [2, 9], multi-task learning [16], etc.\nAnother related direction of research is the multi-source domain adaption problem [14], where we\nhave k distributions, and a hypothesis with error at most \u0001 on each of the k distributions. The task\nis to combine the k hypotheses to a single one which has error at most k\u0001 on any mixture of the k\ndistribution. This problem is different from our setting in that we want to learn the \u201cglobal\u201d hypothesis\nfrom scratch instead of combine the existing ones.\n\n2 The Basic Algorithm\n\nIn this section we propose an algorithm for collaborative learning using the multiplicative weight\nmethod. The algorithm is described in Algorithm 1, using Algorithm 2 as a subroutine.\n\nWe brie\ufb02y describe Algorithm 1 in words. We\nstart by giving a unit weight to each of the k\nplayer. The algorithm runs in T = O(ln k)\nrounds, and players\u2019 weights will change at each\nround. At round t, we take a set of samples S(t)\nfrom the average distribution of the k players\nweighted by their weights. We then learn a clas-\nsi\ufb01er g(t) for samples in S(t), and test for each\nplayer i whether g(t) agrees with the target func-\ntion f\u2217 with probability mass at least 1 \u2212 \u0001/6 on\ndistribution Di. If yes then we keep the weight of\nthe i-th player; otherwise we multiply its weight\nby a factor of 2, so that Di will attract more at-\ntention in the future learning process. Finally, we\nreturn a classi\ufb01er g which takes the plurality vote3\nof the T classi\ufb01ers g(0), g(1), . . . , g(T\u22121) that we\nhave constructed. We note that we make no ef-\nfort to optimize the constants in the algorithms\nand their theoretical analysis; while in the experi-\nment section, we will tune the constants for better\nempirical performance.\nThe following lemma shows that TEST returns,\nwith high probability, the desired set of players\nwhere g is an accurate hypothesis for its own dis-\ntribution. We say a call to TEST successful if\nits returning set has the properties described in\nLemma 1. The omitted proofs in this section can\nbe found in Appendix ??.\n\ni \u2190 1 for each\n\nAlgorithm 1 BASICMW\n1: Let the initial weight w(0)\nplayer i \u2208 {1, 2, . . . , k}.\n2: Let T \u2190 10 ln k.\n3: for t \u2190 0 to T \u2212 1 do\ni(cid:80)k\nLet p(t)(i) \u2190 w(t)\n4:\nLet D(t) \u2190(cid:80)K\n\nfor each i \u2208\n{1, 2, . . . , k} so that p(t)(\u00b7) de\ufb01nes a prob-\nability distribution.\n\ni=1 w(t)\n\ni=1 p(t)(i)Di.\n\ni\n\nLet S(t) be a set of S \u0001\n120 ,\nfrom D(t). Let g(t) \u2190 L \u0001\n120 ,\n\n\u03b4\n\n4(t+1)2\n\nsamples\n4(t+1)2 ,F (S(t)).\n\n\u03b4\n\n5:\n6:\n\nif i \u2208 Z (t) then\n\nLet Z (t) \u2190 TEST(g(t), k, t, \u0001, \u03b4).\nfor each i \u2208 {1, 2, . . . , k} do\n\n7:\n8:\n9:\n10:\n11:\n12:\n13: return g = Plurality(g(0), . . . , g(T\u22121)).\n\n\u2190 w(t)\n\u2190 2 \u00b7 w(t)\n\nw(t+1)\n\nw(t+1)\n\nelse\n\n.\n\ni\n\ni\n\ni\n\ni\n\n(cid:16) k\u00b74(t+1)2\n\nAlgorithm 2 Accuracy Test (TEST(g, k, t, \u0001, \u03b4))\n1: for each i \u2208 {1, 2, . . . , k} do Let Ti be a set\n\n(cid:17)\nsamples from Di.\n2: return {i | errTi(g) \u2264 \u0001\n6}.\n\nof 432\n\u0001\n\nln\n\n\u03b4\n\n3I.e. the most frequent value, where ties broken arbitrarily.\n\n3\n\n\fLemma 1 With probability at least 1 \u2212\nincludes 1) each i such that errDi(g) \u2264 \u0001\nGiven a function g and a distribution D, we say that g is a good candidate for D if errD(g) \u2264 \u0001\n4. The\nfollowing lemma shows that if we have a set of functions where most of them are good candidates for\nD, then the plurality vote of these functions also has good accuracy for D.\n\n4(t+1)2 , TEST(g, k, t, \u0001, \u03b4) returns a set of players that\n\n12 , 2) none of the i such that errDi(g) > \u0001\n4 .\n\n\u03b4\n\nLemma 2 Let g1, g2, . . . , gm be a set of functions such that more than 70% of them are good\ncandidates for D. Let g = Plurality(g1, g2, . . . , gm), we have that errD(g) \u2264 \u0001.\nWe let the E be the event that every call of the learner L and TEST is successful. It is straightforward\nto see that\n\nPr[E] \u2265 1 \u2212 +\u221e(cid:88)\n\nt=0\n\n\u03b4\n\n4(t + 1)2 \u00b7 2 = 1 \u2212 \u03b4 \u00b7 \u03c02\n\n24\n\n> 1 \u2212 \u03b4.\n\n(1)\n\nNow we are ready to prove the main theorem for Algorithm 1.\n\nTheorem 3 Algorithm 1 has the following properties.\n\n1. With probability at least 1 \u2212 \u03b4, it returns a function g such that errDi(g) \u2264 \u0001 for all\n\ni \u2208 {1, 2, . . . , k}.\n\n2. Its sample complexity is O\n\n(d + k ln \u03b4\u22121 + k ln k)\n\n.\n\n(cid:18) ln k\n\n\u0001\n\n(cid:19)\n\nProof. While the sample complexity is easy to verify, we focus on the proof of the \ufb01rst property. In\nparticular, we show that when E happens (which is with probability at least 1 \u2212 \u03b4 by (1)), we have\nerrDi(g) \u2264 \u0001 for all i \u2208 {1, 2, . . . , k}.\nFor now till the end of the proof, we assume that E happens.\nFor each round t, we have that\ninequality, we have that Pri\u223cp(t)(\u00b7)\n\n120 \u2265 errD(t)(g(t)) = Ei\u223cp(t)(\u00b7)[errDi(g(t))] . Therefore, by Markov\n\n(cid:2)errDi(g(t)) > \u0001\n(cid:3) \u2264 .1 . In other words,\n1(cid:80)k\n\n(cid:88)\n\np(t)(i) =\n\nw(t)\n\n12\n\n.\n\n\u0001\n\ni\n\ni\n\ni:errDi (g(t))> \u0001\n\n12\n\ni\n\n12\n\n, we have\n\ni=1 w(t)\n\ni:errDi (g(t))> \u0001\n\ni=1 w(t+1)\n\n.1 \u2265 (cid:88)\nNow consider the total weight(cid:80)k\nk(cid:88)\nBy Lemma 1 and E, we have that(cid:88)\n\u2264 1.1(cid:80)k\nCombining (2), (3), and (4), we have(cid:80)k\nhave the following inequality holds for every t = 0, 1, 2, . . . :(cid:80)k\n\nk(cid:88)\n\u2264 (cid:88)\n\ni=1 w(t+1)\n\ni:errDi (g(t))> \u0001\n\n(cid:88)\n\nw(t+1)\n\nw(t+1)\n\ni(cid:54)\u2208Z(t)\n\nw(t)\n\ni +\n\ni(cid:54)\u2208Z(t)\n\ni=1\n\ni=1\n\n=\n\n12\n\ni\n\ni\n\ni\n\ni\n\nw(t+1)\n\n.\n\ni\n\nw(t+1)\n\n.\n\n. Since(cid:80)k\n\ni=1 w(t)\n\ni\n\ni=1 w(t)\n\ni \u2264 1.1t \u00b7 k .\n\ni=1 w(0)\n\ni = k, we\n\n(2)\n\n(3)\n\n(4)\n\n4, and this will conclude the proof of this theorem thanks to Lemma 2.\n\nNow let us focus on an arbitrary player i. We will show that for at least 70% of the rounds t, we have\nerrDi(g(t)) \u2264 \u0001\nSuppose the contrary: for more than 30% of the rounds, we have errDi(g(t)) > \u0001\nround t, we have i (cid:54)\u2208 Z (t) because of Lemma 1 and E, and therefore w(t+1)\nwe have w(T )\ni=1 w(T )\ncontradiction for T = 10 ln k.\n\n4. At each of such\n. Therefore,\ni \u2264 1.1T \u00b7 k, which is a\n(cid:117)(cid:116)\n\ni \u2265 2.3T . Together with (4), we have 2.3T \u2264 wT\n\ni \u2264(cid:80)k\n\n= 2 \u00b7 w(t)\n\ni\n\ni\n\n4\n\n\f3 The Quest for Optimality via Robust Multiplicative Weights\n\nIn this section we improve the result in Theorem 3 to get an optimal algorithm when k is polynomially\nbounded by d (see Theorem 4; the optimality will be shown in Section 4). In fact, our improved\nalgorithm (Algorithm 3 using Algorithm 4 as a subroutine), is almost the same as Algorithm 1 (using\nAlgorithm 2 as a subroutine). We highlight the differences as follows.\n\n1. The total number of iterations at Line 2 of Algorithm 1 is changed to \u02dcT = 2000 ln(k/\u03b4).\n2. The failure probability for the single-distribution learning algorithm L at Line 6 of Algo-\n\nrithm 1 is increased to a constant 1/100.\n\n3. The number of times that each distribution is sampled at Line 1 of Algorithm 2 is reduced to\n\n432\n\n\u0001\n\nln(100).\n\nAlthough these changes seem minor, it requires substantial technical efforts to establish Theorem 4.\nWe describe the challenge and sketch our solution as follows.\nWhile the 2nd and 3rd items lead to the key reduction of the sample complexity, they make it\nimpossible to use the union bound and claim that with high probability \u201cevery call of L and TEST is\nsuccessful\u201d (see Inequality (1) in the analysis for Algorithm 1).\nTo address this problem, we will make our multiplicative weight analysis robust against occasionally\nfailed rounds so that it works when \u201cmost calls of L and WEAKTEST are successful\u201d.\n\ni\n\ni\n\ni\n\ni=1 w(t)\n\ni=1 w(t+1)\n\n\u2264 1.1(cid:80)k\n\nweights W (t) = (cid:80)k\nistic statement (cid:80)k\n\nIn more details, we will \ufb01rst work on the total\nat the t-th round,\nand show that conditioned on the t-th round,\nE[W (t+1)] is upper bounded by 1.13W (t) (where\nin contrast we had a stronger and determin-\ni=1 w(t)\nin the analysis for the basic algorithm). Us-\ning Jensen\u2019s inequality we will be able to de-\nrive that E[ln W (t+1)]\nis upper bounded by\n(ln 1.13+ln W (t)). Then, using Azuma\u2019s inequal-\nity for supermartingale random variables, we\nwill show that with high probability, ln W ( \u02dcT ) \u2264\n\u02dcT (ln 1.18) + ln W (0), i.e. W ( \u02dcT ) \u2264 1.18 \u02dcT \u00b7 k,\ni \u2264 1.1t \u00b7 k in\nthe basic proof. On the other hand, recall that in\nthe basic proof we had to show that if for more\nthan 30% of the rounds, the g(t) function is not a\ngood candidate for a player distribution Di, then\ni \u2265 2.3T . In the analysis for the im-\nwe have w(T )\nproved algorithm, because the WEAKTEST pro-\ncedure fails with much higher probability, we\nneed to use concentration inequalities and derive\ni \u2265 2.25 \u02dcT ). Fi-\na slightly weaker statement (w( \u02dcT )\nnally, we will put everything together using the\nsame proof via contradiction argument, and prove\nthe following theorem.\n\nwhich corresponds to(cid:80)k\n\ni=1 w(t)\n\ni \u2190 1 for each\n\nAlgorithm 3 MWEIGHTS\n1: Let the initial weight w(0)\nplayer i \u2208 {1, 2, 3, . . . , k}.\n2: Let \u02dcT \u2190 2000 ln(k/\u03b4).\n3: for t \u2190 0 to \u02dcT \u2212 1 do\ni(cid:80)k\nLet p(t)(i) \u2190 w(t)\n4:\nLet D(t) \u2190(cid:80)K\n\ni\n\nfor each i \u2208\n{1, 2, 3, . . . , k} so that p(t)(\u00b7) de\ufb01nes a prob-\nability distribution.\n\ni=1 w(t)\n\ni=1 p(t)(i)Di.\n\nsamples from\n\nLet S(t) be a set of S \u0001\n\n100\n\n120 , 1\n\nD(t). Let g(t) \u2190 L \u0001\n\n120 , 1\n100 ,F (S(t)).\nLet Z (t) \u2190 WEAKTEST(k, g(t), \u0001, \u03b4).\nfor each i \u2208 {1, 2, 3, . . . , k} do\n\n7:\n8:\n9:\n10:\n11:\n12:\n13: return g = Plurality(g(0), . . . , g( \u02dcT\u22121)).\n\n\u2190 w(t)\n\u2190 2 \u00b7 w(t)\n\nif i \u2208 Z (t) then\n\nw(t+1)\n\nw(t+1)\n\nelse\n\n.\n\ni\n\ni\n\ni\n\ni\n\n5:\n6:\n\nAccuracy\n\nAlgorithm 4 Weak\n(WEAKTEST(g, k, \u0001, \u03b4))\n1: for each i \u2208 {1, 2, 3, . . . , k} do Let Ti be a\n2: return {i | errTi(g) \u2264 \u0001\n6}.\n\nln (100) samples from Di.\n\nset of 432\n\u0001\n\nTest\n\nTheorem 4 Algorithm 3 has the following properties.\n\n1. With probability at least 1 \u2212 \u03b4, it returns a function g such that errDi(g) \u2264 \u0001 for all\n\ni \u2208 {1, 2, . . . , k}.\n\n2. Its sample complexity is O\n\n(cid:18) (ln k + ln \u03b4\u22121)(d + k)\n\n(cid:19)\n\n.\n\n\u0001\n\n5\n\n\fNow we prove Theorem 4.\nSimilarly to Lemma 1, applying Proposition ?? (but without the union bound), we have the following\nlemma for WEAKTEST.\nLemma 5 For each player i, with probability at least 1\u2212 1\nthen i \u2208 WEAKTEST(g, k, \u0001, \u03b4); 2) if errDi (g) > \u0001\n\n100 , the following hold, 1) if errDi (g) \u2264 \u0001\n12 ,\n\n4 , then i (cid:54)\u2208 WEAKTEST(g, k, \u0001, \u03b4).\n\nLet the indicator variable \u03c8(t)\nnot happen; and let \u03c8(t)\nfor each player i, we have Pr\n\n(cid:104)(cid:80) \u02dcT\u22121\nNow let J1 be the event that(cid:80) \u02dcT\u22121\n\ni = 1 if the desired event described in Lemma 5 for i and time t does\n100. By Proposition ??,\nk5 .\n\ni = 0 otherwise. By Lemma 5, we have E[\u03c8(t)\n3 \u00b7 42 \u00b7\n\ni > .05 \u02dcT\ni \u2264 .05 \u02dcT for every i. Via a union bound, we have that\n\n(cid:17) \u2264 exp\n\n(cid:105) \u2264 exp\n\n(cid:17) \u2264 \u03b4\n\n(cid:16)\u2212 5 \u02dcT\n\n] \u2264 1\n\u02dcT\n100\n\n(cid:16)\u2212 1\n\nt=0 \u03c8(t)\n\nt=0 \u03c8(t)\n\n100\n\ni\n\nPr[J1] \u2265 1 \u2212 \u03b4\nk4 .\n\n(5)\n\nLet the indicator variable \u03c7(t) = 1 if the learner L fails at time t; and let \u03c7(t) = 0 otherwise. We have\n\nLet W (t) =(cid:80)k\n\ni=1 w(t)\n\ni be the total weights at time t. For each t, similarly to (3), we have\n\nE(cid:104)\n\n\u03c7(t) | time 0, 1, . . . , t \u2212 1\n\n(cid:105) \u2264 1\n\n.\n\n100\n\nW (t+1) = W (t) +\n\nw(t)\n\ni\n\n.\n\n(cid:88)\n\ni(cid:54)\u2208Z(t)\n\n(6)\n\n(7)\n\n(cid:2)errDi(g(t)) > \u0001\n\n12\n\n(8)\n\n(cid:3) \u2264 .1,\n\n(9)\n\nFor each i such that errDi (g(t)) \u2264 \u0001\nwe take the expectation over the randomness of WEAKTEST at time t, we have,\n\n12, by Lemma 5, we know that Pr[i (cid:54)\u2208 Z (t)] \u2264 1\n\n100. Therefore, if\n\n\uf8ee\uf8f0 (cid:88)\n\ni(cid:54)\u2208Z(t)\n\nE\n\n\uf8f9\uf8fb \u2264\n\nw(t)\n\ni\n\n(cid:88)\n(cid:88)\n\n\u2264\n\ni:errDi (g(t))> \u0001\n\n12\n\ni:errDi (g(t))> \u0001\n\n12\n\n\uf8f9\uf8fb\n\nw(t)\n\ni\n\n\uf8ee\uf8f0 (cid:88)\n\u00b7 k(cid:88)\n\ni\n\ni=1\n\nw(t)\n\n.\n\ni:errDi (g(t))\u2264 \u0001\n\n12\n\ni + E\nw(t)\n\nw(t)\n\ni +\n\n1\n100\n\nWhen \u03c7(t) = 0, similarly to the proof of Theorem 3, we have Pri\u223cp(t)(\u00b7)\nand\n\n.1 \u2265 (cid:88)\nE(cid:104)\nW (t+1)(cid:12)(cid:12) \u03c7(t) = 0 and W (0), . . . , W (t)(cid:105) \u2264 1.11 \u00b7 W (t).\n\n1(cid:80)k\n\ni:errDi (g(t))> \u0001\n\ni:errDi (g(t))> \u0001\n\n(cid:88)\n\ni=1 w(t)\n\np(t)(i) =\n\n12\n\n12\n\ni\n\ni\n\nw(t)\n\n.\n\nCombining (7), (8), and (9), we have (when \u03c7(t) = 0)\n\n(10)\n\n(6), we\n\nTogether with\nln\n\nhave E(cid:2)W (t+1)(cid:12)(cid:12) W (0), . . . , W (t)(cid:3)\n(cid:16) E(cid:2)W (t+1)(cid:12)(cid:12) \u03c7(t) = 0 and W (0), . . . , W (t)(cid:3) \u00b7 Pr(cid:2)\u03c7(t) = 0 |W (0), . . . , W (t)(cid:3) + 2W (t)\nPr(cid:2)\u03c7(t) = 1 |W (0), . . . , W (t)(cid:3)(cid:17) \u2264 (1.11 + 0.02)W (t) = 1.13W (t).\nLet Q(t) = ln W (t+1)/W (t), and by Jensen\u2019s inequality, we have E(cid:2)Q(t)(cid:12)(cid:12) W (0), . . . , W (t)(cid:3) \u2264\nln E(cid:2)W (t+1)/W (t)(cid:12)(cid:12) W (0), . . . , W (t)(cid:3).\nTherefore, we have E(cid:2)Q(t)(cid:12)(cid:12) Q(0), . . . , Q(t\u22121)(cid:3) =\nE(cid:2)Q(t)(cid:12)(cid:12) W (0), . . . , W (t)(cid:3) \u2264 ln E(cid:2)W (t+1)/W (t)(cid:12)(cid:12) W (0), . . . , W (t)(cid:3) \u2264 ln(1.11 + .02) = ln 1.13.\nNow let \u02dcQ(t) = (cid:80)t\u22121\n\nz=0 Q(z) \u2212 t \u00b7 ln 1.13 for all t = 0, 1, 2, . . . . We have that { \u02dcQ(t)} is a super-\nmartingale and | \u02dcQ(t+1) \u2212 \u02dcQ(t)| \u2264 ln 2 for all t = 0, 1, 2, . . . . By Proposition ?? and noticing that\n\n1.11 \u00b7 W (t)\n\n=\n\u00b7\n\n\u2264\n\n6\n\n\fln 1.18 \u2212 ln 1.13 > .04, we have Pr\n\n(cid:17) \u2264 \u03b4\nk2 . Let J2 be the event that W ( \u02dcT ) \u2264 1.18 \u02dcT \u00b7 k \u21d4(cid:80) \u02dcT\u22121\n\nt=0 Q(t) > (ln 1.18) \u02dcT\n\n(cid:16)\u2212 .042\u00b7 \u02dcT\n\n2\u00b7(ln 2)2\n\nexp\nwe have that\n\n(cid:104)(cid:80) \u02dcT\u22121\n\n(cid:105) \u2264 Pr\n\n(cid:104) \u02dcQ( \u02dcT ) \u2212 \u02dcQ(0) > .04 \u02dcT\n\n(cid:105) \u2264\n\nt=0 Q(t) \u2264 (ln 1.18) \u02dcT ,\n\nNow let J = J1 \u2229 J2, combining (5) and (11), for k \u2265 2, we have\n\nPr[J2] \u2265 1 \u2212 \u03b4\nk2 .\n\nPr[J ] \u2265 1 \u2212 \u03b4\nk\n\n.\n\n(11)\n\n(12)\n\nNow we are ready to prove Theorem 4 for Algorithm 3.\nProof. [of Theorem 4] While the sample complexity is easy to verify, we focus on the proof of the\n\ufb01rst property. In particular, we show that when J happens (which is with probability at least 1 \u2212 \u03b4\nby (12)), we have errDi(g) \u2264 \u0001 for all i \u2208 {1, 2, 3, . . . , k}.\nLet us consider an arbitrary player i. We will show that when J happens, for at least 70% the times t,\nwe have errDi(g(t)) \u2264 \u0001\nSuppose the contrary: for more than 30% of the times, we have errDi(g(t)) > \u0001\nmore than 30%\u2212 5% = 25% of the times t, we have i (cid:54)\u2208 Z (t). Therefore, we have w( \u02dcT )\nthe other hand, by J2 we have W ( \u02dcT ) \u2264 1.2 \u02dcT . Therefore, we reach 2.25 \u02dcT \u2264 w( \u02dcT )\nwhich is a contradiction to \u02dcT = 2000 ln(k/\u03b4).\n\n4. Because of J1, for\ni \u2265 2.25 \u02dcT . On\ni \u2264 W ( \u02dcT ) \u2264 1.18 \u02dcT\u00b7k,\n(cid:117)(cid:116)\n\n4, and this will conclude the proof of this theorem thanks to Lemma 2.\n\n4 Lower Bound\n\nWe show the following lower bound result, which matches our upper bound (Theorem 3) when\nk = (1/\u03b4)\u2126(1) and k = dO(1).\n\nTheorem 6 In collaborative PAC learning with k players and a hypothesis class of VC-dimension d,\nfor any \u0001, \u03b4 \u2208 (0, 0.01), there exists a hard input distribution on which any (\u0001, \u03b4)-learning algorithm\nA needs \u2126(max{d ln k, k ln d}/\u0001) samples in expectation, where the expectation is taken over the\nrandomness used in obtaining the samples and the randomness used in drawing the input from the\ninput distribution.\n\nThe proof of Theorem 6 is similar to that for the lower bound result in [3]; however, we need to\ngeneralize the hard instance provided in [3] in two different cases. We brie\ufb02y discuss the high level\nideas of our generalization here, and leave the full proof to Appendix ?? due to space constraints.\nThe lower bound proof in [3] (for k = d) performs a reduction from a simple player problem to a\nk-player problem, such that if we can (\u0001, \u03b4)-PAC learn the k-party problem using m samples in total,\nthen we can (\u0001, 10\u03b4/(9k))-PAC learn the single player problem using O(m/k) samples. Now for the\ncase when d > k, we need to change the single player problem used in [3] whose hypothesis class\nis of VC-dimension \u0398(1) to one whose hypothesis class is of VC-dimension \u0398(d/k). For the case\nwhen d \u2264 k, we essentially duplicate the hard instance for a d-player problem k/d times, getting a\nhard instance for a k-player problem, and then perform the random embedding reduction from the\nsingle player problem to the k-player problem. See Appendix ?? for details.\n\n5 Experiments\n\nWe present in this section a set of experimental results which demonstrate the effectiveness of our\nproposed algorithms.\nOur algorithms are based on the assumption that given a hypothesis class, we are able to compute its\nVC dimension d and access an oracle to compute an (\u0001, \u03b4)-classi\ufb01er with sample complexity S\u0001,\u03b4. In\npractice, however, it is usually computationally dif\ufb01cult to compute the exact VC dimension for a\n\n7\n\n\fgiven hypothesis class. Also, the VC dimension usually only proves to be a very loose upper bound\nfor the sample complexity needed for an (\u0001, \u03b4)-classi\ufb01er.\nTo address these practical dif\ufb01culties, in our experiment, we treat the VC dimension d as a parameter\nto control the sample budget. More speci\ufb01cally, we will \ufb01rst choose a concrete model as the oracle;\nin our implementation, we choose the decision tree. We then set the parameter \u03b4 = 0.9 and gradually\nincrease d to determine the sample budget. For each \ufb01xed sample budget (i.e., each \ufb01xed d), we run\nthe algorithm for 100 times and test whether the following happens,\nerrDi (g) \u2264 \u0001 for all i] \u2265 0.9.\n\n(cid:99)Pr[max\n\n(13)\n\ni\n\nHere \u0001 is a parameter we choose and g is the classi\ufb01er returned by the collaborative learning algorithm\n\nto be tested. The empirical probability (cid:99)Pr[\u00b7] in (13) is calculated over the 100 runs. We \ufb01nally report\n\nthe minimum number of samples consumed by the algorithm to achieve (13).\nNote that in our theoretical analysis, we did not try to optimize the constants. Instead, we tune\nthe constants for both CENLEARN and MWEIGHTS for better performance. Please \ufb01nd more\nimplementation details in the appendix.\n\nDatasets. We will test the collaborative learning algorithms using the following data sets.\nMAGIC-EVEN [4]. This data set is generated to simulate registration of high energy gamma particles\nin an atmospheric Cherenkov telescope. There are 19, 020 instances and each belongs to one of the\ntwo classes (gamma and hadron). There are 11 attributes in each data point. We randomly partition\nthis data set into k = 10 subsets (namely, D1, . . . , Dk).\nMAGIC-1. The raw data set is the same as we have in MAGIC-EVEN. Instead of random partitioning,\nwe partition the data set into D1 and D2 based on the two different classes, and make k \u2212 2 more\ncopies of D2 so that D2, D3, . . . , Dk are identical. In our case we set k = 10.\nMAGIC-2. This data set differs from MAGIC-1 in the way of constructing D1 and D2: we partition\nthe original data set into D1 and D2 based on the \ufb01rst dimension of the feature vectors; we then make\nduplicates for D2. Here we again set k = 10.\nWINE [5]. This data set contains physicochemical tests for white wine, and the scores of the wine\nrange from 0 to 10. There are 4, 898 instances and there are 12 attributes in the feature vectors. We\npartition the data set into D1, . . . , D4 based on the \ufb01rst two dimensions.\nEYE. This data set consists of 14 EEG values and a value indicating the eye state. There are 14, 980\ninstances in this data set. We partition it into D1, . . . , D4 based on the \ufb01rst two dimensions.\nLETTER [8]. This data set has 20, 000 instances, each in R16. There are 26 classes, each representing\none of 26 capital letters. We partition this data set into k = 12 subsets based on the \ufb01rst 4 dimensions\nof the feature vectors.\n\nTested Algorithms. We compare our algorithms with the following two baseline algorithms,\nNAIVE. In this algorithm we treat all distributions D1, . . . , Dk equally. That is, given a budget z,\nwe sample z training samples from D = 1\ni=1 Di. We then train a classi\ufb01er (decision tree) using\nk\nthose samples.\nCENLEARN, this is the implementation of the algorithm proposed by Blum et al. [3].\nSince our Algorithm 1 and Algorithm 3 are very similar, and Algorithm 3 has better theoretical\nguarantee, we will only test Algorithm 3, denoted as MWEIGHTS, in our experiments.\n\n(cid:80)k\n\nExperimental Results and Discussion. The experimental results are presented in Figure 1. We\ntest the algorithms for each data set using multiple values of the error threshold \u0001, and report the\nsample complexity for NAIVE, MWEIGHTS and CENLEARN.\nIn Figure 1a, we notice that NAIVE uses less samples than its competitors. This phenomenon is\npredictable because in MAGIC-EVEN, D1, . . . , Dk are constructed via random partitioning, which is\nthe easiest case for NAIVE. Since MWEIGHTS and CENLEARN need to train multiple classi\ufb01ers,\neach classi\ufb01er will get fewer training samples than NAIVE when the total budgets are the same.\n\n8\n\n\f(a) MAGIC-EVEN\n\n(b) MAGIC-1\n\n(c) MAGIC-2\n\n(d) WINE\n\n(e) EYE\n\n(f) LETTER\n\nFigure 1: Sample complexity versus error threshold \u0001.\n\n(cid:80)k\n\nIn Figure 1b and Figure 1c, D1, . . . , Dk are constructed in a way that D2, D3, . . . , Dk are identical,\nand D1 is very different from other distributions. Thus the overall distribution (i.e., D = 1\ni=1 Di)\nk\nused to train NAIVE is quite different from the original data set. One can observe from those two\n\ufb01gures that MWEIGHTS still works quite well while NAIVE suffers.\nIn Figure 1b-Figure 1f, one can observe that MWEIGHTS uses fewer samples than its competitors in\nalmost all cases, which shows the superiority of our proposed algorithm. CENLEARN outperforms\nNAIVE in general. However, NAIVE uses slightly fewer samples than CENLEARN in some cases\n(e.g., Figure 1d). This may due to the fact that the distributions D1, . . . , Dk in those cases are not\nhard enough to show the superiority of CENLEARN over NAIVE.\nTo summarize, our experimental results show that MWEIGHTS and CENLEARN need fewer samples\nthan NAIVE when the input distributions D1, . . . , Dk are suf\ufb01ciently different. MWEIGHTS consis-\ntently outperforms CENLEARN, which may due to the facts that MWEIGHTS has better theoretical\nguarantees and is more straightforward to implement.\n\n6 Conclusion\n\nIn this paper we consider the collaborative PAC learning problem. We have proved the optimal\noverhead ratio and sample complexity, and conducted experimental studies to show the superior\nperformance of our proposed algorithms.\nOne open question is to consider the balance of the numbers of queries made to each player, which\ncan be measured by the ratio between the largest number of queries made to a player and the average\nnumber of queries made to the k players. The proposed algorithms in this paper may attain a balance\nratio of \u2126(k) in the worst case. It will be interesting to investigate:\n\n1. Whether there is an algorithm with the same sample complexity but better balance ratio?\n2. What is the optimal trade-off between sample complexity and balance ratio?\n\n9\n\n\fAcknowledgments\n\nJiecao Chen and Qin Zhang are supported in part by NSF CCF-1525024, CCF-1844234 and IIS-\n1633215. Part of the work was done when Yuan Zhou was visiting the Shanghai University of Finance\nand Economics.\n\nReferences\n[1] M. Balcan, A. Blum, S. Fine, and Y. Mansour. Distributed learning, communication complexity\n\nand privacy. In COLT, pages 26.1\u201326.22, 2012.\n\n[2] M. Balcan, S. Ehrlich, and Y. Liang. Distributed k-means and k-median clustering on general\n\ncommunication topologies. In NIPS, pages 1995\u20132003, 2013.\n\n[3] A. Blum, N. Haghtalab, A. D. Procaccia, and M. Qiao. Collaborative PAC learning. In NIPS,\n\npages 2389\u20132398, 2017.\n\n[4] R. Bock, A. Chilingarian, M. Gaug, F. Hakl, T. Hengstebeck, M. Jirina, J. Klaschka, E. Kotrc,\nP. Savick`y, S. Towers, et al. Methods for multidimensional event classi\ufb01cation: a case study. as\nInternal Note in CERN, 2003.\n\n[5] P. Cortez, A. Cerdeira, F. Almeida, T. Matos, and J. Reis. Modeling wine preferences by data\n\nmining from physicochemical properties. Decision Support Systems, 47(4):547\u2013553, 2009.\n\n[6] A. Ehrenfeucht, D. Haussler, M. J. Kearns, and L. G. Valiant. A general lower bound on the\n\nnumber of examples needed for learning. Inf. Comput., 82(3):247\u2013261, 1989.\n\n[7] Y. Freund and R. E. Schapire. A decision-theoretic generalization of on-line learning and an\n\napplication to boosting. Journal of computer and system sciences, 55(1):119\u2013139, 1997.\n\n[8] P. W. Frey and D. J. Slate. Letter recognition using holland-style adaptive classi\ufb01ers. Machine\n\nLearning, 6:161\u2013182, 1991.\n\n[9] S. Guha, Y. Li, and Q. Zhang. Distributed partial clustering. In SPAA, pages 143\u2013152, 2017.\n\n[10] S. Hanneke. The optimal sample complexity of pac learning. The Journal of Machine Learning\n\nResearch, 17(1):1319\u20131333, 2016.\n\n[11] H. D. III, J. M. Phillips, A. Saha, and S. Venkatasubramanian. Ef\ufb01cient protocols for distributed\n\nclassi\ufb01cation and optimization. In ALT, pages 154\u2013168, 2012.\n\n[12] H. D. III, J. M. Phillips, A. Saha, and S. Venkatasubramanian. Protocols for learning classi\ufb01ers\n\non distributed data. In AISTATS, pages 282\u2013290, 2012.\n\n[13] Y. Liang, M. Balcan, V. Kanchanapally, and D. P. Woodruff. Improved distributed principal\n\ncomponent analysis. In NIPS, pages 3113\u20133121, 2014.\n\n[14] Y. Mansour, M. Mohri, and A. Rostamizadeh. Domain adaptation with multiple sources. In\n\nNIPS, pages 1041\u20131048, 2008.\n\n[15] H. L. Nguyen and L. Zakynthinou. Improved Algorithms for Collaborative PAC Learning.\n\narXiv preprint arXiv:1805.08356, 2018.\n\n[16] J. Wang, M. Kolar, and N. Srebro. Distributed multi-task learning. In AISTATS, pages 751\u2013760,\n\n2016.\n\n10\n\n\f", "award": [], "sourceid": 1825, "authors": [{"given_name": "Jiecao", "family_name": "Chen", "institution": "Indiana University Bloomington"}, {"given_name": "Qin", "family_name": "Zhang", "institution": "Indiana University Bloomington"}, {"given_name": "Yuan", "family_name": "Zhou", "institution": "Indiana University Bloomington"}]}