{"title": "(Almost) No Label No Cry", "book": "Advances in Neural Information Processing Systems", "page_first": 190, "page_last": 198, "abstract": "In Learning with Label Proportions (LLP), the objective is to learn a supervised classifier when, instead of labels, only label proportions for bags of observations are known. This setting has broad practical relevance, in particular for privacy preserving data processing. We first show that the mean operator, a statistic which aggregates all labels, is minimally sufficient for the minimization of many proper scoring losses with linear (or kernelized) classifiers without using labels. We provide a fast learning algorithm that estimates the mean operator via a manifold regularizer with guaranteed approximation bounds. Then, we present an iterative learning algorithm that uses this as initialization. We ground this algorithm in Rademacher-style generalization bounds that fit the LLP setting, introducing a generalization of Rademacher complexity and a Label Proportion Complexity measure. This latter algorithm optimizes tractable bounds for the corresponding bag-empirical risk. Experiments are provided on fourteen domains, whose size ranges up to 300K observations. They display that our algorithms are scalable and tend to consistently outperform the state of the art in LLP. Moreover, in many cases, our algorithms compete with or are just percents of AUC away from the Oracle that learns knowing all labels. On the largest domains, half a dozen proportions can suffice, i.e. roughly 40K times less than the total number of labels.", "full_text": "(Almost) No Label No Cry\n\nGiorgio Patrini1,2, Richard Nock1,2, Paul Rivera1,2, Tiberio Caetano1,3,4\n\nAustralian National University1, NICTA2, University of New South Wales3, Ambiata4\n\nSydney, NSW, Australia\n\n{name.surname}@anu.edu.au\n\nAbstract\n\nIn Learning with Label Proportions (LLP), the objective is to learn a supervised\nclassi\ufb01er when, instead of labels, only label proportions for bags of observations\nare known. This setting has broad practical relevance, in particular for privacy\npreserving data processing. We \ufb01rst show that the mean operator, a statistic which\naggregates all labels, is minimally suf\ufb01cient for the minimization of many proper\nscoring losses with linear (or kernelized) classi\ufb01ers without using labels. We pro-\nvide a fast learning algorithm that estimates the mean operator via a manifold\nregularizer with guaranteed approximation bounds. Then, we present an itera-\ntive learning algorithm that uses this as initialization. We ground this algorithm\nin Rademacher-style generalization bounds that \ufb01t the LLP setting, introducing\na generalization of Rademacher complexity and a Label Proportion Complexity\nmeasure. This latter algorithm optimizes tractable bounds for the corresponding\nbag-empirical risk. Experiments are provided on fourteen domains, whose size\nranges up to \u2248300K observations. They display that our algorithms are scalable\nand tend to consistently outperform the state of the art in LLP. Moreover, in many\ncases, our algorithms compete with or are just percents of AUC away from the\nOracle that learns knowing all labels. On the largest domains, half a dozen pro-\nportions can suf\ufb01ce, i.e. roughly 40K times less than the total number of labels.\n\n1\n\nIntroduction\n\nMachine learning has recently experienced a proliferation of problem settings that, to some extent,\nenrich the classical dichotomy between supervised and unsupervised learning. Cases as multiple\ninstance labels, noisy labels, partial labels as well as semi-supervised learning have been studied\nmotivated by applications where fully supervised learning is no longer realistic. In the present work,\nwe are interested in learning a binary classi\ufb01er from information provided at the level of groups of\ninstances, called bags. The type of information we assume available is the label proportions per\nbag, indicating the fraction of positive binary labels of its instances. Inspired by [1], we refer to this\nframework as Learning with Label Proportions (LLP). Settings that perform a bag-wise aggregation\nof labels include Multiple Instance Learning (MIL) [2]. In MIL, the aggregation is logical rather\nthan statistical: each bag is provided with a binary label expressing an OR condition on all the labels\ncontained in the bag. More general setting also exist [3] [4] [5].\nMany practical scenarios \ufb01t the LLP abstraction. (a) Only aggregated labels can be obtained due to\nthe physical limits of measurement tools [6] [7] [8] [9]. (b) The problem is semi- or unsupervised\nbut domain experts have knowledge about the unlabelled samples in form of expectation, as pseudo-\nmeasurement [5].\n(c) Labels existed once but they are now given in an aggregated fashion for\nprivacy-preserving reasons, as in medical databases [10], fraud detection [11], house price market,\nelection results, census data, etc. . (d) This setting also arises in computer vision [12] [13] [14].\nRelated work. Two papers independently introduce the problem, [12] and [9]. In the \ufb01rst the authors\npropose a hierarchical probabilistic model which generates labels consistent with the proportions,\nand make inference through MCMC sampling. Similarly, the second and its follower [6] offer a\n\n1\n\n\fvariety of standard machine learning methods designed to generate self-consistent labels. [15] gives\na Bayesian interpretation of LLP where the key distribution is estimated through an RBM. Other\nideas rely on structural learning of Bayesian networks with missing data [7], and on K-MEANS clus-\ntering to solve preliminary label assignment in order to resort to fully supervised methods [13] [8].\nRecent SVM implementations [11] [16] outperform most of the other known methods. Theoretical\nworks on LLP belong to two main categories. The \ufb01rst contains uniform convergence results, for the\nestimators of label proportions [1], or the estimator of the mean operator [17]. The second contains\napproximation results for the classi\ufb01er [17]. Our work builds upon their Mean Map algorithm, that\nrelies on the trick that the logistic loss may be split in two, a convex part depending only on the\nobservations, and a linear part involving a suf\ufb01cient statistic for the label, the mean operator. Being\nable to estimate the mean operator means being able to \ufb01t a classi\ufb01er without using labels. In [17],\nthis estimation relies on a restrictive homogeneity assumption that the class-conditional estimation\nof features does not depend on the bags. Experiments display the limits of this assumption [11][16].\nContributions. In this paper we consider linear classi\ufb01ers, but our results hold for kernelized for-\nmulations following [17]. We \ufb01rst show that the trick about the logistic loss can be generalized,\nand the mean operator is actually minimally suf\ufb01cient for a wide set of \u201csymmetric\u201d proper scoring\nlosses with no class-dependent misclassi\ufb01cation cost, that encompass the logistic, square and Mat-\nsushita losses [18]. We then provide an algorithm, LMM, which estimates the mean operator via a\nLaplacian-based manifold regularizer without calling to the homogeneity assumption. We show that\nunder a weak distinguishability assumption between bags, our estimation of the mean operator is\nall the better as the observations norm increase. This, as we show, cannot hold for the Mean Map\nestimator. Then, we provide a data-dependent approximation bound for our classi\ufb01er with respect\nto the optimal classi\ufb01er, that is shown to be better than previous bounds [17]. We also show that\nthe manifold regularizer\u2019s solution is tightly related to the linear separability of the bags. We then\nprovide an iterative algorithm, AMM, that takes as input the solution of LMM and optimizes it fur-\nther over the set of consistent labelings. We ground the algorithm in a uniform convergence result\ninvolving a generalization of Rademacher complexities for the LLP setting. The bound involves\na bag-empirical surrogate risk for which we show that AMM optimizes tractable bounds. All our\ntheoretical results hold for any symmetric proper scoring loss. Experiments are provided on four-\nteen domains, ranging from hundreds to hundreds of thousands of examples, comparing AMM and\nLMM to their contenders: Mean Map, InvCal [11] and \u221dSVM [16]. They display that AMM and\nLMM outperform their contenders, and sometimes even compete with the fully supervised learner\nwhile requiring few proportions only. Tests on the largest domains display the scalability of both\nalgorithms. Such experimental evidence seriously questions the safety of privacy-preserving sum-\nmarization of data, whenever accurate aggregates and informative individual features are available.\nSection (2) presents our algorithms and related theoretical results. Section (3) presents experiments.\nSection (4) concludes. A Supplementary Material [19] includes proofs and additional experiments.\n\n2 LLP and the mean operator: theoretical results and algorithms\n\n.= {1, 2, ..., m}. Let \u03a3m\n\nLearning setting Hereafter, boldfaces like p denote vectors, whose coordinates are denoted pl for\nl = 1, 2, .... For any m \u2208 N\u2217, let [m]\n.= {\u03c3 \u2208 {\u22121, 1}m} and X \u2286 Rd.\nExamples are couples (observation, label) \u2208 X \u00d7 \u03a31, sampled i.i.d. according to some unknown\nbut \ufb01xed distribution D. Let S .= {(xi, yi), i \u2208 [m]} \u223c Dm denote a size-m sample. In Learning\nwith Label Proportions (LLP), we do not observe directly S but S|y, which denotes S with labels\nremoved; we are given its partition in n > 0 bags, S|y = \u222ajSj, j \u2208 [n], along with their respective\n.= mj/m with mj = card(Sj). (This\nlabel proportions \u02c6\u03c0j\ngeneralizes to a cover of S, by copying examples among bags.) The \u201cbag assignment function\u201d that\npartitions S is unknown but \ufb01xed. In real world domains, it would rather be known, e.g. state, gender,\nage band. A classi\ufb01er is a function h : X \u2192 R, from a set of classi\ufb01ers H. HL denotes the set of\nlinear classi\ufb01ers, noted h\u03b8(x) .= \u03b8(cid:62)x with \u03b8 \u2208 X. A (surrogate) loss is a function F : R \u2192 R+.\ni F (yih(xi)) denote the empirical surrogate risk on S corresponding to\nloss F . For the sake of clarity, indexes i, j and k respectively refer to examples, bags and features.\n\nWe let F (S, h) .= (1/m)(cid:80)\n\n.= \u02c6P[y = +1|Sj] and bag proportions \u02c6pj\n\nThe mean operator and its minimal suf\ufb01ciency We de\ufb01ne the (empirical) mean operator as:\n\n(cid:88)\n\ni\n\n\u00b5S\n\n.=\n\n1\nm\n\n2\n\nyixi .\n\n(1)\n\n\fAlgorithm 1 Laplacian Mean Map (LMM)\n\nInput Sj, \u02c6\u03c0j, j \u2208 [n]; \u03b3 > 0 (7); w (7); V (8); permissible \u03c6 (2); \u03bb > 0;\nStep 1 : let \u02dcB\n\n\u00b1 \u2190 arg minX\u2208R2n\u00d7d (cid:96)(L, X) using (7) (Lemma 2)\n\nStep 2 : let \u02dc\u00b5S \u2190(cid:80)\n\nj \u2212 (1 \u2212 \u02c6\u03c0j)\u02dcb\u2212\n\u02dcb+\nj )\n\nj \u02c6pj(\u02c6\u03c0j\n\nStep 3 : let \u02dc\u03b8\u2217 \u2190 arg min\u03b8 F\u03c6(S|y, \u03b8, \u02dc\u00b5S) + \u03bb(cid:107)\u03b8(cid:107)2\nReturn \u02dc\u03b8\u2217\n\n2 (3)\n\nTable 1: Correspondence between permissible functions \u03c6 and the corresponding loss F\u03c6.\n\nloss name\nlogistic loss\nsquare loss\n\nMatsushita loss\n\n\u2212\u03c6(x)\n\nF\u03c6(x)\n(1 \u2212 x)2\n\nlog(1 + exp(\u2212x)) \u2212x log x \u2212 (1 \u2212 x) log(1 \u2212 x)\n\u2212x +\n\n(cid:112)x(1 \u2212 x)\n\nx(1 \u2212 x)\n\n1 + x2\n\n\u221a\n\nThe estimation of the mean operator \u00b5S appears to be a learning bottleneck in the LLP setting\n[17]. The fact that the mean operator is suf\ufb01cient to learn a classi\ufb01er without the label information\nmotivates the notion of minimal suf\ufb01cient statistic for features in this context. Let F be a set of\nloss functions, H be a set of classi\ufb01ers, I be a subset of features. Some quantity t(S) is said to be\na minimal suf\ufb01cient statistic for I with respect to F and H iff: for any F \u2208 F, any h \u2208 H and\nany two samples S and S(cid:48), the quantity F (S, h) \u2212 F (S(cid:48), h) does not depend on I iff t(S) = t(S(cid:48)).\nThis de\ufb01nition can be motivated from the one in statistics by building losses from log likelihoods.\nThe following Lemma motivates further the mean operator in the LLP setting, as it is the minimal\nsuf\ufb01cient statistic for a broad set of proper scoring losses that encompass the logistic and square\nlosses [18]. The proper scoring losses we consider, hereafter called \u201csymmetric\u201d (SPSL), are twice\ndifferentiable, non-negative and such that misclassi\ufb01cation cost is not label-dependent.\n\nLemma 1 \u00b5S is a minimal suf\ufb01cient statistic for the label variable, with respect to SPSL and HL.\n\n([19], Subsection 2.1) This property, very useful for LLP, may also be exploited in other weakly\nsupervised tasks [2]. Up to constant scalings that play no role in its minimization, the empirical\nsurrogate risk corresponding to any SPSL, F\u03c6(S, h), can be written with loss:\n\n.=\n\nF\u03c6(x)\n\n(2)\nand \u03c6 is a permissible function [20, 18], i.e. dom(\u03c6) \u2287 [0, 1], \u03c6 is strictly convex, differentiable and\nsymmetric with respect to 1/2. \u03c6(cid:63) is the convex conjugate of \u03c6. Table 1 shows examples of F\u03c6. It\nfollows from Lemma 1 and its proof, that any F\u03c6(S\u03b8), can be written for any \u03b8 \u2261 h\u03b8 \u2208 HL as:\n\nb\u03c6\n\n,\n\n.= a\u03c6 +\n\n\u03c6(0) + \u03c6(cid:63)(\u2212x)\n\u03c6(0) \u2212 \u03c6(1/2)\n\n\u03c6(cid:63)(\u2212x)\n\n(cid:32)(cid:88)\n\n(cid:88)\n\ni\n\n\u03c3\n\n(cid:33)\n\nF\u03c6(\u03c3\u03b8(cid:62)xi)\n\n\u03b8(cid:62)\u00b5S\n\n\u2212 1\n2\n\n.= F\u03c6(S|y, \u03b8, \u00b5S) ,\n\n(3)\n\nF\u03c6(S, \u03b8) =\n\nb\u03c6\n2m\n\nwhere \u03c3 \u2208 \u03a31.\n\nThe Laplacian Mean Map (LMM) algorithm The sum in eq. (3) is convex and differentiable\nin \u03b8. Hence, once we have an accurate estimator of \u00b5S, we can then easily \ufb01t \u03b8 to minimize\nF\u03c6(S|y, \u03b8, \u00b5S). This two-steps strategy is implemented in LMM in algorithm 1. \u00b5S can be retrieved\nfrom 2n bag-wise, label-wise unknown averages b\u03c3\nj :\n\n\u00b5S = (1/2)\n\n\u02c6pj\n\n(2\u02c6\u03c0j + \u03c3(1 \u2212 \u03c3))b\u03c3\nj ,\n\n(1/mj)(cid:80)\n\nwith b\u03c3\nj\n\n.= ES[x|\u03c3, j] denoting these 2n unknowns (for j \u2208 [n], \u03c3 \u2208 \u03a31), and let bj\nxi\u2208Sj\n\nj s are solution of a set of n identities that are (in matrix form):\n\nxi. The 2n b\u03c3\n\n(4)\n\n.=\n\n(5)\n\nn(cid:88)\n\nj=1\n\n(cid:88)\n\n\u03c3\u2208\u03a31\n\nB \u2212 \u03a0\n\n(cid:62)B\u00b1 = 0 ,\n\n3\n\n\fwhere B .= [b1|b2|...|bn](cid:62) \u2208 Rn\u00d7d, \u03a0\nthe matrix of unknowns:\n\n.= [DIAG( \u02c6\u03c0)|DIAG(1 \u2212 \u02c6\u03c0)](cid:62) \u2208 R2n\u00d7n and B\u00b1 \u2208 R2n\u00d7d is\n\nB\u00b1\n\n.=\n\n(cid:104)\n\n(cid:12)(cid:12)(cid:12) b-1\n(cid:124)\n\n(cid:124)\n\n(cid:123)(cid:122)\n(cid:125)\n2 |...|b+1\n1 |b+1\nb+1\n)(cid:62)\n(B+\n\nn\n\n(cid:123)(cid:122)\n(cid:125)\n2|...|b-1\n1|b-1\n)(cid:62)\n(B\u2013\n\nn\n\n(cid:105)(cid:62)\n\n.\n\n(6)\n\n\u00b1\n\nand \u03b3 > 0. Dw\n\n= arg minX\u2208R2n\u00d7d (cid:96)(L, X), with:\n\n(cid:62)X)(cid:1) + \u03b3tr(cid:0)X(cid:62)LX(cid:1) ,\n\n.= tr(cid:0)(B(cid:62) \u2212 X(cid:62)\n(cid:20) La\n\nSystem (5) is underdetermined, unless one makes the homogeneity assumption that yields the Mean\nMap estimator [17]. Rather than making such a restrictive assumption, we regularize the cost that\nbrings (5) with a manifold regularizer [21], and search for \u02dcB\n\u03a0)Dw(B \u2212 \u03a0\n\n(cid:96)(L, X)\n.= DIAG(w) is a user-\ufb01xed bias matrix with w \u2208 Rn\n\n(7)\n+,\u2217 (and w (cid:54)= \u02c6p in general) and:\n(8)\n.= D \u2212 V \u2208 Rn\u00d7n is the Laplacian of the bag similarities. V is a symmetric similarity\nwhere La\nj(cid:48) vjj(cid:48),\u2200j \u2208 [n].\nmatrix with non negative coordinates, and the diagonal matrix D satis\ufb01es djj\nThe size of the Laplacian is O(n2), which is very small compared to O(m2) if there are not many\nbags. One can interpret the Laplacian regularization as smoothing the estimates of b\u03c3\nj w.r.t the\nsimilarity of the respective bags.\n\n\u2208 R2n\u00d72n ,\n\n|\n0\n| La\n\n.= \u03b5I +\n\n.=(cid:80)\n=(cid:0)\u03a0Dw\u03a0(cid:62) + \u03b3L(cid:1)\u22121\n\nLemma 2 The solution \u02dcB\n\n\u00b1 to minX\u2208R2n\u00d7d (cid:96)(L, X) is \u02dcB\n\n\u00b1\n\n\u03a0DwB.\n\n(cid:21)\n\nL\n\n0\n\n([19], Subsection 2.2). This Lemma explains the role of penalty \u03b5I in (8) as \u03a0Dw\u03a0(cid:62) and L have\nrespectively n- and (\u2265 1)-dim null spaces, so the inversion may not be possible. Even when this does\nnot happen exactly, this may incur numerical instabilities in computing the inverse. For domains\nwhere this risk exists, picking a small \u03b5 > 0 solves the problem. Let \u02dcb\u03c3\nj denote the row-wise\nfollowing (6), from which we compute \u02dc\u00b5S following (4) when we use these\ndecomposition of \u02dcB\nj ,\u2200j \u2208 [n] to our estimates\n2n estimates in lieu of the true b\u03c3\n\u02dc\u00b5j\n\n,\u2200j \u2208 [n], granted that \u00b5S =(cid:80)\n\nj \u2212 (1\u2212 \u02c6\u03c0j)b\u2212\n\nj . We compare \u00b5j\n\nj \u2212 (1 \u2212 \u02c6\u03c0j)\u02dcb\u2212\n\u02dcb+\n\n.= \u02c6\u03c0jb+\n\nj \u02c6pj \u02dc\u00b5j.\n\n.= \u02c6\u03c0j\n\n\u00b1\n\nj\n\nTheorem 3 Suppose that \u03b3 satis\ufb01es \u03b3\n[\u00b51|\u00b52|...|\u00b5n](cid:62) \u2208 Rn\u00d7d, \u02dcM .= [ \u02dc\u00b51| \u02dc\u00b52|...| \u02dc\u00b5n](cid:62) \u2208 Rn\u00d7d and \u03c2(V, B\u00b1)\nmaxj(cid:54)=j(cid:48) vjj(cid:48))2(cid:107)B\u00b1(cid:107)F . The following holds:\n(cid:107)M \u2212 \u02dcM(cid:107)F \u2264 \u221a\n\n(cid:18)\u221a\n\n2 min\n\n2 \u2264 ((\u03b5(2n)\u22121) + maxj(cid:54)=j(cid:48) vjj(cid:48))/ minj wj. Let M .=\n.= ((\u03b5(2n)\u22121) +\n\n(9)\n\nn\n\nw2\nj\n\nj\n\nj \u02c6pj\u00b5j and \u02dc\u00b5S =(cid:80)\n(cid:19)\u22121 \u00d7 \u03c2(V, B\u00b1) .\n\n\u221a\n\n(cid:18) ASSOC(Sj, Sj)\n\n([19], Subsection 2.3) The multiplicative factor to \u03c2 in (9) is roughly O(n5/2) when there is no large\ndiscrepancy in the bias matrix Dw, so the upperbound is driven by \u03c2(., .) when there are not many\nbags. We have studied its variations when the \u201cdistinguishability\u201d between bags increases. This\nsetting is interesting because in this case we may kill two birds in one shot, with the estimation of\nM and the subsequent learning problem potentially easier, in particular for linear separators. We\nconsider two examples for vjj(cid:48), the \ufb01rst being (half) the normalized association [22]:\n\n1\n2\n\nASSOC(Sj(cid:48), Sj(cid:48))\n\n+\n\nvnc\njj(cid:48)\n\nvG,s\njj(cid:48)\n\n= NASSOC(Sj, Sj(cid:48)) ,\n\n(10)\n\nASSOC(Sj, Sj \u222a Sj(cid:48))\n\nASSOC(Sj(cid:48), Sj \u222a Sj(cid:48))\n\nHere, ASSOC(Sj, Sj(cid:48)) .= (cid:80)\n\n.=\n.= exp(\u2212(cid:107)bj \u2212 bj(cid:48)(cid:107)2/s) , s > 0 .\nj(cid:48) (cid:107)x \u2212 x(cid:48)(cid:107)2\n\n(11)\n2 [22]. To put these two similarity measures in\nthe context of Theorem 3, consider the setting where we can make assumption (D1) that there\nexists a small constant \u03ba > 0 such that (cid:107)bj \u2212 bj(cid:48)(cid:107)2\n2,\u2200j, j(cid:48) \u2208 [n]. This is a\nweak distinguishability property as if no such \u03ba exists, then the centers of distinct bags may just\nbe confounded. Consider also the additional assumption, (D2), that there exists \u03ba(cid:48) > 0 such that\ni\u2208Sj (cid:107)xi \u2212 xi(cid:48)(cid:107)2 is a bag\u2019s diameter. In the following\nmaxj d2\nLemma, the little-oh notation is with respect to the \u201clargest\u201d unknown in eq. (4), i.e. max\u03c3,j (cid:107)b\u03c3\nj (cid:107)2.\n\nj \u2264 \u03ba(cid:48),\u2200j \u2208 [n], where dj\n\n2 \u2265 \u03ba max\u03c3,j (cid:107)b\u03c3\n\n.= maxxi,x(cid:48)\n\nx\u2208Sj ,x(cid:48)\u2208S\n\nj (cid:107)2\n\n(cid:19)\n\n4\n\n\fAlgorithm 2 Alternating Mean Map (AMMOPT)\n\nInput LMM parameters + optimization strategy OPT \u2208 {min, max} + convergence predicate PR\nStep 1 : let \u02dc\u03b80 \u2190 LMM(LMM parameters) and t \u2190 0\nStep 2 : repeat\n\nStep 2.1 : let \u03c3t \u2190 arg OPT\u03c3\u2208\u03a3 \u02c6\u03c0 F\u03c6(S|y, \u03b8t, \u00b5S(\u03c3))\nStep 2.2 : let \u02dc\u03b8t+1 \u2190 arg min\u03b8 F\u03c6(S|y, \u03b8, \u00b5S(\u03c3t)) + \u03bb(cid:107)\u03b8(cid:107)2\nStep 2.3 : let t \u2190 t + 1\nuntil predicate PR is true\n\n2\n\nReturn \u02dc\u03b8\u2217 .= arg mint F\u03c6(S|y, \u02dc\u03b8t+1, \u00b5S(\u03c3t))\n\nLemma 4 There exists \u03b5\u2217 > 0 such that \u2200\u03b5 \u2264 \u03b5\u2217, the following holds: (i) \u03c2(Vnc, B\u00b1) = o(1) under\nassumptions (D1 + D2); (ii) \u03c2(VG,s, B\u00b1) = o(1) under assumption (D1), \u2200s > 0.\n\n([19], Subsection 2.4) Hence, provided a weak (D1) or stronger (D1+D2) distinguishability assump-\ntion holds, the divergence between M and \u02dcM gets smaller with the increase of the norm of the\nunknowns b\u03c3\nj . The proof of the Lemma suggests that the convergence may be faster for VG,s. The\nfollowing Lemma shows that both similarities also partially encode the hardness of solving the clas-\nsi\ufb01cation problem with linear separators, so that the manifold regularizer \u201climits\u201d the distortion of\nthe \u02dcb\u00b1\nLemma 5 Take vjj(cid:48) \u2208 {vG,.\nSj, Sj(cid:48) are not linearly separable, and if vjj(cid:48) < \u03bal then Sj, Sj(cid:48) are linearly separable.\n\njj(cid:48)}. There exists 0 < \u03bal < \u03ban < 1 such that (i) if vjj(cid:48) > \u03ban then\n\n. s between two bags that tend not to be linearly separable.\n\njj(cid:48) , vnc\n\n([19], Subsection 2.5) This Lemma is an advocacy to \ufb01t s in a data-dependent way in vG,s\njj(cid:48) . The\nquestion may be raised as to whether \ufb01nite samples approximation results like Theorem 3 can be\nproven for the Mean Map estimator [17]. [19], Subsection 2.6 answers by the negative.\nIn the Laplacian Mean Map algorithm (LMM, Algorithm 1), Steps 1 and 2 have now been described.\nStep 3 is a differentiable convex minimization problem for \u03b8 that does not use the labels, so it does\nnot present any technical dif\ufb01culty. An interesting question is how much our classi\ufb01er \u02dc\u03b8\u2217 in Step 3\ndiverges from the one that would be computed with the true expression for \u00b5S, \u03b8\u2217. It is not hard to\nshow that Lemma 17 in Altun and Smola [23], and Corollary 9 in Quadrianto et al. [17] hold for\nLMM so that (cid:107) \u02dc\u03b8\u2217 \u2212 \u03b8\u2217(cid:107)2\n2. The following Theorem shows a data-dependent\n\u2217 xi \u2208 \u03c6(cid:48)([0, 1]),\u2200i\napproximation bound that can be signi\ufb01cantly better, when it holds that \u03b8(cid:62)\n(\u03c6(cid:48) is the \ufb01rst derivative). We call this setting proper scoring compliance (PSC) [18]. PSC always\nholds for the logistic and Matsushita losses for which \u03c6(cid:48)([0, 1]) = R. For other losses like the square\nloss for which \u03c6(cid:48)([0, 1]) = [\u22121, 1], shrinking the observations in a ball of suf\ufb01ciently small radius\nis suf\ufb01cient to ensure this.\nTheorem 6 Let fk \u2208 Rm denote the vector encoding the kth feature variable in S : fki = xik\n(k \u2208 [d]). Let \u02dcF denote the feature matrix with column-wise normalized feature vectors: \u02dcfk\n.=\n2, with:\n\n2)(d\u22121)/(2d)fk. Under PSC, we have (cid:107) \u02dc\u03b8\u2217 \u2212 \u03b8\u2217(cid:107)2\n\n2 \u2264 (2\u03bb + q)\u22121(cid:107) \u02dc\u00b5S \u2212 \u00b5S(cid:107)2\n\n2 \u2264 (2\u03bb)\u22121(cid:107) \u02dc\u00b5S \u2212 \u00b5S(cid:107)2\n\n(d/(cid:80)\n\n\u2217 xi, \u02dc\u03b8(cid:62)\n\nk(cid:48) (cid:107)fk(cid:48)(cid:107)2\n\nq\n\n.=\n\n(cid:62)\u02dcF\n\n\u00d7\n\ndet \u02dcF\nm\n\n2e\u22121\n\nb\u03c6\u03c6(cid:48)(cid:48) (\u03c6(cid:48)\u22121(q(cid:48)/\u03bb))\n\n(> 0) ,\n\n(12)\n\nfor some q(cid:48) \u2208 I .= [\u00b1(x\u2217 + max{(cid:107)\u00b5S(cid:107)2,(cid:107) \u02dc\u00b5S(cid:107)2})]. Here, x\u2217 .= maxi (cid:107)xi(cid:107)2 and \u03c6(cid:48)(cid:48)\n\n.= (\u03c6(cid:48))(cid:48).\n\n([19], Subsection 2.7) To see how large q can be, consider the simple case where all eigenvalues of\n(cid:62)\u02dcF, \u03bbk(\u02dcF\n(cid:62)\u02dcF) \u2208 [\u03bb\u25e6 \u00b1 \u03b4] for small \u03b4. In this case, q is proportional to the average feature \u201cnorm\u201d:\n\u02dcF\n\ntr(cid:0)F(cid:62)F(cid:1)\n\nmd\n\n(cid:62)\u02dcF\n\ndet \u02dcF\nm\n\n=\n\n(cid:80)\n\ni (cid:107)xi(cid:107)2\nmd\n\n2\n\n+ o(\u03b4) .\n\n+ o(\u03b4) =\n\n5\n\n\f.= {\u03c3 \u2208 \u03a3m :(cid:80)\n\ni:xi\u2208Sj\n\n\u00b5S(\u03c3) .= (1/m)(cid:80)\n\nThe Alternating Mean Map (AMM) algorithm Let us denote \u03a3 \u02c6\u03c0\n\u03c3i =\n(2\u02c6\u03c0j \u2212 1)mj,\u2200j \u2208 [n]} the set of labelings that are consistent with the observed proportions \u02c6\u03c0, and\ni \u03c3ixi the biased mean operator computed from some \u03c3 \u2208 \u03a3 \u02c6\u03c0. Notice that the\ntrue mean operator \u00b5S = \u00b5S(\u03c3) for at least one \u03c3 \u2208 \u03a3 \u02c6\u03c0. The Alternating Mean Map algorithm,\n(AMM, Algorithm 2), starts with the output of LMM and then optimizes it further over the set of\nconsistent labelings. At each iteration, it \ufb01rst picks a consistent labeling in \u03a3 \u02c6\u03c0 that is the best (OPT\n= min) or the worst (OPT = max) for the current classi\ufb01er (Step 2.1) and then \ufb01ts a classi\ufb01er \u02dc\u03b8 on the\ngiven set of labels (Step 2.2). The algorithm then iterates until a convergence predicate is met, which\ntests whether the difference between two values for F\u03c6(., ., .) is too small (AMMmin), or the number\nof iterations exceeds a user-speci\ufb01ed limit (AMMmax). The classi\ufb01er returned \u02dc\u03b8\u2217 is the best in the\nsequence. In the case of AMMmin, it is the last of the sequence as risk F\u03c6(S|y, ., .) cannot increase.\nAgain, Step 2.2 is a convex minimization with no technical dif\ufb01culty. Step 2.1 is combinatorial. It\ncan be solved in time almost linear in m [19] (Subsection 2.8).\n\nLemma 7 The running time of Step 2.1 in AMM is \u02dcO(m), where the tilde notation hides log-terms.\n\nBag-Rademacher generalization bounds for LLP We relate the \u201cmin\u201d and \u201cmax\u201d strategies of\nAMM by uniform convergence bounds involving the true surrogate risk, i.e. integrating the unknown\ndistribution D and the true labels (which we may never know). Previous uniform convergence\nbounds for LLP focus on coarser grained problems, like the estimation of label proportions [1].\nWe rely on a LLP generalization of Rademacher complexity [24, 25]. Let F : R \u2192 R+ be a\nloss function and H a set of classi\ufb01ers. The bag empirical Rademacher complexity of sample S,\nES[\u03c3(x)F (\u03c3(cid:48)(x)h(x))]. The usual empirical\nRb\nm for card(\u03a3 \u02c6\u03c0) = 1. The Label Proportion Complexity of H is:\nRademacher complexity equals Rb\n.= ED2m\nEI/2\n(13)\n\nES[\u03c31(x)(\u02c6\u03c0s|2(x) \u2212 \u02c6\u03c0(cid:96)|1(x))h(x)] .\n\n.= E\u03c3\u223c\u03a3m suph\u2208H{E\u03c3(cid:48)\u223c\u03a3 \u02c6\u03c0\n\nm, is de\ufb01ned as Rb\nm\n\nL2m\n\n1 ,I/2\n\n2\n\nsup\nh\u2208H\n\nl , l = 1, 2 is a random (uniformly) subset of [2m] of cardinal m. Let S(I/2\n\nl ) be the\nthen\nl ). Else, \u02c6\u03c0s|2(xi) is its bag\u2019s\n2 ) and \u02c6\u03c0(cid:96)|1(xi) is its label (i.e. a bag\u2019s label proportion that would\n1 ) \u2212 1 \u2208 \u03a31. L2m tends to be all the smaller as\n\nHere, each of I/2\nsize-m subset of S that corresponds to the indexes. Take l = 1, 2 and any xi \u2208 S. If i (cid:54)\u2208 I/2\n\u02c6\u03c0s|l(xi) = \u02c6\u03c0(cid:96)|l(xi) is xi\u2019s bag\u2019s label proportion measured on S\\S(I/2\nlabel proportion measured on S(I/2\ncontain only xi). Finally, \u03c31(x) .= 2 \u00d7 1x\u2208S(I/2\nclassi\ufb01ers in H have small magnitude on bags whose label proportion is close to 1/2.\nTheorem 8 Suppose \u2203h\u2217 \u2265 0 s.t. |h(x)| \u2264 h\u2217,\u2200x,\u2200h. Then, for any loss F\u03c6, any training sample\nof size m and any 0 < \u03b4 \u2264 1, with probability > 1 \u2212 \u03b4, the following bound holds over all h \u2208 H:\nED[F\u03c6(yh(x))] \u2264 E\u03a3 \u02c6\u03c0\n\nES[F\u03c6(\u03c3(x)h(x))] + 2Rb\n\n(cid:19)(cid:114) 1\n\n(cid:18) 2h\u2217\n\nm + L2m + 4\n\n.(14)\n\n+ 1\n\nlog\n\nl\n\nb\u03c6\n\n2m\n\n2\n\u03b4\n\nFurthermore, under PSC (Theorem 6), we have for any F\u03c6:\n\nm \u2264 2b\u03c6E\u03a3m sup\nRb\nh\u2208H\n\n{ES[\u03c3(x)(\u02c6\u03c0(x) \u2212 (1/2))h(x)]} .\n\n(15)\n\n([19], Subsection 2.9) Despite similar shapes (13) (15), Rb\nare pure (\u02c6\u03c0j \u2208 {0, 1},\u2200j), L2m = 0. When bags are impure (\u02c6\u03c0j = 1/2,\u2200j), Rb\nimpure, the bag-empirical surrogate risk, E\u03a3 \u02c6\u03c0\nand AMMmax respectively minimize a lowerbound and an upperbound of this risk.\n\nm and L2m behave differently: when bags\nm = 0. As bags get\nES[F\u03c6(\u03c3(x)h(x))], also tends to increase. AMMmin\n\n3 Experiments\n\nAlgorithms We compare LMM, AMM (F\u03c6 = logistic loss) to the original MM [17], InvCal [11], conv-\n\u221dSVM and alter-\u221dSVM [16] (linear kernels). To make experiments extensive, we test several ini-\ntializations for AMM that are not displayed in Algorithm 2 (Step 1): (i) the edge mean map estimator,\n.= 1 (AMM1), and \ufb01nally\n\u02dc\u00b5EMM\nAMM10ran which runs 10 random initial models ((cid:107)\u03b80(cid:107)2 \u2264 1), and selects the one with smallest risk;\nS\n\ni xi) (AMMEMM), (ii) the constant estimator \u02dc\u00b51\n\n.= 1/m2((cid:80)\n\ni yi)((cid:80)\n\nS\n\n6\n\n\f(a)\n\n(b)\n\n(c)\n\n(d)\n\nFigure 1: Relative AUC (wrt MM) as homogeneity assumption is violated (a). Relative AUC (wrt\nOracle) vs entropy on heart for LMM(b), AMMmin(c). Relative AUC vs n/m for AMMmin\n\nG,s (d).\n\nTable 2: Small domains results. #win/#lose for row vs column. Bold faces means p-val < .001 for\nWilcoxon signed-rank tests. Top-left subtable is for one-shot methods, bottom-right iterative ones,\nbottom-left compare the two. Italic is state-of-the-art. Grey cells highlight the best of all (AMMmin\nG ).\n\nalgorithm\n\nMM\n\nM G\nG,s\nM\nL\nnc\nInvCal\nMM\nG\nG,s\n10ran\n\nM\nM\nA\n\nn\ni\nm\n\nx MM\na\nm\n\nM\nM\nA\n\nG\nG,s\n10ran\nM conv-\u221d\nalter-\u221d\n\nV\nS\n\n36/4\n38/3\n28/12\n4/46\n33/16\n38/11\n35/14\n27/22\n25/25\n27/23\n25/25\n23/27\n21/29\n0/50\n\nG\n\n30/6\n3/37\n3/47\n26/24\n35/14\n33/17\n24/26\n23/27\n22/28\n21/29\n21/29\n2/48\n0/50\n\nLMM\nG,s\n\nnc\n\nInvCal\n\nMM\n\nAMMmin\nG\n\nG,s\n\n10ran\n\nMM\n\nAMMmax\nG\n\nG,s\n\n10ran\n\nconv-\n\u221dSVM\n\n2/37\n4/46\n25/25\n30/20\n30/20\n22/28\n22/28\n21/28\n22/28\n19/31\n2/48\n0/50\n\n4/46\n32/18\n37/13\n35/15\n26/24\n25/25\n26/24\n24/26\n24/26\n2/48\n0/50\n\n46/4\n47/3\n47/3\n44/6\n45/5\n45/5\n45/5\n50/0\n2/48\n20/30\n\n31/7\n24/11\n20/30\n15/35\n17/33\n15/35\n19/31\n4/46\n0/50\n\n7/15\n16/34\n13/37\n14/36\n13/37\n15/35\n3/47\n0/50\n\n(cid:46) e.g. AMMmin\n\nG,s wins on AMMmin\n\nG 7 times, loses 15, with 28 ties\n\n19/31\n13/37\n14/36\n13/37\n17/33\n3/47\n0/50\n\n8/42\n10/40\n12/38\n7/43\n4/46\n3/47\n\n13/14\n15/22\n19/30\n3/47\n3/47\n\n16/22\n20/29\n3/47\n2/48\n\n17/32\n4/46\n1/49\n\n0/50\n0/50\n\n27/23\n\nthis is the same procedure of alter-\u221dSVM. Matrix V (eqs. (10), (11)) used is indicated in subscript:\nLMM/AMMG, LMM/AMMG,s, LMM/AMMnc respectively denote vG,s with s = 1, vG,s with s learned\non cross validation (CV; validation ranges indicated in [19]) and vnc. For space reasons, results\nnot displayed in the paper can be found in [19], Section 3 (including runtime comparisons, and de-\ntailed results by domain). We split the algorithms in two groups, one-shot and iterative. The latter,\nincluding AMM, (conv/alter)-\u221dSVM, iteratively optimize a cost over labelings (always consistent\nwith label proportions for AMM, not always for (conv/alter)-\u221dSVM). The former (LMM, InvCal) do\nnot and are thus much faster. Tests are done on a 4-core 3.2GHz CPUs Mac with 32GB of RAM.\nAMM/LMM/MM are implemented in R. Code for InvCal and \u221dSVM is [16].\nSimulated domains, MM and the homogeneity assumption The testing metric is the AUC. Prior\nto testing on our domains, we generate 16 domains that gradually move away the b\u03c3\nj away from each\nother (wrt j), thus violating increasingly the homogeneity assumption [17]. The degree of violation\nis measured as (cid:107)B\u00b1 \u2212 B\u00b1(cid:107)F , where B\u00b1 is the homogeneity assumption matrix, that replaces all b\u03c3\nby b\u03c3 for \u03c3 \u2208 {\u22121, 1}, see eq. (5). Figure 1 (a) displays the ratios of the AUC of LMM to the\nAUC of MM. It shows that LMM is all the better with respect to MM as the homogeneity assumption\nis violated. Furthermore, learning s in LMM improves the results. Experiments on the simulated\ndomain of [16] on which MM obtains zero accuracy also display that our algorithms perform better\n(1 iteration only of AMMmax brings 100% AUC).\nSmall and large domains experiments We convert 10 small domains [19] (m \u2264 1000) and 4 bigger\nones (m > 8000) from UCI[26] into the LLP framework. We cast to one-against-all classi\ufb01cation\nwhen the problem is multiclass. On large domains, the bag assignment function is inspired by [1]:\nwe craft bags according to a selected feature value, and then we remove that feature from the data.\nThis conforms to the idea that bag assignment is structured and non random in real-world problems.\nMost of our small domains, however, do not have a lot of features, so instead of clustering on one\nfeature and then discard it, we run K-MEANS on the whole data to make the bags, for K = n \u2208 2[5].\nSmall domains results We performe 5-folds nested CV comparisons on the 10 domains = 50 AUC\nvalues for each algorithm. Table 2 synthesises the results [19], splitting one-shot and iterative algo-\n\nj\n\n7\n\n1.01.11.21.3246divergenceAUC rel. to MMMM LMMG LMMG,sLMMnc 0.60.70.80.91.00.60.81.0entropyAUC rel. to OracleMM LMMG LMMG,sLMMnc 0.60.70.80.91.00.60.81.0entropyAUC rel. to OracleAMMMM AMMG AMMG,s AMMnc AMM10ranBiggerdomainsSmalldomains0.20.40.60.81.010^\u2212510^\u2212310^\u22121#bag/#instancesAUC rel. to OracleAMMG\fTable 3: AUCs on big domains (name: #instances\u00d7#features). I=cap-shape, II=habitat,\n\nIII=cap-colour, IV=race, V=education, VI=country, VII=poutcome, VIII=job (number of bags);\n\nfor each feature, the best result over one-shot, and over iterative algorithms is bold faced.\n\nadult: 48842 \u00d7 89\n\nmarketing: 45211 \u00d7 41\n\ncensus: 299285 \u00d7 381\n\nalgorithm\n\nEMM\nMM\nLMMG\nLMMG,s\nAMMEMM\nAMMMM\nAMMG\nAMMG,s\nAMM1\nAMMEMM\nAMMMM\nAMMG\nAMMG,s\nAMM1\nOracle\n\nn\ni\nm\n\nM\nM\nA\n\nx\na\nm\n\nM\nM\nA\n\nmushroom: 8124 \u00d7 108\nIII(10)\nI(6)\n76.68\n55.61\n5.02\n51.99\n14.70\n73.92\n89.43\n94.91\n69.43\n85.12\n15.74\n89.81\n89.18\n50.44\n3.28\n89.24\n95.90\n97.31\n26.67\n93.04\n99.70\n59.45\n99.30\n95.50\n84.26\n95.84\n95.01\n1.29\n99.8\n99.82\n\nII(7)\n59.80\n98.79\n98.57\n98.24\n99.45\n99.01\n99.45\n99.57\n98.49\n3.32\n55.16\n65.32\n65.32\n73.48\n99.81\n\nIV(5)\n43.91\n80.93\n81.79\n84.89\n49.97\n83.73\n83.41\n81.18\n81.32\n54.46\n82.57\n82.75\n82.69\n75.22\n90.55\n\nV(16)\n47.50\n76.65\n78.40\n78.94\n56.98\n77.39\n82.55\n78.53\n75.80\n69.63\n71.63\n72.16\n70.95\n67.52\n90.55\n\nVI(42)\n66.61\n74.01\n78.78\n80.12\n70.19\n80.67\n81.96\n81.96\n80.05\n56.62\n81.39\n81.39\n81.39\n77.67\n90.50\n\nV(4)\n63.49\n54.64\n54.66\n49.27\n61.39\n52.85\n51.61\n52.03\n65.13\n51.48\n48.46\n50.58\n66.88\n66.70\n79.52\n\nVII(4)\n54.50\n50.71\n51.00\n51.00\n55.73\n75.27\n75.16\n75.16\n64.96\n55.63\n51.34\n47.27\n47.27\n61.16\n75.55\n\nVIII(12)\n44.31\n49.70\n51.93\n65.81\n43.10\n58.19\n57.52\n53.98\n66.62\n57.48\n56.90\n34.29\n34.29\n71.94\n79.43\n\nIV(5)\n56.05\n75.21\n75.80\n84.88\n87.86\n89.68\n87.61\n89.93\n89.09\n71.20\n50.75\n48.32\n80.33\n57.97\n94.31\n\nVIII(9)\n56.25\n90.37\n71.75\n60.71\n87.71\n84.91\n88.28\n83.54\n88.94\n77.14\n66.76\n67.54\n74.45\n81.07\n94.37\n\nVI(42)\n57.87\n75.52\n76.31\n69.74\n40.80\n68.36\n76.99\n52.13\n56.72\n66.71\n58.67\n77.46\n52.70\n53.42\n94.45\n\nrithms. LMMG,s outperforms all one-shot algorithms. LMMG and LMMG,s are competitive with many\niterative algorithms, but lose against their AMM counterpart, which proves that additional optimiza-\ntion over labels is bene\ufb01cial. AMMG and AMMG,s are con\ufb01rmed as the best variant of AMM, the\n\ufb01rst being the best in this case. Surprisingly, all mean map algorithms, even one-shots, are clearly\nsuperior to \u221dSVMs. Further results [19] reveal that \u221dSVM performances are dampened by learning\nclassi\ufb01ers with the \u201cinverted polarity\u201d \u2014 i.e. \ufb02ipping the sign of the classi\ufb01er improves its perfor-\nmances. Figure 1 (b, c) presents the AUC relative to the Oracle (which learns the classi\ufb01er knowing\nall labels and minimizing the logistic loss), as a function of the Gini entropy of bag assignment,\ngini(S) .= 4Ej[\u02c6\u03c0j(1 \u2212 \u02c6\u03c0j)]. For an entropy close to 1, we were expecting a drop in performances.\nThe unexpected [19] is that on some domains, large entropies (\u2265 .8) do not prevent AMMmin to\ncompete with the Oracle. No such pattern clearly emerges for \u221dSVM and AMMmax [19].\nBig domains results We adopt a 1/5 hold-out method. Scalability results [19] display that every\nmethod using vnc and \u221dSVM are not scalable to big domains; in particular, the estimated time for a\nsingle run of alter-\u221dSVM is >100 hours on the adult domain. Table 3 presents the results on the big\ndomains, distinguishing the feature used for bag assignment. Big domains con\ufb01rm the ef\ufb01ciency of\nLMM+AMM. No approach clearly outperforms the rest, although LMMG,s is often the best one-shot.\nSynthesis Figure 1 (d) gives the AUCs of AMMmin\nG over the Oracle for all domains [19], as a function\nof the \u201cdegree of supervision\u201d, n/m (=1 if the problem is fully supervised). Noticeably, on 90% of\nthe runs, AMMmin\nG gets an AUC representing at least 70% of the Oracle\u2019s. Results on big domains\ncan be remarkable: on the census domain with bag assignment on race, 5 proportions are suf\ufb01cient\nfor an AUC 5 points below the Oracle\u2019s \u2014 which learns with 200K labels.\n\n4 Conclusion\n\nIn this paper, we have shown that ef\ufb01cient learning in the LLP setting is possible, for general loss\nfunctions, via the mean operator and without resorting to the homogeneity assumption. Through its\nestimation, the suf\ufb01ciency allows one to resort to standard learning procedures for binary classi\ufb01ca-\ntion, practically implementing a reduction between machine learning problems [27]; hence the mean\noperator estimation may be a viable shortcut to tackle other weakly supervised settings [2] [3] [4]\n[5]. Approximation results and generalization bounds are provided. Experiments display results that\nare superior to the state of the art, with algorithms that scale to big domains at affordable computa-\ntional costs. Performances sometimes compete with the Oracle\u2019s \u2014 that learns knowing all labels\n\u2014, even on big domains. Such experimental \ufb01nding poses severe implications on the reliability of\nprivacy-preserving aggregation techniques with simple group statistics like proportions.\n\nAcknowledgments\n\nNICTA is funded by the Australian Government through the Department of Communications and\nthe Australian Research Council through the ICT Centre of Excellence Program. The \ufb01rst author\nwould like to acknowledge that part of this research was conducted during his internship at the\nCommonwealth Bank of Australia. We thank A. Menon and D. Garc\u00b4\u0131a-Garc\u00b4\u0131a for useful discussions.\n\n8\n\n\fReferences\n[1] F.-X. Yu, S. Kumar, T. Jebara, and S.-F. Chang. On learning with label proportions. CoRR, abs/1402.5902,\n\n2014.\n\n[2] T.-G. Dietterich, R.-H. Lathrop, and T. Lozano-P\u00b4erez. Solving the multiple instance problem with axis-\n\nparallel rectangles. Arti\ufb01cial Intelligence, 89:31\u201371, 1997.\n\n[3] G.-S. Mann and A. McCallum. Generalized expectation criteria for semi-supervised learning of condi-\n\ntional random \ufb01elds. In 46 th ACL, 2008.\n\n[4] J. Grac\u00b8a, K. Ganchev, and B. Taskar. Expectation maximization and posterior constraints. In NIPS*20,\n\npages 569\u2013576, 2007.\n\n[5] P. Liang, M.-I. Jordan, and D. Klein. Learning from measurements in exponential families. In 26 th ICML,\n\npages 641\u2013648, 2009.\n\n[6] D.-J. Musicant, J.-M. Christensen, and J.-F. Olson. Supervised learning by training on aggregate outputs.\n\nIn 7 th ICDM, pages 252\u2013261, 2007.\n\n[7] J. Hern\u00b4andez-Gonz\u00b4alez, I. Inza, and J.-A. Lozano. Learning bayesian network classi\ufb01ers from label\n\nproportions. Pattern Recognition, 46(12):3425\u20133440, 2013.\n\n[8] M. Stolpe and K. Morik. Learning from label proportions by optimizing cluster model selection. In 15th\n\nECMLPKDD, pages 349\u2013364, 2011.\n\n[9] B.-C. Chen, L. Chen, R. Ramakrishnan, and D.-R. Musicant. Learning from aggregate views.\n\n22 th ICDE, pages 3\u20133, 2006.\n\nIn\n\n[10] J. Wojtusiak, K. Irvin, A. Birerdinc, and A.-V. Baranova. Using published medical results and non-\n\nhomogenous data in rule learning. In 10 th ICMLA, pages 84\u201389, 2011.\n\n[11] S. R\u00a8uping. Svm classi\ufb01er estimation from group probabilities. In 27 th ICML, pages 911\u2013918, 2010.\n[12] K. Hendrik and N. de Freitas. Learning about individuals from group statistics.\n\nIn 21 th UAI, pages\n\n332\u2013339, 2005.\n\n[13] S. Chen, B. Liu, M. Qian, and C. Zhang. Kernel k-means based framework for aggregate outputs classi-\n\n\ufb01cation. In 9 th ICDMW, pages 356\u2013361, 2009.\n\n[14] K.-T. Lai, F.X. Yu, M.-S. Chen, and S.-F. Chang. Video event detection by inferring temporal instance\n\nlabels. In 11 th CVPR, 2014.\n\n[15] K. Fan, H. Zhang, S. Yan, L. Wang, W. Zhang, and J. Feng. Learning a generative classi\ufb01er from label\n\nproportions. Neurocomputing, 139:47\u201355, 2014.\n\n[16] F.-X. Yu, D. Liu, S. Kumar, T. Jebara, and S.-F. Chang. \u221dSVM for Learning with Label Proportions. In\n\n30th ICML, pages 504\u2013512, 2013.\n\n[17] N. Quadrianto, A.-J. Smola, T.-S. Caetano, and Q.-V. Le. Estimating labels from label proportions. JMLR,\n\n10:2349\u20132374, 2009.\n\n[18] R. Nock and F. Nielsen. Bregman divergences and surrogates for learning. IEEE Trans.PAMI, 31:2048\u2013\n\n2059, 2009.\n\n[19] G. Patrini, R. Nock, P. Rivera, and T-S. Caetano. (Almost) no label no cry - supplementary material\u201d. In\n\nNIPS*27, 2014.\n\n[20] M.J. Kearns and Y. Mansour. On the boosting ability of top-down decision tree learning algorithms. In\n\n28 th ACM STOC, pages 459\u2013468, 1996.\n\n[21] M. Belkin, P. Niyogi, and V. Sindhwani. Manifold regularization: A geometric framework for learning\n\nfrom labeled and unlabeled examples. JMLR, 7:2399\u20132434, 2006.\n\n[22] J. Shi and J. Malik. Normalized cuts and image segmentation. IEEE Trans.PAMI, 22:888\u2013905, 2000.\n[23] Y. Altun and A.-J. Smola. Unifying divergence minimization and statistical inference via convex duality.\n\nIn 19th COLT, pages 139\u2013153, 2006.\n\n[24] P.-L. Bartlett and S. Mendelson. Rademacher and gaussian complexities: Risk bounds and structural\n\nresults. JMLR, 3:463\u2013482, 2002.\n\n[25] V. Koltchinskii and D. Panchenko. Empirical margin distributions and bounding the generalization error\n\nof combined classi\ufb01ers. Ann. of Stat., 30:1\u201350, 2002.\n\n[26] K. Bache and M. Lichman. UCI machine learning repository, 2013.\n[27] A. Beygelzimer, V. Dani, T. Hayes, J. Langford, and B. Zadrozny. Error limiting reductions between\n\nclassi\ufb01cation tasks. In 22 th ICML, pages 49\u201356, 2005.\n\n9\n\n\f", "award": [], "sourceid": 142, "authors": [{"given_name": "Giorgio", "family_name": "Patrini", "institution": "Australian National University / NICTA"}, {"given_name": "Richard", "family_name": "Nock", "institution": null}, {"given_name": "Paul", "family_name": "Rivera", "institution": ""}, {"given_name": "Tiberio", "family_name": "Caetano", "institution": "NICTA Canberra"}]}