{"title": "Fundamental Limits of Online and Distributed Algorithms for Statistical Learning and Estimation", "book": "Advances in Neural Information Processing Systems", "page_first": 163, "page_last": 171, "abstract": "Many machine learning approaches are characterized by information constraints on how they interact with the training data. These include memory and sequential access constraints (e.g. fast first-order methods to solve stochastic optimization problems); communication constraints (e.g. distributed learning); partial access to the underlying data (e.g. missing features and multi-armed bandits) and more. However, currently we have little understanding how such information constraints fundamentally affect our performance, independent of the learning problem semantics. For example, are there learning problems where any algorithm which has small memory footprint (or can use any bounded number of bits from each example, or has certain communication constraints) will perform worse than what is possible without such constraints? In this paper, we describe how a single set of results implies positive answers to the above, for several different settings.", "full_text": "Fundamental Limits of Online and Distributed\n\nAlgorithms for Statistical Learning and Estimation\n\nOhad Shamir\n\nWeizmann Institute of Science\n\nohad.shamir@weizmann.ac.il\n\nAbstract\n\nMany machine learning approaches are characterized by information constraints\non how they interact with the training data. These include memory and sequential\naccess constraints (e.g. fast \ufb01rst-order methods to solve stochastic optimization\nproblems); communication constraints (e.g. distributed learning); partial access\nto the underlying data (e.g. missing features and multi-armed bandits) and more.\nHowever, currently we have little understanding how such information constraints\nfundamentally affect our performance, independent of the learning problem se-\nmantics. For example, are there learning problems where any algorithm which\nhas small memory footprint (or can use any bounded number of bits from each\nexample, or has certain communication constraints) will perform worse than what\nis possible without such constraints? In this paper, we describe how a single set\nof results implies positive answers to the above, for several different settings.\n\n1\n\nIntroduction\n\nInformation constraints play a key role in machine learning. Of course, the main constraint is the\navailability of only a \ufb01nite data set to learn from. However, many current problems in machine\nlearning can be characterized as learning with additional information constraints, arising from the\nmanner in which the learner may interact with the data. Some examples include:\n\u2022 Communication constraints in distributed learning: There has been much recent work on learning\nwhen the training data is distributed among several machines. Since the machines may work\nin parallel, this potentially allows signi\ufb01cant computational speed-ups and the ability to cope\nwith large datasets. On the \ufb02ip side, communication rates between machines is typically much\nslower than their processing speeds, and a major challenge is to perform these learning tasks with\nminimal communication.\n\n\u2022 Memory constraints: The standard implementation of many common learning tasks requires\nmemory which is super-linear in the data dimension. For example, principal component analysis\n(PCA) requires us to estimate eigenvectors of the data covariance matrix, whose size is quadratic\nin the data dimension and can be prohibitive for high-dimensional data. Another example is kernel\nlearning, which requires manipulation of the Gram matrix, whose size is quadratic in the number\nof data points. There has been considerable effort in developing and analyzing algorithms for\nsuch problems with reduced memory footprint (e.g. [20, 7, 27, 24]).\n\n\u2022 Online learning constraints: The need for fast and scalable learning algorithms has popularised\nthe use of online algorithms, which work by sequentially going over the training data, and in-\ncrementally updating a (usually small) state vector. Well-known special cases include gradient\ndescent and mirror descent algorithms. The requirement of sequentially passing over the data\ncan be seen as a type of information constraint, whereas the small state these algorithms often\nmaintain can be seen as another type of memory constraint.\n\n1\n\n\f\u2022 Partial-information constraints: A common situation in machine learning is when the available\ndata is corrupted, sanitized (e.g. due to privacy constraints), has missing features, or is otherwise\npartially accessible. There has also been considerable interest in online learning with partial\ninformation, where the learner only gets partial feedback on his performance. This has been\nused to model various problems in web advertising, routing and multiclass learning. Perhaps\nthe most well-known case is the multi-armed bandits problem with many other variants being\ndeveloped, such as contextual bandits, combinatorial bandits, and more general models such as\npartial monitoring [10, 11].\n\nAlthough these examples come from very different domains, they all share the common feature\nof information constraints on how the learning algorithm can interact with the training data.\nIn\nsome speci\ufb01c cases (most notably, multi-armed bandits, and also in the context of certain distributed\nprotocols, e.g. [6, 29]) we can even formalize the price we pay for these constraints, in terms of\ndegraded sample complexity or regret guarantees. However, we currently lack a general information-\ntheoretic framework, which directly quanti\ufb01es how such constraints can impact performance. For\nexample, are there cases where any online algorithm, which goes over the data one-by-one, must\nhave a worse sample complexity than (say) empirical risk minimization? Are there situations where\na small memory footprint provably degrades the learning performance? Can one quantify how a\nconstraint of getting only a few bits from each example affects our ability to learn?\nIn this paper, we make a \ufb01rst step in developing such a framework. We consider a general class of\nlearning processes, characterized only by information-theoretic constraints on how they may interact\nwith the data (and independent of any speci\ufb01c problem semantics). As special cases, these include\nonline algorithms with memory constraints, certain types of distributed algorithms, as well as online\nlearning with partial information. We identify cases where any such algorithm must perform worse\nthan what can be attained without such information constraints. The tools developed allows us to\nestablish several results for speci\ufb01c learning problems:\n\u2022 We prove a new and generic regret lower bound for partial-information online learning with expert\n\nadvice, of the form \u2126((cid:112)(d/b)T ), where T is the number of rounds, d is the dimension of the\n\nloss/reward vector, and b is the number of bits b extracted from each loss vector. It is optimal\nup to log-factors (without further assumptions), and holds no matter what these b bits are \u2013 a\nsingle coordinate (as in multi-armed bandits), some information on several coordinates (as in\nsemi-bandit feedback), a linear projection (as in bandit linear optimization), some feedback signal\nfrom a restricted set (as in partial monitoring) etc. Interestingly, it holds even if the online learner\nis allowed to adaptively choose which bits of the loss vector it can retain at each round. The lower\nbound quanti\ufb01es directly how information constraints in online learning degrade the attainable\nregret, independent of the problem semantics.\n\n\u2022 We prove that for some learning and estimation problems - in particular, sparse PCA and sparse\ncovariance estimation in Rd - no online algorithm can attain statistically optimal performance (in\nterms of sample complexity) with less than \u02dc\u2126(d2) memory. To the best of our knowledge, this is\nthe \ufb01rst formal example of a memory/sample complexity trade-off in a statistical learning setting.\n\u2022 We show that for similar types of problems, there are cases where no distributed algorithm (which\nis based on a non-interactive or serial protocol on i.i.d. data) can attain optimal performance with\nless than \u02dc\u2126(d2) communication per machine. To the best of our knowledge, this is the \ufb01rst formal\nexample of a communication/sample complexity trade-off, in the regime where the communication\nbudget is larger than the data dimension, and the examples at each machine come from the same\nunderlying distribution.\n\n\u2022 We demonstrate the existence of (synthetic) stochastic optimization problems where any algo-\nrithm which uses memory linear in the dimension (e.g. stochastic gradient descent or mirror\ndescent) cannot be statistically optimal.\n\nRelated Work\n\nIn stochastic optimization, there has been much work on lower bounds for sequential algorithms\n(e.g. [22, 1, 23]). However, these results all hold in an oracle model, where data is assumed to be\nmade available in a speci\ufb01c form (such as a stochastic gradient estimate). As already pointed out in\n\n2\n\n\f[22], this does not directly translate to the more common setting, where we are given a dataset and\nwish to run a simple sequential optimization procedure.\nIn the context of distributed learning and statistical estimation, information-theoretic lower bounds\nwere recently shown in the pioneering work [29], which identi\ufb01es cases where communication con-\nstraints affect statistical performance. These results differ from ours (in the context of distributed\nlearning) in two important ways. First, they pertain to parametric estimation in Rd, where the com-\nmunication budget per machine is much smaller than what is needed to even specify the answer\nwith constant accuracy (O(d) bits). In contrast, our results pertain to simpler detection problems,\nwhere the answer requires only O(log(d)) bits, yet lead to non-trivial lower bounds even when the\nbudget size is much larger (in some cases, much larger than d). The second difference is that their\nwork focuses on distributed algorithms, while we address a more general class of algorithms, which\nincludes other information-constrained settings. Strong lower bounds in the context of distributed\nlearning have also been shown in [6], but they do not apply to a regime where examples across ma-\nchines come from the same distribution, and where the communication budget is much larger than\nwhat is needed to specify the output.\nThere are well-known lower bounds for multi-armed bandit problems and other online learning with\npartial-information settings. However, they crucially depend on the semantics of the information\nfeedback considered. For example, the standard multi-armed bandit lower bound [5] pertain to a\nsetting where we can view a single coordinate of the loss vector, but doesn\u2019t apply as-is when we\ncan view more than one coordinate (e.g. [4, 25]), get side-information (e.g. [19]), receive a linear\nor non-linear projection (as in bandit linear and convex optimization), or receive a different type of\npartial feedback (e.g. partial monitoring [11]). In contrast, our results are generic and can directly\napply to any such setting.\nMemory and communication constraints have been extensively studied within theoretical computer\nscience (e.g. [3, 21]). Unfortunately, almost all these results pertain to data which was either ad-\nversarially generated, ordered (in streaming algorithms) or split (in distributed algorithms), and do\nnot apply to statistical learning tasks, where the data is drawn i.i.d. from an underlying distribution.\n[28, 15] do consider i.i.d. data, but focus on problems such as detecting graph connectivity and\ncounting distinct elements, and not learning problems such as those considered here. Also, there are\nworks on provably memory-ef\ufb01cient algorithms for statistical problems (e.g. [20, 7, 17, 13]), but\nthese do not consider lower bounds or provable trade-offs.\nFinally, there has been a line of works on hypothesis testing and statistical estimation with \ufb01nite\nmemory (see [18] and references therein). However, the limitations shown in these works apply\nwhen the required precision exceeds the amount of memory available. Due to \ufb01nite sample effects,\nthis regime is usually relevant only when the data size is exponential in the memory size. In contrast,\nwe do not rely on \ufb01nite precision considerations.\n\n2\n\nInformation-Constrained Protocols\n\nWe begin with a few words about notation. We use bold-face letters (e.g. x) to denote vectors, and\nlet ej \u2208 Rd denote j-th standard basis vector. When convenient, we use the standard asymptotic\nnotation O(\u00b7), \u2126(\u00b7), \u0398(\u00b7) to hide constants, and an additional \u02dcsign (e.g. \u02dcO(\u00b7)) to also hide log-\nfactors. log(\u00b7) refers to the natural logarithm, and log2(\u00b7) to the base-2 logarithm.\nOur main object of study is the following generic class of information-constrained algorithms:\nDe\ufb01nition 1 ((b, n, m) Protocol). Given access to a sequence of mn i.i.d. instances (vectors in Rd),\nan algorithm is a (b, n, m) protocol if it has the following form, for some functions ft returning an\noutput of at most b bits, and some function f:\n\n\u2022 For t = 1, . . . , m\n\n\u2013 Let X t be a batch of n i.i.d. instances\n\u2013 Compute message W t = ft(X t, W 1, W 2, . . . W t\u22121)\n\n\u2022 Return W = f (W 1, . . . , W m)\n\n3\n\n\ft=1, f are completely arbitrary, may depend on m and can also be\n\nNote that the functions {ft}m\nrandomized. The crucial assumption is that the outputs W t are constrained to be only b bits.\nAs the de\ufb01nition above may appear quite abstract, let us consider a few speci\ufb01c examples:\n\u2022 b-memory online protocols: Consider any algorithm which goes over examples one-by-one, and\nincrementally updates a state vector W t of bounded size b. We note that a majority of online\nlearning and stochastic optimization algorithms have bounded memory. For example, for linear\npredictors, most gradient-based algorithms maintain a state whose size is proportional to the size\nof the parameter vector that is being optimized. Such algorithms correspond to (b, n, m) protocols\nwhere W t is the state vector after round t, with an update function ft depending only on W t\u22121,\nand f depends only on W m. n = 1 corresponds to algorithms which use one example at a time,\nwhereas n > 1 corresponds to algorithms using mini-batches.\n\u2022 Non-interactive and serial distributed algorithms: There are m machines and each machine re-\nceives an independent sample X t of size n. It then sends a message W t = ft(X t) (which here\ndepends only on X t). A centralized server then combines the messages to compute an output\nf (W 1 . . . W m). This includes for instance divide-and-conquer style algorithms proposed for dis-\ntributed stochastic optimization (e.g. [30]). A serial variant of the above is when there are m\nmachines, and one-by-one, each machine t broadcasts some information W t to the other ma-\nchines, which depends on X t as well as previous messages sent by machines 1, 2, . . . , (t \u2212 1).\n\u2022 Online learning with partial information: Suppose we sequentially receive d-dimensional loss\nvectors, and from each of these we can extract and use only b bits of information, where b (cid:28) d.\nThis includes most types of bandit problems [10].\n\nIn our work, we contrast the performance attainable by any algorithm corresponding to such a pro-\ntocol, to constraint-free protocols which are allowed to interact with the data in any manner.\n\n3 Basic Results\n\nOur results are based on a simple \u2018hide-and-seek\u2019 statistical estimation problem, for which we show\na strong gap between the performance of information-constrained protocols and constraint-free pro-\ntocols. It is parameterized by a dimension d, bias \u03c1, and sample size mn, and de\ufb01ned as follows:\nDe\ufb01nition 2 (Hide-and-seek Problem). Consider the set of product distributions {Prj(\u00b7)}d\nj=1 over\n{\u22121, 1}d de\ufb01ned via Ex\u223cPrj (\u00b7)[xi] = 2\u03c1 1i=j for all coordinates i = 1, . . . d. Given an i.i.d. sample\nof mn instances generated from Prj(\u00b7), where j is unknown, detect j.\nIn words, Prj(\u00b7) corresponds to picking all coordinates other than j to be \u00b11 uniformly at random,\n\nand independently picking coordinate j to be +1 with a higher probability(cid:0) 1\n\n2 + \u03c1(cid:1). The goal is to\n\ndetect the biased coordinate j based on a sample.\nFirst, we note that without information constraints, it is easy to detect the biased coordinate with\nO(log(d)/\u03c12) instances. This is formalized in the following theorem, which is an immediate conse-\nquence of Hoeffding\u2019s inequality and a union bound:\nTheorem 1. Consider the hide-and-seek problem de\ufb01ned earlier. Given mn samples, if \u02dcJ is the\n\ncoordinate with the highest empirical average, then Prj( \u02dcJ = j) \u2265 1 \u2212 2d exp(cid:0)\u2212 1\n\n2 mn\u03c12(cid:1) .\n\nWe now show that for this hide-and-seek problem, there is a large regime where detecting j is\ninformation-theoretically possible (by Thm. 1), but any information-constrained protocol will fail to\ndo so with high probability. We \ufb01rst show this for (b, 1, m) protocols (i.e. protocols which process\none instance at a time, such as bounded-memory online algorithms, and distributed algorithms where\neach machine holds a single instance):\nTheorem 2. Consider the hide-and-seek problem on d > 1 coordinates, with some bias \u03c1 \u2264 1/4\nand sample size m. Then for any estimate \u02dcJ of the biased coordinate returned by any (b, 1, m)\nprotocol, there exists some coordinate j such that\n\nPrj( \u02dcJ = j) \u2264 3\nd\n\n4\n\n(cid:114)\n\n+ 21\n\nm\n\n\u03c12b\nd\n\n.\n\n\fThe theorem implies that any algorithm corresponding to (b, 1, m) protocols requires sample size\nm \u2265 \u2126((d/b)/\u03c12) to reliably detect some j. When b is polynomially smaller than d (e.g. a constant),\nwe get an exponential gap compared to constraint-free protocols, which only require O(log(d)/\u03c12)\ninstances.\nMoreover, Thm. 2 is tight up to log-factors: Consider a b-memory online algorithm, which splits\nthe d coordinates into O(d/b) segments of O(b) coordinates each, and sequentially goes over the\nsegments, each time using \u02dcO(1/\u03c12) independent instances to determine if one of the coordinates in\neach segment is biased by \u03c1 (assuming \u03c1 is not exponentially smaller than b, this can be done with\nO(b) memory by maintaining the empirical average of each coordinate). This will allow to detect\nthe biased coordinate, using \u02dcO((d/b)/\u03c12) instances.\nWe now turn to provide an analogous result for general (b, n, m) protocols (where n is possibly\ngreater than 1). However, it is a bit weaker in terms of the dependence on the bias parameter1:\nTheorem 3. Consider the hide-and-seek problem on d > 1 coordinates, with some bias \u03c1 \u2264 1/4n\nand sample size mn. Then for any estimate \u02dcJ of the biased coordinate returned by any (b, n, m)\nprotocol, there exists some coordinate j such that\n\n(cid:115)\n\n(cid:27)\n\n(cid:26) 10\u03c1b\n\nd\n\nPrj( \u02dcJ = j) \u2264 3\nd\n\n+ 5\n\nmn min\n\n, \u03c12\n\n.\n\n(cid:16)\n\n(cid:110) (d/b)\n\n(cid:111)(cid:17)\n\nmax\n\n\u03c1 , 1\n\u03c12\n\nThe theorem implies that any (b, n, m) protocol will require a sample size mn which is at least\nin order to detect the biased coordinate. This is larger than the O(log(d)/\u03c12)\n\u2126\ninstances required by constraint-free protocols whenever \u03c1 > b log(d)/d, and establishes trade-offs\nbetween sample complexity and information complexities such as memory and communication.\nDue to lack of space, all our proofs appear in the supplementary material. However, the technical\ndetails may obfuscate the high-level intuition, which we now turn to explain.\nFrom an information-theoretic viewpoint, our results are based on analyzing the mutual information\nbetween j and W t in a graphical model as illustrated in \ufb01gure 1. In this model, the unknown message\nj (i.e. the identity of the biased coordinate) is correlated with one of d independent binary-valued\nrandom vectors (one for each coordinate across the data instances X t). All these random vectors\nare noisy, and the mutual information in bits between X t\nj and j can be shown to be on the order of\nn\u03c12. Without information constraints, it follows that given m instantiations of X t, the total amount\nof information conveyed on j by the data is \u0398(mn\u03c12), and if this quantity is larger than log(d), then\nthere is enough information to uniquely identify j. Note that no stronger bound can be established\nwith standard statistical lower-bound techniques, since these do not consider information constraints\ninternal to the algorithm used.\nIndeed, in our information-constrained setting there is an added complication, since the output W t\ncan only contain b bits.\nd.\n1, . . . , X t\nMoreover, it will likely convey only little information if it doesn\u2019t already \u201cknow\u201d j. For example,\nW t may provide a little bit of information on all d coordinates, but then the amount of information\nconveyed on each (and in particular, the random variable X t\nj which is correlated with j) will be\nvery small. Alternatively, W t may provide accurate information on O(b) coordinates, but since\nthe relevant coordinate j is not known, it is likely to \u201cmiss\u201d it. The proof therefore relies on the\nfollowing components:\n\u2022 No matter what, a (b, n, m) protocol cannot provide more than b/d bits of information (in expec-\n\nIf b (cid:28) d, then W t cannot convey all the information on X t\n\ntation) on X t\n\nj, unless it already \u201cknows\u201d j.\n\n\u2022 Even if the mutual information between W t and X t\n\nj is only b/d, and the mutual information be-\ntween X t\nj and j is n\u03c12, standard information-theoretic tools such as the data processing inequality\nonly implies that the mutual information between W t and j is bounded by min{n\u03c12, b/d}. We\nessentially prove a stronger information contraction bound, which is the product of the two terms\n\n1The proof of Thm. 2 also applies in the case n > 1, but the dependence on n is exponential - see the proof\n\nfor details.\n\n5\n\n\fof\nthe\n\nIllustration\n\nthe\nrela-\nFigure 1:\ntionship between j,\ncoordinates\n1, 2, . . . , j, . . . , d of the sample X t, and\nthe message W t. The coordinates are in-\ndependent of each other, and most of them\njust output \u00b11 uniformly at random. Only\nj has a slightly different distribution and\nX t\nhence contains some information on j.\n\nO(\u03c12b/d) when n = 1, and O(n\u03c1b/d) for general n. At a technical level, this is achieved by\nconsidering the relative entropy between the distributions of W t with and without a biased co-\nordinate j, relating it to the \u03c72-divergence between these distributions (using relatively recent\nanalytic results on Csisz\u00b4ar f-divergences [16], [26]), and performing algebraic manipulations to\nj, which is on average b/d\nupper bound it by \u03c12 times the mutual information between W t and X t\nas discussed earlier. This eventually leads to the m\u03c12b/d term in Thm. 2, as well as Thm. 3 using\nsomewhat different calculations.\n\n4 Applications\n\n4.1 Online Learning with Partial Information\n\n(cid:111)\n\nConsider the setting of learning with expert advice, de\ufb01ned as a game over T rounds, where each\nround t a loss vector (cid:96)t \u2208 [0, 1]d is chosen, and the learner (without knowing (cid:96)t) needs to pick an\naction it from a \ufb01xed set {1, . . . , d}, after which the learner suffers loss (cid:96)t,it. The goal of the learner\nt=1 (cid:96)t,i. We are interested\nin variants where the learner only gets some partial information on (cid:96)t. For example, in multi-armed\nbandits, the learner can only view (cid:96)t,it. The following theorem is a simple corollary of Thm. 2:\nTheorem 4. Suppose d > 3. For any (b, 1, T ) protocol, there is an i.i.d. distribution over loss\n, where\n\nis to minimize the regret with respect to any \ufb01xed action i,(cid:80)T\nvectors (cid:96)t \u2208 [0, 1]d for which minj E(cid:104)(cid:80)T\nt=1 (cid:96)t,jt \u2212(cid:80)T\n\u2126((cid:112)(d/b)T ) for suf\ufb01ciently large T . Without further assumptions on the feedback model, the\nbound is optimal up to log-factors, as shown by O((cid:112)(d/b)T ) upper bounds for linear or coordinate\n\nt=1 (cid:96)t,it \u2212(cid:80)T\n(cid:105) \u2265 c min\n(cid:110)\nT,(cid:112)(d/b)/T\n\nAs a result, we get that for any algorithm with any partial information feedback model (where b\nbits are extracted from each d-dimensional loss vector), it is impossible to get regret lower than\n\nc > 0 is a numerical constant.\n\nt=1 (cid:96)t,j\n\nmeasurements (where b is the number of measurements or coordinates seen2) [2, 19, 25]. However,\nthe lower bound extends beyond these speci\ufb01c settings, and include cases such as arbitrary non-\nlinear measurements of the loss vector, or receiving feedback signals of bounded size (although\nsome setting-speci\ufb01c lower bounds may be stronger).\nIt also simpli\ufb01es previous lower bounds,\ntailored to speci\ufb01c types of partial information feedback, or relying on careful reductions to multi-\narmed bandits (e.g. [12, 25]). Interestingly, the bound holds even if the algorithm is allowed to\nexamine each loss vector (cid:96)t and adaptively choose which b bits of information it wishes to retain.\n\n4.2 Stochastic Optimization\n\nWe now turn to consider an example from stochastic optimization, where our goal is to approxi-\nmately minimize F (h) = EZ[f (h; Z)] given access to m i.i.d. instantiations of Z, whose distri-\nbution is unknown. This setting has received much attention in recent years, and can be used to\nmodel many statistical learning problems. In this section, we demonstrate a stochastic optimization\nproblem where information-constrained protocols provably pay a performance price compared to\nnon-constrained algorithms. We emphasize that it is a simple toy problem, and not meant to repre-\nsent anything realistic. We present it for two reasons: First, it illustrates another type of situation\n\n2Strictly speaking, if the losses are continuous-valued, these require arbitrary-precision measurements, but\n\nin any practical implementation we can assume the losses and measurements are discrete.\n\n6\n\n\ud835\udc4b1\ud835\udc61\ud835\udc4b2\ud835\udc61\ud835\udc4b\ud835\udc57\ud835\udc61\ud835\udc4b\ud835\udc51\ud835\udc61\ud835\udc4a\ud835\udc61\ud835\udc57\u22ee\u22ee\f(cid:17)\n\nwith high probability.\n\n(cid:80)m\n\n1\nm\n\nt=1 Z t\n\ni=1 wi = (cid:80)d\n\nf ((w, v); Z) = w(cid:62)Zv , Z \u2208 [\u22121, +1]d\u00d7d\n\nwhere information-constrained protocols may fail (in particular, problems involving matrices). Sec-\nond, the intuition of the construction is also used in the more realistic problem of sparse PCA and\ncovariance estimation, considered in the next section.\nSpeci\ufb01cally, suppose we wish to solve min(w,v) F (w, v) = EZ[f ((w, v); Z)], where\n\ni=1 vi = 1).\nA minimizer of F (w, v) is (ei\u2217 , ej\u2217 ), where (i\u2217, j\u2217) are indices of the matrix entry with mini-\nmal mean. Moreover, by a standard concentration of measure argument, given m i.i.d.\ninstan-\ntiations Z 1, . . . , Z m from any distribution over Z, then the solution (e \u02dcI , e \u02dcJ ), where ( \u02dcI, \u02dcJ) =\ni,j are the indices of the entry with empirically smallest mean, satis\ufb01es\narg mini,j\n\nand w, v range over all vectors in the simplex (i.e. wi, vi \u2265 0 and(cid:80)d\nF (e \u02dcI , e \u02dcJ ) \u2264 minw,v F (w, v) + O(cid:16)(cid:112)log(d)/m\nnumber of parameters), then we claim that we have a lower bound of \u2126(min{1,(cid:112)d/m}) on the\nexpected error, which is much higher than the O((cid:112)log(d)/m) upper bound for constraint-free pro-\nzero except some coordinate (i\u2217, j\u2217) where it equals O((cid:112)d/m). For such distributions, getting op-\ntimization error smaller than O((cid:112)d/m) reduces to detecting (i\u2217, j\u2217), and this in turn reduces to the\nhide-and-seek problem de\ufb01ned earlier, over d2 coordinates and a bias \u03c1 = O((cid:112)d/m). However,\n\nHowever, computing ( \u02dcI, \u02dcJ) as above requires us to track d2 empirical means, which may be ex-\npensive when d is large. If instead we constrain ourselves to (b, 1, m) protocols where b = O(d)\n(e.g. any sort of stochastic gradient method optimization algorithm, whose memory is linear in the\n\ntocols. This claim is a straightforward consequence of Thm. 2: We consider distributions where\nZ \u2208 {\u22121, +1}d\u00d7d with probability 1, each of the d2 entries is chosen independently, and E[Z] is\n\nThm. 2 shows that no (b, 1, m) protocol (where b = O(d)) will succeed if md\u03c12 (cid:28) d2, which\nindeed happens if \u03c1 is small enough.\nSimilar kind of gaps can be shown using Thm. 3 for general (b, n, m) protocols, which apply to any\nspecial case such as non-interactive distributed learning.\n\n4.3 Sparse PCA, Sparse Covariance Estimation, and Detecting Correlations\n\nE[x2\n\n1] + v2\n2\n\n2] + 2v1v2E[xixj]. Following previous work (e.g.\n\nThe sparse PCA problem ([31]) is a standard and well-known statistical estimation problem, de\ufb01ned\nas follows: We are given an i.i.d. sample of vectors x \u2208 Rd, and we assume that there is some\ndirection, corresponding to some sparse vector v (of cardinality at most k), such that the variance\nE[(v(cid:62)x)2] along that direction is larger than at any other direction. Our goal is to \ufb01nd that direction.\nWe will focus here on the simplest possible form of this problem, where the maximizing direction v\nis assumed to be 2-sparse, i.e. there are only 2 non-zero coordinates vi, vj. In that case, E[(v(cid:62)x)2] =\nE[x2\n[8]), we even assume that\nv2\n1\nE[x2\ni ] = 1 for all i, in which case the sparse PCA problem reduces to detecting a coordinate pair\n(i\u2217, j\u2217), i\u2217 < j\u2217 for which xi\u2217 , xj\u2217 are maximally correlated. A special case is a simple and natural\nsparse covariance estimation problem [9], where we assume that all covariates are uncorrelated\n(E[xixj] = 0) except for a unique correlated pair (i\u2217, j\u2217) which we need to detect.\nThis setting bears a resemblance to the example seen in the context of stochastic optimization in sec-\ntion 4.2: We have a d\u00d7 d stochastic matrix xx(cid:62), and we need to detect an off-diagonal biased entry\nat location (i\u2217, j\u2217). Unfortunately, these stochastic matrices are rank-1, and do not have independent\nentries as in the example considered in section 4.2. Instead, we use a more delicate construction,\nrelying on distributions supported on sparse vectors. The intuition is that then each instantiation of\nxx(cid:62) is sparse, and the situation can be reduced to a variant of our hide-and-seek problem where only\na few coordinates are non-zero at a time. The theorem below establishes performance gaps between\nconstraint-free protocols (in particular, a simple plug-in estimator), and any (b, n, m) protocol for a\nspeci\ufb01c choice of n, or any b-memory online protocol (See Sec. 2).\nTheorem 5. Consider the class of 2-sparse PCA (or covariance estimation) problems in d \u2265 9\ndimensions as described above, and all distributions such that E[x2\n1. For a unique pair of distinct coordinates (i\u2217, j\u2217), it holds that E[xi\u2217 xj\u2217 ] = \u03c4 > 0, whereas\n\ni ] = 1 for all i, and:\n\nE[xixj] = 0 for all distinct coordinate pairs (i, j) (cid:54)= (i\u2217, j\u2217).\n\n7\n\n\f2. For any i < j, if (cid:103)xixj is the empirical average of xixj over m i.i.d. instances, then\nPr(cid:0)|(cid:103)xixj \u2212 E[xixj]| \u2265 \u03c4\n\u2022 Let ( \u02dcI, \u02dcJ) = arg maxi