{"title": "The committee machine: Computational to statistical gaps in learning a two-layers neural network", "book": "Advances in Neural Information Processing Systems", "page_first": 3223, "page_last": 3234, "abstract": "Heuristic tools from statistical physics have been used in the past to compute the optimal learning and generalization errors in the teacher-student scenario in multi- layer neural networks. In this contribution, we provide a rigorous justification of these approaches for a two-layers neural network model called the committee machine. We also introduce a version of the approximate message passing (AMP) algorithm for the committee machine that allows to perform optimal learning in polynomial time for a large set of parameters. We find that there are regimes in which a low generalization error is information-theoretically achievable while the AMP algorithm fails to deliver it; strongly suggesting that no efficient algorithm exists for those cases, and unveiling a large computational gap.", "full_text": "The committee machine: Computational to statistical\n\ngaps in learning a two-layers neural network\n\nBenjamin Aubin(cid:63)\u2020, Antoine Maillard\u2020, Jean Barbier\u2297\u2666\u2020\nFlorent Krzakala\u2020, Nicolas Macris\u2297, Lenka Zdeborov\u00e1(cid:63)\n\nAbstract\n\nHeuristic tools from statistical physics have been used in the past to locate the\nphase transitions and compute the optimal learning and generalization errors in\nthe teacher-student scenario in multi-layer neural networks. In this contribution,\nwe provide a rigorous justi\ufb01cation of these approaches for a two-layers neural\nnetwork model called the committee machine. We also introduce a version of\nthe approximate message passing (AMP) algorithm for the committee machine\nthat allows to perform optimal learning in polynomial time for a large set of\nparameters. We \ufb01nd that there are regimes in which a low generalization error is\ninformation-theoretically achievable while the AMP algorithm fails to deliver it;\nstrongly suggesting that no ef\ufb01cient algorithm exists for those cases, and unveiling\na large computational gap.\n\nWhile the traditional approach to learning and generalization follows the Vapnik-Chervonenkis [1]\nand Rademacher [2] worst-case type bounds, there has been a considerable body of theoretical work\non calculating the generalization ability of neural networks for data arising from a probabilistic model\nwithin the framework of statistical mechanics [3, 4, 5, 6, 7]. In the wake of the need to understand\nthe effectiveness of neural networks and also the limitations of the classical approaches [8], it is of\ninterest to revisit the results that have emerged thanks to the physics perspective. This direction is\ncurrently experiencing a strong revival, see e.g. [9, 10, 11, 12].\nOf particular interest is the so-called teacher-student approach, where labels are generated by feeding\ni.i.d. random samples to a neural network architecture (the teacher) and are then presented to another\nneural network (the student) that is trained using these data. Early studies computed the information\ntheoretic limitations of the supervised learning abilities of the teacher weights by a student who is\ngiven m independent n-dimensional examples with \u03b1\u2261 m/n = \u0398(1) and n \u2192 \u221e [3, 4, 7]. These\nworks relied on non-rigorous heuristic approaches, such as the replica and cavity methods [13, 14].\nAdditionally no provably ef\ufb01cient algorithm was provided to achieve the predicted learning abilities,\nand it was thus dif\ufb01cult to test those predictions, or to assess the computational dif\ufb01culty.\nRecent developments in statistical estimation and information theory \u2014in particular of approximate\nmessage passing algorithms (AMP) [15, 16, 17, 18], and a rigorous proof of the replica formula for\nthe optimal generalization error [11]\u2014 allowed to settle these two missing points for single-layer\nneural networks (i.e. without any hidden variables). In the present paper, we leverage on these works,\nand provide rigorous asymptotic predictions and corresponding message passing algorithm for a class\nof two-layers networks.\n(cid:63) Institut de Physique Th\u00e9orique, CNRS & CEA & Universit\u00e9 Paris-Saclay, Saclay, France.\n\u2020 Laboratoire de Physique Statistique, CNRS & Sorbonnes Universit\u00e9s & \u00c9cole Normale Sup\u00e9rieure, PSL\nUniversity, Paris, France.\n\u2297 Laboratoire de Th\u00e9orie des Communications, Facult\u00e9 Informatique et Communications, Ecole Polytechnique\nF\u00e9d\u00e9rale de Lausanne, Suisse.\n\u2666 International Center for Theoretical Physics, Trieste, Italy.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\f1 Summary of contributions and related works\n\nWhile our results hold for a rather large class of non-linear activation functions, we illustrate our\n\ufb01ndings on a case considered most commonly in the early literature: The committee machine. This is\npossibly the simplest version of a two-layers neural network where all the weights in the second layer\nare \ufb01xed to unity. Denoting Y\u00b5 the label associated with a n-dimensional sample X\u00b5, and W \u2217\nil the\nweight connecting the i-th coordinate of the input to the l-th node of the hidden layer, it is de\ufb01ned by:\n\nY\u00b5 = sign(cid:104) K(cid:88)l=1\n\nsign(cid:16) n(cid:88)i=1\n\nX\u00b5iW\n\n\u2217\n\nil(cid:17)(cid:105) .\n\n(1)\n\nil}K\n\nil from the available examples (X\u00b5, Y\u00b5)m\n\n\u00b5=1 and knows the distribution P0 used to generate W \u2217\n\nWe concentrate here on the teacher-student scenario: The teacher generates i.i.d. data samples with\ni.i.d. standard Gaussian coordinates X\u00b5i \u223c N (0, 1), then she/he generates the associated labels\nY\u00b5 using a committee machine as in (1), with i.i.d. weights W \u2217\nil unknown to the student (in the\nproof section we will consider the more general case of a distribution for the weights of the form\n(cid:81)n\ni=1 P0({W \u2217\nl=1), but in practice we consider the fully separable case). The student is then given\nthe m input-output pairs (X\u00b5, Y\u00b5)m\nil. The\ngoal of the student is to learn the weights W \u2217\n\u00b5=1 in order to\nreach the smallest possible generalization error (i.e. to be able to predict the label the teacher would\ngenerate for a new sample not present in the training set).\nThere have been several studies of this model within the non-rigorous statistical physics approach\nin the limit where \u03b1 \u2261 m/n = \u0398(1), K = \u0398(1) and n \u2192 \u221e [19, 20, 21, 22, 6, 7]. A particularly\ninteresting result in the teacher-student setting is the specialization of hidden neurons (see sec. 12.6\nof [7], or [23] in the context of online learning): For \u03b1 < \u03b1spec (where \u03b1spec is a certain critical value\nof the sample complexity), the permutational symmetry between hidden neurons remains conserved\neven after an optimal learning, and the learned weights of each of the hidden neurons are identical.\nFor \u03b1 > \u03b1spec, however, this symmetry gets broken as each of the hidden units correlates strongly\nwith one of the hidden units of the teacher. Another remarkable result is the calculation of the optimal\ngeneralization error as a function of \u03b1.\nOur \ufb01rst contribution consists in a proof of the replica formula conjectured in the statistical physics\nliterature, using the adaptive interpolation method of [24, 11], that allows to put several of these\nresults on a \ufb01rm rigorous basis. Our second contribution is the design of an AMP-type of algorithm\nthat is able to achieve the optimal learning error in the above limit of large dimensions for a wide range\nof parameters. The study of AMP \u2014that is widely believed to be optimal between all polynomial\nalgorithms in the above setting [25, 26, 27, 28]\u2014 unveils, in the case of the committee machine with\na large number of hidden neurons, the existence a large hard phase in which learning is information-\ntheoretically possible, leading to a good generalization error decaying asymptotically as 1.25K/\u03b1 (in\nthe \u03b1 = \u0398(K) regime), but where AMP fails and provides only a poor generalization that does not\ndecay when increasing \u03b1. This strongly suggests that no ef\ufb01cient algorithm exists in this hard region\nand therefore there is a computational gap in learning in such neural networks. In other problems\nwhere a hard phase was identi\ufb01ed, its study boosted the development of algorithms that are able to\nmatch the predicted thresholds and we anticipate this will translate to the present model.\nWe also want to comment on a related line of work that studies the loss-function landscape of neural\nnetworks. While a range of works show under various assumptions that spurious local minima are\nabsent in neural networks, others show under different conditions that they do exist, see e.g. [29].\nThe regime of parameters that is hard for AMP must have spurious local minima, but the converse is\nnot true in general. It might be that there are spurious local minima, yet the AMP approach succeeds.\nMoreover, in all previously studied models in the Bayes-optimal setting the (generalization) error\nobtained with the AMP is the best known and other approaches, e.g. (noisy) gradient based, spectral\nalgorithms or semide\ufb01nite programming, are not better in generalizing even in cases where the\n\u201cstudent\u201d models are overparametrized. Of course in order to be in the Bayes-optimal setting one\nneeds to know the model used by the teacher which is not the case in practice.\n\n2 Main technical results\n\nA general model \u2014 While in the illustration of our results we shall focus on the model (1), all our\n\u00b5,i=1, we denote W \u2217\nformulas are valid for a broader class of models: Given m input samples (X\u00b5i)m,n\nil\n\n2\n\n\f\u2217\n\n\u2217\n\nl=1\n\n(2)\n\n\u221an\n\n\u221an\n\nX\u00b5iW\n\nX\u00b5iW\n\nn(cid:88)i=1\n\nil(cid:111)K\n\nil(cid:111)K\nl=1(cid:17) ,\n\nthe teacher-weight connecting the i-th input (i.e. visible unit) to the l-th node of the hidden layer. For\na generic function \u03d5out : RK \u00d7 R \u2192 R one can formally write the output as\nY\u00b5 = \u03d5out(cid:16)(cid:110) 1\nn(cid:88)i=1\n\n, A\u00b5(cid:17) or Y\u00b5 \u223c Pout(cid:16) \u00b7(cid:12)(cid:12)(cid:12)(cid:110) 1\n\nwhere (A\u00b5)m\n\u00b5=1 are i.i.d. real valued random variables with known distribution PA, that form the\nprobabilistic part of the model, generally accounting for noise. For deterministic models the second\nargument is simply absent (or is a Dirac mass). We can view alternatively (2) as a channel where\nthe transition kernel Pout is directly related to \u03d5out. As discussed above, we focus on the teacher-\nstudent scenario where the teacher generates Gaussian i.i.d. data X\u00b5i \u223c N (0, 1), and i.i.d. weights\nil \u223c P0. The student then learns W \u2217 from the data (X\u00b5, Y\u00b5)m\nW \u2217\n\u00b5=1 by computing marginal means of\nthe posterior probability distribution (5).\nDifferent scenarii \ufb01t into this general framework. Among those, the committee machine is obtained\nwhen choosing \u03d5out(h) = sign((cid:80)K\nl=1 sign(hl)) while another model considered previously is given\nby the parity machine, when \u03d5out(h) =(cid:81)K\nl=1 sign(hl), see e.g. [7]. A number of layers beyond two\nhas also been considered, see [22]. Other activation functions can be used, and many more problems\ncan be described, e.g. compressed pooling [30, 31] or multi-vector compressed sensing [32].\nTwo auxiliary inference problems \u2014 Denote SK the \ufb01nite dimensional vector space of K \u00d7 K\nmatrices, S +\nK for positive\nde\ufb01nite K \u00d7 K matrices, and \u2200 N \u2208 S +\nStating our results requires introducing two simpler auxiliary K-dimensional estimation problems:\n\u2022 The \ufb01rst one consists in retrieving a K-dimensional input vector W0 \u223c P0 from the output of a\nGaussian vector channel with K-dimensional observations Y0 = r1/2W0 + Z0, Z0 \u223c N (0, IK\u00d7K)\nand the \u201cchannel gain\u201d matrix r \u2208 S +\nP (w|Y0) =\n\nK the convex and compact set of semi-de\ufb01nite positive K \u00d7 K matrices, S ++\nK s.t. N \u2212 M \u2208 S +\nK}.\n\nK. The associated posterior distribution on w = (wl)K\n\nK(N ) \u2261 {M \u2208 S+\n\nK we set S+\n\n(cid:124)\n0 r1/2w\u2212 1\n\nP0(w)eY\n\nl=1 is\n\nrw ,\n\n(3)\n\n(cid:124)\n\n2 w\n\n1\nZP0\n\nu\n\n1\nZPout\n\n(cid:124)\n0 ] is in S +\n\nP (u|(cid:101)Y0, V ) =\n\nand the associated free entropy (or minus free energy) is given by the expectation over Y0 of the\nlog-partition function \u03c8P0(r) \u2261 E lnZP0 and involves K dimensional integrals.\n\u2022 The second problem considers K-dimensional i.i.d. vectors V, U\u2217\nconsidered to be known and one has to retrieve U\u2217 from a scalar observation obtained as (cid:101)Y0 \u223c\nPout(\u00b7|q1/2V + (\u03c1\u2212 q)1/2U\u2217) where the second moment matrix \u03c1 \u2261 E[W0W\nand the so-called \u201coverlap matrix\u201d q is in S+\nK(\u03c1). The associated posterior is\n(cid:124)\ne\u2212 1\n2 u\n(2\u03c0)K/2 Pout(cid:0)(cid:101)Y0|q1/2V + (\u03c1 \u2212 q)1/2u(cid:1) ,\n\n\u223c N (0, IK\u00d7K) where V is\nK (W0 \u223c P0)\n\nand also involves K dimensional integrals.\nThe free entropy \u2014 The central object of study leading to the optimal learning and generalization\nerrors in the present setting is the posterior distribution of the weights:\n\nand the free entropy reads this time \u03a8Pout (q; \u03c1) \u2261 E lnZPout (with the expectation over(cid:101)Y0 and V )\nl=1(cid:17) ,\nX\u00b5iwil(cid:111)K\n\nP ({wil}n,K\nwhere the normalization factor is nothing else than a partition function, i.e.\nthe integral of the\ni,l=1. The expected 2 free entropy is by de\ufb01nition fn \u2261 E lnZn/n. The replica\nnumerator over {wil}n,K\nformula gives an explicit (conjectural) expression of fn in the high-dimensional limit n, m \u2192 \u221e with\n\u03b1 = m/n \ufb01xed. We discuss in the long version of this paper [33] how the heuristic replica method\n[13, 14] yields the formula. This computation was \ufb01rst performed, to the best of our knowledge,\nby [19] in the case of the committee machine. Our \ufb01rst contribution is a rigorous proof of the\ncorresponding free entropy formula using an interpolation method [34, 35, 24].\n2 The symbol E will generally denote an expectation over all random variables in the ensuing expression (here\n{X\u00b5i, Y\u00b5}). Subscripts will be used only when we take partial expectations or if there is an ambiguity.\n\nPout(cid:16)Y\u00b5(cid:12)(cid:12)(cid:12)(cid:110) 1\n\ni,l=1|{X\u00b5i, Y\u00b5}m,n\n\nP0({wil}K\n\nm(cid:89)\u00b5=1\n\nn(cid:88)i=1\n\nn(cid:89)i=1\n\n\u00b5,i=1) =\n\n1\nZn\n\n\u221an\n\nl=1)\n\n(5)\n\n(4)\n\n3\n\n\fIn order to formulate our rigorous results, we add an (arbitrarily small) Gaussian regularization noise\nZ\u00b5\u221a\u2206 to the \ufb01rst expression of the model (2), where \u2206 > 0, Z\u00b5 \u223c N (0, 1), so that the channel\nkernel is (u \u2208 RK)\n\ndPA(a)e\n\n\u2212 1\n\n2\u2206 (y\u2212\u03d5(u,a))2\n\n.\n\n(6)\n\nPout(y|u) =\n\n1\n\n\u221a2\u03c0\u2206(cid:90)R\n\nTheorem 2.1 (Replica formula) Suppose (H1): The prior P0 has bounded support in RK; (H2):\nThe activation \u03d5out : RK \u00d7 R \u2192 R is a bounded C2 function with bounded \ufb01rst and second\nderivatives w.r.t. its \ufb01rst argument (in RK-space); and (H3): For all \u00b5 = 1, . . . , m and i = 1, . . . , n\nwe have i.i.d. X\u00b5i \u223c N (0, 1). Then for the model (2) with kernel (6) the limit of the free entropy is:\n(7)\n\nK (\u03c1)(cid:110)\u03c8P0 (r) + \u03b1\u03a8Pout (q; \u03c1) \u2212\n\n1\n2\n\nTr(rq)(cid:111) ,\n\n1\nn\n\nE lnZn = sup\nr\u2208S+\n\nK\n\nn\u2192\u221e fn \u2261 lim\nlim\nn\u2192\u221e\n\ninf\nq\u2208S+\n\nwhere \u03b1 \u2261 m/n and where \u03a8Pout(q; \u03c1) and \u03c8P0(r) are the free entropies of two simpler K-\ndimensional estimation problems (3) and (4).\n\nThis theorem extends the recent progress for generalized linear models of [11], which includes the\ncase K = 1 of the present contribution, to the phenomenologically richer case of two-layers problems\nsuch as the committee machine. The proof sketch based on an adaptive interpolation method recently\ndeveloped in [24] is outlined and the details can be found in [33]. Note that, following similar\napproximation arguments as in [11], the hypothesis (H1) can be relaxed to the existence of the\nsecond moment of the prior; thus covering the Gaussian case, (H2) can be dropped (and thus include\nmodel (1) and its sign(\u00b7) activation) and (H3) extended to data matrices X with i.i.d. entries of zero\nmean, unit variance and \ufb01nite third moment. Moreover, the case \u2206 = 0 can be considered when\nthe outputs are discrete, as in the committee machine (1), see [11]. The channel kernel becomes\nPout(y|u) =(cid:82) dPA(a)1(y\u2212 \u03d5(u, a)) and the replica formula is the limit \u2206 \u2192 0 of the one provided\n\nin Theorem 2.1. In general this regularizing noise is needed for the free entropy limit to exist.\nLearning the teacher weights and optimal generalization error \u2014 A classical result in Bayesian\nestimation is that the estimator \u02c6W that minimizes the mean-square error with the ground-truth W \u2217 is\ngiven by the expected mean of the posterior distribution. Denoting q\u2217 the extremizer in the replica\nformula (7), we expect from the replica method that in the limit n \u2192 \u221e, m/n = \u03b1, that with high\nW \u2217/n \u2192 q\u2217. We refer to proposition 4.2 and to the proof in [33] for the precise\nprobablity \u02c6W\nstatement, that remains rigorously valid only in the presence of an additional (possibly in\ufb01nitesimal)\nside-information. From the overlap matrix q\u2217, one can compute the Bayes-optimal generalization\nerror when the student tries to classify a new, yet unseen, sample Xnew. The estimator of the new label\n\u02c6Ynew that minimizes the mean-square error with the true label is given by computing the posterior\nmean of \u03d5out(Xneww) (Xnew is a row vector). Given the new sample, the optimal generalization\nerror is then\n\n(cid:124)\n\n\u2217\n\n)\n\n(8)\n\n1\n2\n\nE(cid:104)(cid:0)Ew|X,Y(cid:2)\u03d5out(Xneww)(cid:3) \u2212 \u03d5out(XnewW\n\n\u2217\n\n)(cid:1)2(cid:105) \u2212\u2212\u2212\u2212\u2192n\u2192\u221e \u0001g(q\n\n(cid:124)\n\nwhere w is distributed according to the posterior measure (5) (note that this Bayes-optimal compu-\ntation differs from the so-called Gibbs estimator by a factor 2, see [33]). In particular, when the\ndata X is drawn from the standard Gaussian distribution on Rm\u00d7n, and is thus rotationally invariant,\nW \u2217, which converges to q\u2217. Then a direct algebraic\nit follows that this error only depends on w\ncomputation gives a lengthy but explicit formula for \u0001g(q\u2217), as shown in [33].\nApproximate message passing, and its state evolution \u2014 Our next result is based on an adaptation\nof a popular algorithm to solve random instances of generalized linear models, the AMP algorithm\n[15, 16], for the case of the committee machine and models described by (2).\nThe AMP algorithm can be obtained as a Taylor expansion of loopy belief-propagation (see [33]) and\nalso originates in earlier statistical physics works [36, 37, 38, 39, 40, 26]. It is conjectured to perform\nthe best among all polynomial algorithms in the framework of these models. It thus gives us a tool\nto evaluate both the intrinsic algorithmic hardness of the learning and the performance of existing\nalgorithms with respect to the optimal one in this model.\nThe AMP algorithm is summarized by its pseudo-code in Algorithm 1, where the update functions\ngout, \u2202\u03c9gout, fW and fC are related, again, to the two auxiliary problems (3) and (4). The functions\n\n4\n\n\f\u03c9t\n\n\u00b5 =\n\nAlgorithm 1 Approximate Message Passing for the committee machine\nInput: vector Y \u2208 Rm and matrix X \u2208 Rm\u00d7n:\nInitialize: \u02c6Wi, gout,\u00b5 \u2208 RK and \u02c6Ci, \u2202\u03c9gout,\u00b5 \u2208 S +\nK for 1 \u2264 i \u2264 n and 1 \u2264 \u00b5 \u2264 m at t = 0.\nrepeat\nUpdate of the mean \u03c9\u00b5 \u2208 RK and covariance V\u00b5 \u2208 S +\nK:\n(cid:1)\u22121 \u02c6C t\nout,\u00b5(cid:1)\ngt\u22121\ni \u03a3t\u22121\nUpdate of gout,\u00b5 \u2208 RK and \u2202\u03c9gout,\u00b5 \u2208 S +\nK:\n\u2202\u03c9gt\nout,\u00b5 = \u2202\u03c9gout(\u03c9t\nUpdate of the mean Ti \u2208 RK and covariance \u03a3i \u2208 S +\nK:\n| \u03a3t\nUpdate of the estimated marginals \u02c6Wi \u2208 RK and \u02c6Ci \u2208 S +\nK:\n\ni = \u2212(cid:16) m(cid:80)\u00b5=1\n\nn (cid:0)\u03a3t\u22121\n\nout,\u00b5(cid:17)\u22121\n\nn(cid:80)i=1(cid:0) X\u00b5i\u221a\n\ni(cid:16) m(cid:80)\u00b5=1\n\ngt\nout,\u00b5 = gout(\u03c9t\n\nX 2\nn \u2202\u03c9gt\n\u00b5i\n\nX 2\nn \u2202\u03c9gt\n\u00b5i\n\n\u00b5, Y\u00b5, V t\n\u00b5)\n\n\u00b5, Y\u00b5, V t\n\u00b5)\n\nout,\u00b5 \u2212\n\nn(cid:80)i=1\n\n\u02c6W t\n\ni \u2212\n\nT t\ni = \u03a3t\n\ni(cid:17)\n\nX\u00b5i\u221a\n\nn gt\n\nX 2\n\u00b5i\nn\n\n\u02c6C t\ni\n\n\u02c6W t\n\nout,\u00b5\n\nV t\n\u00b5 =\n\nX 2\n\u00b5i\n\ni\n\n|\n\n|\n\nn\n\ni\n\n\u02c6W t+1\nt = t + 1\n\ni = fW (\u03a3t\n\ni, T t\ni )\n\n\u02c6C t+1\n\ni = fC(\u03a3t\n\ni, T t\ni )\n\n|\n\nuntil Convergence on \u02c6W , \u02c6C.\nOutput: \u02c6W and \u02c6C.\n\nfW (\u03a3, T ) and fC(\u03a3, T ) are respectively the mean and variance under the posterior distribution (3)\nwhen r \u2192 \u03a3\u22121 and Y0 \u2192 \u03a31/2T , while gout(\u03c9\u00b5, Y\u00b5, V\u00b5) is given by the product of V\nand the\nmean of u under the posterior (4) using(cid:101)Y0 \u2192 Y\u00b5, \u03c1 \u2212 q \u2192 V\u00b5 and q1/2V \u2192 \u03c9\u00b5 (see [33] for more\ndetails). After convergence, \u02c6W estimates the weights of the teacher-neural network. The label of a\nsample Xnew not seen in the training set is estimated by the AMP algorithm as\n\n\u22121/2\n\u00b5\n\nY t\n\nnew =(cid:90) dy(cid:0) K(cid:89)l=1\n\ndzl(cid:1) yPout(y|{zl}K\n\nl=1)N (z; \u03c9t\n\nnew, V t\n\nnew) ,\n\n(9)\n\ni=1 Xnew,i \u02c6W t\n\nnew = (cid:80)n\n\nAMP is the K \u00d7 K covariance matrix (see below for the de\ufb01nition of qt\n\ni is the mean of the normally distributed variable z \u2208 RK, and\nAMP). We\n\nwhere \u03c9t\nnew = \u03c1 \u2212 qt\nV t\nprovide a demo of the algorithm on github [41].\nAMP is particularly interesting because its performance can be tracked rigorously, again in the\nasymptotic limit when n \u2192 \u221e, via a procedure known as state evolution (a rigorous version of the\ncavity method in physics [14]), see [18]. State evolution tracks the value of the overlap between the\nW \u2217/n, via:\nhidden ground truth W \u2217 and the AMP estimate \u02c6Wt, de\ufb01ned as qt\n\n(cid:124)\n\nqt+1\nAMP = 2\n\n\u2202\u03c8P0\n\u2202r\n\n(rt\n\nAMP) ,\n\nrt+1\nAMP = 2\u03b1\n\n\u2202q\n\nAMP \u2261 limn\u2192\u221e( \u02c6W t)\n\u2202\u03a8Pout\n\n(qt\n\nAMP; \u03c1) .\n\n(10)\n\nThe \ufb01xed points of these equations correspond to the critical points of the replica free entropy (7).\nLet us comment further on the convergence of the algorithm. In the large n limit, and if the integrals\nare performed without errors, then the algorithm is guaranteed to converge. This is a consequence of\nthe state evolution combined with the Bayes-optimal setting. In practice, of course, n is \ufb01nite and\nintegrals are approximated. In that case convergence is not guaranteed, but is robustly achieved in all\nthe cases presented in this paper. We also expect (by experience with the single layer case) that if the\ninput-data matrix is not random (i.e. beyond our assumptions) then we will encounter convergence\nissues, which could be \ufb01xed by moving to some variant of the algorithm such as VAMP [42].\n\n3 From two to more hidden neurons, and the specialization phase transition\n\nTwo neurons \u2014 Let us now discuss how the above results can be used to study the optimal learning\nin the simplest non-trivial case of a two-layers neural network with two hidden neurons, i.e. when\nmodel (1) is simply Y\u00b5 = sign[ sign((cid:80)n\ni=1 X\u00b5iW \u2217\ni2)], with the convention\nthat sign(0) = 0. We remind that the input-data matrix X has i.i.d. N (0, 1) entries, and the\nteacher-weights W \u2217 used to generate the labels Y are taken i.i.d. from P0.\n\ni1) + sign((cid:80)n\n\ni=1 X\u00b5iW \u2217\n\n5\n\n\fIn Fig. 1 we plot the optimal generalization error as a function of the sample complexity \u03b1 = m/n.\nIn the left panel the weights are Gaussian (for both the teacher and the student), while in the center\npanel they are binary/Rademacher. The full line is obtained from the \ufb01xed point of the state evolution\n(SE) of the AMP algorithm (10), corresponding to the extremizer of the replica free entropy (7).\nThe points are results of the AMP algorithm run till convergence averaged over 10 instances of size\nn = 104. In this case and with random initial conditions the AMP algorithm did converge in all our\ntrials. As expected we observe excellent agreement between the SE and AMP.\nIn both left and center panels of Fig. 1 we observe the so-called specialization phase transition. Indeed\n(10) has two types of \ufb01xed points: A non-specialized \ufb01xed point where every element of the K \u00d7 K\norder parameter q is the same (so that both hidden neurons learn the same function) and a specialized\n\ufb01xed point where the diagonal elements of the order parameter are different from the non-diagonal\nones. We checked for other types of \ufb01xed points for K = 2 (one where the two diagonal elements are\nnot the same), but have not found any. In terms of weight-learning, this means for the non-specialized\n\ufb01xed point that the estimators for both W1 and W2 are the same, whereas in the specialized \ufb01xed\npoint the estimators of the weights corresponding to the two hidden neurons are different, and that\nthe network \u201c\ufb01gured out\u201d that the data are better described by a non-linearly separable model. The\nspecialized \ufb01xed point is associated with lower error than the non-specialized one (as one can see in\nFig. 1). The existence of this phase transition was discussed in statistical physics literature on the\ncommittee machine, see e.g. [20, 23].\nFor Gaussian weights (Fig. 1 left), the specialization phase transition arises continuously at\nspec(K = 2) the number of samples is too\nspec(K = 2) (cid:39) 2.04. This means that for \u03b1 < \u03b1G\n\u03b1G\nsmall, and the student-neural network is not able to learn that two different teacher-vectors W1 and\nW2 were used to generate the observed labels. For \u03b1 > \u03b1G\nspec(K = 2), however, it is able to distin-\nguish the two different weight-vectors and the generalization error decreases fast to low values (see\nFig. 1). For completeness we remind that in the case of K = 1 corresponding to single-layer neural\nnetwork no such specialization transition exists. We show [33] that it is absent also in multi-layer\nneural networks as long as the activations remain linear. The non-linearity of the activation function\nis therefore an essential ingredient in order to observe a specialization phase transition.\nThe center part of Fig. 1 depicts the \ufb01xed point reached by the state evolution of AMP for the case of\nbinary weights. We observe two phase transitions in the performance of AMP in this case: (a) the\nspecialization phase transition at \u03b1B\nspec(K = 2) (cid:39) 1.58, and for slightly larger sample complexity a\ntransition towards perfect generalization (beyond which the generalization error is asymptotically\nzero) at \u03b1B\nperf (K = 2) (cid:39) 1.99. The binary case with K = 2 differs from the Gaussian one in the fact\nthat perfect generalization is achievable at \ufb01nite \u03b1. While the specialization transition is continuous\nhere, the error has a discontinuity at the transition of perfect generalization. This discontinuity is\nassociated with the 1st order phase transition (in the physics nomenclature), leading to a gap between\nalgorithmic (AMP in our case) performance and information-theoretically optimal performance\nreachable by exponential algorithms. To quantify the optimal performance we need to evaluate the\nglobal extremum of the replica free entropy (not the local one reached by the state evolution). In\ndoing so that we get that information theoretically there is a single discontinuous phase transition\ntowards perfect generalization at \u03b1B\nWhile the information-theoretic and specialization phase transitions were identi\ufb01ed in the physics\nliterature on the committee machine [20, 21, 3, 4], the gap between the information-theoretic per-\nformance and the performance of AMP \u2014that is conjectured to be optimal among polynomial\nalgorithms\u2014 was not yet discussed in the context of this model. Indeed, even its understanding in\nsimpler models than those discussed here, such as the single layer case, is more recent [15, 26, 25].\nMore is different \u2014 It becomes more dif\ufb01cult to study the replica formula for larger values of K\nas it involves (at least) K-dimensional integrals. Quite interestingly, it is possible to work out the\nsolution of the replica formula in the large K limit (thus taken after the large n limit, so that K/n\nvanishes). It is indeed natural to look for solutions of the replica formula, as suggested in [19], of\nthe form q = qdIK\u00d7K + (qa/K)1K1(cid:124)\nl=1. Since both q and \u03c1 are\nassumed to be positive, this scaling implies [33] that 0 \u2264 qd \u2264 1 and 0 \u2264 qa + qd \u2264 1, as it should.\nWe also detail in [33] the corresponding large K expansion of the free entropy for the teacher-student\nscenario with Gaussian weights. Only the information-theoretically reachable generalization error\nwas computed [19], thus we concentrated on the analysis of performance of AMP by tracking the\nstate evolution equations. In doing so, we unveil a large computational gap.\n\nK, with the unit vector 1K = (1)K\n\nIT(K = 2) (cid:39) 1.54.\n\n6\n\n\fFigure 1: Order parameter and optimal generalization error for a committee machine with two\nhidden neurons with Gaussian weights (left), binary/Rademacher weights (center), and for Gaussian\nweights in the large number of hidden units limit (right). These are shown as a function of the ratio\n\u03b1 = m/n between the number of samples m and the dimensionality n. Lines are obtained from\nthe state evolution equations (dominating solution is shown in full line), data-points from the AMP\nalgorithm averaged over 10 instances of the problem of size n = 104. q00 and q01 denote diagonal\nand off-diagonal overlaps, and their values are given by the labels on the far-right of the \ufb01gure.\nIn the right plot of Fig. 1 we show the \ufb01xed point values of the two overlaps q00 = qd + qa/K\nand q01 = qa/K and the resulting generalization error. As discussed in [19] it can be written in\na closed form as \u0001g = arccos [2 (qa + arcsin qd) /\u03c0] /\u03c0. The specialization transition arises for\nspinodal (cid:39) 7.17 but the free entropy global\nspec (cid:39) 7.65. This has interesting\nimplications for the optimal generalization error that gets towards a plateau of value \u03b5plateau (cid:39) 0.28\nAMP is conjectured to be optimal among all polynomial algorithms (in the considered limit) and thus\nanalyzing its state evolution sheds light on possible computational-to-statistical gaps that come hand\nin hand with 1st order phase transitions. In the regime of \u03b1 = \u0398(K) for large K the non-specialized\n\ufb01xed point is always stable implying that AMP will not be able to give a lower generalization error\nthan \u03b5plateau. Analyzing the replica formula for large K in more details [33], we concluded that\nAMP will not reach the optimal generalization for any \u03b1 < \u0398(K 2). This implies a rather sizable\ngap between the performance that can be reached information-theoretically and the one reachable\ntractably. Such large computational gaps have been previously identi\ufb01ed in a range of inference\nproblems \u2014most famously in the planted clique problem [27]\u2014 but the committee machine is the\n\ufb01rst model of a multi-layer neural network with realistic non-linearities (the parity machine is another\nexample but use a very peculiar non-linearity) that presents such large gap.\n\n\u03b1 = \u0398(K) so we de\ufb01ne(cid:101)\u03b1 \u2261 \u03b1/K. The specialization is now a 1st order phase transition, meaning\nthat the specialization \ufb01xed point \ufb01rst appears at (cid:101)\u03b1G\nextremizer remains the one of the non-specialized \ufb01xed point until(cid:101)\u03b1G\nspec and then jumps discontinuously down to reach a decay aymptotically as 1.25/(cid:101)\u03b1.\nfor(cid:101)\u03b1 <(cid:101)\u03b1G\n\n4 Sketch of proof of Theorem 2.1\n\n\u2217\ni = (W \u2217\n\nil)K\n\nl=1. For \u00b5 = 1, . . . m, let V \u00b5, U\n\nl=1, wi =\nWe denote K-dimensional column vectors by underlined letters. In particular W\n\u2217\n\u00b5 be K-dimensional vectors with i.i.d. N (0, 1) components.\n(wil)K\nLet sn \u2208 (0, 1/2] a sequence that goes to 0 as n increases, and let M be the compact subset of\nK with eigenvalues in the interval [1, 2]. For all M \u2208 snM, 2snIK\u00d7K \u2212 M \u2208 S +\nmatrices in S++\nK.\nLet \u0001 = (\u00011, \u00012) \u2208 (snM)2. Let then q : [0, 1] \u2192 S +\nK(\u03c1) and r : [0, 1] \u2192 S +\nK be two \u201cinterpolation\nfunctions\u201d (that will later on depend on \u0001), and R1(t) \u2261 \u00011 +(cid:82) t\n0 q(v)dv.\ni +(cid:112)R2(t) V \u00b5 +\nFor t \u2208 [0, 1], de\ufb01ne the K-dimensional vector: St,\u00b5 \u2261(cid:112)(1 \u2212 t)/n(cid:80)n\n(cid:112)t\u03c1 \u2212 R2(t) + 2snIK\u00d7K U\nt,i =(cid:112)R1(t) W\n\n\u2217\n(cid:48)\n(11)\ni,\ni + Z\n(cid:48)\n(cid:48)\nwhere Z\ni is (for each i) a K-vector with i.i.d. N (0, 1) components, and Y\nt,i is a K-vector as well.\nRecall that in our notation the \u2217-variables have to be retrieved, while the other random variables are\n\n\u2217\n\u00b5 where matrix square-roots are well de\ufb01ned. We interpolate with\n\nauxiliary problems related to those discussed in sec. 2:\n\n0 r(v)dv, R2(t) \u2261 \u00012 +(cid:82) t\n\ni=1 X\u00b5iW\n\n\u2217\n\n1 \u2264 i \u2264 n,\n\nYt,\u00b5 \u223c Pout( \u00b7 | St,\u00b5),\n\n1 \u2264 \u00b5 \u2264 m,\n\n(cid:48)\n\nY\n\n7\n\n024\u03b10.000.050.100.150.200.25Generalizationerror\u01ebg(\u03b1)012\u03b10.000.050.100.150.200.250510e\u03b1\u2261\u03b1/K0.00.10.20.30.40.50.00.20.40.60.81.0OverlapqAMPq00AMPq01SEq00SEq01SE\u01ebg(\u03b1)AMP\u01ebg(\u03b1)SpecializationSpinodal1stordertransition\fTr\n\n+\n\n1\n2\n\nn,t,\u0001\n\ni=1 W \u2217\n\n(cid:107)u\u00b5(cid:107)2\n\n2\n\ne\n\nthe\n\nequals\n\ndt\n\n2 + On(1) ,\n\n(cid:104)(cid:16) 1\n\nn\n\nm(cid:88)\n\n\u00b5=1\n\n2(cid:81)\u00b5\n\n0 r(v)dv) \u2212 K\n\n2 + On(1) .\n\n0 q(t)dt; \u03c1) \u2212 1\n\n2Tr(\u03c1(cid:82) 1\n\nTr [r(t)(q(t) \u2212 \u03c1)]+On(1) ,\n\n\u2207uYt,\u00b5 (st,\u00b5)\u2207uYt,\u00b5 (St,\u00b5)\n\n(cid:124) \u2212 r(t)\n\n(cid:17)(cid:0)Q \u2212 q(t)(cid:1)(cid:105)(cid:69)\n\n\u2217\ni and u\u00b5 replacing U\n2(cid:107)Y (cid:48)\n\n\u2212 1\n\n0 r(t)dt) + \u03b1\u03a8Pout((cid:82) 1\n\nt , X, V ) = 1Zn(t,\u0001)(cid:81)i P0(wi)e\n\nassumed to be known (except for the noise variables obviously). De\ufb01ne now st,\u00b5 by the expression\n\u2217\n\u00b5. We introduce the interpolating posterior:\nof St,\u00b5 but with wi replacing W\nt,i\u2212\u221aR1(t)wi(cid:107)2\n\u2212 1\nPt,\u0001(w, u|Yt, Y (cid:48)\n2\n(2\u03c0)K/2 Pout(Yt,\u00b5|st,\u00b5)\nwhere the normalization factor Zn(t, \u0001) equals the numerator integrated over all components of w\nand u. The average free entropy at time t is by de\ufb01nition fn,\u0001(t) \u2261 E lnZn(t, \u0001)/n. One veri\ufb01es,\nusing in particular continuity and boundedness properties of \u03c8P0 and \u03a8Pout(see [33] for details):\n(cid:26) fn,\u0001(0) = fn \u2212 K\nfn,\u0001(1) = \u03c8P0((cid:82) 1\nHere On(1) \u2192 0 in the n, m\u2192\u221e limit uniformly in t, q, r, \u0001. The next step is to compute the free\nentropy variation along the interpolation path [33]:\nProposition 4.1 (Free entropy variation) Denote by (cid:104)\u2212(cid:105)n,t,\u0001 the (Gibbs) expectation w.r.t.\nE(cid:68)\nposterior Pt,\u0001. Set uy(x) \u2261 ln Pout(y|x). For all t \u2208 [0, 1] the derivative dfn,\u0001(t)\n\u2212 1\n2\nwhere \u2207 is the K-dimensional gradient w.r.t. the argument of uYt,\u00b5(\u00b7), and On(1) \u2192 0 in the\nn, m\u2192\u221e limit uniformly in t, q, r, \u0001. Here, Qll(cid:48) \u2261(cid:80)n\nilwil(cid:48)/n is a K\u00d7K overlap matrix .\nA crucial step of the adaptive interpolation method is to show that the overlap concentrates (see [33]):\nProposition 4.2 (Overlap concentration) Assume that for any t \u2208 (0, 1) the transformation \u0001 \u2208\n(snM)2 (cid:55)\u2192 (R1(t, \u0001), R2(t, \u0001)) is a C1 diffeomorphism with a Jacobian greater or equal to 1. Then\none can \ufb01nd a sequence sn going to 0 slowly enough such that there exists a constant C > 0 and a\n\u03b3 > 0, that only depend on the support and moments of P0 and on the activation \u03d5out and \u03b1, such that\nF(cid:11)n,t,\u0001 \u2264 Cn\u2212\u03b3.\n((cid:107) \u2212 (cid:107)F is the Frobenius norm): Vol(snM)\u22122(cid:82)(snM)2 d\u0001(cid:82) 1\nLet fRS(q, r) \u2261 \u03c8P0 (r) + \u03b1\u03a8Pout (q; \u03c1) \u2212 Tr(rq)/2 the replica symmetric (RS) potential. We have:\nProposition 4.3 (Lower bound) lim inf n\u2192\u221e fn \u2265 supr\u2208S+\nProof: Choose \ufb01rst r(t) = r \u2208 S +\nK a \ufb01xed matrix. Then R(t) = (R1(t), R2(t)) can be \ufb01xed\nas the solution to the \ufb01rst order differential equation: \u2202tR1(t, \u0001) = r, \u2202tR2(t, \u0001) = E(cid:104)Q(cid:105)n,t,\u0001\nand R(0, \u0001) = \u0001. We denote it R(t, \u0001) = (rt + \u00011,(cid:82) t\nIt is possible to\ncheck (see [33]) that this ODE satis\ufb01es the hypotheses of the parametric Cauchy-Lipschitz\ntheorem, and that by the Liouville formula the determinant of the Jacobian of \u0001 (cid:55)\u2192 R(\u0001, t)\nsatis\ufb01es Jn,\u0001(t) = exp{(cid:82) t\n0(cid:80)K\nl\u2265l(cid:48) \u2202(R2)ll(cid:48) E(cid:104)Qll(cid:48)(cid:105)n,s,\u0001(s, R(s, \u0001))ds} \u2265 1; indeed, this sum of\npartials is always positive, see [33]. Using then Prop. 4.1 and Prop. 4.2, we obtain fn =\nVol(snM)\u22122(cid:82)(snM)2 d\u0001fRS((cid:82) 1\n0 q(v, \u0001; r)dv, r) + On(1). This implies the lower bound.\nProposition 4.4 (Upper bound) lim supn\u2192\u221e fn \u2264 supr\u2208S+\nProof : We now \ufb01x R(t) = (R1(t), R2(t)) as the solution R(t, \u0001) = ((cid:82) t\n0 r(v, \u0001)dv +\n\u00011,(cid:82) t\n0 q(v, \u0001)dv + \u00012) to the following Cauchy problem: \u2202tR1(t, \u0001) = 2\u03b1\u2207\u03a8Pout (E(cid:104)Q(cid:105)n,t,\u0001),\n\u2202tR2(t, \u0001) = E(cid:104)Q(cid:105)n,t,\u0001, and R(0, \u0001) = \u0001. We denote this equation as \u2202tR(t, \u0001) = Fn (R(t, \u0001), t). It is\nthen possible to verify that Fn(R(t, \u0001), t) is a bounded C1 function of R(t, \u0001), and thus a direct appli-\ncation of the Cauchy-Lipschitz theorem implies that R(t, \u0001) is a C1 function of t and \u0001 and by unicity\nof the solution, the function \u0001 (cid:55)\u2192 R(t, \u0001) is injective for any t. Since E(cid:104)Q(cid:105)n,t,\u0001 and \u03c1 \u2212 E(cid:104)Q(cid:105)n,t,\u0001 are\npositive matrices (see [33]) we also have that q(t, \u0001) \u2208 S +\nK(\u03c1) and since by the differential equation\nwe have r(t, \u0001) = 2\u03b1\u2207\u03a8Pout (q(t, \u0001)) and as \u2207\u03a8Pout(q) \u2208 S +\nK. More-\nover the Liouville formula for the Jacobian of the map \u0001 \u2208 (snM)2 (cid:55)\u2192 R(t, \u0001) \u2208 R(t, (snM)2)\n\n0 dt E(cid:10)(cid:107)Q \u2212 E(cid:104)Q(cid:105)n,t,\u0001(cid:107)2\n\nK (see [33]), then r(t, \u0001) \u2208 S +\n\ninf q\u2208S+\n\nK (\u03c1) fRS(q, r).\n\nK\n\ninf q\u2208S+\n\nK (\u03c1) fRS(q, r).\n\nK\n\n0 q(v, \u0001; r)dv + \u00012).\n\n8\n\n\ffn =\n\n1\n\n1\n\n0 q(\u0001, v)dv; \u03c1)\u2212 1\n\nyields Jn,\u0001(t) = exp{(cid:82) t\n0(cid:80)K\nl\u2265l(cid:48)[\u2202(R1)ll(cid:48) (Fn,1)ll(cid:48) +\u2202(R2)ll(cid:48) (Fn,2)ll(cid:48)](s, R(s, \u0001))ds}. For all s \u2208 [0, 1]\nthe integrand(cid:80)l\u2265l(cid:48)[. . .] \u2265 0 (see [33]). We can again apply Prop. 4.2, and obtain\nVol(snM)2(cid:82) d\u0001{\u03c8P0((cid:82) 1\n0 r(\u0001, v)dv)+\u03b1\u03a8Pout((cid:82) 1\nBy convexity of \u03c8P0 and \u03a8Pout, fn \u2264\nK (\u03c1) fRS(q, r(\u0001, v)). Indeed, for every r \u2208 S +\nnow remark that fRS(q(\u0001, v), r(\u0001, v)) = inf q\u2208S+\nfunction gr : q \u2208 S +\nis \u2207gr(q) = \u03b1\u2207\u03a8Pout (q) \u2212 r/2. Since \u2207gr(\u0001,v)(q(\u0001, v)) = 0 by de\ufb01nition of r(\u0001, v), and S +\nconvex, the minimum of gr(\u0001,v)(q) is necessarily achieved at q = q(\u0001, v). Therefore\nfn \u2264\nwhich concludes the proof of Prop. 4.4. Combined with Prop. 4.3 we obtain Thm. 2.1.\n\n0 q(\u0001, v)r(\u0001, v)dv}+On(1).\n0 dvfRS(q(\u0001, v), r(\u0001, v))+On(1). We\nK, the\nK(\u03c1) (cid:55)\u2192 fRS(q, r) \u2208 R can be shown to be convex (see [33]) and its q-derivative\nK(\u03c1) is\n\nVol(snM)2(cid:82)(snM)2 d\u0001(cid:82) 1\n\n1\n\nVol(snM)2(cid:82) d\u0001(cid:82) 1\n\ninf\nq\u2208S+\nK (\u03c1)\n\nfRS (q, r(\u0001, v)) + On(1) \u2264 sup\nr\u2208S+\n\nfRS(q, r) + On(1),\n\n2 Tr(cid:82) 1\n\n0 dv\n\ninf\nq\u2208S+\n\nK (\u03c1)\n\nK\n\n5 Discussion\n\nOne of the contributions of this paper is the design of an AMP-type algorithm that is able to achieve\nthe Bayes-optimal learning error in the limit of large dimensions for a range of parameters out of\nthe so-called hard phase. The hard phase is associated with \ufb01rst order phase transitions appearing\nin the solution of the model. In the case of the committee machine with a large number of hidden\nneurons we identify a large hard phase in which learning is possible information-theoretically but\nnot ef\ufb01ciently. In other problems where such a hard phase was identi\ufb01ed, its study boosted the\ndevelopment of algorithms that are able to match the predicted threshold. We anticipate this will also\nbe the same for the present model. We should, however, note that for larger K > 2 the present AMP\nalgorithm includes higher-dimensional integrals that hamper the speed of the algorithm. Our current\nstrategy to tackle this is to combine the large-K expansion and use it in the algorithm. Detailed\naccount of the corresponding results are left for future work.\nWe studied the Bayes-optimal setting where the student-network is the same as the teacher-network,\nfor which the replica method can be readily applied. The method still applies when the number\nof hidden units in the student and teacher are different, while our proof does not generalize easily\nto this case. It is an interesting subject for future work to see how the hard phase evolves under\nover-parametrization and what is the interplay between the simplicity of the loss-landscape and the\nachievable generalization error. We conjecture that in the present model over-parametrization will\nnot improve the generalization error achieved by AMP in the Bayes-optimal case.\nEven though we focused in this paper on a two-layers neural network, the analysis and algorithm can\nbe readily extended to a multi-layer setting, see [22], as long as the number of layers as well as the\nnumber of hidden neurons in each layer is held constant, and as long as one learns only weights of the\n\ufb01rst layer, for which the proof already applies. The numerical evaluation of the phase diagram would\nbe more challenging than the cases presented in this paper as multiple integrals would appear in the\ncorresponding formulas. In future works, we also plan to analyze the case where the weights of the\nsecond and subsequent layers (including the biases of the activation functions) are also learned. This\ncould be done for instance with a combination of EM and AMP along the lines of [43, 44] where this\nis done for the simpler single layer case.\nConcerning extensions of the present work, an important open case is the one where the number of\nsamples per dimension \u03b1 = \u0398(1) and also the size of the hidden layer per dimension K/n = \u0398(1)\nas n \u2192 \u221e, while in this paper we treated the case K = \u0398(1) and n \u2192 \u221e. This other scaling where\nK/n = \u0398(1) is challenging even for the non-rigorous replica method.\n\n6 Acknowledgements\n\nThis work is supported by the ERC under the European Union\u2019s Horizon 2020 Research and Innova-\ntion Program 714608-SMiLe, as well as by the French Agence Nationale de la Recherche under grant\nANR-17-CE23-0023-01 PAIL. We gratefully acknowledge the support of NVIDIA Corporation with\nthe donation of the Titan Xp GPU used for this research. Jean Barbier was supported by the Swiss\nNational Foundation grant no 200021-156672. We also thank L\u00e9o Miolane for fruitful discussions.\n\n9\n\n\fReferences\n[1] Vladimir Vapnik. Statistical learning theory. 1998. Wiley, New York, 1998.\n\n[2] Peter L Bartlett and Shahar Mendelson. Rademacher and gaussian complexities: Risk bounds\n\nand structural results. Journal of Machine Learning Research, 3(Nov):463\u2013482, 2002.\n\n[3] Sebastian Seung, Haim Sompolinsky, and Naftali Tishby. Statistical mechanics of learning from\n\nexamples. Physical Review A, 45(8):6056, 1992.\n\n[4] Timothy LH Watkin, Albrecht Rau, and Michael Biehl. The statistical mechanics of learning a\n\nrule. Reviews of Modern Physics, 65(2):499, 1993.\n\n[5] R\u00e9mi Monasson and Riccardo Zecchina. Learning and generalization theories of large\n\ncommittee-machines. Modern Physics Letters B, 9(30):1887\u20131897, 1995.\n\n[6] R\u00e9mi Monasson and Riccardo Zecchina. Weight space structure and internal representations: a\ndirect approach to learning and generalization in multilayer neural networks. Physical review\nletters, 75(12):2432, 1995.\n\n[7] Andreas Engel and Christian PL Van den Broeck. Statistical Mechanics of Learning. Cambridge\n\nUniversity Press, 2001.\n\n[8] Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding\ndeep learning requires rethinking generalization. arXiv preprint arXiv:1611.03530, 2016. in\nICLR 2017.\n\n[9] Pratik Chaudhari, Anna Choromanska, Stefano Soatto, Yann LeCun, Carlo Baldassi, Christian\nBorgs, Jennifer Chayes, Levent Sagun, and Riccardo Zecchina. Entropy-sgd: Biasing gradient\ndescent into wide valleys. arXiv preprint arXiv:1611.01838, 2016. in ICLR 2017.\n\n[10] Charles H Martin and Michael W Mahoney. Rethinking generalization requires revisiting\nold ideas: statistical mechanics approaches and complex learning behavior. arXiv preprint\narXiv:1710.09553, 2017.\n\n[11] Jean Barbier, Florent Krzakala, Nicolas Macris, L\u00e9o Miolane, and Lenka Zdeborov\u00e1. Phase\ntransitions, optimal errors and optimality of message-passing in generalized linear models.\narXiv preprint arXiv:1708.03395, 2017.\n\n[12] Marco Baity-Jesi, Levent Sagun, Mario Geiger, Stefano Spigler, G\u00e9rard Ben-Arous, Chiara\nCammarota, Yann LeCun, Matthieu Wyart, and Giulio Biroli. Comparing dynamics: Deep\nneural networks versus glassy systems. arXiv preprint arXiv:1803.06969, 2018.\n\n[13] Marc M\u00e9zard, Giorgio Parisi, and Miguel Virasoro. Spin glass theory and beyond: An Intro-\nduction to the Replica Method and Its Applications, volume 9. World Scienti\ufb01c Publishing\nCompany, 1987.\n\n[14] Marc M\u00e9zard and Andrea Montanari. Information, physics, and computation. Oxford University\n\nPress, 2009.\n\n[15] David L Donoho, Arian Maleki, and Andrea Montanari. Message-passing algorithms for\ncompressed sensing. Proceedings of the National Academy of Sciences, 106(45):18914\u201318919,\n2009.\n\n[16] Sundeep Rangan. Generalized approximate message passing for estimation with random linear\nmixing. In Information Theory Proceedings (ISIT), 2011 IEEE International Symposium on,\npages 2168\u20132172. IEEE, 2011.\n\n[17] Mohsen Bayati and Andrea Montanari. The dynamics of message passing on dense graphs, with\napplications to compressed sensing. IEEE Transactions on Information Theory, 57(2):764\u2013785,\n2011.\n\n[18] Adel Javanmard and Andrea Montanari. State evolution for general approximate message\npassing algorithms, with applications to spatial coupling. Information and Inference: A Journal\nof the IMA, 2(2):115\u2013144, 2013.\n\n10\n\n\f[19] Henry Schwarze. Learning a rule in a multilayer neural network. Journal of Physics A:\n\nMathematical and General, 26(21):5781, 1993.\n\n[20] Henry Schwarze and John Hertz. Generalization in a large committee machine. EPL (Euro-\n\nphysics Letters), 20(4):375, 1992.\n\n[21] Henry Schwarze and John Hertz. Generalization in fully connected committee machines. EPL\n\n(Europhysics Letters), 21(7):785, 1993.\n\n[22] German Mato and Nestor Parga. Generalization properties of multilayered neural networks.\n\nJournal of Physics A: Mathematical and General, 25(19):5047, 1992.\n\n[23] David Saad and Sara A Solla. On-line learning in soft committee machines. Physical Review E,\n\n52(4):4225, 1995.\n\n[24] Jean Barbier and Nicolas Macris. The adaptive interpolation method: A simple scheme to prove\nreplica formulas in bayesian inference. To appear in Probability Theory and Related Fields,\narXiv preprint http://arxiv.org/abs/1705.02780 [cs.IT], 2017.\n\n[25] David L Donoho, Iain Johnstone, and Andrea Montanari. Accurate prediction of phase tran-\nsitions in compressed sensing via a connection to minimax denoising. IEEE transactions on\ninformation theory, 59(6):3396\u20133433, 2013.\n\n[26] Lenka Zdeborov\u00e1 and Florent Krzakala. Statistical physics of inference:\n\nalgorithms. Advances in Physics, 65(5):453\u2013552, 2016.\n\nthresholds and\n\n[27] Yash Deshpande and Andrea Montanari. Finding hidden cliques of size \\sqrt {N/e} n/e in\n\nnearly linear time. Foundations of Computational Mathematics, 15(4):1069\u20131128, 2015.\n\n[28] Afonso S Bandeira, Amelia Perry, and Alexander S Wein. Notes on computational-to-statistical\n\ngaps: predictions using statistical physics. arXiv preprint arXiv:1803.11132, 2018.\n\n[29] Itay Safran and Ohad Shamir. Spurious local minima are common in two-layer relu neural\n\nnetworks. arXiv preprint arXiv:1712.08968, 2017.\n\n[30] Ahmed El Alaoui, Aaditya Ramdas, Florent Krzakala, Lenka Zdeborov\u00e1, and Michael I\nJordan. Decoding from pooled data: Sharp information-theoretic bounds. arXiv preprint\narXiv:1611.09981, 2016.\n\n[31] Ahmed El Alaoui, Aaditya Ramdas, Florent Krzakala, Lenka Zdeborov\u00e1, and Michael I Jordan.\nDecoding from pooled data: Phase transitions of message passing. In Information Theory (ISIT),\n2017 IEEE International Symposium on, pages 2780\u20132784. IEEE, 2017.\n\n[32] Junan Zhu, Dror Baron, and Florent Krzakala. Performance limits for noisy multimeasurement\n\nvector problems. IEEE Transactions on Signal Processing, 65(9):2444\u20132454, 2017.\n\n[33] Benjamin Aubin, Antoine Maillard, Jean Barbier, Florent Krzakala, Nicolas Macris, and Lenka\nZdeborov\u00e1. The committee machine: Computational to statistical gaps in learning a two-layers\nneural network. arXiv:1806.05451, 2018.\n\n[34] Francesco Guerra. Broken replica symmetry bounds in the mean \ufb01eld spin glass model.\n\nCommunications in mathematical physics, 233(1):1\u201312, 2003.\n\n[35] Michel Talagrand. Spin glasses: a challenge for mathematicians: cavity and mean \ufb01eld models,\n\nvolume 46. Springer Science & Business Media, 2003.\n\n[36] David J Thouless, Philip W Anderson, and Robert G Palmer. Solution of\u2019solvable model of a\n\nspin glass\u2019. Philosophical Magazine, 35(3):593\u2013601, 1977.\n\n[37] Marc M\u00e9zard. The space of interactions in neural networks: Gardner\u2019s computation with the\n\ncavity method. Journal of Physics A: Mathematical and General, 22(12):2181\u20132190, 1989.\n\n[38] Manfred Opper and Ole Winther. Mean \ufb01eld approach to bayes learning in feed-forward neural\n\nnetworks. Physical review letters, 76(11):1964, 1996.\n\n11\n\n\f[39] Yoshiyuki Kabashima. Inference from correlated patterns: a uni\ufb01ed theory for perceptron\nlearning and linear vector channels. Journal of Physics: Conference Series, 95(1):012001, 2008.\n\n[40] Carlo Baldassi, Alfredo Braunstein, Nicolas Brunel, and Riccardo Zecchina. Ef\ufb01cient supervised\nlearning in networks with binary synapses. Proceedings of the National Academy of Sciences,\n104(26):11079\u201311084, 2007.\n\n[41] Benjamin Aubin, Antoine Maillard, Jean Barbier, Florent Krzakala, Nicolas Macris, and\nLenka Zdeborov\u00e1. AMP implementation of the committee machine. https://github.com/\nbenjaminaubin/TheCommitteeMachine, 2018.\n\n[42] Philip Schniter, Sundeep Rangan, and Alyson K Fletcher. Vector approximate message passing\nfor the generalized linear model. In Signals, Systems and Computers, 2016 50th Asilomar\nConference on, pages 1525\u20131529. IEEE, 2016.\n\n[43] Florent Krzakala, Marc M\u00e9zard, Francois Sausset, Yifan Sun, and Lenka Zdeborov\u00e1. Probabilis-\ntic reconstruction in compressed sensing: algorithms, phase diagrams, and threshold achieving\nmatrices. Journal of Statistical Mechanics: Theory and Experiment, 2012(08):P08009, 2012.\n\n[44] Ulugbek Kamilov, Sundeep Rangan, Michael Unser, and Alyson K Fletcher. Approximate\nmessage passing with consistent parameter estimation and applications to sparse learning. In\nAdvances in Neural Information Processing Systems, pages 2438\u20132446, 2012.\n\n12\n\n\f", "award": [], "sourceid": 1640, "authors": [{"given_name": "Benjamin", "family_name": "Aubin", "institution": "Ipht Saclay"}, {"given_name": "Antoine", "family_name": "Maillard", "institution": "Ecole Normale Sup\u00e9rieure"}, {"given_name": "jean", "family_name": "barbier", "institution": "EPFL"}, {"given_name": "Florent", "family_name": "Krzakala", "institution": "\u00c9cole Normale Sup\u00e9rieure"}, {"given_name": "Nicolas", "family_name": "Macris", "institution": "EPFL"}, {"given_name": "Lenka", "family_name": "Zdeborov\u00e1", "institution": "CEA Saclay"}]}