{"title": "The spiked matrix model with generative priors", "book": "Advances in Neural Information Processing Systems", "page_first": 8366, "page_last": 8377, "abstract": "Using a low-dimensional parametrization of signals is a generic and powerful way to enhance performance in signal processing and statistical inference. A very popular and widely explored type of dimensionality reduction is sparsity; another type is generative modelling of signal distributions. Generative models based on neural networks, such as GANs or variational auto-encoders, are particularly performant and are gaining on applicability. In this paper we study spiked matrix models, where a low-rank matrix is observed through a noisy channel. This problem with sparse structure of the spikes has attracted broad attention in the past literature. Here, we replace the sparsity assumption by generative modelling, and investigate the consequences on statistical and algorithmic properties. We analyze the Bayes-optimal performance under specific generative models for the spike. In contrast with the sparsity assumption, we do not observe regions of parameters where statistical performance is superior to the best known algorithmic performance. We show that in the analyzed cases the approximate message passing algorithm is able to reach optimal performance. We also design enhanced spectral algorithms and analyze their performance and thresholds using random matrix theory, showing their superiority to the classical principal component analysis. We complement our theoretical results by illustrating the performance of the spectral algorithms when the spikes come from real datasets.", "full_text": "The spiked matrix model with generative priors\n\nBenjamin Aubin\u2020, Bruno Loureiro\u2020, Antoine Maillard(cid:63),\n\nFlorent Krzakala(cid:63), Lenka Zdeborov\u00e1\u2020\n\nAbstract\n\nUsing a low-dimensional parametrization of signals is a generic and powerful\nway to enhance performance in signal processing and statistical inference. A very\npopular and widely explored type of dimensionality reduction is sparsity; another\ntype is generative modelling of signal distributions. Generative models based\non neural networks, such as GANs or variational auto-encoders, are particularly\nperformant and are gaining on applicability. In this paper we study spiked matrix\nmodels, where a low-rank matrix is observed through a noisy channel. This problem\nwith sparse structure of the spikes has attracted broad attention in the past literature.\nHere, we replace the sparsity assumption by generative modelling, and investigate\nthe consequences on statistical and algorithmic properties. We analyze the Bayes-\noptimal performance under speci\ufb01c generative models for the spike. In contrast\nwith the sparsity assumption, we do not observe regions of parameters where\nstatistical performance is superior to the best known algorithmic performance. We\nshow that in the analyzed cases the approximate message passing algorithm is able\nto reach optimal performance. We also design enhanced spectral algorithms and\nanalyze their performance and thresholds using random matrix theory, showing\ntheir superiority to the classical principal component analysis. We complement our\ntheoretical results by illustrating the performance of the spectral algorithms when\nthe spikes come from real datasets.\n\n1\n\nIntroduction\n\nA key idea of modern signal processing is to exploit the structure of the signals under investigation.\nA traditional and powerful way of doing so is via sparse representations of the signals. Images are\ntypically sparse in the wavelet domain, sound in the Fourier domain, and sparse coding [1] is designed\nto search automatically for dictionaries in which the signal is sparse. This compressed representation\nof the signal can be used to enable ef\ufb01cient signal processing under larger noise or with fewer samples\nleading to the ideas behind compressed sensing [2] or sparsity enhancing regularizations. Recent years\nbrought a surge of interest in another powerful and generic way of representing signals \u2013 generative\nmodeling. In particular the generative adversarial networks (GANs) [3] provide an impressively\npowerful way to represent classes of signals. A recent series of works on compressed sensing and\nother regression-related problems successfully explored the idea of replacing the traditionally used\nsparsity by generative models [4\u201310]. These results and performances conceivably suggest that [11]:\n\nGenerative models are the new sparsity.\n\nNext to compressed sensing and regression, another technique in statistical analysis that uses sparsity\nin a fruitful way is sparse principal component analysis (PCA) [12]. Compared to the standard PCA,\n\n\u2020 Universit\u00e9 Paris-Saclay, CNRS, CEA, Institut de physique th\u00e9orique, 91191, Gif-sur-Yvette, France.\n(cid:63) Laboratoire de Physique de l\u2019\u00c9cole Normale Sup\u00e9rieure, PSL University & CNRS & Sorbonne Universit\u00e9s,\nParis, France.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fin sparse-PCA the principal components are linear combinations of a few of the input variables,\nspeci\ufb01cally k of them. This means (for rank-one) that we aim to decompose the observed data matrix\nY \u2208 Rn\u00d7p as Y = uv(cid:124)\n+\u03be where the spike v \u2208 Rp is a vector with only k (cid:28) p non-zero components,\nand u, \u03be are commonly modelled as independent and identically distributed (i.i.d.) Gaussian variables.\nThe main goal of this paper is to explore the idea of replacing sparsity of the spike v by the\nassumption that the spike belongs to the range of a generative model. Sparse-PCA with structured\nsparsity inducing priors is well studied, e.g. [13], in this paper we remove the sparsity entirely and\nin a sense replace it by lower dimensionality of the latent space of the generative model. For the\npurpose of comparing generative model priors and sparsity we focus on the rich range of properties\nin the noisy high-dimensional regime (denoted below, borrowing statistical physics jargon, as the\nthermodynamic limit) where the spike v cannot be estimated consistently, but can be estimated better\nthan by random guessing. In particular we analyze two spiked-matrix models as considered in a\nseries of existing works on sparse-PCA, e.g. [14\u201320], de\ufb01ned as follows:\n\nSpiked Wigner model (vv(cid:124)): Consider an unknown vector (the spike) v(cid:63) \u2208 Rp drawn from a\ndistribution Pv; we observe a matrix Y \u2208 Rp\u00d7p with a symmetric noise term \u03be \u2208 Rp\u00d7p and \u2206 > 0:\n(1)\n\n+ \u221a\u2206\u03be ,\n\nv(cid:63)v(cid:63)(cid:124)\n\nY =\n\n1\n\u221ap\n\nwhere \u03beij\u223cN (0, 1) i.i.d. The aim is to \ufb01nd back the hidden spike v(cid:63) from Y (up to a global sign).\nSpiked Wishart (or spiked covariance) model (uv(cid:124)): Consider two unknown vectors u(cid:63) \u2208 Rn\nand v(cid:63) \u2208 Rp drawn from distributions Pu and Pv and let \u03be \u2208 Rn\u00d7p with \u03be\u00b5i\u223cN (0, 1) i.i.d. and\n\u2206 > 0, we observe\n\nY =\n\n1\n\u221ap\n\nu(cid:63)v(cid:63)(cid:124)\n\n+ \u221a\u2206\u03be ;\n\n(2)\n\nthe goal is to \ufb01nd back the hidden spikes u(cid:63) and v(cid:63) from Y \u2208 Rn\u00d7p.\nThe noisy high-dimensional limit that we consider in this paper (the thermodynamic limit) is p, n\u2192\u221e\nwhile \u03b2 \u2261 n/p = \u0398(1), and the noise \u03be has a variance \u2206 = \u0398(1). The prior Pv is representing the\nspike v via a k-dimensional parametrization with \u03b1\u2261 p/k = \u0398(1). In the sparse case, k is the number\nof non-zeros components of v(cid:63), while in generative models k is the number of latent variables.\n\n1.1 Considered generative models\n\nThe simplest non-separable prior Pv that we consider is the Gaussian model with a covariance matrix\n\u03a3, that is Pv(v) = N (v; 0, \u03a3). This prior is not compressive, yet it captures some structure and can\nbe simply estimated from data via the empirical covariance. We use this prior later to produce Fig. 4.\nTo exploit the practically observed power of generative models, it would be desirable to consider\nmodels (e.g. GANs, variational auto-encoders, restricted Boltzmann machines, or others) trained\non datasets of examples of possible spikes. Such training, however, leads to correlations between\nthe weights of the underlying neural networks for which the theoretical part of the present paper\ndoes not apply readily. To keep tractability in a closed form, and subsequent theoretical insights, we\nfocus on multi-layer generative models where all the weight matrices W (l) \u2208 Rkl+1\u00d7kl, l = 1, . . . , L\n(cid:18) 1\n(with k1 = k, kL+1 = p), are \ufb01xed, layer-wise independent, i.i.d. Gaussian with zero mean and unit\nvariance. Let v \u2208 Rp be the output of such a generative model\n\u221ak1\n\n(cid:18) 1\n\nW (L) . . . \u03d5(1)\n\nv = \u03d5(L)\n\n\u221akL\n\nW (1)z\n\n(cid:19)\n\nwith z \u2208 Rk a latent variable drawn from separable distribution Pz, with \u03c1z = EPz\n\u03d5(l) element-wise activation functions that can be either deterministic or stochastic. In the setting\nconsidered in this paper the ground-truth spike v(cid:63) is generated using a ground-truth value of the\nlatent variable z(cid:63). The spike is then estimated from the knowledge of the data matrix Y , and the\nknown form of the spiked-matrix and of the generative model. In particular the matrices W (l) are\nknown, as are the parameters \u03b2, \u2206, Pz, Pu, Pv, \u03d5(l). Only the spikes v(cid:63), u(cid:63) and the latent vector z(cid:63)\nare unknown, and are to be inferred.\n\n(cid:19)\n\n. . .\n\n.\n\n(3)\n\n(cid:2)z2(cid:3) and\n\n2\n\n\fFor concreteness and simplicity, the generative model that will be analyzed in most examples given\nin the present paper is the single-layer case of (3) with L = 1:\n\u21d4 v \u223c Pout\n\n(cid:12)(cid:12)(cid:12) 1\n\nv = \u03d5\n\nW z\n\n(4)\n\nW z\n\n.\n\n(cid:18) 1\n\n\u221ak\n\n(cid:19)\n\n(cid:19)\n\n(cid:18)\n\n\u00b7\n\n\u221ak\n\nWe de\ufb01ne the compression ratio \u03b1 \u2261 p/k. In what follows we will illustrate our results for \u03d5 being\nlinear, sign and ReLU functions.\n\n1.2 Summary of main contributions\n\nWe analyze how the availability of generative priors, de\ufb01ned in section 1.1, in\ufb02uences the statistical\nand algorithmic properties of the spiked-matrix models (1) and (2). Both sparse-PCA and generative\npriors provide statistical advantages when the effective dimensionality k is small, k (cid:28) p. However,\nwe show that from the algorithmic perspective the two cases are quite different. This is why our main\n\ufb01ndings are best presented in a context of the results known for sparse-PCA. We draw two main\nconclusions from the present work:\n(i) No algorithmic gap with generative-model priors: Sharp and detailed results are known in the\nthermodynamic limit (as de\ufb01ned above) when the spike v(cid:63) is sampled from a separable distribution\nPv. A detailed account of several examples can be found in [21]. The main \ufb01nding for sparse priors\nPv is that when the sparsity \u03c1 = k/p = 1/\u03b1 is large enough then there exist optimal algorithms [15],\nwhile for \u03c1 small enough there is a striking gap between statistically optimal performance and the\none of best known algorithms [16]. The small-\u03c1 expansion studied in [21] is consistent with the\nwell-known results for exact recovery of the support of v(cid:63) [22, 23], which is one of the best-known\ncases in which gaps between statistical and best-known algorithmic performance were described.\nOur analysis of the spiked-matrix models with generative priors reveals that in the investigated cases\nthe algorithmic gap disappears and known algorithms are able to obtain (asymptotically) optimal\nperformance even when the dimension is greatly reduced, i.e. \u03b1 (cid:29) 1. Analogous conclusion about\nthe lack of algorithmic gaps was reached for the problem of phase retrieval under a deep generative\nprior in [9]. This result suggests that plausibly generative priors are better than sparsity as they lead\nto algorithmically easier problems and give back the hope that the structure can be exploited not only\ninformation-theoretically but also tractably.\n(ii) Spectral algorithms reaching statistical threshold: Arguably the most basic algorithm used to\nsolve the spiked-matrix model is based on the leading singular vectors of the matrix Y . We will refer\nto this as PCA. Previous work on spiked-matrix models [17,21] established that in the thermodynamic\nlimit and for separable priors of zero mean PCA reaches the best performance of all known ef\ufb01cient\nalgorithms in terms of the value of noise \u2206 below which it is able to provide positive correlation\nbetween its estimator and the ground-truth spike. While for sparse priors positive correlation is\nstatistically reachable even for larger values of \u2206 [17, 21], no ef\ufb01cient algorithm beating the PCA\nthreshold is known2.\nIn the case of generative priors we \ufb01nd in this paper that other spectral methods improve on the\ncanonical PCA. We design a spectral method, called LAMP, that (under certain assumptions, e.g.\nzero mean of the spikes) reach the statistically optimal threshold, meaning that for larger values\nof noise variance no other (even exponential) algorithm is able to reach positive correlation with\nthe spike. Again this is a striking difference with the sparse separable prior, making the generative\npriors algorithmically more attractive. We demonstrate the performance of LAMP on the spiked-\nmatrix model when the spike is taken to be one of the fashion-MNIST images showing considerable\nimprovement over canonical PCA.\n\n2 Analysis of information-theoretically optimal estimation\n\nWe \ufb01rst discuss the information theoretic results on the estimation of the spike, regardless of the\ncomputational cost. A considerable amount of results have been obtained for the spiked-matrix\nmodels with separable priors [14, 15, 18, 19, 25\u201329]. Here, we extend these results to the case where\nthe spike v(cid:63) \u2208 Rp is generated from a generic non-separable prior Pv on Rp.\n\n2This result holds only for sparsity \u03c1 = \u0398(1). A line of works shows that when sparsity k scales slower than\n\nlinearly with p, algorithms more performant than PCA exist [22, 24]\n\n3\n\n\f2.1 Mutual Information and Minimal Mean Squared Error\n\nWe consider the mutual information between the ground-truth spike v(cid:63) and the observation Y ,\nde\ufb01ned as I(Y ; v(cid:63)) = DKL(P(v(cid:63),Y )(cid:107)Pv(cid:63) PY ). Next, we consider the best possible value of the\nmean-squared-error on recovering the spike, commonly called the minimum mean-squared-error\n(MMSE). The MMSE estimator is computed from marginal-means of the posterior distribution\nP (v|Y ).\nTheorem 1. [Mutual information for the spiked Wigner model with structured spike] Informally\n(see SM section 3 for details and proof), assume the spikes v(cid:63) come from a sequence (of growing\ndimension p) of generic structured priors Pv on Rp, then\nI(Y ; v(cid:63))\n\np\u2192\u221e ip \u2261 lim\nlim\np\u2192\u221e\n\np\n\n= inf\n\n\u03c1v\u2265qv\u22650\n\nwith iRS(\u2206, qv) \u2261\n\n(\u03c1v \u2212 qv)2\n\n4\u2206\n\n+ lim\np\u2192\u221e\n\nI\n\n(cid:16)\n\niRS(\u2206, qv),\n\n(cid:113) \u2206\n\nqv\n\n(cid:17)\n\n\u03be\n\nv; v +\n\np\n\nand \u03be being a Gaussian vector with zero mean, unit diagonal variance and \u03c1v = lim\np\u2192\u221e\n\nThis theorem connects the asymptotic mutual information of the spiked model with generative prior\nPv to the mutual information between v taken from Pv and its noisy version, I(v; v +\n\u2206/qv\u03be).\nComputing this later mutual information is itself a high-dimensional task, hard in full generality, but it\ncan be done for a range of models. The simplest tractable case is when the prior Pv is separable, then\nit yields back exactly the formula known from [18, 19, 26]. It can be computed also for the Gaussian\ngenerative model, Pv(v) = N (v; 0, \u03a3), leading to I(v; v +\n\u2206/qv\u03be) = Tr (log (Ip + qv\u03a3/\u2206)) /2.\nMore interestingly, the mutual information associated to the generative prior in eq. (6) can also\nbe asymptotically computed for the multi-layer generative model with random weights, de\ufb01ned in\neq. (3). Indeed, for the single-layer prior (4) the corresponding formula for mutual information has\nbeen derived and proven in [30]. For the multi-layer case the mutual information formula has been\nderived in [6] and proven for the case of two layers in [31]. Theorem 1 together with the results\nfrom [6, 30, 31] yields the following formula (see SM sec. 3 for details) for the spiked Wigner model\n(1) with L-layer generative prior (3):\n\n(cid:112)\n\n(5)\n\n(6)\nEPv [v(cid:124)v]/p.\n(cid:112)\n\nL(cid:88)\n(cid:16)\n\nl=2\n\nZz\n\n(cid:17)\n\n, qL\n\n(cid:16) qv\n\n\u2206\n\n(7)\n\n(cid:35)\n\n.\n\niRS(\u2206, qv) =\n\n\u03c12\nv\n4\u2206\n\n1\n\u03b1\n\nextr\n{\u02c6ql,ql}l\n\n1\n4\u2206\n\nq2\nv+\n\nL(cid:88)\n\n+\n\n1\n2\n\n(cid:34)\n(cid:16)\n(cid:104)\nZz\nZ (l)\n\n\u03b1l \u02c6qlql \u2212\n\n\u03b1l\u03a8(l)\n\nout (\u02c6ql, ql\u22121) \u2212 \u03b1\u03a8(L+1)\n\nout\n\n\u2212 \u03a8z (\u02c6qz)\n\nl=1\n\n(cid:16)\n\n(cid:16)\n\n(cid:17)\n\nx1/2\u03be, x\n\nwhere \u03b1l = kl/k (note that in particular \u03b11 = 1) and the functions \u03a8z, \u03a8out are de\ufb01ned by\n\n(cid:104)\n\u03a8z(x) \u2261 E\u03be\nout(x, y) \u2261 E\u03be,\u03b7\n\u03a8(l)\n\u03d5(l)(cid:16) 1\u221a\nW (l)h(l)(cid:17)\nwith \u03be, \u03b7 \u223c N (0, 1) i.i.d., \u03c1l+1 the second moment of the hidden variable h(l+1) =\nout are the normalizations of the following denoising scalar\ndistributions:\n\n\u2208 Rkl+1 and Zz, Z (l)\n\nx1/2\u03be, x, y1/2\u03b7, \u03c1l \u2212 y\n\nx1/2\u03be, x, y1/2\u03b7, \u03c1l \u2212 y\n\n(cid:17)(cid:17)(cid:105)\n(cid:16)\n\n(cid:17)(cid:17)(cid:105)\n\nx1/2\u03be, x\n\nZ (l)\n\nout\n\n(cid:17)\n\n(cid:16)\n\nlog\n\n(8)\n\n(9)\n\nlog\n\nout\n\nkl\n\n,\n\n,\n\nQ\u03b3,\u039b\n\nz\n\n(z) \u2261\n\nPz(z)\nZz(\u03b3, \u039b)\n\ne\n\n\u2212 \u039b\n\n2 z2+\u03b3z , Q(l),B,A,\u03c9,V\n\nout\n\n(v, x) \u2261\n\nP (l)\nout(v|x)\n\nZ (l)\nout(B, A, \u03c9, V )\n\n2 v2+Bv e\u2212 (x\u2212\u03c9)2\n\u2212 A\n\u221a2\u03c0V\ne\n\n2V\n\n.\n\n(10)\n\nResult (7) is remarkable in that it connects the asymptotic mutual information of a high-dimensional\nmodel with a simple scalar formula that can be easily evaluated. In the SM sec. 2 we show how this\nformula is obtained using the heuristic replica method from statistical physics and, once we have the\nformula in hand, we prove it using the interpolation method in SM sec. 3. In SM sec. 2.2 we also\ngive the corresponding formula for the spiked Wishart model.\n\n4\n\n\fBeyond its theoretical interest, the main point of the mutual information formula is that it yields\nthe optimal value of the mean-squared error (MMSE). It is well-known [32] that the mean-squared\nerror is minimized by an estimator evaluating the conditional expectation of the signal given the\nobservations. Following generic theorems on the connection between the mutual information and\nthe MMSE [33], one can prove in particular that for the spiked-matrix model [27] the MMSE on the\nspike v(cid:63) is asymptotically given by:\n\nwhere q(cid:63)\n\nv is the optimizer of the function iRS (\u2206, qv).\n\nMMSEv = \u03c1v \u2212 q(cid:63)\nv ,\n\n(11)\n\n2.2 Examples of phase diagrams\n\n(cid:17)\n\n(cid:16) qv\n\n\u2206\n\n(cid:17)\n\n(cid:16) qv\n\n\u2206\n\nTaking the extremization over qv, \u02c6qz, qz in eq. (7), we obtain the following \ufb01xed point equations:\n\nqv = 2\u2202qv \u03a8out\n\n, qz\n\n,\n\nqz = 2\u2202\u02c6qz \u03a8z (\u02c6qz) ,\n\n\u02c6qz = 2\u03b1\u2202qz \u03a8out\n\n, qz\n\n.\n\n(12)\n\nUsing (11), analyzing the \ufb01xed points of eqs. (12) provides all the informations about the performance\nof the Bayes-optimal estimator in the models under consideration.\n\nPhase transition: A \ufb01rst question is whether better estimation than random guessing from the\nprior is possible. In terms of \ufb01xed points of eqs. (12), this corresponds to the existence of the\nnon-informative \ufb01xed point q(cid:63)\nv = 0 (i.e. zero overlap with the spike, or maximum MSEv = \u03c1v).\nEvaluating the right-hand side of eqs. (12) at qv = 0, we can see that q(cid:63)\n\nv = 0 is a \ufb01xed point if\n\nEPz [z] = 0 and E\n\nQ0\n\nout [v] = 0 ,\n\n(13)\n\nout\n\nout(v, x) \u2261 Q0,0,0,\u03c1z\n\n(v, x) from eq. (10). Note that for a deterministic channel the second\n\nwhere Q0\ncondition is equivalent to \u03d5 being an odd function.\nWhen the condition (13) holds, (qv, \u02c6qz, qz) = (0, 0, 0) is a \ufb01xed point of eq. (12). The numerical\nstability of this \ufb01xed point determines a phase transition point \u2206c, de\ufb01ned as the noise below which\nthe \ufb01xed point (0, 0, 0) becomes unstable. This corresponds to the value of \u2206 for which the largest\neigenvalue of the Jacobian of the eqs. (12) at (0, 0, 0), given by\n\n2d(\u2202qv \u03a8out, \u03b1\u2202qz \u03a8out, \u2202\u02c6qz \u03a8z)|(0,0,0) =\n\n\uf8eb\uf8ec\uf8ed 1\n\n\u2206\n\u03b1\n\u2206\n\n(cid:0)E\n(cid:0)E\n\nout v2(cid:1)2\noutvx(cid:1)2\n\nQ0\n\nQ0\n0\n\n0\n\n(cid:0)EPz z2(cid:1)2\n\n0\n\n\u03b1\n\u03c12\nz\n\n(cid:0)E\n(cid:0)E\n\n1\n\u03c12\nz\n\nQ0\n\noutvx(cid:1)2\n\nQ0\n\noutx2 \u2212 \u03c1z\n0\n\n\uf8f6\uf8f7\uf8f8 ,\n\n(cid:1)2\n\n(14)\n\nbecomes greater than one. The details of this calculation can be found in sec. 6 of the SM.\nIt is instructive to compute \u2206c in speci\ufb01c cases. We therefore \ufb01x Pz = N (0, 1) and Pout(v|x) =\n\u03b4(v \u2212 \u03d5(x)) and discuss two different choices of (odd) activation function \u03d5.\nLinear activation: For \u03d5(x) = x the leading eigenvalue of the Jacobian becomes one at \u2206c = 1+\u03b1.\n. Note\nthat in the limit \u03b1 = 0 we recover the phase transition \u2206c = 1 known from the case with\nseparable prior [21]. For \u03b1 > 0, we have \u2206c > 1 meaning the spike can be estimated more\nef\ufb01ciently when its structure is accounted for.\n\nNote that for L > 1 the result is derived in SM sec. 2.3 and reads \u2206c = 1 +(cid:80)L\n\n\u03c02 . As above it generalizes for L > 1 as \u2206c = 1 +(cid:80)L\n\nSign activation: For \u03d5(x) = sgn(x) the leading eigenvalue of the Jacobian becomes one at \u2206c =\n. For \u03b1 = 0,\n1 + 4\u03b1\nPv = Bern(1/2), and the transition \u2206c = 1 agrees with the one found for a separable prior\ndistribution [21]. As in the linear case, for \u03b1 > 0, we can estimate the spike for larger values\nof noise than in the separable case.\n\n(cid:1)l \u03b1\n\n(cid:0) 4\n\n\u03b1\n\u03b1l\n\nl=1\n\nl=1\n\n\u03c02\n\n\u03b1l\n\nIn Fig. 1 we solve the \ufb01xed point equations (12) and plot the MMSE obtained from the \ufb01xed point in\na heat map, for the linear, sign and relu activations. The white dashed line marks the above stated\nthreshold \u2206c. The property that we \ufb01nd the most striking is that in these three evaluated cases, for all\nvalues of \u2206, \u03b1 and L that we analyzed, we always found that eq. (12) has a unique stable \ufb01xed point.\n\n5\n\n\fv, and\nFigure 1: Spiked Wigner model MMSEv on the spike as a function of noise to signal ratio \u2206/\u03c12\ngenerative prior (4) with compression ratio \u03b1 for L = 1 linear (left, \u03c1v = 1), sign (center, \u03c1v = 1),\nand relu (right, \u03c1v = 1/2) activations. Dashed white lines mark the phase transitions \u2206c, matched by\nboth the AMP and LAMP algorithms. Dotted white line marks the phase transition of canonical PCA.\n\nFigure 2: Spiked Wigner model: MMSEv as a function of noise \u2206 - (upper) for a wide range of\ncompression ratios \u03b1 = 0, 1, 10, 100, 1000, for L = 1 linear (left), sign (center), and relu (right)\nactivations. Unique stable \ufb01xed point of (12) is found for all these cases - (lower) for different depths\nL = 1, 2, 3 with constant compressive ratio \u03b11 = \u03b12 = \u03b13 = 1, for linear (left), sign (center), and\nrelu (right) activations. The second moment of the variable v for L = 1, 2, 3 are \u03c1(L)\nv = 1 for linear\nand sign, while for ReLU \u03c1(L)\nv = 1/2L. Similarly a unique stable \ufb01xed point is found in these cases.\n\nThus we have not identi\ufb01ed any \ufb01rst order phase transition (in the physics terminology). This is\nillustrated in Fig. 2 for larger values of \u03b1 (upper) and for different depths L (lower), where we solved\nthe eq. (12) iteratively from uncorrelated initial condition, and from initial condition corresponding\nto the ground truth signal, and found that both lead to the same \ufb01xed point. In particular, as a unique\n\ufb01xed point is found, the Bayes optimal errors are continuous and we did not observe any algorithmic\ngap. Details of the expressions equivalent to eq. (12-14) for L \u2265 1 are detailed in SM sec. 2.3.\n\n6\n\n02468100246810\u2206/\u03c12v0246810\u03b10246810\u2206c\u2206PCA02468100.00.51.01.52.02.50.10.20.30.40.50.60.70.80.91.0MMSEv(\u2206/\u03c12v,\u03b1)/MMSEv(\u221e,\u03b1)10\u2212210\u221211001011021030.00.20.40.60.81.0MMSEv10\u2212210\u22121100101102103\u22060.00.20.40.60.81.0\u03b1=0\u03b1=1\u03b1=10\u03b1=100\u03b1=100010\u2212210\u221211001011021030.000.050.100.150.200.250.30123450.00.20.40.60.81.0MMSEv0.51.01.52.02.5\u22060.00.20.40.60.81.0L=1L=2L=30.51.01.52.00.000.050.100.150.200.250.300.35\f3 Approximate message passing with generative priors\n\nA straightforward algorithmic evaluation of the Bayes-optimal estimator is exponentially costly.\nThis section is devoted to the analysis of an approximate message passing (AMP) algorithm that\nfor the analyzed cases is able to reach the optimal performance (in the thermodynamic limit). For\nthe purpose of presentation, we focus again on the spiked Wigner model (see SM for the spiked\nWishart model). For separable priors, the AMP for the spiked Wigner model is well known [14\u2013\n16].\nIt can, however, be extended to non-separable priors [6, 34, 35]. We show in SM sec. 4\nhow AMP can be generalized to handle the generative model (4). Iterating this derivation leads\nnaturally to its multi-layer version ML-AMP for L \u2265 1. In particular AMP for L = 1 reads:\nInput: Y \u2208 Rp\u00d7p and W \u2208 Rp\u00d7k:\nInitialize to zero: (g, \u02c6v, Bv, Av)t=0.\nInitialize with: \u02c6vt=1 = N (0, \u03c32), \u02c6zt=1 = N (0, \u03c32), and \u02c6ct=1\nrepeat\nSpiked layer:\n(cid:0)1\nY\u221a\np \u02c6vt \u2212 1\nBt\n\u2206\nGenerative layer:\n(cid:124)\nk\u02c6ct\nV t = 1\nz\nk\nk(cid:107)gt(cid:107)2\n\u039bt = 1\nW\n2Ik\nUpdate of the estimated marginals:\n\u02c6vt+1 = fv(Bt\nand\nv, At\n\u02c6zt+1 = fz(\u03b3t, \u039bt)\n\u02c6ct+1\nz = \u2202\u03b3fz(\u03b3t, \u039bt),\nt = t + 1.\n\nand At\nv = 1\nW \u02c6zt \u2212 V tgt\u22121\n(cid:124)gt + \u039bt\u02c6zt.\n\n(cid:1) Ip, \u03c9t = 1\u221a\n\nv, \u03c9t, V t(cid:1),\n\nv = 1p, \u02c6ct=1\n\nz = 1k, t = 1.\n\nv, \u03c9t, V t)\nand\n\nv = \u2202Bfv(Bt\n\u02c6ct+1\n\nand gt = fout\n\n\u2206p(cid:107)\u02c6vt(cid:107)2\n\n2Ip.\n\nv, At\n\nv, \u03c9t, V t),\n\nand \u03b3t = 1\u221a\n\n(cid:0)Bt\n\nv = 1\n\u2206\n\n(1(cid:124)\np\u02c6ct\np\n\nv)\n\n\u02c6vt\u22121\n\nv, At\n\nk\n\nk\n\nuntil Convergence.\nOutput: \u02c6v, \u02c6z.\nAlgorithm 1: AMP algorithm for the spiked Wigner model with single-layer generative prior.\n\nwhere Is and 1s denote respectively the identity matrix and vector of ones of size s. The update\nfunctions fout and fv are the means of V \u22121 (x \u2212 \u03c9) and v with respect to Qout, eq. (10), while the\nupdate function fz is the mean of z with respect to Qz, eq. (10).\nThe algorithm for the spiked Wishart model is very similar and both derivations are given in SM\n(cid:124)v(cid:63)/p\u2212\u2192qt\nsec. 4. We de\ufb01ne the overlap of the AMP estimator with the ground truth spike as (\u02c6vt)\nas p \u2192 \u221e. Perhaps the most important virtue of AMP-type algorithms is that their asymptotic\nperformance can be tracked exactly via a set of scalar equations called state evolution. This fact has\nbeen proven for a range of models including the spiked matrix models with separable priors in [36],\nand with non-separable priors in [35]. To help the reader understand the state evolution equations we\nprovide a heuristic derivation in the SM, section 4.4. For L = 1, the state evolution states that the\noverlap qt\n\nv evolves under iterations of the AMP algorithm as:\n\nv\n\nqt+1\nv = 2\u2202qv \u03a8out\n\n,\n\nqt+1\nz = 2\u2202\u02c6qz \u03a8z\n\n\u02c6qt\nz = 2\u03b1\u2202qz \u03a8out\n\n(cid:19)\n\n(cid:18) qt\n\n, qt\nz\n\nv\n\u2206\n\n(cid:0)\u02c6qt\n\nz\n\n(cid:1) ,\n\n(cid:18) qt\n\n(cid:19)\n\n, qt\nz\n\nv\n\u2206\n\n,\n\n(15)\n\nv = \u03b5, qt=0\n\nwith initialization qt=0\nz = \u03b5 and a small \u03b5 > 0. We notice immediately that (15) are the\nsame equations as the \ufb01xed point equations related to the Bayes-optimal estimation (12) with speci\ufb01c\ntime-indices and initialization, but crucially the same \ufb01xed points. This observation generalizes\nnaturally to L > 1. Thus the analysis of \ufb01xed points in sec. 2.2 applies also to the behaviour of AMP.\nIn particular in all the scenarios for which we solved the corresponding equations numerically we\nfound the stable \ufb01xed point of (12) to be unique or equivalently the Bayes optimal errors as a function\nof the noise to be continuous. Hence under the assumption that the data was created using the model\nfrom eq. (1) and the spike from eq. (3) with i.i.d weight matrices W (l) and i.i.d. Gaussian entries,\nit means the AMP algorithm is able to reach asymptotically the optimal performance in all these\ncases. This is further illustrated in Fig. 3 where we explicitly compare runs of AMP on \ufb01nite size\ninstances with the results of the asymptotic state evolution, thus also giving an idea of the amplitude\nof the \ufb01nite size effects. Note that we provide a demonstration notebook in [37] that compares AMP,\nLAMP and PCA numerical performances. Finally as has been done in previous works, e.g. [5, 8\u201310]\nfor compressed sensing and denoising, translating our results to practical situations in designing an\nAMP algorithm that takes care of correlated GAN or VAE weights is still under investigation.\n\n7\n\n\fFigure 3: Comparison between PCA, LAMP and AMP - (upper) for (left) the linear, (center) and\nsign activations, for L = 1 and compression ratio \u03b1 = 2. Lines correspond to the theoretical\nasymptotic performance of PCA (red line), LAMP (green line) and AMP (blue line). Dots correspond\nto simulations of PCA (red squares), LAMP (green crosses) for k = 104 and AMP (blue points)\nfor k = 5.103, \u03c32 = 1. (Right) Illustration of the spectral phase transition in the matrix \u0393vv\np\neq. (18) at \u03b1 = 2 with an informative leading eigenvector with eigenvalue equal to 1 out of the\nbulk for \u2206 \u2264 1 + \u03b1. We show the bulk spectral density \u00b5(\u03b1, \u2206). The inset shows the two leading\neigenvalues - (lower) for (left) three layers generative model with (\u03b11, \u03b12, \u03b13) = (1, 1, 1) using\nlinear activations (k = 104) (right) two layers generative model with (\u03b11, \u03b12) = (1, 1) using sign\nactivations (k = 2.104). The vertical lines show the PCA and the optimal threshold respectively.\n\n4 Spectral methods for generative priors\n\nSpectral methods are the most common class of algorithms used for spiked matrix estimation. For\ninstance, canonical PCA estimates the spike from the leading eigenvector of the matrix Y . A classical\nresult from Baik, Ben Arous and P\u00e9ch\u00e9 (BBP) [38] shows that this eigenvector is correlated with the\nsignal if and only if the signal-to-noise ratio \u03c12\nv = \u0398(1)),\nv is also the threshold for AMP and it is conjectured that no polynomial algorithm can\n\u2206PCA = \u03c12\nimprove upon it [21]. In the previous section we show that for the analyzed generative priors AMP\nhas a better threshold than PCA. Here we design a spectral method, called LAMP, that matches\nthe AMP threshold and is hence superior over the canonical PCA. In order to do so, we follow the\npowerful strategy pioneered in [39] and linearize the AMP around its non-informative \ufb01xed point. In\nthe spiked Wigner model with a single-layer prior (L = 1) the linearized AMP leads to the following\noperator:\n\nv/\u2206 > 1. For sparse separable priors (with \u03c12\n\n(cid:18)\n\n(cid:19)\n\n(cid:18) Y\n\n\u0393vv\n\np =\n\n(cid:124)\nW\n\u221ak\nwhere parameters are moments of distributions Pz and Q0\nout according to\nEPz\n\n(a \u2212 b)Ip + b\n\nW W\nk\n\n(cid:124)\n1p1\nk\nk\n\nE\nQ0\n\n1\n\u2206\n\n+ c\n\n\u00d7\n\n\u22121\nz\n\n\u22123\nz\n\n(cid:124)\n\n[vx2]E\n\n[vx]2 ,\n\nb \u2261 \u03c1\n\na \u2261 \u03c1v ,\n\n(17)\nWe denote the spectral algorithm that takes the leading eigenvectors of (16) as LAMP (for linearized-\nAMP). Its derivation is presented in SM sec. 5 together with the one for the spiked Wishart model. For\nthe speci\ufb01c case of Gaussian z and prior (4) with the sign activation function we obtain (a, b, c) =\n(1, 2/\u03c0, 0). For linear activation we get (a, b, c) = (1, 1, 0), leading to\n\nc \u2261\n\n[vx] .\n\nQ0\n\nQ0\n\n\u03c1\n\nout\n\nout\n\nout\n\n\u221ap \u2212 aIp\n\n,\n\n(16)\n\n1\n2\n\n(cid:19)\n(cid:2)z3(cid:3) E\n\nwith Kp =\n\n[W W\nk\n\n8\n\n(cid:20) Y\n\n(cid:21)\n\n\u0393vv\n\np =\n\n1\n\u2206\n\nKp\n\n\u221ap \u2212 Ip\n\n(cid:124)\n\n]\n\n= \u03a3 \u2248\n\n1\nn\n\nn(cid:88)\n\n\u03b1=1\n\n(cid:124)\nv\u03b1(v\u03b1)\n\n,\n\n(18)\n\n12345\u22060.000.250.500.751.001.251.501.752.00MSEv12345\u22060.000.250.500.751.001.251.501.752.00AMPPCALAMPSEAMPSEPCASELAMP\u2206c\u2206PCA\u22124\u2212201z0.00.51.0d\u00b5(\u03b1,\u2206)/dz13579\u22060.91.0\u03bb1\u03bb2\u2206=1\u2206=3\u2206=100123456\u22060.000.250.500.751.001.251.501.752.00MSEvPCAlAMPSEAMPSElAMP0.51.01.52.02.53.0\u22060.000.250.500.751.001.251.501.752.00MSEvPCAlAMPSEAMP\fwith Kp = EPv [vv(cid:124)\n\n] .\n\n(cid:104) Y\u221a\n\nAlgorithm 2: LAMP spectral algorithm\n\nFigure 4: Illustration of canonical PCA (top line) and the LAMP (bottom line) spectral methods\nAlg. 2 on the spiked Wigner model. The covariance Kp is estimated empirically, see (18), from\nthe FashionMNIST database [40]. The estimation of the spike is shown for two images from\nFashionMNIST, with (from left to right), noise variance \u2206 = 0.01, 0.1, 1, 2, 10.\nwhere the last two equalities come from the fact that for the model (4) with linear activation and\nGaussian separable Pz, Kp is asymptotically equal to the covariance matrix between samples of\nspikes, \u03a3. The same observation holds for the sign activation function. In fact, the spectral method\nbased on the matrix in eq. (18) can also be derived linearizing AMP with a Gaussian prior with\ncovariance \u03a3. Interestingly, as the spectral method based on the matrix Kp in eq. (18) can be\nempirically estimated directly from n samples of spikes, v\u03b1, \u03b1 = 1, . . . , n, without the knowledge of\nthe generative model (\u03d5, W ) itself, it suggests a simple practical implementation of LAMP Alg. 2\n(cid:105)\nfor any prior Pv.\nInput: Observed matrix Y \u2208 Rp\u00d7p, prior Pv on v \u2208 Rp\nTake the leading eigenvector \u02c6v \u2208 Rp of Kp\np \u2212 Ip\nAnalogously to the state evolution for AMP, the asymptotic performance of both PCA and LAMP can\nbe evaluated in a closed-form for the spiked Wigner model with single-layer generative prior with\nlinear activation (4). The corresponding expressions are derived in SM sec. 5 and plotted in Fig. 3 for\nthe three considered algorithms that illustrates LAMP spectral method reaches the same threshold\nthan ML-AMP for different depths L and activations.\nFor illustration purposes, we display the behaviour of this spectral method on the spiked Wigner\nmodel with spikes coming from the Fashion-MNIST dataset in Fig. 4. A demonstration notebook is\nprovided in [37], illustrating PCA and LAMP performances on Fashion-MNIST dataset.\nRemarkably, the performance of the spectral method based on matrix (18) can be investigated\nindependently of AMP using random matrix theory. An analysis of the random matrix (18) shows\nthat a spectral phase transition for generative prior with linear activations appears at \u2206c = 1 + \u03b1\n(as for AMP). This transition is analogous to the well-known BBP transition [38], but a non-GOE\nrandom matrix (18) needs to be analyzed. For the spiked Wigner models with linear generative prior\nwe prove two theorems describing the behavior of the supremum of the bulk spectral density, the\ntransition of the largest eigenvalue and the correlation of the corresponding eigenvector:\nTheorem 2 (Bulk of the spectral density, spiked Wigner, linear activation). Let \u03b1, \u2206 > 0, then:\np converges almost surely and in the weak sense to a compactly\n(i) The spectral measure of \u0393vv\nsupported probability measure \u00b5(\u03b1, \u2206). We denote \u03bbmax the supremum of the support of \u00b5(\u03b1, \u2206).\n(ii) For any \u03b1 > 0, as a function of \u2206, \u03bbmax has a unique global maximum, reached exactly at the\npoint \u2206 = \u2206c(\u03b1) = 1 + \u03b1. Moreover, \u03bbmax(\u03b1, \u2206c(\u03b1)) = 1.\nTheorem 3 (Transition of the largest eigenvalue and eigenvector, spiked Wigner, linear activation).\np . If \u2206 \u2265 \u2206c(\u03b1), then as\nLet \u03b1 > 0. We denote \u03bb1 \u2265 \u03bb2 the \ufb01rst and second eigenvalues of \u0393vv\np \u2192 \u221e we have a.s. \u03bb1\u2192\u03bbmax and \u03bb2\u2192\u03bbmax. If \u2206 \u2264 \u2206c(\u03b1), then as p \u2192 \u221e we have a.s. \u03bb1\u21921\nand \u03bb2\u2192\u03bbmax. Further, denoting \u02dcv a normalized ((cid:107)\u02dcv(cid:107)2 = p ) eigenvector of \u0393vv\np with eigenvalue \u03bb1,\nthen |\u02dcv(cid:124)v(cid:63)|2/p2\u2192\u0001(\u2206) a.s., where \u0001(\u2206) = 0 for all \u2206 \u2265 \u2206c(\u03b1), \u0001(\u2206) > 0 for all \u2206 < \u2206c(\u03b1) and\nlim\u2206\u21920 \u0001(\u2206) = 1.\nThm. 2 and Thm. 3 are illustrated in Fig. 3. The proof gives the value of \u0001(\u2206), which turns out to\nlead to the same MSE as in Fig. 3 in the linear case. We state the theorems counterparts for the uv(cid:124)\nlinear case in SM sec. 7. The proofs of the theorems and the precise arguments used to derive the\neigenvalue density, the transition of \u03bb1 and the computation of \u0001(\u2206) are given in SM sec. 7, and a\nMathematica demonstration notebook is also provided in [37]. We also describe in SM the dif\ufb01culties\nto circumvent to generalize the analysis to a non-linear activation function with random matrix theory.\n\n9\n\n\f5 Acknowledgments\n\nThis work is supported by the ERC under the European Union\u2019s Horizon 2020 Research and Inno-\nvation Program 714608-SMiLe, as well as by the French Agence Nationale de la Recherche under\ngrant ANR-17-CE23-0023-01 PAIL. We gratefully acknowledge the support of NVIDIA Corporation\nwith the donation of the Titan Xp GPU used for this research. We thank Google Cloud for providing\nus access to their platform through the Research Credits Application program. We would also like to\nthank the Kavli Institute for Theoretical Physics (KITP) for welcoming us during part of this research,\nwith the support of the National Science Foundation under Grant No. NSF PHY-1748958. We thank\nAhmed El Alaoui for insightful discussions about the proof of the Bayes optimal performance, and\nRemi Monasson for his insightful lecture series that inspired partly this work. Additional funding is\nacknowledged by AM from \u2018Chaire de recherche sur les mod\u00e8les et sciences des donn\u00e9es\u2019, Fondation\nCFM pour la Recherche-ENS.\n\n10\n\n\fReferences\n[1] Bruno A Olshausen and David J Field. Sparse coding with an overcomplete basis set: A strategy\n\nemployed by v1? Vision research, 37(23):3311\u20133325, 1997.\n\n[2] David L Donoho. Compressed sensing. IEEE Transactions on information theory, 52(4):1289\u2013\n\n1306, 2006.\n\n[3] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil\nOzair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural\ninformation processing systems, pages 2672\u20132680, 2014.\n\n[4] Eric W Tramel, Andre Manoel, Francesco Caltagirone, Marylou Gabri\u00e9, and Florent Krzakala.\nInferring sparsity: Compressed sensing using generalized restricted Boltzmann machines. In\n2016 IEEE Information Theory Workshop (ITW), pages 265\u2013269. IEEE, 2016.\n\n[5] Ashish Bora, Ajil Jalal, Eric Price, and Alexandros G Dimakis. Compressed sensing using\ngenerative models. In Proceedings of the 34th International Conference on Machine Learning-\nVolume 70, pages 537\u2013546. JMLR. org, 2017.\n\n[6] Andre Manoel, Florent Krzakala, Marc M\u00e9zard, and Lenka Zdeborov\u00e1. Multi-layer generalized\nlinear estimation. In 2017 IEEE International Symposium on Information Theory (ISIT), pages\n2098\u20132102. IEEE, 2017.\n\n[7] Paul Hand and Vladislav Voroninski. Global guarantees for enforcing deep generative priors by\n\nempirical risk. In Conference On Learning Theory, pages 970\u2013978, 2018.\n\n[8] Alyson K Fletcher, Sundeep Rangan, and Philip Schniter. Inference in deep networks in high\ndimensions. In 2018 IEEE International Symposium on Information Theory (ISIT), pages\n1884\u20131888. IEEE, 2018.\n\n[9] Paul Hand, Oscar Leong, and Vlad Voroninski. Phase retrieval under a generative prior. In\n\nAdvances in Neural Information Processing Systems, pages 9136\u20139146, 2018.\n\n[10] Dustin G Mixon and Soledad Villar. Sunlayer: Stable denoising with generative networks.\n\narXiv preprint arXiv:1803.09319, 2018.\n\n[11] Soledad\n\nVillar.\n\nGenerative\n\nhttps://solevillar.github.io/2018/03/28/SUNLayer.html, 2018.\n\nmodels\n\nare\n\nthe\n\nnew\n\nsparsity?\n\n[12] Hui Zou, Trevor Hastie, and Robert Tibshirani. Sparse principal component analysis. Journal\n\nof computational and graphical statistics, 15(2):265\u2013286, 2006.\n\n[13] Rodolphe Jenatton, Guillaume Obozinski, and Francis Bach. Structured sparse principal\ncomponent analysis. In Proceedings of the Thirteenth International Conference on Arti\ufb01cial\nIntelligence and Statistics, pages 366\u2013373, 2010.\n\n[14] Sundeep Rangan and Alyson K Fletcher. Iterative estimation of constrained rank-one matrices\nin noise. In Information Theory Proceedings (ISIT), 2012 IEEE International Symposium on,\npages 1246\u20131250. IEEE, 2012.\n\n[15] Yash Deshpande and Andrea Montanari. Information-theoretically optimal sparse PCA. In\n2014 IEEE International Symposium on Information Theory, pages 2197\u20132201. IEEE, 2014.\n[16] Thibault Lesieur, Florent Krzakala, and Lenka Zdeborov\u00e1. Phase transitions in sparse PCA. In\n2015 IEEE International Symposium on Information Theory (ISIT), pages 1635\u20131639. IEEE,\n2015.\n\n[17] Amelia Perry, Alexander S. Wein, Afonso S. Bandeira, and Ankur Moitra. Optimality and\nsub-optimality of pca i: Spiked random matrix models. Ann. Statist., 46(5):2416\u20132451, 10\n2018.\n\n[18] Marc Lelarge and L\u00e9o Miolane. Fundamental limits of symmetric low-rank matrix estimation.\n\nProbability Theory and Related Fields, 173(3-4):859\u2013929, 2019.\n\n[19] Jean Barbier, Mohamad Dia, Nicolas Macris, Florent Krzakala, Thibault Lesieur, and Lenka\nZdeborov\u00e1. Mutual information for symmetric rank-one matrix estimation: A proof of the\nreplica formula. In Advances in Neural Information Processing Systems, pages 424\u2013432, 2016.\n[20] L\u00e9o Miolane. Fundamental limits of low-rank matrix estimation: the non-symmetric case. arXiv\n\npreprint arXiv:1702.00473, 2017.\n\n11\n\n\f[21] Thibault Lesieur, Florent Krzakala, and Lenka Zdeborov\u00e1. Constrained low-rank matrix\nestimation: phase transitions, approximate message passing and applications. Journal of\nStatistical Mechanics: Theory and Experiment, 2017(7):073403, 2017.\n\n[22] Arash A Amini and Martin J Wainwright. High-dimensional analysis of semide\ufb01nite relaxations\n\nfor sparse principal components. The Annals of Statistics, pages 2877\u20132921, 2009.\n\n[23] Quentin Berthet and Philippe Rigollet. Computational lower bounds for sparse PCA. arXiv\n\npreprint arXiv:1304.0828, 2013.\n\n[24] Yash Deshpande and Andrea Montanari. Sparse PCA via covariance thresholding. In Advances\n\nin Neural Information Processing Systems, pages 334\u2013342, 2014.\n\n[25] Yash Deshpande, Emmanuel Abbe, and Andrea Montanari. Asymptotic mutual information for\nthe balanced binary stochastic block model. Information and Inference: A Journal of the IMA,\n6(2):125\u2013170, 2016.\n\n[26] Florent Krzakala, Jiaming Xu, and Lenka Zdeborov\u00e1. Mutual Information in Rank-One Matrix\nEstimation. 2016 IEEE Information Theory Workshop (ITW), pages 71\u201375, September 2016.\narXiv: 1603.08447.\n\n[27] Ahmed El Alaoui and Florent Krzakala. Estimation in the spiked Wigner model: A short proof\nof the replica formula. In 2018 IEEE International Symposium on Information Theory (ISIT),\npages 1874\u20131878, June 2018.\n\n[28] Ahmed El Alaoui, Florent Krzakala, and Michael I Jordan. Finite size corrections and likelihood\n\nratio \ufb02uctuations in the spiked Wigner model. arXiv preprint arXiv:1710.02903, 2017.\n\n[29] Jean-Christophe Mourrat. Hamilton-Jacobi equations for \ufb01nite-rank matrix inference. arXiv\n\npreprint arXiv:1904.05294, 2019.\n\n[30] Jean Barbier, Florent Krzakala, Nicolas Macris, L\u00e9o Miolane, and Lenka Zdeborov\u00e1. Optimal\nerrors and phase transitions in high-dimensional generalized linear models. Proceedings of the\nNational Academy of Sciences, 116(12):5451\u20135460, 2019.\n\n[31] Marylou Gabri\u00e9, Andre Manoel, Cl\u00e9ment Luneau, Jean Barbier, Nicolas Macris, Florent\nKrzakala, and Lenka Zdeborov\u00e1. Entropy and mutual information in models of deep neural\nnetworks.\nIn S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and\nR. Garnett, editors, Advances in Neural Information Processing Systems 31, pages 1821\u20131831.\nCurran Associates, Inc., 2018.\n\n[32] Thomas M Cover and Joy A Thomas. Elements of information theory. John Wiley & Sons,\n\n2012.\n\n[33] Dongning Guo, S. Shamai, and S. Verd\u00fa. Mutual information and minimum mean-square error\nin gaussian channels. IEEE Transactions on Information Theory, 51(4):1261\u20131282, April 2005.\n[34] Christopher A Metzler, Arian Maleki, and Richard G Baraniuk. From denoising to compressed\n\nsensing. IEEE Transactions on Information Theory, 62(9):5117\u20135144, 2016.\n\n[35] Rapha\u00ebl Berthier, Andrea Montanari, and Phan-Minh Nguyen. State evolution for approximate\nmessage passing with non-separable functions. Information and Inference: A Journal of the\nIMA, 2019.\n\n[36] Adel Javanmard and Andrea Montanari. State evolution for general approximate message\npassing algorithms, with applications to spatial coupling. Information and Inference: A Journal\nof the IMA, 2(2):115\u2013144, 2013.\n\n[37] Benjamin Aubin, Bruno Loureiro, Antoine Maillard, Florent Krzakala, and Lenka Zdeborov\u00e1.\nDemonstration codes - the spiked matrix model with generative priors. https://github.com/\nbenjaminaubin/StructuredPrior_demo.\n\n[38] Jinho Baik, G\u00e9rard Ben Arous, Sandrine P\u00e9ch\u00e9, et al. Phase transition of the largest eigenvalue\nfor nonnull complex sample covariance matrices. The Annals of Probability, 33(5):1643\u20131697,\n2005.\n\n[39] F. Krzakala, C. Moore, E. Mossel, J. Neeman, A. Sly, L. Zdeborov\u00e1, and P. Zhang. Spectral\nredemption in clustering sparse networks. Proceedings of the National Academy of Sciences,\n110(52):20935\u201320940, December 2013.\n\n[40] Han Xiao, Kashif Rasul, and Roland Vollgraf. Fashion-mnist: a novel image dataset for\n\nbenchmarking machine learning algorithms, 2017.\n\n12\n\n\f", "award": [], "sourceid": 4541, "authors": [{"given_name": "Benjamin", "family_name": "Aubin", "institution": "Ipht Saclay"}, {"given_name": "Bruno", "family_name": "Loureiro", "institution": "IPhT Saclay"}, {"given_name": "Antoine", "family_name": "Maillard", "institution": "Ecole Normale Sup\u00e9rieure"}, {"given_name": "Florent", "family_name": "Krzakala", "institution": "ENS Paris & Sorbonnes Universit\u00e9"}, {"given_name": "Lenka", "family_name": "Zdeborov\u00e1", "institution": "CEA Saclay"}]}