{"title": "Phase Retrieval Under a Generative Prior", "book": "Advances in Neural Information Processing Systems", "page_first": 9136, "page_last": 9146, "abstract": "We introduce a novel deep-learning inspired formulation of the \\textit{phase retrieval problem}, which asks to recover a signal $y_0 \\in \\R^n$ from $m$ quadratic observations, under structural assumptions on the underlying signal. As is common in many imaging problems, previous methodologies have considered natural signals as being sparse with respect to a known basis, resulting in the decision to enforce a generic sparsity prior. However, these methods for phase retrieval have encountered possibly fundamental limitations, as no computationally efficient algorithm for sparse phase retrieval has been proven to succeed with fewer than $O(k^2\\log n)$ generic measurements, which is larger than the theoretical optimum of $O(k \\log n)$. In this paper, we sidestep this issue by considering a prior that a natural signal is in the range of a generative neural network $G : \\R^k \\rightarrow \\R^n$.  We introduce an empirical risk formulation that has favorable global geometry for gradient methods, as soon as $m = O(k)$, under the model of a multilayer fully-connected neural network with random weights.  Specifically, we show that there exists a descent direction outside of a small neighborhood around the true $k$-dimensional latent code and a negative multiple thereof.  This formulation for structured phase retrieval thus benefits from two effects: generative priors can more tightly represent natural signals than sparsity priors, and this empirical risk formulation can exploit those generative priors at an information theoretically optimal sample complexity, unlike for a sparsity prior. We corroborate these results with experiments showing that exploiting generative models in phase retrieval tasks outperforms both sparse and general phase retrieval methods.", "full_text": "Phase Retrieval Under a Generative Prior\n\nPaul Hand\u21e4\n\nNortheastern University\n\np.hand@northeastern.edu\n\nOscar Leong\nRice University\n\noscar.f.leong@rice.edu\n\nVladislav Voroninski\n\nHelm.ai\n\nvlad@helm.ai\n\nAbstract\n\nWe introduce a novel deep learning inspired formulation of the phase retrieval\nproblem, which asks to recover a signal y0 2 Rn from m quadratic observations,\nunder structural assumptions on the underlying signal. As is common in many\nimaging problems, previous methodologies have considered natural signals as\nbeing sparse with respect to a known basis, resulting in the decision to enforce a\ngeneric sparsity prior. However, these methods for phase retrieval have encountered\npossibly fundamental limitations, as no computationally ef\ufb01cient algorithm for\nsparse phase retrieval has been proven to succeed with fewer than O(k2 log n)\ngeneric measurements, which is larger than the theoretical optimum of O(k log n).\nIn this paper, we propose a new framework for phase retrieval by modeling natural\nsignals as being in the range of a deep generative neural network G : Rk ! Rn.\nWe introduce an empirical risk formulation that has favorable global geometry for\ngradient methods, as soon as m = O(kd2 log n), under the model of a d-layer\nfully-connected neural network with random weights. Speci\ufb01cally, when suitable\ndeterministic conditions on the generator and measurement matrix are met, we\nconstruct a descent direction for any point outside of a small neighborhood around\nthe true k-dimensional latent code and a negative multiple thereof. This formulation\nfor structured phase retrieval thus bene\ufb01ts from two effects: generative priors can\nmore tightly represent natural signals than sparsity priors, and this empirical risk\nformulation can exploit those generative priors at an information theoretically\noptimal sample complexity, unlike for a sparsity prior. We corroborate these results\nwith experiments showing that exploiting generative models in phase retrieval tasks\noutperforms both sparse and general phase retrieval methods.\n\nIntroduction\n\n1\nWe study the problem of recovering a signal y0 2 Rn given m \u2327 n phaseless observations of the\nform b = |Ay0| where the measurement matrix A 2 Rm\u21e5n is known and | \u00b7 | is understood to act\nentrywise. This is known as the phase retrieval problem. In this work, we assume, as a prior, that\nthe signal y0 is in the range of a generative model G : Rk ! Rn so that y0 = G(x0) for some\nx0 2 Rk. To recover y0, we \ufb01rst recover the original latent code x0 corresponding to it, from which\ny0 is obtained by applying G. Hence we study the phase retrieval problem under a generative prior\nwhich asks:\n\n\ufb01nd x 2 Rk such that b = |AG(x)|.\n\nWe will refer to this formulation as Deep Phase Retrieval (DPR). The phase retrieval problem has\napplications in X-ray crystallography [21, 29], optics [34], astronomical imaging [14], diffraction\n\n\u21e4Authors are listed in alphabetical order.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fimaging [5], and microscopy [28]. In these problems, the phase information of an object is lost due to\nphysical limitations of scienti\ufb01c instruments. In crystallography, the linear measurements in practice\nare typically Fourier modes because they are the far \ufb01eld limit of a diffraction pattern created by\nemitting a quasi-monochromatic wave on the object of interest.\nIn many applications, the signals to be recovered are compressible or sparse with respect to a\ncertain basis (e.g. wavelets). Many researchers have attempted to leverage sparsity priors in phase\nretrieval to yield more ef\ufb01cient recovery algorithms. However, these methods have been met with\npotentially severe fundamental limitations. In the Gaussian measurement regime where A has i.i.d.\nGaussian entries, one would hope that recovery of a k-sparse n-dimensional signal is possible\nwith O(k log n) measurements. However, there is no known method to succeed with fewer than\nO(k2 log n) measurements. Moreover, [26] proved that the semide\ufb01nite program PhaseLift cannot\noutperform this suboptimal sample complexity by direct `1 penalization. This is in stark contrast to\nthe success of leveraging sparsity in linear compressed sensing to yield optimal sample complexity.\nHence enforcing sparsity as a generic prior in phase retrieval may be fundamentally limiting sample\ncomplexity.\n\nOur contribution. We show information theoretically optimal sample complexity2 for structured\nphase retrieval under generic measurements and a novel nonlinear formulation based on empirical risk\nand a generative prior. In this work, we suppose that the signal of interest is the output of a generative\nmodel. In particular, the generative model is a d-layer, fully-connected, feed forward neural network\nwith Rectifying Linear Unit (ReLU) activation functions and no bias terms. Let Wi 2 Rni\u21e5ni1\ndenote the weights in the i-th layer of our network for i = 1, . . . , d where k = n0 < n1 < \u00b7\u00b7\u00b7 < nd.\nGiven an input x 2 Rk, the output of the the generative model G : Rk ! Rnd can be expressed as\n\nG(x) := relu (Wd . . . relu(W2(relu(W1x))) . . . )\n\nwhere relu(x) = max(x, 0) acts entrywise. We further assume that the measurement matrix A\nand each weight matrix Wi have i.i.d. Gaussian entries. The Gaussian assumption of the weight\nmatrices is supported by empirical evidence showing neural networks, learned from data, that have\nweights that obey statistics similar to Gaussians [1]. Furthermore, there has also been work done in\nestablishing a relationship between deep networks and Gaussian processes [25]. Nevertheless, we\nwill introduce deterministic conditions on the weights for which our results hold, allowing the use of\nother distributions.\nTo recover x0, we study the following `2 empirical risk minimization problem:\n\nDue to the non-convexity of the objective function, there is no a priori guarantee that gradient descent\nschemes can solve (1) as many local minima may exist. In spite of this, our main result illustrates\nthat the objective function exhibits favorable geometry for gradient methods. Moreover, our result\nholds with information theoretically optimal sample complexity:\nTheorem 1 (Informal). If we have a suf\ufb01cient number of measurements m =\u2326( kd log(n1 . . . nd))\nand our network is suf\ufb01ciently expansive at each layer ni =\u2326( ni1 log ni1), then there exists a\ndescent direction vx,x0 2 Rk for any non-zero x 2 Rk outside of two small neighborhoods centered\nat the true solution x0 and a negative multiple \u21e2dx0 with high probability. In addition, the origin is\na local maximum of f. Here \u21e2d > 0 depends on the number of layers d and \u21e2d ! 1 as d ! 1.\nOur main result asserts that the objective function does not have any spurious local minima away\nfrom neighborhoods of the true solution and a negative multiple of it. Hence if one were to solve (1)\nvia gradient descent and the algorithm converged, the \ufb01nal iterate would be close to the true solution\nor a negative multiple thereof. The proof of this result is a concentration argument. We \ufb01rst prove the\nsuf\ufb01ciency of two deterministic conditions on the weights Wi and measurement matrix A. We then\nshow that Gaussian Wi and A satisfy these conditions with high probability. Finally, using these two\nconditions, we argue that the speci\ufb01ed descent direction vx,x0 concentrates around a vector hx,x0\nthat is continuous for non-zero x 2 Rk and vanishes only when x \u21e1 x0 or x \u21e1 \u21e2dx0.\nRather than working against potentially fundamental limitations of polynomial time algorithms, we\nexamine more sophisticated priors using generative models. Our results illustrate that these priors are,\n\n2with respect to the dimensionality of the latent code given to the generative network\n\n2\n\nf (x) :=\n\nmin\nx2Rk\n\n1\n\n2|AG(x)|| AG(x0)|\n\n2\n\n.\n\n(1)\n\n\fin reality, less limiting in terms of sample complexity, both by providing more compressibility and by\nbeing able to be more tightly enforced.\n\nPrior methodologies for general phase retrieval.\nIn the Gaussian measurement regime, most of\nthe techniques to solve phase retrieval problems can be classi\ufb01ed as convex or non-convex methods.\nIn terms of convex techniques, lifting-based methods transform the signal recovery problem into a\nrank-one matrix recovery problem by lifting the signal into the space of positive semide\ufb01nite matrices.\nThese semide\ufb01nite programming (SDP) approaches, such as PhaseLift [9], can provably recover\nany n-dimensional signal with O(n log n) measurements. A re\ufb01nement on this analysis by [7] for\nPhaseLift showed that recovery is in fact possible with O(n) measurements. Other convex methods\ninclude PhaseCut [33], an SDP approach, and linear programming algorithms such as PhaseMax,\nwhich has been shown to achieve O(n) sample complexity [17].\nNon-convex methods encompass alternating minimization approaches such as the original Gerchberg-\nSaxton [16] and Fienup [15] algorithms and direct optimization algorithms such as Wirtinger Flow\n[8]. These latter methods directly tackle the least squares objective function\n\nmin\ny2Rn\n\n1\n\n2|Ay|2 | Ay0|2\n\n2\n\n.\n\n(2)\n\nIn the seminal work, [8] show that through an initialization via the spectral method, a gradient\ndescent scheme can solve (2) where the gradient is understood in the sense of Wirtinger calculus with\nO(n log n) measurements. Expanding on this, a later study on the minimization of (2) in [31] showed\nthat with O(n log3 n) measurements, the energy landscape of the objective function exhibited global\nbenign geometry which would allow it to be solved ef\ufb01ciently by gradient descent schemes without\nspecial initialization. There also exist amplitude \ufb02ow methods that solve the following non-smooth\nvariation of (2):\n\nmin\ny2Rn\n\n1\n\n2|Ay|| Ay0|\n\n2\n\n.\n\n(3)\n\nThese methods have found success with O(n) measurements [13] and have been shown to empirically\nperform better than intensity-based methods using the squared formulation in (2) [37].\n\nSparse phase retrieval. Many of the successful methodologies for general phase retrieval have\nbeen adapted to try to solve sparse phase retrieval problems. In terms of non-convex optimization,\nWirtinger Flow type methods such as Thresholded Wirtinger Flow [6] create a sparse initializer via\nthe spectral method and perform thresholded gradient descent updates to generate sparse iterates\nto solve (2). Another non-convex method, SPARTA [35], estimates the support of the signal for\nits initialization and performs hard thresholded gradient updates to the amplitude-based objective\nfunction (3). Both of these methods require O(k2 log n) measurements for a generic k-sparse\nn-dimensional signal to succeed, which is more than the theoretical optimum O(k log n).\nWhile lifting-based methods such as PhaseLift have been proven unable to beat the suboptimal sample\ncomplexity O(k2 log n), there has been some progress towards breaking this barrier. In [19], the\nauthors show that with an initializer that suf\ufb01ciently correlates with the true solution, a linear program\ncan recover the sparse signal from O(k log n\nk ) measurements. However, the best known initialization\nmethods require at least O(k2 log n) measurements [6]. Outside of the Gaussian measurement regime,\nthere have been other results showing that if one were able to design their own measurement matrices,\nthen the optimal sample complexity could be reached [22]. For example, [2] showed that assuming\nthe measurement vectors were chosen from an incoherent subspace, then recovery is possible with\nk ) measurements. However, these results would be dif\ufb01cult to generalize to the experimental\nO(k log n\nsetting as their design architectures are often unrealistic. Moreover, the Gaussian measurement regime\nmore closely models the experimental Fourier diffraction measurements observed in, for example,\nX-ray crystallography. As Fourier models are the ultimate goal, results towards lowering this sample\ncomplexity in the Gaussian measurement regime must be made or new modes of regularization must\nbe explored in order for phase retrieval to advance.\n\nRelated work. There has been recent empirical evidence supporting applying a deep learning based\napproach to holographic imaging, a phase retrieval problem. The authors in [18] show that a neural\nnetwork with ReLU activation functions can learn to perform holographic image reconstruction. In\nparticular, they show that compared to current approaches, this neural network based method requires\n\n3\n\n\fless measurements to succeed and is computationally more ef\ufb01cient, needing only one hologram to\nreconstruct the necessary images.\nFurthermore, there have been a number of recent advancements in leveraging generative priors over\nsparsity priors in compressed sensing. In [4], the authors considered the least squares objective\n\nmin\nx2Rk\n\n2\n\n.\n\n(4)\n\n1\n\n2AG(x)  AG(x0)\n\nThey provided empirical evidence showing that 5-10X fewer measurements were needed to succeed\nin recovery compared to standard sparsity-based approaches such as Lasso. In terms of theory, they\nshowed that if A satis\ufb01ed a restricted eigenvalue condition and if one were able to solve (4), then\nthe solution would be close to optimal. The authors in [20] analyze the same optimization problem\nin [4] but exhibit global guarantees regarding the non-convex objective function. Under particular\nconditions about the expansivity of each neural network layer and randomness assumptions on their\nweights, they show that the energy landscape of the objective function does not have any spurious\nlocal minima. Furthermore, there is always a descent direction outside of two small neighborhoods\nof the global minimum and a negative scalar multiple of it. The success of leveraging generative\npriors in compressed sensing along with the sample complexity bottlenecks in sparse phase retrieval\nhave in\ufb02uenced this work to consider enforcing a generative prior in phase retrieval to surpass sparse\nphase retrieval\u2019s current theoretical and practical limitations.\nNotation. Let (\u00b7)> denote the real transpose. Let [n] = {1, . . . , n}. Let B(x, r) denote the\nEuclidean ball centered at x with radius r. Let k\u00b7k denote the `2 norm for vectors and spectral norm\nfor matrices. For any non-zero x 2 Rn, let \u02c6x = x/kxk. Let \u21e71\ni=dWi = WdWd1 . . . W1. Let In be\nthe n \u21e5 n identity matrix. Let S k1 denote the unit sphere in Rk. We write c =\u2326( ) when c > C\nfor some positive constant C. Similarly, we write c = O() when c 6 C for some positive constant\nC. When we say that a constant depends polynomially on \u270f1, this means that it is at least C\u270fk\nfor some positive C and positive integer k. For notational convenience, we write a = b + O1(\u270f) if\nka  bk 6 \u270f where k\u00b7k denotes | \u00b7 | for scalars, `2 norm for vectors, and spectral norm for matrices.\nDe\ufb01ne sgn : R ! R to be sgn(x) = x/|x| for non-zero x 2 R and sgn(x) = 0 otherwise. For a\nvector v 2 Rn, diag(sgn(v)) is sgn(vi) in the i-th diagonal entry and diag(v > 0) is 1 in the i-th\ndiagonal entry if vi > 0 and 0 otherwise.\n\n2 Algorithm\n\nWhile our main result illustrates that the objective function exhibits favorable geometry for optimiza-\ntion, it does not guarantee recovery of the signal as gradient descent algorithms could, in principle,\nconverge to the negative multiple of our true solution. Hence we propose a gradient descent scheme\nto recover the desired solution by escaping this region. First, consider Figure 1 which illustrates the\nbehavior of our objective function in expectation, i.e. when the number of measurements m ! 1.\nWe observe two important attributes of the objective function\u2019s landscape: (1) there exist two minima,\nthe true solution x0 and a negative multiple x0 for some > 0 and (2) if z \u21e1 x0 while w \u21e1 x0,\nwe have that f (z) < f (w), i.e. the objective function value is lower near the true solution than near\nits negative multiple. This is due to the fact that the true solution is in fact the global optimum.\nBased on these attributes, we will introduce a gradient descent scheme to converge to the global\nminimum. First, we de\ufb01ne some useful quantities. For any x 2 Rk and matrix W 2 Rn\u21e5k, de\ufb01ne\nW+,x := diag(W x > 0)W. That is, W+,x keeps the rows of W that have a positive dot product with\nx and zeroes out the rows that do not. We will extend the de\ufb01nition of W+,x to each layer of weights\nWi in our neural network. For W1 2 Rn1\u21e5k and x 2 Rk, de\ufb01ne W1,+,x := diag(W1x > 0)W1. For\neach layer i 2 [d], de\ufb01ne\n\nWi,+,x := diag(WiWi1,+,x . . . W2,+,xW1,+,xx > 0)Wi.\n\nWi,+,x keeps the rows of Wi that are active when the input to the generative model is x. Then, for\nany x 2 Rk, the output of our generative model can be written as G(x) = (\u21e7 1\ni=dWi,+,x)x. For any\nz 2 Rn, de\ufb01ne Az := diag(sgn(Az))A. Note that |AG(x)| = AG(x)G(x) for any x 2 Rk.\nSince a gradient descent scheme could in principle be attracted to the negative multiple, we exploit\nthe geometry of the objective function\u2019s landscape to escape this region. First, choose a random initial\n\n4\n\n\f0.9\n0.8\n0.7\n0.6\n0.5\n0.4\n0.3\n0.2\n0.1\n\nFigure 1: Surface (left) and contour plot (right) of objective function with m ! 1 and true solution\nx0 = [1, 0]> 2 R2.\n\niterate for gradient descent x1 6= 0. At each iteration i = 1, 2, . . . , compute the descent direction\n\nvxi,x0 := (\u21e71\n\ni=dWi,+,xi)>A>G(xi) (|AG(xi)|| AG(x0)|) .\n\nThis is the gradient of our objective function f where f is differentiable. Once computed, we then\ntake a step in the direction of vxi,x0. However, prior to taking this step, we compare the objective\nfunction value for xi and its negation xi. If f (xi) < f (xi), then we set xi to its negation,\ncompute the descent direction and update the iterate. The intuition for this algorithm relies on the\nlandscape illustrated in Figure 1: since the true solution x0 is the global minimum, the objective\nfunction value near x0 is smaller than near \u21e2dx0. Hence if we begin to converge towards \u21e2dx0,\nthis algorithm will escape this region by choosing a point with lower objective function value, which\nwill be in a neighborhood of x0. Algorithm 1 formally outlines this process.\n\nAlgorithm 1 Deep Phase Retrieval (DPR) Gradient method\nRequire: Weights Wi, measurement matrix A, observations |AG(x0)|, and step size \u21b5> 0\n1: Choose an arbitrary initial point x1 2 Rk \\ {0}\n2: for i = 1, 2, . . . do\n3:\n4:\n5:\n6:\n7:\n8: end for\n\nif f (xi) < f (xi) then\nend if\nCompute vxi,x0 = (\u21e7 1\nxi+1 = xi  \u21b5vxi,x0;\n\ni=dWi,+,xi)>A>G(xi) (|AG(xi)|| AG(x0)|);\n\nxi xi;\n\nRemark. We note that while the function is not differentiable, the descent direction is well-\nde\ufb01ned for all x 2 Rk due to the de\ufb01nitions of Wi,+,x and AG(x). When the objective function\nis differentiable, vx,x0 agrees with the true gradient. Otherwise, the descent direction only takes\ncomponents of the formula for which the inputs to each ReLU are nonnegative.\n\n3 Main Theoretical Analysis\nWe now formally present our main result. While the objective function is not smooth, its one-\nsided directional derivatives exist everywhere due to the continuity and piecewise linearity of G.\nLet Dvf (x) denote the unnormalized one-sided directional derivative of f at x in the direction v:\nDvf (x) = limt!0+\nTheorem 2. Fix \u270f> 0 such that K1d8\u270f1/4 6 1 and let d > 2. Suppose G is such that Wi 2 Rni\u21e5ni1\nhas i.i.d. N (0, 1/ni) entries for i = 1, . . . , d. Suppose that A 2 Rm\u21e5nd has i.i.d. N (0, 1/m)\n\nf (x+tv)f (x)\n\n.\n\nt\n\n5\n\n\fentries independent from {Wi}. Then if m > C\u270fdk log(n1n2 . . . nd) and ni > C\u270fni1 log ni1 for\ni = 1, . . . , d, then with probability at least 1 Pd\ni=1 niec\u270fni1  m4k+1ec\u270fm, the following\nholds: for all non-zero x, x0 2 Rk, there exists vx,x0 2 Rk such that the one-sided directional\nderivatives of f satisfy\n\nDvx,x0\n\nf (x) < 0, 8x /2B (x0, K2d3\u270f1/4kx0k) [B (\u21e2dx0, K2d14\u270f1/4kx0k) [{ 0},\n\nDxf (0) < 0, 8x 6= 0,\n\nwhere \u21e2d > 0 converges to 1 as d ! 1 and K1 and K2 are universal constants. Here C\u270f depends\npolynomially on \u270f1, c\u270f depends on \u270f, and  is a universal constant.\nSee Section 3.1 for the de\ufb01nition of the descent direction vx,x0. We note that while we assume the\nweights to have i.i.d. Gaussian entries, we make no assumption about the independence between\nlayers. The result will be shown by proving the suf\ufb01ciency of two deterministic conditions on the\nweights Wi of our generative network and the measurement matrix A.\n\nWeight Distribution Condition. The \ufb01rst condition quanti\ufb01es the Gaussianity and spatial arrange-\nment of the neurons in each layer. We say that W 2 Rn\u21e5k satis\ufb01es the Weight Distribution Condition\n(WDC) with constant \u270f> 0 if for any non-zero x, y 2 Rk:\n\nW >+,xW+,y  Qx,y 6 \u270f where Qx,y :=\n\n\u21e1  \u2713x,y\n\n2\u21e1\n\nIk +\n\nsin \u2713x,y\n\n2\u21e1\n\nM\u02c6x$\u02c6y.\n\nHere \u2713x,y = \\(x, y) and M\u02c6x$\u02c6y\n3 is the matrix that sends \u02c6x 7! \u02c6y, \u02c6y 7! \u02c6x, and z 7! 0 for any z 2\nspan({x, y})?. If Wij \u21e0N (0, 1/n), then an elementary calculation gives E\u21e5W >+,xW+,y\u21e4 = Qx,y.\n[20] proved that Gaussian W satis\ufb01es the WDC with high probability (Lemma 1 in the Appendix).\n\nRange Restricted Concentration Property. The second condition is similar in the sense that it\nquanti\ufb01es whether the measurement matrix behaves like a Gaussian when acting on the difference of\npairs of vectors given by the output of the generative model. We say that A 2 Rm\u21e5n satis\ufb01es the\nRange Restricted Concentration Property (RRCP) with constant \u270f> 0 if for all non-zero x, y 2 Rk,\nthe matrices AG(x) and AG(y) satisfy the following for all x1, x2, x3, x4 2 Rk:\n\n|h(A>G(x)AG(y)  G(x),G(y))(G(x1)  G(x2)),G(x3)  G(x4)i|\n\n6 31\u270fkG(x1)  G(x2)kkG(x3)  G(x4)k\n\nwhere\n\nz,w :=\n\n\u21e1  2\u2713z,w\n\n\u21e1\n\nIn +\n\n2 sin \u2713z,w\n\n\u21e1\n\nM\u02c6z$ \u02c6w.\n\nIf Aij \u21e0N (0, 1/m), then for any z, w 2 Rn, a similar calculation for Gaussian W gives\nE\u21e5A>z Aw\u21e4 = z,w. In our work, we establish that Gaussian A satis\ufb01es the RRCP with high\nprobability. Please see Section 6 in the Appendix for a complete proof.\nWe emphasize that these two conditions are deterministic, meaning that other distributions could be\nconsidered. We now state our main deterministic result.\nTheorem 3. Fix \u270f> 0 such that K1d8\u270f1/4 6 1 and let d > 2. Suppose that G is such that\nWi 2 Rni\u21e5ni1 satis\ufb01es the WDC with constant \u270f for all i = 1, . . . , d. Suppose A 2 Rm\u21e5nd satis\ufb01es\nthe RRCP with constant \u270f. Then the same conclusion as Theorem 2 holds.\n\n3.1 Proof sketch for Theorem 2\nBefore we outline the proof of Theorem 2, we specify the descent direction vx,x0. For any x 2 Rk\nwhere f is differentiable, we have that\nrf (x) = (\u21e7 1\n\ni=dWi,+,x)>A>G(x)AG(x0)(\u21e71\n\ni=dWi,+,x)>A>G(x)AG(x)(\u21e71\n\ni=dWi,+,x)x  (\u21e71\n\ni=dWi,+,x0)x0.\n\n3A formula for this matrix is as follows: consider a rotation matrix R that sends \u02c6x 7! e1 and \u02c6y 7!\n\ncos \u27130e1 + sin \u27130e2 where \u27130 = \\(x, y). Then M\u02c6x$\u02c6y = R>24\n\ncos \u27130\nsin \u27130\nsin \u27130  cos \u27130\n\n0\n\n0\n\nis the k  2 \u21e5 k  2 matrix of zeros. Note that if \u27130 = 0 or \u21e1, M\u02c6x$\u02c6y = \u02c6x\u02c6x> or \u02c6x\u02c6x>, respectively.\n\n0\n0\n0k2\n\n35 R where 0k2\n\n6\n\n\fThis is precisely the descent direction speci\ufb01ed in Algorithm 1, expanded with our notation. When f\nis not differentiable at x, choose a direction w such that f is differentiable at x + w for suf\ufb01ciently\nsmall > 0. Such a direction w exists by the piecewise linearity of the generative model G. In fact,\nnot only is the function piecewise linear, each of the pieces is the intersection of a \ufb01nite number\nof half spaces. Thus, with probability 1 any randomly chosen direction w moves strictly into one\npiece, allowing for differentiability at x + w for suf\ufb01ciently small . We note that any such w can\nbe chosen arbitrarily. Hence we de\ufb01ne our descent direction vx,x0 as\n\nf differentiable at x 2 Rk\notherwise.\n\nvx,x0 =\u21e2rf (x)\n\nlim!0+ rf (x + w)\nThe following is a sketch of the proof of Theorem 2:\n\na continuous vector hx,x0 de\ufb01ned by equation (7) in the Appendix.\n\n\u2022 By the WDC and RRCP, we have that the descent direction vx,x0 concentrates uniformly\nfor all non-zero x, x0 2 Rk around a particular vector vx,x0 de\ufb01ned by equation (5) in the\nAppendix.\n\u2022 The WDC establishes that vx,x0 concentrates uniformly for all non-zero x, x0 2 Rk around\n\u2022 A direct analysis shows that hx,x0 is only small in norm for x \u21e1 x0 and x \u21e1 \u21e2dx0. See\nSection 5.3 for a complete proof. Since vx,x0 \u21e1 vx,x0 \u21e1 hx,x0, vx,x0 is also only small in\nnorm in neighborhoods around x0 and \u21e2dx0, establishing Theorem 3.\n\u2022 Gaussian Wi and A satisfy the WDC and RRCP with high probability (Lemma 1 and\nProposition 2 in the Appendix).\nTheorem 2 is a combination of Lemma 1, Proposition 2, and Theorem 3. The full proofs of these\nresults can be found in the Appendix.\n\nRemark.\nIn comparison to the results in [20], considerable technical advances were needed in our\ncase, including establishing concentration of AG(x) over the range of G. The quantity AG(x) acts\nlike a spatially dependent sensing matrix, requiring a condition similar to the Restricted Isometry\nProperty that must hold simultaneously over a \ufb01nite number of subspaces given by the range(G).\n\n4 Experiments\nIn this section, we investigate the use of enforcing generative priors in phase retrieval tasks. We\ncompared our results with the sparse truncated amplitude \ufb02ow algorithm (SPARTA) [35] and three\npopular general phase retrieval methods: Fienup [15], Gerchberg Saxton [16], and Wirtinger Flow [8].\nA MATLAB implementation of the SPARTA algorithm was made publicly available by the authors\nat https://gangwg.github.io/SPARTA/. We implemented the last three algorithms using the\nMATLAB phase retrieval library PhasePack [10]. While these methods are not intended for sparse\nrecovery, we include them to serve as baselines.\n\n4.1 Experiments for Gaussian signals\nWe \ufb01rst consider synthetic experiments using Gaussian measurements on Gaussian signals. In\nparticular, we considered a two layer network given by G(x) = relu(W2relu(W1x)) where each\nWi has i.i.d. N (0, 1) entries for i = 1, 2. We set k = 10, n1 = 500, and n2 = 1000. We let\nthe entries of A 2 Rm\u21e5n2 and x0 2 Rk be i.i.d. N (0, 1). We ran Algorithm 1 for 25 random\ninstances of (A, W1, W2, x0). A reconstruction x? is considered successful if the relative error\nkG(x?)  G(x0)k/kG(x0)k 6 103. We also compared our results with SPARTA. In this setting,\nwe chose a k = 10-sparse y0 2 Rn2, where the nonzero coef\ufb01cients are i.i.d. N (0, 1). As before, we\nran SPARTA with 25 random instances of (A, y0) and considered a reconstruction y? successful if\nky?  y0k/ky0k 6 103. We also experimented with sparsity levels k = 3, 5. Figure 2 displays the\npercentage of successful trials for different ratios m/n where n = n2 = 1000 and m is the number\nof measurements.\n\n4.2 Experiments for MNIST and CelebA\nWe next consider image recovery tasks, where we use two different generative models for the MNIST\nand CelebA datasets. In each task, the goal is to recover an image y0 2 Rn given |Ay0| where\n\n7\n\n\fFigure 2: Empirical success rate with ratios m/n where DPR\u2019s latent code dimension is k = 10,\nSPARTA\u2019s sparsity level ranges from k = 3, 5, and 10, and n = 1000. DPR achieves nearly the\nsame empirical success rate of recovering a 10-dimensional latent code as SPARTA in recovering a\n3-sparse 1000-dimensional signal.\n\nA 2 Rm\u21e5n has i.i.d. N (0, 1/m) entries. We found an estimate image G(x?) in the range of\nour generator via gradient descent, using the Adam optimizer [23]. Empirically, we noticed that\nAlgorithm 1 would typically only negate the latent code (Lines 3\u20134) at the initial iterate, if necessary.\nHence we use a modi\ufb01ed version of Algorithm 1 in these image experiments: we ran two sessions of\ngradient descent for a random initial iterate x1 and its negation x1 and chose the most successful\nreconstruction.\nIn the \ufb01rst image experiment, we used a pretrained Variational Autoencoder (VAE) from [4] that\nwas trained on the MNIST dataset [24]. This dataset consists of 60, 000 images of handwritten\ndigits. Each image is of size 28 \u21e5 28, resulting in vectorized images of size 784. As described in\n[4], the recognition network is of size 784  500  500  20 while the generator network is of size\n20  500  500  784. The latent code space dimension is k = 20.\n\nFigure 3: Top left: Example reconstructions with 200 measurements. Top right: Example reconstruc-\ntions with 500 measurements. Bottom: A comparison of DPR\u2019s reconstruction error versus each\nalgorithm for different numbers of measurements.\n\n8\n\n\fFor SPARTA, we performed sparse recovery by transforming the images using the 2-D Discrete\nCosine Transform (DCT). We allowed 10 random restarts for each algorithm, including the sparse\nand general phase retrieval methods. The results in Figure 3 demonstrate the success of our algorithm\nwith very few measurements. For 200 measurements, we can achieve reasonable recovery. SPARTA\ncan achieve good recovery with 500 measurements while the other algorithms cannot. In addition,\nour algorithm exhibits recovery with 500 measurements compared to the alternatives requiring 1000\nand 1500 measurements, which is where they begin to succeed. The performance for the general\nphase retrieval methods is to be expected as they are known to succeed only when m =\u2326( n) where\nn = 784.\nWe note that while our algorithm succeeds with fewer measurements than the other methods, our\nperformance, as measured by per-pixel reconstruction error, saturates as the number of measurements\nincreases since our reconstruction accuracy is ultimately bounded by the generative model\u2019s represen-\ntational error. As generative models improve, their representational errors will decrease. Nonetheless,\nas can be seen in the reconstructed digits, the recoveries are semantically correct (the correct digit is\nlegibly recovered) even though the reconstruction error does not decay to zero. In applications, such\nas MRI and molecular structure estimation via X-ray crystallography, semantic error measures would\nbe more informative estimates of recovery performance than per-pixel error measures.\nIn the second experiment, we used a pretrained Deep Convolutional Generative Adversarial Network\n(DCGAN) from [4] that was trained on the CelebA dataset [27]. This dataset consists of 200, 000\nfacial images of celebrities. The RGB images were cropped to be of size 64 \u21e5 64, resulting in\nvectorized images of dimension 64 \u21e5 64 \u21e5 3 = 12288. The latent code space dimension is k = 100.\nWe allowed 2 random restarts. We ran numerical experiments with the other methods and they did not\nsucceed at measurement levels below 5000. The general phase retrieval methods began reconstructing\nthe images when m =\u2326( n) where n = 12288. The following \ufb01gure showcases our results on\nreconstructing 10 images from the DCGAN\u2019s test set with 500 measurements.\n\nOriginal\n\nDPR\u00a0with\u00a0DCGAN\n\nFigure 4: 10 reconstructed images from celebA\u2019s test set using DPR with 500 measurements.\n\nAcknowledgments\nOL acknowledges support by the NSF Graduate Research Fellowship under Grant No. DGE-1450681.\nPH acknowledges funding by the grant NSF DMS-1464525.\n\nReferences\n[1] Sanjeev Arora, Yingyu Liang, and Tengyu Ma. Why are deep nets reversible: A simple theory\n\nwith implications for training. CoRR, abs/1511.05653, 2015.\n\n[2] Sohail Bahmani and Justin Romberg. Ef\ufb01cient compressive phase retrieval with constrained\nsensing vectors. Advances in Neural Information Processing Systems (NIPS 2015), pages\n523\u2013531, 2015.\n\n[3] Richard Baraniuk, Mark Davenport, Ronald DeVore, and Michael Wakin. A simple proof of the\nrestricted isometry property for random matrices. Constructive Approximation, 28(3):253\u2013263,\n2008.\n\n[4] Ashish Bora, Alexandros G. Dimakis, Ajil Jalal, and Eric Price. Compressed sensing using\n\ngenerative models. arXiv preprint arXiv:1703.03208, 2017.\n\n9\n\n\f[5] Oliver Bunk, Ana Diaz, Franz Pfeiffer, Christian David, Bernd Schmitt, Dillip K. Satapathy, and\nJ. Friso van der Veen. Diffractive imaging for periodic samples: Retrieving one-dimensional con-\ncentration pro\ufb01les across micro\ufb02uidic channels. Acta Crystallographica Section A: Foundations\nof Crystallography, 63(4):306\u2013314, 2007.\n\n[6] Tony Cai, Xiaodong Li, and Zongming Ma. Optimal rates of convergence for noisy sparse\nphase retrieval via thresholded wirtinger \ufb02ow. The Annals of Statistics, 44(5):2221\u20132251, 2016.\n[7] Emmanuel J. Cand\u00e8s and Xiaodong Li. Solving quadratic equations via phaselift when there are\nabout as many equations as unknowns. Foundations of Computational Mathematics, 14.5:1017\u2013\n1026, 2014.\n\n[8] Emmanuel J. Cand\u00e8s, Xiaodong Li, and Mahdi Soltanolkotabi. Phase retrieval via wirtinger\n\ufb02ow: Theory and applications. IEEE Transactions on Information Theory, 61(4):195\u20132007,\n2017.\n\n[9] Emmanuel J. Cand\u00e8s, Thomas Strohmer, and Vladislav Voroninski. Phaselift: Exact and stable\nsignal recovery from magnitude measurements via convex programming. Comm. Pure Applied\nMath, 66(8):1241\u20131274, 2013.\n\n[10] Rohan Chandra, Ziyuan Zhong, Justin Hontz, Val McCulloch, Christoph Studer, and Tom\nGoldstein. Phasepack: A phase retrieval library. Asilomar Conference on Signals, Systems, and\nComputers, 2017.\n\n[11] Mark A. Davenport.\n\nProof of the rip for sub-gaussian matrices. OpenStax CNX,\n\nhttp://cnx.org/contents/f37687c1-d62b-4ede-8064-794a7e7da7da@5, 2013.\n\n[12] S. W. Drury. Honours analysis lecture notes, mcgill university. http://www.math.mcgill.\n\nca/drury/notes354.pdf, 2001.\n\n[13] Yonina C. Eldar, Georgios B. Giannakis, and Gang Wang. Solving systems of random quadratic\nequations via truncated amplitude \ufb02ow. IEEE Transactions on Information Theory, 23(26):773\u2013\n794, 2017.\n\n[14] C. Fienup and J. Dainty. Phase retrieval and image reconstruction for astronomy.\n\nRecovery: Theory and Application, pages 231\u2013275, 1987.\n\nImage\n\n[15] J.R. Fienup. Phase retrieval algorithms: A comparison. Applied Optics, 21:2758\u20132768, 1982.\n[16] R.W. Gerchberg and W.O. Saxton. A practical algorithm for the determination of phase from\n\nimage and diffraction plane pictures. Optik, 35:237\u2013246, 1972.\n\n[17] Tom Goldstein and Christoph Struder. Phasemax: Convex phase retrieval via basis pursuit.\n\narXiv preprint arXiv: 1610.07531, 2016.\n\n[18] Harun G\u00fcnaydin, Da Tend, Yair Rivenson, Aydogan Ozcan, and Yibo Zhang. Phase recovery\nand holographic image reconstruction using deep learning in neural networks. arXiv preprint\narXiv:1705.04286, 2017.\n\n[19] Paul Hand and Vladislav Voroninski. Compressed sensing from phaseless gaussian measure-\nments via linear programming in the natural parameter space. arXiv preprint arXiv:1611.05985,\n2016.\n\n[20] Paul Hand and Vladislav Voroninski. Global guarantees for enforcing generative priors by\n\nempirical risk. arXiv preprint arXiv:1705.07576, 2017.\n\n[21] Robert W. Harrison. Phase problem in crystallography. J. Opt. Soc. Am. A, 10(5):1046\u20131055,\n\n1993.\n\n[22] Kishore Jaganathan, Samet Oymak, and Babak Hassibi. Sparse phase retrieval: Convex\nalgorithms and limitations. Information Theory Proceedings (ISIT), 2013 IEEE International\nSymposium on:1022\u20131026, 2013.\n\n[23] Diederik Kingma and Jimmy Ba. Adam. Adam: A method for stochastic optimization. arXiv\n\npreprint, arXiv:1412.6980, 2014.\n\n10\n\n\f[24] Yann LeCun, Leon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning\n\napplied to document recognition. Proceedings of the IEEE, 86(11):2278\u20132324, 1998.\n\n[25] Jaehoon Lee, Yasaman Bahri, Roman Novak, Samuel S. Schoenholz, Jeffrey Pennington, and\nJascha Sohl-Dickstein. Deep neural networks as gaussian processes. International Conference\non Learning Representations (ICLR 2018), 2018.\n\n[26] Xiaodong Li and Vladislav Voroninski. Sparse signal recovery from quadratic measurements\nvia convex programming. SIAM Journal on Mathematical Analysis, 45(5):3019\u20133033, 2013.\n[27] Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep learning face attributes in the\nwild. Proceedings of the IEEE International Conference on Computer Vision, pages 3730\u20133738,\n2015.\n\n[28] Jianwei Miao, Tetsuya Ishikawa, Qun Shen, and Thomas Earnest. Extending x-ray crystallogra-\nphy to allow the imaging of noncrystalline materials, cells, and single protein complexes. Annu.\nRev. Phys. Chem., 59:387\u2013410, 2008.\n\n[29] RP Millane. Phase retrieval in crystallography and optics. J. Opt. Soc. Am. A, 7(3):394\u2013411,\n\n1990.\n\n[30] Yaniv Plan and Roman Vershynin. One-bit compressed sensing by linear programming. Com-\n\nmunications on Pure and Applied Mathematics, 66(8):1275\u20131297, 2013.\n\n[31] Ju Sun, Qing Qu, and John Wright. A geometric analysis of phase retrieval. Information Theory\n\n(ISIT), 2016 IEEE International Symposium, pages 2379\u20132383, 2016.\n\n[32] Roman Vershynin. Introduction to the non-asymptotic analysis of random matrices. Compressed\n\nSensing: Theory and Applications, Cambridge University Press, 2012.\n\n[33] Ir\u00e8ne Waldspurger, Alexandre d\u2019Aspremont, and St\u00e9phane Mallat. Phase recovery, maxcut and\n\ncomplex semide\ufb01nite programming. Mathematical Programming, 149(1-2):47\u201381, 2015.\n\n[34] Adriaan Walther. The question of phase retrieval in optics. Journal of Modern Optics, 10(1):41\u2013\n\n49, 1963.\n\n[35] Gang Wang, Liang Zhang, Georgios B. Giannakis, Mehmet Ak\u00e7akaya, and Jie Chen. Sparse\nphase retrieval via truncated amplitude \ufb02ow. Signal Processing IEEE Transactions on, 66:479\u2013\n491, 2018.\n\n[36] James G. Wendel. A problem in geometric probability. Math. Scand., 11:109\u2013111, 1962.\n[37] Li-Hao Yeh, Jonathan Dong, Jingshan Zhong, Lei Tian, Michael Chen, Gongguo Tang, Mahdi\nSoltanolkotabi, and Laura Waller. Experimental robustness of fourier ptychography phase\nretrieval algorithms. Optics Express, PP(99):33214\u201333240, 2015.\n\n11\n\n\f", "award": [], "sourceid": 5491, "authors": [{"given_name": "Paul", "family_name": "Hand", "institution": "Northeastern University"}, {"given_name": "Oscar", "family_name": "Leong", "institution": "Rice University"}, {"given_name": "Vlad", "family_name": "Voroninski", "institution": "Helm.ai"}]}