{"title": "Global Guarantees for Blind Demodulation with Generative Priors", "book": "Advances in Neural Information Processing Systems", "page_first": 11535, "page_last": 11545, "abstract": "We study a deep learning inspired formulation for the blind demodulation problem, which is the task of recovering two unknown vectors from their entrywise multiplication. We consider the case where the unknown vectors are in the range of known deep generative models, $\\mathcal{G}^{(1)}:\\mathbb{R}^n\\rightarrow\\mathbb{R}^\\ell$ and $\\mathcal{G}^{(2)}:\\mathbb{R}^p\\rightarrow\\mathbb{R}^\\ell$. In the case when the networks corresponding to the generative models are expansive, the weight matrices are random and the dimension of the unknown vectors satisfy $\\ell = \\Omega(n^2+p^2)$, up to log factors, we show that the empirical risk objective has a favorable landscape for optimization. That is, the objective function has a descent direction at every point outside of a small neighborhood around four hyperbolic curves. We also characterize the local maximizers of the empirical risk objective and, hence, show that there does not exist any other stationary points outside of these neighborhood around four hyperbolic curves and the set of local maximizers. We also implement a gradient descent scheme inspired by the geometry of the landscape of the objective function. In order to converge to a global minimizer, this gradient descent scheme exploits the fact that exactly one of the hyperbolic curve corresponds to the global minimizer, and thus points near this hyperbolic curve have a lower objective value than points close to the other spurious hyperbolic curves. We show that this gradient descent scheme can effectively remove distortions synthetically introduced to the MNIST dataset.", "full_text": "Global Guarantees for Blind Demodulation with\n\nGenerative Priors\n\nDept. of Mathematics and College of Computer Science and Information\n\nPaul Hand\n\nNortheastern University, MA\np.hand@northeastern.edu\n\nBabhru Joshi\n\nDept. of Mathematics\n\nUniversity of British Columbia, BC\n\nb.joshi@math.ubc.ca\n\nAbstract\n\nWe study a deep learning inspired formulation for the blind demodulation prob-\nlem, which is the task of recovering two unknown vectors from their entrywise\nmultiplication. We consider the case where the unknown vectors are in the range\nof known deep generative models, G(1) : Rn ! R` and G(2) : Rp ! R`. In the\ncase when the networks corresponding to the generative models are expansive,\nthe weight matrices are random and the dimension of the unknown vectors satisfy\n` =\u2326( n2 + p2), up to log factors, we show that the empirical risk objective has a\nfavorable landscape for optimization. That is, the objective function has a descent\ndirection at every point outside of a small neighborhood around four hyperbolic\ncurves. We also characterize the local maximizers of the empirical risk objective\nand, hence, show that there does not exist any other stationary points outside of\nthese neighborhood around four hyperbolic curves and the set of local maximizers.\nWe also implement a gradient descent scheme inspired by the geometry of the\nlandscape of the objective function. In order to converge to a global minimizer,\nthis gradient descent scheme exploits the fact that exactly one of the hyperbolic\ncurve corresponds to the global minimizer, and thus points near this hyperbolic\ncurve have a lower objective value than points close to the other spurious hyper-\nbolic curves. We show that this gradient descent scheme can effectively remove\ndistortions synthetically introduced to the MNIST dataset.\n\n1 Introduction\nWe study the problem of recovering two unknown vectors x0 2 R` and w0 2 R` from observations\ny0 2 R` of the form\n(1)\nwhere is entrywise multiplication. This bilinear inverse problem (BIP) is known as the blind\ndemodulation problem. BIPs, in general, have been extensively studied and include problems\nsuch as blind deconvolution/demodulation [Ahmed et al., 2014, Stockham et al., 1975, Kundur and\nHatzinakos, 1996, Aghasi et al., 2016, 2019], phase retrieval [Fienup, 1982, Cand\u00e8s and Li, 2012,\nCand\u00e8s et al., 2013], dictionary learning [Tosic and Frossard, 2011], matrix factorization [Hoyer,\n2004, Lee and Seung, 2001], and self-calibration [Ling and Strohmer, 2015]. A signi\ufb01cant challenge\nof BIP is the ambiguity of solutions. These ambiguities are challenging because they cause the set of\nsolutions to be non-convex.\n\ny0 = w0 x0,\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fA common ambiguity, also shared by the BIP in (1), is the scaling ambiguity. That is any member of\nthe set {cw0, 1\nc x0} for c 6= 0 solves (1). In addition to the scaling ambiguity, this BIP is dif\ufb01cult to\nsolve because the solutions are non-unique, even when excluding the scaling ambiguity. For example,\n(w0, x0) and (1, w0 x0) both satisfy (1). This structural ambiguity can be solved by assuming\na prior model of the unknown vectors. In past works relating to blind deconvolution and blind\ndemodulation [Ahmed et al., 2014, Aghasi et al., 2019], this structural ambiguity issue was addressed\nby assuming a subspace prior, i.e. the unknown signals belong to known subspaces. Additionally, in\nmany applications, the signals are compressible or sparse with respect to a basis like a wavelet basis\nor the Discrete Cosine Transform basis, which can address this structural ambiguity issue.\nIn contrast to subspace and sparsity priors, we address the structural ambiguity issue by assuming\nthe signals w0 and x0 belong to the range of known generative models G(1) : Rn ! R` and\nG(2) : Rp ! R`, respectively. That is, we assume that w0 = G(1)(h0) for some h0 2 Rn and\nx0 = G(2)(m0) for some m0 2 Rp. So, to recover the unknown vectors w0 and x0, we \ufb01rst recover\nthe latent code variables h0 and m0 and then apply G(1) and G(2) on h0 and m0, respectively. Thus,\nthe blind demodulation problem under generative prior we study is:\n\n\ufb01nd h 2 Rn and m 2 Rp, up to the scaling ambiguity, such that y0 = G(1)(h) G (2)(m).\n\nIn recent years, advances in generative modeling of images [Karras et al., 2017] has signi\ufb01cantly\nincreased the scope of using a generative model as a prior in inverse problems. Generative models\nare now used in speech synthesis [van den Oord et al., 2016], image in-painting [Iizuka et al., 2017],\nimage-to-image translation [Zhu et al., 2017], superresolution [S\u00f8nderby et al., 2017], compressed\nsensing [Bora et al., 2017, Lohit et al., 2018], blind deconvolution [Asim et al., 2018], blind pty-\nchography [Shamshad et al., 2018], and in many more \ufb01elds. Most of these papers empirically show\nthat using generative model as a prior to solve inverse problems outperform classical methods. For\nexample, in compressed sensing, optimization over the latent code space to recover images from its\ncompressive measurements have been empirically shown to succeed with 10x fewer measurements\nthan classical sparsity based methods [Bora et al., 2017]. Similarly, the authors of Asim et al. [2018]\nempirically show that using generative priors in image debluring inverse problem provide a very\neffective regularization that produce sharp deblurred images from very blurry images.\nIn the present paper, we use generative priors to solve the blind demodulation problem (1). The\ngenerative model we consider is the an expansive, fully connected, feed forward neural network with\nRecti\ufb01ed Linear Unit (ReLU) activation functions and no bias terms. Our main contribution is we\nshow that the empirical risk objective function, for a suf\ufb01ciently expansive random generative model,\nhas a landscape favorable for gradient based methods to converge to a global minimizer. Our result\nimplies that if the dimension of the unknown signals satisfy ` =\u2326( n2 + p2), up to log factors, then\nthe landscape is favorable. In comparison, classical sparsity based methods for similar BIPs like\nsparse blind demodulation [Lee et al., 2017] and sparse phase retrieval [Li and Voroninski, 2013]\nshowed that exact recovery of the unknown signals is possible if the number of measurements scale\nquadratically, up to a log factor, w.r.t. the sparsity level of the signals. While we show a similar\nscaling of the number of measurements w.r.t. the latent code dimension, the latent code dimension can\nbe smaller than the sparsity level for the same signal, and thus recovering the signal using generative\nprior would require less number of measurements.\n\n1.1 Main results\nWe study the problem of recovering two unknown signals w0 and x0 in R` from observations\ny0 = w0 x0, where denotes entrywise product. We assume, as a prior, that the vectors\nw0 and x0 belong to the range of d-layer and s-layer neural networks G(1) : Rn ! R` and\nG(2) : Rp ! R`, respectively. The task of recovering w0 and x0 is reduced to \ufb01nding the latent\ncodes h0 2 Rn and m0 2 Rp such that G(1)(h0) = w0 and G(2)(m0) = x0. More precisely, we\nconsider the generative networks modeled by G(1)(h) = relu(W (1)\n1 h)) . . . )\nand G(2)(m) = relu(W (2)\n1 m)) . . . ), where relu(x) = max(x, 0) applies\nentrywise, W (1)\ni 2\nRpi\u21e5pi1 for i = 1, . . . , s with p = p0 < p1 < \u00b7\u00b7\u00b7 < ps = `. The blind demodulation problem we\nconsider is:\n\ni 2 Rni\u21e5ni1 for i = 1, . . . , d with n = n0 < n1 < \u00b7\u00b7\u00b7 < nd = `, and W (2)\n\n2 relu(W (2)\n\nd . . . relu(W (1)\n\n2 relu(W (1)\n\n. . . relu(W (2)\n\ns\n\nLet: y0 2 R`, h0 2 Rn, m0 2 Rp such that y0 = G(1)(h0) G (2)(m0),\n\n2\n\n\fGiven: G(1),G(2) and measurements y0,\nFind: h0 and m0, up to the scaling ambiguity.\n\nIn order to recover h0 and m0, up to the scaling ambiguity, we consider the following empirical risk\nminimization program:\n\nminimize\nh2Rn,m2Rp\n\nf (h, m) :=\n\n1\n\n2G(1) (h0) G (2) (m0) G (1) (h) G (2) (m)\n\n2\n\n2\n\n.\n\n(2)\n\n1.5\n\n1\n\n0.5\n\n0\n\n-0.5\n\n-1\n\n(a) Landscape of the empirical risk function.\n\n(b) Note the four hyperbolic branches visible.\n\nFigure 1: Plots showing the landscape of the objective function with h0 = 1 and m0 = 1.\n\n-1\n\n-0.5\n\n0\n\n0.5\n\n1\n\n1.5\n\nFigures 1a and 1b show the landscape of the objective function in the case when h0 = m0 = 1,\ns = d = 2, the networks are expansive, and the weight matrices W (1)\ncontain i.i.d.\nGaussian entries. Clearly, the objective function in (2) is non-convex and, as a result, there does not\nexist a prior guarantee that gradient based methods will converge to a global minima. Additionally, the\nobjective function does not contain any regularizer which are generally be used to resolve the scaling\n\nand W (2)\n\ni\n\ni\n\nshow that under certain conditions on the networks, the minimizers of (2) are in the neighborhood of\nfour hyperbolic curves, one of which is the hyperbolic curve containing the global minimizers.\nIn order to de\ufb01ne these hyperbolic neighborhoods, let\n\nc m0)|c > 0 is a global optima of (2). Nonetheless, we\nambiguity, and thus every point in(ch0, 1\nA\u270f,(\u02dch, \u02dcm) =\u21e2(h, m) 2 Rn\u21e5p9 c > 0 s.t. (h, m) \u2713c\u02dch,\n\u02dcm\u25c62 ,\nwhere (\u02dch, \u02dcm) 2 Rn\u21e5p is \ufb01xed.\nc \u02dcm)|c > 0o. We show that the minimizers of (2) are contained in the four hyperbolic\nn(c\u02dch, 1\nsets given by A\u270f,(h0,m0), A\u270f,(\u21e2(1)\ns m0). Here, \u270f de-\npends on the expansivity and number of layers in the networks, and both \u21e2(1)\nand \u21e2(2)\nare positive\ns\nd\nconstants close to 1. We also show that the points in the set {(h, 0)|h 2 Rn}[{ (0, m)|m 2 Rp}\nare local maximizers. This result holds for networks with the following assumptions:\n\n(3)\nis an \u270f-neighborhood of the hyperbolic set\n\n\u02dcm\u25c62 \uf8ff \u270f\u2713c\u02dch,\n\nd h0,m0), A\u270f,(h0,\u21e2(2)\n\ns m0), and A\u270f,(\u21e2(1)\n\nd h0,\u21e2(2)\n\nThis set\n\n1\nc\n\n1\nc\n\nA1. The weight matrices are random.\nA2. The weight matrices of inner layers satisfy ni cni1 log ni1 for i = 1, . . . , d 1 and\nA3. The weight matrices of the last layer for each generator satisfy ` c((nd1 log nd1)2 +\n\npi cpi1 log pi1 for i = 1, . . . , s 1.\n(ps1 log ps1)2).\n\nIn the above assumptions, c is a constant that depends polynomially on the expansivity parameter\nof G(1) and G(2) . Figures 1a and 1b show the landscape of the objective function and corroborate\nour \ufb01ndings. In the paper we provide two deterministic conditions that are suf\ufb01cient to characterize\nthe landscape of the objective function, and show that Gaussian matrices satisfy these conditions.\nIn essence, we only require approximate Gaussian matrices. We also note that the state-of-the-art\nliterature for provable convergence of the training of neural networks for regression and classi\ufb01cation\nadmit proofs only in the case that the \ufb01nal trained weights are close to their random initialization.\n\n3\n\n\fThus, our neural network assumptions are consistent with the best known cases for which networks\ncan be provably trained.\nTheorem 1 (Informal). Let\n\nA = A\u270f,(h0,m0) [A \u270f,(\u21e2(1)\n\nd h0,m0) [A \u270f,(h0,\u21e2(2)\n\ns m0) [A \u270f,(\u21e2(1)\nd ,\u21e2 (2)\n\nd h0,\u21e2(2)\n\ns m0),\n\nwhere \u270f> 0 depends on the expansivity of our networks and \u21e2(1)\ns ! 1 as d, s ! 1, respectively.\nSuppose the networks are suf\ufb01ciently expansive such that the number of neurons in the inner layers\nand the last layers satisfy assumptions A2 and A3, respectively. Then there exist a descent direction,\ngiven by one of the one-sided partial derivative of the objective function in (2), for every (h, m) /2\nA[{ (h, 0)|h 2 Rn}[{ (0, m)|m 2 Rp} with high probability. In addition, elements of the set\n{(h, 0)|h 2 Rn}[{ (0, m)|m 2 Rp} are local maximizers.\nOur main result states that the objective function in (2) does not have any spurious minimizers outside\nof the four hyperbolic neighborhoods. Thus, a gradient descent algorithm will converge to a point\ninside the four neighborhoods, one of which contains the global minimizers of (2). However, it may\nnot guarantee convergence to a global minimizer and it may not resolve the inherent scaling ambiguity\npresent in the problem. So, in order to converge to a global minimizer, we implement a gradient\ndescent scheme that exploits the landscape of the objective function. That is, we exploit the fact\nthat points near the hyperbolic curve corresponding to the global minimizer have a lower objective\nvalue than points that are close to the remaining three spurious hyperbolic curves. Second, in order\nto resolve the scaling ambiguity, we promote solutions that have equal `2 norm by normalizing\nthe estimates in each iteration of the gradient descent scheme (See Section 2). In principle, a\nconvergence result to a global minimizer by gradient descent is possible, and would require showing\na convexity-like property around the hyperbola. We leave this for possible future work.\nTheorem 1 also provides a global guarantee of the landscape of the objective function in (2) if the\ndimension of the unknown signals scale quadratically w.r.t. to the dimension of the latent codes, i.e.\n` =\u2326( n2 + p2), up to log factors. Our result, which we get by enforcing generative priors may enjoy\nbetter sample complexity than classical priors like sparsity because: i) existing recovery guarantee\nof unstructured signals require number of measurements that scale quadratically with the sparsity\nlevel, and ii) a signal can have a latent code dimension with respect to a GAN that is smaller than its\nsparsity level with respect to a wavelet basis. For example, consider a set of images that correspond to\na single train going down a single track. This set of images form a one dimensional sub-manifold of\nthe manifold of natural images. If properly parameterized by a generative model, then it would have\na latent dimensionality of approximately 1, whereas the number of wavelet coef\ufb01cients needed to\ndescribe any of those images is much greater. The work in Bora et al. [2017] shows that compressed\nsensing can be done with 5-10x fewer measurements than sparsity models. This provides evidence\nfor the more economical representation of generative models than of sparsity models. Additionally, it\nis more natural to view the natural signal manifold as a low-dimensional manifold, as opposed to\nbeing the combinatorially-many union of low dimensional spaces. Performance gains are provided\nby the fact that the natural signal manifold can be directly exploited, whereas the union of subspaces\ncan only be indirectly exploited via convex relaxations. Thus, our result may be less limiting in terms\nof sample complexity.\n\n1.2 Prior work on problems related to blind demodulation\nA common approach of solving the BIP in (1) is to assume a subspace or sparsity prior on the\nunknown vectors. In these cases the unknown vectors w0 and x0 are assumed to be in the range of\nknown matrices B 2 R`\u21e5n and C 2 R`\u21e5p, respectively. In Ahmed et al. [2014], the authors assumed\na subspace prior and cast the BIP as a linear low rank matrix recovery problem. They introduced\na semide\ufb01nite program based on nuclear norm minimization to recover the unknown matrix. For\nthe case where the rows of B and C are Fourier and Gaussian vectors, respectively, they provide a\nrecovery guarantee that depend on the number of measurements as ` =\u2326( n + p), up to log factors.\nHowever, because this method operates in the space of matrices, it is computationally prohibitively\nexpensive. Another limitation of the lifted approach is that recovering a low rank and sparse matrix\nef\ufb01ciently from linear observation of the matrix has been challenging. Recently, Lee et al. [2017]\nprovided a recovery guarantee with near optimal sample complexity for the low rank and sparse\nmatrix recovery problem using an alternating minimization method for a class of signals that satisfy a\npeakiness condition. However, for general signals the same work established a recovery result for the\ncase where the number of measurements scale quadratically with the sparsity level.\n\n4\n\n\fIn order to address the computational cost of working in the lifted case, a recent theme has been to\nintroduce convex and non-convex programs that work in the natural parameter space. For example,\nin Bahmani and Romberg [2016], Goldstein and Studer [2016], the authors introduced PhaseMax,\nwhich is a convex program for phase retrieval that is based on \ufb01nding a simple convex relaxation via\nthe convex hull of the feasibility set. The authors showed that PhaseMax enjoys rigorous recovery\nguarantee if a good anchor is available. This formulation was extended to the sparse case in Hand and\nVoroninski [2016], where the authors considered SparsePhaseMax and provided a recovery guarantee\nwith optimal sample complexity. The idea of formulating a convex program using a simple convex\nrelaxation via the convex hull of the feasibility set was used in the blind demodulation problem as\nwell [Aghasi et al., 2019, 2018]. In particular, Aghasi et al. [2018] introduced a convex program in\nthe natural parameter space for the sparse blind demodulation problem in the case where the sign of\nthe unknown signals are known. Like in Lee et al. [2017], the authors in Aghasi et al. [2019] provide\na recovery guarantee with optimal sample complexity for a class of signals. However, the result does\nnot extend to signals with no constraints. Other approaches that operate in the natural parameter space\nare methods based on Wirtinger Flow. For example, in Cand\u00e8s et al. [2015], Wang et al. [2016], Li\net al. [2016], the authors use Wirtinger Flow and its variants to solve the phase retrieval and the blind\ndeconvolution problem. These methods are non-convex and require a good initialization to converge\nto a global solution. However, they are simple to solve and enjoys rigorous recovery guarantees.\n\n1.3 Other related work\nIn this paper, we consider the blind demodulation problem with the unknown signals assumed to be\nin the range of known generative models. Our work is motivated by experimental results in deep\ncompressed sensing and deep blind deconvolution presented in Bora et al. [2017], Asim et al. [2018]\nand theoretical work in deep compressed sensing presented in Hand and Voroninski [2017]. In Bora\net al. [2017], the authors consider the compressed sensing problem where, instead of a sparsity\nprior, a generative prior is considered. They used an empirical risk optimization program over the\nlatent code space to recover images and empirical showed that their method succeeds with 10x fewer\nmeasurements than previous sparsity based methods. Following the empirical successes of deep\ncompressed sensing, the authors in Hand and Voroninski [2017] provided a theoretical understanding\nfor these successes by characterizing the landscape of the empirical risk objective function. In the\nrandom case with the layers of the generative model suf\ufb01ciently expansive, they showed that every\npoint outside of a small neighborhood around the true solution and a negative multiple of the true\nsolution has a descent direction with high probability. Another instance where generative model\ncurrently outperforms sparsity based methods is in sparse phase retrieval Hand et al. [2018]. In\nsparse phase retrieval, current algorithms that enjoy a provable recovery guarantee of an unknown\nn-dimensional k-sparse signal require at least O(k2 log n) measurements; whereas, when assuming\nthe unknown signal is an output of a known d-layer generator G : Rk ! Rn, the authors in Hand\net al. [2018] showed that, under favorable conditions on the generator and with at least O(kd2 log n)\nmeasurements, the empirical risk objective enjoys a favorable landscape.\nSimilarly, in Asim et al. [2018], the authors consider the blind deconvolution problem where a\ngenerative prior over the unknown signal is considered. They empirically showed that using generative\npriors in the image deblurring inverse problem provide a very effective regularization that produce\nsharp deblurred images from very blurry images. The algorithm used to recovery these deblurred\nimages is an alternating minimization approach which solves the empirical risk minimization with `2\nregularization on the unknown signals. The `2 regularization promotes solution with least `2 norm\nand resolves the scaling ambiguity present in the blind deconvolution problem. We consider a related\nproblem, namely the blind demodulation problem with a generative prior on the unknown signals,\nand show that under certain conditions on the generators the empirical risk objective has a favorable\nlandscape.\n\n1.4 Notations\nVectors and matrices are written with boldface, while scalars and entries of vectors are written in plain\nfont. We write 1 as the vector of all ones with dimensionality appropriate for the context. Let Sn1\nbe the unit sphere in Rn. We write I n as the n \u21e5 n identity matrix. For x 2 RK and y 2 RN, (x, y)\nis the corresponding vector in RK \u21e5 RN. Let relu(x) = max(x, 0) apply entrywise for x 2 Rn. Let\ndiag(W x > 0) be the diagonal matrix that is 1 in the (i, i) entry if (W x)i > 0 and 0 otherwise. Let\nA B mean that B A is a positive semide\ufb01nite matrix. We will write = O() to mean that\n\n5\n\n\fthere exists a positive constant C such that \uf8ff C, where is understood to be positive. Similarly\nwe will write c =\u2326( ) to mean that there exists a positive constant C such that c C. When we\nsay that a constant depends polynomially on \u270f1, that means that it is at most C\u270fk for some positive\nC and positive integer k. For notational convenience, we will write a = b + O1(\u270f) if ka bk \uf8ff \u270f,\nwhere the norm is understood to be absolute value for scalars, the `2 norm for vectors, and the spectral\nnorm for matrices.\n\n2 Algorithm\n\ns\n\nc m0)|c > 0}, {(c\u21e2(1)\n\nc m0)|c > 0}, where \u21e2(1)\n\nd h0, \u21e2(2)\nc m0) is less than f (ch0, 1\n\nIn this section, we propose a gradient descent scheme that solves (2). The gradient descent\nscheme exploits the global geometry present in the landscape of the objective function in (2)\nand avoids regions containing spurious minimizers. The gradient descent scheme is based on\ntwo observations. The \ufb01rst observation is that the minimizers of (2) are close to four hyperbolic\nc m0)|c > 0}, {(ch0, \u21e2(2)\ncurves given by {(ch0, 1\nc m0)|c > 0}, and\n{(c\u21e2(1)\nare close to 1. The second observation is that\nc m0), and f (ch0, 1\nc m0) for any c > 0.\nf (ch0, 1\nThis is because the curve {(ch0, 1\nWe now introduce some quantities which are useful in stating the gradient descent algorithm. For any\nh 2 Rn and W 2 Rl\u21e5n, de\ufb01ne W +,h = diag(W h > 0)W . That is, W +,h zeros out the rows of\nW that do not have a positive inner product with h and keeps the remaining rows. We will extend\nthe de\ufb01nition of W +,h to each layer of weights W (1)\ni 2 Rn1\u21e5n\nand h 2 Rn, de\ufb01ne W (1)\n1 . For each layer i > 1, de\ufb01ne\n\nc m0)|c > 0} corresponds to the global minimizer of (2).\n\nin our neural network. For W (1)\n1 h > 0)W (1)\n\nc m0), f (ch0, 1\n\nd h0, 1\nand \u21e2(2)\ns\n\n1,+,h := (W (1)\n\nd\n\ni\n\nd\n\n1 )+,h = diag(W (1)\ni1,+,h . . . W (1)\n\n.\n\ni\n\nW (1)\n\ns,+,mm.\n\ni=d W (1)\n\n2,+,hW (1)\n\n1,+,hh > 0)W (1)\n\ni,+,h = diag(W (1)\n\nd,+,h :=Q1\n\ni W (1)\ni,+,h. Using the above notation, G(1)(h) can be compactly written\n\nd,+,hh. Similarly, we may write G(2)(m) compactly as \u21e4(2)\n\nLastly, de\ufb01ne \u21e4(1)\nas \u21e4(1)\nThe gradient descent scheme is an alternating descent direction algorithm. We \ufb01rst pick an initial\niterate (h1, m1) such that h1 6= 0 and m1 6= 0. At each iteration i = 1, 2, . . . , we \ufb01rst compare the\nobjective value at (h1, m1), (h1, m1), (h1,m1), and (h1,m1) and reset (h1, m1) to be\nthe point with least objective value. Second we descend along a direction. We compute the descent\ndirection \u02dcg1,(h,m), given by the partial derivative of f in (2) w.r.t. h,\nh0\u2318\ns,+,m0m0\u2318 .\n\nand take a step along this direction. Next, we compute the descent direction \u02dcg2,(h,m), given by the\npartial derivative of f in w.r.t. m,\n\n|\u21e3diag(\u21e4(2)\n|\u21e3diag(\u21e4(1)\n\nand again take a step along this direction. Lastly, we normalize the iterate so that at each iteration i\nkhik2 = kmik2. We repeat this process until convergence. Algorithm 1 outlines this process.\nAlgorithm 1 Alternating descent algorithm for (2)\n\ns,+,mm diag(\u21e4(1)\n\nd,+,hh diag(\u21e4(2)\n\ns,+,mm \u21e4(2)\n\nd,+,hh \u21e4(1)\n\n\u02dcg2,(h,m) := \u21e4(2)\n\n\u02dcg1,(h,m) := \u21e4(1)\n\ns,+,m0m0)\u21e4(1)\n\ns,+,mm)2\u21e4(1)\n\nd,+,hh)2\u21e4(2)\n\nh0)\u21e4(2)\n\nd,+,h0\n\nd,+,h\n\ns,+,m\n\nd,+,h0\n\nInput: Weight matrices, W (1)\nOutput: An estimate of a global minimizer of (2)\n\nand W (2)\n\ni\n\ni\n\n, observation y0 and step size \u2318> 0.\n\n1: Choose an arbitrary point (h1, m1) such that h1 6= 0 and m1 6= 0\n2: for i = 1, 2, . . . do:\n3:\n4:\n5:\n6: end for\n\n(hi, mi) arg min(f (hi, mi), f (hi, mi), f (hi,mi), f (hi,mi))\nhi+1 hi \u2318\u02dcg1,(hi,mi), mi+1 mi \u2318\u02dcg2,(hi+1,mi)\nc pkhi+1k2/kmi+1k2, hi+1 hi+1/c, mi+1 mi+1 \u00b7 c\n\n6\n\n\f1\n\n3 Proof Outline\n\nWe now present our main results which states that the objective function has a descent direction at\nevery point outside of four hyperbolic regions. In order to state these directions, we \ufb01rst note that the\npartial derivatives of f at a differentiable point (h, m) are\n\nrhf (h, m) = \u02dcg1,(h,m) and rmf (h, m) = \u02dcg2,(h,m).\n\n0\n\n0\n\nlim\n\nThe function f is not differentiable everywhere because of the behavior of the RELU activation\nfunction in the neural network. However, since G(1) and G(2) are piecewise linear, f is differentiable\nat (h, m) + w for all (h, m) and w and suf\ufb01ciently small . The directions we consider are\ng1,(h,m) 2 Rn+p and g2,(h,m) 2 Rn+p, where\n!0+ rmf ((h, m) + w) , and\ng1,(h,m) =\uf8ff lim\n\n , g2,(h,m) =\uf8ff\n\n!0+ rhf ((h, m) + w)\n\n(4)\nw is \ufb01xed. Let Dgf (h, m) be the unnormalized one-sided directional derivative of f (h, m) in the\ndirection of g: Dgf (h, m) = limt!0+\nTheorem 2. Fix \u270f> 0 such that K1(d7s2 + d2s7)\u270f1/4 < 1, d 2, and s 2. Assume the\nnetworks satisfy assumptions A2 and A3. Assume W (1)\nI ni1) for i = 1, . . . , d 1\nand ith row of W (1)\n` I nd1). Sim-\nd\nilarly, assume W (2)\nsatis\ufb01es\n(w(2)\n` I ps1). Let K = {(h, 0) 2 Rn\u21e5p |h 2 Rn}[\n{(0, m) 2 Rn\u21e5p |m 2 Rp} and A = AK2d3s3\u270f\nd h0,m0\u2318 [\ns h0,m0\u2318. Then on an event of probability at least\n4 ,\u21e3\u21e2(2)\nAK2d3s8\u270f\ni=1 \u02dccnieni1Ps\n1Pd\ni=1 \u02dccpiepi1\u02dcce`/(nd1 log nd1+ps1 log ps1) we have the following:\nfor (h0, m0) 6= (0, 0), and\n\nd )|\nI pi1) for i = 1, . . . , s 1 and ith row of W (2)\n\nsatis\ufb01es (w(1)\ni \u21e0N (0, 1\nkwk2\uf8ff3pps1/` with w \u21e0N (0, 1\n\nkwk2\uf8ff3pnd1/` with w \u21e0N (0, 1\n\ns h0,m0\u2318 [A K2d8s8\u270f\n\n4 ,(h0,m0) [A K2d8s3\u270f\n\ni \u21e0N (0, 1\n\ni = w| \u00b7 1\n\nf ((h,m)+tg)f (h,m)\n\ni = w| \u00b7 1\n\n4 ,\u21e3\u21e2(1)\n\n4 ,\u21e3\u21e2(1)\n\ns )|\n\n1\n\nd \u21e2(2)\n\n1\n\n1\n\nt\n\n.\n\npi\n\ns\n\nni\n\n(h, m) /2A[K\n\nd\n\nthe one-sided directional derivative of f in the direction of g = g1,(h,m) or g = g2,(h,m), de\ufb01ned in\n(4), satisfy Dgf (h, m) < 0. Additionally, elements of the set K are local maximizers. Here, \u21e2(k)\nare positive numbers that converge to 1 as d ! 1, c and 1 are constants that depend polynomially\non \u270f1 and \u02dcc, K1, and K2 are absolute constants.\nWe prove Theorem 2 by showing that neural networks with random weights satisfy two deterministic\nconditions. These conditions are the Weight Distributed Condition (WDC) and the joint Weight\nDistributed Condition (joint-WDC). The WDC is a slight generalization of the WDC introduced in\nHand and Voroninski [2017]. We say a matrix W 2 R`\u21e5n satis\ufb01es the WDC with constants \u270f> 0\nand 0 <\u21b5 \uf8ff 1 if for all nonzero x, y 2 Rk,\ni \u21b5Qx,y \uf8ff \u270f, with Qx,y =\n`Xi=1\nwhere wi 2 Rn is the ith row of W ; M \u02c6x$\u02c6y 2 Rn\u21e5n is the matrix such that \u02c6x ! \u02c6y, \u02c6y ! \u02c6x, and\nz ! 0 for all z 2 span({x, y})?; \u02c6x = x/kxk2 and \u02c6y = y/kyk2; \u27130 = \\(x, y); and 1S is the\nindicator function on S. If wi \u21e0N (0, 1\n` I n) for all i, then an elementary calculation shows that\nii = Qx,y and if x = y then Qx,y is an isometry up to a factor\n\nof 1/2. Also, note that if W satis\ufb01es WDC with constants \u270f and \u21b5, then 1p\u21b5 W satis\ufb01es WDC with\nconstants \u270f/\u21b5 and 1.\nWe now state the joint Weight Distributed Condition. We say that B 2 R`\u21e5n and C 2 R`\u21e5p satisfy\njoint-WDC with constants \u270f> 0 and 0 <\u21b5 \uf8ff 1 if for all nonzero h, x 2 Rn and nonzero m,\ny 2 Rp,\n\n\nEhP`\n\ni=1 1wi\u00b7x>01wi\u00b7y>0 \u00b7 wiw|\n\n1wi\u00b7x>01wi\u00b7y>0 \u00b7 wiw|\n\n\u21e1 \u27130\n2\u21e1\n\nM \u02c6x$\u02c6y,\n\nsin \u27130\n\nI n +\n\n(5)\n\n2\u21e1\n\n+,hdiag (C+,mm C+,yy) B+,x \n\nB|\n\nm|Qm,yy \u00b7 Qh,x \uf8ff\n\n\u21b5\n`\n\n7\n\n\u270f\n`kmk2kyk2, and\n\n(6)\n\n\f+,mdiag (B+,hh B+,xx) C+,y \n\nC|\n\n\u21b5\n`\n\nh|Qh,xx \u00b7 Qm,y \uf8ff\n\n\u270f\n`khk2kxk2\n\n(7)\n\nWe analyze networks G(1) and G(2) where the weight matrices corresponding to the inner layers\nsatisfy the WDC with constants \u270f> 0 and 1 and for the two matrices corresponding to the outer\nlayers, we assume that one of them satis\ufb01es WDC with constants \u270f and 0 <\u21b5 1 \uf8ff 1 and the other\nsati\ufb01es WDC with constants \u270f and 0 <\u21b5 2 \uf8ff 1. We also assume that the two outer layer matrices\nsatisfy joint-WDC with constants \u270f> 0 and \u21b5 = \u21b51 \u00b7 \u21b52. We now state the main deterministic result:\nTheorem 3. Fix \u270f> 0, 0 <\u21b5 1 \uf8ff 1 and 0 <\u21b5 2 \uf8ff 1 such that K1(d7s2 + d2s7)\u270f1/4/(\u21b51\u21b52) < 1,\nd 2, and s 2. Let K = {(h, 0) 2 Rn\u21e5p |h 2 Rn}[{ (0, m) 2 Rn\u21e5p |m 2 Rp}. Suppose that\nW (1)\ni 2 Rni\u21e5ni1 for i = 1, . . . , d 1 and W (2)\ni 2 Rpi\u21e5pi1 for i = 1, . . . , s 1 satisfy the WDC\nwith constant \u270f and 1. Suppose W (1)\nd 2 R`\u21e5nd1 satisfy WDC with constants \u270f and \u21b51, and W (2)\ns 2\ns \u2318 satisfy joint-WDC\nR`\u21e5ps1 satisfy WDC with constants \u270f and \u21b52. Also, suppose\u21e3W (1)\nwith constants \u270f, \u21b5 = \u21b51 \u00b7 \u21b52. Let K = {(h, 0) 2 Rn\u21e5p |h 2 Rn}[{ (0, m) 2 Rn\u21e5p |m 2 Rp}\nand A = AK2d3s3\u270f\n4 \u21b51,\u21e3\u21e2(2)\ns h0,m0\u2318 [\n4 \u21b51,(h0,m0) [A K2d8s3\u270f\ns h0,m0\u2318. Then, for (h0, m0) 6= (0, 0), and\n4 \u21b51,\u21e3\u21e2(1)\nAK2d8s8\u270f\nd \u21e2(2)\n\nd , W (2)\nd h0,m0\u2318 [A K2d3s8\u270f\n\n4 \u21b51,\u21e3\u21e2(1)\n\n1\n\n1\n\n1\n\n1\n\n(h, m) /2A[K\n\nthe one-sided directional derivative of f in the direction of g = g1,(h,m) or g = g2,(h,m) satisfy\nDgf (h, m) < 0. Additionally, elements of the set K are local maximizers. Here, \u21e2(k)\nd are positive\nnumbers that converge to 1 as d ! 1, and K1, and K2 are absolute constants.\nWe prove the theorems by showing that the descent directions g1,(h,m) and g2,(h,m) concentrate\naround its expectation and then characterize the set of points where the corresponding expectations\nare simultaneously zero. The outline of the proof is:\n\n(h,m),(h0,m0) and t(2)\n\n\u2022 The WDC and joint-WDC imply that the one-sided partial directional derivatives of f\nconcentrate uniformly for all non-zero h, h0 2 Rn and m, m0 2 Rp around continuous\nvectors t(1)\n(h,m),(h0,m0), respectively, de\ufb01ned in equations (10) and (11)\nin the Appendix.\n\u2022 Direct analysis show that t(1)\nmately zero around the four hyperbolic sets A\u270f,(h0,m0), A\u270f,(\u21e2(1)\nand A\u270f,(\u21e2(1)\nd h0,\u21e2(2)\nnetworks, and both \u21e2(1)\nd\nof layers in the two neural networks as well.\n\n(h,m),(h0,m0) are simultaneously approxi-\ns m0),\ns m0), where \u270f depends on the expansivity and number of layers in the\nare positive constants close to 1 and depends on the number\n\nd h0,m0), A\u270f,(h0,\u21e2(2)\n\n(h,m),(h0,m0) and t(2)\n\nand \u21e2(2)\ns\n\n\u2022 Using sphere covering arguments, Gaussian and truncated Gaussian matrices with appropri-\n\nate dimensions satisfy the WDC and joint-WDC conditions.\n\nThe full proof of Theorem 3 is provided in the Appendix.\n\n4 Numerical Experiment\n\nWe now empirically show that Algorithm 1 can remove distortions present in the dataset. We consider\nthe image recovery task of removing distortions that were synthetically introduced to the MNIST\ndataset. The distortion dataset contain 8100 images of size 28\u21e528 where the distortions are generated\nusing a 2D Gaussian function, g(x, y) = e\n, where c is the center and controls its\ntail behavior. For each of the 8100 image, we \ufb01x c and , which vary uniformly in the intervals [3, 3]\nand [20, 35], respectively, and x and y are in the interval [5, 5]. Prior to training the generators, the\nimages in the MNIST dataset and the distortion dataset were resized to 64 \u21e5 64 images. We used\nDCGAN [Radford et al., 2016] with a learning rate of 0.0002 and latent code dimension of 50 to\ntrain a generator, G(2), for the distortion images. Similarly, we used the DCGAN with learning rate\nof 0.0002 and latent code dimension of 100 to train a generator, G(1), for the MNIST images. Finally,\n\n(xc)2+(yc)2\n\n\n\n8\n\n\fa distorted image y0 is generated via the pixelwise multiplication of an image w0 from the MNIST\ndataset and an image x0 from the distortion dataset, i.e. y0 = w0 x0.\n\nFigure 2: The \ufb01gure shows the result removing distortion in an image by solving (2) using Algorithm\n1. The top row corresponds to the input distorted image. The second and third row corresponds to the\nimages recovered using empirical risk minimization.\n\nFigure 2 shows the result of using Algorithm 1 to remove distortion from y0. In the implementation of\nAlgorithm 1, \u02dcg1,(hi,mi) and \u02dcg1,(hi,mi) corresponds to the partial derivatives of f with the generators\nas G(1) and G(2). We used the Stochastic Gradient Descent algorithm with the step size set to 1 and\nmomentum set to 0.9. For each image in the \ufb01rst row of Figure 2, the corresponding images in the\nsecond and third rows are the output of Algorithm 1 after 500 iterations.\n\nReferences\nAli Ahmed, Benjamin Recht, and Justin Romberg. Blind deconvolution using convex programming.\n\nIEEE Trans. Inform. Theory, 60(3):1711\u20131732, 2014.\n\nThomas G Stockham, Thomas M Cannon, and Robert B Ingebretsen. Blind deconvolution through\n\ndigital signal processing. Proceedings of the IEEE, 63(4):678\u2013692, 1975.\n\nDeepa Kundur and Dimitrios Hatzinakos. Blind image deconvolution. IEEE signal processing\n\nmagazine, 13(3):43\u201364, 1996.\n\nAlireza Aghasi, Barmak Heshmat, Albert Redo-Sanchez, Justin Romberg, and Ramesh Raskar.\nSweep distortion removal from terahertz images via blind demodulation. Optica, 3(7):754\u2013762,\n2016.\n\nAlireza Aghasi, Ali Ahmed, and Paul Hand. Branchhull: Convex bilinear inversion from the entrywise\nproduct of signals with known signs. Applied Computational and Harmonic Analysis, 2019. doi:\nhttps://doi.org/10.1016/j.acha.2019.03.002.\n\nJames R Fienup. Phase retrieval algorithms: a comparison. Applied optics, 21(15):2758\u20132769, 1982.\nE. Cand\u00e8s and X. Li. Solving quadratic equations via phaselift when there are about as many\n\nequations as unknowns. Found. Comput. Math., pages 1\u201310, 2012.\n\nE. Cand\u00e8s, T. Strohmer, and V. Voroninski. Phaselift: Exact and stable signal recovery from magnitude\n\nmeasurements via convex programming. Commun. Pure Appl. Math., 66(8):1241\u20131274, 2013.\n\nIvana Tosic and Pascal Frossard. Dictionary learning. IEEE Signal Processing Magazine, 28(2):\n\n27\u201338, 2011.\n\nPatrik O Hoyer. Non-negative matrix factorization with sparseness constraints. Journal of machine\n\nlearning research, 5(Nov):1457\u20131469, 2004.\n\nDaniel D Lee and H Sebastian Seung. Algorithms for non-negative matrix factorization. In Advances\n\nin neural information processing systems, pages 556\u2013562, 2001.\n\nShuyang Ling and Thomas Strohmer. Self-calibration and biconvex compressive sensing. Inverse\n\nProblems, 31(11):115002, 2015.\n\nTero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. Progressive growing of gans for\nimproved quality, stability, and variation. CoRR, abs/1710.10196, 2017. URL http://arxiv.\norg/abs/1710.10196.\n\n9\n\n\fA\u00e4ron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves,\nNal Kalchbrenner, Andrew W. Senior, and Koray Kavukcuoglu. Wavenet: A generative model for\nraw audio. CoRR, abs/1609.03499, 2016. URL http://arxiv.org/abs/1609.03499.\n\nSatoshi Iizuka, Edgar Simo-Serra, and Hiroshi Ishikawa. Globally and locally consistent image\ncompletion. ACM Trans. Graph., 36(4):107:1\u2013107:14, July 2017. ISSN 0730-0301. doi: 10.1145/\n3072959.3073659. URL http://doi.acm.org/10.1145/3072959.3073659.\n\nJun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A. Efros. Unpaired image-to-image translation\nusing cycle-consistent adversarial networks. CoRR, abs/1703.10593, 2017. URL http://arxiv.\norg/abs/1703.10593.\n\nCasper Kaae S\u00f8nderby, Jose Caballero, Lucas Theis, Wenzhe Shi, and Ferenc Husz\u00e1r. Amortised MAP\ninference for image super-resolution. In 5th International Conference on Learning Representations,\nICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings, 2017. URL\nhttps://openreview.net/forum?id=S1RP6GLle.\n\nAshish Bora, Ajil Jalal, Eric Price, and Alexandros G. Dimakis. Compressed sensing using generative\n\nmodels. 2017. URL https://arxiv.org/abs/1703.03208.\n\nS. Lohit, K. Kulkarni, R. Kerviche, P. Turaga, and A. Ashok. Convolutional neural networks for\nnoniterative reconstruction of compressively sensed images. IEEE Transactions on Computational\nImaging, 4(3):326\u2013340, Sep. 2018. ISSN 2333-9403. doi: 10.1109/TCI.2018.2846413.\n\nMuhammad Asim, Fahad Shamshad, and Ali Ahmed. Solving bilinear inverse problems using deep\ngenerative priors. CoRR, abs/1802.04073, 2018. URL http://arxiv.org/abs/1802.04073.\nFahad Shamshad, Farwa Abbas, and Ali Ahmed. Deep ptych: Subsampled fourier ptychography\n\nusing generative priors. CoRR, abs/1812.11065, 2018.\n\nKiryung Lee, Yihing Wu, and Yoram Bresler. Near optimal compressed sensing of a class of sparse\n\nlow-rank matrices via sparse power factorization. arXiv preprint arXiv:1702.04342, 2017.\n\nXiaodong Li and Vladislav Voroninski. Sparse signal recovery from quadratic measurements via\n\nconvex programming. SIAM Journal on Mathematical Analysis, 45(5):3019\u20133033, 2013.\n\nSohail Bahmani and Justin Romberg. Phase retrieval meets statistical learning theory: A \ufb02exible\n\nconvex relaxation. arXiv preprint arXiv:1610.04210, 2016.\n\nTom Goldstein and Christoph Studer. Phasemax: Convex phase retrieval via basis pursuit. arXiv\n\npreprint arXiv:1610.07531, 2016.\n\nPaul Hand and Vladislav Voroninski. Compressed sensing from phaseless gaussian measurements\nvia linear programming in the natural parameter space. CoRR, abs/1611.05985, 2016. URL\nhttp://arxiv.org/abs/1611.05985.\n\nAlireza Aghasi, Ali Ahmed, Paul Hand, and Babhru Joshi. A convex program for bilinear in-\nversion of sparse vectors.\nIn S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-\nBianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems 31,\npages 8548\u20138558. Curran Associates, Inc., 2018. URL http://papers.nips.cc/paper/\n8074-a-convex-program-for-bilinear-inversion-of-sparse-vectors.pdf.\n\nEmmanuel Cand\u00e8s, Xiaodong Li, and Mahdi Soltanolkotabi. Phase retrieval via wirtinger \ufb02ow:\n\nTheory and algorithms. IEEE Trans. Inform. Theory, 61(4):1985\u20132007, 2015.\n\nGang Wang, Georgios B Giannakis, and Yonina C Eldar. Solving systems of random quadratic\n\nequations via truncated amplitude \ufb02ow. arXiv preprint arXiv:1605.08285, 2016.\n\nXiaodong Li, Shuyang Ling, Thomas Strohmer, and Ke Wei. Rapid, robust, and reliable blind\n\ndeconvolution via nonconvex optimization. arXiv preprint arXiv:1606.04933, 2016.\n\nPaul Hand and Vladislav Voroninski. Global guarantees for enforcing deep generative priors by\n\nempirical risk. CoRR, abs/1705.07576, 2017. URL http://arxiv.org/abs/1705.07576.\n\n10\n\n\fPaul Hand, Oscar Leong, and Vladislav Voroninski. Phase retrieval under a generative prior. CoRR,\n\nabs/1807.04261, 2018. URL http://arxiv.org/abs/1807.04261.\n\nAlec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with deep\n\nconvolutional generative adversarial networks. In ICLR, 2016.\n\nR. Vershynin. Compressed sensing: theory and applications. Cambridge University Press, 2012.\nHalyun Jeong, Xiaowei Li, Yaniv Plan, and Ozgur Yilmaz. Non-gaussian random matrices on sets:\nOptimal tail dependence and applications. In 13th International conference on Sampling Theory\nand Applications, 2019.\n\n11\n\n\f", "award": [], "sourceid": 6150, "authors": [{"given_name": "Paul", "family_name": "Hand", "institution": "Northeastern University"}, {"given_name": "Babhru", "family_name": "Joshi", "institution": "University of British Columbia"}]}