{"title": "Optimal Linear Estimation under Unknown Nonlinear Transform", "book": "Advances in Neural Information Processing Systems", "page_first": 1549, "page_last": 1557, "abstract": "Linear regression studies the problem of estimating a model parameter $\\beta^* \\in \\R^p$, from $n$ observations $\\{(y_i,x_i)\\}_{i=1}^n$ from linear model $y_i = \\langle \\x_i,\\beta^* \\rangle + \\epsilon_i$. We consider a significant generalization in which the relationship between $\\langle x_i,\\beta^* \\rangle$ and $y_i$ is noisy, quantized to a single bit, potentially nonlinear, noninvertible, as well as unknown. This model is known as the single-index model in statistics, and, among other things, it represents a significant generalization of one-bit compressed sensing. We propose a novel spectral-based estimation procedure and show that we can recover $\\beta^*$ in settings (i.e., classes of link function $f$) where previous algorithms fail. In general, our algorithm requires only very mild restrictions on the (unknown) functional relationship between $y_i$ and $\\langle x_i,\\beta^* \\rangle$. We also consider the high dimensional setting where $\\beta^*$ is sparse, and introduce a two-stage nonconvex framework that addresses estimation challenges in high dimensional regimes where $p \\gg n$. For a broad class of link functions between $\\langle x_i,\\beta^* \\rangle$ and $y_i$, we establish minimax lower bounds that demonstrate the optimality of our estimators in both the classical and high dimensional regimes.", "full_text": "Optimal Linear Estimation under Unknown\n\nNonlinear Transform\n\nXinyang Yi\n\nThe University of Texas at Austin\n\nyixy@utexas.edu\n\nZhaoran Wang\n\nPrinceton University\n\nzhaoran@princeton.edu\n\nConstantine Caramanis\n\nThe University of Texas at Austin\nconstantine@utexas.edu\n\nHan Liu\n\nPrinceton University\n\nhanliu@princeton.edu\n\nAbstract\n\nLinear regression studies the problem of estimating a model parameter \u03b2\u2217 \u2208 Rp,\nfrom n observations {(yi, xi)}n\ni=1 from linear model yi = (cid:104)xi, \u03b2\u2217(cid:105) + \u0001i. We\nconsider a signi\ufb01cant generalization in which the relationship between (cid:104)xi, \u03b2\u2217(cid:105)\nand yi is noisy, quantized to a single bit, potentially nonlinear, noninvertible, as\nwell as unknown. This model is known as the single-index model in statistics, and,\namong other things, it represents a signi\ufb01cant generalization of one-bit compressed\nsensing. We propose a novel spectral-based estimation procedure and show that\nwe can recover \u03b2\u2217 in settings (i.e., classes of link function f) where previous\nalgorithms fail. In general, our algorithm requires only very mild restrictions on the\n(unknown) functional relationship between yi and (cid:104)xi, \u03b2\u2217(cid:105). We also consider the\nhigh dimensional setting where \u03b2\u2217 is sparse, and introduce a two-stage nonconvex\nframework that addresses estimation challenges in high dimensional regimes where\np (cid:29) n. For a broad class of link functions between (cid:104)xi, \u03b2\u2217(cid:105) and yi, we establish\nminimax lower bounds that demonstrate the optimality of our estimators in both\nthe classical and high dimensional regimes.\n\n1\n\nIntroduction\n\nP(Y = 1|X = x) =\n\nWe consider a generalization of the one-bit quantized regression problem, where we seek to recover\nthe regression coef\ufb01cient \u03b2\u2217 \u2208 Rp from one-bit measurements. Speci\ufb01cally, suppose that X is a\nrandom vector in Rp and Y is a binary random variable taking values in {\u22121, 1}. We assume the\nconditional distribution of Y given X takes the form\n1\n2\n\n(1.1)\nwhere f : R \u2192 [\u22121, 1] is called the link function. We aim to estimate \u03b2\u2217 from n i.i.d. observations\n{(yi, xi)}n\ni=1 of the pair (Y, X). In particular, we assume the link function f is unknown. Without\nany loss of generality, we take \u03b2\u2217 to be on the unit sphere Sp\u22121 since its magnitude can always be\nincorporated into the link function f.\nThe model in (1.1) is simple but general. Under speci\ufb01c choices of the link function f, (1.1) immedi-\nately leads to many practical models in machine learning and signal processing, including logistic\nregression and one-bit compressed sensing. In the settings where the link function is assumed to\nbe known, a popular estimation procedure is to calculate an estimator that minimizes a certain loss\n\nf ((cid:104)x, \u03b2\u2217(cid:105)) +\n\n1\n2\n\n,\n\n1\n\n\ffunction. However, for particular link functions, this approach involves minimizing a nonconvex\nobjective function for which the global minimizer is in general intractable to obtain. Furthermore, it\nis dif\ufb01cult or even impossible to know the link function in practice, and a poor choice of link function\nmay result in inaccurate parameter estimation and high prediction error. We take a more general\napproach, and in particular, target the setting where f is unknown. We propose an algorithm that can\nestimate the parameter \u03b2\u2217 in the absence of prior knowledge on the link function f. As our results\nmake precise, our algorithm succeeds as long as the function f satis\ufb01es a single moment condition.\nAs we demonstrate, this moment condition is only a mild restriction on f. In particular, our methods\nand theory are widely applicable even to the settings where f is non-smooth, e.g., f (z) = sign(z), or\nnoninvertible, e.g., f (z) = sin(z).\nIn particular, as we show in \u00a72, our restrictions on f are suf\ufb01ciently \ufb02exible so that our results provide\na uni\ufb01ed framework that encompasses a broad range of problems, including logistic regression,\none-bit compressed sensing, one-bit phase retrieval as well as their robust extensions. We use these\nimportant examples to illustrate our results, and discuss them at several points throughout the paper.\nMain contributions. The key conceptual contribution of this work is a novel use of the method of\nmoments. Rather than considering moments of the covariate, X, and the response variable, Y , we\nlook at moments of differences of covariates, and differences of response variables. Such a simple yet\ncritical observation enables everything that follows and leads to our spectral-based procedure.\nWe also make two theoretical contributions. First, we simultaneously establish the statistical and\ncomputational rates of convergence of the proposed spectral algorithm. We consider both the low\ndimensional setting where the number of samples exceeds the dimension and the high dimensional\nsetting where the dimensionality may (greatly) exceed the number of samples. In both these settings,\nour proposed algorithm achieves the same statistical rate of convergence as that of linear regression\napplied on data generated by the linear model without quantization. Second, we provide minimax\nlower bounds for the statistical rate of convergence, and thereby establish the optimality of our\nprocedure within a broad model class. In the low dimensional setting, our results obtain the optimal\nrate with the optimal sample complexity. In the high dimensional setting, our algorithm requires\nestimating a sparse eigenvector, and thus our sample complexity coincides with what is believed to\nbe the best achievable via polynomial time methods [2]; the error rate itself, however, is information-\ntheoretically optimal. We discuss this further in \u00a73.4.\nRelated works. Our model in (1.1) is close to the single-index model (SIM) in statistics. In the SIM,\nwe assume that the response-covariate pair (Y, X) is determined by\n\nY = f ((cid:104)X, \u03b2\u2217(cid:105)) + W\n\n(1.2)\nwith unknown link function f and noise W . Our setting is a special case of this, as we restrict Y\nto be a binary random variable. The single index model is a classical topic, and therefore there is\nextensive literature \u2013 too much to exhaustively review it. We therefore outline the pieces of work most\nrelevant to our setting and our results. For estimating \u03b2\u2217 in (1.2), a feasible approach is M-estimation\n[8, 9, 12], in which the unknown link function f is jointly estimated using nonparametric estimators.\nAlthough these M-estimators have been shown to be consistent, they are not computationally ef\ufb01cient\nsince they involve solving a nonconvex optimization problem. Another approach to estimate \u03b2\u2217 is\nnamed the average derivative estimator (ADE; [24]). Further improvements of ADE are considered\nin [13, 22]. ADE and its related methods require that the link function f is at least differentiable, and\nthus excludes important models such as one-bit compressed sensing with f (z) = sign(z). Beyond\nestimating \u03b2\u2217, the works in [15, 16] focus on iteratively estimating a function f and vector \u03b2 that\nare good for prediction, and they attempt to control the generalization error. Their algorithms are\nbased on isotonic regression, and are therefore only applicable when the link function is monotonic\nand satis\ufb01es Lipschitz constraints. The work discussed above focuses on the low dimensional setting\nwhere p (cid:28) n. Another related line of works is suf\ufb01cient dimension reduction, where the goal is to\n\ufb01nd a subspace U of the input space such that the response Y only depends on the projection U(cid:62)X.\nSingle-index model and our problem can be regarded as special cases of this problem as we are\nprimarily interested in recovering a one-dimensional subspace. Due to space limit, we refer readers to\nthe long version of this paper for a detailed survey [29].\n\n2\n\n\fIn the high dimensional regime with p (cid:29) n and \u03b2\u2217 has some structure (for us this means sparsity),\nwe note there exists some recent progress [1] on estimating f via PAC Bayesian methods. In the\nspecial case when f is linear function, sparse linear regression has attracted extensive study over\nthe years. The recent work by Plan et al. [21] is closest to our setting. They consider the setting of\nnormal covariates, X \u223c N (0, Ip), and they propose a marginal regression estimator for estimating\n\u03b2\u2217, that, like our approach, requires no prior knowledge about f. Their proposed algorithm relies\non the assumption that Ez\u223cN (0,1)\neven. As we will describe below, our algorithm is based on a novel moment-based estimator, and\navoids requiring such a condition, thus allowing us to handle even link functions under a very mild\nmoment restriction, which we describe in detail below. Generally, the work in [21] requires different\nconditions, and thus beyond the discussion above, is not directly comparable to the work here. In\ncases where both approaches apply, the results are minimax optimal.\n\n(cid:2)zf (z)(cid:3) (cid:54)= 0, and hence cannot work for link functions that are\n\n2 Example models\n\n1\n\n1\u2212pe\n\npe\n\nIn this section, we discuss several popular (and important) models in machine learning and signal\nprocessing that fall into our general model (1.1) under speci\ufb01c link functions. Variants of these models\nhave been studied extensively in the recent literature. These examples trace through the paper, and we\nuse them to illustrate the details of our algorithms and results.\nLogistic regression. In logistic regression (LR), we assume that P(Y = 1|X = x) =\n1+exp (\u2212(cid:104)x,\u03b2\u2217(cid:105)\u2212\u03b6), where \u03b6 is the intercept. The link function corresponds to f (z) = exp (z+\u03b6)\u22121\nexp (z+\u03b6)+1 .\nOne robust variant of LR is called \ufb02ipped logistic regression, where we assume that the labels\nY generated from standard LR model are \ufb02ipped with probability pe, i.e., P(Y = 1|X = x) =\n1+exp ((cid:104)x,\u03b2\u2217(cid:105)+\u03b6). This reduces to the standard LR model when pe = 0. For\n1+exp (\u2212(cid:104)x,\u03b2\u2217(cid:105)\u2212\u03b6) +\n\ufb02ipped LR, the link function f can be written as\nexp (z + \u03b6) \u2212 1\nexp (z + \u03b6) + 1\n\n(2.1)\nFlipped LR has been studied by [19, 25]. In both papers, estimating \u03b2\u2217 is based on minimizing some\nsurrogate loss function involving a certain tuning parameter connected to pe. However, pe is unknown\nin practice. In contrast to their approaches, our method does not hinge on the unknown parameter pe.\nOur approach has the same formulation for both standard and \ufb02ipped LR, thus uni\ufb01es the two models.\nOne-bit compressed sensing. One-bit compressed sensing (CS) aims at recovering sparse signals\nfrom quantized linear measurements (see e.g., [11, 20]). In detail, we de\ufb01ne B0(s, p) := {\u03b2 \u2208 Rp :\n| supp(\u03b2)| \u2264 s} as the set of sparse vectors in Rp with at most s nonzero elements. We assume\n(Y, X) \u2208 {\u22121, 1} \u00d7 Rp satis\ufb01es\n\n+ 2pe \u00b7 1 \u2212 exp (z + \u03b6)\n\n1 + exp (z + \u03b6)\n\nf (z) =\n\n.\n\nY = sign((cid:104)X, \u03b2\u2217(cid:105)),\n\n1\u221a\n2\u03c0\u03c3\n\n(cid:90) \u221e\n\n(2.2)\nwhere \u03b2\u2217 \u2208 B0(s, p). In this paper, we also consider its robust version with noise \u0001, i.e., Y =\nsign((cid:104)X, \u03b2\u2217(cid:105) + \u0001). Assuming \u0001 \u223c N (0, \u03c32), the link function f of robust 1-bit CS thus corresponds\nto\n\ne\u2212(u\u2212z)2/2\u03c32\n\ndu \u2212 1.\n\n0\n\nf (z) = 2\n\n(2.3)\nNote that (2.2) also corresponds to the probit regression model without the sparse constraint on \u03b2\u2217.\nThroughout the paper, we do not distinguish between the two model names. Model (2.2) is referred\nto as one-bit compressed sensing even in the case where \u03b2\u2217 is not sparse.\nOne-bit phase retrieval. The goal of phase retrieval (e.g., [5]) is to recover signals based on linear\nmeasurements with phase information erased, i.e., pair (Y, X) \u2208 R \u00d7 Rp is determined by equation\nY = |(cid:104)X, \u03b2\u2217(cid:105)|. Analogous to one-bit compressed sensing, we consider a new model named one-bit\nphase retrieval where the linear measurement with phase information erased is quantized to one bit.\nIn detail, pair (Y, X) \u2208 {\u22121, 1} \u00d7 Rp is linked through Y = sign(|(cid:104)X, \u03b2\u2217(cid:105)| \u2212 \u03b8), where \u03b8 is the\nquantization threshold. Compared with one-bit compressed sensing, this problem is more dif\ufb01cult\nbecause Y only depends on \u03b2\u2217 through the magnitude of (cid:104)X, \u03b2\u2217(cid:105) instead of the value of (cid:104)X, \u03b2\u2217(cid:105).\nAlso, it is more dif\ufb01cult than the original phase retrieval problem due to the additional quantization.\n\n3\n\n\fUsing our general model, The link function thus corresponds to\n\nf (z) = sign(|z| \u2212 \u03b8).\n\nIt is worth noting that, unlike previous models, here f is neither odd nor monotonic.\n\n(2.4)\n\n3 Main results\nWe now turn to our algorithms for estimating \u03b2\u2217 in both low and high dimensional settings. We \ufb01rst\nintroduce a second moment estimator based on pairwise differences. We prove that the eigenstructure\nof the constructed second moment estimator encodes the information of \u03b2\u2217. We then propose\nalgorithms to estimate \u03b2\u2217 based upon this second moment estimator. In the high dimensional setting\nwhere \u03b2\u2217 is sparse, computing the top eigenvector of our pairwise-difference matrix reduces to\ncomputing a sparse eigenvector. Beyond algorithms, we discuss minimax lower bound in \u00a73.5. We\npresent simulation results in \u00a73.6\n\n3.1 Conditions for success\n\nWe now introduce several key quantities, which allow us to state precisely the conditions required for\nthe success of our algorithm.\nDe\ufb01nition 3.1. For any (unknown) link function, f, de\ufb01ne the quantity \u03c6(f ) as follows:\n\nwhere \u00b50, \u00b51 and \u00b52 are given by\n\n1 \u2212 \u00b50\u00b52 + \u00b52\n0.\n\n\u03c6(f ) := \u00b52\n\n\u00b5k := E(cid:2)f (Z)Z k(cid:3),\n\nk = 0, 1, 2 . . . ,\n\nwhere Z \u223c N (0, 1).\nAs we discuss in detail below, the key condition for success of our algorithm is \u03c6(f ) (cid:54)= 0. As we\nshow below, this is a relatively mild condition, and in particular, it is satis\ufb01ed by the three examples\nintroduced in \u00a72. For odd and monotonic f, \u03c6(f ) > 0 unless f (z) = 0 for all z in which case no\nalgorithm is able to recover \u03b2\u2217. For even f, we have \u00b51 = 0. Thus \u03c6(f ) (cid:54)= 0 if and only if \u00b50 (cid:54)= \u00b52.\n\n3.2 Second moment estimator\nWe describe a novel moment estimator that enables our algorithm. Let {(yi, xi)}n\ni=1 be the n i.i.d.\nobservations of (Y, X). Assuming without loss of generality that n is even, we consider the following\nkey transformation\n\n\u2206yi := y2i \u2212 y2i\u22121, \u2206xi := x2i \u2212 x2i\u22121,\n\nfor i = 1, 2, ..., n/2. Our procedure is based on the following second moment\n\n(3.1)\n\n(3.2)\n\n(3.3)\n\n(3.4)\n\nn/2(cid:88)\n\ni=1\n\nM :=\n\n2\nn\n\n\u2206y2\n\ni \u2206xi\u2206x(cid:62)\n\ni \u2208 Rp\u00d7p.\n\nThe intuition behind this second moment is as follows. By (1.1), the variation of X along the direction\n\u03b2\u2217 has the largest impact on the variation of (cid:104)X, \u03b2\u2217(cid:105). Thus, the variation of Y directly depends\non the variation of X along \u03b2\u2217. Consequently, {(\u2206yi, \u2206xi)}n/2\ni=1 encodes the information of such a\ndependency relationship. In the following, we make this intuition more rigorous by analyzing the\neigenstructure of E(M) and its relationship with \u03b2\u2217.\nLemma 3.2. For \u03b2\u2217 \u2208 Sp\u22121, we assume that (Y, X) \u2208 {\u22121, 1} \u00d7 Rp satis\ufb01es (1.1). For X \u223c\nN (0, Ip), we have\n\nE(M) = 4\u03c6(f ) \u00b7 \u03b2\u2217\u03b2\u2217(cid:62) + 4(1 \u2212 \u00b52\n\n0) \u00b7 Ip,\n\n(3.5)\n\nwhere \u00b50 and \u03c6(f ) are de\ufb01ned in (3.2) and (3.1).\nLemma 3.2 proves that \u03b2\u2217 is the leading eigenvector of E(M) as long as the eigengap \u03c6(f ) is positive.\nIf instead we have \u03c6(f ) < 0, we can use a related moment estimator which has analogous properties.\n\n4\n\n\f(cid:80)n/2\ni=1(y2i + y2i\u22121)2\u2206xi\u2206x(cid:62)\n\nTo this end, de\ufb01ne M(cid:48) := 2\nsimilar result for M(cid:48) as stated below.\nCorollary 3.3. Under the setting of Lemma 3.2,\n\nn\n\nE(M(cid:48)) = \u22124\u03c6(f ) \u00b7 \u03b2\u2217\u03b2\u2217(cid:62) + 4(1 + \u00b52\n\n0) \u00b7 Ip.\n\ni . In parallel to Lemma 3.2, we have a\n\nCorollary 3.3 therefore shows that when \u03c6(f ) < 0, we can construct another second moment estimator\nM(cid:48) such that \u03b2\u2217 is the leading eigenvector of E(M(cid:48)). As discussed above, this is precisely the setting\nfor one-bit phase retrieval when the quantization threshold in (3.1) satis\ufb01es \u03b8 < \u03b8m. For simplicity of\nthe discussion, hereafter we assume that \u03c6(f ) > 0 and focus on the second moment estimator M\nde\ufb01ned in (3.4).\nA natural question to ask is whether \u03c6(f ) (cid:54)= 0 holds for speci\ufb01c models. The following lemma\ndemonstrates exactly this, for the example models introduced in \u00a72.\nLemma 3.4. (a) Consider the \ufb02ipped logistic regression where f is given in (2.1). By setting the\nintercept to be \u03b6 = 0, we have \u03c6(f ) (cid:38) (1\u2212 2pe)2. (b) For robust one-bit compressed sensing where f\nis given in (2.3). We have \u03c6(f ) (cid:38) min\n. (c) For one-bit phase retrieval where\nf is given in (2.4). For Z \u223c N (0, 1), we let \u03b8m be the median of |Z|, i.e., P(|Z| \u2265 \u03b8m) = 1/2. We\nhave |\u03c6(f )| (cid:38) \u03b8|\u03b8 \u2212 \u03b8m| exp(\u2212\u03b82) and sign[\u03c6(f )] = sign(\u03b8 \u2212 \u03b8m). We thus obtain \u03c6(f ) > 0 for\n\u03b8 > \u03b8m.\n\n(cid:26)(cid:16) 1\u2212\u03c32\n\n, C(cid:48)\u03c34\n\n(cid:17)2\n\n(cid:27)\n\n(1+\u03c33)2\n\n1+\u03c32\n\n(cid:20)\n\n(cid:124)\n\n\u03b3 :=\n\n0\n\n+ 1\n\n0\n\n(cid:21)(cid:30)\n\n2,\n\nand \u03be :=\n\n\u03c6(f ) + 1 \u2212 \u00b52\n\n3.3 Low dimensional recovery\nWe consider estimating \u03b2\u2217 in the classical (low dimensional) setting where p (cid:28) n. Based on the\nsecond moment estimator M de\ufb01ned in (3.4), estimating \u03b2\u2217 amounts to solving a noisy eigenvalue\nproblem. We solve this by a simple iterative algorithm: provided an initial vector \u03b20 \u2208 Sp\u22121 (which\nmay be chosen at random) we perform power iterations as shown in Algorithm 1.\nTheorem 3.5. We assume X \u223c N (0, Ip) and (Y, X) follows (1.1). Let {(yi, xi)}n\ni=1 be n i.i.d.\nsamples of response input pair (Y, X). For any link function f in (1.1) with \u00b50, \u03c6(f ) de\ufb01ned in (3.2)\nand (3.1), and \u03c6(f ) > 01. We let\n1 \u2212 \u00b52\n\n(3.6)\nThere exist constant Ci such that when n \u2265 C1p/\u03be2, for Algorithm 1, we have that with probability\nat least 1 \u2212 2 exp(\u2212C2p),\n\n(cid:13)(cid:13)\u03b2t \u2212 \u03b2\u2217(cid:13)(cid:13)2 \u2264 C3 \u00b7 \u03c6(f ) + 1 \u2212 \u00b52\n(cid:125)\n(cid:123)(cid:122)\nHere \u03b1 =(cid:10)\u03b20,(cid:98)\u03b2(cid:11), where (cid:98)\u03b2 is the \ufb01rst leading eigenvector of M.\nand optimization error terms in (3.7) are of the same order, we have(cid:13)(cid:13)\u03b2Tmax \u2212 \u03b2\u2217(cid:13)(cid:13)2\n\nNote that by (3.6) we have \u03b3 \u2208 (0, 1). Thus, the optimization error term in (3.7) decreases at\na geometric rate to zero as t increases. For Tmax suf\ufb01ciently large such that the statistical error\n\n\u03b3\u03c6(f ) + (\u03b3 \u2212 1)(1 \u2212 \u00b52\n0)\n\n(1 + \u03b3)(cid:2)\u03c6(f ) + 1 \u2212 \u00b52\n\n(cid:46) (cid:112)p/n.\n\n(cid:114) p\n(cid:125)\n\nThis statistical rate of convergence matches the rate of estimating a p-dimensional vector in linear\nregression without any quantization, and will later be shown to be optimal. This result shows that the\nlack of prior knowledge on the link function and the information loss from quantization do not keep\nour procedure from obtaining the optimal statistical rate.\n\nfor t = 1, . . . , Tmax.\n\n1 \u2212 \u03b12\n\u03b12\n\n(cid:123)(cid:122)\n\nOptimization Error\n\nStatistical Error\n\n(cid:114)\n(cid:124)\n\n(cid:3) .\n\n0\n\n\u00b7 \u03b3t\n\n,\n\n0\n\n\u00b7\n\n+\n\nn\n\n(3.7)\n\n\u03c6(f )\n\n3.4 High dimensional recovery\nNext we consider the high dimensional setting where p (cid:29) n and \u03b2\u2217 is sparse, i.e., \u03b2\u2217 \u2208 Sp\u22121 \u2229\nB0(s, p) with s being support size. Although this high dimensional estimation problem is closely\n\n1Recall that we have an analogous treatment and thus results for \u03c6(f ) < 0.\n\n5\n\n\frelated to the well-studied sparse PCA problem, the existing works [4, 6, 17, 23, 27, 28, 31, 32] on\nsparse PCA do not provide a direct solution to our problem. In particular, they either lack statistical\nguarantees on the convergence rate of the obtained estimator [6, 23, 28] or rely on the properties of\nthe sample covariance matrix of Gaussian data [4, 17], which are violated by the second moment\nestimator de\ufb01ned in (3.4). For the sample covariance matrix of sub-Gaussian data, [27] prove that the\n\nconvex relaxation proposed by [7] achieves a suboptimal s(cid:112)log p/n rate of convergence. Yuan and\nZhang [31] propose the truncated power method, and show that it attains the optimal(cid:112)s log p/n rate\nregularization parameter \u03c1, sparsity level(cid:98)s.\n\nAlgorithm 1 Low dimensional recovery\nInput {(yi, xi)}n\ni=1, number of iterations Tmax\n1: Second moment estimation: Construct M\n2: Initialization: Choose a random vector \u03b20 \u2208\n\nAlgorithm 2 Sparse recovery\nInput {(yi, xi)}n\n\n1: Second moment estimation: Construct M\n\nfrom samples according to (3.4).\n\nfrom samples according to (3.4).\n\ni=1, number of iterations Tmax,\n\n\u03c6(f )3\n\nStatistical Error\n\nOptimization Error\n\n(3.11)\n\nwith high probability. Here \u03ba is de\ufb01ned in (3.9).\n\nThe \ufb01rst term on the right-hand side of (3.11) is the statistical error while the second term gives the\noptimization error. Note that the optimization error decays at a geometric rate since \u03ba < 1. For Tmax\n\n6\n\nSn\u22121\n\n\u03b2t \u2190 M \u00b7 \u03b2t\u22121\n\u03b2t \u2190 \u03b2t/(cid:107)\u03b2t(cid:107)2\n\n3: For t = 1, 2, . . . , Tmax do\n4:\n5:\n6: end For\nOutput \u03b2Tmax\n\n2: Initialization:\n3: \u03a00 \u2190 argmin\n\u03a0\u2208Rp\u00d7p\n\n{\u2212(cid:104)M, \u03a0(cid:105) + \u03c1(cid:107)\u03a0(cid:107)1,1\n\n| Tr(\u03a0) = 1, 0 (cid:22) \u03a0 (cid:22) I} (3.8)\n\n\u03b2t \u2190 \u03b2t/(cid:107)\u03b2t(cid:107)2\n\n\u03b20 \u2190 \ufb01rst leading eigenvector of \u03a00\n\u03b20 \u2190 \u03b20/(cid:107)\u03b20(cid:107)2\n\n\u03b20 \u2190 trunc(\u03b20,(cid:98)s)\n\u03b2t \u2190 trunc(M \u00b7 \u03b2t\u22121,(cid:98)s)\n\n4:\n5:\n6:\n7: For t = 1, 2, . . . , Tmax do\n8:\n9:\n10: end For\nOutput \u03b2Tmax\n\nlocally; that is, it exhibits this rate of convergence\nonly in a neighborhood of the true solution where\n(cid:104)\u03b20, \u03b2\u2217(cid:105) > C where C > 0 is some constant. It\nis well understood that for a random initialization\non Sp\u22121, such a condition fails with probability\ngoing to one as p \u2192 \u221e.\nInstead, we propose a two-stage procedure for estimating \u03b2\u2217 in our setting. In the \ufb01rst stage, we adapt\nthe convex relaxation proposed by [27] and use it as an initialization step, in order to obtain a good\nenough initial point satisfying the condition (cid:104)\u03b20, \u03b2\u2217(cid:105) > C. The convex optimization problem can be\neasily solved by the alternating direction method of multipliers (ADMM) algorithm (see [3, 27] for\ndetails). Then we adapt the truncated power method. This procedure is illustrated in Algorithm 2. In\nparticular, we de\ufb01ne truncation operator trunc(\u00b7,\u00b7) as [trunc(\u03b2, s)]j = 1(j \u2208 S)\u03b2j, where S is the\nindex set corresponding to the top s largest |\u03b2j|. The initialization phase of our algorithm requires\nO(s2 log p) samples (see below for more precise details) to succeed. As work in [2] suggests, it is\nunlikely that a polynomial time algorithm can avoid such dependence. However, once we are near the\n\nsolution, as we show, this two-step procedure achieves the optimal error rate of(cid:112)s log p/n.\n\n\u03ba :=(cid:2)4(1 \u2212 \u00b52\n\n0) + 3\u03c6(f )(cid:3) < 1,\nnmin := C \u00b7 s2 log p \u00b7 \u03c6(f )2 \u00b7 min(cid:8)\u03ba(1 \u2212 \u03ba1/2)/2, \u03ba/8(cid:9)(cid:14)(cid:2)(1 \u2212 \u00b52\n\n0) + \u03c6(f )(cid:3)(cid:14)(cid:2)4(1 \u2212 \u00b52\n\n0)(cid:3)(cid:112)log p/n with a suf\ufb01ciently large constant C, where \u03c6(f ) and \u00b50\nSuppose \u03c1 = C(cid:2)\u03c6(f )+(1\u2212\u00b52\nare speci\ufb01ed in (3.2) and (3.5). Meanwhile, assume the sparsity parameter(cid:98)s in Algorithm 2 is set to\nbe(cid:98)s = C(cid:48)(cid:48) max(cid:8)(cid:6)1/(\u03ba\u22121/2\u22121)2(cid:7) ,1(cid:9)\u00b7s\u2217. For n \u2265 nmin with nmin de\ufb01ned in (3.10), we have\nmin(cid:8)(1 \u2212 \u03ba1/2)/2, 1/8(cid:9)\n(cid:125)\n\n0)(cid:3) 5\n(cid:2)\u03c6(f ) + (1 \u2212 \u00b52\n(cid:123)(cid:122)\n\n+ \u03bat \u00b7(cid:113)\n(cid:124)\n\n(cid:107)\u03b2t \u2212 \u03b2\u2217(cid:107)2 \u2264 C \u00b7\n\n0) + \u03c6(f )(cid:3)2\n\nand the minimum sample size be\n\nTheorem 3.6. Let\n\ns log p\n\nn\n\n(cid:125)\n\n2 (1 \u2212 \u00b52\n\n0) 1\n\n2\n\n(cid:114)\n\n\u00b7\n\n.\n\n(3.10)\n\n(cid:123)(cid:122)\n\n(3.9)\n\n(cid:124)\n\n\fsuf\ufb01ciently large, we have\n\n(cid:13)(cid:13)\u03b2Tmax \u2212 \u03b2\u2217(cid:13)(cid:13)2\n\n(cid:46)(cid:112)s log p/n.\n\nIn the sequel, we show that the right-hand side gives the optimal statistical rate of convergence for a\nbroad model class under the high dimensional setting with p (cid:29) n.\n\nLet X n\n\nf := {(yi, xi)}n\n\n|f (z) \u2212 f (z(cid:48))| \u2264 L|z \u2212 z(cid:48)|,\n\n3.5 Minimax lower bound\nWe establish the minimax lower bound for estimating \u03b2\u2217 in the model de\ufb01ned in (1.1). In the sequel\nwe de\ufb01ne the family of link functions that are Lipschitz continuous and are bounded away from \u00b11.\nFormally, for any m \u2208 (0, 1) and L > 0, we de\ufb01ne\n\nF(m, L) :=(cid:8)f : |f (z)| \u2264 1 \u2212 m,\n(3.12)\nsatis\ufb01es (1.1) with link function f. Correspondingly, we denote the estimator of \u03b2\u2217 \u2208 B to be (cid:98)\u03b2(X n\ni=1 be the n i.i.d. realizations of (Y, X), where X follows N (0, Ip) and Y\nf ),\nIn the above de\ufb01nition, we not only take the in\ufb01mum over all possible estimators (cid:98)\u03b2, but also all\n\nfor all z, z(cid:48) \u2208 R(cid:9).\nf ) \u2212 \u03b2\u2217(cid:13)(cid:13)2.\n\nwhere B is the domain of \u03b2\u2217. We de\ufb01ne the minimax risk for estimating \u03b2\u2217 as\n\npossible link functions in F(m, L). For a \ufb01xed f, our formulation recovers the standard de\ufb01nition\nof minimax risk [30]. By taking the in\ufb01mum over all link functions, our formulation characterizes\nthe minimax lower bound under the least challenging f in F(m, L). In the sequel we prove that our\nprocedure attains such a minimax lower bound for the least challenging f given any unknown link\nfunction in F(m, L). That is to say, even when f is unknown, our estimation procedure is as accurate\nas in the setting where we are provided the least challenging f, and the achieved accuracy is not\nimprovable due to the information-theoretic limit. The following theorem establishes the minimax\nlower bound in the high dimensional setting.\n\nTheorem 3.7. Let B =Sp\u22121\u2229B0(s, p). We assume that n > m(1\u2212m)/(2L2)2\u00b7(cid:2)Cs log(p/s)/2\u2212log 2(cid:3).\n\nR(n, m, L,B) := inf\n\nE(cid:13)(cid:13)(cid:98)\u03b2(X n\n\nf\u2208F (m,L)\n\ninf(cid:98)\u03b2(X n\n\nf )\n\nsup\n\u03b2\u2217\u2208B\n\n(3.13)\n\nFor any s \u2208 (0, p/4], the minimax risk de\ufb01ned in (3.13) satis\ufb01es\n\n(cid:112)m(1 \u2212 m)\n\n(cid:114)\n\n\u00b7\n\ns log(p/s)\n\nR(n, m, L,B) \u2265 C(cid:48) \u00b7\n\nn\n\nL\n\n.\nHere C and C(cid:48) are absolute constants, while m and L are de\ufb01ned in (3.12).\nTheorem 3.7 establishes the minimax optimality of the statistical rate attained by our procedure for\n\np(cid:29) n and s-sparse \u03b2\u2217. In particular, for arbitrary f \u2208 F(m, L) \u2229 {f : \u03c6(f ) > 0}, the estimator (cid:98)\u03b2\nattained by Algorithm 2 is minimax-optimal in the sense that its(cid:112)s log p/n rate of convergence is\none can show the best possible convergence rate is \u2126((cid:112)m(1 \u2212 m)p/n/L) by setting s = p/4 in\nf (z) = sign(z). In fact, for noiseless one-bit compressed sensing, the(cid:112)s log p/n rate is not optimal.\n\nTheorem 3.7.\nIt is worth to note that our lower bound becomes trivial for m = 0, i.e., there exists some z such\nthat |f (z)| = 1. One example is the noiseless one-bit compressed sensing for which we have\n\nnot improvable, even when the information on the link function f is available. For general \u03b2\u2217 \u2208 Rp,\n\nFor example, the Jacques et al. [14] provide an algorithm (with exponential running time) that achieves\nrate s log p/n. Understanding such a rate transition phenomenon for link functions with zero margin,\ni.e., m = 0 in (3.12), is an interesting future direction.\n\n3.6 Numerical results\n\nWe now turn to the numerical results that support our theory. For the three models introduced in \u00a72,\nwe apply Algorithm 1 and Algorithm 2 to do parameter estimation in the classic and high dimensional\nregimes. Our simulations are based on synthetic data. For classic recovery, \u03b2\u2217 is randomly chosen\nj = s\u22121/21(j \u2208 S) for all j \u2208 [p], where S is a random\nfrom Sp\u22121; for sparse recovery, we set \u03b2\u2217\nindex subset of [p] with size s. In Figure 1, as predicted by Theorem 3.5, we observe that the same\n\n7\n\n\f(cid:112)p/n leads to nearly identical estimation error. Figure 2 demonstrates similar results for the predicted\nrate(cid:112)s log p/n of sparse recovery and thus validates Theorem 3.6.\n\n(c) One-bit Phase Retrieval\n(a) Flipped Logistic Regression\nFigure 1: Estimation error of low dimensional recovery. (a) pe = 0.1. (b) \u03b42 = 0.1. (c) \u03b8 = 1.\n\n(b) One-bit Compressed Sensing\n\n(a) Flipped Logistic Regression\n\n(b) One-bit Compressed Sensing\n\n(c) One-bit Phase Retrieval\n\nFigure 2: Estimation error of sparse recovery. (a) pe = 0.1. (b) \u03b42 = 0.1. (c) \u03b8 = 1.\n\n4 Discussion\n\nSample complexity. In high dimensional regime, while our algorithm achieves optimal convergence\nrate, the sample complexity we need is \u2126(s2 log p). The natural question is whether it can be reduced\nto O(s log p). We note that breaking the barrier s2 log p is challenging. Consider a simpler problem\nsparse phase retrieval where yi = |(cid:104)xi, \u03b2\u2217(cid:105)|, with a fairly extensive body of literature, the state-of-\nthe-art ef\ufb01cient algorithms (i.e., with polynomial running time) for recovering sparse \u03b2\u2217 requires\nsample complexity \u2126(s2 log p) [10]. It remains open to show whether it\u2019s possible to do consistent\nsparse recovery with O(s log p) samples by any polynomial time algorithms.\n\nAcknowledgment\n\nXY and CC would like to acknowledge NSF grants 1056028, 1302435 and 1116955. This research\nwas also partially supported by the U.S. Department of Transportation through the Data-Supported\nTransportation Operations and Planning (D-STOP) Tier 1 University Transportation Center. HL is\ngrateful for the support of NSF CAREER Award DMS1454377, NSF IIS1408910, NSF IIS1332109,\nNIH R01MH102339, NIH R01GM083084, and NIH R01HG06841. ZW was partially supported by\nMSR PhD fellowship while this work was done.\n\nReferences\n\n[1] A L Q U I E R , P. and B I A U , G . (2013). Sparse single-index model. Journal of Machine Learning\n\nResearch, 14 243\u2013280.\n\n[2] B E R T H E T, Q . and R I G O L L E T, P. (2013). Complexity theoretic lower bounds for sparse principal\n\ncomponent detection. In Conference on Learning Theory.\n\n[3] B O Y D , S ., PA R I K H , N ., C H U , E ., P E L E AT O , B . and E C K S T E I N , J . (2011). Distributed\noptimization and statistical learning via the alternating direction method of multipliers. Foundations and\nTrends R(cid:13) in Machine Learning, 3 1\u2013122.\n\n8\n\npp/n0.050.10.150.2EstimationError0.20.30.40.50.60.7p=10p=20p=40pp/n0.050.10.150.2EstimationError0.20.30.40.5p=10p=20p=40pp/n0.050.10.150.2EstimationError0.20.30.40.50.60.7p=10p=20p=40pslogp/n0.10.20.30.4EstimationError0.20.40.60.811.2p=100,s=5p=100,s=10p=200,s=5p=200,s=10pslogp/n0.10.20.30.4EstimationError00.20.40.60.8p=100,s=5p=100,s=10p=200,s=5p=200,s=10pslogp/n0.10.150.20.250.3EstimationError00.20.40.60.81p=100,s=5p=100,s=10p=200,s=5p=200,s=10\f[4] C A I , T. T., M A , Z . and W U , Y. (2013). Sparse PCA: Optimal rates and adaptive estimation. Annals\n\nof Statistics, 41 3074\u20133110.\n\n[5] C A N D \u00c8 S , E . J ., E L D A R , Y. C ., S T R O H M E R , T. and V O R O N I N S K I , V. (2013). Phase retrieval\n\nvia matrix completion. SIAM Journal on Imaging Sciences, 6 199\u2013225.\n\n[6] D \u2019 A S P R E M O N T, A ., B A C H , F. and E L G H A O U I , L . (2008). Optimal solutions for sparse principal\n\ncomponent analysis. Journal of Machine Learning Research, 9 1269\u20131294.\n\n[7] D \u2019 A S P R E M O N T, A ., E L G H A O U I , L ., J O R D A N , M . I . and L A N C K R I E T, G . R . (2007). A\n\ndirect formulation for sparse PCA using semide\ufb01nite programming. SIAM Review 434\u2013448.\n\n[8] D E L E C R O I X , M ., H R I S TA C H E , M . and PAT I L E A , V. (2000). Optimal smoothing in semiparametric\nindex approximation of regression functions. Tech. rep., Interdisciplinary Research Project: Quanti\ufb01cation\nand Simulation of Economic Processes.\n\n[9] D E L E C R O I X , M ., H R I S TA C H E , M . and PAT I L E A , V. (2006). On semiparametric M-estimation in\n\nsingle-index regression. Journal of Statistical Planning and Inference, 136 730\u2013769.\n\n[10] E L D A R , Y. C . and M E N D E L S O N , S . (2014). Phase retrieval: Stability and recovery guarantees.\n\nApplied and Computational Harmonic Analysis, 36 473\u2013494.\n\n[11] G O P I , S ., N E T R A PA L L I , P., JA I N , P. and N O R I , A . (2013). One-bit compressed sensing: Provable\n\nsupport and vector recovery. In International Conference on Machine Learning.\n\n[12] H A R D L E , W., H A L L , P. and I C H I M U R A , H . (1993). Optimal smoothing in single-index models.\n\nAnnals of Statistics, 21 157\u2013178.\n\n[13] H R I S TA C H E , M ., J U D I T S K Y, A . and S P O K O I N Y, V. (2001). Direct estimation of the index\n\ncoef\ufb01cient in a single-index model. Annals of Statistics, 29 pp. 595\u2013623.\n\n[14] JA C Q U E S , L ., L A S K A , J . N ., B O U F O U N O S , P. T. and B A R A N I U K , R . G . (2011). Robust 1-bit\n\ncompressive sensing via binary stable embeddings of sparse vectors. arXiv preprint arXiv:1104.3160.\n\n[15] K A K A D E , S . M ., K A N A D E , V., S H A M I R , O . and K A L A I , A . (2011). Ef\ufb01cient learning of\ngeneralized linear and single index models with isotonic regression. In Advances in Neural Information\nProcessing Systems.\n\n[16] K A L A I , A . T. and S A S T RY, R . (2009). The isotron algorithm: High-dimensional isotonic regression.\n\nIn Conference on Learning Theory.\n\n[17] M A , Z . (2013). Sparse principal component analysis and iterative thresholding. The Annals of Statistics,\n\n41 772\u2013801.\n\n[18] M A S S A R T, P. and P I C A R D , J . (2007). Concentration inequalities and model selection, vol. 1896.\n\nSpringer.\n\n[19] N ATA R A J A N , N ., D H I L L O N , I ., R AV I K U M A R , P. and T E WA R I , A . (2013). Learning with noisy\n\nlabels. In Advances in Neural Information Processing Systems.\n\n[20] P L A N , Y. and V E R S H Y N I N , R . (2013). One-bit compressed sensing by linear programming. Commu-\n\nnications on Pure and Applied Mathematics, 66 1275\u20131297.\n\n[21] P L A N , Y., V E R S H Y N I N , R . and Y U D O V I N A , E . (2014). High-dimensional estimation with\n\ngeometric constraints. arXiv preprint arXiv:1404.3749.\n\n[22] P O W E L L , J . L ., S T O C K , J . H . and S T O K E R , T. M . (1989). Semiparametric estimation of index\n\ncoef\ufb01cients. Econometrica, 57 pp. 1403\u20131430.\n\n[23] S H E N , H . and H U A N G , J . (2008). Sparse principal component analysis via regularized low rank matrix\n\napproximation. Journal of Multivariate Analysis, 99 1015\u20131034.\n\n[24] S T O K E R , T. M . (1986). Consistent estimation of scaled coef\ufb01cients. Econometrica, 54 pp. 1461\u20131481.\n[25] T I B S H I R A N I , J . and M A N N I N G , C . D . (2013). Robust logistic regression using shift parameters.\n\narXiv preprint arXiv:1305.4987.\n\n[26] V E R S H Y N I N , R . (2010).\n\nIntroduction to the non-asymptotic analysis of random matrices. arXiv\n\npreprint arXiv:1011.3027.\n\n[27] V U , V. Q ., C H O , J ., L E I , J . and R O H E , K . (2013). Fantope projection and selection: A near-optimal\n\nconvex relaxation of sparse PCA. In Advances in Neural Information Processing Systems.\n\n[28] W I T T E N , D ., T I B S H I R A N I , R . and H A S T I E , T. (2009). A penalized matrix decomposition, with\napplications to sparse principal components and canonical correlation analysis. Biostatistics, 10 515\u2013534.\n[29] Y I , X ., WA N G , Z ., C A R A M A N I S , C . and L I U , H . (2015). Optimal linear estimation under unknown\n\nnonlinear transform. arXiv preprint arXiv:1505.03257.\n\n[30] Y U , B . (1997). Assouad, Fano, and Le Cam. In Festschrift for Lucien Le Cam. Springer, 423\u2013435.\n[31] Y U A N , X . - T. and Z H A N G , T. (2013). Truncated power method for sparse eigenvalue problems.\n\nJournal of Machine Learning Research, 14 899\u2013925.\n\n[32] Z O U , H ., H A S T I E , T. and T I B S H I R A N I , R . (2006). Sparse principal component analysis. Journal\n\nof Computational and Graphical Statistics, 15 265\u2013286.\n\n9\n\n\f", "award": [], "sourceid": 960, "authors": [{"given_name": "Xinyang", "family_name": "Yi", "institution": "Utaustin"}, {"given_name": "Zhaoran", "family_name": "Wang", "institution": "Princeton University"}, {"given_name": "Constantine", "family_name": "Caramanis", "institution": "UT Austin"}, {"given_name": "Han", "family_name": "Liu", "institution": "Princeton University"}]}