{"title": "Interaction Hard Thresholding: Consistent Sparse Quadratic Regression in Sub-quadratic Time and Space", "book": "Advances in Neural Information Processing Systems", "page_first": 7926, "page_last": 7936, "abstract": "Quadratic regression involves modeling the response as a (generalized) linear function of not only the features $x^{j_1}$ but also of quadratic terms $x^{j_1}x^{j_2}$. The inclusion of such higher-order \u201cinteraction terms\" in regression often provides an easy way to increase accuracy in already-high-dimensional problems. However, this explodes the problem dimension from linear $O(p)$ to quadratic $O(p^2)$, and it is common to look for sparse interactions (typically via heuristics). In this paper, we provide a new algorithm \u2013 Interaction Hard Thresholding (IntHT) which is the first one to provably accurately solve this problem in sub-quadratic time and space. It is a variant of Iterative Hard Thresholding; one that uses the special quadratic structure to devise a new way to (approx.) extract the top elements of a $p^2$ size gradient in sub-$p^2$ time and space. Our main result is to theoretically prove that, in spite of the many speedup-related approximations, IntHT linearly converges to a consistent estimate under standard high-dimensional sparse recovery assumptions. We also demonstrate its value via synthetic experiments. Moreover, we numerically show that IntHT can be extended to higher-order regression problems, and also theoretically analyze an SVRG variant of IntHT.", "full_text": "Interaction Hard Thresholding:\n\nConsistent Sparse Quadratic Regression in\n\nSub-quadratic Time and Space\n\nShuo Yang \u2217\n\nDepartment of Computer Science\n\nUniversity of Texas at Austin\n\nAustin, TX 78712\n\nyangshuo_ut@utexas.edu\n\nYanyao Shen \u2217\nECE Department\n\nUniversity of Texas at Austin\n\nAustin, TX 78712\n\nshenyanyao@utexas.edu\n\nSujay Sanghavi\nECE Department\n\nUniversity of Texas at Austin\n\nAustin, TX 78712\n\nsanghavi@mail.utexas.edu\n\nAbstract\n\nQuadratic regression involves modeling the response as a (generalized) linear\nfunction of not only the features xj, but also of quadratic terms xj1xj2. The\ninclusion of such higher-order \u201cinteraction terms\" in regression often provides an\neasy way to increase accuracy in already-high-dimensional problems. However,\nthis explodes the problem dimension from linear O(p) to quadratic O(p2), and it is\ncommon to look for sparse interactions (typically via heuristics).\nIn this paper we provide a new algorithm \u2013 Interaction Hard Thresholding (IntHT)\n\u2013 which is the \ufb01rst one to provably accurately solve this problem in sub-quadratic\ntime and space. It is a variant of Iterative Hard Thresholding; one that uses the\nspecial quadratic structure to devise a new way to (approx.) extract the top elements\nof a p2 size gradient in sub-p2 time and space.\nOur main result is to theoretically prove that, in spite of the many speedup-related\napproximations, IntHT linearly converges to a consistent estimate under standard\nhigh-dimensional sparse recovery assumptions. We also demonstrate its value via\nsynthetic experiments.\nMoreover, we numerically show that IntHT can be extended to higher-order regres-\nsion problems, and also theoretically analyze an SVRG variant of IntHT.\n\nIntroduction\n\n1\nSimple linear regression aims to predict a response y via a (possibly generalized) linear function \u03b8(cid:62)x\nof the feature vector x. Quadratic regression aims to predict y as a quadratic function x(cid:62)\u0398x of the\nfeatures x\n\nLinear Model\ny \u223c \u03b8(cid:62)x\n\nQuadratic Model\ny \u223c x(cid:62)\u0398 x\n\nThe inclusion of such higher-order interaction terms \u2013 in this case second-order terms of the form\nxj1xj2 \u2013 is common practice, and has been seen to provide much more accurate predictions in\n\n\u2217equal contribution\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fseveral high-dimensional problem settings like recommendation systems, advertising, social network\nmodeling and computational biology [23, 11, 3]. In this paper we consider quadratic regression with\nan additional (possibly non-linear) link function relating y to x(cid:62)\u0398 x.\nOne problem with explicitly adding quadratic interaction terms is that the dimension of the problem\nnow goes from p to p2. In most cases, the quadratic problem is high-dimensional and will likely\nover\ufb01t the data; correspondingly, it is common to implicitly / explicitly impose low-dimensional\nstructure on the \u0398 \u2013 with sparsity of \u0398 being a natural choice. A concrete example for sparse\ninteraction would be the genome-wide association study, where for a given phenotype, the associated\ngenetic variants are usually a sparse subset of all possible variants. Those genes usually interact with\neach other and leads to the given phenotype [15].\nThe naive approach to solving this problem involves recasting this as a big linear model that is now\nin p2 dimensions, with the corresponding p2 features being all pairs of the form xj1 xj2. However,\nthis approach takes \u2126(p2) time and space, since sparse linear regression cannot be done in time and\nspace smaller than its dimension \u2013 which in this case is p2 \u2013 even in cases where statistical properties\nlike restricted strong convexity / incoherence etc. hold. Fundamentally, the problem lies in the fact\nthat one needs to compute a gradient of the loss, and this is an \u2126(p2) operation.\nOur motivation: Can we learn a sparse quadratic model with time and space complexity that is\nsub-quadratic? In particular, suppose we have data which is well modeled by a \u0398\u2217 that is K-sparse,\nwith K being O(p\u03b3) and \u03b3 < 1. Statistically, this can be possibly recovered from O(K log p)\nsamples, each of which is p-dimensional. Thus we have a setting where the input is sub-quadratic\nwith size O(Kp log p), and the \ufb01nal output is sub-quadratic with size O(K). Our aim is to have an\nalgorithm whose time and space complexity is also sub-quadratic for this case.\nIn this paper, we develop a new algorithm which has this desired sub-quadratic complexity, and\nsubsequently theoretically establish that it consistently recovers a sparse \u0398\u2217. We brie\ufb02y overview\nour setting and results below.\n\n1.1 Main Contributions\nGiven n samples {(xi, yi)}n\nsponding to a quadratic model:\n\ni=1, we are interested in minimizing the following loss function corre-\n\nf(cid:0)x(cid:62)\n\nn\u22121(cid:88)\n\ni=0\n\n(cid:1)\n\n(Quadratic Structure)\n\nmin\n\n\u0398:(cid:107)\u0398(cid:107)0\u2264K\n\n1\nn\n\ni \u0398xi, yi\n\n:= Fn (\u0398)\n\n(1)\n\nWe develop a new algorithm \u2013 Interaction Hard Thresholding (IntHT), outlined in Algorithm 1\n\u2013 for this problem, and provide a rigorous proof of consistency for it under the standard settings\n(Restricted strong convexity and smoothness of the loss) for which consistency is established for\nsparse recovery problems. At a high level, it is based on the following key ideas:\n\n(1) Because of the special quadratic structure, we show that the top 2k entries of the gradient can\nbe found in sub-quadratic time and space, using ideas from hashing and coding. The subroutine\nin Algorithm 2 for doing this is based on the idea of [21] and Theorem 1 characterizes its\nperformance and approximation guarantee.\n\n(2) We note a simple but key fact: in (stochastic) iterative hard thresholding, the new k-sparse \u0398t+1\nthat is produced has its support inside the union of two sets of size k and 2k: the support of the\nprevious \u0398t, and the top-2k elements of the gradient.\n\n(3) While we do not \ufb01nd the precise top-2k elements of the gradient, we do \ufb01nd an approximation.\nUsing a new theoretical analysis, we show that this approximate-top-2k is still suf\ufb01cient to establish\nlinear convergence to a consistent solution. This is our main result, described in Theorem 4.\n\n(4) As an extension, we show that our algorithm also works with popular SGD variants like SVRG\n(Algorithm 4 in Appendix B), with provable linear convergence and consistency in Appendix C.\nWe also demonstrate the extension of our algorithm to estimate higher order interaction terms with\na numerical experiment in Section 5 .\n\nNotation We use [n] to represent the set {0,\u00b7\u00b7\u00b7 , n\u2212 1}. We use fB (\u0398) to denote the average loss on\nbatch B, where B is a subset of [n] with batch size m. We de\ufb01ne (cid:104)A, B(cid:105) = tr (A(cid:62)B), and supp(A)\n\n2\n\n\fto be the index set of A with non-zero entries. We let PS to be the projection operator onto the index\n\nset S. We use standard Big-O notation for time/space complexity analysis, and Big-(cid:101)O notation which\n\nignores log factors.\n\n2 Related Work\n\nLearning with high-order interactions Regression with interaction terms has been studied in the\nstatistics community. However, many existing results consider under the assumption of strong/weak\nhierarchical (SH/WH) structure: the coef\ufb01cient of the interaction term xj1 xj2 is non-zero only when\nboth coef\ufb01cients of xj1 and xj2 are (or at least one of them is) non-zero. Greedy heuristics [32, 11]\nand regularization based methods [7, 3, 16, 25, 10] are proposed accordingly. However, they could\npotentially miss important signals that only contains the effect of interactions. Furthermore, several\nof these methods also suffer from scaling problems due to the quadratic scaling of the parameter size.\nThere are also results considering the more general tensor regression, see, e.g., [34, 9], among many\nothers. However, neither do these results focus on solutions with ef\ufb01cient memory usage and time\ncomplexity, which may become a potential issue when the dimension scales up. From a combinatorial\nperspective, [18, 13] learns sparse polynomial in Boolean domain using quite different approaches.\nSparse recovery, IHT and stochastic-IHT IHT [4] is one type of sparse recovery algorithms that is\nproved to be effective for M-estimation [12] under the regular RSC/RSM assumptions. [20] proposes\nand analyzes a stochastic version of IHT. [14, 26] further consider variance reduced acceleration\nalgorithm under this high dimensional setting, [35] studies IHT in high dimensional setting with\nnonlinear measurement. Notice that IHT, if used for our quadratic problem, still suffers from quadratic\nspace, similar to other techniques, e.g., the Lasso, basis pursuit, least angle regression [29, 6, 8]. On\nthe other hand, [19] recently considers a variant of IHT, where for each sample, only a random subset\nof features is observed. This makes each update cheap, but their sample size has linear dependence\non the ambient dimension, which is again quadratic. Apart from that, [20, 17] also show that IHT can\npotentially tolerate a small amount of error per iteration .\nMaximum inner product search One key technique of our method is extracting the top elements\n(by absolute value) of gradient matrix, which can be expressed as the inner product of two matrices.\nThis can be formulated as \ufb01nding Maximum Inner Product (MIP) from two sets of vectors. In\npractice, algorithms speci\ufb01cally designed for MIP are proposed based on locality sensitive hashing\n[27], and many other greedy type algorithms [2, 33]. But they either can\u2019t \ufb01t into the regression\nsetting, or suffers from quadratic complexity. In theory, MIP is treated as a fundamental problem\nin the recent development of complexity theory [1, 31]. [1, 5] shows the hardness of MIP, even for\nBoolean vectors input. While in general hard, there are data dependent approximation guarantees,\nusing the compressed matrix multiplication method [21], which inspired our work.\nOthers The quadratic problem we study also share similarities with several other problem settings,\nincluding factorization machine [23] and kernel learning [24, 22]. Different from factorization\nmachine, we do not require the input data to be sparse. While the factorization machine tries to learn\na low rank representation, we are interested in learning a sparse representation. Compared to kernel\nlearning, especially the quadratic / polynomial kernels, our task is to do feature selection and identify\nthe correct interactions.\n\n3\n\nInteraction Hard Thresholding\n\nWe now describe the main ideas motivating our approach, and then formally describe the algorithm.\nNaively recasting as a linear model has p2 time and space complexity: As a \ufb01rst step to our\nmethod, let us see what happens with the simplest approach. Speci\ufb01cally, as noted before, problem\n(1) can be recast as one of \ufb01nding a sparse (generalized) linear model in the p2 size variable \u0398:\n\nn\u22121(cid:88)\n\ni=0\n\n(Recasting as linear model)\n\nmin\n\n\u0398:(cid:107)\u0398(cid:107)0\u2264K\n\n1\nn\n\nf ((cid:104)Xi, \u0398(cid:105), yi )\n\nwhere matrix Xi := xix(cid:62)\ni . Iterative hard thresholding (IHT) [4] is a state-of-the-art method (both in\nterms of speed and statistical accuracy) for such sparse (generalized) linear problems. This involves\n\n3\n\n\fi=1, dimension p\n\n3: Output: The parameter estimation (cid:98)\u0398\n\nAlgorithm 1 INTERACTION HARD THRESHOLDING (INTHT)\n1: Input: Dataset {xi, yi}n\n2: Parameters: Step size \u03b7, estimation sparsity k, batch size m, round number T\n4: Initialize \u03980 as a p \u00d7 p zero matrix.\n5: for t = 0 to T \u2212 1 do\n6:\n7:\n8:\n9:\n10:\n11:\n12:\n13:\n\nDraw a subset of indices Bt from [n] randomly.\nCalculate the residual ui = u(\u0398t, xi, yi) based on eq. (2), for every i \u2208 Bt.\nSet At \u2208 Rp\u00d7m , where each column of At is uixi, i \u2208 Bt.\nCompute (cid:101)St = ATEE(At, Bt, 2k).\nSet Bt \u2208 Rp\u00d7m, where each column of Bt is xi, i \u2208 Bt. (where AtB(cid:62)\nSet St = (cid:101)St \u222a supp(\u0398t).\nCompute PSt(Gt) \u2190 the gradient value Gt = 1\nUpdate \u0398t+1 = Hk (\u0398t \u2212 \u03b7PSt(Gt)).\n\n(cid:80)\ni\u2208Bt uixix(cid:62)\n\nm\n\n14: Return: (cid:98)\u0398 = \u0398T\n\nm gives the gradient)\n\u2014-/* approximate top elements extraction */\u2014- X 2\nU(cid:62)\n\u2014-/* inaccurate hard thresholding update */\u2014- X 2\nU(cid:62)\n\nt\n\ni only calculated on St.\n\nAlgorithm 2 APPROXIMATED TOP ELEMENTS EXTRACTION (ATEE)\n1: Input: Matrix A, matrix B, top selection size k\n2: Parameters: Output set size upper bound b, repetition number d, signi\ufb01cant level \u2206\n3: Expected Output: Set \u039b: the top-k elements in AB(cid:62) with absolute value greater than \u2206\n\n4: Output: Set(cid:101)\u039b of indices, with size at most b (approximately contains \u039b)\n\n5: Short Description: This algorithm is adopted directly from [21]. It follows from the matrix\ncompressed product via FFT (see section 2.2 of [21]) and sub-linear result extraction by error-\ncorrecting code (see section 4 of [21]), which drastically reduces the complexity. The whole\nprocess is repeated for d times to boost the success probability. The notation here matches [21]\nexactly, except that we use p for dimension while n is used in [21] instead.\n6: Intuitively, the algorithm will put all the elements of AB(cid:62) into b different \"basket\"s, with each\nof the elements assigned a positive or negative sign. It then selects the \"basket\" whose magnitude\nis greater than \u2206. Further, one large element is recovered from each of the selected baskets.\n\nthe following update rule\n\n(standard IHT)\n\n\u0398t+1 = Hk\n\n(cid:0) \u0398t \u2212 \u03b7 \u2207Fn(\u0398t)(cid:1)\n\nwhere Fn(\u00b7) is the average loss de\ufb01ned in (1), and Hk(\u00b7) is the hard-thresholding operator that\nchooses the largest k elements (in terms of absolute value) of the matrix given to it, and sets the rest\nto 0. Here, k is the estimation sparsity parameter. In this update equation, the current iterate \u0398t has\nk non-zero elements and so can be stored ef\ufb01ciently. But the gradient \u2207Fn(\u0398t) is p2 dimensional;\nthis causes IHT to have \u2126(p2) complexity. This issue remains even if the gradient is replaced by a\nstochastic gradient that uses fewer samples, since even in a stochastic gradient the number of variables\nremains p2.\nA key observation: We only need to know the top-2k elements of this gradient \u2207Fn(\u0398t), because\nof the following simple fact: if A is a k-sparse matrix, and B is any matrix, then\n\nsupp(Hk(A + B)) \u2282 supp(A) \u222a supp(H2k(B)).\n\nThat is, the support of the top k elements of the sum A + B is inside the union of the support of A,\nand the top-2k elements of B. The size of this union set is at most 3k.\nThus, in the context of standard IHT, we do not really need to know the full (stochastic) gradient\n\u2207Fn(\u0398t); instead we only need to know (a) the values and locations of its top-2k elements, and (b)\nevaluate at most k extra elements of it \u2013 those corresponding to the support of the current \u0398t.\nThe key idea of our method is to exploit the special structure of the quadratic model to \ufb01nd the top-2k\nelements of the batch gradient \u2207fB in sub-quadratic time. Speci\ufb01cally, \u2207fB has the following form:\n(2)\n\n\u2207f(cid:0)x(cid:62)\n\nu(\u0398, xi, yi)xix(cid:62)\ni ,\n\n(cid:1) =\n\n(cid:88)\n\n(cid:88)\n\ni \u0398xi, yi\n\n\u2207fB(\u0398) (cid:44) 1\nm\n\n1\nm\n\ni\u2208B\n\ni\u2208B\n\n4\n\n\ftop-2k elements of the p2-dimensional stochastic gradient in (cid:101)O(k(p + k)) time and space, which is\n\nwhere u(\u0398, xi, yi) is a scalar related to the residual and the derivative of link function , and B\nrepresents the mini-batch where B \u2282 [n] ,|B| = m. This allows us to approximately \ufb01nd the\nsub-quadratic when k is O(p\u03b3) for \u03b3 < 1.\nOur algorithm is formally described in Algorithm 1. We use Approximate Top Elements Extraction\n(ATEE) to approximately \ufb01nd the top-2k elements of the gradient, which is brie\ufb02y summarized in\nAlgorithm 2, based on the idea of Pagh [21]. The full algorithm is re-organized and provided in\nAppendix A for completeness. Our method, Interaction Hard Thresholding (IntHT) builds on IHT,\nbut needs a substantially new analysis for proof of consistency. The subsequent section goes into the\ndetails of its analysis.\n\n4 Theoretical Guarantees\n\nIn this section, we establish the consistency of Interaction Hard Thresholding, in the standard setting\nwhere sparse recovery is established.\nSpeci\ufb01cally, we establish convergence results under deterministic assumptions on the data and\nfunction, including restricted strong convexity (RSC) and smoothness (RSM). Then, we analyze\nthe sample complexity when features are generated from sub-gaussian distribution in the quadratic\nregression setting, in order to have well-controlled RSC and RSM parameters. The analysis of\nrequired sample complexity yields an overall complexity that is sub-quadratic in time and space.\n\n4.1 Preliminaries\n\nWe \ufb01rst describe the standard deterministic setting in which sparse recovery is typically analyzed.\nSpeci\ufb01cally, the samples (xi, yi) are \ufb01xed and known. Our \ufb01rst assumption de\ufb01nes how our intended\nrecovery target \u0398(cid:63) relates to the resulting loss function Fn(\u00b7).\nAssumption 1 (Standard identi\ufb01ability assumption). There exists a \u0398(cid:63) which is K-sparse such that\nthe following holds: given any batch B \u2282 [n] of m samples, the norm of batch gradient at \u0398(cid:63) is\nbounded by constant G. That is, (cid:107)\u2207fB(\u0398(cid:63))(cid:107)F \u2264 G, and (cid:107)\u0398(cid:63)(cid:107)\u221e \u2264 \u03c9.\nIn words, this says the the gradient at \u0398(cid:63) is small. In a noiseless setting where data is generated\nfrom \u0398(cid:63), e.g. when yi = x(cid:62)\ni \u0398(cid:63)xi, this gradient is 0; i.e. the above is satis\ufb01ed with G = 0, and \u0398(cid:63)\nwould be the exact sparse optimum of Fn(\u00b7). The above assumption generalizes this notion to noisy\nand non-linear cases, relating our recovery target \u0398(cid:63) to the loss function. This is a standard setup\nassumption in sparse recovery.\nNow that we have speci\ufb01ed what \u0398(cid:63) is and why it is special, we specify the properties the loss\nfunction needs to satisfy. These are again standard in the sparse recovery literature [20, 26, 14].\nAssumption 2 (Standard landscape properties of the loss). For any pair \u03981, \u03982 and s \u2264 p2 such\nthat |supp(\u03981 \u2212 \u03982)| \u2264 s\n\u2022 The overall loss Fn satis\ufb01es \u03b1s-Restricted Strong Convexity (RSC):\n\u03b1s\n2\n\nFn(\u03981) \u2212 Fn(\u03982) \u2265 (cid:104)\u03981 \u2212 \u03982,\u2207\u0398Fn(\u03982)(cid:105) +\n\n(cid:107)\u03981 \u2212 \u03982(cid:107)2\n\nF\n\n\u2022 The mini-batch loss fB satis\ufb01es Ls-Restricted Strong Smoothness (RSM):\n\n(cid:107)\u2207fB(\u03981) \u2212 \u2207fB(\u03982)(cid:107)F \u2264 Ls (cid:107)\u03981 \u2212 \u03982(cid:107)F , \u2200B \u2282 [n] , |B| = m\n\n\u2022 fB satis\ufb01es Restricted Convexity (RC) (but not strong):\n\nfB(\u03981) \u2212 fB(\u03982) \u2212 (cid:104)\u2207fB(\u03982), \u03981 \u2212 \u03982(cid:105) \u2265 0, \u2200B \u2282 [n] , |B| = m, s = 3k + K\n\nNote: While our assumptions are standard, our result does not follow immediately from existing\nanalyses \u2013 because we cannot \ufb01nd the exact top elements of the gradient. We need to do a new\nanalysis to show that even with our approximate top element extraction, linear convergence to \u0398(cid:63)\nstill holds.\n\n5\n\n\f4.2 Main Results\n\nHere we proceed to establish the sub-quadratic complexity and consistency of IntHT for parameter\nestimation. Theorem 1 presents the analysis of ATEE. It provides the computation complexity\nanalysis, as well as the statistical guarantee of support recovery. Based on this, we show the per round\nconvergence property of Algorithm 1 in Theorem 3. We then establish our main statistical result, the\nlinear convergence of Algorithm 1 in Theorem 4.\nNext, we discuss the batch size that guarantees support recovery in Theorem 5, focusing on the\nquadratic regression setting, i.e.\nthe model is linear in both interaction terms and linear terms.\nCombining all the established results, the sub-quadratic complexity is established in Corollary 6. All\nthe proofs in this subsection can be found in Appendix E.\nAnalysis of ATEE Consider ATEE with parameters set to be b, d, \u2206. Recall this means that ATEE re-\nNote that the desired index set (\u039b) is composed by the top-2k elements of gradient \u2207fB(\u0398) whose\nabsolute value is greater than \u2206. Suppose now the current estimate is \u0398, and B is the batch. The\n\nturns an index set ((cid:101)\u039b) of size at most b, which is expected to contain the desired index set (\u039b).\nfollowing theorem establishes when this output set ((cid:101)\u039b) captures the top elements of the gradient.\nF and d \u2265 48 log 2ck, then the index set ((cid:101)\u039b)\nAlso in this case the time complexity of ATEE is (cid:101)O (m(p + b)), and space complexity is (cid:101)O (m(p + b)).\n\nTheorem 1 (Recovering top-2k elements of the gradient, modi\ufb01ed from [21]). With the setting above,\nif we choose b, d, \u2206 so that b\u22062 \u2265 432(cid:107)\u2207fB(\u0398)(cid:107)2\nreturned by ATEE contains the desired index set (\u039b) with probability at least 1 \u2212 1/c.\n\nTheorem 1 requires that parameter b, \u2206 are set to satisfy b\u22062 \u2265 432(cid:107)\u2207fB(\u0398)(cid:107)2\nF . Note that \u2206\ncontrols the minimum magnitude of top-k element we can found. To avoid getting trivial extraction\nresult, we need to set \u2206 as a constant that doesn\u2019t scale with p. In order to control the scale of \u2206\nand b, to get consistent estimation and to achieve sub-quadratic complexity, we need to upper bound\n(cid:107)\u2207fB(\u0398)(cid:107)2\nF . This is the compressibility estimation problem that was left open in [21]. In our case,\nthe batch gradient norm can be controlled by the RSM property. More formally, we have\nLemma 2 (Frobenius norm bound of gradient). The Frobenius norm of batch gradient at arbitrary\nk-sparse \u0398, with (cid:107)\u0398(cid:107)\u221e \u2264 \u03c9, can be bounded as (cid:107)\u2207fB(\u0398)(cid:107)F \u2264 2L2k\nk\u03c9 + G, where G is the\nuniform bound on (cid:107)\u2207fB(\u0398(cid:63))(cid:107)F over all batches B and \u03c9 bounds (cid:107)\u0398(cid:63)(cid:107)\u221e (see Assumption 1).\nLemma 2 directly implies that Theorem 1 could allow b scale linearly with k while keep \u2206 as a\nconstant2. This is the key ingredient to achieve sub-quadratic complexity and consistent estimation.\nWe postpone the discussion for complexity to later paragraph, and proceed to \ufb01nish the statistical\nanalysis of gradient descent.\nConvergence of IntHT: Consider IntHT with parameter set to be \u03b7, k. For the purpose of analysis,\n\nwe keep the de\ufb01nition of \u039b and (cid:101)\u039b from the analysis of ATEE and further de\ufb01ne k\u2206 to be the\n\nnumber of top-2k elements whose magnitude is below \u2206. Recall that K is the sparsity of \u0398(cid:63), de\ufb01ne\n/2, \u03c1 = K/k, where \u03bd measures the error induced by exact IHT (see\n\u03bd = 1 +\nLemma 9 for detail). Denote Bt = {B0,B1, ...,Bt}. We have\nTheorem 3 (Per-round convergence of IntHT). Following the above notations, the per-round conver-\ngence of Algorithm 1 satis\ufb01es the following:\n\n(cid:16)\n\u03c1 +(cid:112)(4 + \u03c1)\u03c1\n\n(cid:17)\n\n\u221a\n\n2For now, we assume L2k to be a constant independent of p, k. We will discuss this in Theorem 5.\n\n6\n\n+ \u03c32\n\nGD + \u03c32\n\nF ail|GD,\n\nEBt\n\n\u2022 If ATEE succeeds, i.e., \u039b \u2286(cid:101)\u039b, then\n(cid:104)(cid:13)(cid:13)\u0398t \u2212 \u0398(cid:63)(cid:13)(cid:13)2\nwhere \u03ba1 = \u03bd(cid:0)1 \u2212 2\u03b7\u03b12k + 2\u03b72L2\n\u2022 If ATEE fails, i.e., \u039b (cid:54)\u2282(cid:101)\u039b, then,\n(cid:104)(cid:13)(cid:13)\u0398t \u2212 \u0398(cid:63)(cid:13)(cid:13)2\n\n(cid:105) \u2264 \u03ba1EBt\u22121\n(cid:1), \u03c32\n(cid:105) \u2264 \u03ba2EBt\u22121\n\n\u03c32\nGD = max\n\n|\u2126|\u22642k+K\n\nEBt\n\n(cid:104)\n\n4\u03bd\u03b7\n\n\u221a\n\n2k\n\nF\n\nk\u03c9 (cid:107)P\u2126 (\u2207F (\u0398(cid:63)))(cid:107)F + 2\u03bd\u03b72EBt\n(cid:105)\n\n(cid:104)(cid:13)(cid:13)\u0398t\u22121 \u2212 \u0398(cid:63)(cid:13)(cid:13)2\n\nF\n\nF\n\n(cid:104)(cid:13)(cid:13)\u0398t\u22121 \u2212 \u0398(cid:63)(cid:13)(cid:13)2\n\n(cid:105)\n\n\u221a\n\u2206|GD = 4\n\nk\u2206\u03b7\n\n\u221a\n\nF\n\n+ \u03c32\n\nGD + \u03c32\n\n\u2206|GD,\nk\u03c9\u2206 + 2k\u2206\u03b72\u22062, and\n\n(cid:104)(cid:107)P\u2126 (\u2207fBt (\u0398(cid:63)))(cid:107)2\n\nF\n\n(cid:105)(cid:105)\n\n.\n\n\fwhere \u03ba2 = \u03ba1 + 2\u03bd\u03b7L2k, \u03c32\n\nF ail|GD = max|\u2126|\u22642k+K\n\n(cid:104)\n\n4\u03bd\u03b7\n\n(cid:105)\nk\u03c9EBt [(cid:107)P\u2126 (\u2207fBt (\u0398(cid:63)))(cid:107)F ]\n\n.\n\n\u221a\n\nRemark 1. It is worth noting that \u03c3GD, \u03c3F ail|GD are both statistical errors, which in the noiseless\ncase are 0. In the case that the magnitude of top-2k elements in the gradient are all greater than \u2206,\nwe have k\u2206 = 0, which implies \u03c3\u2206|GD = 0. In this case ATEE\u2019s approximation doesn\u2019t incur any\nadditional error compared with exact IHT.\n\n2k/\u03b12\n\n2k), \u03b7 = \u03b12k/2L2\n\nTheorem 3 shows that by setting k = \u0398(KL2\n2k, the parameter estimation can\nbe improved geometrically when ATEE succeeds. We will show in Theorem 5 that with suffciently\nlarge batch size m, \u03b12k, L2k are controlled and don\u2019t scale with k, p. When ATEE fails, it can\u2019t make\nthe \u0398 estimation worse by too much. Given that success rate of ATEE is controlled in Theorem 1, it\nnaturally suggests that we can obtain the linear convergence in expectation. This leads to Theorem 4.\nDe\ufb01ne \u03c32\nGD + \u03c3F ail|GD. Let \u03c6t to be the success indicator of\nATEE at time step t, and \u03a6t = {\u03c60, \u03c61, ..., \u03c6t}. By Theorem 1, with d = 48 log 2ck, ATEE recovers\ntop-2k with probability at least (1 \u2212 1/c), we can easily show the convergence of Algorithm 1 as\nTheorem 4 (Main result). Following the above notations, the expectation of the parameter recovery\nerror of Algorithm 1 is bounded by\n\n\u2206|GD, and \u03c32\n\nGD + \u03c32\n\n1 = \u03c32\n\n2 = \u03c32\n\n(cid:104)(cid:13)(cid:13)\u0398t \u2212 \u0398(cid:63)(cid:13)(cid:13)2\n\nF\n\nEBt,\u03a6t\n\n(cid:34)(cid:18)\n\n+\n\n\u03ba1 +\n\n(\u03ba2 \u2212 \u03ba1)\n\n1\nc\n\n(cid:18)\n(cid:105) \u2264\n(cid:19)t \u2212 1\n\n(\u03ba2 \u2212 \u03ba1)\n\n\u03ba1 +\n\n1\nc\n\n(cid:35)(cid:18) \u03c32\n\n(cid:19)\n\n(cid:19)t(cid:13)(cid:13)\u03980 \u2212 \u0398(cid:63)(cid:13)(cid:13)2\n\nF\n\n1\n\n\u03ba1 \u2212 1\n\n\u03ba2 \u2212 1\n\nc \u2212 c\u03ba1 + \u03ba1 \u2212 \u03ba2\n\n+\n\n(cid:18) \u03c32\n\n2\n\n\u03ba2 \u2212 1\n\n(cid:19)\n\n.\n\n\u2212 \u03c32\n\u03ba1 \u2212 1\n\n1\n\nThis shows that Algorithm 1 achieves linear convergence by setting c \u2265 (\u03ba2 \u2212 \u03ba1)/(1 \u2212 \u03ba1). With c\n1/(1 \u2212 \u03ba1). The proof follows directly by taking expectation\nincreasing, the error ball converges to \u03c32\nof the result we obtain in Theorem 3 with the recovery success probability established in Theorem 1.\nComputational analysis With the linear convergence, the computational complexity is dominated\nby the complexity per iteration. Before discussing the complexity, we \ufb01rst establish the dependency\nbetween Lk, \u03b1k and m in the special case of quadratic regression, where the link function is identity.\nNotice that similar results would hold for more general quadratic problems as well.\nTheorem 5 (Minimum batch size). For feature vector x \u2208 Rp, whose \ufb01rst p \u2212 1 coordinates\nare drawn i.i.d. from a bounded distribution, and the p-th coordinate is constant 1. W.l.o.g., we\nassume the \ufb01rst p \u2212 1 coordinates to be zero mean, variance 1 and bounded by B. With batch size\nm (cid:38) kB log p/\u00012 we have \u03b1k \u2265 1 \u2212 \u0001, Lk \u2264 1 + \u0001 with high probability.\nNote that the sample complexity requirement matches the known information theoretic lower bound\nfor recovering k-sparse \u0398 up to a constant factor. The proof is similar to the analysis of restricted\nisometry property in sparse recovery. Recall that by Theorem 1, we have the per-iteration complexity\n\non the complexity:\nCorollary 6 (Achieving sub-quadratic space and time complexity). In the case of quadratic regres-\nsion, by setting the parameters as above, IntHT recovers \u0398(cid:63) in expectation up to a noise ball with\nwhen k is O(p\u03b3) for \u03b3 < 1.\n\n(cid:101)O(m(p + b)). Combining the results of Lemma 2, Theorems 4 and 5, we have the following corollary\nlinear convergence. The time and space complexity of IntHT is (cid:101)O(k(k + p)), which is sub-quadratic\ncomplexity of IntHT is (cid:101)O(k(k + p)), which is nearly optimal.\n\nNote that the optimal time and space complexity is \u2126(kp), since a minimum of \u2126(k) samples\nare required for recovery, and \u2126(p) for reading all entries. Corollary 6 shows the time and space\n\n5 Synthetic Experiments\n\nTo examine the sub-quadratic time and space complexity, we design three tasks to answer the\nfollowing three questions: (i) Whether Algorithm 1 maintains linear convergence despite the hard\nthresholding not being accurate? (ii) What is the dependency between b and k to guarantee successful\nrecovery? (iii) What is the dependency between m and p to guarantee successful recovery? Recall that\n\n7\n\n\fthe per-iteration complexity of Algorithm 1 is (cid:101)O(m(p + b)), where b upper bounds the size of ATEE\u2019s\n\noutput set, p is the dimension of features and m is batch size and k is the sparsity of estimation. It\nwill be clear as we proceed how the three questions can support sub-quadratic complexity.\nExperimental setting We generate feature vectors xi, whose coordinates follow i.i.d. uniform\ndistribution on [\u22121, 1]. Constant 1 is appended to each feature vector to model the linear terms and\nintercept. The true support is uniformly selected from all the interaction and linear terms, where\nthe non-zero parameters are then generated uniformly on [\u221220,\u221210] \u222a [10, 20]. Note that for the\nexperiment concerning minimum batch size m, we instead use Bernoulli distribution to generate both\nthe features and the parameters, which reduces the variance for multiple random runs and makes our\nphase transition plot clearer. The output yis, are generated following x(cid:62)\ni \u0398(cid:63)xi. On the algorithm\nside, by default, we set p = 200, d = 3, K = 20, k = 3K, \u03b7 = 0.2. Support recovery results with\ndifferent b-K combinations are averaged over 3 independent runs, results for m-p combinations are\naveraged over 5 independent runs. All experiments are terminated after 150 iterations.\n\n(a) Inaccurate recovery using differ-\nent ATEE\u2019s output set sizes b\n\n(b) Support recovery results with\ndifferent b and K\n\n(c) Support recovery results with dif-\nferent m and p\n\nFigure 1: Synthetic experiment results: note b, m are the parameters we used for IntHTand ATEE,\nwhere b upper bounds the size of ATEE\u2019s output set and m is the batch size used for IntHT. Recall\np is the dimension of features and K is the sparsity of \u0398(cid:63). (a) the convergence behavior with\ndifferent choices of b. Linear convergence holds for small b, e.g., 360, when the parameter space is\naround 20, 000. (b) Support recovery results with different choices of (b, K). We observe a linear\ndependence between b and K. (c) Support recovery results with different choices of (m, p). m scales\nsub-linearly with p to ensure a success recovery.\n\nInaccurate support recovery with different b\u2019s Figure 1-(a) demonstrates different convergence\nresults, measured by (cid:107)\u0398 \u2212 \u0398(cid:63)(cid:107)F with multiple choices of b for ATEE in Algorithm 1. The dashed\ncurve is obtained by replacing ATEE with exact top elements extraction (calculates the gradient\nexactly and picks the top elements). This is statistically optimal, but comes with quadratic complexity.\nBy choosing a moderately large b, the inaccuracy induced by ATEE has negligible impact on the\nconvergence. Therefore, Algorithm 1 can maintain the linear convergence despite the support recovery\nin each iteration is inaccurate. This aligns with Theorem 3. With linear convergence, the per iteration\ncomplexity will dominate the overall complexity.\nDependency between b and sparsity k We proceed to see the proper choice of b under different\nsparsity k (we use k = 3K). We vary the sparsity K from 1 to 30, and apply Algorithm 1 with b\nranges from 30 to 600. As shown in Figure 1-(b), the minimum proper choice of b scales no more\nthan linearly with k. This agrees with our analysis in Theorem 1. The per-iteration complexity then\n\ncollapse to (cid:101)O(m(p + k)).\n\nDependency between batch size m and dimension p Finally, we characterize the dependency\nbetween minimum batch size m and the input dimension p. This will complete our discussion on the\nper-iteration complexity. The batch size varies from 1 to 99, and the input dimension varies from 10\nto 1000. In this experiment, we employ the Algorithm 1 with ATEE replaced by exact top-k elements\nextraction. Figure 1-(c) demonstrates the support recovery success rate of each (k, p) combination. It\nshows the minimum batch size scales in logarithm with dimension p, as we proved in Theorem 5.\nTogether with the previous experiment, it establishes the sub-quadratic complexity.\n\n8\n\n020406080100120140Iteration5102050100Parameter Estimation ErrorInaccurate recovery using different b sb=120b=240b=360b=480b=600Exact30130230330430530b30252015105Kb-K-dependency0.00.20.40.60.81.010210410610810p9979593919mm-p-dependency0.00.20.40.60.81.0\fFigure 2: 3-order regression support recovery using different ATEE\u2019s output set sizes b\n\nexploiting similar gradient structure(cid:80) rixi \u2297 xi \u2297 xi, where ri denotes the residual for (Xi, yi),\n(cid:80) \u0398i,j,kxixjxk, where \u0398 is now a three dimension tensor. Further, we set the dimension of x to 30\n\nHigher order interaction IntHT is also extensible to higher order interactions. Speci\ufb01cally, by\n\u2297 denotes the outer product of vector, we can again combine sketching with high-dimensional\noptimization to achieve nearly linear time and space (for constant sparsity).\nFor the experiment, we adopt the similar setting as for the Inaccurate support recovery with\ndifferent bs experiment. The main difference is that we change from yi = x(cid:62)\ni \u0398(cid:63)xi to yi =\n\nand the sparsity K = 20. Figure 2 demonstrates the result of support recovering of 3-order interaction\nterms with different setting of b, where b still bounds the size of ATEE\u2019s output set. We can see that\nIntHT still maintains the linear convergence in the higher order setting.\n\nAcknowledgement\n\nWe would like to acknowledge NSF grants 1302435 and 1564000 for supporting this research.\n\n9\n\n020406080100Iteration125102050Parameter Estimation ErrorInaccurate recovery using different b sb=40b=80b=120b=160b=200Exact\fReferences\n[1] Amir Abboud, Aviad Rubinstein, and Ryan Williams. Distributed pcp theorems for hardness\nof approximation in p. In 2017 IEEE 58th Annual Symposium on Foundations of Computer\nScience (FOCS), pages 25\u201336. IEEE, 2017.\n\n[2] Grey Ballard, Tamara G Kolda, Ali Pinar, and C Seshadhri. Diamond sampling for approximate\nmaximum all-pairs dot-product (mad) search. In 2015 IEEE International Conference on Data\nMining, pages 11\u201320. IEEE, 2015.\n\n[3] Jacob Bien, Jonathan Taylor, and Robert Tibshirani. A lasso for hierarchical interactions. Annals\n\nof statistics, 41(3):1111, 2013.\n\n[4] Thomas Blumensath and Mike E Davies. Iterative hard thresholding for compressed sensing.\n\nApplied and computational harmonic analysis, 27(3):265\u2013274, 2009.\n\n[5] Lijie Chen. On the hardness of approximate and exact (bichromatic) maximum inner product. In\n33rd Computational Complexity Conference (CCC 2018). Schloss Dagstuhl-Leibniz-Zentrum\nfuer Informatik, 2018.\n\n[6] Scott Shaobing Chen, David L Donoho, and Michael A Saunders. Atomic decomposition by\n\nbasis pursuit. SIAM review, 43(1):129\u2013159, 2001.\n\n[7] Nam Hee Choi, William Li, and Ji Zhu. Variable selection with the strong heredity constraint\nand its oracle property. Journal of the American Statistical Association, 105(489):354\u2013364,\n2010.\n\n[8] Bradley Efron, Trevor Hastie, Iain Johnstone, Robert Tibshirani, et al. Least angle regression.\n\nThe Annals of statistics, 32(2):407\u2013499, 2004.\n\n[9] Botao Hao, Anru Zhang, and Guang Cheng. Sparse and low-rank tensor estimation via cubic\n\nsketchings. arXiv preprint arXiv:1801.09326, 2018.\n\n[10] Ning Hao, Yang Feng, and Hao Helen Zhang. Model selection for high-dimensional quadratic\nregression via regularization. Journal of the American Statistical Association, 113(522):615\u2013\n625, 2018.\n\n[11] Ning Hao and Hao Helen Zhang. Interaction screening for ultrahigh-dimensional data. Journal\n\nof the American Statistical Association, 109(507):1285\u20131301, 2014.\n\n[12] Prateek Jain, Ambuj Tewari, and Purushottam Kar. On iterative hard thresholding methods for\nhigh-dimensional m-estimation. In Advances in Neural Information Processing Systems, pages\n685\u2013693, 2014.\n\n[13] Murat Kocaoglu, Karthikeyan Shanmugam, Alexandros G Dimakis, and Adam Klivans. Sparse\nIn Advances in Neural Information Processing\n\npolynomial learning and graph sketching.\nSystems, pages 3122\u20133130, 2014.\n\n[14] Xingguo Li, Raman Arora, Han Liu, Jarvis Haupt, and Tuo Zhao. Nonconvex sparse learning via\nstochastic optimization with progressive variance reduction. arXiv preprint arXiv:1605.02711,\n2016.\n\n[15] Yun Li, George T. O\u2019Connor, Jos\u00e9e Dupuis, and Eric D. Kolaczyk. Modeling gene-covariate\ninteractions in sparse regression with group structure for genome-wide association studies.\nStatistical applications in genetics and molecular biology, 14 3:265\u201377, 2015.\n\n[16] Michael Lim and Trevor Hastie. Learning interactions via hierarchical group-lasso regulariza-\n\ntion. Journal of Computational and Graphical Statistics, 24(3):627\u2013654, 2015.\n\n[17] Liu Liu, Yanyao Shen, Tianyang Li, and Constantine Caramanis. High dimensional robust\n\nsparse regression. arXiv preprint arXiv:1805.11643, 2018.\n\n10\n\n\f[18] Yishay Mansour. Randomized interpolation and approximation of sparse polynomials. SIAM\n\nJournal on Computing, 24(2):357\u2013368, 1995.\n\n[19] Tomoya Murata and Taiji Suzuki. Sample ef\ufb01cient stochastic gradient iterative hard thresholding\nmethod for stochastic sparse linear regression with limited attribute observation. In Advances in\nNeural Information Processing Systems, pages 5317\u20135326, 2018.\n\n[20] Nam Nguyen, Deanna Needell, and Tina Woolf. Linear convergence of stochastic itera-\ntive greedy algorithms with sparse constraints. IEEE Transactions on Information Theory,\n63(11):6869\u20136895, 2017.\n\n[21] Rasmus Pagh. Compressed matrix multiplication. ACM Transactions on Computation Theory\n\n(TOCT), 5(3):9, 2013.\n\n[22] Ali Rahimi and Benjamin Recht. Random features for large-scale kernel machines. In Advances\n\nin neural information processing systems, pages 1177\u20131184, 2008.\n\n[23] Steffen Rendle. Factorization machines. In 2010 IEEE International Conference on Data\n\nMining, pages 995\u20131000. IEEE, 2010.\n\n[24] John Shawe-Taylor, Nello Cristianini, et al. Kernel methods for pattern analysis. Cambridge\n\nuniversity press, 2004.\n\n[25] Yiyuan She, Zhifeng Wang, and He Jiang. Group regularized estimation under structural\n\nhierarchy. Journal of the American Statistical Association, 113(521):445\u2013454, 2018.\n\n[26] Jie Shen and Ping Li. A tight bound of hard thresholding. The Journal of Machine Learning\n\nResearch, 18(1):7650\u20137691, 2017.\n\n[27] Anshumali Shrivastava and Ping Li. Asymmetric lsh (alsh) for sublinear time maximum\ninner product search (mips). In Advances in Neural Information Processing Systems, pages\n2321\u20132329, 2014.\n\n[28] Michael Sipser and Daniel A Spielman. Expander codes. In Proceedings 35th Annual Sympo-\n\nsium on Foundations of Computer Science, pages 566\u2013576. IEEE, 1994.\n\n[29] Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal\n\nStatistical Society: Series B (Methodological), 58(1):267\u2013288, 1996.\n\n[30] Roman Vershynin. Introduction to the non-asymptotic analysis of random matrices. arXiv\n\npreprint arXiv:1011.3027, 2010.\n\n[31] Ryan Williams. On the difference between closest, furthest, and orthogonal pairs: Nearly-linear\nvs barely-subquadratic complexity. In Proceedings of the Twenty-Ninth Annual ACM-SIAM\nSymposium on Discrete Algorithms, pages 1207\u20131215. Society for Industrial and Applied\nMathematics, 2018.\n\n[32] Jing Wu, Bernie Devlin, Steven Ringquist, Massimo Trucco, and Kathryn Roeder. Screen\nand clean: a tool for identifying interactions in genome-wide association studies. Genetic\nEpidemiology: The Of\ufb01cial Publication of the International Genetic Epidemiology Society,\n34(3):275\u2013285, 2010.\n\n[33] Hsiang-Fu Yu, Cho-Jui Hsieh, Qi Lei, and Inderjit S Dhillon. A greedy approach for budgeted\nmaximum inner product search. In Advances in Neural Information Processing Systems, pages\n5453\u20135462, 2017.\n\n[34] Rose Yu and Yan Liu. Learning from multiway data: Simple and ef\ufb01cient tensor regression. In\n\nInternational Conference on Machine Learning, pages 373\u2013381, 2016.\n\n[35] Kaiqing Zhang, Zhuoran Yang, and Zhaoran Wang. Nonlinear structured signal estimation\nin high dimensions via iterative hard thresholding. In International Conference on Arti\ufb01cial\nIntelligence and Statistics, pages 258\u2013268, 2018.\n\n11\n\n\f", "award": [], "sourceid": 4344, "authors": [{"given_name": "Shuo", "family_name": "Yang", "institution": "UT Austin"}, {"given_name": "Yanyao", "family_name": "Shen", "institution": "UT Austin"}, {"given_name": "Sujay", "family_name": "Sanghavi", "institution": "UT-Austin"}]}