{"title": "Wavelet regression and additive models for irregularly spaced data", "book": "Advances in Neural Information Processing Systems", "page_first": 8973, "page_last": 8983, "abstract": "We present a novel approach for nonparametric regression using wavelet basis functions. Our proposal, waveMesh, can be applied to non-equispaced data with sample size not necessarily a power of 2. We develop an efficient proximal gradient descent algorithm for computing the estimator and establish adaptive minimax convergence rates. The main appeal of our approach is that it naturally extends to additive and sparse additive models for a potentially large number of covariates. We prove minimax optimal convergence rates under a weak compatibility condition for sparse additive models. The compatibility condition holds when we have a small number of covariates. Additionally, we establish convergence rates for when the condition is not met. We complement our theoretical results with empirical studies comparing waveMesh to existing methods.", "full_text": "Wavelet regression and additive models for\n\nirregularly spaced data\n\nAsad Haris\u2217\n\nDepartment of Biostatistics\nUniversity of Washington\n\nSeattle, WA 98195\naharis@uw.edu\n\nNoah Simon\n\nDepartment of Biostatistics\nUniversity of Washington\n\nSeattle, WA 98195\nnrsimon@uw.edu\n\nAli Shojaie\n\nDepartment of Biostatistics\nUniversity of Washington\n\nSeattle, WA 98195\nashojaie@uw.edu\n\nAbstract\n\nWe present a novel approach for nonparametric regression using wavelet basis\nfunctions. Our proposal, waveMesh, can be applied to non-equispaced data with\nsample size not necessarily a power of 2. We develop an ef\ufb01cient proximal gradient\ndescent algorithm for computing the estimator and establish adaptive minimax\nconvergence rates. The main appeal of our approach is that it naturally extends to\nadditive and sparse additive models for a potentially large number of covariates.\nWe prove minimax optimal convergence rates under a weak compatibility condition\nfor sparse additive models. The compatibility condition holds when we have a\nsmall number of covariates. Additionally, we establish convergence rates for when\nthe condition is not met. We complement our theoretical results with empirical\nstudies comparing waveMesh to existing methods.\n\nIntroduction\n\n1\nWe consider the canonical task of estimating a regression function, f, from observations {(xi, yi) :\ni = 1, . . . , n}, with xi \u2208 [0, 1]p, yi \u2208 R and yi = f (xi) + \u03b5i (i = 1, . . . , n), where \u03b5i are\nindependent, mean 0, sub-Gaussian random variables. A popular approach for estimating f is to use\nlinear combinations of a pre-speci\ufb01ed set of basis functions, e.g., polynomials, splines [Wahba, 1990],\nwavelets [Daubechies, 1992], or other systems [ \u02c7Cencov, 1962]. The weights, or coef\ufb01cients, in such\na linear combination are often determined using some form of penalized regression. In this paper,\nwe focus on estimators that use wavelets. Wavelet-based estimators have compelling theoretical\nproperties. However, a number of issues have limited their adaptation in many non-parametric\napplications. The approach proposed in this paper overcomes these issues. Throughout the paper, we\nassume basic knowledge of wavelet methods though some key points will be reviewed. For a detailed\nintroduction to wavelets, see books by Daubechies [1992], Percival and Walden [2006], Vidakovic\n[2009], Nason [2010], Ogden [2012].\nWavelets are a system of orthonormal basis functions for L2([0, 1]). Wavelets are popular for\nrepresenting functions because they allow time and frequency localization [Daubechies, 1990] as\nopposed to, say, Fourier bases, which allow only frequency localization. Additionally, wavelet-based\n\n\u2217Mailing address: Box 357232, University of Washington, Seattle, WA 98195-7232\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fmethods are computationally ef\ufb01cient. The main ingredient of wavelet regression is the discrete\nwavelet transform (DWT) and its inverse (IDWT) which can be computed in O(n) operations [Mallat,\n1989]. Unfortunately, traditional wavelet methods require stringent conditions on the data, speci\ufb01cally\nthat xi = i/n with n = 2J for some integer J. This is not a problem in many signal processing\napplications with regularly sampled signals; however, in general non-parametric regression, this\ncondition will rarely be satis\ufb01ed. A simple solution for general data types is to ignore irregular\nspacing of data [Cai and Brown, 1999, Sardy et al., 1999] and/or arti\ufb01cially extend the signal such\nthat n = 2J [Strang and Nguyen, 1996, Ch. 8]. Other solutions include transformations [Cai and\nBrown, 1998, Pensky and Vidakovic, 2001] or interpolation [Hall and Turlach, 1997, Kovac and\nSilverman, 2000, Antoniadis and Fan, 2001] of the data to a regular grid of size 2J. The literature\non univariate wavelet methods is quite extensive and cannot be adequately discussed within this\nmanuscript. In contrast, the literature on wavelet methods for multiple covariates is rather limited,\nparticularly when the number of covariates is large.\nFor the multivariate settings with xi \u2208 [0, 1]p for p \u2265 2, we consider estimating an additive model,\n\nj (cid:98)fj (xij). Additive models naturally extend linear models to capture non-linear\n\ni.e., (cid:98)f (xi) =(cid:80)\n\nconditional relationships, while retaining some interpretability; they also do not suffer from the\ncurse of dimensionality. Despite these bene\ufb01ts, wavelet-based additive models have received limited\nattention. This is most likely because data with multiple covariates are rarely available on a regular\ngrid of size n = 2J. Sardy and Tseng [2004] \ufb01t additive wavelet models by treating the data as if\nregularly spaced; however, they do not discuss the case when n is not a power of 2. A number of\nproposals transform the data to a regular grid [Amato and Antoniadis, 2001, Zhang and Wong, 2003,\nGrez and Vidakovic, 2018]. However, to do this, the density of the covariates must be estimated,\nwhich unnecessarily invokes the curse of dimensionality. In addition, to the best of our knowledge,\nthere are no wavelet-based methods for \ufb01tting additive models in high dimensions (when p > n) that\n\ninduce sparsity, i.e., for many j, give a solution with (cid:98)fj \u2261 0.\n\nIn this paper, we give a simple proposal that effectively extends wavelet-based methods to non-\nparametric modeling with a potentially large number of covariates. We present an interpolation-based\napproach for dealing with irregularly spaced data when n is not necessarily a power of 2. However,\nunlike existing interpolation methods, we do not transform the raw data (xi, yi). As a result, our\nmethod naturally extends to additive and sparse additive models. We also propose a penalized\nestimation framework to induce sparsity in high dimensions. We develop a proximal gradient descent\nmethod for computation of our estimator, which leverages fast algorithms for DWT and sparse matrix\nmultiplication. Furthermore, we establish adaptive minimax convergence rates (up to a log n factor)\nsimilar to that of existing wavelet methods for regularly spaced data. We also establish convergence\nrates for our (sparse) additive proposal for a potentially large number of covariates. We discuss an\nextension of our proposal to general convex loss functions, and a weighted variation of our penalty\nwhich exhibits improved performance.\nIn Section 2 we present our univariate, additive and sparse additive proposals. The univariate case\n(p = 1) is mainly presented to motivate our proposal. We also present our main algorithm for\ncomputing the estimator. We establish convergence rates of our estimators in Section 3, and present\nempirical studies in Section 4. Concluding remarks are given in Section 5.\n\n2 Methodology\n\n2.1 Short background on wavelets\n\nWe begin with a quick review of wavelet methods for nonparametric regression covering 3 main\ningredients: (1) wavelet basis functions, (2) the discrete wavelet transform (DWT) and, (3) shrinkage.\nFirst, wavelets are a system of orthonormal basis functions for L2([0, 1]) or L2(R). The bases are\ngenerated by translations and dilations of special functions \u03c6(\u00b7) and \u03c8(\u00b7) called the father and mother\nwavelet, respectively. In greater detail, for any j0 \u2265 0, a function f \u2208 L2([0, 1]) can be written as\n\n2j0\u22121(cid:88)\n\n\u221e(cid:88)\n\n2j\u22121(cid:88)\n\nf (x) =\n\n\u03b1j0k\u03c6j0k(x) +\n\n\u03b2jk\u03c8jk(x),\n\n(1)\n\nwhere\n\nk=0\n\nj=j0\n\nk=0\n\n\u03c6jk(x) = 2j/2\u03c6(2jx \u2212 k), \u03c8jk(x) = 2j/2\u03c8(2jx \u2212 k).\n\n2\n\n\fj=j0\n\n(cid:80)2j\u22121\n\nk=0 \u03b1j0k\u03c6j0k(x) +(cid:80)J\n\nare considered, i.e., functions of the form f (x) =(cid:80)2j0\u22121\n\nThe coef\ufb01cients \u03b1j0k and \u03b2jk are called the father and mother wavelet coef\ufb01cients, respectively. The\nindex j is called the resolution level and j0 is the minimum resolution level. Different choices of\n\u03c6 and \u03c8 generate various wavelet families. Popular choices are Daubechies [Daubechies, 1988],\nCoi\ufb02ets [Daubechies, 1993], Meyer wavelets [Meyer, 1985], and Spline wavelets [Chui, 1992]; for\nan overview of wavelet families, see Ogden [2012]. Often functions with a truncated basis expansion\nk=0 \u03b2jk\u03c8jk(x),\nfor some J. For regular data with xi = i/n (i = 1, . . . , n) and n = 2J for some J, we can calculate\nthe vector f = [f (1/n), f (2/n), . . . , f (n/n)](cid:62) ef\ufb01ciently via our second ingredient described next.\nf = W (cid:62)d, where d =(cid:0)\u03b1j00, . . . , \u03b1j02j0\u22121, \u03b2j00, \u03b2j01, . . . , \u03b2J2J\u22121\nAny vector f = [f (1/n), f (2/n), . . . , f (n/n)](cid:62), for function f with truncated wavelet basis expan-\nsion of order J, can be written as a linear combination of that truncated wavelet basis. In particular,\nis the vector of wavelet coef-\nSpeci\ufb01cally, W is an orthogonal matrix with Wli \u2248 \u221a\n\ufb01cients, and the rows of W contain the corresponding wavelet basis functions evaluated at xi = i/n.\nn\u03c6jk(i/n), for some\nl; the\nn factor is due to convention in the literature and software implementation. By orthogonality,\nd = W f; this transformation from f to its wavelet coef\ufb01cients via multiplication by W is known as\nthe discrete wavelet transform (DWT). The transformation from wavelet coef\ufb01cients to \ufb01tted values,\nvia multiplication by W (cid:62) is known as the inverse discrete wavelet transform (IDWT). The DWT and\nIDWT can be computed in O(n) operations via Mallat\u2019s pyramid algorithm [Mallat, 1989]. However,\nthis is only possible for n = 2J.\n\nFinally, shrinkage is employed to obtain estimates of the form (cid:98)f = W (cid:62)(cid:98)d; for ease of exposition, we\n\nwill assume j0 = 0; i.e., all except the \ufb01rst element of d correspond to mother wavelet coef\ufb01cients.\nOur methodology and theoretical results do not depend on the choice of j0. The wavelet shrinkage\nestimator is given by\n\nn\u03c8jk(i/n), or Wli \u2248 \u221a\n\n(cid:1)(cid:62)\n\n\u221a\n\n(cid:107)y \u2212 W (cid:62)d(cid:107)2\n\n2 + \u03bb\n\n|di|,\n\n(2)\n\ni=2\n\n(cid:80)n\nfor a positive tuning parameter \u03bb, and given data {(i/n, yi) \u2208 R2 : i = 1, . . . , n}. The (cid:96)1 penalty,\ni=2 |di| \u2261 (cid:107)d\u22121(cid:107)1, shrinks the wavelet coef\ufb01cients and also induces sparsity; the sparsity is\nmotivated by the desirable parsimony property of wavelets: many functions in L2([0, 1]) are sparse\nde\ufb01ne (cid:101)d = W y, the DWT of y. Then, (cid:98)d1 = (cid:101)d1 and (cid:98)di = sgn((cid:101)di)(|(cid:101)di| \u2212 2\u03bb)+ (i = 2, . . . , n) where\nlinear combinations of wavelet bases. The optimization problem (2) can be solved exactly as follows:\n\n(x)+ = max(x, 0). Thus, for regularly spaced data with n = 2J, wavelet bases provide an ef\ufb01cient\nnonparametric estimator. In the following subsection, we discuss some existing methods for dealing\nwith irregularly spaced data and present our novel proposal, waveMesh.\n\n(cid:98)d \u2190 arg min\n\nd\u2208Rn\n\n1\n2\n\nn(cid:88)\n\n2.2 A novel interpolation scheme\n\nThe common approach to dealing with irregularly spaced data is to map the observed outcomes\n{(xi, yi) \u2208 [0, 1] \u00d7 R : i = 1, . . . , n} to approximate outcomes on the regular grid {(i/n, y(cid:48)\ni) \u2208 R2 :\ni = 1, . . . , K} for K = 2J for some integer J, via either interpolation or transformation of the data.\nThe novelty of our approach is a reversal of the direction of interpolation, i.e., interpolation from \ufb01tted\nvalues on the regular grid i/K (i = 1, . . . , K), to approximated \ufb01ts on the raw data xi (i = 1, . . . , n).\nFor our proposal, we require an interpolation scheme which can be written as a linear map. In greater\ndetail, for any function f evaluated at a regular grid, f = [f (1/K), . . . , f (K/K)](cid:62) we require\nR \u2208 Rn\u00d7K. Linear interpolation is a natural choice where\n\nan interpolation scheme (cid:101)f (\u00b7) such that [(cid:101)f (x1), . . . ,(cid:101)f (xn)](cid:62) = Rf for some interpolation matrix\n\nfor x \u2208 (i/K, (i + 1)/K] and (cid:101)f (x) = f (1/K) for x \u2264 1/K; and the interpolation matrix is\n\n+ f ((i + 1)/K)\n\n(i + 1) \u2212 Kx\n(i + 1) \u2212 i\n\nKx \u2212 i\n(i + 1) \u2212 i\n\n,\n\n(cid:101)f (x) = f (i/K)\n\uf8f1\uf8f4\uf8f2\uf8f4\uf8f3 1\n\nRij =\n\n(j + 1) \u2212 Kxi\nKxi \u2212 (j \u2212 1)\n0\n\n(3)\n\n(4)\n\nj = 1, xi \u2264 1/K\nj = (cid:98)Kxi(cid:99), xi \u2208 (1/K, 1]\nj = (cid:100)Kxi(cid:101), xi \u2208 (1/K, 1]\notherwise\n\n.\n\n3\n\n\f(cid:98)d \u2190 arg min\n\nd\u2208RK\n\n1\n2\n\nOur proposal, waveMesh, solves the following convex optimization problem\n\n(cid:107)y \u2212 RW (cid:62)d(cid:107)2\n\n2 + \u03bb(cid:107)d\u22121(cid:107)1,\n\n(5)\n\nwhere K = 2(cid:100)log2 n(cid:101), d\u22121 = [d2, . . . , dn](cid:62) \u2208 RK\u22121, and W \u2208 RK\u00d7K is the usual DWT matrix. To\n\nevaluate the waveMesh estimate at a new point x \u2208 R, one can use r(x)(cid:62)W (cid:62)(cid:98)d, where r is given by\nRn\u00d7K be the interpolation matrix corresponding to covariate j, i.e., Rjf = [(cid:101)f (x1j), . . . ,(cid:101)f (xnj)](cid:62).\n\nthe chosen interpolation scheme. The advantage of waveMesh, over existing methods, is that it can\nnaturally be extended to additive models. Given data {(xi, yi) \u2208 Rp+1 : i = 1, . . . , n}, let Rj \u2208\n\nThen, waveMesh can be extended to \ufb01tting additive models by the following optimization problem:\n\nd1,...,dp\u2208RK\n\n(cid:98)d1, . . . , (cid:98)dp \u2190 arg min\nand (cid:98)f = [(cid:98)f (x1), . . . ,(cid:98)f (xn)](cid:62) = (cid:80)p\n(cid:13)(cid:13)(cid:13)y \u2212 p(cid:88)\n(cid:98)d1, . . . , (cid:98)dp \u2190 arg min\n\nd1,...,dp\u2208RK\n\n1\n2\n\nj=1\n\nj=1\n\n1\n2\n\n(cid:13)(cid:13)(cid:13)2\n\n2\n\n+ \u03bb\n\np(cid:88)\n\nRjW (cid:62)dj\n\n(cid:107)dj,\u22121(cid:107)1,\n\n(cid:13)(cid:13)(cid:13)y \u2212 p(cid:88)\nj=1 (cid:98)fj = (cid:80)p\nj=1 RjW (cid:62)(cid:98)dj. Finally, we can extend additive\n(cid:13)(cid:13)(cid:13)2\n(cid:3) .\n\n(cid:2)\u03bb1(cid:107)dj,\u22121(cid:107)1 + \u03bb2(cid:107)RjW (cid:62)dj(cid:107)2\n\nRjW (cid:62)dj\n\np(cid:88)\n\n(7)\n\n(6)\n\nj=1\n\n+\n\n2\n\nwaveMesh to \ufb01tting sparse additive models for a potentially large number of covariates. This can be\nachieved by adding a sparsity inducing penalty for each component fj as follows:\n\nj=1\n\n2.3 Algorithm for waveMesh and sparse additive waveMesh\n\n+ tlP (d),\n\n2\n\n(cid:13)(cid:13)(cid:13)2\n(cid:17) \u2212 d\n(cid:111) \u2212 W (cid:62)d\n\n(cid:13)(cid:13)(cid:13)2\n\n2\n\nWe now present a proximal gradient descent algorithm [Parikh and Boyd, 2014] for solving the\noptimization problem (5). For convex loss (cid:96) and penalty P , the proximal gradient descent algorithm\niteratively \ufb01nds the minimizer of {(cid:96)(d) + P (d)} via the iteration:\n\n(cid:13)(cid:13)(cid:13)(cid:16)\n\nd(l+1) \u2190 arg min\nd\u2208RK\n\n1\n2\n\nd(l) \u2212 tl\u2207(cid:96)(d(l))\n\n(cid:13)(cid:13)(cid:13)(cid:110)\n\nfor a step-size tl > 0. The algorithm is guaranteed to converge as long as tl \u2264 L\u22121 where L is the\nLipschitz constant of \u2207(cid:96)(\u00b7). The step-size can be \ufb01xed or selected via a line search algorithm. For\n(5), we obtain the following iterative scheme:\n\nd(l+1) \u2190 arg min\nd\u2208RK\n\n1\n2\n\n(IK \u2212 tlR(cid:62)R)W (cid:62)d(l) + tlR(cid:62)y\n\n+ tl\u03bb(cid:107)d(l)\u22121(cid:107)1.\n\n(8)\n\nOur algorithm has a number of desirable features which make it computationally ef\ufb01cient. Firstly,\n(8) is the traditional wavelet problem for regularly spaced data (2), with response vector r =\n{(IK \u2212 tlR(cid:62)R)W (cid:62)d(l) + tlR(cid:62)y}. The vector r can be ef\ufb01ciently calculated via the sparsity of\nR and Mallat\u2019s algorithm for DWT [Mallat, 1989]. Secondly, we can use a \ufb01xed step size with\ntl = L\u22121\nmax where Lmax is the maximum eigenvalue of R(cid:62)R. Again, the maximum eigenvalue can\nbe ef\ufb01ciently computed for sparse matrices, e.g., if R is the linear interpolation matrix then R(cid:62)R\nis tridiagonal, and its eigenvalues can be calculated in O(K log K) operations. The matrix R for\nlinear interpolation matrix needs to be computed once and requires a sorting of the observations,\ni.e. O(n log n). Finally, by taking advantage of Nesterov-style acceleration [Nesterov, 2007], the\nworst-case convergence rate of the algorithm after k steps can be improved from O(k\u22121) to O(k\u22122).\nThe procedure (8) can also be used to solve the additive (6) and sparse additive (7) extensions via a\nblock coordinate descent algorithm. Speci\ufb01cally, given a set of estimates dj (j = 1, . . . , p) we can\n\ufb01x all but one of the vectors dj and optimize over the non-\ufb01xed vector, by solving\n2 + \u03bb1(cid:107)d\u22121(cid:107)1 + \u03bb2(cid:107)RjW (cid:62)d(cid:107)2,\n\n(9)\nfor some vector rj \u2208 Rn. For additive waveMesh (\u03bb2 = 0), this reduces to the univariate problem\nwhich can be solved via the algorithm (8). For sparse additive waveMesh (\u03bb2 (cid:54)= 0), the problem can\nbe solved by solving (9) with \u03bb2 = 0 following by a soft-scaling operation [Petersen et al., 2016,\nLemma 7.1]. We detail our algorithm for sparse additive waveMesh in the supplementary material.\n\n(cid:107)rj \u2212 RjW (cid:62)d(cid:107)2\n\nminimize\n\nd\u2208RK\n\n1\n2\n\n4\n\n\f2.4 Some extensions and variations\n\nIn this subsection, we discuss some variations and extensions of waveMesh, namely (1) using a\nconservative order for the wavelet basis expansion, (2) extending waveMesh for more general loss\nfunctions and, (3) using a weighted (cid:96)1 penalty for shrinkage of wavelet coef\ufb01cients.\nWhile in (5) we set K = 2(cid:100)log2 n(cid:101), we could, instead, set K to be any power of 2. Since the main\ncomputational step in our algorithm is the DWT and IDWT which requires O(K) operations, a\nsmaller value of K can greatly reduce the computation time. Furthermore, using a smaller K can\nlead to superior predictive performance in some settings; this is formalized in our theoretical results\nof Section 3 and observed in the simulation studies of Section 4. In the supplementary material we\npresent additional simulation studies comparing the prediction performance and computation time of\nwaveMesh for various values of K.\nSecondly, waveMesh can be extended to other loss functions appropriate for various data types. For\nexample, we can extend our methodology to the setting of binary classi\ufb01cation via a logistic loss\nfunction. Let yi \u2208 {\u22121, 1} (i = 1, . . . , n) be the observed response. For the univariate case, we get\n\n(cid:98)d \u2190 arg min\n\nd\u2208RK\n\n1\n2\n\nn(cid:88)\n\ni=1\n\nlog(cid:0)1 + exp(cid:2)\u2212yi(RW (cid:62)d)i\n\n(cid:3)(cid:1) + \u03bb(cid:107)d\u22121(cid:107)1.\n\nLike the least squares loss, (10) naturally extends to (sparse) additive models. The problem can be\nef\ufb01ciently solved via a proximal gradient descent algorithm described in the supplementary material.\nFinally, we consider a variation of our (cid:96)1 penalty motivated by the SURESHRINK procedure of Donoho\nand Johnstone [1995]. For a vector d \u2208 RK of discrete father and mother wavelet coef\ufb01cients, denote\nby d[j] the discrete mother wavelet coef\ufb01cients at resolution level j. For this particular variation, we\nrequire that the minimum resolution level j0 > 1. We then propose to solve\n\n(cid:98)d \u2190 arg min\n\nd\u2208RK\n\n(cid:107)y \u2212 RW (cid:62)d(cid:107)2\n\n2 + \u03bb\n\n1\n2\n\n(cid:112)2 log(j)(cid:107)d[j](cid:107)1.\n\nlog2 K(cid:88)\n\nj=j0\n\nIn the supplementary material we show that the above estimator outperforms the usual waveMesh\nestimator (5) in terms of prediction error.\n\n3 Theoretical results\n\n(10)\n\n(11)\n\nIn this section, we study \ufb01nite sample properties of our univariate estimator (5), and sparse additive\nestimator (7). We begin with a quick introduction to Besov spaces and their connection to wavelet\nbases. We establish minimax convergence rates (up to a log n factor) for our univariate proposal. We\nnote that our estimator (5) can be seen as a lasso estimator [Tibshirani, 1996] with design matrix\nRW (cid:62); this allows us to use well-known results for the lasso estimator to easily establish minimax\nrates which we present below. Additionally, the lasso formulation allows us to establish suf\ufb01cient\n\nconditions for the uniqueness of our estimator. Speci\ufb01cally, \ufb01tted values(cid:98)f = RW (cid:62)(cid:98)d are unique\nwhereas uniqueness of(cid:98)d depends on the matrix RW (cid:62). In the interest of brevity, we omit derivation of\nsuf\ufb01cient conditions for uniqueness of (cid:98)d and refer the interested reader to Tibshirani [2013]. Finally,\n\nwe also establish rates for the sparse additive waveMesh proposal for a speci\ufb01c penalty.\nq1,q2, are function spaces with speci\ufb01c degrees of smoothness\nBesov spaces on the unit interval, Bs\nin their derivative: for the Besov norm (cid:107) \u00b7 (cid:107)Bs\n< C}.\nThe constants (s, q1, q2) are the parameters of Besov spaces; for a function g \u2208 L2([0, 1]) with the\nwavelet bases expansion (1), the Besov norm is de\ufb01ned as\n\n= {g \u2208 L2([0, 1]) : (cid:107)g(cid:107)Bs\n\n, Bs\n\nq1,q2\n\nq1,q2\n\nq1,q2\n\n(cid:107)g(cid:107)Bs\n\nq1,q2\n\n= (cid:107)\u03b1j0(cid:107)q1 +\n\n2j(s+1/2\u22121/q1)(cid:107)\u03b2j(cid:107)q1\n\n,\n\n(12)\n\nj=j0\n\nwhere \u03b1j0 \u2208 R2j0 is the vector of father wavelet coef\ufb01cients with minimum resolution level j0 and\n\u03b2j \u2208 R2j is the vector of mother wavelet coef\ufb01cients at resolution level j. For completeness, we\nalso de\ufb01ne (cid:107)g(cid:107)Bs\n\n(cid:9). We consider Besov spaces\n\n(cid:8)2j(s+1/2\u22121/q1)(cid:107)\u03b2j(cid:107)q1\n\nq1,\u221e = (cid:107)\u03b1j0(cid:107)q1 + supj\u2265j0\n\n5\n\n(cid:20) \u221e(cid:88)\n\n(cid:26)\n\n(cid:27)q2(cid:21)1/q2\n\n\fbecause they generalize well-known classes such as the Sobolev (Bs\n2,2, s = 1, 2, . . .), and H\u00f6lder\n(Bs\u221e,\u221e, s > 0) spaces and the class of bounded total variation functions (sandwiched between B1\n1,1\nand B1\n1,\u221e). Our \ufb01rst result below establishes near minimax convergence rates for the prediction\nerror of our estimator. An attractive feature of our estimator is that it achieves this rate without any\ninformation about the parameters (s, q1, q2). We recover the usual wavelet rates of Donoho [1995]\nunder the special case when xi = i/n and R = In. Additionally, the theorem justi\ufb01es the use of\nK < n basis functions: if the true function is suf\ufb01ciently smooth, we recover the usual rates with an\nadditional log K factor instead of log n.\n\nTheorem 1 Suppose yi = f 0(xi) + \u03b5i (i = 1, . . . , n) for mean zero, sub-Gaussian noise \u03b5i. De\ufb01ne\n\nthe estimator (cid:98)f = RW (cid:62)(cid:98)d = [(cid:98)f (x1), . . . ,(cid:98)f (xn)]T for linear interpolation matrix R (4) where\n\n(cid:98)d \u2190 arg min\n\nd\u2208RK\n\n1\n2\n\n(cid:107)y \u2212 RW (cid:62)d(cid:107)2\n\n2 + \u03bb(cid:107)d\u22121(cid:107)1,\n\nj=1\n\n2j0\u22121(cid:88)\n\nq1,q2\n\n2\n\nn\n\n1\nn\n\n\u2264 C\n\n(cid:19) 2s\n\n2s+1\n\n(cid:18) log K\n\n(cid:13)(cid:13)(cid:13)f 0 \u2212 (cid:98)f\n\nand the mother wavelet \u03c8, has r null moments and r continuous derivatives where\n\nwhere the constant c1 depends on R and the distribution of \u03b5i, and the constant C depends on R.\n\nfor the usual DWT transform matrix W \u2208 RK\u00d7K associated with some orthogonal wavelet family.\nf 0 \u2208 Bs\nr > max{1, s}. Suppose \u03bb \u2265 c1\n(speci\ufb01cally K \u2265 c1n1/(2s+1) for some constant c1), with probability at least 1 \u2212 2 exp(\u2212t2/2), we\nhave\n\nFurther, de\ufb01ne f 0 = [f 0(x1), . . . , f 0(xn)](cid:62) and (cid:101)f 0 = [f 0(1/K), . . . , f 0(K/K)](cid:62). Assume that\n(cid:112)t2 + 2 log K for some t > 0. Then, for suf\ufb01ciently large K\n(cid:13)(cid:13)(cid:13)2\n\n(cid:107)f 0 \u2212 R(cid:101)f 0(cid:107)2\nThe above theorem includes an approximation error term (cid:107)f 0 \u2212 R(cid:101)f 0(cid:107)2\n\n2 which depends on the type\nof interpolation matrix R. For example, for linear interpolation of a twice continuously differentiable\nfunction, the approximation error scales as O(K\u22122). Thus, for a suf\ufb01ciently large K (particularly\nK = n), the approximation error will disappear. In fact, as long as the approximation error is of the\norder (log K/n)2s/(2s+1), we obtain the usual near-minimax rate.\nFor the sparse additive model, we consider a different model motivated by the Besov norm\n\n(12). Our next theorem provides convergence rates for the estimated function (cid:98)f = (cid:80)p\nj=1 (cid:98)fj =\n(cid:80)p\nj=1 RjW (cid:62)(cid:98)dj, where\n(cid:3) ,\n(cid:2)\u03bb1Ps(dj) + \u03bb2(cid:107)RjW (cid:62)dj(cid:107)2\n(cid:98)d1, . . . , (cid:98)dp \u2190 arg min\n\nRjW (cid:62)dj\n\n(cid:13)(cid:13)(cid:13)y \u2212 p(cid:88)\n\n+\n\n2\nn\n\n(13)\n\n2,\n\n(cid:13)(cid:13)(cid:13)2\n\n2\n\n+\n\np(cid:88)\n\nj=1\n\nd1,...,dp\u2208RK\n\n1\n2\n\nand the penalty Ps is the discrete version of the Besov norm for Bs\n1,1. Speci\ufb01cally, for d as a\nvector of father coef\ufb01cients, \u03b1j0k (k = 0, . . . , 2j0 \u2212 1), and mother wavelet coef\ufb01cients \u03b2jk (j =\nj0, . . . , J; k = 0, . . . , 2j \u2212 1) the penalty is\n\nPs(d) =\n\n|\u03b1j0k| +\n\n2j(s\u22121/2)\n\n.\n\n(14)\n\nk=0\n\nj=j0\n\nk=0\n\nj\u2208S (cid:107)fj(cid:107)2 \u2264(cid:112)|S|(cid:107)(cid:80)p\n\nj\u2208S (cid:107)fj(cid:107)2 and (cid:107)(cid:80)p\n(cid:80)\ninequality(cid:80)\n\nBefore presenting our next result, we state and discuss the so called compatibility condition. This\ncondition is common in the high-dimensional literature [van de Geer and B\u00fchlmann, 2009] and crucial\nfor proving minimax rates for sparse additive models. Brie\ufb02y, our proof requires the semi-norms\nj=1 fj(cid:107)2 to be somehow \u2018compatible\u2019, for an index set S \u2286 {1, . . . , p}. In\nthe low-dimensional/non-sparse case, i.e., S = {1, . . . , p}, the semi-norms are compatible by the\nj=1 fj(cid:107)2. The compatibility condition ensures such an inequality\nholds for proper subsets S. Furthermore, the compatibility condition can be relaxed at the cost of\nproving a slower rate; this is similar to the lasso slow rate [Dalalyan et al., 2017].\nDe\ufb01nition 1 The compatibility condition is said to hold for an index set S \u2282 {1, 2, . . . , p}, with\ncompatibility constant \u03d1(S) > 0, if for all \u03b3 > 0 and any set of discrete wavelet coef\ufb01cients vector\n\n6\n\nJ(cid:88)\n\n(cid:16)\n\n2j\u22121(cid:88)\n\n|\u03b2jk|(cid:17)\n\n\f/\u03d1(S).\n\nj\u2208S (cid:107)RjW (cid:62)dj(cid:107), it\n\nTheorem 2 Assume the model yi = f 0(xi)+\u03b5i (i = 1, . . . , n) with mean zero, sub-Gaussian \u03b5i. Let\nj be an arbitrary\nsparse additive function with S\u2217 \u2282 {1, 2, . . . , p}. Let \u03c1 = \u03ba max{n\u22122s/(2s+1), (log p/n)1/2} for a\nconstant \u03ba that depends on the distribution of \u03b5i and s. Suppose \u03bb \u2265 4\u03c1. Then, with probability at\nleast 1 \u2212 2 exp(\u2212c1n\u03c12) \u2212 c2 exp(\u2212c3n\u03c12), we have\n\nj\u2208S\u2217 RjW (cid:62)d\u2217\n\nj\u2208S\u2217 f\u2217\n\nj=1 RjW (cid:62)dj\n\nj\u2208Sc n\u22121(cid:107)RjW (cid:62)dj(cid:107)2 + \u03b3(cid:80)p\nj=1 Ps(dj) \u2264 3(cid:80)\n(cid:13)(cid:13)(cid:13)2\nj =(cid:80)\n(cid:17)1/2(cid:111)\n2s+1 ,|S\u2217|(cid:16) log p\n\n(d1, . . . , dp), that satisfy(cid:80)\nholds that(cid:80)\n(cid:98)f =(cid:80)p\n\nj\u2208S (cid:107)RjW (cid:62)dj(cid:107)2 \u2264(cid:112)|S|(cid:13)(cid:13)(cid:13)(cid:80)p\nj=1 (cid:98)fj be as de\ufb01ned in (13), and let f\u2217 =(cid:80)\n(cid:13)(cid:13)(cid:13)2\nn\u22121(cid:13)(cid:13)(cid:13)f 0 \u2212 (cid:98)f\n(cid:110)|S\u2217|n\u2212 s\nn\u22121(cid:13)(cid:13)(cid:13)f 0 \u2212 (cid:98)f\n\n|S\u2217|\u22121(cid:80)\n(cid:110)|S\u2217|n\u2212 2s\nwhere the constant C2 depends on \u03d1(S\u2217) and |S\u2217|\u22121(cid:80)\n\n(cid:111)\nj\u2208S\u2217 Ps(d\u2217\nj ).\n\nwhere constants c1, c2 depend on the distribution of \u03b5i and s, and C1 depends on \u03ba and\nj ). Furthermore, if the compatibility condition holds for S\u2217 with constant\n\u03d1(S\u2217) we have\n\nj\u2208S\u2217 Ps(d\u2217\n\n+ n\u22121(cid:13)(cid:13)f 0 \u2212 f\u2217(cid:13)(cid:13)2\n\n2 ,\n\n+ 4n\u22121(cid:13)(cid:13)f 0 \u2212 f\u2217(cid:13)(cid:13)2\n\n2 ,\n\n\u2264 C2 max\n\n2s+1 ,|S\u2217| log p\nn\n\n\u2264 C1 max\n\n2\n\nn\n\n(cid:13)(cid:13)(cid:13)2\n\n2\n\n4 Numerical experiments\n\n4.1 Experiments for univariate regression\n\nMSE = n\u22121(cid:13)(cid:13)f 0 \u2212 (cid:98)f(cid:13)(cid:13)2\n\nWe begin with a simulation to compare the performance of univariate waveMesh to the traditional\ninterpolation method of Kovac and Silverman [2000], isometric wavelet method of Sardy et al.\n[1999]\u2014which treats the data as if it were regularly spaced\u2014and adaptive lifting method of Nunes\net al. [2006]. The former two methods are implemented in the R package wavethres [Nason, 2016]\nand the latter is implemented in the adlift package [Nunes and Knight, 2017].\nWe generate the data as yi = f 0(xi) + \u03b5i (i = 1, . . . , n) for different choices of function f 0 and\nn. The errors are distributed as \u03b5i \u223c N (0, \u03c32) with \u03c32 chosen such that SNR = 5, where SNR =\nvar(f 0)/\u03c32. We consider two different choices of the covariate, xi \u223c U[0, 1] and xi \u223c N (0, 1)\nscaled to lie in [0, 1]. We consider 6 different choices for the function f 0: 1. polynomial, 2. sine,\n3. piecewise polynomial, 4. heavy sine, 5. bumps and, 6. doppler. These functions are shown in\nFigure 1 of the supplementary material. We apply our proposal, waveMesh, the interpolation proposal\nof Kovac and Silverman [2000] and isometric wavelet proposal of Sardy et al. [1999], for a sequence\nof 50 \u03bb values linear on the log scale and select the \u03bb value that minimizes the mean square error,\n2. For adaptive lifting, the R implementation automatically selects a tuning\nparameter. We implement waveMesh using the linear interpolation matrix (4). We also implement\nwaveMesh using a small grid, i.e., we \ufb01t (5) with K = 25 and 26. The R implementation of isometric\nwavelets requires sample sizes to be a power of two; if not, we pad the response vector with zeros.\nWe also analyze the motorcycle data studied by Silverman [1985] consisting of 133 head acceleration\nmeasurements in a simulated motorcycle accident taken at 94 unequally spaced time points. To\navoid the issue of repeated measurements, we average acceleration measurements at the same time\nleading to a sample size of n = 94. Selection of tuning parameter for waveMesh is done via 5-fold\ncross validation. For interpolation [Sardy et al., 1999] and isometric [Kovac and Silverman, 2000]\nwavelet proposals, we use the universal thresholding rule for tuning parameter selection [Donoho\nand Johnstone, 1994]; this rule leads to near minimax convergence rates like that of Theorem 1.\nTable 1 shows the ratio of MSE between our proposal with K = 2(cid:100)log2 n(cid:101) and other proposals\nfor uniformly distributed xi. We observe that our proposal has the smallest MSE for all functions\nexcept the Bumps function. Even for the Bumps function, waveMesh exhibits superior prediction\nperformance over other methods for n = 512. We also observe that waveMesh with smaller values\nof K often outperforms the full waveMesh (K = 2(cid:100)log2 n(cid:101)) method in terms of MSE. Results for\nnormally distributed xi are given in the supplementary material. In that case, we again observe that\nwaveMesh outperforms existing methods for a number of simulation scenarios, except for a few\ncases with polynomial and bumps functions. Results for sample sizes that are not powers of two\n\n7\n\n\fTable 1: Results for xi \u223c U[0, 1] averaged over 100 replicates; the ratio MSE / MSEF G is shown\nalong with 100\u00d7 the standard error, where MSEF G is the MSE of waveMesh with K = 2(cid:100)log2 n(cid:101).\nBoldface values represent the method with the smallest MSE within each row of the table.\n\nPolynomial\n\nSine\n\nPiecewise\nPolynomial\n\nHeavy Sine\n\nBumps\n\nDoppler\n\nwaveMesh\nK = 25\n\n1.19 (5.51)\n0.92 (5.57)\n1.00 (6.20)\n0.78 (3.18)\n0.97 (3.14)\n0.76 (3.18)\n0.66 (2.50)\n0.57 (2.34)\n0.85 (1.97)\n0.77 (2.00)\n0.82 (1.92)\n1.01 (2.43)\n0.84 (2.44)\n0.75 (2.66)\n0.66 (1.64)\n0.58 (1.59)\n2.11 (2.30)\n2.86 (2.77)\n4.81 (6.82)\n7.45 (9.13)\n0.98 (1.69)\n1.24 (2.02)\n1.71 (3.92)\n2.58 (4.85)\n\nwaveMesh\nK = 26\n1.00 (0.00)\n0.77 (3.07)\n0.85 (3.15)\n0.72 (2.58)\n1.00 (0.00)\n0.76 (1.96)\n0.70 (2.22)\n0.56 (2.22)\n1.00 (0.00)\n0.82 (1.52)\n0.79 (1.59)\n0.86 (1.70)\n1.00 (0.00)\n0.82 (1.16)\n0.72 (1.14)\n0.60 (1.18)\n1.00 (0.00)\n2.11 (1.62)\n3.47 (4.39)\n5.69 (6.77)\n1.00 (0.00)\n0.89 (1.04)\n0.94 (1.38)\n1.26 (2.01)\n\nn = 64\nn = 128\nn = 256\nn = 512\nn = 64\nn = 128\nn = 256\nn = 512\nn = 64\nn = 128\nn = 256\nn = 512\nn = 64\nn = 128\nn = 256\nn = 512\nn = 64\nn = 128\nn = 256\nn = 512\nn = 64\nn = 128\nn = 256\nn = 512\n\nInterpolation\n\nIsometric\n\nAdaptive Lifting\n\n1.24 (4.11)\n1.12 (6.00)\n1.61 (9.04)\n1.76 (6.11)\n1.47 (5.81)\n1.29 (6.08)\n1.93 (9.49)\n2.13 (7.78)\n1.18 (3.12)\n1.26 (2.75)\n1.42 (3.18)\n1.71 (3.56)\n1.12 (3.04)\n1.17 (3.32)\n1.37 (2.98)\n1.58 (3.05)\n1.70 (1.75)\n1.40 (1.59)\n1.43 (1.89)\n1.32 (1.35)\n1.15 (3.45)\n1.07 (2.13)\n1.20 (2.11)\n1.21 (1.48)\n\n1.78 (7.56)\n1.33 (7.18)\n1.50 (7.67)\n1.13 (2.64)\n1.59 (6.72)\n1.46 (5.24)\n1.34 (4.23)\n1.24 (3.66)\n1.31 (3.62)\n1.22 (2.61)\n1.14 (2.11)\n1.15 (1.99)\n1.41 (3.17)\n1.50 (4.75)\n1.33 (2.58)\n1.29 (1.60)\n0.72 (1.34)\n0.63 (0.83)\n0.88 (0.99)\n1.19 (1.03)\n1.33 (3.20)\n1.44 (2.57)\n1.29 (1.99)\n1.10 (1.31)\n\n4.28 (29.86)\n3.57 (31.27)\n4.29 (31.29)\n3.61 (26.47)\n3.62 (33.65)\n2.98 (19.78)\n3.41 (18.80)\n3.63 (28.42)\n1.63 (9.07)\n1.40 (7.36)\n1.15 (6.04)\n1.25 (7.24)\n1.70 (8.35)\n1.56 (8.26)\n1.53 (6.74)\n1.50 (9.21)\n1.07 (5.12)\n0.85 (2.43)\n0.97 (2.00)\n1.23 (2.34)\n1.30 (3.65)\n1.18 (3.22)\n1.30 (3.44)\n1.23 (3.36)\n\nwere similar to the results provided here. In the interest of brevity, these results are presented in the\nsupplementary material.\nIn Figure 1, we plot the motorcycle data and \ufb01tted functions for each method. Here, waveMesh\nreasonably models the data via a smooth function; the interpolation method has a similar but slightly\nmore biased result around 10 to 25 ms. Adaptive lifting and isometric wavelets lead to highly variable\nestimates.\n\n4.2 Experiments for multivariate additive regression\n\nWe proceed with a simulation study to illustrate the performance of additive waveMesh compared\nto the proposal of Sardy and Tseng [2004], AMlet. We use the author-provided R implementation\nfor the AMlet proposal; due to a lack of R packages for other proposals, we defer the comparison to\nfuture work. We consider the following simulation setting: we generate data with yi = f1(xi1) +\nf2(xi2) + f3(xi3) + f4(xi4) + \u03b5i (i = 1, . . . , 210), where \u03b5i \u223c N (0, \u03c32), xi \u223c U[0, 1], and \u03c32 such\nthat SNR = 10. The four functions f1, . . . , f4 are the polynomial, sine, piecewise polynomial and\nheavy sine functions presented in Figure 1 of the supplementary material. We consider sample sizes\nn = 64, 100, 256, 500, 512 and results were averaged over 100 data sets. For sample sizes not a\npower of 2, the response vector was padded with zeros for the R implementation of AMlet. The\nuniversal threshold rule was used for AMlet as detailed in Sardy and Tseng [2004]; 5-fold cross\nvalidation was used for additive waveMesh for selection of \u03bb.\nFor a real world data analysis, we consider the Boston housing data analyzed by Ravikumar et al.\n[2009]. The goal is to predict the median value of homes based on 10 predictors. The data consists of\nn = 506 observations; we use 256 observations for training and calculate the test error on the rest.\nTuning parameters are selected in the same way as the simulation study above.\nTable 2 shows the MSE of both proposals for various choices of n for the simulation study. The\nresults clearly indicate that additive waveMesh offers substantial improvement over AMlet, especially\n\n8\n\n\fFigure 1: Fitted functions to the motorcycle accident dataset for each of the 4 methods.\n\nfor smaller values of n. We observe similar results for the Boston housing data: the average test\nerror is 21.2 for waveMesh (standard error 0.34) and 25.1 for AMlet (standard error 0.42). These\nresults support our theoretical analysis and underscore the advantages of waveMesh in sparse high-\ndimensional additive models.\n\n5 Conclusion\n\nIn this paper, we introduced waveMesh, a novel method for non-parametric regression using wavelets.\nUnlike traditional methods, waveMesh does not require that covariates are uniformly spaced on the\nunit interval, nor does it require that the sample size is a power of 2. We achieve this using a novel\ninterpolation approach for wavelets. The main appeal of our proposal is that it naturally extends to\nmultivariate additive models for a potentially large number of covariates.\nTo compute the estimator, we proposed an ef\ufb01cient proximal gradient descent algorithm, which\nleverages existing techniques for fast computation of the DWT. We established minimax convergence\nrates for our univariate proposal over a large class of Besov spaces. For a particular Besov space,\nwe also established minimax convergence rates for our (sparse) additive framework. The R package\nwaveMesh, which implements our methodology, will soon be publicly available on GitHub.\n\nTable 2: MSE and standard error of waveMesh and AMlet averaged over 100 data sets.\n\nwaveMesh\nAMlet\n\nn = 64\n\n10.76 (0.31)\n100.48 (1.83)\n\nn = 100\n\n11.35 (0.33)\n34.58 (1.05)\n\nn = 128\n8.82 (0.24)\n45.49 (1.09)\n\nn = 256\n5.45 (0.11)\n19.57 (0.33)\n\nn = 500\n4.34 (0.08)\n10.67 (0.12)\n\nn = 512\n4.08 (0.07)\n8.90 (0.11)\n\n9\n\n1020304050\u2212100\u221250050waveMeshTime (ms)Acceleration (g)1020304050\u2212100\u221250050InterpolationTime (ms)Acceleration (g)1020304050\u2212100\u221250050IsometricTime (ms)Acceleration (g)1020304050\u2212100\u221250050Adaptive LiftingTime (ms)Acceleration (g)\fAcknowledgments\n\nWe thank three anonymous referees for insightful comments that substantially improved the\nmanuscript. We thank Professor Sylvain Sardy for providing software. This work was partially\nsupported by National Institutes of Health grants to A.S. and N.S., and National Science Foundation\ngrants to A.S.\n\nReferences\nUmberto Amato and Anestis Antoniadis. Adaptive wavelet series estimation in separable nonpara-\n\nmetric regression models. Statistics and Computing, 11(4):373\u2013394, 2001.\n\nAnestis Antoniadis and Jianqing Fan. Regularization of wavelet approximations. Journal of the\n\nAmerican Statistical Association, 96(455):939\u2013967, 2001.\n\nTony Cai and Lawrence D Brown. Wavelet shrinkage for nonequispaced samples. The Annals of\n\nStatistics, 26(5):1783\u20131799, 1998.\n\nTony Cai and Lawrence D Brown. Wavelet estimation for samples with random uniform design.\n\nStatistics & Probability Letters, 42(3):313\u2013321, 1999.\n\nCharles K Chui. An introduction to wavelets. Philadelphia, SIAM, 38, 1992.\n\nArnak S. Dalalyan, Mohamed Hebiri, and Johannes Lederer. On the prediction performance of the\n\nlasso. Bernoulli, 23(1):552\u2013581, 2017.\n\nIngrid Daubechies. Orthonormal bases of compactly supported wavelets. Communications on Pure\n\nand Applied Mathematics, 41(7):909\u2013996, 1988.\n\nIngrid Daubechies. The wavelet transform, time-frequency localization and signal analysis. IEEE\n\nTransactions on Information Theory, 36(5):961\u20131005, 1990.\n\nIngrid Daubechies. Ten lectures on wavelets, volume 61. SIAM, 1992.\n\nIngrid Daubechies. Orthonormal bases of compactly supported wavelets II. variations on a theme.\n\nSIAM Journal on Mathematical Analysis, 24(2):499\u2013519, 1993.\n\nDavid L Donoho. De-noising by soft-thresholding. IEEE Transactions on Information Theory, 41(3):\n\n613\u2013627, 1995.\n\nDavid L Donoho and Iain M Johnstone. Adapting to unknown smoothness via wavelet shrinkage.\n\nJournal of the American Statistical Association, 90(432):1200\u20131224, 1995.\n\nDavid L Donoho and Jain M Johnstone. Ideal spatial adaptation by wavelet shrinkage. Biometrika,\n\n81(3):425\u2013455, 1994.\n\nGerman A Schnaidt Grez and Brani Vidakovic. Empirical wavelet-based estimation for non-linear\n\nadditive regression models. ArXiv e-prints, March 2018.\n\nPeter Hall and Berwin A Turlach. Interpolation methods for nonlinear wavelet regression with\n\nirregularly spaced design. The Annals of Statistics, 25(5):1912\u20131925, 1997.\n\nArne Kovac and Bernard W Silverman. Extending the scope of wavelet regression methods by\ncoef\ufb01cient-dependent thresholding. Journal of the American Statistical Association, 95(449):\n172\u2013183, 2000.\n\nStephane G Mallat. A theory for multiresolution signal decomposition: the wavelet representation.\n\nIEEE Transactions on Pattern Analysis and Machine Intelligence, 11(7):674\u2013693, 1989.\n\nYves Meyer. Principe d\u2019incertitude, bases hilbertiennes et algebres d\u2019operateurs. S\u00e9minaire Bourbaki,\n\n662:1985\u20131986, 1985.\n\nGuy Nason. Wavelet methods in statistics with R. Springer Science & Business Media, 2010.\n\n10\n\n\fGuy Nason. wavethresh: Wavelets Statistics and Transforms, 2016. URL https://CRAN.\n\nR-project.org/package=wavethresh. R package version 4.6.8.\n\nYurii Nesterov. Gradient methods for minimizing composite objective function. Technical report,\n\nUCL, 2007.\n\nMatthew A Nunes and Marina I Knight. adlift: An Adaptive Lifting Scheme Algorithm, 2017. URL\n\nhttps://CRAN.R-project.org/package=adlift. R package version 1.3-3.\n\nMatthew A Nunes, Marina I Knight, and Guy P Nason. Adaptive lifting for nonparametric regression.\n\nStatistics and Computing, 16(2):143\u2013159, 2006.\n\nTodd Ogden. Essential wavelets for statistical applications and data analysis. Springer Science &\n\nBusiness Media, 2012.\n\nNeal Parikh and Stephen Boyd. Proximal algorithms. Foundations and Trends in Optimization, 1(3):\n\n127\u2013239, 2014.\n\nMarianna Pensky and Brani Vidakovic. On non-equally spaced wavelet regression. Annals of the\n\nInstitute of Statistical Mathematics, 53(4):681\u2013690, 2001.\n\nDonald B Percival and Andrew T Walden. Wavelet methods for time series analysis, volume 4.\n\nCambridge University Press, 2006.\n\nAshley Petersen, Daniela Witten, and Noah Simon. Fused lasso additive model. Journal of Computa-\n\ntional and Graphical Statistics, 25(4):1005\u20131025, 2016.\n\nPradeep Ravikumar, John Lafferty, Han Liu, and Larry Wasserman. Sparse additive models. Journal\n\nof the Royal Statistical Society: Series B (Statistical Methodology), 71(5):1009\u20131030, 2009.\n\nSylvain Sardy and Paul Tseng. AMlet, RAMlet, and GAMlet: automatic nonlinear \ufb01tting of additive\nmodels, robust and generalized, with wavelets. Journal of Computational and Graphical Statistics,\n13(2):283\u2013309, 2004.\n\nSylvain Sardy, Donald B Percival, Andrew G Bruce, Hong-Ye Gao, and Werner Stuetzle. Wavelet\n\nshrinkage for unequally spaced data. Statistics and Computing, 9(1):65\u201375, 1999.\n\nBernhard W Silverman. Some aspects of the spline smoothing approach to non-parametric regression\ncurve \ufb01tting. Journal of the Royal Statistical Society. Series B (Methodological), pages 1\u201352, 1985.\n\nGilbert Strang and Truong Nguyen. Wavelets and \ufb01lter banks. SIAM, 1996.\n\nRobert J Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical\n\nSociety. Series B (Methodological), pages 267\u2013288, 1996.\n\nRyan J Tibshirani. The lasso problem and uniqueness. Electronic Journal of Statistics, 7:1456\u20131490,\n\n2013.\n\nSara van de Geer and Peter B\u00fchlmann. On the conditions used to prove oracle results for the lasso.\n\nElectronic Journal of Statistics, 3:1360\u20131392, 2009.\n\nNikolai N \u02c7Cencov. Evaluation of an unknown distribution density from observations. Doklady, 3:\n\n1559\u20131562, 1962.\n\nBrani Vidakovic. Statistical modeling by wavelets, volume 503. John Wiley & Sons, 2009.\n\nGrace Wahba. Spline Models for Observational Data. SIAM, 1990.\n\nShuanglin Zhang and Man-Yu Wong. Wavelet threshold estimation for additive regression models.\n\nThe Annals of Statistics, 31(1):152\u2013173, 2003.\n\n11\n\n\f", "award": [], "sourceid": 5375, "authors": [{"given_name": "Asad", "family_name": "Haris", "institution": "University of Washington"}, {"given_name": "Ali", "family_name": "Shojaie", "institution": null}, {"given_name": "Noah", "family_name": "Simon", "institution": "University of Washington"}]}