{"title": "A Sharp Error Analysis for the Fused Lasso, with Application to Approximate Changepoint Screening", "book": "Advances in Neural Information Processing Systems", "page_first": 6884, "page_last": 6893, "abstract": "In the 1-dimensional multiple changepoint detection problem, we derive a new fast error rate for the fused lasso estimator, under the assumption that the mean vector has a sparse number of changepoints. This rate is seen to be suboptimal (compared to the minimax rate) by only a factor of $\\log\\log{n}$. Our proof technique is centered around a novel construction that we call a lower interpolant. We extend our results to misspecified models and exponential family distributions. We also describe the implications of our error analysis for the approximate screening of changepoints.", "full_text": "A Sharp Error Analysis for the Fused Lasso, with\nApplication to Approximate Changepoint Screening\n\nKevin Lin\n\nCarnegie Mellon University\n\nPittsburgh, PA 15213\n\nkevinl1@andrew.cmu.edu\n\nJames Sharpnack\n\nUniversity of California, Davis\n\nDavis, CA 95616\n\njsharpna@ucdavis.edu\n\nAlessandro Rinaldo\n\nCarnegie Mellon University\n\nPittsburgh, PA 15213\n\narinaldo@stat.cmu.edu\n\nRyan J. Tibshirani\n\nCarnegie Mellon University\n\nPittsburgh, PA 15213\n\nryantibs@stat.cmu.edu\n\nAbstract\n\nIn the 1-dimensional multiple changepoint detection problem, we derive a new fast\nerror rate for the fused lasso estimator, under the assumption that the mean vector\nhas a sparse number of changepoints. This rate is seen to be suboptimal (compared\nto the minimax rate) by only a factor of log log n. Our proof technique is centered\naround a novel construction that we call a lower interpolant. We extend our results\nto misspeci\ufb01ed models and exponential family distributions. We also describe the\nimplications of our error analysis for the approximate screening of changepoints.\n\n1\n\nIntroduction\n\nConsider the 1-dimensional multiple changepoint model\n\ni = 1, . . . , n,\n\nyi = \u03b80,i + \u0001i,\n\n(1)\nwhere \u0001i, i = 1, . . . , n are i.i.d. errors, and \u03b80,i, i = 1, . . . , n is a piecewise constant mean sequence,\nhaving a set of changepoints\n\nS0 =(cid:8)i \u2208 {1, . . . , n \u2212 1} : \u03b80,i (cid:54)= \u03b80,i+1\n\n(2)\nThis is a well-studied setting, and there is a large body of literature on estimation of the piecewise\nconstant mean vector \u03b80 \u2208 Rn and its changepoints S0 using various estimators; refer, e.g., to the\nsurveys Brodsky and Darkhovski (1993); Chen and Gupta (2000); Eckley et al. (2011).\nIn this work, we consider the 1-dimensional fused lasso (also called 1d fused lasso, or simply fused\nlasso) estimator, which, given a data vector y \u2208 Rn from a model as in (1), is de\ufb01ned by\n\n(cid:9).\n\n(yi \u2212 \u03b8i)2 + \u03bb\n\n|\u03b8i \u2212 \u03b8i+1|,\n\ni=1\n\ni=1\n\n\u03b8\u2208Rn\n\n(3)\nwhere \u03bb \u2265 0 serves as a tuning parameter. This was proposed and named by Tibshirani et al. (2005),\nbut the same idea was proposed earlier in signal processing, under the name total variation denoising,\nby Rudin et al. (1992). Variants of the fused lasso have been used in biology to detect regions where\ntwo genomic samples differ due to genetic variations (Tibshirani and Wang, 2008), in \ufb01nance to detect\nshifts in the stock market (Chan et al., 2014), and in neuroscience to detect changes in stationary\nbehaviors of the brain (Aston and Kirch, 2012). Popularity of the fused lasso can be attributed in part\nto its computational scalability, the optimization problem in (3) being convex and highly structured.\nThere has also been plenty of supporting statistical theory developed for the fused lasso, which we\nreview in Section 2.\n\nn\u22121(cid:88)\n\n(cid:98)\u03b8 = argmin\n\nn(cid:88)\n\n1\n2\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fNotation. We will make use of the following quantities that are de\ufb01ned in terms of the mean \u03b80 in\n(1) and its changepoint set S0 in (2). We denote the size of the changepoint set by s0 = |S0|. We\nenumerate S0 = {t1, . . . , ts0}, where 1 \u2264 t1 < . . . < ts0 < n, and for convenience we set t0 = 0,\nts0+1 = n. The smallest distance between changepoints in \u03b80 is denoted by\n\nWn = min\n\ni=0,1...,s0\n\n(ti+1 \u2212 ti),\n\nand the smallest distance between consecutive levels of \u03b80 by\n|\u03b80,i+1 \u2212 \u03b80,i|.\n\nHn = min\ni\u2208S0\n\n(4)\n\n(5)\n\n(6)\n\nWe use D \u2208 R(n\u22121)\u00d7n to denote the difference operator\n0\n1\n...\n\n1\n0 \u22121\n...\n\nD =\n\n. . .\n. . .\n...\n0 . . . \u22121\n\n\uf8ee\uf8ef\uf8ef\uf8f0 \u22121\n\n0\n\n\uf8f9\uf8fa\uf8fa\uf8fb .\n\n0\n0\n\n1\n\nNote that s0 = (cid:107)D\u03b80(cid:107)0. We write DS to extract rows of D indexed by a subset S \u2286 {1, . . . , n \u2212 1},\nand D\u2212S to extract the rows in Sc = {1, . . . , n \u2212 1} \\ S.\nFor a vector x \u2208 Rn, we use (cid:107)x(cid:107)2\n2/n to denote its length-scaled (cid:96)2 norm. For sequences\nan, bn, we use standard asymptotic notation: an = O(bn) to denote that an/bn is bounded for large\nenough n, an = \u2126(bn) to denote that bn/an is bounded for large enough n, an = \u0398(bn) to denote\nthat both an = O(bn) and an = \u2126(bn), an = o(bn) to denote that an/bn \u2192 0, and an = \u03c9(bn)\nto denote that bn/an \u2192 0. For random sequences An, Bn, we write An = OP(Bn) to denote that\nAn/Bn is bounded in probability. A random variable Z is said to have a sub-Gaussian distribution\nprovided that E(Z) = 0 and P(|Z| > t) \u2264 2 exp(\u2212t2/(2\u03c32)) for all t \u2265 0, and a constant \u03c3 > 0.\n\nn = (cid:107)x(cid:107)2\n\nSummary of results. Our main focus is on deriving a sharp estimation error bound for the fused\nlasso, parametrized by the number of changepoints s0 in \u03b80. We also study several consequences of\nour error bound and its analysis. A summary of our contributions is as follows.\n\n\u2022 New error analysis for the fused lasso. In Section 3, we develop a new error analysis for\nthe fused lasso, in the model (1) with sub-Gaussian errors. Our analysis leverages a novel\nquantity that we call a lower interpolant to approximate the fused lasso estimate (once it has\nbeen orthogonalized with respect to the changepoint structure of the mean \u03b80) with 2s0 + 2\nmonotonic segments, which allows for \ufb01ner control of the empirical process term.\nWhen s0 = O(1), and the changepoint locations in S0 are (asymptotically) evenly spaced,\nn = O(log n(log log n)/n) for the fused lasso estimator\n\nour main result implies E(cid:107)(cid:98)\u03b8 \u2212 \u03b80(cid:107)2\n(cid:98)\u03b8 in (3). This is slower than the minimax rate by a log log n factor. Our result improves on\n\npreviously established results from Dalalyan et al. (2017), and after the completion of this\npaper, was itself improved upon by Guntuboyina et al. (2017) (who are able to remove the\nextraneous log log n factor).\n\u2022 Extension to misspeci\ufb01ed and exponential family models. In Section 4, we extend our\nerror analysis to cover a mean vector \u03b80 that is not necessarily piecewise constant (or in\nother words, has potentially many changepoints). In Section 5, we extend our analysis to\nexponential family models. The latter extension, especially, is of practical importance, as\nmany applications, e.g., CNV data analysis, call for changepoint detection on count data.\n\u2022 Application to approximate screening and recovery. In Section 6, we establish that the\nmaximum distance between any true changepoint and its nearest estimated changepoint is\nn) using the fused lasso, when s0 = O(1) and all changepoints are\nOP(log n(log log n)/H 2\n(asymptotically) evenly spaced. After applying simple post-processing step, we show that\nthe maximum distance between any estimated changepoint and its nearest true changepoint\nis of the same order. Our proof technique relies only on the estimation error rate of the fused\nlasso, and therefore immediately generalizes to any estimator of \u03b80, where the distance (for\napproximate changepoint screening and recovery) is a function of the inherent error rate.\n\nThe supplementary document gives numerical simulations that support the theory in this paper.\n\n2\n\n\f2 Preliminary review of existing theory\n\nWe begin by describing known results on the quantity (cid:107)(cid:98)\u03b8 \u2212 \u03b80(cid:107)2\nfused lasso estimate(cid:98)\u03b8 in (3) and the mean \u03b80 in (1).\n\nn, the estimation error between the\n\nEarly results on the fused lasso are found in Mammen and van de Geer (1997) (see also Tibshirani\n(2014) for a translation to a setting more consistent with that of the current paper). These authors\nstudy what may be called the weak sparsity case, in which it is that assumed (cid:107)D\u03b80(cid:107)1 \u2264 Cn, with D\nbeing the difference operator in (6). Assuming additionally that the errors in (1) are sub-Gaussian,\n\u22121/3\nMammen and van de Geer (1997) show that for a choice of tuning parameter \u03bb = \u0398(n1/3C\n),\nn\n\nthe fused lasso estimate(cid:98)\u03b8 in (3) satis\ufb01es\n(cid:107)(cid:98)\u03b8 \u2212 \u03b80(cid:107)2\n\nn = OP(n\u22122/3C 2/3\nn ).\n\n(7)\n\nThe weak sparsity setting is not the focus of our paper, but we still recall the above result to give\na sense of the difference between the weak and strong sparsity settings, the latter being the setting\nin which we assume control over s0 = (cid:107)D\u03b80(cid:107)0, as we do in the current paper. Prior to this paper,\nthe strongest result in the strong sparsity setting was given by Dalalyan et al. (2017), who assume\n\nN (0, \u03c32) errors in (1), and show that for \u03bb = \u03c3(cid:112)2n log(n/\u03b4), the fused lasso estimate satis\ufb01es\n\nn \u2264 C\u03c32 s0 log(n/\u03b4)\n\n(8)\nwith probability at least 1 \u2212 2\u03b4, for large enough n, and a constant C > 0, where recall Wn is the\nminimum distance between changepoints in \u03b80, as in (4). Our main result in Theorem 1 improves\nupon (8) in two ways: by reducing the \ufb01rst log n term inside the brackets to log s0 + log log n, and\n\nreducing the second n/Wn term to(cid:112)n/Wn.\n\nlog n +\n\nn\n\n,\n\nAfter our paper was completed, Guntuboyina et al. (2017) gave an even sharper error rate for the\nfused lasso (and more broadly, for trend the family of higher-order \ufb01ltering estimates as de\ufb01ned in\nSteidl et al. (2006); Kim et al. (2009); Tibshirani (2014)). Again assuming N (0, \u03c32) errors in (1),\nas well as Wn \u2265 cn/(s0 + 1) for some constant c \u2265 1, these authors show that the family of fused\n\nlasso estimates {(cid:98)\u03b8\u03bb, \u03bb \u2265 0} (using subscripts here to explicitly denote the dependence on the tuning\n\nparameter \u03bb) satis\ufb01es\n\n(cid:18)\n\n(cid:19)\n\nn\nWn\n\n(cid:107)(cid:98)\u03b8 \u2212 \u03b80(cid:107)2\n\n(cid:107)(cid:98)\u03b8\u03bb \u2212 \u03b80(cid:107)2\n\n(cid:18) en\n\n(cid:19)\n\n4\u03c32\u03b4\n\n,\n\nn\n\nn \u2264 C\u03c32 s0 + 1\n\nn\n\n+\n\nlog\n\ninf\n\u03bb\u22650\n\ns0 + 1\n\nsharper than ours in Theorem 1 in that (log s0 + log log n) log n +(cid:112)n/Wn is replaced essentially\n\n(9)\nwith probability at least 1 \u2212 exp(\u2212\u03b4), for large enough n, and a constant C > 0. The above bound is\nby log Wn. (Also, the result in (9) does not actually require Wn \u2265 cn/(s0 + 1), but only requires the\ndistance between changepoints where jumps alternate in sign to be larger than cn/(s0 + 1), which is\nanother improvement.) Further comparisons will be made in Remark 1 following Theorem 1.\nThere are numerous other estimators, e.g., based on segmentation techniques or wavelets, that admit\nestimation results comparable to those above. These are described in Remark 2 following Theorem 1.\nLastly, it can be seen the minimax estimation error over the class of signals \u03b80 with s0 changepoints,\nassuming N (0, \u03c32) errors in (1), satis\ufb01es\n\nE(cid:107)(cid:98)\u03b8 \u2212 \u03b80(cid:107)2\n\nn \u2265 C\u03c32 s0\nn\n\nlog\n\n(cid:18) n\n\n(cid:19)\n\ns0\n\ninf(cid:98)\u03b8\n\nsup\n\n(cid:107)D\u03b80(cid:107)0\u2264s0\n\n,\n\n(10)\n\nfor large enough n, and a constant C > 0. This says that one cannot hope to improve the rate in (9).\nThe minimax result in (10) follows from standard minimax theory for sparse normal means problems,\nas in, e.g., Johnstone (2015); for a proof, see Padilla et al. (2016).\n\n3 Sharp error analysis for the fused lasso estimator\n\nHere we derive a sharper error bound for the fused lasso, improving upon the previously established\nresult of Dalalyan et al. (2017) as stated in (8). Our proof is based on a concept that we call a lower\ninterpolant, which as far as we can tell, is a new idea that may be of interest in its own right.\n\n3\n\n\fdistribution. Then under a choice of tuning parameter \u03bb = (nWn)1/4, the fused lasso estimate(cid:98)\u03b8 in\n\nTheorem 1. Assume the data model in (1), with errors \u0001i, i = 1, . . . , n i.i.d. from a sub-Gaussian\n\n(3) satis\ufb01es\n\n(cid:107)(cid:98)\u03b8 \u2212 \u03b80(cid:107)2\n\nn \u2264 \u03b32c\n\ns0\nn\n\n(log s0 + log log n) log n +\n\nwith probability at least 1 \u2212 exp(\u2212C\u03b3), for all \u03b3 > 1 and n \u2265 N, where c, C, N > 0 are constants\nthat depend on only \u03c3 (the parameter appearing in the sub-Gaussian distribution of the errors).\n\nAn immediate corollary is as follows.\nCorollary 1. Under the same assumptions as in Theorem 1, we have\n\nE(cid:107)(cid:98)\u03b8 \u2212 \u03b80(cid:107)2\n\nn \u2264 c\n\ns0\nn\n\n(log s0 + log log n) log n +\n\n(cid:32)\n\n(cid:32)\n\n(cid:33)\n\n,\n\n(cid:114) n\n\nWn\n\n(cid:33)\n\n,\n\n(cid:114) n\n\nWn\n\nfor some constant c > 0.\n\nE(cid:107)(cid:98)\u03b8\u03bb \u2212 \u03b80(cid:107)2\n\n(cid:18) en\n\n(cid:19)\n\n,\n\nWe give some remarks comparing Theorem 1 to related results in the literature.\nRemark 1 (Comparison to Dalalyan et al. (2017); Guntuboyina et al. (2017)). We can see that\nthe result in Theorem 1 is sharper than that in (8) from Dalalyan et al. (2017) for any s0, Wn, as\n\nlog s0 \u2264 log n and(cid:112)n/Wn \u2264 n/Wn. Moreover, when s0 = O(1) and Wn = \u0398(n), the rates are\nin that it reduces the factor of (log s0 + log log n) log n +(cid:112)n/Wn to a single term of log Wn. In\n\nlog2 n/n and log n(log log n)/n from Theorem 1 and (8), respectively.\nComparing the result in Theorem 1 to that in (9) from Guntuboyina et al. (2017), the latter is sharper\n\nthe case s0 = O(1) and Wn = \u0398(n), the rates are log n(log log n)/n and log n/n from Theorem 1\nand (8), respectively, and the latter rate cannot be improved, owing to the minimax lower bound in\n(10). Similar to our expectation bound in Corollary 1, Guntuboyina et al. (2017) establish\n\nn \u2264 C\u03c32 s0 + 1\n\nn\n\nfor the family of fused lasso estimates {(cid:98)\u03b8\u03bb, \u03bb \u2265 0}, for large enough n, and a constant C > 0. Like\n\ns0 + 1\n\ninf\n\u03bb\u22650\n\n(11)\n\nlog\n\ni=1 |\u03b8i \u2212 \u03b8i+1| in (3) with the (cid:96)0 penalty(cid:80)n\u22121\n\nn = O(log2 n/n) as shown by Donoho and Johnstone (1994). Pairing unbal-\nanced Haar (UH) wavelets with a basis selection method, Fryzlewicz (2007) developed an estimator\nn = O(log2 n/n). Though they are not written in this form, the results in\n\ntheir high probability result in (9), their expectation result in (11) is stated in terms of an in\ufb01mum\nover \u03bb \u2265 0, and does not provide an explicit value of \u03bb that attains the bound. (Inspection of their\nproofs suggests that it is not at all easy to make such a value of \u03bb explicit.) Meanwhile, Theorem 1\nand Corollary 1 have the advantage this choice is made explicit, as in \u03bb = (nWn)1/4.\nde\ufb01ned by replacing the (cid:96)1 penalty(cid:80)n\u22121\nRemark 2 (Comparison to other estimators). Various other estimators obtain comparable estima-\nand denoted say by(cid:98)\u03b8Potts, satis\ufb01es a bound (cid:107)(cid:98)\u03b8Potts \u2212 \u03b80(cid:107)2\ntion error rates. In what follows, all results are stated in the case s0 = O(1). The Potts estimator,\ni=1 1{\u03b8i (cid:54)= \u03b8i+1},\net al. (2009). Wavelet denoising (placing weak conditions on the wavelet basis), denoted by(cid:98)\u03b8wav,\nsatis\ufb01es E(cid:107)(cid:98)\u03b8wav \u2212 \u03b80(cid:107)2\nn = O(log n/n) a.s. as shown by Boysen\n(cid:98)\u03b8UH with E(cid:107)(cid:98)\u03b8UH \u2212 \u03b80(cid:107)2\nFryzlewicz (2016) imply that his \u201ctail-greedy\u201d unbalanced Haar (TGUH) estimator,(cid:98)\u03b8TGUH, satis\ufb01es\n(cid:107)(cid:98)\u03b8TGUH \u2212 \u03b80(cid:107)2\nof the fused lasso estimate(cid:98)\u03b8 in (3)):\n(cid:107)(cid:98)\u03b8 \u2212 \u03b80(cid:107)2\nTo precisely control the empirical process term \u0001(cid:62)((cid:98)\u03b8 \u2212 \u03b80), we consider a decomposition\nwhere we de\ufb01ne(cid:98)\u03b4 = P0((cid:98)\u03b8 \u2212 \u03b80) and(cid:98)x = P1(cid:98)\u03b8. Here P0 is the projection matrix onto the piecewise\n\n2 \u2264 2\u0001(cid:62)((cid:98)\u03b8 \u2212 \u03b80) + 2\u03bb(cid:0)(cid:107)D\u03b80(cid:107)1 \u2212 (cid:107)D(cid:98)\u03b8(cid:107)1\n\u0001(cid:62)((cid:98)\u03b8 \u2212 \u03b80) = \u0001(cid:62)(cid:98)\u03b4 + \u0001(cid:62)(cid:98)x,\n\nHere is an overview of the proof of Theorem 1. The full proof is deferred until the supplement, as\nwith all proofs in this paper. We begin by deriving a basic inequality (stemming from the optimality\n\nconstant structure inherent in \u03b80, and P1 = I \u2212 P0. More precisely, writing S0 = {t1, . . . , ts0} for\nthe set of ordered changepoints in \u03b80, we de\ufb01ne Bj = {tj + 1, . . . , tj+1}, and denote by 1Bj \u2208 Rn\n\nn = O(log2 n/n) with probability tending to 1.\n\n(cid:1).\n\n(12)\n\n4\n\n\f}. The parameter(cid:98)\u03b4 lies in an low-dimensional\nsubspace, which makes bounding the term \u0001(cid:62)(cid:98)\u03b4 relatively easy. Bounding the term \u0001(cid:62)(cid:98)x requires a\nthe indicator of block Bj, for j = 0, . . . , s0. In this notation, P0 is the projection onto the (s0 + 1)-\ndimensional linear subspace R = span{1B0, . . . , 1Bs0\nLemma 1 is a deterministic result ensuring the existence of what we call a lower interpolant(cid:98)z to(cid:98)x.\nmuch more intricate argument, which is spelled out in the following lemmas.\nThis interpolant approximates(cid:98)x using 2s0 + 2 monotonic segments, and its empirical process term\n\u0001(cid:62)(cid:98)z can be \ufb01nely controlled, as shown in Lemma 2. The residual from the interpolant approximation,\ndenoted (cid:98)w =(cid:98)x \u2212(cid:98)z, has an empirical process term \u0001(cid:62)(cid:98)w that is more crudely controlled, in Lemma 3.\nPut together, as in \u0001(cid:62)(cid:98)x = \u0001(cid:62)(cid:98)z + \u0001(cid:62)(cid:98)w, gives the \ufb01nal control on \u0001(cid:62)(cid:98)x.\nBefore stating Lemma 1, we de\ufb01ne the class of vectors containing the lower interpolant. Given any\ncollection of changepoints t1 < . . . < ts0 (and t0 = 0, ts0+1 = n), let M be the set of \u201cpiecewise\nmonotonic\u201d vectors z \u2208 Rn, with the following properties, for each i = 0, . . . , s0:\n\n(i) there exists a point t(cid:48)\ni, . . . , ti+1};\n\nnonincreasing over the segment j \u2208 {ti + 1, . . . , t(cid:48)\nj \u2208 {t(cid:48)\n\ni such that ti + 1 \u2264 t(cid:48)\n\ni \u2264 ti+1, and such that the absolute value |zj| is\ni}, and nondecreasing over the segment\n\n(ii) the signs remain constant on the monotone pieces,\n\nsign(zti) \u00b7 sign(zj) \u2265 0,\nsign(zti+1) \u00b7 sign(zj) \u2265 0,\n\nj = ti + 1, . . . , t(cid:48)\ni,\nj = t(cid:48)\n\ni + 1, . . . , ti+1.\n\nNow we state our lemma that characterizes the lower interpolant.\nLemma 1. Given changepoints t0 < . . . < ts0+1, and any x \u2208 Rn, there exists a vector z \u2208 M (not\nnecessarily unique), such that the following statements hold:\n\n(cid:107)D\u2212S0x(cid:107)1 = (cid:107)D\u2212S0z(cid:107)1 + (cid:107)D\u2212S0(x \u2212 z)(cid:107)1,\n\u221a\ns0\u221a\n(cid:107)DS0 x(cid:107)1 = (cid:107)DS0 z(cid:107)1 \u2264 (cid:107)D\u2212S0z(cid:107)1 +\n4\nWn\n(cid:107)z(cid:107)2 \u2264 (cid:107)x(cid:107)2\n(cid:107)x \u2212 z(cid:107)2 \u2264 (cid:107)x(cid:107)2,\n\nand\n\n(cid:107)z(cid:107)2,\n\n(13)\n\n(14)\n\n(15)\nwhere D \u2208 R(n\u22121)\u00d7n is the difference matrix in (6). We call a vector z with these properties a lower\ninterpolant to x.\n\nLoosely speaking, the lower interpolant(cid:98)z can be visualized by taking a string that lies initially on top\nof(cid:98)x, is nailed down at the changepoints t0, . . . ts0+1, and then pulled taut while maintaining that it is\nnot greater (elementwise) than(cid:98)x, in magnitude. Here \u201cpulling taut\u201d means that (cid:107)D(cid:98)z(cid:107)1 is made small.\nFigure 1 provides illustrations of the interpolant(cid:98)z to(cid:98)x for a few examples.\nNote that(cid:98)z consists of 2s0 + 2 monotonic pieces. This special structure leads to a sharp concentration\n(cid:32)\nI (log s0 + log log n)(cid:1),\n\ninequality. The next lemma is the primary contributor to the fast rate given in Theorem 1.\nLemma 2. Given changepoints t1 < . . . < ts0, there exists constants cI , CI , NI > 0 such that when\n\u0001 \u2208 Rn has i.i.d. sub-Gaussian components,\n\n(cid:112)(log s0 + log log n)s0 log n\n\n\u2264 2 exp(cid:0) \u2212 CI \u03b32c2\n\n(cid:33)\n\n> \u03b3cI\n\nP\n\n|\u0001(cid:62)z|\n(cid:107)z(cid:107)2\n\nsup\nz\u2208M\n\nfor any \u03b3 > 1, and n \u2265 NI.\n\nFinally, the following lemma controls the residuals, (cid:98)w =(cid:98)x \u2212(cid:98)z.\n(cid:19)\n\n(cid:18)\n\n(cid:112)(cid:107)D\u2212S0w(cid:107)1(cid:107)w(cid:107)2\n\n|\u0001(cid:62)w|\n\nP\n\nsup\nw\u2208R\u22a5\n\n> \u03b3cR(ns0)1/4\n\n\u2264 2 exp(\u2212CR\u03b32c2\n\nR\n\n\u221a\n\ns0),\n\nLemma 3. Given changepoints t1 < . . . < ts0, there exists constants cR, CR > 0 such that when\n\u0001 \u2208 Rn has i.i.d. sub-Gaussian components,\n\nfor any \u03b3 > 1, where R\u22a5 is the orthogonal complement of R = span{1B0, . . . , 1Bs0\n\n}.\n\n5\n\n\fFigure 1: The lower interpolants for two examples (in the left and right columns), each with n = 800 points. In\nthe top row, the data y (in gray) and underlying signal \u03b80 (red) are plotted across the locations 1, . . . , n. Also\n\nshown is the fused lasso estimate(cid:98)\u03b8 (blue). In the bottom row, the error vector(cid:98)x = P1(cid:98)\u03b8 is plotted (blue) as well\n\nas the interpolant (black), and the dotted vertical lines (red) denote the changepoints t1, . . . ts0 of \u03b80.\n\n4 Extension to misspeci\ufb01ed models\n\nWe consider data from the model in (1) but where the mean \u03b80 is not necessarily piecewise constant\n(i.e., where s0 is potentially large). Let us de\ufb01ne\n(cid:107)\u03b80 \u2212 \u03b8(cid:107)2\n\nsubject to (cid:107)D\u03b8(cid:107)0 \u2264 s,\n\n\u03b80(s) = argmin\n\n(16)\n\n\u03b8\u2208Rn\n\nwhich we call the best s-approximation to \u03b80. We now present an extension of Theorem 1.\nTheorem 2. Assume the data model in (1), with errors \u0001i, i = 1, . . . , n i.i.d. from a sub-Gaussian\ndistribution. For any s, consider the best s-approximation \u03b80(s) to \u03b80, as in (16), and let Wn(s) be\nthe minimum distance between the s changepoints in \u03b80(s). Then under a choice of tuning parameter\n\n\u03bb = (nWn(s))1/4, the fused lasso estimate(cid:98)\u03b8 in (3) satis\ufb01es\n\nn \u2264 (cid:107)\u03b80(s) \u2212 \u03b80(cid:107)2\n\n(17)\nwith probability at least 1 \u2212 exp(\u2212C\u03b3), for all \u03b3 > 1 and n \u2265 N, where c, C, N > 0 are constants\n\nthat depend on only \u03c3. Further, if \u03bb is chosen large enough so that (cid:107)D(cid:98)\u03b8(cid:107)0 \u2264 s on an event E, then\n\n(log s + log log n) log n +\n\nn + \u03b32c\n\nWn(s)\n\n,\n\n(cid:107)(cid:98)\u03b8 \u2212 \u03b80(cid:107)2\n\n2\n\n(cid:32)\n\ns\nn\n\n(cid:33)\n\n(cid:114) n\n(cid:33)\n\nn\n\u03bb2\n\n(cid:32)\n\n(cid:107)(cid:98)\u03b8 \u2212 \u03b80(s)(cid:107)2\n\nn \u2264 \u03b32c\n\ns\nn\n\n(18)\non E intersected with an event of probability at least 1 \u2212 exp(\u2212C\u03b3), for all \u03b3 > 1, n \u2265 N, where\nc, C, N > 0 are the same constants as above.\n\n(log s + log log n) log n +\n\nWn(s)\n\n+\n\n,\n\n\u03bb2\n\nThe \ufb01rst result in (17) in Theorem 2 is a standard oracle inequality. It provides a bound on the error\nof the fused lasso estimator that decomposes into two parts, the \ufb01rst term being the approximation\nerror, determined by the proximity of \u03b80(s) to \u03b80, and second term being the usual bound we would\nencounter if the mean truly had s changepoints.\n\nThe second result in (18) in the theorem is a direct bound on the estimation error (cid:107)(cid:98)\u03b8 \u2212 \u03b80(s)(cid:107)2\ntake \u03bb to be large enough for(cid:98)\u03b8 to itself have s changepoints. But the rate worsens as \u03bb grows larger,\nwith s changepoints, then we may have to take \u03bb very large to ensure that(cid:98)\u03b8 has s changepoints).\n\nsee that the estimation error can be small, apparently regardless of the size of (cid:107)\u03b80(s) \u2212 \u03b80(cid:107)2\n\nso implicitly, the proximity of \u03b80(s) to \u03b80 does play an role (if \u03b80 were actually far away from a signal\n\nn. We\nn, if we\n\n6\n\nllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll0200400600800\u221250510Index0200400600800\u22121.5\u22120.50.00.51.0llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll0200400600800\u2212202468Index0200400600800\u22123\u22122\u2212101\fRemark 3 (Comparison to other results). Dalalyan et al. (2017); Guntuboyina et al. (2017) also\nprovide oracle inequalities and their results could be adapted to take forms as in Theorem 2. It is not\nclear to us that previous results on other estimators, such as those from Remark 2, adapt as easily.\n\n5 Extension to exponential family models\nWe consider data y = (y1, . . . , yn) \u2208 Rn with independent components distributed according to\n\np(yi; \u03b80,i) = h(yi) exp(cid:0)yi\u03b80,i \u2212 \u039b(\u03b80,i)(cid:1),\n\n(19)\nHere, for each i = 1, . . . , n, the parameter \u03b80,i is the natural parameter in the exponential family and\n\u039b is the cumulant generating function. As before, in the location model, we are mainly interested in\nthe case in which the natural parameter vector \u03b80 is piecewise constant (with s0 denoting its number\nof changepoints, as before). Estimation is now based on penalization of the negative log-likelihood:\n\ni = 1, . . . , n.\n\n(cid:98)\u03b8 = argmin\n\n\u03b8\u2208Rn\n\nn(cid:88)\n\n(cid:0) \u2212 yi\u03b8i + \u039b(\u03b8i)(cid:1) + \u03bb\n\nn(cid:88)\n\ni=1\n\ni=1\n\n|\u03b8i \u2212 \u03b8i+1|,\n\n(20)\n\nSince the cumulant generating function \u039b is always convex in exponential families, the above is a\nconvex optimization problem. We present an estimation error bound the present setting.\nTheorem 3. Assume the data model in (19), with a strictly convex, twice continuously differentiable\ncumulant generating function \u039b. Assume that \u03b80,i \u2208 [l, u], i = 1, . . . , n for constants l, u \u2208 R, and\nadd the constraints \u03b8i \u2208 [l, u], i = 1, . . . , n in the optimization problem in (20). Finally, assume that\nthe random variables yi \u2212 E(yi), i = 1, . . . , n obey a sub-Gaussian distribution, with parameter \u03c3.\nThen under a choice of tuning parameter \u03bb = (nWn)1/4, the exponential family fused lasso estimate\n\n(cid:98)\u03b8 in (20) (subject to the additional boundedness constraints) satis\ufb01es\n\n(cid:33)\n\n,\n\n(cid:114) n\n\nWn\n\n(cid:32)\n\n(cid:107)(cid:98)\u03b8 \u2212 \u03b80(cid:107)2\n\nn \u2264 \u03b32c\n\ns0\nn\n\n(log s0 + log log n) log n +\n\nwith probability at least 1 \u2212 exp(\u2212C\u03b3), for all \u03b3 > 1 and n \u2265 N, where c, C, N > 0 are constants\nthat depend on only l, u, \u03c3.\nRemark 4 (Roles of l, u). The restriction of \u03b80,i and the optimization parameters in (20) to [l, u],\nfor i = 1, . . . , n, is used to ensure that the second derivative of \u039b is bounded away from zero. (The\nsame property could be accomplished by instead adding a small squared (cid:96)2 penalty on \u03b8 in (20).) A\nmore re\ufb01ned analysis could alleviate the need for this bounded domain (or extra squared (cid:96)2 penalty)\nbut we do not pursue this for simplicity.\nRemark 5 (Sub-Gaussianity in exponential families). When are the random variables yi \u2212 E(yi),\ni = 1, . . . , n sub-Gaussian, in an exponential family model (19)? A simple suf\ufb01cient condition (not\nspeci\ufb01c to exponential families, in fact) is that these centered variates are bounded. This covers the\nbinomial model yi \u223c Bin(k, \u00b5(\u03b80,i)), where \u00b5(\u03b80,i) = 1/(1 + e\u2212\u03b80,i), i = 1, . . . , n, and k is a \ufb01xed\nconstant. Hence Theorem 3 applies to binomial data.\nFor Poisson data yi \u223c Pois(\u00b5(\u03b80,i)), where \u00b5(\u03b80,i) = e\u03b80,i, i = 1, . . . , n, we now give two options\nfor the analysis. The \ufb01rst is to assume a maximum achieveable count (which may be reasonable in\nCNV data) and then apply Theorem 3 owing again to boundedness. The second is to invoke the fact\nthat Poisson random variables have sub-exponential (rather than sub-Gaussian) tails, and then use a\n\ntruncation argument, to show that for the Poisson fused lasso estimate(cid:98)\u03b8 in (20) (under the additional\n\nboundedness constraints), with \u03bb = log n(nWn)1/4,\n\nn \u2264 \u03b32c\n\ns0 log n\n\nn\n\n(log s0 + log log n) log n +\n\n(21)\nwith probability at least 1 \u2212 exp(\u2212C\u03b3) \u2212 1/n, for all \u03b3 > 1 and n \u2265 N, where c, C, N > 0 are\nconstants depending on l, u. This is slower than the rate in Theorem 3 by a factor of log n.\nRemark 6 (Comparison to other results). The results in Dalalyan et al. (2017); Guntuboyina et al.\n(2017) assume normal errors. It seems believable to us that the results of Dalalyan et al. (2017) could\nbe extended to sub-Gaussian errors and hence exponential family data, in a manner similar to what\nwe have done above in Theorem 3. To us, this is less clear for the results of Guntuboyina et al. (2017),\nwhich rely on some technical calculations involving Gaussian widths. It is even less clear to us how\nresults from other estimators, as in Remark 2, extend to exponential family data.\n\nWn\n\n,\n\n(cid:107)(cid:98)\u03b8 \u2212 \u03b80(cid:107)2\n\n(cid:32)\n\n(cid:33)\n\n(cid:114) n\n\n7\n\n\f6 Approximate changepoint screening and recovery\n\nIn many applications of changepoint detection, one may be interested in estimation of the changepoint\nlocations in \u03b80, rather than the mean vector \u03b80 as a whole. In this section, we show that estimation of\nthe changepoint locations and of \u03b80 itself are two very closely linked problems, in the following sense:\nany procedure with guarantees on its error in estimating \u03b80 automatically has certain approximate\nchangepoint detection guarantees, and not surprisingly, a faster error rate (in estimating \u03b80) translates\ninto a stronger statement about approximate changepoint detection. We use this general link to prove\nnew approximate changepoint screening results for the fused lasso. We also show that in general a\nsimple post-processing step may be used to discard spurious detected changepoints, and again apply\nthis to the fused lasso to yield new approximate changepoint recovery results.\nIt helps to introduce some additional notation. For a vector \u03b8 \u2208 Rn, we write S(\u03b8) for the set of its\nchangepoint indices, i.e.,\n\nS(\u03b8) =(cid:8)i \u2208 {1, . . . , n \u2212 1} : \u03b8i (cid:54)= \u03b8i+1\n\n(cid:9).\n\nRecall, we abbreviate S0 = S(\u03b80) for the changepoints of the underlying mean \u03b80. For two discrete\nsets A, B, we de\ufb01ne the metrics\nmin\na\u2208A\n\nand dH (A, B) = max(cid:8)d(A|B), d(B|A)}.\n\nd(A|B) = max\nb\u2208B\n\n|a \u2212 b|\n\nThe \ufb01rst metric above can be seen as a one-sided screening distance from B to A, measuring the\nfurthest distance of an element in B to its closest element in A. The second metric above is known as\nthe Hausdorff distance between A and B.\n\nApproximate changepoint screening. We present our general theorem on changepoint screening.\nThe basic idea behind the result is quite simple: if an estimator misses a (large) changepoint in \u03b80,\nthen its estimation error must suffer, and we can use this fact to bound the screening distance.\n\nTheorem 4. Let(cid:101)\u03b8 \u2208 Rn be an estimator such that (cid:107)(cid:101)\u03b8 \u2212 \u03b80(cid:107)2\n(cid:18) nRn\n\nn =\no(Wn), where, recall, Hn is the minimum gap between adjacent levels of \u03b80, de\ufb01ned in (5), and Wn\nis the minimum distance between adjacent changepoints of \u03b80, de\ufb01ned in (4). Then\n\nd(cid:0)S((cid:101)\u03b8)| S0\ntantly, Theorem 4 assumes no data model whatsoever, and treats(cid:101)\u03b8 as a generic estimator of \u03b80. (Of\ncourse, through the statement (cid:107)(cid:101)\u03b8 \u2212 \u03b80(cid:107)2\nn = OP(Rn), one can see that(cid:101)\u03b8 is random, constructed from\ndata that depends on \u03b80, but no speci\ufb01c data model is required, nor are any speci\ufb01c properties of(cid:101)\u03b8,\n\nRemark 7 (Generic setting: no speci\ufb01c data model, and no assumptions on estimator). Impor-\n\nn = OP(Rn). Assume that nRn/H 2\n\n(cid:1) = OP\n\nother than its error rate.) This \ufb02exibility allows for the result to be applied in any problem setting in\nwhich one has control of the error in estimating a piecewise constant parameter \u03b80 (in some cases\nthis may be easier to obtain, compared to direct analysis of detection properties). A similar idea was\nused (concurrently and independently) by Fryzlewicz (2016) in the analysis of the TGUH estimator.\n\n(cid:19)\n\nH 2\nn\n\n.\n\n\u22121/3\nn\n\n) satis\ufb01es\n\nCombining the above theorem with known error rates for the fused lasso estimator\u2014(7) in the weak\nsparsity case, and Theorem 1 in the strong sparsity case\u2014gives the following result.\nCorollary 2. Assume the data model in (1), with errors \u0001i, i = 1, . . . , n i.i.d. from a sub-Gaussian\ndistribution. Let Cn = (cid:107)D\u03b80(cid:107)1, and assume that Hn = \u03c9(n1/6C 1/3\nWn). Then the fused lasso\nn /\n\nestimator(cid:98)\u03b8 in (3) with \u03bb = \u0398(n1/3C\nAlternatively, assume s0 = O(1), Wn = \u0398(n), and Hn = \u03c9((cid:112)log n(log log n)/n). Then the fused\nRemark 8 (Changepoint detection limit). The restriction Hn = \u03c9((cid:112)log n(log log n)/n) for (23)\n\n(cid:18) n1/3C 2/3\n(cid:19)\n(cid:1) = OP\nd(cid:0)S((cid:98)\u03b8)| S0\n(cid:18) log n(log log n)\n(cid:1) = OP\nd(cid:0)S((cid:98)\u03b8)| S0\n\n\u221a\nlasso with \u03bb = \u0398(\n\nin Corollary 2 is very close to the optimal detection limit of Hn = \u03c9(1/\nn): Duembgen and Walther\n(2008) showed that in Gaussian changepoint model with a single elevated region, and Wn = \u0398(n),\nthere is no test for detecting a changepoint that has asymptotic power 1 unless Hn = \u03c9(1/\n\nn) satis\ufb01es\n\nn\nH 2\nn\n\n(cid:19)\n\n(22)\n\n\u221a\n\nn).\n\n(23)\n\nH 2\nn\n\n.\n\n\u221a\n\n.\n\n\u221a\n\n8\n\n\fH 2\nn\n\nCombining Theorem 4 with (21) gives the following (a similar result holds for the binomial model).\nCorollary 3. Assume yi \u223c Pois(e\u03b80,i), independently, for i = 1, . . . , n, and assume (cid:107)\u03b80(cid:107)\u221e = O(1),\n\ns0 = O(1), Wn = \u0398(n), Hn = \u03c9(log n(cid:112)log log n/n). Then for the Poisson fused lasso estimator\n(cid:98)\u03b8 in (20) (subject to appropriate boundedness constraints) with \u03bb = \u0398(log n\n\nn), we have\n\n\u221a\n\nd(cid:0)S((cid:98)\u03b8)| S0\n\n(cid:1) = OP\n\n(cid:18) log2 n(log log n)\n\n(cid:19)\n\n.\n\nbn\n\nj=i+1\n\n1\nbn\n\n(cid:101)\u03b8j,\n\nj=i\u2212bn+1\n\ni(cid:88)\n\ni+bn(cid:88)\n\n(cid:101)\u03b8j \u2212 1\n\nFi((cid:101)\u03b8) =\n\nfor i = bn, . . . , n \u2212 bn,\n\nApproximate changepoint recovery. We present a post-processing procedure for the estimated\n\nand retaining only locations at which the \ufb01lter value is large (in magnitude), we can approximately\nrecovery the changepoints of \u03b80, in the Hausdorff metric.\n\nchangepoints in(cid:101)\u03b8, to eliminate changepoints of(cid:101)\u03b8 that lie far away from changepoints of \u03b80. Our\nprocedure is based on convolving(cid:101)\u03b8 with a \ufb01lter that resembles the mother Haar wavelet. Consider\nfor an integral bandwidth bn > 0. By evaluating the \ufb01lter Fi((cid:101)\u03b8) at all locations i = bn, . . . , n \u2212 bn,\nTheorem 5. Let(cid:101)\u03b8 \u2208 Rn be such that (cid:107)(cid:101)\u03b8 \u2212 \u03b80(cid:107)2\n(cid:111) \u222a {bn, n \u2212 bn},\nIF ((cid:101)\u03b8) =\nand de\ufb01ne a set of \ufb01ltered points SF ((cid:101)\u03b8) = {i \u2208 IF ((cid:101)\u03b8) : |Fi((cid:101)\u03b8)| \u2265 \u03c4n}, for a threshold level \u03c4n. If\nNote that the set of \ufb01ltered points |SF ((cid:101)\u03b8)| in Theorem 5 is not necessarily of a subset of the original\nset of estimated changepoints S((cid:101)\u03b8), but it has the property |SF ((cid:101)\u03b8)| \u2264 3|S((cid:101)\u03b8)| + 2.\n\n(cid:110)\ni \u2208 {bn, . . . , n \u2212 bn} : i \u2208 S((cid:101)\u03b8), or i + bn \u2208 S((cid:101)\u03b8), or i \u2212 bn \u2208 S((cid:101)\u03b8)\n\nn), 2bn \u2264 Wn, and \u03c4n/Hn \u2192 \u03c1 \u2208 (0, 1) as n \u2192 \u221e, then\ndH\n\nevaluate the \ufb01lter in (24) with bandwidth bn at locations in\n\nn = OP(Rn). Consider the following procedure: we\n\n(cid:0)SF ((cid:101)\u03b8), S0\n\nbn, \u03c4n satisfy bn = \u03c9(nRn/H 2\n\n(cid:1) \u2264 2bn\n\n(cid:17) \u2192 1\n\nas n \u2192 \u221e.\n\nP(cid:16)\n\n(24)\n\nWe \ufb01nish with corollaries for the fused lasso. For space reasons, remarks comparing them to related\napproximate recovery results in the literature are deferred to the supplement.\nCorollary 4. Assume the data model in (1), with errors \u0001i, i = 1, . . . , n i.i.d. from a sub-Gaussian\ndistribution. Let Cn = (cid:107)D\u03b80(cid:107)1. If we apply the post-processing procedure in Theorem 5 to the fused\nn(cid:99) \u2264 Wn/2 for a sequence\n\u03bdn \u2192 \u221e, and \u03c4n/Hn \u2192 \u03c1 \u2208 (0, 1), then\n\n), bn = (cid:98)n1/3C 2/3\n\n\u22121/3\nn\n\nn/H 2\n\nn \u03bd2\n\n(cid:18)\n\nlasso estimator(cid:98)\u03b8 in (3) with \u03bb = \u0398(n1/3C\n(cid:0)SF ((cid:98)\u03b8), S0\n(cid:0)SF ((cid:98)\u03b8), S0\n\n(cid:18)\n\ndH\n\ndH\n\nP\n\n(cid:1) \u2264 2n1/3C 2/3\n\nn \u03bd2\nn\n\nH 2\nn\n\n(cid:19)\n\n\u2192 1\n\nas n \u2192 \u221e.\n\n(25)\n\n\u221a\nAlternatively, assuming s0 = O(1), Wn = \u0398(n), if we apply the same post-processing procedure to\nn(cid:99) \u2264 Wn/2 for a sequence \u03bdn \u2192 \u221e,\n(cid:19)\nthe fused lasso with \u03bb = \u0398(\nand \u03c4n/Hn \u2192 \u03c1 \u2208 (0, 1), then\n\nn), bn = (cid:98)log n(log log n)\u03bd2\n\n(26)\nCorollary 5. Assume yi \u223c Pois(e\u03b80,i), independently, for i = 1, . . . , n, and assume (cid:107)\u03b80(cid:107)\u221e = O(1),\nlasso estimator(cid:98)\u03b8 in (20) (subject to appropriate boundedness constraints) with \u03bb = \u0398(log n\n\u221a\ns0 = O(1), Wn = \u0398(n). If we apply the post-processing method in Theorem 5 to the Poisson fused\nn),\nn(cid:99) \u2264 Wn/2 for a sequence \u03bdn \u2192 \u221e, and \u03c4n/Hn \u2192 \u03c1 \u2208 (0, 1), then\nbn = (cid:98)log2 n(log log n)\u03bd2\n\n(cid:1) \u2264 2 log n(log log n)\u03bd2\n\nas n \u2192 \u221e.\n\nn/H 2\n\n\u2192 1\n\nH 2\nn\n\nP\n\nn\n\n(cid:18)\n\nn/H 2\n\n(cid:0)SF ((cid:98)\u03b8), S0\n\n(cid:1) \u2264 2 log2 n(log log n)\u03bd2\n\nn\n\ndH\n\nH 2\nn\n\n(cid:19)\n\n\u2192 1\n\nas n \u2192 \u221e.\n\nP\n\n7 Summary\n\nWe gave a new error analysis for the fused lasso, with extensions to misspeci\ufb01ed models and data\nfrom exponential families. We showed that error bounds for general changepoint estimators lead to\napproximate changepoint screening results, and after post-processing, approximate recovery results.\n\nAcknolwedgements. JS was supported by NSF Grant DMS-1712996. RT was supported by NSF\nGrant DMS-1554123.\n\n9\n\n\fReferences\nJohn AD Aston and Claudia Kirch. Evaluating stationarity via change-point alternatives with\n\napplications to fMRI data. The Annals of Applied Statistics, 6(4):1906\u20131948, 2012.\n\nLeif Boysen, Angela Kempe, Volkmar Liebscher, Axel Munk, and Olaf Wittich. Consistencies and\nrates of convergence of jump-penalized least squares estimators. The Annals of Statistics, 37(1):\n157\u2013183, 2009.\n\nBoris Brodsky and Boris Darkhovski. Nonparametric Methods in Change-Point Problems. Springer,\n\n1993.\n\nNgai Hang Chan, Chun Yip Yau, and Rong-Mao Zhang. Group lasso for structural break time series.\n\nJournal of the American Statistical Association, 109(506):590\u2013599, 2014.\n\nJie Chen and Arjun Gupta. Parametric Statistical Change Point Analysis. Birkhauser, 2000.\n\nArnak S. Dalalyan, Mohamed Hebiri, and Johannes Lederer. On the prediction performance of the\n\nlasso. Bernoulli, 23(1):552\u2013581, 2017.\n\nDavid L Donoho and Iain M Johnstone. Ideal spatial adaptation by wavelet shrinkage. Biometrika,\n\n81(3):425\u2013455, 1994.\n\nLutz Duembgen and Guenther Walther. Multiscale inference about a density. The Annals of Statistics,\n\n36(4):1758\u20131785, 2008.\n\nIdris Eckley, Paul Fearnhead, and Rebecca Killick. Analysis of changepoint models. In David\nBarber, Taylan Cemgil, and Silvia Chiappa, editors, Bayesian Time Series Models, chapter 10,\npages 205\u2013224. Cambridge University Press, Cambridge, 2011.\n\nPiotr Fryzlewicz. Unbalanced Haar technique for nonparametric function estimation. Journal of the\n\nAmerican Statistical Association, 102(480):1318\u20131327, 2007.\n\nPiotr Fryzlewicz. Tail-greedy bottom-up data decompositions and fast multiple change-point detection.\n\n2016. URL http://stats.lse.ac.uk/fryzlewicz/tguh/tguh.pdf.\n\nAdityanand Guntuboyina, Donovan Lieu, Sabyasachi Chatterjee, and Bodhisattva Sen. Spatial\n\nadaptation in trend \ufb01ltering. arXiv preprint arXiv:1702.05113, 2017.\n\nIain M. Johnstone. Gaussian Estimation: Sequence and Wavelet Models. Cambridge University\n\nPress, 2015. Draft version.\n\nSeung-Jean Kim, Kwangmoo Koh, Stephen Boyd, and Dimitry Gorinevsky. (cid:96)1 trend \ufb01ltering. SIAM\n\nReview, 51(2):339\u2013360, 2009.\n\nEnno Mammen and Sara van de Geer. Locally adaptive regression splines. The Annals of Statistics,\n\n25(1):387\u2013413, 1997.\n\nOscar Hernan Madrid Padilla, James Sharpnack, James Scott, , and Ryan J. Tibshirani. The DFS\nfused lasso: Linear-time denoising over general graphs. arXiv preprint arXiv:1608.03384, 2016.\n\nLeonid Rudin, Stanley Osher, and Emad Faterni. Nonlinear total variation based noise removal\n\nalgorithms. Physica D: Nonlinear Phenomena, 60(1\u20134):259\u2013268, 1992.\n\nGabriel Steidl, Stephan Didas, and Julia Neumann. Splines in higher order TV regularization.\n\nInternational Journal of Computer Vision, 70(3):214\u2013255, 2006.\n\nRobert Tibshirani and Pei Wang. Spatial smoothing and hot spot detection for cgh data using the\n\nfused lasso. Biostatistics, 9(1):18\u201329, 2008.\n\nRobert Tibshirani, Michael Saunders, Saharon Rosset, Ji Zhu, and Keith Knight. Sparsity and\nsmoothness via the fused Lasso. Journal of the Royal Statistical Society: Series B (Statistical\nMethodology), 67(1):91\u2013108, 2005.\n\nRyan J. Tibshirani. Adaptive piecewise polynomial estimation via trend \ufb01ltering. The Annals of\n\nStatistics, 42(1):285\u2013323, 2014.\n\n10\n\n\f", "award": [], "sourceid": 3454, "authors": [{"given_name": "Kevin", "family_name": "Lin", "institution": "Carnegie Mellon University"}, {"given_name": "James", "family_name": "Sharpnack", "institution": "UC Davis"}, {"given_name": "Alessandro", "family_name": "Rinaldo", "institution": "CMU"}, {"given_name": "Ryan", "family_name": "Tibshirani", "institution": "Carnegie Mellon University"}]}