{"title": "Sparse High-Dimensional Isotonic Regression", "book": "Advances in Neural Information Processing Systems", "page_first": 12872, "page_last": 12882, "abstract": "We consider the problem of estimating an unknown coordinate-wise monotone function given noisy measurements, known as the isotonic regression problem. Often, only a small subset of the features affects the output. This motivates the sparse isotonic regression setting, which we consider here. We provide an upper bound on the expected VC entropy of the space of sparse coordinate-wise monotone functions, and identify the regime of statistical consistency of our estimator. We also propose a linear program to recover the active coordinates, and provide theoretical recovery guarantees. We close with experiments on cancer classification, and show that our method significantly outperforms several standard methods.", "full_text": "Sparse High-Dimensional Isotonic Regression\n\nDavid Gamarnik \u2217\n\nSloan School of Management\n\nMassachusetts Institute of Technology\n\nCambridge, MA 02139\ngamarnik@mit.edu\n\nJulia Gaudio\u2020\n\nOperations Research Center\n\nMassachusetts Institute of Technology\n\nCambridge, MA 02139\n\njgaudio@mit.edu\n\nAbstract\n\nWe consider the problem of estimating an unknown coordinate-wise monotone\nfunction given noisy measurements, known as the isotonic regression problem.\nOften, only a small subset of the features affects the output. This motivates the\nsparse isotonic regression setting, which we consider here. We provide an upper\nbound on the expected VC entropy of the space of sparse coordinate-wise mono-\ntone functions, and identify the regime of statistical consistency of our estimator.\nWe also propose a linear program to recover the active coordinates, and provide\ntheoretical recovery guarantees. We close with experiments on cancer classi\ufb01cation,\nand show that our method signi\ufb01cantly outperforms several standard methods.\n\nIntroduction\n\n1\nGiven a partial order (cid:22) on Rd, we say that a function f : Rd \u2192 R is monotone if for all x1, x2 \u2208 Rd\nsuch that x1 (cid:22) x2, it holds that f (x1) \u2264 f (x2). In this paper, we study the univariate isotonic\nregression problem under the standard Euclidean partial order. Namely, we de\ufb01ne the partial order (cid:22)\non Rd as follows: x1 (cid:22) x2 if x1,i \u2264 x2,i for all i \u2208 {1, . . . , d}. If f is monotone according to the\nEuclidean partial order, we say f is coordinate-wise monotone.\nThis paper introduces the sparse isotonic regression problem, de\ufb01ned as follows. Write x1 (cid:22)A x2 if\nx1,i \u2264 x2,i for all i \u2208 A. We say that a function f on Rd is s-sparse coordinate-wise monotone if for\nsome set A \u2286 [d] with |A| = s, it holds that x1 (cid:22)A x2 =\u21d2 f (x1) \u2264 f (x2). We call A the set of\nactive coordinates. The sparse isotonic regression problem is to estimate the s-sparse coordinate-wise\nmonotone function f from samples, knowing the sparsity level s but not the set A. Observe that if x\nand y are such that xi = yi for all i \u2208 A, then x (cid:22)A y and y (cid:22)A x, so that f (x) = f (y). In other\nwords, the value of f is determined by the active coordinates.\nWe consider two different noise models. In the Noisy Output Model, the input X is a random\nvariable supported on [0, 1]d, and W is zero-mean noise that is independent from X. The model is\nY = f (X) + W . Let R be the range of f and let supp(W ) be the support of W . We assume that\nboth R and supp(W ) are bounded. Without loss of generality, let R + supp(W ) \u2286 [0, 1], where +\nis the Cartesian sum. In the Noisy Input Model, Y = f (X + W ), and we exclusively consider the\nclassi\ufb01cation problem, namely f : Rd \u2192 {0, 1}. In either noise model, we assume that n independent\nsamples (X1, Y1), . . . , (Xn, Yn) are given.\nThe goal of our paper is to produce an estimator \u02c6fn and give statistical guarantees for it. To our\nknowledge, the only work that provides statistical guarantees on isotonic regression estimators in\nthe Euclidean partial order setting with d \u2265 3 is the paper of Han et al ([13]). The authors give\n\n\u2217http://web.mit.edu/gamarnik/www/home.html\n\u2020http://web.mit.edu/jgaudio/www/index.html\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\f(cid:20)\n\n(cid:80)n\n\n(cid:16) \u02c6fn(Xi) \u2212 f (Xi)\n\n(cid:17)2(cid:21)\n\nguarantees of the empirical L2 loss, de\ufb01ned as R( \u02c6fn, f ) = E\n, where\nthe expectation is over the samples X1, . . . Xn. In this paper, we instead expand on the approach in\nGamarnik ([11]), to the high-dimensional sparse setting. It is shown in [11] that the expected Vapnik-\nChervonenkis entropy of the class of coordinate-wise monotone functions grows subexponentially.\nThe main result of [11] is a guarantee on the tail of (cid:107) \u02c6fn \u2212 f(cid:107)2. When X \u2208 [0, 1]2 and Y \u2208 [0, 1]\nalmost surely, it is claimed that\n\ni=1\n\n1\nn\n\nP(cid:16)(cid:107) \u02c6fn \u2212 f(cid:107)2 > \u0001\n\n(cid:17) \u2264 e(cid:100) 4\n\n\u00012(cid:101)\u221a\n\nn\u2212 \u00014n\n256 ,\n\n(cid:111)\n\n(cid:110)\n\n(cid:111)\n\n(cid:110)\n\ne\u03c9(s2), \u03c9 (s log d)\n\nse\u03c9(s2), \u03c9(s3 log d))\n\nwhere \u02c6fn is a coordinate-wise monotone function, estimated based on empirical mean squared error.\nHowever, the constants of the result are incorrect due to a calculation error, which we correct. This\nresult shows that the estimated function converges to the true function in L2, almost surely ([11]).\nIn this paper, we extend the work of [11] to the sparse high-dimensional setting, where the problem\ndimension d and the sparsity s may diverge to in\ufb01nity as the sample size n goes to in\ufb01nity.\nWe propose two algorithms for the estimation of the unknown s-sparse coordinate-wise monotone\nfunction f. The simultaneous algorithm determines the active coordinates and the estimated function\nvalues in a single optimization formulation based on integer programming. The two-stage algorithm\n\ufb01rst determines the active coordinates via a linear program, and then estimates function values.\nThe sparsity level is treated as constant or moderately growing. We give statistical consistency\nand support recovery guarantees for the Noisy Output Model, analyzing both the simultaneous and\n, the estimator \u02c6fn from the\ntwo-stage algorithms. We show that when n = max\nsimultaneous procedure is statistically consistent. In particular, when the sparsity s level is of constant\norder, the dimension d is allowed to be much larger than the sample size. We note that, remarkably,\nwhen the maximum is dominated by \u03c9(s log d), our sample performance nearly matches the one of\nhigh-dimensional linear regression [2, 10]. For the two-stage approach, we show that if a certain\nsignal strength condition holds and n = max\n, the estimator is consistent.\nWe also give statistical consistency guarantees for the simultaneous and two-stage algorithms in the\nNoisy Input Model, assuming that the components of W are independent. We show that in the regime\nwhere a signal strength condition holds, s is of constant order, and n = \u03c9(log d), the estimators from\nboth algorithms are consistent.\nThe isotonic regression problem has a long history in the statistics literature; see for example the\nbooks [19] and [20]. The emphasis of most research in the area of isotonic regression has been the\ndesign of algorithms: for example, the Pool Adjacent Violators algorithm ([15]), active set methods\n([1], [5]), and the Isotonic Recursive Partitioning algorithm ([16]). In addition to the univariate setting\n(f : Rd \u2192 R), the multivariate setting (f : Rd \u2192 Rq, q \u2265 2) has also been considered; see e.g. [21]\nand [22]. In the multivariate setting, whenever x1 (cid:22) x2 according to some de\ufb01ned partial order (cid:22), it\nholds that f (x1) \u02dc(cid:22)f (x2), where \u02dc(cid:22) is some other de\ufb01ned partial order. There are many applications\nfor the coordinate-wise isotonic regression problem. For example, Dykstra and Robertson (1982)\nshowed that isotonic regression could be used to predict college GPA from standardized test scores\nand high school GPA. Luss et al (2012) applied isotonic regression to the prediction of baseball\nplayers\u2019 salaries, from the number of runs batted in and the number of hits. Isotonic regression has\nfound rich applications in biology and medicine, particularly to build disease models ([16], [23]).\nThe rest of the paper is structured as follows. Section 2 gives the simultaneous and two-stage\nalgorithms for sparse isotonic regression. Section 3 and Section A of the supplementary material\nprovide statistical consistency and recovery guarantees for the Noisy Output and Noisy Input models.\nAll proofs can be found in the supplementary material. In Section 4, we provide experimental\nevidence for the applicability of our algorithms. We test our algorithm on a cancer classi\ufb01cation\ntask, using gene expression data. Our algorithm achieves a success rate of about 96% on this task,\nsigni\ufb01cantly outperforming the k-Nearest Neighbors classi\ufb01er and the Support Vector Machine.\n\n2 Algorithms for sparse isotonic regression\n\nIn this section, we present our two algorithmic approaches for sparse isotonic regression:\nthe\nsimultaneous and two-stage algorithms. Recall that R is the range of f. In the Noisy Output Model,\nR \u2286 [0, 1], and in the Noisy Input Model, R = {0, 1}. We assume the following throughout.\n\n2\n\n\fAssumption 1. For each i \u2208 A, the function f (x) is not constant with respect to xi, i.e.\n\n(cid:90)\n\nx\u2208X\n\n(cid:90)\n\n(cid:12)(cid:12)(cid:12)(cid:12)f (x) \u2212\n\nf (z)dz\n\nz\u2208X\n\n(cid:12)(cid:12)(cid:12)(cid:12) dx > 0.\n\n2.1 The Simultaneous Algorithm\n\nThe simultaneous algorithm solves the following problem.\n\nn(cid:88)\n\ni=1\n\nmin\nA,F\n\n(Yi \u2212 Fi)2\n\n(1)\n\ns.t. |A| = s\nFi \u2264 Fj\nFi \u2208 R\n\nif Xi (cid:22)A Xj\n\u2200i\n\n(2)\n(3)\n(4)\nThe estimated function \u02c6fn is determined by interpolating from the pairs (X1, F1), . . . , (Xn, Fn) in\na straightforward way. In particular, \u02c6fn(x) = max{Fi : Xi (cid:22) x}. In other words, we identify all\npoints Xi such that Xi (cid:22) x and select the smallest consistent function value. We call this the \u201cmin\u201d\ninterpolation rule because it selects the smallest interpolation values. The \u201cmax\u201d interpolation rule is\n\u02c6fn(x) = min{Fi : Xi (cid:23) x}.\nDe\ufb01nition 1. For inputs X1, . . . , Xn, let q(i, j, k) = 1 if Xi,k > Xj,k, and q(i, j, k) = 0 otherwise.\nProblem (1)-(4) can be encoded as a single mixed-integer convex minimization. We refer to the\nresulting Algorithm 1 as Integer Programming Isotonic Regression (IPIR) and provide its formulation\nbelow. Binary variables vk indicate the estimated active coordinates; vk = 1 means that the\noptimization program has determined that coordinate k is active. The variables Fi represent the\nestimated function values at data points Xi.\nAlgorithm 1 Integer Programming Isotonic Regression (IPIR)\nInput: Values (X1, Y1), . . . , (Xn, Yn); sparsity level s\nOutput: An estimated function \u02c6fn\n1: Solve the following optimization problem.\n\nn(cid:88)\n\ni=1\n\n(Yi \u2212 Fi)2\n\nvk = s\n\nq(i, j, k)vk \u2265 Fi \u2212 Fj\n\ns.t.\n\nmin\nv,F\n\nd(cid:88)\nd(cid:88)\n\nk=1\n\nk=1\n\nvk \u2208 {0, 1}\nFi \u2208 R\n\n(5)\n\n(6)\n\n(7)\n\n(8)\n(9)\n\n\u2200i, j \u2208 {1, . . . , n}\n\u2200k \u2208 {1, . . . , d}\n\u2200i \u2208 {1, . . . , n}\n\n2: Return the function \u02c6fn(x) = max{Fi : Xi (cid:22) x}.\nWe claim that Problem (5)-(9) is equivalent to Problem (1)-(4). Indeed, the monotonicity requirement\nis Xi (cid:22)A Xj =\u21d2 f (Xi) \u2264 f (Xj). The contrapositive of this statement is f (Xi) > f (Xj) =\u21d2\nXi (cid:54)(cid:22)A Xj; alternatively, f (Xi) > f (Xj) =\u21d2 \u2203k \u2208 A s.t. Xik > Xjk. The contrapositive is\nexpressed by Constraints (7).\nRecall that in the Noisy Input Model, the function f is binary-valued, i.e. R = {0, 1}. Let\nS + = {i : Yi = 1} and S\u2212 = {i : Yi = 0}. When {Fi}n\ni=1 are binary-valued, it holds that\ni\u2208S\u2212 Fi. Therefore, if we replace the objective function\n\ni\u2208S+ (1 \u2212 Fi) +(cid:80)\n\n(cid:80)n\ni=1 (Yi \u2212 Fi)2 =(cid:80)\n(5) by(cid:80)\n\ni\u2208S+ (1 \u2212 Fi) +(cid:80)\n\ni\u2208S\u2212 Fi, we obtain an equivalent linear integer program.\n\nAlgorithm 1 when applied to the Noisy Output Model is a mixed-integer convex optimization program.\nWhen applied to the Noisy Input Model, it s a mixed integer linear optimization program. While both\nare formally NP-hard in general, moderately-sized instances are solvable in practice.\n\n3\n\n\f2.2 The Two-Stage Algorithm\n\nAlgorithm 1 is slow, both in theory and in practice. Motivated by this, we propose an alternative\ntwo-stage algorithm. The two-stage algorithm estimates the active coordinates through a linear\nprogram, using these to then estimate the function values. The process of estimating the active\ncoordinates is referred to as support recovery. The active coordinates may be estimated all at once\n(Algorithm 2) or sequentially (Algorithm 3). Algorithm 2 is referred to as Linear Programming\nSupport Recovery (LPSR) and Algorithm 3 is referred to as Sequential Linear Programming Support\nRecovery (S-LPSR). The two-stage algorithm for estimating \u02c6fn \ufb01rst estimates the set of active\ncoordinates using the LPSR or S-LPSR algorithm, and then estimates the function values. The results\nalgorithm is referred to as Two Stage Isotonic Regression (TSIR) (Algorithm 4).\n\nAlgorithm 2 Linear Programming Support Recovery (LPSR)\nInput: Values (X1, Y1), . . . , (Xn, Yn); sparsity level s\nOutput: The estimated support, \u02c6A\n1: Solve the following optimization problem.\n\nn(cid:88)\n\nn(cid:88)\n\nd(cid:88)\n\ni=1\n\nj=1\n\nk=1\n\ncij\nk\n\nmin\nv,c\n\nd(cid:88)\nd(cid:88)\n\nk=1\n\ns.t.\n\nvk = s\n\n(cid:16)\n\nvk + cij\nk\n\n(cid:17) \u2265 1\n\n(10)\n\n(11)\n\n(12)\n\nd(cid:88)\n\n(13)\n(14)\n2: Determine the s largest values vi, breaking ties arbitrarily. Let \u02c6A be the set of the corresponding\n\nif Yi > Yj and\n\nq(i, j, k) \u2265 1\n\u2200k \u2208 {1, . . . , d}\n\u2200i \u2208 {1, . . . , n}, j \u2208 {1, . . . , n}, k \u2208 {1, . . . , p}\n\nk=1\n\nk=1\n\nq(i, j, k)\n0 \u2264 vk \u2264 1\nk \u2265 0\ncij\n\ns indices.\n\n(12), (cid:80)d\n\nIn Problem (10)-(14), the vk variables are meant to indicate the active coordinates, while the cij\nk\nvariables act as correction in the monotonicity constraints. For example, if for one of the constraints\nk = 0.3 for some (i, j, k) such that\n\nk=1 q(i, j, k)vk = 0.7, then we will need to set cij\n\nq(i, j, k) = 1. The vk\u2019s should therefore be chosen in a way to minimize the correction.\nAlgorithm 3 determines the active coordinates one at a time, setting s = 1 in Problem (10)-(14).\nOnce a coordinate i is included in the set of active coordinates, variable vi is set to zero in future\niterations.\nAlgorithm 3 Sequential Linear Programming Support Recovery (S-LPSR)\nInput: Values (X1, Y1), . . . , (Xn, Yn); sparsity level s\nOutput: The estimated support, \u02c6A\n1: B \u2190 \u2205\n2: while |B| < s do\n3:\n\nSolve the optimization problem in Algorithm 2 with s = 1:\n\nn(cid:88)\n\nn(cid:88)\n\nd(cid:88)\n\ni=1\n\nj=1\n\nk=1\n\nvk = 1\n\ncij\nk\n\ns.t.\n\nmin\n\nd(cid:88)\nd(cid:88)\n\nk=1\n\nk=1\nvi = 0\n\nq(i, j, k)\n\n(15)\n\n(16)\n\u2200i \u2208 B (17)\n\nq(i, j, k) \u2265 1\n\n(18)\n\nd(cid:88)\n\nk=1\n\n(cid:16)\n\nvk + cij\nk\n\n(cid:17) \u2265 1\n\nif Yi > Yj and\n\n4\n\n\f(19)\n(20)\n\n0 \u2264 vk \u2264 1\nk \u2265 0\ncij\n\n\u2200k \u2208 {1, . . . , d}\n\u2200i \u2208 {1, . . . , n}, j \u2208 {1, . . . , n}, k \u2208 {1, . . . , d}\nIdentify i(cid:63) such that vi(cid:63) = maxi{vi}, breaking ties arbitrarily. Set B \u2190 B \u222a {imax}.\n\n4:\n5: end while\n6: Return \u02c6A = B.\nAlgorithm 3\u2019 is de\ufb01ned to be the batch version of Algorithm 3. Namely, there are n samples in total,\ndivided into n\ns batches. The \ufb01rst iteration of the sequential procedure is performed on the \ufb01rst batch,\nthe second iteration on the second batch, and so on.\nWe are now ready to state the two-stage algorithm for estimating the function \u02c6fn.\nAlgorithm 4 Two Stage Isotonic Regression (TSIR)\nInput: Values (X1, Y1), . . . , (Xn, Yn); sparsity level s\nOutput: The estimated function, \u02c6fn\n1: Estimate \u02c6A by using Algorithm 2, 3, or 3\u2019. Let vk = 1 if k \u2208 \u02c6A and vk = 0 otherwise.\n2: Solve the following optimization problem.\n\nn(cid:88)\n\ni=1\n\nmin\n\nd(cid:88)\n\n(Yi \u2212 Fi)2\n\n(21)\n\n(22)\n\n(23)\n\ns.t.\n\nq(i, j, k)vk \u2265 Fi \u2212 Fj\n\nIn the Noisy Input Model, replace the objective with(cid:80)\n\nFi \u2208 R\n\nk=1\n\n\u2200i, j \u2208 {1, . . . , n}\n\u2200i \u2208 {1, . . . , n}\n\ni\u2208S+ (1 \u2212 Fi) +(cid:80)\n\ni\u2208S\u2212 Fi.\n\n3: Return the function \u02c6fn(x) = max{Fi : Xi (cid:22) x}.\nBoth algorithms for support recovery are linear programs, which can be solved in polynomial time.\nThe second step of Algorithm 4 when applied to the Noisy Output Model is a linearly-constrained\nquadratic minimization program that can be solved in polynomial time. The following lemma shows\nthat Step 2 of Algorithm 4 when applied to the Noisy Input Model can be reduced to a linear program.\nLemma 1. Under the Noisy Input Model, replacing the constraints Fi \u2208 {0, 1} with Fi \u2208 [0, 1] in\nProblems (5)-(9) and (21)-(23) does not change the optimal value. Furthermore, there always exists\nan integer optimal solution that can be constructed from an optimal solution in polynomial time.\n\n3 Results on the Noisy Output Model\n\nRecall the Noisy Output Model: Y = f (X) + W , where f is an s-sparse coordinate-wise monotone\nfunction with active coordinates A. We assume throughout this section that X is a uniform random\nvariable on [0, 1]d, W is a zero-mean random variable independent from X, and the domain of f\nis [0, 1]d. We additionally assume that Y \u2208 [0, 1] almost surely. Up to shifting and scaling, this is\nequivalent to assuming that f has a bounded range and W has a bounded support.\n\n3.1 Statistical consistency\n\nIn this section, we extend the results of [11], in order to demonstrate the statistical consistency of the\nestimator produced by Algorithm 1. The consistency will be stated in terms of the L2 norm error.\nDe\ufb01nition 2 (L2 Norm Error). For an estimator \u02c6fn, de\ufb01ne\n\n(cid:44)\n\n(cid:107) \u02c6fn \u2212 f(cid:107)2\nWe call (cid:107) \u02c6fn \u2212 f(cid:107)2 the L2 norm error.\nDe\ufb01nition 3 (Consistent Estimator). Let \u02c6fn be a estimator for the function f. We say that \u02c6fn is\nconsistent if for all \u0001 > 0, it holds that\n\nx\u2208[0,1]d\n\ndx.\n\n2\n\n(cid:16) \u02c6fn(x) \u2212 f (x)\n(cid:17)2\n\n(cid:90)\n\nP(cid:16)(cid:107) \u02c6fn \u2212 f(cid:107)2 \u2265 \u0001\n\n(cid:17) \u2192 0.\n\nlim\nn\u2192\u221e\n\n5\n\n\fP(cid:16)(cid:107) \u02c6fn \u2212 f(cid:107)2 \u2265 \u0001\n(cid:17) \u2264 8\n(cid:110)\n\n(cid:18)d\n\n(cid:19)\n\ns\n\n(cid:20)(cid:18) 128 log(2)\n(cid:111)\n\n\u00012\n\nexp\n\n(cid:19)\n\n+ 2\n\n64\n\n\u00012 2s\n\nn\n\ns\u22121\n\ns \u2212 \u00014n\n512\n\n(cid:21)\n\n.\n\nTheorem 1. The L2 error of the estimator \u02c6fn obtained from Algorithm 1 is upper bounded as\n\nCorollary 1. When n = max\n, the estimator \u02c6fn from Algorithm 1 is consistent.\nNamely, (cid:107) \u02c6fn \u2212 f(cid:107)2 \u2192 0 in probability as n \u2192 \u221e. In particular, if the sparsity level s is constant,\nthe sample complexity is only logarithmic in the dimension.\n\ne\u03c9(s2), \u03c9(s log(d))\n\n3.2 Support recovery\n\nIn this subsection, we give support recovery guarantees for Algorithm 3. The guarantees will be in\nterms of the values pk, de\ufb01ned below.\nDe\ufb01nition 4. Let Y1 = f (X1) + W1 and Y2 = f (X2) + W2 be two independent samples from the\nmodel. For k \u2208 A, let\n\npk (cid:44) P (Y1 > Y2 | q(1, 2, k) = 1) \u2212 P (Y1 < Y2 | q(1, 2, k) = 1) .\nAssume without loss of generality that A = {1, 2, . . . , s} and p1 \u2264 p2 \u2264 \u00b7\u00b7\u00b7 \u2264 ps.\nLemma 2. It holds that pk > 0 for all k. In other words, when X1 is greater than X2 in at least one\nactive coordinate, the output corresponding to X1 is likely to be larger than the one corresponding to\nX2.\nTheorem 2. Let B be the set of indices corresponding to running Algorithm 3\u2019 using n samples.\nThen it holds that B = A with probability at least\n1 \u2212 ds exp\n\n(cid:18)\n\n(cid:19)\n\n.\n\n\u2212 p2\n1n\n64s3\n\nCorollary 2. Assume that p1 = \u0398(1). Let n be the number of samples used by Algorithm 3\u2019. If\nn = \u03c9(s3 log(d)), then Algorithm 3\u2019 recovers the true support w.h.p. as n \u2192 \u221e.\nFor x \u2208 Rd, let xA denote x restricted to coordinates de\ufb01ned by the set A. Suppose that s is\nconstant, and the sequence of functions {fd} extends a function on s variables, i.e. fd is de\ufb01ned as\nfd(x) = g(xA) for some g : [0, 1]s \u2192 R. In that case, p1 = \u0398(1).\nWe can now give a guarantee of the success of Algorithm 4, using Algorithm 3\u2019 for support recovery.\nCorollary 3. Assume that p1 = \u0398(1). Consider running Algorithm 4 using n samples for se-\nquential recovery. Let m = n\ns . Consider using an additional m samples for function value\nestimation, so that the total sample size is n + m. Let \u02c6fn+m be the estimated function.\nIf\nn = max\nCorollary 3 shows that if s is constant and the sequence of functions {fd} extends a function of s\nvariables, then Algorithm 4 produces a consistent estimator with n = \u03c9(log(d)) samples. In the\nsupplementary material, we state the statistical consistency results for the Noisy Input Model.\n\n(cid:110)\n\u03c9(s3 log(d)), se\u03c9(s2)(cid:111)\n\n, then \u02c6fn+m is a consistent estimator.\n\n4 Experimental results\n\nAll algorithms were implemented in Java version 8, using Gurobi version 6.0.0.\n\n4.1 Support recovery\nWe test the support recovery algorithms on random synthetic instances. Let A = {1, . . . , s} without\nloss of generality. First, randomly sample r \u201canchor points\u201d in [0, 1]d, calling them Z1, . . . , Zr. The\nparameter r governs the complexity of the function produced. In our experiment, we set r = 10.\nNext, randomly sample X1, . . . , Xn in [0, 1]d. For i \u2208 {1, . . . , n}, assign Yi = 1 + Wi if Zj (cid:22)A Xi\nfor some j \u2208 {1, . . . , r}, and assign Yi = Wi otherwise. The linear programming based algorithms\nfor support recovery, LPSR and S-LPSR, are compared to the simultaneous approach, IPIR, which\nestimates the active coordinates while also estimating the function values. Note that even though the\n\n6\n\n\fproof of support recovery using S-LPSR requires fresh data at each iteration, our experiments do not\nuse fresh data. We keep s = 3 \ufb01xed and vary d and n. The error is Gaussian with mean 0 and variance\n0.1, independent across coordinates. We report the percentages of successful recovery (see Table 1).\nThe IPIR algorithm performs the best on nearly all settings of (n, d). This suggests that the objective\nof the IPIR algorithm- to minimize the number of misclassi\ufb01cations on the data- gives the algorithm\nan advantage in selecting the true active coordinates. The LPSR algorithm outperforms the S-LPSR\nalgorithm when d = 5, but the situation reverses for d \u2208 {10, 20}. For n = 200 samples and d = 5,\nthe LPSR algorithm correctly recovers the coordinates all but one time, while S-LPSR succeeds\n86% of the time. For d = 10, LPSR and S-LPSR succeed 46 and 75% of the time, respectively; for\nd = 20, LPSR and S-LPSR succeed 30 and 63% of the time, respectively. It appears that determining\nthe coordinates one at a time provides implicit regularization for larger d.\nWe additionally compare the accuracy in function estimation (Table 2). We found these results to be\nextremely encouraging. For n = 200 samples, the IPIR and S-LPSR algorithms had accuracy rates in\nthe range of 87 \u2212 90%.\n\nTable 1: Performance of support recovery algorithms on synthetic instances. Each line of the table\ncorresponds to 100 trials.\n\nIPIR\nd =\n10\n55\n85\n94\n99\n\n20\n57\n90\n91\n92\n\nLPSR\nd =\n10\n29\n33\n50\n46\n\n5\n76\n92\n99\n99\n\n20\n1\n13\n16\n30\n\nS-LPSR\n\nd =\n10\n33\n56\n71\n75\n\n20\n26\n49\n65\n63\n\n5\n62\n76\n86\n86\n\nn\n\n50\n100\n150\n200\n\n5\n62\n92\n98\n95\n\nTable 2: Accuracy of isotonic regression on synthetic instances. Each line of the table corresponds to\n100 trials.\n\nIPIR\nd =\n10\n77.8\n85.8\n87.8\n89.8\n\nLPSR\nd =\n10\n74.2\n77.6\n81.3\n83.6\n\nS-LPSR\n\nd =\n10\n76.1\n83.9\n86.6\n88.9\n\n20\n74.3\n81.7\n85.9\n87.5\n\n20\n65.9\n75.0\n77.9\n83.4\n\n5\n\n77.1\n84.2\n87.1\n89.0\n\n20\n77.6\n84.6\n86.8\n88.3\n\n5\n\n77.4\n84.1\n87.8\n89.1\n\nn\n\n50\n100\n150\n200\n\n5\n\n78.2\n85.1\n87.9\n89.2\n\n4.2 Cancer classi\ufb01cation using gene expression data\n\nThe presence or absence of a disease is believed to follow a monotone relationship with respect to\ngene expression. Similarly, classifying patients as having one of two diseases amounts to applying\nthe monotonicity principle to a subpopulation of individuals having one of the two diseases. In\norder to assess the applicability of our sparse monotone regression approach, we apply it to cancer\nclassi\ufb01cation using gene expression data. The motivation for using a sparse model for disease\nclassi\ufb01cation is that certain genes should be more responsible for disease than others. Sparsity can\nbe viewed as a kind of regularization; to prevent over\ufb01tting, we allow the regression to explain the\nresults using only a small number of genes.\nThe data is drawn from the COSMIC database [9], which is widely used in quantitative research\nin cancer biology. Each patient in the database is identi\ufb01ed as having a certain type of cancer. For\neach patient, gene expressions of tumor cells are reported as a z-score. Namely, if \u00b5G and \u03c3G are\nthe mean and standard deviation of the gene expression of gene G and x is the gene expression of a\ncertain patient, then his or her z-score would be equal to x\u2212\u00b5G\n. We \ufb01lter the patients by cancer type,\nselecting those with skin and lung cancer, two common cancer types. There are 236698 people with\nlung or skin cancer in the database, though the database only includes gene expression data for 1492\nof these individuals. Of these, 1019 have lung cancer and 473 have skin cancer. A classi\ufb01er always\n\n\u03c3G\n\n7\n\n\fselecting \u201clung\u201d would have an expected correct classi\ufb01cation rate of 1019/1492 \u2248 68%. Therefore\nthis rate should be regarded as the baseline classi\ufb01cation rate.\nOur goal is to use gene expression data to classify the patients as having either skin or lung cancer.\nWe associate skin cancer as a \u201c0\u201d label and lung cancer as a \u201c1\u201d label. We only include the 20 most\nassociated genes for each of the two types, according to the COSMIC website. This leaves 31 genes,\nsince some genes appear on both lists. We additionally include the negations of the gene expression\nvalues as coordinates, since a lower gene expression of certain genes may promote lung cancer over\nskin cancer. The number of coordinates is therefore equal to 62. The number of active genes is ranged\nbetween 1 and 5.\nWe perform both simultaneous and two-stage isotonic regression, comparing the IPIR and TSIR\nalgorithms, using S-LPSR to recover the coordinates in the two-stage approach. Since for every\ngene, its negation also corresponds to a coordinate, we added additional constraints. In IPIR, we use\nvariables vk \u2208 {0, 1} to indicate whether coordinate k is in the estimated set of active coordinates.\nIn LPSR and S-LPSR, we use variables vk \u2208 [0, 1] instead. In order to incorporate the constraints\nregarding negation of coordinates in IPIR, we included the constraint vi + vj \u2264 1 for pairs (i, j) such\nthat coordinate j is the negation of coordinate i. In S-LPSR, once a coordinate vi was selected, its\nnegation was set to zero in future iterations. The LPSR algorithm, however, could not be modi\ufb01ed to\ntake this additional structure into account without using integer variables. Adding the constraints\nvi + vj \u2264 1 when coordinate j is the negation of coordinate i proved to be insuf\ufb01cient. Therefore,\nwe do not include the LPSR algorithm in our experiments on the COSMIC database.\nWe compare our isotonic regression algorithms to two classical algorithms: k-Nearest Neighbors ([8])\nand the Support Vector Machine ([4]). Given a test sample x and an odd number k, the k-Nearest\nNeighbors algorithm \ufb01nds the k closest training samples to x. The label of x is chosen according\nto the majority of the labels of the k closest training samples. The SVM algorithm used is the\nsoft-margin classi\ufb01er with penalty C and polynomial kernel given by K(x, y) = (1 + x \u00b7 y)m. We\nhave additionally implemented a version of kNN with dimensionality reduction, in an effort to reduce\nthe curse-of-dimensionality suffered by kNN. Data points are compressed by Principal Component\nAnalysis ([18]) prior to nearest-neighbor classi\ufb01cation. However, this version of kNN performed\nworse than the basic version, and we omit the results.\nIn Table 3, each row is based on 10 trials, with 1000 test data points chosen uniformly and separately\nfrom the training points. The two-stage method was generally faster than the simultaneous method.\nWith 200 training points and s = 3, the simultaneous method took 260 seconds on average per\ntrial, while the two-stage method took only 42 seconds per trial. The simultaneous method became\nprohibitively slow for higher values of n. The averages for k-Nearest Neighbors and Support\nVector Machine are taken as the best over parameter choices in hindsight. For k-Nearest Neighbors,\nk \u2208 {1, 3, 5, 7, 9, 11, 15}, and for SVM, C \u2208 {10, 100, 500, 1000} and m \u2208 {1, 2, 3, 4}. The fact\nthat the sparse isotonic regression method outperforms the k-NN classi\ufb01er and the polynomial kernel\nSVM by such a large margin can be explained by a difference in structural assumptions; the results\nsuggest that monotonicity, rather than proximity or a polynomial functional relationship, is the correct\nproperty to leverage.\n\nTable 3: Comparison of classi\ufb01er success rates on COSMIC data. Top row data is according to the\n\u201cmin\u201d interpolation rule and bottom row data is according to the \u201cmax\u201d interpolation rule.\n\nn\n\n100\n\n200\n\n300\n\n400\n\n1\n\n83.1\n83.9\n85.4\n85.8\n\n-\n-\n-\n-\n\n2\n\n84.6\n91.8\n88.1\n92.6\n\n-\n-\n-\n-\n\nIPIR\ns =\n3\n\n76.8\n91.0\n84.3\n96.4\n\n-\n-\n-\n-\n\n4\n\n66.2\n85.7\n73.9\n88.9\n\n-\n-\n-\n-\n\n5\n\n53.8\n75.7\n62.7\n83.9\n\n-\n-\n-\n-\n\nTSIR + S-LPSR\n\nk-NN SVM\n\n2\n\n84.6\n90.4\n89.3\n94.5\n91.7\n94.2\n91.8\n94.0\n\ns =\n3\n\n77.8\n88.9\n86.7\n95.9\n89.0\n95.6\n89.7\n95.7\n\n4\n\n73.0\n87.4\n81.2\n95.3\n84.4\n95.9\n87.3\n96.4\n\n5\n\n65.4\n83.3\n76.9\n93.0\n80.2\n94.8\n81.7\n95.7\n\n69.8\n\n63.8\n\n76.6\n\n72.6\n\n76.6\n\n74.2\n\n78.6\n\n77.4\n\n1\n\n82.4\n82.9\n85.4\n85.8\n84.7\n85.1\n85.6\n85.8\n\n8\n\n\fThe results suggest that the correct sparsity level is s = 3. With n = 400 samples, the classi\ufb01cation\naccuracy rate is 95.7%. When the sparsity level is too low, the monotonicity model is too simple to\naccurately describe the monotonicity pattern. On the other hand, when the sparsity level is too high,\nfewer points are comparable, which leads to fewer monotonicity constraints. For n \u2208 {100, 200} and\nd \u2208 {1, 2, 3, 4, 5}, TSIR + S-LPSR does at least as well as IPIR on 15 out of 20 of (n, d) pairs, and\noutperforms on 12 of these. This result is surprising, because synthetic experiments show that IPIR\noutperforms S-LPSR on support recovery.\nWe further investigate the TSIR + S-LPSR algorithm. Figure 1 shows how the two-stage procedure\nlabels the training points. The high success rate of the sparse isotonic regression method suggests\nthat this nonlinear picture is quite close to reality. The observed clustering of points may be a feature\nof the distribution of patients, or could be due to a saturation in measurement. Figure 2 studies the\nrobustness of TSIR + S-LPSR. Additional synthetic zero-mean Gaussian noise is added to the inputs,\nwith varying standard deviation. The \u201cmax\u201d classi\ufb01cation rule is used. 200 training points and 1000\ntest points were used. Ten trials were run, with one standard deviation error bars indicated in gray.\nThe results indicate that TSIR + S-LPSR is robust to moderate levels of noise.\nWe note that because the gene expression is measured from tumor cells, much of the variation between\nthe lung and skin cancer patients can be attributed to intrinsic differences between lung and skin\ncells. Still, this classi\ufb01cation task is highly non-linear and challenging, as evidenced by the poor\nperformance of other classi\ufb01ers. We view these experiments as a proof-of-concept, showing that our\nalgorithm can perform well on real data. An example of a more medically relevant application of our\nalgorithm would be identifying patients as having cancer or not, using blood protein levels [3].\n\n(a) s = 2.\n\n(b) s = 3.\n\nFigure 1: Illustration of the TSIR + S-LPSR algorithm. Blue and red markers correspond to lung and\nskin cancer, respectively.\n\n)\n\n%\n\n(\n\ny\nc\na\nr\nu\nc\nc\nA\n\n100\n\n90\n\n80\n\n70\n\n60\n\n0\n\n0.1\n\n0.2\n\n0.3\n\n0.4\n\n0.5\n\nStandard deviation of additional synthetic noise\n\nFigure 2: Robustness to error of TSIR + S-LPSR.\n\n5 Conclusion\n\nIn this paper, we have considered the sparse isotonic regression problem under two noise models:\nNoisy Output and Noisy Input. We have formulated optimization problems to recover the active\ncoordinates, and then estimate the underlying monotone function. We provide explicit guarantees on\nthe performance of these estimators. We leave the analysis of Linear Programming Support Recovery\n(Algorithm 2) as an open problem. Finally, we demonstrate the applicability of our approach to a\ncancer classi\ufb01cation task, showing that our methods outperform widely-used classi\ufb01ers. While the\ntask of classifying patients with two cancer types is relatively simple, the accuracy rates illustrate the\nmodeling power of the sparse monotone regression approach.\n\n9\n\n-6-5-4-3-2-101-6-5-4-3-2-101-41-30.50-20-1-0.50-1-11-1.5-2-2-2.5\fReferences\n[1] Michael J. Best and Nilotpal Chakravarti. Active set algorithms for isotonic regression; a\n\nunifying framework. Mathematical Programming, 47:425\u2013439, 1990.\n\n[2] Peter B\u00fchlmann and Sara Van De Geer. Statistics for high-dimensional data: methods, theory\n\nand applications. Springer Science & Business Media, 2011.\n\n[3] Joshua D Cohen, Lu Li, Yuxuan Wang, Christopher Thoburn, Bahman Afsari, Ludmila Danilova,\nChristopher Douville, Ammar A Javed, Fay Wong, Austin Mattox, et al. Detection and localiza-\ntion of surgically resectable cancers with a multi-analyte blood test. Science, 359(6378):926\u2013930,\n2018.\n\n[4] Corinna Cortes and Vladimir Vapnik. Support-Vector networks. Machine Learning, 20:273\u2013297,\n\n1995.\n\n[5] Jan de Leeuw, Kurt Hornik, and Patrick Mair.\n\nIsotone optimization in R: Pool-Adjacent-\nVioloators Algorithm (PAVA) and active set methods. Journal of Statistical Software, 32(5):1\u201324,\n2009.\n\n[6] Luc Devroye, L\u00e1szl\u00f3 Gy\u00f6r\ufb01, and G\u00e1bor Lugosi. A Probabilistic Theory of Pattern Recognition.\n\nSpringer, 1996.\n\n[7] Richard L. Dykstra and Tim Robertson. An algorithm for isotonic regression for two or more\n\nindependent variables. The Annals of Statistics, 10(3):708\u2013716, 1982.\n\n[8] E. Fix and J.L. Hodges. Discriminatory analysis. nonparametric discrimination; consistency\nproperties. Technical Report Report Number 4, Project Number 21-49-004, USAF School of\nAviation Medicine, Randolph Field, Texas., 1951.\n\n[9] Simon A. Forbes, Nidhi Bindal, Sally Bamford, Charlotte Cole, Chai Yin Kok, David Beare,\nMingming Jia, Rebecca Shepherd, Kenric Leung, Andrew Menzies, Jon W. Teague, Peter J.\nCampbell, Michael R. Stratton, and P. Andrew Futreal. COSMIC: mining complete cancer\ngenomes in the Catalogue of Somatic Mutations in Cancer. Nucleic Acids Research, 39(1):D945\u2013\nD950, 2011.\n\n[10] Simon Foucart and Holger Rauhut. A mathematical introduction to compressive sensing.\n\nBirkh\u00e4user Springer, 2013.\n\n[11] D. Gamarnik. Ef\ufb01cient learning of monotone concepts via quadratic optimization. In COLT,\n\n1999.\n\n[12] Dimitris Bertsimas David Gamarnik and John N. Tsitsiklis. Estimation of time-varying pa-\nrameters in statistical models: an optimization approach. Machine Learning, 35(3):225\u2013245,\n1999.\n\n[13] Qiyang Han, Tengyao Wang, Sabyasachi Chatterjee, and Richard J. Samworth.\n\nregression in general dimensions. arXiv 1708.0946v1, 2017.\n\nIsotonic\n\n[14] David Haussler. Overview of the Probably Approximately Correct (PAC) learning frame-\nwork. https://hausslergenomics.ucsc.edu/wp-content/uploads/2017/08/smo_0.\npdf, 1995.\n\n[15] J. B. Kruskal. Nonmetric multidimensional scaling: A numerical method. Psychometrika,\n\n29(2):115\u2013129, 1964.\n\n[16] Ronny Luss, Saharon Rosset, and Moni Shahar. Ef\ufb01cient regularised isotonic regression with\napplication to gene-gene interaction search. The Annals of Applied Statistics, 6(1):253\u2013283,\n2012.\n\n[17] Guy Moshkovitz and Asaf Shapira. Ramsey theory, integer partitions and a new proof of the\n\nErdos-Szekeres theorem. Advances in Mathematics, 262:1107\u20131129, 2014.\n\n[18] Karl Pearson. On lines and planes of closest \ufb01t to systems of points in space. The London,\nEdinburgh, and Dublin Philosophical Magazine and Journal of Science, 2(11):559\u2013572, 1901.\n\n10\n\n\f[19] R. E. Barlow, D. J. Bartholomew, J. M. Bremner, and H. D. Brunk. Statistical inference under\n\norder restrictions. John Wiley & Sons, 1973.\n\n[20] T. Robertson, F. T. Wright, and R. L. Dykstra. Order restricted statistical inference. John Wiley\n\n& Sons, 1988.\n\n[21] S. Sasabuchi, M. Inutsuka, and D. D. S. Kulatunga. A multivariate version of isotonic regression.\n\nBiometrika, 70(2):465\u2013472, 1983.\n\n[22] Syoichi Sasabuchi, Makoto Inutsuka, and D. D. Sarath Kulatunga. An algorithm for computing\n\nmultivariate isotonic regression. Hiroshima Mathematical Journal, 22(551-560), 1992.\n\n[23] Michael J. Schell and Bahadur Singh. The reduced monotonic regression method. Journal of\n\nthe American Statistical Association, 92(437):128\u2013135, 1997.\n\n[24] V. Vapnik. Nature of Learning Theory. Springer-Verlag, 1996.\n\n11\n\n\f", "award": [], "sourceid": 7018, "authors": [{"given_name": "David", "family_name": "Gamarnik", "institution": "Massachusetts Institute of Technology"}, {"given_name": "Julia", "family_name": "Gaudio", "institution": "Massachusetts Institute of Technology"}]}