{"title": "Nonparametric Reduced Rank Regression", "book": "Advances in Neural Information Processing Systems", "page_first": 1628, "page_last": 1636, "abstract": "We propose an approach to multivariate nonparametric regression that generalizes reduced rank regression for linear models. An additive model is estimated for each dimension of a $q$-dimensional response, with a shared $p$-dimensional predictor variable. To control the complexity of the model, we employ a functional form of the Ky-Fan or nuclear norm, resulting in a set of function estimates that have low rank. Backfitting algorithms are derived and justified using a nonparametric form of the nuclear norm subdifferential. Oracle inequalities on excess risk are derived that exhibit the scaling behavior of the procedure in the high dimensional setting. The methods are illustrated on gene expression data.", "full_text": "Nonparametric Reduced Rank Regression\n\nRina Foygel\u2020,\u2217, Michael Horrell\u2020, Mathias Drton\u2020,\u2021, John Lafferty\u2020\n\n\u2217Department of Statistics\n\nStanford University\n\n\u2020Department of Statistics\nUniversity of Chicago\n\n\u2021Department of Statistics\nUniversity of Washington\n\nAbstract\n\nWe propose an approach to multivariate nonparametric regression that generalizes\nreduced rank regression for linear models. An additive model is estimated for each\ndimension of a q-dimensional response, with a shared p-dimensional predictor\nvariable. To control the complexity of the model, we employ a functional form of\nthe Ky-Fan or nuclear norm, resulting in a set of function estimates that have low\nrank. Back\ufb01tting algorithms are derived and justi\ufb01ed using a nonparametric form\nof the nuclear norm subdifferential. Oracle inequalities on excess risk are derived\nthat exhibit the scaling behavior of the procedure in the high dimensional setting.\nThe methods are illustrated on gene expression data.\n\n1\n\nIntroduction\n\ndimensional covariate vector. This is also referred to as multi-task learning in the machine learning\n\nIn the multivariate regression problem the objective is to estimate the conditional mean E(Y\u2223 X)=\nm(X) = (m1(X), . . . , mq(X))\u22ba where Y is a q-dimensional response vector and X is a p-\nliterature. We are given a sample of n iid pairs{(Xi, Yi)} from the joint distribution of X and Y .\nUnder a linear model, the mean is estimated as m(X)= BX where B \u2208 R\nq\u00d7p is a q\u00d7 p matrix\nIn reduced rank regression the matrix B is estimated under a rank constraint r= rank(B)\u2264 C, so\n\nof regression coef\ufb01cients. When the dimensions p and q are large relative to the sample size n, the\ncoef\ufb01cients of B cannot be reliably estimated, without further assumptions.\n\np. Intuitively, this implies\nthat the rows or columns of B lie in an r-dimensional subspace of R\nthat the model is based on a smaller number of features than the ambient dimensionality p would\nsuggest, or that the tasks representing the components Y k of the response are closely related. In low\ndimensions, the constrained rank model can be computed as an orthogonal projection of the least\nsquares solution; but in high dimensions this is not well de\ufb01ned.\n\nThe nuclear norm\u2225B\u2225\u2217, also known as the trace or Ky-Fan norm, is the sum of the singular vectors\n\nRecent research has studied the use of the nuclear norm as a convex surrogate for the rank constraint.\n\nq or R\n\nIn this paper we study nonparametric parallels of reduced rank linear models. We focus our attention\n\nof B. A rank constraint can be thought of as imposing sparsity, but in an unknown basis; the nuclear\nnorm plays the role of the (cid:1)1 norm in sparse estimation. Its use for low rank estimation problems\nwas proposed by Fazel in [2]. More recently, nuclear norm regularization in multivariate linear\nregression has been studied by Yuan et al. [10], and by Neghaban and Wainwright [4], who analyzed\nthe scaling properties of the procedure in high dimensions.\n\non additive models, so that the regression function m(X)=(m1(X), . . . , mq(X))\u22ba has each com-\nponent mk(X)=\u2211p\nj(Xj) equal to a sum of p functions, one for each covariate. The objective\nis then to estimate the q\u00d7 p matrix of functions M(X)=[mk\npenalty\u2225B\u2225\u2217 in the linear model. Because we must estimate a matrix of functions, the analogue of\n\nThe \ufb01rst problem we address, in Section 2, is to determine a replacement for the regularization\n\nj(Xj)].\n\nj=1 mk\n\nthe nuclear norm is not immediately apparent. We propose two related regularization penalties for\n\n1\n\n\fnonparametric low rank regression, and show how they specialize to the linear case. We then study,\nin Section 4, the (in\ufb01nite dimensional) subdifferential of these penalties. In the population setting,\nthis leads to stationary conditions for the minimizer of the regularized mean squared error. This\nsubdifferential calculus then justi\ufb01es penalized back\ufb01tting algorithms for carrying out the optimiza-\ntion for a \ufb01nite sample. Constrained rank additive models (CRAM) for multivariate regression are\nanalogous to sparse additive models (SPAM) for the case where the response is 1-dimensional [6]\n(studied also in the reproducing kernel Hilbert space setting by [5]), but with the goal of recovering\na low-rank matrix rather than an entry-wise sparse vector. The back\ufb01tting algorithms we derive in\nSection 5 are analogous to the iterative smoothing and soft thresholding back\ufb01tting algorithms for\nSPAM proposed in [6]. A uniform bound on the excess risk of the estimator relative to an oracle\nis given Section 6. This shows the statistical scaling behavior of the methods for prediction. The\nanalysis requires a concentration result for nonparametric covariance matrices in the spectral norm.\nExperiments with gene data are given in Section 7, which are used to illustrate different facets of the\nproposed nonparametric reduced rank regression techniques.\n\n2 Nonparametric Nuclear Norm Penalization\n\ns=1\n\na linear combination of r of the other functions.\n\ns=1 \u03c3s(M)=\u2211q\n\ndoes it mean for this collection of functions to be low rank? Let x1, x2, . . . , xn be a collection\n\nIn the multivariate regression setting, but still assuming the domain is one-dimensional for simplicity\n\nWe begin by presenting the penalty that we will use to induce nonparametric regression esti-\nmates to be low rank. To motivate our choice of penalty and provide some intuition, suppose\n\nthat f 1(x), . . . , f q(x) are q smooth one-dimensional functions with a common domain. What\nof points in the common domain of the functions. We require that the n\u00d7 q matrix of function values\nF(x1\u2236n)=[f k(xi)] is low rank. This matrix is of rank at most r< q for every set{xi} of arbitrary\nsize n if and only if the functions{f k} are r-linearly independent\u2014each function can be written as\n(q > 1 and p = 1), we have a random sample X1, . . . , Xn. Consider the n\u00d7 q sample matrix\nM=[mk(Xi)] associated with a vector M =(m1, . . . , mq) of q smooth (regression) functions,\nand suppose that n> q. We would like for this to be a low rank matrix. This suggests the penalty\n\u221a\n\u2225M\u2225\u2217=\u2211q\n\u03bbs(M\u22baM), where{\u03bbs(A)} denotes the eigenvalues of a symmetric\nmatrix A and{\u03c3s(B)} denotes the singular values of a matrix B. Now, assuming the columns of M\nM as the sample covariance\u0302\u03a3(M)\nare centered, and E[mk(X)]= 0 for each k, we recognize 1\nof the population covariance \u03a3(M)\u2236= Cov(M(X))=[E(mk(X)ml(X))]. This motivates the\n\u2225\u03a3(M)1/2\u2225\u2217=\u2225 Cov(M(X))1/2\u2225\u2217\n\u2225\u0302\u03a3(M)1/2\u2225\u2217= 1\u221a\n\u2225M\u2225\u2217.\nE\u2225Y \u2212 M(X)\u22252\n2+ \u03bb\u2225\u03a3(M)1/2\u2225\u2217\nF+ \u03bb\u221a\n\u2225Y\u2212 M\u22252\n\u2225M\u2225\u2217.\nWe recall that if A\u2ab0 0 has spectral decomposition A= U DU \u22ba then A1/2= U D1/2U \u22ba.\n\nWith Y denoting the n\u00d7 q matrix of response values for the sample(Xi, Yi), this leads to the fol-\n\nfollowing sample and population penalties, where A1/2 denotes the matrix square root:\n\nlowing population and empirical regularized risk functionals for low rank nonparametric regression:\n\npopulation penalized risk:\n\npopulation penalty:\n\nsample penalty:\n\n(2.1)\n\n(2.2)\n\n(2.3)\n\n(2.4)\n\n\u22ba\n\nM\n\nn\n\nn\n\nempirical penalized risk:\n\nn\n\n1\n2\n1\n2n\n\n3 Constrained Rank Additive Models (CRAM)\n\nWe now consider the case where X is p-dimensional. Throughout the paper we use superscripts to\ndenote indices of the q-dimensional response, and subscripts to denote indices of the p-dimensional\n\ncovariate. We consider the family of additive models, with regression functions of the form m(X)=\n(m1(X), . . . , mq(X))\u22ba=\u2211p\nj(Xj))\u22ba is\n\nj=1 Mj(Xj), where each term Mj(Xj)=(m1\n\nj(Xj), . . . , m\n\nq\n\na q-vector of functions evaluated at Xj.\n\n2\n\n\fq\n\n\u2217\n\n\u2217\n\n\u2217\n\n\u2217\n\nn\n\n\u2217\n\n\u2217\n\nn\n\nj=1\n\n(3.1)\n\n(3.2)\n\n1, mk\n\np\u2211\n\np)\u22ba\n\n= 1\u221a\n\n\u2225Mj\u2225\u2217.\n\nto be low rank. This penalty is given by\n\nIn this setting we propose two different penalties. The \ufb01rst penalty, intuitively, encourages the\n\nall have mean zero; this is required for identi\ufb01ability in the additive model. As a shorthand, let\n\nThe second penalty, intuitively, encourages the set of q vector-valued functions(mk\n\nvector(m1\nj(Xj)\nj(Xj)) to be low rank, for each j. Assume that the functions mk\nj(Xj), . . . , m\n\u03a3j= \u03a3(Mj)= Cov(Mj(Xj)) denote the covariance matrix of the j-th component functions, with\nsample version\u0302\u03a3j. The population and sample versions of the \ufb01rst penalty are then given by\n\u2225\u03a31/2\n1 \u2225\n\u2225\u0302\u03a31/2\n1 \u2225\n\n+\u22ef+\u2225\u03a31/2\n+\u2225\u03a31/2\n2 \u2225\np \u2225\n+\u2225\u0302\u03a31/2\n+\u22ef+\u2225\u0302\u03a31/2\n2 \u2225\np \u2225\n\u2225(\u03a31/2\np )\u2225\n1 \u22ef\u03a31/2\n\u2225(\u0302\u03a31/2\n1 \u22ef\u0302\u03a31/2\np )\u2225\n= 1\u221a\n\u2225M1\u2236p\u2225\u2217\n1\u22efM\nwhere, for convenience of notation, M1\u2236p =(M\np)\u22ba\nis an np\u00d7 q matrix. The corresponding\n\u2225\u03a31/2\nj \u2225\nE\u2225Y \u2212 p\u2211\n+ \u03bb\nMj(X)\u22252\np\u2211\n+ \u03bb\u221a\n\u2225Y\u2212 p\u2211\n\u2225Mj\u2225\u2217\nMj\u22252\np\u2211\nj)= 1. In the linear case we have Mj(Xj)=\nNow suppose that each Xj is normalized so that E(X 2\nq. Let B =(B1\u22efBp)\u2208 R\nXjBj where Bj \u2208 R\np \u2225\u2217=\u2225B\u2225\u2217\nj=1\u2225\u03a31/2\nthe penalties reduce to\u2211p\nthe group lasso [11]. The second penalty reduces to the nuclear norm regularization\u2225B\u2225\u2217 used for\n\nfor the second. Thus, in the linear case the \ufb01rst penalty is encouraging B to be column-wise sparse,\nso that many of the Bjs are zero, meaning that Xj doesn\u2019t appear in the \ufb01t. This is a version of\n\nj=1\u2225Bj\u22252 for the \ufb01rst penalty, and\u2225\u03a31/2\n\npopulation and empirical risk functionals, for the \ufb01rst penalty, are then\n\nq\u00d7p. Some straightforward calculation shows that\n\nj \u2225\u2217=\u2211p\n\n1 \u22ef\u03a31/2\n\nand similarly for the second penalty.\n\n1\n2\n1\n2n\n\n2, . . . , mk\n\n(3.3)\n\n(3.5)\n\nn\n\nj=1\n\n(3.6)\n\n(3.4)\n\n\u2217\n\n\u2217\n\n\u22ba\n\n\u22ba\n\n2\n\nj=1\n\nj=1\n\nF\n\nj=1\n\n\u2217\n\nhigh-dimensional reduced-rank regression.\n\n4 Subdifferentials for Functional Matrix Norms\n\ntials of the penalties. We are interested in(q\u00d7 p)-dimensional matrices of functions F =[f k\n\nA key to deriving algorithms for functional low-rank regression is computation of the subdifferen-\n\nj]. For\n\n\u22ba\n\nj=1\n\nj=1\n\nk=1\n\nj gk\n\n\u22ba)) ,\n\nE(F\nE(F F \u22ba)\u2225\n\nj Gj)= tr(E(F G\n\neach column index j and row index k, f k\nj is a function of a random variable Xj, and we will take\nexpectations with respect to Xj implicitly. We write Fj to mean the jth column of F , which is a\nq-vector of functions of Xj. We de\ufb01ne the inner product between two matrices of functions as\n\n\u27eaF, G\u27eb\u2236= p\u2211\nj)= p\u2211\nE(f k\nq\u2211\nand write\u2225F\u22252=\u221a\u27eaF, F\u27eb. Note that\u2225F\u22252=\u2225\u221a\nis a positive semide\ufb01nite q\u00d7 q matrix.\n\u2223\u2223\u2223F\u2223\u2223\u2223sp\u2236=\u221a\u2225E(F F \u22ba)\u2225sp=\u2225\u221a\nE(F F \u22ba)\u2225\nwhere\u2225A\u2225sp is the spectral norm (operator norm), the largest singular value of A, and it is convenient\n\u221a\nA = A1/2. Each of the norms depends on F only through\nE(F F \u22ba). In fact, these two norms are dual\u2014for any F ,\n\u2223\u2223\u2223F\u2223\u2223\u2223\u2217= sup\n\nwhere E(F F \u22ba)=\u2211j E(FjF \u22ba\nj)\u2ab0 0\n\u2223\u2223\u2223F\u2223\u2223\u2223\u2217\u2236=\u2225\u221a\n\nWe de\ufb01ne two further norms on a matrix of functions F , namely,\n\nE(F F \u22ba)\u2225\u2217,\n\nto write the matrix square root as\n\n\u27eaG, F\u27eb ,\n\n(4.1)\n\n(4.2)\n\nand\n\nsp\n\nF\n\n\u2223\u2223\u2223G\u2223\u2223\u2223sp\u22641\n\n3\n\n\fto show\n\n(4.3)\n\n(4.4)\n\npseudo-inverse.\n\nF , with A\u22121 denoting the matrix\n\nE(F F \u22ba))\u22121\n\nF+ H \u2236\n\nE(F F \u22ba))\u22121\n\nExpanding the right-hand side of (4.4), we have\n\nsetting (matrices of functions) to the ordinary matrix case; see [9, 7]. Here, we show the reverse\n\n\u2223\u2223\u2223F+ G\u2223\u2223\u2223\u2217\u2265\u2223\u2223\u2223F\u2223\u2223\u2223\u2217+\u27eaG, D\u27eb ,\n\nwhere the supremum is attained by setting G=(\u221a\nProposition 4.1. The subdifferential of\u2223\u2223\u2223F\u2223\u2223\u2223\u2217 is the set\nS(F)\u2236={(\u221a\n\u2223\u2223\u2223H\u2223\u2223\u2223sp\u2264 1, E(F H \u22ba)= 0q\u00d7q, E(F F \u22ba)H= 0q\u00d7p a.e.} .\nProof. The fact thatS(F) contains the subdifferential \u2202\u2223\u2223\u2223F\u2223\u2223\u2223\u2217 can be proved by comparing our\ninclusion,S(F)\u2286 \u2202\u2223\u2223\u2223F\u2223\u2223\u2223\u2217. Let D\u2208S(F) and let G be any element of the function space. We need\nwhere D=(\u221a\nF+ H =\u2236 \u0303F+ H for some H satisfying the conditions in (4.3) above.\nE(F F \u22ba))\u22121\n\u2223\u2223\u2223F\u2223\u2223\u2223\u2217+\u27eaG, D\u27eb=\u2223\u2223\u2223F\u2223\u2223\u2223\u2217+\u27eaG,\u0303F+ H\u27eb=\u27eaF+ G,\u0303F+ H\u27eb\u2264\u2223\u2223\u2223F+ G\u2223\u2223\u2223\u2217\u2223\u2223\u2223D\u2223\u2223\u2223sp ,\nwhere the second equality follows from\u2223\u2223\u2223F\u2223\u2223\u2223\u2217=\u27eaF,\u0303F\u27eb, and the fact that\u27eaF, H\u27eb= tr(E(F H \u22ba))=\nFinally, we show that\u2223\u2223\u2223D\u2223\u2223\u2223sp\u2264 1. We have\n\u22ba)= E(\u0303F\u0303F \u22ba)+ E(\u0303F H \u22ba)+ E(H\u0303F \u22ba)+ E(HH \u22ba)= E(\u0303F\u0303F \u22ba)+ E(HH \u22ba) ,\nE(DD\nwhere we use the fact that E(F H \u22ba)= 0q\u00d7q, implying E(\u0303F H \u22ba)= 0q\u00d7q. Next, let E(F F \u22ba)= V DV \u22ba\nbe a reduced singular value decomposition, where D is a positive diagonal matrix of size q\u2032 \u2264 q.\nThen E(\u0303F\u0303F \u22ba)= V V \u22ba, and we have\nE(F F \u22ba)\u22c5 H= 0q\u00d7p a.e. \u21d4 V \u22baH= 0q\u2032\u00d7p a.e. \u21d4 E(\u0303F\u0303F \u22ba)H= 0q\u00d7p a.e. .\nThis implies that E(\u0303F\u0303F \u22ba)\u22c5 E(HH \u22ba)= 0q\u00d7q and so these two symmetric matrices have orthogonal\n\u2225E(DD\n\u22ba)\u2225\n\u22ba)\u2225\n}\u2264 1 ,\nwhere the last bound comes from the fact that\u2223\u2223\u2223\u0303F\u2223\u2223\u2223sp,\u2223\u2223\u2223H\u2223\u2223\u2223sp\u2264 1. Therefore\u2223\u2223\u2223D\u2223\u2223\u2223sp\u2264 1.\n\n= max{\u2225E(\u0303F\u0303F\n\n0. The inequality follows from the duality of the norms.\n\n\u22ba)\u2225\nsp ,\u2225E(HH\n\nrow spans and orthogonal column spans. Therefore,\n\n\u22ba)+ E(HH\n\n=\u2225E(\u0303F\u0303F\n\n\u22ba)\u2225\n\nsp\n\nsp\n\nsp\n\nThis gives the subdifferential of penalty 2, de\ufb01ned in (3.3). We can view the \ufb01rst penalty update as\njust a special case of the second penalty update. For penalty 1 in (3.1), if we are updating Fj and \ufb01x\nall the other functions, we are now penalizing the norm\n\n,\n\n(4.5)\n\nFj+ Hj \u2236\n\nj))\u22121\nE(FjF \u22ba\n\nwhich is clearly just a special case of penalty 2 with a single q-vector of functions instead of p\ndifferent q-vectors of functions. So, we have\n\n\u2202\u2223\u2223\u2223Fj\u2223\u2223\u2223\u2217={(\u221a\nj)Hj= 0 a.e.} . (4.6)\nReturning to the base case of p= 1 covariate, consider the population regularized risk optimization\n\n5 Stationary Conditions and Back\ufb01tting Algorithms\n\nj)= 0, E(FjF\n\n\u22ba\n\n\u22ba\n\n{1\n\nE\u2225Y \u2212 M(X)\u22252\n\n2+ \u03bb\u2223\u2223\u2223M\u2223\u2223\u2223\u2217},\na.e. for some V \u2208 \u2202\u2223\u2223\u2223M\u2223\u2223\u2223\u2217.\n\nmin\nM\n\n2\n\nE(Y\u2223 X)= M(X)+ \u03bbV(X)\n\n(5.1)\n\n(5.2)\n\n\u2223\u2223\u2223Fj\u2223\u2223\u2223\u2217=\u2225\u221a\nE(FjF \u22ba\nj)\u2225\n\u2223\u2223\u2223Hj\u2223\u2223\u2223sp\u2264 1, E(FjH\n\n\u2217\n\nwhere M is a vector of q univariate functions. The stationary condition for this optimization is\n\nDe\ufb01ne P(X)\u2236= E(Y\u2223 X).\n\n4\n\n\fCRAM BACKFITTING ALGORITHM \u2014 FIRST PENALTY\n\nIterate until convergence:\n\nInput: Data(Xi, Yi), regularization parameter \u03bb.\nInitialize\u0302Mj= 0, for j= 1, . . . , p.\nFor each j= 1, . . . , p:\n\u0302Mk(Xk);\n(1) Compute the residual: Zj= Y \u2212\u2211k\u2260j\n(2) Estimate Pj= E[Zj\u2223 Xj] by smoothing: \u0302Pj=SjZj;\n\u0302P \u22ba\n\u0302Pj\nj = U diag(\u03c4)U \u22ba\n(4) Soft threshold: \u0302Mj= U diag([1\u2212 \u03bb/\u221a\n)U \u22ba\u0302Pj;\n\u03c4]\n(5) Center: \u0302Mj\u2190\u0302Mj\u2212 mean(\u0302Mj).\nOutput: Component functions\u0302Mj and estimator\u0302M(Xi)=\u2211j\n\n(3) Compute SVD: 1\nn\n\n+\n\n\u0302Mj(Xij).\n\n\u22ba\n\nP\n\n\u22ba\n\nthat\n\n(5.3)\n\n\u03c4]+)U\n\nFigure 1: The CRAM back\ufb01tting algorithm, using the \ufb01rst penalty, which penalizes each component.\n\nM= U diag([1\u2212 \u03bb/\u221a\n\nProposition 5.1. Let E(P P \u22ba)= U diag(\u03c4)U \u22ba be the singular value decomposition and de\ufb01ne\nwhere[x]+= max(x, 0). Then M satis\ufb01es stationary condition (5.2), and is a minimizer of (5.1).\nProof. Assume the singular values are sorted as \u03c41\u2265 \u03c42\u2265\u22ef\u2265 \u03c4q, and let r be the largest index such\n\u221a\nE(M M \u22ba)= U diag([\u221a\n\u221a\n\u03c4\u2212 \u03bb]+)U \u22ba, and therefore\n\u03c4r> \u03bb. Thus, M has rank r. Note that\n\u03bb(\u221a\nM= U diag(\u03bb/\u221a\nE(M M \u22ba))\u22121\n\u03c41\u2236r, 0q\u2212r)U\nwhere x1\u2236k=(x1, . . . , xk) and ck=(c, . . . , c). It follows that\nM+ \u03bb(\u221a\nE(M M \u22ba))\u22121\nM= U diag(1r, 0q\u2212r)U \u22baP.\nU diag(0r, 1q\u2212r)U \u22baP\nH= 1\nand take V =(\u221a\nE(M M \u22ba))\u22121\nM+ H. Then we have M+ \u03bbV = P .\n\u221a\nE(HH \u22ba)=\n\u221a\n\u221a\nU diag(0r,\n\u03c4r+1/\u03bb, . . . ,\n\u03c4q/\u03bb)U \u22ba we have\u2223\u2223\u2223H\u2223\u2223\u2223sp\u2264 1. Also, E(M H \u22ba)= 0q\u00d7q since\ndiag(1\u2212 \u03bb/\u221a\n\u03c41\u2236r, 0q\u2212r) diag(0r, 1q\u2212r/\u03bb)= 0q\u00d7q.\nSimilarly, E(M M \u22ba)H= 0q\u00d7q since\ndiag((\u221a\n\u03c41\u2236r\u2212 \u03bb)2, 0q\u2212r) diag(0r, 1q\u2212r/\u03bb)= 0q\u00d7q.\nIt follows that V \u2208 \u2202\u2223\u2223\u2223M\u2223\u2223\u2223sp.\n\nIt remains to show that H satis\ufb01es the conditions of the subdifferential in (4.3). Since\n\nNow de\ufb01ne\n\n(5.8)\n\n(5.7)\n\n(5.5)\n\n(5.6)\n\n(5.4)\n\nP\n\n\u03bb\n\nThe analysis above justi\ufb01es a back\ufb01tting algorithm for estimating a constrained rank additive model\nwith the \ufb01rst penalty, where the objective is\n\nE\u2225Y \u2212 p\u2211\n{1\nMj(Xj)\u22252\n+ \u03bb\n\u2223\u2223\u2223Mj\u2223\u2223\u2223\u2217}.\np\u2211\nFor a given coordinate j, we form the residual Zj= Y \u2212\u2211k\u2260j Mk, and then compute the projection\nPj= E(Zj\u2223 Xj), with singular value decomposition E(PjP \u22ba\nj)= U diag(\u03c4)U \u22ba. We then update\nMj= U diag([1\u2212 \u03bb/\u221a\n\u03c4]+)U\n\nmin\nMj\n\n(5.10)\n\n(5.9)\n\nPj\n\nj=1\n\nj=1\n\n2\n\n\u22ba\n\n2\n\n5\n\n\f+\n\n\u2217\n\nsp\n\nsp\n\n2,\n\n6 Excess Risk Bounds\n\nNote that to predict at a point x not included in the training set, the smoother matrices are constructed\n\nand proceed to the next variable. This is a Gauss-Seidel procedure that parallels the population\nback\ufb01tting algorithm for SPAM [6]. In the sample version we replace the conditional expectation\n\nThe algorithm for penalty 2 is similar. In step (3) of the algorithm in Figure 1 we compute the SVD\nof 1\nn\nBoth algorithms can be viewed as functional projected gradient descent procedures.\n\nPj= E(Zj\u2223 Xj) by a nonparametric linear smoother,\u0302Pj=SjZj. The algorithm is given in Figure 1.\nusing that point; that is, \u0302Pj(xj)= Sj(xj)\u22baZj.\n)U \u22ba\u0302P1\u2236p.\n\u0302P1\u2236p\n\n1\u2236p. Then, in step (4) we soft threshold according to\u0302M1\u2236p= U diag([1\u2212 \u03bb/\u221a\n\u0302P \u22ba\n\u03c4]\nThe population risk of a q\u00d7 p regression matrix M(X)=[M1(X1)\u22efMp(Xp)] is\nR(M)= E\u2225Y \u2212 M(X)1p\u22252\nwith sample version denoted \u0302R(M). Consider all models that can be written as\nM(X)= U\u22c5 D\u22c5 V(X)\u22ba\nwhere U is an orthogonal q\u00d7 r matrix, D is a positive diagonal matrix, and V(X)=[vjs(Xj)]\nsatis\ufb01es E(V \u22baV)= Ir. The population risk can be reexpressed as\nR(M)= tr{(\u2212Iq\nV(X)\u22ba )\u22ba](\u2212Iq\nV(X)\u22ba )(\nDU \u22ba)\u22ba\nE[(\nDU \u22ba)}\n= tr{(\u2212Iq\n)(\u2212Iq\nDU \u22ba)\u22ba(\u03a3Y Y \u03a3Y V\nDU \u22ba)}\nand similarly for the sample risk, with\u0302\u03a3n(V) replacing \u03a3(V)\u2236= Cov((Y, V(X)\u22ba)) above. The\n\u201cuncontrollable\u201d contribution to the risk, which does not depend on M, is Ru= tr{\u03a3Y Y}. We can\nRc(M)= R(M)\u2212 Ru= tr{(\u22122Iq\n\u03a3(V)( 0q\nDU \u22ba)} .\nDU \u22ba)\u22ba\nUsing the von Neumann trace inequality, tr(AB)\u2264\u2225A\u2225p\u2225B\u2225p\u2032 where 1/p+ 1/p\u2032= 1,\nRc(M)\u2212\u0302Rc(M)\u2264\u2225(\u22122Iq\nDU \u22ba)\u22ba(\u03a3(V)\u2212\u0302\u03a3n(V))\u2225\n\u2225( 0q\nDU \u22ba)\u2225\n\u2264\u2225(\u22122Iq\n\u2225\u03a3(V)\u2212\u0302\u03a3n(V)\u2225\nDU \u22ba)\u22ba\u2225\n\u2225D\u2225\u2217\n\u2264 C max(2,\u2225D\u2225sp)\u2225\u03a3(V)\u2212\u0302\u03a3n(V)\u2225\n\u2225D\u2225\u2217\n\u2217}\u2225\u03a3(V)\u2212\u0302\u03a3n(V)\u2225\n\u2264 C max{2,\u2225D\u22252\n\u22ba(\u03a3(V)\u2212\u0302\u03a3n(V)) w\n\u2264 C sup\n\nexpress the remaining \u201ccontrollable\u201d risk as\n\nY V \u03a3V V\n\n(6.1)\n\n\u03a3\u22ba\n\nw\n\nY\n\nY\n\nsp\n\nsp\n\nsp\n\nsup\nV\n\nsup\nw\u2208N\n\nwhere here and in the following C is a generic constant. For the last factor in (6.1), it holds that\n\n\u2225\u03a3(V)\u2212\u0302\u03a3n(V)\u2225\nwhereN is a 1/2-covering of the unit(q+ r)-sphere, which has size\u2223N\u2223\u2264 6q+r\u2264 36q; see [8]. We\nnow assume that the functions vsj(xj) are uniformly bounded from a Sobolev space of order two.\nSpeci\ufb01cally, let{\u03c8jk\u2236 k= 0, 1, . . .} denote a uniformly bounded, orthonormal basis with respect to\nL2[0, 1], and assume that vsj\u2208Hj where\nHj={fj\u2236 fj(xj)= \u221e\u2211\nfor some 0< K<\u221e. The L\u221e-covering number ofHj satis\ufb01es logN(Hj, \u0001)\u2264 K/\u221a\n\najk\u03c8jk(xj),\n\njkk4\u2264 K 2}\n\n\u221e\u2211\n\na2\n\nk=0\n\nk=0\n\n\u0001.\n\nV\n\nsp\n\n6\n\n\fsup\nV\n\nn\n\n0\n\nd\u0001\n\n\u0001\n\nn\n\n(6.2)\n\nE(sup\n\nV\n\nsup\nw\u2208N\n\nsp\n\n\u22ba\n\n, with E(V\n\nfor some constant B. Thus, by Markov\u2019s inequality we conclude that\n\nsample continuous. It follows from a result of Cesa-Bianchi and Lugosi [1] that\n\nSuppose that Y \u2212 E(Y\u2223 X)= W is Gaussian and the true regression function E(Y\u2223 X) is bounded.\nThen the family of random variables Z(V,w) \u2236=\u221a\nn\u22c5 w\u22ba(\u03a3(V)\u2212\u0302\u03a3n(V))w is sub-Gaussian and\n\u221a\nw\u22ba(\u03a3(V)\u2212\u0302\u03a3n(V))w)\u2264 C\u221a\nq log(36)+ log(pq)+ K\u221a\n\u222b B\n\u221a\n\u239e\u23a0 .\n\u239b\u239d\nq+ log(pq)\n\u2225\u03a3(V)\u2212\u0302\u03a3n(V)\u2225\n= OP\nIf\u2223\u2223\u2223M\u2223\u2223\u2223\u2217=\u2225D\u2225\u2217= o(n/(q+ log(pq)))1/4, then returning to (6.1), this gives us a bound on Rc(M)\u2212\n\u0302Rc(M) that is oP(1). More precisely, we de\ufb01ne a class of matrices of functions:\nq+ log(pq))1/4\u23ab\u23aa\u23aa\u23ac\u23aa\u23aa\u23ad .\nMn=\u23a7\u23aa\u23aa\u23a8\u23aa\u23aa\u23a9M \u2236 M(X)= U DV(X)\u22ba\nV)= I, vsj\u2208Hj,\u2225D\u2225\u2217= o(\nThen, for a \ufb01tted matrix\u0302M chosen fromMn, writing M\u2217= arg minM \u2208Mn R(M), we have\nR(\u0302M)\u2212 inf\nR(M)= R(\u0302M)\u2212\u0302R(\u0302M)\u2212(R(M\u2217)\u2212\u0302R(M\u2217))+(\u0302R(\u0302M)\u2212\u0302R(M\u2217))\n\u2264[R(\u0302M)\u2212\u0302R(\u0302M)]\u2212[R(M\u2217)\u2212\u0302R(M\u2217)].\nSubtracting Ru\u2212\u0302Ru from each of the bracketed differences, we obtain that\nR(M)\u2264[Rc(\u0302M)\u2212\u0302Rc(\u0302M)]\u2212[Rc(M\u2217)\u2212\u0302Rc(M\u2217)]\n{Rc(M)\u2212\u0302Rc(M)}\n\u2217\u2225\u03a3(V)\u2212\u0302\u03a3n(V)\u2225\n) by (6.2)= oP(1).\nn\u2211i\u2225Yi\u2212\u2211j Mj(Xij)\u22252\nR(M) PD\u2192 0 .\n\nProposition 6.1. Let\u0302M minimize the empirical risk 1\n\n\u2264 2 sup\nby (6.1)\u2264 O(\u2225D\u22252\nR(\u0302M)\u2212 inf\n\n2 over the classMn.\n\nR(\u0302M)\u2212 inf\n\nThis proves the following result.\n\nM \u2208Mn\n\nM \u2208Mn\n\nM \u2208Mn\n\nsp\n\nn\n\nThen\n\nM \u2208Mn\n\n7 Application to Gene Expression Data\n\nTo illustrate the proposed nonparametric reduced rank regression techniques, we consider data on\ngene expression in E. coli from the \u201cDREAM 5 Network Inference Challenge\u201d1 [3]. In this challenge\ngenes were classi\ufb01ed as transcription factors (TFs) or target genes (TGs). Transcription factors\nregulate the target genes, as well as other TFs.\n\nWe focus on predicting the expression levels Y for a particular set of q= 27 TGs, using the expres-\nsion levels X for p= 6 TFs. Our motivation for analyzing these 33 genes is that, according to the\nfunctional relationship between X and Y is given by the composition of a map g\u2236 R\nmap h\u2236 R\n27. If g and h are both linear, their composition h\u25cb g is a linear map of rank no more\nis linear, then h\u25cb g has rank at most 2 in the sense of penalty 2. Higher rank can in principle occur\n\ngold standard gene regulatory network used for the DREAM 5 challenge, the 6 TFs form the parent\nset common to two additional TFs, which have the 27 TGs as their child nodes. In fact, the two\nintermediate nodes d-separate the 6 TFs and the 27 TGs in a Bayesian network interpretation of this\ngold standard. This means that if we treat the gold standard as a causal network, then up to noise, the\n2 and a\n\nthan 2. As observed in Section 2, such a reduced rank linear model is a special case of an additive\nmodel with reduced rank in the sense of penalty 2. More generally, if g is an additive function and h\n\n6\u2192 R\n\n2\u2192 R\n\n1http://wiki.c2b2.columbia.edu/dream/index.php/D5c4\n\n7\n\n\fPenalty 1, \u03bb = 20\n\nPenalty 2, \u03bb = 5\n\nFigure 2: Left: Penalty 1 with large tuning parameter. Right: Penalty 2 with tuning parameter ob-\ntained through 10-fold cross-validation. Plotted points are residuals holding out the given predictor.\n\nunder functional composition, since even a univariate additive map h\u2236 R\u2192 R\n\nq under our penalties (recall that penalty 1 and 2 coincide for univariate maps).\n\nq may have rank up to\n\nThe back\ufb01tting algorithm of Figure 1 with penalty 1 and a rather aggressive choice of the tuning\nparameter \u03bb produces the estimates shown in Figure 2, for which we have selected three of the 27\nTGs. Under such strong regularization, the 5th column of functions is rank zero and, thus, identically\nzero. The remaining columns have rank one; the estimated \ufb01tted values are scalar multiples of one\nanother. We also see that scalings can be different for different columns. The third plot in the third\nrow shows a slightly negative slope, indicating a negative scaling for this particular estimate. The\nremaining functions in this row are oriented similarly to the other rows, indicating the same, positive\nscaling. This property characterizes the difference between penalties 1 and 2; in an application of\npenalty 2, the scalings would have been the same across all functions in a given row.\n\nNext, we illustrate a higher-rank solution for penalty 2. Choosing the regularization parameter \u03bb by\nten-fold cross-validation gives a \ufb01t of rank 5, considerably lower than 27, the maximum possible\nrank. Figure 2 shows a selection of three coordinates of the \ufb01tted functions. Under rank \ufb01ve, each\nrow of functions is a linear combination of up to \ufb01ve other, linearly independent rows. We remark\nthat the use of cross-validation generally produces somewhat more complex models than is necessary\nto capture an underlying low-rank data-generating mechanism. Hence, if the causal relationships for\nthese data were indeed additive and low rank, then the true low rank might well be smaller than \ufb01ve.\n\n8 Summary\n\nThis paper introduced two penalties that induce reduced rank \ufb01ts in multivariate additive nonpara-\nmetric regression. Under linearity, the penalties specialize to group lasso and nuclear norm penalties\nfor classical reduced rank regression. Examining the subdifferentials of each of these penalties, we\ndeveloped back\ufb01tting algorithms for the two resulting optimization problems that are based on soft-\nthresholding of singular values of smoothed residual matrices. The algorithms were demonstrated\non a gene expression data set constructed to have a naturally low-rank structure. We also provided a\npersistence analysis that shows error tending to zero under a scaling assumption on the sample size\nn and the dimensions q and p of the regression problem.\n\nAcknowledgements\n\nResearch supported in part by NSF grants IIS-1116730, DMS-0746265, and DMS-1203762,\nAFOSR grant FA9550-09-1-0373, ONR grant N000141210762, and an Alfred P. Sloan Fellowship.\n\n8\n\n\fReferences\n[1] Nicol`o Cesa-Bianchi and G\u00b4abor Lugosi. On prediction of individual sequences. The Annals of\n\nStatistics, 27(6):1865\u20131894, 1999.\n\n[2] Maryam Fazel. Matrix rank minimization with applications. Technical report, Stanford Uni-\n\nversity, 2002. Doctoral Dissertation, Electrical Engineering Department.\n\n[3] D. Marbach, J. C. Costello, R. K\u00a8uffner, N. Vega, R. J. Prill, D. M. Camacho, K. R. Allison,\nthe DREAM5 Consortium, M. Kellis, J. J. Collins, and G. Stolovitzky. Wisdom of crowds for\nrobust gene network inference. Nature Methods, 9(8):796\u2013804, 2012.\n\n[4] Sahan Negahban and Martin J. Wainwright. Estimation of (near) low-rank matrices with noise\n\nand high-dimensional scaling. Annals of Statistics, 39:1069\u20131097, 2011.\n\n[5] Garvesh Raskutti, Martin J. Wainwright, and Bin Yu. Minimax-optimal rates for sparse addi-\n\ntive models over kernel classes via convex programming. arxiv:1008.3654, 2010.\n\n[6] Pradeep Ravikumar, John Lafferty, Han Liu, and Larry Wasserman. Sparse additive models.\n\nJournal of the Royal Statistical Society, Series B, Methodological, 71(5):1009\u20131030, 2009.\n\n[7] Benjamin Recht, Maryam Fazel, and Pablo A. Parrilo. Guaranteed minimum rank solutions to\n\nlinear matrix equations via nuclear norm minimization. SIAM Review, 52(3):471\u2013501, 2010.\n\n[8] Roman Vershynin. How close is the sample covariance matrix to the actual covariance matrix?\n\narxiv:1004.3484, 2010.\n\n[9] G. A. Watson. Characterization of the subdifferential of some matrix norms. Linear Algebra\n\nand Applications, 170:1039\u20131053, 1992.\n\n[10] Ming Yuan, Ali Ekici, Zhaosong Lu, and Renato Monteiro. Dimension reduction and coeff-\n\ncient estimation in multivariate linear regression. J. R. Statist. Soc. B, 69(3):329\u2013346, 2007.\n\n[11] Ming Yuan and Yi Lin. Model selection and estimation in regression with grouped variables.\nJournal of the Royal Statistical Society: Series B (Statistical Methodology), 68(1):49\u201367, 2006.\n\n9\n\n\f", "award": [], "sourceid": 767, "authors": [{"given_name": "Rina", "family_name": "Foygel", "institution": null}, {"given_name": "Michael", "family_name": "Horrell", "institution": null}, {"given_name": "Mathias", "family_name": "Drton", "institution": null}, {"given_name": "John", "family_name": "Lafferty", "institution": null}]}