{"title": "Linear Dependent Dimensionality Reduction", "book": "Advances in Neural Information Processing Systems", "page_first": 145, "page_last": 152, "abstract": "", "full_text": "Linear Dependent Dimensionality Reduction\n\nNathan Srebro\n\nTommi Jaakkola\n\nDepartment of Electrical Engineering and Computer Science\n\nMassachusetts Institute of Technology\n\nCambridge, MA 02139\n\nnati@mit.edu,tommi@ai.mit.edu\n\nAbstract\n\nWe formulate linear dimensionality reduction as a semi-parametric esti-\nmation problem, enabling us to study its asymptotic behavior. We gen-\neralize the problem beyond additive Gaussian noise to (unknown) non-\nGaussian additive noise, and to unbiased non-additive models.\n\n1 Introduction\n\nFactor models are often natural in the analysis of multi-dimensional data. The underly-\ning premise of such models is that the important aspects of the data can be captured via a\nlow-dimensional representation (\u201cfactor space\u201d). The low-dimensional representation may\nbe useful for lossy compression as in typical applications of PCA, for signal reconstruc-\ntion as in factor analysis or non-negative matrix factorization [1], for understanding the\nsignal structure [2], or for prediction as in applying SVD for collaborative \ufb01ltering [3]. In\nmany situations, including collaborative \ufb01ltering and structure exploration, the \u201cimportant\u201d\naspects of the data are the dependencies between different attributes. For example, in col-\nlaborative \ufb01ltering we rely on a representation that summarizes the dependencies among\nuser preferences. More generally, we seek to identify a low-dimensional space that captures\nthe dependent aspects of the data, and separate them from independent variations. Our goal\nis to relax restrictions on the form of each of these components, such as Gaussianity, addi-\ntivity and linearity, while maintaining a principled rigorous framework that allows analysis\nof the methods.\n\nWe begin by studying the probabilistic formulations of the problem, focusing on the as-\nsumptions that are made about the dependent, low-rank \u201csignal\u201d and independent \u201cnoise\u201d\ndistributions. We consider a general semi-parametric formulation that emphasizes what is\nbeing estimated and allows us to discuss asymptotic behavior (Section 2). We then study\nthe standard (PCA) approach, show that it is appropriate for additive i.i.d. noise (Section 3),\nand present a generic estimator that is appropriate also for unbiased non-additive models\n(Section 4). In Section 5 we confront the non-Gaussianity directly, develop maximum-\nlikelihood estimators in the presence of Gaussian mixture additive noise, and show that the\nconsistency of such maximum-likelihood estimators should not be taken for granted.\n\n\f2 Dependent Dimensionality Reduction\n\nOur starting point is the problem of identifying linear dependencies in the presence of in-\ndependent identically distributed Gaussian noise. In this formulation, we observe a data\nmatrix Y \u2208 0 be the non-zero eigenvalues of \u039bx. Since z has variance ex-\nactly \u03c32 in any direction, the principal directions of variation are not affected by it, and the\neigenvalues of \u039bY are exactly s1 + \u03c32, . . . , sk + \u03c32, \u03c32, . . . , \u03c32, with the leading k eigen-\nvectors being the eigenvectors of \u039bX. This ensures an eigenvalue gap of sk > 0 between\nthe invariant subspace of \u039bY spanned by the eigenvectors of \u039bX and its complement, and\nwe can bound the norm of the canonical sines between V0 and the leading k eigenvectors of\n[8]. Since |\u02c6\u039bn\u2212\u039bY | \u2192 0 a.s., we conclude that the estimator is consistent.\n\u02c6\u039bn by\n\n| \u02c6\u039bn\u2212\u039bY |\n\nsk\n\n4 The Variance-Ignoring Estimator\n\nWe turn to additive noise with independent, but not identically distributed, coordinates. If\nthe noise variances are known, the ML estimator corresponds to minimizing the column-\nweighted (inversely proportional to the variances) Frobenius norm of Y \u2212 X , and can be\ncalculated from the leading eigenvectors of a scaled empirical covariance matrix [9]. If the\nvariances are not known, e.g. when the scale of different coordinates is not known, there is\nno ML estimator: at least k coordinates of each y can always be exactly matched, and so\nthe likelihood is unbounded when up to k variances approach zero.\n\n3We call this an L2 estimator not because it minimizes the matrix L2-norm |Y \u2212 X|2, which it\n\ndoes, but because it minimizes the vector L2-norms |y \u2212 x|2\n2.\n\n4We should also be careful about signals that occupy only a proper subspace of V0, and be satis\ufb01ed\nwith any rank-k subspace containing the support of x, but for simplicity of presentation we assume\nthis does not happen and x is of full rank k.\n\n\fFigure 1: Norm of sines of canonical angles to correct subspace: (a) Random rank-2 subspaces\nin <10. Gaussian noise of different scales in different coordinates\u2014 between 0.17 and 1.7 signal\nstrength. (b) Random rank-2 subspaces in <10, 500 sample rows, and Gaussian noise with varying\ndistortion (mean over 200 simulations, bars are one standard deviations tall) (c) Observations are\n0\nexponentially distributed with means in rank-2 subspace ( 1 1 1 1 1 1 1 1 1 1\n1 0 1 0 1 0 1 0 1 0 )\n\n.\n\nThe L2 estimator is not satisfactory in this scenario. The covariance matrix \u039bZ is still diag-\nonal, but is no longer a scaled identity. The additional variance introduced by the noise is\ndifferent in different directions, and these differences may overwhelm the \u201csignal\u201d variance\nalong V0, biasing the leading eigenvectors of \u039bY , and thus the limit of the L2 estimator,\ntoward axes with high \u201cnoise\u201d variance. The fact that this variability is independent of the\nvariability in other coordinates is ignored, and the L2 estimator is asymptotically biased.\nInstead of recovering the directions of greatest variability, we recover the covariance struc-\nture directly. In the limit, \u02c6\u039bn \u2192 \u039bY = \u039bX + \u039bZ, a sum of a rank-k matrix and a diagonal\nmatrix. In particular, the non-diagonal entries of \u02c6\u039bn approach those of \u039bX. We can thus\nseek a rank-k matrix \u02c6\u039bX approximating \u02c6\u039bn, e.g. in a sum-squared sense, except on the di-\nagonal. This is a (zero-one) weighted low-rank approximation problem. We optimize \u02c6\u039bX\nby iteratively seeking a rank-k approximation of \u02c6\u039bn with diagonal entries \ufb01lled in from\nthe last iterate of \u02c6\u039bX (this can be viewed as an EM procedure [5]). The row-space of the\nresulting \u02c6\u039bX is then an estimator for the signal subspace. Note that the L2 estimator is the\nrow-space of the rank-k matrix minimizing the unweighted sum-squared distance to \u02c6\u039bn.\nFigures 1(a,b) demonstrate this variance-ignoring estimator on simulated data with non-\nidentical Gaussian noise. The estimator reconstructs the signal-space almost as well as the\nML estimator, even though it does not have access to the true noise variance.\n\n(cid:1) for any 1\n\nDiscussing consistency in the presence of non-identical noise with unknown variances is\nproblematic, since the signal subspace is not necessarily identi\ufb01able. For example, the\ncombined covariance matrix \u039bY = ( 2 1\n1 2 ) can arise from a rank-one signal covariance\n2 \u2264 a \u2264 2, each corresponding to a different signal subspace.\nCounting the number of parameters and constraints suggests identi\ufb01ability when k < d \u2212\n\u221a\n8d+1\u22121\n, but this is by no means a precise guarantee. Anderson and Rubin [10] present\n\n\u039bX = (cid:0) a 1\nseveral conditions on \u039bX which are suf\ufb01cient for identi\ufb01ability but require k <(cid:4) d\n\n(cid:5), and\n\n2\n\n1 1/a\n\n2\n\nother weaker conditions which are necessary.\n\nNon-Additive Noise The above estimation method is also useful in a less straight-\nforward situation. Until now we have considered only additive noise, in which the dis-\ntribution of yi \u2212 xi was independent of xi. We will now relax this restriction and allow\nmore general conditional distributions yi|xi, requiring only that E [yi|xi] = xi. With this\nrequirement, together with the structural constraint (yi independent given x), for any i 6= j:\n\nCov [yi, yj] = E [yiyj] \u2212 E [yi]E [yj] = E [E [yiyj|x]] \u2212 E [E [yi|x]]E [E [yj|x]]\n\n= E [E [yi|x]E [yj|x]] \u2212 E [xi]E [xj] = E [xixj] \u2212 E [xi]E [xj] = Cov [xi, xj].\n\n1010010001000000.10.20.30.40.50.60.70.80.91sample size|sin Q|2L2 variance\u2212ignored ML, known variances12 3 4 5 6 7 8 9 100.10.20.30.40.50.60.70.80.911.1spread of noise scale (max/min ratio)|sin Q|2L2 variance\u2212ignored ML, known variances10210310400.20.40.60.81|sin(Q)|2sample size (number of observed rows)full L2variance\u2212ignored\fh\n\nh\n\ni |xi\n\nE(cid:2)y2\n\nE [yi|xi]2i \u2212\n\n(cid:3) \u2212 E [yi|xi]2i\n\nAs in the non-identical additive noise case, \u039bY agrees with \u039bX except on the diagonal.\nEven if yi|xi is identically conditionally distributed for all i, the difference \u039bY \u2212 \u039bX is\nnot in general a scaled identity: Var [yi] = E\nE [yi]2 = E [Var [yi|xi]] + Var [xi]. Unlike the additive noise case, the variance of yi|xi\ndepends on xi, and so its expectation depends on the distribution of xi.\nThese observations suggest using the variance-ignoring estimator. Figure 1(c) demonstrates\nhow such an estimator succeeds in reconstruction when yi|xi is exponentially distributed\nwith mean xi, even though the standard L2 estimator is not applicable. We cannot guaran-\ntee consistency because the decomposition of the covariance matrix might not be unique,\ny|x is known, even if the decomposition is not unique, the correct signal covariance might\nbe identi\ufb01able based on the relationship between the signal marginals and the expected\nconditional variance of of y|x, but this is not captured by the variance-ignoring estimator.\n\n(cid:5) this is not likely to happen. Note that if the conditional distribution\n\nbut when k < (cid:4) d\n\n+ E\n\n2\n\n5 Low Rank Approximation with a Gaussian Mixture Noise Model\n\nPm\n\nWe return to additive noise, but seeking better estimation with limited data, we confront\nnon-Gaussian noise distributions directly: we would like to \ufb01nd the maximum-likelihood\nX when Y = X + Z, and Zij are distributed according to a Gaussian mixture: pZ(zij) =\n\nc=1 pc(2\u03c0\u03c32\n\nc )1/2 exp((zij \u2212 \u00b5c)2/(2\u03c32\n\nc )).\n\nTo do so, we introduce latent variables Cij specifying the mixture component of the noise\nat Yij, and solve the problem using EM. In the Expectation step, we compute the posterior\nprobabilities Pr (Cij|Yij; X ) based on the current low-rank parameter matrix X . In the\nMaximization step we need to \ufb01nd the low-rank matrix X that maximizes the posterior\nexpected log-likelihood:\n\nEC|Y [log Pr (Y = X + Z|C; X )] = \u2212X\n\n(Xij\u2212(Yij +\u00b5c))2 + Const\n\nX\n\nij\n\nij\n\nWij (Xij \u2212 Aij)2 + Const\n\n2\u03c32\nc\n\nPr(Cij =c)|Yij\n\nX\nAij = Yij +X\n\nc\n\nc\n\n= \u2212 1\n\n2\n\nwhere Wij =X\n\nPr(Cij =c)|Yij\n\n\u03c32\nc\n\nc\n\n(1)\n\nPr(Cij =c)|Yij \u00b5c\n\n\u03c32\n\nc Wij\n\nThis is a weighted Frobenius low-rank approximation (WLRA) problem. Equipped with a\nWLRA optimization method [5], we can now perform EM iteration in order to \ufb01nd the ma-\ntrix X maximizing the likelihood of the observed matrix Y . At each M step it is enough to\nperform a single WLRA optimization iteration, which is guaranteed to improve the WLRA\nobjective, and so also the likelihood. The method can be augmented to handle an unknown\nGaussian mixture, by introducing an optimization of the mixture parameters at each M\niteration.\n\nExperiments with GSMs We report here initial experiments with ML estimation using\nbounded Gaussian scale mixtures [11], i.e. a mixture of Gaussians with zero mean, and\nvariance bounded from bellow. Gaussian scale mixtures (GSMs) are a rich class of sym-\nmetric distributions, which include non-log-concave, and heavy tailed distributions. We\ninvestigated two noise distributions: a \u2019Gaussian with outliers\u2019 distribution formed as a\nmixture of two zero-mean Gaussians with widely varying variances; and a Laplace dis-\ntribution p(z) \u221d e\u2212|z|, which is an in\ufb01nite scale mixture of Gaussians. Figures 2(a,b)\nshow the quality of reconstruction of the L2 estimator and the ML bounded GSM estima-\ntor, for these two noise distributions, for a \ufb01xed sample size of 300 rows, under varying\n\n\fFigure 2: Norm of sines of canonical angles to correct subspace: (a) Random rank-3 subspace in\n<10 with Laplace noise.\n(b)\nRandom rank-2 subspace in <10 with 0.99N (0, 1) + 0.01N (0, 100) noise. (c) span(2, 1, 1)0 \u2282 <3\nwith 0.9N (0, 1) + 0.1N (0, 25) noise. The ML estimator converges to (2.34, 1, 1). Bars are one\nstandard deviation tall.\n\nInsert: sine norm of ML est. plotted against sine norm of L2 est.\n\nsignal strengths. We allowed ten Gaussian components, and did not observe any signi\ufb01cant\nchange in the estimator when the number of components increases.\nThe ML estimator is overall more accurate than the L2 estimator\u2014it succeeds in reliably\nreconstructing the low-rank signal for signals which are approximately three times weaker\nthan those necessary for reliable reconstruction using the L2 estimator. The improvement\nin performance is not as dramatic, but still noticeable, for Laplace noise.\n\nComparison with Newton\u2019s Methods Confronted with a general additive noise distri-\nbution, the approach presented here would be to rewrite, or approximate, it as a Gaussian\nmixture and use WLRA in order to learn X using EM. A different approach is to consider-\ning the second order Taylor expansions of the log-likelihood, with respect to the entries of\nX , and iteratively maximize them using WLRA [5, 7]. Such an approach requires calculat-\ning the \ufb01rst and second derivatives of the density. If the density is not speci\ufb01ed analytically,\nor is unknown, these quantities need to be estimated. But beyond these issues, which can be\novercome, lies the major problem of Newton\u2019s method: the noise density must be strictly\nlog-concave and differentiable. If the distribution is not log-concave, the quadratic expan-\nsion of the log-likelihood will be unbounded and will not admit an optimum. Attempting\nto ignore this fact, and for example \u201coptimizing\u201d U given V using the equations derived\nfor non-negative weights would actually drive us towards a saddle-point rather then a local\noptimum. The non-concavity does not only mean that we are not guaranteed a global opti-\nmum (which we are not guaranteed in any case, due to the non-convexity of the low-rank\nrequirement)\u2014 it does not yield even local improvements. On the other hand, approximat-\ning the distribution as a Gaussians mixture and using the EM method, might still get stuck\nin local minima, but is at least guaranteed local improvement.\n\nLimiting ourselves to only log-concave distributions is a rather strong limitation, as it\nprecludes, for example, all heavy-tailed distributions. Consider even the \u201cbalanced tail\u201d\nLaplace distribution p(z) \u221d e\u2212|z|. Since the log-density is piecewise linear, a quadratic\napproximation of it is a line, which of course does not attain a minimum value.\n\nConsistency Despite the gains in reconstruction presented above, the ML estimator may\nsuffer from an asymptotic bias, making it inferior to the L2 estimator on large samples. We\nstudy the asymptotic limit of the ML estimator, for a known product distribution p. We \ufb01rst\nestablish a necessary and suf\ufb01cient condition for consistency of the estimator.\nThe ML estimator is the minimizer of the empirical mean of the random function \u03a6(V ) =\nminu(\u2212 log p(y\u2212 uV 0)). When the number of samples increase, the empirical means con-\nverge to the true means, and if E [\u03a6(V1)] < E [\u03a6(V2)], then with probability approaching\n\n0.20.40.60.811.21.41.60.10.150.20.250.30.350.40.45signal variance / noise variance|sin(Q)|2L2ML00.10.20.30.400.050.10.150.20.250.30.350.400.20.40.60.811.21.400.20.40.60.811.21.4signal variance / noise variance|sin(Q)|2L2ML, known noise modelML, nuisance noise model1010010001000000.050.10.150.20.250.30.350.40.450.5Sample size (number of observed rows)sin(Q)MLL2\fone V2 will not minimize \u02c6E [\u03a6(V )]. For the ML estimator to be consistent, E [\u03a6(V )] must\nbe minimized by V0, establishing a necessary condition for consistency.\nThe suf\ufb01ciency of this condition rests on the uniform convergence of {\u02c6E [\u03a6(V )]}, which\ndoes not generally exist, or at least on uniform divergence from E [\u03a6(V0)]. It should be\nnoted that the issue here is whether the ML estimator at all converges, since if it does con-\nverge, it must converge to the minimizer of E [\u03a6(V )]. Such convergence can be demon-\nstrated at least in the special case when the marginal noise density p(zi) is continuous,\nstrictly positive, and has \ufb01nite variance and differential entropy. Under these conditions,\nthe ML estimator is consistent if and only if V0 is the unique minimizer of E [\u03a6(V )].\nWhen discussing E [\u03a6(V )], the expectation is with respect to the noise distribution and\nthe signal distribution. This is not quite satisfactory, as we would like results which are\nindependent of the signal distribution, beyond the rank of its support. To do so, we must\nensure the expectation of \u03a6(V ) is minimized on V0 for all possible signals (and not only in\nexpectation). Denote the objective \u03c6(y; V ) = minu(\u2212 log p(y \u2212 uV 0)). For any x \u2208