{"title": "Self-Organizing Rules for Robust Principal Component Analysis", "book": "Advances in Neural Information Processing Systems", "page_first": 467, "page_last": 474, "abstract": null, "full_text": "Self-Organizing Rules for Robust \n\nPrincipal Component Analysis \n\nLei Xu l ,2\"'and Alan Yuille l \n\n1. Division of Applied Sciences, Harvard University, Cambridge, MA 02138 \n\n2. Dept. of Mathematics, Peking University, Beijing, P.R.China \n\nAbstract \n\nIn the presence of outliers, the existing self-organizing rules for \nPrincipal Component Analysis (PCA) perform poorly. Using sta(cid:173)\ntistical physics techniques including the Gibbs distribution, binary \ndecision fields and effective energies, we propose self-organizing \nPCA rules which are capable of resisting outliers while fulfilling \nvarious PCA-related tasks such as obtaining the first principal com(cid:173)\nponent vector, the first k principal component vectors, and directly \nfinding the subspace spanned by the first k vector principal com(cid:173)\nponent vectors without solving for each vector individually. Com(cid:173)\nparative experiments have shown that the proposed robust rules \nimprove the performances of the existing PCA algorithms signifi(cid:173)\ncantly when outliers are present. \n\n1 \n\nINTRODUCTION \n\nPrincipal Component Analysis (PCA) is an essential technique for data compression \nand feature extraction, and has been widely used in statistical data analysis, com(cid:173)\nmunication theory, pattern recognition and image processing. In the neural network \nliterature, a lot of studies have been made on learning rules for implementing PCA \nor on networks closely related to PCA (see Xu & Yuille, 1993 for a detailed reference \nlist which contains more than 30 papers related to these issues). The existing rules \ncan fulfil various PCA-type tasks for a number of application purposes. \n\n\"'Present address: Dept. of Brain and Cognitive Sciences, E10-243, Massachusetts \n\nInstitute of Technology, Cambridge, MA 02139. \n\n467 \n\n\f468 \n\nXu and Yuille \n\nHowever, almost all the previously mentioned peA algorithms are based on the \nassumption that the data has not been spoiled by outliers (except Xu, Oja&Suen \n1992, where outliers can be resisted to some extent.). In practice, real data often \ncontains some outliers and usually they are not easy to separate from the data set. \nAs shown by the experiments described in this paper, these outliers will significantly \nworsen the performances of the existing peA learning algorithms. Currently, little \nattention has been paid to this problem in the neural network literature, although \nthe problem is very important for real applications. \n\nRecently, there have been some success in applying t:te statistical physics approach \nto a variety of computer vision problems (Yuille, 1990; Yuille, Yang&Geiger 1990; \nYuille, Geiger&Bulthoff, 1991). In particular, it has also been shown that some \ntechniques developed in robust statistics (e.g., redescending M-estimators, least(cid:173)\ntrimmed squares estimators) appear naturally within the Bayesian formulation by \nthe use of the statistical physics approach. In this paper we adapt this approach \nto tackle the problem of robust PCA. Robust rules are proposed for various PCA(cid:173)\nrelated tasks such as obtaining the first principal component vector, the first k \nprincipal component vectors, and principal subspaces. Comparative experiments \nhave been made and the results show that our robust rules improve the performances \nof the existing peA algorithms significantly when outliers are present. \n\n2 peA LEARNING AND ENERGY MINIMIZATION \n\nThere exist a number of self-organizing rules for finding the first principal compo(cid:173)\nnent. Three of them are listed as follows (Oja 1982, 85; Xu, 1991,93): \n\nm(t + 1) = m(t) + aa(t)(xy - m(t)y2), \nm(t + 1) = m(t) + aa(t)(xy - m(~~~(t)y2), \nm(t + 1) = m(t) + aa(t)[y(x - iI) + (y - y')X]. \n\n(1) \n\n(2) \n\n(3) \nwhere y = m(t)T x, iI = ym(t), y' = m(tf iI and aa(t) 2:: 0 is the learning rate which \ndecreases to zero as t -- 00 while satisfying certain conditions, e.g., Lt aa(t) = \n00, Lt aa(t)q < 00 for some q> 1. \nEach of the three rules will converge to the principal component vector i almost \nsurely under some mild conditions which are studied in detail in by Oja (1982&85) \nand Xu (1991&93). Regarding m as the weight vector of a linear neuron with output \ny = mT x, all the three rules can be considered as modifications of the well known \nHebbian rule m(t + 1) = m(t) + aa(t)xy through introducing additional terms for \npreventing IIm(t)1I from going to 00 as t -- 00. \nThe performances of these rules deteriorate considerably when data contains out(cid:173)\nliers. Although some outlier-resisting versions of eq.(l) and eq.(2) have also been \nrecently proposed (Xu, Oja & Suen, 1992), they work well only for data which is not \nseverely spoiled by outliers. In this paper, we adopt a totally different approach-we \ngeneralize eq.(1),eq.(2) and eq.(3) into more robust versions by using the statistical \nphysics approach. \n\nTo do so, first we need to connect these rules to energy functions. It follows from Xu \n(1991&93) and Xu & Yuille(1993) that the rules eq.(2) and eq.(3) are respectively \n\n\fSelf-Organizing Rules for Robust Principal Component Analysis \n\n469 \n\non-line gradient descent rules for minimizing J 1 (m), J2(m) respectivelyl: \n\nJ ( -) = _ \"'(-'!' -. _ m Xixi m) \n1 m \n\nN \nL..J x, X, \n\nI\n\n-T -\n\n::::T -\n\n-T _ \nm m \n\nN i=l \n\nN \n\nhem) = ~ L !Iii - uill 2 . \n\ni=1 \n\n(4) \n\n(5) \n\nIt has also been proved that the rule given by eq.(l) satisfies (Xu, 1991, 93): \n(a) hTh2 2: 0,E(hJ)T JJ(h1) 2: 0, with hI = iy-my2, h2 = iy- mo/.m y2 ; (b) \nE(hl)TE(h3) > 0, with h3 = y(i-iI)+(y-y')i; (c) Both J1 and h have only one \nlocal (also global) minimum tr(~) - iI'r-i, and all the other critical points (i.e., \nthe points satisfy 8Jakm) = 0, i = 1,2) are saddle points. Here ~ = E{ii t}, and i \nis the eigenvector of r- corresponding to the largest eigenvalue. \nThat is, the rule eq.(l) is a downhill algorithm for minimizing J1 in both the on \nline sense and the average sense, and for minimizing J2 in the average sense. \n\n3 GENERALIZED ENERGY AND ROBUST peA \n\nWe further regard J 1(m), J2(m) as special cases of the following general energy: \n\nJ(m) = ~L Z(ii, m), Z(ii' m) 2: 0. \n\nN \n\ni=1 \n\nwhere Z(ii' m) is the portion of energy contributed by the sample ii, and \n\n(6) \n\n(7) \n\nFollowing (Yuille, 1990 a& b), we now generalize energy eq.(6) into \n\nE(V, m) = L:f:1 Vi Z(ii' m) + Eprior(V) \n\n(8) \nwhere V = {Vi, i = 1, .. \" N} is a binary field {\\Ii} with each \\Ii being a random \n\\Ii acts as a decision indicator for deciding \nvariable taking value either 0 or 1. \nwhether ii is an outlier or a sample. When \\Ii = 1, the portion of energy contributed \nby the sample ii is taken into consideration; otherwise, it is equivalent to discarding \nii as an outlier. Eprior(V) is the a priori portion of energy contributed by the a \npriori distribution of {Vi}. A natural choice is \n\nEpriorCV) = 11 1:(1- Vi) \n\nN \n\ni=1 \n\n(9) \n\nThis choice of priori has a natural interpretation: for fixed m it is energetically \nfavourable to set \\Ii = 1 (i.e., not regarding ii as an outlier) if Z(ii' m) < yfii (i.e., \n\nlWe have J1(ffi) 2: 0, since iTi - m\"fm = lIiW sin2 (Jxm 2: o. \n\n\fPmargin(m) \n\ne \n\nL..J, \n\n1 L -{3 ~ {V,z(x\"m)+T/(l-V,)} \n-\nZ _ \nv \n! II L e-{3{V,z(x\"m)+T/(l- V,)} = _1_ e-{3EeJJ (m). \n\nZ \n\n. \n, V,={O,l} \nEeJj(m) = -1 Llog{1 + e-{3{z(x\"m)-T/}}. \n\nZm \n\n(11) \n\n(12) \n\n470 \n\nXu and Yuille \n\nthe portion of energy contributed by Xi is smaller than a prespecified threshold) \nand to set it to 0 otherwise. \nBased on E(V, m), we define a Gibbs distribution (Parisi 1988): \n\n-\n\nP[V m] = _e-{3E V,m \n[- -] \n' \n\n'z \n\n1 \n\n(10) \n\nwhere Z is the partition function which ensures Lv Lm pry, m] = 1. Then we \ncompute \n\n(3 \n\ni \n\nEel! is called the effective energy. Each term in the sum for Eel I is approximately \nz(xi,m) for small values of Z but becomes constant as z(xi,m) -+ 00. In this way \noutliers, which are more likely to yield large values of z( Xi, m), are treated differently \nfrom samples, and thus the estimation m obtained by minimizing EeJj(m) will be \nrobust and able to resist outliers. \n\nEe! f (m) is usually not a convex function and may have many local minima. The \nstatistical physics framework suggests using deterministic annealing to minimize \nEeJj(m). That is, by the following gradient descent rule eq.(13), to minimize \nEeJj(m) for small (3 and then track the minimum as (3 increases to infinity (the \nzero temperature limit): \n\n_( \nm t + 1 = m t -\n\n) \n\n_() (~ \n\nlYb t) ~ 1 + e{3(z(x\"m(f))-T/) \n\n1 \n\n, \n\noz(xi,m(t)) \n. \n\nom(t) \n\n(13) \n\nMore specifically, with z's chosen to correspond to the energies hand J2 respec(cid:173)\ntively, we have the following batch-way learning rules for robust peA: \n\n2) \n_ ( \nm t + 1 = m t + lYb t ~ 1 + e{3(z(x\"m(t))-T/) XiYi - m(t)Tm(t)Yi' \n\n( ) ~ \n\nm( t) \n\n_ ( ) \n\n( _ \n\n1 \n\n) \n\nz \n\n() \n14 \n\nmet + 1) = met) + abet) ~ 1 + e{3(Z(;\"m(f))-T/) [Yi(Xi -\n\n, \n\nild + (Yi - yDXi]. \n\n(15) \n\nFor data that comes incrementally or in the on-line way, we correspondingly have \nthe following adaptive or stochastic approximation versions \n\n2) \n-( \nm t + = m t + aa t 1 + e{3(z(x\"m(t))-17) XiYi - m(t)T met) Yi \n, \n\nmet) \n\n1) \n\n(-\n\n1 \n\n-C) \n\n() \n\nmet + 1) = met) + aa(t) 1 + e{3(Z(;\"m(t))-17) [Yi(Xi -\n\niii) + (Yi - YDXi]. \n\n(16) \n\n(17) \n\n\fSelf-Organizing Rules for Robust Principal Component Analysis \n\n471 \n\nIt can be observed that the difference between eq.(2) and eq.(16) or eq.(3) and \neq.(17) is that the learning rate G'a(t) has been modified by a multiplicative factor \n\nG'm(t) = 1 + e{j(Z(tri,m(t))-\")' \n\n1 \n\n(18) \n\nwhich adaptively modifies the learning rate to suit the current input Xi. This \nmodifying factor has a similar function as that used in Xu, Oja&Suen(1992) for \nrobust line fitting. But the modifying factor eq.(18) is more sophisticated and \nperforms better. \n\nBased on the connecticn between the rule eq.(I) and J 1 or J2 , given in sec.2, we \ncan also formally use t ile modifying factor G'm(t) to turn the rule eq.(I) into the \nfollowing robust version: \n\nmet + 1) = met) + G'a(t) 1 + e{j(Z(;.,m(t))-,,) (iiYi - m(t)yi), \n\n(19) \n\n4 ROBUST RULES FOR k PRINCIPAL COMPONENTS \n\nIn a similar way to SGA (Oja, 1992) and GHA (Sanger, 1989) we can generalize the \nrobust rules eq.(19), eq.(16) and eq.(17) into the following general form of robust \nrules for finding the first k principal components: \n\nmj(t + 1) = mj(t) + G'a(t) 1 + e{j(Z(tr)n,m;(t))-,,) ~mj(xi(j), mj(t\u00bb, \nii(j + 1) = Xi(j) - L Yi(r)mr(t), Yi(j) = mJ (t)ii(j), \n\nXi(O) = ii, \n\nj - l \n\n(20) \n\n(21) \n\nr=l \n\nwhere ~mj(ii(j), mj(t\u00bb, Z(Xi(j), mj(t\u00bb have four possibilities (Xu & Yuille, 1993). \nAs an example, one of them is given here \n\ndmj(xi(j), mj(t\u00bb = (Xi(j)Yi(j) - mj(t)Yi(j)2), \n\n( .. (.) .. (t\u00bb \nZ Xi J ,mj = Xi J Xi J - mj(t)Tmj(t)' \n\n.. (')T - (.) \n\nYi(j)2 \n\nIn this case, eq.(20) can be regarded as the generalization of GHA (Sanger, 1989). \n\nWe can also develop an alternative set of rules for a type of nets with asymmetric \nlateral weights as used in (Rubner&Schulten, 1990). The rules can also get the first \nk principal components robustly in the presence of outliers (Xu & Yuille, 1993). \n\n5 ROBUST RULES FOR PRINCIPAL SUBSPACE \nLet M = [ml, .. \" mk], ~ = [\u00a21, .. \" \u00a2k], Y = [Yl, .. \" Ykf and y = MT X, it follows \nfrom Oja(1989) and Xu(1991) the rules eq.(l), eq.(3) can be generalized into eq.(22) \nand eq.(23) respectively: \n\n(22) \n\n\f472 \n\nXu and Yuille \n\nu = y, y = MTa (23) \n- M-\nIn the case without outliers, by both the rules, the weight matrix M(t) will converge \nto a matrix MOO whose column vectors mj, j = 1,\"\" k span the k-dimensional \nprincipal subspace (Oja, 1989; Xu, 1991&93), although the vectors are, in general, \nnot equal to the k principal component vectors \u00a2j, j = 1, ... , k. \nSimilar to the previously used procedure, we have the following results: \n(1). We can SllOW that eq.(23) is an on-line or stochastic approximation rule which \nminimizes the energy 13 in the gradient descent way (Xu, 1991& 93): \n\nJ3 (ffi) = ~ L: IIXi - ai ll 2 , a = My, Y' = MT iI. \n\nN \n\n(24) \n\ni=l \n\nand that in the average sense the subspace rule eq.(22) is also an on-line \"down-hill\" \nrule for minimizing the energy function Ja. \n(2). We can also generalize the non-robust rules eq.(22) and eq.(23) into robust \nversions by using the statistical physics approach again: \n\nM(t + 1) = M(t) + GA(t) 1 + e!3(I//-U.1I2_'1) [Yi(Xi - ildT -\n\n(fii - Y1)iT]' \n\n-\nM(t + 1) = M(t) + GA(t) 1 + e!3(l/x.-u;1/2_'1) [y,Xi - YiY, M(t)] \n\n-,..fJ'-~ \n\n-\n\n1 \n\n(25) \n\n(26) \n\n6 EXAMPLES OF EXPERIMENTAL RESULTS \nLet x from a population of 400 samples with zero mean. These samples are located \non an elliptic ring centered at the origin of R3 , with its largest elliptic axis being \nalong the direction (-1,1,0), the plane of its other two axes intersecting the x - Y \nplane with an acute angle (30\u00b0). Among the 400 samples, 10 points (only 2.5%) are \nrandomly chosen and replaced by outliers. The obtained data set is shown in Fig.1. \n\nBefore the outliers were introduced, either the conventional simple-variance-matrix \nbased approach (i.e., solving S\u00a2 = A\u00a2, S = k L~l iiX[) or the unrobust rules \neqs.(I)(2)(3) can find the correct 1st principal component vector of this data set. \n\nOn the data set contaminated by outliers, shown in Fig.l, the result of the simple(cid:173)\nvariance-matrix based approach has an angular error of \u00a2p by 71.04\u00b0-a result \ndefinitely unacceptable. The results of using the proposed robust rules eq.(19), \neq.(16) and eq.(17) are shown in Fig.2(a) in comparison with those of their unrobust \ncounterparts- the rules eq.(I), eq.(2) and eq.(3). We observe that all the unrobust \nrules get the solutions with errors of more than 21\u00b0 from the correct direction of \n\u00a2p. By contrast, the robust rules can still maintain a very good accuracy-the \nerror is about 0.36\u00b0. Fig.2(b) gives the results of solving for the first two principal \ncomponent vectors. Again, the unrobust rule produce large errors of around 23\u00b0, \nwhile the robust rules have an error of about 1. 7\u00b0 . Fig.3 shows the results of \nsoIling for the 2-dimensional principal subspace, it is easy to see the significant \nimprovements obtained by using the robus.t rules. \n\n\fSelf-Organizing Rules for Robust Principal Component Analysis \n\n473 \n\n\" \n\n, \n\n, , \n\n,~ \n\n, ' \n\n\\\\' \n\n\"~\"114\" \n\n., \n\n~ \n\n\u2022 \n\n\u2022 \n\nJ \n\nt \n\n\u2022 \n\n, \n\nf \n\n., \n\n~ \n\n\u2022 \n\nI \n\n2 \n\n\u2022 \n\n\u2022 \n\nJ \n\n\u2022 \n\nFigure 1: The projections of the data on the x - y, y - z and z - x planes, with 10 \noutliers. \n\nAcknowledgements \n\nWe would like to thank DARPA and the Air Force for support with contracts \nAFOSR-89-0506 and F4969092-J-0466. \n\nWe like to menta ion that some further issues about the proposed robust rules are studied \nin Xu & Yuille (1993), including the selection of parameters 0', j3 and 1], the extension of \nthe rules for robust Minor Component Analysis (MCA) , the relations between the rules \nto the two main types of existing robust peA algorithms in the literature of statistics, as \nwell as to Maximal Likelihood (ML) estimation of finite mixture distributions. \n\nReferences \n\nE. Oja, J. Math. Bio. 16, 1982,267-273. \nE. Oja & J. Karhunen, J. Math. Anal. Appl. 106,1985,69-84. \n\nE. Oja, Int. J. Neural Systems 1, 1989,61-68. \nE. Oja, Neural Networks 5, 1992, 927-935. \n\nG. Parisi, Statistical Field Theory, Addison-Wesley, Reading, Mass., 1988. \nJ. Rubner & K. Schulten, Biological Cybernetics, 62, 1990, 193-199. \nT.D. Sanger, Neural Networks, 2, 1989,459-473. \nL. Xu, Proc. of IJCNN'91-Singapore, Nov., 1991,2368-2373. \nL. Xu, Least mean square error reconstruction for self-organizing neural-nets, Neural \nNetworks 6, 1993, in press. \nL. Xu, E. Oja & C.Y. Suen, Neural Networks 5, 1992,441-457. \nL. Xu & A.L. Yuille, Robust principal component analysis by self-organizing rules \nbased on statistical physics approach, IEEE Trans. Neural Networks, 1993, in press. \n\nA.L. Yuille, Neural computation 2, 1990, 1-24. \nA.L. Yuille, D. Geiger and H.H. Bulthoff,Networks 2, 1991. 423-442. \n\n\f474 \n\nXu and Yuille \n\n--. .. \n\n(a) \n\n(b) \n\nFigure 2: The learning curves obtained in the comparative experiments for principal \ncomponent vectors. (a) for the first principal component vector, RAl, RA2, RA3 de(cid:173)\nnote the robust rules eq.(19), eq.(16) and eq.(17) respectively, and U AI, U A2, U A3 \ndenote the rules eq.(l), eq.(2) and eq.(3) respectively. The horizontal axis denotes \nthe learning steps, and the vertical axis is (Jm(t)\u00a2Pl with (Jx,y denoting the acute \nangle between x and y. (b) for the first two principal component vectors, by the \nrobust rule eq.(20) and its unrobust counterpart GHA. U Akl, U Ak2 denote the \nlearning curves of angles (Jml(t)\u00a2Pl and (Jm2(t)\u00a2P2 respectively, obtained by GHA . \nRAk 1, RAk2 denote the learning curves of the angles obtained by using the robust \nrule eq.(20). In both (a) & (b), i pj , j = 1,2 is the correct 1st and 2nd principal \ncomponent vector respectively. \n\nt \n\nt \n\n1 _______ _ \n\n........ \n\nFigure 3: The learning curves obtained in the comparative experiments for for solv(cid:173)\ning the 2-dimensional principal subspace. Each learning curve expresses the change \nof the residual er(t) = L:J=ll!mj(t) - L:;=l(mj(tf i pr)\u00a2prI12 with learning steps. \nThe smaller the residual, the closer the estimated principal subspace to the correct \none. SU Bl, SU B2 denote the unrobust rules eq.(22) and eq.(23) respectively, and \nRSU Bl, RSU B2 denote the robust rules eq.(26) and eq.(25) respectively. \n\n\f", "award": [], "sourceid": 686, "authors": [{"given_name": "Lei", "family_name": "Xu", "institution": null}, {"given_name": "Alan", "family_name": "Yuille", "institution": null}]}