{"title": "Linear Multilayer Independent Component Analysis for Large Natural Scenes", "book": "Advances in Neural Information Processing Systems", "page_first": 897, "page_last": 904, "abstract": null, "full_text": " Linear Multilayer Independent Component\n Analysis for Large Natural Scenes\n\n\n\n Yoshitatsu Matsuda \n Kazunori Yamaguchi Laboratory\n Department of General Systems Studies\n Graduate School of Arts and Sciences\n The University of Tokyo\n Japan 153-8902\n matsuda@graco.c.u-tokyo.ac.jp\n\n\n Kazunori Yamaguchi\n yamaguch@graco.c.u-tokyo.ac.jp\n\n\n\n\n Abstract\n\n In this paper, linear multilayer ICA (LMICA) is proposed for extracting\n independent components from quite high-dimensional observed signals\n such as large-size natural scenes. There are two phases in each layer of\n LMICA. One is the mapping phase, where a one-dimensional mapping\n is formed by a stochastic gradient algorithm which makes more highly-\n correlated (non-independent) signals be nearer incrementally. Another\n is the local-ICA phase, where each neighbor (namely, highly-correlated)\n pair of signals in the mapping is separated by the MaxKurt algorithm.\n Because LMICA separates only the highly-correlated pairs instead of all\n ones, it can extract independent components quite efficiently from ap-\n propriate observed signals. In addition, it is proved that LMICA always\n converges. Some numerical experiments verify that LMICA is quite ef-\n ficient and effective in large-size natural image processing.\n\n\n\n1 Introduction\n\nIndependent component analysis (ICA) is a recently-developed method in the fields of\nsignal processing and artificial neural networks, and has been shown to be quite useful\nfor the blind separation problem [1][2][3] [4]. The linear ICA is formalized as follows. Let\ns and A are N -dimensional source signals and N N mixing matrix. Then, the observed\nsignals x are defined as\n x = As. (1)\n\nThe purpose is to find out A (or the inverse W ) when the observed (mixed) signals only\nare given. In other words, ICA blindly extracts the source signals from M samples of the\nobserved signals as follows:\n ^\n S = W X, (2)\n\n http://www.graco.c.u-tokyo.ac.jp/~matsuda\n\n\f\nwhere X is an N M matrix of the observed signals and ^\n S is the estimate of the source\nsignals. This is a typical ill-conditioned problem, but ICA can solve it by assuming that the\nsource signals are generated according to independent and non-gaussian probability dis-\ntributions. In general, the ICA algorithms find out W by maximizing a criterion (called\nthe contrast function) such as the higher-order statistics (e.g. the kurtosis) of every com-\nponent of ^\n S. That is, the ICA algorithms can be regarded as an optimization method of\nsuch criteria. Some efficient algorithms for this optimization problem have been proposed,\nfor example, the fast ICA algorithm [5][6], the relative gradient algorithm [4], and JADE\n[7][8].\n\nNow, suppose that quite high-dimensional observed signals (namely, N is quite large) are\ngiven such as large-size natural scenes. In this case, even the efficient algorithms are not\nmuch useful because they have to find out all the N 2 components of W . Recently, we pro-\nposed a new algorithm for this problem, which can find out global independent components\nby integrating the local ICA modules. Developing this approach in this paper, we propose\na new efficient ICA algorithm named \" the linear multilayer ICA algorithm (LMICA).\" It\nwill be shown in this paper that LMICA is quite efficient than other standard ICA algo-\nrithms in the processing of natural scenes. This paper is an extension of our previous works\n[9][10].\n\nThis paper is organized as follows. In Section 2, the algorithm is described. In Section 3,\nnumerical experiments will verify that LMICA is quite efficient in image processing and\ncan extract some interesting edge detectors from large natural scenes. Lastly, this paper is\nconcluded in Section 4.\n\n\n2 Algorithm\n\n\n2.1 basic idea\n\nLMICA can extract all the independent components approximately by repetition of the\nfollowing two phases. One is the mapping phase, which brings more highly-correlated\nsignals nearer. Another is local-ICA phase, where each neighbor pair of signals in the\nmapping is separated by MaxKurt algorithm [8]. The mechanism of LMICA is illustrated\nin Fig. 1. Note that this illustration holds just in the ideal case where the mixing matrix\nA is given according to such a hierarchical model. In other words, it does not hold for an\narbitrary A. It will be shown in Section 3 that this hierarchical model is quite effective at\nleast in natural scenes.\n\n\n2.2 mapping phase\n\nIn the mapping phase, given signals X are arranged in a one-dimensional array so that\npairs (i, j) taking higher x2 x2\n k ik jk are placed nearer. Letting Y = (yi) be the coordinate\nof the i-th signal xik, the following objective function is defined:\n\n (Y ) = x2ikx2jk (yi - yj)2 . (3)\n i,j k\n\n\nThe optimal mapping is found out by minimizing with respect to Y under the constraints\nthat yi = 0 and y2i = 1. It has been well-known that such optimization problems can\nbe solved efficiently by a stochastic gradient algorithm [11][12]. In this case, the stochastic\ngradient algorithm is given as follows (see [10] for the details of the derivation of this\nalgorithm):\n yi (T + 1) := yi (T ) - T (ziyi + zi) , (4)\n\n\f\nFigure 1: The illustration of LMICA (the ideal case): Each number from 1 to 8 means\na source signal. In the first local-ICA phase, each neighbor pair of the completely-mixed\nsignals (denoted \"1-8\") is partially separated into \"1-4\" and \"5-8.\" Next, the mapping phase\nrearranges the partially-separated signals so that more highly-correlated signals are nearer.\nIn consequence, the four \"1-4\" signals (similarly, \"5-8\" ones) are brought nearer. Then,\nthe local-ICA phase partially separates the pairs of neighbor signals into \"1-2,\" \"3-4,\" \"5-\n6,\" and \"7-8.\" By repetition of the two phases, LMICA can extract all the sources quite\nefficiently.\n\n\n\nwhere T is the step size at the T -th time step, zi = x2ik (k is randomly selected from\n{1, . . . , M } at each time step),\n = zi, (5)\n i\n\nand\n = ziyi. (6)\n i\n\nBy calculating and before the update for each i, each update requires just O (N ) com-\nputation. Eq. (4) is guaranteed to converge to a local minimum of the objective function\n (Y ) if T decreases sufficiently slowly (limT T = 0 and T = ).\n\nBecause the Y in the above method is continuous, each continuous yi is replaced by the\nranking of itself in Y in the last of the mapping phase. That is, yi := 1 for the largest\nyi, yj := N for the smallest one, and so on. The corresponding permutation is given as\n (i) = yi.\n\nThe total procedure of the mapping phase for given X is described as follows:\n\n\n\n\n mapping phase\n\n\n xik\n 1. xik := xik - \n xi for each i, k, where xi is the mean k\n M .\n\n 2. yi = i, and (i) = i for each i.\n\n\f\n 3. Until the convergence, repeat the following steps:\n\n (a) Select k randomly from {1, . . . , M }, and let zi = x2ik for each i.\n (b) Update each yi by Eq. (4).\n (c) Normalize Y to satisfy y y2\n i i = 0 and i i = 1.\n\n 4. Discretize yi.\n\n 5. Update X by x(i)k := xik for each i and k.\n\n\n\n\n2.3 local-ICA phase\n\nIn the local-ICA phase, the following contrast function (X) (the sum of kurtoses) is used\n(MaxKurt algorithm in [8]):\n (X) = - x4ik, (7)\n i,k\n\nand (X) is minimized by \"rotating\" the neighbor pairs of signals (namely, under an\northogonal transformation). For each neighbor pair (i, i + 1), a rotation matrix Ri () is\ngiven as \n Ii-1 0 0 0\n 0 cos sin 0 \n Ri () = 0 - sin cos 0 , (8)\n 0 0 0 IN-i-2\n\nwhere In is the n n identity matrix. Then, the optimal angle ^\n is given as\n ^\n = argmin X , (9)\n\nwhere X () = Ri () X. After some tedious transformation of the equations (see [8]), it\nis shown that ^\n is determined analytically by the following equations:\n \n sin 4^\n = ij , cos 4^\n = ij , (10)\n 2 + 2 2 + 2\n ij ij ij ij\n\nwhere\n x4 + x4 - 6x2 x2\n k ik jk ik jk\n ij = x3ikxjk - xikx3jk , ij = , (11)\n 4\n k\n\nand j = i + 1.\n\nNow, the procedure of the local-ICA phase for given X is described as follows:\n\n\n\n\n local-ICA phase\n\n 1. Let W local = IN , Alocal = IN\n\n 2. For each i = {1, . . . , N - 1},\n\n (a) Find out the optimal angle ^\n by Eq. (10).\n\n (b) X := Ri(^\n )X, W local := RiW local, and Alocal := AlocalRti.\n\n\f\n2.4 complete algorithm\n\nThe complete algorithm of LMICA for any given observed signals X is given by repeating\nthe mapping phase and the local-ICA phase alternately. Here, P is the permutation matrix\ncorresponding to .\n\n\n\n\n linear multilayer ICA algorithm\n\n 1. Initial Settings: Let X be the given observed signal matrix, and W and A be IN .\n\n 2. Repetition: Do the following two phases alternately over L times.\n\n (a) Mapping Phase: Find out the optimal permutation matrix P and the\n optimally-arranged signals X by the mapping phase. Then, W := P W\n and A := AP t.\n (b) Local-ICA Phase: Find out the optimal matrices W local, Alocal, and X.\n Then, W := W localW and A := AAlocal.\n\n\n\n2.5 some remarks\n\n Relation to MaxKurt algorithm. Eq. (10) is just the same as MaxKurt algorithm [8].\n The crucial difference between our LMICA and MaxKurt is that LMICA opti-\n N (N -1)\n mizes just the neighbor pairs instead of all the 2 ones in MaxKurt. In\n LMICA, the pairs with higher \"costs\" (higher x2 x2\n k ik jk) are brought nearer in\n the mapping phase. So, independent components can be extracted effectively by\n optimizing just the neighbor pairs.\n\n Contrast function. In order to make consistency between this paper and our previous\n work [10], the following contrast function instead of Eq. (7) is used in Section\n 3:\n (X) = x2ikx2jk. (12)\n i,j,k\n\n The minimization of Eq. (12) is equivalent to that of Eq. (7) under the orthogonal\n transformation.\n\n Pre-whitening. Though LMICA (which is based on MaxKurt) presupposes that X is\n pre-whitened, the algorithm in Section 2.4 is applicable to any raw X without the\n pre-whitening. Because any pre-whitening method suitable for LMICA has not\n been found out yet, raw images of natural scenes are given as X in the numeri-\n cal experiments in Section 3. In this non-whitening case, the mixing matrix A is\n limited to be orthogonal and the influence of the second-order statistics is not re-\n moved. Nevertheless, it will be shown in Section 3 that the higher-order statistics\n of X cause some interesting results.\n\n\n3 Results\n\nIt has been well-known that various local edge detectors can be extracted from natural\nscenes by the standard ICA algorithm [13][14]. Here, LMICA was applied to the same\nproblem. 30000 samples of natural scenes of 12 12 pixels were given as the observed sig-\nnals X. That is, N and M were 144 and 30000. Original natural scenes were downloaded\nat http://www.cis.hut.fi/projects/ica/data/images/. The number of\n\n\f\nlayers L was set 720, where one layer means one pair of the mapping and the local-ICA\nphases. For comparison, the experiments without the mapping phase were carried out,\nwhere the mapping Y was randomly generated. In addition, the standard MaxKurt algo-\nrithm [8] was used with 10 iterations. The contrast function (Eq. (12)) was calculated\nat each layer, and it was averaged over 10 independently generated Xs. Fig. 2-(a) shows\nthe decreasing curves of of normal LMICA and the one without the mapping phase. The\ncross points show the result at each iteration of MaxKurt. Because one iteration of MaxKurt\nis equivalent to 72 layers of LMICA with respect to the times of the optimizations for the\npairs of signals, a scaling (72) is applied. Surprisingly, LMICA nearly converged to the\noptimal point within just 10 layers. The number of parameters within 10 layers is 143 10,\nwhich is much fewer than the degree of freedom of A ( 144143\n 2 ). It suggests that LMICA\ngives a quite suitable model for natural scenes. The calculation time with the values of is\nshown in Table. 1. It shows that the time costs of the mapping phase are not much higher\nthan those of the local-ICA phase. The fact that 10 layers of LMICA required much less\ntime (22sec.) than one iteration of MaxKurt (94sec.) and optimized approximately (4.91)\nverifies the efficiency of LMICA. Note that each iteration of MaxKurt can not be stopped\nhalfway. Fig. 3 shows 5 5 representative edge detectors at each layer of LMICA. At the\n20th layer (Fig. 3-(a)), rough and local edge detectors were recognized, though they were\na little unclear. As the layer proceeded, edge detectors became clearer and more global\n(see Figs. 3-(b) and 3-(c)). It is interesting that ICA-like local edges (where the higher-\norder statistics are dominant) at the early stage were transformed to PCA-like global edges\n(the second-order statistics are dominant) at the later stage (see [13]). For comparison, Fig.\n3-(d) show the result at the 10th iteration of MaxKurt. It is similar to Fig. 3-(c) as expected.\n\nIn addition, we used large-size natural scenes. 100000 samples of natural scenes of 64 64\npixels were given as X. MaxKurt and other well-known ICA algorithms are not available\nfor such a large-scale problem because they require huge computation. Fig. 2-(b) shows\nthe decreasing curve of in the large-size natural scenes. LMICA was carried out in 1000\nlayers, and it consumed about 69 hours with Intel 2.8GHz CPU. It shows that LMICA\nrapidly decreased in the first 20 layers and converged around the 500th layer. It verifies\nthat LMICA is quite efficient in the analysis of large-size natural scenes. Fig. 4 shows\nsome edge detectors generated at the 1000th layer. It is interesting that some \"compound\"\ndetectors such as a \"cross\" were generated in addition to simple \"long-edge\" detectors. In\na famous previous work [13] which applied ICA and PCA to small-size natural scenes,\nsymmetric global edge detectors similar to our \"compound\" ones could be generated by\nPCA which manages only the second-order statistics. On the other hand, asymmetric local\nedge detectors similar to our simple \"long-edge\" ones could not be generated by PCA and\ncould be extracted by ICA utilizing the higher-order statistics. In comparison with it, our\nLMICA could extract various local and global detectors simultaneously from large-size\nnatural scenes. Besides, it is expected from the results for small-size images (see Fig. 3)\nthat other various detectors are generated at each layer. In summary, those results show\nthat LMICA can extract quite many useful and various detectors from large-size natural\nscenes efficiently. It is also interesting that there was a plateau in the neighborhood of the\n10th layer. It suggests that large-size natural scenes may be generated by two different\ngenerative models. But, the close inspection is beyond the scope of this paper.\n\n\n4 Conclusion\n\nIn this paper, we proposed the linear multilayer ICA algorithm (LMICA). We carried out\nsome numerical experiments on natural scenes, which verified that LMICA can find out the\napproximations of independent components quite efficiently and it is applicable to large\nproblems. We are now analyzing the results of LMICA in large-size natural scenes of 64\n 64 pixels, and we are planning to apply this algorithm to quite large-scale images such\nas the ones of 256 256 pixels. We are also planning to utilize LMICA in the data mining\n\n\f\nTable 1: Calculation time with the values of the contrast function (Eq. (12)): They are the\naverages over 10 runs at the 10th layer (approximation) and the 720th layer (convergence)\nin LMICA (the normal one and the one without the mapping phase). In addition, those of\n10 iterations in MaxKurt (approximately corresponding to L = 10 72 = 720) are shown.\nThey were calculated in Intel 2.8GHz CPU.\n LMICA LMICA without mapping MaxKurt (10 iterations)\n 10th layer 22sec. (4.91) 9.3sec. (17.6) -\n 720th layer 1600sec. (4.57) 670sec. (4.57) 940sec. (4.57)\n\n\n\nof quite high-dimensional data space, such as the text mining. In addition, we are trying to\nfind out the pre-whitening method suitable for LMICA. Some normalization techniques in\nthe local-ICA phase may be promising.\n\n\nReferences\n\n [1] C. Jutten and J. Herault. Blind separation of sources (part I): An adaptive algorithm\n based on neuromimetic architecture. Signal Processing, 24(1):110, jul 1991.\n\n [2] P. Comon. Independent component analysis - a new concept? Signal Processing,\n 36:287314, 1994.\n\n [3] A. J. Bell and T. J. Sejnowski. An information-maximization approach to blind sepa-\n ration and blind deconvolution. Neural Computation, 7:11291159, 1995.\n\n [4] J.-F. Cardoso and Beate Laheld. Equivariant adaptive source separation. IEEE Trans-\n actions on Signal Processing, 44(12):30173030, dec 1996.\n\n [5] A. Hyvarinen and E. Oja. A fast fixed-point algorithm for independent component\n analysis. Neural Computation, 9(7):14831492, 1997.\n\n [6] A. Hyvarinen. Fast and robust fixed-point algorithms for independent component\n analysis. IEEE Transactions on Neural Networks, 10(3):626634, 1999.\n\n [7] Jean-Francois Cardoso and Antoine Souloumiac. Blind beamforming for non Gaus-\n sian signals. IEE Proceedings-F, 140(6):362370, dec 1993.\n\n [8] Jean-Francois Cardoso. High-order contrasts for independent component analysis.\n Neural Computation, 11(1):157192, jan 1999.\n\n [9] Yoshitatsu Matsuda and Kazunori Yamaguchi. Linear multilayer ica algorithm in-\n tegrating small local modules. In Proceedings of ICA2003, pages 403408, Nara,\n Japan, 2003.\n\n[10] Yoshitatsu Matsuda and Kazunori Yamaguchi. Linear multilayer independent compo-\n nent analysis using stochastic gradient algorithm. In Independent Component Anal-\n ysis and Blind source separation - ICA2004, volume 3195 of LNCS, pages 303310,\n Granada, Spain, sep 2004. Springer-Verlag.\n\n[11] Yoshitatsu Matsuda and Kazunori Yamaguchi. Global mapping analysis: stochastic\n approximation for multidimensional scaling. International Journal of Neural Systems,\n 11(5):419426, 2001.\n\n[12] Yoshitatsu Matsuda and Kazunori Yamaguchi. An efficient MDS-based topographic\n mapping algorithm. Neurocomputing, 2005. in press.\n\n[13] A. J. Bell and T. J. Sejnowski. The \"independent components\" of natural scenes are\n edge filters. Vision Research, 37(23):33273338, dec 1997.\n\n[14] J. H. van Hateren and A. van der Schaaf. Independent component filters of natural\n images compared with simple cells in primary visual cortex. Proceedings of the Royal\n Society of London: B, 265:359366, 1998.\n\n\f\n (a). for small-size images. (b). for large-size images.\n\nFigure 2: Decreasing curve of the contrast function along the number of layers (in log-\nscale): (a). It is for small-size natural scenes of 12 12 pixels. The normal and dotted\ncurves show the decreases of by LMICA and the one without the mapping phase (random\nmapping), respectively. The cross points show the results of MaxKurt. Each iteration in\nMaxKurt approximately corresponds to 72 layers with respect to the times of the optimiza-\ntions for the pairs of signals. (b). It is for large-size natural scenes of 64 64 pixels. The\ncurve displays the decrease of by LMICA in 1000 layers.\n\n\n\n\n\n (a). at 20th layer. (b). at 100th layer. (c). at 720th layer. (d). MaxKurt.\n\nFigure 3: Representative edge detectors from natural scenes of 12 12 pixels: (a). It\ndisplays the basis vectors generated by LMICA at the 20th layer. (b). at the 100th layer.\n(c). at the 720th layer. (d). It shows the ones after 10 iterations of MaxKurt algorithm.\n\n\n\n\n\n Figure 4: Representative edge detectors from natural scenes of 64 64 pixels.\n\n\f\n", "award": [], "sourceid": 2563, "authors": [{"given_name": "Yoshitatsu", "family_name": "Matsuda", "institution": null}, {"given_name": "Kazunori", "family_name": "Yamaguchi", "institution": null}]}