{"title": "Kernel-Based Extraction of Slow Features: Complex Cells Learn Disparity and Translation Invariance from Natural Images", "book": "Advances in Neural Information Processing Systems", "page_first": 269, "page_last": 276, "abstract": null, "full_text": "Kernel-based Extraction of Slow Features: \nComplex Cells Learn Disparity and Trans(cid:173)\nlation Invariance from Natural Images \n\nAlistair Bray and Dominique Martinez* \nCORTEX Group, LORIA-INRIA, Nancy, France \n\nbray@loria.fr, dmartine@loria.jr \n\nAbstract \n\nIn Slow Feature Analysis (SFA [1]), it has been demonstrated that \nhigh-order invariant properties can be extracted by projecting in(cid:173)\nputs into a nonlinear space and computing the slowest changing \nfeatures in this space; this has been proposed as a simple general \nmodel for learning nonlinear invariances in the visual system. How(cid:173)\never, this method is highly constrained by the curse of dimension(cid:173)\nality which limits it to simple theoretical simulations. This paper \ndemonstrates that by using a different but closely-related objective \nfunction for extracting slowly varying features ([2, 3]), and then ex(cid:173)\nploiting the kernel trick, this curse can be avoided. Using this new \nmethod we show that both the complex cell properties of transla(cid:173)\ntion invariance and disparity coding can be learnt simultaneously \nfrom natural images when complex cells are driven by simple cells \nalso learnt from the image. \n\nThe notion of maximising an objective function based upon the temporal pre(cid:173)\ndictability of output has been progressively applied in modelling the development \nof invariances in the visual system. F6ldiak used it indirectly via a Hebbian trace \nrule for modelling the development of translation invariance in complex cells [4] \n(closely related to many other models [5,6,7]); this rule has been used to maximise \ninvariance as one component of a hierarchical system for object and face recognition \n[8]. On the other hand, similar functions have been maximised directly in networks \nfor extracting linear [2] and nonlinear [9, 1] visual invariances. Direct maximisation \nof such functions have recently been used to model complex cells [10] and as an \nalternative to maximising sparseness/independence in modelling simple cells [11]. \nSlow Feature Analysis [1] combines many of the best properties of these methods to \nprovide a good general nonlinear model. That is, it uses an objective function that \nminimises the first-order temporal derivative of the outputs; it provides a closed(cid:173)\nform solution which maximises this function by projecting inputs into a nonlinear \n\nhttp://www.loria.fr/equipes/cortex/ \n\n\fspace; it exploits sphering (or PCA-whitening) of the data to ensure that all outputs \nhave unit variance and are uncorrelated. However, the method suffers from the curse \nof dimensionality in that the nonlinear feature space soon becomes very large as the \ninput dimension grows, and yet this feature space must be represented explicitly in \norder for the essential sphering to occur. \n\nThe alternative that we propose here is to use the objective function of Stone [2, 9], \nthat maximises output variance over a long period whilst minimising variance over a \nshorter period; in the linear case, this can be implemented by a biologically plausible \nmixture of Hebbian and anti-Hebbian learning on the same synapses [2]. In recent \nwork, Stone has proposed a closed-form solution for maximising this function in \nthe linear domain of blind source separation that does not involve data-sphering. \nThis paper describes how this method can be kernelised. The use of the \"kernel \ntrick\" allows projection of inputs into a nonlinear kernel induced feature space \nof very high (possibly infinite) dimension which is never explicitly represented or \naccessed. This leads to an efficient method that maps to an architecture that could \nbe biologically implemented either by Sigma-Pi neurons, or fixed REF networks (as \ndescribed for SFA [1]). We demonstrate that using this method to extract features \nthat vary slowly in natural images leads to the development of both the complex-cell \nproperties of translation invariance and disparity coding simultaneously. \n\n1 Finding Slow Features with kernels \n\nGiven I time-series vectors X i(x) of a new input x onto w which is \nequivalent to y = 2:!=l Qik(Xi'X). \n\nFinding a sparse solution \n\n- -T \n\nIf the eigen problem is solved on the entire training set then this algorithm also \nsuffers from the curse of dimensionality, since the matrices (lxl) easily become com(cid:173)\nputationally intractable. A sparse solution using a small subset p of the training \ndata in the expansion is therefore essential: this is called the basis set BS. The out(cid:173)\nput is now y = 2:iE BS Qik(Xi' x), and the solution must lie in the subspace spanned \nby BS. The kernel elements Kij are computed between the p basis vectors X i and the \n1 training data Xj. Thus, K, K and :K are rectangular pxl but the covariance ma-\ntrices (K K ) and (K KT) used in the eigenproblem are only pxp. This approach \ncan effectively solve very large problems, provided p < < l. The question of course \nis how to choose the basis vectors: it is both necessary and sufficient that they span \nthe space of the solution in the kernel induced feature space. In a recent version \nof the algorithm [12] we use the sparse greedy method of [13] as a preprocessing \nstep. This efficiently finds a small basis set that minimises the least-squares error \nbetween data points in feature space and those reconstructed in the feature space \ndefined by the basis set. In the simulations below we used a less efficient greedy \nalgorithm that performed equally well here, but requires a considerably larger basis \nsetl. \n\n-\n\n-\n\nThe complete online algorithm requires minimal memory, making it ideal for very \nlarge data sets. The implementation estimates the long- and short-term kernel \nmeans online using exponential time averages parameterised using half-lives As, At \n(as in [9]). Likewise, the covariance matrices KK T , i(i(T are updated online at \neach time step e.g. KK is updated to KK + KK where K is the column vector \nof kernel values centred using the long term mean and computed for the current \ntime step; there is therefore no need to explicitly compute or store kernel matrices. \n\n- -T -T \n\n- -T \n\n2 Simulation Results \n\nThe simulation was performed using a grey-level stereo pair of resolution 128x128, \nshown in Figure 1 [a]. A new 2D direction 0\u00b0 < e :::; 360\u00b0 was selected at every 64 \ntime steps, and the image was translated by one pixel per time step in this direction \n(with toroidal wrap-around). \n\nA set of 20 monocular simple cells was learnt using the algorithm described in [11] \nthat maximises a nonlinear measure of temporal correlation (TRS) between the \n\nlVectors x are added to BS if, for y E BS, Ik(x,y) 1 ~ T where threshold T is slowly \n\nannealed from TO = 1, and the size of BS is set at 400. \n\n\fFigure 1: Training on natural images. [a] Stereo Pair. [b] Linear filters that max(cid:173)\nimise TRS [11]. [c] Output of filters for left image. [d] Output of nonlinear complex \ncells in binocular simulation. [e] Output of complex cells in monocular simulation. \n\npresent and a previous output, based upon the transfer function g(y) = In cosh(y). \nWe chose this algorithm since it is based on a nonlinear measure of temporal cor(cid:173)\nrelation and yet provides a linear sparse-distributed coding, very similar to that \nof lCA for describing simple cells [14] . We did not use the objective function de(cid:173)\nscribed above since in the linear case it yields filters similar to the local Fourier \nseries2 . The filters were optimised for this particular stereo pair; simulations using \na greater variation of more natural images resulted in more spatially localised filters \nvery similar to those in [14, 11]. We used only the 20 most predictable filters since \nresults did not improve through use of the full set. The simple cell receptive field \nwas 8x8, and during learning data was provided by both eyes at one position in the \nimage3 . The oriented Gabor-like weight vectors for the 20 cells contributing most \nto the TRS objective function are shown in Figure l[b], and the result of processing \nthe left image with these linear filters is shown in Figure l[c]. \n\nThe complex cells received input from these 20 types of simple cells when processing \nboth the left and right eye images. Complex cells had a spatial receptive field of 4x4; \n\n2 An intuitive explanation for this necessity for nonlinearity in the objective function \nis provided in [11]; in brief, the temporal correlation of the output of a Gabor-like linear \nfilter is low, whilst a similar correlation for a measure of the power in the filter is high. \n\n3The dimension of the PeA-whitened space was reduced from 63 to 40, and 6.t = 1, 'f] = \n\n10-3,0 = 10- 1 ; 105 input vectors were used. \n\n\f[a] \n\n[b] \n\nFigure 2: Testing on simulated pair used in [9] . \n[b] \nUnderlying disparity function. [c] Output of most predictable complex cell trained \non Figure I[a]. \n\n[a] Artificial stereo pair. \n\neach cell therefore received 320 simple cell inputs (2x4x4x20); these were normalised \nto have unit variance and zero mean. The most predictable features were extracted \nfor this input vector over 105 time-steps, using the kernel-based method described \nabove, using data at just one position in the image. The basis set was made up of 400 \ninput vectors, and a polynomial kernel of degree 2 was used. The temporal half-lifes \nfor estimating the short- and long-term means in U and V were As = 2, Al = 200. \nThe algorithm therefore extracts 400 outputs; we display the outputs for the 8 most \npredictable (determined by highest eigenvalues) in Figure I[d]; further values were \nhard to interpret. Below this, in Figure I[e], we show the complex outputs obtained \nif we substitute the right image with the left one in the stereo pair, so making the \nsimulation monocular. \n\nConsider first the monocular simulation in [e] . It is visually apparent how the most \npredictable units are strongly selective for regions of iso-orientation (looking quite \ndifferent to any simple cell response in [c]). \nIn this particular image, it results \nin different \"T\" -shaped parts of the Pentagon of considerable size being distinctly \nisolated. Since in our network the complex cell receptive field size in the image is \nonly 50% greater than that for the simple cells, this implies translation invariance: \nover the time (or space) that a simple cell of the correct orientation gives a strong \nbut transitory response, the complex cells provides a strong continuous response. \nThat is, its response is invariant to the phase that determines the profile of the \nsimple cell response. \n\nConsider now the stereo simulation in [d]. This tendency is still present (e.g. the \n3rd output), but it is confounded with another parameter that isolates the complete \nshape of the Pentagon from the background. This is most striking in the output \nprovided by the first feature; that is, this parameter is the most predictable in \nthe image (providing an eigenvalue A = VjU = 7.28, as opposed to A ~ 4 for \nthe \"T\"-shapes in [e]). This parameter is binocular disparity, generated by the \nvariation in depth of the Pentagon roof compared to the ground. The proof of this \nlies in Figure 2. Here we have taken the artificial stereo pair used in [9], shown in \nFigure 2 [a] , that has been generated using the known eggshell disparity function \nshown in Figure 2[b]. We presented this to the network trained wholly on the \nPentagon stereo pair; it can be seen that the most predictable component, shown \nin Figure 2[c], replicates the disparity function of [b]4. \n\n4The output is somewhat noisy, partly because the image has few linear features like \nthose in Figure l[b] ; if we train the simple and complex cells on this image we get a much \ncleaner result. \n\n\f3 Discussion \n\nThe simulation above confirms that the linear properties of simple cells, and two \nof the nonlinear properties of complex cells (translation invariance and disparity \ncoding) can be extracted simutaneously from natural images through maximising \nfunctions of temporal coherence in the input. Although these properties have been \ndealt with in others' work discussed above, they have been considered either in \nisolation or through theoretical simulation. It is only because the kernel-based \nmethod we present allows us to work efficiently with large amounts of data in a \nnonlinear feature space derived from high dimensional input that we have been able \nto extract both complex cell properties together from realistic image data. \n\nThe method described above is computationally efficient. It is also biologically plau(cid:173)\nsible in as much as [a] it uses a reasonable objective function based on temporal \ncoherence of output, and [b] the final computation required to extract these most \npredictable outputs could be performed either by Sigma-Pi neurons, or fixed RBF \nnetworks (as in SFA [1]) . However, we do not claim either that the precise formula(cid:173)\ntion of the objective function is biologically exact, or that a biological system would \nuse the same means to arrive at the final architecture that computes the optimal \nsolution: the learning algorithm is certainly different. Our approach is therefore \nfocussed on the constraints provided by [a] and [b]. \n\nThe method also exploits a distributed representation for maximising the objective \nfunction that results from the generalised eigenvector solution. Is this plausible \ngiven the emphasis that has been laid on sparse-coding early in the visual system \n[15]? Sparse representations are often the result of constraining different outputs to \nbe uncorrelated, or stronger, independent. However, as one ascends the perceptual \npathway generating more reduced nonlinear representations, even the constraint of \nuncorrelated output may be too strong, or unnecessary, to create the highly robust \nrepresentations exploited by the brain. For example, Rolls reports and defends a \nhighly distributed coding of faces in infero-temporal cortical areas with cells re(cid:173)\nsponding to a large proportion of stimuli to some degree ([16], chapter 5). Our \nmethod enforces the constraint that successive eigenvectors are orthogonal in the \nmetrics C and C and can result in the partly correlated output expected in the \nrobust distributed coding Rolls proposes. However, this would not be the case if \nthe long-term means used for C are estimated with a temporal half-life sufficiently \nlarge that these means do not differ from the true expected values. \n\nFinally, although maximising the sparseness of representation may be inappropriate \nin deeper cortex, one might suggest that the coding of parameters we obtain in our \nsimulation is not highly distributed across outputs: in reality each complex cell \nresponds to a limited range of disparity and orientation. However, it can be seen \nin Figure l[d]) that there is a clear separation of orientation, and some mixing of \ndisparity and orientation-sensitivity. It is a feature of our method that different \noutputs must have different measures of predictability (i.e. eigenvalues) . In the \ncase of sparse coding of translation invariance, for example, there is no obvious \nreason why this assumption should be met by cells coding different orientations \nalone; it can however be enforced by coding different mixtures of orientation and \ndisparity parameters leading to distinct eigenvalues. There is certainly no practical \nor biological reason why these parameters should be carried separately in the visual \nsystem (see [1] for discussion). \n\n\fIn conclusion, this work provides further support for the fruitful approach of ex(cid:173)\ntracting non-trivial parameters through maximisation of objective functions based \non temporal properties of perceptual input. One of the challenges here is to extend \ncurrent linear models into the nonlinear domain whilst limiting the extra complexity \nthey bring, which can lead to excess degrees of freedom and computational prob(cid:173)\nlems. We have described here a kernel-based method that goes some way towards \nthis, extracting disparity and translation simultaneously for complex cells trained \non natural images. \n\nReferences \n\n[1] L. Wiskott and T .J . Sejnowski. Slow feature analysis: Unsupervised learning of \n\ninvariances. Neural Computation, 14(4) , 2002. \n\n[2] J. V. Stone and A. J. Bray. A learning rule for extracting spatio-temporal invariances. \n\nNetwork: Computation in Neural Systems, 6(3):429- 436, 1995. \n\n[3] James V. Stone. Blind source separation using temporal predictability. Neural Com(cid:173)\n\nputation, (13):1559- 1574, 200l. \n\n[4] P. Foldiak. Learning invariance from transformation sequences. Neural Computation, \n\n3(2):194- 200, 1991. \n\n[5] H. G. Barrow and A. J. Bray. A model of adaptive development of complex cortical \ncells. In 1. Aleksander and J. Taylor, editors, Artificial Neural Networks II: Proceedings \nof the International Conference on Artificial Neural Networks. Elsevier Publishers, \n1992. \n\n[6] K. Fukushima. Self-organisation of shift-invariant receptive fields. N eural N etworks, \n\n12:826- 834, 1999. \n\n[7] M. Stewart Bartlett and T.J. Sejnowski. Learning viewpoint invariant face represen(cid:173)\n\ntations from visual experience in an attractor network. Network: Computation in \nNeural Systems, 9(3):399- 417, 1998. \n\n[8] E. T . Rolls and T. Milward. A model of invariant object recognition in the visual \nsystem: Learning rules, activation functions , lateral inhibition, and information-based \nperformance measures. Neural Computation, 12:2547- 2572, 2000. \n\n[9] J. V. Stone. Learning perceptually salient visual parameters using spatiotemporal \n\nsmoothness constraints. N eural Computation, 8(7):1463- 1492, October 1996. \n\n[10] K. Kayser, W. Einhiiuser, O. Dummer, P. Konig, and K. Kording. Extracting slow \nsubspaces from natural videos leads to complex cells. In ICANN 2001, LNCS 2130, \npages 1075- 1080. Springer-Verlag Berlin Heidelberg 2001 , 200l. \n\n[11] J. Hurri and A. Hyvarinen. Simple-cell-like receptive fields maximise temporal coher(cid:173)\nence in natural video. Submitted, http://www.cis.hut.fi/)armo/publications. 2002. \n\n[12] D. Martinez and A. Bray. Nonlinear blind source separation using kernels. IEEE \n\nTrans. Neural Networks, 14(1):228- 235, Jan. 2003. \n\n[13] G. Baudat and F . Anouar. Kernel-based methods and function approximation. In(cid:173)\n\nternational Joint Conference of Neural Networks IJCNN, pages 1244-1249, 200l. \n\n[14] A. J. Bell and T. J. Sejnowski. The independent components of natural scenes are \n\nedge filters . Vision Research, 37:3327- 3338, 1997. \n\n[15] B.A. Olhausen and D.J. Field. Emergence of simple-cell receptive field properties by \n\nlearning a sparse code for natural images. Nature, 381:607- 609, 1996. \n\n[16] E .T . Rolls and G. Deco. Computational Neuroscience of Vision. Oxford University \n\nPress, 2002. \n\n\f", "award": [], "sourceid": 2209, "authors": [{"given_name": "Alistair", "family_name": "Bray", "institution": null}, {"given_name": "Dominique", "family_name": "Martinez", "institution": null}]}