{"title": "Replacing supervised classification learning by Slow Feature Analysis in spiking neural networks", "book": "Advances in Neural Information Processing Systems", "page_first": 988, "page_last": 996, "abstract": "Many models for computations in recurrent networks of neurons assume that the network state moves from some initial state to some fixed point attractor or limit cycle that represents the output of the computation. However experimental data show that in response to a sensory stimulus the network state moves from its initial state through a trajectory of network states and eventually returns to the initial state, without reaching an attractor or limit cycle in between. This type of network response, where salient information about external stimuli is encoded in characteristic trajectories of continuously varying network states, raises the question how a neural system could compute with such code, and arrive for example at a temporally stable classification of the external stimulus. We show that a known unsupervised learning algorithm, Slow Feature Analysis (SFA), could be an important ingredient for extracting stable information from these network trajectories. In fact, if sensory stimuli are more often followed by another stimulus from the same class than by a stimulus from another class, SFA approaches the classification capability of Fishers Linear Discriminant (FLD), a powerful algorithm for supervised learning. We apply this principle to simulated cortical microcircuits, and show that it enables readout neurons to learn discrimination of spoken digits and detection of repeating firing patterns within a stream of spike trains with the same firing statistics, without requiring any supervision for learning.", "full_text": "Replacing supervised classi\ufb01cation learning by\n\nSlow Feature Analysis in spiking neural networks\n\nStefan Klamp\ufb02, Wolfgang Maass\n\nInstitute for Theoretical Computer Science\n\nGraz University of Technology\n\nA-8010 Graz, Austria\n\n{klampfl,maass}@igi.tugraz.at\n\nAbstract\n\nIt is open how neurons in the brain are able to learn without supervision to discrim-\ninate between spatio-temporal \ufb01ring patterns of presynaptic neurons. We show\nthat a known unsupervised learning algorithm, Slow Feature Analysis (SFA), is\nable to acquire the classi\ufb01cation capability of Fisher\u2019s Linear Discriminant (FLD),\na powerful algorithm for supervised learning, if temporally adjacent samples are\nlikely to be from the same class. We also demonstrate that it enables linear readout\nneurons of cortical microcircuits to learn the detection of repeating \ufb01ring patterns\nwithin a stream of spike trains with the same \ufb01ring statistics, as well as discrimi-\nnation of spoken digits, in an unsupervised manner.\n\n1 Introduction\n\nSince the presence of supervision in biological learning mechanisms is rare, organisms often have\nto rely on the ability of these mechanisms to extract statistical regularities from their environment.\nRecent neurobiological experiments [1] have suggested that the brain uses some type of slowness\nobjective to learn the categorization of external objects without a supervisor. Slow Feature Analysis\n(SFA) [2] could be a possible mechanism for that. We establish a relationship between the unsu-\npervised SFA learning method and a commonly used method for supervised classi\ufb01cation learning:\nFisher\u2019s Linear Discriminant (FLD) [3]. More precisely, we show that SFA approximates the classi-\n\ufb01cation capability of FLD by replacing the supervisor with the simple heuristics that two temporally\nadjacent samples in the input time series are likely to be from the same class. Furthermore, we\ndemonstrate in simulations of a cortical microcircuit model that SFA could also be an important\ningredient in extracting temporally stable information from trajectories of network states and that it\nsupports the idea of \u201canytime\u201d computing, i.e., it provides information about the stimulus identity\nnot only at the end of a trajectory of network states, but already much earlier.\n\nThis paper is structured as follows. We start in section 2 with brief recaps of the de\ufb01nitions of\nSFA and FLD. We discuss the relationship between these methods for unsupervised and supervised\nlearning in section 3, and investigate the application of SFA to trajectories in section 4. In section 5\nwe report results of computer simulations of several SFA readouts of a cortical microcircuit model.\nSection 6 concludes with a discussion.\n\n2 Basic De\ufb01nitions\n\n2.1 Slow Feature Analysis (SFA)\n\nSlow Feature Analysis (SFA) [2] is an unsupervised learning algorithm that extracts the slowest\ncomponents yi from a multi-dimensional input time series x by minimizing the temporal variation\n\n1\n\n\f\u2206(yi) of the output signal yi, which is de\ufb01ned in [2] as the average of its squared temporal derivative.\nThus the objective is to minimize\n\n2it.\n\nmin \u2206(yi) := h \u02d9yi\n\n(1)\nThe notation h\u00b7it denotes averaging over time, and \u02d9y is the time derivative of y. The additional\nconstraints of zero mean (hyiit = 0) and unit variance (hy2\ni it = 1) avoid the trivial constant so-\nlution yi(t) \u2261 0. If multiple slow features are extracted, a third constraint (hyiyjit = 0, \u2200j < i)\nensures that they are decorrelated and ordered by decreasing slowness, i.e., y1 is the slowest feature\nextracted, y2 the second slowest feature, and so on. In other words, SFA \ufb01nds those functions gi out\nof a certain prede\ufb01ned function space that produce the slowest possible outputs yi = gi(x) under\nthese constraints.\n\nThis optimization problem is hard to solve in the general case [4], but if we assume that the time\nseries x has zero mean (hxit = 0) and if we only allow linear functions y = wT x the problem\nsimpli\ufb01es to the objective\n\nmin\n\nJSF A(w) :=\n\nwT h \u02d9x \u02d9xT itw\nwT hxxT itw\n\n.\n\n(2)\n\nThe matrix hxxT it is the covariance matrix of the input time series and h \u02d9x \u02d9xT it denotes the covari-\nance matrix of time derivatives (or time differences, for discrete time) of the input time series. The\nweight vector w which minimizes (2) is the solution to the generalized eigenvalue problem\n\n(3)\ncorresponding to the smallest eigenvalue \u03bb. To make use of a larger function space one typically\nconsiders linear combinations y = wT z of \ufb01xed nonlinear expansions z = h(x) and performs the\noptimization (2) in this high-dimensional space.\n\nh \u02d9x \u02d9xT itw = \u03bbhxxT itw\n\n2.2 Fisher\u2019s Linear Discriminant (FLD)\n\nFisher\u2019s Linear Discriminant (FLD) [3], on the other hand, is a supervised learning method, since\nit is applied to labeled training examples hx, ci, where c \u2208 {1, . . . , C} is the class to which this\nexample x belongs. The goal is to \ufb01nd a weight vector w so that the ability to predict the class of x\nfrom the value of wT x is maximized.\n\nFLD searches for that projection direction w which maximizes the separation between classes while\nat the same time minimizing the variance within classes, thereby minimizing the class overlap of the\nprojected values:\n\nmax\n\n(4)\nFor C point sets Sc, each with Nc elements and means \u00b5c, SB is the between-class covariance\n\nJF LD(w) :=\n\nmatrix given by the separation of the class means, SB =Pc Nc(\u00b5c \u2212 \u00b5)(\u00b5c \u2212 \u00b5)T , and SW is the\nwithin-class covariance matrix given by SW = PcPx\u2208Sc\n\nw optimizing (4) can be viewed as the solution to a generalized eigenvalue problem,\n\n(x \u2212 \u00b5c)(x \u2212 \u00b5c)T . Again, the vector\n\nwT SBw\nwT SW w\n\n.\n\nSBw = \u03bbSW w,\n\n(5)\n\ncorresponding to the largest eigenvalue \u03bb.\n\n3 SFA can acquire the classi\ufb01cation capability of FLD\n\nSFA and FLD receive different data types as inputs: unlabeled time series for SFA, in contrast to\nlabeled single data points for the FLD. Therefore, in order to apply the unsupervised SFA learning\nalgorithm to the same classi\ufb01cation problem as the supervised FLD, we have to convert the labeled\ntraining samples into a time series of unlabeled data points that can serve as an input to the SFA\nalgorithm1. In the following we investigate the relationship between the weight vectors found by\nboth methods for a particular way of time series generation.\n\n1A \ufb01rst link between SFA and pattern recognition has been established in [5]. There the optimization is\nperformed over all possible pattern pairs of the same class. However, it might often be implausible to have\naccess to such an arti\ufb01cial time series, e.g., from the perspective of a readout neuron that receives input on-the-\n\ufb02y. We take a different approach and apply the standard SFA algorithm to a time series consisting of randomly\nselected patterns of the classi\ufb01cation problem, where the class at each time step is switched with a certain\nprobability.\n\n2\n\n\fWe consider a classi\ufb01cation problem with C classes,\n\nS1, S2, . . . , SC \u2282 Rn. Let Nc be the number of points in Sc and let N = PC\n\ni.e., assume we are given point sets\nc=1 Nc be the total\nnumber of points. In order to create a time series xt out of these point sets we de\ufb01ne a Markov model\nwith C states S = {1, 2, . . . , C}, one for each class, and choose at each time step t = 1, . . . , T a\nrandom point from the class that corresponds to the current state in the Markov model. We de\ufb01ne\nthe transition probability from state i \u2208 S to state j \u2208 S as\n\nPij =(a \u00b7 Nj\n\nN\n\n1 \u2212Pk6=j Pik\n\nif i 6= j,\nif i = j,\n\n(6)\n\nwith some appropriate constant a > 0. The stationary distribution of this Markov model is \u03c0 =\n(N1/N, N2/N, . . . , NC/N ). We choose the initial distribution p0 = \u03c0, i.e., at any time t the\nprobability that point xt is chosen from class c is Nc/N .\nFor this particular way of generating the time series from the original classi\ufb01cation problem we can\nexpress the matrices hxxT it and h \u02d9x \u02d9xT it of the SFA objective (2) in terms of the within-class and\nbetween-class scatter matrices of the FLD (4), SW and SB, in the following way [6]:\n\nhxxT it =\n\nh \u02d9x \u02d9xT it =\n\nSW +\n\n1\nN\n\nSW + a \u00b7\n\n1\nN\n2\nN\n\nSB\n\n2\nN\n\nSB\n\n(7)\n\n(8)\n\nNote that only h \u02d9x \u02d9xT it depends on a, whereas hxxT it does not.\nFor small a we can neglect the effect of SB on h \u02d9x \u02d9xT it in (8). In this case the time series consists\nmainly of transitions within a class, whereas switching between the two classes is relatively rare.\nTherefore the covariance of time derivatives is mostly determined by the within-class scatter of\nthe two point sets, and both matrices become approximately proportional: h \u02d9x \u02d9xT it \u2248 2/N \u00b7 SW .\nMoreover, if we assume that SW (and therefore h \u02d9x \u02d9xT it) is positive de\ufb01nite, we can rewrite the SFA\nobjective (2) as\n\nmin JSF A(w) \u21d4 max\n\n1\n\nJSF A(w)\n\n\u21d4 max\n\nwT hxxT itw\nwT h \u02d9x \u02d9xT itw\n\n\u21d4 max\n\n1\n2\n\n+\n\n1\n2\n\n\u00b7\n\nwT SBw\nwT SW w\n\n\u21d4 max JF LD(w).\n\n(9)\n\nThat is, the weight vector that optimizes the SFA objective (2) also optimizes the FLD objective\n(4). For C > 2 this equivalence can be seen by recalling the de\ufb01nition of SFA as a generalized\neigenvalue problem (3) and inserting (7) and (8):\n\nh \u02d9x \u02d9xT itW = hxxT itW\u039b\n\nSBW = SW W(cid:2)2\u039b\u22121 \u2212 E(cid:3) ,\n\n(10)\n\nwhere W = (w1, . . . , wn) is the matrix of generalized eigenvectors and \u039b = diag(\u03bb1, . . . , \u03bbn) is\nthe diagonal matrix of generalized eigenvalues. The last line of (10) is just the formulation of FLD as\na generalized eigenvalue problem (5). More precisely, the eigenvectors of the SFA problem are also\neigenvectors of the FLD problem. Note that the eigenvalues correspond by \u03bbF LD\n\u2212 1,\nwhich means the order of eigenvalues is reversed (\u03bbSF A\n> 0). Thus, the subspace spanned by the\nslowest features is the same that optimizes separability in terms of Fisher\u2019s Discriminant, and the\nslowest feature is the weight vector which achieves maximal separation.\n\n= 2/\u03bbSF A\n\ni\n\ni\n\ni\n\nFigure 1A demonstrates this relationship on a sample two-class problem in two dimensions for the\nspecial case of N1 = N2 = N/2. In this case at each time the class is switched with probability\np = a/2 or is left unchanged with probability 1 \u2212 p. We interpret the weight vectors found by\nboth methods as normal vectors of hyperplanes in the input space, which we place simply onto\nthe mean value \u00b5 of all training data points (i.e., the hyperplanes are de\ufb01ned as wT x = \u03b8 with\n\u00b5). One sees that the weight vector found by the application of SFA to the training time\n\u03b8 = wT\nseries xt generated with p = 0.2 is approximately equal to the weight vector resulting from FLD on\nthe initial sets of training points. This demonstrates that SFA has extracted the class of the points as\nthe slowest varying feature by \ufb01nding a direction that separates both classes.\n\n3\n\n\fFigure 1: Relationship between SFA and FLD for a two-class problem in 2D. (A) Sample point\nsets with 250 points for each class. The dashed line indicates a hyperplane corresponding to the\nweight vector wF LD resulting from the application of FLD to the two-class problem. The black\nsolid line shows a hyperplane for the weight vector wSF A resulting from SFA applied to the time\nseries generated from these training points as described in the text (T = 5000, p = 0.2). The\ndotted line displays an additional SFA hyperplane resulting from a time series generated with p =\n0.45. All hyperplanes are placed onto the mean value of all training points. (B) Dependence of\nthe error between the weight vectors found by FLD and SFA on the switching probability p. This\nerror is de\ufb01ned as the average angle between the weight vectors obtained on 100 randomly chosen\nclassi\ufb01cation problems. Error bars denote the standard error of the mean.\n\nFigure 1B quanti\ufb01es the deviation of the weight vector resulting from the application of SFA to the\ntime series from the one found by FLD on the original points. We use the average angle between both\nweight vectors as an error measure. It can be seen that if p is low, i.e., transitions between classes\nare rare compared to transitions within a class, the angle between the vectors is small and SFA\napproximates FLD very well. The angle increases moderately with increasing p; even with higher\nvalues of p (up to 0.45) the approximation is reasonable and a good classi\ufb01cation by the slowest\nfeature can be achieved (see dotted hyperplane in Figure 1A). As soon as p reaches a value of about\n0.5, the error grows almost immediately to the maximal value of 90\u25e6. For p = 0.5 (a = 1) points\nare chosen independently of their class, making the matrices h \u02d9x \u02d9xT it and hxxT it proportional. This\nmeans that every possible vector w is a solution to the generalized eigenvalue problem (3), resulting\nin an average angle of about 45\u25e6.\n\n4 Application to trajectories of training examples\n\nIn the previous section we have shown that SFA approximates the classi\ufb01cation capability of FLD\nif the probability is low that two successive points in the input time series to SFA are from different\nclasses. Apart from this temporal structure induced by the class information, however, these samples\nare chosen independently at each time step. In this section we investigate how the SFA objective\nchanges when the input time series consists of a sequence of trajectories of samples instead of\nindividual points only.\nFirst, we consider a time series xt consisting of multiple repetitions of a \ufb01xed prede\ufb01ned trajectory\n\u02dct, which is embedded into noise input consisting of a random number of points drawn from the\nsame distribution as the trajectory points, but independently at each time step. It is easy to show\n[6] that for such a time series the SFA objective (2) reduces to \ufb01nding the eigenvector of the matrix\n\u02dc\u03a3t corresponding to the largest eigenvalue. \u02dc\u03a3t is the covariance matrix of the trajectory \u02dct with \u02dct\ndelayed by one time step, i.e., it measures the temporal covariances (hence the index t) of \u02dct with time\nlag 1. Since the transitions between two successive points of the trajectory \u02dct occur much more often\nin the time series xt than transitions between any other possible pair of points, SFA has to respond\nas smoothly as possible (i.e., maximize the temporal correlations) during \u02dct in order to produce the\n\n4\n\n\fslowest possible output. This means that SFA is able to detect repetitions of \u02dct by responding during\nsuch instances with a distinctive shape.\n\n\u02dcT ,\nNext, we consider a classi\ufb01cation problem given by C sets of trajectories, T1, T2, . . . , TC \u2282 (Rn)\ni.e., the elements of each set Tc are sequences of \u02dcT n-dimensional points. We generate a time\nseries according to the same Markov model as described in the previous section, except that we\ndo not choose individual points at each time step, rather we generate a sequence of trajectories.\nFor this time series we can express the matrices hxxT it and h \u02d9x \u02d9xT it in terms of the within-class\nand between-class scatter of the individual points of the trajectories in Tc, analogously to (7) and\n(8) [6]. While the expression for hxxT it is unchanged the temporal correlations induced by the\nuse of trajectories however have an effect on the covariance of temporal differences h \u02d9x \u02d9xT it. First,\nthis matrix additionally depends on the temporal covariance \u02dc\u03a3t with time lag 1 of all available\ntrajectories in all sets Tc. Second, the effective switching probability is reduced by a factor of 1/ \u02dcT .\nWhenever a trajectory is selected, \u02dcT points from the same class are presented in succession.\nThis means that even for a small switching probability2 the objective of SFA cannot be solely re-\nduced to the FLD objective, but rather that there is a trade-off between the tendency to separate\ntrajectories of different classes (as explained by the relation between SB and SW ) and the tendency\nto produce smooth responses during individual trajectories (determined by the temporal covariance\nmatrix \u02dc\u03a3t):\n\nmin\n\nJSF A(w) =\n\nwT h \u02d9x \u02d9xT itw\nwT hxxT itw\n\n\u2248\n\n1\nN\n\n\u00b7\n\nwT SW w\n\nwT hxxT itw\n\n\u2212 \u02dcp \u00b7\n\nwT \u02dc\u03a3tw\n\nwT hxxT itw\n\n,\n\n(11)\n\nwhere N is here the total number of points in all trajectories and \u02dcp is the fraction of transitions\nbetween two successive points of the time series that belong to the same trajectory. The weight vec-\ntor w which minimizes the \ufb01rst term in (11) is equal to the weight vector found by the application\nof FLD to the classi\ufb01cation problem of the individual trajectory points (note that SB enters (11)\nthrough hxxT it, cf. eq. (9)). The weight vector which maximizes the second term is the one which\nproduces the slowest possible response during individual trajectories. If the separation between the\ntrajectory classes is large compared to the temporal correlations (i.e., the \ufb01rst term in (11) dominates\nfor the resulting w) the slowest feature will be similar to the weight vector found by FLD on the\ncorresponding classi\ufb01cation problem. On the other hand, as the temporal correlations of the trajec-\ntories increase, i.e., the trajectories themselves become smoother, the slowest feature will tend to\nfavor exploiting this temporal structure of the trajectories over the separation of different classes (in\nthis case, (11) is dominated by the second term for the resulting w).\n\n5 Application to linear readouts of a cortical microcircuit model\n\nIn the following we discuss several computer simulations of a cortical microcircuit of spiking neu-\nrons that demonstrate the theoretical arguments given in the previous section. We trained a number\nof linear SFA readouts3 on a sequence of trajectories of network states, each of which is de\ufb01ned\nby the low-pass \ufb01ltered spike trains of the neurons in the circuit. Such recurrent circuits typically\nprovide a temporal integration of the input stream and project it nonlinearly into a high-dimensional\nspace [7], thereby boosting the expressive power of the subsequent linear SFA readouts. Note, how-\never, that the optimization (2) implicitly performs an additional whitening of the circuit response. As\na model for a cortical microcircuit model we use the laminar circuit from [8] consisting of 560 spik-\ning neurons organized into layers 2/3, 4, and 5, with layer-speci\ufb01c connection probabilities obtained\nfrom experimental data [9, 10].\n\nIn the \ufb01rst experiment we investigated the ability of SFA to detect a repeating \ufb01ring pattern within\nnoise input of the same \ufb01ring statistics. We recorded circuit trajectories in response to 200 repetitions\nof a \ufb01xed spike pattern which are embedded into a continuous Poisson input stream of the same rate.\nWe then trained linear SFA readouts on this sequence of circuit trajectories (we used an exponential\n\n2In fact, for suf\ufb01ciently long trajectories the SFA objective becomes effectively independent of the switching\n\n3We interpret the linear combination de\ufb01ned by each slow feature as the weight vector of a hypothetical\n\nprobability.\n\nlinear readout.\n\n5\n\n\fFigure 2: Detecting embedded spike patterns. (A) From top to bottom: sample stimulus sequence,\nresponse spike trains of the network, and slowest features. The stimulus consists of 10 channels\nand is de\ufb01ned by repetitions of a \ufb01xed spike pattern (dark gray) which are embedded into random\nPoisson input of the same rate. The pattern has a length of 250ms and is made up by Poisson spike\ntrains of rate 20Hz. The period between two repetitions is drawn uniformly between 100ms and\n500ms. The response spike trains of the laminar circuit of [8] are shown separated into layers 2/3, 4,\nand 5. The numbers of neurons in the layers are indicated on the left, but only the response of every\n12th neuron is plotted. Shown are the 5 slowest features, y1 to y5, for the network response shown\nabove. The dashed lines indicate values of 0. (B) Phase plots of low-pass \ufb01ltered versions (leaky\nintegration, \u03c4 = 100ms) of individual slow features in response to a test sequence of 50 embedded\npatterns plotted against each other (black: traces during the pattern, gray: during random Poisson\ninput).\n\n\ufb01lter with \u03c4 = 30ms and a sample time of 1ms). The period of Poisson input in between two such\npatterns was randomly chosen.\n\nAt \ufb01rst glance there is no clear difference in Figure 2A between the raw SFA responses during\nperiods of pattern presentations and during phases of noise input due to the same \ufb01ring statistics.\nHowever, we found that on average the slow feature responses during noise input are zero, whereas\na characteristic response remains during pattern presentations. This effect is predicted by the the-\noretical arguments in section 4. It can be seen in phase plots of traces that are obtained by a leaky\nintegration of the slowest features in response to a test sequence of 50 embedded patterns (see Figure\n2B) that the slow features span a subspace where the response during pattern presentations can be\nnicely separated from the response during noise input. That is, by simple threshold operations on the\nlow-pass \ufb01ltered versions of the slowest features one can in principle detect the presence of patterns\nwithin the continuous input stream. Furthermore, this extracted information is not only available\nafter a pattern has been presented, but already during the presentation of the pattern, which supports\nthe idea of \u201canytime\u201d computing.\n\nIn the second experiment we tested whether SFA is able to discriminate two classes of trajectories\nas described in section 4. We performed a speech recognition task using the dataset considered orig-\ninally in [11] and later in the context of biological circuits in [7, 12, 13]. This isolated spoken digits\ndataset consists of the audio signals recorded from 5 speakers pronouncing the digits \u201czero\u201d, \u201cone\u201d,\n..., \u201cnine\u201d in ten different utterances (trials) each. We preprocessed the raw audio \ufb01les with a model\nof the cochlea [14] and converted the resulting analog cochleagrams into 20 spike trains (using the\nalgorithm in [15]) that serve as input to our microcircuit model (see Figure 3A). We tried to dis-\n\n6\n\n\fFigure 3: SFA applied to digit recognition of a single speaker. (A) From top to bottom: cochlea-\ngrams, input spike trains, response spike trains of the network, and traces of different linear readouts.\nEach cochleagram has 86 channels with analog values between 0 and 1 (white, near 1; black, near\n0). Stimulus spike trains are shown for two different utterances of the given digit (black and gray,\nthe black spike times correspond to the cochleagram shown above). The response spike trains of\nthe laminar circuit from [8] are shown separated into layers 2/3, 4, and 5. The number of neurons\nin each layer is indicated on the left, but only the response of every 12th neuron is plotted. The\nresponses to the two stimulus spike trains in the panel above are shown superimposed with the cor-\nresponding color. Each readout trace corresponds to a weighted sum (\u03a3) of network states of the\nblack responses in the panel above. The trace of the slowest feature (\u201cSF1\u201d, see B) is compared to\ntraces of readouts trained by FLD and SVM with linear kernel to discriminate at any time between\nthe network states of the two classes. All weight vectors are normalized to length 1. The dotted line\ndenotes the threshold of the respective linear classi\ufb01er. (B) Response of the 5 slowest features y1 to\ny5 of the previously learned SFA in response to trajectories of the three test utterances of each class\n\nnot used for training (black, digit \u201cone\u201d; gray, digit \u201ctwo\u201d). The slowness index \u03b7 = T /2\u03c0p\u2206(y)\n\n[2] is calculated from these output signals. The angle \u03b1 denotes the deviation of the projection di-\nrection of the respective feature from the direction found by FLD. The thick curves in the shaded\narea display the mean SFA responses over all three test trajectories for each class. (C) Phase plots\nof individual slow features plotted against each other (thin lines: individual responses, thick lines:\nmean response over all test trajectories).\n\ncriminate between trajectories in response to inputs corresponding to utterances of digits \u201cone\u201d and\n\u201ctwo\u201d, of a single speaker. We kept three utterances of each digit for testing and generated from the\nremaining training samples a sequence of 100 input samples, recorded for each sample the response\nof the circuit, and concatenated the resulting trajectories in time. Note that here we did not switch\nthe classes of two successive trajectories with a certain probability because, as explained in the pre-\nvious section, for long trajectories the SFA response is independent of this switching probability.\nRather, we trained linear SFA readouts on a completely random trajectory sequence.\n\n7\n\n\fFigure 3B shows the 5 slowest features, y1 to y5, ordered by decreasing slowness in response to the\ntrajectories corresponding to the three remaining test utterances for each class, digit \u201cone\u201d and digit\n\u201ctwo\u201d. In this example, already the slowest feature y1 extracts the class of the input patterns almost\nperfectly: it responds with positive values for trajectories in response to utterances of digit \u201ctwo\u201d\nand with negative values for utterances of digit \u201cone\u201d. This property of the extracted features, to\nrespond differently for different stimulus classes, is called the What-information [2]. The second\nslowest feature y2, on the other hand, responds with shapes whose sign is independent of the pattern\nidentity. One can say that, in principle, y2 encodes simply the presence of and the location within a\nresponse. This is a typical example of a representation of Where-information [2], i.e., the \u201cpattern\nlocation\u201d regardless of the identity of the pattern. The other slow features y3 to y5 do not extract ei-\nther What- or Where-information explicitly, but rather a mixed version of both. As a measure for the\ndiscriminative capability of a speci\ufb01c SFA response, i.e., its quality as a possible classi\ufb01er, we mea-\nsured the angle between the projection direction corresponding to this slow feature and the direction\nof the FLD. It can be seen in Figure 3B that the slowest feature y1 is closest to the FLD. Hence,\naccording to (11), this constitutes an example where the separation between classes dominates, but\nis already signi\ufb01cantly in\ufb02uenced by the temporal correlations of the circuit trajectories.\n\nFigure 3C shows phase plots of these slow features shown in Figure 3B plotted against each other.\nIn the three plots involving feature y1 it can be seen that the directions of the response vector (i.e.,\nthe vector composed of the slow feature values at a particular point in time) cluster at class-speci\ufb01c\nangles, which is characteristic for What-information. On the other hand, these phase plots tend to\nform loops in phase space (instead of just straight lines from the origin), where each point on this\nloop corresponds to a position within the trajectory. This is a typical property of Where-information.\nSimilar responses have been theoretically predicted in [4] and found in simulations of a hierarchical\n(nonlinear) SFA network trained with a sequence of one-dimensional trajectories [2].\n\nThis experiment demonstrates that SFA extracts information about the spoken digit in an unsuper-\nvised manner by projecting the circuit trajectories onto a subspace where they are nicely separable\nso that they can easily be classi\ufb01ed by later processing stages. Moreover, this information is pro-\nvided not only at the end of a speci\ufb01c trajectory, but is made available already much earlier. After\nsuf\ufb01cient training, the slowest feature y1 in Figure 3B responds with positive or negative values in-\ndicating the stimulus class almost during the whole duration of of the network trajectory. This again\nsupports the idea of \u201canytime\u201d computing. It can be seen in the bottom panel of Figure 3A that the\nslowest feature, which is obtained in an unsupervised manner, achieves a good separation between\nthe two test trajectories, comparable to the supervised methods of FLD and Support Vector Machine\n(SVM) [16] with linear kernel.\n\n6 Discussion\n\nThe results of our paper show that Slow Feature Analysis is in fact a very powerful tool, which is\nable to approximate the classi\ufb01cation capability that results from supervised classi\ufb01cation learning.\nIts elegant formulation as a generalized eigenvalue problem has allowed us to establish a relation-\nship to the supervised method of Fisher\u2019s Linear Discriminant (FLD). A more detailed discussion of\nthis relationship, including complete derivations, can be found in [6]. If temporal contiguous points\nin the time series are likely to belong to the same class, SFA is able to extract the class as a slowly\nvarying feature in an unsupervised manner. This ability is of particular interest in the context of\nbiologically realistic neural circuits because it could enable readout neurons to extract from the tra-\njectories of network states information about the stimulus \u2013 without any \u201cteacher\u201d, whose existence\nis highly dubious in the brain. We have shown in computer simulations of a cortical microcircuit\nmodel that linear readouts trained with SFA are able to detect speci\ufb01c spike patterns within a stream\nof spike trains with the same \ufb01ring statistics and to discriminate between different spoken digits.\nMoreover, SFA provides in these tasks an \u201canytime\u201d classi\ufb01cation capability.\n\nAcknowledgments\n\nWe would like to thank Henning Sprekeler and Laurenz Wiskott for stimulating discussions. This\npaper was written under partial support by the Austrian Science Fund FWF project # S9102-N13\nand project # FP6-015879 (FACETS), project # FP7-216593 (SECO) and project # FP7-231267\n(ORGANIC) of the European Union.\n\n8\n\n\fReferences\n\n[1] N. Li and J. J. DiCarlo. Unsupervised natural experience rapidly alters invariant object representation in\n\nvisual cortex. Science, 321:1502\u20131507, 2008.\n\n[2] L. Wiskott and T. J. Sejnowski. Slow feature analysis: unsupervised learning of invariances. Neural\n\nComputation, 14(4):715\u2013770, 2002.\n\n[3] R. A. Fisher. The use of multiple measurements in taxonomic problems. Annuals of Eugenics, 7:179\u2013188,\n\n1936.\n\n[4] L. Wiskott. Slow feature analysis: A theoretical analysis of optimal free responses. Neural Computation,\n\n15(9):2147\u20132177, 2003.\n\n[5] P. Berkes. Pattern recognition with slow feature analysis. Cognitive Sciences EPrint Archive (CogPrint)\n\n4104, February 2005. http://cogprints.org/4104/.\n\n[6] S. Klamp\ufb02 and W. Maass. A theoretical basis for emergent pattern discrimination in neural systems\n\nthrough slow feature extraction. Submitted for publication, 2009.\n\n[7] W. Maass, T. Natschl\u00a8ager, and H. Markram. Real-time computing without stable states: A new framework\n\nfor neural computation based on perturbations. Neural Computation, 14(11):2531\u20132560, 2002.\n\n[8] S. H\u00a8ausler and W. Maass. A statistical analysis of information processing properties of lamina-speci\ufb01c\n\ncortical microcircuit models. Cerebral Cortex, 17(1):149\u2013162, 2007.\n\n[9] A. Gupta, Y. Wang, and H. Markram. Organizing principles for a diversity of GABAergic interneurons\n\nand synapses in the neocortex. Science, 287:273\u2013278, 2000.\n\n[10] A. M. Thomson, D. C. West, Y. Wang, and A. P. Bannister. Synaptic connections and small circuits\ninvolving excitatory and inhibitory neurons in layers 2\u20135 of adult rat and cat neocortex: triple intracellular\nrecordings and biocytin labelling in vitro. Cerebral Cortex, 12(9):936\u2013953, 2002.\n\n[11] J. J. Hop\ufb01eld and C. D. Brody. What is a moment? Transient synchrony as a collective mechanism for\n\nspatio-temporal integration. Proc. Nat. Acad. Sci. USA, 98(3):1282\u20131287, 2001.\n\n[12] D. Verstraeten, B. Schrauwen, D. Stroobandt, and J. Van Campenhout. Isolated word recognition with the\n\nliquid state machine: a case study. Inf. Process. Lett., 95(6):521\u2013528, 2005.\n\n[13] R. Legenstein, D. Pecevski, and W. Maass. A learning theory for reward-modulated spike-timing-\n\ndependent plasticity with application to biofeedback. PLoS Computational Biology, 4(10):1\u201327, 2008.\n\n[14] R. F. Lyon. A computational model of \ufb01ltering, detection, and compression in the cochlea. In Proc. IEEE\n\nInt. Conf. Acoustics Speech and Signal Processing, pages 1282\u20131285, May 1982.\n\n[15] B. Schrauwen and J. V. Campenhout. BSA, a fast and accurate spike train encoding scheme. In Proceed-\n\nings of the International Joint Conference on Neural Networks, 2003.\n\n[16] B. Sch\u00a8olkopf and A. J. Smola. Learning with Kernels. MIT Press, Cambridge, MA, 2002.\n\n9\n\n\f", "award": [], "sourceid": 597, "authors": [{"given_name": "Stefan", "family_name": "Klampfl", "institution": null}, {"given_name": "Wolfgang", "family_name": "Maass", "institution": null}]}