{"title": "Least Informative Dimensions", "book": "Advances in Neural Information Processing Systems", "page_first": 413, "page_last": 421, "abstract": "We present a novel non-parametric method for finding a subspace of stimulus features that contains all information about the response of a system. Our method generalizes similar approaches to this problem such as spike triggered average, spike triggered covariance, or maximally informative dimensions. Instead of maximizing the mutual information between features and responses directly, we use integral probability metrics in kernel Hilbert spaces to minimize the information between uninformative features and the combination of informative features and responses. Since estimators of these metrics access the data via kernels, are easy to compute, and exhibit good theoretical convergence properties, our method can easily be generalized to populations of neurons or spike patterns. By using a particular expansion of the mutual information, we can show that the informative features must contain all information if we can make the uninformative features independent of the rest.", "full_text": "Least Informative Dimensions\n\nFabian H. Sinz\n\nDepartment for Neuroethology\n\nEberhard Karls University T\u00a8ubingen\n\nfabee@epagoge.de\n\nAnna St\u00a8ockl\n\nDepartment for Functional Zoology\n\nLund University, Sweden\n\nAnna.Stockl@biol.lu.se\n\nJan Grewe\n\nDepartment for Neuroethology\n\nEberhard Karls University T\u00a8ubingen\njan.grewe@uni-tuebingen.de\n\nJan Benda\n\nDepartment for Neuroethology\n\nEberhard Karls University T\u00a8ubingen\njan.benda@uni-tuebingen.de\n\nAbstract\n\nWe present a novel non-parametric method for \ufb01nding a subspace of stimulus fea-\ntures that contains all information about the response of a system. Our method\ngeneralizes similar approaches to this problem such as spike triggered average,\nspike triggered covariance, or maximally informative dimensions. Instead of max-\nimizing the mutual information between features and responses directly, we use\nintegral probability metrics in kernel Hilbert spaces to minimize the information\nbetween uninformative features and the combination of informative features and\nresponses. Since estimators of these metrics access the data via kernels, are easy\nto compute, and exhibit good theoretical convergence properties, our method can\neasily be generalized to populations of neurons or spike patterns. By using a par-\nticular expansion of the mutual information, we can show that the informative\nfeatures must contain all information if we can make the uninformative features\nindependent of the rest.\n\n1\n\nIntroduction\n\nAn important aspect of deciphering the neural code is to determine those stimulus features popula-\ntions of sensory neurons are most sensitive to. Approaches to that problem include white noise anal-\nysis [2, 14], in particular spike-triggered average [4] or spike-triggered covariance [3, 19], canonical\ncorrelation analysis or population receptive \ufb01elds [12], generalized linear models [18, 15], or max-\nimally informative dimensions [22]. All these techniques have in common that they optimize a\nstatistical dependency measure between stimuli and spike responses over the choice of a linear sub-\nspace. The particular algorithms differ in the dimensionality of the subspace they extract (one- vs.\nmulti-dimensional), the statistical measure they use (correlation, likelihood, relative entropy), and\nwhether an extension to population responses is feasible or not. While spike-triggered average uses\ncorrelation and is restricted to a single subspace, spike-triggered covariance and canonical correla-\ntion analysis can already extract multi-dimensional subspaces but are still restricted to second-order\nstatistics. Maximally informative dimensions is the only technique of the above that can extract\nmultiple dimensions that are informative also with respect to higher-order statistics. However, an\nextension to spike patterns or population responses is not straightforward because of the curse of di-\nmensionality. Here we approach the problem from a different perspective and propose an algorithm\nthat can extract a multi-dimensional subspace containing all relevant information about the neural\nresponses Y in terms of Shannon\u2019s mutual information (if such a subspace exists). Our method\ndoes not commit to a particular parametric model, and can easily be extended to spike patterns or\npopulation responses.\n\n1\n\n\fI [Y : X] = I [Y : U , V ] = EX,Y\n\n(cid:20)\n\np (U , V , Y )\n\np (U , V ) p (Y )\np (Y , V | U )\n\n(cid:20)\n\nlog\n\n(cid:21)\n\n(cid:21)\n\nIn general, the problem of \ufb01nding the most informative subspace of the stimuli X about the re-\nsponses Y can be described as \ufb01nding an orthogonal matrix Q (a basis for Rn) that separates X\ninto informative and non-informative features (U , V )\n= QX. Since Q is orthogonal, the mutual\ninformation I [X : Y ] between X and Y can be decomposed as [5]\n\n(cid:62)\n\n= I [Y : U ] + EY ,V\n= I [Y : U ] + EU [I [Y | U : V | U ]] .\n\np (Y | U ) p (V | U )\n\nlog\n\n(1)\nSince the two terms on the right hand side of equation (1) are always positive and sum up to the\nmutual information between Y and X, two ways to obtain maximally informative features U about\nY would be to either maximize I [Y : U ] or to minimize EU [I [Y |U : V |U ]] via the choice of Q.\nThe \ufb01rst possibility is along the lines of maximally informative dimensions [22] and involves direct\nestimation of the mutual information. The second possibility which avoids direct estimation has\nbeen proposed by Fukumizu and colleagues [5, 6] (we discuss both in Section 3). Here, we explore\na third possibility, which trades practical advantages against a slightly more restrictive objective. The\nidea is to obtain maximally informative features U by making V as independent as possible from\nthe combination of U and Y . For this reason, we name our approach least informative dimensions\n(LID). Formally, least informative dimensions tries to minimize the mutual information between the\npair Y , U and V . Using the chain rule for multi information we can write it as (see supplementary\nmaterial)\n\nI [Y , U : V ] = I [Y : X] + I [U : V ] \u2212 I [Y : U ] .\n\n(2)\n\nThis means that minimizing I [Y , U : V ] is equivalent to maximizing I [Y : U ] while simultane-\nously minimizing I [U : V ]. Note that I [Y , U : V ] = 0 implies I [U : V ] = 0. Therefore, if Q\ncan be chosen such that I [Y , U : V ] = 0 equation (2) reduces to I [Y : X] = I [Y : U ], pushing\nall information about Y into U.\nSince each new choice of Q requires the estimation of the mutual information between (potentially\nhigh-dimensional) variables, direct optimization is hard or unfeasible. For this reason, we resort to\nanother dependency measure which is easier to estimate but shares its minimum with mutual infor-\nmation, that is, it is zero if and only if the mutual information is zero. The objective is to choose Q\nsuch that (Y , U ) and V are independent in that dependency measure. If we can \ufb01nd such a Q, then\nwe know that I [Y , U : V ] is zero as well, which means that U are the most informative features in\nterms of the Shannon mutual information. This will allow us to obtain maximally informative fea-\ntures without ever having to estimate a mutual information. The easier estimation procedure comes\nat the cost of only being able to link the alternative dependency measure to the mutual information\nif both of them are zero. If there is no Q that achieves this, we will still get informative features in\nthe alternative measure, but it is not clear how informative they are in terms of mutual information.\n\n2 Least informative dimensions\n\nThis section describes how to ef\ufb01ciently \ufb01nd a Q such that I [Y , U : V ] = 0 (if such a Q exists).\nUnless noted otherwise, (U , V )\n= QX where U denotes the informative and V the uninforma-\ntive features. The mutual information is a special case of the relative entropy\n\n(cid:62)\n\nDKL [p || q] = EX\u223cp\n\n(cid:20) log p (X)\n\n(cid:21)\n\nlog q (X)\n\nbetween two distribution p and q. While being linked to the rich theoretical background of Shannon\ninformation theory, the relative entropy is known to be hard to estimate [25]. Alternatives to relative\nentropy of increasing practical interest are the integral probability metrics (IPM), de\ufb01ned as [25, 17]\n(3)\n\n|EX [f (X)] \u2212 EZ [f (Z)]| .\n\n\u03b3F (X : Z) = sup\nf\u2208F\n\nIntuitively, the metric in equation (3) searches for a function f, which can detect a difference in\nthe distributions of two random variables X and Z. If no such witness function can be found, the\n\n2\n\n\fdistributions must be equal. If F is chosen to be a suf\ufb01ciently rich reproducing kernel Hilbert space\nH [21], then the supremum in equation (3) can be computed explicitly and the divergence can be\ncomputed in closed form [7]. This particular type of IPM is called maximum mean discrepancy\n(MMD) [9, 7, 10].\nA kernel k : X \u00d7 X \u2192 R is a symmetric function such that the matrix Kij = k (xi, xj) is positive\n(semi)-de\ufb01nite for every selection of points x1, ..., xm \u2208 X [21]. In that case, the functions k (\u00b7, x)\nare elements of a reproducing kernel Hilbert space (RKHS) of functions H. This space is endowed\nwith a dot product (cid:104)\u00b7,\u00b7(cid:105)H with the so called reproducing property (cid:104)k (\u00b7, x) , f(cid:105)H = f (x) for f \u2208 H.\nIn particular, (cid:104)k (\u00b7, x) , k (\u00b7, x(cid:48))(cid:105)H = k (x, x(cid:48)). When setting F in equation (3) to be the unit ball in\nH, then the IPM can be computed in closed form as the norm of the difference between the mean\nfunctions in H [7, 10, 8, 26]:\n\n\u03b3H (X : Z) = (cid:107)EX [k (\u00b7, X)] \u2212 EZ [k (\u00b7, Z)](cid:107)H\n\n= (cid:0)EX,X(cid:48)(cid:2)k(cid:0)X, X(cid:48)(cid:1)(cid:3) \u2212 2EX,Z [k (X, Z)] + EZ,Z(cid:48)(cid:2)k(cid:0)Z, Z(cid:48)(cid:1)(cid:3)(cid:1) 1\n\n(4)\n\n2 ,\n\n(cid:2)exp(cid:0)it(cid:62)X(cid:1)(cid:3) [26, 27]. This means that for characteristic kernels MMD is zero\n\nwhere the \ufb01rst equality is derived in [7], and second equality uses the bi-linearity of the dot product\nand the reproducing property of k. Furthermore, (X, X(cid:48)) \u223c PX \u00d7 PX and (Z, Z(cid:48)) \u223c PZ \u00d7 PZ are\ntwo independent random variables drawn from the marginal distributions of X and Z, respectively.\nThe function EX [k (\u00b7, X)] is an embedding of the distribution of X into the RKHS H via\nX (cid:55)\u2192 EX [k (\u00b7, X)]. If this map is injective, that is, if it uniquely represents the probability distribu-\ntion of X, then equation (4) is zero if and only if the probability distributions of X and X(cid:48) are the\nsame. Kernels with that property are called characteristic in analogy to the characteristic function\n\u03c6X (t) (cid:55)\u2192 EX\nexactly if the relative entropy DKL [p(cid:107)q] is zero as well. Since the mutual information is the relative\nentropy between the joint distribution and the products of the marginals, we can use MMD to search\nfor a Q such that \u03b3H (PY ,U ,V : PY ,U \u00d7 PV ) is zero1, which then implies that I [Y , U : V ] = 0\nas well. The \ufb01nite sample version of (4) is simply given by replacing the expectations with the\nempirical mean (and possibly some bias correction) [7, 10, 8]. The estimation of \u03b3H therefore only\ninvolves summation over three kernel matrices and can be done in a few lines of code. Unlike for\n\u221a\nthe relative entropy, the empirical estimation of MMD is therefore much more feasible. Further-\nmore, the residual error of the empirical estimator can be shown to decrease on the order of 1/\nm\nwhere m is the number of data points [25]. Note in particular, that this rate does not depend on the\ndimensionality of the data.\n\nObjective function The objective function for our optimization problem now has the following\nform: We transform input examples xi into features ui and vi via (ui, vi) = Qxi. Then we use a\n\nkernel k(cid:0)(ui, vi, yi) ,(cid:0)uj, vj, yj\n\nQ. In order to do that ef\ufb01ciently, a few adaptations are required. First, without loss of generality, we\nminimize the squared MMD instead of MMD itself\n\n1\n\n(cid:1)(cid:3) \u2212 2EZ1,Z2 [k (Z1, Z2)] + EZ2,Z(cid:48)\n\n(cid:1)(cid:1) to compute and minimize MMD with respect to the choice of\n(cid:2)k(cid:0)Z1, Z(cid:48)\n(cid:1)(cid:3) , (5)\n(cid:1)(cid:1) \u00b7 k2 (vi, vj). For this special case, one can\n(cid:1)(cid:1) = k1\n(cid:0)(ui, yi) ,(cid:0)uj, yj\n(cid:1)(cid:1)(cid:3) E [k2 (vi, vj)] .\n(cid:0)(ui, yi) ,(cid:0)uj, yj\n(cid:1)(cid:1) \u00b7 k2 (vi, vj)(cid:3) = E(cid:2)k1\n\n(cid:2)k(cid:0)Z2, Z(cid:48)\n\n2\n\nwhere Z1 = (Y , U , V ) \u223c PY ,U ,V and Z2 = (Y , U , V ) \u223c PY ,U \u00d7 PV .\nSecond, in order to get samples from PY ,U \u00d7 PV , we assume that our kernel takes the form\n\n2\n\nincorporate the independence assumption between U , Y and V directly by using the fact that for\nindependent random variables, the expectation of the product is equal to the product of expectations,\nthat is,\n\n\u03b32H (Z1, Z2) = EZ1,Z(cid:48)\n\n1\n\nk(cid:0)(ui, vi, yi) ,(cid:0)uj, vj, yj\n(cid:0)(ui, yi) ,(cid:0)uj, yj\nE(cid:2)k1\n\nThis special case of MMD is equivalent to the Hilbert-Schmidt Independence Criterion (HSIC)\n[9, 23] and can be computed as\n\n(6)\nwhere K1 and K2 denote the matrices of pairwise kernel values between the data sets {(ui, yi)}m\nand {vi}m\n\ni=1, respectively, and Hij = \u03b4ij \u2212 m\u22121.\n\n(m \u2212 1)2 tr (K1HK2H) ,\n\n\u02c6\u03b32\nhs =\n\ni=1\n\n1\n\n1With some abuse of notation, we wrote MMD as a function of the probability measures.\n\n3\n\n\fNote, however, that one could in principle also optimize (5) for a non-factorizing kernel by simply\nshuf\ufb02ing the (ui, yi) and vi across examples. We can also use shuf\ufb02ing to assess whether the\nhs found during the optimization is signi\ufb01cantly different from zero by comparing\noptimal value \u02c6\u03b32\nthe value to a null distribution over \u02c6\u03b32\nhs obtained from datasets where the (ui, yi) and vi have been\npermuted across examples.\n\nhs, then we show how to compute the \u2207ui,vi \u02c6\u03b32\n\nMinimization procedure and gradients For optimizing (6) with respect to Q we use gradient\ndescent over the orthogonal group SO(n). The optimization can be carried out by computing the\nunconstrained gradient \u2207Q\u03b3 of the objective function with respect to Q (treating Q as an ordinary\nmatrix), projecting that gradient onto the tangent space of SO (n), and performing a line search\nalong the gradient direction. We now present the necessary formulae to implement the optimization\nin a modular fashion. We \ufb01rst show how to compute the gradient \u2207Q\u03b3 in terms of the gradients\n\u2207ui,vi \u02c6\u03b32\nhs in terms of derivatives of kernel functions,\nand \ufb01nally demonstrate how the formulae change when approximating the kernel matrices with an\nincomplete Cholesky decomposition.\nGiven the unconstrained gradient \u2207Q\u03b3 the projection onto the tangent space is given by \u03b6 =\nQ\u2207Q\u03b3(cid:62)Q \u2212 \u2207Q\u03b3 [13, eq. (22)]. The function is then minimized by performing a line-search\nalong \u03c0 (Q + t\u03b6), where \u03c0 is the projection onto SO (n) which can easily be computed via singular\nvalue decomposition of Q + t\u03b6 and setting the singular values to one [13, prop. 7].\nThis means that all we need for the gradient descent on SO(n) is the unconstrained gradient \u2207Q\u03b3.\nThis gradient takes the form of a sum of outer products [16, eq. (20)]\n\n\u2207Q\u02c6\u03b32\n\nhs =\n\n\u2202\u02c6\u03b32\nhs\n\n\u00b7 x(cid:62)\n\ni = J(cid:62)\u039e,\n\nJ =\n\n\u2202 (ui, vi)\n\ni=1\n\n(cid:18) \u2202\u02c6\u03b32\n\nhs\n\n\u2202 (ui, vi)\n\n(cid:19)\n\n,\n\ni\n\nm(cid:88)\n\n(cid:16)\n(cid:17)\n\nwhere the matrix \u039e contains the stimuli xi in its rows.\nThe \ufb01rst k columns J (u)\nJ (v) corresponding to the dimension of the features vi are given by\n2\n\n2\n\n\u03b7\n\nHK2HD(u)(cid:62)\n\ncorresponding to the dimension of the features ui and the last n\u2212k columns\n\n(cid:17)\n\nHK1HD(v)(cid:62)\n\n\u03b7\n\n,\n\nJ (u)\n\u03b7 =\n\nwhere\n\n(m \u2212 1)2 diag\n(cid:16)\n\nD(u)\n\n\u03b7\n\n=\n\nij\n\nand\n\nJ (v)\n\u03b7 =\n\n(m \u2212 1)2 diag\n\nk(cid:0)(ui, vi, yi) ,(cid:0)uj, vj, yj\n\n(cid:1)(cid:1)(cid:19)\n\n\u03b7\n\n(cid:17)\n(cid:18) \u2202\n\n\u2202ui\u03b7\n\n(cid:16)\n\nij\n\ncontains the partial derivatives of the kernel with respect to the \u03b7th dimension of u (and analogously\nfor v) in the \ufb01rst argument (see supplementary material for the derivation).\n\nEf\ufb01cient implementation with incomplete Cholesky decomposition of the kernel matrix So\nfar, the evaluation of HSIC requires the computation of two m\u00d7 m kernel matrices in each step. For\nlarger datasets this can quickly become computationally prohibitive. In order to speed up computa-\ntion time, we approximate the kernel matrices by an incomplete Cholesky decomposition K = LL(cid:62),\nwhere L \u2208 Rm\u00d7(cid:96) is a \u201ctall\u201d matrix [1]. In that case, HSIC can be computed much faster as the trace\nof a product of two (cid:96) \u00d7 (cid:96) matrices because\n\ntr (K1HK2H) = tr(cid:0)L(cid:62)\n\n1 H 2L2L(cid:62)\n\n2 H 2L1\n\n(cid:1) ,\n\nwhere HLk can be ef\ufb01ciently computed by centering Lk on its row mean. Also in this case, the\nmatrix J can be computed ef\ufb01ciently in terms of derivatives of sub-matrices of the kernel matrix\n(see supplementary material for the exact formulae).\n\n3 Related work\n\nKernel dimension reduction in regression [5, 6] Fukumizu and colleagues \ufb01nd maximally in-\nformative features U by minimizing EU [I [V | U : Y | U ]] in equation (1) via conditional kernel\n\n4\n\n\fcovariance operators. They show that the covariance operator equals zero if and only if Y is con-\nditionally independent of V given U, that is, Y \u22a5\u22a5V | U. In that case, U carries all information\nabout Y . Although their approach is closest to ours, it differs in a few key aspects: In contrast to our\napproach, their objective involves the inversion of a\u2014potentially large\u2014kernel matrix which needs\nadditional regularization in order to be invertible. A conceptual difference is that we are optimizing\na slightly more restrictive problem because their objective does not attempt to make U independent\nof V as well. However, this will not make a difference in many practical cases, since many stimulus\ndistributions are Gaussian for which the dependencies between U and V can be removed by pre-\nwhitening the stimulus data before training LID. In that case I [U : V ] = 0 for every choice of Q\nand equation (2) becomes equivalent to maximizing the mutual information between U and Y . The\nadvantage of our formulation of the problem is that it allows us to detect and quantify independence\nby comparing the current \u02c6\u03b3hs to its null distribution obtained by shuf\ufb02ing the (yi, ui) against vi\nacross examples. This is hardly possible in the conditional case. Also note that for spherically sym-\nmetric data I [U : V ] = const. for every choice of Q. In that case equation (2) becomes equivalent\nto maximizing I [Y : U ]. However, a residual redundancy remains which would show up when\ncomparing \u02c6\u03b32\nhs to its null distribution. Finally, the use of kernel covariance operators is bound to\nkernels that factorize. In principle, our method is also applicable to non-factorizing kernels if we use\n\u03b3H instead of \u03b3hs and obtain the samples from the product distribution of PY ,U \u00d7 PV via shuf\ufb02ing.\n\nMaximally informative dimensions [22] Sharpee and colleagues maximize the relative entropy\nIspike = DKL\nmative dimensions given a spike, to the marginal distribution of the projection. This relative entropy\nis the part of the mutual information which is carried by the arrival of a single spike, since\n\n(cid:2)p(cid:0)v(cid:62)s|spike(cid:1) || p(cid:0)v(cid:62)s(cid:1)(cid:3) between the distribution of stimuli projected onto infor-\nI(cid:2)v(cid:62)s : {spike, no spike}(cid:3) = p (spike) \u00b7 Ispike + p (no spike) Ino spike.\n\nTheir method is also completely non-parametric and captures higher order dependencies between\na stimulus and a single spike. However, by focusing on single spikes and the spike triggered den-\nsity only, it neglects the dependencies between spikes and the information carried by the silence\nof the neuron [28]. Additionally, the generalization to spike patterns or population responses is\nnon-trivial because the information between the projected stimuli and spike patterns \u00011, ..., \u0001(cid:96) be-\ni p (\u0001i) \u00b7 I\u0001i. This requires the estimation of a conditional distribution\n\n(cid:1) for each pattern \u0001i which can quickly become prohibitive when the number of patterns\n\ncomes I(cid:2)v(cid:62)s : \u0001(cid:3) =(cid:80)\np(cid:0)v(cid:62)s|\u0001i\n\ngrows exponentially.\n\n4 Experiments\n\nIn all the experiments below, we demonstrate the validity of our methods on controlled arti\ufb01cial\nexamples and on P-unit recordings from electric \ufb01sh. We use an RBF kernel on the vi and a tensor\nRBF kernel on the (ui, yi):\n\n(cid:18)\n\n\u2212(cid:107)vi \u2212 vj(cid:107)2\n\n\u03c32\n\n(cid:19)\n\nand k(cid:0)(ui, yi) ,(cid:0)uj, yj\n\n(cid:1)(cid:1) = exp\n\n(cid:32)\n\n(cid:33)\n\n.\n\n\u2212(cid:107)uiy(cid:62)\n\ni \u2212 ujy(cid:62)\nj (cid:107)2\n\n\u03c32\n\nk (vi, vj) = exp\n\nThe derivatives of the kernels can be found in the supplementary material. Unless noted otherwise\nthe \u03c3 were chosen to be the median of pairwise Euclidean distances between data points. In all\narti\ufb01cial experiments, Q was chosen randomly.\n\nnonlinear Poisson (LNP) neuron yi \u223c Poisson(cid:0)(cid:98)(cid:104)w, xi(cid:105) \u2212 \u03b8(cid:99)+\n\nLinear Non-Linear Poisson Model (LNP)\n\n(cid:1) with an exponentially decaying\n\nIn this experiment, we trained LID on a simple linear\n\n\ufb01lter and a rectifying non-linearity (see Figure 1, left). We used m = 5000 data points xi from\na 20-dimensional standard normal distribution N (0, I) as input. The offset was chosen such that\napproximately 35% non-zero spike counts in the yi were obtained. We used one informative and 19\nnon-informative dimensions, and set \u03c3 = 1 for the tensor kernel.\nAfter optimization, the \ufb01rst dimension q1 of Q converged to the \ufb01lter w (Figure 1). We compared\nthe HSIC values \u02c6\u03b3hs\nbefore and after the optimization to their\nnull distribution obtained by shuf\ufb02ing. Before the optimization, the dependence of (Y , U ) and V\n\n(cid:104){(yi, ui)}i=1,...,m : {vi}i=1,...,m\n\n(cid:105)\n\n5\n\n\fFigure 1: Left: LNP Model. The informative dimension (gray during optimization, black after op-\ntimization) converges to the true \ufb01lter of an LNP model (blue line). Before optimization (Y , U ) and\nV are dependent as shown by the left inset (null distribution obtained via shuf\ufb02ing in gray, dashed\nline shows actual HSIC value). After the optimization (right inset) the HSIC value is even below\nthe null distribution. Right: Two state neuron. LID correctly identi\ufb01es the subspace (blue dashed)\nin which the two true \ufb01lters (solid black) reside since projections of the \ufb01lters on the subspace (red\ndashed) closely resemble the original \ufb01lters.\n\nis correctly detected (Figure 1, left, insets). After convergence the actual HSIC value lies left to the\nnull distribution\u2019s domain. Since the appropriate test for independence would be one-sided, the null\nhypothesis \u201c(Y , U ) is independent of V \u201d would not be rejected in this case.\n\nTwo state neuron In this experiment, we simulated a neuron with two states that were both at-\ntained in 50% of the trials (see Figure 1, right). This time, the output consisted of four \u201cbins\u201d\nwhose statistics varied depending on the state. In the \ufb01rst\u2014steady rate\u2014state, the four bins con-\ntained spike counts drawn from an LNP neuron with exponentially decaying \ufb01lter as above. In the\nsecond\u2014burst\u2014state, the \ufb01rst two bins were drawn from Poisson distribution with a \ufb01xed base rate\nindependent of the stimulus. The second two bins were drawn from an LNP neuron with a modu-\nlated exponential \ufb01lter and higher gain. We used m = 8000 input stimuli from a 20-dimensional\nstandard normal distribution. We use two informative dimensions and set \u03c3 of the tensor kernel to\ntwo times the median of the pairwise distances. LID correctly identi\ufb01ed the subspace associated\nwith the two \ufb01lters also in this case (Figure 1, right).\n\nArti\ufb01cial complex cell\nIn a second experiment, we estimated the two-dimensional subspace as-\nsociated with a arti\ufb01cial complex cell. We generated a quadrature pair w1 and w2 of two 10-\ndimensional \ufb01lters (see Figure 2, left). We used m = 8000 input points from a standard nor-\nmal distribution. Responses were generated from a Poisson distribution with the rate given by\n\u03bbi = (cid:104)w1, xi(cid:105)2 + (cid:104)w2, xi(cid:105)2. This led to about 34% non-zero neural responses. When using two\ninformative subspaces, LID was able to identify the subspace correctly (Figure 2, left). When com-\nparing the HSIC value against the null distribution found via shuf\ufb02ing, the \ufb01nal value indicated no\nfurther dependencies. When only a one-dimensional subspace was used (Figure 2, right), LID did\nnot converge to the correct subspace. Importantly, the HSIC value after optimization was clearly\noutside the support of the null distribution, thereby correctly indicating residual dependencies.\n\nP-Unit recordings from weakly electric \ufb01sh Finally, we applied our method to P-unit recordings\nfrom the weakly electric \ufb01sh Eigenmannia virescens. These weakly electric \ufb01sh generate a dipole-\nlike electric \ufb01eld which changes polarity with a frequency at about 300Hz. Sensors in the skin of the\n\ufb01sh are tuned to this carrier frequency and respond to amplitude changes caused by close-by objects\nwith different conductive properties than water [20]. In the present recordings, the immobilized \ufb01sh\nwas stimulated with 10s of 300 \u2212 600Hz low-pass \ufb01ltered full \ufb01eld frozen Gaussian white noise\namplitude modulations of its own \ufb01eld. Neural activity was recorded intra-cellularly from the P-unit\nafferents.\nSpikes were binned with 1ms precision. We selected m = 8400 random time points in the spike\nresponse and the corresponding preceding 20ms of the input (20 dimensions). We used the same\n\n6\n\n\fFigure 2: Arti\ufb01cial Complex Cell. Left: The original \ufb01lters are 90\u00b0 phase shifted Gabor \ufb01lters\nwhich form an orthogonal basis for a two-dimensional subspace. After optimization, the two infor-\nmative dimensions of LID (\ufb01rst two rows of Q) converge to that subspace and also form a pair of\n90\u00b0 phase shifted \ufb01lters (note that even if the \ufb01lters are not the same, they span the same subspace).\nComparing the HSIC values before and after optimization shows that this subspace contains the\nrelevant information (left and right inset). Right: If only a one-dimensional informative subspace\nis used, the \ufb01lter only slightly converges to the subspace. After optimization, a comparison of the\nHSIC value to the null distribution obtained via shuf\ufb02ing indicates residual dependencies which are\nnot explained by the one-dimensional subspace (left and right inset).\n\nFigure 3: Most informative feature for a weakly electric \ufb01sh P-Unit: A random \ufb01lter (blue trace)\nexhibits HSIC values that are clearly outside the domain of the null distribution (left inset). Using\nthe spike triggered average (red trace) moves the HSIC values of the \ufb01rst feature of Q already inside\nthe null distribution (middle inset). Further optimization with LID re\ufb01nes the feature (black trace)\nand brings the HSIC values closer to zero (right inset). After optimization, the informative feature\nU is independent of the features V because the \ufb01rst row and column of the covariance matrix of the\ntransformed Gaussian input show no correlations. The fact that one informative feature is suf\ufb01cient\nto bring the HSIC values inside the null distribution indicates that a single subspace captures all\ninformation conveyed by these sensory neurons.\n\nkernels as in the experiment on the LNP model. We initialized the \ufb01rst row in Q with the normal-\nized spike triggered average (STA; Figure 3, left, red trace). We neither pre-whitened the data for\ncomputing the STA nor for the optimization of LID. Unlike a random feature (Figure 3, left, blue\ntrace), the spike triggered average already achieves HSIC values within the null distribution (Figure\n3, left and middle inset). The most informative feature corresponding to U looks very similar to the\nSTA but shifts the HSIC value deeper into the domain of the null distribution (Figure 3, right inset).\n\n7\n\n\fThis indicates that one single subspace in the input is suf\ufb01cient to carry all information between the\ninput and the neural response.\n\n5 Discussion\n\nHere we presented a non-parametric method to estimate a subspace of the stimulus space that con-\ntains all information about a response variable Y . Even though our method is completely generic\nand applicable to arbitrary input-output pairs of data, we focused on the application in the con-\ntext of sensory neuroscience. The advantage of the generic approach is that Y can in principle be\nanything from spike counts, to spike patterns or population responses. Since our method \ufb01nds the\nmost informative dimensions by making the complement of those dimensions as independent from\nthe data as possible, we termed it least informative dimensions (LID). We use the Hilbert-Schmidt\nindependence criterion to minimize the dependencies between the uninformative features and the\ncombination of informative features and outputs. This measure is easy to implement, avoids the\nneed to estimate mutual information, and its estimator has good convergence properties independent\nof the dimensionality of the data. Even though our approach only estimates the informative features\nand not mutual information itself, it can help to estimate mutual information by reducing the number\nof dimensions.\nit might be that no Q exists such that\nAs in the approach by Fukumizu and colleagues,\nI [Y , U : V ] = 0.\nIn that situation, the price to pay for an easier measure is that it is hard to\nmake de\ufb01nite statements about the informativeness of the features U in terms of the Shannon infor-\nmation, since \u03b3H = I [Y , U : V ] = 0 is the point that connects \u03b3H to the mutual information. As\ndemonstrated in the experiments, we can detect this case by comparing the actual value of \u02c6\u03b3H to an\nempirical null distribution of \u02c6\u03b3H values obtained by shuf\ufb02ing the vi against the ui, yi pairs. How-\never, if \u03b3H (cid:54)= 0, theoretical upper bounds on the mutual information are unfortunately not available.\nIn fact, using results from [25] and Pinsker\u2019s inequality one can show that \u03b32H bounds the mutual\ninformation from below. One might now be tempted to think that maximizing \u03b3H [Y , U ] might be a\nbetter way to \ufb01nd informative features. While this might be a way to get some informative features\n[24], it is not possible to link the features to informativeness in terms of Shannon mutual informa-\ntion, because the point that builds the bridge between the two dependency measures is where both\nof them are zero. Anywhere else the bound may not be tight so the maximally informative features\nin terms of \u03b3H and in terms of mutual information can be different.\nAnother problem our approach shares with many algorithms that detect higher-order dependencies\nis the non-convexity of the objective function. In practice, we found that the degree to which this\nposes a problem very much depends on the problem at hand. For instance, while the subspaces of\nthe LNP or the two state neuron were detected reliably, the two dimensional subspace of the arti\ufb01cial\ncomplex cell seems to pose a harder problem. It is likely that the choice of kernel has an in\ufb02uence\non the landscape of the objective function. We plan to explore this relationship in more detail in the\nfuture. In general, a good initialization of Q helps to get close to the global optimum.\nBeyond that, however, integral probability metric approaches to maximally informative dimensions\noffer a great chance to avoid many problems associated with direct estimation of mutual information,\nand to extend it to much more interesting output structures than single spikes.\n\nAcknowledgements\n\nFabian Sinz would like to thank Lucas Theis and Sebastian Gerwinn for helpful discussions and comments\non the manuscript. This study is part of the research program of the Bernstein Center for Computational\nNeuroscience, T\u00a8ubingen, funded by the German Federal Ministry of Education and Research (BMBF; FKZ:\n01GQ1002).\n\nReferences\n[1] F. R. Bach and M. I. Jordan. Predictive low-rank decomposition for kernel methods. In Proceedings of\nthe 22nd international conference on Machine learning - ICML \u201905, pages 33\u201340, New York, New York,\nUSA, 2005. ACM Press.\n\n[2] E. D. Boer and P. Kuyper. Triggered Correlation, 1968.\n\n8\n\n\f[3] N. Brenner, W. Bialek, and R. De Ruyter Van Steveninck. Adaptive rescaling maximizes information\n\ntransmission. Neuron, 26(3):695\u2013702, 2000.\n\n[4] E. J. Chichilnisky. A simple white noise analysis of neuronal light responses. Network: Comput. Neural\n\nSyst, 12:199\u2013213, 2001.\n\n[5] K. Fukumizu, F. R. Bach, and M. I. Jordan. Dimensionality Reduction for Supervised Learning with\n\nReproducing Kernel Hilbert Spaces. Journal of Machine Learning Research, 5(1):73\u201399, 2004.\n\n[6] K. Fukumizu, F. R. Bach, and M. I. Jordan. Kernel dimension reduction in regression. Annals of Statistics,\n\n37(4):1871\u20131905, 2009.\n\n[7] A. Gretton, K. M. Borgwardt, M. Rasch, B. Sch\u00a8olkopf, and A. Smola. A kernel method for the two sample\nproblem. In B. Sch\u00a8olkopf, J. Platt, and T. Hoffman, editors, Advances in Neural Information Processing\nSystems 19, pages 513\u2014-520, Cambridge, MA, 2007. MIT Press.\n\n[8] A. Gretton, K. M. Borgwardt, M. J. Rasch, B. Sch\u00a8olkopf, and A. Smola. A Kernel Two-Sample Test.\n\nJournal of Machine Learning Research, 13:723\u2013773, 2012.\n\n[9] A. Gretton, O. Bousquet, A. Smola, and B. Sch\u00a8olkopf. Measuring Statistical Dependence with Hilbert-\nSchmidt Norms. In S. Jain, H. U. Simon, and E. Tomita, editors, Advances in Neural Information Pro-\ncessing Systems, pages 63\u201377. Springer Berlin / Heidelberg, 2005.\n\n[10] A. Gretton, K. Fukumizu, Z. Harchaoui, and B. K. Sriperumbudur. A Fast, Consistent Kernel Two-Sample\nTest. In Y Bengio, D Schuurmans, J Lafferty, C K I Williams, and A Culotta, editors, Advances in Neural\nInformation Processing Systems, pages 673\u2013681. Curran, Red Hook, NY, USA, 2009.\n\n[11] J. D. Hunter. Matplotlib: A 2D graphics environment. Computing In Science & Engineering, 9(3):90\u201395,\n\n2007.\n\n[12] J. Macke, G. Zeck, and M. Bethge. Receptive Fields without Spike-Triggering. Advances in Neural\n\nInformation Processing Systems 20, pages 1\u20138, 2007.\n\n[13] J. H. Manton. Optimization algorithms exploiting unitary constraints. Signal Processing, IEEE Transac-\n\ntions on, 50(3):635\u2013650, 2002.\n\n[14] P. Z. Marmarelis and K. Naka. White-noise analysis of a neuron chain: an application of the Wiener\n\ntheory. Science, 175(27):1276\u20131278, 1972.\n\n[15] P McCullagh and J A Nelder. Generalized Linear Models, Second Edition. Chapman and Hall, 1989.\n[16] T. P. Minka. Old and New Matrix Algebra Useful for Statistics. MIT Media Lab Note, pages 1\u201319, 2000.\n[17] A. M\u00a8uller. Integral Probability Metrics and Their Generating Classes of Functions. Advances in Applied\n\nProbability, 29(2):429\u2013443, 1997.\n\n[18] L. Paninski. Maximum likelihood estimation of cascade point-process neural encoding models. Network:\n\nComputation in Neural Systems, 15(4):243\u2013262, 2004.\n\n[19] J. W. Pillow and E. P. Simoncelli. Dimensionality reduction in neural models: an information-theoretic\ngeneralization of spike-triggered average and covariance analysis. Journal of Vision, 6(4):414\u2013428, 2006.\n[20] H. Scheich, T. H. Bullock, and R. H Hamstra. Coding properties of two classes of afferent nerve \ufb01bers:\nhigh-frequency electroreceptors in the electric \ufb01sh, Eigenmannia. Journal of Neurophysiology, 36(1):39\u2013\n60, 1973.\n\n[21] B. Sch\u00a8olkopf and A. J. Smola. Learning with Kernels: Support Vector Machines, Regularization, Opti-\n\nmization, and Beyond, volume 98 of Adaptive computation and machine learning. MIT Press, 2001.\n\n[22] T. Sharpee, N. C. Rust, and W. Bialek. Analyzing neural responses to natural signals: maximally infor-\n\nmative dimensions. Neural Computation, 16(2):223\u2013250, 2004.\n\n[23] A. Smola, A. Gretton, L. Song, and B. Sch\u00a8olkopf. A Hilbert Space Embedding for Distribu-\ntions. In Algorithmic Learning Theory: 18th International Conference, pages 13\u201331. Springer-Verlag,\nBerlin/Heidelberg, 2007.\n\n[24] L. Song, A. Smola, A. Gretton, J. Bedo, and K. Borgwardt. Feature selection via dependence maximiza-\n\ntion. Journal of Machine Learning Research, 13(May):1393\u20131434, 2012.\n\n[25] B. K. Sriperumbudur, K. Fukumizu, A. Gretton, and G. R. G. Lanckriet. On Integral Probability Metrics,\n\nphi-divergences and binary classi\ufb01cation. Technical Report 1, arXiv, 2009.\n\n[26] B. K. Sriperumbudur, A. Gretton, K. Fukumizu, G. Lanckriet, and B. Sch\u00a8olkopf. Injective Hilbert Space\nEmbeddings of Probability Measures. In Proceedings of the 21st Annual Conference on Learning Theory,\nnumber i, pages 111\u2013122. Omnipress, 2008.\n\n[27] B. K. Sriperumbudur, A. Gretton, K. Fukumizu, B. Sch\u00a8olkopf, and G.R. G. Lanckriet. Hilbert Space\nEmbeddings and Metrics on Probability Measures. Journal of Machine Learning Research, 11(1):48,\n2010.\n\n[28] R. S. Williamson, M. Sahani, and J. W. Pillow. Equating information-theoretic and likelihood-based\n\nmethods for neural dimensionality reduction. Technical Report 1, arXiv, 2013.\n\n9\n\n\f", "award": [], "sourceid": 271, "authors": [{"given_name": "Fabian", "family_name": "Sinz", "institution": "Universit\u00e4t T\u00fcbingen"}, {"given_name": "Anna", "family_name": "Stockl", "institution": "Lund University, Sweden"}, {"given_name": "Jan", "family_name": "Grewe", "institution": "Universit\u00e4t T\u00fcbingen"}, {"given_name": "Jan", "family_name": "Benda", "institution": "Universit\u00e4t T\u00fcbingen"}]}