{"title": "Relevant sparse codes with variational information bottleneck", "book": "Advances in Neural Information Processing Systems", "page_first": 1957, "page_last": 1965, "abstract": "In many applications, it is desirable to extract only the relevant aspects of data. A principled way to do this is the information bottleneck (IB) method, where one seeks a code that maximises information about a relevance variable, Y, while constraining the information encoded about the original data, X. Unfortunately however, the IB method is computationally demanding when data are high-dimensional and/or non-gaussian. Here we propose an approximate variational scheme for maximising a lower bound on the IB objective, analogous to variational EM. Using this method, we derive an IB algorithm to recover features that are both relevant and sparse. Finally, we demonstrate how kernelised versions of the algorithm can be used to address a broad range of problems with non-linear relation between X and Y.", "full_text": "Relevant sparse codes with variational information\n\nbottleneck\n\nMatthew Chalk\n\nIST Austria\nAm Campus 1\n\nA - 3400 Klosterneuburg, Austria\n\nOlivier Marre\n\nInstitut de la Vision\n\n17, Rue Moreau\n\n75012, Paris, France\n\nGasper Tkacik\n\nIST Austria\nAm Campus 1\n\nA - 3400 Klosterneuburg, Austria\n\nAbstract\n\nIn many applications, it is desirable to extract only the relevant aspects of data.\nA principled way to do this is the information bottleneck (IB) method, where\none seeks a code that maximizes information about a \u2018relevance\u2019 variable, Y ,\nwhile constraining the information encoded about the original data, X. Unfor-\ntunately however, the IB method is computationally demanding when data are\nhigh-dimensional and/or non-gaussian. Here we propose an approximate varia-\ntional scheme for maximizing a lower bound on the IB objective, analogous to\nvariational EM. Using this method, we derive an IB algorithm to recover features\nthat are both relevant and sparse. Finally, we demonstrate how kernelized versions\nof the algorithm can be used to address a broad range of problems with non-linear\nrelation between X and Y .\n\n1\n\nIntroduction\n\nAn important problem, for both humans and machines, is to extract relevant information from complex\ndata. To do so, one must be able to de\ufb01ne which aspects of data are relevant and which should be\ndiscarded. The \u2018information bottleneck\u2019 (IB) approach, developed by Tishby and colleagues [1],\nprovides a principled way to approach this problem. The idea behind the IB approach is to use\nadditional \u2018variables of interest\u2019 to determine which aspects of a signal are relevant. For example, for\nspeech signals, variables of interest could be the words being pronounced, or alternatively, the speaker\nidentity. One then seeks a coding scheme that retains maximal information about these variables of\ninterest, constrained on the information encoded about the input.\nThe IB approach has been used to tackle a wide variety of problems, including \ufb01ltering, prediction and\nlearning [2-5]. However, it quickly becomes intractable with high-dimensional and/or non-gaussian\ndata. Consequently, previous research has primarily focussed on tractable cases, where the data\ncomprises a countably small number of discrete states [1-5], or is gaussian [6].\nHere, we extend the IB algorithm of Tishby et al. [1] using a variational approximation. The algorithm\nmaximizes a lower bound on the IB objective function, and is closely related to variational EM.\nUsing this approach, we derive an IB algorithm that can be effectively applied to \u2018sparse\u2019 data in\nwhich input and relevance variables are generated by sparsely occurring latent features. The resulting\nsolutions share many properties with previous sparse coding models, used to model early sensory\nprocessing [7]. However, unlike these sparse coding models, the learned representation depends on:\n(i) the relation between the input and variable of interest; (ii) the trade-off between encoding quality\nand compression. Finally, we present a kernelized version of the algorithm, that can be applied to a\nlarge range of problems with non-linear relation between the input data and variables of interest.\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fpt+1 (y|r) =(cid:82)\n\n\u2261 (cid:104)log p (y|r) \u2212 log p (y) + \u03b3 log p (r) \u2212 \u03b3 log p (r|x)(cid:105)p(r,x,y) ,\nwhere 0 < \u03b3 < 1 is a Lagrange multiplier that determines the strength of the bottleneck.\nTishby and colleagues showed that the IB loss function can be optimized by applying iterative\nupdates: pt+1 (r|x) \u221d pt (r) exp\nx p (x) pt+1 (r|x) and\nx p (y|x) pt+1 (x|r) [1]. Unfortunately however, when p (x, y) is high-dimensional\nand/or non-gaussian these updates become intractable, and approximations are required.\nDue to the positivity of the KL divergence, we can write, (cid:104)log q (\u00b7)(cid:105)p(\u00b7) \u2264 (cid:104)log p (\u00b7)(cid:105)p(\u00b7) for any\napproximative distribution q(\u00b7). This allows us to formulate a variational lower bound for the IB\nobjective function:\n\n, pt+1 (r) = (cid:82)\n\ny p (y|x) log p(y|x)\npt(y|r)\n\n(cid:82)\n\n(cid:104)\u2212 1\n\n\u03b3\n\n(cid:105)\n\n(1)\n\n\u02dcLp(r|x),q(y|r),q(r) =\n\n(cid:104)log q (yn|r) + \u03b3 log q (r) \u2212 \u03b3 log p (r|xn)(cid:105)p(r|xn)\n\n(2)\n\nN(cid:88)\n\u2264 Lp(r|x),\n\n1\nN\n\nn=1\n\n2 Variational IB\n\nLet us de\ufb01ne an input variable X, as well as a \u2018relevance variable\u2019, Y , with joint distribution\np (y, x). The goal of the IB approach is to compress the variable X through another variable R,\nwhile conserving information about Y . Mathematically, we seek an encoding model, p (r|x), that\nmaximizes:\n\nLp(r|x) = I (R; Y ) \u2212 \u03b3I (R; X)\n\nwhere q (yn|r) and q (r) are variational distributions, and we have replaced the expectation over\np (x, y) with the empirical expectation over training data. (Note that, for notational simplicity we\nhave also omitted the constant term, HY = \u2212(cid:104)log p (y)(cid:105)p(y).)\nSetting q (yn|r) \u2190 p (yn|r) and q (r) \u2190 p (r) fully tightens the bound (so that \u02dcL = L), and leads to\nthe iterative algorithm of Tishby et al. However, when these exact updates are not possible, one can\ninstead choose a restricted class of distributions q (y|r) \u2208 Qy|r and q (r) \u2208 Qr for which inference is\ntractable. Thus, to maximize \u02dcL with respect to parameters \u0398 of the encoding distribution p (r|x, \u0398),\nwe repeat the following steps until convergence:\n\n\u2022 For \ufb01xed \u0398, \ufb01nd {qnew (y|r) , qnew (r)} = arg max{q(y|r),q(r)}\u2208{Qy|r,Qr} \u02dcL\n\u2022 For \ufb01xed q (y|r) and q (r), \ufb01nd \u0398 = arg max\u0398 \u02dcL.\n\nWe note that using a simple approximation for the decoding distribution, q(y|r), can carry additional\nbene\ufb01ts, besides rendering the IB algorithm tractable. Speci\ufb01cally, while an advantage of mutual\ninformation is its generality, in certain cases this can also be a drawback. That is, because Shannon\ninformation does not make any assumptions about the code, it is not always apparent how information\nshould be best extracted from the responses: just because information is \u2018there\u2019 does not mean we\nknow how to get at it.\nIn contrast, using a simple approximation for the decoding distribution, q(y|r) (e.g. linear gaussian),\nconstrains the IB algorithm to \ufb01nd solutions where information about Y can be easily extracted from\nthe responses (e.g. via linear regression).\n\n3 Sparse IB\n\nIn previous work on gaussian IB [6], responses were equal to a linear projection of the input, plus\nnoise: r = W x + \u03b7, where W is an Nr \u00d7 Nx matrix of encoding weights, and \u03b7 \u223c N (\u03b7|0, \u03a3),\nwhere \u03a3 is an Nr \u00d7 Nr covariance matrix. When the joint distribution, p (x, y), is gaussian, it follows\nthat the marginal and decoding distributions, p (r) and p (y|r), are also gaussian, and the parameters\nof the encoding distribution, W and \u03a3, can be found analytically.\nTo illustrate the capabilities of the variational algorithm, while permitting comparison to gaussian\nIB, we begin by adding a single degree of complexity. In common with gaussian IB, we consider\n\n2\n\n\fi Student(cid:0)ri|0, \u03c92\n(cid:81)\n\n(cid:1), with scale and shape parameters, \u03c92\n\na linear gaussian encoder, p (r|x) = N (r|W x, \u03a3), and decoder, q (y|r) = N (y|U r, \u039b). However,\nunlike gaussian IB, we use a student-t distribution to approximate the response marginal: q (r) =\ni and \u03bdi, respectively. When the shape\nparameter, \u03bdi, is small then the student-t distribution is heavy-tailed, or \u2018sparse\u2019, compared to a\ngaussian distribution. Thus, we call the resulting algorithm \u2018sparse IB\u2019. Unlike gaussian IB, the\nintroduction of a student-t marginal means the IB algorithm cannot be solved analytically, and one\nrequires approximations.\n\ni , \u03bdi\n\n3.1\n\nIterative algorithm\n\nRecall that the IB objective function consists of two terms: I (R; Y ), and I (R; X). We begin by\ndescribing how to optimize the lower and upper bound of each of these two terms with respect to the\nvariational distributions q(y|r) and q(r), respectively.\nThe \ufb01rst term of the IB objective function is bounded from below by:\n\nI (R; Y ) \u2265 \u2212 1\n2\n\nlog |\u039b| \u2212 1\n2N\n\n+ const.\n\n(3)\n\n(cid:88)\n\nn\n\n(cid:68)\n(yn \u2212 U r)T \u039b\n\n\u22121 (yn \u2212 U r)\n\n(cid:69)\nxyW T(cid:0)W CxxW T + \u03a3(cid:1)\u22121\n\np(r|xn)\n\nN\n\nN\n\nterm:\n\nn xnyT\n\n(cid:80)\n\n(cid:80)\n\nn .\nn xnxT\n\nn , Cxy = 1\n\nn , and Cxx = 1\n\n[8]. Following a standard EM procedure [9], one can thus write a tractable lower bound on the log-\n\n\u039b = Cyy \u2212 U W Cxy, U = C T\nn ynyT\n\nMaximizing the lower bound on I (R; Y ) with respect to the decoding parameters, U and \u039b, gives:\n(4)\n\nwhere Cyy = 1\nN\nUnfortunately, it is not straightforward to express the bound on I (R; X) in closed form. Instead, we\nuse an additional variational approximation, utilising the fact that the student-t distribution can be ex-\n\n(cid:80)\n(cid:1)\n\u03b7 N(cid:0)r|0, \u03c92(cid:1) Gamma(cid:0)\u03b7| \u03bd\npressed as an in\ufb01nite mixture of gaussians: Student(cid:0)r|0, \u03c92, \u03bd(cid:1) =(cid:82)\nlikelihood, l \u2261 log(cid:2)Student(cid:0)r|0, \u03c92, \u03bd(cid:1)(cid:3), which corresponds to an upper-bound on the bottleneck\nI (R; X) \u2264 (cid:88)\n(cid:104)\u2212 log q (ri) + log p (ri|xn)(cid:105)p(ri|xn)\n(cid:34)\n\u2264 (cid:88)\n(cid:11) = wixnxT\nshorthand notation,(cid:10)r2\n\nwhere \u03beni, and ai denote variational parameters for the ith unit and nth data instance. We used the\ni is the ith diagonal element of \u03a3 and wi is\nthe ith row of W . For notational simplicity, terms that do not depend on the encoding parameters\nwere pushed into the function, f (\u03bdi, \u03bei, ai)1.\nMinimizing the upper bound on I (R; X) with respect to \u03c92\n\n(cid:11) + f (\u03bdi, \u03bei, ai)\n\nlog |\u03a3| + const.\n\ni , \u03beni and ai (for \ufb01xed \u03bdi) gives:\n\ni , where \u03c32\n\n(cid:10)r2\n\nn wT\n\ni + \u03c32\n\nN(cid:88)\n\nlog \u03c92\n\ni +\n\n\u2212 1\n2\n\n2N \u03c92\ni\n\n2 , \u03bd\n\n2\n\n(cid:35)\n\n(5)\n\n\u03beni\n\n1\n2\n\nn=1\n\ni,n\n\nni\n\nni\n\n1\n\ni\n\n\u03beni\n\nN(cid:88)\n(cid:17) \u2212 log\n\nn=1\n\n(cid:10)r2\n(cid:16) \u03bdi\n\n(cid:11) , \u03beni =\n(cid:17)\n\n= 1 +\n\nni\n\n2\n\n\u03c92\n\ni =\n\n1\nN\n\n(cid:16) \u03bdi\n\n2\n\n\u03c8\n\n\u03bdi + 1\n\u03bdi + (cid:104)r2\n(cid:20)\nN(cid:88)\n\nni(cid:105) /\u03c92\n\ni\n\n, ai =\n\n1\n2\n\n(\u03bdi + 1),\n\n(6)\n\n(cid:21)\n\nThe shape parameter, \u03bdi, is then found numerically on each iteration (for \ufb01xed \u03beni and ai), by solving:\n\n1\nN\n\nn=1\n\n\u03c8(ai) \u2212 log\n\n\u2212 \u03beni\n\nai\n\u03beni\n\n,\n\n(7)\n\nwhere \u03c8(\u00b7) is the digamma function [9].\nNext we maximize the full variational objective function \u02dcL with respect to the encoding distribution,\np (r|x) (for \ufb01xed q(y|r) and q(r)). Maximizing \u02dcL with respect to the encoding noise covariance, \u03a3,\ngives:\n\n\u22121 =\n\n1 f (\u03bdi, \u03bei, ai) = log \u0393(cid:0) \u03bdi\n\n2\n\n\u03a3\n\n(cid:1) \u2212 \u03bdi\n\n2 log \u03bdi\n\n\u22121\n\nU T \u039b\n\n1\n\u03b3\n2 \u2212 1\n\nN\n\n1\nN\n\n\u22121U +\n(cid:80)\n\n(cid:104) \u03bdi\u22121\n\nn\n\n2\n\n\u2126\n\n(cid:16)\n\n\u03c8(ai) \u2212 ln ai\n\n\u03beni\n\n(cid:17) \u2212 \u03bdi\n\n2 \u03beni + Hni\n\n(cid:105)\n\n(8)\n\n, where Hni\n\nis the entropy of a gamma distribution with shape and rate parameters: ai, and ai/\u03beni, respectively [9].\n\nN(cid:88)\n\nn=1\n\n\u039en,\n\n3\n\n\fFigure 1: Behaviour of sparse IB and gaussian IB algorithms, on denoising task. (A) Arti\ufb01cial image\npatches were constructed from combinations of orientated edge-like features. Patches were corrupted\nwith white noise to generate the input, X. The goal of the IB algorithm is to learn a linear code that\nmaximized information about the original patches, Y , constrained on information encoded about\nthe input, X. (B) A selection of linear encoding (left), and decoding (right) \ufb01lters obtained with the\ngaussian IB algorithm. (C) Same as B, but for the sparse IB algorithm. (D) Response histograms\nfor the 10 units with highest variance, for the gaussian (red) and sparse (blue) IB algorithms. (E)\nInformation curves for the gaussian (red) and sparse (blue) algorithms, alongside a \u2018null\u2019 model,\nwhere responses were equal to the original input, plus white noise. (F) Fraction of response variance\nattributed to signal \ufb02uctuations, for each unit. Solid and dashed curves correspond to strong and weak\nbottlenecks, respectively (corresponding to the vertical dashed lines in panel E).\n\nwhere \u2126 and \u039en are Nr \u00d7 Nr diagonal covariance matrices with diagonal elements \u2126ii = \u03c92\n(\u039en)ii = \u03beni, respectively.\nFinally, taking the derivative of \u02dcL with respect to the encoding weights, W , gives:\n\ni , and\n\n(cid:88)\n\nn\n\n\u2202 \u02dcL\n\u2202W\n\n= U T \u039b\n\n\u22121C T\n\nxy \u2212 U T \u039b\n\n\u22121U W Cxx \u2212 \u03b3\u2126\n\n\u22121 1\nN\n\n\u039enW xnxT\nn ,\n\n(9)\n\nSetting the derivative to zero, we can solve for W directly. One may verify that, when variational\nparameters, \u03beni, are unity, the above iterative updates are identical to the iterative gaussian IB\nalgorithm described in [6].\n\n3.2 Simulations\n\nIn our framework, the approximation of the response marginal, q (r), plays an analogous role to\nthe prior distribution in a probabilistic generative model. Thus, we hypothesized that a sparse\napproximation for the response marginal, q(r), would permit the IB algorithm to recover sparsely\noccurring input features, analogous to the effect of using a sparse prior.\n\n4\n\ndecoding \ufb01ltersencoding \ufb01ltersdecoding \ufb01ltersencoding \ufb01ltersgaussian IBsparse IB\u02c6YXRABCDresponse (a.u.)\u2212100100.0010.010.11gaussian IB sparse IBprob densityF02040608000.51gaussian IB sparse IB2s2s+2nIlin(Y;R)(nats)I(X;R)(nats)050100150050100Egaussian IB sparse IB null modelWUunits\fFigure 2: Variant of the task in \ufb01gure 1, in which the input noise is spatically correlated. (A) Example\ninput X and patch, Y . Spatial noise correlations were aligned along the vertical direction. (B) Subset\nof decoding \ufb01lters obtained with the sparse IB algorithm. (C) Distribution of encoded orientations.\n(D) Example stimulus (left) and reconstruction (right) of bars presented at variable orientations\n(presented with zero input noise, so that X \u2261 Y for this example).\n\nTo show this, we constructed arti\ufb01cial 9 \u00d7 9 image patches from combinations of orientated bar\nfeatures. Each bar had a gaussian cross-section, with maximum amplitude drawn from a standard\nnormal distribution of width 1.2 pixels. Patches were constructed by linearly combining 3 bars, with\nuniformly random orientation and position.\nInitially, we considered a simple de-noising task, where the input, X, was a noisy version of the\noriginal image patches (gaussian noise, with variance \u03c32 = 0.005; \ufb01gure 1A). Training data consisted\nof 10,000 patches. Figure 1B and 1C show a selection of encoding (W ) and decoding (U) \ufb01lters\nobtained with the gaussian and sparse IB models, respectively. As predicted, only the sparse IB model\nwas able to recover the original bar features. In addition, response histograms were considerably\nmore heavy-tailed for the sparse IB model (\ufb01g. 1D).\nThe relevant information, I(R; Y ), encoded by the sparse model was greater than for the gaussian\nmodel, over a range of bottleneck strengths (\ufb01g. 1E). While the difference may appear small, it\nis consistent with work showing that sparse coding models achieve only a small improvement in\nlog-likelihood for natural image patches [10]. We also plotted the information curve for a \u2018null\nmodel\u2019, with responses sampled from p(r|x) = N (r|x, \u03c32I). Interestingly, the performance of this\nnull model was almost identical to the gaussian IB model.\nFigure 1F plots the fraction of response variance due to the signal, for each unit ( wiCxxwT\n). Solid\nand dashed curves denote strong and weak bottlenecks, respectively. In both cases, the gaussian model\ngave a smooth spectrum of response magnitudes, while the sparse model was more \u2018all-or-nothing\u2019.\nOne way the sparse IB algorithm differs qualitatively from traditional sparse coding algorithms, is\nthat the learned representation depends on the relation between X and Y , rather than just the input\nstatistics. To illustrate this, we conducted simulations with patches corrupted by spatially correlated\nnoise, aligned along the vertical direction (\ufb01g. 2A). The spatial covariance of the noise was described\nby a gaussian envelope, with standard deviation 3 pixels in the vertical direction and 1 pixel in\nhorizontal direction.\nFigure 2B shows a selection of decoding \ufb01lters obtained from the sparse IB model, with correlated in-\nput noise. The shape of individual \ufb01lters was qualitatively similar to those obtained with uncorrelated\nnoise (\ufb01g. 1C). However, with this stimulus, the IB model avoided \u2018wasting\u2019 bits by representing\nfeatures co-orientated with the noise (\ufb01g. 2C). Consequently, it was not possible to reconstruct vertical\nbars from the responses, when bars were presented alone, even with zero noise (\ufb01g. 2D).\n\nwiCxxwT\n\ni\n\ni +\u03c32\ni\n\n4 Kernel IB\n\nOne way to improve the IB algorithm is to consider non-linear encoders. A general choice is:\np (r|x) = N (r|W \u03c6(x), \u03a3), where \u03c6(x) is an embedding to a high-dimensional non-linear feature\nspace.\n\n5\n\nADreconstructionFigure 2: spatially correlated noiseCXYB\u221290\u221245045900510encoded orientation\n(relative to vertical)no. of unitsstim.\fFigure 3: Behaviour of kernel IB algorithm on occlusion task. (A) Image patches were the same as\nfor \ufb01gure 1. However, the input, X, was restricted to 2 columns to either side of the patch. The target\nvariable, Y , was the central region. (B) Subset of decoding \ufb01lters, U, for the sparse kernel IB (\u2018sparse\nkIB\u2019) algorithm. (C) As for B, for other versions of the IB algorithm. (D) Information curves for\nthe gaussian kIB (blue) sparse kIB (green) and sparse IB algorithms (red). The bottleneck strength\nfor the other panels in this \ufb01gure is indicated by a vertical dashed line. (E) Response histogram for\nthe 10 units with highest variance, for the gaussian and sparse kIB models. (F) (above) Three test\nstimuli, used to demonstrate the non-linear properties of the sparse KIB code. (below) Reconstruction\nobtained from responses to test stimulus. (G) Responses of two units which showed strong responses\nto stimulus 3. The decoding \ufb01lters for these units are shown above the bar plots.\n\nevery solution for wi can be expressed as an expansion of mapped training data, wi =(cid:80)N\n\nThe variational objective functions for both gaussian and sparse IB algorithms are quadratic in the\nresponses, and thus can be expressed in terms of dot products of the row vector, \u03c6(x). Consequently,\nn=1 ain\u03c6(xn)\n[11]. It follows that the variational IB algorithm can be expressed in\u2018dual space\u2019, with responses to the\nnth input drawn from r \u223c N (r|Akn, \u03a3), where A is an Nr \u00d7 N matrix of expansion coef\ufb01cients, and\nkn is the nth column of the N \u00d7 N kernel-gram matrix, K, with elements Knm = \u03c6(xn)\u03c6(xm)T .\nIn this formulation, the problem of \ufb01nding the linear encoding weights, W , is replaced by \ufb01nding the\nexpansion coef\ufb01cients, A.\nThe advantage of expressing the algorithm in the dual space is that we never have to deal with\n\u03c6(x) directly, so are free to consider high- (or even in\ufb01nite) dimensional feature spaces. However,\nwithout additional constraints on the expansion coef\ufb01cients, A, the IB algorithm becomes degenerate\n(i.e. the solutions are independent of the input, X). A standard way to deal with this is to add an L2\nregularization term that favours solutions with small expansion coef\ufb01cients. Here, this is achieved\nhere by replacing \u03c6T\nn \u03c6n + \u03bbI, where \u03bb is a \ufb01xed regularization parameter. Doing so, the\nderivative of \u02dcL with respect to A becomes:\n\nn \u03c6n with \u03c6T\n\n= U T \u039b\n\n(10)\nSetting the derivative to zero and solving for A directly requires inverting an N Nr \u00d7 N Nr matrix,\nwhich is expensive. Instead, one can use an iterative solver (we used the conjugate gradients squared\n\nn\n\n\u22121U + \u03b3\u2126\n\n\u22121\u039en\n\n\u22121Y K \u2212(cid:88)\n\n(cid:0)U T \u039b\n\n\u2202 \u02dcL\n\u2202A\n\n(cid:1) A(cid:0)knkT\n\nn + \u03bbK(cid:1)\n\n6\n\n\u2212100100.00010.0010.010.11sparse kIBABDI(X;R)(nats)gaussian kIBsparse IBgaussian IBIlin(Y;R)(nats)0204005101520ECgaussian kIB sparse kIBresponse (a.u.)prob densityXRUf(X)\u02c6Y\u2212505response (a.u.)stim 1stim 2stim 3stim 1stim 2stim 3GFrecon.patchesstim 1stim 2stim 3gaussian kIB sparse kIB sparse IB\fFigure 4: Behaviour of kernel IB algorithm on handwritten digit data. (A) As with \ufb01gure 4, we\nconsidered an occlusion task. This time, units were provided with the left hand side of the image\npatch, and had to reconstruct the right hand side. (B) Response distribution for 10 neurons with\nhighest variance, for the gaussian (blue) and sparse (green) kIB algorithms. (C) Decoding \ufb01lters for a\nsubset of units, obtained with the sparse kIB algorithm. Note that, for clearer visualization, we show\nhere the decoding \ufb01lter for the entire image patch, not just the occluded region. (D) A selection of\ndecoding \ufb01lters obtained with the alternative IB algorithms.\n\non a subspace of training instances, such that, wi =(cid:80)M\n\u22121(cid:1)\u22121\n\nmethod). In addition, the computational complexity can be reduced by restricting the solution to lie\nn=1 ain\u03c6(xn), where M < N. The derivation\ndoes not change, only now K has dimensions M \u00d7 N [11].\nWhen q(r) is gaussian (equivalent to setting \u039en = I), solving for A gives:\n\nA =(cid:0)U T \u039b\n\n(11)\nwhere AKRR = Y (K + \u03bbI)\u22121 are the coef\ufb01cients obtained from kernel ridge-regression (KRR).\nThis suggests the following two stage algorithm: \ufb01rst, we learn the regularisation constant, \u03bb, and\nparameters of the kernel matrix, K, to maximize KRR performance on hold-out data; next, we\nperform variational IB, with \ufb01xed K and \u03bb.\n\nU T \u039b\n\n\u22121U + \u03b3\u2126\n\n\u22121AKRR\n\n4.1 Simulations\n\nTo illustrate the capabilities of the kernel IB algorithm, we considered an \u2018occlusion\u2019 task, with the\nouter columns of each patch presented as input, X (2 columns to the far left and right), and the inner\ncolumns as the relevance variable Y , to be reconstructed. Image patches were as before. Note that\nperforming the occlusion task optimally requires detecting combinations of features presented to\neither side of the occluded region, and is thus inherently nonlinear.\nWe used gaussian kernels, with scale parameter, \u03ba, and regularisation constant, \u03bb, chosen to maximize\nKRR performance on test data. Both test and training data consisted of 10,000 images. However, A\nwas restricted to lie on a subset of 1000 randomly chosen training patches (see earlier).\nFigure 3B shows a selection of decoding \ufb01lters (U) learned by the sparse kernel IB algorithm (\u2018sparse\nkIB\u2019). A large fraction of \ufb01lters resembled near-horizontal bars, traversing the occluded region. This\nwas not the case for the sparse linear IB algorithm, which recovered localized blobs either side of the\noccluded region, nor the gaussian linear or kernelized models, which recovered non-local features\n(\ufb01g. 3C). Figure 3D shows a small but signi\ufb01cant improvement in performance for the sparse kIB\nversus the gaussian kIB model. Most noticeable, however, is the distribution of responses, which are\nmuch more heavy tailed for the sparse kIB algorithm (\ufb01g. 3E).\nTo demonstrate the non-linear behaviour of the sparse kIB model, we presented bar segments: \ufb01rst\nto either side of the occluded patch, then to both sides simultaneously. When bar segments were\npresented to both sides simultaneously, the sparse KIB model \u2018\ufb01lled in\u2019 the missing bar segment,\n\n7\n\nFigure 5: handwritten digitssparse kIBgaussian kIBsparse IBgaussian IBXRUf(X)\u2212100100.0010.010.11gaussian kIB sparse kIBresponse (a.u.)prob densityABDC\u02c6Y\fin contrast to the reconstruction obtained with single bar segments (\ufb01g. 3F). This behaviour was\nre\ufb02ected in the non-linear responses of certain encoding units, which were large when two segments\nwere presented together, but near zero when one segment was presented alone (\ufb01g. 3G).\nFinally, we repeated the occlusion task with handwritten digits, taken from the USPS dataset (www.\ngaussianprocess.org/gpml/data). We used 4649 training and 4649 test patches, of 16\u00d716\npixels. However, expansion coeffecients were restricted to a lie on subset of 500 randomly patches.\nWe set X and Y , to be the left and right side of each patch, respectively (\ufb01g. 4A).\nIn common with the arti\ufb01cial data, the response distributions achieved with the sparse kIB algorithm\nwere more heavy-tailed than for the gaussian kIB algorithm (\ufb01g. 4B). Likewise, recovered decoding\n\ufb01lters closely resembled handwritten digits, and extended far into the occluded region (\ufb01g. 4C). This\nwas not the case for the alternative IB algorithms (\ufb01g. 4D).\n\n5 Discussion\n\nPrevious work has shown close parallels between the IB framework and maximum-likelihood\nestimation in a latent variable model [12, 13]. For the sparse IB algorithm presented here, maximizing\nthe IB objective function is closely related to maximizing the likelihood of a \u2018sparse coding\u2019 latent\nvariable model, with student-t prior and linear gaussian likelihood function. However, unlike\ntraditional sparse coding models, the encoding (or \u2018recognition\u2019) model p(r|x) is conditioned on a\nseperate set of inputs, X, distinct from the image patches themselves. Thus, the solutions depend on\nthe relation between X and Y , not just the image statistics (e.g. see \ufb01g. 2). Second, an additional\nparameter, \u03b3, not present in sparse coding models, controls the trade-off between encoding and\ncompression. Finally, in contrast to traditional sparse coding algorithms, IB gives an unambiguous\nordering of features, which can be arranged according to the response variance of each unit (\ufb01g. 1F).\nOur work is also closely related to the IM algorithm, proposed by Barber et al. to solve the information\nmaximization (\u2018infomax\u2019) problem [14]. However, a general issue with infomax problems is that they\nare usually ill-posed, necessitating additional ad hoc constraints on the encoding weights or responses\n[15]. In contrast, in the IB approach, such constraints emerge automatically from the bottleneck term.\nA related method to \ufb01nd low-dimensional projections of X/Y pairs is canonical correlation analysis\n(\u2018CCA\u2019), and its kernel analogue [16]. In fact, the features obtained with gaussian IB are identical\nto those obtained with CCA [6]. However, unlike CCA, the number and \u2018scale\u2019 of the features are\nnot speci\ufb01ed in advance, but determined by the bottleneck parameter, \u03b3. Secondly, kernel CCA\nis symmetric in X and Y , and thus performs nonlinear embedding of both X and Y . In contrast,\nthe IB problem is assymetric: we are interested in recovering Y from an input X. Thus, only X is\nkernelized, while the decoder remains linear. Finally, the features obtained from gaussian IB (and\nthus, CCA) differ qualitatively from the sparse IB algorithm, which recovers sparse features that\naccount jointly for X and Y .\nSparse IB can be extended to the nonlinear regime using a kernel expansion. For the gaussian\nmodel, the expansion coef\ufb01cients, A, are a linear projection of the coef\ufb01cients used for kernel-ridge-\nregression (\u2018KRR\u2019). A general disadvantage of KRR, is that it can be dif\ufb01cult to know which aspects\nof X are relied on to perform the regression. In contrast, the kernel IB framework provides an\nintermediate representation, allowing one to visualize the features that jointly account for both X and\nY (\ufb01gs. 3B & 4C). Furthermore, this learned representation permits generalisation across different\ntasks that rely on the same set of latent features; something not possible with KRR.\nFinally, the IB approach has important implications for models of early sensory processing [17, 18].\nNotably, \u2018ef\ufb01cient coding\u2019 models typically consider the low-noise limit, where the goal is to reduce\nthe neural response redundancy [7]. In contrast, the IB approach provides a natural way to explore the\nfamily of solutions that emerge as one varies internal coding constraints (by varying \u03b3) and external\nconstraints (by varying the input, X) [19, 20]. Further, our simulations suggest how the framework\ncan be used to go beyond early sensory processing: for example to explain higher-level cognitive\nphenomena such as perceptual \ufb01lling in (\ufb01g. 3G). In future, it would be interesting to explore how\nthe IB framework can be used to extend the ef\ufb01cient coding theory, by accounting for modulations\nin sensory processing that occur due to changing task demands (i.e. via changes to the relevance\nvariable, Y ), rather than just the input statistics (X).\n\n8\n\n\fReferences\n\n[1] Tishby, N. Pereira, F C. & Bialek, W. (1999) The information bottleneck method. The 37th annual Allerton\nConference on Communication, Control and Computing. pp. 368\u2013377\n[2] Bialek, W. Nemenman, I. & Tishby, N. (2001) Predictability, complexity, and learning. Neural computation,\n13(11) pp. 240- 63\n[3] Slonim, N. (2003) Information bottleneck theory and applications. PhD thesis, Hebrew University of\nJerusalem\n[4] Chechik, G. & Tishby, N. (2002) Extracting relevant structures with side information. In Advances in Neural\nInformation Processing Systems 15\n[5] Hofmann, T. & Gondek, D. (2003) Conditional information bottleneck clustering. In 3rd IEEE International\nconference in data mining, workshop on clustering large data sets\n[6] Chechik, G. Globerson, A., Tishby, N. & Weiss, Y. (2005) Information bottleneck for gaussian variables.\nJournal of Machine Learning Research, (6) pp. 165\u2013188\n[7] Simoncelli, E. P. & Olshausen, B. A. (2001) Natural image statistics and neural representation. Ann. Rev.\nNeurosci. 24:1194\u2013216\n[8] Andrews, D. F. & Mallows C. L. (1974). Scale mixtures of normal distributions. J. of the Royal Stat. Society.\nSeries B 36(1) pp. 99\u2013102\n[9] Schef\ufb02er, C. (2008). A derivation of the EM updates for \ufb01nding the maximum likelihood parameter estimates\nof the student-t distribution. Technical note. URL www.inference.phy.cam.ac.uk/cs482/publications/\nscheffler2008derivation.pdf\n[10] Eichhorn, J. Sinz, F. & Bethge, M. (2009). Natural image coding in V1: how much use is orientation\nselectivity?. PLoS Comput Biol, 5(4), e1000336.\n[11] Mika, S. Ratsch, G. Weston, J. Scholkopf, B. Smola, A. J. & Muller, K. R. (1999). Invariant Feature\nExtraction and Classi\ufb01cation in Kernel Spaces. In Advances in neural information processing systems 12 pp.\n526\u2013532.\n[12] Slonim, N., & Weiss, Y. (2002). Maximum likelihood and the information bottleneck. In Advances in\nneural information processing systems pp. 335\u2013342\n[13] Elidan, G., & Friedman, N. (2002). The information bottleneck EM algorithm. In Proceedings of the\nNineteenth conference on Uncertainty in Arti\ufb01cial Intelligence pp. 200\u2013208\n[14] Barber, D. & Agakov, F. (2004) The IM algorithm: a variational approach to information maximization. In\nAdvances in Neural Information Processing Systems 16 pp. 201\u2013208\n[15] Doi, E., Gauthier, J. L. Field, G. D. Shlens, J. Sher, A. Greschner, M. (2012). Ef\ufb01cient Coding of Spatial\nInformation in the Primate Retina. The Journal of neuroscience 32(46), pp. 16256\u201316264\n[16] Hardoon, D. R., Szedmak, S. & Shawe-Taylor, J. (2004). Canonical correlation analysis: An overview with\napplication to learning methods. Neural computation, 16(12), 2639-2664.\n[17] Bialek, W., de Ruyter Van Steveninck, R. R., & Tishby, N. (2008). Ef\ufb01cient representation as a design\nprinciple for neural coding and computation. In Information Theory, 2006 IEEE International Symposium pp.\n659\u2013663\n[18] Palmer, S. E., Marre, O., Berry, M. J., & Bialek, W. (2015). Predictive information in a sensory population.\nProceedings of the National Academy of Sciences 112(22) pp. 6908\u20136913.\n[19] Doi, Eizaburo. & Lewicki, M. S. (2005). Sparse coding of natural images using an overcomplete set of\nlimited capacity units. In Advances in Neural Information Processing Systems 17 pp. 377\u2013384\n[20] Tkacik, G. Prentice, J. S. Balasubramanian, V. & Schneidman, E. (2010). Optimal population coding by\nnoisy spiking neurons. Proceedings of the National Academy of Sciences 107(32), pp. 14419-14424.\n\n9\n\n\f", "award": [], "sourceid": 1061, "authors": [{"given_name": "Matthew", "family_name": "Chalk", "institution": "IST Austria"}, {"given_name": "Olivier", "family_name": "Marre", "institution": "Institut de la vision"}, {"given_name": "Gasper", "family_name": "Tkacik", "institution": "Institute of Science and Technology Austria"}]}