{"title": "An Application of the Principle of Maximum Information Preservation to Linear Systems", "book": "Advances in Neural Information Processing Systems", "page_first": 186, "page_last": 194, "abstract": null, "full_text": "186 \n\nAN APPLICATION OF THE PRINCIPLE OF \nMAXIMUM INFORMATION PRESERVATION \n\nTO LINEAR SYSTEMS \n\nIBM T. J. Watson Research Center, Yorktown Heights, NY 10598 \n\nRalph Linsker \n\nABSTRACT \n\nThis paper addresses the problem of determining the weights for a \nset of linear filters (model \"cells\") so as to maximize the \nensemble-averaged information that the cells' output values jointly \nconvey about their input values, given the statistical properties of \nthe ensemble of input vectors. The quantity that is maximized is the \nShannon information rate, or equivalently the average mutual \ninformation between input and output. Several models for the role \nof processing noise are analyzed, and the biological motivation for \nconsidering them is described. For simple models in which nearby \ninput signal values (in space or time) are correlated, the cells \nresulting from this optimization process include center-surround \ncells and cells sensitive to temporal variations in input signal. \n\nINTRODUCTION \n\nI have previously proposed [Linsker, 1987, 1988] a principle of \"maximum \ninformation preservation,\" also called the \"infomax\" principle, that may account for \ncertain aspects of the organization of a layered perceptual network. The principle \napplies to a layer L of cells (which may be the input layer or an intermediate layer \nof the network) that provides input to a next layer M. The mapping of the input \nsignal vector L onto an output signal vector M, f:L ~ M, is characterized by a \nconditional probability density function (\"pdf\") p(MI L). The set S of allowed \nmappings I is specified. The input pdf PL(L) is also given. (In the cases considered \nhere, there is no feedback from M to L.) The infomax principle states that a \nmapping I should be chosen for which the Shannon information rate [Shannon, \n1949] \n\nR(j) == f dL PL(L) f dM p(MI L) 10g[P(MI L)/PM(M)] \nis a maximum (over allIin the set S). Here PM(M) == fdLPL(L)P(MIL) is the pdf \nof the output signal vector M. R is identical to the average mutual information \nbetween Land M. \n\n(1) \n\n\fMaximum Infonnation Preservation to Linear Systems \n\n187 \n\nTo understand better how the info max principle may be applied to biological systems \nand complex synthetic networks, it is useful to solve the infomax optimization \nproblem explicitly for simpler systems whose properties are nonetheless biologically \nmotivated. This paper therefore deals with the practical computation of infomax \nsolutions for cases in which the mappings! are constrained to be linear. \n\nINFOMAX SOLUTIONS FOR A SET OF LINEAR FILTERS \n\nWe consider the case of linear model \"neurons\" with multivariate Gaussian input \nand additive Gaussian noise. There are N input (L) cells and N' output (M) cells. \nThe input column vector L = (Lt,~, ... ,LNF is randomly selected from an \nN-dimensional Gaussian distribution having mean zero. That is, \n\n(2) \n\nwhere QL is the covariance matrix of the input activities, Q6 = J dL PL(L)LjLj \n(Superscript T denotes the matrix transpose.) \n\n\u2022 \n\nTo specify the set S of allowed mappings !:L .... M, we define a processing model \nthat includes a description of (i) how noise enters during processing, (ii) the \nindependent variables over which we are to maximize R, and (iii) any constraints \non their values. Figure 1 shows several such models. We shall analyze the simplest, \nthen explain the motivation for the more complex models and analyze them in turn. \n\nModel A -- Additive noise of constant variance \n\nIn Model A of Fig. 1 the output signal value of the nth M cell is: \n\n(3) \n\nThe noise components \"11 are independently and identically distributed (fli.i.d. \") \nrandom variables drawn from a Gaussian distribution having a mean of zero and \nvariance B. \n\nEach mapping !:L .... M is characterized by the values of the {Cnj } and the noise \nparameter B. The elements of the covariance matrix of the output activities are \n(using Eqn. 3) \n\n(4) \n\nwhere ~nm = 1 if n = m and 0 otherwise. \nEvaluating Eqn. 1 for this processing model gives the information rate: \n\nR(j) = (1/2) In Det W(j) \n(5) \nwhere ~m = Q:!'/ B. (R is the difference of two entropy terms. See [Shannon, \n1949], p.57, for the entropy of a Gaussian distribution.) \n\n\f188 \n\nLinsker \n\nIf the components Cni of the C matrix are allowed to be arbitrarily large, then the \ninformation rate can be made arbitrarily large, and the effects of noise become \narbitrarily small. One way to limit C is to impose a \"resource constraint\" on each \nM cell. An example of such a constraint is ~jqj = 1 for all n. One can then attempt \ndirectly, using numerical methods, to maximize Eqn. 5 over all allowed C for given \nB. However, when some additional conditions (below) are satisfied, further \nanalytical progress can be made. \n\nSuppose the NL-cells are uniformly spaced along the line interval [0,1] with periodic \nboundary conditions, so that cell N is next to cell 1. [The analysis can be extended \nto a two- (or higher-) dimensional array in a straightforward manner.] Suppose also \nthat (for given N) the covariance Q6 of the input values at cells i and j is a function \nQL(Sj) only of the displacement s'J from i to j. (We deal with the periodicity by \ndefining \nthat \n-N/2 S Sab < N/2.) Then QL is a Toeplitz matrix, and its eigenvalues {Ak} are the \ncomponents of the discrete Fourier transform (\"F.T.\") of QL(S): \n\nSab = b - a - Ya~ and \n\ninteger \n\nchoosing \n\nthe \n\nYab \n\nsuch \n\nAk = ~sQL(s) exp( -2~ks/N), (-N/2) S k < N/2. \n\n(6) \n\n(1) N' = N. This simplifies the resulting \nWe now impose two more conditions: \nexpressions, but is otherwise inessential, as we shall discuss. (2) We constrain each \nM cell to have the same arrangement of C-values relative to the M cell's position. \nThat is, Cnj is to be a function C(Sni) only of the displacement Sni from n to i. This \nconstraint substantially reduces the computational demands. We would not expect \n\nL\u00b7 I \n\n(S,C) \n\nL\u00b7 I \n\n(D) \n\nFigure 1. \n\nFour processing models (A)-(D): Each diagram shows a single \nM cell (indexed by n) having output activity Mn. Inputs {LJ may \nbe common to many M cells. All noise contributions (dotted \nlines) are uncorrelated with one another and with {LJ. GC = \ngain control (see text). \n\n\fMaximum Information Preservation to Linear Systems \n\n189 \n\nit to hold in general in a biologically realistic model -- since different M cells should \nbe allowed to develop different arrangements of weights -- although even then it \ncould be used as an Ansatz to provide a lower bound on R. The section, \n\"Temporally-correlated input patterns,\" deals with a situation in which it is \nbiologically plausible to impose this constraint. \n\nfor \n\nthe eigenvalues \n\nis obtained by \n\nUnder these conditions, (Q:!') is also a Toeplitz matrix. \nIts eigenvalues are the \ncomponents of the F.T. of QM(snm). For N' = N these eigenvalues are (B + A~k) , \nwhere Zk = ICkl2 and Ck == ~sC(s) exp( -2'TT~ks/N) is the F.T. of C(s). [This \nexpression \nrewriting Eqn. 4 as: \nQM(snm) = B8n_m.o + ~j.jC(snJQL(Sj)C(sm) ,and taking the F.T. of both sides.] \nTherefore \nR = (1/2)~k In[l + AJcZk/ B]. \nWe want to maximize R subject to ~sC(S)2 = 1, which is equivalent to ~Zk = N . \nUsing the Lagrange multiplier method, we maximize A == R + 11-(~Zk - N) over all \nnonnegative {Zk}' Solving for (JA/ (JZk = 0 and requiring Zk ~ 0 for all k gives the \nsolution: \nZk = max[( -1/211-) -\n\n(B/Ak)' 0], \n\n(7) \n\n(8) \n\nwhere (given B) 11- is chosen such that ~Zk = N. \n\nNote that while the optimal {Zk} are uniquely determined, the phases of the {ck} are \ncompletely arbitrary [except that since the {C(s)} are real, we must have Ck * = c_ k \nfor all k]. The {C(s)} values are therefore not uniquely determined. Fig. 2a shows \ntwo of the solutions for .an example in which QL(S) = exp[ - (s/ So)2] with So = 6, \nN=N'=64, and B.:..:.l. Both solutions have ZO.\u00b11 ..... \u00b16=5.417, 5.409, 5.378, \n5.306, 5.134,4.689,3.376, and all other Zk == O. Setting all Ck phases to zero yields \nthe solid curve; a particular random choice of phases yields the dotted dHve. We \nshall later see that imposing locality conditions on the {C(s)} (e.g., penalizing \nnonzero C(s) for large I s I) can remove the phase ambiguity. \nOur solution (Eqn. 8) can be described in terms of a so-called \"water-filling\" \nanalogy: If one plots B /Ak versus k, then Zk is the depth of \"water\" at k when one \n\"pours\" into the \"vessel\" defined by the B / Ak curve a total quantity of \"water\" that \ncorresponds to ~Zk = N and brings the \"water level\" to ( -1/211-). \n\nLet us contrast this problem with two other problems to which the \"water-filling\" \nanalogy has been applied in the information-theory literature. In our notation, they \nare: \n1. Given a transfer function {C(s)} and the noise variance B, how should a given \ntotal input signal power ~Ak be apportioned among the various wavenumbers \nk so as to maximize the information rate R [Gallager, 1968]? Our problem is \ncomplementary to this: we fix the input signal properties and seek an optimal \ntransfer function subject to constraints. \n\n\f190 \n\nLinsker \n\n2. Rate-distortion (R-D) calculation [Berger, 1971]: Given a distortion measure \n(that defines a \"distance\" between the actual input signal and an estimate of it \nthat can be reconstructed from the channel's output), and the input power \nspectrum p.k}, what choice of {Zk} minimizes the average distortion for given \ninformation rate (or minimizes the required rate for given distortion)? In the \nR-D problem there is a process of reconstruction, and a given measure for \nassessing the \"goodness\" of reconstruction. In contrast, in our network there \nis no reconstruction of the input signal, and no criterion of the \"goodness\" of \nsuch a hypothetical reconstruction is provided. \n\nNote also that infomax optimization is not the same as computing which channel \n(that is, which mapping !:L .... M) selected from an allowed set has the maximum \ninformation-theoretic capacity. In that problem, one is free to encode the inputs \nbefore transmission so as to make optimal use of (Le., \"achieve the capacity of\") the \nchannel. In our case, there is no such pre-encoding; the input ensemble is prescribed \n(by the environment or by the output of an earlier processing stage) and we need to \nmaximize the channel rate for that ensemble. \nThe simplifying condition that N = N' (above) is unnecessarily restrictive. Eqn. 7 \ncan be easily generalized to the case in which N is a mUltiple of N' and the N' M cells \nare uniformly spaced on the unit interval. Moreover, in the limit that 1/ N' is much \nsmaller than the correlation length scale of QL, it can be shown that R is unchanged \nwhen we simultaneously increase N' and B by the same factor. (For example, two \nadjacent M cells each having noise variance 2B jointly convey the same information \n\nc \n\n(0) \n\nc \n\n(b) \n\nc \n\nl \n..-.' \n\ns \n\n-10 \n\n\\ ,: \n\\,/ \n\nFigure 2. \n\ninfomax \n\nfor \n\nsolutions C(s) \n\nlocally-correlated \nExample \n(a) Model A; region of nonnegligible C(s) extends over \ninputs: \nall s; phase ambiguity in Ck yields non unique C(s) solutions, two \nof which are shown. See text for details. (b) Models C (solid \ncurve) and D (dotted curve) with Gaussian g(S)-l favoring short \nconnections; shows center-surround receptive fields, more \npronounced in Model D. \n(c) \"Temporal receptive field\" using \nModel D for temporally correlated scalar input to a single M cell; \nC(s) is the weight applied to the input signal that occurred s time \nsteps ago. Spacing between ordinate marks is 0.1; ~ C(S)2 = 1 in \neach case. \n\n\fMaximum Information Preservation to Linear Systems \n\n191 \n\nabout L as one M cell having noise variance B.) For biological applications we are \nmainly interested in cases in which there are many L cells [so that C(s) can be \ntreated as a function of a continuous variable] and many M cells (so that the effect \nof the noise process is described by the single parameter B/ N). \n\nThe analysis so far shows two limitations of Model A. First, the constraint \n~iqi = 1 is quite arbitrary. (It certainly does not appear to be a biologically natural \nconstraint to impose!) Second, for biological applications we are interested in \npredicting the favored values of {C(s)}, but the phase ambiguity prevents this. In \nthe next section we show that a modified noise model leads naturally, without \narbitrary constraints on ~iqi' to the same results derived above. We then turn to a \nmodel that favors local connections over long-range ones, and that resolves the \nphase ambiguity issue. \n\nModel B -- Independent noise on each input line \n\nIn Model B of Fig. 1 each input Li to the nth M cell is corrupted by Li.d. Gaussian \nnoise Vl1i of mean zero and variance B. The output is \n\n(9) \n\nSince each Vni is independent of all other noise terms (and of the inputs {Li }), we find \n\n(10) \n\nWe may rewrite the last term as B~l1m (~iqy!2 (~jc;)l/2. The information rate is \nthen R = (1/2) In DetWwhere \n\n(11) \n\nDefine c' ni == Cl1i(~kqk)-1/2 ; then J\u00a5,.m = ~lIm + (~,.jc'lIiQbC' mj)/ B. Note that this is \nidentical (except for the replacement C ~ C') to the expression following Eqn. (5), \nin which QM was given by Eqn. (4). By definition, the {C' nil satisfy ~iC';i = 1 for \nall n. Therefore, the problem of maximizing R for this model (with no constraints \non ~jq;) is identical to the problem we solved in the previous section. \n\nModel C -- Favoring of local connections \n\nSince the arborizations of biological cells tend to be spatially localized in many cases, \nwe are led to consider constraints or cost terms that favor localization. There are \nvarious ways to implement this. Here we present a way of modifying the noise \nprocess so that the infomax principle itself favors localized solutions, without \nrequiring additional terms unrelated to information transmission. \n\nModel C of Fig. 1 is the same as Model B, except that now the longer connections \nare \"noisier\" than the shorter ones. That is, the variance of VIIi is = B~(sn;) \nwhere g(s) increases with 1 s I. [Equivalently, one could attenuate the signal on the \n(i ~ n) line by g(sll;) 1/2 and have the same noise variance Bo on all lines.] \n\n\f192 \n\nLinsker \n\nThis change causes the last term of Eqn. 10 to be replaced by Bo8I1m~g(SIl)qi . \nUnder the conditions discussed earlier (Toeplitz QL and QM, and N = N), we derive \n\n(12) \n\nRecall that the {ck } are related to {C(s)} by a Fourier transform (see just before Eqn. \n7). To cotppute which choice of IC(s)} maximizes R for a given problem, we used \na gradient ascent algorithm several times, each time using a different random set of \ninitial I C(s)} values. For the problems whose solutions are exhibited in Figs. 2b and \n2c, multiple starting points usually yielded the same solution to within the error \ntolerance specified for the algorithm [apart from an arbitrary factor by which all of \nthe C(s)'s can be multiplied without affecting R], and that solution had the largest \nR of any obtained for the given problem. That is, a limitation sometimes associated \nwith gradient ascent algorithms -- namely, that they may yield multiple \"solutions\" \nthat are local, but far from global, maxima -- did not appear to be a difficulty in these \ncases. \n\nFig. 2b (solid curve) shows the infomax solution for an example having \n(S/sO)2] and g(s) = exp[(s/s.)2] with So = 4, s. = 6, N = N = 32, \nQL(S) = exp[ -\nand Bo = 0.1. There is a central excitatory peak flanked by shallow inhibitory \n(As noted, the negative of this \nsidelobes (and weaker additional oscillations). \nsolution, having a central inhibitory region and excitatory sidelobes, gives the same \nR.) As Bo is increased (a range from 0.001 to 20 was studied), the peak broadens, \nthe sidelobes become shallower (relative to the peak), and the receptive fields of \nnearby M cells increasingly overlap. This behavior is an example of the \n\"redundancy-diversity\" tradeoff discussed in [Linsker, 1988]. \n\nModel D -- Bounded output variance \n\nOur previous models all produce output values Mn whose variance is not explicitly \nconstrained. More biologically realistic cells have limited output variance. For \nexample, a cell's firing rate must lie between zero and some maximum value. Thus, \nthe output of a model nonlinear cell is often taken to be a sigmoid function of \n(~iCII;L)\u00b7 \n\nWithin the context of linear cell models, we can capture the effect of a bounded \noutput variance by using Model D of Fig. 1. We pass the intermediate output \n~iClIi(Li + VIIi) through a gain control QC that normalizes the output variance to \nunity, then we add a final (Li.d. Gaussian) noise term V'II of variance R.. That is, \n\n(13) \n\nWithout the last term, this model wo~ld be identical to Model C, since mUltiplying \nboth the signal and the VIIi noise by the same factor GC would not affect R. The last \nterm in effect fixes the number of output values that can be discriminated (Le., not \nconfounded with each other by the noise process V'II) to be of order Rl1!2. \n\nThe information rate for this model is derived to be (cf. Eqn. 12): \n\n\fMaximum Information Preservation to Linear Systems \n\n193 \n\n(14) \n\nwhere V( C) is the variance of the intermediate output before it is passed through \nGC: \n\n(15) \n\nFig. 2b (dotted curve) shows the infomax solution (numerically obtained as above) \nfor the same QL(S) and g(s) functions and parameter values as were used to generate \nthe solid curve (for Model C), but with the new parameter Bl = 0.4. The effect of \nthe new Bl noise process in this case is to deepen the inhibitory sidelobes (relative \nto the central peak). The more pronounced center-surround character of the \nresulting M cell dampens the response of the cell to differences (between different \ninput patterns) in the spatially uniform component of the input pattern. This \nresponse property allows the L .... M mapping to be info max-optimal when the \ndynamic range of the cells' output response is constrained.\u00b7 (A competing effect can \ncomplicate the analysis: If Bl is increased much further, for example to 50 in the \ncase discussed, the sidelobes move to larger s and become shallower. This behavior \nresembles that discussed at the end of the previous section for the case of increasing \nBo; in the present case it is the overall noise level that is being increased when Bl \nincreases and Bo is kept constant.) \n\nTemporaUy-correlated input patterns \n\nLet us see how infomax can be used to extract regularities in input time series, as \ncontrasted with the spatially-correlated input patterns discussed above. We consider \na single M cell that, at each discrete time denoted by n, can process inputs {LJ from \nearlier times i ~ n (via delay !ines, for example). We use the same Model D as \nbefore. There are two differences: First, we want g(s) = 00 for all s > 0 (input lines \nfrom future times are \"infinitely noisy\"). [A technical point: Our use of periodic \nboundary conditions, while computationally convenient, means that the input value \nthat will occur s time steps from now is the same value that occurred (N - s) steps \nago. We deal with this by choosing g(s) to equal 1 at s = 0, to increase as \ns .... -N/2 (going into the past), and to increase further as s decreases from +N/2 \nto 1, corresponding to increasingly remote past times. The periodicity causes no \nunphysical effects, provided that we make g(s) increase rapidly enough (or make N \nlarge enough) so that C(s) is negligible for time intervals comparable to N.] Second, \nthe fact that C,,; is a function only of s'\" is now a consequence of the constancy of \nconnection weights C(s) of a single M cell with time, rather than merely a convenient \nAnsatz to facilitate the infomax computation for a set of many M cells (as it was in \nprevious sections). \n\nThe \ninfomax \nfor an example having \nt(s} = s \nQL(S) = exp[ -\nfor s ~ 0 and \nt(s} = s - N for s ~ 1; So = 4, Sl = 6, N = 32, Bo = 0.1, and Bl = 0.4. The result is \nthat the \"temporal receptive field\" of the M cell is excitatory for recent times, and \n\nsolution \nin Fig. 2c \n(S/So)2]; g(s) = exp[ -t(s}/s.J with \n\nis \n\nshown \n\n\f194 \n\nLinsker \n\ninhibitory for somewhat more remote times (with additional weaker oscillations). \nThe cell's output can be viewed approximately as a linear combination of a smoothed \ninput and a smoothed first time derivative of the input, just as the output of the \ncenter-surround cell of Fig. 2b can be viewed as a linear combination of a smoothed \ninput and a smoothed second spatial derivative of the input. As in Fig. 2b, setting \nBI = 0 (not shown) lessens the relative inhibitory contribution. \n\nSUMMARY \n\nTo gain insight into the operation of the principle of maximum information \npreservation, we have applied the principle to the problem of the optimal design of \nan array of linear filters under various conditions. The filter models that have been \nused are motivated by certain features that appear to be characteristic of biological \nnetworks. These features include the favoring of short connections and the \nconstrained range of output signal values. When nearby input signals (in space or \ntime) are correlated, the infomax-optimal solutions for the cases studied include (1) \ncenter-surround cells and (2) cells sensitive to temporal variations in input. The \nresults of the mathematical analysis presented here apply also to arbitrary input \ncovariance functions of the form QL( I i -\nj I). We have also presented more general \nexpressions for the information rate, which can be used even when QL is not of this \nform. The cases discussed illustrate the operation of the infomax principle in some \nrelatively simple but instructive situations. The analysis and results suggest how the \nprinciple may be applied to more biologically realistic networks and input ensembles. \n\nReferences \n\nT. Berger, Rate Distortion Theory (Prentice-Hall, Englewood Cliffs, N.J., 1971), \n\nchap. 4. \n\nR. G. Gallager, Information Theory and Reliable Communication (John Wiley and \n\nSons, N.Y., 1968), p. 388. \n\nR. Linsker, in: Neural Information Processing Systems (Denver, Nov. 1987), ed. \n\nD. Z. Anderson (Amer. Inst. of Physics, N.Y.), pp. 485-494. \n\nR. Linsker, Computer 21 (3) 105-117 (March 1988). \n\nC. E. Shannon and W. Weaver, The Mathematical Theory of Communication (Univ. \n\nof Illinois Press, Urbana, 1949). \n\n\f", "award": [], "sourceid": 102, "authors": [{"given_name": "Ralph", "family_name": "Linsker", "institution": null}]}