{"title": "A Theoretical Analysis of Robust Coding over Noisy Overcomplete Channels", "book": "Advances in Neural Information Processing Systems", "page_first": 307, "page_last": 314, "abstract": null, "full_text": "A Theoretical Analysis of Robust Coding over Noisy Overcomplete Channels\n\nEizaburo Doi1 , Doru C. Balcan2 , & Michael S. Lewicki1,2 1 Center for the Neural Basis of Cognition, 2 Computer Science Department, Carnegie Mellon University, Pittsburgh, PA 15213 {edoi,dbalcan,lewicki}@cnbc.cmu.edu\n\nAbstract\nBiological sensory systems are faced with the problem of encoding a high-fidelity sensory signal with a population of noisy, low-fidelity neurons. This problem can be expressed in information theoretic terms as coding and transmitting a multi-dimensional, analog signal over a set of noisy channels. Previously, we have shown that robust, overcomplete codes can be learned by minimizing the reconstruction error with a constraint on the channel capacity. Here, we present a theoretical analysis that characterizes the optimal linear coder and decoder for one- and twodimensional data. The analysis allows for an arbitrary number of coding units, thus including both under- and over-complete representations, and provides a number of important insights into optimal coding strategies. In particular, we show how the form of the code adapts to the number of coding units and to different data and noise conditions to achieve robustness. We also report numerical solutions for robust coding of highdimensional image data and show that these codes are substantially more robust compared against other image codes such as ICA and wavelets.\n\n1\n\nIntroduction\n\nIn neural systems, the representational capacity of a single neuron is estimated to be as low as 1 bit/spike [1, 2]. The characteristics of the optimal coding strategy under such conditions, however, remains an open question. Recent efficient coding models for sensory coding such as sparse coding and ICA have provided many insights into visual sensory coding (for a review, see [3]), but those models made the implicit assumption that the representational capacity of individual neurons was infinite. Intuitively, such a limit on representational precision should strongly influence the form of the optimal code. In particular, it should be possible to increase the number of limited capacity units in a population to form a more precise representation of the sensory signal. However, to the best of our knowledge, such a code has not been characterized analytically, even in the simplest case. Here we present a theoretical analysis of this problem for one- and two-dimensional data for arbitrary numbers of units. For simplicity, we assume that the encoder and decoder are both linear, and that the goal is to minimize the mean squared error (MSE) of the reconstruction. In contrast to our previous report, which examined noisy overcomplete\n\n\f\nrepresentations [4], the cost function does not contain a sparsity prior. This simplification makes the cost depend up to second order statistics, making it analytically tractable while preserving the robustness to noise.\n\n2\n\nThe model\n\nTo define our model, we assume that the data is N -dimensional, has zero mean and covariance matrix x , and define two matrices W  RM N and A  RN M . For each data point x, its representation r in the model is the linear transform of x through matrix W, 2 perturbed by the additive noise (i.e., channel noise) n  N (0, n IM ): r = Wx + n = u + n. (1) We refer to W as the encoding matrix and its row vectors as encoding vectors. The reconstruction of a data point from its representation is simply the linear transform of the latter, using matrix A: x = Ar = AWx + An. ^ (2) We refer to A as the decoding matrix and its column vectors as decoding vectors. The term AWx in eq. 2 determines how the reconstruction depends on the data, while An reflects the channel noise in the reconstruction. When there is no channel noise (n = 0), AW = I is equivalent to perfect reconstruction. A graphical description of this system is shown in Fig. 1.\nChannel Noise Encoder Data\n\nn\n\nDecoder Noisy Representation\n\nx\n\nW\n\nNoiseless Representation\n\nu\n\nr\n\nA\n\nReconstruction\n\n^ x\n\nFigure 1: Diagram of the model. The goal of the system is to form an accurate representation of the data that is robust to the presence of channel noise. We quantify the accuracy of the reconstruction by the mean squared error (MSE) over a set of data. The error of each sample is = x - x = ^ (IN - AW)x - An, and the MSE is expressed in matrix form: 2 E (A, W) = tr{(IN - AW)x (IN - AW)T } + n tr{AAT }, (3) where we used E = T = tr( T ). Note that, due to the MSE objective along with the zero-mean assumptions, the optimal solution depends solely on second-order statistics of the data and the noise. Since the SNR is limited in the neural representation [1, 2], we assume that each coding 2 unit has a limited variance u2 = u so that the SNR is limited to the same constant value i 1 2 2 2  = u /n . As the channel capacity of information is defined by C = 2 ln( 2 + 1), this is equivalent to limiting the capacity of each unit to the same level. We will call this constraint as channel capacity constraint. Now our problem is to minimize eq. 3 under the channel capacity constraint. To solve it, we will include this constraint in the parametrization of W. Let x = EDET be 1 the eigenvalue ecomposition of the data covariance matrix, and denote S = D 2 = d  diag( 1 ,    , M ), where i  Dii are the x 's eigenvalues. As we will see shortly, 2 it is convenient to define V  WES/u , then the condition u2 = u implies that i 2 VVT = Cu = uuT /u , (4) where Cu is the correlation matrix of the representation u. Now the problem is formulated as a constrained optimization: finding the parameters that satisfy eq. 4 and minimize E .\n\n\f\n3\n\nThe optimal solutions and their characteristics\n\nIn this section we analyze the optimal solutions in some simple cases, namely for 1dimensional (1-D) and 2-dimensional (2-D) data. 3.1 1-D data\n\nIn the 1-D case the MSE (eq. 3) is expressed as\n2 2 E = x (1 - aw)2 + n a 2 , 2 2 x 1M M 1\n\n(5)\n\nwhere = x  R , a = A  R and w = W  R . By solving the necessary condition for the minimum,  E / a = 0, with the channel capacity constraint (eq. 4), the entries of the optimal solutions are u 1 2 , ai =  , x wi M   2 + 1 and the smallest value of the MSE is 2 x E= . 2+1 M  wi =  (6)\n\n(7)\n\nThis minimum depends on the SNR ( 2 ) and on the number of units (M ), and it is monotonically decreasing with respect to both. Furthermore, we can compensate for a decrease in SNR by an increase of the number of units. Note that ai are responsible for this adaptive behavior as wi do not vary with either  2 or M , in the 1-D case. The second term in eq. 5 leads the optimal a into having as small norm as possible, while the first term prevents it from being arbitrarily small. The optimum is given by the best trade-off between them. 3.2 2-D data\n\nIn the 2-D case, the channel capacity constraint (eq. 4) restricts V such that the row vectors of V should be on the unit circle. Therefore V can be parameterized as   cos 1 sin 1   . . . . V= (8) , . . cos M sin M where i  [0, 2 ) is the angle between i-th row of V and the principal eigenvector of the data e1 (E = [e1 , e2 ], 1  2 > 0). The necessary condition for the minimum  E / A = O implies\n2 2 A = u ESVT (u VVT + n IM )-1 .\n\n(9)\n\nUsing eqs. 8 and 9, the MSE can be expressed as 2 - 2 (1 + 2 ) M  2 + 1 2 (1 - 2 ) Re(Z ) E= , M 21  2 + 1 - 4  4 |Z |2 2 where by definition Z= M\nk=1 zk\n\n(10)\n\n=\n\nM\n\nk=1 [cos(2k )\n\n+ i sin(2k )].\n\n(11)\n\nNow the problem has been reduced to finding simply a complex number Z that minimizes E . Note that Z defines k in V, which in turn defines W (by definition; see eq. 4) and A (eq. 9). In the following we analyze the problem in two complementary cases: when the data variance is isotropic (i.e., 1 = 2 ), and when it is anisotropic (1 > 2 ). As we will see, the solutions are qualitatively different in these two cases.\n\n\f\n3.2.1\n\nIsotropic case\n\n2 Isotropy of the data variance implies 1 = 2  x , and (without loss of generality) E = I, which simplifies the MSE (eq. 10) as 1 2 2 2x + M  2 2 E= . (12) 21  2 + 1 - 4  4 |Z |2 M\n\nTherefore, E is minimized whenever |Z |2 is minimized. If M = 1, |Z |2 = |z1 |2 is always 1 by definition (eq. 11), yielding the optimal solutions W= x 2 u V, A = 2 VT , x u  + 1 (13)\n\nwhere V = V(1 ),  1  [0, 2 ). Eq. 13 means that the orientation of the encoding and decoding vectors is arbitrary, and that the length of those vectors is adjusted exactly as in the 1-D case (eq. 6 with M = 1; Fig. 2). The minimum MSE is given by E=\n2 x 2 + x . 2 + 1\n\n(14)\n\nThe first term is the same as in the 1-D case (eq. 7 with M = 1), corresponding to the error component along the axis that the encoding/decoding vectors represent, while the second term is the whole data variance along the axis orthogonal to the encoding/decoding vectors, along which no reconstruction is made. If M  2, there exists a set of angles k for which |Z |2 is 0. This can be verified by representing Z in the complex plane (Z-diagram in Fig. 2) and observing that there is always a configuration of connected, unit-length bars that starts from, and ends up at the origin, thus indicating that Z = |Z |2 = 0. Accordingly, the optimal solution is W= x 2 u V, A = M 2 VT , x u 2  + 1 (15)\n\nwhere the optimal V = V(1 ,    , M ) is given by such 1 , . . . , M for which Z = 0. Specifically, if M = 2, then z1 and z2 must be antiparallel but are not otherwise constrained, making the pair of decoding vectors (and that of encoding vectors) orthogonal, yet free to rotate. Note that both the encoding and the decoding vectors are parallel to the rows of V (eq. 15), and the angle of zk from the real axis is twice as large as that of ak (or wk ). Likewise, if M = 3, the decoding vectors should be evenly distributed yet still free to rotate; if M = 4, the four vectors should just be two pairs of orthogonal vectors (not necessarily evenly distributed); if M  5, there is no obvious regularity. With Z = 0, the MSE is minimized as E=\nM2 2 2 2x . +1\n\n(16)\n\nThe minimum MSE (eq. 16) depends on the SNR ( 2 ) and overcompleteness ratio (M /N ) exactly in the same manner as explained in the 1-D case (eq. 7), considering that in both cases the numerator is the data variance, tr(x ). We present examples in Fig 2: given M = 2, the reconstruction gets worse by lowering the SNR from 10 to 1; however, the reconstruction can be improved by increasing the number of units for a fixed SNR ( 2 = 1). Just as in the 1-D case, the norm of the decoding vectors gets smaller by increasing M or decreasing  2 , which is explicitly described by eq. 15.\n\n\f\nM=1 2=1 Variance\n\nM=2 2=10 2=1\n\nM=3 2=1\n\nM=4 2=1\n\nM=5 2=1\n\nFigure 2: The optimal solutions for isotropic data. M is the number of units and  2 is the SNR in the representation. \"Variance\" shows the variance ellipses for the data (gray) and the reconstruction (magenta). For perfect reconstruction, the two ellipses should overlap. \"Encoding\" and \"Decoding\" show encoding vectors (red) and decoding vectors (blue), respectively. The gray vectors show the principal axes of the data, e1 and e2 . \"Z-Diagram\" represents Z = k zk (eq. 11) in the complex plane, where each unit length bar corresponds to a zk , and the end point indicated by \"\" represents the coordinates of Z . The set of green dots in a plot corresponds to optimal values of Z ; when this set reduces to a single dot, the optimal Z is unique. In general there could be multiple configurations of bars for a single Z , implying multiple equivalent solutions of A and W for a given Z . For M = 2 and  2 = 10, we drew with gray dotted bars an example of Z that is not optimal (corresponding encoding and decoding vectors not shown). 3.2.2 Anisotropic case\n\nIn the anisotropic condition 1 > 2 , the MSE (eq. 10) is minimized when Z = Re(Z )  0 for a fixed value of |Z |2 . Therefore, the problem is reduced to seeking a real value Z = y  [0, M ] that minimizes - 2 M (1 + 2 ) 2  2 + 1 2 (1 - 2 ) y . (17) E= M 21 2+1  - 4  4 y2 2 If M = 1, then y = cos 21 from eq. 11, and therefore, E in eq. 17 is minimized iff 1 = 0, yielding the optimal solutions  u T 1 2 W =  e1 , A = 2 e1 . (18) u  + 1 1 In contrast to the isotropic case with M = 1, the encoding and decoding vectors are specified along the principal axis (e1 ) as illustrated in Fig. 3. The minimum MSE is 1 + 2 . (19) +1 This is the same form as in the isotropic case (eq. 14) except that the first term is now related to the variance along the principal axis, 1 , by which the encoding/decoding vectors can E= 2\n\nZ-Diagram\n\nDecoding\n\nEncoding\n\n\f\nmost effectively be utilized for representing the data, while the second term is specified as the data variance along the minor axis, 2 , by which the loss of reconstruction is mostly minimized. Note that it is a similar mechanism of dimensionality reduction as using PCA. If M  2, then we can derive the optimal y from the necessary condition for the minimum, dE /dy = 0, which yields    M -  M - = 1 -  2 2  1 + 2 2    +2 y +2 y 0. (20)   1 +  2  1 - 2\n2 Let c denote the SNR critical point, where  2 c = ( 1 /2 - 1)/M .\n\n(21)\n\nIf  \n\n2\n\n2 c ,\n\nthen eq. 20 has a root within its domain [0, M ],   , 2  1 - 2  +M y=  1 + 2  2\n\n(22)\n\n2 with y = M if  2 = c . Accordingly the optimal solutions are given by    E 1 + 2 2 0 u / 1 T  W=V M 2 EVT , , A= 0 u / 2 2u  +1 2\n\n(23)\n\nwhere the optimal V = V(1 ,    , M ) is given by the Z-diagram as illustrated in Fig. 3, which we will describe shortly. The minimum MSE is given by   ( 1 + 2 )2 1 E= M 2 . (24) 2 2  +1 Note that eqs. 2324 are reduced to eqs. 1516 if 1 = 2 .\n2 If the SNR is smaller than c , then dE /dy = 0 does not have a root within the domain. However, dE /dy is always negative, and hence, E decreases monotonically on [0, M ]. The minimum is therefore obtained when y = M , yielding the optimal solutions  u 1 2 T W =  1M e 1 , A = e 1 1T , (25)  M u M  2 + 1 1\n\nwhere 1M = (1,    , 1)T  RM , and the minimum is given by E= 1 + 2 . M 2 + 1 (26)\n\nNote that E takes the same form as in M = 1 (eq. 19) except that we can now decrease the error by increasing the number of units. To summarize, if the representational resource is too limited either by M or  2 , the best strategy is to represent only the principal axis. Now we describe the optimal solutions using the Z-diagram (Fig. 3). First, the optimal so2 lutions differ depending on the SNR. If  2 > c , the optimal Z is a certain point between 0 and M on the real axis. Specifically, for M = 2 the optimal configuration of the unit-length connected bars is unique (up to flipping about x-axis), meaning that the encoding/decoding vectors are symmetric about the principal axis; for M  3, there are infinitely many configurations of the bars starting from the origin and ending at the optimal Z , and nothing can be 2 added about their regularity. If  2  c , the optimal Z is M , and the optimal configuration is obtained only when all the bars align on the real axis. In this case, encoding/decoding vectors are all parallel to the principal axis (e1 ), as described by eq. 25. Such a degenerate 2 representation is unique for the anisotropic case and is determined by c (eq. 21). We can\n\n\f\nM=1 2=1 Variance\n\n2=10\n\nM=2 2=2\n\nM=3 2=1 2=10 2=1\n\nM=8 2=1\n\nFigure 3: The optimal solutions for anisotropic data. Notations are as in Fig. 2. We set 2 1 = 1.87 and 2 = 0.13.  2 > c holds for all M  2 but the one with M = 2 and  2 = 1. avoid the degeneration either by increasing the SNR (e.g., Fig. 3, M = 2 with different  2 ) or by increasing the number of units ( 2 = 1 with different M ). Also, the optimal solutions for the overcomplete representation are, in general, not obtained by simple replication (except in the degenerate case). For example, for  2 = 1 in Fig. 3, the optimal solution for M = 8 is not identical to the replication of the optimal solution for M = 2, and we can formally prove it by using eq. 22. For M = 1 and for the degenerate case, where only one axis in two dimensional space is represented, the optimal strategy is to preserve information along the principal axis at the cost of losing all information along the minor axis. Such a biased representation is also found for the non-degenerate case. We can see in Fig. 3 that the data along the principal axis is more accurately reconstructed than that along the minor axis; if there is no bias, the ellipse for the reconstruction should be similar to that of the data. More precis y, we can el  prove that the error ratio along e1 is smaller than that along e2 at the ratio of 2 : 1 (note the switch of the subscripts), which describes the representation bias toward the main axis.\n\n4\n\nZ-Diagram\n\nDecoding\n\nEncoding\n\nApplication to image coding\n\nIn the case of high-dimensional data we can employ an algorithm similar to the one in [4], to numerically compute optimal solutions that minimizes the MSE subject to the channel capacity constraint. Fig. 4 presents the performance of our model when applied to image coding in the presence of channel noise. The data were 8  8 pixel blocks taken from a large image, and for comparison we considered representations with M = 64 (\"1\") and respectively, 512 (\"8\") units. As for the channel capacity, each unit has 1.0 bit precision as in the neural representation [1]. The robust coding model shows a dramatic reduction in the reconstruction error, when compared to alternatives such as ICA and wavelet codes. This underscores the importance of taking into account the channel capacity constraint for better understanding the neural representation.\n\n\f\nOriginal\n\nIC A\n\nWavelets\n\nRobust Coding (1x) Robust Coding (8x)\n\n32.5%\n\n34.8%\n\n3.8%\n\n0.6%\n\nFigure 4: Reconstruction using one bit channel capacity representations. To ensure that all models had the same precision of 1.0 bit for each coefficient, we added Gaussian noise to the coefficients of the ICA and \"Daubechies 9/7\" wavelet codes as in the robust coding. For each representation, we displayed percentage error of the reconstruction. The results are consistent using other images, block size, or wavelet filters.\n\n5\n\nDiscussion\n\nIn this study we measured the accuracy of the reconstruction by the MSE. An alternative ^ measure could be, as in [5, 3], mutual information I (x, x) between the data and the reconstruction. However, we can prove that this measure does not yield optimal solutions for the robust coding problem. Assuming the data is Gaussian and the representation is complete, we can prove that the mutual information is upper-bounded, 1 N ^ I (x, x) = ln det( 2 VVT + IN )  ln( 2 + 1), (27) 2 2 with equality iff VVT = I, i.e., when the representation u is whitened (see eq. 4). This result holds even for anisotropic data, which is different from the optimal MSE code that can employ correlated, or even degenerate, representation. As ICA is one form of whitening, the results in Fig. 4 demonstrate the suboptimality of whitening in the MSE sense. The optimal MSE code over noisy channels was examined previously in [6] for N dimensional data. However, the capacity constraint was defined for a population and only examined the case of undercomplete codes. In the model studied here, motivated by the neural representation, the capacity constraint is imposed for individual units. Furthermore, the model allows for arbitrary number of units, which provides a way to arbitrarily improve the robustness of the code using a population code. The theoretical analysis for oneand two-dimensional cases quantifies the amount of error reduction as a function of the SNR and the number of units along with the data covariance matrix. Finally, our numerical results for higher-dimensional image data demonstrate a dramatic improvement in the robustness of the code over both conventional transforms such as wavelets and also representations optimized for statistical efficiency such as ICA.\n\nReferences\n[1] A. Borst and F. E. Theunissen. Information theory and neural coding. Nature Neuroscience, 2:947957, 1999. [2] N. K. Dhingra and R. G. Smith. Spike generator limits efficiency of information transfer in a retinal ganglion cell. Journal of Neuroscience, 24:29142922, 2004. [3] A. Hyvarinen, J. Karhunen, and E. Oja. Independent Component Analysis. Wiley, 2001. [4] E. Doi and M. S. Lewicki. Sparse coding of natural images using an overcomplete set of limited capacity units. In Advances in NIPS, volume 17, pages 377384. MIT Press, 2005. [5] J. J. Atick and A. N. Redlich. What does the retina know about natural scenes? Neural Computation, 4:196210, 1992. [6] K. I. Diamantaras, K. Hornik, and M. G. Strintzis. Optimal linear compression under unreliable representation and robust PCA neural models. IEEE Trans. Neur. Netw., 10(5):11861195, 1999.\n\n\f\n", "award": [], "sourceid": 2867, "authors": [{"given_name": "Eizaburo", "family_name": "Doi", "institution": null}, {"given_name": "Doru", "family_name": "Balcan", "institution": null}, {"given_name": "Michael", "family_name": "Lewicki", "institution": null}]}