{"title": "Associative Memory via a Sparse Recovery Model", "book": "Advances in Neural Information Processing Systems", "page_first": 2701, "page_last": 2709, "abstract": "An associative memory is a structure learned from a dataset $\\mathcal{M}$ of vectors (signals) in a way such that, given a noisy version of one of the vectors as input, the nearest valid vector from $\\mathcal{M}$ (nearest neighbor) is provided as output, preferably via a fast iterative algorithm. Traditionally, binary (or $q$-ary) Hopfield neural networks are used to model the above structure. In this paper, for the first time, we propose a model of associative memory based on sparse recovery of signals. Our basic premise is simple. For a dataset, we learn a set of linear constraints that every vector in the dataset must satisfy. Provided these linear constraints possess some special properties, it is possible to cast the task of finding nearest neighbor as a sparse recovery problem. Assuming generic random models for the dataset, we show that it is possible to store super-polynomial or exponential number of $n$-length vectors in a neural network of size $O(n)$. Furthermore, given a noisy version of one of the stored vectors corrupted in near-linear number of coordinates, the vector can be correctly recalled using a neurally feasible algorithm.", "full_text": "Associative Memory via a Sparse Recovery Model\n\nArya Mazumdar\nDepartment of ECE\n\nUniversity of Minnesota Twin Cities\n\narya@umn.edu\n\nAnkit Singh Rawat\u21e4\n\nComputer Science Department\nCarnegie Mellon University\n\nasrawat@andrew.cmu.edu\n\nAbstract\n\nAn associative memory is a structure learned from a dataset M of vectors (signals)\nin a way such that, given a noisy version of one of the vectors as input, the nearest\nvalid vector from M (nearest neighbor) is provided as output, preferably via a fast\niterative algorithm. Traditionally, binary (or q-ary) Hop\ufb01eld neural networks are\nused to model the above structure. In this paper, for the \ufb01rst time, we propose\na model of associative memory based on sparse recovery of signals. Our basic\npremise is simple. For a dataset, we learn a set of linear constraints that every\nvector in the dataset must satisfy. Provided these linear constraints possess some\nspecial properties, it is possible to cast the task of \ufb01nding nearest neighbor as\na sparse recovery problem. Assuming generic random models for the dataset,\nwe show that it is possible to store super-polynomial or exponential number of\nn-length vectors in a neural network of size O(n). Furthermore, given a noisy\nversion of one of the stored vectors corrupted in near-linear number of coordinates,\nthe vector can be correctly recalled using a neurally feasible algorithm.\n\n1\n\nIntroduction\n\nNeural associative memories with exponential storage capacity and large (potentially linear) fraction\nof error-correction guarantee have been the topic of extensive research for the past three decades.\nA networked associative memory model must have the ability to learn and remember an arbitrary\nbut speci\ufb01c set of n-length messages. At the same time, when presented with a noisy query, i.e., an\nn-length vector close to one of the messages, the system must be able to recall the correct message.\nWhile the \ufb01rst task is called the learning phase, the second one is referred to as the recall phase.\nAssociative memories are traditionally modeled by what is called binary Hop\ufb01eld networks [15],\nwhere a weighted graph of size n is considered with each vertex representing a binary state neuron.\nThe edge-weights of the network are learned from the set of binary vectors to be stored by the\nHebbian learning rule [13].\nIt has been shown in [22] that, to recover the correct vector in the\npresence of a linear (in n) number of errors, it is not possible to store more than O( n\nlog n ) arbitrary\nbinary vectors in the above model of learning. In the pursuit of networks that can store exponential\n(in n) number of messages, some works [26, 12, 21] do show the existence of Hop\ufb01eld networks that\ncan store \u21e0 1.22n messages. However, for such Hop\ufb01eld networks, only a small number of errors\nin the query render the recall phase unsuccessful. The Hop\ufb01eld networks that store non-binary\nmessage vectors are studied in [17, 23], where the storage capacity of such networks against large\nfraction of errors is again shown to be linear in n. There have been multiple efforts to increase the\nstorage capacity of the associative memories to exponential by moving away from the framework\nof the Hop\ufb01eld networks (in term of both the learning and the recall phases)[14, 11, 19, 25, 18].\nThese efforts also involve relaxing the requirement of storing the collections of arbitrary messages.\nIn [11], Gripon and Berrou stored O(n2) number of sparse message vectors in the form of neural\ncliques. Another setting where neurons have been assumed to have a large (albeit constant) number\n\u21e4This work was done when the author was with the Dept. of ECE, University of Texas at Austin, TX, USA.\n\n1\n\n\fxn\n\nrm\n\nnodes\n\nMessage\n\nnodes\n\nr1\n\nr2\n\nr3\n\nConstraint\n\nx1 x2 x3\n\nof states, and at the same time the message set (or the dataset) is assumed to form a linear subspace\nis considered in [19, 25, 18].\nThe most basic premise of the works on neural\nassociative memory is to design a graph dynamic\nsystem such that the vectors to be stored are the\nsteady states of the system. One way to attain\nthis is to learn a set of constraints that every vec-\ntor in the dataset must satisfy. The inclusion re-\nlation between the variables in the vectors and the\nconstraints can be represented by a bipartite graph\n(cf. Fig. 1). For the recall phase, noise removal\ncan be done by running belief propagation on this\nbipartite graph.\nIt can be shown that the correct\nmessage is recovered successfully under conditions\nsuch as sparsity and expansion properties of the\ngraph. This is the main idea that has been explored\nin [19, 25, 18].\nIn particular, under the assump-\ntion that the messages belong to a linear subspace,\n[19, 25] propose associative memories that can\nstore exponential number of messages while toler-\nating at most constant number of errors. This ap-\nproach is further re\ufb01ned in [18], where each mes-\nsage vector from the dataset is assumed to com-\nprise overlapping sub-vectors which belong to dif-\nferent linear subspaces. The learning phase \ufb01nds\nthe (sparse) linear constraints for the subspaces as-\nsociated with these sub-vectors. For the recall phase then belief propagation decoding ideas of\nerror-correcting codes have been used. In [18], Karbasi et al. show that the associative memories\nobtained in this manner can store exponential (in n) messages. They further show that the recall\nphase can correct linear (in n) number of random errors provided that the bipartite graph associated\nwith learnt linear constraints (during learning phase) has certain structural properties.\nOur work is very closely related to the above principle. Instead of \ufb01nding a sparse set of constraints,\nwe aim to \ufb01nd a set of linear constraints that satisfy 1) the coherence property, 2) the null-space\nproperty or 3) the restricted isometry property (RIP). Indeed, for a large class of random signal\nmodels, we show that, such constraints must exists and can be found in polynomial time. Any of\nthe three above mentioned properties provide suf\ufb01cient condition for recovery of sparse signals or\nvectors [8, 6]. Under the assumption that the noise in the query vector is sparse, denoising can\nbe done very ef\ufb01ciently via iterative sparse recovery algorithms that are neurally feasible [9]. A\nneurally feasible algorithm for our model employs only local computations at the vertices of the\ncorresponding bipartite graph based on the information obtained from their neighboring nodes.\n\nFigure 1: The complete bipartite graph corre-\nsponding to the associative memory. Here, we\ndepict only a small fraction of edges. The edge\nweights of the bipartite graph are obtained from the\nlinear constraints satis\ufb01ed by the messages. Infor-\nmation can \ufb02ow in both directions in the graph, i.e.,\nfrom a message node to a constraint node and from\na constraint node to a message node. In the steady\nstate n message nodes store n coordinates of a valid\nmessage, and all the m constraints nodes are satis-\n\ufb01ed, i.e., the weighted sum of the values stored on\nthe neighboring message nodes (according to the as-\nsociated edge weights) is equal to zero. Note that an\nedge is relevant for the information \ufb02ow iff the cor-\nresponding edge weight is nonzero.\n\n1.1 Our techniques and results\n\nOur main provable results pertain to two different models of datasets, and are given below.\nTheorem 1 (Associative memory with sub-gaussian dataset model). It is possible to store a dataset\nof size \u21e0 exp(n3/4) of n-length vectors in a neural network of size O(n) such that a neurally\nfeasible algorithm can output the correct vector from the dataset given a noisy version of the vector\ncorrupted in \u21e5(n1/4) coordinates.\nTheorem 2 (Associative memory with dataset spanned by random rows of \ufb01xed orthonormal basis).\nIt is possible to store a dataset of size \u21e0 exp(r) of n-length vectors in a neural network of size O(n)\nsuch that a neurally feasible algorithm can output the correct vector from the dataset given a noisy\nversion of the vector corrupted in \u21e5( nr\n\nlog6 n ) coordinates.\n\nTheorem 1 follows from Prop. 1 and Theorem 3, while Theorem 2 follows from Prop. 2 and 3; and\nby also noting the fact that all r-vectors over any \ufb01nite alphabet can be linearly mapped to exp(r)\nnumber of points in a space of dimensionality r. The neural feasibility of the recovery follows\nfrom the discussion of Sec. 5. In contrast with [18], our sparse recovery based approach provides\n\n2\n\n\fassociative memories that are robust against stronger error model which comprises adversarial error\npatterns as opposed to random error patterns. Even though we demonstrate the associative memories\nwhich have sub-exponential storage capacity and can tolerate sub-linear (but polynomial) number\nof errors, neurally feasible recall phase is guaranteed to recover the message vector from adversarial\nerrors. On the other hand, the recovery guarantees in [18, Theorem 3 and 5] hold if the bipartite\ngraph obtained during learning phase possesses certain structures (e.g. degree sequence). However,\nit is not apparent in their work if the learnt bipartite graph indeed has these structural properties.\nSimilar to the aforementioned papers, our operations are performed over real numbers. We show the\ndimensionality of the dataset to be large enough, as referenced in Theorem 1 and 2. As in previous\nworks such as [18], we can therefore \ufb01nd a large number of points, exponential in the dimensionality,\nwith \ufb01nite (integer) alphabet that can be treated as the message vectors or dataset.\nOur main contribution is to bring in the model of sparse recovery in the domain of associative\nmemory - a very natural connection. The main techniques that we employ are as follows: 1) In\nSec. 3, we present two models of ensembles for the dataset. The dataset belongs to subspaces\nthat have associated orthogonal subspace with \u2018good\u2019 basis. These good basis for the orthogonal\nsubspaces satisfy one or more of the conditions introduced in Sec. 2, a section that provides some\nbackground material on sparse recovery and various suf\ufb01cient conditions relevant to the problem.\n2) In Sec. 4, we brie\ufb02y describe a way to obtain a \u2018good\u2019 null basis for the dataset. The found bases\nserve as measurement matrices that allow for sparse recovery. 3) Sec. 5 focus on the recall phases\nof the proposed associative memories. The algorithms are for sparse recovery, but stated in a way\nthat are implementable in a neural network.\nIn Sec. 6, we present some experimental results showcasing the performance of the proposed asso-\nciative memories. In Appendix C, we describe another approach to construct associative memory\nbased on the dictionary learning problem [24].\n\n2 De\ufb01nition and mathematical preliminaries\n\nNotation: We use lowercase boldface letters to denote vectors. Uppercase boldface letters represent\nmatrices. For a matrix B, BT denotes the transpose of B. A vector is called k-sparse if it has only k\nnonzero entries. For a vector x 2 Rn and any set of coordinates I \u2713 [n] \u2318 {1, 2, . . . , n}, xI denotes\nthe projection of x on to the coordinates of I. For any set of coordinates I \u2713 [n], I c \u2318 [n] \\ I.\nSimilarly, for a matrix B, BI denotes the sub-matrix obtained by the rows of B that are indexed by\nthe set I. We use span(B) to denote the subspace spanned by the columns of B. Given an m \u21e5 n\nmatrix B, denote the columns of the matrix as bj, j = 1, . . . , n and assume, for all the matrices in\nthis section, that the columns are all unit norm, i.e., kbjk2 = 1.\nDe\ufb01nition 1 (Coherence). The mutual coherence of the matrix B is de\ufb01ned to be\n\n\u00b5(B) = max\n\ni6=j |hbi, bji|.\n\n(1)\n\n(2)\n\nDe\ufb01nition 2 (Null-space property). The matrix B is supposed to satisfy the null-space property\nwith parameters (k, \u21b5 < 1) if khIk1 \uf8ff \u21b5khI ck1, for every vector h 2 Rn with Bh = 0 and any\nset I \u2713 [n],|I| = k.\nDe\ufb01nition 3 (RIP). A matrix B is said to satisfy the restricted isometry property with parameters k\nand , or the the (k, )-RIP, if for all k-sparse vectors x 2 Rn,\n\n(1 )kxk2\n\n2 \uf8ff kBxk2\n\n2 \uf8ff (1 + )kxk2\n2.\n\nNext we list some results pertaining to sparse signal recovery guarantee based on these aforemen-\ntioned parameters. The sparse recovery problem seeks the solution \u02c6x, that has the smallest number\nof nonzero entries, of the underdetermined system of equation Bx = r, where, B 2 Rm\u21e5n and\nx 2 Rn. The basis pursuit algorithm for sparse recovery provides the following estimate.\n(3)\n\nBx=r kxk1.\nLet xk denote the projection of x on its largest k coordinates.\nProposition 1. If B has null-space property with parameters (k, \u21b5 < 1), then, we have,\n\n\u02c6x = arg min\n\nk\u02c6x xk1 \uf8ff\n\n2(1 + \u21b5)\n\n1 \u21b5 kx xkk1.\n3\n\n(4)\n\n\fThe proof of this is quite standard and delegated to the Appendix A.\n\nProposition 2 ([5] ). The (2k,p2 1)-RIP of the sampling matrix implies, for a constant c,\n\nk\u02c6x xk2 \uf8ff\n\nc\npkkx xkk1.\n\n(5)\n\nFurthermore, it can be easily seen that any matrix is (k, (k 1)\u00b5)-RIP, where \u00b5 is the mutual\ncoherence of the sampling matrix.\n\n3 Properties of the datasets\n\nIn this section, we show that, under reasonable random models that represent quite general assump-\ntions on the datasets, it is possible to learn linear constraints on the messages, that satisfy one of the\nsuf\ufb01cient properties of sparse recovery: incoherence, null-space property or RIP. We mainly con-\nsider two models for the dataset: 1) sub-gaussian model 2) span of a random set from an orthonormal\nbasis.\n\n3.1 Sub-Gaussian model for the dataset and the null-space property\n\nIn this section we consider the message sets that are spanned by a basis matrix which has its entries\ndistributed according to a sub-gaussain distribution. The sub-gaussian distributions are prevalent\nin machine learning literature and provide a broad class of random models to analyze and validate\nvarious learning algorithms. We refer the readers to [27, 10] for the background on these distribution.\nLet A 2 Rn\u21e5r be an n \u21e5 r random matrix that has independent zero mean sub-gaussian random\nvariables as its entries. We assume that the subspace spanned by the columns of the matrix A\nrepresents our dataset M. The main result of this section is the following.\nTheorem 3. The dataset above satis\ufb01es a set of linear constraints, that has the null-space property.\nThat is, for any h 2 M \u2318 span(A), the following holds with high probability:\nfor all I \u2713 [n] such that |I| \uf8ff k,\n\nkhIk1 \uf8ff \u21b5khI ck1\n\n(6)\n\nfor k = O(n1/4), r = O(n/k) and a constant \u21b5 < 1.\n\nThe rest of this section is dedicated to the proof of this theorem. But, before we present the proof,\nwe state a result from [27] which we utilize to prove Theorem 3.\nProposition 3. [27, Theorem 5.39] Let A be an s \u21e5 r matrix whose rows ai are independent\nsub-gaussian isotropic random vectors in Rn. Then for every t 0, with probability at least\n1 2 exp(ct2) one has\n\nps Cpr t \uf8ff smin(A) =\n\uf8ff smax(A) =\n\nmin\n\nx2Rr:kxk2=1kAxk2\nx2Rr:kxk2=1kAxk2 \uf8ff ps + Cpr + t.\n\nmax\n\n(7)\n\nHere C and c depends on the sub-gaussian norms of the rows of the matrix A.\nProof of Theorem 3: Consider an n \u21e5 r matrix A which has independent sub-gaussian isotropic\nrandom vectors as its rows. Now for a given set I \u2713 [n], we can focus on two disjoint sub-matrices\nof A: 1) A1 = AI and 2) A2 = AI c.\nUsing Proposition 3 with s = |I|, we know that, with probability at least 1 2 exp(ct2), we have\n(8)\n\nsmax(A1) =\n\nmax\n\nx2Rr:kxk2=1kA1xk2 \uf8ffp|I| + Cpr + t.\n\nSince we know that kA1xk1 \uf8ffp|I|kA1xk2, using (8) the following holds with probability at least\n1 2 exp(ct2).\nk(Ax)Ik1 = kA1xk1 \uf8ffp|I|kA1xk2 \uf8ff |I| + Cp|I|r + tp|I| 8 x 2 Rr : kxk2 = 1.\n\n(9)\n\n4\n\n\fWe now consider A2. It follows from Proposition 3 with s = |I c| = n |I| that with probability at\nleast 1 2 exp(ct2),\n\nsmin(A2) =\n\nmin\n\nx2Rr:kxk2=1kA2xk2 pn |I| Cpr t.\n\nCombining (10) with the observation that kA2xk1 kA2xk2, the following holds with probability\nat least 1 2 exp(ct2).\nk(Ax)I ck1 = kA2xk1 kA2xk2 pn |I| Cpr t\nNote that we are interested in showing that for all h 2 M, we have\n\nfor all x 2 Rr : kxk2 = 1.\n\n(11)\n\nkhIk1 \uf8ff \u21b5khI ck1\n\nfor all I \u2713 [n] such that |I| \uf8ff k.\n\nThis is equivalent to showing that the following holds for all x 2 Rr : kxk2 = 1.\nfor all I \u2713 [n] such that |I| \uf8ff k.\n\nk(Ax)Ik1 \uf8ff \u21b5k(Ax)I ck1\n\n(10)\n\n(12)\n\n(13)\n\n(14)\n\nFor a given I \u2713 [n], we utilize (9) and (11) to guarantee that (13) holds with probability at least\n1 2 exp(ct2) as long as\n\nNow, given that k = |I| satis\ufb01es (14), (13) holds for all I \u21e2 [n] : |I| = k with probability at least\n(15)\n\n|I| + Cp|I|r + tp|I| \uf8ff \u21b5pn |I| Cpr t\n1 2\u2713n\nk k exp(ct2).\n\nk\u25c6 exp(ct2) 1 en\n\nLet\u2019s consider the following set of parameters: k = O(n1/4), r = O(n/k) = O(n3/4) and t =\n\n(cf. (15)).\nRemark 1. In Theorem 3, we specify one particular set of parameters for which the null-space\nproperty holds. Using (14) and (15), it can be shown that the null-space property in general holds\n\n\u21e5(pk log(n/k)). This set of parameters ensures that (14) holds with overwhelming probability\nfor the following set of parameters: k = O(pn/ log n), r = O(n/k) and t = \u21e5(pk log(n/k)).\n\nTherefore, it possible to trade-off the number of correctable errors during the recall phase (denoted\nby k) with the dimension of the dataset (represented by r).\n\n3.2 Span of a random set of columns of an orthonormal basis\n\nNext, in this section, we consider the ensemble of signals spanned by a random subset of rows from\na \ufb01xed orthonormal basis B. Assume B to be an n \u21e5 n matrix with orthonormal rows. Let \u21e2 [n]\nbe a random index set such that E(||) = r. The vectors in the dataset have form h = BT\n u for\nsome u 2 R||. In other words, the dataset M \u2318 span(BT\n ).\nIn this case, Bc constitutes a basis matrix for the null space of the dataset. Since we have selected\nthe set randomly, set \u2326 \u2318 c is also a random set with E(\u2326) = n E() = n r.\nProposition 4 ([7]). Assume that B be an n \u21e5 n orthonormal basis for Rn with the property that\nmaxi,j |Bi,j| \uf8ff \u232b. Consider a random |\u2326|\u21e5 n matrix C obtained by selecting a random set of rows\nof B indexed by the set \u2326 2 [n] such that E(\u2326) = m. Then the matrix C obeys (k, )-RIP with\nprobability at least 1 O(n\u21e2/\u21b5) for some \ufb01xed constant \u21e2 > 0, where k = \u21b5m\nTherefore, we can invoke Proposition 4 to conclude that the matrix Bc obeys (k, )-RIP with\nk = \u21b5(nr)\n\n\u232b2 log6 n with \u232b being the largest absolute value among the entries of Bc.\n\n\u232b2 log6 n .\n\n4 Learning the constraints: null space with small coherence\n\nIn the previous section, we described some random ensemble of datasets that can be stored on\nan associative memory based on sparse recovery. This approach involves \ufb01nding a basis for the\n\n5\n\n\forthogonal subspace to the message or the signal subspace (dataset). Indeed, our learning algorithm\nsimply \ufb01nds a null space of the dataset M. While obtaining the basis vectors of null(M), we\nrequire them to satisfy null-space property, RIP or small mutual coherence so that the a signal can\nbe recovered from its noisy version via the basis pursuit algorithm, that can be neurally implemented\n(see Sec. 5.2). However, for a given set of message vectors, it is computationally intractable to check\nif the obtained (learnt) orthogonal basis has null-space property or RIP with suitable parameters\nassociated with these properties. Mutual coherence of the orthogonal basis, on the other hand, can\nindeed be veri\ufb01ed in a tractable manner. Further, the more straight-forward iterative soft thresholding\nalgorithm will be successful if null(M) has low coherence. This will also lead to fast convergence\nof the recovery algorithm (see, Sec. 5.1). Towards this, we describe one approach that ensures the\nselection of a orthogonal basis that has smallest possible mutual coherence. Subsequently, using\nthe mutual coherence based recovery guarantees for sparse recovery, this basis enables an ef\ufb01cient\nrecovery phase for the associative memory.\nOne underlying assumption on the dataset that we make is its less than full dimensionality. That\nis, the dataset must belong to a low dimensional subspace, so that its null-space is not trivial. In\npractical cases, M is approximately low-dimensional. We use a preprocessing step, employing\nprincipal component analysis (PCA) to make sure that the dataset is low-dimensional. We do not\nindulge in to a more detailed description of this phase, as it seems to be quite standard (see, [18]).\n\nAlgorithm 1 Find null-space with low coherence\n\nInput: The dataset M with n dimensional vectors. An initial coherence \u00b50 and a step-size \nOutput: A m \u21e5 n orthogonal matrix B and coherence \u00b5\nPreprocessing. Perform PCA on M\nFind the n \u21e5 r basis matrix A of M\nfor l = 0, 1, 2, . . . do\n\nFind a feasible point of the quadratically constrained quadratic problem (QCQP) below (interior\npoint method): BA = 0; kbik = 1,8i 2 [n]; |hbi, bji| \uf8ff \u00b5l where B is (n r) \u21e5 n\nif No feasible point found then\n\nbreak\n\nelse\n\n\u00b5 \u00b5l\n\u00b5l+1 = \u00b5l \n\nend if\nend for\n\n5 Recall via neurally feasible algorithms\n\nWe now focus on the second aspect of an associative memory, namely the recovery phase. For the\nsignal model that we consider in this paper, the recovery phase is equivalent to solving a sparse signal\nrecovery problem. Given a noisy vector y = x + e from the dataset, we can use the basis of the\nnull-space B associated to our dataset that we constructed during the learning phase to obtain r =\nBy = Be. Now given that e is suf\ufb01ciently sparse enough and the matrix B obeys the properties of\nSec. 2, we can solve for e using a sparse recovery algorithm. Subsequently, we can remove the error\nvector e from the noisy signal y to construct the underlying message vector x. There is a plethora of\nalgorithms available in the literature to solve this problem. However, we note that for the purpose of\nan associative memory, the recovery phase should be neurally feasible and computationally simple.\nIn other words, each node (or storage unit) should be able to recover the coordinated associated to\nit locally by applying simple computation on the information received from its neighboring nodes\n(potentially in an iterative manner).\n\n5.1 Recovery via Iterative soft thresholding (IST) algorithm\n\nAmong the various, sparse recovery algorithms in the literature, iterative soft thresholding (IST)\nalgorithm is a natural candidate for implementing the recovery phase of the associative memories\nwith underlying setting. The IST algorithm tries to solve the following unconstrained `1-regularized\n\n6\n\n\fleast square problem which is closely related to the basis pursuit problem described in (3) and (18).\n\n1\n2kBe rk2.\nFor the IST algorithm, its t + 1-th iteration is described as follows.\n\n\u232b|ek1 +\n\n\u02c6e = arg min\n\ne\n\n(16)\n\n(IST)\n\net+1 = \u2318S(et \u2327 BT (Bet r); = \u2327 \u232b).\n\n(17)\nHere, \u2327 is a constant and \u2318S(x; ) = (sgn(x1)(x1)+, sgn(x2)(x2)+, . . . , sgn(xn)(xn)+)\ndenotes the soft thresholding (or shrinkage) operator. Note that the IST algorithm as described in\n(17) is neurally feasible as it involves only 1) performing matrix vector multiplications and 2) soft\nthresholding a coordinate in a vector independent of the values of other coordinates in the vector.\nIn Appendix B, we describe in details how the IST algorithm can be performed over a bipartite\nneural network with B as its edge weight matrix. Under suitable assumption on the coherence of the\nmeasurement matrix B, the IST algorithm is also known to converge to the correct k-sparse vector\ne [20]. In particular, Maleki [20] allows the thresholding parameter to be varied in every iteration\nsuch that all but at most the largest k coordinates (in terms of their absolute values) are mapped\nto zero by the soft thresholding operation. In this setting, Maleki shows that the solution of the\nIST algorithm recovers the correct support of the optimal solution in \ufb01nite steps and subsequently\nconverges to the true solution very fast. However, we are interested in analysis of the IST algorithm\nin a setting where thresholding parameter is kept a suitable constant depending on other system\nparameters so that the algorithm remains neurally feasible. Towards this, we note that there exists\ngeneral analysis of the IST algorithm even without the coherence assumption.\nProposition 5. [4, Theorem 3.1] Let {et}t1 be as de\ufb01ned in (17) with 1\nfor any t 1, h(et) h(e) \uf8ff ke0ek2\nfunction de\ufb01ned in (16).\n\n\u2327 max(BT B)1. Then,\n2kr Bek2 + \u232bkek1 is the objective\n\n. Here, h(e) = 1\n\n2t\n\n5.2 Recovery via Bregman iterative algorithm\n\nRecall that the basis pursuit algorithm refers to the following optimization problem.\n\n\u02c6e = arg min\n\ne {kek1 : r = Be}.\n\n(18)\nEven though the IST algorithm as described in the previous subsection solves the problem in (16),\nthe parameter value \u232b needs to be set small enough so that the recovered solution \u02c6e nearly satis\ufb01es\nthe constraint Be = r in (18). However, if we insist on recovering the solution e which exactly\nmeets the constraint, one can employ the Bregman iterative algorithm from [29]. The Bregman\niterative algorithm relies on the Bregman distance Dp\n(\u00b7,\u00b7) based on k \u00b7 k1 which is de\ufb01ned as\nfollows.\n\nk\u00b7k1\n\nDp\n\nk\u00b7k1\n\n(e1, e2) = ke1k1 ke2k1 hp, e1 e2i,\n\nwhere p 2 @ke2k1 is a sub-gradient of the `1-norm at the point e2. The (t + 1)-th iteration of the\nBregman iterative algorithm is then de\ufb01ned as follows.\n\net+1 = arg min\n\ne\n\nDpt\nk\u00b7k1\n\n(e, et) +\n\n1\n2kBe rk2,\n\n= arg min\n\ne kek1 (pt)T e +\n\n1\n2kBe rk2 ketk1 + (pt)T et,\n\n(19)\n\npt+1 = pt BT (Bet+1 r).\n\n(20)\nNote that, for the (t + 1)-th iteration, the objective function in (19) is essentially equivalent to the\nobjective function in (16). Therefore, each iteration of the Bergman iterative algorithm can be solved\nusing the IST algorithm. It is shown in [29] that after a \ufb01nite number of iteration of the Bregman\niterative algorithm, one recovers the solution of the problem in (18) (Theorem 3.2 and 3.3 in [29]).\nRemark 2. We know that the IST algorithm is neurally feasible. Furthermore, the step described\nin (20) is neurally feasible as it only involve matrix-vector multiplications in the spirit of Eq. (17).\nSince each iteration of the Bregman iterative algorithm only relies on these two operations, it follows\nthat the Bregman iterative algorithm is neurally feasible as well. It should be noted that the neural\nfeasibility of the Bregman iterative algorithm was discussed in [16] as well, however the neural\nstructures employed by [16] is different from ours.\n\n1Note that max(BT B), the maximum eigenvalue of the matrix BT B serves as a Lipschitz constant for the\n\ngradient f (e) of the function f (e) = 1\n\n2kr Bek2\n\n7\n\n\f1\n\n0.9\n\n0.8\n\n0.7\n\n0.6\n\n0.5\n\n0.4\n\n0.3\n\n0.2\n\n0.1\n\ne\nr\nu\n\nl\ni\n\na\n\nf\n \nf\n\no\n \ny\nt\ni\nl\ni\n\nb\na\nb\no\nr\nP\n\n0\n \n100\n\n150\n\n200\n\n250\n\nSparsity\n\n300\n\n \n\nm = 500 (PD)\nm = 500 (BI)\nm = 700 (PD)\nm = 700 (BI)\n\n1\n\n0.9\n\n0.8\n\n0.7\n\n0.6\n\n0.5\n\n0.4\n\n0.3\n\n0.2\n\n0.1\n\ne\nr\nu\n\nl\ni\n\na\n\nf\n \nf\n\no\n\n \ny\nt\ni\nl\ni\n\nb\na\nb\no\nr\nP\n\n350\n\n400\n\n450\n\n0\n \n100\n\n150\n\n200\n\n250\n\nSparsity\n\n300\n\n \n\nm = 500 (PD)\nm = 500 (BI)\nm = 700 (PD)\nm = 700 (BI)\n\n350\n\n400\n\n450\n\n(a) Gaussian matrix and Gaussian noise\n\n(b) Gaussian matrix and Discrete noise\n\n1\n\n0.9\n\n0.8\n\n0.7\n\n0.6\n\n0.5\n\n0.4\n\n0.3\n\n0.2\n\n0.1\n\ne\nr\nu\n\nl\ni\n\na\n\nf\n \nf\n\no\n \ny\nt\ni\nl\ni\n\nb\na\nb\no\nr\nP\n\n0\n \n100\n\n150\n\n200\n\n \n\nStudent Version of MATLAB\n\nm = 500 (PD)\nm = 500 (BI)\nm = 700 (PD)\nm = 700 (BI)\n\n1\n\n0.9\n\n0.8\n\n0.7\n\n0.6\n\n0.5\n\n0.4\n\n0.3\n\n0.2\n\n0.1\n\ne\nr\nu\n\nl\ni\n\na\n\nf\n \nf\n\no\n\n \ny\nt\ni\nl\ni\n\nb\na\nb\no\nr\nP\n\n350\n\n400\n\n450\n\n0\n \n100\n\n150\n\n200\n\n250\n\nSparsity\n\n300\n\n \n\nStudent Version of MATLAB\n\nm = 500 (PD)\nm = 500 (BI)\nm = 700 (PD)\nm = 700 (BI)\n\n350\n\n400\n\n450\n\n250\n\nSparsity\n\n300\n\n(c) Bernoulli matrix and Gaussian noise\n\n(d) Bernoulli matrix and Discrete noise\n\nFigure 2: Performance of the proposed associative memory approach during recall phase. The PD algorithm\nrefers to the primal dual algorithm to solve linear program associated with the problem in (18). The BI algorithm\nrefers to the Bregman iterative algorithm described in Sec. 5.2.\n\nStudent Version of MATLAB\n\nStudent Version of MATLAB\n\n6 Experimental results\n\nIn this section, we demonstrate the feasibility of the associative memory framework using computer\ngenerated data. Along the line of the discussion in Sec. 3.1, we \ufb01rst sample an n \u21e5 r sub-gaussian\nmatrix A with i.i.d entries. We consider two sub-gaussian distributions: 1) Gaussian distribution\nand 2) Bernoulli distribution over {+1,1}. The message vectors to be stored are then assumed to\nbe spanned by the k columns of the sampled matrix. For the learning phase, we \ufb01nd a good basis\nfor the subspace orthogonal to the space spanned by the columns of the matrix A. For noise during\nthe recall phase, we consider two noise models: 1) Gaussian noise and 2) discrete noise where each\nnonzero elements take value in the set {M,(M 1), . . . , M}\\{0}.\nFigure 2 presents our simulation results for n = 1000. For recall phase, we employ the Bregman\niterative (BI) algorithm with the IST algorithm as a subroutine. We also plot the performance of\nthe primal dual (PD) algorithm based linear programming solution for the recovery problem of\ninterest (cf. (18)). This allows us to gauge the disadvantage due to the restriction of working with\na neurally feasible recovery algorithm, e.g., the BI algorithm in our case. Furthermore, we consider\nmessage sets with two different dimensions which amounts to m = 500 and m = 700. Note that\nthe dimension of the message set is n m. We run 50 iterations of the recovery algorithms for a\ngiven set of parameters to obtain the estimates of the probability of failure (of exact recovery of error\nvector). In Fig. 2a, we focus on the setting with Gaussian basis matrix (for message set) and unit\nvariance zero mean Gaussian noise during the recall phase. It is evident that the proposed associative\nmemory do allow for the exact recovery of error vectors up to certain sparsity level. This corroborate\nour \ufb01ndings in Sec. 3. We also note that the performance of the BI algorithm is very close to the\nPD algorithm. Fig. 2b shows the performance of the recall phase for the setting with Gaussian basis\nfor message set and discrete noise model with M = 4. In this case, even though the BI algorithm\nis able to exactly recover the noise vector up to a particular sparsity level, it\u2019s performance is worse\nthan that of PD algorithm. The performance of the recall phase with Bernoulli bases matrices for\nmessage set is shown in Fig. 2c and 2d. The results are similar to those in the case of Gaussian bases\nmatrices for the message sets.\n\n8\n\n\fReferences\n[1] A. Agarwal, A. Anandkumar, P. Jain, P. Netrapalli, and R. Tandon. Learning sparsely used overcomplete\n\ndictionaries via alternating minimization. CoRR, abs/1310.7991, 2013.\n\n[2] S. Arora, R. Ge, T. Ma, and A. Moitra. Simple, ef\ufb01cient, and neural algorithms for sparse coding. CoRR,\n\nabs/1503.00778, 2015.\n\n[3] S. Arora, R. Ge, and A. Moitra. New algorithms for learning incoherent and overcomplete dictionaries.\n\narXiv preprint arXiv:1308.6273, 2013.\n\n[4] A. Beck and M. Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems.\n\nSIAM Journal on Imaging Sciences, 2(1):183\u2013202, 2009.\n\n[5] E. J. Candes. The restricted isometry property and its implications for compressed sensing. Comptes\n\nRendus Mathematique, 346(9):589\u2013592, 2008.\n\n[6] E. J. Cand`es, J. Romberg, and T. Tao. Robust uncertainty principles: Exact signal reconstruction from\n\nhighly incomplete frequency information. IEEE Trans. on Inf. Theory, 52(2):489\u2013509, 2006.\n\n[7] E. J. Candes and T. Tao. Near-optimal signal recovery from random projections: Universal encoding\n\nstrategies? IEEE Trans. on Inf. Theory, 52(12):5406\u20135425, Dec 2006.\n\n[8] D. L. Donoho. Compressed sensing. IEEE Trans. on Inf. Theory, 52(4):1289\u20131306, 2006.\n[9] D. L. Donoho, A. Maleki, and A. Montanari. Message-passing algorithms for compressed sensing. Pro-\n\nceedings of the National Academy of Sciences, 106(45):18914\u201318919, 2009.\n\n[10] S. Foucart and H. Rauhut. A Mathematical Introduction to Compressive Sensing. Birkhauser Basel, 2013.\n[11] V. Gripon and C. Berrou. Sparse neural networks with large learning diversity. IEEE Transactions on\n\nNeural Networks, 22(7):1087\u20131096, 2011.\n\n[12] D. J. Gross and M. Mezard. The simplest spin glass. Nuclear Physics B, 240(4):431 \u2013 452, 1984.\n[13] D. O. Hebb. The organization of behavior: A neuropsychological theory. Psychology Press, 2005.\n[14] C. Hillar and N. M. Tran.\n\nRobust exponential memory in hop\ufb01eld networks.\n\narXiv preprint\n\narXiv:1411.4625, 2014.\n\n[15] J. J. Hop\ufb01eld. Neural networks and physical systems with emergent collective computational abilities.\n\nProceedings of the national academy of sciences, 79(8):2554\u20132558, 1982.\n\n[16] T. Hu, A. Genkin, and D. B. Chklovskii. A network of spiking neurons for computing sparse representa-\n\ntions in an energy-ef\ufb01cient way. Neural computation, 24(11):2852\u20132872, 2012.\n\n[17] S. Jankowski, A. Lozowski, and J. M. Zurada. Complex-valued multistate neural associative memory.\n\nIEEE Transactions on Neural Networks, 7(6):1491\u20131496, Nov 1996.\n\n[18] A. Karbasi, A. H. Salavati, and A. Shokrollahi. Convolutional neural associative memories: Massive\n\ncapacity with noise tolerance. CoRR, abs/1407.6513, 2014.\n\n[19] K. R. Kumar, A. H. Salavati, and A. Shokrollahi. Exponential pattern retrieval capacity with non-binary\n\nassociative memory. In 2011 IEEE Information Theory Workshop (ITW), pages 80\u201384, Oct 2011.\n\n[20] A. Maleki. Coherence analysis of iterative thresholding algorithms. In 47th Annual Allerton Conference\n\non Communication, Control, and Computing, 2009. Allerton 2009, pages 236\u2013243, Sept 2009.\n\n[21] R. J. McEliece and E. C. Posner. The number of stable points of an in\ufb01nite-range spin glass memory.\n\nTelecommunications and Data Acquisition Progress Report, 83:209\u2013215, 1985.\n\n[22] R. J. McEliece, E. C. Posner, E. R. Rodemich, and S. S. Venkatesh. The capacity of the hop\ufb01eld associa-\n\ntive memory. Information Theory, IEEE Transactions on, 33(4):461\u2013482, 1987.\n\n[23] M. K. Muezzinoglu, C. Guzelis, and J. M. Zurada. A new design method for the complex-valued multi-\n\nstate hop\ufb01eld associative memory. IEEE Transactions on Neural Networks, 14(4):891\u2013899, July 2003.\n\n[24] B. A. Olshausen and D. J. Field. Sparse coding with an overcomplete basis set: A strategy employed by\n\nv1? Vision Research, 37(23):3311 \u2013 3325, 1997.\n\n[25] A. H. Salavati and A. Karbasi. Multi-level error-resilient neural networks. In 2012 IEEE International\n\nSymposium on Information Theory Proceedings (ISIT), pages 1064\u20131068, July 2012.\n\n[26] F. Tanaka and S. F. Edwards. Analytic theory of the ground state properties of a spin glass. i. ising spin\n\nglass. Journal of Physics F: Metal Physics, 10(12):2769, 1980.\n\n[27] R. Vershynin.\n\nIntroduction to the non-asymptotic analysis of random matrices.\n\narXiv:1011.3027, 2010.\n\narXiv preprint\n\n[28] M. J. Wainwright.\n\nSharp thresholds for high-dimensional and noisy sparsity recovery using `1 -\nconstrained quadratic programming (lasso). IEEE Trans. Inform. Theory, 55(5):2183\u20132202, May 2009.\n[29] W. Yin, S. Osher, D. Goldfarb, and J. Darbon. Bregman iterative algorithms for `1-minimization with\n\napplications to compressed sensing. SIAM Journal on Imaging Sciences, 1(1):143\u2013168, 2008.\n\n9\n\n\f", "award": [], "sourceid": 1559, "authors": [{"given_name": "Arya", "family_name": "Mazumdar", "institution": "University of Minnesota -- Twin Cities"}, {"given_name": "Ankit Singh", "family_name": "Rawat", "institution": "Carnegie Mellon University"}]}