{"title": "Unsupervised Learning by Convex and Conic Coding", "book": "Advances in Neural Information Processing Systems", "page_first": 515, "page_last": 521, "abstract": null, "full_text": "Unsupervised Learning by \nConvex and Conic Coding \n\nD. D. Lee and H. S. Seung \n\nBell Laboratories, Lucent Technologies \n\nMurray Hill, NJ 07974 \n\n{ddlee I seung}Obell-labs. com \n\nAbstract \n\nUnsupervised learning algorithms based on convex and conic en(cid:173)\ncoders are proposed. The encoders find the closest convex or conic \ncombination of basis vectors to the input. The learning algorithms \nproduce basis vectors that minimize the reconstruction error of the \nencoders. The convex algorithm develops locally linear models of \nthe input, while the conic algorithm discovers features. Both al(cid:173)\ngorithms are used to model handwritten digits and compared with \nvector quantization and principal component analysis. The neural \nnetwork implementations involve feedback connections that project \na reconstruction back to the input layer. \n\n1 \n\nIntroduction \n\nVector quantization (VQ) and principal component analysis (peA) are two widely \nused unsupervised learning algorithms, based on two fundamentally different ways \nof encoding data. In VQ, the input is encoded as the index of the closest prototype \nstored in memory. In peA, the input is encoded as the coefficients of a linear \nsuperposition of a set of basis vectors. VQ can capture nonlinear structure in \ninput data, but is weak because of its highly localized or \"grandmother neuron\" \nrepresentation. Many prototypes are typically required to adequately represent the \ninput data when the number of dimensions is large. On the other hand, peA uses a \ndistributed representation so it needs only a small number of basis vectors to model \nthe input. Unfortunately, it can only model linear structures. \n\nLearning algorithms based on convex and conic encoders are introduced here. These \nencoders are less constrained than VQ but more constrained than peA. As a result, \n\n\f516 \n\nD. D. Lee and H. S. Seung \n\nFigure 1: The affine, convex, and conic hulls for two basis vectors. \n\nthey are able to produce sparse distributed representations that are efficient to com(cid:173)\npute. The resulting learning algorithms can be understood as approximate matrix \nfactorizations and can also be implemented as neural networks with feedforward \nand feedback connections between neurons. \n\n2 Affine, convex, conic, and point encoding \n\nGiven a set of basis vectors {wa}, the linear combination L::=1 vawa is called \n\n} \n\naffine \ncon~ex \ncomc \n\nif \n\n{ Va:::?: 0 , \nVa:::?: 0 . \n\nL:a Va = 1 , \nL:a Va = 1 , \n\nThe complete set of affine, convex, and conic combinations are called respectively the \naffine, convex, and conic hulls of the basis. These hulls are geometrically depicted in \nFigure 1. The convex hull contains only interpolations of the basis vectors, whereas \nthe affine hull contains not only the convex hull but also linear extrapolations. The \nconic hull also contains the convex hull but is not constrained to stay within the \nset La Va = 1. It extends to any nonnegative combination of the basis vectors and \nforms a cone in the vector space. \n\nFour encoders are considered in this paper. The convex and conic encoders are \nnovel, and find the nearest point to the input x in the convex and conic hull of the \nbasis vectors. These encoders are compared with the well-known affine and point \nencoders. The affine encoder finds the nearest point to x in the affine hull and is \nequivalent to the encoding in peA, while the point encoder or VQ finds the nearest \nbasis vector to the input. All of these encoders minimize the reconstruction error: \n\n2 \n\n(1) \n\nThe constraints on Va for the convex, conic, and affine encoders were described \nabove. Point encoding can be thought of as a heavily constrained optimization of \nEq. (1): a single Va must equal unity while all the rest vanish. \n\nEfficient algorithms exist for computing all of these encodings. The affine and point \nencoders are the fastest. Affine encoding is simply a linear transformation of the \ninput vector. Point encoding is a nonlinear operation, but is computationally simple \n\n\fUnsupervised Learning by Convex and Conic Coding \n\n517 \n\nsince it involves only a minimum distance computation. The convex and conic \nencoders require solving a quadratic programming problem. These encodings are \nmore computationally demanding than the affine and point encodingsj nevertheless, \npolynomial time algorithms do exist. The tractability of these problems is related \nto the fact that the cost function in Eq. (1) has no local minima on the convex \ndomains in question. These encodings should be contrasted with computationally \ninefficient ones. A natural modification of the point encoder with combinatorial \nexpressiveness can be obtained by allowing v to be any vector of zeros and ones \n[1, 2]. Unfortunately, with this constraint the optimization of Eq. (1) becomes an \ninteger programming problem and is quite inefficient to solve. \n\nThe convex and conic encodings of an input generally contain coefficients Va that \nvanish, due to the nonnegativity constraints in the optimization of Eq. (1). This \nmethod of obtaining sparse encodings is distinct from the method of simply trun(cid:173)\ncating a linear combination by discarding small coefficients [3]. \n\n3 Learning \n\nThere correspond learning algorithms for each of the encoders described above that \nminimize the average reconstruction error over an ensemble of inputs. If a training \nset of m examples is arranged as the columns of a N x m matrix X, then the learning \nand encoding minimization can be expressed as: \n\nmin liX - WVI1 2 \nW,V \n\n(2) \n\nwhere IIXII 2 is the summed squares of the elements of X. Learning and encoding \ncan thus be described as the approximate factorization of the data matrix X into a \nN x r matrix W of r basis vectors and a r X m matrix V of m code vectors. \n\nAssuming that the input vectors in X have been scaled to the range [0,1], the \nconstraints on the optimizations in Eq. (2) are given by: \n\no ~ Wia ~ 1, \nAffine: \nConvex: o ~ Wia ~ 1, \no ~ Wia ~ 1, \nConic: \n\nThe nonnegativity constraints on W and V prevent cancellations from occurring in \nthe linear combinations, and their importance will be seen shortly. The upper bound \non W is chosen such that the basis vectors are normalized in the same range as the \ninputs X. We noted earlier that the computation for encoding is tractable since \nthe cost function Eq. (2) is a quadratic function of V. However, when considered \nas a function of both W and V, the cost function is quartic and finding its global \nminimum for learning can be very difficult. The issue of local minima is discussed \nin the following example. \n\n4 Example: modeling handwritten digits \n\nWe applied Affine, Convex, Conic, and VQ learning to the USPS database [4], \nwhich consists of examples of haI!dwritten digits segmented from actual zip codes. \nEach of the 7291 training and 2007 test images were normalized to a 16 x 16 grid \n\n\f518 \n\nD. D. Lee and H. S. Seung \n\nVQ \n\nAffine (PeA) \n\nConvex \n\nConic \n\nFigure 2: Basis vectors for \"2\" found by VQ, Affine, Convex, and Conic learning. \n\nwith pixel intensities in the range [0,1]. There were noticeable segmentation errors \nresulting in unrecognizable digits, but these images were left in both the training \nand test sets. The training examples were segregated by digit class and separate \nbasis vectors were trained for each of the classes using the four encodings. Figure 2 \nshows our results for the digit class \"2\" with r = 25 basis vectors. \n\nThe k-means algorithm was used to find the VQ basis vectors shown in Figure 2. \nBecause the encoding is over a discontinuous and highly constrained space, there \nexist many local minima to Eq. (2). In order to deal with this problem, the algorithm \nwas restarted with various initial conditions and the best solution was chosen. The \nresulting basis vectors look like \"2\" templates and are blurry because each basis \nvector is the mean of a large number of input images. \n\nAffine determines the affine space that best models the input data. As can be seen \nin the figure, the individual basis vectors have no obvious interpretation. Although \nthe space found by Affine is unique, its representation by basis vectors is degener(cid:173)\nate. Any set of r linearly independent vectors drawn from the affine space can be \nused to represent it. This is due to the fact that the product WV is invariant under \nthe transformation W ~ W 8 and V ~ 8- 1 V. 1 \n\nConvex finds the r basis vectors whose convex hull best fits the input data. The \noptimization was performed by alternating between projected gradient steps of W \nand V. The constraint that the column sums of V equal unity was implemented \n\n1 Affine is essentially equivalent to peA, except that they represent the affine space in \ndifferent ways. Affine represents it with r points chosen from the space. peA represents \nthe affine space with a single point from the space and r - 1 orthonormal directions. This \nis still a degenerate representation, but peA fixes it by taking the point to be the sample \nmean and the r - 1 directions to be the eigenvectors of the covariance matrix of X with \nthe largest eigenvalues. \n\n\fUnsupervised Learning by Convex and Conic Coding \n\n519 \n\nInput Image \n\nConvex Coding [J .. . \n\n. .\n. \n. . . \n. \n. \n. ' . .... . . \n\u00a5::' ....\n\u00b7i~~.: ... ,', .. ,\\\",: ;,._ \n. \n. . . . . . . . . . .. \n,' .. \n\n. . \n\n. \n\n. \n\n' \n\nConic Coding \n\n[}\u00a7:\"\"''''''' \n\n.\n. . . \n. . \n. \n. ~ .. .. : . . ,.'.: .. :, .. \n......... . .......... . \n. . \n. \n. . \n. \n. \n. \n\n. -. \n\n\u2022 \u2022 \u2022 \u2022\n\n: \"....: \n\n. ... . \n\n.\" \n\n\u2022\n\n\u2022 \n\n\u2022 : \n\nlo\" \n\n: \n\nv \n\nConvex Reconstructions \n\nConic Reconstructions \n\nFigure 3: Activities and reconstructions of a \"2\" using conic and convex coding. \n\nby adding a quadratic penalty term. In contrast to Affine, the basis vectors are \ninterpretable as templates and are less blurred than those found by VQ. Many of \nthe elements of Wand also of V are zero at the minimum. This eliminates many in(cid:173)\nvariant transformations S, because they would violate the nonnegativity constraints \non Wand V. From our simulations, it appears that most of the degeneracy seen in \nAffine is lifted by the nonnegativity constraints. \n\nConic finds basis vectors whose conic hull best models the input images. The \nlearning algorithm is similar to Convex except there is no penalty term on the sum \nof the activities. The Conic representation allows combinations of basis vectors, not \njust interpolations between them. As a result, the basis vectors found are features \nrather than templates, as seen in Figure 2. In contrast to Affine, the nonnegativity \nconstraint leads to features that are interpretable as correlated strokes. As the \nnumber of basis vectors r increases, these features decrease in size until they become \nindividual pixels. \n\nThese models were used to classify novel test images. Recognition was accomplished \nby separately reconstructing the test images with the different digit models and as(cid:173)\nsociating the image with the model having the smallest reconstruction error. Figure \n3 illustrates an example of classifying a \"2\" using the conic and convex encodings. \nThe basis vectors are displayed weighted by their activites Va and the sparsity in \nthe representations can be clearly seen. The bottom part of the figure shows the \ndifferent reconstructions generated by the various digit models. \n\nWith r = 25 patterns per digit class, Convex incorrectly classified 113 digits out of \nthe 2007 test examples for an overall error rate of 5.6%. This is virtually identical \nto the performance of k = 1 nearest neighbor (112 errors) and linear r = 25 peA \nmodels (111 errors). However, scaling up the convex models to r = 100 patterns \nresults in an error rate of 4.4% (89 errors). This improvement arises because the \nlarger convex hulls can better represent the overall nonlinear nature of the input \n\n\f520 \n\nD. D. Lee and H. S. Seung \n\ndistributions. This is good performance relative to other methods that do not \nuse prior knowledge of invariances, such as the support vector machine (4.0% [5]). \nHowever, it is not as good as methods that do use prior knowledge, such as nearest \nneighbor with tangent distance (2.6% [6]). \nOn the other hand, Conic coding with r = 25 results in an error rate of 6.8% \n(138 errors). With larger basis sets r > 50, Conic shows worse performance as the \nfeatures shrink to small spots. These results indicate that by itself, Conic does not \nyield good models; non-trivial correlations still remain in the Va and also need to \nbe taken into account. For instance, while the conic basis for \"9\" can fit some \"7\" 's \nquite well with little reconstruction error, the codes Va are distinct from when it \nfits \"9\" 'so \n\n5 Neural network implementation \n\nConic and Convex were described above as matrix factorizations. Alternatively, \nthe encoding can be performed by a neural network dynamics [7] and the learning \nby a synaptic update rule. We describe here the implementation for the Conic \nnetwork; the Convex network is similar. The Conic network has a layer of N \nerror neurons ei and a layer of r encoding neurons Va. The fixed point of the \nencoding dynamics \n\ndVa \ndt +Va \n\ndei \n-+ e \u00b7 \ndt \nt \n\n[i:e;Wi. +v.f \n\nt=l \n\nr \n\nXi - ~WiaVa, \n\na=l \n\n(3) \n\n(4) \n\noptimizes Eq. (1), finding the best convex encoding of the input Xi. The rectification \nnonlinearity [x]+ = max(x,O) enforces the nonnegativity constraint. The error \nneurons subtract the reconstruction from the input Xi. The excitatory connection \nfrom ei to Va is equal and opposite to the inhibitory connection from Va back to ei. \nThe Hebbian synaptic weight update \n\nis made following convergence of the encoding dynamics for each input, while re(cid:173)\nspecting the bound constraints on Wia . This performs stochastic gradient descent \non the ensemble reconstruction error with learning rate \"1. \n\n(5) \n\n6 Discussion \n\nConvex coding is similar to other locally linear models [8, 9, 10, 11]. Distance to \na convex hull was previously used in nearest neighbor classification [12], though no \nlearning algorithm was proposed. Conic coding is similar to the noisy OR [13, 14] \nand harmonium [15] models. The main difference is that these previous models \ncontain discrete binary variables, whereas Conic uses continuous ones. The use of \nanalog rather than binary variables makes the encoding computationally tractable \nand allows for interpolation between basis vectors. \n\n\fUnsupervised Learning by Convex and Conic Coding \n\n521 \n\nHere we have emphasized the geometrical interpretation of Convex and Conic cod(cid:173)\ning. They can also be viewed as probabilistic hidden variable models. The inputs Xi \nare visible while the Va are hidden variables, and the reconstruction error in Eq. (1) \nis related to the log likelihood, log P{xilva). No explicit model P(va) for the hidden \nvariables was used, which limited the quality of the Conic models in particular. \nThe feature discovery capabilities of Conic, however, make it a promising tool for \nbuilding hierarchical representations. We are currently working on extending these \nnew coding schemes and learning algorithms to multilayer networks. \n\nWe acknowledge the support of Bell Laboratories. We thank C. Burges, C. Cortes, \nand Y. LeCun for providing us with the USPS database. We are also grateful to K. \nClarkson, R. Freund, L. Kaufman, L. Saul, and M. Wright for helpful discussions. \n\nReferences \n\n[1] Hinton, GE & Zemel, RS (1994) . Autoencoders, minimum description length and \nHelmholtz free energy. Advances in Neural Information Processing Systems 6, 3- 10. \n\n[2] Ghahramani, Z (1995). Factorial learning and the EM algorithm. Advances in Neural \n\nInformation Processing Systems 7, 617-624. \n\n[3] Olshausen, BA & Field, DJ (1996). Emergence of simple-cell receptive field properties \n\nby learning a sparse code for natural images. Nature 381, 607-609. \n\n[4] Le Cun, Yet al. (1989). Backpropagation applied to handwritten zip code recognition. \n\nNeural Comput. 1, 541-55l. \n\n[5] Scholkopf, B, Burges, C, & Vapnik, V (1995). Extracting support data for a given \n\ntask. KDD-95 Proceedings, 252-257. \n\n[6] Simard, P, Le Cun Y & Denker J (1993). Efficient pattern recognition using a new \ntransformation distance. Advances in Neural Information Processing Systems 5, 50-\n58. \n\n[7] Tank, DW & Hopfield, JJ (1986). Simple neural optimization networks; an AjD \nconverter, signal decision circuit, and a linear programming circuit. IEEE 7rans. \nCirc. Syst. CAS-33, 533- 54l. \n\n[8] Bezdek, JC, Coray, C, Gunderson, R & Watson J (1981). Detection and characteri(cid:173)\n\nzation of cluster substructure. SIAM J. Appl. Math. 40, 339- 357; 358- 372. \n\n[9] Bregler, C & Omohundro, SM (1995). Nonlinear image interpolation using manifold \n\nlearning. Advances in Neural Information Processing Systems 7, 973-980. \n\n[10] Hinton, GE, Dayan, P & Revow M (1996). Modeling the manifolds of images of \n\nhandwritten digits. IEEE 7rans. Neural Networks, submitted. \n\n[11] Hastie, T, Simard, P & Sackinger E (1995). Learning prototype models for tangent \n\ndistance. Advances in Neural Information Processing Systems 7, 999-1006. \n\n[12] Haas, HPA, Backer, E & Boxma, I (1980). Convex hull nearest neighbor rule. Fifth \n\nIntl. Conf. on Pattern Recognition Proceedings, 87-90. \n\n[13] Dayan, P & Zemel, RS (1995). Competition and multiple cause models. Neural Com(cid:173)\n\nput. 7, 565- 579. \n\n[14] Saund, E (1995). A multiple cause mixture model for unsupervised learning. Neural \n\nComput. 7, 51-7l. \n\n[15] Freund, Y & Haussler, D (1992). Unsupervised learning of distributions on binary \nvectors using two layer networks. Advances in Neural Information Processing Systems \n4,912-919. \n\n\f", "award": [], "sourceid": 1242, "authors": [{"given_name": "Daniel", "family_name": "Lee", "institution": null}, {"given_name": "H. Sebastian", "family_name": "Seung", "institution": null}]}