{"title": "Classification via Minimum Incremental Coding Length (MICL)", "book": "Advances in Neural Information Processing Systems", "page_first": 1633, "page_last": 1640, "abstract": null, "full_text": "Classi\ufb01cation via Minimum Incremental Coding\n\nLength (MICL)\n\nJohn Wright\u2217, Yi Ma\n\nUniversity of Illinois at Urbana-Champaign\n\nCoordinated Science Laboratory\n{jnwright,yima}@uiuc.edu\n\nYangyu Tao, Zhouchen Lin, Heung-Yeung Shum\n\nVisual Computing Group\nMicrosoft Research Asia\n\n{v-yatao,zhoulin,hshum}@microsoft.com\n\nAbstract\n\nWe present a simple new criterion for classi\ufb01cation, based on principles from lossy\ndata compression. The criterion assigns a test sample to the class that uses the min-\nimum number of additional bits to code the test sample, subject to an allowable\ndistortion. We prove asymptotic optimality of this criterion for Gaussian data and\nanalyze its relationships to classical classi\ufb01ers. Theoretical results provide new\ninsights into relationships among popular classi\ufb01ers such as MAP and RDA, as\nwell as unsupervised clustering methods based on lossy compression [13]. Mini-\nmizing the lossy coding length induces a regularization effect which stabilizes the\n(implicit) density estimate in a small-sample setting. Compression also provides\na uniform means of handling classes of varying dimension. This simple classi-\n\ufb01cation criterion and its kernel and local versions perform competitively against\nexisting classi\ufb01ers on both synthetic examples and real imagery data such as hand-\nwritten digits and human faces, without requiring domain-speci\ufb01c information.\n\n1 Introduction\n\nOne quintessential problem in statistical learning [9, 20] is to construct a classi\ufb01er from labeled\ntraining data (xi, yi) \u223ciid pX,Y (x, y). Here, xi \u2208 Rn is the observation, and yi \u2208 {1, . . . , K} its\nassociated class label. The goal is to construct a classi\ufb01er g : Rn \u2192 {1, . . . , K} which minimizes\nthe expected risk (or probability of error): g\u2217 = arg min E[Ig(X)(cid:54)=Y ], where the expectation is taken\nwith respect to pX,Y . When the conditional class distributions pX|Y (x|y) and the class priors pY (y)\nare given, the maximum a posterior (MAP) assignment\n\n\u02c6y(x) = arg miny\u2208{1,...,K} \u2212 ln pX|Y (x|y) \u2212 ln pY (y)\n\n(1)\ngives the optimal classi\ufb01er. This amounts to a minimum coding length principle: the optimal clas-\nsi\ufb01er minimizes the Shannon optimal (lossless) coding length of the test data x with respect to\nthe distribution of the true class. The \ufb01rst term is the number of bits needed to code x w.r.t. the\ndistribution of class y, and the second term is the number of bits needed to code the label y for x.\n\nIssues with Learning the Distributions from Training Samples.\nIn the typical classi\ufb01cation\nsetting, the distributions pX|Y (x|y) and pY (y) need to be learned from a set of labeled training\n\u2217The authors gratefully acknowledge support from grants NSF Career IIS-0347456, NSF CRS-EHS-\n\n0509151, NSF CCF-TF-0514955, and ONR YIP N00014-05-1-0633.\n\n1\n\n\fdata. Conventional approaches to model estimation (implicitly) assume that the distributions are\nnondegenerate and the samples are suf\ufb01ciently dense. However, these assumptions fail in many\nclassi\ufb01cation problems which are vital for applications in computer vision [10,11]. For instance, the\nset of images of a human face taken from different angles and under different lighting conditions\noften lie in a low-dimensional subspace or submanifold of the ambient space [2]. As a result, the as-\nsociated distributions are degenerate or nearly degenerate. Moreover, due to the high dimensionality\nof imagery data, the set of training images is typically sparse.\nInferring the generating probability distribution pX,Y from a sparse set of samples is an inherently\nill-conditioned problem [20]. Furthermore, in the case of degenerate distributions, the classical\nlikelihood function (1) does not have a well-de\ufb01ned maximum [20]. Thus, to infer the distribution\nfrom the training data or to use it to classify new observations, the distribution or its likelihood\nfunction needs to be properly \u201cregularized.\u201d Typically, this is accomplished either explicitly via\nsmoothness constraints, or implicitly via parametric assumptions on the distribution [3]. However,\neven if the distributions are assumed to be generic Gaussians, explicit regularization is still necessary\nto achieve good small-sample performance [6].\nIn many real problems in computer vision, the distributions associated with different classes of data\nhave different model complexity. For instance, when detecting a face in an image, features associated\nwith the face often have a low-dimensional structure which is \u201cembedded\u201d as a submanifold in a\ncloud of essentially random features from the background. Model selection criteria such as minimum\ndescription length (MDL) [12, 16] serve as important modi\ufb01cations to MAP for model estimation\nacross classes of different complexity. It selects the model that minimizes the overall coding length\nof the given (training) data, hence the name \u201cminimum description length\u201d [1]. Notice, however, that\nMDL does not specify how the model complexity should be properly accounted for when classifying\nnew test data among models that have different dimensions.\n\nSolution from Lossy Data Coding. Given the dif\ufb01culty of learning the (potentially degenerate)\ndistributions pX|Y (x|y) from a few samples in a high-dimensional space, it makes more sense to\nseek good \u201csurrogates\u201d for implementing the minimum coding length principle (1). Our idea is to\nmeasure how ef\ufb01ciently a new observation can be encoded by each class of the training data subject\nto an allowable distortion, and to assign the new observation to the class that requires the minimum\nnumber of additional bits. We dub this criterion \u201cminimum incremental coding length\u201d (MICL) for\nclassi\ufb01cation. It provides a counterpart of the MDL principle for model estimation and as a surrogate\nfor the minimum coding length principle for classi\ufb01cation.\nThe proposed MICL criterion naturally addresses the issues of regularization and model complexity.\nRegularization is introduced through the use of lossy coding, i.e. coding the test data x upto an\nallowable distortion1 (placing our approach along the lines of lossy MDL [15]). This contrasts with\nShannon\u2019s optimal lossless coding length, which requires precise knowledge of the true distributions.\nLossy coding length also accounts for model complexity by directly measuring the difference in the\nvolume (hence dimension) of the training data with and without the new observation.\n\nRelationships to Existing Classi\ufb01ers. While MICL and MDL both minimize a coding-theoretic\nobjective, MICL differs strongly from traditional MDL approaches to classi\ufb01cation such as those\nproven inconsistent in [8]. Those methods chose a decision boundary that minimizes the total num-\nber of bits needed to code the boundary and the samples it incorrectly classi\ufb01es. In contrast, MICL\nuses coding length directly as a measure of how well the training data represent the new sample.\nThe inconsistency result of [8] does not apply in this modi\ufb01ed context. Within the lossy data cod-\ning framework, we establish that the MICL criterion leads to a family of classi\ufb01ers that generalize\nthe conventional MAP classi\ufb01er (1). We prove that for Gaussian distributions, the MICL criterion\nasymptotically converges to a regularized version of MAP2 (see Theorem 1) and give a precise es-\ntimate of the convergence rate (see Theorem 2). Thus, lossy coding induces a regularization effect\nsimilar to Regularized Discriminant Analysis (RDA) [6], with similar gains in \ufb01nite sample per-\nformance with respect to MAP/QDA. The fully Bayesian approach to model estimation, in which\nposterior distributions over model parameters are estimated also provides \ufb01nite sample gains over\n\n1Information Bottleneck also uses lossy coding, but in an unsupervised manner, for clustering, feature\n\nselection and dimensionality reduction [19]. We apply lossy coding in the supervised (classi\ufb01cation) setting.\n\n2MAP subject to a Gaussian assumption is often referred to as Quadratic Discriminant Analysis (QDA) [9].\n\n2\n\n\fML/MAP [14]. However, that method is sensitive to the choice of prior when the number of samples\nis less than the dimension of the space, a situation that poses no dif\ufb01culty to our proposed classi\ufb01er.\nWhen the distributions involved are not Gaussian, the MICL criterion can still be applied locally,\nsimilar to the popular k-Nearest Neighbor (k-NN) classi\ufb01er. However, the local MICL classi-\n\ufb01er signi\ufb01cantly improves the k-NN classi\ufb01er as it accounts for both the number of samples and\nthe distribution of the samples within the neighborhood. MICL can also be kernelized to handle\nnonlinear/non-Gaussian data, an extension similar to the generalization of Support Vector Machines\n(SVM) to nonlinear decision boundaries. The kernelized version of MICL provides a simple alter-\nnative to the SVM approach of constructing a linear decision boundary in the embedded (kernel)\nspace, and better exploits the covariance structure of the embedded data.\n\n2 Classi\ufb01cation Criterion and Analysis\n\nrepresented by a two-part code using(cid:80)K\n\n2.1 Minimum Incremental Coding Length.\nA lossy coding scheme [5] maps vectors X = (x1, . . . , xm) \u2208 Rn\u00d7m to a sequence of binary bits,\nfrom which the original vectors can be recovered upto an allowable distortion E[(cid:107)\u02c6x \u2212 x(cid:107)2] \u2264 \u03b52.\nThe length of the bit sequence is then a function L\u03b5(X ) : Rn\u00d7m \u2192 Z+. If we encode each class\nof training data Xj\n.= {xi : yi = j} separately using L\u03b5(Xj) bits, the entire training dataset can be\nj=1 L\u03b5(Xj)\u2212|Xj| log2 pY (j) bits. Here, the second term is\nthe minimum number of bits needed to (losslessly) code the class labels yi.\nNow, suppose we are given a test observation x \u2208 Rn, whose associated class label y(x) = j is\nunknown. If we code x jointly with the training data Xj of the jth class, the number of additional\nbits needed to code the pair (x, y) is \u03b4L\u03b5(x, j) = L\u03b5(Xj \u222a{x})\u2212L\u03b5(Xj)+L(j). Here, the \ufb01rst two\nterms measure the excess bits needed to code (x,Xj) upto distortion \u03b52, while the last term L(j)\nis the cost of losslessly coding the label y(x) = j. One may view these as \u201c\ufb01nite-sample lossy\u201d\nsurrogates for the Shannon coding lengths in the ideal classi\ufb01er (1). This interpretation naturally\nleads to the following classi\ufb01er:\nCriterion 1 (Minimum Incremental Coding Length). Assign x to the class which minimizes the\nnumber of additional bits needed to code (x, \u02c6y), subject to the distortion \u03b5:\n\n\u02c6y(x) .= arg minj\u2208{1,...,K} \u03b4L\u03b5(x, j).\n\n(2)\n\nThe above criterion (2) can be taken as a general principle for classi\ufb01cation, in the sense that it can be\napplied using any lossy coding scheme. Nevertheless, effective classi\ufb01cation demands that the cho-\nsen coding scheme be approximately optimal for the given data. From a \ufb01nite sample perspective,\nL\u03b5 should approximate the Kolmogorov complexity of X , while in an asymptotic, statistical setting\nit should approach the lower bound given by the rate-distortion of the generating distribution [5].\n\nLossy Coding of Gaussian Data. We will \ufb01rst consider a coding length function L\u03b5 introduced\nand rigorously justi\ufb01ed in [13], which is (asymptotically) optimal for Gaussians. The (implicit) use\nof a coding scheme which is optimal for Gaussian sources is equivalent to assuming that the condi-\ntional class distributions pX|Y can be well-approximated by Gaussians. After rigorously analyzing\nthis admittedly restrictive scenario, we will extend the MICL classi\ufb01er (with this same L\u03b5 function)\nto arbitrary, multimodal distributions via an effective local Gaussian approximation.\nFor a multivariate Gaussian source N (\u00b5, \u03a3), the average number of bits needed to code a vector\n(cid:80)\nsubject to a distortion \u03b52 is approximately R\u03b5(\u03a3) .= 1\nX = (x1, . . . , xm) with sample mean \u02c6\u00b5 = 1\ni(xi \u2212\n\u02c6\u00b5)(xi \u2212 \u02c6\u00b5)T can be represented upto expected distortion \u03b52 using \u2248 mR\u03b5(\u02c6\u03a3) bits. The optimal\ncodebook is adaptive to the data, and can be encoded by representing the principal axes of the\ncovariance using an additional nR\u03b5(\u02c6\u03a3) bits. Encoding the mean vector \u00b5 requires an additional\n2 log2\n\n2 log2 det(cid:0)I+ n\n(cid:80)\ni xi and covariance \u02c6\u03a3(X ) = 1\nm\u22121\n\n\u03b52 \u03a3(cid:1) (bits/vector). Observations\n\n(cid:0)1 + \u02c6\u00b5T \u02c6\u00b5\n\n\u03b52\n\nm\n\nn\n\n(cid:1) bits. The total number of bits required to code X is therefore\n(cid:0)1 +\n\nL\u03b5(X ) .= m + n\n\n\u02c6\u03a3(X )\n\nlog2 det\n\nlog2\n\n(cid:17)\n\n(cid:16)\n\n(3)\n\n(cid:1).\n\n\u02c6\u00b5T \u02c6\u00b5\n\u03b52\n\n2\n\n+ n\n2\n\nI + n\n\u03b52\n\n3\n\n\fMICL\n\nk-NN\n\nSVM-RBF\n\nFigure 1: MICL harnesses linear structure in the data to interpolate (left) and extrapolate (center) in\nsparsely sampled regions. Popular classi\ufb01ers such as k-NN and SVM-RBF do not (right).\n\nThe \ufb01rst term gives the number of bits needed to represent the distribution of the xi about their mean,\nand the second gives the cost of representing the mean. The above function well-approximates the\noptimal coding length for Gaussian data, and has also been shown to give a good upper bound on the\nnumber of bits needed to code \ufb01nitely many samples lying on a linear subspace (e.g., a degenerate\nGaussian distribution) [13].\n\nCoding the Class Label. Since the label Y is discrete, it can be coded losslessly. If the test class\nlabels Y are known to have the marginal distribution P [Y = j] = \u03c0j, then the optimal coding\nlengths are (within one bit): L(j) = \u2212 log2 \u03c0j. In practice, we may replace \u03c0j with the estimate\n\u02c6\u03c0j = |Xj|/m. Notice that as in the MAP classi\ufb01er, the \u03c0j essentially form a prior on class labels.\nCombining this coding length the class label with the coding length function (3) for the observations,\nwe summarize the MICL criterion (2) as Algorithm 1 below:\n\nAlgorithm 1 (MICL Classi\ufb01er).\n1: Input: m training samples partitioned into K classes X1,X2, . . . ,XK and a test sample x.\n2: Compute prior distribution of class labels \u02c6\u03c0j = |Xj|/m.\n3: Compute incremental coding length of x for each class:\n\n\u03b4L\u03b5(x, j) = L\u03b5(Xj \u222a {x}) \u2212 L\u03b5(Xj) \u2212 log2 \u02c6\u03c0j,\n\nlog2 det(cid:0)I + n\n\n\u03b52\u02c6\u03a3(X )(cid:1) + n\n\n2 log2\n\n(cid:0)1 + \u02c6\u00b5T \u02c6\u00b5\n\n(cid:1).\n\n\u03b52\n\nL\u03b5(X ) .= m+n\n\n2\n\nwhere\n\n4: Output: \u02c6y(x) = arg minj=1,...,K \u03b4L\u03b5(x, j).\nThe L\u03b5(Xj \u222a{x}) can be computed in O(min(m, n)2) time (see [21]), allowing the MICL classi\ufb01er\nto be directly applied to high-dimensional data. Figure 1 shows the performance of Algorithm 1 on\ntwo toy problems. In both cases, the MICL criterion harnesses the covariance structure of the data\nto achieve good classi\ufb01cation in sparsely sampled regions. In the left example, the criterion inter-\npolates the data structure to achieve correct classi\ufb01cation, even near the origin where the samples\nare sparse. In the right example, the criterion extrapolates the horizontal line to the other side of the\nplane. Methods such as k-NN and SVM do not achieve the same effect. Notice, however, that these\ndecision boundaries are similar to what MAP/QDA would give. This raises an important question:\nwhat is the precise relationship between MICL and MAP, and when is MICL superior?\n\n2.2 Asymptotic Behavior and Relationship to MAP\n\nIn this section, we analyze the asymptotic behavior of Algorithm 1 as the number of training samples\ngoes to in\ufb01nity. The following result, whose proof is given in [21], indicates that MICL converges\nto a regularized version of ML/MAP, subject to a reward on the dimension of the classes:\ni=1 \u223ciid pX,Y (x, y), with\nTheorem 1 (Asymptotic MICL [21]). Let the training samples {(xi, yi)}m\n.= Cov(X|Y = j). Then as m \u2192 \u221e, the MICL criterion coincides\n\u00b5j\n(asymptotically, with probability one) with the decision rule\n\n.= E[X|Y = j], \u03a3j\n\n(cid:16)\n\nx(cid:12)(cid:12) \u00b5j, \u03a3j + \u03b52\n\n(cid:17)\n\nLG\n\n+ ln \u03c0j +\n\nI\n\nn\n\n1\n2 D\u03b5(\u03a3j),\n\n\u02c6y(x) = argmax\n\nj=1,...,K\n\nwhere LG(\u00b7| \u00b5, \u03a3) is the log-likelihood function for a N (\u00b5, \u03a3) distribution , and D\u03b5(\u03a3j)\ntr(\u03a3j(\u03a3j + \u03b52\n\nn I)\u22121) is the effective dimension of the j-th model, relative to the distortion \u03b52.\n\n(4)\n.=\n\n4\n\n\fFigure 2: Left: Excess risk incurred by using MAP rather than MICL, as a function of \u03b5 and m. (a)\nisotropic Gaussians. (b) anisotropic Gaussians. Right: Excess risk for nested classes, as a function\nof n and m. (c) MICL vs. MAP. (d) MICL vs. RDA. In all examples, MICL is superior for n (cid:29) m.\n\nThis result shows that asymptotically, MICL generates a family of MAP-like classi\ufb01ers parametrized\nby the distortion \u03b52. If all of the distributions are nondegenerate (i.e. their covariance matrices \u03a3j\nare nonsingular), then lim\u03b5\u21920(\u03a3j + \u03b52\nn I) = \u03a3j and lim\u03b5\u21920 D\u03b5(\u03a3j) = n, a constant across the\nvarious classes. Thus, for nondegenerate data, the family of classi\ufb01ers induced by MICL contains\nthe conventional MAP classi\ufb01er (1) at \u03b5 = 0. Given a \ufb01nite number, m, of samples, any reasonable\nrule for choosing the distortion \u03b52 should therefore ensure that \u03b5 \u2192 0 as m \u2192 \u221e. This guarantees\nthat for non-degenerate distributions, MICL converges to the asymptotically optimal MAP criterion.\nSimulations (e.g., Figure 1) suggest that the limiting behavior provides useful information even\nfor \ufb01nite training data. The following result, proven in [21], veri\ufb01es that the MICL discriminant\nfunctions \u03b4L\u03b5(x, j) converge quickly to their limiting form \u03b4L\u221e\nTheorem 2 (MICL Convergence Rate [21]). As the number of samples, m \u2192 \u221e, the MICL criterion\n(2) converges to its asymptotic form, (4) at a rate of m\u2212 1\n2 . More speci\ufb01cally, with probability at least\n\n\u03b5 (x, j):\n\n1 \u2212 \u03b1,(cid:12)(cid:12)\u03b4L\u03b5(z, j) \u2212 \u03b4L\u221e\n\n\u03b5 (z, j)(cid:12)(cid:12) \u2264 c(\u03b1) \u00b7 m\u2212 1\n\n2 for some constant c(\u03b1) > 0.\n\n2.3\n\nImprovements over MAP\n\nIn the above, we have established the fact that asymptotically, the MICL criterion (4) is just as good\nas the MAP criterion. Nevertheless, the MICL criterion makes several important modi\ufb01cations to\nMAP, which signi\ufb01cantly improve its performance on sparsely sampled or degenerate data.\n\nRegularization and Finite-Sample Behavior. Notice that the \ufb01rst two terms of the asymptotic\nMICL criterion (4) have the form of a MAP criterion, based on an N (\u00b5, \u03a3 + \u03b52\nn I) distribution.\nThis is somewhat equivalent to softening the distribution by \u03b52\nn along each dimension, and has two\nimportant effects. First, it renders the associated MAP decision rule well-de\ufb01ned, even if the true\ndata distribution is (almost) degenerate. Even for non-degenerate distributions, there is empirical\nevidence that for appropriately chosen \u03b5, \u02c6\u03a3 + \u03b52\nn I gives more stable \ufb01nite-sample classi\ufb01cation [6].\nFigure 2 demonstrates this effect on two simple examples. The generating distributions are param-\neterized as (a) \u00b51 = [\u2212 1\n4 , 0],\n\u03a31 = diag(1, 4), \u03a32 = diag(4, 1). In each example, we vary the number of training samples, m,\nand the distortion \u03b5. For each (m, \u03b5) combination, we draw m training samples from two Gaus-\nsian distributions N (\u00b5i, \u03a3i), i = 1, 2, and estimate the Bayes risk of the resulting MICL and MAP\nclassi\ufb01ers. This procedure is repeated 500 times, to estimate the overall Bayes risk with respect to\nvariations in the training data. Figure 2 visualizes the difference in risks, RM AP \u2212 RM ICL. Posi-\ntive values indicate that MICL is outperforming MAP. The red line approximates the zero level-set,\nwhere the two methods perform equally well. In the isotropic case (a), MICL outperforms MAP for\nall suf\ufb01ciently large \u03b5. with a larger performance gain when the number of samples is small. In the\nanisotropic case (b), for most \u03b5, MICL dramatically outperforms MAP for small sample sizes. We\nwill see in the next example that this effect becomes more pronounced as the dimension increases.\n\n2 , 0], \u03a31 = \u03a32 = I, and (b) \u00b51 = [\u2212 3\n\n2 , 0], \u00b52 = [ 1\n\n4 , 0], \u00b52 = [ 3\n\nbe rewritten as D\u03b5(\u03a3j) =(cid:80)n\n\nDimension Reward. The effective dimension term D\u03b5(\u03a3j) in the large-n MICL criterion (4) can\nn + \u03bbi), where \u03bbi is the ith eigenvalue of \u03a3j. If the data lie\nnear a d-dimensional subspace (\u03bb1 . . . \u03bbd (cid:29) \u03b52/n and \u03bbd+1 . . . \u03bbn (cid:28) \u03b52/n), D\u03b5 \u2248 d. In general,\n\ni=1 \u03bbi/( \u03b52\n\n5\n\n\u22122\u221210123152739516375log \u03b5Number of training samplesRMAP \u2212 RMICL \u22120.04\u22120.0200.020.040.06\u22122\u221210131527395163log \u03b5Number of training samplesRMAP \u2212 RMICL 00.020.040.061022341020304050Ambient dimensionNumber of training samplesRMAP \u2212 RMICL 00.20.40.60.81022341020304050Ambient dimensionNumber of training samplesRRDA \u2212 RMICL \u22120.0500.050.10.150.2\fD\u03b5 can be viewed as \u201csoftened\u201d estimate of the dimension3, relative to the distortion \u03b52. MICL\ntherefore rewards distributions that have relatively higher dimension.4 However, this effect is some-\nwhat countered by the regularization induced by \u03b5, which rewards lower dimensional distributions.\nFigure 2(right) empirically compares MICL to the conventional MAP and the regularized MAP (or\nRDA [6]). We draw m samples from three nested Gaussian distributions: one of full rank n, one of\nrank n/2, and one of rank 1. All samples are corrupted by 4% Gaussian noise. We estimate the Bayes\nrisk for each (m, n) combination as in the previous example. The regularization parameter in RDA\nand the distortion \u03b5 for MICL are chosen independently for each trial by cross validation. Plotted\nare the (estimated) differences in risk, RM AP \u2212 RM ICL (Fig. 2 (c)) and RRDA \u2212 RM ICL (Fig. 2\n(d)). The red lines again correspond to the zero level-set of the difference. Unsurprisingly, MICL\noutperforms MAP for most (m, n), and that the effect is most pronounced when n is large and m is\nsmall. When m is much smaller than n (e.g. the bottom row of Figure 2 right), MICL demonstrates\na signi\ufb01cant performance gain with respect to RDA. As the number of samples increases, there is a\nregion where RDA is slightly better. For most (m, n), MICL and RDA are close in performance.\n\n2.4 Extensions to Non-Gaussian Data\n\nIn practice, the data distribution(s) of interest may not be Gaussian. If the rate-distortion function is\nknown, one could, in principle, carry out similar analysis as for the Gaussian case. Nevertheless, in\nthis subsection, we discuss two practical modi\ufb01cations to the MICL criterion that are applicable to\narbitrary distributions and preserve the desirable properties discussed in the previous subsections.\nKernel MICL Criterion. Since XX T and X TX have the same non-zero eigenvalues,\n\nlog2 det(cid:0)I +\u03b1XX T(cid:1) = log2 det(cid:0)I +\u03b1X TX(cid:1).\n\n(5)\nThis identity shows that L\u03b5(X ) can also be computed from the inner products between the xi. If the\ndata x (of each class) are not Gaussian but there exists a nonlinear map \u03c8 : Rn \u2192 H such that the\ntransformed data \u03c8(x) are (approximately) Gaussian, we can replace the inner product xT\n1 x2 with\na symmetric positive de\ufb01nite kernel function k(x1, x2) .= \u03c8(x1)T \u03c8(x2). Choosing a proper kernel\nfunction will improve classi\ufb01cation performance for non-Gaussian distributions. In practice, popular\nchoices include the polynomial kernel k(x1, x2) = (xT\n1 x2 + 1)d, the radial basis function (RBF)\nkernel k(x1, x2) = exp(\u2212\u03b3(cid:107)x1 \u2212 x2(cid:107)2) and their variants. Implementation details, including how\nto properly account for the mean and dimension of the embedded data, are given in [21].\nA similar transformation is used to generate nonlinear decision boundaries with SVM. Notice, how-\never, that whereas SVM constructs a linear decision boundary in the lifted space H, kernel MICL\nexploits the covariance structure of the lifted data, generating decision boundaries that are (asymp-\ntotically) quadratic. In Section 3 we will see that even for real data whose statistical nature is unclear,\nkernel MICL outperforms SVM when applied with the same kernel function.\n\nLocal MICL Criterion. For real data whose distribution is unknown, it may be dif\ufb01cult to \ufb01nd an\nappropriate kernel function. In this case, MICL can still be applied locally, in a neighborhood of the\ntest sample x. Let N k(x) denote the k nearest neighbors of x in the training set X . Training data in\nj (x) .= Xj \u2229 N k(x), j = 1, . . . , K. In the MICL\nthis neighborhood that belong to each class are N k\nclassi\ufb01er (Algorithm 1), we replace the incremental coding length \u03b4L\u03b5(x, j) by its local version:\n\nj (x) \u222a {x}) \u2212 L\u03b5(N k\n\nj (x)) + L(j),\n\n\u03b4L\u03b5(x, j) = L\u03b5(N k\nj (x)|/|N k(x)|). Theorem 1 implies that this gives a universal classi\ufb01er:\nwith L(j) = \u2212 log2(|N k\nCorollary 3. Suppose the conditional density pj(x) = p(x|y = j) of each class is nondegenerate.\nThen if k = o(m) and k, m \u2192 \u221e, the local MICL criterion converges to the MAP criterion (1).\nThis follows, since as the radius of the neighborhood shrinks, the cost of coding the class label,\n\u2212 log2(|N k\nj (x)|/|N k(x)|) \u2192 \u2212 log2 pj(x), dominates the coding length, (6). In this asymptotic\nsetting the local MICL criterion behaves like k-Nearest Neighbor (k-NN). However, the \ufb01nite-\nsample behavior of the local MICL criterion can differ drastically from that of k-NN, especially\n\n(6)\n\n3This quantity has been dubbed the effective number of parameters in the context of ridge regression [9].\n4This contrasts with the dimension penalties typical in model selection/estimation.\n\n6\n\n\f(a) KMICL-RBF\n\n(b) SVM-RBF\n\n(c) LMICL\n\n(d) 5-NN\n\nFigure 3: Nonlinear extensions to MICL, compared to SVM and k-NN. Local MICL produces a\nsmoother and more intuitive decision boundary than k-NN. Kernel MICL and SVM produce similar\nboundaries, that are smoother and better respect the data structure than those given by local methods.\n\nError Method\n\nMethod\nLMICL 1.6% SVM-Poly [20]\nk-NN\n\n3.1% Best [18]\n\nError\n1.4%\n0.4%\n\nError Method\n\nMethod\nLMICL 4.9% KMICL-Poly\n5.3% SVM-Poly [4]\nk-NN\n\nError\n4.7%\n5.3%\n\nTable 1: Results for handwritten digit recognition. Left: MNIST dataset. Right: USPS dataset, with\nidentical preprocessing and kernel function. Here, kernel-MICL slightly outperforms SVM.\n\nwhen the samples are sparse and the distributions involved are almost degenerate. In this case, from\n(4), local MICL effectively approximates the local shape of the distribution pj(x) by a (regularized)\nGaussian, exploiting structure in the distribution of the nearest neighbors (see \ufb01gure 3).\n\n3 Experiments with Real Imagery Data\n\nUsing experiments on real data, we demonstrate that MICL and its nonlinear variants approach the\nbest results from more sophisticated systems, without relying on domain-speci\ufb01c information.\n\nHandwritten Digit Recognition. We \ufb01rst test the MICL classi\ufb01er on two standard datasets for\nhandwritten digit recognition (Table 1 top). The MNIST handwritten digit dataset [10] consists of\n60,000 training images and 10,000 test images. We achieved better results using the local version\nof MICL, due to non-Gaussian distribution of the data. With k = 20 and \u03b5 = 150, local MICL\nachieves a test error 1.59%, outperforming simple methods such as k-NN as well as many more\ncomplicated neural network approaches (e.g. LeNet-1 [10]). MICL\u2019s error rate approaches the\nbest result for a generic learning machine (1.4% error for SVM with a degree-4 polynomial kernel).\nProblem-speci\ufb01c approaches have resulted in lower error rates, however, with the best reported result\nachieved using a specially engineered neural network [18].\nWe also test on the challenging USPS digits database (Table 1 bottom). Here, even humans have\nconsiderable dif\ufb01culties (\u2248 2.5% error). With k = 35 and \u03b5 = 0.03, local MICL achieves an error\nrate of 4.88%, again outperforming k-NN. We further compare the performance of kernel MICL\nto SVM (using [4]) on this dataset with the same homogeneous, degree 3 polynomial kernel, and\nidentical preprocessing (normalization and centering), allowing us to compare pure classi\ufb01cation\nperformace. Here, SVM achieves a 5.3% error, while kernel-MICL achieves an error rate of 4.7%\nwith distortion \u03b5 = 0.0067 (chosen automatically by cross-validation). Using domain-speci\ufb01c infor-\nmation, one can achieve better results. For instance [17] achieves 2.7% error using tangent distance\nto a large number of prototypes. Other preprocessing steps, synthetic training images, or more ad-\nvanced skew-correction and normalization techniques have been applied to lower the error rate for\nSVM (e.g. 4.1% in [20]). While we have avoided extensive preprocessing here, so as to isolate the\neffect of the classi\ufb01er, such preprocessing can be readily incorporated into our framework.\n\nFace Recognition. We further verify MICL\u2019s effectiveness on sparsely sampled high-dimensional\ndata using the Yale Face Database B [7], which tests illumination sensitivity of face recognition\nalgorithms. Following [7, 11], we use subsets 1 and 2 for training, and report the average test error\nacross the four subsets. We apply Algorithm 1, not the local or kernel version, with \u03b5 = 75. MICL\nsigni\ufb01cantly outperforms classical subspace techniques on this problem (see Table 2), with error\n0.9% near the best reported results in [7, 11] that were obtained using a domain-speci\ufb01c model of\n\n7\n\n\fMethod\nMICL\nSubspace [7]\n\nError Method\n0.9% Eigenface [7]\n4.6% Best [11]\n\nError\n25.8%\n0% Subsets 1,2 (training)\n\nSubsets 1-4 (testing)\n\nTable 2: Face recognition under widely varying illumination. MICL outperforms classical face\nrecognition methods such as Eigenfaces on Yale Face Database B [7].\n\nillumination for face images. We suggest that the source of this improved performance is precisely\nthe regularization induced by lossy coding. In this problem the number of training vectors per class,\n19, is small compared to the dimension, n = 32, 256 (for raw 168 \u00d7 192 images). Simulations (e.g.\nFigure 2) show that this is exactly the circumstance in which MICL is superior to MAP and even\nRDA. Interestingly, this suggests that directly exploiting degenerate or low-dimensional structures\nvia MICL renders dimensionality reduction before classifying unnecessary or even undesirable.\n\n4 Conclusion\n\nWe have proposed and studied a new information theoretic classi\ufb01cation criterion, Minimum In-\ncremental Coding Length (MICL), establishing its optimality for Gaussian data. MICL generates a\nfamily of classi\ufb01ers that inherit many of the good properties of MAP, RDA, and k-NN, while ex-\ntending their working conditions to sparsely sampled or degenerate high-dimensional observations.\nMICL and its kernel and local versions approach best reported performance on high-dimensional vi-\nsual recognition problems without domain-speci\ufb01c engineering. Due to its simplicity and \ufb02exibility,\nwe believe MICL can be successfully applied to a wide range of real-world classi\ufb01cation problems.\n\nReferences\n[1] A. Barron, J. Rissanen, and B. Yu. The minimum description length principle in coding and modeling.\n\nIEEE Transactions on Information Theory, 44(6):2743\u20132760, 1998.\n\n[2] R. Basri and D. Jacobs. Lambertian re\ufb02ection and linear subspaces. PAMI, 25(2):218\u2013 233, 2003.\n[3] P. Bickel and B. Li. Regularization in statistics. TEST, 15(2):271\u2013344, 2006.\n[4] C. Chang and C. Lin. LIBSVM: a library for support vector machines, 2001.\n[5] T. Cover and J. Thomas. Elements of Information Theory. Wiley Series in Telecommunications, 1991.\n[6] J. Friedman. Regularized discriminant analysis. JASA, 84:165\u2013175, 1989.\n[7] A. Georghiades, P. Belhumeur, and D. Kriegman. From few to many: Illumination cone models for face\n\nrecognition under variable lighting and pose. PAMI, 23(6):643\u2013660, 2001.\n\n[8] P. Grunwald and J. Langford. Suboptimal behaviour of Bayes and MDL in classi\ufb01cation under misspeci-\n\n\ufb01cation. In Proceedings of Conference on Learning Theory, 2004.\n\n[9] T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning. Springer, 2001.\n[10] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition.\n\nProceedings of the IEEE, 86(11):2278\u20132324, 1998.\n\n[11] K. Lee, J. Ho, and D. Kriegman. Acquiring linear subspaces for face recognition under variable lighting.\n\nPAMI, 27(5):684\u2013698, 2005.\n\n[12] J. Li. A source coding approach to classi\ufb01cation by vector quantization and the principle of minimum\n\ndescription length. In IEEE DCC, pages 382\u2013391, 2002.\n\n[13] Y. Ma, H. Derksen, W. Hong, and J. Wright. Segmentation of multivariate mixed data via lossy data\n\ncoding and compression. PAMI, 29(9):1546\u20131562, 2007.\n\n[14] D. MacKay. Developments in probabilistic modelling with neural networks \u2013 ensemble learning. In Proc.\n\n3rd Annual Symposium on Neural Networks, pages 191\u2013198, 1995.\n\n[15] M. Madiman, M. Harrison, and I. Kontoyiannis. Minimum description length vs. maximum likelihood in\n\nlossy data compression. In IEEE International Symposium on Information Theory, 2004.\n\n[16] J. Rissanen. Modeling by shortest data description. Automatica, 14:465\u2013471, 1978.\n[17] P. Simard, Y. LeCun, and J. Denker. Ef\ufb01cient pattern recognition using a new transformation distance. In\n\nProceedings of NIPS, volume 5, 1993.\n\n[18] P. Simard, D. Steinkraus, and J. Platt. Best practice for convolutional neural networks applied to visual\n\ndocument analysis. In ICDAR, pages 958\u2013962, 2003.\n\n[19] N. Tishby, F. Pereira, and W. Bialek. The information bottleneck method. In Allerton, 1999.\n[20] V. Vapnik. The Nature of Statistical Learning Theory. Springer, 2000.\n[21] J. Wright, Y. Tao, Z. Lin, Y. Ma, and H. Shum. Classi\ufb01cation via minimum incremental coding length\n\n(MICL). Technical report, UILU-ENG-07-2201, http://perception.csl.uiuc.edu/coding, 2007.\n\n8\n\n\f", "award": [], "sourceid": 23, "authors": [{"given_name": "John", "family_name": "Wright", "institution": null}, {"given_name": "Yangyu", "family_name": "Tao", "institution": null}, {"given_name": "Zhouchen", "family_name": "Lin", "institution": null}, {"given_name": "Yi", "family_name": "Ma", "institution": null}, {"given_name": "Heung-yeung", "family_name": "Shum", "institution": null}]}