{"title": "Compressive Feature Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 2931, "page_last": 2939, "abstract": "This paper addresses the problem of unsupervised feature learning for text data. Our method is grounded in the principle of minimum description length and uses a dictionary-based compression scheme to extract a succinct feature set. Specifically, our method finds a set of word $k$-grams that minimizes the cost of reconstructing the text losslessly. We formulate document compression as a binary optimization task and show how to solve it approximately via a sequence of reweighted linear programs that are efficient to solve and parallelizable. As our method is unsupervised, features may be extracted once and subsequently used in a variety of tasks. We demonstrate the performance of these features over a range of scenarios including unsupervised exploratory analysis and supervised text categorization. Our compressed feature space is two orders of magnitude smaller than the full $k$-gram space and matches the text categorization accuracy achieved in the full feature space. This dimensionality reduction not only results in faster training times, but it can also help elucidate structure in unsupervised learning tasks and reduce the amount of training data necessary for supervised learning.", "full_text": "Compressive Feature Learning\n\nHristo S. Paskov\n\nDepartment of Computer Science\n\nStanford University\n\nhpaskov@cs.stanford.edu\n\nJohn C. Mitchell\n\nDepartment of Computer Science\n\nStanford University\n\nmitchell@cs.stanford.edu\n\nRobert West\n\nDepartment of Computer Science\n\nStanford University\n\nwest@cs.stanford.edu\n\nTrevor J. Hastie\n\nDepartment of Statistics\n\nStanford University\n\nhastie@stanford.edu\n\nAbstract\n\nThis paper addresses the problem of unsupervised feature learning for text data.\nOur method is grounded in the principle of minimum description length and uses\na dictionary-based compression scheme to extract a succinct feature set. Specif-\nically, our method \ufb01nds a set of word k-grams that minimizes the cost of re-\nconstructing the text losslessly. We formulate document compression as a bi-\nnary optimization task and show how to solve it approximately via a sequence of\nreweighted linear programs that are ef\ufb01cient to solve and parallelizable. As our\nmethod is unsupervised, features may be extracted once and subsequently used in\na variety of tasks. We demonstrate the performance of these features over a range\nof scenarios including unsupervised exploratory analysis and supervised text cate-\ngorization. Our compressed feature space is two orders of magnitude smaller than\nthe full k-gram space and matches the text categorization accuracy achieved in the\nfull feature space. This dimensionality reduction not only results in faster training\ntimes, but it can also help elucidate structure in unsupervised learning tasks and\nreduce the amount of training data necessary for supervised learning.\n\n1\n\nIntroduction\n\nMachine learning algorithms rely critically on the features used to represent data; the feature set\nprovides the primary interface through which an algorithm can reason about the data at hand. A typ-\nical pitfall for many learning problems is that there are too many potential features to choose from.\nIntelligent subselection is essential in these scenarios because it can discard noise from irrelevant\nfeatures, thereby requiring fewer training examples and preventing over\ufb01tting. Computationally,\na smaller feature set is almost always advantageous as it requires less time and space to train the\nalgorithm and make inferences [10, 9].\nVarious heuristics have been proposed for feature selection, one class of which work by evaluat-\ning each feature separately with respect to its discriminative power. Some examples are document\nfrequency, chi-square value, information gain, and mutual information [26, 9]. More sophisticated\nmethods attempt to achieve feature sparsity by optimizing objective functions containing an L1 reg-\nularization penalty [25, 27].\nUnsupervised feature selection methods [19, 18, 29, 13] are particularly attractive. First, they do not\nrequire labeled examples, which are often expensive to obtain (e.g., when humans have to provide\nthem) or might not be available in advance (e.g., in text classi\ufb01cation, the topic to be retrieved might\nbe de\ufb01ned only at some later point). Second, they can be run a single time in an of\ufb02ine preprocessing\n\n1\n\n\fstep, producing a reduced feature space that allows for subsequent rapid experimentation. Finally,\na good data representation obtained in an unsupervised way captures inherent structure and can be\nused in a variety of machine learning tasks such as clustering, classi\ufb01cation, or ranking.\nIn this work we present a novel unsupervised method for feature selection for text data based on ideas\nfrom data compression and formulated as an optimization problem. As the universe of potential\nfeatures, we consider the set of all word k-grams.1 The basic intuition is that substrings appearing\nfrequently in a corpus represent a recurring theme in some of the documents, and hence pertain\nto class representation. However, it is not immediately clear how to implement this intuition. For\ninstance, consider a corpus of NIPS papers. The bigram \u2018supervised learning\u2019 will appear often, but\nso will the constituent unigrams \u2018supervised\u2019 and \u2018learning\u2019. So shall we use the bigram, the two\nseparate unigrams, or a combination, as features?\nOur solution invokes the principle of minimum description length (MDL) [23]: First, we compress\nthe corpus using a dictionary-based lossless compression method. Then, the substrings that are used\nto reconstruct each document serve as the feature set. We formulate the compression task as a nu-\nmerical optimization problem. The problem is non-convex, but we develop an ef\ufb01cient approximate\nalgorithm that is linear in the number of words in the corpus and highly parallelizable. In the ex-\nample, the bigram \u2018supervised learning\u2019 would appear often enough to be added to the dictionary;\n\u2018supervised\u2019 and \u2018learning\u2019 would also be chosen as features if they appear separately in combina-\ntions other than \u2018supervised learning\u2019 (because the compression paradigm we choose is lossless).\nWe apply our method to two datasets and compare it to a canonical bag-of-k-grams representation.\nOur method reduces the feature set size by two orders of magnitude without incurring a loss of\nperformance on several text categorization tasks. Moreover, it expedites training times and requires\nsigni\ufb01cantly less labeled training data on some text categorization tasks.\n\n2 Compression and Machine Learning\n\nOur work draws on a deep connection between data compression and machine learning, exempli-\n\ufb01ed early on by the celebrated MDL principle [23]. More recently, researchers have experimented\nwith off-the-shelf compression algorithms as machine learning subroutines. Instances are Frank et\nal.\u2019s [7] compression-based approach to text categorization, as well as compression-based distance\nmeasures, where the basic intuition is that, if two texts x and y are very similar, then the compressed\nversion of their concatenation xy should not be much longer than the compressed version of either\nx or y separately. Such approaches have been shown to work well on a variety of tasks such as\nlanguage clustering [1], authorship attribution [1], time-series clustering [6, 11], anomaly detection\n[11], and spam \ufb01ltering [3].\nDistance-based approaches are akin to kernel methods, and thus suffer from the problem that con-\nstructing the full kernel matrix for large datasets might be infeasible. Furthermore, Frank et al.\n[7] deplore that \u201cit is hard to see how ef\ufb01cient feature selection could be incorporated\u201d into the\ncompression algorithm. But Sculley and Brodley [24] show that many compression-based distance\nmeasures can be interpreted as operating in an implicit high-dimensional feature space, spanned by\nthe dictionary elements found during compression. We build on this observation to address Frank et\nal.\u2019s above-cited concern about the impossibility of feature selection for compression-based meth-\nods. Instead of using an off-the-shelf compression algorithm as a black-box kernel operating in\nan implicit high-dimensional feature space, we develop an optimization-based compression scheme\nwhose explicit job it is to perform feature selection.\nIt is illuminating to discuss a related approach suggested (as future work) by Sculley and Brodley\n[24], namely \u201cto store substrings found by Lempel\u2013Ziv schemes as explicit features\u201d. This simplistic\napproach suffers from a serious \ufb02aw that our method overcomes. Imagine we want to extract features\nfrom an entire corpus. We would proceed by concatenating all documents in the corpus into a single\nlarge document D, which we would compress using a Lempel\u2013Ziv algorithm. The problem is that\nthe extracted substrings are dependent on the order in which we concatenate the documents to form\nthe input D. For the sake of concreteness, consider LZ77 [28], a prominent member of the Lempel\u2013\nZiv family (but the argument applies equally to most standard compression algorithms). Starting\nfrom the current cursor position, LZ77 scans D from left to right, consuming characters until it\n1In the remainder of this paper, the term \u2018k-grams\u2019 includes sequences of up to (rather than exactly) k words.\n\n2\n\n\fFigure 1: Toy example of our optimization problem for text compression. Three different solutions\nshown for representing the 8-word document D = manamana in terms of dictionary and pointers.\nDictionary cost: number of characters in dictionary. Pointer cost: \u03bb\u00d7 number of pointers. Costs\ngiven as dictionary cost + pointer cost. Left: dictionary cost only (\u03bb = 0). Right: expensive pointer\ncost (\u03bb = 8). Center: balance of dictionary and pointer costs (\u03bb = 1).\n\nhas found the longest pre\ufb01x matching a previously seen substring. It then outputs a pointer to that\nprevious instance\u2014we interpret this substring as a feature\u2014and continues with the remaining input\nstring (if no pre\ufb01x matches, the single next character is output). This approach produces different\nfeature sets depending on the order in which documents are concatenated. Even in small instances\nsuch as the 3-document collection {D1 = abcd,D2 = ceab,D3 = bce}, the order (D1,D2,D3) yields\nthe feature set {ab,bc}, whereas (D2,D3,D1) results in {ce,ab} (plus, trivially, the set of all single\ncharacters).\nAs we will demonstrate in our experiments section, this instability has a real impact on performance\nand is therefore undesirable. Our approach, like LZ77, seeks common substrings. However, our\nformulation is not affected by the concatenation order of corpus documents and does not suffer from\nLZ77\u2019s instability issues.\n\n3 Compressive Feature Learning\n\nThe MDL principle implies that a good feature representation for a document D = x1x2 . . .xn of\nn words minimizes some description length of D. Our dictionary-based compression scheme ac-\ncomplishes this by representing D as a dictionary\u2014a subset of D\u2019s substrings\u2014and a sequence of\npointers indicating where copies of each dictionary element should be placed in order to fully recon-\nstruct the document. The compressed representation is chosen so as to minimize the cost of storing\neach dictionary element in plaintext as well as all necessary pointers. This scheme achieves a shorter\ndescription length whenever it can reuse dictionary elements at different locations in D.\nFor a concrete example, see Fig. 1, which shows three ways of representing a document D in terms\nof a dictionary and pointers. These representations are obtained by using the same pointer storage\ncost \u03bb for each pointer and varying \u03bb. The two extreme solutions focus on minimizing either the\ndictionary cost (\u03bb = 0) or the pointer cost (\u03bb = 8) solely, while the middle solution (\u03bb = 1) trades\noff between minimizing a combination of the two. We are particularly interested in this tradeoff:\nwhen all pointers have the same cost, the dictionary and pointer costs pull the solution in opposite\ndirections. Varying \u03bb allows us to \u2018interpolate\u2019 between the two extremes of minimum dictionary\ncost and minimum pointer cost. In other words, \u03bb can be interpreted as tracing out a regularization\npath that allows a more \ufb02exible representation of D.\nTo formalize our compression criterion, let S = {xi . . .xi+t\u22121 | 1 \u2264 t \u2264 k,1 \u2264 i \u2264 n\u2212t + 1} be the\nset of all unique k-grams in D, and P ={ (s,l) | s = xl . . .xl+|s|\u22121} be the set of all m =|P| (potential)\npointers. Without loss of generality, we assume that P is an ordered set, i.e., each i \u2208 {1, . . . ,m}\ncorresponds to a unique pi \u2208 P, and we de\ufb01ne J(s) \u2282 {1, . . . ,m} to be the set of indices of all\npointers which share the same string s. Given a binary vector w \u2208 {0,1}m, w reconstructs word x j if\nfor some wi = 1 the corresponding pointer pi = (s,l) satis\ufb01es l \u2264 j < l +|s|. This notation uses wi\nto indicate whether pointer pi should be used to reconstruct (part of) D by pasting a copy of string s\ninto location l. Finally, w reconstructs D if every x j is reconstructed by w.\n\n3\n\nm\u00a0a\u00a0n\u00a0a\u00a0m\u00a0a\u00a0n\u00a0am\u00a0\u00a0\u00a0\u00a0\u00a0a\u00a0\u00a0\u00a0\u00a0\u00a0nm\u00a0a\u00a0n\u00a0a\u00a0m\u00a0a\u00a0n\u00a0am\u00a0a\u00a0n\u00a0am\u00a0a\u00a0n\u00a0a\u00a0m\u00a0a\u00a0n\u00a0am\u00a0a\u00a0n\u00a0a\u00a0m\u00a0a\u00a0n\u00a0a3\u00a0+\u00a0(0\u00a0\u00d7\u00a08)\u00a0=\u00a034\u00a0+\u00a0(1\u00a0\u00d7\u00a02)\u00a0=\u00a068\u00a0+\u00a0(8\u00a0\u00d7\u00a01)\u00a0=\u00a016DocumentPointersDictionaryCostMin.\u00a0dictionary\u00a0costMin.\u00a0combined\u00a0costMin.\u00a0pointer\u00a0cost\fCompressing D can be cast as a binary linear minimization problem over w; this bit vector tells us\nwhich pointers to use in the compressed representation of D and it implicitly de\ufb01nes the dictionary\n(a subset of S). In order to ensure that w reconstructs D, we require that Xw \u2265 1. Here X \u2208 {0,1}n\u00d7m\nindicates which words each wi = 1 can reconstruct: the i-th column of X is zero everywhere except\nfor a contiguous sequence of ones corresponding to the words which wi = 1 reconstructs. Next, we\nassume the pointer storage cost of setting wi = 1 is given by di \u2265 0 and that the cost of storing any\ns \u2208 S is c(s). Note that s must be stored in the dictionary if (cid:107)wJ(s)(cid:107)\u221e = 1, i.e., some pointer using s\nis used in the compression of D. Putting everything together, our lossless compression criterion is\n\nminimize\n\nw\n\nwT d +\n\nc(s)(cid:107)wJ(s)(cid:107)\u221e\n\nsubject to Xw \u2265 1, w \u2208 {0,1}m.\n\n(1)\n\n(cid:88)\n\ns\u2208S\n\n(cid:88)\n\ns\u2208S\n\nFinally, multiple documents can be compressed jointly by concatenating them in any order into a\nlarge document and disallowing any pointers that span document boundaries. Since this objective is\ninvariant to the document concatenating order, it does not suffer from the same problems as LZ77\n(cf. Section 2).\n\n4 Optimization Algorithm\n\nThe binary constraint makes the problem in (1) non-convex. We solve it approximately via a series\nof related convex problems P(1),P(2), . . . that converge to a good optimum. Each P(i) relaxes the\nbinary constraint to only require 0 \u2264 w \u2264 1 and solves a weighted optimization problem\n\nminimize\n\nw\n\nwT \u02dcd(i) +\n\nc(s)(cid:107)D(i)\n\nJ(s)J(s)wJ(s)(cid:107)\u221e\n\nsubject to Xw \u2265 1, 0 \u2264 w \u2264 1.\n\n(2)\n\nHere, D(i) is an m\u00d7 m diagonal matrix of positive weights and \u02dcd(i) = D(i)d for brevity. We use an\niterative reweighting scheme that uses D(1) = I and D(i+1)\n, where w(i) is\nthe solution to P(i). This scheme is inspired by the iterative reweighting method of Cand`es et al. [5]\nfor solving problems involving L0 regularization. At a high level, reweighting can be motivated by\nnoting that (2) recovers the correct binary solution if \u0001 is suf\ufb01ciently small and we use as weights\na nearly binary solution to (1). Since we do not know the correct weights, we estimate them from\nour best guess to the solution of (1). In turn, D(i+1) punishes coef\ufb01cients that were small in w(i) and,\ntaken together with the constraint Xw \u2265 1, pushes the solution to be binary.\n\nj j = max\n\n1, (w(i)\n\n(cid:110)\n\nj + \u0001)\u22121(cid:111)\n\nADMM Solution We demonstrate an ef\ufb01cient and parallel algorithm to solve (2) based on the\nAlternating Directions Method of Multipliers (ADMM) [2]. Problem (2) is a linear program solvable\nby a general purpose method in O(m3) time. However, if all potential dictionary elements are no\nlonger than k words in length, we can use problem structure to achieve a run time of O(k2n) per step\nof ADMM, i.e., linear in the document length. This is helpful because k is relatively small in most\nscenarios: long k-grams tend to appear only once and are not helpful for compression. Moreover,\nthey are rarely used in NLP applications since the relevant signal is captured by smaller fragments.\nADMM is an optimization framework that operates by splitting a problem into two subproblems that\nare individually easier to solve. It alternates solving the subproblems until they both agree on the\nsolution, at which point the full optimization problem has been solved. More formally, the optimum\nof a convex function h(w) = f (w) + g(w) can be found by minimizing f (w) + g(z) subject to the\nconstraint that w = z. ADMM acccomplishes this by operating on the augmented Lagrangian\n\nL\u03c1(w,z,y) = f (w) + g(z) + yT (w\u2212 z) +\n\n(3)\nIt minimizes L\u03c1 with respect to w and z while maximizing with respect to dual variable y \u2208 Rm in\norder to enforce the condition w = z. This minimization is accomplished by, at each step, solving\nfor w, then z, then updating y according to [2]. These steps are repeated until convergence.\n\n(cid:107)w\u2212 z(cid:107)2\n2.\n\n\u03c1\n2\n\n4\n\n\fDropping the D(i) superscripts for legibility, we can exploit problem structure by splitting (2) into\n\nf (w) = wT \u02dcd +\n\nc(s)(cid:107)DJ(s)J(s)wJ(s)(cid:107)\u221e + I+(w),\n\ng(z) = I+(Xz\u2212 1)\n\n(4)\n\n(cid:88)\n\ns\u2208S\n\nwhere I+(\u00b7) is 0 if its argument is non-negative and \u221e otherwise. We eliminated the w \u2264 1 constraint\nbecause it is unnecessary\u2014any optimal solution will automatically satisfy it.\n\nJ(s)J(s)qJ(s)(cid:107)1 \u2264 c(s), where qJ(s) = max(cid:8)\u03c1zJ(s) \u2212 \u02dcdJ(s) \u2212 yJ(s),0(cid:9) and the\n\nMinimizing w The dual of this problem is a quadratic knapsack problem solvable in linear ex-\npected time [4], we provide a similar algorithm that solves the primal formulation. We solve for\neach wJ(s) separately since the optimization is separable in each block of variables. It can be shown\n[21] that wJ(s) = 0 if (cid:107)D\u22121\nmax operation is applied elementwise. Otherwise, wJ(s) is non-zero and the L\u221e norm only affects\nthe maximal coordinates of DJ(s)J(s)wJ(s). For simplicity of exposition, we assume that the co-\nef\ufb01cients of wJ(s) are sorted in decreasing order according to DJ(s)J(s)qJ(s), i.e., [DJ(s)J(s)qJ(s)] j \u2265\n[DJ(s)J(s)qJ(s)] j+1. This is always possible by permuting coordinates. We show in [21] that, if\nDJ(s)J(s)wJ(s) has r maximal coordinates, then\n\n(cid:40)\n\nwJ(s) j = D\u22121\n\nJ(s) jJ(s) j min\n\nDJ(s) jJ(s) jqJ(s) j ,\n\n(cid:80)r\n(cid:80)r\nv=1 D\u22121\nv=1 D\u22122\n\nJ(s)vJ(s)v\n\nqJ(s)v \u2212 c(s)\n\nJ(s)vJ(s)v\n\n(cid:41)\n\n.\n\n(5)\n\nWe can \ufb01nd r by searching for the smallest value of r for which exactly r coef\ufb01cients in DJ(s)J(s)wJ(s)\nare maximal when determined by the formula above. As discussed in [21], an algorithm similar to\nthe linear-time median-\ufb01nding algorithm can be used to determine wJ(s) in linear expected time.\n\nMinimizing z Solving for z is tantamount to projecting a weighted combination of w and y onto\nthe polyhderon given by Xz \u2265 1 and is best solved by taking the dual. It can be shown [21] that the\ndual optimization problem is\n\nminimize\n\n\u03b1\n\n1\n2\n\n\u03b1T H\u03b1\u2212 \u03b1T (\u03c11\u2212 X(y + \u03c1w))\n\nsubject to \u03b1 \u2265 0\n\n(6)\n\n+ is a dual variable enforcing Xz \u2265 1 and H = XX T . Strong duality obtains and z can\n\nwhere \u03b1 \u2208 Rn\nbe recovered via z = \u03c1\u22121(y + \u03c1w + X T \u03b1).\nThe matrix H has special structure when S is a set of k-grams no longer than k words.\nIn this\ncase, [21] shows that H is a (k \u2212 1)\u2013banded positive de\ufb01nite matrix so we can \ufb01nd its Cholesky\ndecomposition in O(k2n). We then use an active-set Newton method [12] to solve (6) quickly in\napproximately 5 Cholesky decompositions. A second important property of H is that, if N docu-\nments n1, . . . ,nN words long are compressed jointly and no k-gram spans two documents, then H is\nblock-diagonal with block i an ni \u00d7 ni (k\u2212 1)\u2013banded matrix. This allows us to solve (6) separately\nfor each document. Since the majority of the time is spent solving for z, this property allows us to\nparallelize the algorithm and speed it up considerably.\n\n5 Experiments\n\n20 Newsgroups Dataset The majority of our experiments are performed on the 20 Newsgroups\ndataset [15, 22], a collection of about 19K messages approximately evenly split among 20 different\nnewsgroups. Since each newsgroup discusses a different topic, some more closely related than\nothers, we investigate our compressed features\u2019 ability to elucidate class structure in supervised and\nunsupervised learning scenarios. We use the \u201cby-date\u201d 60%/40% training/testing split described in\n[22] for all classi\ufb01cation tasks. This split makes our results comparable to the existing literature\nand makes the task more dif\ufb01cult by removing correlations from messages that are responses to one\nanother.\n\n5\n\n\fFeature Extraction and Training We compute a bag-of-k-grams representation from a com-\npressed document by counting the number of pointers that use each substring in the compressed\nversion of the document. This method retrieves the canonical bag-of-k-grams representation when\nall pointers are used, i.e., w = 1. Our compression criterion therefore leads to a less redundant\nrepresentation. Note that we extract features for a document corpus by compressing all of its doc-\numents jointly and then splitting into testing and training sets. Since this process involves no label\ninformation, it ensures that our estimate of testing error is unbiased.\nAll experiments were limited to using 5-grams as features, i.e., k = 5 for our compression algorithm.\nEach substring\u2019s dictionary cost was its word length and the pointer cost was uniformly set to 0 \u2264\n\u03bb \u2264 5. We found that an overly large \u03bb hurts accuracy more than an overly small value since the\nformer produces long, infrequent substrings, while the latter tends to a unigram representation. It is\nalso worthwhile to note that the storage cost (i.e., the value of the objective function) of the binary\nsolution was never more than 1.006 times the storage cost of the relaxed solution, indicating that we\nconsistently found a good local optimum.\nFinally, all classi\ufb01cation tasks use an Elastic-Net\u2013regularized logistic regression classi\ufb01er imple-\nmented by glmnet [8]. Since this regularizer is a mix of L1 and L2 penalties, it is useful for feature\nselection but can also be used as a simple L2 ridge penalty. Before training, we normalize each doc-\nument by its L1 norm and then normalize features by their standard deviation. We use this scheme\nso as to prevent overly long documents from dominating the feature normalization.\n\nLZ77 Comparison Our \ufb01rst\nexperiment\ndemonstrates LZ77\u2019s sensitivity to document\nordering on a simple binary classi\ufb01cation task\nof predicting whether a document is from the\nalt.atheism (A) or comp.graphics (G) news-\ngroup. Features were computed by concate-\nnating documents in different orders:\n(1) by\nclass, i.e., all documents in A before those in\nG, or G before A; (2) randomly; (3) by alter-\nnating the class every other document. Fig. 5\nshows the testing error compared to features\ncomputed from our criterion. Error bars were\nestimated by bootstrapping the testing set 100\ntimes, and all regularization parameters were\nchosen to minimize testing error while \u03bb was\n\ufb01xed at 0.03. As predicted in Section 2, doc-\nument ordering has a marked impact on per-\nformance, with the by-class and random orders\nperforming signi\ufb01cantly worse than the alter-\nnating ordering. Moreover, order invariance\nand the ability to tune the pointer cost lets our\ncriterion select a better set of 5-grams.\n\nFigure 2: Misclassi\ufb01cation error and standard\nerror bars when classifying alt.atheism (A) vs.\ncomp.graphics (G) from 20 Newsgroups. The\nfour leftmost results are on features from running\nLZ77 on documents ordered by class (AG, GA),\nrandomly (Rand), or by alternating classes (Alt);\nthe rightmost is on our compressed features.\n\nPCA Next, we investigate our features in a typical exploratory analysis scenario: a researcher\nlooking for interesting structure by plotting all pairs of the top 10 principal components of the\ndata. In particular, we verify PCA\u2019s ability to recover binary class structure for the A and G news-\ngroups, as well as multiclass structure for the A, comp.sys.ibm.pc.hardware (PC), rec.motorcycles\n(M), sci.space (S), and talk.politics.mideast (PM) newsgroups. Fig. 3 plots the pair of principal\ncomponents that best exempli\ufb01es class structure using (1) compressed features and (2) all 5-grams.\nFor the sake of fairness, the components were picked by training a logistic regression on every pair\nof the top 10 principal components and selecting the pair with the lowest training error. In both the\nbinary and multiclass scenarios, PCA is inundated by millions of features when using all 5-grams\nand cannot display good class structure. In contrast, compression reduces the feature set to tens of\nthousands (by two orders of magnitude) and clearly shows class structure. The star pattern of the\n\ufb01ve classes stands out even when class labels are hidden.\n\n6\n\nAGGARandAltOurs00.020.040.060.08Misclassification ErrorLZ77 Order Sensitivity\fFigure 3: PCA plots for 20 Newsgroups. Left: alt.atheism (blue), comp.graphics (red). Right:\nalt.atheism (blue), comp.sys.ibm.pc.hardware (green), rec.motorcycles (red), sci.space (cyan),\ntalk.politics.mideast (magenta). Top: compressed features (our method). Bottom: all 5-grams.\n\nTable 1: Classi\ufb01cation accuracy on the 20 Newsgroups and IMDb datasets\n\nMethod\n\n20 Newsgroups\n\nIMDb\n\nDiscriminative RBM [16]\n\nBag-of-Words SVM [14, 20]\n\nNa\u00a8\u0131ve Bayes [17]\nWord Vectors [20]\n\nAll 5-grams\n\nCompressed (our method)\n\n76.2\n80.8\n81.8\n\u2014\n82.8\n83.0\n\n\u2014\n88.2\n\u2014\n88.9\n90.6\n90.4\n\nClassi\ufb01cation Tasks Table 1 compares the performance of compressed features with all 5-grams\non two tasks: (1) categorizing posts from the 20 Newsgroups corpus into one of 20 classes; (2) cate-\ngorizing movie reviews collected from IMDb [20] into one of two classes (there are 25,000 training\nand 25,000 testing examples evenly split between the classes). For completeness, we include com-\nparisons with previous work for 20 Newsgroups [16, 14, 17] and IMDb [20]. All regularization\nparameters, including \u03bb, were chosen through 10-fold cross validation on the training set. We also\ndid not L1-normalize documents in the binary task because it was found to be counterproductive on\nthe training set.\nOur classi\ufb01cation performance is state of the art in both tasks, with the compressed and all-5-gram\nfeatures tied in performance. Since both datasets feature copious amounts of labeled data, we expect\nthe 5-gram features to do well because of the power of the Elastic-Net regularizer. What is remark-\nable is that the compression retains useful features without using any label information. There are\ntens of millions of 5-grams, but compression reduces them to hundreds of thousands (by two orders\nof magnitude). This has a particularly noticeable impact on training time for the 20 Newsgroups\ndataset. Cross-validation takes 1 hour with compressed features and 8\u201316 hours for all 5-grams on\nour reference computer depending on the sparsity of the resulting classi\ufb01er.\n\nTraining-Set Size Our \ufb01nal experiment explores the impact of training-set size on binary-clas-\nsi\ufb01cation accuracy for the A vs. G and rec.sport.baseball (B) vs. rec.sport.hockey (H) newsgroups.\nFig. 4 plots testing error as the amount of training data varies, comparing compressed features to full\n\n7\n\n \f(a)\n\n(b)\n\nFigure 4: Classi\ufb01cation accuracy as the training set size varies for two classi\ufb01cation tasks\nfrom 20 Newsgroups: (a) alt.atheism (A) vs. comp.graphics (G); (b) rec.sport.baseball (B) vs.\nrec.sport.hockey (H). To demonstrate the effects of feature selection, L2 indicates L2-regularization\nwhile EN indicates elastic-net regularization.\n\n5-grams; we explore the latter with and without feature selection enabled (i.e., Elastic Net vs. L2 reg-\nularizer). We resampled the training set 100 times for each training-set size and report the average\naccuracy. All regularization parameters were chosen to minimize the testing error (so as to elimi-\nnate effects from imperfect tuning) and \u03bb = 0.03 in both tasks. For the A\u2013G task, the compressed\nfeatures require substantially less data than the full 5-grams to come close to their best testing error.\nThe B\u2013H task is harder and all three classi\ufb01ers bene\ufb01t from more training data, although the gap\nbetween compressed features and all 5-grams is widest when less than half of the training data is\navailable. In all cases, the compressed features outperform the full 5-grams, indicating that that\nlatter may bene\ufb01t from even more training data. In future work it will be interesting to investigate\nthe ef\ufb01cacy of compressed features on more intelligent sampling schemes such as active learning.\n\n6 Discussion\n\nWe develop a feature selection method for text based on lossless data compression. It is unsupervised\nand can thus be run as a task-independent, one-off preprocessing step on a corpus. Our method\nachieves state-of-the-art classi\ufb01cation accuracy on two benchmark datasets despite selecting features\nwithout any knowledge of the class labels. In experiments comparing it to a full 5-gram model, our\nmethod reduces the feature-set size by two orders of magnitude and requires only a fraction of the\ntime to train a classi\ufb01er. It selects a compact feature set that can require signi\ufb01cantly less training\ndata and reveals unsupervised problem structure (e.g., when using PCA).\nOur compression scheme is more robust and less arbitrary compared to a setup which uses off-the-\nshelf compression algorithms to extract features from a document corpus. At the same time, our\nmethod has increased \ufb02exibility since the target k-gram length is a tunable parameter. Importantly,\nthe algorithm we present is based on iterative reweighting and ADMM and is fast enough\u2014linear\nin the input size when k is \ufb01xed, and highly parallelizable\u2014to allow for computing a regularization\npath of features by varying the pointer cost. Thus, we may adapt the compression to the data at hand\nand select features that better elucidate its structure.\nFinally, even though we focus on text data in this paper, our method is applicable to any sequential\ndata where the sequence elements are drawn from a \ufb01nite set (such as the universe of words in the\ncase of text data). In future work we plan to compress click stream data from users browsing the\nWeb. We also plan to experiment with approximate text representations obtained by making our\ncriterion lossy.\n\nAcknowledgments\n\nWe would like to thank Andrej Krevl, Jure Leskovec, and Julian McAuley for their thoughtful dis-\ncussions and help with our paper.\n\n8\n\n02040608010000.10.20.30.40.5Percent of Training DataMisclassification ErrorError on A vs. G CompressedAll 5-grams L2All 5-grams EN02040608010000.10.20.30.40.5Percent of Training DataMisclassification ErrorError on B vs. H CompressedAll 5-grams L2All 5-grams EN\fReferences\n[1] D. Benedetto, E. Caglioti, and V. Loreto. Language trees and zipping. PRL, 88(4):048702, 2002.\n[2] S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein. Distributed optimization and statistical learning\nvia the alternating direction method of multipliers. Foundations and Trends in Machine Learning, 3(1):1\u2013\n122, 2011.\n\n[3] A. Bratko, B. Filipi\u02c7c, G. V. Cormack, T. R. Lynam, and B. Zupan. Spam \ufb01ltering using statistical data\n\ncompression models. JMLR, 7:2673\u20132698, 2006.\n\n[4] P. Brucker. An O(n) algorithm for quadratic knapsack problems. Operations Research Letters, 3(3):163\u2013\n\n166, 1984.\n\n[5] E. Cand`es, M. Wakin, and S. Boyd. Enhancing sparsity by reweighted (cid:96)1 minimization. J Fourier Analysis\n\nand Applications, 14(5-6):877\u2013905, 2008.\n\n[6] R. Cilibrasi and P. M. Vit\u00b4anyi. Clustering by compression. TIT, 51(4):1523\u20131545, 2005.\n[7] E. Frank, C. Chui, and I. Witten. Text categorization using compression models. Technical Report 00/02,\n\nUniversity of Waikato, Department of Computer Science, 2000.\n\n[8] J. Friedman, T. Hastie, and R. Tibshirani. Regularization paths for generalized linear models via coordi-\n\nnate descent. J Stat Softw, 33(1):1\u201322, 2010.\n\n[9] E. Gabrilovich and S. Markovitch. Text categorization with many redundant features: Using aggressive\n\nfeature selection to make SVMs competitive with C4.5. In ICML, 2004.\n\n[10] I. Guyon and A. Elisseeff. An introduction to variable and feature selection. JMLR, 3:1157\u20131182, 2003.\n[11] E. Keogh, S. Lonardi, and C. A. Ratanamahatana. Towards parameter-free data mining. In KDD, 2004.\n[12] D. Kim, S. Sra, and I. S. Dhillon. Tackling box-constrained optimization via a new projected quasi-newton\n\napproach. SIAM Journal on Scienti\ufb01c Computing, 32(6):3548\u20133563, 2010.\n\n[13] V. Kuleshov. Fast algorithms for sparse principal component analysis based on Rayleigh quotient iteration.\n\nIn ICML, 2013.\n\n[14] M. Lan, C. Tan, and H. Low. Proposing a new term weighting scheme for text categorization. In AAAI,\n\n2006.\n\n[15] K. Lang. Newsweeder: Learning to \ufb01lter netnews. In ICML, 1995.\n[16] H. Larochelle and Y. Bengio. Classi\ufb01cation using discriminative restricted Boltzmann machines.\n\nICML, 2008.\n\nIn\n\n[17] B. Li and C. Vogel.\n\nImproving multiclass text classi\ufb01cation with error-correcting output coding and\n\nsub-class partitions. In Can Conf Adv Art Int, 2010.\n\n[18] H. Liu and L. Yu. Toward integrating feature selection algorithms for classi\ufb01cation and clustering. TKDE,\n\n17(4):491\u2013502, 2005.\n\n[19] T. Liu, S. Liu, Z. Chen, and W. Ma. An evaluation on feature selection for text clustering. In ICML, 2003.\n[20] A. Maas, R. Daly, P. Pham, D. Huang, A. Y. Ng, and C. Potts. Learning word vectors for sentiment\n\nanalysis. In ACL, 2011.\n\n[21] H. S. Paskov, R. West, J. C. Mitchell, and T. J. Hastie. Supplementary material for Compressive Feature\n\nLearning, 2013.\n\n[22] J. Rennie. 20 Newsgroups dataset, 2008. http://qwone.com/\u02dcjason/20Newsgroups (accessed\n\nMay 31, 2013).\n\n[23] J. Rissanen. Modeling by shortest data description. Automatica, 14(5):465\u2013471, 1978.\n[24] D. Sculley and C. E. Brodley. Compression and machine learning: A new perspective on feature space\n\nvectors. In DCC, 2006.\n\n[25] R. Tibshirani. Regression shrinkage and selection via the lasso. J. R. Stat. Soc. B, 58(1):267\u2013288, 1996.\n[26] Y. Yang and J. Pedersen. A comparative study on feature selection in text categorization. In ICML, 1997.\n[27] J. Zhu, S. Rosset, T. Hastie, and R. Tibshirani. 1-norm support vector machines. In NIPS, 2004.\n[28] J. Ziv and A. Lempel. A universal algorithm for sequential data compression. TIT, 23(3):337\u2013343, 1977.\n[29] H. Zou, T. Hastie, and R. Tibshirani. Sparse principal component analysis. JCGS, 15(2):265\u2013286, 2006.\n\n9\n\n\f", "award": [], "sourceid": 1342, "authors": [{"given_name": "Hristo", "family_name": "Paskov", "institution": "Stanford University"}, {"given_name": "Robert", "family_name": "West", "institution": "Stanford University"}, {"given_name": "John", "family_name": "Mitchell", "institution": "Stanford University"}, {"given_name": "Trevor", "family_name": "Hastie", "institution": "Stanford University"}]}