{"title": "Learning Sparse Representations of High Dimensional Data on Large Scale Dictionaries", "book": "Advances in Neural Information Processing Systems", "page_first": 900, "page_last": 908, "abstract": "", "full_text": "Learning Sparse Representations of High Dimensional Data on Large Scale Dictionaries\r\n\r\nZhen James Xiang Hao Xu Peter J. Ramadge Department of Electrical Engineering, Princeton University Princeton, NJ 08544, USA {zxiang,haoxu,ramadge}@princeton.edu\r\n\r\nAbstract\r\nLearning sparse representations on data adaptive dictionaries is a state-of-the-art method for modeling data. But when the dictionary is large and the data dimension is high, it is a computationally challenging problem. We explore three aspects of the problem. First, we derive new, greatly improved screening tests that quickly identify codewords that are guaranteed to have zero weights. Second, we study the properties of random projections in the context of learning sparse representations. Finally, we develop a hierarchical framework that uses incremental random projections and screening to learn, in small stages, a hierarchically structured dictionary for sparse representations. Empirical results show that our framework can learn informative hierarchical sparse representations more efficiently.\r\n\r\n1\r\n\r\nIntroduction\r\n\r\nConsider approximating a p-dimensional data point x by a linear combination x Bw of m (possibly linearly dependent) codewords in a dictionary B = [b1 , b2 , . . . , bm ]. Doing so by imposing the additional constraint that w is a sparse vector, i.e., x is approximated as a weighted sum of only a few codewords in the dictionary, has recently attracted much attention [1]. As a further refinement, when there are many data points xj , the dictionary B can be optimized to make the representations wj as sparse as possible. This leads to the following problem. Given n data points in Rp organized as matrix X = [x1 , x2 , . . . , xn ] Rpn , we want to learn a dictionary B = [b1 , b2 , . . . , bm ] Rpm and sparse representation weights W = [w1 , w2 , . . . , wn ] Rmn so that each data point xj is well approximated by Bwj with wj a sparse vector: 1 min X - BW 2 + W 1 F B,W 2 (1) s.t. bi 2 1, i = 1, 2, . . . , m. 2 Here \r\nF\r\n\r\nand \r\n\r\n1\r\n\r\ndenote the Frobenius norm and element-wise l1 -norm of a matrix, respectively.\r\n\r\nThere are two advantages to this representation method. First, the dictionary B is adapted to the data. In the spirit of many modern approaches (e.g. PCA, SMT [2], tree-induced bases [3,4]), rather than fixing B a priori (e.g. Fourier, wavelet, DCT), problem (1) assumes minimal prior knowledge and uses sparsity as a cue to learn a dictionary adapted to the data. Second, the new representation w is obtained by a nonlinear mapping of x. Algorithms such as Laplacian eigenmaps [5] and LLE [6], also use nonlinear mappings x w. By comparison, l1 -regularization enjoys a simple formulation with a single tuning parameter (). In many other approaches (including [24]), although the codewords in B are cleverly chosen, the new representation w is simply a linear mapping of x, e.g. w = B x. In this case, training a linear model on w cannot learn nonlinear structure in the data. As a final point, we note that the human visual cortex uses similar mechanisms to encode visual scenes [7] and sparse representation has exhibited superior performance on difficult computer vision problems such as face [8] and object [9] recognition. 1\r\n\r\n\fThe challenge, however, is that solving the non-convex optimization problem(1) is computationally expensive. Most state-of-the-art algorithms solve (1) by iteratively optimizing W and B. For a fixed B, optimizing W requires solving n, p-dimensional, lasso problems of size m. Using LARS [10] with a Cholesky-based implementation, each lasso problem has a computation cost of O(mp + m2 ), where is the number of nonzero coefficients [11]. For a fixed W, optimizing B is a least squares problem of pm variables and m constraints. In an efficient algorithm [12], the dual formulation has only m variables but still requires inverting m m matrices (O(m3 ) complexity). To address this challenge, we examine decomposing a large dictionary learning problem into a set of smaller problems. First (2), we explore dictionary screening [13, 14], to select a subset of codewords to use in each Lasso optimization. We derive two new screening tests that are significantly better than existing tests when the data points and codewords are highly correlated, a typical scenario in sparse representation applications [15]. We also provide simple geometric intuition for guiding the derivation of screening tests. Second (3), we examine projecting data onto a lower dimensional space so that we can control information flow in our hierarchical framework and solve sparse representations with smaller p. We identify an important property of the data that's implicitly assumed in sparse representation problems (scale indifference) and study how random projection preserves this property. These results are inspired by [16] and related work in compressed sensing. Finally (4), we develop a framework for learning a hierarchical dictionary (similar in spirit to [17] and DBN [18]). To do so we exploit our results on screening and random projection and impose a zero-tree like structured sparsity constraint on the representation. This constraint is similar to the formulation in [19]. The key difference is that we learn the sparse representation stage-wise in layers and use the exact zero-tree sparsity constraint to utilize the information in previous layers to simplify the computation, whereas [19] uses a convex relaxation to approximate the structured sparsity constraint and learns the sparse representation (of all layers) by solving a single large optimization problem. Our idea of using incremental random projections is inspired by the work in [20, 21]. Finally, unlike [12] (that addresses the same computational challenge), we focus on a high level reorganization of the computations rather than improving basic optimization algorithms. Our framework can be combined with all existing optimization algorithms, e.g. [12], to attain faster results.\r\n\r\n2\r\n\r\nReducing the Dictionary By Screening\r\n\r\nIn this section we assume that all data points and codewords are normalized: xj 2 = bi 2 = 1, 1 j n, 1 i m (we discuss the implications of this assumption in 3). When B is fixed, finding the optimal W in (1) requires solving n subproblems. The j th subproblem finds wj for xj . For notational simplicity, in this section we drop the index j and denote x = xj , w = wj = [w1 , w2 , . . . , wm ]T . Each subproblem is then of the form:\r\nw1 ,w2 ,...,wm\r\n\r\nmin\r\n\r\n1 x- wi bi 2 i=1\r\n\r\nm\r\n\r\nm 2 2\r\n\r\n+\r\ni=1\r\n\r\n|wi |.\r\n\r\n(2)\r\n\r\nTo address the challenge of solving (2) for large m, we first explore simple screening tests that identify and discard codewords bi guaranteed to have optimal solution wi = 0. El Ghaoui's SAFE ~ rule [13] is an example of a simple screening test. We introduce some simple geometric intuition for screening and use this to derive new tests that are significantly better than existing tests for the type of problems of interest here. To this end, it will help to consider the dual problem of (2): max\r\n\r\n\r\ns.t.\r\n\r\n2 x 2 1 x 2- - 2 2 2 2 T | bi | 1 i = 1, 2, . . . , m.\r\n\r\n(3)\r\n\r\n~ As is well known (see the supplemental material), the optimal solution of the primal problem w = T ~ are related through: [w1 , w2 , . . . , wm ] and the optimal solution of the dual problem ~ ~ ~\r\nm\r\n\r\nx=\r\ni=1\r\n\r\n~ wi bi + , ~\r\n\r\n~T bi \r\n\r\n{sign wi } ~ [-1, 1]\r\n\r\nif wi = 0, ~ if wi = 0. ~\r\n\r\n(4)\r\n\r\nThe dual formulation gives useful geometric intuition. Since x 2 = bi 2 = 1, x and all bi lie on the unit sphere S p-1 (Fig.1(a)). For y on S p-1 , P (y) = {z : zT y = 1} is the tangent hyperplane 2\r\n\r\n\fmax\r\n\r\n= 0.8\r\n\r\nDiscarding Threshold\r\n\r\nSp-1 P(b )\r\n1\r\n\r\nSp-1\r\nP(b2) 0\r\n\r\nSp-1 0 b* x\r\n\r\n0.8 0.6 0.4 0.2 0 0.6 ST2, our new test. ST1/SAFE 0.8 1 / max\r\nmax\r\n\r\n0\r\n2\r\n\r\nP(b ) b * 1\r\n\r\nDiscarding Threshold\r\n\r\nb* x x/max\r\n\r\nFeasible Region b\r\n\r\nx/max\r\n\r\n= 0.9\r\n\r\n0.8 0.6 0.4 0.2 0 0.6 ST2, our new test. ST1/SAFE 0.8 1 / max\r\n\r\n\r\n\r\nx/\r\nx/\r\n\r\nq\r\n\r\n(a)\r\n\r\n(b)\r\n\r\n(c)\r\n\r\n(d)\r\n\r\nFigure 1: (a) Geometry of the dual problem. (b) Illustration of a sphere test. (c) The solid red, dotted blue and solid magenta circles leading to sphere tests ST1/SAFE, ST2, ST3, respectively. (d) The thresholds in ST2 and ST1/SAFE when max = 0.8 (top) and max = 0.9 (bottom). A higher threshold yields a better test. of S p-1 at y and H(y) = {z : zT y 1} is the corresponding closed half space containing the origin. The constraints in (3) indicate that feasible must be in H(bi ) and H(-bi ) for all i. To ~ ~ find that maximizes the objective in (3), we must find a feasible closest to x/. By (4), if is not on P (bi ) or P (-bi ), then wi = 0 and we can safely discard bi from problem (2). ~ Let max = maxi |xT bi | and b {bi }m be selected so that max = xT b . Note that i=1 = x/max is a feasible solution for (3). max is also the largest for which (2) has a nonzero solution. If > max , then x/ itself is feasible, making it the optimal solution. Since it is not on any hyperplane P (bi ) or P (-bi ), wi = 0, i = 1, . . . , m. Hence we assume that max . ~ ~ These observations can be used for screening as follows. If we know that is within a region R, then we can discard those bi for which the tangent hyperplanes P (bi ) and P (-bi ) don't intersect R, since by (4) the corresponding wi will be 0. Moreover, if the region R is contained in a closed ~ ball (e.g. the shaded blue area in Fig.1(b)) centered at q with radius r, i.e., { : - q 2 r}, then one can discard all bi for which |qT bi | is smaller than a threshold determined by the common tangent hyperplanes of the spheres - q 2 = r and S p-1 . This \"sphere test\" is made precise in the following lemma (All lemmata are proved in the supplemental material). ~ ~ Lemma 1. If the solution of (3) satisfies - q 2 r, then |qT bi | < (1 - r) wi = 0. ~ El Ghaoui's SAFE rule [13] is a sphere test of the simplest form. To see this, note that x/max is a feasible point of (3), so the optimal cannot be further away from x/ than x/max . Therefore we ~ have the constraint : - x/ 2 1/ - 1/max (solid red ball in Fig.1(c)). Plugging in q = x/ and r = 1/ - 1/max into Lemma 1 yields El Ghaoui's SAFE rule: Sphere Test # 1 (ST1/SAFE): If |xT bi | < - 1 + /max , then wi = 0. ~ Note that the SAFE rule is weakest when max is large, i.e., when the codewords are very similar to the data points, a frequent situation in applications [15]. To see that there is room for improvement, ~ consider the constraint: T b 1. This puts in the intersection of the previous closed ball (solid red) and H(b ). This is indicated by the shaded green region in Fig. 1(c). Since this intersection is small when max is large, a better test results by selecting R to be the shaded green region. However, to simplify the test, we relax R to a closed ball and use the sphere test of Lemma 1. Two relaxations, the solid magenta ball and the dotted blue ball in Fig. 1(c), are detailed in the following lemma. Lemma 2. If satisfies (a) - x/ 2 1/ - 1/max and (b) T b 1, then satisfies - (x/ - (max / - 1)b - x/max\r\n2 2\r\n\r\n 2\r\n\r\n1/2 - 1(max / - 1), max 1/2 - 1(max / - 1). max\r\n\r\nand\r\n\r\n(5) (6)\r\n\r\n~ By Lemma 2, since satisfies (a) and (b), it satisfies (5) and (6). We start with (6) because of its similarity to the closed ball constraint used to derive ST1/SAFE (solid red ball). Plugging q = x/max and r = 2 1/2 - 1(max / - 1) into Lemma 1 yields our first new test: max 3\r\n\r\n\fSphere Test # 2 (ST2):\r\n\r\nIf |xT bi | < max (1 - 2\r\n\r\n1/2 - 1(max / - 1)), then wi = 0. ~ max\r\n\r\nSince ST2 and ST1/SAFE both test |xT bi | against thresholds, we can compare the tests by plotting their thresholds. We do so for max = 0.8, 0.9 in Fig.1(d). The thresholds must be positive and large to be useful. ST2 is most useful when max is large. Indeed, we have the following lemma: Lemma 3. When max > 3/2, if ST1/SAFE discards bi , then ST2 also discards bi . Finally, to use the closed ball constraint (5), we plug in q = x/ - (max / - 1)b and r = 1/2 - 1(max / - 1) into Lemma 1 to obtain a second new test: max Sphere Test # 3 (ST3): If |xT bi - (max - )bT bi | < (1 - 1/2 - 1(max / - 1)), then wi = 0. ~ max\r\n\r\nST3 is slightly more complex. It requires finding b and computing a weighted sum of inner products. But ST3 is always better than ST2 since its sphere lies strictly inside that of ST2: Lemma 4. Given any x, b and , if ST2 discards bi , then ST3 also discards bi . To summarize, ST3 completely outperforms ST2, and when max is larger than 3/2 0.866, ST2 completely outperforms ST1/SAFE. Empirical comparisons are given in 5. By making two passes through the dictionary, the above tests can be efficiently implemented on large-scale dictionaries that can't fit in memory. The first pass holds x, u, bi Rp in memory at once and computes u(i) = xT bi . By simple bookkeeping, after pass one we have b and max . The second pass holds u, b , bi in memory at once, computes bT bi and executes the test. \r\n\r\n3\r\n\r\nRandom Projections of the Data\r\n\r\nIn 4 we develop a framework for learning a hierarchical dictionary and this involves the use of random data projections to control information flow to the levels of the hierarchy. The motivation for using random projections will become clear, and is specifically discussed, in 4. Here we lay some groundwork by studying basic properties of random projections in learning sparse representations. We first revisit the normalization assumption xj 2 = bi 2 = 1, 1 j n, 1 i m in 2. The assumption that all codewords are normalized: bi 2 = 1, is necessary for (1) to be meaningful, otherwise we can increase the scale of bi and inversely scale the ith row of W to lower the loss. The assumption that all data points are normalized: xj 2 = 1, warrants a more careful examination. To see this, assume that the data {xj }n are samples from an underlying low dimensional smooth j=1 manifold X and that one desires a correspondence between codewords and local regions on X . Then we require the following scale indifference (SI) property to hold: Definition 1. X satisfies the SI property if x1 , x2 X , with x1 = x2 , and = 0, x1 = x2 . Intuitively, SI means that X doesn't contain points differing only in scale and it implies that points x1 , x2 from distinct regions on X will use different codewords in their representation. SI is usually implicitly assumed [9,15] but it will be important for what follows to make the condition explicit. SI is true in many typical applications of sparse representation. For example, for image signals when we are interested in the image content regardless of image luminance. When SI holds we can indeed normalize the data points to S p-1 = {x : x 2 = 1}. Since a random projection of the original data doesn't preserve the normalization xj 2 = 1, it's important for the random projection to preserve the SI property so that it is reasonable to renormalize the projected data. We will show that this is indeed the case under certain assumptions. Suppose we use a random projection matrix T Rdp , with orthonormal rows, to project the data to Rd (d < p) and use TX as the new data matrix. Such T can be generated by running the GramSchmidt procedure on d, p-dimensional random row vectors with i.i.d. Gaussian entries. It's known that for certain sets X , with high probability random projection preserves pairwise distances: Tx1 - Tx2 2 (1 + ) d/p. (7) (1 - ) d/p x1 - x2 2 For example, when X contains only -sparse vectors, we only need d O( ln(p/)) and when X is a K-dimensional Riemannian submanifold, we only need d O(K ln p) [16]. We will show that when the pairwise distances are preserved as in (7), the SI property will also be preserved: 4\r\n\r\n\fTheorem 1. Define S(X ) = {z : z = x, x X , || 1}. If X satisfies SI and (x1 , x2 ) S(X ) S(X ) (7) is satisfied, then T (X ) = {z : z = Tx, x X } also satisfies SI. Proof. If T (X ) doesn't satisfy SI, then by Definition 1, (x1 , x2 ) X X , {0, 1} s.t.: / Tx1 = Tx2 . Without loss of generality we can assume that || 1 (otherwise we can exchange the positions of x1 and x2 ). Since x1 and x2 are both in S(X ), using (7) gives that x1 - x2 2 Tx1 - Tx2 2 /((1 - ) d/p) = 0. So x1 = x2 . This contradicts the SI property of X . For example, if X contains only -sparse vectors, so does S(X ). If X is a Riemannian submanifold, so is S(X ). Therefore applying random projections to these X will preserve SI with high probability. For the case of -sparse vectors, under some strong conditions, we can prove that random projection always preserves SI. (Proofs of the theorems below are in the supplemental material.) Theorem 2. If X satisfies SI and has a -sparse representation using dictionary B, then the projected data T (X ) satisfies SI if (2 - 1)M (TB) < 1, where M () is matrix mutual coherence. Combining (7) with Theorem 1 or 2 provides an important insight: the projected data TX contains rough information about the original data X and we can continue to use the formulation (1) on TX to extract such information. Actually, if we do this for a Riemannian submanifold X , then we have: Theorem 3. Let the data points lie on a K-dimensional compact Riemannian submanifold X Rp with volume V , conditional number 1/ , and geodesic covering regularity R (see [16]). Assume that in the optimal solution of (1) for the projected data (replacing X with TX), data points Tx1 and Tx2 have nonzero weights on the same set of codewords. Let wj be the new representation of xj and i = Txj - Bwj 2 be the length of the residual (j = 1, 2). With probability 1 - : x1 - x2 x1 - with\r\n1 2 2 x2 2 2\r\n-1\r\n\r\n (p/d)(1 + (p/d)(1 - )\r\n\r\n1 )(1 1 )(1\r\n\r\n+ -\r\n\r\n2 )( 2 )(\r\n\r\nw1 - w2 w1 -\r\n\r\n2 2+ 2 w2 2 ,\r\n\r\n22 + 22 ) 1 2\r\n\r\n(8)\r\n\r\n= O(( K ln(N V R d\r\n\r\n) ln(1/) 0.5-\r\n\r\n) (for any small > 0) and\r\n\r\n2\r\n\r\n= ( - 1)M (B).\r\n\r\nTherefore the distances between the sparse representation weights reflect the original data point distances. We believe a similar result should also hold when X contains only -sparse vectors.\r\n\r\n4\r\n\r\nLearning a Hierarchical Dictionary\r\n\r\nOur hierarchical framework decomposes a large dictionary learning problem into a sequence of smaller hierarchically structured dictionary learning problems. The result is a tree of dictionaries. High levels of the tree give course representations, deeper levels give more detailed representations, and the codewords at the leaves form the final dictionary. The tree is grown top-down in l levels by refining the dictionary at the previous level to give the dictionary at the next level. Random data projections are used to control the information flow to different layers. We also enforce a zero-tree constraint on the sparse representation weights so that the zero weights in the previous level will force the corresponding weights in the next level to be zero. At each stage we combine this zero-tree constraint with our new screening tests to reduce the size of Lasso problems that must be solved. In detail, we use l random projections Tk Rdk p (1 k l) to extract information incrementally from the data in l stages. Each Tk has orthonormal rows and the rows of distinct Tk are orthogonal. At level k we learn a dictionary Bk Rdk mk and weights Wk Rmk n by solving a small sparse representation problem similar to (1):\r\nBk ,Wk\r\n\r\nmin\r\n\r\n1 Tk X - Bk Wk 2 bi\r\n(k) 2 2\r\n\r\n2 F\r\n\r\n+ k Wk\r\n\r\n1\r\n\r\n(9)\r\n\r\ns.t.\r\n(k)\r\n\r\n 1,\r\n\r\ni = 1, 2, . . . , mk .\r\n\r\nHere bi is the ith column of matrix Bk and mk is assumed to be a multiple of mk-1 , so the number of codewords mk increases with k. We solve (9) for level k = 1, 2, . . . , l sequentially. An additional constraint is required to enforce a tree structure. Denote the ith element of the j th (k) column of Wk by wj (i) and organize the weights at level k > 1 in mk-1 groups, one per level 5\r\n\r\n\fk - 1 codeword. The ith group has mk /mk-1 weights: {wj (rmk-1 + i), 0 r < mk /mk-1 }, and has weight wj (i) as its parent weight. To enforce a tree structure we require that a child weight is zero if its parent weight is zero. So for every level k 2, data point j (1 j n), group i (k) (1 i mk-1 ), and weight wj (rmk-1 + i) (0 r < mk /mk-1 ), we enforce: wj\r\n(k-1) (k-1)\r\n\r\n(k)\r\n\r\n(i) = 0\r\n\r\n\r\n\r\nwj (rmk-1 + i) = 0.\r\n\r\n(k)\r\n\r\n(10)\r\n\r\nThis imposed tree structure is analogous to a \"zero-tree\" in EZW wavelet compression [22]. In addition, (10) means that the weights of the previous layer select a small subset of codewords to k enter the Lasso optimization. When solving for wj , (10) reduces the number of codewords from mk to (mk /mk-1 ) wj is sparse. Thus the screening 0 , a considerable reduction since wj rules in 2 and the imposed screening rule (10) work together to reduce the effective dictionary size. The tree structure in the weights introduces a similar hierarchical tree structure in the dictionaries (k) (k-1) {Bk }l : the codewords {brmk-1 +i , 0 r < mk /mk-1 } are the children of codeword bi . This k=1 tree structure provides a heuristic way of updating Bk . When k > 1, the mk codewords in layer k are naturally divided into mk-1 groups, so we can solve Bk by optimizing each group sequentially. mk /m -1 (k) This is similar to block coordinate descent. For i = 1, 2, . . . , mk-1 , let B = [brmk-1 +i ]r=0 k-1 denote the codewords in group i. Let W be the submatrix of W containing only the (rmk-1 + i)th rows of W, r = 0, 1, . . . , mk /mk-1 - 1. W is the weight matrix for B . Denote the remaining codewords and weights by B and W . For all mk-1 groups in random order, we fix B and update B by solving (1) for data matrix Tk X - B W . This reduces the complexity from O(mq ) k to O(mq /mq-1 ) where O(mq ) is the complexity of updating a dictionary with size m. Since q 3, k k-1 this offers big computational savings but might yield a suboptimal solution of (9). After finalizing Wk and Bk , we can solve an unconstrained QP to find Ck = arg minC X - CWk 2 . Ck is useful for visualization purposes; it represents the points on the F original data manifold corresponding to Bk . In principle, our framework can use any orthogonal projection matrix Tk . We choose random projections because they're simple and, more importantly, because they provide a mechanism to control the amount of information extracted at each layer. If all Tk are randomly generated independently of X, then on average, the amount of information in Tk X is proportional to dk . This allows us to control the flow of information to each layer so that we avoid using all the information in one layer.\r\n(k-1) (k-1)\r\n\r\n5\r\n\r\nExperiments\r\n\r\nWe tested our framework on: (a) the COIL rotational image data set [23], (b) the MNIST digit classification data set [24], and (c) the extended Yale B face recognition data set [25] [26]. The basic sparse representation problem (1) is solved using the toolbox provided in [12] to iteratively optimize B and W until an iteration results in a loss function reduction of less than 0.01%. COIL Rotational Image Data Set: This is intended as a small scale illustration of our framework. We use the 72, 128x128 color images of object No. 80 rotating around a circle in 15 degreeincrements (18 images shown in Fig.2(a)). We ran the traditional sparse representation algorithm to compare the three screening tests in 2. The dictionary size is m = 16 and we vary . As shown in Fig.2(c), ST3 discards a larger fraction of codewords than ST2 and ST2 discards a larger fraction than ST1/SAFE. We ran the same algorithms on 200 random data projections and the results are almost identical. The average max for these two situations is 0.98. Next we test our hierarchical framework using two layers. We set (d2 , m2 ) = (200, 16) so that the second layer solves a problem of the same scale as in the previous paragraph. We demonstrate how the result of the first layer, with (d1 , m1 , 1 ) = (100, 4, 0.5), helps the second layer discard more codewords when the tree constraint (10) is imposed. Fig.2(b) illustrates this constraint: the 16 second layer codewords are organized in 4 groups of 4 (only 2 groups shown). The weight on any codeword in a group has to be zero if the parent codeword in the first layer has weight zero. This imposed constraint discards many more codewords in the screening stage than any of the three tests in 2. (Fig.2(d)). Finally, the illustrated codewords and weights in Fig.2(b) are the actual values in 6\r\n\r\n\fLearning non-hierarchical sparse representation Average % of codewords discarded\r\n\r\n0(1& 0,1& 021& 0/1&\r\n\r\nAverage percentage of discarded codewords in the prescreening.\r\n80 ST3, original data ST3, projected data ST2, original data ST2, projected data ST1/SAFE, original data ST1/SAFE, projected data\r\n\r\n100\r\n\r\n80 60 40 20\r\n\r\n60\r\n\r\n40\r\n\r\nUse our new bound on the origianl data Use our new bound on the projected data Use El Ghaoui et al. 2010 on the original data Use El Ghaoui et al. 2010 on the projected data\r\n\r\n20\r\n\r\n0 0\r\n\r\n0.2\r\n\r\n0.4 \r\n\r\n0.6\r\n\r\n0.8\r\n\r\n1\r\n\r\n0 0 100\r\nAverage % of codewords discarded\r\n\r\n0.2\r\n\r\n0.4\r\n\r\n0.6\r\n\r\n0.8\r\n\r\n1\r\n\r\nLearning the second layer sparse representation Average percentage of discarded codewords in the prescreening.\r\n\r\n80\r\n\r\n80 (10) + ST3\r\n\r\n60 60 40 40 20 20 0 0 0 0\r\n\r\nUse constraint (13) and our new bound ST3 only Use our new bound (10) + ST2 ST2 only Use constraint (13) and El Ghaoui et al. 2010 (10) + ST1/SAFE Use El Ghaoui et al. 2010\r\nST1/SAFE only\r\n\r\n!\"#$%&'()*#&&\r\n\r\n+*,-./&'()*#&&\r\n\r\n0.2 0.2\r\n\r\n0.40.4 \r\n\r\n0.6\r\n\r\n0.6\r\n\r\n0.8\r\n\r\n0.8 1\r\n\r\n1\r\n\r\nFigure 2: (a): Example images of the data set. (b): Illustration of a two layer hierarchical sparse representation. (c): Comparison of the three screening tests for sparse representation. (d): Screening performance in the second layer of our hierarchical framework using combinations of screening criteria. The imposed constraint (10) helps to discard significantly more codewords when is small.\r\nRecognition rate (%) on testing set 100 90 80 70 60 Traditional sparse representation Our hierarchical framework Our framework with PCA projections Linear classifier Wright et al., 2008, SRC\r\n\r\n97\r\n\r\nClassification accuracy (%) on testing set\r\n\r\n96\r\n\r\n95\r\n\r\n94\r\n\r\nAverage encoding time (ms)\r\n\r\n93\r\n\r\nTraditional sparse representation: m=64, with 6 different settings m=128, with 6 (same as above) m=192, with 6 m=256, with 6 m=512, with 6 Our hierarchical framework: m1=32, m2=512, with 6 m1=64, m2=2048, with 6 m1=16, m2=256, m3=4096, with 6 Baseline: the same linear classifier using 250 principal components using original pixel values 2 3 5 10 20 30\r\n\r\n50 32(0.1%) 64(0.2%) 128(0.4%) 256(0.8%) # of random projections (percentage of image size) to use\r\n\r\n80 60 40 20\r\n\r\nTraditional sparse representation Our hierarchical framework Our framework with PCA projections Linear classifier\r\n\r\n92\r\n\r\n91\r\n\r\nAverage encoding time for a testing image (ms)\r\n\r\n0 32(0.1%) 64(0.2%) 128(0.4%) 256(0.8%) # of random projections (percentage of image size) to use\r\n\r\nFigure 3: Left: MNIST: The tradeoff between classification accuracy and average encoding time for various\r\nsparse representation methods. Our hierarchical framework yields better performance in less time. The average encoding time doesn't apply to baseline methods. The performance of traditional sparse representation is consistent with [9]. Right: Face Recognition: The recognition rate (top) and average encoding time (bottom) for various methods. Traditional sparse representation has the best accuracy and is very close to a similar method SRC in [8] (SRC's recognition rate is cited from [8] but data on encoding time is not available). Our hierarchical framework achieves a good tradeoff between the accuracy and speed. Using PCA projections in our framework yields worse performance since these projections do not spread information across the layers.\r\n\r\nC2 and W2 when 2 = 0.4 (the marked point in Fig.2(d)). The sparse representation gives a multiresolution representation of the rotational pattern: the first layer encodes rough orientation and the second layer refines it. The next two experiments evaluate the performance of sparse representation by (1) the accuracy of T T T a classification task using the columns in W (or in [W1 , W2 , . . . , Wl ]T for our framework) as features, and (2) the average encoding time required to obtain these weights for a testing data point. This time is highly correlated with the total time needed for iterative dictionary learning. We used linear SVM (liblinear [27]) with parameters tuned by 10-fold cross-validations on the training set. 7\r\n\r\n\fMNIST Digit Classification: This data set contains 70,000 28x28 hand written digit images (60,000 training, 10,000 testing). We ran the traditional sparse representation algorithm for dictionary size m {64, 128, 192, 256} and = {0.06, 0.08, 0.11, 0.16, 0.23, 0.32}. In Fig.3 left panel, each curve contains settings with the same m but with different . Points to the right correspond to smaller values (less sparse solutions and more difficult computation). There is a tradeoff between speed (x-axis) and classification performance (y-axis). To see where our framework stands, we tested the following settings: (a) 2 layers with (d1 , d2 ) = (200, 500), (m1 , m2 ) = (32, 512), 1 = 0.23 and 2 , (b) (m1 , m2 ) = (64, 2048) and everything else in (a) unchanged, (c) 3 layers with (d1 , d2 , d3 ) = (100, 200, 400), (m1 , m2 , m3 ) = (16, 256, 4096), (1 , 2 ) = (0.16, 0.11) and 3 . The plot shows that compared to the traditional sparse representation, our hierarchical framework achieves roughly a 1% accuracy improvement given the same encoding time and a roughly 2X speedup given the same accuracy. Using 3 layers also offers competitive performance but doesn't outperform the 2 layer setting. Face Recognition: For each of 38 subjects we used 64 cropped frontal face views under differing lighting conditions, randomly divided into 32 training and 32 testing images. This set-up mirrors that in [8]. In this experiment we start with the random projected data (p {32, 64, 128, 256} random projections of the original 192x128 data) and use this data as follows: (a) learn a traditional non-hierarchical sparse representation, (b) our framework, i.e., sample the data in two stages using orthogonal random projections and learn a 2 layer hierarchical sparse representation, (c) use PCA projections to replace random projections in (b), (d) directly apply a linear classifier without first learning a sparse representation. For (a) we used m = 1024, = 0.030 for p = 32, 64 and = 0.029 for p = 128, 256 (tuned to yield the same average sparsity for different p). For (b) we 3 used (m1 , m2 ) = (32, 1024), (d1 , d2 ) = ( 8 p, 5 p), 1 = 0.02 and 2 the same as in (a). For (c) 8 we used the same setting in (b) except random projection matrices T1 , T2 in our framework are now set to the PCA projection matrices (calculate SVD X = USVT with singular values in descending order, then use the first d1 columns of U as the rows in T1 and the next d2 columns of U as the rows in T2 ). The results in Fig.3 right panel suggest that our framework strikes a good balance between speed and accuracy. The PCA variant of our framework has worse performance because the first 3 8 p projections contain too much information, leaving the second layer too little information (which also drags down the speed for lack of sparsity and structure). This reinforces our argument at the end of 4 about the advantage of random projections. The fact that a linear SVM performs well given enough random projections suggests this data set does not have a strong nonlinear structure. Finally, at any iteration, the average max for all data points ranges from 0.76 to 0.91 in all settings in the MNIST experiment and ranges from 0.82 to nearly 1 in the face recognition experiment (except for the second layer in the PCA variant, in which average max can be as low as 0.54). As expected, max is large, a situation that favors our new screening tests (ST2, ST3).\r\n\r\n6\r\n\r\nConclusion\r\n\r\nOur theoretical results and algorithmic framework effectively make headway on the computational challenge of learning sparse representations on large size dictionaries for high dimensional data The new screening tests greatly reduce the size of the lasso problems to be solved and the tests are proven, both theoretically and empirically, to be much more effective than the existing ST1/SAFE test. We have shown that under certain conditions, random projection preserves the scale indifference (SI) property with high probability, thus providing an opportunity to learn informative sparse representations with data fewer dimensions. Finally, the new hierarchical dictionary learning framework employs random data projections to control the flow of information to the layers, screening to eliminate unnecessary codewords, and a tree constraint to select a small number of candidate codewords based on the weights leant in the previous stage. By doing so, it can deal with large m and p simultaneously. The new framework exhibited impressive performance on the tested data sets, achieving equivalent classification accuracy with less computation time. Acknowledgements This research was partially supported by the NSF grant CCF-1116208. Zhen James Xiang thanks Princeton University for support through the Charlotte Elizabeth Procter honorific fellowship.\r\n\r\n8\r\n\r\n\fReferences\r\n[1] M. Elad. Sparse and Redundant Representations: From Theory to Applications in Signal and Image Processing. Springer, 2010. [2] G. Cao and C.A. Bouman. Covariance estimation for high dimensional data vectors using the sparse matrix transform. In Advances in Neural Information Processing Systems, 2008. [3] A.B. Lee, B. Nadler, and L. Wasserman. Treelets An adaptive multi-scale basis for sparse unordered data. The Annals of Applied Statistics, 2(2):435471, 2008. [4] M. Gavish, B. Nadler, and R.R. Coifman. Multiscale wavelets on trees, graphs and high dimensional data: Theory and applications to semi supervised learning. In International Conference on Machine Learning, 2010. [5] M. Belkin and P. Niyogi. Using manifold stucture for partially labeled classification. In Advances in Neural Information Processing Systems, pages 953960, 2003. [6] S.T. Roweis and L.K. Saul. Nonlinear dimensionality reduction by locally linear embedding. Science, 290(5500):2323, 2000. [7] B.A. Olshausen and D.J. Field. Sparse coding with an overcomplete basis set: A strategy employed by V1? Vision research, 37(23):33113325, 1997. [8] J. Wright, A.Y. Yang, A. Ganesh, S.S. Sastry, and Y. Ma. Robust face recognition via sparse representation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 31(2):210227, 2008. [9] K. Yu, T. Zhang, and Y. Gong. Nonlinear learning using local coordinate coding. In Advances in Neural Information Processing Systems, volume 3, 2009. [10] B. Efron, T. Hastie, I. Johnstone, and R. Tibshirani. Least angle regression. Annals of Statistics, pages 407451, 2004. [11] J. Mairal, F. Bach, J. Ponce, and G. Sapiro. Online learning for matrix factorization and sparse coding. The Journal of Machine Learning Research, 11:1960, 2010. [12] H. Lee, A. Battle, R. Raina, and A.Y. Ng. Efficient sparse coding algorithms. In Advances in Neural Information Processing Systems, volume 19, page 801, 2007. [13] L.E. Ghaoui, V. Viallon, and T. Rabbani. Safe feature elimination in sparse supervised learning. Arxiv preprint arXiv:1009.3515, 2010. [14] R. Tibshirani, J. Bien, J. Friedman, T. Hastie, N. Simon, J. Taylor, and R.J. Tibshirani. Strong rules for discarding predictors in lasso-type problems. Arxiv preprint arXiv:1011.2234, 2010. [15] J. Wright, Y. Ma, J. Mairal, G. Sapiro, T. Huang, and S. Yan. Sparse representation for computer vision and pattern recognition. Proceedings of the IEEE, 98(6):10311044, 2010. [16] R.G. Baraniuk and M.B. Wakin. Random projections of smooth manifolds. Foundations of Computational Mathematics, 9(1):5177, 2007. [17] Y. Lin, T. Zhang, S. Zhu, and K. Yu. Deep coding network. In Advances in Neural Information Processing Systems, 2010. [18] G.E. Hinton, S. Osindero, and Y.W. Teh. A fast learning algorithm for deep belief nets. Neural Computation, 18(7):15271554, 2006. [19] R. Jenatton, J. Mairal, G. Obozinski, and F. Bach. Proximal methods for sparse hierarchical dictionary learning. In International Conference on Machine Learning, 2010. [20] M.B. Wakin, D.L. Donoho, H. Choi, and R.G. Baraniuk. Highresolution navigation on non-differentiable image manifolds. In IEEE International Conference on Acoustics, Speech and Signal Processing, volume 5, pages 10731076, 2005. [21] M.F. Duarte, M.A. Davenport, M.B. Wakin, JN Laska, D. Takhar, K.F. Kelly, and RG Baraniuk. Multiscale random projections for compressive classification. In IEEE International Conference on Image Processing, volume 6, 2007. [22] J.M. Shapiro. Embedded image coding using zerotrees of wavelet coefficients. IEEE Transactions on Signal Processing, 41(12):34453462, 2002. [23] S.A. Nene, S.K. Nayar, and H. Murase. Columbia object image library (coil-100). Techn. Rep. No. CUCS-006-96, dept. Comp. Science, Columbia University, 1996. [24] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278 2324, nov 1998. [25] A.S. Georghiades, P.N. Belhumeur, and D.J. Kriegman. From few to many: Illumination cone models for face recognition under variable lighting and pose. IEEE Transactions on Pattern Analysis and Machine Intelligence, 23(6):643660, 2002. [26] K.C. Lee, J. Ho, and D.J. Kriegman. Acquiring linear subspaces for face recognition under variable lighting. IEEE Transactions on Pattern Analysis and Machine Intelligence, pages 684698, 2005. [27] R.E. Fan, K.W. Chang, C.J. Hsieh, X.R. Wang, and C.J. Lin. LIBLINEAR: A library for large linear classification. The Journal of Machine Learning Research, 9:18711874, 2008.\r\n\r\n9\r\n\r\n\f", "award": [], "sourceid": 4400, "authors": [{"given_name": "Zhen", "family_name": "Xiang", "institution": null}, {"given_name": "Hao", "family_name": "Xu", "institution": null}, {"given_name": "Peter", "family_name": "Ramadge", "institution": null}]}