{"title": "Online L1-Dictionary Learning with Application to Novel Document Detection", "book": "Advances in Neural Information Processing Systems", "page_first": 2258, "page_last": 2266, "abstract": "Given their pervasive use, social media, such as Twitter, have become a leading source of breaking news. A key task in the automated identification of such news is the detection of novel documents from a voluminous stream of text documents in a scalable manner. Motivated by this challenge, we introduce the problem of online L1-dictionary learning where unlike traditional dictionary learning, which uses squared loss, the L1-penalty is used for measuring the reconstruction error. We present an efficient online algorithm for this problem based on alternating directions method of multipliers, and establish a sublinear regret bound for this algorithm. Empirical results on news-stream and Twitter data, shows that this online L1-dictionary learning algorithm for novel document detection gives more than an order of magnitude speedup over the previously known batch algorithm, without any significant loss in quality of results. Our algorithm for online L1-dictionary learning could be of independent interest.", "full_text": "Online (cid:96)1-Dictionary Learning with Application to\n\nNovel Document Detection\n\nShiva Prasad Kasiviswanathan\u2217\nGeneral Electric Global Research\n\nkasivisw@gmail.com\n\nHuahua Wang\u2020\n\nUniversity of Minnesota\nhuwang@cs.umn.edu\n\nArindam Banerjee\u2020\nUniversity of Minnesota\n\nbanerjee@cs.umn.edu\n\nPrem Melville\n\nIBM T.J. Watson Research Center\n\npmelvil@us.ibm.com\n\nAbstract\n\nGiven their pervasive use, social media, such as Twitter, have become a leading\nsource of breaking news. A key task in the automated identi\ufb01cation of such news\nis the detection of novel documents from a voluminous stream of text documents\nin a scalable manner. Motivated by this challenge, we introduce the problem of\nonline (cid:96)1-dictionary learning where unlike traditional dictionary learning, which\nuses squared loss, the (cid:96)1-penalty is used for measuring the reconstruction error.\nWe present an ef\ufb01cient online algorithm for this problem based on alternating\ndirections method of multipliers, and establish a sublinear regret bound for this\nalgorithm. Empirical results on news-stream and Twitter data, shows that this\nonline (cid:96)1-dictionary learning algorithm for novel document detection gives more\nthan an order of magnitude speedup over the previously known batch algorithm,\nwithout any signi\ufb01cant loss in quality of results.\n\n1\n\nIntroduction\n\nThe high volume and velocity of social media, such as blogs and Twitter, have propelled them to\nthe forefront as sources of breaking news. On Twitter, it is possible to \ufb01nd the latest updates on\ndiverse topics, from natural disasters to celebrity deaths; and identifying such emerging topics has\nmany practical applications, such as in marketing, disease control, and national security [14]. The\nkey challenge in automatic detection of breaking news, is being able to detect novel documents in\na stream of text; where a document is considered novel if it is \u201cunlike\u201d documents seen in the past.\nRecently, this has been made possible by dictionary learning, which has emerged as a powerful data\nrepresentation framework. In dictionary learning each data point y is represented as a sparse linear\ncombination Ax of dictionary atoms, where A is the dictionary and x is a sparse vector [1, 12]. A\ndictionary learning approach can be easily converted into a novel document detection method: let A\nbe a dictionary representing all documents till time t\u2212 1, for a new data document y arriving at time\nt, if one does not \ufb01nd a sparse combination x of the dictionary atoms, and the best reconstruction\nAx yields a large loss, then y clearly is not well represented by the dictionary A, and is hence novel\ncompared to documents in the past. At the end of timestep t, the dictionary is updated to represent\nall the documents till time t.\nKasiviswanathan et al. [10] presented such a (batch) dictionary learning approach for detecting novel\ndocuments/topics. They used an (cid:96)1-penalty on the reconstruction error (instead of squared loss com-\n\n\u2217Part of this wok was done while the author was a postdoc at the IBM T.J. Watson Research Center.\n\u2020H. Wang and A. Banerjee was supported in part by NSF CAREER grant IIS-0953274, NSF grants IIS-\n\n0916750, 1029711, IIS-0812183, and NASA grant NNX12AQ39A.\n\n1\n\n\fmonly used in the dictionary learning literature) as the (cid:96)1-penalty has been found to be more effective\nfor text analysis (see Section 3). They also showed this approach outperforms other techniques, such\nas a nearest-neighbor approach popular in the related area of First Story Detection [16]. We build\nupon this work, by proposing an ef\ufb01cient algorithm for online dictionary learning with (cid:96)1-penalty.\nOur online dictionary learning algorithm is based on the online alternating directions method which\nwas recently proposed by Wang and Banerjee [19] to solve online composite optimization problems\nwith additional linear equality constraints. Traditional online convex optimization methods such\nas [25, 8, 5, 6, 22] require explicit computation of the subgradient making them computationally\nexpensive to be applied in our high volume text setting, whereas in our algorithm the subgradients\nare computed implicitly. The algorithm has simple closed form updates for all steps yielding a fast\n\u221a\nand scalable algorithm for updating the dictionary. Under suitable assumptions (to cope with the\nnon-convexity of the dictionary learning problem), we establish an O(\nT ) regret bound for the ob-\njective, matching the regret bounds of existing methods [25, 5, 6, 22]. Using this online algorithm\nfor (cid:96)1-dictionary learning, we obtain an online algorithm for novel document detection, which we\nempirically validate on traditional news-streams as well as streaming data from Twitter. Experi-\nmental results show a substantial speedup over the batch (cid:96)1-dictionary learning based approach of\nKasiviswanathan et al. [10], without a loss of performance in detecting novel documents.\nRelated Work. Online convex optimization is an area of active research and for a detailed survey\non the literature we refer the reader to [18]. Online dictionary learning was recently introduced\nby Mairal et al. [12] who showed that it provides a scalable approach for handling large dynamic\ndatasets. They considered an (cid:96)2-penalty and showed that their online algorithm converges to the\nminimum objective value in the stochastic case (i.e., with distributional assumptions on the data).\nHowever, the ideas proposed in [12] do not translate to the (cid:96)1-penalty. The problem of novel docu-\nment/topics detection was also addressed by a recent work of Saha et al. [17], where they proposed a\nnon-negative matrix factorization based approach for capturing evolving and novel topics. However,\ntheir algorithm operates over a sliding time window (does not have online regret guarantees) and\nworks only for (cid:96)2-penalty.\n\n2 Preliminaries\n\nits norm, (cid:107)Z(cid:107)1 = (cid:80)\n\nF = (cid:80)\n\nij z2\n\ni,j |zij| and (cid:107)Z(cid:107)2\n\nNotation. Vectors are always column vectors and are denoted by boldface letters. For a matrix Z\nij. For arbitrary real matrices the standard inner\nproduct is de\ufb01ned as (cid:104)Y, Z(cid:105) = Tr(Y (cid:62)Z). We use \u03a8max(Z) to denote the largest eigenvalue of\nZ(cid:62)Z. For a scalar r \u2208 R, let sign(r) = 1 if r > 0, \u22121 if r < 0, and 0 if r = 0. De\ufb01ne\nsoft(r, T ) = sign(r) \u00b7 max{|r| \u2212 T, 0}. The operators sign and soft are extended to a matrix by\napplying it to every entry in the matrix. 0m\u00d7n denotes a matrix of all zeros of size m \u00d7 n and the\nsubscript is omitted when the dimension of the represented matrix is clear from the context.\nDictionary Learning Background. Dictionary learning is the problem of estimating a collection\nof basis vectors over which a given data collection can be accurately reconstructed, often with sparse\nencodings. It falls into a general category of techniques known as matrix factorization. Classic dic-\ntionary learning techniques for sparse representation (see [1, 15, 12] and references therein) consider\na \ufb01nite training set of signals P = [p1, . . . , pn] \u2208 Rm\u00d7n and optimize the empirical cost function\ni=1 l(pi, A), where l(\u00b7,\u00b7) is a loss function such that l(pi, A) should\nbe small if A is \u201cgood\u201d at representing the signal pi in a sparse fashion. Here, A \u2208 Rm\u00d7k is referred\nto as the dictionary. In this paper, we use a (cid:96)1-loss function with an (cid:96)1-regularization term, and our\n\nwhich is de\ufb01ned as f (A) =(cid:80)n\n\n(cid:107)pi \u2212 Ax(cid:107)1 + \u03bb(cid:107)x(cid:107)1, where \u03bb is the regularization parameter.\n\nl(pi, A) = min\n\nx\n\nWe de\ufb01ne the problem of dictionary learning as that of minimizing the empirical cost f (A). In other\nwords, the dictionary learning is the following optimization problem\n\nmin\n\nA\n\nf (A) = f (A, X)\n\ndef\n= min\nA,X\n\nl(pi, A) = min\nA,X\n\n(cid:107)P \u2212 AX(cid:107)1 + \u03bb(cid:107)X(cid:107)1.\n\nn(cid:88)\n\ni=1\n\nFor maintaining interpretability of the results, we would additionally require that the A and X ma-\ntrices be non-negative. To prevent A from being arbitrarily large (which would lead to arbitrarily\nsmall values of X), we add a scaling constant on A as follows. Let A be the convex set of matrices\nde\ufb01ned as\nA = {A \u2208 Rm\u00d7k : A \u2265 0m\u00d7k \u2200j = 1, . . . , k ,(cid:107)Aj(cid:107)1 \u2264 1}, where Aj is the jth column in A.\n\n2\n\n\fWe use \u03a0A to denote the Euclidean projection onto the nearest point in the convex set A. The\nresulting optimization problem can be written as\n\nmin\n\nA\u2208A,X\u22650\n\n(cid:107)P \u2212 AX(cid:107)1 + \u03bb(cid:107)X(cid:107)1\n\n(1)\n\nThe optimization problem (1) is in general non-convex. But if one of the variables, either A or X is\nknown, the objective function with respect to the other variable becomes a convex function (in fact,\ncan be transformed into a linear program).\n\n3 Novel Document Detection Using Dictionary Learning\n\nIn this section, we describe the problem of novel document detection and explain how dictionary\nlearning could be used to tackle this problem. Our problem setup is similar to [10].\nNovel Document Detection Task. We assume documents arrive in streams. Let {Pt\n: Pt \u2208\nRmt\u00d7nt, t = 1, 2, 3, . . .} denote a sequence of streaming matrices where each column of Pt repre-\nsents a document arriving at time t. Here, Pt represents the term-document matrix observed at time\nt. Each document is represented is some conventional vector space model such as TF-IDF [13].\nThe t could be at any granularity, e.g., it could be the day that the document arrives. We use nt to\nrepresent the number of documents arriving at time t. We normalize Pt such that each column (doc-\nument) in Pt has a unit (cid:96)1-norm. For simplicity in exposition, we will assume that mt = m for all\nt.1 We use the notation P[t] to denote the term-document matrix obtained by vertically concatenating\nthe matrices P1, . . . , Pt, i.e., P[t] = [P1|P2| . . .|Pt]. Let Nt be the number of documents arriving at\ntime \u2264 t, then P[t] \u2208 Rm\u00d7Nt. Under this setup, the goal of novel document detection is to identify\ndocuments in Pt that are \u201cdissimilar\u201d to the documents in P[t\u22121].\nSparse Coding to Detect Novel Documents. Let At \u2208 Rm\u00d7k represent the dictionary matrix after\ntime t\u2212 1; where dictionary At is a good basis to represent of all the documents in P[t\u22121]. The exact\nconstruction of the dictionary is described later. Now, consider a document y \u2208 Rm appearing at\ntime t. We say that it admits a sparse representation over At, if y could be \u201cwell\u201d approximated as\na linear combination of few columns from At. Modeling a vector with such a sparse decomposition\nis known as sparse coding. In most practical situations it may not be possible to represent y as Atx,\ne.g., if y has new words which are absent in At. In such cases, one could represent y = Atx + e\nwhere e is an unknown noise vector. We consider the following sparse coding formulation\n\n(cid:107)y \u2212 Atx(cid:107)1 + \u03bb(cid:107)x(cid:107)1.\n\nl(y, At) = min\nx\u22650\n\n(2)\nThe formulation (2) naturally takes into account both the reconstruction error (with the (cid:107)y \u2212 Atx(cid:107)1\nterm) and the complexity of the sparse decomposition (with the (cid:107)x(cid:107)1 term).\nIt is quite easy to\ntransform (2) into a linear program. Hence, it can be solved using a variety of methods. In our\nexperiments, we use the alternating directions method of multipliers (ADMM) [2] to solve (2).\nADMM has recently gathered signi\ufb01cant attention in the machine learning community due to its\nwide applicability to a range of learning problems with complex objective functions [2].\nWe can use sparse coding to detect novel documents as follows. For each document y arriving at\ntime t, we do the following. First, we solve (2) to check whether y could be well approximated as a\nsparse linear combination of the atoms of At. If the objective value l(y, At) is \u201cbig\u201d then we mark\nthe document as novel, otherwise we mark the document as non-novel. Since, we have normalized\nall documents in Pt to unit (cid:96)1-length, the objective values are in the same scale.\nChoice of the Error Function. A very common choice of reconstruction error is the (cid:96)2-penalty. In\nfact, in the presence of isotopic Gaussian noise the (cid:96)2-penalty on e = y \u2212 Atx gives the maximum\nlikelihood estimate of x [21, 23]. However, for text documents, the noise vector e rarely satis\ufb01es the\nGaussian assumption, as some of its coef\ufb01cients contain large, impulsive values. For example, in\n\ufb01elds such as politics and sports, a certain term may become suddenly dominant in a discussion [10].\nIn such cases imposing an (cid:96)1-penalty on the error is a better choice than imposing an (cid:96)2-penalty\n(e.g., recent research [21, 24, 20] have successfully shown the superiority of (cid:96)1 over (cid:96)2 penalty for a\n\n1As new documents come in and new terms are identi\ufb01ed, we expand the vocabulary and zero-pad the\nprevious matrices so that at the current time t, all previous and current documents have a representation over\nthe same vocabulary space.\n\n3\n\n\fdifferent but related application domain of face recognition). We empirically validate the superiority\nof using the (cid:96)1-penalty for novel document detection in Section 5.\nSize of the Dictionary. Ideally, in our application setting, changing the size of the dictionary (k)\ndynamically with t would lead to a more ef\ufb01cient and effective sparse coding. However, in our\ntheoretical analysis, we make the simplifying assumption that k is a constant independent of t. In\nour experiments, we allow for small increases in the size of the dictionary over time when required.\nBatch Algorithm for Novel Document Detection. We now describe a simple batch algorithm\n(slightly modi\ufb01ed from [10]) for detecting novel documents. The Algorithm BATCH alternates be-\ntween a novel document detection and a batch dictionary learning step.\n\nAlgorithm 1 : BATCH\n\nInput: P[t\u22121] \u2208 Rm\u00d7Nt\u22121, Pt = [p1, . . . , pnt] \u2208 Rm\u00d7nt, At \u2208 Rm\u00d7k, \u03bb \u2265 0, \u03b6 \u2265 0\nNovel Document Detection Step:\nfor j = 1 to nt do\n\nSolve: xj = argminx\u22650 (cid:107)pj \u2212 Atx(cid:107)1 + \u03bb(cid:107)x(cid:107)1\nif (cid:107)pj \u2212 Atxj(cid:107)1 + \u03bb(cid:107)xj(cid:107)1 > \u03b6\nBatch Dictionary Learning Step:\nSet P[t] \u2190 [P[t\u22121] | p1, . . . , pnt]\nSolve: [At+1, X[t]] = argminA\u2208A,X\u22650 (cid:107)P[t] \u2212 AX(cid:107)1 + \u03bb(cid:107)X(cid:107)1\n\nMark pj as novel\n\nBatch Dictionary Learning. We now describe the batch dictionary learning step. At time t, the\ndictionary learning step is2\n\n[At+1, X[t]] = argminA\u2208A,X\u22650 (cid:107)P[t] \u2212 AX(cid:107)1 + \u03bb(cid:107)X(cid:107)1.\n\n(3)\n\nEven though conceptually simple, Algorithm BATCH is computationally inef\ufb01cient. The bottleneck\ncomes in the dictionary learning step. As t increases, so does the size of P[t], so solving (3) becomes\nprohibitive even with ef\ufb01cient optimization techniques. To achieve computational ef\ufb01ciency, in [10],\nthe authors solved an approximation of (3) where in the dictionary learning step they only update\nthe A\u2019s and not the X\u2019s.3 This leads to faster running times, but because of the approximation, the\nquality of the dictionary degrades over time and the performance of the algorithm decreases. In this\npaper, we propose an online learning algorithm for (3) and show that this online algorithm is both\ncomputationally ef\ufb01cient and generates good quality dictionaries under reasonable assumptions.\n\n4 Online (cid:96)1-Dictionary Learning\n\nIn this section, we introduce the online (cid:96)1-dictionary learning problem and propose an ef\ufb01cient al-\ngorithm for it. The standard goal of online learning is to design algorithms whose regret is sublinear\nin time T , since this implies that \u201con the average\u201d the algorithm performs as well as the best \ufb01xed\nstrategy in hindsight [18]. Now consider the (cid:96)1-dictionary learning problem de\ufb01ned in (3). Since\nthis problem is non-convex, it may not be possible to design ef\ufb01cient (i.e., polynomial running time)\nalgorithms that solves it without making any assumptions on either the dictionary (A) or the sparse\ncode (X). This also means that it may not be possible to design ef\ufb01cient online algorithm with\nsublinear regret without making any assumptions on either A or X because an ef\ufb01cient online al-\ngorithm with sublinear regret would imply an ef\ufb01cient algorithm for solving (1) in the of\ufb02ine case.\nTherefore, we focus on obtaining regret bounds for the dictionary update, assuming that the at each\ntimestep the sparse codes given to the batch and online algorithms are \u201cclose\u201d. This motivates the\nfollowing problem.\nDe\ufb01nition 4.1 (Online (cid:96)1-Dictionary Learning Problem). At time t, the online algorithm picks\n\u02c6At+1 \u2208 A. Then, the nature (adversary) reveals (Pt+1, \u02c6Xt+1) with Pt+1 \u2208 Rm\u00d7n and \u02c6Xt+1 \u2208\n2In our algorithms, it is quite straightforward to replace the condition A \u2208 A by some other condition\n3In particular, de\ufb01ne (recursively) (cid:101)X[t] = [(cid:101)X[t\u22121] | x1, . . . , xnt ] where xj\u2019s are coming from the novel\ndocument detection step at time t. In [10], the dictionary learning step is At+1 = argminA\u2208A (cid:107)P[t] \u2212 A(cid:101)X[t](cid:107)1.\n\nA \u2208 C, where C is some closed non-empty convex set.\n\n4\n\n\fRk\u00d7n. The problem is to pick the \u02c6At+1 sequence such that the following regret function is mini-\nmized4\n\nR(T ) =\n\n(cid:107)Pt \u2212 \u02c6At \u02c6Xt(cid:107)1 \u2212 min\nA\u2208A\n\n(cid:107)Pt \u2212 AXt(cid:107)1 ,\n\nT(cid:88)\n\nt=1\n\nT(cid:88)\n\nt=1\n\nwhere \u02c6Xt = Xt + Et and Et is an error matrix dependent on t.\n\nThe regret de\ufb01ned above admits the discrepancy between the sparse coding matrices supplied to the\nbatch and online algorithms through the error matrix. The reason for this generality is because in\nour application setting, the sparse coding matrices used for updating the dictionaries of the batch\nand online algorithms could be different. We will later establish the conditions on Et\u2019s under which\nwe can achieve sublinear regret. All missing proofs and details appear in the full version of the\npaper [11].\n\n4.1 Online (cid:96)1-Dictionary Algorithm\n\nIn this section, we design an algorithm for the online (cid:96)1-dictionary learning problem, which we\ncall Online Inexact ADMM (OIADMM)5 and bound its regret. Firstly note that because of the\nnon-smooth (cid:96)1-norms involved it is computationally expensive to apply standard online learning\nalgorithms like online gradient descent [25, 8], COMID [6], FOBOS [5], and RDA [22], as they\nrequire computing a costly subgradient at every iteration. The subgradient of (cid:107)P \u2212 AX(cid:107)1 at A = \u00afA\nis (X \u00b7 sign(X(cid:62) \u00afA(cid:62) \u2212 P (cid:62)))(cid:62).\nOur algorithm for online (cid:96)1-dictionary learning is based on the online alternating direction method\nwhich was recently proposed by Wang et al. [19]. Our algorithm \ufb01rst performs a simple variable\nsubstitution by introducing an equality constraint. The update for each of the resulting variable has\na closed-form solution without the need of estimating the subgradients explicitly.\n\nAlgorithm 2 : OIADMM\n\nInput: Pt \u2208 Rm\u00d7n, \u02c6At \u2208 Rm\u00d7k, \u2206t \u2208 Rm\u00d7n, \u02c6Xt \u2208 Rk\u00d7n, \u03b2t \u2265 0, \u03c4t \u2265 0\n\n(cid:101)\u0393t \u2190\u2212 Pt \u2212 \u02c6At \u02c6Xt\n\u0393t+1 = argmin\u0393 (cid:107)\u0393(cid:107)1 + (cid:104)\u2206t,(cid:101)\u0393t \u2212 \u0393(cid:105) + (\u03b2t/2)(cid:107)(cid:101)\u0393t \u2212 \u0393(cid:107)2\nGt+1 \u2190\u2212 \u2212(\u2206t/\u03b2t +(cid:101)\u0393t \u2212 \u0393t+1) \u02c6X(cid:62)\n\u02c6At+1 = argminA\u2208A \u03b2t((cid:104)Gt+1, A \u2212 \u02c6At(cid:105) + (1/2\u03c4t)(cid:107)A \u2212 \u02c6At(cid:107)2\nF )\n\u2206t+1 = \u2206t + \u03b2t(Pt \u2212 \u02c6At+1 \u02c6Xt \u2212 \u0393t+1)\nReturn \u02c6At+1 and \u2206t+1\n\nt\n\n(\u21d2 \u0393t+1 = soft((cid:101)\u0393t + \u2206t/\u03b2t, 1/\u03b2t))\n\nF\n\n(\u21d2 \u02c6At+1 = \u03a0A(max{0, \u02c6At \u2212 \u03c4tGt+1}))\n\nThe Algorithm OIADMM is simple. Consider the following minimization problem at time t\n\nWe can rewrite this above minimization problem as:\n\nA\u2208A (cid:107)Pt \u2212 A \u02c6Xt(cid:107)1.\n\nmin\n\n(cid:107)\u0393(cid:107)1\n\nsuch that Pt \u2212 A \u02c6Xt = \u0393.\n\nmin\nA\u2208A,\u0393\n\nThe augmented Lagrangian of (4) is:\n\nL(A, \u0393, \u2206) = min\nA\u2208A,\u0393\n\n(cid:107)\u0393(cid:107)1 + (cid:104)\u2206, Pt \u2212 A \u02c6Xt \u2212 \u0393(cid:105) +\nwhere \u2206 \u2208 Rm\u00d7n is a multiplier and \u03b2t > 0 is a penalty parameter.\n\n(cid:13)(cid:13)(cid:13)Pt \u2212 A \u02c6Xt \u2212 \u0393\n(cid:13)(cid:13)(cid:13)2\n\nF\n\n\u03b2t\n2\n\n(4)\n\n,\n\n(5)\n\n4For ease of presentation and analysis, we will assume that m and n don\u2019t vary with time. One could allow\n\nfor changing m and n by carefully adjusting the size of the matrices by zero-padding.\n\n5The reason for naming it OIADMM is because the algorithm is based on alternating directions method of\n\nmultipliers (ADMM) procedure.\n\n5\n\n\fOIADMM is summarized in Algorithm 2. The algorithm generates a sequence of iterates\n{\u0393t, At, \u2206t}\u221e\nt=1. At each time t, instead of solving (4) completely, it only runs one step ADMM\nupdate of the variables (\u0393t, At, \u2206t). The complete analysis of Algorithm 2 is presented in the full\nversion of the paper [11]. Here, we just summarize the main result in the following theorem.\nTheorem 4.2. Let {\u0393t, \u02c6At, \u2206t} be the sequences generated by the OIADMM procedure and R(T )\nbe the regret as de\ufb01ned above. Assume the following conditions hold: (i) \u2200t, the Frobenius norm\nof \u2202(cid:107)\u0393t(cid:107)1 is upper bounded by \u03a6, (ii) \u02c6A1 = 0m\u00d7k,(cid:107)Aopt(cid:107)F \u2264 D, (iii) \u22061 = 0m\u00d7n, and (iv) \u2200t,\n1/\u03c4t \u2265 2\u03a8max( \u02c6Xt). Setting \u2200t, \u03b2t = \u03a6\n\n\u03c4mT where \u03c4m = maxt {1/\u03c4t}, we have\n\n\u221a\n\nR(T ) \u2264 \u03a6D\n\n+\n\n(cid:107)AoptEt(cid:107)1.\n\nD\n\n\u221a\nT\u221a\n\u03c4m\n\nT(cid:88)\n\nt=1\n\nMore generally, as long as(cid:80)T\n\nIn the above theorem one could replace \u03c4m by any upper bound on it (i.e., we don\u2019t need to know\n\u03c4m exactly).\nCondition on Et\u2019s for Sublinear Regret. In a standard online learning setting, the (Pt, \u02c6Xt) made\navailable to the online learning algorithm will be the same as (Pt, Xt) made available to the batch\ndictionary learning algorithm in hindsight, so that \u02c6Xt = Xt \u21d2 Et = 0, yielding a O(\nT ) regret.\nt=1 (cid:107)Et(cid:107)p = o(T ) for some suitable p-norm, we get a sublinear regret\nbound.6 For example, if {Zt} is a sequence of matrices such that for all t, (cid:107)Zt(cid:107)p = O(1), then\nsetting Et = t\u2212\u0001Zt, \u0001 > 0 yields a sublinear regret. This gives a suf\ufb01cient condition for sublinear\nregret, and it is an interesting open problem to extend the analysis to other cases.\nRunning Time. For the ith column in the dictionary matrix the projection onto A can be done in\nO(si log m) time where si is the number of non-zero elements in the ith column using the projec-\ntion onto (cid:96)1-ball algorithm of Duchi et al. [4]. The simplest implementation of OIADMM takes\nO(mnk) time at each timestep because of the matrix multiplications involved.\n\n\u221a\n\n5 Experimental Results\n\nIn this section, we present experiments to compare and contrast the performance of (cid:96)1-batch and\n(cid:96)1-online dictionary learning algorithms for the task of novel document detection. We also present\nresults highlighting the superiority of using an (cid:96)1- over an (cid:96)2-penalty on the reconstruction error for\nthis task (validating the discussion in Section 3).\nImplementation of BATCH.\nIn our implementation, we grow the dictionary size by \u03b7 in each\ntimestep. Growing the dictionary size is essential for the batch algorithm because as t increases the\nnumber of columns of P[t] also increases, and therefore, a larger dictionary is required to compactly\nrepresent all the documents in P[t]. For solving (3), we use alternative minimization over the vari-\nables. The pseudo-code description is given in the full version of the paper [11]. The optimization\nproblems arising in the sparse coding and dictionary learning steps are solved using ADMM\u2019s.\nOnline Algorithm for Novel Document Detection. Our online algorithm7 uses the same novel\ndocument detection step as Algorithm BATCH, but dictionary learning is done using OIADMM. For\na pseudo-code description, see full version of the paper [11]. Notice that the sparse coding matrices\nof the Algorithm BATCH, X1, . . . , Xt could be different from \u02c6X1, . . . , \u02c6Xt. If these sequence of\nmatrices are close to each other, then we have a sublinear regret on the objective function.8\nEvaluation of Novel Document Detection. For performance evaluation, we assume that documents\nin the corpus have been manually identi\ufb01ed with a set of topics. For simplicity, we assume that each\ndocument is tagged with the single, most dominant topic that it associates with, which we call the\ntrue topic of that document. We call a document y arriving at time t novel if the true topic of\ny has not appeared before the time t. So at time t, given a set of documents, the task of novel\nt=1 (cid:107)Et(cid:107)p) for 1 \u2264\np, q \u2264 \u221e and 1/p + 1/q = 1, and by the assuming (cid:107)Aopt(cid:107)q is bounded. Here, (cid:107)\u00b7(cid:107)p denotes Schatten p-norm.\n7In our experiments, the number of documents introduced in each timestep is almost of the same order, and\n\n6This follows from H\u00a8older\u2019s inequality which gives(cid:80)T\n\nt=1 (cid:107)AoptEt(cid:107)1 \u2264 (cid:107)Aopt(cid:107)q((cid:80)T\n\nhence there is no need to change the size of the dictionary across timesteps for the online algorithm.\n\n8As noted earlier, we can not do a comparison without making any assumptions.\n\n6\n\n\fdocument detection is to classify each document as either novel (positive) or non-novel (negative).\nFor evaluating this classi\ufb01cation task, we use the standard Area Under the ROC Curve (AUC) [13].\nPerformance Evaluation for (cid:96)1-Dictionary Learning. We use a simple reconstruction error mea-\nsure for comparing the dictionaries produced by our (cid:96)1-batch and (cid:96)1-online algorithms. We want the\ndictionary at time t to be a good basis to represent all the documents in P[t] \u2208 Rm\u00d7Nt. This leads\nus to de\ufb01ne the sparse reconstruction error (SRE) of a dictionary A at time t as\n\n(cid:19)\n\n(cid:107)P[t] \u2212 AX(cid:107)1 + \u03bb(cid:107)X(cid:107)1\n\n.\n\n(cid:18)\n\nSRE(A)\n\ndef\n=\n\n1\nNt\n\nmin\nX\u22650\n\nA dictionary with a smaller SRE is better on average at sparsely representing the documents in P[t].\nNovel Document Detection using (cid:96)2-dictionary learning. To justify the choice of using an (cid:96)1-\npenalty (on the reconstruction error) for novel document detection, we performed experiments com-\nparing (cid:96)1- vs. (cid:96)2-penalty for this task. In the (cid:96)2-setting, for the sparse coding step we used a fast\nimplementation of the LARS algorithm with positivity constraints [7] and the dictionary learning\nwas done by solving a non-negative matrix factorization problem with additional sparsity constraints\n(also known as the non-negative sparse coding problem [9]). A complete pseudo-code description\nis given in the full version of the paper [11].9\nExperimental Setup. All reported results are based on a Matlab implementation running on a quad-\ncore 2.33 GHz Intel processor with 32GB RAM. The regularization parameter \u03bb is set to 0.1 which\nyields reasonable sparsities in our experiments. OIADMM parameters \u03c4t is set 1/(2\u03a8max( \u02c6Xt))\n(chosen according to Theorem 4.2) and \u03b2t is \ufb01xed to 5 (obtained through tuning). The ADMM\nparameters for the sparse coding and batch dictionary learning steps are set as suggested in [10]\n(refer to the full version [11]). In the batch algorithms, we grow the dictionary sizes by \u03b7 = 10 in\neach timestep. The threshold value \u03b6 is treated as a tunable parameter.\n\n5.1 Experiments on News Streams\n\nOur \ufb01rst dataset is drawn from the NIST Topic Detection and Tracking (TDT2) corpus which con-\nsists of news stories in the \ufb01rst half of 1998. In our evaluation, we used a set of 9000 documents\nrepresented over 19528 terms and distributed into the top 30 TDT2 human-labeled topics over a\nperiod of 27 weeks. We introduce the documents in groups. At timestep 0, we introduce the \ufb01rst\n1000 documents and these documents are used for initializing the dictionary. We use an alternative\nminimization procedure over the variables of (1) to initialize the dictionary. In these experiments\nthe size of the initial dictionary k = 200. In each subsequent timestep t \u2208 {1, . . . , 8}, we provide\nthe batch and online algorithms the same set of 1000 documents. In Figure 1, we present novel\ndocument detection results for those timesteps where at least one novel document was introduced.\nTable 1 shows the corresponding AUC numbers. The results show that using an (cid:96)1-penalty on the\nreconstruction error is better for novel document detection than using an (cid:96)2-penalty.\n\nFigure 1: ROC curves for TDT2 for timesteps where novel documents were introduced.\n\nComparison of the (cid:96)1-online and (cid:96)1-batch Algorithms. The (cid:96)1-online and (cid:96)1-batch algorithms\nhave almost identical performance in terms of detecting novel documents (see Table 1). However,\nthe online algorithm is much more computationally ef\ufb01cient. In Figure 2(a), we compare the running\ntimes of these algorithms. As noted earlier, the running time of the batch algorithm goes up as\nt increases (as it has to optimize over the entire past). However, the running time of the online\nalgorithm is independent of the past and only depends on the number of documents introduced\nin each timestep (which in this case is always 1000). Therefore, the running time of the online\n\n9We used the SPAMS package http://spams-devel.gforge.inria.fr/ in our implementation.\n\n7\n\n00.5100.51False Positive RateTrue Positive RateTimestep 1 ONLINEBATCH\u2212IMPLL2\u2212BATCH00.5100.51False Positive RateTrue Positive RateTimestep 2 ONLINEBATCH\u2212IMPLL2\u2212BATCH00.5100.51False Positive RateTrue Positive RateTimestep 5 ONLINEBATCH\u2212IMPLL2\u2212BATCH00.5100.51False Positive RateTrue Positive RateTimestep 6 ONLINEBATCH\u2212IMPLL2\u2212BATCH00.5100.51False Positive RateTrue Positive RateTimestep 8 ONLINEBATCH\u2212IMPLL2\u2212BATCH\fTimestep No. of Novel Docs. No. of Nonnovel Docs.\n\n981\n947\n884\n934\n935\n\nAUC (cid:96)1-online\n\nAUC (cid:96)1-batch\n\nAUC (cid:96)2-batch\n\n0.791\n0.694\n0.732\n0.881\n0.757\n0.771\n\n0.815\n0.704\n0.764\n0.898\n0.760\n0.788\n\n0.674\n0.586\n0.601\n0.816\n0.701\n0.676\n\n1\n2\n5\n6\n8\n\nAvg.\n\n19\n53\n116\n66\n65\n\nTable 1: AUC Numbers for ROC Plots in Figure 1.\n\nalgorithm is almost the same across different timesteps. As expected the run-time gap between the\n(cid:96)1-batch and (cid:96)1-online algorithms widen as t increases \u2013 in the \ufb01rst timestep the online algorithm is\n5.4 times faster, and this rapidly increases to a factor of 11.5 in just 7 timesteps.\nIn Figure 2(b), we compare the dictionaries produced by the (cid:96)1-batch and (cid:96)1-online algorithms\nunder the SRE metric. In the \ufb01rst few timesteps, the SRE of the dictionaries produced by the online\nalgorithm is slightly lower than that of the batch algorithm. However, this gets corrected after a few\ntimesteps and as expected later on the batch algorithm produces better dictionaries.\n\n(a)\n\n(b)\n\n(c)\n\n(d)\n\nFigure 2: Running time and SRE plots for TDT2 and Twitter datasets.\n\n5.2 Experiments on Twitter\n\nOur second dataset is from an application of monitoring Twitter for Marketing and PR for smart-\nphone and wireless providers. We used the Twitter Decahose to collect a 10% sample of all tweets\n(posts) from Sept 15 to Oct 05, 2011. From this, we \ufb01ltered the tweets relevant to \u201cSmartphones\u201d\nusing a scheme presented in [3] which utilizes the Wikipedia ontology to do the \ufb01ltering. Our dataset\ncomprises of 127760 tweets over these 21 days and the vocabulary size is 6237 words. We used the\ntweets from Sept 15 to 21 (34292 in number) to initialize the dictionaries. Subsequently, at each\ntimestep, we give as input to both the algorithms all the tweets from a given day (for a period of 14\ndays between Sept 22 to Oct 05). Since this dataset is unlabeled, we do a quantitative evaluation of\n(cid:96)1-batch vs. (cid:96)1-online algorithms (in terms of SRE) and do a qualitative evaluation of the (cid:96)1-online\nalgorithm for the novel document detection task. Here, the size of the initial dictionary k = 100.\nFigure 2(c) shows the running times on the Twitter dataset. At \ufb01rst timestep the online algorithm is\nalready 10.8 times faster, and this speedup escalates to 18.2 by the 14th timestep. Figure 2(d) shows\nthe SRE of the dictionaries produced by these algorithms. In this case, the SRE of the dictionaries\nproduced by the batch algorithm is consistently better than that of the online algorithm, but as the\nrunning time plots suggests this improvement comes at a very steep price.\n\nDate\n\n2011-09-26\n2011-09-29\n2011-10-03\n2011-10-04\n2011-10-05\n\nSample Novel Tweets Detected Using our Online Algorithm\n\nAndroid powered 56 percent of smartphones sold in the last three months. Sad thing is it can\u2019t lower the rating of ios!\nHow Windows 8 is faster, lighter and more ef\ufb01cient: WP7 Droid Bionic Android 2.3.4 HP TouchPad white ipods 72\n\nU.S. News: AT&T begins sending throttling warnings to top data hogs: AT&T did away with its unlimited da... #iPhone\n\nCan\u2019t wait for the iphone 4s #letstalkiphone\n\nEverybody put an iPhone up in the air one time #ripstevejobs\n\nTable 2: Sample novel documents detected by our online algorithm.\n\nTable 2 below shows a representative set of novel tweets identi\ufb01ed by our online algorithm. Using\na completely automated process (refer to the full version [11]), we are able to detect breaking news\nand trending relevant to the smartphone market, such as AT&T throttling data bandwidth, launch of\nIPhone 4S, and the death of Steve Jobs.\n\n8\n\n024680100200300400TimestepCPU Running Time (in mins)Running Time Plot for TDT2 ONLINEBATCH\u2212IMPLL2\u2212BATCH024680.60.70.80.91TimestepSparse Reconstruction Error (SRE)Sparse Reconstruction Error Plot for TDT2 ONLINEBATCH\u2212IMPL05100100200300400TimestepCPU Running Time (in mins)Run Time Plot for Twitter ONLINEBATCH\u2212IMPL05100.50.60.70.80.91TimestepSparse Reconstruction Error (SRE)Sparse Reconstruction Error Plot for Twitter ONLINEBATCH\u2212IMPL\fReferences\n\n[1] M. Aharon, M. Elad, and A. Bruckstein. The K-SVD: An Algorithm for Designing Overcomplete Dic-\n\ntionaries for Sparse Representation. IEEE Transactions on Signal Processing, 54(11), 2006.\n\n[2] S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein. Distributed Optimization and Statistical Learning\nvia the Alternating Direction Method of Multipliers. Foundations and Trends in Machine Learning, 2011.\n[3] V. Chenthamarakshan, P. Melville, V. Sindhwani, and R. D. Lawrence. Concept Labeling: Building Text\n\nClassi\ufb01ers with Minimal Supervision. In IJCAI, pages 1225\u20131230, 2011.\n\n[4] J. Duchi, S. Shalev-Shwartz, Y. Singer, and T. Chandra. Ef\ufb01cient Projections onto the l1-ball for Learning\n\nin High Dimensions. In ICML, pages 272\u2013279, 2008.\n\n[5] J. Duchi and Y. Singer. Ef\ufb01cient Online and Batch Learning using Forward Backward Splitting. JMLR,\n\n10:2873\u20132898, 2009.\n\n[6] J. C. Duchi, S. Shalev-Shwartz, Y. Singer, and A. Tewari. Composite Objective Mirror Descent. In COLT,\n\npages 14\u201326, 2010.\n\n[7] J. Friedman, T. Hastie, H. H\ufb02ing, and R. Tibshirani. Pathwise Coordinate Optimization. The Annals of\n\nApplied Statistics, 1(2):302\u2013332, 2007.\n\n[8] E. Hazan, A. Agarwal, and S. Kale. Logarithmic Regret Algorithms for Online Convex Optimization.\n\nMachine Learning, 69(2-3):169\u2013192, 2007.\n\n[9] P. O. Hoyer. Non-Negative Sparse Coding. In IEEE Workshop on Neural Networks for Signal Processing,\n\npages 557\u2013565, 2002.\n\n[10] S. P. Kasiviswanathan, P. Melville, A. Banerjee, and V. Sindhwani. Emerging Topic Detection using\n\nDictionary Learning. In CIKM, pages 745\u2013754, 2011.\n\n[11] S. P. Kasiviswanathan, H. Wang, A. Banerjee, and P. Melville. Online (cid:96)1-Dictionary Learning\nhttp://www.cse.psu.edu/\u02dckasivisw/\n\nwith Application to Novel Document Detection.\nfullonlinedict.pdf.\n\n[12] J. Mairal, F. Bach, J. Ponce, and G. Sapiro. Online Learning for Matrix Factorization and Sparse Coding.\n\nJMLR, 11:19\u201360, 2010.\n\n[13] C. Manning, P. Raghavan, and H. Sch\u00a8utze. Introduction to Information Retrieval. Cambridge University\n\nPress, 2008.\n\n[14] P. Melville, J. Leskovec, and F. Provost, editors. Proceedings of the First Workshop on Social Media\n\nAnalytics. ACM, 2010.\n\n[15] B. Olshausen and D. Field. Sparse Coding with an Overcomplete Basis Set: A Strategy Employed by\n\nV1? Vision Research, 37(23):3311\u20133325, 1997.\n\n[16] S. Petrovi\u00b4c, M. Osborne, and V. Lavrenko. Streaming First Story Detection with Application to Twitter.\n\nIn HLT \u201910, pages 181\u2013189. ACL, 2010.\n\n[17] A. Saha and V. Sindhwani. Learning Evolving and Emerging Topics in Social Media: A Dynamic NMF\n\nApproach with Temporal Regularization. In WSDM, pages 693\u2013702, 2012.\n\n[18] S. Shalev-Shwartz. Online Learning and Online Convex Optimization. Foundations and Trends in Ma-\n\nchine Learning, 4(2), 2012.\n\n[19] H. Wang and A. Banerjee. Online Alternating Direction Method. In ICML, 2012.\n[20] J. Wright and Y. Ma. Dense Error Correction Via L1-Minimization. IEEE Transactions on Information\n\nTheory, 56(7):3540\u20133560, 2010.\n\n[21] J. Wright, A. Yang, A. Ganesh, S. Sastry, and Y. Ma. Robust Face Recognition via Sparse Representation.\n\nIEEE Transactions on Pattern Analysis and Machine Intelliegence, 31(2):210\u2013227, Feb. 2009.\n\n[22] L. Xiao. Dual Averaging Methods for Regularized Stochastic Learning and Online Optimization. JMLR,\n\n11:2543\u20132596, 2010.\n\n[23] A. Y. Yang, S. S. Sastry, A. Ganesh, and Y. Ma. Fast L1-minimization Algorithms and an Application\nin Robust Face Recognition: A Review. In International Conference on Image Processing, pages 1849\u2013\n1852, 2010.\n\n[24] J. Yang and Y. Zhang. Alternating Direction Algorithms for L1-Problems in Compressive Sensing. SIAM\n\nJournal of Scienti\ufb01c Computing, 33(1):250\u2013278, 2011.\n\n[25] M. Zinkevich. Online Convex Programming and Generalized In\ufb01nitesimal Gradient Ascent. In ICML,\n\npages 928\u2013936, 2003.\n\n9\n\n\f", "award": [], "sourceid": 1116, "authors": [{"given_name": "Shiva", "family_name": "Kasiviswanathan", "institution": null}, {"given_name": "Huahua", "family_name": "Wang", "institution": null}, {"given_name": "Arindam", "family_name": "Banerjee", "institution": null}, {"given_name": "Prem", "family_name": "Melville", "institution": null}]}