{"title": "Regularized Co-Clustering with Dual Supervision", "book": "Advances in Neural Information Processing Systems", "page_first": 1505, "page_last": 1512, "abstract": "By attempting to simultaneously partition both the rows (examples) and columns (features) of a data matrix, Co-clustering algorithms often demonstrate surpris- ingly impressive performance improvements over traditional one-sided (row) clustering techniques. A good clustering of features may be seen as a combinatorial transformation of the data matrix, effectively enforcing a form of regularization that may lead to a better clustering of examples (and vice-versa). In many applications, partial supervision in the form of a few row labels as well as column labels may be available to potentially assist co-clustering. In this paper, we develop two novel semi-supervised multi-class classification algorithms motivated respectively by spectral bipartite graph partitioning and matrix approximation (e.g., non-negative matrix factorization) formulations for co-clustering. These algorithms (i) support dual supervision in the form of labels for both examples and/or features, (ii) provide principled predictive capability on out-of-sample test data, and (iii) arise naturally from the classical Representer theorem applied to regularization problems posed on a collection of Reproducing Kernel Hilbert Spaces. Empirical results demonstrate the effectiveness and utility of our algorithms.", "full_text": "Regularized Co-Clustering with Dual Supervision\n\nVikas Sindhwani\n\nJianying Hu\n\nAleksandra Mojsilovic\n\nIBM Research, Yorktown Heights, NY 10598\n\n{vsindhw, jyhu, aleksand}@us.ibm.com\n\nAbstract\n\nBy attempting to simultaneously partition both the rows (examples) and columns\n(features) of a data matrix, Co-clustering algorithms often demonstrate surpris-\ningly impressive performance improvements over traditional one-sided row clus-\ntering techniques. A good clustering of features may be seen as a combinatorial\ntransformation of the data matrix, effectively enforcing a form of regularization\nthat may lead to a better clustering of examples (and vice-versa). In many appli-\ncations, partial supervision in the form of a few row labels as well as column labels\nmay be available to potentially assist co-clustering. In this paper, we develop two\nnovel semi-supervised multi-class classi\ufb01cation algorithms motivated respectively\nby spectral bipartite graph partitioning and matrix approximation formulations for\nco-clustering. These algorithms (i) support dual supervision in the form of labels\nfor both examples and/or features, (ii) provide principled predictive capability on\nout-of-sample test data, and (iii) arise naturally from the classical Representer\ntheorem applied to regularization problems posed on a collection of Reproducing\nKernel Hilbert Spaces. Empirical results demonstrate the effectiveness and utility\nof our algorithms.\n\n1 Introduction\n\nConsider the setting where we are given large amounts of unlabeled data together with dual su-\npervision in the form of a few labeled examples as well as a few labeled features, and the goal\nis to estimate an unknown classi\ufb01cation function. This setting arises naturally in numerous appli-\ncations. Imagine, for example, the problem of inferring sentiment (\u201cpositive\u201d versus \u201cnegative\u201d)\nassociated with presidential candidates from online political blog posts represented as word vectors,\ngiven the following: (a) a vast collection of blog posts easily downloadable from the web (unlabeled\nexamples), (b) a few blog posts whose sentiment for a candidate is manually identi\ufb01ed (labeled ex-\namples), and (c) prior knowledge of words that re\ufb02ect positive (e.g., \u2019superb\u2019) and negative (e.g,\n\u2019awful\u2019) sentiment (labeled features). Most existing semi-supervised algorithms do not explicitly\nincorporate feature supervision. They typically implement the cluster assumption [3] by learning\ndecision boundaries such that unlabeled points belonging to the same cluster are given the same\nlabel, and empirical loss over labeled examples is concurrently minimized. In situations where the\nclasses are predominantly supported on unknown subsets of similar features, it is clear that feature\nsupervision can potentially illuminate the true cluster structure inherent in the unlabeled examples\nover which the cluster assumption ought to be enforced.\n\nEven when feature supervision is not available, there is ample empirical evidence in numerous recent\npapers in the co-clustering literature (see e.g.,\n[5, 1] and references therein), suggesting that the\nclustering of columns (features) of a data matrix can lead to massive improvements in the quality\nof row (examples) clustering. An intuitive explanation is that column clustering enforces a form of\ndimensional reduction or implicit regularization that is responsible for performance enhancements\nobserved in many applications such as text clustering, microarray data analysis and video content\nmining [1]. In this paper, we utilize data-dependent co-clustering regularizers for semi-supervised\nlearning in the presence of partial dual supervision.\n\n1\n\n\fOur starting point is the spectral bipartite graph partitioning approach of [5] which we brie\ufb02y review\nin Section 2.1. This approach effectively applies spectral clustering on a graph representation of\nthe data matrix and is also intimately related to Singular Value Decomposition. In Section 2.2 we\nreview an equivalence between this approach and a matrix approximation objective function that\nis minimized under orthogonality constraints [6]. By dropping the orthogonality constraints but\nimposing non-negativity constraints, one is led to a large family of co-clustering algorithms that\narise from the non-negative matrix factorization literature.\n\nBased on the algorithmic intuitions embodied in the algorithms above, we develop two semi-\nsupervised classi\ufb01cation algorithms that extend the spectral bipartite graph partitioning approach\nand the matrix approximation approach respectively. We start with Reproducing Kernel Hilbert\nSpaces (RKHSs) de\ufb01ned over both row and column spaces. These RKHSs are then coupled through\nco-clustering regularizers.\nIn the \ufb01rst algorithm, we directly adopt graph Laplacian regularizers\nconstructed from the bipartite graph of [5] and include it as a row and column smoothing term in\nthe standard regularization objective function. The solution is obtained by solving a convex opti-\nmization problem. This approach may be viewed as a modi\ufb01cation of the Manifold Regularization\nframework [2] where we now jointly learn row and column classi\ufb01cation functions. In the second\nalgorithm proposed in this paper, we instead add a (non-convex) matrix approximation term to the\nobjective function, which is then minimized using a block-coordinate descent procedure.\n\nUnlike, their unsupervised counterparts, our methods support dual supervison and naturally possess\nout-of-sample extension. In Section 4, we provide experimental results where we compare against\nvarious baseline approaches, and highlight the performance bene\ufb01ts of feature supervision.\n\n2 Co-Clustering Algorithms\n\nLet X denote the data matrix with n data points and d features. The methods that we discuss in\nthis section output a row partition function \u03c0r : {i}n\nj=1 and a column partition function\n\u03c0c : {i}d\nj=1 that give cluster assignments to row and column indices respectively. Here,\nmr is the desired number of row clusters and mc is the desired number of column clusters. Below,\nby xi we mean the ith example (row) and by fj we mean the jth column (feature) in the data matrix.\n\ni=1 7\u2192 {j}mr\n\ni=1 7\u2192 {j}mc\n\n2.1 Bipartite Graph Partitioning\n\nIn the co-clustering technique introduced by [5], the data matrix is modeled as a bipartite graph\nwith examples (rows) as one set of nodes and features (columns) as another. An edge (i, j) exists\nif feature fj assumes a non-zero value in example xi, in which case the edge is given a weight of\nXij. This bi-partite graph is undirected and there are no inter-example or inter-feature edges. The\nadjacency matrix, W, and the normalized Laplacian [4], M, of this graph are given by,\n\nW = (cid:20) 0 X\n\n0 (cid:21) , M = I \u2212 D\u2212 1\n\nXT\n\n2 WD\u2212 1\n\n2\n\n(1)\n\nwhere D is the diagonal degree matrix de\ufb01ned by Dii = Pi Wij and I is the (n + d) \u00d7 (n + d)\n\nidentity matrix. Guided by the premise that column clustering induces row clustering while row\nclustering induces column clustering, [5] propose to \ufb01nd an optimal partitioning of the nodes of the\nbipartite graph. This method is retricted to obtaining co-clusterings where mr = mc = m. The m-\npartitioning is obtained by minimizing the relaxation of the normalized cut objective function using\nstandard spectral clustering techniques. This reduces to \ufb01rst constructing a spectral representation of\nrows and columns given by the smallest eigenvectors of M, and then performing standard k-means\nclustering on this representation, to \ufb01nally obtain the partition functions \u03c0r, \u03c0c. Due to the special\nstructure of Eqn. 1, it can be shown that the spectral representation used in this algorithm is related\nto the singular vectors of a normalized version of X.\n\n2.2 Matrix Approximation Formulation\n\nIn [6] it is shown that the bipartite spectral graph partitioning is closely related to solving the fol-\nT kf ro\nlowing matrix approximation problem, (Fr\nwhere Fr is an n \u00d7 m matrix and Fc is a d \u00d7 m matrix. Once the minimization is performed,\n\nT Fc=I kX \u2212 FrFc\n\n\u22c6) = argminFr\n\nT Fr=I,Fc\n\n\u22c6, Fc\n\n2\n\n\f\u22c6\nij and \u03c0c(i) = argmaxj Fc\n\n\u22c6\n\u03c0r(i) = argmaxj Fr\nij. In a non-negative matrix factorization approach,\nthe orthogonality constraints are dropped to make the optimization easier while non-negativity con-\nstraints Fr, Fc \u2265 0 are introduced with the goal of lending better interpretability to the solutions.\nThere are numerous multiplicative update algorithms for NMF which essentially have the \ufb02avor of\nalternating non-convex optimization. In our empirical comparisons in Section 4, we use the Alter-\nnating Constrained Least Squares (ACLS) approach of [12]. In Section 3.2 we consider a 3-factor\nnon-negative matrix approximation to incorporate unequal values of mr and mc, and to improve the\nquality of the approximation. See [7, 13] for more details on matrix tri-factorization based formula-\ntions for co-clustering.\n\n3 Objective Functions for Regularized Co-clustering with Dual Supervision\n\nLet us consider examples x to be elements of R \u2282 \u211cd. We consider column values f for each\nfeature to be a data point in C \u2282 \u211cn. Our goal is to learn partition functions de\ufb01ned over the entire\nrow and column spaces (as opposed to matrix indices), i.e., \u03c0r : R 7\u2192 {i}mr\ni=1 and \u03c0c : C 7\u2192 {i}mc\ni=1.\nFor this purpose, let us introduce kr : R \u00d7 R \u2192 \u211c to be the row kernel that de\ufb01nes an associated\nRKHS Hr. Similarly, kc : C \u00d7 C \u2192 \u211c denotes the column kernel whose associated RKHS is Hc.\nBelow, we de\ufb01ne \u03c0r, \u03c0c using these real valued function spaces.\n\nr\n\nr (x) \u00b7 \u00b7 \u00b7 f mr\n\nConsider a simultaneous assignment of rows into mr classes and columns into mc classes. For any\ndata point x, denote Fr(x) = [f 1\n(x)]T \u2208 \u211cmr to be a vector whose elements are soft\nclass assignments where f j\nr \u2208 Hr for all j. For the given n data points, denote Fr to be the n \u00d7 mr\nclass assignment matrix. Correspondingly, Fc(f ) is de\ufb01ned for any feature f \u2208 C, and Fc denotes\nthe associated column class assignment matrix. Additionally, we are given dual supervision in the\nform of label matrices Yr \u2208 \u211cn\u00d7mr and Yc \u2208 \u211cm\u00d7mc where Yrij = 1 speci\ufb01es that the ith\nexample is labeled with class j (simlarly for the feature labels matrix Yc). The associated row sum\nfor a labeled point is 1. Unlabeled points have all-zero rows, and the row sums are therefore 0. Let\nJr (Jc) denote a diagonal matrix of size n\u00d7n (d\u00d7d) whose diagonal entry is 1 for labeled examples\n(features) and 0 otherwise. By Is we will denote an identity matrix of size s \u00d7 s. We use the notation\ntr(A) to mean the trace of the matrix A.\n\n3.1 Manifold Regularization with Bipartite Graph Laplacian (MR)\n\nIn this approach, we setup the following optimization problem,\n\nmr\n\nargmin\nmr\n,Fc \u2208H\nr\n\nmc\nc\n\nFr \u2208H\n\n\u03b3r\n2\n\n+\n\nmc\n\n\u03b3c\n2\n\nkf i\n\nkf i\n\nck2\n\nrk2\n\nHc +\n\nHr +\n\nXi=1\n\nXi=1\ntr(cid:2)(Fc \u2212 Yc)T Jc(Fc \u2212 Yc)(cid:3) +\n\n1\n2\n\n1\n2\n\ntr(cid:2)(Fr \u2212 Yr)T Jr(Fr \u2212 Yr)(cid:3)\nFc (cid:19)(cid:21)\ntr(cid:20)(cid:16)Fr\n\nT(cid:17) M(cid:18) Fr\n\nT Fc\n\n\u00b5\n2\n\n(2)\n\nThe \ufb01rst two terms impose the usual RKHS norm on the class indicator functions for rows and\ncolumns. The middle two terms measure squared loss on labeled data. The \ufb01nal term measure\nsmoothness of the row and column indicator functions with respect to the bipartite graph introduced\nin Section 2.1. This term also incorporates unlabeled examples and features. \u03b3r, \u03b3c, \u00b5 are real-valued\nparameters that tradeoff various regularization terms.\n\nClearly, by Representer Theorem the solution is has the form,\n\nf j\nr (x) =\n\nn\n\nXi=1\n\n\u03b1ijkr(x, xi), 1 \u2264 j \u2264 mr, f j\n\nc (f ) =\n\nd\n\nXi=1\n\n\u03b2ijkc(f , fi), 1 \u2264 j \u2264 mc\n\n(3)\n\nLet \u03b1, \u03b2 denote the corresponding optimal expansion coef\ufb01cient matrices. Then, plugging in Eqn. 3\nand solving the optimization problem, the solution is easily seen to be given by,\n\n(cid:20)(cid:18) \u03b3rIn\n\n0\n\n0\n\n\u03b3cId (cid:19) + \u00b5M(cid:18) Kr\n\n0 Kc (cid:19) +(cid:18) JrKr\n\n0\n\n0\n\n0\n\nJcKc (cid:19)(cid:21)(cid:18) \u03b1\n\n\u03b2 (cid:19) = (cid:18) Yr\n\nYc (cid:19)\n\n(4)\n\n3\n\n\fwhere Kr, Kc are gram matrices over datapoints and features respectively. The partition functions\nare then de\ufb01ned by\n\n\u03c0r(x) = argmax\n1\u2264j\u2264m\n\nn\n\nXi=1\n\n\u03b1ijkr(x, xi),\n\n\u03c0c(f ) = argmax\n1\u2264j\u2264m\n\nd\n\nXi=1\n\n\u03b2ijkc(f , fi)\n\n(5)\n\nAs in Section 2.1, we assume mr = mc = m. If the linear system above is solved by explicitly\ncomputing the matrix inverse, the computational cost is O((n + d)3 + (n + d)2m). This approach\nis closely related to the Manifold Regularization framework of [2], and may be viewed as an mod-\ni\ufb01cation of the Laplacian Regularized Least Squares (LAPRLS) algorithm, which uses a euclidean\nnearest neighbor row similarity graph to capture the manifold structure in the data. Instead of using\nthe squared loss, one can develop variants using the SVM Hinge loss or the logistic loss function.\nOne can also use a large family of graph regularizers derived from the graph Laplacian [3].\nIn\nparticular, we use the iterated Laplacian of the form M p where p is an integer.\n\n3.2 Matrix Approximation under Dual Supervision (MA)\n\nWe now consider an alternative objective function where instead of the graph Laplacian regularizer,\nwe add a penalty term that measures how well the data matrix is approximated by a trifactorization\nFrQFc\n\nT ,\n\nargmin\nmc\nmr\n,Fc \u2208H\nc\nr\nQ\u2208\u211cmr \u00d7mc\n\nFr \u2208H\n\n\u03b3r\n2\n\nmr\n\nXi=1\n\nkf i\n\nrk2\n\nHr +\n\n\u03b3c\n2\n\nmc\n\nXi=1\n\nkf i\n\nck2\n\nHc +\n\n1\n2\n\ntr(cid:2)(Fr \u2212 Yr)T Jr(Fr \u2212 Yr)(cid:3)\n\n+\n\n1\n2\n\ntr(cid:2)(Fc \u2212 Yc)T Jc(Fc \u2212 Yc)(cid:3) +\n\n\u00b5\n2\n\nkX \u2212 FrQFc\n\nT k2\n\nf ro\n\n(6)\n\nAs before, the \ufb01rst two terms above enforce smoothness, the third and fourth terms measure squared\nloss over labels and the \ufb01nal term enforces co-clustering. The classical Representer Theorem\n(Eqn. 3) can again be applied since the above objective function only depends on point evalua-\ntions and RKHS norms of functions in Hr, Hc. The optimal expansion coef\ufb01cient matrices, \u03b1, \u03b2,\nin this case are obtained by solving,\n\nargmin\n\u03b1,\u03b2,Q\n\nJ (\u03b1, \u03b2, Q) =\n\n+\n\n\u03b3r\n2\n1\n2\n\n\u03b3c\n2\n\ntr(cid:2)\u03b1T Kr\u03b1(cid:3) +\ntr(cid:2)\u03b2T Kc\u03b2(cid:3) +\ntr(cid:2)(Kc\u03b2 \u2212 Yc)T Jc(Kc\u03b2 \u2212 Yc)(cid:3) +\n\n1\n2\n\ntr(cid:2)(Kr\u03b1 \u2212 Yr)T Jr(Kr\u03b1 \u2212 Yr)(cid:3)\nkX \u2212 Kr\u03b1Q\u03b2T Kck2\n\n(7)\n\nf ro\n\n\u00b5\n2\n\nThis problem is not convex in \u03b1, \u03b2, Q. We propose a block coordinate descent algorithm for the\nproblem above. Keeping two variables \ufb01xed, the optimization over the other is a convex problem\nwith a unique solution. This guarantees monotonic decrease of the objective function and conver-\ngence to a stationary point. We get the simple update equations given below,\n\n\u2202J\n\u2202Q\n\u2202J\n\u2202\u03b1\n\u2202J\n\u2202\u03b2\n\n= 0 =\u21d2 Q = (\u03b1T K2\n\nr\u03b1)\u22121(\u03b1T KrXKc\u03b2)(\u03b2T K2\n\nc\u03b2)\u22121\n\n= 0 =\u21d2 [\u03b3rIn + JrKr] \u03b1 + \u00b5Kr\u03b1Zc =  JrYr + \u00b5XKc\u03b2QT(cid:1)\n= 0 =\u21d2 [\u03b3cId + JcKc] \u03b2 + \u00b5Kc\u03b2Zr =  JcYc + \u00b5XT Kr\u03b1Q(cid:1)\n\nwhere Zc = Q\u03b2T K2\n\nc\u03b2QT , Zr = QT \u03b1T K2\n\nr\u03b1Q\n\n(8)\n\n(9)\n\n(10)\n\n(11)\n\nIn Eqn 8, we assume that the appropriate matrix inverses exist. Eqns 9 and 10 are generalized\nSylvester matrix equations of the form AXB\u22a4 + CXD\u22a4 = E whose unique solution X under\ncertain regularity conditions can be exactly obtained by an extended version of the classical Bartels-\nStewart method [9] whose complexity is O((p+q)3) for p\u00d7q-sized matrix variable X. Alternatively,\n\none can solve the linear system [10]:  B\u22a4 \u2297 A + D\u22a4 \u2297 C(cid:1) vec(X) = vec(E) where \u2297 denotes\n\n4\n\n\fKronecker product and vec(X) vectorizes X in a column oriented way (it behaves as the matlab\noperator X(:)). Thus, the solution to Eqns (9,10) are as follows,\n\n[Imr \u2297 (\u03b3rIn + JrKr) + \u00b5Zc \u2297 Kr] vec(\u03b1) = vec(JrYr + \u00b5XKc\u03b2QT )\n[Imc \u2297 (\u03b3rId + JcKc) + \u00b5Zr \u2297 Kc] vec(\u03b2) = vec(JcYc + \u00b5XT Kr\u03b1Q)\n\n(12)\n\n(13)\n\nThese linear systems are of size nmr \u00d7 nmr and dmc \u00d7 dmc respectively. It is computationally\nprohibitive to solve these systems by direct matrix inversion. We use an iterative conjugate gradients\n(CG) technique instead, which can exploit hot-starts from the previous solution, and the fact that the\nmatrix vector products can be computed relatively ef\ufb01ciently as follows,\n\n[Imr \u2297 (\u03b3rIn + JrKr) + \u00b5Zc \u2297 Kr] vec(\u03b1) = vec(\u00b5Kr\u03b1Z\u22a4\n\nc ) + \u03b3rvec(\u03b1) + vec(JrKr\u03b1)\n\nTo optimize \u03b1 (\u03b2) given \ufb01xed Q and \u03b2 (\u03b1), we run CG with a stringent tolerance of 10\u221210 and\nmaximum of 200 iterations starting from the \u03b1(\u03b2) from the previous iteration. In an outer loop, we\nmonitor the relative decrease in the objective function and terminate when the relative improvement\nfalls below 0.0001. We use a maximum of 40 outer iterations where each iteration performs one\nround of \u03b1, \u03b2, Q optimization. Empirically, we \ufb01nd that the block coordinate descent approach\noften converges surprisingly quickly (see Section 4.2). The \ufb01nal classi\ufb01cation is given by Eqn. 5.\n\n4 Empirical Study\n\nIn this section, we present an empirical study aimed at comparing the proposed algorithms with sev-\neral baselines: (i) Unsupervised co-clustering with spectral bipartite graph partitioning (BIPARTITE)\nand non-negative matrix factorization (NMF), (ii) supervised performance of standard regularized\nleast squares classi\ufb01cation (RLS) that ignores unlabeled data, and (iii) one-sided semi-supervised\nperformance obtained with Laplacian RLS (LAPRLS) which uses a euclidean nearest-neighbor row\nsimilarity graph. The goal is to observe whether dual supervision particularly along features can help\nimprove classi\ufb01cation performance, and whether joint RKHS regularization as formulated in our al-\ngorithms (abbreviated MR for the manifold regularization based method of Section 3.1 and MA for\nthe matrix approximation method of Section 3.2) along both rows and columns leads to good qual-\nity out-of-sample prediction. In the experiments below, the performance of RLS and LAPRLS is\noptimized for best performance on the unlabeled set over a grid of hyperparameters. We use\nGaussian kernels with width \u03c3r for rows and \u03c3c for columns. These were set to 2k\u03c30r, 2k\u03c30c\nrespectively where \u03c30r, \u03c30c are (1/m)-quantile of pairwise euclidean distances among rows and\ncolumns respectively for an m class problem, and k is tuned over {\u22122, \u22121, 0, 1, 2} to optimize 3-\nfold cross-validation performance of fully supervised RLS. The values \u03b3r, \u03b3c, \u00b5 are loosely tuned\nfor MA,MR with respect to a single random split of the data into training and validation set; more\ncareful hyperparameter tuning may further improve the results presented below.\n\nWe focus on performance in predicting row labels. To enable comparison with the unsupervised co-\nclustering methods, we use the popularly used F-measure de\ufb01ned on pairs of examples as follows:\n\nPrecision =\n\nRecall =\n\nNumber of Pairs Correctly Predicted\n\nNumber of Pairs Predicted to be In Same Cluster or Class\n\nNumber of Pairs Correctly Predicted\n\nNumber of Pairs in the Same Cluster or Class\nF-measure = (2 \u2217 Precision \u2217 Recall)/(Precision + Recall)\n\n(14)\n\n4.1 A Toy Dataset\n\nWe generated a toy 2-class dataset with 200 examples per class and 100 features to demonstrate the\nmain observations. The feature vector for a positive example is of the form [2u \u2212 0.1 2u + 0.1], and\nfor a negative example is of the form [2u+0.1 2u\u22120.1], where u is a 50-dimensional random vector\nwhose entries are uniformly distributed over the unit interval. It is clear that there is substantial over-\nlap between the two classes. Given a column partitioning \u03c0c, consider the following transformation:\n\nT (x) = (cid:16)P i:\u03c0c (i)=1 xi\n\n|i:\u03c0c(i)=1|\n\n, P i:\u03c0c (i)=\u22121 xi\n|i:\u03c0c(i)=\u22121| (cid:17) that maps examples in \u211c100 to the plane \u211c2 by composing\n\na single feature whose value equals the mean of all features in the same partition. For the correct\ncolumn partitioning, \u03c0c(i) = 1, 1 \u2264 i \u2264 50, \u03c0c(i) = \u22121, 50 < i \u2264 100, the examples under the\n\n5\n\n\faction of T are shown in Figure 1 (left). It is clear that T renders the data to be almost separable. It is\ntherefore natural to attempt to (effectively) learn T in a semi-supervised manner. In Figure 1 (right),\nwe plot the learning curves of various algorithms with respect to increasing number of row and col-\numn labels. On this dataset, co-clustering techniques (BIPARTITE, NMF) perform fairly well, and\neven signi\ufb01cantly better than RLS, which has an optimized F-measure of 67% with 25 row labels.\nWith increasing amounts of column labels, the learning curves of MR and MA steadily lift eventu-\nally outperforming the unsupervised techniques. The hyperparameters used in this experiment are:\n\u03c3r = 2.1, \u03c3c = 4.1, \u03b3r = \u03b3c = 0.001, \u00b5 = 10 for MR and 0.001 for MA.\n\nFigure 1: left: Examples in the toy dataset under the transformation de\ufb01ned by the correct column\npartitioning. right: Performance comparison \u2013 the number of column labels used are marked.\n\n1.4\n\n1.3\n\n1.2\n\n1.1\n\n1\n\n0.9\n\n0.8\n\n0.7\n\nMR,50\n\nMA,50\n\nbipartite\n\nMR,10\n\n0.92\n\n0.9\n\nnmf\n\n0.88\n\nMR,25\n\n0.86\n\nMA,10,25\n\nMR,5\n\n0.84\n\nMA,5\n\ne\nr\nu\ns\na\ne\nm\n\u2212\nF\n\n0.7\n\n0.8\n\n0.9\n\n1\n\n1.1\n\n1.2\n\n1.3\n\n1.4\n\n0.82\n\n5\n\n10\n\n15\n\n20\n\n25\n\nNumber of Row Labels\n\n4.2 Text Categorization\n\nWe performed experiments on document-word matrices drawn from the 20-newgroups dataset pre-\nprocessed as in [15]. The preprocessed data has been made publicly available by the authors of [15]1.\nFor each word w and class c, we computed a score as follows:\nscore(w, c) = \u2212P (Y = c) log P (Y = c) \u2212 P (W = w)P (Y = c|W = w) log P (Y = c|W = w)\n\u2212 P (W 6= w)P (Y = c|W 6= w) log P (Y = c|W 6= w), where P (Y = c) is the fraction of\ndocuments whose category is c, P (W = w) is the fraction of times word w is encountered, and\nP (Y = c|W = w) (P (Y = c|W 6= w)) is the fraction of documents with class c when w is\npresent (absent). It is easy to see that the mutual information between the indicator random vari-\n\nable for w and the class variable is Pc score(w, c). We simulated manual labeling of words by\n\nassociating w with the class argmaxc score(w, c). Finally, we restricted attention to 631 words\nwith highest overall mutual information and 2000 documents that belong to the following 5 classes:\ncomp.graphics, rec.motorcycles, rec.sport.baseball, sci.space, talk.politics.mideast. Since words of\ntalk.politics.mideast accounted for more than half the vocabulary, we used the class normalization\nprescribed in [11] to handle the imbalance in the labeled data.\n\nResults presented in Table 1 are averaged over 10 runs. In each run, we randomly split the documents\ninto training and test sets, in the ratio 1 : 3. The training set is then further split into labeled\nand unlabeled sets by randomly selecting 75 labeled documents. We experimented with increasing\nnumber of randomly chosen word labels. The hyperparameters are as follows: \u03c3r = 0.43, \u03c3c =\n0.69, \u03b3r = \u03b3c = \u00b5 = 1 for MR and \u03b3r = \u03b3c = 0.0001, \u00b5 = 0.01 for MA.\n\nWe observe that even without any word supervision MR outperforms all the baseline approaches:\nunsupervised co-clustering with BIPARTITE and NMF, standard RLS that only uses labeled doc-\numents, and also LAPRLS which uses a graph Laplacian based on document similarity for semi-\nsupervised learning. This validates the effectiveness of the bipartite document and word graph\nregularizer. As the amount of word supervision increases, the performance of both MR and MA im-\nproves gracefully. The out-of-sample extension to test data is of good quality, considering that\nour test sets are much larger than our training sets. We also observed that the mean number of\n(outer) iterations required for convergence of MA decreases as labels are increased from 0 to 500:\n28.7(0), 12.2(100), 12.7(200), 9.3(350), 7.8(500). In, Figure 2 we show the top unlabeled words\n\n1At http://www.princeton.edu/\u223cnslonim/data/20NG data 74000.mat.gz\n\n6\n\n\fTable 1: Performance on 5-Newsgroups Dataset with 75 row labels\n\n(a) F-measure on Unlabeled Set\n\nBIPARTITE\n54.8 (7.8)\n\nNMF\n\nRLS\n\n54.4 (6.2)\n\n62.2 (3.1)\n\nLAPRLS\n62.5 (3.0)\n\n(b) F-measure on Test Set\n\nRLS\n\n61.2 (1.7)\n\nLAPRLS\n61.9 (1.4)\n\n(c) F-measure on Unlabeled Set\n\n(d) F-measure on Test Set\n\n# col labs\n\nMR\n\n0\n\n100\n200\n350\n500\n\n64.7 (1.3)\n72.3 (2.2)\n77.0 (2.5)\n78.6 (2.1)\n79.3 (1.6)\n\nMA\n\n60.4 (5.6)\n59.6 (5.7)\n69.2 (7.1)\n75.1 (4.1)\n77.1 (5.8)\n\n# col labs\n\nMR\n\n0\n\n100\n200\n350\n500\n\n57.1 (2.1)\n60.9 (2.4)\n66.2 (2.8)\n68.1 (1.9)\n69.1 (2.4)\n\nMA\n\n60.3 (7.0)\n60.9 (5.0)\n66.2 (6.2)\n70.3 (4.4)\n71.0 (6.0)\n\nfor each class sorted by the real-valued prediction score assigned by MR (in one run trained with\n100 labeled words). Intuitvely, the main words associated with the class are retrieved.\n\nFigure 2: Top unlabeled words categorized by MR\n\nCOMP.GRAPHICS: polygon, gifs, conversion, shareware, graphics, rgb, vesa, viewers, gif, format, viewer, amiga, raster, ftp, jpeg, manipulation\n\nREC.MOTORCYCLES: biker, archive, dogs, yamaha, plo, wheel, riders, motorcycle, probes, ama, rockies, neighbors, saudi, kilometers\n\nREC.SPORT.BASEBALL: clemens, morris, pitched, hr, batters, dodgers, offense, reds, rbi, wins, mets, innings, ted, defensive, sox, inning\n\nSCI.SPACE: oo, servicing, solar, scispace, scheduled, atmosphere, missions, telescope, bursts, orbiting, energy, observatory, island, hst, dark\n\nTALK.POLITICS.MIDEAST:turkish, greek, turkey, hezbollah, armenia, territory, ohanus, appressian, sahak, melkonian, civilians, greeks\n\n4.3 Project Categorization\n\nWe also considered a problem that arises in a real business-intelligence setting. The dataset is\ncomposed of 1169 projects tracked by the Integrated Technology Services division of IBM. These\nprojects need to be categorized into 8 prede\ufb01ned product categories within IBM\u2019s Server Services\nproduct line, with the eventual goal of performing various follow-up business analyses at the gran-\nularity of categories. Each project is represented as a 112-dimensional vector specifying the dis-\ntribution of skills required for its delivery. Therefore, each feature is associated with a particular\njob role/skill set (JR/SS) combination, e.g., \u201cdata-specialist (oracle database)\u201d. Domain experts val-\nidated project (row) labels and additionally provided category labels for 25 features deemed to be\nimportant skills for delivering projects in the corresponding category. By demonstrating our algo-\nrithms on this dataset, we are able to validate a general methodology with which to approach project\ncategorization across all service product lines (SPLs) on a regular basis. The amount of dual su-\npervision available in other SPLs is indeed severely limited as both the project categories and skill\nde\ufb01nitions are constantly evolving due to the highly dynamic business environment.\n\nResults presented in Table 2 are averaged over 10 runs. In each run, we randomly split the projects\ninto training and test sets, in the ratio 3 : 1. The training set is then further split into labeled and\nunlabeled sets by randomly selecting 30 labeled projects. We experimented with increasing number\nof randomly chosen column labels, from none to all 25 available labels. The hyperparameters are\nas follows: \u03b3r = \u03b3c = 0.0001, \u03c3r = 0.69, \u03c3c = 0.27 chosen as described earlier. Results in\nTables 2(c),2(d) are obtained with \u00b5 = 10 for MR, \u00b5 = 0.001 for MA.\n\nWe observe that BIPARTITE performs signi\ufb01cantly better than NMF on this dataset, and is competitve\nwith supervised RLS performance that relies only on labeled data. By using LAPRLS , performance\ncan be slightly boosted. We \ufb01nd that MR outperforms all approaches signi\ufb01cantly even with very\nfew column labels. We conjecture that the comparatively lower mean and high variance in the per-\nformance of MA on this dataset is due to suboptimal local minima issues, which may be alleviated\nusing annealing techniques or multiple random starts, commonly used for Transductive SVMs [3].\nFrom Tables 2(c),2(d) we also observe that both methods give high quality out-of-sample extension\non this problem.\n\n7\n\n\fTable 2: Performance on IBM Project Categorization Dataset with 30 row labels\n\n(a) F-measure on Unlabeled Set\n\nBIPARTITE\n89.1 (2.7)\n\nNMF\n\nRLS\n\nLAPRLS\n\n56.5 (1.1)\n\n88.1 (7.3)\n\n90.20 (5.8)\n\n(b) F-measure on Test Set\n\nRLS\n\n87.8 (8.4)\n\nLAPRLS\n90.2 (6.0)\n\n(c) F-measure on Unlabeled Set\n\n(d) F-measure on Test Set\n\n# col labs\n\nMR\n\n0\n5\n10\n15\n25\n\n92.7 (4.6)\n94.9 (1.8)\n93.0 (4.2)\n92.3 (7.0)\n98.0 (0.5)\n\nMA\n\n90.7 (4.8)\n87.8 (6.4)\n89.0 (8.0)\n89.1 (7.4)\n92.2 (6.0)\n\n# col labs\n\nMR\n\n0\n5\n10\n15\n25\n\n89.2 (5.5)\n93.3 (1.7)\n91.9 (4.2)\n92.2 (5.2)\n96.4 (1.6)\n\nMA\n\n90.0 (5.5)\n87.4 (6.6)\n89.1 (8.3)\n89.2 (8.8)\n92.1 (6.8)\n\n5 Conclusion\n\nWe have developed semi-supervised kernel methods that support partial supervision along both di-\nmensions of the data. Empirical studies show promising results and highlight the previously un-\ntapped bene\ufb01ts of feature supervision in semi-supervised settings. For an application of closely\nrelated algorithms to blog sentiment classi\ufb01cation, we point the reader to [14]. For recent work on\ntext categorization with labeled features instead of labeled examples, see [8].\n\nReferences\n\n[1] A. Banerjee, I. Dhillon, J. Ghosh, S.Merugu, and D.S. Modha. A generalized maximum en-\ntropy approach to bregman co-clustering and matrix approximation. JMLR, 8:1919\u20131986,\n2007.\n\n[2] M. Belkin, P. Niyogi, and V. Sindhwani. Manifold regularization: A geometric framework for\n\nlearning from labeled and unlabeled examples. JMLR, 7:2399\u20132434, 2006.\n\n[3] O. Chapelle, B. Sch\u00a8olkopf, and A. Zien, editors. Semi-Supervised Learning. MIT Press, 2006.\n\n[4] F. Chung, editor. Spectral Graph Theory. AMS, 1997.\n\n[5] I. Dhillon. Co-clustering documents and words using bipartite spectral graph partitioning. In\n\nKDD, 2001.\n\n[6] C. Ding, X. He, and H.D. Simon. On the equivalence of nonnegative matrix factorization and\n\nspectral clustering. In SDM, 2005.\n\n[7] C. Ding, T. Li, W. Peng, and H. Park. Orthogonal nonnegative matrix tri-factorizations for\n\nclustering. In KDD, 2006.\n\n[8] G. Druck, G. Mann, and A. McCallum. Learning from labeled features using generalized\n\nexpectation criteria. In SIGIR, 2008.\n\n[9] J. Gardiner, Laub A.J, Amato J.J, and Moler C.B. Solution of the Sylvester matrix equation\n\nAXBT + CXDT = E. ACM Transactions on Mathematical Software, 18(2):223\u2013231, 1992.\n[10] D. Harville. Matrix Algebra From a Statistician\u2019s Perspective. Springer, New York, 1997.\n\n[11] T.M. Huang and V. Kecman. Semi-supervised learning from unbalanced labeled data an\n\nimprovement. Lecture Notes in Computer Science, 3215:765\u2013771, 2004.\n\n[12] A. Langville, C. Meyer, and R. Albright. Initializations for the non-negative matrix factoriza-\n\ntion. In KDD, 2006.\n\n[13] T. Li and C. Ding. The relationships among various nonnegative matrix factorization methods\n\nfor clustering. In ICDM, 2006.\n\n[14] V. Sindhwani and P. Melville. Document-word co-regularization for semi-supervised sentiment\n\nanalysis. In ICDM, 2008.\n\n[15] N. Slonim and N. Tishby. Document clustering using word clusters via the information bottle-\n\nneck method. In SIGIR, 2000.\n\n8\n\n\f", "award": [], "sourceid": 983, "authors": [{"given_name": "Vikas", "family_name": "Sindhwani", "institution": null}, {"given_name": "Jianying", "family_name": "Hu", "institution": null}, {"given_name": "Aleksandra", "family_name": "Mojsilovic", "institution": null}]}