{"title": "Regularized Learning with Networks of Features", "book": "Advances in Neural Information Processing Systems", "page_first": 1401, "page_last": 1408, "abstract": "For many supervised learning problems, we possess prior knowledge about which features yield similar information about the target variable. In predicting the topic of a document, we might know that two words are synonyms, or when performing image recognition, we know which pixels are adjacent. Such synonymous or neighboring features are near-duplicates and should therefore be expected to have similar weights in a good model. Here we present a framework for regularized learning in settings where one has prior knowledge about which features are expected to have similar and dissimilar weights. This prior knowledge is encoded as a graph whose vertices represent features and whose edges represent similarities and dissimilarities between them. During learning, each feature's weight is penalized by the amount it differs from the average weight of its neighbors. For text classification, regularization using graphs of word co-occurrences outperforms manifold learning and compares favorably to other recently proposed semi-supervised learning methods. For sentiment analysis, feature graphs constructed from declarative human knowledge, as well as from auxiliary task learning, significantly improve prediction accuracy.", "full_text": "Regularized Learning with Networks of Features\n\nTed Sandler, Partha Pratim Talukdar, and Lyle H. Ungar\n\nDepartment of Computer & Information Science, University of Pennsylvania\n\n{tsandler,partha,ungar}@cis.upenn.edu\n\nDepartment of Computer Science, U.C. Berkeley\n\nJohn Blitzer\n\nblitzer@cs.berkeley.edu\n\nAbstract\n\nFor many supervised learning problems, we possess prior knowledge about which\nfeatures yield similar information about the target variable. In predicting the topic\nof a document, we might know that two words are synonyms, and when perform-\ning image recognition, we know which pixels are adjacent. Such synonymous or\nneighboring features are near-duplicates and should be expected to have similar\nweights in an accurate model. Here we present a framework for regularized learn-\ning when one has prior knowledge about which features are expected to have sim-\nilar and dissimilar weights. The prior knowledge is encoded as a network whose\nvertices are features and whose edges represent similarities and dissimilarities be-\ntween them. During learning, each feature\u2019s weight is penalized by the amount\nit differs from the average weight of its neighbors. For text classi\ufb01cation, reg-\nularization using networks of word co-occurrences outperforms manifold learn-\ning and compares favorably to other recently proposed semi-supervised learning\nmethods. For sentiment analysis, feature networks constructed from declarative\nhuman knowledge signi\ufb01cantly improve prediction accuracy.\n\n1 Introduction\n\nFor many important problems in machine learning, we have a limited amount of labeled training\ndata and a very high-dimensional feature space. A common approach to alleviating the dif\ufb01culty\nof learning in these settings is to regularize a model by penalizing a norm of its parameter vector.\nThe most commonly used norms in classi\ufb01cation, L1 and L2, assume independence among model\nparameters [1]. However, we often have access to information about dependencies between param-\neters. For example, with spatio-temporal data, we usually know which measurements were taken at\npoints nearby in space and time. And in natural language processing, digital lexicons such as Word-\nNet can indicate which words are synonyms or antonyms [2]. For the biomedical domain, databases\nsuch as KEGG and DIP list putative protein interactions [3, 4]. And in the case of semi-supervised\nlearning, dependencies can be inferred from unlabeled data [5, 6]. Consequently, we should be able\nto learn models more effectively if we can incorporate dependency structure directly into the norm\nused for regularization.\n\nHere we introduce regularized learning with networks of features, a framework for constructing cus-\ntomized norms on the parameters of a model when we have prior knowledge about which parameters\nare likely to have similar values. Since our focus is on classi\ufb01cation, the parameters we consider are\nfeature weights in a linear classi\ufb01er. The prior knowledge is encoded as a network or graph whose\nnodes represent features and whose edges represent similarities between the features in terms of how\nlikely they are to have similar weights. During learning, each feature\u2019s weight is penalized by the\namount it differs from the average weight of its neighbors. This regularization objective is closely\n\n\fconnected to the unsupervised dimensionality reduction method, locally linear embedding (LLE),\nproposed by Roweis and Saul [7]. In LLE, each data instance is assumed to be a linear combina-\ntion of its nearest neighbors on a low dimensional manifold. In this work, each feature\u2019s weight is\npreferred (though not required) to be a linear combination of the weights of its neighbors.\n\nSimilar to other recent methods for incorporating prior knowledge in learning, our framework can\nbe viewed as constructing a Gaussian prior with non-diagonal covariance matrix on the model pa-\nrameters [6, 8]. However, instead of constructing the covariance matrix directly, it is induced from\na network. The network is typically sparse in that each feature has only a small number of neigh-\nbors. However, the induced covariance matrix is generally dense. Consequently, we can implicitly\nconstruct rich and dense covariance matrices over large feature spaces without incurring the space\nand computational blow-ups that would be incurred if we attempted to construct these matrices\nexplicitly.\n\nRegularization using networks of features is especially appropriate for high-dimensional feature\nspaces such as are encountered in text processing where the local distances required by tradi-\ntional manifold classi\ufb01cation methods [9, 10] may be dif\ufb01cult to estimate accurately, even with\nlarge amounts of unlabeled data. We show that regularization with feature-networks derived from\nword co-occurrence statistics outperforms manifold regularization and another, more recent, semi-\nsupervised learning approach [5] on the task of text classi\ufb01cation. Feature network based regu-\nlarization also supports extensions which provide \ufb02exibility in modeling parameter dependencies,\nallowing for feature dissimilarities and the introduction of feature classes whose weights have com-\nmon but unknown means. We demonstrate that these extensions improve classi\ufb01cation accuracy\non the task of classifying product reviews in terms of how favorable they are to the products in\nquestion [11]. Finally, we contrast our approach with related regularization methods.\n\n2 Regularized Learning with Networks of Features\n\nWe assume a standard supervised learning framework in which we are given a training set of in-\nstances T = {(xi, yi)}n\ni=1 with xi \u2208 Rd and associated labels yi \u2208 Y. We wish to learn a linear\nclassi\ufb01er parameterized by weight vector w \u2208 Rd by minimizing a convex loss function l(x, y ; w)\nover the training instances, (xi, yi). For many problems, the dimension, d, is much larger than the\nnumber of labeled instances, n. Therefore, it is important to impose some constraints on w. Here\nwe do this using a directed network or graph, G, whose vertices, V = {1, ..., d}, correspond to the\nfeatures of our model and whose edges link features whose weights are believed to be similar. The\nedges of G are non-negative with larger weights indicating greater similarity. Conversely, a weight\nof zero means that two features are not believed a priori to be similar. As has been shown elsewhere\n[5, 6, 8], such similarities can be inferred from prior domain knowledge, auxiliary task learning, and\nstatistics computed on unlabeled data. For the time being we assume that G is given and defer its\nconstruction until section 4, experimental work.\nThe weights of G are encoded by a matrix, P , where Pij \u2265 0 gives the weight of the directed edge\nfrom vertex i to vertex j. We constrain the out-degree of each vertex to sum to one, Pj Pij = 1, so\nthat no feature \u201cdominates\u201d the graph. Because the semantics of the graph are that linked features\nshould have similar weights, we penalize each feature\u2019s weight by the squared amount it differs from\nthe weighted average of its neighbors. This gives us the following criterion to optimize in learning:\n\nloss(w) =\n\nn\n\nX\n\ni=1\n\nl(xi, yi ; w) + \u03b1\n\nd\n\nX\n\nj=1\n\n\u00a1wj \u2212 X\n\nk\n\nPjk wk\u00a22\n\n+ \u03b2 kwk2\n2,\n\n(1)\n\nwhere we have added a ridge term to make the loss strictly convex. The hyperparameters \u03b1 and \u03b2\nspecify the amount of network and ridge regularization respectively. The regularization penalty can\nbe rewritten as w\u22a4M w where M = \u03b1 (I \u2212 P )\u22a4(I \u2212 P ) + \u03b2 I. The matrix M is symmetric positive\nde\ufb01nite, and therefore our criterion possesses a Bayesian interpretation in which the weight vector,\nw, is a priori normally distributed with mean zero and covariance matrix 2M \u22121.\nMinimizing equation (1) is equivalent to \ufb01nding the MAP estimate for w. The gradient of (1) with\nrespect to w is \u2207w loss = Pn\ni=1 \u2207w l(xi, yi ; w) + 2M w and therefore requires only an additional\nIf P is sparse, as it is in\nmatrix multiply on top of computing the loss over the training data.\nour experiments\u2014i.e., it has only kd entries for k \u226a d\u2014then the matrix multiply is O(d). Thus\n\n\fequation (1) can be minimized very quickly. Additionally, the induced covariance matrix M \u22121\nwill typically be dense even though P is sparse, showing that we can construct dense covariance\nstructures over w without incurring storage and computation costs.\n\n2.1 Relationship to Locally Linear Embedding\n\nLocally linear embedding (LLE) is an unsupervised learning method for embedding high dimen-\nsional data in a low dimensional vector space. The data { ~Xi}n\ni=1 is assumed to lie on a low dimen-\nsional manifold of dimension c within a high dimensional vector space of dimension d with c \u226a d.\nSince the data lies on a manifold, each point is approximately a convex combination of its nearest\nneighbors on the manifold. That is, ~Xi \u2248 Pj\u223ci Pij ~Xj, where j \u223c i denotes the samples, j, which\nlie close to i on the manifold. As above, the matrix P has non-negative entries and its rows sum to\none. The set of low dimensional coordinates, {~Yi}n\ni=1, ~Yi \u2208 Rc, are found by minimizing the sum\nof squares cost:\nk~Yi \u2212 X\n(2)\n\ncost({~Yi}) = X\n\nPij ~Yjk2\n2,\n\ni\n\nj\n\nsubject to the constraint that the {~Yi} have unit variance in each of the c dimensions. The solution\nto equation (2) is found by performing eigen-decomposition on the matrix (I \u2212 P )\u22a4(I \u2212 P ) =\nU \u039bU \u22a4 where U is the matrix of eigenvectors and \u039b is the diagonal matrix of eigenvalues. The\nLLE coordinates are obtained from the eigenvectors, u1, ..., uc whose eigenvalues, \u03bb1, ..., \u03bbc, are\nsmallest1 by setting ~Yi = (u1i, ..., uci)\u22a4. Looking at equation (1) and ignoring the ridge term, it is\nclear that our feature network regularization penalty is identical to LLE except that the embedding\nis found for the feature weights rather than data instances. However, there is a deeper connection.\nIf we let L(Y, Xw) denote the unregularized loss over the training set where X is the n \u00d7 d matrix\nof instances and Y is the n-vector of class labels, we can express equation (1) in matrix form as\n\nw\n\n\u2217 = argmin\n\nL(Y, Xw) + w\n\nw\n\n\u22a4\u00a1 \u03b1 (I \u2212 P )\u22a4(I \u2212 P ) + \u03b2 I \u00a2 w.\n\n(3)\n\nDe\ufb01ning \u02dcX to be XU (\u03b1\u039b + \u03b2 I)\u22121/2 where U and \u039b are from the eigen-decomposition above, it is\nnot hard to show that equation (3) is equivalent to the alternative ridge regularized learning problem\n\n\u02dcw\n\n\u2217 = argmin\n\nL(Y, \u02dcX \u02dcw) + \u02dcw\n\n\u22a4 \u02dcw.\n\n\u02dcw\n\n(4)\n\nThat is, the two minimizers, w and \u02dcw, yield the same predictions: \u02c6Y = Xw = \u02dcX \u02dcw. Consequently,\nwe can view feature network regularization as: 1) \ufb01nding an embedding for the features using LLE\nin which all of the eigenvectors are used and scaled by the inverse square-roots of their eigenvalues\n(plus a smoothing term, \u03b2I, that makes the inverse well-de\ufb01ned); 2) projecting the data instances\nonto these coordinates; and 3) learning a ridge-penalized model for the new representation. In using\nall of the eigenvectors, the dimensionality of the feature embedding is not reduced. However, in\nscaling the eigenvectors by the inverse square-roots of their eigenvalues, the directions of least cost\nin the network regularized problem become the directions of maximum variance in the associated\nridge regularized problem, and hence are the directions of least cost in the ridge problem. As a result,\nthe effective dimensionality of the learning problem is reduced to the extent that the distribution\nof inverted eigenvalues is sharply peaked. When the best representation for classi\ufb01cation has high\ndimension, it is faster to solve (3) than to compute a large eigenvector basis and solve (4). In the high\ndimensional problems of section 4, we \ufb01nd that regularization with feature networks outperforms\nLLE-based regression.\n\n3 Extensions to Feature Network Regularization\n\nIn this section, we pose a number of extensions and alternatives to feature network regularization as\nformulated in section 2, including the modeling of classes of features whose weights are believed\nto share the same unknown means, the incorporation of feature dissimilarities, and two alternative\nregularization criteria based on the graph Laplacian.\n\n1More precisely, eigenvectors u2, ..., uc+1 are used so that the {~Yi} are centered.\n\n\f3.1 Regularizing with Classes of Features\n\nIn machine learning, features can often be grouped into classes, such that all the weights of the\nfeatures in a given class are drawn from the same underlying distribution. For example, words can\nbe grouped by part of speech, by meaning (as in WordNet\u2019s synsets), or by clustering based on the\nwords they co-occur with or the documents they occur in. Using an appropriately constructed feature\ngraph, we can model the case in which the underlying distributions are believed to be Gaussians with\nknown, identical variances but with unknown means. That is, the case in which there are k disjoint\ni=1 whose weights are drawn i.i.d. N (\u00b5i, \u03c32) with \u00b5i unknown but \u03c32\nclasses of features {Ci}k\nknown and shared across all classes.\n\nThe straight-forward approach to modeling this scenario might seem to be to link all the features\nwithin a class to each other, forming a clique, but this does not lead to the desired interpretation.\nAdditionally, the number of edges in this construction scales quadratically in the clique sizes, result-\ning in feature graphs that are not sparse. Our approach is therefore to create k additional \u201cvirtual\u201d\nfeatures, f1, ..., fk, that do not appear in any of the data instances but whose weights \u02c6\u00b51, ..., \u02c6\u00b5k serve\nas the estimates for the true but unknown means, \u00b51, ..., \u00b5k. In creating the feature graph, we link\neach feature to the virtual feature for its class with an edge of weight one. The virtual features,\nthemselves, do not possess any out-going links.\n\nDenoting the class of feature i as c(i), and setting the hyperparameters \u03b1 and \u03b2 in equation (1) to\n1/(2\u03c32) and 0, respectively, yields a network regularization cost of 1\ni=1(wi\u2212 \u02c6\u00b5c(i))2. Since\nthe virtual features do not appear in any instances, i.e. their values are zero in every data instance,\ntheir weights are free to take on whatever values minimize the network regularization cost in (1),\nin particular the estimates of the class means, \u00b51, ..., \u00b5k. Consequently, minimizing the network\nregularization penalty maximizes the log-likelihood for the intended scenario. We can extend this\nconstruction to model the case in which the feature weights are drawn from a mixture of Gaussians\nby connecting each feature to a number of virtual features with edge weights that sum to one.\n\n2 \u03c3\u22122 Pd\n\n3.2\n\nIncorporating Feature Dissimilarities\n\nFeature network regularization can also be extended to induce features to have opposing weights.\nSuch feature \u201cdissimilarities\u201d can be useful in tasks such as sentiment prediction where we would\nlike weights for words such as \u201cgreat\u201d or \u201cfantastic\u201d to have opposite signs from their negated bigram\ncounterparts \u201cnot great\u201d and \u201cnot fantastic,\u201d and from their antonyms. To model dissimilarities, we\nconstruct a separate graph whose edges represent anti-correlations between features. Regularizing\nover this graph enforces each feature\u2019s weight to be equal to the negative of the average of the neigh-\nboring weights. To do this, we encode the dissimilarity graph using a matrix Q, de\ufb01ned analogously\nto the matrix P , and add the term Pi\u00a1wi + Pj Qij wj\u00a22 to the network regularization criterion,\nwhich can be written as w\u22a4(I +Q)\u22a4(I +Q)w. The matrix (I +Q)\u22a4(I +Q) is positive semide\ufb01nite\nlike its similarity graph counterpart. Goldberg et al. [12] use a similar construction with the graph\nLaplacian in order to incorporate dissimilarities between instances in manifold learning.\n\n3.3 Regularizing Features with the Graph Laplacian\n\nA natural alternative to the network regularization criterion given in section (2) is to regularize the\nfeature weights using a penalty derived from the graph Laplacian [13]. Here, the feature graph\u2019s edge\nweights are given by a symmetric matrix, W , whose entries, Wij \u2265 0, give the weight of the edge\nbetween features i and j. The Laplacian penalty is 1\n2 Pi,j Wij(wi \u2212 wj)2 which can be written as\nw\u22a4(D\u2212W ) w, where D = diag(W 1) is the vertex degree matrix. The main difference between the\nLaplacian penalty and the network penalty in equation (1) is that the Laplacian penalizes each edge\nequally (modulo the edge weights) whereas the network penalty penalizes each feature equally. In\ngraphs where there are large differences in vertex degree, the Laplacian penalty will therefore focus\nmost of the regularization cost on features with many neighbors. Experiments in section 4 show\nthat the criterion in (1) outperforms the Laplacian penalty as well as a related penalty derived from\nthe normalized graph Laplacian, 1\n\n2 Pi,j Wij(wi/\u221aDii \u2212 wj/pDjj)2. The normalized Laplacian\npenalty assumes that pDjj wi \u2248 \u221aDiiwj, which is different from assuming that linked features\n\nshould have similar weights.\n\n\fy\nc\na\nr\nu\nc\nc\nA\n\n0\n8\n\n0\n7\n\n0\n6\n\n0\n5\n\n0\n4\n\n0\n3\n\nFNR\nLLE Regression\nPCR\nNorm. Laplacian\nLaplacian\nRidge Penalty\n\n60\n\n200\n\n100\n1000\nNumber of Training Instances\n\n500\n\ny\nc\na\nr\nu\nc\nc\nA\n\n0\n8\n\n0\n7\n\n0\n6\n\n0\n5\n\n0\n4\n\n0\n3\n\n2000\n\nFNR\nManifold (Loc/Glob)\nASO Top\nASO Bottom\n\n100\n\n200\n\n500\n\n1000\n\nNumber of Training Instances\n\nFigure 1: Left: Accuracy of feature network regularization (FNR) and \ufb01ve baselines on \u201c20 newsgroups\u201d data.\nRight: Accuracy of FNR compared to reported accuracies of three other semi-supervised learning methods.\n\n4 Experiments\n\nWe evaluated logistic regression augmented with feature network regularization on two natural lan-\nguage processing tasks. The \ufb01rst was document classi\ufb01cation on the 20 Newsgroups dataset, a\nwell-known document classi\ufb01cation benchmark. The second was sentiment classi\ufb01cation of prod-\nuct reviews, the task of classifying user-written reviews according to whether they are favorable or\nunfavorable to the product under review based on the review text [11]. Feature graphs for the two\ntasks were constructed using different information. For document classi\ufb01cation, the feature graph\nwas constructed using feature co-occurrence statistics gleaned from unlabeled data. In sentiment\nprediction, both co-occurrence statistics and prior domain knowledge were used.\n\n4.1 Experiments on 20 Newsgroups\n\nWe evaluated feature network based regularization on the 20 newsgroups classi\ufb01cation task using\nall twenty classes. The feature set was restricted to the 11,376 words which occurred in at least 20\ndocuments, not counting stop-words. Word counts were transformed by adding one and taking logs.\nTo construct the feature graph, each feature (word) was represented by a binary vector denoting its\npresence/absence in each of the 20,000 documents of the dataset. To measure similarity between\nfeatures, we computed cosines between these binary vectors. Each feature was linked to the 25\nother features with highest cosine scores, provided that the scores were above a minimum threshold\nof 0.10. The edge weights of the graph were set to these cosine scores and the matrix P was\nconstructed by normalizing each vertex\u2019s out-degree to sum to one.\n\nFigure 1 (left) shows feature network regularization compared against \ufb01ve other baselines: logis-\ntic regression with an L2 (ridge) penalty; principal components logistic regression (PCR) in which\neach instance was projected onto the largest 200 right singular vectors of the n\u00d7 d matrix, X; LLE-\nlogistic regression in which each instance was projected onto the smallest 200 eigenvectors of the\nmatrix (I\u2212P )\u22a4(I\u2212P ) described in section 2; and logistic regression regularized by the normalized\nand unnormalized graph Laplacians described in section 3.3. Results at each training set size are\naverages of \ufb01ve trials with training sets sampled to contain an equal number of documents per class.\nFor ridge, the amount of L2 regularization was chosen using cross validation on the training set.\nSimilarly, for feature network regularization and the Laplacian regularizers, the hyperparameters \u03b1\nand \u03b2 were chosen through cross validation on the training set using a simple grid search. The ratio\nof \u03b1 to \u03b2 tended to be around 100:1. For PCR and LLE-logistic regression, the number of eigenvec-\ntors used was chosen to give good performance on the test set at both large and small training set\nsizes. All models were trained using L-BFGS with a maximum of 200 iterations. Learning a sin-\ngle model took between between 30 seconds and two minutes, with convergence typically achieved\nbefore the full 200 iterations.\n\n\fBooks\n\nDVDs\n\nElectronics\n\nKitchen Appliances\n\nsim\nsim+dissim\nridge\n\nsim\nsim+dissim\nridge\n\nsim\nsim+dissim\nridge\n\nsim\nsim+dissim\nridge\n\n0\n8\n\n0\n7\n\n0\n6\n\n0\n5\n\n2\n\n10\n\n50\n\n250 1000\n\n2\n\n10\n\n50\n\n250\n\n1000\n\n2\n\n10\n\n50\n\n250\n\n1000\n\n2\n\n10\n\n50\n\n250\n\n1000\n\nTraining Instances\n\nTraining Instances\n\nTraining Instances\n\nTraining Instances\n\nFigure 2: Accuracy of feature network regularization on the sentiment datasets using feature classes and\ndissimilarity edges to regularize the small sent of SentiWordNet features.\n\nThe results in \ufb01gure 1 show that feature network regularization with a graph constructed from unla-\nbeled data outperforms all baselines and increases accuracy by 4%-17% over the plain ridge penalty,\nan error reduction of 17%-30%. Additionally, it outperforms the related LLE regression. We conjec-\nture this is because in tuning the hyperparameters, we can adaptively tune the dimensionality of the\nunderlying data representation. Additionally, by scaling the eigenvectors by their eigenvalues, fea-\nture network regularization keeps more information about the directions of least cost in weight space\nthan does LLE regression, which does not rescale the eigenvectors but simply keeps or discards them\n(i.e. scales them by 1 or 0).\n\nFigure 1 (right) compares feature network regularization against two external approaches that lever-\nage unlabeled data: a multi-task learning approach called alternating structure optimization (ASO),\nand our reimplementation of a manifold learning method which we refer to as \u201clocal/global consis-\ntency\u201d [5, 10]. To make a fair comparison against the reported results for ASO, training sets were\nsampled so as not to necessarily contain an equal number of documents per class. Accuracies are\ngiven for the highest and lowest performing variants of ASO reported in [5]. Our reimplementation\nof local/global consistency used the same document preprocessing described in [10]. However, the\ngraph was constructed so that each document had only K = 10 neighbors (the authors in [10] use\na fully connected graph which does not \ufb01t in memory for the entire 20 newsgroups dataset). Clas-\nsi\ufb01cation accuracy of local/global consistency did not vary much with K and up to 500 neighbors\nwere tried for each document. Here we see that feature network regularization is competitive with\nthe other semi-supervised methods and performs best at all but the smallest training set size.\n\n4.2 Sentiment Classi\ufb01cation\n\nFor sentiment prediction, we obtained the product review datasets used in [11]. Each dataset con-\nsists of reviews downloaded from Amazon.com for one of four different product domains: books,\nDVDs, electronics, and kitchen appliances. The reviews have an associated number of \u201cstars,\u201d rang-\ning from 0 to 5, rating the quality of a product. The goal of the task is to predict whether a review\nhas more than (positive) or less than (negative) 3 stars associated with it based only on the text in the\nreview. We performed two sets of experiments in which prior domain knowledge was incorporated\nusing feature networks. In both, we used a list of sentimentally-charged words obtained from the\nSentiWordNet database [14], a database which associates positive and negative sentiment scores to\neach word in WordNet. In the \ufb01rst experiment, we constructed a set of feature classes in the manner\ndescribed in section 3.1 to see if such classes could be used to boot-strap weight polarities for groups\nof features. In the second, we computed similarities between words in terms of the similarity of their\nco-occurrence\u2019s with the sentimentally charged words.\n\nFrom SentiWordNet we extracted a list of roughly 200 words with high positive and negative sen-\ntiment scores that also occurred in the product reviews at least 100 times. Words to which Senti-\nWordNet gave a high \u2018positive\u2019 score were placed in a \u201cpositive words\u201d cluster and words given\na high \u2018negative\u2019 score were placed in a \u201cnegative words\u201d cluster. As described in section 3.1, all\nwords in the positive cluster were attached to a virtual feature representing the mean feature weight\nof the positive cluster words, and all words in the negative cluster were attached to a virtual weight\nrepresenting the mean weight of the negative cluster words. We also added a dissimilarity edge (de-\nscribed in section 3.2) between the positive and negative clusters\u2019 virtual features to induce the two\n\n\f0\n9\n\n5\n8\n\n0\n8\n\n5\n7\n\n0\n7\n\n5\n6\n\n0\n6\n\nBooks\n\nDVDs\n\nElectronics\n\nKitchen Appliances\n\nFNR\nRidge Penalty\n\nFNR\nRidge Penalty\n\nFNR\nRidge Penalty\n\nFNR\nRidge Penalty\n\n50\n\n100\n\n250\n\n500 1000\n\n50\n\n100\n\n250\n\n500\n\n1000\n\n50\n\n100\n\n250\n\n500\n\n1000\n\n50\n\n100\n\n250\n\n500\n\n1000\n\nTraining Instances\n\nTraining Instances\n\nTraining Instances\n\nTraining Instances\n\nFigure 3: Accuracy of feature network and ridge regularization on four sentiment classi\ufb01cation datasets.\n\nclasses of features to have opposite means. As shown in \ufb01gure 2, imposing feature clusters on the\ntwo classes of words improves performance noticeably while the addition of the feature dissimilarity\nedge does not yield much bene\ufb01t. When it helps, it is only for the smallest training set sizes.\n\nThis simple set of experiments demonstrated the applicability of feature classes for inducing groups\nof features to have similar means, and that the words extracted from SentiWordNet were relatively\nhelpful in determining the sentiment of a review. However, the number of features used in these\nexperiments was too small to yield reasonable performance in an applied setting. Thus we extended\nthe feature sets to include all unigram and bigram word-features which occurred in ten or more\nreviews. The total number of reviews and size of the feature sets is given in table 1.\n\nInstances\n13,161\n13,005\n8,922\n7,760\n\nFeatures\n29,404\n31,475\n15,104\n11,658\n\nTable 1: Sentiment Data Statistics\n\nEdges\n470,034\n419,178\n343,890\n305,926\n\nDataset\nbooks\nDVDs\nelectronics\nkitchen\n\nThe method used to construct the feature graph\nin the 20 newsgroups experiments was not well\nsuited for sentiment prediction since plain feature\nco-occurrence statistics tended to \ufb01nd groups of\nwords that showed up in reviews for products of the\nsame type, e.g., digital cameras or laptops. While\nsuch similarities are useful in predicting what type\nof product is being reviewed, they are of little help\nin determining whether a review is favorable or un-\nfavorable. Thus, to align features along dimensions\nof \u2018sentiment,\u2019 we computed the correlations of all features with the SentiWordNet features so that\neach word was represented as a 200 dimensional vector of correlations with these highly charged\nsentiment words. Distances between these correlation vectors were computed in order to determine\nwhich features should be linked. We next computed each feature\u2019s 100 nearest neighbors. Two fea-\ntures were linked if both were in the other\u2019s set of nearest 100 neighbors. For simplicity, the edge\nweights were set to one and the graph weight matrix was then row-normalized in order to construct\nthe matrix P . The number of edges in each feature graph is given in table 1.\nThe \u2018kitchen\u2019 dataset was used as a development dataset in order to arrive at the method for con-\nstructing the feature graph and for choosing the hyperparameter values: \u03b1 = 9.9 and \u03b2 = 0.1.\nFigure 3 gives accuracy results for all four sentiment datasets at training sets of 50 to 1000 in-\nstances. The results show that linking features which are similarly correlated with sentiment-loaded\nwords yields improvements on every dataset and at every training set size.\n\n5 Related Work\n\nMost similar to the work presented here is that of the fused lasso (Tibshirani et al. [15]) which can\nbe interpreted as using the graph Laplacian regularizer but with an L1 norm instead of L2 on the\nresiduals of weight differences: Pi Pj\u223ci |wi \u2212 wj| and all edge weights set to one. As the authors\ndiscuss, an L1 penalty prefers that weights of linked features be exactly equal so that the residual\nvector of weight differences is sparse. L1 is appropriate if the true weights are believed to be exactly\nequal, but in many settings, features are near copies of one another whose weights should be similar\nrather than identical. Thus in these settings, penalizing squared differences rather than absolute\nones is more appropriate. Optimizing L1 feature weight differences also leads to a much harder\noptimization problem, making it less applicable in large scale learning. Li and Li [13] regularize\n\n\ffeature weights using the normalized graph Laplacian in their work on biomedical prediction tasks.\nAs shown, this criterion does not work as well on the text prediction problems considered here.\n\nKrupka and Tishby [8] proposed a method for inducing feature-weight covariance matrices using\ndistances in a \u201cmeta-feature\u201d space. Under their framework, two features positively covary if they\nare close in this space and approach independence as they grow distant. The authors represent each\nfeature i as a vector of meta-features, ui, and compute the entries of the feature weight covariance\nmatrix, Cij = exp(\u2212 1\n2\u03c32kui \u2212 ujk2). Obviously, the choice of which is more appropriate, a feature\ngraph or metric space, is application dependent. However, it is less obvious how to incorporate\nfeature dissimilarities in a metric space. A second difference is that our work de\ufb01nes the regularizer\nin terms of C \u22121 \u2248 (I\u2212P )\u22a4(I\u2212P ) rather than C itself. While C \u22121 is constructed to be sparse with\na nearest neighbors graph, the induced covariance matrix, C, need not be sparse. Thus, working with\nC \u22121 allows for construct dense covariance matrices without having to explicitly store them. Finally,\nRaina et al. [6] learn a feature-weight covariance matrix via auxiliary task learning. Interestingly, the\nentries of this covariance matrix are learned jointly with a regression model for predicting feature\nweight covariances as a function of meta-features. However, since their approach explicitly predicts\neach entry of the covariance matrix, they are restricted to learning smaller models, consisting of\nhundreds rather than tens of thousands of features.\n\n6 Conclusion\n\nWe have presented regularized learning with networks of features, a simple and \ufb02exible framework\nfor incorporating expectations about feature weight similarities in learning. Feature similarities\nare modeled using a feature graph and the weight of each feature is preferred to be close to the\naverage of its neighbors. On the task of document classi\ufb01cation, feature network regularization\nis superior to several related criteria, as well as to a manifold learning approach where the graph\nmodels similarities between instances rather than between features. Extensions for modeling feature\nclasses, as well as feature dissimilarities, yielded bene\ufb01ts on the problem of sentiment prediction.\n\nReferences\n[1] T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning. Springer New York, 2001.\n[2] C. Fellbaum. WordNet: an electronic lexical database. MIT Press, 1998.\n[3] H. Ogata, S. Goto, K. Sato, W. Fujibuchi, H. Bono, and M. Kanehisa. KEGG: Kyoto Encyclopedia of\n\nGenes and Genomes. Nucleic Acids Research, 27(1):29\u201334, 1999.\n\n[4] I. Xenarios, D.W. Rice, L. Salwinski, M.K. Baron, E.M. Marcotte, and D. Eisenberg. DIP: The Database\n\nof Interacting Proteins. Nucleic Acids Research, 28(1):289\u2013291, 2000.\n\n[5] R.K. Ando and T. Zhang. A Framework for Learning Predictive Structures from Multiple Tasks and\n\nUnlabeled Data. JMLR, 6:1817\u20131853, 2005.\n\n[6] R. Raina, A.Y. Ng, and D. Koller. Constructing informative priors using transfer learning. In ICML, 2006.\n[7] S.T. Roweis and L.K. Saul. Nonlinear Dimensionality Reduction by Locally Linear Embedding. Science,\n\n290(5500):2323\u20132326, 2000.\n\n[8] E. Krupka and N. Tishby. Incorporating Prior Knowledge on Features into Learning. In AISTATS, 2007.\n[9] M. Belkin, P. Niyogi, and V. Sindhwani. Manifold regularization: a geometric framework for lerning\n\nfrom lableed and unlabeled examples. JMLR, 7:2399\u20132434, 2006.\n\n[10] D. Zhou, O. Bousquet, T.N. Lal, J. Weston, and B. Sch\u00a8olkopf. Learning with local and global consistency.\n\nIn NIPS, 2004.\n\n[11] J. Blitzer, M. Dredze, and F. Pereira. Biographies, Bollywood, Boom-boxes and Blenders: Domain\n\nAdaptation for Sentiment Classi\ufb01cation. In ACL, 2007.\n\n[12] A.B. Goldberg, X. Zhu, and S. Wright. Dissimilarity in Graph-Based Semi-Supervised Classi\ufb01cation. In\n\nAISTATS, 2007.\n\n[13] C. Li and H. Li. Network-constrained regularization and variable selection for analysis of genomic data.\n\nBioinformatics, 24(9):1175\u20131182, 2008.\n\n[14] A. Esuli and F. Sebastiani. SentiWordNet: A Publicly Available Lexical Resource For Opinion Mining.\n\nIn LREC, 2006.\n\n[15] R. Tibshirani, M. Saunders, S. Rosset, J. Zhu, and K. Knight. Sparsity and Smoothness via the Fused\n\nLasso. Journal of the Royal Statistical Society Series B, 67(1):91\u2013108, 2005.\n\n\f", "award": [], "sourceid": 959, "authors": [{"given_name": "Ted", "family_name": "Sandler", "institution": null}, {"given_name": "John", "family_name": "Blitzer", "institution": null}, {"given_name": "Partha", "family_name": "Talukdar", "institution": null}, {"given_name": "Lyle", "family_name": "Ungar", "institution": null}]}