{"title": "Projection Retrieval for Classification", "book": "Advances in Neural Information Processing Systems", "page_first": 3023, "page_last": 3031, "abstract": "In many applications classification systems often require in the loop human intervention. In such cases the decision process must be transparent and comprehensible simultaneously requiring minimal assumptions on the underlying data distribution. To tackle this problem, we formulate it as an axis-alligned subspacefinding task under the assumption that query specific information dictates the complementary use of the subspaces. We develop a regression-based approach called RECIP that efficiently solves this problem by finding projections that minimize a nonparametric conditional entropy estimator. Experiments show that the method is accurate in identifying the informative projections of the dataset, picking the correct ones to classify query points and facilitates visual evaluation by users.", "full_text": "Projection Retrieval for Classi\ufb01cation\n\nMadalina Fiterau\n\nMachine Learning Department\nCarnegie Mellon University\n\nPittsburgh, PA 15213\n\nmfiterau@cs.cmu.edu\n\nArtur Dubrawski\n\nSchool of Computer Science\nCarnegie Mellon University\n\nPittsburgh, PA 15213\nawd@cs.cmu.edu\n\nAbstract\n\nIn many applications, classi\ufb01cation systems often require human intervention in\nthe loop. In such cases the decision process must be transparent and comprehen-\nsible, simultaneously requiring minimal assumptions on the underlying data dis-\ntributions. To tackle this problem, we formulate an axis-aligned subspace-\ufb01nding\ntask under the assumption that query speci\ufb01c information dictates the comple-\nmentary use of the subspaces. We develop a regression-based approach called\nRECIP that ef\ufb01ciently solves this problem by \ufb01nding projections that minimize a\nnonparametric conditional entropy estimator. Experiments show that the method\nis accurate in identifying the informative projections of the dataset, picking the\ncorrect views to classify query points, and facilitates visual evaluation by users.\n\nIntroduction and problem statement\n\n1\nIn the domain of predictive analytics, many applications which keep human users in the loop require\nthe use of simple classi\ufb01cation models. Often, it is required that a test-point be \u2018explained\u2019 (classi-\n\ufb01ed) using a simple low-dimensional projection of the original feature space. This is a Projection\nRetrieval for Classi\ufb01cation problem (PRC). The interaction with the user proceeds as follows: the\nuser provides the system a query point; the system searches for a projection in which the point can\nbe accurately classi\ufb01ed; the system displays the classi\ufb01cation result as well as an illustration of how\nthe classi\ufb01cation decision was reached in the selected projection.\nSolving the PRC problem is relevant in many practical applications. For instance, consider a nuclear\nthreat detection system installed at a border check point. Vehicles crossing the border are scanned\nwith sensors so that a large array of measurements of radioactivity and secondary contextual infor-\nmation is being collected. These observations are fed into a classi\ufb01cation system that determines\nwhether the scanned vehicle may carry a threat. Given the potentially devastating consequences of\na false negative, a border control agent is requested to validate the prediction and decide whether\nto submit the vehicle for a costly further inspection. With the positive classi\ufb01cation rate of the sys-\ntem under strict bounds because of limitations in the control process, the risk of false negatives is\nincreased. Despite its crucial role, human intervention should only be withheld for cases in which\nthere are reasons to doubt the validity of classi\ufb01cation. In order for a user to attest the validity of a\ndecision, the user must have a good understanding of the classi\ufb01cation process, which happens more\nreadily when the classi\ufb01er only uses the original dataset features rather than combinations of them,\nand when the discrimination models are low-dimensional.\nIn this context, we aim to learn a set of classi\ufb01ers in low-dimensional subspaces and a decision\nfunction which selects the subspace under which a test point is to be classi\ufb01ed. Assume we are\ngiven a dataset {(x1, y1) . . . (xn, yn)} \u2208 X n \u00d7 {0, 1}n and a class of discriminators H. The model\nwill contain a set \u03a0 of subspaces of X ; \u03a0 \u2286 \u03a0, where \u03a0 is the set of all axis-aligned subspaces\nof the original feature space, the power set of the features. To each projection \u03c0i \u2208 \u03a0 corresponds\none discriminator from a given hypothesis space hi \u2208 H. It will also contain a selection function\ng : X \u2192 \u03a0\u00d7H, which yields, for a query point x, the projection/discriminator pair with which this\npoint will be classi\ufb01ed. The notation \u03c0(x) refers to the projection of the point x onto the subspace\n\n1\n\n\f\u03c0 while h(\u03c0(x)) represents the predicted label for x. Formally, we describe the model class as\n\nMd = {\u03a0 = {\u03c0 : \u03c0 \u2208 \u03a0, dim(\u03c0) \u2264 d},\n\nH = {hi : hi \u2208 H, h : \u03c0i \u2192 Y,\u2200i = 1 . . .|\u03a0|},\ng \u2208 {f : X \u2192 {1 . . .|\u03a0|}}\n\n.\n\nwhere dim(\u03c0) presents the dimensionality of the subspace determined by the projection \u03c0. Note that\nonly projections up to size d will be considered, where d is a parameter speci\ufb01c to the application.\nThe set H contains one discriminator from the hypothesis class H for each projection.\nIntuitively, the aim is to minimize the expected classi\ufb01cation error over Md, however, a notable\nmodi\ufb01cation is that the projection and, implicitly, the discriminator, are chosen according to the\ndata point that needs to be classi\ufb01ed. Given a query x in the space X , g(x) will yield the subspace\n\u03c0g(x) onto which the query is projected and the discriminator hg(x) for it. Distinct test points can\nbe handled using different combinations of subspaces and discriminators. We consider models that\nminimize 0/1 loss. Hence, the PRC problem can be stated as follows:\ny (cid:54)= hg(x)(\u03c0g(x)(x))\n\nEX ,Y\n\n(cid:104)\n\n(cid:105)\n\n\u2217\n\nM\n\n= arg min\nM\u2208Md\n\nk . . . xnk\n\nThere are limitations to the type of selection function g that can be learned. A simple example for\nwhich g can be recovered is a set of signal readings x for which, if one of the readings xi exceeds a\nthreshold ti, the label can be predicted just based on xi. A more complex one is a dataset containing\nregulatory variables, that is, for xi in the interval [ak, bk] the label only depends on (x1\nk ) -\ndatasets that fall into the latter category ful\ufb01ll what we call the Subspace-Separability Assumption.\nThis paper proposes an algorithm called RECIP that solves the PRC problem for a class of nonpara-\nmetric classi\ufb01ers. We evaluate the method on arti\ufb01cial data to show that indeed it correctly identi\ufb01es\nthe underlying structure for data satisfying the Subspace-Separability Assumption. We show some\ncase studies to illustrate how RECIP offers insight into applications requiring human intervention.\nThe use of dimensionality reduction techniques is a common preprocessing step in applications\nwhere the use of simpli\ufb01ed classi\ufb01cation models is preferable. Methods that learn linear combina-\ntions of features, such as Linear Discriminant Analysis, are not quite appropriate for the task consid-\nered here, since we prefer to natively rely on the dimensions available in the original feature space.\nFeature selection methods, such as e.g. lasso, are suitable for identifying sets of relevant features,\nbut do not consider interactions between them. Our work better \ufb01ts the areas of class dependent\nfeature selection and context speci\ufb01c classi\ufb01cation, highly connected to the concept of Transductive\nLearning [6]. Other context-sensitive methods are Lazy and Data-Dependent Decision Trees, [5] and\n[10] respectively. In Ting et al [14], the Feating submodel selection relies on simple attribute splits\nfollowed by \ufb01tting local predictors, though the algorithm itself is substantially different. Obozinski\net al present a subspace selection method in the context of multitask learning [11]. Go et al propose\na joint method for feature selection and subspace learning [7], however, their classi\ufb01cation model\nis not particularly query speci\ufb01c. Alternatively, algorithms that transform complex or unintelligi-\nble models with user-friendly equivalents have been proposed [3, 2, 1, 8]. Algorithms speci\ufb01cally\ndesigned to yield understandable models are a precious few. Here we note a rule learning method\ndescribed in [12], even though the resulting rules can make visualization dif\ufb01cult, while itemset\nmining [9] is not speci\ufb01cally designed for classi\ufb01cation. Unlike those approaches, our method is\ndesigned to retrieve subsets of the feature space designed for use in a way that is complementary to\nthe basic task at hand (classi\ufb01cation) while providing query-speci\ufb01c information.\n2 Recovering informative projections with RECIP\nTo solve PRC, we need means by which to ascertain which projections are useful in terms of discrim-\ninating data from the two classes. Since our model allows the use of distinct projections depending\non the query point, it is expected that each projection would potentially bene\ufb01t different areas of the\nfeature space. A(\u03c0) refers to the area of the feature space where the projection \u03c0 is selected.\n\n(cid:104)\n\nThe objective becomes\n\nmin\nM\u2208Md\n\nE(X\u00d7Y)\n\ny (cid:54)= hg(x)(\u03c0g(x)(x))\n\np(A(\u03c0))E(cid:16)\n\ny (cid:54)= hg(x)(\u03c0g(x)(x))|x \u2208 A(\u03c0)\n\n(cid:17)\n\n.\n\nA(\u03c0) = {x \u2208 X : \u03c0g(x) = \u03c0}\n\n(cid:105)\n\n(cid:88)\n\n\u03c0\u2208\u03a0\n\n= min\nM\u2208Md\n\n2\n\n\fThe expected classi\ufb01cation error over A(\u03c0) is linked to the conditional entropy of Y |X. Fano\u2019s\ninequality provides a lower bound on the error while Feder and Merhav [4] derive a tight upper\nbound on the minimal error probability in terms of the entropy. This means that conditional entropy\ncharacterizes the potential of a subset of the feature space to separate data, which is more generic\nthan simply quantifying classi\ufb01cation accuracy for a speci\ufb01c discriminator.\nIn view of this connection between classi\ufb01cation accuracy and entropy, we adapt the objective to:\n\np(A(\u03c0))H(Y |\u03c0(X); X \u2208 A(\u03c0))\n\n(1)\n\n(cid:88)\n\n\u03c0\u2208\u03a0\n\nmin\nM\u2208Md\n\nThe method we propose optimizes an empirical analog of (1) which we develop below and for which\nwe will need the following result.\nProposition 2.1. Given a continuous variable X \u2208 X and a binary variable Y , where X is sampled\nfrom the mixture model f (x) = p(y = 0)f0(x) + p(y = 1)f1(x) = p0f0(x) + p1f1(x) ,\nthen H(Y |X) = \u2212p0 log p0 \u2212 p1 log p1 \u2212 DKL(f0||f ) \u2212 DKL(f1||f ) .\n\nNext, we will use the nonparametric estimator presented in [13] for Tsallis \u03b1-divergence. Given\nsamples Ui \u223c U, with i = 1, n and Vj \u223c V with j = 1, m, the divergence is estimated as follows:\n\n\u02c6T\u03b1(U||V ) =\n\n1\n\n1 \u2212 \u03b1\n\n(cid:104) 1\n\nn(cid:88)\n\n(cid:16) (n \u2212 1)\u03bdk(Ui, U \\ ui)d\n\n(cid:17)1\u2212\u03b1\n\nn\n\ni=1\n\nm\u03bdk(Ui, V )d\n\nB(k, \u03b1) \u2212 1\n\n,\n\n(2)\n\n(cid:105)\n\nwhere d is the dimensionality of the variables U and V and \u03bdk(z, Z) represents the distance from z\nto its kth nearest neighbor of the set of points Z. For \u03b1 \u2248 1 and n \u2192 \u221e, \u02c6T\u03b1(u||v) \u2248 DKL(u||v).\n2.1 Local estimators of entropy\nWe will now plug (2) in the formula obtained by Proposition 2.1 to estimate the quantity (1). We\nuse the notation X0 to represent the n0 samples from X which have the labels Y equal to 0, and X1\nto represent the n1 samples from X which have the labels set to 1. Also, Xy(x) represents the set of\nsamples that have labels equal to the label of x and X\u00acy(x) the data that have labels opposite to the\nlabel of x.\n\n\u02c6H(Y |X; X \u2208 A) = \u2212H(p0) \u2212 H(p1) \u2212 \u02c6T (f x\n\n\u02c6H(Y |X; X \u2208 A) \u221d 1\nn0\n\ni=1\n\ni=1\n\nn0(cid:88)\nn1(cid:88)\nn0(cid:88)\nn1(cid:88)\nn(cid:88)\nn(cid:88)\n\ni=1\n\ni=1\n\ni=1\n\n+\n\n1\nn1\n\u221d 1\nn0\n\n+\n\n1\nn1\n\u221d 1\nn\n\ni=1\n\nI[xi \u2208 A]\n\nI[xi \u2208 A]\n\nI[xi \u2208 A]\n\nI[xi \u2208 A]\n\nI[xi \u2208 A]\n\n\u03b1 \u2248 1\n\nn\u03bdk(xi, X \\ xi)d\n\n0 ||f x) \u2212 \u02c6T (f x\n\n1 ||f x) + C\n\nn\u03bdk(xi, X \\ xi)d\n\n(cid:17)1\u2212\u03b1\n(cid:16) (n0 \u2212 1)\u03bdk(xi, X0 \\ xi)d\n(cid:17)1\u2212\u03b1\n(cid:16) (n1 \u2212 1)\u03bdk(xi, X1 \\ xi)d\n(cid:16) (n0 \u2212 1)\u03bdk(xi, X0 \\ xi)d\n(cid:17)1\u2212\u03b1\n(cid:17)1\u2212\u03b1\n(cid:16) (n1 \u2212 1)\u03bdk(xi, X1 \\ xi)d\n(cid:16) (n \u2212 1)\u03bdk(xi, Xy(xi) \\ xi)d\n(cid:17)1\u2212\u03b1\n(cid:16) (n \u2212 1)\u03bdk(\u03c0(xi), \u03c0(Xy(xi)) \\ \u03c0(xi))d\n\nn\u03bdk(xi, X\u00acy(xi) \\ xi)d\n\nn\u03bdk(xi, X0 \\ xi)d\n\nn\u03bdk(xi, X1 \\ xi)d\n\nn\u03bdk(\u03c0(xi), \u03c0(X\u00acy(xi) \\ xi))d\n(cid:17)1\u2212\u03b1\n\n(cid:16) (n \u2212 1)\u03bdk(\u03c0(xi), \u03c0(Xy(xi)) \\ \u03c0(xi))d\n\nn\u03bdk(\u03c0(xi), \u03c0(X\u00acy(xi) \\ xi))d\n\n(cid:17)1\u2212\u03b1\n\n(3)\n\n(4)\n\n3\n\nThe estimator for the entropy of the data that is classi\ufb01ed with projection \u03c0 is as follows:\n\u02c6H(Y |\u03c0(X); X \u2208 A(\u03c0)) \u221d 1\nn\n\nI[xi \u2208 A(\u03c0)]\n\nFrom 3 and using the fact that I[xi \u2208 A(\u03c0)] = I[\u03c0g(xi) = \u03c0] for which we use the notation\nI[g(xi) \u2192 \u03c0], we estimate the objective as\n\n(cid:88)\n\n\u03c0\u2208\u03a0\n\nn(cid:88)\n\ni=1\n\n1\nn\n\nmin\nM\u2208Md\n\nI[g(xi) \u2192 \u03c0]\n\n\fTherefore, the contribution of each data point to the objective corresponds to a distance ratio on the\nprojection \u03c0\u2217 where the class of the point is obtained with the highest con\ufb01dence (data is separable\nin the neighborhood of the point). We start by computing the distance-based metric of each point on\neach projection of size up to d - there are d\u2217 such projections.\nThis procedure yields an extended set of features Z, which we name local entropy estimates:\nj \u2208 {1 . . . d\u2217}\n\n(cid:16) \u03bdk(\u03c0j(xi), \u03c0j(Xy(xi)) \\ \u03c0j(xi))\n\n(cid:17)d(1\u2212\u03b1)\n\n\u03b1 \u2248 1\n\n(5)\n\nZij =\n\n\u03bdk(\u03c0j(xi), \u03c0j(X\u00acy(xi)) \\ \u03c0j(xi))\n\nFor each training data point, we compute the best distance ratio amid all the projections, which is\nsimply Ti = minj\u2208[d\u2217] Zij.\nThe objective can be then further rewritten as a function of the entropy estimates:\n\nn(cid:88)\n\ni=1\n\n(cid:88)\n\n\u03c0j\u2208\u03a0\n\nI[g(xi) \u2192 \u03c0j]Zij\n\n(cid:88)\nI[g(xi) \u2192 \u03c0j]Zij \u2265 n(cid:88)\n\nmin\nM\u2208Md\n\nn(cid:88)\n\ni=1\n\nmin\nM\u2208Md\n\ni=1\n\n\u03c0j\u2208\u03a0\n\n(6)\n\n(7)\n\nTi .\n\nFrom the de\ufb01nition of T, it is also clear that\n\n2.2 Projection selection as a combinatorial problem\n\nConsidering form (6) of the objective, and given that the estimates Zij are constants, depending only\non the training set, the projection retrieval problem is reduced to \ufb01nding g for all training points,\nwhich will implicitly select the projection set of the model. Naturally, one might assume the best-\nperforming classi\ufb01cation model is the one containing all the axis-aligned subspaces. This model\nachieves the lower bound (7) for the training set. However, the larger the set of projections, the more\nvalues the function g takes, and thus the problem of selecting the correct projection becomes more\ndif\ufb01cult. It becomes apparent that the number of projections should be somehow restricted to allow\nintepretability. Assuming a hard threshold of at most t projections, the optimization (6) becomes\nan entry selection problem over matrix Z where one value must be picked from each row under a\nlimitation on the number of columns that can be used. This problem cannot be solved exactly in\npolynomial time. Instead, it can be formulated as an optimization problem under (cid:96)1 constraints.\n\n2.3 Projection retrieval through regularized regression\nTo transform the projection retrieval to a regression problem we consider T, the minimum obtainable\nvalue of the entropy estimator for each point, as the output which the method needs to predict. Each\nrow i of the parameter matrix B represents the degrees to which the entropy estimates on each\nprojection contribute to the entropy estimator of point xi. Thus, the sum over each row of B is 1,\nand the regularization penalty applies to the number of non-zero columns in B.\n\n||T \u2212 (Z (cid:12) B)J|\u03a0|,1||2\n\nmin\n\nB\n\n[Bi (cid:54)= 0]\n\nd\u2217(cid:88)\n|Bk|(cid:96)1 = 1 k = 1, n\n\n2 + \u03bb\n\ni=1\n\nsubject to\nwhere\n\n(Z (cid:12) B)ij = Zij + Bij and J is a matrix of ones.\n\n(8)\n\ni=1 |Bi|(cid:96)1 =(cid:80)n\n\ni=1 |Bi|(cid:96)1. However,(cid:80)d\u2217\nto(cid:80)d\u2217\n\nThe problem with this optimization is that it is not convex. A typical walk-around of this issue is\nto use the convex relaxation for Bi (cid:54)= 0, that is (cid:96)1 norm. This would transform the penalized term\nk=1 |Bk|(cid:96)1 = n , so this penalty really has no effect.\nAn alternative mechanism to encourage the non-zero elements in B to populate a small number\nof columns is to add a penalty term in the form of B\u03b4, where \u03b4 is a d\u2217-size column vector with\neach element representing the penalty for a column in B. With no prior information about which\nsubspaces are more informative, \u03b4 starts as an all-1 vector. An initial value for B is obtained through\nthe optimization (8). Since our goal is to handle data using a small number of projections, \u03b4 is then\nupdated such that its value is lower for the denser columns in B. This update resembles the re-\nweighing in adaptive lasso. The matrix B itself is updated, and this 2-step process continues until\nconvergence of \u03b4. Once \u03b4 converges, the projections corresponding to the non-zero columns of B\nare added to the model. The procedure is shown in Algorithm 1.\n\n4\n\n\fAlgorithm 1: RECIP\n\n\u03b4 = [1 . . . 1]\nrepeat\n\nb = arg minB ||T \u2212(cid:80)|P I|\nsubject to\n\u03b4k = |Bi|(cid:96)1\n\u03b4 = 1 \u2212 \u03b4|\u03b4|(cid:96)1\nuntil \u03b4 converges\nreturn \u03a0 = {\u03c0i;\n\ni=1 < Z, B > ||2\n\n2 + \u03bb|B\u03b4|(cid:96)1\n\n|Bk|(cid:96)1 = 1\n\ni = . . . d\u2217 (update the differential penalty)\n\nk = 1 . . . n\n\n|Bi|(cid:96)1 > 0 \u2200i = 1 . . . d\u2217}\n\n2.4 Lasso for projection selection\nWe will compare our algorithm to lasso regularization that ranks the projections in terms of their\npotential for data separability. We write this as an (cid:96)1-penalized optimization on the extended feature\nset Z, with the objective T : min\u03b2 |T \u2212 Z\u03b2|2 + \u03bb|\u03b2|(cid:96)1. The lasso penalty to the coef\ufb01cient vector\nencourages sparsity. For a high enough \u03bb, the sparsity pattern in \u03b2 is indicative of the usefulness of\nthe projections. The lasso on entropy contributions was not found to perform well as it is not query\nspeci\ufb01c and will \ufb01nd one projection for all data. We improved it by allowing it to iteratively \ufb01nd\nprojections - this robust version offers increased performance by reweighting the data thus focusing\non different subsets of it. Although better than running lasso on entropy contributions, the robust\nlasso does not match RECIP\u2019s performance as the projections are selected gradually rather than\njointly. Running the standard lasso on the original design matrix yields a set of relevant variables\nand it is not immediately clear how the solution would translate to the desired class.\n\n2.5 The selection function\nOnce the projections are selected, the second stage of the algorithm deals with assigning the pro-\njection with which to classify a particular query point. An immediate way of selecting the correct\nprojection starts by computing the local entropy estimator for each subspace with each class assign-\nment. Then, we may select the label/subspace combination that minimizes the empirical entropy.\n\n(cid:16) \u03bdk(\u03c0i(x), \u03c0i(X\u03b8))\n\n(cid:17)dim(\u03c0i)(1\u2212\u03b1)\n\n\u03bdk(\u03c0i(x), \u03c0i(X\u00ac\u03b8))\n\n(i\u2217, \u03b8\u2217) = arg min\n\ni,\u03b8\n\ni = 1 . . . d\u2217\n\n, \u03b1 \u2248 1\n\n(9)\n\n3 Experimental results\nIn this section we illustrate the capability of RECIP to retrieve informative projections of data and\ntheir use in support of interpreting results of classi\ufb01cation. First, we analyze how well RECIP can\nidentify subspaces in synthetic data whose distribution obeys the subspace separability assumption\n(3.1). As a point of reference, we also present classi\ufb01cation accuracy results (3.2) for both the\nsynthetic data and a few real-world sets. This is to quantify the extent of the trade-off between\n\ufb01delity of attainable classi\ufb01ers and desired informativeness of the projections chosen by RECIP. We\nexpect RECIP\u2019s classi\ufb01cation performance to be slightly, but not substantially worse when compared\nto relevant classi\ufb01cation algorithms trained to maximize classi\ufb01cation accuracy. Finally, we present\na few examples (3.3) of informative projections recovered from real-world data and their utility in\nexplaining to human users the decision processes applied to query points.\nA set of arti\ufb01cial data used in our experiments contains q batches of data points, each of them made\nk) with k \u2208\nclassi\ufb01able with high accuracy using one of available 2-dimensional subspaces (x1\n{1 . . . q}. The data in batch k also have the property that x1\nk > tk. This is done such that the group a\npoint belongs to can be detected from x1\nk is a regulatory variable. We control the amount of\nnoise added to thusly created synthetic data by varying the proportion of noisy data points in each\nbatch. The results below are for datasets with 7 features each, with number of batches q ranging\nbetween 1 and 7. We kept the number of features speci\ufb01cally low in order to prevent excessive\nvariation between any two sets generated this way, and to enable computing meaningful estimates\nof the expectation and variance of performance, while enabling creation of complicated data in\nwhich synthetic patterns may substantially overlap (using 7 features and 7 2-dimensional patterns\nimplies that dimensions of at least 4 of the patterns will overlap). We implemented our method\n\nk, thus x1\n\nk, x2\n\n5\n\n\fto be scalable to the size and dimensionality of data and although for brevity we do not include a\ndiscussion of this topic here, we have successfully run RECIP against data with 100 features.\nThe parameter \u03b1 is a value close to 1, because the Tsallis divergence converges to the KL divergence\nas alpha approaches 1. For the experiments on real-world data, d was set to n (all projections were\nconsidered). For the arti\ufb01cial data experiments, we reported results for d = 2 as they do not change\nsigni\ufb01cantly for d >= 2 because this data was synthesized to contain bidimensional informative\nprojections. In general, if d is too low, the correct full set of projections will not be found, but it may\nbe recovered partially. If d is chosen too high, there is a risk that a given selected projection p will\ncontain irrelevant features compared to the true projection p0. However, this situation only occurs\nif the noise introduced by these features in the estimators makes the entropy contributions on p and\np0 statistically indistinguishable for a large subset of data. The users will choose d according to the\ndesired/acceptable complexity of the resulting model. If the results are to be visually interpreted by\na human, values of 2 or 3 are reasonable for d.\n3.1 Recovering informative projections\nTable 1 shows how well RECIP recovers the q subspaces corresponding to the synthesized batches of\ndata. We measure precision (proportion of the recovered projections that are known to be informa-\ntive), and recall (proportion of known informative projections that are recovered by the algorithm).\nin Table 1, rows correspond to the number of distinct synthetic batches injected in data, q, and sub-\nsequent columns correspond to the increasing amounts of noise in data. We note that the observed\nprecision is nearly perfect: the algorithm makes only 2 mistakes over the entire set of experiments,\nand those occur for highly noisy setups. The recall is nearly perfect as long as there is little overlap\namong the dimensions, that is when the injections do not interfere with each other. As the number\nof projections increases, the chances for overlap among the affected features also increase, which\nmakes the data more confusing resulting on a gradual drop of recall until only about 3 or 4 of the\n7 known to be informative subspaces can be recovered. We have also used lasso as described in\n2.4 in an attempt to recover projections. This setup only manages to recover one of the informative\nsubspaces, regardless of how the regularization parameter is tuned.\n3.2 Classi\ufb01cation accuracy\nTable 2 shows the classi\ufb01cation accuracy of RECIP, obtained using synthetic data. As expected, the\nobserved performance is initially high when there are few known informative projections in data and\nit decreases as noise and ambiguity of the injected patterns increase.\nMost types of ensemble learners would use a voting scheme to arrive at the \ufb01nal classi\ufb01cation of a\ntesting sample, rather than use a model selection scheme. For this reason, we have also compared\npredictive accuracy revealed by RECIP against a method based on majority voting among multiple\ncandidate subspaces. Table 4 shows that the accuracy of this technique is lower than the accuracy of\nRECIP, regardless of whether the informative projections are recovered by the algorithm or assumed\nto be known a priori. This con\ufb01rms the intuition that a selection-based approach can be more\neffective than voting for data which satis\ufb01es the subspace separability assumption.\nFor reference, we have also classi\ufb01ed the synthetic data using K-Nearest-Neighbors algorithm using\nall available features at once. The results of that experiment are shown in Table 5. Since RECIP uses\nneighbor information, K-NN is conceptually the closest among the popular alternatives. Compared\nto RECIP, K-NN performs worse when there are fewer synthetic patterns injected in data to form\ninformative projections. It is because some features used then by K-NN are noisy. As more features\nbecome informative, the K-NN accuracy improves. This example shows the bene\ufb01t of a selective\napproach to feature space and using a subset of the most explanatory projections to support not only\nexplanatory analyses but also classi\ufb01cation tasks in such circumstances.\n3.3 RECIP case studies using real-world data\n\nTable 3 summarizes the RECIP and K-NN performance on UCI datasets. We also test the meth-\nods using Cell dataset containing a set of measurements such as the area and perimeter biological\ncells with separate labels marking cells subjected to treatment and control cells. In Vowel data, the\nnearest-neighbor approach works exceptionally well, even outperforming random forests (0.94 ac-\ncuracy), which is an indication that all features are jointly relevant. For d lower than the number\nof features, RECIP picks projections of only one feature, but if there is no such limitation, RECIP\npicks the space of all the features as informative.\n\n6\n\n\fTable 1: Projection recovery for arti\ufb01cial datasets with 1 . . . 7 informative features and noise level\n0 . . . 0.2 in terms of mean and variance of Precision and Recall. Mean/var obtained for each setting\nby repeating the experiment with datasets with different informative projections.\n\n0\n1\n1\n1\n1\n1\n1\n1\n\n0.02\n1\n1\n1\n1\n1\n1\n1\n\n0\n1\n1\n1\n0.9643\n0.7714\n0.6429\n0.6327\n\n0.02\n1\n1\n1\n0.9643\n0.7429\n0.6905\n0.5918\n\nMean\n0.05\n1\n1\n1\n1\n1\n1\n1\n\nMean\n0.05\n1\n1\n0.9524\n0.9643\n0.8286\n0.6905\n0.5918\n\nPRECISION\n\n0.1\n0.9286\n1\n1\n1\n1\n1\n1\n\n0.1\n1\n1\n0.9524\n0.9643\n0.8571\n0.6905\n0.5714\n\n0.2\n0.9286\n1\n1\n1\n1\n1\n1\n\nRECALL\n\n0\n0\n0\n0\n0\n0\n0\n0\n\n0.02\n0\n0\n0\n0\n0\n0\n0\n\n0.2\n1\n1\n1\n0.9286\n0.7714\n0.6905\n0.551\n\n0\n0\n0\n0\n0.0077\n0.0163\n0.0113\n0.0225\n\n0.02\n0\n0\n0\n0.0077\n0.0196\n0.0113\n0.02\n\nVariance\n0.05\n0\n0\n0\n0\n0\n0\n0\n\nVariance\n0.05\n0\n0\n0.0136\n0.0077\n0.0049\n0.0272\n0.0258\n\n0.1\n0.0306\n0\n0\n0\n0\n0\n0\n\n0.1\n0\n0\n0.0136\n0.0077\n0.0082\n0.0113\n0.0233\n\n0.2\n0.0306\n0\n0\n0\n0\n0\n0\n\n0.2\n0\n0\n0\n0.0128\n0.0278\n0.0113\n0.02\n\nTable 2: RECIP Classi\ufb01cation Accuracy on Arti\ufb01cial Data\n\nCLASSIFICATION ACCURACY\n\n0.02\n0.0000\n0.0001\n0.0005\n0.0020\n0.0044\n0.0021\n0.0040\n\n0\n0.0000\n0.0001\n0.0004\n0.0020\n0.0042\n0.0025\n0.0034\n\n0.2\n0.9420\n0.8946\n0.8618\n0.8187\n0.7782\n0.7511\n0.7205\n\nMean\n0.05\n0.9686\n0.9227\n0.8764\n0.8589\n0.8105\n0.7669\n0.7347\n\nVariance\n0.05\n0.0000\n0.0001\n0.0016\n0.0019\n0.0033\n0.0026\n0.0042\nCLASSIFICATION ACCURACY - KNOWN PROJECTIONS\nVariance\n0.05\n0.0000\n0.0001\n0.0005\n0.0014\n0.0018\n0.0016\n0.0018\n\nMean\n0.05\n0.9686\n0.9227\n0.8914\n0.8657\n0.8523\n0.8377\n0.8256\n\n0.2\n0.9514\n0.8946\n0.8618\n0.8331\n0.8209\n0.8074\n0.7988\n\n0.1\n0.9543\n0.9067\n0.8640\n0.8454\n0.8105\n0.7632\n0.7278\n\n0.1\n0.9637\n0.9067\n0.8777\n0.8541\n0.8429\n0.8285\n0.8122\n\n0.02\n0.9731\n0.9297\n0.8967\n0.8685\n0.8009\n0.7739\n0.7399\n\n0.02\n0.9731\n0.9297\n0.8967\n0.8781\n0.8641\n0.8497\n0.8371\n\n0\n0.0000\n0.0001\n0.0004\n0.0011\n0.0015\n0.0014\n0.0015\n\n0.02\n0.0000\n0.0001\n0.0005\n0.0011\n0.0015\n0.0015\n0.0018\n\n0.2\n0.0000\n0.0001\n0.0007\n0.0020\n0.0023\n0.0021\n0.0020\nTable 3: Accuracy of K-NN and RECIP\n\n0.1\n0.0001\n0.0001\n0.0007\n0.0014\n0.0019\n0.0023\n0.0021\n\n0.1\n0.0008\n0.0001\n0.0028\n0.0025\n0.0036\n0.0025\n0.0042\n\n0.2\n0.0007\n0.0001\n0.0007\n0.0032\n0.0044\n0.0027\n0.0045\n\n0\n0.9751\n0.9333\n0.9053\n0.8725\n0.8113\n0.7655\n0.7534\n\n0\n0.9751\n0.9333\n0.9053\n0.8820\n0.8714\n0.8566\n0.8429\n\n1\n2\n3\n4\n5\n6\n7\n\n1\n2\n3\n4\n5\n6\n7\n\n1\n2\n3\n4\n5\n6\n7\n\n1\n2\n3\n4\n5\n6\n7\n\nDataset\nBreast Cancer Wis\nBreast Tissue\nCell\nMiniBOONE*\n\nKNN RECIP\n0.8415\n0.8275\n1.0000\n1.0000\n0.7640\n0.7072\n0.7896\n0.7396\n0.7680\nSpam 0.7680\n0.9839\n0.9839\nVowel\n\nIn Spam data, the two most informative projections are\n\u2019Capital Length Total\u2019 (CLT)/\u2019Capital Length Longest\u2019\n(CLL) and CLT/\u2019Frequency of word your\u2019 (FWY). Fig-\nure 1 shows these two projections, with the dots repre-\nsenting training points. The red dots represent points la-\nbeled as spam while the blue ones are non-spam. The\ncircles are query points that have been assigned to be clas-\nsi\ufb01ed with the projection in which they are plotted. The green circles are correctly classi\ufb01ed points,\nwhile the magenta circles - far fewer - are the incorrectly classi\ufb01ed ones. Not only does the impor-\ntance of text in capital letters make sense for a spam \ufb01ltering dataset, but the points that select those\nprojections are almost \ufb02awlessly classi\ufb01ed. Additionally, assuming the user would need to attest the\nvalidity of classi\ufb01cation for the \ufb01rst plot, he/she would have no trouble seeing that the circled data\npoints are located in a region predominantly populated with examples of spam, so any non-spam\nentry appears suspicious. Both of the magenta-colored cases fall into this category, and they can be\ntherefore \ufb02agged for further investigation.\n\n7\n\n\fMean\n0.05\n0.9686\n0.7331\n0.7163\n0.6932\n0.6745\n0.6460\n0.6268\n\nMean\n0.05\n0.9686\n0.7331\n0.7390\n0.7083\n0.7050\n0.6801\n0.6772\n\n1\n2\n3\n4\n5\n6\n7\n\n1\n2\n3\n4\n5\n6\n7\n\n1\n2\n3\n4\n5\n6\n7\n\n0.2\n0.0053\n0.0001\n0.0002\n0.0009\n0.0013\n0.0005\n0.0012\n\n0.2\n0.0000\n0.0001\n0.0010\n0.0043\n0.0016\n0.0009\n0.0008\n\n0.2\n0.0002\n0.0001\n0.0000\n0.0001\n0.0001\n0.0001\n0.0001\n\nFigure 1: Spam Dataset Selected Subspaces\n\nTable 4: Classi\ufb01cation accuracy using RECIP-learned projections - or known projections, in the\nlower section - within a voting model instead of a selection model\n\nCLASSIFICATION ACCURACY - VOTING ENSEMBLE\n\n0\n0.9751\n0.7360\n0.7290\n0.6934\n0.6715\n0.6410\n0.6392\n\n0.02\n0.9731\n0.7354\n0.7266\n0.6931\n0.6602\n0.6541\n0.6342\n\n0.1\n0.0070\n0.0002\n0.0006\n0.0008\n0.0014\n0.0006\n0.0012\nCLASSIFICATION ACCURACY - VOTING ENSEMBLE, KNOWN PROJECTIONS\n\n0.2\n0.9226\n0.7257\n0.7212\n0.6867\n0.6581\n0.6512\n0.6294\n\n0\n0.0000\n0.0002\n0.0002\n0.0008\n0.0013\n0.0008\n0.0009\n\n0.02\n0.0000\n0.0002\n0.0002\n0.0008\n0.0014\n0.0007\n0.0011\n\n0.1\n0.9317\n0.7303\n0.7166\n0.6904\n0.6688\n0.6529\n0.6251\n\n0.02\n0.9731\n0.7354\n0.7385\n0.7109\n0.7070\n0.6807\n0.6783\n\n0\n0.9751\n0.7360\n0.7409\n0.7110\n0.7077\n0.6816\n0.6787\nTable 5: Classi\ufb01cation accuracy for arti\ufb01cial data with the K-Nearest Neighbors method\n\n0.2\n0.9514\n0.7257\n0.7325\n0.7035\n0.7008\n0.6747\n0.6722\n\n0.1\n0.0001\n0.0002\n0.0011\n0.0042\n0.0016\n0.0008\n0.0008\n\n0.02\n0.0000\n0.0002\n0.0012\n0.0041\n0.0015\n0.0008\n0.0009\n\n0.1\n0.9637\n0.7303\n0.7353\n0.7067\n0.7034\n0.6790\n0.6767\n\n0\n0.0000\n0.0002\n0.0010\n0.0041\n0.0015\n0.0008\n0.0008\n\nCLASSIFICATION ACCURACY - KNN\n\nVariance\n0.05\n0.0000\n0.0001\n0.0008\n0.0008\n0.0013\n0.0010\n0.0012\n\nVariance\n0.05\n0.0000\n0.0001\n0.0010\n0.0042\n0.0015\n0.0008\n0.0009\n\n0\n0.7909\n0.7940\n0.7964\n0.7990\n0.8038\n0.8043\n0.8054\n\n0.02\n0.7843\n0.7911\n0.7939\n0.7972\n0.8024\n0.8032\n0.8044\n\nMean\n0.05\n0.7747\n0.7861\n0.7901\n0.7942\n0.8002\n0.8015\n0.8028\n\n0.1\n0.7652\n0.7790\n0.7854\n0.7904\n0.7970\n0.7987\n0.8004\n\n0.2\n0.7412\n0.7655\n0.7756\n0.7828\n0.7905\n0.7930\n0.7955\n\n0\n0.0002\n0.0001\n0.0000\n0.0001\n0.0001\n0.0001\n0.0001\n\n0.02\n0.0002\n0.0001\n0.0001\n0.0001\n0.0001\n0.0001\n0.0001\n\nVariance\n0.05\n0.0002\n0.0001\n0.0001\n0.0001\n0.0001\n0.0001\n0.0001\n\n0.1\n0.0002\n0.0001\n0.0000\n0.0001\n0.0001\n0.0001\n0.0001\n\n4 Conclusion\nThis paper considers the problem of Projection Recovery for Classi\ufb01cation. It is relevant in applica-\ntions where the decision process must be easy to understand in order to enable human interpretation\nof the results. We have developed a principled, regression-based algorithm designed to recover small\nsets of low-dimensional subspaces that support interpretability. It optimizes the selection using in-\ndividual data-point-speci\ufb01c entropy estimators. In this context, the proposed algorithm follows the\nidea of transductive learning, and the role of the resulting projections bears resemblance to high con-\n\ufb01dence regions known in conformal prediction models. Empirical results obtained using simulated\nand real-world data show the effectiveness of our method in \ufb01nding informative projections that\nenable accurate classi\ufb01cation while maintaining transparency of the underlying decision process.\n\nAcknowledgments\n\nThis material is based upon work supported by the NSF, under Grant No. IIS-0911032.\n\n8\n\n050010001500200025003000350005001000150020002500Capital Run Length TotalCapital Run Length LongestInformative Projection for the Spam dataset0200040006000800010000120001400016000024681012Capital Run Length TotalFrequency of word \u2018your\u2018Informative Projection for the Spam dataset\fReferences\n[1] Mark W. Craven and Jude W. Shavlik. Extracting Tree-Structured Representations of Trained Networks.\nIn David S. Touretzky, Michael C. Mozer, and Michael E. Hasselmo, editors, Advances in Neural Infor-\nmation Processing Systems, volume 8, pages 24\u201330. The MIT Press, 1996.\n\n[2] Pedro Domingos. Knowledge discovery via multiple models. Intelligent Data Analysis, 2:187\u2013202, 1998.\n[3] Eulanda M. Dos Santos, Robert Sabourin, and Patrick Maupin. A dynamic overproduce-and-choose\n\nstrategy for the selection of classi\ufb01er ensembles. Pattern Recogn., 41:2993\u20133009, October 2008.\n\n[4] M. Feder and N. Merhav. Relations between entropy and error probability. Information Theory, IEEE\n\nTransactions on, 40(1):259\u2013266, January 1994.\n\n[5] Jerome H. Friedman, Ron Kohavi, and Yeogirl Yun. Lazy decision trees, 1996.\n[6] A. Gammerman, V. Vovk, and V. Vapnik. Learning by transduction.\n\nIntelligence, pages 148\u2013155. Morgan Kaufmann, 1998.\n\nIn In Uncertainty in Arti\ufb01cial\n\n[7] Quanquan Gu, Zhenhui Li, and Jiawei Han. Joint feature selection and subspace learning, 2011.\n[8] Bing Liu, Minqing Hu, and Wynne Hsu. Intuitive representation of decision trees using general rules and\nIn Proceedings of Seventeeth National Conference on Arti\ufb01cial Intellgience (AAAI-2000),\n\nexceptions.\nJuly 30 - Aug 3, 2000, pages 615\u2013620, 2000.\n\n[9] Michael Mampaey, Nikolaj Tatti, and Jilles Vreeken. Tell me what i need to know: succinctly summariz-\ning data with itemsets. In Proceedings of the 17th ACM SIGKDD international conference on Knowledge\ndiscovery and data mining, KDD \u201911, pages 573\u2013581, New York, NY, USA, 2011. ACM.\n\n[10] Mario Marchand and Marina Sokolova. Learning with decision lists of data-dependent features. JOUR-\n\nNAL OF MACHINE LEARNING REASEARCH, 6, 2005.\n\n[11] Guillaume Obozinski, Ben Taskar, and Michael I. Jordan. Joint covariate selection and joint subspace\n\nselection for multiple classi\ufb01cation problems. Statistics and Computing, 20(2):231\u2013252, April 2010.\n\n[12] Michael J. Pazzani, Subramani Mani, and W. Rodman Shankle. Beyond concise and colorful: Learning\n\nintelligible rules, 1997.\n\n[13] B. Poczos and J. Schneider. On the estimation of alpha-divergences. AISTATS, 2011.\n[14] Kai Ting, Jonathan Wells, Swee Tan, Shyh Teng, and Geoffrey Webb. Feature-subspace aggregating:\nensembles for stable andunstable learners. Machine Learning, 82:375\u2013397, 2011. 10.1007/s10994-010-\n5224-5.\n\n9\n\n\f", "award": [], "sourceid": 1380, "authors": [{"given_name": "Madalina", "family_name": "Fiterau", "institution": null}, {"given_name": "Artur", "family_name": "Dubrawski", "institution": null}]}