{"title": "Efficient High-Order Interaction-Aware Feature Selection Based on Conditional Mutual Information", "book": "Advances in Neural Information Processing Systems", "page_first": 4637, "page_last": 4645, "abstract": "This study introduces a novel feature selection approach CMICOT, which is a further evolution of filter methods with sequential forward selection (SFS) whose scoring functions are based on conditional mutual information (MI). We state and study a novel saddle point (max-min) optimization problem to build a scoring function that is able to identify joint interactions between several features. This method fills the gap of MI-based SFS techniques with high-order dependencies. In this high-dimensional case, the estimation of MI has prohibitively high sample complexity. We mitigate this cost using a greedy approximation and binary representatives what makes our technique able to be effectively used. The superiority of our approach is demonstrated by comparison with recently proposed interaction-aware filters and several interaction-agnostic state-of-the-art ones on ten publicly available benchmark datasets.", "full_text": "Ef\ufb01cient High-Order Interaction-Aware Feature\n\nSelection Based on Conditional Mutual Information\n\nAlexander Shishkin, Anastasia Bezzubtseva, Alexey Drutsa,\n\nIlia Shishkov, Ekaterina Gladkikh, Gleb Gusev, Pavel Serdyukov\n\nYandex; 16 Leo Tolstoy St., Moscow 119021, Russia\n\n{sisoid,nstbezz,adrutsa,ishfb,kglad,gleb57,pavser}@yandex-team.ru\n\nAbstract\n\nThis study introduces a novel feature selection approach CMICOT, which is a\nfurther evolution of \ufb01lter methods with sequential forward selection (SFS) whose\nscoring functions are based on conditional mutual information (MI). We state and\nstudy a novel saddle point (max-min) optimization problem to build a scoring\nfunction that is able to identify joint interactions between several features. This\nmethod \ufb01lls the gap of MI-based SFS techniques with high-order dependencies.\nIn this high-dimensional case, the estimation of MI has prohibitively high sample\ncomplexity. We mitigate this cost using a greedy approximation and binary repre-\nsentatives what makes our technique able to be effectively used. The superiority of\nour approach is demonstrated by comparison with recently proposed interaction-\naware \ufb01lters and several interaction-agnostic state-of-the-art ones on ten publicly\navailable benchmark datasets.\n\n1\n\nIntroduction\n\nMethods of feature selection is an important topic of machine learning [8, 2, 17], since they improve\nperformance of learning systems while reducing their computational costs. Feature selection methods\nare usually grouped into three main categories: wrapper, embedded, and \ufb01lter methods [8]. Filters are\ncomputationally cheap and are independent of a particular learning model that make them popular\nand broadly applicable. In this paper, we focus on most popular \ufb01lters, which are based on mutual\ninformation (MI) and apply the sequential forward selection (SFS) strategy to obtain an optimal\nsubset of features [17]. In such applications as web search, features may be highly relevant only\njointly (having a low relevance separately). A challenging task is to account for such interactions [17].\nExisting SFS-based \ufb01lters [18, 3, 24] are able to account for interactions of only up to 3 features.\nIn this study, we \ufb01ll the gap in the absence of effective SFS-based \ufb01lters accounting for feature\ndependences of higher orders. A search of t-way interacting features is turned into a novel saddle\npoint (max-min) optimization problem for MI of the target variable and the candidate feature with\nits complementary team conditioned on its opposing team of previously selected features. We show\nthat, on the one hand, the saddle value of this conditional MI is a low-dimensional approximation\nof the CMI score1 and, on the other hand, solving that problem represents two practical challenges:\n(a) prohibitively high computational complexity and (b) sample complexity, a larger number of\ninstances required to accurately estimate the MI. These issues are addressed by two novel techniques:\n(a) a two stage greedy search for the approximate solution of the above-mentioned problem whose\ncomputational complexity is O(i) at each i-th SFS iteration; and (b) binary representation of features\nthat reduces the dimension of the space of joint distributions by a factor of (q/2)2t for q-value\nfeatures. Being reasonable and intuitive, these techniques together constitute the main contribution of\nour study: a novel SFS method CMICOT that is able to identify joint interactions between multiple\n\n1The CMI \ufb01lter is believed to be a \u201cnorth star\" for vast majority of the state-of-the-art \ufb01lters [2].\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\ffeatures. We also empirically validate our approach with 3 state-of-the-art classi\ufb01cation models on\n10 publicly available benchmark datasets and compare it with known interaction-aware SFS-based\n\ufb01lters and several state-of-the-art ones.\n\n2 Preliminaries and related work\n\nInformation-theoretic measures. The mutual information (MI) of two random variables f and\ng is de\ufb01ned as I(f ; g) = H(f ) + H(g) \u2212 H(f, g), where H(f ) = \u2212E [log P(f )] is Shannon\u2019s\nentropy [4]2. The conditional mutual information of two random variables f and g given the variable\nh is I(f ; g | h) = I(f ; g, h) \u2212 I(f ; h). The conditional MI measures the amount of additional\ninformation about the variable f carried by g compared to the variable h. Given sample data, entropy\n(and, hence, MI and conditional MI) of discrete variables could be simply estimated using the\nempirical frequencies (the point estimations) [15] or in a more sophisticated way (e.g., by means of\nthe Bayesian framework [10]). More details on different entropy estimators can be found in [15].\nBackground of the feature selection based on MI. Let F be a set of features that could be used by\na classi\ufb01er to predict a variable c representing a class label. The objective of a feature selection (FS)\nprocedure is to \ufb01nd a feature subset So \u2286 F of a given size k \u2208 N that maximizes its joint MI with\nthe class label c, i.e., So = argmax{S:S\u2286F,|S|\u2264k} I(c; S). In our paper, we focus on this simple but\ncommonly studied FS objective in the context of MI-based \ufb01lters [2], though there is a wide variety\nof other de\ufb01nitions of optimal subset of features [17] (e.g., the all-relevant problem [13]).\nIn order to avoid an exhaustive search of an optimal subset S , most \ufb01lters are based on sub-optimal\nsearch strategies. The most popular one is the sequential forward selection (SFS) [20, 23, 17], which\nstarts with an empty set (S0 := \u2205) and iteratively increases it by adding one currently unselected\nfeature on each step (Si := Si\u22121 \u222a {fi}, i = 1, . . . , k, and So := Sk). The feature fi is usually\nselected by maximizing a certain scoring function (also called score) Ji(f ) that is calculated with\nrespect to currently selected features Si\u22121, i.e., fi := argmaxf\u2208F\\Si\u22121 Ji(f ).\nA trivial feature selection approach is to select top-k features in terms of their MI with the class label\nc [12]. This technique is referred to as MIM [2] and is a particular case of the SFS strategy based\n(f ) := I(c; f ). Note that the resulting set may contain a lot of redundant features,\non score J MIM\n(\u00b7) is independent from already selected features Si\u22121. Among\nsince the scoring function J MIM\nmethods that take into account the redundancy between features [2, 17], the most popular and widely\napplicable ones are MIFS [1], JMI [21, 14], CMIM [6, 19], and mRMR [16]. Brown et al. [2] uni\ufb01ed\nthese techniques under one framework, where they are different low-order approximations of CMI\nfeature selection approach. This method is based on the score equal to MI of the label with the\nevaluated feature conditioned on already selected features:\n\ni\n\ni\n\nJ CMI\ni\n\n(f ) := I(c; f | Si\u22121).\n\n(1)\n\nThe main drawback of CMI is the sample complexity, namely, the exponential growth of the dimension\nof the distribution of the tuple (c, f, Si\u22121) with respect to i. The larger the dimension is, the larger\nnumber of instances is required to accurately estimate the conditional MI in Eq. (1). Therefore, this\ntechnique is not usable in the case of small samples and in the cases, when a large number of features\nshould be selected [2]. This is also observed in our experiment in Appendix.F2, where empirical\nscore estimated over high dimensions results in drastically low performance of CMI.\nThus, low-dimensional approximations of Eq. (1) are more preferable in practice. For instance, the\nCMIM approach approximates Eq. (1) by\nJ CMIM\ni\n\nI(c; f | g),\n\n(2)\n\n(f ) := min\ng\u2208Si\u22121\n\ni.e., one replaces the redundancy of f with respect to the whole subset Si\u22121 by the worst redundancy\nwith respect to one feature from this subset. The other popular methods (mentioned above) are\nparticular cases of the following approximation of the I(c; f | Si\u22121):\n\n(f ) := I(c; f ) \u2212 (cid:88)\n\n(cid:16)\n\nJ \u03b2,\u03b3\ni\n\n(cid:17)\n\none random vector variable, e.g., I(f ; g, h) := I(cid:0)f ; (g, h)(cid:1) and, for F = \u222an\n\n2From here on in the paper, variables separated by commas or a set of variables in MI expressions are treated as\ni=1{fi}, I(f ; F ) := I(f ; f1, .., fn).\n\ng\u2208Si\u22121\n\n\u03b2I(g; f ) \u2212 \u03b3I(g; f | c)\n\n,\n\n(3)\n\n2\n\n\fre\ufb01ned by adding the three-way feature interaction terms(cid:80)\n\ne.g., MIFS (\u03b2 \u2208 [0, 1], \u03b3 = 0), mRMR (\u03b2 = |Si\u22121|\u22121, \u03b3 = 0), and JMI (\u03b2 = \u03b3 = |Si\u22121|\u22121).\nAn important but usually neglected aspect in FS methods is feature complementariness [8, 3] (also\nknown as synergy [24] and interaction [11]). In general, complementary features are those that\nappear to have low relevance to the target class c individually, but whose combination is highly\nrelevant [25, 24]. In the next subsection, we provide a brief overview of existing studies on \ufb01lters that\ntake into account feature interaction. A reader interested in a formalized concept of feature relevance,\nredundancy, and interaction is referred to [11] and [24].\nRelated work on interaction-aware \ufb01lters. To the best of our knowledge, existing interaction-aware\n\ufb01lters that utilize the pure SFS strategy with a MI-based scoring function are the following ones.\nRelaxMRMR [18] is a modi\ufb01cation of the mRMR method, whose scoring function in Eq. (3) was\nh,g\u2208Si\u22121,h(cid:54)=g I(f ; h | g). The method\nRCDFS [3] is a special case of Eq. (3), where \u03b2 = \u03b3 are equal to a transformation of the standard\ndeviation of the set {I(f ; h)}h\u2208Si\u22121. The approach IWFS [24] is based on the following idea: at\neach step i, for each unselected feature f \u2208 F \\ Si, one calculates the next step score Ji+1(f ) as\nthe current score Ji(f ) multiplied by a certain measure of interaction between this feature f and the\nfeature fi selected at the current step. Both RCDFS and IWFS can catch dependences between no\nmore than 2 features, while RelaxMRMR is able to identify an interaction of up to 3 features, but\nits score\u2019s computational complexity is O(i2) what makes it unusable in real applications. All these\nmethods could not be straightforwardly improved to incorporate interactions of a higher order.\nIn our study, we propose a general methodology that \ufb01lls the gap between the ideal (\u201coracle\") but\ninfeasible CMI method, which takes all interactions into account, and the above-described methods\nthat account for up to 3 interacting features. Our method can be effectively used in practice with its\nscore\u2019s computational complexity of a linear growth O(i) (as in most state-of-the-art SFS-\ufb01lters).\n\n3 Proposed feature selection\n\nIn this section, we introduce a novel feature selection approach based on the SFS strategy whose\nscore is built by solving from a novel optimization problem and comprises two novel techniques that\nmakes the approach ef\ufb01cient and effective in practice.\n\n3.1 Score with t-way interacted complementary and opposing teams\nOur FS method has a parameter t \u2208 N that is responsible for the desirable number of features whose\nmutual interaction (referred to as a t-way feature interaction) should be taken into account by the\nscoring function Ji(\u00b7). We build the scoring function according to the following intuitions.\nFirst, the amount of relevant information carried by a t-way interaction of a candidate feature f has the\nform I(c; f, H) for some set of features H of size |H| \u2264 t\u2212 1. Second, we remove the redundant part\nof this information w.r.t. the already selected features Si\u22121 and obtain the non-redundant information\npart I(c; f, H | Si\u22121). Following the heuristic of the CMIM method, this could be approximated\nby use of a small subset G \u2286 Si\u22121, |G| \u2264 s \u2208 N, i.e., by the low-dimensional approximation\nmin{G\u2286Si\u22121,|G|\u2264s} I(c; f, H | G) (assuming s (cid:28) i). Third, since in the SFS strategy one has to\nselect only one feature at an iteration i, this approximated additional information of the candidate f\nwith H w.r.t. Si\u22121 will be gained by with the feature f at this SFS iteration only if all complementary\nfeatures H have been already selected (i.e., H \u2286 Si\u22121). In this way, the score of the candidate f\nshould be equal to the maximal additional information estimated using above reasoning, i.e., we\ncome to the score which is a solution of the following saddle point (max-min) optimization problem\n\n\u25e6\nJ (t,s)\ni\n\n(f ) := max\n\nH\u2286Si\u22121,\n|H|\u2264t\u22121\n\nmin\nG\u2286Si\u22121,\n|G|\u2264s\n\nI(c; f, H | G).\n\n(4)\n\nf , where H o\n\nWe refer to the set {f} \u222a H o\nf is an optimal set H in Eq. (4), as an optimal complementary\nteam of the feature f \u2208 F \\ Si\u22121, while an optimal set G in Eq. (4) is referred to as an optimal\nopposing team to this feature f (and, thus, to its complementary team as well) and is denoted by Go\nf .\nThe described approach is inspired by methods of greedy learning of ensembles of decision trees [7],\nwhere an ensemble of trees is built by sequentially adding a decision tree that maximizes the gain in\nlearning quality. In this way, our complementary team corresponds to the features used in a candidate\n\n3\n\n\fdecision tree, while our opposing team corresponds to the features used to build previous trees in the\nensemble. Since they are already selected by SFS, they are expectedly stronger than f and we can\nassume that, at the early iterations, a greedy machine learning algorithm would more likely use these\nfeatures rather than the new feature f once we add it into the feature set. So, Eq. (4) tries to mimic\nthe maximal amount of information that feature f can provide additionally to the worst-case baseline\nbuilt on Si\u22121.\nStatement 1. For t, s + 1 \u2265 i, the score\n\nfrom Eq. (1).\n(f ) = maxH\u2286Si\u22121 minG\u2286Si\u22121\\H I(c; f | H, G)\nThe proof\u2019s sketch is: (a) justify the identity\nfor t, s + 1 \u2265 i; (b) get a contradiction to the assumption that there are no optimal subsets H and G\nsuch that Si\u22121 = H \u222a G. Detailed proof of Statement 1 is given in Appendix A. Thus, we argue that\nthe score\nThe score from Eq. (4) is of a general nature and reasonable, but, to the best of our knowledge, was\nnever considered in existing studies. However, this score is not suitable for effective application,\nsince it suffers from two practical issues:\n\nfrom Eq. (4) is a low-dimensional approximation of the CMI score J CMI\n\nfrom Eq. (4) is equal to the score J CMI\n\n\u25e6\nJ (t,s)\ni\n\u25e6\nJ (t,s)\ni\n\n\u25e6\nJ (t,s)\ni\n\n.3.\n\ni\n\ni\n\n(PI.a) computational complexity: ef\ufb01cient search of optimal sets H o\n(PI.b) sample complexity: accurate estimation of the MI over features with a large dimension of its\n\nf in Eq. (4);\n\nf and Go\n\njoint distribution.\n\nWe address these research problems and propose the following solutions to them: in Sec. 3.2, the\nissue (PI.a) is overcome in a greedy fashion, while, in Sec. 3.3,the issue (PI.b) is mitigated by means\nof binary representatives.\n\n\u25e6\nJ (t,s)\ni\n\nt\u22121\n\ns\n\n3.2 Greedy approximation of the score\n\nNote that an exhaustive search of a saddle point in Eq. (4) requires(cid:0)i\u22121\n\n(cid:1)(cid:0)i\u22121\n\n(cid:1) MI calculations\n\nthat can make calculation of the scoring function\ninfeasible at a large iteration i even for low\nteam sizes t, s > 1. In order to overcome this issue, we propose the following greedy search for\nsub-optimal complementary and opposing teams.\nAt the \ufb01rst stage, we start from a greedy search of a sub-optimal set H that cannot be done straight-\nforwardly, since Eq. (4) comprises both max and min operators. The latter one requires a search of\nan optimal G that we want do at the second stage (after H). Hence, the double optimization problem\nneeds to be replaced by a simpler one which does not utilize a search of G.\nProposition 1. (1) For any H \u2286 Si\u22121 such that |H| \u2264 s, the following holds\n\nmin\n\nG\u2286Si\u22121,|G|\u2264s\n\nI(c; f, H | G) \u2264 I(c; f | H).\n\n(2) If s \u2265 t \u2212 1, then the score given by the following optimization problem\n\nis an upper bound for the score\n\nI(c; f | H),\n\nmax\n\nH\u2286Si\u22121,|H|\u2264t\u22121\n\n\u25e6\nJ (t,s)\ni\n\nfrom Eq. (4).\n\n(5)\n\n(6)\n\nThe optimization problem Eq. (6) seems reasonable due to the following properties: (a) in fact, the\nsearch of H in Eq. (6) is maximization of the additional information carried out by the candidate f\nw.r.t. no more than t \u2212 1 already selected features from Si\u22121; (b) if a candidate f is a combination of\nfeatures from H, then the right hand side in Eq. (5) is 0 and the inequality becomes an equality.\nSo, we greedily search the maximum in Eq. (6), obtaining the (greedy) complementary team {f}\u222aHf ,\nwhere Hf := {h1, . . . , ht\u22121} is de\ufb01ned by4\n\nhj := argmax\nh\u2208Si\u22121\n\nI(c; f | h1, . . . , hj\u22121, h),\n\nj = 1, . . . , t \u2212 1.\n\n(7)\n\n3Moreover, the CMIM score from Eq. (2) is a special case of Eq. (4) with s = t = 1 and restriction G (cid:54)= \u2205.\n4If several elements provide an optimum (the case of ties), then we randomly select one of them.\n\n4\n\n\fAt the second stage, given the complementary team {f} \u222a Hf , we greedily search the (greedy)\nopposing team Gf := {g1, . . . , gs} in the following way:\n\nI(c; f, h1, . . . , hmin{j,t}\u22121 | g1, . . . , gj\u22121, g),\n\nj = 1, . . . , s.\n\n(8)\n\ngj := argmin\ng\u2208Si\u22121\n\nFinally, given the teams {f} \u222a Hf and Gf , we get the following greedy approximation of\n\n\u25e6\nJ (t,s)\ni\n\n(f ):\n\n(f ) := I(c; f, Hf | Gf ).\n\nJ (t,s)\ni\n\n\u25e6\nJ (t,s)\ni\n\nand resolve the issue (PI.a).\n\n(9)\nThis score requires (t + s \u2212 1)i MI calculations (see Eq. (7)\u2013(9)), which is a linear dependence\non an iteration i as in the most state-of-the-art SFS-based \ufb01lters [2]. Thus, we built an ef\ufb01cient\napproximation of the score\nNote that we have two options on the minimization stage: either to search among all members of the\nset Hf at each step (as in Eq. (A.7) in Appendix A.3), or (what we actually do in Eq. (8)) to use only\na few \ufb01rst members of Hf . The latter option demonstrates noticeably better MAUC performance and\nalso results in 0 score for a feature that is a copy of an already selected one (Proposition 2), while the\nformer does not (Remark A.2 in Appendix A.3). That is why we chose this option.\nProposition 2. Let s \u2265 t and a candidate feature f \u2208 F \\ Si\u22121 be such that its copy \u02dcf \u2261 f is\nalready selected \u02dcf \u2208 Si\u22121, then, in the absence of ties in Eq. (8) for j \u2264 t, the score J (t,s)\n(f ) = 0.\n\ni\n\nProposition 2 shows that the FS approach based on the greedy score J (t,s)\n(f ) remains conservative,\ni.e., a copy of an already selected feature will not be selected, despite that it exploits sub-optimal\nteams in contrast to the FS approach based on the optimal score\n\n(f ).\n\ni\n\n\u25e6\nJ (t,s)\ni\n\n3.3 Binary representatives of features\n\ni\n\n\u25e6\nJ (t,s)\ni\n\nand our greedy one J (t,s)\n\nAs it is mentioned in Sec. 2, a FS method that is based on calculation of MI over more than three\nfeatures is usually not popular in practice, since a large number of features implies a large dimension\nof their joint distribution that leads to a large number of instances required to accurately estimate the\nsuffer from the same issue (PI.b) as\nMI [2]. Both our optimal score\nwell, since they exploit high-dimensional MI in Eq.(4) and Eq. (7)\u2013(9). For instance, if we deal with\nbinary classi\ufb01cation and each feature in F has q unique values (e.g., continuous features are usually\npreprocessed into discrete variables with q \u2265 5 [18]), then the dimension of the joint distribution of\nfeatures in Eq. (9) is equal to 2 \u00b7 qt+s (e.g., \u2248 4.9 \u00b7 108 for t = s = 6, q = 5). In our method, we\ncannot reduce the number of features used in MIs (since t-way interaction constitutes the key basis\nof our approach), but we can mitigate the effect of the sample complexity by the following novel\ntechnique, which we demonstrate on our greedy score J (t,s)\nDe\ufb01nition 1. For each discrete feature f \u2208 F , we denote by B[f ] the binary transformation of f,\ni.e., the set of binary variables (referred to as the binary representatives (BR) of f) that constitute all\ntogether a vector containing the same information as f 6. For any subset F (cid:48) \u2286 F , the set of binary\n\nrepresentatives of all features from F (cid:48) is denoted by B[F (cid:48)] =(cid:83)\n\n. Let F consists of discrete features5.\n\ni\n\nf\u2208F (cid:48) B[f ].\n\nThen, we replace all features by their binary representatives at each stage of our score calculation.\nNamely, in Eq. (7) and Eq. (8), (a) the searches are performed for each binary representative b \u2208 B[f ]\nof the complementary team is found among B[Si\u22121] \u222a B[f ]; while\ninstead of f; (b) the set H bin\nb\n(c) the opposing team Gbin\nis found among B[Si\u22121] (exact formulas could be found in Algorithm 1,\nlines 12 and 15). Finally, the score of a feature f in this FS approach based on binary representatives\nis de\ufb01ned as the best score among the binary representatives B[f ] of the candidate f:\n\nb\n\nJ (t,s),bin\ni\n\n(f ) := max\nb\u2208B[f ]\n\nI(c; b, H bin\n\nb\n\n| Gbin\n\nb\n\n).\n\n(10)\n\nNote that, in the previous example with a binary target variable c and q-value features, the dimension\nof the joint distribution of binary representatives used to calculate MI in J (t,s),bin\nis equal to 21+t+s,\n\ni\n\n5If there is a non-discrete feature, then we apply a discretization (e.g., by equal-width, equal-frequency\nbinnings [5], MDL [22, 3], etc.), which is the state-of-the-art preprocessing of continuous features in \ufb01lters.\nl=1, one could take B[f ] = {I{f =xl}}q\u22121\nl=1 , where IX is X \u2019s indicator,\n6For instance, for f with values in {xl}q\nl=1 that is a smallest set (i.e., |B[f ]| = (cid:100)log2 q(cid:101)) among possible B[f ].\nor take bits of a binary encoding of {xl}q\n\n5\n\n\f// Select the \ufb01rst feature\n\nfor f \u2208 F \\ S do\nfor b \u2208 B[f ] do\n\nfor j := 1 to t \u2212 1 do\n\nAlgorithm 1 Pseudo-code of the CMICOT feature selection method (an implementation of this\nalgorithm is available at https://github.com/yandex/CMICOT).\n1: Input: F \u2014 the set of all features; B[f ], f \u2208 F, \u2014 set of binary representatives built on f;\n2: c \u2014 the target variable; k \u2208 N \u2014 the number of features to be selected;\n3: t \u2208 N, s \u2208 Z+ \u2014 the team sizes (parameters of the algorithm);\n4: Output: S \u2014 the set of selected features;\n5: Initialize:\n6: fbest := argmaxf\u2208F maxb\u2208B[f ] I(c; b);\n7: S := {fbest}; Sbin := B[fbest];\n8: while |S| < k and |F \\ S| > 0 do\n9:\n10:\n11:\n12:\n13:\n14:\n15:\n16:\n17:\n18:\n19:\n20:\n21:\n22:\n23: end while\n\nend for\nfbest := argmaxf\u2208F\\S Ji[f ];\nS := S \u222a {fbest}; Sbin := Sbin \u222a B[fbest];\n\ngj := argming\u2208Sbin I(c; b, h1, .., hmin{j,t}\u22121 | g1, .., gj\u22121, g); // Search for opp. feat.\n\nhj := argmaxh\u2208Sbin\u222aB[f ] I(c; b | h1, .., hj\u22121, h);\n\n// Search for complementary feat.\n\n// Select the best candidate feature at the current step\n\nend for\nfor j := 1 to s do\n\nend for\nJi[b] := I(c; b, h1, .., ht\u22121 | g1, .., gs);\n\nend for\nJi[f ] := maxb\u2208B[f ] Ji[b];\n\n// Calculate the score of the feature f\n\n// Calculate the score of the binary rep. b\n\ni\n\nwhich is (q/2)t+s times smaller (the dimension reduction rate) than for the MI in J (t,s)\n. For\ninstance, for t = s = 6, q = 5, the MI from Eq. (10) deals with\u2248 8.2 \u00b7 103 dimensions, which is\n\u2248 6 \u00b7 104 times lower than\u2248 4.9\u00b7 108 ones for the MI from Eq. (9). The described technique has been\ninspired by the intuition that probably two binary representatives of two different features interact on\naverage better than two binary representatives of one feature (see App. A.5.1). Therefore, we believe\nthat the BR modi\ufb01cation retains the score\u2019s awareness to the most interactions between features.\nSurely, on the one hand, the BR technique can also be applied to any state-of-the-art SFS-\ufb01lter [2] or\nany existing interaction-aware one (RelaxMRMR [18], RCDSFS [3], and IWFS [24]), but the effect\non them will not be striking breakthrough, since these \ufb01lters exploit no more than 3 features in one\nMI, and the dimension reduction rate will thus be not more than (q/2)3 (e.g., \u2248 15.6 for q = 5). On\nthe other hand, this technique is of a general nature and represents a self-contained contribution to\nML community, since it may be applied with noticeable pro\ufb01t to SFS-based \ufb01lters with MIs of higher\norders (possibly not yet invented).\n\n3.4 CMICOT feature selection method\n\nWe summarize Sec. 3.1\u2013Sec. 3.3 in our novel feature selection method that is based on sequential\nforward selection strategy with the scoring function from Eq. (10). We refer to this FS method as\nCMICOT (Conditional Mutual Information with Complementary and Opposing Teams) and present\nits pseudo-code in Algorithm 1, which has a form of a SFS strategy with a speci\ufb01c algorithm to\ncalculate the score (lines 10\u201319). In order to bene\ufb01t from Prop. 1 and 2, one has to select s \u2265 t, and,\nfor simplicity, from here on in this paper we consider only equally limited teams, i.e., t = s.\nProposition 3. Let |B[f ]| \u2264 \u03bd, \u2200f \u2208 F , |F| \u2264 M, and entropies in MIs are calculated over\nN instances, then O(i\u03bd2t2N ) simple operations are needed to calculate the score J (t,t),bin\nand\nO(k2\u03bd2t2M N ) simple operations are needed to select top-k features by CMICOT from Alg. 1.\n\ni\n\nLet us remind how each of our techniques contributes to the presented above computational complexity\nof the score. First, the factor t2 is an expected payment for the ability to be aware of t-way interactions\n(Sec. 3.1). Second, the two stage greedy technique from Sec. 3.2 makes the score\u2019 computational\ncomplexity linearly depend on a SFS iteration i. Third, utilization of the BR technique from Sec. 3.3,\non the one hand, seems to increase the computational complexity by the factor \u03bd2, but, on the other\n\n6\n\n\fhand, we know that it drastically reduces the sample complexity (i.e., the number of instances required\nto accurately estimate the used MIs). For simplicity, let us assume that each feature has 2\u03bd values and\nis transformed to \u03bd binary ones. If we do not use the BR technique, the complexity will be lower by\nthe factor \u03bd2 for the same number of instances N, but estimation of the MIs will require (2\u03bd/2)2t\ntimes more instances to achieve the same level of accuracy as with the BRs. Hence, the BR technique\nactually reduces the computational complexity by the factor 22t(\u03bd\u22121)/\u03bd2. Note that the team size\nt can be used to trade off between the number of instances available in the sample dataset and the\nmaximal number of features whose joint interaction could be taken into account in a SFS manner.\nFinally, for a given dataset and a given team size t, the score\u2019s computational complexity linearly\ndepends on the i-th SFS iteration, on the one hand, as in most state-of-the-art SFS-\ufb01lters [2] like\nCMIM, MIFS, mRMR, JMI, etc. (see Eq. (2)\u2013(3)). On the other hand, scores of existing interaction-\naware ones have either the same (O(i) for RCDFS [3]), or higher (O(M \u2212 i) for IWFS [24] and\nO(i2) for RelaxMRMR [18]) order of complexity w.r.t. i. Thus, we conclude that our FS method is\nnot inferior in ef\ufb01ciency to all baseline \ufb01lters, but is able to identify feature dependences of higher\norders than these baselines.\n\n4 Experimental evaluation\n\nWe compare our CMICOT approach with (a) all known interaction-aware SFS-based \ufb01lters (RelaxM-\nRMR [18], IWFS [24], and RCDFS [3]); (b) the state-of-the-art \ufb01lters [2] (MIFS, mRMR, CMIM,\nJMI, DISR, and FCBF (CBFS)); (c) and the idealistic but practically infeasible CMI method (see\nSec. 2 and [2]). In our experiments, we consider t = 1, . . . , 10 to validate that CMICOT is able to\ndetect interactions of a considerably higher order than its competitors.\nEvaluation on synthetic data. First, we study the ability to detect high-order feature dependencies\nusing synthetic datasets where relevant and interacting features are a priory known. A synthetic\ndataset has feature set F , which contains a group of jointly interacting relevant features Fint, and a its\ntarget c is a deterministic function of Fint for half of examples (|F \\Fint| = 15 and |Fint| = 2, . . . , 11\nin our experiments). The smaller k0 = min{k | Fint \u2286 Sk}, the more effective the considered FS\nmethod, since it builds the smaller set of features needed to construct the best possible classi\ufb01er.\nWe conduct an experiment where, \ufb01rst, we randomly sample 100 datasets from the prede\ufb01ned joint\ndistribution (more details in Appendix C). Second, we calculate k0 for each of studied FS methods\non these datasets. Finally, we average k0 over the datasets and present the results in Figure 1 (a). We\nsee, \ufb01rst, that CMICOT with t \u2265 |Fint| signi\ufb01cantly outperforms all baselines, except the idealistic\nCMI method whose results are similar to CMICOT. This is expected, since CMI is infeasible only for\nlarge k, and, in App. F.2, we show that CMICOT is the closest approximation of true CMI among\nall baselines. Second, the team size t de\ufb01nitely responds to the number of interacted features, that\nprovides an experimental evidence for ability of CMICOT to identify high-order feature interactions.\nEvaluation on benchmark real data. Following the state-of-the-art practice [6, 22, 2, 18, 24, 3],\nwe conduct an extensive empirical evaluation of the effectiveness of our CMICOT approach on\n10 large public datasets from the UCI ML Repo (that include the NIPS\u20192003 FS competition) and\none private dataset from one of the most popular search engines7. We employ three state-of-the-\nart classi\ufb01ers: Naive Bayes Classi\ufb01er (NBC), k-Nearest Neighbor (kNN), and AdaBoost [6] (see\nApp. B). Their performance on a set of features is measured by means of AUC [2] (MAUC [9])\nfor a binary (multi-class) target variable. First, we apply each of the FS methods to select top-k\nfeatures Sk for each dataset and for k = 1, .., 50 [2, 24, 3]. Given k \u2208 {1, .., 50}, a dataset, and a\ncertain classi\ufb01er, we measure the performance of a FS method (1) in terms of the (M)AUC of the\nclassi\ufb01er built on the selected features Sk (2) and in terms of the rank of the FS method among\nthe other FS methods w.r.t. (M)AUC. The resulting (M)AUC and rank averaged over all datasets\nare shown in Fig. 1(b,c) for kNN and AdaBoost. From these \ufb01gures we see that our CMICOT for\nt = 68 method noticeably outperforms all baselines for the classi\ufb01cation models kNN and AdaBoost9\nstarting from approximately k = 10. We reason this frontier by the size of the teams in CMICOT\n7The number of features, instances, and target classes varies from 85 to 5000, from 452 to 105, and from 2\n\nto 26 respectively. More datasets\u2019 characteristics and preprocessing can be found in Appendix D.\n\n8Our experimentation on CMICOT with different t = 1, . . . , 10 on our datasets showed that t = 5 and 6 are\n\nthe most reasonable in terms of classi\ufb01er performance (see Appendix E.1.1).\n\n9The results of CMICOT on NBC classi\ufb01er are similar to the ones of other baselines. This is expected\nsince NBC does not exploit high-order feature dependences, which is the key advantage of CMICOT. Note that\n\n7\n\n\fFigure 1: (a) Comparison of the performance of SFS-based \ufb01lters in terms of average k0 on synthetic\ndatasets. (b) Average values of (M)AUC for compared FS methods and (c) their ranks w.r.t. (M)AUC\nk = 1, .., 50 and for the kNN and AdaBoost classi\ufb01cation models over all datasets (see also App. C,E).\nmethod, which should select different teams more likely when |Si\u22121| > 2t (= 12 for t = 6). The\ncurves on Fig. 1 (b,c) are obtained over a test set, while a 10-fold cross-validation [2, 18] is also\napplied for several key points (e.g. k = 10, 20, 50) to estimate the signi\ufb01cance of differences in\nclassi\ufb01cation quality. The detailed results of this CV for k = 50 on representative datasets are given\nin Appendix E.2. A more comprehensive details on these and other experiments are in App. E and F.\nWe \ufb01nd that our approach either signi\ufb01cantly outperforms baselines (most one for kNN and AdaBoost),\nor have non-signi\ufb01cantly different difference with the other (most one for NBC). Note that the\ninteraction awareness of RelaxMRMR, RCDFS and IWFS is apparently not enough to outperform\nCMIM, our strongest competitor. In fact, there is no comparison of RelaxMRMR and IWFS with\nCMIM in [3, 24], while RCDFS is outperformed by CMIM on some datasets including the only one\nutilized in both [18] and our work. One compares CMICOT with and without BR technique: on\nthe one hand, we observed that CMICOT without BRs loses in performance to the one with BRs\non the datasets with non-binary features, that emphasizes importance of the problem (PI.b); on the\nother hand, results on binary datasets (poker, ranking, and semeion; see App. E), where the CMICOT\nvariants are the same, the effectiveness of our approach separately to the BR technique is established.\n\n5 Conclusions\n\nWe proposed a novel feature selection method CMICOT that is based on sequential forward selection\nand is able to identify high-order feature interactions. The technique based on a two stage greedy\nsearch and binary representatives of features makes our approach able to be effectively used on\ndatasets of different sizes for restricted team sized t. We also empirically validated our approach\nfor t up to 10 by means of 3 state-of-the-art classi\ufb01cation models (NBC, kNN, and AdaBoost) on\n10 publicly available benchmark datasets and compared it with known interaction-aware SFS-based\n\ufb01lters (RelaxMRMR, IWFS, and RCDFS) and several state-of-the-art ones (CMIM, JMI, CBFS,\nand others). We conclude that our FS algorithm, unlike all competitor methods, is capable to detect\ninteractions between up to t features. The overall performance of our algorithm is the best among the\nstate-of-the-art competitors.\n\nAcknowledgments\n\nWe are grateful to Mikhail Parakhin for important remarks which resulted in signi\ufb01cant improvement\nof the paper presentation.\n\nRelaxMRMR also showed its poorest performance on NBC in [18], while IWFS and RCDFS in [3, 24] didn\u2019t\nconsider NBC at all.\n\n8\n\n\fReferences\n[1] R. Battiti. Using mutual information for selecting features in supervised neural net learning. Neural\n\nNetworks, IEEE Transactions on, 5(4):537\u2013550, 1994.\n\n[2] G. Brown, A. Pocock, M.-J. Zhao, and M. Luj\u00e1n. Conditional likelihood maximisation: a unifying\n\nframework for information theoretic feature selection. JMLR, 13(1):27\u201366, 2012.\n\n[3] Z. Chen, C. Wu, Y. Zhang, Z. Huang, B. Ran, M. Zhong, and N. Lyu. Feature selection with redundancy-\n\ncomplementariness dispersion. arXiv preprint arXiv:1502.00231, 2015.\n\n[4] T. M. Cover and J. A. Thomas. Elements of information theory. John Wiley & Sons, 2012.\n\n[5] J. Dougherty, R. Kohavi, M. Sahami, et al. Supervised and unsupervised discretization of continuous\n\nfeatures. In ICML, volume 12, pages 194\u2013202, 1995.\n\n[6] F. Fleuret. Fast binary feature selection with conditional mutual information. JMLR, 5:1531\u20131555, 2004.\n\n[7] J. H. Friedman. Greedy function approximation: a gradient boosting machine. Annals of Statistics, 2001.\n\n[8] I. Guyon and A. Elisseeff. An introduction to variable and feature selection. JMLR, 3:1157\u20131182, 2003.\n\n[9] D. J. Hand and R. J. Till. A simple generalisation of the area under the roc curve for multiple class\n\nclassi\ufb01cation problems. Machine Learning, 2001.\n\n[10] M. Hutter. Distribution of mutual information. NIPS, 1:399\u2013406, 2002.\n\n[11] A. Jakulin and I. Bratko. Analyzing attribute dependencies. Springer, 2003.\n\n[12] D. D. Lewis. Feature selection and feature extraction for text categorization. In Proceedings of the\n\nworkshop on Speech and Natural Language, pages 212\u2013217. ACL, 1992.\n\n[13] J. Liu, C. Zhang, C. A. McCarty, P. L. Peissig, E. S. Burnside, and D. Page. High-dimensional structured\n\nfeature screening using binary markov random \ufb01elds. In AISTATS, pages 712\u2013721, 2012.\n\n[14] P. E. Meyer, C. Schretter, and G. Bontempi. Information-theoretic feature selection in microarray data\n\nusing variable complementarity. IEEE Journal of STSP, 2(3):261\u2013274, 2008.\n\n[15] L. Paninski. Estimation of entropy and mutual information. Neural comput., 15(6):1191\u20131253, 2003.\n\n[16] H. Peng, F. Long, and C. Ding. Feature selection based on mutual information criteria of max-dependency,\n\nmax-relevance, and min-redundancy. PAMI, 27(8):1226\u20131238, 2005.\n\n[17] J. R. Vergara and P. A. Est\u00e9vez. A review of feature selection methods based on mutual information.\n\nNeural Computing and Applications, 24(1):175\u2013186, 2014.\n\n[18] N. X. Vinh, S. Zhou, J. Chan, and J. Bailey. Can high-order dependencies improve mutual information\n\nbased feature selection? Pattern Recognition, 2015.\n\n[19] G. Wang and F. H. Lochovsky. Feature selection with conditional mutual information maximin in text\n\ncategorization. In ACM CIKM, pages 342\u2013349. ACM, 2004.\n\n[20] A. W. Whitney. A direct method of nonparametric measurement selection. Computers, IEEE Transactions\n\non, 100(9):1100\u20131103, 1971.\n\n[21] H. Yang and J. Moody. Feature selection based on joint mutual information. In Proceedings of international\n\nICSC symposium on advances in intelligent data analysis, pages 22\u201325. Citeseer, 1999.\n\n[22] L. Yu and H. Liu. Ef\ufb01cient feature selection via analysis of relevance and redundancy. JMLR, 5:1205\u20131224,\n\n2004.\n\n[23] M. Zaffalon and M. Hutter. Robust feature selection by mutual information distributions. In UAI, pages\n\n577\u2013584. Morgan Kaufmann Publishers Inc., 2002.\n\n[24] Z. Zeng, H. Zhang, R. Zhang, and C. Yin. A novel feature selection method considering feature interaction.\n\nPattern Recognition, 48(8):2656\u20132666, 2015.\n\n[25] Z. Zhao and H. Liu. Searching for interacting features in subset selection. Intelligent Data Analysis,\n\n13(2):207\u2013228, 2009.\n\n9\n\n\f", "award": [], "sourceid": 2324, "authors": [{"given_name": "Alexander", "family_name": "Shishkin", "institution": "Yandex"}, {"given_name": "Anastasia", "family_name": "Bezzubtseva", "institution": "Yandex"}, {"given_name": "Alexey", "family_name": "Drutsa", "institution": "Yandex"}, {"given_name": "Ilia", "family_name": "Shishkov", "institution": "Yandex"}, {"given_name": "Ekaterina", "family_name": "Gladkikh", "institution": "Yandex"}, {"given_name": "Gleb", "family_name": "Gusev", "institution": "Yandex LLC"}, {"given_name": "Pavel", "family_name": "Serdyukov", "institution": "Yandex"}]}