{"title": "Causal meets Submodular: Subset Selection with Directed Information", "book": "Advances in Neural Information Processing Systems", "page_first": 2649, "page_last": 2657, "abstract": "We study causal subset selection with Directed Information as the measure of prediction causality. Two typical tasks, causal sensor placement and covariate selection, are correspondingly formulated into cardinality constrained directed information maximizations. To attack the NP-hard problems, we show that the first problem is submodular while not necessarily monotonic. And the second one is ``nearly'' submodular. To substantiate the idea of approximate submodularity, we introduce a novel quantity, namely submodularity index (SmI), for general set functions. Moreover, we show that based on SmI, greedy algorithm has performance guarantee for the maximization of possibly non-monotonic and non-submodular functions, justifying its usage for a much broader class of problems. We evaluate the theoretical results with several case studies, and also illustrate the application of the subset selection to causal structure learning.", "full_text": "Causal meets Submodular: Subset Selection with\n\nDirected Information\n\nYuxun Zhou\n\nDepartment of EECS\n\nUC Berekely\n\nyxzhou@berkeley.edu\n\nCostas J. Spanos\n\nDepartment of EECS\n\nUC Berkeley\n\nspanos@berkeley.edu\n\nAbstract\n\nWe study causal subset selection with Directed Information as the measure of\nprediction causality. Two typical tasks, causal sensor placement and covariate\nselection, are correspondingly formulated into cardinality constrained directed\ninformation maximizations. To attack the NP-hard problems, we show that the \ufb01rst\nproblem is submodular while not necessarily monotonic. And the second one is\n\u201cnearly\u201d submodular. To substantiate the idea of approximate submodularity, we\nintroduce a novel quantity, namely submodularity index (SmI), for general set func-\ntions. Moreover, we show that based on SmI, greedy algorithm has performance\nguarantee for the maximization of possibly non-monotonic and non-submodular\nfunctions, justifying its usage for a much broader class of problems. We evaluate\nthe theoretical results with several case studies, and also illustrate the application\nof the subset selection to causal structure learning.\n\n1\n\nIntroduction\n\nA wide variety of research disciplines, including computer science, economic, biology and social\nscience, involve causality analysis of a network of interacting random processes. In particular, many\nof those tasks are closely related to subset selection. For example, in sensor network applications,\nwith limited budget it is necessary to place sensors at information \u201csources\u201d that provide the best\nobservability of the system. To better predict a stock under consideration, investors need to select\ncausal covariates from a pool of candidate information streams. We refer to the \ufb01rst type of problems\nas \u201ccausal sensor placement\u201d, and the second one as \u201ccausal covariate selection\u201d.\nTo solve the aforementioned problems we \ufb01rstly need a causality measure for multiple random\nprocesses. In literature, there exists two types of causality de\ufb01nitions, one is related with time\nseries prediction (called Granger-type causality) and another with counter-factuals [18]. We focus\non Granger-type prediction causality substantiated with Directed Information (DI), a tool from\ninformation theory. Recently, a large body of work has successfully employed DI in many research\n\ufb01elds, including in\ufb02uence mining in gene networks [14], causal relationship inference in neural spike\ntrain recordings [19], and message transmission analysis in social media [23]. Compared to model-\nbased or testing-based methods such as [2][21], DI is not limited by model assumptions and can\nnaturally capture non-linear and non-stationary dependence among random processes. In addition, it\nhas clear information theoretical interpretation and admits well-established estimation techniques. In\nthis regards, we formulate causal sensor placement and covariate selection into cardinality constrained\ndirected information maximizations problems.\nWe then need an ef\ufb01cient algorithm that makes optimal subset selection. Although subset selection, in\ngeneral, is not tractable due to its combinatorial nature, the study of greedy heuristics for submodular\nobjectives has shown promising results in both theory and practice. To list a few, following the\npioneering work [8] that proves the near optimal 1 \u2212 1/e guarantee, [12] [1] investigates the submod-\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fularity of mutual information under Gaussian assumption, and then uses a greedy algorithm for sensor\nplacement. In the context of speech and Nature Language Processing (NLP), the authors of and [13]\nadopt submodular objectives that encourage small vocabulary subset and broad coverage, and then\nproceed to maximization with a modi\ufb01ed greedy method. In [3], the authors combine insights from\nspectral theory and submodularity analysis of R2 score, and their result remarkably explains the near\noptimal performance of forward regression and orthogonal matching pursuit.\nIn this work, we also attack the causal subset selection problem via submodularity analysis. We show\nthat the objective function of causal sensor placement, i.e., DI from selected set to its complement,\nis submodular, although not monotonic. And the problem of causal covariates selection, i.e., DI\nfrom selected set to some target process, is not submodular in general but is \u201cnearly\u201d submodular in\nparticular cases. Since classic results require strictly submodularity and monotonicity which cannot\nbe established for our purpose, we propose a novel measure of the degree of submodularity and show\nthat, the performance guarantee of greedy algorithms can be obtained for possibly non-monotonic\nand non-submodular functions. Our contributions are: (1) Two important causal subset selection\nobjectives are formulated with directed information and the corresponding submodularity analysis\nare conducted. (2) The SmI dependent performance bound implies that submodularity is better\ncharacterized by a continuous indicator than being used as a \u201cyes or no\u201d property, which extends the\napplication of greedy algorithms to a much broader class of problems.\nThe rest of the paper is organized as follows. In next section, we brie\ufb02y review the notion of directed\ninformation and submodular function. Section 3 is devoted to problem formulation and submodularity\nanalysis. In section 4, we introduce SmI and provides theoretical results on performance guarantee of\nrandom and deterministic greedy algorithms. Finally in Section 5, we conduct experiments to justify\nour theoretical \ufb01ndings and illustrate a causal structure learning application.\n\n2 Preliminary\n\nDirected Information\nConsider two random process X n and Y n, we use the convention that\nX i = {X0, X1, ...Xi}, with t = 0, 1, ..., n as the time index. Directed Information from X n to Y n\nis de\ufb01ned in terms of mutual information:\n\nn(cid:88)\n\nt=1\n\nI(X n \u2192 Y n) =\n\nI(X t; Yt|Y t\u22121)\n\n(1)\n\nWith causally conditioned entropy de\ufb01ned by H(Y n||X n) (cid:44)(cid:80)n\n\nwhich can be viewed as the aggregated dependence between the history of process X and the current\nvalue of process Y , given past observations of Y . The above de\ufb01nition captures a natural intuition\nabout causal relationship, i.e., the unique information X t has on Yt, when the past of Y t\u22121 is known.\nt=1 H(Yt|Y t\u22121, X t), the directed\nn(cid:88)\n\ninformation from X n to Y n when causally conditioned on the series Z n can be written as\n\nI(X n \u2192 Y n||Z n) = H(Y n||Z n) \u2212 H(Y n||X n, Z n) =\n\nI(X t; Yt|Y t\u22121, Z t)\n\n(2)\n\nt=1\n\nObserve that causally conditioned directed information is expressed as the difference between two\ncausally conditioned entropy, which can be considered as \u201ccausal uncertainty reduction\u201d. With\nthis interpretation one is able to relate directed information to Granger Causality. Denote \u00afX as the\ncomplement of X in a a universal set V , then,\nTheorem 1 [20] With log loss, I(X n \u2192 Y n|| \u00afX t) is precisely the value of the side information\n(expected cumulative reduction in loss) that X has, when sequentially predicting Y with the knowledge\nof \u00afX. The predictors are distributions with minimal expected loss.\n\nIn particular, with linear models directed information is equivalent to Granger causality for jointly\nGaussian processes.\nSubmodular Function\nThere are three equivalent de\ufb01nitions of submodular functions, and each\nof them reveals a distinct character of submodularity, a diminishing return property that universally\nexists in economics, game theory and network systems.\n\n2\n\n\fDe\ufb01nition 1 A submodular funciton is a set function f : 2\u2126 \u2192 R, which satis\ufb01es one of the three\nequivalent de\ufb01nitions:\n(1) For every S, T \u2286 \u2126 with S \u2286 T , and every x \u2208 \u2126 \\ T , we have that\nf (S \u222a {x}) \u2212 f (S) \u2265 f (T \u222a {x}) \u2212 f (T )\n\n(3)\n\n(2) For every S, T \u2286 \u2126, we have that\n\n(3) For every S \u2286 \u2126, and x1, x2 \u2208 \u2126 \\ S, we have that\n\nf (S) + f (T ) \u2265 f (S \u222a T ) + f (S \u2229 T )\n\n(4)\n\nf (S \u222a {x1}) + f (S \u222a {x2}) \u2265 f (S \u222a {x1, x2}) + f (S)\n\n(5)\nA set function f is called supermodular if \u2212f is submodular. The \ufb01rst de\ufb01nition is directly related\nto the diminishing return property. The second de\ufb01nition is better understood with the classic\nmax k-cover problem [4]. The third de\ufb01nition indicates that the contribution of two elements is\nmaximized when they are added individually to the base set. Throughout this paper, we will denote\nfx(S) (cid:44) f (S \u222a x) \u2212 f (S) as the \u201c\ufb01rst order derivative\u201d of f at base set S for further analysis.\n\n3 Problem Formulation and Submodularity Analysis\n\nIn this section, we \ufb01rst formulate two typical subset selection problems into cardinality constrained\ndirected information maximization. Then we address the issues of submodularity and monotonicity\nin details. All proofs involved in this and the other sections, are given in supplementary material.\nCausal Sensor Placement and Covariates Selection by Maximizing DI\nTo motivate the \ufb01rst\nformulation, imagine we are interested in placing sensors to monitor pollution particles in a vast\nregion. Ideally, we would like to put k sensors, which is a given budget, at pollution sources to better\npredict the particle dynamics for other areas of interest. As such, the placement locations can be\nobtained by maximizing the directed information from selected location set S to its complement S\n(in the universal set V that contains all candidate sites). Then this type of \u201ccausal sensor placement\u201d\nproblems can be written as\n\nRegarding the causal covariates selection problem, the goal is to choose a subset S from a universal\nset V , such that S has maximal prediction causality to a (or several) target process Y . To leverage\nsparsity, the cardinality constraints |S| \u2264 k is also imposed on the number of selected covariates.\nAgain with directed information, this type of subset selection problems reads\n\nThe above two optimizations are hard even in the most reduced cases: Consider a collection of\ncausally independent Gaussian processes, then the above problems are equivalent to the D-optimal\ndesign problem, which has been shown to be NP-hard [11]. Unless \u201cP = NP\u201d, it is unlikely to \ufb01nd\nany polynomial algorithm for the maximization, and a resort to tractable approximations is necessary.\nSubmodularity Analysis of the Two Objectives\nFortunately, we can show that the objective\nfunction of OPT1, the directed information from selected processes to unselected ones, is submodular.\n\nTheorem 2 The objective I(Sn \u2192 \u00afSn) as a function of S \u2286 V is submodular.\nThe problem is that OPT1 is not monotonic for all S, which can be seen since both I(\u2205 \u2192 V ) and\nI(V \u2192 \u2205) are 0 by de\ufb01nition. On the other hand, the deterministic greedy algorithm has guaranteed\nperformance only when the objective function is monotonic up to 2k elements. In literature, several\nworks have been addressing the issue of maximizing non-monotonic submodular function [6][7][17].\nIn this work we mainly analysis the random greedy technique proposed in [17], which is simpler\ncompared to other alternatives and achieves best-known guarantees.\nConcerning the second objective OPT2, we make a slight detour and take a look at the property of its\n\u201c\ufb01rst derivative\u201d.\n\n3\n\nI(Sn \u2192 S n)\n\nargmax\nS\u2286V,|S|\u2264k\n\nI(Sn \u2192 Y n)\n\nargmax\nS\u2286V,|S|\u2264k\n\n(OPT1)\n\n(OPT2)\n\n\fProposition 1 fx(S) = I(Sn \u222a xn \u2192 Y n) \u2212 I(Sn \u2192 Y n) = I(xn \u2192 Y n||Sn)\nThus, the derivative is the directed information from processes x to Y causally conditioned on S.\nBy the \ufb01rst de\ufb01nition of submodularity, if the derivative is decreasing in S, i.e. if fx(S) \u2265 fx(T )\nfor any S \u2286 T \u2286 V and x \u2286 V \\ T , then the objective I(Sn \u2192 Y n) is a submodular function.\nIntuition may suggest this is true since knowing more (conditioning on a larger set) seems to reduce\nthe dependence (and also the causality) of two phenomena under consideration. However, in general,\nthis conjecture is not correct, and a counterexample could be constructed by having \u201cexplaining away\u201d\nprocesses. Hence the dif\ufb01culty encountered for solved OPT2 is that the objective is not submodular.\nNote that with some extra conditional independence assumptions we can justify its submodularity,\nProposition 2 If for any two processes s1, s2 \u2208 S, we have the conditional independence that\n(s1t \u22a5\u22a5 s2t | Yt), then I(Sn \u2192 Y n) is a monotonic submodular function of set S.\nIn practice, the assumption made in the above proposition is hard to check. Yet one may wonder that\nif the conditional dependence is weak or sparse, possibly a greedy selection still works to some extent\nbecause the submodularity is not severely deteriorated. Extending this idea we propose Submodularity\nIndex (SmI), a novel measure of the degree of submodularity for general set functions, and we will\nprovide the performance guarantee of greedy algorithms as a function of SmI.\n\n4 Submodularity Index and Performance Guarantee\n\nFor the ease of notation, we use f to denote a general set function and treat directed information\nobjectives as special realizations. It\u2019s worth mentioning that in literature, several effort has already\nbeen made to characterize approximate submodularity, such as the \u03b5 relaxation of de\ufb01nition (3)\nproposed in [5] for a dictionary selection objective, and the submodular ratio proposed in [3].\nCompared to existing works, the SmI suggested in this work (1) is more generally de\ufb01ned for all set\nfunctions, (2) does not presume monotonicity, and (3) is more suitable for tasks involving information,\nin\ufb02uence, and coverage metrics in terms of computational convenience.\nSmI De\ufb01nition and its Properties We start by de\ufb01ning the local submodular index for a function\nf at location A for a candidate set S\n\n\u03d5f (S, A) (cid:44)(cid:88)\n\nx\u2208S\n\nfx(A) \u2212 fS(A)\n\n(6)\n\nwhich can be considered as an extension of the third de\ufb01nition (5) of submodularity. In essence,\nit captures the difference between the sum of individual effect and aggregated effect on the \ufb01rst\nderivative of the function. Moreover, it has the following property:\n\nProposition 3 For a given submodular function f, the local submodular index \u03d5f (S, A) is a super-\nmodular function of S.\n\nNow we de\ufb01ne SmI by minimizing over set variables:\nDe\ufb01nition 2 For a set function f : 2V \u2192 R the submodularity index (SmI) for a location set L and\na cardinality k, denoted by \u03bbf (L, k), is de\ufb01ned as\n\n\u03bbf (L, k) (cid:44)\n\nmin\nA\u2286L\n\nS\u2229A=\u2205, |S|\u2264k\n\n\u03d5f (S, A)\n\n(7)\n\nThus, SmI is the smallest possible value of local submodularity indexes subject to |S| \u2264 k. Note\nthat we implicitly assume |S| \u2265 2 in the above de\ufb01nition, as in the cases where |S| = {0, 1}, SmI\nreduces to 0. Besides, the de\ufb01nition of submodularity can be alternatively posed with SmI,\nLemma 1 A set function f is submodular if and only if \u03bbf (L, k) \u2265 0, \u2200 L \u2286 V and k.\nFor functions that are already submodular, SmI measures how strong the submodularity is. We call a\nfunction super-submodular if its SmI is strictly larger than zero. On the other hand for functions that\nare not submodular, SmI provides an indicator of how close the function is to submodular. We call a\nfunction quasi-submodular if it has a negative but close to zero SmI.\n\n4\n\n\fDirect computation of SmI by solving (7) is hard. For the purpose of obtaining performance guarantee,\nhowever, a lower bound of SmI is suf\ufb01cient and is much easier to compute. Consider the objective of\n(OPT1), which is already a submodular function. By using proposition 3, we conclude that its local\nsubmodular index is a super-modular function for \ufb01xed location set. Hence computing (7) becomes\na cardinality constrained supermodular minimization problem for each location set. Besides, the\nfollowing decomposition is useful to avoid extra work on directed information estimation:\nProposition 4 The local submodular index of the function I({\u2022}n \u2192 {V \\\u2022}n) can be decomposed\nt=1 \u03d5H({\u2022}|V t\u22121)(St, At), where H(\u2022)\n\nas \u03d5I({\u2022}n\u2192{V \\\u2022}n)(Sn, An) = \u03d5H({V \\\u2022}n)(Sn, An) +(cid:80)n\n\nis the entropy function.\n\nThe lower bound of SmI for the objective of OPT2 is more involved. With some work on an alternative\nrepresentation of causally conditioned directed information, we obtain that\nLemma 2 For any location sets L \u2286 V , cardinality k, and target process set Y , we have\n\nn(cid:88)\n\n(cid:8)G|L|+k\n\n(cid:0)W t, Y t\u22121(cid:1) \u2212 G|L|+k\n\n(cid:0)W t, Y t(cid:1)(cid:9)\n\n\u03bbI({\u2022}n\u2192Y n)(L, k) \u2265 min\nW\u2286V\n\nwhere the function Gk(W, Z) (cid:44)(cid:80)\n\nt=1\n\nI(W n \u2192 Y n) \u2265 \u2212I(V n \u2192 Y n)\n\n|W|\u2264|L|+k\n\u2265 \u2212 max\nW\u2286V\n|W|\u2264|L|+k\nw\u2208W H(w|Z) \u2212 kH(W|Z) is super-modular of W .\n\n(8)\n\n(9)\n\nSince (8) is in fact minimizing (maximizing) the difference of two supermodular (submodular)\nfunctions, one can use existing approximate or exact algorithms [10] [16] to compute the lower bound.\n(9) is often a weak lower bound, although is much easier to compute.\nRandom Greedy Algorithm and Performance Bound with SmI With the introduction of SmI,\nin this subsection, we analyze the performance of the random greedy algorithm for maximizing\nnon-monotonic, quasi- or super-submodular function in a uni\ufb01ed framework. The results broaden the\ntheoretical guarantee for a much richer class of functions.\n\nAlgorithm 1 Random Greedy for Subset Selection\n\nS0 \u2190 \u03c6\nfor i = 1, ..., k do\n\nMi = argmaxMi\u2286V \\Si\u22121,|Mi|=k\n\nDraw ui uniformly from Mi\nSi \u2190 Si\u22121 \u222a {ui}\n\nend for\n\n(cid:80)\n\nu\u2208Mi\n\nfu(Si)\n\nThe randomized greedy algorithm was recently\nproposed in [17] [22] for maximizing cardinality\nconstrained non-monotonic submodular func-\ntions. Also in [17], a 1/e expected performance\nbound was provided. The overall procedure is\nsummarized in algorithm 1 for reference. Note\nthat the random greedy algorithm only requires\nO(k|V |) calls of the function evaluation, mak-\ning it suitable for large-scale problems.\n\nIn order to analyze the performance of the algorithm, we start with two lemmas that reveal more\nproperties of SmI. The \ufb01rst lemma shows that the monotonicity of the \ufb01rst derivative of a general set\nfunction f could be controlled by its SmI.\nLemma 3 Given a set function f : V \u2192 R, and the corresponding SmI \u03bbf (L, k) de\ufb01ned in\n(7), and also let set B = A \u222a {y1, ..., yM} and x \u2208 B. For an ordering {j1, ..., jM}, de\ufb01ne\nBm = A \u222a {yj1, ..., yjm}, B0 = A, BM = B, we have\nM\u22121(cid:88)\n\nfx(A) \u2212 fx(B) \u2265 max\n\n\u03bbf (Bm, 2) \u2265 M \u03bbf (B, 2)\n\n(10)\n\n{j1,...,jM}\n\nm=0\n\nEssentially, the above result implies that as long as SmI can be lower bounded by some small negative\nnumber, the submodularity (the decreasing derivative property (3) in De\ufb01nition 1) is not severely\ndegraded. The second lemma provides an SmI dependent bound on the expected value of a function\nwith random arguments.\nLemma 4 Let the set function f : V \u2192 R be quasi submodular with \u03bbf (L, k) \u2264 0. Also let S(p)\na random subset of S, with each element appears in S(p) with probability at most p, then we have\n\nE [f (S(p))] \u2265 (1 \u2212 p1)f (\u2205) + \u03b3S,p, with \u03b3S,p (cid:44)(cid:80)|S|\n\ni=1(i \u2212 1)p\u03bbf (Si, 2).\n\n5\n\n\fNow we present the main theory and provide re\ufb01ned bounds for two different cases when the function\nis monotonic (but not necessarily submodular) or submodular (but not necessarily monotonic).\n\nTheorem 3 For a general (possibly non-monotonic, non-submodular) functions f, let the optimal\nsolution of the cardinality constrained maximization be denoted as S\u2217, and the solution of random\ngreedy algorithm be Sg then\n\n(cid:32)\n\n(cid:33)\n\nE [f (Sg)] \u2265\n\n1\ne\n\n+\n\n\u03bef\nSg,k\n\nE[f (Sg)]\n\nf (S\u2217)\n\nwhere \u03bef\n\nSg,k = \u03bbf (Sg, k) + k(k\u22121)\n\n2 min{\u03bbf (Sg, 2), 0}.\n\nThe role of SmI in determining the performance of the random greedy algorithm is revealed: the\nbound consist of 1/e \u2248 0.3679 plus a term as a function of SmI. If SmI = 0, the 1/e bound in\nprevious literature is recovered. For super-submodular functions, as SmI is strictly larger than zero, the\ntheorem provides a stronger guarantee by incorporating SmI. For quasi-submodular functions having\nnegative SmI, although a degraded guarantee is produced, the bound is only slightly deteriorated\nwhen SmI is close to zero. In short, the above theorem not only encompasses existing results as\nspecial cases, but also suggests that we should view submodularity and monotonicity as a \u201ccontinuous\u201d\nproperty of set functions. Besides, greedy heuristics should not be restricted to the maximization\nof submodular functions, but can also be applied for \u201cquasi-submodular\u201d functions because a near\noptimal solution is still achievable theoretically. As such, we can formally de\ufb01ne quasi-submodular\nfunctions as those having an SmI such that\n\n\u03bef\nS,k\n\nE[f (S)] > \u2212 1\ne .\n\nCorollary 1 For monotonic functions in general, random greedy algorithm achieves\n\n(cid:19)\nand deterministic greedy algorithm also achieves f (Sg) \u2265 (cid:16)\n\n\u03bb(cid:48)\nf (Sg, k)\nE [f (Sg)]\n\nE [f (Sg)] \u2265\n\n1 \u2212 1\ne\n\n(cid:18)\n\n+\n\nf (S\u2217)\n\n(cid:17)\n\nf (S\u2217), where\n\n1 \u2212 1\n\ne +\n\n\u03bb(cid:48)\nf (Sg,k)\nf (Sg)\n\n(cid:26)\u03bbf (Sg, k)\n\n\u03bb(cid:48)\nf (Sg, k) =\n\n(1 \u2212 1/e)2\u03bbf (Sg, k)\n\nif \u03bbf (Sg, k) < 0\nif \u03bbf (Sg, k) \u2265 0\n\n.\n\nWe see that in the monotonic case, we get a stronger bound for submodular functions compared to the\n1 \u2212 1/e \u2248 0.6321 guarantee. Similarly, for quasi-submodular functions, the guarantee is degraded\nbut not too much if SmD is close to 0. Note that the objective function of OPT2 \ufb01ts into this category.\nFor submodular but non-monotonic functions, e.g., the objective function of OPT1, we have\n\nCorollary 2 For submodular function that are not necessarily monotonic, random greedy algorithm\nhas performance\n\n(cid:18) 1\n\ne\n\n(cid:19)\n\n+\n\n\u03bbf (Sg, k)\nE [f (Sg)]\n\nf (S\u2217)\n\nE [f (Sg)] \u2265\n\n5 Experiment and Applications\n\nIn this section, we conduct experiments to verify the theoretical results, and provide an example that\nuses subset selection for causal structure learning.\nData and Setup\nThe synthesis data is generated with the Bayes network Toolbox (BNT) [15]\nusing dynamic Bayesian network models. Two sets of data, denoted by D1 and D2, are simulated,\neach containing 15 and 35 processes, respectively. For simplicity, all processes are {0, 1} valued.\nThe processes are created with both simultaneous and historical dependence on each other. The order\n(memory length) of the historical dependence is set to 3. The MCMC sampling engine is used to\ndraw n = 104 points for both D1 and D2. The stock market dataset, denoted by ST, contains hourly\nvalues of 41 stocks and indexes for the years 2014-2015. Note that data imputation is performed to\namend a few missing values, and all processes are aligned in time. Moreover, we detrend each time\n\n6\n\n\fseries with a recursive HP-\ufb01lter [24] to remove long-term daily or monthly seasonalities that are not\nrelevant for hourly analysis.\nDirected information is estimated with the procedure proposed in [9], which adopts the context tree\nweighting algorithm as an intermediate step to learn universal probability assignment. Interested\nreaders are referred to [19][20] for other possible estimators. The maximal context tree depth is set to\n5, which is suf\ufb01cient for both the synthesis datasets and the real-world ST dataset.\nCausal Subset Selection Results\n\nFigure 1: Solution and Bounds for OPT1 on D1 Figure 2: Solution and Bounds for OPT2 on ST\n\nFirstly, the causal sensor placement problem, OPT1, is solved on data set D1 with the random greedy\nalgorithm. Figure 1 shows the optimal solution by exhaustive search (red-star), random greedy\nsolution (blue-circle), the 1/e reference bound (cyan-triangle), and the bound with SmI (magenta-\ndiamond), each for cardinality constraints imposed from k = 2 to k = 8. It is seen that the random\ngreedy solution is close to the true optimum. In terms of computational time, the greedy method\n\ufb01nishes in less than \ufb01ve minutes, while the exhaustive search takes about 10 hours on this small-scale\nproblem (|V | = 15). Comparing two bounds in Figure 1, we see that the theoretical guarantee is\ngreatly improved, and a much tighter bound is produced with SmI. The corresponding normalized\nSmI values, de\ufb01ned by SmI\nf (Lg), is shown in the \ufb01rst row of Table 1. As a consequence of those strictly\npositive SmI values and Corollary 2, the guarantees are made greater than 1/e. This observation\njusti\ufb01es that the bounds with SmI are better indicators of the performance of the greedy heuristic.\nTable 1: Normalized submodularity index (NSmI) for OPT1 on D1 and OPT2 on ST at locations of\ngreedy selections. Cardinality is imposed from k = 2 to k = 8.\n\nk =\n\n2\n\nnormalized SmI for OPT1\n0.382\nnormalized SmI for OPT2 -0.305\n\n3\n\n0.284\n0.071\n\n4\n\n5\n\n0.175\n-0.068\n\n0.082\n-0.029\n\n6\n\n0.141\n0.030\n\n7\n\n8\n\n0.078 0.074\n0.058 0.092\n\nSecondly, the causal covariates selection problem, OPT2, is solved on ST dataset with the stock XOM\nused as the target process Y . The results of random greedy, exhaustive search, and performance\nbound (Corollary 1) are shown in Figure 2, and normalized SmIs are listed in the second row of\nTable 1. Note that the 1 \u2212 1/e reference line (green-triangle) in the \ufb01gure is only for comparison\npurpose and is NOT an established bound. We observe that although the objective is not submodular,\nthe random greedy algorithm is still near optimal. As we compare the reference line and the bound\ncalculated with SmI (magenta-diamond), we see that the performance guarantee can be either larger\nor smaller than 1\u2212 1/e, depending on the sign of SmI. By de\ufb01nition, SmI measures the submodularity\nof a function at a location set. Hence, the SmI computed at each greedy selection captures the \u201clocal\u201d\nsubmodularity of the function. The central insight gained from this experiment is that, for a function\nlacking general submodularity, such as the objective function of OPT2, it can be quasi-submodular\n(SmI \u2264 0, SmI \u2248 0) or super-submodular (SmI > 0) at different locations. Accordingly the\nperformance guarantee can be either larger or smaller than 1 \u2212 1/e, depending on the values of SmI.\nApplication: Causal Structure Learning\nThe greedy method for subset selection can be used\nin many situations. Here we brie\ufb02y illustrate the structure learning application based on covariates\nselection. As is detailed in the supplementary material and [20], one can show that the causal structure\nlearning problem can be reduces to solving argmaxS\u2286V,|S|\u2264k I(Sn \u2192 X n\ni ) for each node i \u2208 V ,\n\n7\n\n\fassuming maximal in degree is bounded by k for all nodes. Since the above problem is exactly the\ncovariate selection considered in this work, we can reconstruct the causal structure for a network of\nrandom processes by simply using the greedy heuristic for each node.\nFigure 3 and Figure 4 illustrate the structure learning results on D1 and D2, respectively. In both two\n\ufb01gures, the left sub\ufb01gure is the ground truth structure, i.e., the dynamic Bayesian networks that are\nused in the data generation. Note that each node in the \ufb01gure represents a random process, and an\nedge from node i to j indicates a causal (including both simultaneous and historical) in\ufb02uence. The\nsub\ufb01gure on the right shows the reconstructed causal graph. Comparing two sub\ufb01gures in Figure 3,\nwe observe that the simple structure learning method performs almost \ufb02awlessly. In fact, only the\nedge 6 \u2192 4 is miss detected. On the larger case D2 with 35 processes, the method still works\nrelatively well, correctly reconstructing 82.69% causal relations. Given that only the maximal in\ndegree for all nodes is assumed a priori, these results not only justify the greedy approximation for\nthe subset selection, but also demonstrate its effectiveness in causal structure learning applications.\n\nFigure 3: Ground truth structure (left) versus\nReconstructed causal graph (right), D1 dataset\n\nFigure 4: Ground truth structure (left) versus\nReconstructed causal graph (right), D2 dataset\n\n6 Conclusion\n\nMotivated by the problems of source detection and causal covariate selection, we start with two for-\nmulations of directed information based subset selection, and then we provide detailed submodularity\nanalysis for both of the objective functions. To extend the greedy heuristics to possibly non-monotonic,\napproximately submodular functions, we introduce an novel notion, namely submodularity index, to\ncharacterize the \u201cdegree\u201d of submodularity for general set functions. More importantly, we show that\nwith SmI, the theoretical performance guarantee of greedy heuristic can be naturally extended to a\nmuch broader class of problems. We also point out several bounds and techniques that can be used to\ncalculate SmI ef\ufb01ciently for the objectives under consideration. Experimental results on the synthesis\nand real data sets reaf\ufb01rmed our theoretical \ufb01ndings, and also demonstrated the effectiveness of\nsolving subset selection for learning causal structures.\n\n7 Acknowledgments\n\nThis research is funded by the Republic of Singapore\u2019s National Research Foundation through a grant\nto the Berkeley Education Alliance for Research in Singapore (BEARS) for the Singapore-Berkeley\nBuilding Ef\ufb01ciency and Sustainability in the Tropics (SinBerBEST) Program. BEARS has been\nestablished by the University of California, Berkeley as a center for intellectual excellence in research\nand education in Singapore. We also thank the reviews for their helpful suggestions.\n\nReferences\n[1] A. S. A. Krause and C. Guestrin. Near-optimal sensor placements in gaussian processes: Theory, ef\ufb01cient\nalgorithms and empirical studies. Journal of Machine Learning Research (JMLR), pages 9:235\u2013284, 2008.\n\n[2] A. N.-M. Y. L. C. P. J. H. N. A. A. Lozano, H. Li. Spatial-temporal causal modeling for climate change\nattribution. ACM SIGKDD Conference on Knowledge Discovery and Data Mining (SIGKDD\u201909), 2009.,\n2009.\n\n8\n\n 1 2 3 4 5 6 7 8 9101112131415Original 1 2 3 4 5 6 7 8 9101112131415Reconstructed 1 2 3 4 5 6 7 8 91011121314151617181920212223242526272829303132333435Original 1 2 3 4 5 6 7 8 91011121314151617181920212223242526272829303132333435Reconstructed\f[3] D. K. Abhimanyu Das. Submodular meets spectral: Greedy algorithms for subset selection, sparse\n\napproximation and dictionary selection. Proc. of ICML 2011, Seattle, WA, 2011.\n\n[4] Z. Abrams, A. Goel, and S. Plotkin. Set k-cover algorithms for energy ef\ufb01cient monitoring in wireless\nsensor networks. In Proceedings of the 3rd international symposium on Information processing in sensor\nnetworks, pages 424\u2013432. ACM, 2004.\n\n[5] V. Cevher and A. Krause. Greedy dictionary selection for sparse representation. Selected Topics in Signal\n\nProcessing, IEEE Journal of, 5(5):979\u2013988, 2011.\n\n[6] U. Feige, V. S. Mirrokni, and J. Vondrak. Maximizing non-monotone submodular functions. SIAM Journal\n\non Computing, 40(4):1133\u20131153, 2011.\n\n[7] M. Feldman, J. Naor, and R. Schwartz. A uni\ufb01ed continuous greedy algorithm for submodular maximization.\nIn Foundations of Computer Science (FOCS), 2011 IEEE 52nd Annual Symposium on, pages 570\u2013579.\nIEEE, 2011.\n\n[8] M. L. F. G. L. Nemhauser, L. A. Wolsey. An analysis of approximations for maximizing submodular set\n\nfunctions. Mathematical Programming, pages Volume 14, Issue 1, pp 265\u2013294, 1978.\n\n[9] J. Jiao, H. H. Permuter, L. Zhao, Y.-H. Kim, and T. Weissman. Universal estimation of directed information.\n\nInformation Theory, IEEE Transactions on, 59(10):6220\u20136242, 2013.\n\n[10] Y. Kawahara and T. Washio. Prismatic algorithm for discrete dc programming problem. In Advances in\n\nNeural Information Processing Systems, pages 2106\u20132114, 2011.\n\n[11] C.-W. Ko, J. Lee, and M. Queyranne. An exact algorithm for maximum entropy sampling. Operations\n\nResearch, 43(4):684\u2013691, 1995.\n\n[12] A. Krause and Guestrin. Near-optimal value of information in graphical models. UAI, 2005.\n\n[13] H. Lin and J. Bilmes. A class of submodular functions for document summarization. ACL/HLT, 2011.\n\n[14] P. Mathai, N. C. Martins, and B. Shapiro. On the detection of gene network interconnections using directed\nmutual information. In Information Theory and Applications Workshop, 2007, pages 274\u2013283. IEEE, 2007.\n\n[15] K. Murphy et al. The bayes net toolbox for matlab. Computing science and statistics, 33(2):1024\u20131034,\n\n2001.\n\n[16] M. Narasimhan and J. A. Bilmes. A submodular-supermodular procedure with applications to discrimina-\n\ntive structure learning. arXiv preprint arXiv:1207.1404, 2012.\n\n[17] J. S. N. Niv Buchbinder, Moran Feldman and R. Schwartz. Submodular maximization with cardinality\n\nconstraints. ACM-SIAM Symposium on Discrete Algorithms (SODA), 2014.\n\n[18] J. Pearl. Causality: Models, Reasoning and Inference (Second Edition). Cambridge university press, 2009.\n\n[19] C. J. Quinn, T. P. Coleman, N. Kiyavash, and N. G. Hatsopoulos. Estimating the directed information to\ninfer causal relationships in ensemble neural spike train recordings. Journal of computational neuroscience,\n30(1):17\u201344, 2011.\n\n[20] C. J. Quinn, N. Kiyavash, and T. P. Coleman. Directed information graphs. Information Theory, IEEE\n\nTransactions on, 61(12):6887\u20136909, 2015.\n\n[21] D. V. B. P. R. Sheehan, N. A. and M. D. Tobin. Mendelian randomisation and causal inference in\n\nobservational epidemiology. PLoS medicine., 2008.\n\n[22] V. S. M. Uriel Feige and J. Vondrak. Maximizing non-monotone submodular functions. SIAM Journal on\n\nComputing, page 40(4):1133\u20131153, 2011.\n\n[23] G. Ver Steeg and A. Galstyan. Information-theoretic measures of in\ufb02uence based on content dynamics. In\nProceedings of the sixth ACM international conference on Web search and data mining, pages 3\u201312. ACM,\n2013.\n\n[24] Y. Zhou, Z. Kang, L. Zhang, and C. Spanos. Causal analysis for non-stationary time series in sensor-rich\nsmart buildings. In Automation Science and Engineering (CASE), 2013 IEEE International Conference on,\npages 593\u2013598. IEEE, 2013.\n\n9\n\n\f", "award": [], "sourceid": 1371, "authors": [{"given_name": "Yuxun", "family_name": "Zhou", "institution": "UC Berkeley"}, {"given_name": "Costas", "family_name": "Spanos", "institution": "University of California, Berkeley"}]}