{"title": "Selecting Observations against Adversarial Objectives", "book": "Advances in Neural Information Processing Systems", "page_first": 777, "page_last": 784, "abstract": null, "full_text": "Selecting Observations against Adversarial Objectives\n\nAndreas Krause\n\nSCS, CMU\n\nH. Brendan McMahan\n\nGoogle, Inc.\n\nCarlos Guestrin\n\nSCS, CMU\n\nAnupam Gupta\n\nSCS, CMU\n\nAbstract\n\nIn many applications, one has to actively select among a set of expensive observa-\ntions before making an informed decision. Often, we want to select observations\nwhich perform well when evaluated with an objective function chosen by an adver-\nsary. Examples include minimizing the maximum posterior variance in Gaussian\nProcess regression, robust experimental design, and sensor placement for outbreak\ndetection. In this paper, we present the Submodular Saturation algorithm, a sim-\nple and ef\ufb01cient algorithm with strong theoretical approximation guarantees for\nthe case where the possible objective functions exhibit submodularity, an intuitive\ndiminishing returns property. Moreover, we prove that better approximation al-\ngorithms do not exist unless NP-complete problems admit ef\ufb01cient algorithms.\nWe evaluate our algorithm on several real-world problems. For Gaussian Process\nregression, our algorithm compares favorably with state-of-the-art heuristics de-\nscribed in the geostatistics literature, while being simpler, faster and providing\ntheoretical guarantees. For robust experimental design, our algorithm performs\nfavorably compared to SDP-based algorithms.\n\n1 Introduction\nIn tasks such as sensor placement for environmental temperature monitoring or experimental de-\nsign, one has to select among a large set of possible, but expensive, observations. Often, there are\nseveral different objective functions which we want to simultaneously optimize. For example, in\nthe environmental monitoring problem, we want to minimize the marginal posterior variance of our\ntemperature estimate at all locations simultaneously. In experimental design, we often have uncer-\ntainty about the model parameters, and we want our experiments to be informative no matter what\nthe true parameters of the model are. These problems can be interpreted as a game: We select a\nset of observations (sensor locations, experiments), and an adversary selects an objective function\n(location to evaluate predictive variance, model parameters etc.) to test us on. Often, the individual\nobjective functions (e.g., the marginal variance at one location, or the information gain for a \ufb01xed set\nof parameters [1, 2]) satisfy submodularity, an intuitive diminishing returns property: Adding a new\nobservation helps less if we have already made many observations, and more if we have made few\nobservation thus far. While NP-hard, the problem of selecting an optimal set of k observations max-\nimizing a single submodular objective can be approximately solved using a simple greedy forward-\nselection algorithm, which is guaranteed to perform near-optimally [3]. However, as we show, this\nsimple myopic algorithm performs arbitrarily badly in the case of an adversarially chosen objective.\nIn this paper, we address this problem.\nIn particular: (1) We present SATURATE, an ef\ufb01cient\nalgorithm for settings where an adversarially-chosen submodular objective function must be\noptimized. Our algorithm guarantees solutions which are at least as informative as the optimal\nsolution, at only a slightly higher cost.\n(2) We prove that our approximation guarantee is best\npossible and cannot be improved unless NP-complete problems admit ef\ufb01cient algorithms. (3) We\nextensively evaluate our algorithm on several real-world tasks, including minimizing the maximum\nposterior variance in Gaussian Process regression, \ufb01nding experiment designs which are robust\nwith respect to parameter uncertainty, and sensor placement for outbreak detection.\n2 The adversarial observation selection problem\nObservation selection with a single submodular objective. Observation selection problems can\noften be modeled using set functions: We have a \ufb01nite set V of observations to choose from, and\n\n1\n\n\fa utility function F which assigns a real number F (A) to each A \u2286 V, quantifying its informa-\ntiveness. In many settings, such as the ones described above, the utility F exhibits the property of\nsubmodularity: adding an observation helps more, the fewer observations made so far [2]. Formally,\nF is submodular [3] if, for all A \u2286 B \u2286 V and s \u2208 V \\ B, it holds that F (A \u222a {s}) \u2212 F (A) \u2265\nF (B \u222a {s}) \u2212 F (B); F is monotonic if for all A \u2286 B \u2286 V it holds that F (A) \u2264 F (B), and F is\nnormalized if F (\u2205) = 0. Hence, many observation selection problems can be formalized as\n\nmaxA\u2286V F (A),\n\nsubject to\n\n|A| \u2264 k,\n\n(2.1)\n\nwhere F is normalized, monotonic and submodular, and k is a bound on the number of observations\nwe can make. Since solving the problem (2.1) is generally NP-hard [4], in practice heuristics are\noften used. One such heuristic is the greedy algorithm. This algorithm starts with the empty set, and\niteratively adds the element s\u2217 = argmaxs\u2208V\\A F (A \u222a {s}), until k elements have been selected.\nPerhaps surprisingly, a fundamental result by Nemhauser et. al. [3] states that for submodular\nfunctions, the greedy algorithm achieves a constant factor approximation: The set AG obtained by\nthe greedy algorithm achieves at least a constant fraction (1 \u2212 1/e) of the objective value obtained\nby the optimal solution, i.e., F (AG) \u2265 (1 \u2212 1/e) max|A|\u2264k F (A). Moreover, no polynomial time\nalgorithm can provide a better approximation guarantee unless P = NP [4].\nObservation selection with adversarial objectives.\nIn many applications (such as those dis-\ncussed below), one wants to simultaneously optimize multiple objectives. Here, we are given a\ncollection of monotonic submodular functions F1, . . . , Fm, and we want to solve\n\n|A| \u2264 k.\n\nmaxA\u2286V min\n\ni\n\nFi(A),\n\nsubject to\n\n(2.2)\nProblem (2.2) can be considered a game: First, we (the max-player) select a set of observations A,\nand then our opponent (the min-player) selects a criterion Fi to test us on. Our goal is to select a\nset A of observations which performs well against an opponent who chooses the worst possible Fi\nknowing our choice A. Thereby, we try to \ufb01nd a pure equilibrium to a sequential game on a matrix,\nwith one row per A, and one column per Fi. Note, that even if the Fi are all submodular, G(A) =\nmini Fi(A) is not submodular. In fact, we show below that, in this setting, the simple greedy algo-\nrithm (which performs near-optimally in the single-criterion setting) can perform arbitrarily badly.\nExamples of adversarial observation selection problems. We consider three instances of\nadversarial selection problems. Sec. 4 provides more details and experimental results for these\ndomains. Several more examples are presented in the longer version of this paper [5].\nMinimizing the maximum Kriging variance. Consider a Gaussian Process (GP) [6] XV de\ufb01ned\nover a \ufb01nite set of locations (indices) V. Hereby, XV is a set of random variables, one variable\nXs for each location s \u2208 V. Given a set of locations A \u2286 V which we observe, we can compute\nthe predictive distribution P (XV\\A | XA = xA), i.e., the distribution of the variables XV\\A at the\nunobserved locations V \\ A, conditioned on the measurements at the selected locations, XA = xA.\ns|A be the residual variance after making observations at A. Let \u03a3AA be the covariance\nLet \u03c32\nmatrix of the measurements at the chosen locations A, and \u03a3sA be the vector of cross-covariances\ns \u2212 \u03a3sA\u03a3\u22121AA\u03a3As depends\nbetween the measurements at s and A. Then, the variance \u03c32\nonly on the set A, and not on the observed values xA. Assume that the a priori variance \u03c32\ns is\nconstant for all locations s (in Sec. 3, we show our approach generalizes to non-constant marginal\nvariances). We want to select locations A such that the maximum marginal variance is as small as\npossible. Equivalently, we can de\ufb01ne the variance reduction Fs(A) = \u03c32\ns|A, and desire that\nthe minimum variance reduction over all locations s is as large as possible. Das and Kempe [1]\nshow that, in many practical cases, the variance reduction Fs is a monotonic submodular function.\nRobust experimental designs. Another application is experimental design under nonlinear dynamics\n[7]. The goal is to estimate a set of parameters \u03b8 of a nonlinear function y = f(x, \u03b8) + w, by\nproviding a set of experimental stimuli x, and measuring the (noisy) response y. In many cases,\nexperimental design for linear models (where y = A(x)T \u03b8 + w) with Gaussian noise w can be\nef\ufb01ciently solved [8]. In the nonlinear case, the common approach is to linearize f around an initial\nparameter estimate \u03b80, i.e., y = f(x, \u03b80) + V (x)(\u03b8\u2212 \u03b80) + w, where V (x) is the Jacobian of f with\nrespect to the parameters \u03b8, evaluated at \u03b80. In [7], it was shown that the ef\ufb01ciency of the design\ncan be very sensitive with respect to the initial parameter estimates \u03b80. Consequently, they develop\nan ef\ufb01cient semi-de\ufb01nite program (SDP) for E-optimal design (i.e., the goal is to minimize the\nmaximum eigenvalue of the error covariance) which is robust against perturbations of the Jacobian\n\ns|A = \u03c32\n\ns \u2212 \u03c32\n\n2\n\n\f\u03b8\n\n)\u2212 tr(\u03a3(\u03b80)\n\nV . However, it might be more natural to directly consider robustness with respect to perturbation of\nthe initial parameter estimates \u03b80, around which the linearization is performed. We show how to \ufb01nd\n(Bayesian A-optimal) designs which are robust against uncertainty in these parameter estimates.\nIn this setting, the objectives F\u03b80(A) are the reductions of the trace of the parameter covariance,\nF\u03b80(A) = tr(\u03a3(\u03b80)\n\u03b8|A ), where \u03a3(\u03b80) is the joint covariance of observations and parameters\nafter linearization around \u03b80; thus, F\u03b80 is the sum of marginal parameter variance reductions, which\nare individually monotonic and (often) submodular [1], and so F\u03b80 is monotonic and submodular as\nwell. Hence, in order to \ufb01nd a robust design, we maximize the minimum variance reduction, where\nthe minimum is taken over (a discretization into a \ufb01nite subset of) all initial parameter values \u03b80.\nSensor placement for outbreak detection. Another class of examples are outbreak detection prob-\nlems on graphs, such as contamination detection in water distribution networks [9]. Here, we are\ngiven a graph G = (V,E), and a phenomenon spreading dynamically over the graph. We de\ufb01ne a set\nof intrusion scenarios I; each scenario i \u2208 I models an outbreak (e.g., spreading of contamination)\nstarting from a given node s \u2208 V in the network. By placing sensors at a set of locations A \u2286 V,\nwe can detect such an outbreak, and incur a utility Fi(A) (e.g., reduction in detection time or\npopulation affected).\nIn [9], it was shown that these utilities Fi are monotonic and submodular\nfor a large class of utility functions. In the adversarial setting, the adversary observes our sensor\nplacement A, and then decides on an intrusion i for which our utility Fi(A) is as small as possible.\nHence, our goal is to \ufb01nd a placement A which performs well against such an adversarial opponent.\nHardness of the adversarial observation selection problem. Given the near-optimal perfor-\nmance of the greedy algorithm for the single-objective problem, a natural question is if the per-\nformance guarantee generalizes to the more complex adversarial setting. Unfortunately, this is far\nfrom true. Consider the case with two submodular functions, F1 and F2, where the set of observa-\ntions is V = {s1, s2, t1, t2}. We set F1(\u2205) = F2(\u2205) = 0, and de\ufb01ne F1(A) = 1 if s1 \u2208 A, otherwise\n\u03b5 times the number of ti contained in A. Similarly, if s2 \u2208 A, we set F2(A) = 1, otherwise \u03b5 times\nthe number of ti contained in A. Both F1 and F2 are submodular and monotonic. Optimizing for\na set of 2 elements, the greedy algorithm maximizing G(A) = min{F1(A), F2(A)} would choose\nthe set {t1, t2}, since such choice increases G by 2\u03b5, whereas adding si would not increase the score.\nHowever, the optimal solution with k = 2 is {s1, s2}, with a score of 1. Hence, as \u03b5 \u2192 0, the greedy\nalgorithm performs arbitrarily worse than the optimal solution. Our next hope would be to obtain a\ndifferent good approximation algorithm. However, we can show that most likely this is not possible:\nTheorem 1. Unless P = NP, there cannot exist any polynomial time approximation algorithm for\nProblem (2.2). More precisely: Let n be the size of the problem instance, and \u03b3(\u00b7) > 0 be any\npositive function of n. If there exists a polynomial-time algorithm which is guaranteed to \ufb01nd a set\nA0 of size k such that mini Fi(A0) \u2265 \u03b3(n) max|A|\u2264k mini Fi(A), then P = NP.\nThus, unless P = NP, there cannot exist any algorithm which is guaranteed to provide, e.g., even an\nexponentially small fraction (\u03b3(n) = 2\u2212n) of the optimal solution. All proofs can be found in [5].\n3 The Submodular Saturation Algorithm\nSince Theorem 1 rules out any approximation algorithm which respects the constraint k on the size\nof the set A, our only hope for non-trivial guarantees requires us to relax this constraint. We now\npresent an algorithm that \ufb01nds a set of observations which perform at least as well as the optimal\nset, but at slightly increased cost; moreover, we show that no ef\ufb01cient algorithms can provide better\nguarantees (under reasonable complexity-theoretic assumptions). For now we assume all Fi take\nonly integral values; this assumption is relaxed later. The key idea is to consider the following\nalternative formulation:\nmax\nc,A c,\n\nc \u2264 Fi(A) for 1 \u2264 i \u2264 m and |A| \u2264 \u03b1k.\n\nsubject to\n\n(3.1)\nWe want a set A of size at most \u03b1k, such that Fi(A) \u2265 c for all i, and c is as large as possible.\nHere \u03b1 \u2265 1 is a parameter relaxing the constraint on |A|:\nif \u03b1 = 1, we recover the original\nproblem (2.2). We solve program (3.1) as follows: For each value c, we \ufb01nd the cheapest set A\nwith Fi(A) \u2265 c for all i. If this cheapest set has at most \u03b1k elements, then c is feasible. A binary\nsearch on c allows us to \ufb01nd the optimal solution with the maximum feasible c. We \ufb01rst show how\n\nto approximately solve Equation (3.1) for a \ufb01xed c. For c > 0 de\ufb01ne bFi,c(A) = min{Fi(A), c},\nthe original function Fi truncated at score level c; these bFi,c functions are also submodular [10].\n\n3\n\n\fGPC (F c, c)\nA \u2190 \u2205;\nwhile F c(A) < c do\n\nforeach s \u2208 V \\ A do \u03b4s \u2190 F c(A \u222a {s}) \u2212 F c(A);\nA \u2190 A \u222a {argmaxs \u03b4s};\n\nAlgorithm 1: The greedy submodular partial cover (GPC) algorithm.\n\nSATURATE (F1, . . . , Fm, k, \u03b1)\ncmin \u2190 0; cmax \u2190 mini Fi(V); Abest \u2190 \u2205;\nwhile (cmax \u2212 cmin) \u2265 1\n\nm do\n\nc \u2190 (cmin + cmax)/2; \u2200A de\ufb01ne F c(A) \u2190 1\nif |A| > \u03b1k then cmax \u2190 c; else cmin \u2190 c; Abest = A ;\n\nm\n\nP\ni min{Fi(A), c}; A \u2190 GP C(F c, c);\n\nAlgorithm 2: The Submodular Saturation algorithm.\n\nP\ni bFi,c(A) be their average value; submodular functions are closed under convex\n\nLet F c(A) = 1\ncombinations, so F c is submodular and monotonic. Furthermore, Fi(A) \u2265 c for all 1 \u2264 i \u2264 m\nif and only if F c(A) = c. Hence, in order to determine whether some c is feasible, we solve a\nsubmodular covering problem:\n\nm\n\n(3.2)\nSuch problems are NP-hard in general [4], but in [11] it is shown that the greedy algorithm (c.f.,\nAlgorithm 1) achieves near-optimal performance on this problem. Using this result, we \ufb01nd:\nLemma 2. Given monotonic submodular functions F1, . . . , Fm and a (feasible) constant c, Algo-\nrithm 1 (with input F c) \ufb01nds a set AG such that Fi(AG) \u2265 c for all i, and |AG| \u2264 \u03b1|A\u2217|, where\n\nA\u2217 is the optimal solution, and \u03b1 = 1 + log (maxs\u2208VP\n\ni Fi(s)) \u2265 1 + log(cid:0)m maxs\u2208V F c(s)(cid:1)1.\n\nAc = argminA\u2286V |A|,\n\nsuch that F c(A) = c.\n\nNote,\n\nthat\n\nWe can compute this approximation guarantee \u03b1 for any given instance of the adversarial ob-\nservation selection problem. Hence, if for a given value of c the greedy algorithm returns a set\nof size greater than \u03b1k, there cannot exist a solution A0 with |A0| \u2264 k with Fi(A0) \u2265 c for all\ni; thus, the optimal solution to the adversarial observation selection problem must be less than\nc. We can use this argument to conduct a binary search to \ufb01nd the optimal value of c. We call\nAlgorithm 2, which formalizes this procedure, the submodular saturation algorithm (SATURATE),\n\nobjectives. Theorem 3 (given below) states that SATURATE is guaranteed to \ufb01nd a set which\nachieves adversarial score mini Fi at least as high as the optimal solution, if we allow the set to be\nlogarithmically larger than the optimal solution.\nTheorem 3. For any integer k, SATURATE \ufb01nds a solution AS such that mini Fi(AS) \u2265\ni Fi(s)). The total number\n\nas the algorithm considers the truncated objectives bFi,c, and chooses sets which saturate all these\nmax|A|\u2264k mini Fi(A) and |AS| \u2264 \u03b1k, for \u03b1 = 1 + log (maxs\u2208VP\nof submodular function evaluations is O(cid:0)|V|2m log(P\n1 + log (maxs\u2208VP\n\nHowever,\nif \u03b1 <\nIf we had an exact\nalgorithm for submodular coverage, \u03b1 = 1 would be the correct choice. Since the greedy algorithm\nsolves submodular coverage very effectively, in our experiments, we call SATURATE with \u03b1 = 1,\nwhich empirically performs very well. The worst-case running time guarantee is quite pessimistic,\nand in practice the algorithm is much faster: Using a priority queue and lazy evaluations, Algo-\nrithm 1 can be sped up drastically (c.f., [12] for details). Furthermore, in practical implementations,\none would stop GPC once \u03b1k + 1 elements have been selected, which already proves that the\noptimal solution with k elements cannot achieve score c. Also, Algorithm 2 can be terminated once\ncmax \u2212 cmin is suf\ufb01ciently small; in our experiments, 10-15 iterations usually suf\ufb01ced.\nOne might ask, whether the guarantee on the size of the set, \u03b1, can be improved. Unfortunately, this\nis not likely, as the following Theorem shows:\nTheorem 4. If there were a polynomial time algorithm which, for any integer k, is guaranteed\nto \ufb01nd a solution AS such that mini Fi(AS) \u2265 max|A|\u2264k mini Fi(A) and |AS| \u2264 \u03b2k, where\n\ni Fi(s)), the guarantee of Theorem 3 does not hold.\n\nthe algorithm still makes sense for any value of \u03b1.\n\ni Fi(V))(cid:1).\n\n\u03b2 \u2264 (1 \u2212 \u03b5)(1 + log maxs\u2208VP\n\ni Fi(s)) for some \ufb01xed \u03b5 > 0, then NP \u2286 DTIME(nlog log n).\n\n1This bound is only meaningful for integral Fi, otherwise it could be arbitrarily improved by scaling the Fi.\n\n4\n\n\fHereby, DTIME(nlog log n) is a class of deterministic, slightly superpolynomial (but sub-\nexponential) algorithms [4]; the inclusion NP \u2286 DTIME(nlog log n) is considered unlikely [4].\n\ni \u2212 \u03c32\n\nthis case: Instead of de\ufb01ning bFi,c(A) = min{Fi(A), c}, we de\ufb01ne bFi,c(A) = min{Fi(A), \u03c32\nitive cost g(s); the cost of a set of observations is then g(A) =P\ning observations. In this case, we use the rule \u03b4s \u2190(cid:0)F c(A \u222a {s}) \u2212 F c(A)(cid:1) /g(s) in Algorithm 1.\n\nExtensions. We now show how the assumptions made in our presentation above can be relaxed.\nNon-integral objectives. Most objective functions Fi in the observation selection setting are not\nintegral (e.g., marginal variances of GPs). If they take rational numbers, we can scale the objectives\nby multiplying by their common denominator. If we allow small additive error, we can approximate\ntheir values by their leading digits. An analysis similar to the one presented in [2] can be used to\nbound the effect of this approximation on the theoretical guarantees obtained by the algorithm.\nNon-constant thresholds. Consider the example of Minimax Kriging Designs for GP regression.\nHere, the Fi(A) = \u03c32\ni|A denote the variance reductions at location i. However, rather than\nguaranteeing that Fi(A) \u2265 c for all i (which, in this example, means that the minimum variance re-\ni|A \u2264 c for all i. We can easily adapt our approach to handle\nduction is c), we want to guarantee that \u03c32\ni \u2212 c},\nand then again perform binary search over c, but searching for the smallest c instead. The algorithm,\nusing objectives modi\ufb01ed in this way, will bear the same approximation guarantees.\nNon-uniform observation costs. We can extend SATURATE to the setting where different observa-\ntions have different costs. Suppose a cost function g : V \u2192 R+ assigns each element s \u2208 V a pos-\ns\u2208A g(s). The problem is to \ufb01nd\nA\u2217 = maxA\u2282V mini Fi(A) subject to g(A) \u2264 B, where B > 0 is a budget we can spend on mak-\nFor this modi\ufb01ed algorithm, Theorem 3 still holds, with |A| replaced by g(A) and k replaced by B.\n4 Experimental Results\nMinimax Kriging. We use SATURATE to select observations in a GP to minimize the maximum\nposterior variance. We consider Precipitation data from the Paci\ufb01c Northwest of the United States\n[13]. We discretize the space into 167 locations.\nIn order to estimate variance reduction, we\nconsider the empirical covariance of 50 years of data, which we preprocessed as described in [2].\nIn the geostatistics literature, the predominant choice of optimization algorithms are carefully\ntuned local search procedures, prominently simulated annealing (c.f., [14, 15]). We compare our\nSATURATE algorithm against a state-of-the-art implementation of such a simulated annealing (SA)\nalgorithm, \ufb01rst proposed by [14]. We use an optimized implementation described recently by\n[15]. This algorithm has 7 parameters which need to be tuned, describing the annealing schedule,\ndistribution of iterations among several inner loops, etc. We use the parameter settings as reported\nby [15], and report the best result of the algorithm among 10 random trials. In order to compare\nobservation sets of the same size, we called SATURATE with \u03b1 = 1.\nFig. 1(a) compares simulated annealing, SATURATE, and the greedy algorithm which greedily\nselects elements which decrease the maximum variance the most. We also used SATURATE to\ninitialize the simulated annealing algorithm (using only a single run of simulated annealing, as\nopposed to 10 random trials). SATURATE obtains placements which are drastically better than\nthe placements obtained by the greedy algorithm. Furthermore, the performance is very close\nto the performance of the simulated annealing algorithm. When selecting 30 and more sensors,\nSATURATE strictly outperforms the simulated annealing algorithm. Furthermore, as Fig. 1(b)\nshows, SATURATE is signi\ufb01cantly faster than simulated annealing, by factors of 5-10 for larger\nproblems. When using SATURATE in order to initialize the simulated annealing algorithm, the\nresulting performance almost always resulted in the best solutions we were able to \ufb01nd, while\nstill executing faster than simulated annealing with 10 random restarts as proposed by [15]. These\nresults indicate that SATURATE compares favorably to state-of-the-art local search heuristics, while\nbeing faster, requiring no parameters to tune, and providing theoretical approximation guarantees.\nOptimizing for the maximum variance could potentially be considered too pessimistic. Hence\nwe compared placements obtained by SATURATE, minimizing the maximum marginal posterior\nvariance, with placements obtained by the greedy algorithm, where we minimize the average\nmarginal variance. Note, that, whereas the reduction of the maximum variance is non-submodular,\nthe average variance reduction is (often) submodular [1], and hence the greedy algorithm can be\nexpected to provide near-optimal placements. Fig. 1(c) presents the maximum and average marginal\nvariances for both algorithms. Our results show that if we optimize for the maximum variance\nwe still achieve comparable average variance. If we optimize for average variance however, the\n\n5\n\n\f(a) Algorithm comparison\n\n(b) Running time\n\n(c) Avg. vs max. variance\n\nIn the longer version of this paper [5], we\n\nFigure 1: (a) SATURATE, greedy and SA on the precipitation data. SATURATE performs comparably with the\n\ufb01ne-tuned SA algorithm, and outperforms it for larger placements. (b) Running times for the same experiment.\n(c) Optimizing for the maximum variance (using SATURATE) leads to low average variance, but optimizing for\naverage variance (using greedy) does not lead to low maximum variance.\nmaximum posterior variance remains much higher.\npresent results on two more real data sets, which are qualitatively similar to those discussed here.\nRobust Experimental Design. We consider the robust design of experiments for the Michaelis-\nMenten mass-action kinetics model, as discussed in [7]. The goal is least-square parameter\nestimation for a function y = f(x, \u03b8), where x is the chosen experimental stimulus (the initial\nsubstrate concentration S0), and \u03b8 = (\u03b81, \u03b82) are two parameters as described in [7]. The stimulus\nx is chosen from a menu of six options, x \u2208 {1/8, 1, 2, 4, 8, 16}, each of which can be repeatedly\nchosen. The goal is to produce a fractional design w = (w1, . . . , w6), where each component wi\nmeasures the relative frequency according to which the stimulus xi is chosen. Since f is nonlinear,\nf is linearized around an initial parameter estimate \u03b80 = (\u03b801, \u03b802), and approximated by its\nJacobian V\u03b80. Classical experimental design considers the error covariance of the least squares\nestimate \u02c6\u03b8, Cov(\u02c6\u03b8 | \u03b80, w) = \u03c32(V T\n\u03b80 W V\u03b80)\u22121, where W = diag(w), and aims to \ufb01nd designs\nw which minimize this error covariance. E-optimality, the criterion adopted by [7], measures\nsmallness in terms of the maximum eigenvalue of the error covariance matrix. The optimal w can\nbe found using Semide\ufb01nite Programming (SDP) [8].\nThe estimate Cov(\u02c6\u03b8 | \u03b80, w) depends on the initial parameter estimate \u03b80, where linearization is per-\nformed. However, since the goal is parameter estimation, a \u201ccertain circularity is involved\u201d [7]. To\navoid this problem, [7] \ufb01nd a design w\u03c1(\u03b80) by solving a robust SDP which minimizes the error size,\nsubject to a worst-case (adversarially-chosen) perturbation \u2206 on the Jacobian V\u03b80; the robustness pa-\nrameter \u03c1 bounds the spectral norm of \u2206. As evaluation criterion, [7] de\ufb01ne a notion of ef\ufb01ciency,\nwhich is the error size of the optimal design with correct initial parameter estimate, divided by the\nerror when using a robust design obtained at the wrong initial parameter estimates, i.e.,\n\ne\ufb03ciency \u2261 \u03bbmax[Cov(\u02c6\u03b8 | \u03b8true, wopt(\u03b8true)))]/\u03bbmax[Cov(\u02c6\u03b8 | \u03b8true, w\u03c1(\u03b80))],\n\nwhere wopt(\u03b8) is the E-optimal design for parameter \u03b8. They show that for appropriately chosen\nvalues of \u03c1, the robust design is more ef\ufb01cient than the optimal design, if the initial parameter \u03b80\ndoes not equal the true parameter.\nWhile their results are very promising, an arguably more natural approach than perturbing the Ja-\ncobian would be to perturb the initial parameter estimate, around which linearization is performed.\nE.g., if the function f describes a process, which behaves characteristically differently in different\n\u201cphases\u201d, and the parameter \u03b8 controls which of the phases the process is in, then a robust design\nshould intuitively \u201chedge\u201d the design against the behavior in each possible phase. In such a case, the\nuniform distribution (which the robust SDP chooses for large \u03c1) would not be the most robust design.\nIf we discretize the space of possible parameter perturbations (within a reasonably chosen interval),\nwe can use SATURATE to \ufb01nd robust experimental designs. While the classical E-optimality is not\nsubmodular [2], Bayesian A-optimality is (often) submodular [1, 2]. Here, the goal is to minimize\nthe trace instead of eigenvalue size as error metric. Furthermore, we equip the parameters \u03b8 with an\nuninformative normal prior (which we chose as diag([202, 202])), and then minimize the expected\ntrace of the posterior error covariance, tr(\u03a3\u03b8|A). Hereby, A is a discrete design of 20 experiments,\nwhere each option xi can be chosen repeatedly. In order to apply SATURATE, for each \u03b8, we de\ufb01ne\nF\u03b8(A) as the normalized variance reduction F\u03b8(A) = 1\n\u03b8|A). The normalization Z\u03b8 is\nchosen such that F\u03b8(A) = 1 if A = argmax|A0|=20 F\u03b8(A0), i.e., if A is chosen to maximize only\nF\u03b8. SATURATE is then used to maximize the worst-case normalized variance reduction.\n\n\u03b8 \u2212 \u03c32\n(\u03c32\n\nZ\u03b8\n\n6\n\n0204060801000.511.522.5Number of sensorsMaximum marginal varianceGreedySaturateSimulatedAnnealing (SA)Saturate + SA01020304050600100200300400500Number of observationsRunning time (s)SimulatedAnnealing (SA)Saturate+SASaturateGreedy0510152000.511.522.53Number of sensorsMarginal varianceMax. var.opt. avg.(Greedy)Max. var.opt. max.(Saturate)Avg. var.opt. max.(Saturate)Avg. var.opt. avg.(Greedy)\f(a) Robust experimental design\n\n(b) [W] algorithms Z1\n\n(c) [W] algorithms Z2\n\nFigure 2: (a) Ef\ufb01ciency of robust SDP of [7] and SATURATE on a biological experimental design problem.\nFor a large range of initial parameter estimates, SATURATE outperforms the SDP solutions. (b,c) SATURATE,\ngreedy and SA in the water network setting, when optimizing worst-case detection time (Z1) and affected\npopulation (Z2). SATURATE performs comparably to SA for Z2 and strictly outperforms SA for Z1.\nWe reproduced the experiment of [7], where the initial estimate of the second component \u03b802 of \u03b80\nwas varied between 0 and 16, the \u201ctrue\u201d value being \u03b82 = 2. For each initial estimate of \u03b802, we\ncomputed a robust design, using the SDP approach and using SATURATE, and compared them using\nthe ef\ufb01ciency metric of [7]. We \ufb01rst optimized designs which are robust against a small perturbation\nof the initial parameter estimate. For the SDP, we chose a robustness parameter \u03c1 = 10\u22123, as\n1+\u03b5 , \u03b8(1 + \u03b5)], discretized\nreported in [7]. For SATURATE, we considered an interval around [\u03b8 1\nin a 5 \u00d7 5 grid, with \u03b5 = .1. Fig. 2(a) shows three characteristically different regions, A, B, C,\nseparated by vertical lines. In region B which contains the true parameter setting, the E-optimal\ndesign (which is optimal if the true parameter is known, i.e., \u03b802 = \u03b82) performs similar to both\nrobust methods. Hence, in region B (i.e., small deviation from the true parameter), robustness is not\nreally necessary. Outside of region B however, where the standard E-optimal design performs badly,\nboth robust designs do not perform well either. This is an intuitive result, as they were optimized to\nbe robust only to small parameter perturbations.\nConsequently, we compared designs which are robust against a large parameter range. For SDP,\nwe chose \u03c1 = 16.3, which is the maximum spectral variation of the Jacobian when we consider all\ninitial estimates from \u03b802 varying between 0 and 16. For SATURATE, we optimized a single design\nwhich achieves the maximum normalized variance reduction over all values of \u03b802 between 0 and 16.\nFig. 2(a) shows, that in this case, the design obtained by SATURATE achieves an ef\ufb01ciency of 69%,\nwhereas the ef\ufb01ciency of the SDP design is only 52%. In the regions A and C, the SATURATE design\nstrictly outperforms the other robust designs. This experiment indicates that designs which are robust\nagainst a large range of initial parameter estimates, as provided by SATURATE, can be more ef\ufb01cient\nthan designs which are robust against perturbations of the Jacobian (the SDP approach).\nOutbreak Detection. Consider a city water distribution network, delivering water to households\nvia a system of pipes, pumps, and junctions. Accidental or malicious intrusions can cause contam-\ninants to spread over the network, and we want to select a few locations (pipe junctions) to install\nsensors, in order to detect these contaminations as quickly as possible. In August 2006, the Battle\nof Water Sensor Networks (BWSN) [16] was organized as an international challenge to \ufb01nd the best\nsensor placements for a real (but anonymized) metropolitan water distribution network, consisting\nof 12,527 nodes. In this challenge, a set of intrusion scenarios is speci\ufb01ed, and for each scenario\na realistic simulator provided by the EPA [17] is used to simulate the spread of the contaminant\nfor a 48 hour period. An intrusion is considered detected when one selected node shows positive\ncontaminant concentration. BWSN considered a variety of impact measures, including the time\nto detection (called Z1), and the size of the affected population calculated using a realistic disease\nmodel (Z2). The goal of BWSN was to minimize the expectation of the impact measures Z1 and\nZ2 given a uniform distribution over intrusion scenarios.\nIn this paper, we consider the adversarial setting, where an opponent chooses the contamination\nscenario with knowledge of the sensor locations. The objective functions Z1 and Z2 are in fact sub-\nmodular for a \ufb01xed intrusion scenario [9], and so the adversarial problem of minimizing the impact\nof the worst possible intrusion \ufb01ts into our model. For these experiments, we consider scenarios\nwhich affect at least 10% of the network, resulting in a total of 3424 scenarios. Figures 2(b) and 2(c)\ncompare the greedy algorithm, SATURATE and the simulated annealing (SA) algorithm for the prob-\nlem of maximizing the worst-case detection time (Z1) and worst-case affected population (Z2).\nInterestingly, the behavior is very different for the two objectives. For the affected population (Z2),\ngreedy performs reasonably, and SA sometimes even outperforms SATURATE. For the detection\n\n7\n\nABC10-110010100.20.40.60.81Initial parameter estimate \u03b802Efficiency (w.r.t. E-optimality)ClassicalE-optimaldesigntrue \u03b82SDP:\u03c1= 16.3\u03c1= 10-3Saturate:large intervalsmall interval051015202530050010001500200025003000Number of sensorsMaximum detection time (minutes)GreedySimulatedAnnealingSaturate5101520253000.511.52x 104Number of sensorsMaximum population affectedGreedySaturateSimulatedAnnealingSaturate + SA\ftime (Z1), however, the greedy algorithm did not improve the objective at all, and SA performs\npoorly. The reason is that for Z2, the maximum achievable scores, Fi(V), vary drastically, since\nsome scenarios have much higher impact than others. Hence, there is a strong \u201cgradient\u201d, as the\nadversarial objective changes quickly when the high impact scenarios are covered. This gradient\nallows greedy and SA to work well. On the contrary, for Z1, the maximum achievable scores,\nFi(V), are constant, since all scenarios have the same simulation duration. Unless all scenarios are\ndetected, the worst-case detection time stays constant at the simulation length. Hence, many node\nexchange proposals considered by SA, as well as the addition of a new sensor location by greedy,\ndo not change the adversarial objective, and the algorithms have no useful performance metric.\nSimilarly to the GP Kriging setting, our results show that optimizing the worst-case score leads to\nreasonable performance in the average case score, but not necessarily vice versa.\n5 Conclusions\nIn this paper, we considered the problem of selecting observations which are informative with re-\nspect to an objective function chosen by an adversary. We demonstrated how this class of problems\nencompasses the problem of \ufb01nding designs which minimize the maximum posterior variance in\nGaussian Processes regression, robust experimental design, and detecting events spreading over\ngraphs. In each of these settings, the individual objectives are submodular and can be approximated\nwell using, e.g., the greedy algorithm; the adversarial objective, however, is not submodular. We\nproved that there cannot exist any approximation algorithm for the adversarial problem if the con-\nstraint on the observation set size must be exactly met, unless P = NP. Consequently, we presented\nan ef\ufb01cient approximation algorithm, SATURATE, which \ufb01nds observation sets which are guaran-\nteed to be least as informative as the optimal solution, and only logarithmically more expensive. In\na strong sense, this guarantee is the best possible. We extensively evaluated our algorithm on several\nreal-world problems. For Gaussian Process regression, we showed that SATURATE compares favor-\nably to state-of-the-art heuristics, while being simpler, faster, and providing theoretical guarantees.\nFor robust experimental design, SATURATE performs favorably compared to SDP based approaches.\nAcknowledgements This work was partially supported by NSF Grants No. CNS-0509383, CNS-\n0625518, CCF-0448095, CCF-0729022, and a gift from Intel. Anupam Gupta and Carlos Guestrin\nwere partly supported by Alfred P. Sloan Fellowships, Carlos Guestrin by an IBM Faculty Fellow-\nship and Andreas Krause by a Microsoft Research Graduate Fellowship.\nReferences\n[1] A. Das and D. Kempe. Algorithms for subset selection in linear regression. In Manuscript, 2007.\n[2] A. Krause, A. Singh, and C. Guestrin. Near-optimal sensor placements in Gaussian processes: Theory,\n\nef\ufb01cient algorithms and empirical studies. In To appear in the JMLR, 2007.\n\n[3] G. Nemhauser, L. Wolsey, and M. Fisher. An analysis of the approximations for maximizing submodular\n\nset functions. Mathematical Programming, 14:265\u2013294, 1978.\n\n[4] U. Feige. A threshold of ln n for approximating set cover. J. ACM, 45(4), 1998.\n[5] A. Krause, B. McMahan, C. Guestrin, and A. Gupta. Robust submodular observation selection. Technical\n\n[6] C. E. Rasmussen and C. K. I. Williams. Gaussian Process for Machine Learning. Adaptive Computation\n\nreport, CMU-ML-08-100, 2008.\n\nand Machine Learning. MIT Press, 2006.\n\n[7] P. Flaherty, M. Jordan, and A. Arkin. Robust design of biological experiments. In NIPS, 2006.\n[8] S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge UP, March 2004.\n[9] J. Leskovec, A. Krause, C. Guestrin, C. Faloutsos, J. VanBriesen, and N. Glance. Cost-effective outbreak\n\ndetection in networks. In KDD, 2007.\n\n[10] T. Fujito. Approximation algorithms for submodular set cover with applications. TIEICE, 2000.\n[11] L.A. Wolsey. An analysis of the greedy algorithm for the submodular set covering problem. Combinator-\n\nica, 2:385\u2013393, 1982.\n\n[12] T. G. Robertazzi and S. C. Schwartz. An accelerated sequential algorithm for producing D-optimal de-\n\nsigns. SIAM Journal of Scienti\ufb01c and Statistical Computing, 10(2):341\u2013358, March 1989.\n\n[13] M. Widmann and C. S. Bretherton. 50 km resolution daily precipitation for the paci\ufb01c northwest.\n\nhttp://www.jisao.washington.edu/data sets/widmann/, May 1999.\n\n[14] J. Sacks and S. Schiller. Statistical Decision Theory and Related Topics IV, Vol. 2. Springer, 1988.\n[15] D. P. Wiens. Robustness in spatial studies ii: minimax design. Environmetrics, 16:205\u2013217, 2005.\n[16] A. Ostfeld, J. G. Uber, and E. Salomons. Battle of water sensor networks: A design challenge for engi-\n\nneers and algorithms. In 8th Symposium on Water Distribution Systems Analysis, 2006.\n\n[17] L. A. Rossman. The epanet programmer\u2019s toolkit for analysis of water distribution systems. In Annual\n\nWater Resources Planning and Management Conference, 1999.\n\n8\n\n\f", "award": [], "sourceid": 205, "authors": [{"given_name": "Andreas", "family_name": "Krause", "institution": null}, {"given_name": "Brendan", "family_name": "Mcmahan", "institution": null}, {"given_name": "Carlos", "family_name": "Guestrin", "institution": null}, {"given_name": "Anupam", "family_name": "Gupta", "institution": null}]}