{"title": "Approximate Supermodularity Bounds for Experimental Design", "book": "Advances in Neural Information Processing Systems", "page_first": 5403, "page_last": 5412, "abstract": "This work provides performance guarantees for the greedy solution of experimental design problems. In particular, it focuses on A- and E-optimal designs, for which typical guarantees do not apply since the mean-square error and the maximum eigenvalue of the estimation error covariance matrix are not supermodular. To do so, it leverages the concept of approximate supermodularity to derive non-asymptotic worst-case suboptimality bounds for these greedy solutions. These bounds reveal that as the SNR of the experiments decreases, these cost functions behave increasingly as supermodular functions. As such, greedy A- and E-optimal designs approach (1-1/e)-optimality. These results reconcile the empirical success of greedy experimental design with the non-supermodularity of the A- and E-optimality criteria.", "full_text": "Approximate Supermodularity Bounds for\n\nExperimental Design\n\nLuiz F. O. Chamon and Alejandro Ribeiro\n\n{luizf,aribeiro}@seas.upenn.edu\n\nElectrical and Systems Engineering\n\nUniversity of Pennsylvania\n\nAbstract\n\nThis work provides performance guarantees for the greedy solution of experimen-\ntal design problems. In particular, it focuses on A- and E-optimal designs, for\nwhich typical guarantees do not apply since the mean-square error and the maxi-\nmum eigenvalue of the estimation error covariance matrix are not supermodular.\nTo do so, it leverages the concept of approximate supermodularity to derive non-\nasymptotic worst-case suboptimality bounds for these greedy solutions. These\nbounds reveal that as the SNR of the experiments decreases, these cost functions\nbehave increasingly as supermodular functions. As such, greedy A- and E-optimal\ndesigns approach (1 \u2212 e\u22121)-optimality. These results reconcile the empirical suc-\ncess of greedy experimental design with the non-supermodularity of the A- and\nE-optimality criteria.\n\n1\n\nIntroduction\n\nExperimental design consists of selecting which experiments to run or measurements to observe\nin order to estimate some variable of interest. Finding good designs is an ubiquitous problem with\napplications in regression, semi-supervised learning, multivariate analysis, and sensor placement [1\u2013\n10]. Nevertheless, selecting a set of k experiments that optimizes a generic \ufb01gure of merit is NP-\nhard [11, 12]. In some situations, however, an approximate solution with optimality guarantees can\nbe obtained in polynomial time. For example, this is possible when the cost function possesses\na diminishing returns property known as supermodularity, in which case greedy search is near-\noptimal. Greedy solutions are particularly attractive for large-scale problems due to their iterative\nnature and because they have lower computational complexity than typical convex relaxations [11,\n12].\nSupermodularity, however, is a stringent condition not met by important performance metrics. For\ninstance, it is well-known that neither the mean-square error (MSE) nor the maximum eigenvalue of\nthe estimation error covariance matrix are supermodular [1, 13, 14]. Nevertheless, greedy algorithms\nhave been successfully used to minimize these functions despite the lack of theoretical guarantees.\nThe goal of this paper is to reconcile these observations by showing that these \ufb01gures of merit, used\nin A- and E-optimal experimental designs, are approximately supermodular. To do so, it introduces\ndifferent measures of approximate supermodularity and derives near-optimality results for these\nclasses of functions. It then bounds how much the MSE and the maximum eigenvalue of the error\ncovariance matrix violate supermodularity, leading to performance guarantees for greedy A- and\nE-optimal designs. More to the point, the main results of this work are:\n\n1. The greedy solution of the A-optimal design problem is within a multiplicative (1 \u2212 e\u2212\u03b1)\n\u22121, where \u03b3 upper bounds the signal-to-noise\n\nfactor of the optimal with \u03b1 \u2265 [1 + O(\u03b3)]\nratio (SNR) of the experiments (Theorem 3).\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\f2. The value of the greedy solution of an E-optimal design problem is at most (1 \u2212\n\ne\u22121)(f (D(cid:63)) + k\u0001), where \u0001 \u2264 O(\u03b3) (Theorem 4).\nE-optimal designs approach the classical 1 \u2212 1/e.\n\n3. As the SNR of the experiments decreases, the performance guarantees for greedy A- and\n\nThis last observation is particularly interesting since careful selection of experiments is more impor-\ntant in low SNR scenarios. In fact, unless experiments are highly correlated, designs have similar\nperformances in high SNR. Also, note that the guarantees in this paper are not asymptotic and hold\nin the worst-case, i.e., hold for problems of any dimension and for designs of any size.\n\nNotation Lowercase boldface letters represent vectors (x), uppercase boldface letters are matri-\nces (X), and calligraphic letters denote sets/multisets (A). We write #A for the cardinality of A\nand P(B) to denote the collection of all \ufb01nite multisets of the set B. To say X is a positive semi-\nde\ufb01nite (PSD) matrix we write X (cid:23) 0, so that for X, Y \u2208 Rn\u00d7n, X (cid:22) Y \u21d4 bT Xb \u2264 bT Y b, for\nall b \u2208 Rn. Similarly, we write X (cid:31) 0 when X is positive de\ufb01nite.\n\nye = Ae\u03b8 + ve,\n\n2 Optimal experimental design\nLet E be a pool of possible experiments. The outcome of experiment e \u2208 E is a multivariate\nmeasurement ye \u2208 Rne de\ufb01ned as\n(1)\nwhere \u03b8 \u2208 Rp is a parameter vector with a prior distribution such that E [\u03b8] = \u00af\u03b8 and E(\u03b8 \u2212 \u00af\u03b8)(\u03b8 \u2212\n\u00af\u03b8)T = R\u03b8 (cid:31) 0; Ae is an ne \u00d7 p observation matrix; and ve \u2208 Rne is a zero-mean random variable\ne (cid:31) 0 that represents the experiment uncertainty.\nwith arbitrary covariance matrix Re = E vevT\nThe {ve} are assumed to be uncorrelated across experiments, i.e., E vevT\nf = 0 for all e (cid:54)= f, and\nindependent of \u03b8. These experiments aim to estimate\nz = H\u03b8,\n\n(2)\nwhere H is an m\u00d7p matrix. Appropriately choosing H is important given that the best experiments\nto estimate \u03b8 are not necessarily the best experiments to estimate z. For instance, if \u03b8 is to be used\nfor classi\ufb01cation, then H can be chosen so as to optimize the design with respect to the output of\nthe classi\ufb01er. Alternatively, transductive experimental design can be performed by taking H to be\na collection of data points from a test set [6]. Finally, H = I, the identity matrix, recovers the\nclassical \u03b8-estimation case.\nThe experiments to be used in the estimation of z are collected in a multiset D called a design.\nNote that D contains elements of E with repetitions. Given a design D, it is ready to compute an\noptimal Bayesian estimate \u02c6zD. The estimation error of \u02c6zD is measured by the error covariance\nmatrix K(D). An expression for the estimator and its error matrix in terms of the problem constants\nis given in the following proposition.\nProposition 1 (Bayesian estimator). Let the experiments be de\ufb01ned as in (1). For Me =\ne Ae and a design D \u2208 P(E), the unbiased af\ufb01ne estimator of z with the smallest error\ne R\u22121\nAT\ncovariance matrix in the PSD cone is given by\n\n(cid:34)\n\nR\u22121\n\n(cid:88)\n\n(cid:35)\u22121(cid:34)(cid:88)\n\n(cid:35)\n\n\u02c6zD = H\n\nThe corresponding error covariance matrix K(D) = E(cid:2)(z \u2212 \u02c6zD)(z \u2212 \u02c6zD)T | \u03b8,{Me}e\u2208D(cid:3) is\n\n\u03b8 +\n\ne\u2208D\n\ne\u2208D\n\nMe\n\n(3)\n\n\u00af\u03b8\n\n.\n\ngiven by the expression\n\n\u03b8\n\ne ye + R\u22121\ne R\u22121\nAT\n(cid:35)\u22121\n\nK(D) = H\n\nR\u22121\n\n\u03b8 +\n\nMe\n\nH T .\n\n(4)\n\n(cid:34)\n\n(cid:88)\n\ne\u2208D\n\nProof. See extended version [15].\nThe experimental design problem consists of selecting a design D of cardinality at most k that\nminimizes the overall estimation error. This can be explicitly stated as the problem of choosing D\n\n2\n\n\fwith #D \u2264 k that minimizes the error covariance K(D) whose expression is given in (4). Note\nthat (4) can account for unregularized (non-Bayesian) experimental design by removing R\u03b8 and\nusing a pseudo-inverse [16]. However, the error covariance matrix is no longer monotone in this\ncase\u2014see Lemma 1. Providing guarantees for this scenario is the subject of future work.\nThe minimization of the PSD matrix K(D) in experimental design is typically attempted using\nscalarization procedures generically known as alphabetical design criteria, the most common of\nwhich are A-, D-, and E-optimal design [17]. These are tantamount to selecting different \ufb01gures of\nmerit to compare the matrices K(D). Our focus in this paper is mostly on A- and E-optimal designs,\nbut we also consider D-optimal designs for comparison. A design D with k experiments is said to be\nA-optimal if it minimizes the estimation MSE which is given by the trace of the covariance matrix,\n\n(cid:105) \u2212 Tr\n\n(cid:104)\n\nHR\u03b8H T(cid:105)\n\nK(D)\n\nminimize\n\n|D|\u2264k\n\nTr\n\n(P-A)\n\n(cid:104)\n\n(cid:104)\n\n(cid:104)\n\nNotice that is customary to say a design is A-optimal when H = I in (P-A), whereas the notation\nV-optimal is reserved for the case when H is arbitrary [17]. We do not make this distinction here\nfor conciseness.\nA design is E-optimal if instead of minimizing the MSE as in (P-A), it minimizes the largest eigen-\nvalue of the covariance matrix K(D), i.e.,\n\n(cid:105) \u2212 \u03bbmax\n\n(cid:104)\n\nHR\u03b8H T(cid:105)\n\nK(D)\n\nminimize\n\n|D|\u2264k\n\n\u03bbmax\n\n.\n\n(P-E)\n\nSince the trace of a matrix is the sum of its eigenvalues, we can think of (P-E) as a robust ver-\nsion of (P-A). While the design in (P-A) seeks to reduce the estimation error in all directions,\nthe design in (P-E) seeks to reduce the estimation error in the worst direction. Equivalently, given\nthat \u03bbmax(X) = max(cid:107)u(cid:107)2=1 uT Xu, we can interpret (P-E) with H = I as minimizing the MSE\nfor an adversarial choice of z.\nA D-optimal design is one in which the objective is to minimize the log-determinant of the estima-\ntor\u2019s covariance matrix,\n\n(cid:105) \u2212 log det\n\n(cid:104)\n\nHR\u03b8H T(cid:105)\n\nK(D)\n\n.\n\n|D|\u2264k\n\nlog det\n\nminimize\n\n(P-D)\nThe motivation for using the objective in (P-D) is that the log-determinant of K(D) is proportional\nto the volume of the con\ufb01dence ellipsoid when the data are Gaussian. Note that the trace, maximum\neigenvalue, and determinant of HR\u03b8H T in (P-A), (P-E), and (P-D) are constants and do not affect\nthe respective optimization problems. They are subtracted so that the objectives vanish when D = \u2205,\nthe empty set. This simpli\ufb01es the exposition in Section 4.\nAlthough the problem formulations in (P-A), (P-E), and (P-D) are integer programs known to be\nNP-hard, the use of greedy methods for their solution is widespread with good performance in\npractice. In the case of D-optimal design, this is justi\ufb01ed theoretically because the objective of (P-D)\nis supermodular, which implies greedy methods are (1 \u2212 e\u22121)-optimal [2, 11, 12]. The objectives\nin (P-A) and (P-E), on the other hand, are not be supermodular in general [1, 13, 14] and it is not\nknown why their greedy optimization yields good results in practice\u2014conditions for the MSE to be\nsupermodular exist but are restrictive [1]. The goal of this paper is to derive performance guarantees\nfor greedy solutions of A- and E-optimal design problems. We do so by developing different notions\nof approximate supermodularity to show that A- and E-optimal design problems are not far from\nsupermodular.\nRemark 1. Besides its intrinsic value as a minimizer of the volume of the con\ufb01dence ellipsoid,\n(P-D) is often used as a surrogate for (P-A), when A-optimality (MSE) is considered the appropriate\nmetric. It is important to point out that this is only justi\ufb01ed when the problem has some inherent\nstructure that suggests the minimum volume ellipsoid is somewhat symmetric. Otherwise, since the\nvolume of an ellipsoid can be reduced by decreasing the length of a single principal axis, using (P-D)\ncan lead to designs that perform well\u2014in the MSE sense\u2014along a few directions of the parameter\nspace and poorly along all others. Formally, this can be seen by comparing the variation of the\nlog-determinant and trace functions with respect to the eigenvalues of the PSD matrix K,\n\n\u2202 log det(K)\n\n\u2202\u03bbj(K)\n\n=\n\n1\n\n\u03bbj(K)\n\nand\n\n\u2202 Tr(K)\n\u2202\u03bbj(K)\n\n= 1.\n\n3\n\n\fThe gradient of the log-determinant is largest in the direction of the smallest eigenvalue of the error\ncovariance matrix. In contrast, the MSE gives equal weight to all directions of the space. The latter\ntherefore yields balanced, whereas the former tends to \ufb02atten the con\ufb01dence ellipsoid unless the\nproblem has a speci\ufb01c structure.\n\n3 Approximate supermodularity\nConsider a multiset function f : P(E) \u2192 R for which the value corresponding to an arbitrary\nmultiset D \u2208 P(E) is denoted by f (D). We say the function f is normalized if f (\u2205) = 0 and we\nsay f is monotone decreasing if for all multisets A \u2286 B it holds that f (A) \u2265 f (B). Observe that if a\nfunction is normalized and monotone decreasing it must be that f (D) \u2264 0 for all D. The objectives\nof (P-A), (P-E), and (P-D) are normalized and monotone decreasing multiset functions, since adding\nexperiments to a design decreases the covariance matrix uniformly in the PSD cone\u2014see Lemma 1.\nWe say that a multiset function f is supermodular if for all pairs of multisets A,B \u2208 P(E), A \u2286 B,\nand elements u \u2208 E it holds that\n\nf (A) \u2212 f (A \u222a {u}) \u2265 f (B) \u2212 f (B \u222a {u}).\n\nSupermodular functions encode a notion of diminishing returns as the sets grow. Their relevance\nin this paper is due to the celebrated bound on the suboptimality of their greedy minimization [18].\nSpeci\ufb01cally, construct a greedy solution by starting with G0 = \u2205 and incorporating elements (ex-\nperiments) e \u2208 E greedily so that at the h-th iteration we incorporate the element whose addition\nto Gh\u22121 results in the largest reduction in the value of f:\n\n(5)\n\nGh = Gh\u22121 \u222a {e},\n\nwith\n\ne = argmin\n\nu\u2208E\n\nf (Gh\u22121 \u222a {u}) .\n\nThe recursion in (5) is repeated for k steps to obtain a greedy solution with k elements. Then, if f is\nnormalized, monotone decreasing, and supermodular,\n\nf (Gk) \u2264 (1 \u2212 e\u22121)f (D(cid:63)),\n\n(6)\nwhere D(cid:63) (cid:44) argmin|D|\u2264k f (D) is the optimal design selection of cardinality not larger than k [18].\nWe emphasize that in contrast to the classical greedy algorithm, (5) allows the same element to be\nselected multiple times.\nThe optimality guarantee in (6) applies to (P-D) because its objective is supermodular. This is\nnot true of the cost functions of (P-A) and (P-E). We address this issue by postulating that if a\nfunction does not violate supermodularity too much, then its greedy minimization should have close\nto supermodular performance. To formalize this idea, we introduce two measures of approximate\nsupermodularity and derive near-optimal bounds based on these properties. It is worth noting that as\nintuitive as it may be, such results are not straightforward. In fact, [19] showed that even functions \u03b4-\nclose to supermodular cannot be optimized in polynomial time.\nWe start with the following multiplicative relaxation of the supermodular property.\nDe\ufb01nition 1 (\u03b1-supermodularity). A multiset function f : P(E) \u2192 R is \u03b1-supermodular, for \u03b1 :\nN \u00d7 N \u2192 R, if for all multisets A,B \u2208 P(E), A \u2286 B, and all u \u2208 E it holds that\nf (A) \u2212 f (A \u222a {u}) \u2265 \u03b1(#A, #B) [f (B) \u2212 f (B \u222a {u})] .\n\n(7)\nNotice that for \u03b1 \u2265 1, (7) reduces the original de\ufb01nition of supermodularity, in which case we refer\nto the function simply as supermodular [11, 12]. On the other hand, when \u03b1 < 1, f is said to be\napproximately supermodular. Notice that if f is decreasing, then (7) always holds for \u03b1 \u2261 0. We\nare therefore interested in the largest \u03b1 for which (7) holds, i.e.,\n\n\u03b1(a, b) =\n\nmin\n\nA,B\u2208P(E)\nA\u2286B, u\u2208E\n#A=a, #B=b\n\nf (A) \u2212 f (A \u222a {u})\nf (B) \u2212 f (B \u222a {u})\n\n(8)\n\nInterestingly, \u03b1 not only measures how much f violates supermodularity, but it also quanti\ufb01es the\nloss in performance guarantee incurred from these violations.\n\n4\n\n\fTheorem 1. Let f be a normalized, monotone decreasing, and \u03b1-supermodular multiset function.\nThen, for \u00af\u03b1 = mina<(cid:96), b<(cid:96)+k \u03b1(a, b), the greedy solution from (5) obeys\n\n(cid:34)\n\n1 \u2212 (cid:96)\u22121(cid:89)\n\n(cid:32)\n\nf (G(cid:96)) \u2264\n\n(cid:33)(cid:35)\n\n(cid:80)k\u22121\n\n1 \u2212\n\n1\n\ns=0 \u03b1(h, h + s)\u22121\n\nf (D(cid:63)) \u2264 (1 \u2212 e\u2212 \u00af\u03b1(cid:96)/k)f (D(cid:63)).\n\n(9)\n\nh=0\n\nProof. See extended version [15].\n\nTheorem 1 bounds the suboptimality of the greedy solution from (5) when its objective is \u03b1-\nsupermodular. At the same time, it quanti\ufb01es the effect of relaxing the supermodularity hypothesis\ntypically used to provide performance guarantees in these settings.\nIn fact, if f is supermodu-\nlar (\u03b1 \u2261 1) and for (cid:96) = k, we recover the 1 \u2212 e\u22121 \u2248 0.63 guarantee from [18]. On the other\nhand, for an approximately supermodular function (\u00af\u03b1 < 1), the result in (9) shows that the 63%\nguarantee can be recovered by selecting a set of size (cid:96) = \u00af\u03b1\u22121k. Thus, \u03b1 not only measures how\nmuch f violates supermodularity, but also gives a factor by which the cardinality constraint must be\nviolated to obtain a supermodular near-optimal certi\ufb01cate. Similar to the original bound in [18], it\nworth noting that (9) is not tight and that better results are typical in practice (see Section 5).\nAlthough \u03b1-supermodularity gives a multiplicative approximation factor, \ufb01nding meaningful bounds\non \u03b1 can be challenging for certain multiset functions, such as the E-optimality criterion in (P-E).\nIt is therefore useful to look at approximate supermodularity from a different perspective as in the\nfollowing de\ufb01nition.\nDe\ufb01nition 2 (\u0001-supermodularity). A multiset function f : P(E) \u2192 R is \u0001-supermodular, for \u0001 :\nN \u00d7 N \u2192 R, if for all multisets A,B \u2208 P(E), A \u2286 B, and all u \u2208 E it holds that\nf (A) \u2212 f (A \u222a {u}) \u2265 f (B) \u2212 f (B \u222a {u}) \u2212 \u0001 (#A, #B) .\n\n(10)\nAgain, we say f is supermodular if \u0001(a, b) \u2264 0 for all a, b and approximately supermodular other-\nwise. As with \u03b1, we want the best \u0001 that satis\ufb01es (10), which is given by\n\n\u0001 (a, b) =\n\nmax\n\nA,B\u2208P(E)\nA\u2286B, u\u2208E\n#A=a, #B=b\n\nf (B) \u2212 f (B \u222a {u}) \u2212 f (A) + f (A \u222a {u}) .\n\n(11)\n\nIn contrast to \u03b1-supermodularity, we obtain an additive approximation guarantee for the greedy\nminimization of \u0001-supermodular functions.\nTheorem 2. Let f be a normalized, monotone decreasing, and \u0001-supermodular multiset function.\nThen, for \u00af\u0001 = maxa<(cid:96), b<(cid:96)+k \u0001(a, b), the greedy solution from (5) obeys\n\nf (G(cid:96)) \u2264\n\n1 \u2212\n\n1 \u2212 1\nk\n\nf (D(cid:63)) +\n\n1\nk\n\n\u0001(h, h + s)\n\nk\u22121(cid:88)\n\n(cid:96)\u22121(cid:88)\n\ns=0\n\nh=0\n\n(cid:19)(cid:96)\u22121\u2212h\n\n(cid:18)\n\n1 \u2212 1\nk\n\n(12)\n\n(cid:34)\n\n(cid:18)\n\n(cid:19)(cid:96)(cid:35)\n\n\u2264 (1 \u2212 e\u2212(cid:96)/k)(f (D(cid:63)) + k\u00af\u0001)\n\nProof. See extended version [15].\n\nAs before, \u0001 quanti\ufb01es the loss in performance guarantee due to relaxing supermodularity. Indeed,\n(12) reveals that \u0001-supermodular functions have the same guarantees as a supermodular function\nIn fact, if \u00af\u0001 \u2264 (ek)\u22121|f (D(cid:63))| (recall that f (D(cid:63)) \u2264 0 due\nup to an additive factor of \u0398(k\u00af\u0001).\nto normalization), then taking (cid:96) = 3k recovers the 63% approximation factor of supermodular\nfunctions. This same factor is obtained for \u03b1 \u2265 1/3-supermodular functions.\nWith the certi\ufb01cates of Theorems 1 and 2 in hand, we now proceed with the study of the A- and E-\noptimality criteria. In the next section, we derive explicit bounds on their \u03b1- and \u0001-supermodularity,\nrespectively, thus providing near-optimal performance guarantees for greedy A- and E-optimal de-\nsigns.\n\n5\n\n\f4 Near-optimal experimental design\n\nTheorems 1 and 2 apply to functions that are (i) normalized, (ii) monotone decreasing, and (iii) ap-\nproximately supermodular. By construction, the objectives of (P-A) and (P-E) are normalized [(i)].\nThe following lemma establishes that they are also monotone decreasing [(ii)] by showing that K is\na decreasing set function in the PSD cone. The de\ufb01nition of Loewner order and the monotonicity of\nthe trace operator readily give the desired results [16].\nLemma 1. The matrix-valued set function K(D) in (4) is monotonically decreasing with respect to\nthe PSD cone, i.e., A \u2286 B \u21d4 K(A) (cid:23) K(B).\n\nProof. See extended version [15].\n\nThe main results of this section provide the \ufb01nal ingredient [(iii)] for Theorems 1 and 2 by bounding\nthe approximate supermodularity of the A- and E-optimality criteria. We start by showing that the\nobjective of (P-A) is \u03b1-supermodular.\nTheorem 3. The objective of (P-A) is \u03b1-supermodular with\n\n(cid:2)R\u22121\n(cid:3)\n(cid:3) + a \u00b7 (cid:96)max\n(cid:2)R\u22121\n\n\u03bbmin\n\n\u03b8\n\n\u03b8\n\n\u03b1(a, b) \u2265\n\n1\n\n\u03ba(H)2 \u00b7\n\n\u03bbmax\n\nfor all b \u2208 N,\n\n,\n\n(13)\n\ne Ae, and \u03ba(H) = \u03c3max / \u03c3min is the (cid:96)2-norm\nwhere (cid:96)max = maxe\u2208E \u03bbmax(Me), Me = AT\ncondition number of H, with \u03c3max and \u03c3min denoting the largest and smallest singular values of H\nrespectively.\n\ne R\u22121\n\nProof. See extended version [15].\n\nTheorem 3 bounds the \u03b1-supermodularity of the objective of (P-A) in terms of the condition number\nof H, the prior covariance matrix, and the measurements SNR. To facilitate the interpretation of\nthis result, let the SNR of the e-th experiment be \u03b3e = Tr[Me] and suppose R\u03b8 = \u03c32\n\u03b8 I, H = I,\nand \u03b3e \u2264 \u03b3 for all e \u2208 E. Then, for (cid:96) = k greedy iterations, (13) implies\n\n\u00af\u03b1 \u2265\n\n1\n\n1 + 2k\u03c32\n\u03b8 \u03b3\n\n,\n\n\u03b8 (cid:28) 1), i.e., when the problem is heavily regularized.\n\nfor \u00af\u03b1 as in Theorem 1. This deceptively simple bound reveals that the MSE behaves as a supermod-\nular function at low SNRs. Formally, \u03b1 \u2192 1 as \u03b3 \u2192 0. In contrast, the performance guarantee from\nTheorem 3 degrades in high SNR scenarios. In this case, however, greedy methods are expected\nto give good results since designs yield similar estimation errors (as illustrated in Section 5). The\ngreedy solution of (P-A) also approaches the 1 \u2212 1/e guarantee when the prior on \u03b8 is concen-\ntrated (\u03c32\nThese observations also hold for a generic H as long as it is well-conditioned. Even if \u03ba(H) (cid:29) 1,\nwe can replace H by \u02dcH = DH for some diagonal matrix D (cid:31) 0 without affecting the design,\nsince z is arbitrarily scaled. The scaling D can be designed to minimize the condition number of \u02dcH\nby leveraging preconditioning and balancing methods [20, 21].\nProceeding, we derive guarantees for E-optimal designs using \u0001-supermodularity.\nTheorem 4. The cost function of (P-E) is \u0001-supermodular with\n\nwhere (cid:96)max = maxe\u2208E \u03bbmax(Me), Me = AT\nof H.\n\n\u0001(a, b) \u2264 (b \u2212 a) \u03c3max(H)2 \u03bbmax (R\u03b8)2 (cid:96)max,\n\n(14)\ne Ae, and \u03c3max(H) is the largest singular value\n\ne R\u22121\n\nProof. See extended version [15].\n\nUnder the same assumptions as above, Theorem 4 gives\n\u03b8 \u03b3,\n\n\u00af\u0001 \u2264 2k\u03c34\n\n6\n\n\fFigure 1: A-optimal design: (a) Thm. 3; (b) A-optimality (low SNR); (c) A-optimality (high SNR).\nThe plots show the unnormalized A-optimality value for clarity.\n\nFigure 2: E-optimal design: (a) Thm. 4; (b) E-optimality (low SNR); (c) E-optimality (high SNR).\nThe plots show the unnormalized E-optimality value for clarity.\n\nfor \u00af\u0001 as in Theorem 2. Thus, \u0001 \u2192 0 as \u03b3 \u2192 0. In other words, the behavior of the objective of (P-E)\napproaches that of a supermodular function as the SNR decreases. The same holds for concentrated\npriors, i.e., lim\u03c32\n\u03b8\u21920 \u00af\u0001 = 0. Once again, it is worth noting that when the SNRs of the experiments\nare large, almost every design has the same E-optimal performance as long as the experiments are\nnot too correlated. Thus, greedy design is also expected to give good results under these conditions.\nFinally, the proofs of Theorems 3 and 4 suggest that better bounds can be found when the designs are\nconstructed without replacement, i.e., when only one of each experiment is allowed in the design.\n\n5 Numerical examples\n\nvI. We take \u03c32\n\nIn this section, we illustrate the previous results in some numerical examples. To do so, we draw the\nelements of Ae from an i.i.d. zero-mean Gaussian random variable with variance 1/p and p = 20.\nThe noise {ve} are also Gaussian random variables with Re = \u03c32\nv = 10\u22121 in high\nv = 10 in low SNR simulations. The experiment pool contains #E = 200 experiments.\nSNR and \u03c32\nStarting with A-optimal design, we display the bound from Theorem 3 in Figure 1a for multivariate\nmeasurements of size ne = 5 and designs of size k = 40. Here, \u201cequivalent \u03b1\u201d is the single \u02c6\u03b1 that\ngives the same near-optimal certi\ufb01cate (9) as using (13). As expected, \u02c6\u03b1 approaches 1 as the SNR\ndecreases. In fact, for \u221210 dB is is already close to 0.75 which means that by selecting a design\nof size (cid:96) = 55 we would be within 1 \u2212 1/e of the optimal design of size k = 40. Figures 1b\nand 1c compare greedy A-optimal designs with the convex relaxation of (P-A) in low and high SNR\nscenarios. The designs are obtained from the continuous solutions using the hard constraint, with\nreplacement method of [10] and a simple design truncation as in [22]. Therefore, these simulations\nconsider univariate measurements (ne = 1). For comparison, a design sampled uniformly at ran-\ndom with replacement from E is also presented. Note that, as mentioned before, the performance\ndifference across designs is small for high SNR\u2014notice the scale in Figures 1c and 2c\u2014, so that\neven random designs perform well.\nFor the E-optimality criterion, the bound from Theorem 4 is shown in Figure 2a, again for multi-\nvariate measurements of size ne = 5 and designs of size k = 40. Once again, \u201cequivalent \u0001\u201d is the\nsingle value \u02c6\u0001 that yields the same guarantee as using (14). In this case, the bound degradation in\n\n7\n\n-30-20-100102000.250.50.751SNR (dB)Equivalent(a)4060801002004006008001000Design sizeUnnormalized A-optimality(b)40608010000.250.50.751Design sizeUnnormalized A-optimalityGreedyTruncationRandom[10](c)-30-20-100102010-3100103SNR (dB)Equivalent(a)40608010020406080100Design sizeUnnormalized E-optimality(b)40608010000.050.10.150.2Design sizeUnnormalized E-optimality(c)\fhigh SNR is more pronounced. This re\ufb02ects the dif\ufb01culty in bounding the approximate supermodu-\nlarity of the E-optimality cost function. Still, Figures 2b and 2c show that greedy E-optimal designs\nhave good performance when compared to convex relaxations or random designs. Note that, though\nit is not intended for E-optimal designs, we again display the results of the sampling post-processing\nfrom [10]. In Figure 2b, the random design is omitted due to its poor performance.\n\n5.1 Cold-start survey design for recommender systems\n\nRecommender systems use semi-supervised learning methods to predict user ratings based on few\nrated examples. These methods are useful, for instance, to streaming service providers who are inter-\nested in using predicted ratings of movies to provide recommendations. For new users, these systems\nsuffer from a \u201ccold-start problem,\u201d which refers to the fact that it is hard to provide accurate recom-\nmendations without knowing a user\u2019s preference on at least a few items. For this reason, services\nexplicitly ask users for ratings in initial surveys before emitting any recommendation. Selecting\nwhich movies should be rated to better predict a user\u2019s preferences can be seen as an experimen-\ntal design problem. In the following example, we use a subset of the EachMovie dataset [23] to\nillustrate how greedy experimental design can be applied to address this problem.\nWe randomly selected a training and test set containing 9000 and 3000 users respectively. Following\nthe notation from Section 2, each experiment in E represents a movie (|E| = 1622) and the observa-\ntion vector Ae collects the ratings of movie e for each user in the training set. The parameter \u03b8 is\nused to express the rating of a new user in term of those in the training set. Our hope is that we can\nextrapolate the observed ratings, i.e., {ye}e\u2208D, to obtain the rating for a movie f /\u2208 D as \u02c6yf = Af\n\u02c6\u03b8.\nSince the mean absolute error (MAE) is commonly used in this setting, we choose to work with the\nA-optimality criterion. We also let H = I and take a non-informative prior \u00af\u03b8 = 0 and R\u03b8 = \u03c32\n\u03b8 I\nwith \u03c32\nAs expected, greedy A-optimal design is able to \ufb01nd small sets of movies that lead to good predic-\ntion. For k = 10, for example, MAE = 2.3, steadily reducing until MAE < 1.8 for k \u2265 35. These\nare considerably better results than a random movie selection, for which the MAE varies between 2.8\nand 3.3 for k between 10 and 50. Instead of focusing on the raw ratings, we may be interested in\npredicting the user\u2019s favorite genre. This is a challenging task due to the heavily skewed dataset.\nFor instance, 32% of the movies are dramas whereas only 0.02% are animations. Still, we use the\nsimplest possible classi\ufb01er by selecting the category with highest average estimated ratings. By us-\ning greedy design, we can obtain a misclassi\ufb01cation rate of approximately 25% by observing 100\nratings, compared to over 45% error rate for a random design.\n\n\u03b8 = 100.\n\n6 Related work\n\nOptimal experimental design Classical experimental design typically relies on convex relax-\nations to solve optimal design problems [17, 22]. However, because these are semide\ufb01nite pro-\ngrams (SDPs) or sequential second-order cone programs (SOCPs), their computational complexity\ncan hinder their use in large-scale problems [5, 7, 22, 24]. Another issue with these relaxations\nis that some sort of post-processing is required to extract a valid design from their continuous so-\nlutions [5, 22]. For D-optimal designs, this can be done with (1 \u2212 e\u22121)-optimality [25, 26]. For\nA-optimal designs, [10] provides near-optimal randomized schemes for large enough k.\nGreedy optimization guarantees The (1 \u2212 e\u22121)-suboptimality of greedy search for supermod-\nular minimization under cardinality constraints was established in [18]. To deal with the fact that\nthe MSE is not supermodular, \u03b1-supermodularity with constant \u03b1 was introduced in [27] along\nwith explicit lower bounds. This concept is related to the submodularity ratio introduced by [3]\nto obtain guarantees similar to Theorem 1 for dictionary selection and forward regression. How-\never, the bounds on the submodularity ratio from [3, 28] depend on the sparse eigenvalues of K or\nrestricted strong convexity constants of the A-optimal objective, which are NP-hard to compute. Ex-\nplicit bounds for the submodularity ratio of A-optimal experimental design were recently obtained\nin [29]. Nevertheless, neither [27] nor [29] consider multisets. Hence, to apply their results we must\noperate on an extended ground set containing k unique copies of each experiment, which make the\nbounds uninformative. For instance, in the setting of Section 5, Theorem 3 guarantees 0.1-optimality\nat 0 dB SNR whereas [29] guarantees 2.5\u00d7 10\u22126-optimality. The concept of \u0001-supermodularity was\n\n8\n\n\f\ufb01rst explored in [30] for a constant \u0001. There, guarantees for dictionary selection were derived by\nbounding \u0001 using an incoherence assumption on the Ae. Finally, a more stringent de\ufb01nition of ap-\nproximately submodular functions was put forward in [19] by requiring the function to be upper and\nlower bounded by a submodular function. They show strong impossibility results unless the func-\ntion is O(1/k)-close to submodular. Approximate submodularity is sometimes referred to as weak\nsubmodularity (e.g., [28]), though it is not related to the weak submodularity concept from [31].\n\n7 Conclusions\n\nGreedy search is known to be an empirically effective method to \ufb01nd A- and E-optimal experimental\ndesigns despite the fact that these objectives are not supermodular. We reconciled these observations\nby showing that the A- and E-optimality criteria are approximately supermodular and deriving near-\noptimal guarantees for this class of functions. By quantifying their supermodularity violations,\nwe showed that the behavior of the MSE and the maximum eigenvalue of the error covariance\nmatrix becomes increasingly supermodular as the SNR decreases. An important open question is\nwhether these results can be improved using additional knowledge. Can we exploit some structure\nof the observation matrices (e.g., Fourier, random)? What if the parameter vector is sparse but\nwith unknown support (e.g., compressive sensing)? Are there practical experiment properties other\nthan the SNR that lead to small supermodular violations? Finally, we hope that this approximate\nsupermodularity framework can be extended to other problems.\n\nAcknowledgments\n\nThis work was supported by the National Science Foundation CCF 1717120 and in part by the ARO\nW911NF1710438.\n\nReferences\n[1] A. Das and D. Kempe, \u201cAlgorithms for subset selection in linear regression,\u201d in ACM Symp.\n\non Theory of Comput., 2008, pp. 45\u201354.\n\n[2] A. Krause, A. Singh, and C. Guestrin, \u201cNear-optimal sensor placements in Gaussian processes:\nTheory, ef\ufb01cient algorithms and empirical studies,\u201d J. Mach. Learning Research, vol. 9, pp.\n235\u2013284, 2008.\n\n[3] A. Das and D. Kempe, \u201cSubmodular meets spectral: Greedy algorithms for subset selection,\n\nsparse approximation and dictionary selection,\u201d in Int. Conf. on Mach. Learning, 2011.\n\n[4] Y. Washizawa, \u201cSubset kernel principal component analysis,\u201d in Int. Workshop on Mach.\n\nLearning for Signal Process., 2009.\n\n[5] S. Joshi and S. Boyd, \u201cSensor selection via convex optimization,\u201d IEEE Trans. Signal Process.,\n\nvol. 57[2], pp. 451\u2013462, 2009.\n\n[6] K. Yu, J. Bi, and V. Tresp, \u201cActive learning via transductive experimental design,\u201d in Interna-\n\ntional Conference on Machine Learning, 2006, pp. 1081\u20131088.\n\n[7] P. Flaherty, A. Arkin, and M.I. Jordan, \u201cRobust design of biological experiments,\u201d in Advances\n\nin Neural Information Processing Systems, 2006, pp. 363\u2013370.\n\n[8] X. Zhu, \u201cSemi-supervised learning literature survey,\u201d 2008, http://pages.cs.wisc.edu/\n\n~jerryzhu/research/ssl/semireview.html.\n\n[9] S. Liu, S.P. Chepuri, M. Fardad, E. Ma\u00b8sazade, G. Leus, and P.K. Varshney, \u201cSensor selection\nfor estimation with correlated measurement noise,\u201d IEEE Trans. Signal Process., vol. 64[13],\npp. 3509\u20133522, 2016.\n\n[10] Y. Wang, A.W. Yu, and A. Singh, \u201cOn computationally tractable selection of experiments in\n\nregression models,\u201d 2017, arXiv:1601.02068v5.\n\n[11] F. Bach, \u201cLearning with submodular functions: A convex optimization perspective,\u201d Founda-\n\ntions and Trends in Machine Learning, vol. 6[2-3], pp. 145\u2013373, 2013.\n\n[12] A. Krause and D. Golovin, \u201cSubmodular function maximization,\u201d in Tractability: Practical\n\nApproaches to Hard Problems. Cambridge University Press, 2014.\n\n9\n\n\f[13] G. Sagnol, \u201cApproximation of a maximum-submodular-coverage problem involving spectral\nfunctions, with application to experimental designs,\u201d Discrete Appl. Math., vol. 161[1-2], pp.\n258\u2013276, 2013.\n\n[14] T.H. Summers, F.L. Cortesi, and J. Lygeros, \u201cOn submodularity and controllability in complex\n\ndynamical networks,\u201d IEEE Trans. Contr. Netw. Syst., vol. 3[1], pp. 91\u2013101, 2016.\n\n[15] L. F. O. Chamon and A. Ribeiro, \u201cApproximate supermodularity bounds for experimental\n\ndesign,\u201d 2017, arXiv:1711.01501.\n\n[16] R.A. Horn and C.R. Johnson, Matrix analysis, Cambridge University Press, 2013.\n[17] F. Pukelsheim, Optimal Design of Experiments, SIAM, 2006.\n[18] G.L. Nemhauser, L.A. Wolsey, and M.L. Fisher, \u201cAn analysis of approximations for maxi-\nmizing submodular set functions\u2014I,\u201d Mathematical Programming, vol. 14[1], pp. 265\u2013294,\n1978.\n\n[19] T. Horel and Y. Singer, \u201cMaximization of approximately submodular functions,\u201d in Advances\n\nin Neural Information Processing Systems, 2016, pp. 3045\u20133053.\n\n[20] M. Benzi, \u201cPreconditioning techniques for large linear systems: A survey,\u201d Journal of Com-\n\nputational Physics, vol. 182[2], pp. 418\u2013477, 2002.\n\n[21] R.D. Braatz and M. Morari, \u201cMinimizing the Euclidian condition number,\u201d SIAM Journal of\n\nControl and Optimization, vol. 32[6], pp. 1763\u20131768, 1994.\n\n[22] S. Boyd and L. Vandenberghe, Convex optimization, Cambridge University Press, 2004.\n[23] Digital Equipment Corporation, \u201cEachMovie dataset,\u201d http://www.gatsby.ucl.ac.uk/\n\n~chuwei/data/EachMovie/.\n\n[24] G. Sagnol, \u201cComputing optimal designs of multiresponse experiments reduces to second-order\ncone programming,\u201d Journal of Statistical Planning and Inference, vol. 141[5], pp. 1684\u20131708,\n2011.\n\n[25] T. Horel, S. Ioannidis, and M. Muthukrishnan, \u201cBudget feasible mechanisms for experimental\n\ndesign,\u201d in Latin American Theoretical Informatics Symposium, 2014.\n\n[26] A.A. Ageev and M.I. Sviridenko, \u201cPipage rounding: A new method of constructing algorithms\nwith proven performance guarantee,\u201d Journal of Combinatorial Optimization, vol. 8[3], pp.\n307\u2013328, 2004.\n\n[27] L.F.O. Chamon and A. Ribeiro, \u201cNear-optimality of greedy set selection in the sampling of\n\ngraph signals,\u201d in Global Conf. on Signal and Inform. Process., 2016.\n\n[28] E.R. Elenberg, R. Khanna, A.G. Dimakis, and S. Negahban, \u201cRestricted strong convexity\n\nimplies weak submodularity,\u201d 2016, arXiv:1612.00804.\n\n[29] A. Bian, J.M. Buhmann, A. Krause, and S. Tschiatschek, \u201cGuarantees for greedy maximization\n\nof non-submodular functions with applications,\u201d in ICML, 2017.\n\n[30] A. Krause and V. Cevher, \u201cSubmodular dictionary selection for sparse representation,\u201d in Int.\n\nConf. on Mach. Learning, 2010.\n\n[31] A. Borodin, D. T. M. Le, and Y. Ye,\n\narXiv:1401.6697v5.\n\n\u201cWeakly submodular\n\nfunctions,\u201d 2014,\n\n10\n\n\f", "award": [], "sourceid": 2799, "authors": [{"given_name": "Luiz", "family_name": "Chamon", "institution": "University of Pennsylvania"}, {"given_name": "Alejandro", "family_name": "Ribeiro", "institution": "University of Pennsylvania"}]}