{"title": "Elementary Symmetric Polynomials for Optimal Experimental Design", "book": "Advances in Neural Information Processing Systems", "page_first": 2139, "page_last": 2148, "abstract": "We revisit the classical problem of optimal experimental design (OED) under a new mathematical model grounded in a geometric motivation. Specifically, we introduce models based on elementary symmetric polynomials; these polynomials capture \"partial volumes\" and offer a graded interpolation between the widely used A-optimal and D-optimal design models, obtaining each of them as special cases. We analyze properties of our models, and derive both greedy and convex-relaxation algorithms for computing the associated designs. Our analysis establishes approximation guarantees on these algorithms, while our empirical results substantiate our claims and demonstrate a curious phenomenon concerning our greedy algorithm. Finally, as a byproduct, we obtain new results on the theory of elementary symmetric polynomials that may be of independent interest.", "full_text": "Elementary Symmetric Polynomials for Optimal\n\nExperimental Design\n\nMassachusetts Institute of Technology\n\nMassachusetts Institute of Technology\n\nZelda Mariet\n\nCambridge, MA 02139\nzelda@csail.mit.edu\n\nSuvrit Sra\n\nCambridge, MA 02139\n\nsuvrit@mit.edu\n\nAbstract\n\nWe revisit the classical problem of optimal experimental design (OED) under a\nnew mathematical model grounded in a geometric motivation. Speci\ufb01cally, we\nintroduce models based on elementary symmetric polynomials; these polynomials\ncapture \u201cpartial volumes\u201d and offer a graded interpolation between the widely\nused A-optimal design and D-optimal design models, obtaining each of them as\nspecial cases. We analyze properties of our models, and derive both greedy and\nconvex-relaxation algorithms for computing the associated designs. Our analysis\nestablishes approximation guarantees on these algorithms, while our empirical\nresults substantiate our claims and demonstrate a curious phenomenon concerning\nour greedy method. Finally, as a byproduct, we obtain new results on the theory\nof elementary symmetric polynomials that may be of independent interest.\n\n1 Introduction\nOptimal Experimental Design (OED) develops the theory of selecting experiments to perform in\norder to estimate a hidden parameter as well as possible.\nIt operates under the assumption that\nexperiments are costly and cannot be run as many times as necessary or run even once without\ntremendous dif\ufb01culty [33]. OED has been applied in a large number of experimental settings [35, 9,\n28, 46, 36], and has close ties to related machine-learning problems such as outlier detection [15, 22],\nactive learning [19, 18], Gaussian process driven sensor placement [27], among others.\nWe revisit the classical setting where each experiment depends linearly on a hidden parameter (cid:18) 2\nRm. We assume there are n possible experiments whose outcomes yi 2 R can be written as\n\nyi = x\n\n\u22a4\ni (cid:18) + \u03f5i\n\n1 (cid:20) i (cid:20) n;\n\nwhere the xi 2 Rm and \u03f5i are independent, zero mean, and homoscedastic noises. OED seeks to\nanswer the question: how to choose a set S of k experiments that allow us to estimate (cid:18) without bias\nand with minimal variance?\nis invertible), the Gauss-Markov theorem\nGiven a feasible set S of experiments (i.e.,\n(cid:0)1. How-\nshows that the lowest variance for an unbiased estimate ^(cid:18) satis\ufb01es Var[^(cid:18)] = (\never, Var[^(cid:18)] is a matrix, and matrices do not admit a total order, making it dif\ufb01cult to compare\n(cid:3)\ndifferent designs. Hence, OED is cast as an optimization problem that seeks an optimal design S\n\ni2S xix\n\ni2S xix\n\n\u2211\n\n\u2211\n\n\u22a4\ni )\n\n\u22a4\ni\n\n((\u2211\n\n)\n\n)(cid:0)1\n\nS\n\n(cid:3) 2 argmin\nS2[n];jSj(cid:20)k\n\n(cid:8)\n\ni2S\n\n\u22a4\nxix\ni\n\n;\n\n(1.1)\n\nwhere (cid:8) maps positive de\ufb01nite matrices to R to compare the variances for each design, and may\nhelp elicit different properties that a solution should satisfy, either statistical or structural.\nElfving [16] derived some of the earliest theoretical results for the linear dependency setting, fo-\ncusing on the case where one is interested in reconstructing a prede\ufb01ned linear combination of the\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\f\u22a4\n\n(cid:18) (C-optimal design). Kiefer [26] introduced a more general approach\nunderlying parameters c\nto OED, by considering matrix means on positive de\ufb01nite matrices as a general way of evaluating\noptimality [33, Ch. 6], and Yu [48] derived general conditions for a map (cid:8) under which a class of\nmultiplicative algorithms for optimal design has guaranteed monotonic convergence.\nNonetheless, the theory of OED branches into multiple variants of (1.1) depending on the choice of\n(cid:8), among which A-optimal design ((cid:8) = trace) and D-optimal design ((cid:8) = determinant) are prob-\nably the two most popular choices. Each of these choices has a wide range of applications as well\nas statistical, algorithmic, and other theoretical results. We refer the reader to the classic book [33],\nwhich provides an excellent overview and introduction to the topic; see also the summaries in [1, 35].\nFor A-optimal design, recently Wang et al. [44] derived greedy and convex-relaxation approaches;\n[11] considers the problem of constrained adaptive sensing, where (cid:18) is supposed sparse. D-optimal\ndesign has historically been more popular, with several approaches to solving the related optimiza-\ntion problem [17, 38, 31, 20]. The dual problem of D-optimality, Minimum Volume Covering Ellip-\nsoid (MVCE) is also a well-known and deeply studied optimization problem [3, 34, 43, 41, 14, 42].\nExperimental design has also been studied in more complex settings: [8] considers Bayesian optimal\ndesign; under certain conditions, non-linear settings can be approached with linear OED [13, 25].\nDue to the popularity of A- and D-optimal design, the theory surrounding these two sub-problems\nhas diverged signi\ufb01cantly. However, both the trace and the determinant are special cases of fun-\ndamental spectral polynomials of matrices: elementary symmetric polynomials (ESP), which have\nbeen extensively studied in matrix theory, combinatorics, information theory, and other areas due to\ntheir importance in the theory of polynomials [24, 30, 21, 6, 23, 4].\nThese considerations motivate us to derive a broader view of optimal design which we call ESP-\nDesign, where (cid:8) is obtained from an elementary symmetric polynomial. This allows us to consider\nA-optimal design and D-optimal design as special cases of ESP-design, and thus treat the entire\nESP-class in a uni\ufb01ed manner. Let us state the key contributions of this paper more precisely below.\nContributions\n(cid:15) We introduce ESP-design, a new, general framework for OED that leverages geometric proper-\nties of positive de\ufb01nite matrices to interpolate between A- and D-optimality. ESP-design offers\nan intuitive setting in which to gradually scale between A-optimal and D-optimal design.\n\n(cid:15) We develop a convex relaxation as well as greedy algorithms to compute the associated designs.\nAs a byproduct of our convex relaxation, we prove that ESPs are geodesically log-convex on the\nRiemannian manifold of positive de\ufb01nite matrices; this result may be of independent interest.\n\n(cid:15) We extend a result of Avron and Boutsidis [2] on determinantal column-subset selection to ESPs;\nas a consequence we obtain a greedy algorithm with provable optimality bounds for ESP-design.\n\nExperiments on synthetic and real data illustrate the performance of our algorithms and con\ufb01rm that\nESP-design can be used to obtain designs with properties that scale between those of both A- and\nD-optimal designs, allowing users to tune trade-offs between their different bene\ufb01ts (e.g. predictive\nerror, sparsity, etc.). We show that our greedy algorithm generates designs of equal quality to the\nfamous Fedorov exchange algorithm [17], while running in a fraction of the time.\n2 Preliminaries\nWe begin with some background material that also serves to set our notation. We omit proofs for\nbrevity, as they can be found in standard sources such as [6].\nWe de\ufb01ne [n] \u225c f1; 2; : : : ; ng. For S (cid:18) [n] and M 2 Rn(cid:2)m, we write MS the jSj (cid:2) m matrix\ncreated by keeping only the rows of M indexed by S, and M [SjS\n] the submatrix with rows indexed\n\u2032; by x(i) we denote the vector x with its i-th component removed.\n\u2211\nby S and columns indexed by S\nFor a vector v 2 Rm, the elementary symmetric polynomial (ESP) of order \u2113 2 N is de\ufb01ned by\n\n\u220f\u2113\n1(cid:20)i1<:::* m. Let S+\nm ) be the cone of positive semide\ufb01nite (positive\nde\ufb01nite) matrices of order m. We denote by (cid:21)(M ) the eigenvalues (in decreasing order) of a sym-\nmetric matrix M. Def. (2.1) extends to matrices naturally; ESPs are spectral functions, as we set\n\nI(cid:18)[m];jIj=\u2113\n\ne\u2113(v) \u225c\n\nm (S++\n\n\u2211\n\nvij =\n\nj=1\n\nvj;\n\nj2I\n\n(2.1)\n\n\u2032\n\n\u220f\n\n2\n\n\fE\u2113(M ) \u225c e\u2113\u25e6 (cid:21)(M ); additionally, they enjoy another representation that allows us to interpret them\nas \u201cpartial volumes\u201d, namely,\n\n\u2211\n\nE\u2113(M ) =\n\nS(cid:18)[n];jSj=\u2113\n\ndet(M [SjS]):\n\n(2.2)\n\nThe following proposition captures basic properties of ESPs that we will require in our analysis.\nProposition 2.1. Let M 2 Rm(cid:2)m be symmetric and 1 (cid:20) \u2113 (cid:20) m; also let A; B 2 S+\nfollowing properties:\nthen E\u2113(M\n\nm. We have the\n(i) If A \u2ab0 B in L\u00f6wner order, then E\u2113(A) (cid:21) E\u2113(B); (ii) If M is invertible,\n(cid:0)1)Em(cid:0)\u2113(M ); (iii) \u2207e\u2113((cid:21)) = [e\u2113(cid:0)1((cid:21)(i))]1(cid:20)i(cid:20)m.\n\n(cid:0)1) = det(M\n\n3 ESP-design\nA-optimal design uses (cid:8) (cid:17) tr in (1.1), and thus selects designs with low average variance. Geo-\nmetrically, this translates into selecting con\ufb01dence ellipsoids whose bounding boxes have a small\ndiameter. Conversely, D-optimal design uses (cid:8) (cid:17) det in (1.1), and selects vectors that correspond\nto the ellipsoid with the smallest volume; as a result it is more sensitive to outliers in the data1. We\nintroduce a natural model that scales between A- and D-optimal design. Indeed, by recalling that\nboth the trace and the determinant are special cases of ESPs, we obtain a new model as fundamental\nas A- and D-optimal design, while being able to interpolate between the two in a graded manner.\nUnless otherwise indicated, we consider that we are selecting experiments without repetition.\n3.1 Problem formulation\nLet X 2 Rn(cid:2)m (m \u226a n) be a design matrix with full column rank, and k 2 N be the budget\n(\n(m (cid:20) k (cid:20) n). De\ufb01ne (cid:0)k = fS (cid:18) [n] s.t. jSj (cid:20) k; X\nS XS \u227b 0g to be the set of feasible designs\n\u22a4\nthat allow unbiased (cid:18) estimates. For \u2113 2 f1; : : : ; mg, we introduce the ESP-design model:\n\n)\n\nf\u2113(S) \u225c 1\n\n\u2113 log E\u2113\n\n(X\n\n\u22a4\nS XS)\n\n(cid:0)1\n\n:\n\nmin\nS2(cid:0)k\n\n(3.1)\n\nWe keep the 1=\u2113-factor in (3.1) to highlight the homogeneity (E\u2113 is a polynomial of degree \u2113) of our\ndesign criterion, as is advocated in [33, Ch. 6].\nFor \u2113 = 1, (3.1) yields A-optimal design, while for \u2113 = m, it yields D-optimal design. For 1 < \u2113 <\nm, ESP-design interpolates between these two extremes. Geometrically, we may view it as seeking\nan ellipsoid with the smallest average volume for \u2113-dimensional slices (taken across sets of size \u2113).\nAlternatively, ESP-design can be also be interpreted as a regularized version of D-optimal design\nvia Prop. 2.1-(ii). In particular, for \u2113 = m (cid:0) 1, we recover a form of regularized D-optimal design:\n\n[\n\n(\n\n)\n\n]\n\nfm(cid:0)1(S) = 1\nm(cid:0)1\n\nlog det\n\n(X\n\n\u22a4\nS XS)\n\n(cid:0)1\n\n+ log \u2225XS\u22252\n\n2\n\n:\n\n(3.1) is a known hard combinatorial optimization problem (in particular for \u2113 = m), which precludes\nan exact optimal solution. However, its objective enjoys remarkable properties that help us derive\nef\ufb01cient algorithms for its approximate solution. The \ufb01rst one of these is based on a natural convex\nrelaxation obtained below.\n\n3.2 Continuous relaxation\nWe describe below a traditional approach of relaxing (3.1) by relaxing the constraint on S, allowing\nelements in the set to have fractional multiplicities. The new optimization problem takes the form\n\n(\n\n)\n\n\u22a4\n\nDiag(z)X)\n\n(cid:0)1\n\n;\n\n(3.2)\n\nmin\nz2(cid:0)c\n\nk\n\n1\n\u2113 log E\u2113\n\n(X\n\nwhere we (cid:0)c\ninvertible and 1\nProposition 3.1. Let z\n\n\u22a4\n\nk denotes the set of vectors fz 2 Rn j 0 (cid:20) zi (cid:20) 1g such that X\n\u22a4\nz (cid:20) k. The following is a direct consequence of Prop 2.1-(i):\n\n(cid:3) be the optimal solution to (3.2). Then \u2225z\n\n(cid:3)\u22251 = k.\n\nDiag(z)X remains\n\nConvexity of f\u2113 on (cid:0)c\nk (where by abuse of notation, f\u2113 also denotes the continuous relaxation in (3.2))\ncan be obtained as a consequence of [32]; however, we obtain it as a corollary Lemma 3.3, which\nshows that log E\u2113 is geodesically convex; this result seems to be new, and is stronger than convexity\nof f\u2113; hence it may be of independent interest.\n\n1For a more in depth discussion of the geometric interpretation of various optimal designs, refer to e.g. [7,\n\nSection 7.5].\n\n3\n\n\fDe\ufb01nition 3.2 (geodesic-convexity). A function f : S++\nS++\nm is called geodesically convex if it satis\ufb01es\n\nf (P #tQ) (cid:20) (1 (cid:0) t)f (P ) + tf (Q);\n\nm\n\n! R de\ufb01ned on the Riemannian manifold\nt 2 [0; 1]; and P; Q \u227b 0:\n(cid:0)1=2QP\n\n(cid:0)1=2)tP 1=2 to denote the geodesic\n\n\u2113\n\n\u22a4\n\n((X\n\nM X)\n\n(cid:0)1Y ).\n\n(cid:0)1XP\n\nm and their optimization, we\n\nm under the Riemannian metric gP (X; Y ) = tr(P\n\n(cid:0)1) is log-convex on the set of PD matrices.\n\nwhere we use the traditional notation P #tQ := P 1=2(P\nbetween P and Q 2 S++\nLemma 3.3. The function E\u2113 is geodesically log-convex on the set of positive de\ufb01nite matrices.\nCorollary 3.4. The map M 7! E1=\u2113\nFor further details on the theory of geodesically convex functions on S+\nrefer the reader to [40]. We prove Lemma 3.3 and Corollary 3.4 in Appendix A.\nFrom Corollary 3.4, we immediately obtain that (3.2) is a convex optimization problem, and can\ntherefore be solved using a variety of ef\ufb01cient algorithms. Projected gradient descent turns out to\nbe particularly easy to apply because we only require projection onto the intersection of the cube\n1 = kg (as a consequence of Prop 3.1). Projection onto this\n0 (cid:20) z (cid:20) 1 and the plane fz j z\nintersection is a special case of the so-called continuous quadratic knapsack problem, which is a\nvery well-studied problem and can be solved essentially in linear time [10, 12].\nRemark 3.5. The convex relaxation remains log-convex when points can be chosen with multiplic-\nity, in which case the projection step is also signi\ufb01cantly simpler, requiring only z (cid:21) 0.\nWe conclude the analysis of the continuous relaxation by showing a bound on the support of its\nsolution under some mild assumptions:\nTheorem 3.6. Let \u03d5 be the mapping from Rm to Rm(m+1)=2 such that \u03d5(x) = ((cid:24)ijxixj))1(cid:20)i;j(cid:20)m\nwith (cid:24)ij = 1 if i = j and 2 otherwise. Let ~\u03d5(x) = (\u03d5(x); 1) be the af\ufb01ne version of \u03d5. If for any\nset of m(m + 1)=2 distinct rows of X, the mapping under ~\u03d5 is independent, then the support of the\noptimum z\n\n(cid:3)of (3.2) satis\ufb01es \u2225z\n\n(cid:3)\u22250 (cid:20) k + m(m+1)\n\n\u22a4\n\n.\n\n2\n\nThe proof is identical to that of [44, Lemma 3.5], which shows such a result for A-optimal design;\nwe relegate it to Appendix B.\n\n4 Algorithms and analysis\n\nSolving the convex relaxation (3.2) does not directly provide a solution to (3.1); \ufb01rst, we must round\nk to a discrete solution S 2 (cid:0)k. We present two possibilities: (i) rounding\nthe relaxed solution z\nthe solution of the continuous relaxation (\u00a74.1); and (ii) a greedy approach (\u00a74.2).\n\n(cid:3) 2 (cid:0)c\n\n4.1 Sampling from the continuous relaxation\n\nFor conciseness, we concentrate on sampling without replacement, but note that these results extend\nwith minor changes to with replacement sampling (see [44]). Wang et al. [44] discuss the sampling\nscheme described in Alg. 1) for A-optimal design; the same idea easily extends to ESP-design. In\nparticular, Alg. 1, applied to a solution of (3.2), provides the same asymptotic guarantees as those\nproven in [44, Lemma 3.2] for A-optimal design.\n\n(cid:3)\n\n(cid:3) 2 Rn\n\nAlgorithm 1: Sample from z\nData: budget k, z\nResult: S of size k\nS \u2205\nwhile jSj < k do\n\nSample i 2 [n] n S uniformly at random\nSample x (cid:24) Bernoulli(z\n(cid:3)\ni )\nif x = 1 then S S [ fig\n\nreturn S\nTheorem 4.1. Let (cid:6)(cid:3) = X\nS constructed by sampling as above veri\ufb01es with probability p = 0:8\n\n)X. Suppose \u2225(cid:6)\n\nDiag(z\n\n\u22a4\n\n(cid:3)\n\n)(cid:0)1\n\n)1=\u2113 (cid:20) O(1) (cid:1) E\u2113\n\n((\n(cid:0)1(cid:3) \u22252(cid:20)((cid:6)(cid:3))\u2225X\u222521 log m = O(1). The subset\n\n)1=\u2113\n\n)(cid:0)1\n\n\u22a4\nS(cid:3)XS(cid:3)\n\nX\n\n:\n\n((\n\nE\u2113\n\n\u22a4\nS XS\n\nX\n\n4\n\n\fTheorem 4.1 shows that under reasonable conditions, we can probabilistically construct a good\n(cid:3) to the convex relaxation.\napproximation to the optimal solution in linear time, given the solution z\n\n4.2 Greedy approach\n\nIn addition to the solution based on convex relaxation, ESP-design admits an intuitive greedy ap-\nproach, despite not being a submodular optimization problem in general. Here, elements are re-\nmoved one-by-one from a base set of experiments; this greedy removal, as opposed to greedy ad-\ndition, turns out to be much more practical. Indeed, since f\u2113 is not de\ufb01ned for sets of size smaller\nthan k, it is hard to greedily add experiments to the empty set and then bound the objective function\nafter k items have been added. This dif\ufb01culty precludes analyses such as [45, 39] for optimizing\nnon-submodular set functions by bounding their \u201ccurvature\u201d.\n\nAlgorithm 2: Greedy algorithm\nData: matrix X, budget k, initial set S0\nResult: S of size k\nS S0\nwhile jSj > k do\nFind i 2 S such that S n fig is feasible and i minimizes f\u2113(S n fig)\nS S n fig\n\nreturn S\n\nBounding the performance of Algorithm 2 relies on the following lemma.\nLemma 4.2. Let X 2 Rn(cid:2)m(n (cid:21) m) be a matrix with full column rank, and let k be a budget\nm (cid:20) k (cid:20) n. Let S of size k be subset of [n] drawn with probability P / det(X\n\n\u22a4\nS XS). Then\n\n)]\n\n((\n\n[\n\u220f\u2113\nES(cid:24)P\nS XS \u227b 0 for all subsets S of size k.\n\u22a4\n\n)(cid:0)1\n\n\u22a4\nS XS\n\n(cid:20)\n\nE\u2113\n\ni=1\n\nX\n\n((\n\n)\n\n)(cid:0)1\n\nn (cid:0) m + i\nk (cid:0) m + i\n\n(cid:1) E\u2113\n\n\u22a4\n\nX\n\nX\n\nwith equality if X\n\n;\n\n(4.1)\n\n(cid:20)\n\nE\u2113\n\nX\n\n((\n\n)(cid:0)1\n\n\u22a4\nS+XS+\n\n)\n\u220f\u2113\n)1=\u2113 (cid:20) n (cid:0) m + \u2113\n\n((\n)(cid:0)1\n\nLemma 4.2 extends a result from [2, Lemma 3.9] on column-subset selection via volume sampling\nto all ESPs. In particular, it follows that removing one element (by volume sampling a set of size\nn (cid:0) 1) will in expectation decrease f by a multiplicative factor which is clearly also attained by a\ngreedy minimization. This argument then entails the following bound on Algorithm 2\u2019s performance.\nProofs of both results are in Appendix C.\nTheorem 4.3. Algorithm 2 initialized with a set S0 of size n0 produces a set S+ of size k such that\n(4.2)\n\n)(cid:0)1\n((\n(cid:3) denotes the optimal set):\nf (f1; : : : ; ng) (cid:20) n (cid:0) m + \u2113\nk (cid:0) m + 1\nE\u2113\n(4.3)\n(cid:3)\u22250 of the convex relaxation\nHowever, this naive initialization can be replaced by the support \u2225z\nsolution; in the common scenario described by Theorem 3.6, we then obtain the following result:\nTheorem 4.4. Let ~\u03d5 be the mapping de\ufb01ned in 3.6, and assume that all choices of m(m + 1)=2\ndistinct rows of X always have their mapping independent mappings for ~\u03d5. Then the outcome of the\ngreedy algorithm initialized with the support of the solution to the continuous relaxation veri\ufb01es\n\nAs Wang et al. [44] note regarding A-optimal design, (4.2) provides a trivial optimality bound on\nthe greedy algorithm when initialized with S0 = f1; : : : ; ng (S\n\nn0 (cid:0) m + j\nk (cid:0) m + j\n\nk (cid:0) m + 1\n\n)1=\u2113\n\n\u22a4\nS+XS+\n\n)(cid:0)1\n\n\u22a4\nS(cid:3)XS(cid:3)\n\n((\n\n(cid:1) E\u2113\n\n)\n\n\u22a4\nS0\n\nX\n\nE\u2113\n\nX\n\nXS0\n\nj=1\n\nX\n\n)\n\n(\n\nk + m(m (cid:0) 1)=2 + \u2113\n\nk (cid:0) m + 1\n\n+ f\u2113(S\n\n(cid:3)\n\n):\n\nf\u2113(S+) (cid:20) log\n\n4.3 Computational considerations\n\nComputing the \u2113-th elementary symmetric polynomial on a vector of size m can be done in\nO(m log2 \u2113) using Fast Fourier Transform for polynomial multiplication, due to the construction\nintroduced by Ben-Or (see [37]); hence, computing f\u2113(S) requires O(nm2) time, where the cost is\ndominated by computing X\n\nS XS. Alg. 1 runs in expectation in O(n); Alg. 2 costs O(m2n3).\n\u22a4\n\n5\n\n\f5 Further Implications\n\nWe close our theoretical presentation by discussing a potentially important geometric problem re-\nlated to ESP-design. In particular, our motivation here is the dual problem of D-optimal design (i.e.,\ndual to the convex relaxation of D-optimal design): this is nothing but the well-known Minimum Vol-\nume Covering Ellipsoid (MVCE) problem, which is a problem of great interest to the optimization\ncommunity in its own right\u2014see the recent book [42] for an excellent account.\nWith this motivation, we develop the dual formulation for ESP-design now. We start by deriving\n\u2207E\u2113(A), for which we recall that E\u2113((cid:1)) is a spectral function, whereby the spectral calculus of Lewis\n[29] becomes applicable, saving us from intractable multilinear algebra [23]. More precisely, say\n\u22a4\nU\n\n(cid:3)U is the eigendecomposition of A, with U unitary. Then, as E\u2113(A) = e\u2113 \u25e6 (cid:21)(A),\n\n(5.1)\nWe can now derive the dual of ESP-design (we consider only z (cid:21) 0); in this case problem (3.2) is\n\nDiag(e\u2113(cid:0)1((cid:3)\n\n\u2207E\u2113(A) = U\n\n\u22a4\n\nDiag(\u2207e\u2113((cid:3)))U = U\n\n\u22a4\n\n(cid:0)(i)))U:\n\n(cid:0)1 (cid:0) X\n\n\u22a4\n\nDiag(z)X)) (cid:0) (cid:22)(1\n\n\u22a4\n\nz (cid:0) k);\n\n+ tr(HX\n\n\u22a4\n\nDiag(z)X) (cid:0) (cid:22)(1\n\u22a4\n\nz (cid:0) k):\n\n(5.2)\n\nsup\n\nA\u227b0;z(cid:21)0\n\ninf\n(cid:22)2R;H\n\nwhich admits as dual\n\ninf\n(cid:22)2R;H\n\nsup\n\nA\u227b0;z(cid:21)0\n\n|\n\n(cid:0) 1\n\u2113 log E\u2113(A) (cid:0) tr(HA\n\n{z\n\n\u2113 log E\u2113(A) (cid:0) tr(H(A\n(cid:0) 1\n}\n(\n\n(cid:0)1)\n\n(cid:3)U, we have\n\u2207g(A) = 0 () (cid:3) Diag\n\ng(A)\n\nWe easily show that H \u2ab0 0 and that g reaches its maximum on S++\nRewriting A = U\n\n\u22a4\n\nm for A such that \u2207g = 0.\n\n)\n\ne\u2113(cid:0)1((cid:3)(i))\n\n(cid:3) = e\u2113((cid:3))U HU\n\n\u22a4\n\n:\n\nIn particular, H and A are co-diagonalizable, with (cid:3) Diag(e\u2113(cid:0)1((cid:3)(i)))(cid:3) = Diag(h1; : : : ; hm). The\neigenvalues of A must thus satisfy the system of equations\n\ni e\u2113(cid:0)1((cid:21)1; : : : ; (cid:21)i(cid:0)1; (cid:21)i+1; : : : ; (cid:21)m) = hie\u2113((cid:21)1; : : : ; (cid:21)m);\n(cid:21)2\n\nLet a(H) be such a matrix (notice, a(H) = \u2207g\nwhere f \u22c6\n\n(cid:3)\n\n\u2113 is the Fenchel conjugate of f\u2113 . Finally, the dual optimization problem is given by\n\n(0)). Since f\u2113 is convex, g(a(H)) = f \u22c6\n\n\u2113 ((cid:0)H)\n\n1 (cid:20) i (cid:20) m:\n\nsup\n\ni Hxi(cid:20)1;H\u2ab00\n\u22a4\n\nx\n\n\u2113 ((cid:0)H) =\nf \u22c6\n\nsup\n\ni Hxi(cid:20)1;H\u2ab00\n\u22a4\n\nx\n\n1\n\u2113 log E\u2113(a(H))\n\nDetails of the calculation are provided in Appendix D. In the general case, deriving a(H) or even\nE\u2113(a(H)) does not admit a closed form that we know of. Nevertheless, we recover the well-known\nduals of A-optimal design and D-optimal design as special cases.\nCorollary 5.1. For \u2113 = 1, a(H) = tr(H 1=2)H 1=2 and for \u2113 = m, a(H) = H. Consequently, we\nrecover the dual formulations of A- and D-optimal design.\n\n6 Experimental results\n\nWe compared the following methods to solving (3.1):\n\n\u2013 UNIF / UNIFFDV: k experiments are sampled uniformly / with Fedorov exchange\n\u2013 GREEDY / GREEDYFDV: greedy algorithm (relaxed init.) / with Fedorov exchange\n\u2013 SAMPLE: sampling (relaxed init.) as in Algorithm 1.\n\nWe also report the results for solution of the continuous relaxation (RELAX); the convex optimization\nwas solved using projected gradient descent, the projection being done with the code from [12].\n\n6.1 Synthetic experiments: optimization comparison\n\nWe generated the experimental matrix X by sampling n vectors of size m from the multivariate\n(cid:0)1 (density d ranging from 0.3 to 0.9). Due\nGaussian distribution of mean 0 and sparse precision (cid:6)\nto the runtime of Fedorov methods, results are reported for only one run; results averaged over\nmultiple iterations (as well as for other distributions over X) are provided in Appendix E.\n\n6\n\n\fAs shown in Fig. 1, the greedy algorithm applied to the convex relaxation\u2019s support outperforms\nsampling from the convex relaxation solution, and does as well as the usual Fedorov algorithm\nUNIFFDV; GREEDYFDV marginally improves upon the greedy algorithm and UNIFFDV. Strik-\ningly, GREEDY provides designs of comparable quality to UNIFFDV; furthermore, as very few local\nexchanges improve upon its design, running the Fedorov algorithm with GREEDY initialization is\nmuch faster (Table 1); this is con\ufb01rmed by Table 2, which shows the number of experiments in com-\nmon for different algorithms: GREEDY and GREEDYFDV only differ on very few elements. As the\nbudget k increases, the difference in performances between SAMPLE, GREEDY and the continuous\nrelaxation decreases, and the simpler SAMPLE algorithm becomes competitive. Table 3 reports the\nsupport of the continuous relaxation solution for ESP-design with \u2113 = 10.\n\nTable 1: Runtimes (s) (\u2113 = 10, d = 0:6)\n\nk\n\n40\n\n80\n\n120\n\n160\n\n200\n\nGREEDY\n\nGREEDYFDV\n\nUNIFFDV\n\n2.8 101\n6.6 101\n1.6 103\n\n2.7 101\n2.2 102\n4.1 103\n\n3.1 101\n3.2 102\n6.0 103\n\n4.0 101\n1.2 102\n6.2 103\n\n5.2 101\n1.3 102\n4.7 103\n\nTable 2: Common items between solutions (\u2113 = 10, d = 0:6)\n\nk\n\n|GREEDY \\ UNIFFDV|\n|GREEDY \\ GREEDYFDV|\n|UNIFFDV \\ GREEDYFDV|\n\nTable 3: \u2225z\n\n40\n26\n40\n26\n\n80\n76\n78\n75\n\n120\n114\n117\n113\n(cid:3)\u22250 (\u2113 = 10, d = 0:6)\n\n160\n155\n160\n155\n\n200\n200\n200\n200\n\nk\n\nd = 0:3\nd = 0:6\nd = 0:9\n\n40\n93 (cid:6) 3\n92 (cid:6) 7\n88 (cid:6) 3\n\n80\n\n117 (cid:6) 3\n117 (cid:6) 4\n116 (cid:6) 3\n\n120\n148 (cid:6) 2\n145 (cid:6) 4\n147 (cid:6) 4\n\n160\n181 (cid:6) 3\n180 (cid:6) 3\n179 (cid:6) 3\n\n200\n213 (cid:6) 2\n214 (cid:6) 4\n214 (cid:6) 1\n\n6.2 Real data\n\nWe used the Concrete Compressive Strength dataset [47] (with column normalization) from the UCI\nrepository to evaluate ESP-design on real data; this dataset consists in 1030 possible experiments to\nmodel concrete compressive strength as a linear combination of 8 physical parameters. In Figure 2\n(a), OED chose k experiments to run to estimate (cid:18), and we report the normalized prediction error on\nthe remaining n (cid:0) k experiments. The best choice of OED for this problem is of course A-optimal\ndesign, which shows the smallest predictive error. In Figure 2 (b), we report the fraction of non-zero\nentries in the design matrix XS; higher values of \u2113 correspond to increasing sparsity. This con\ufb01rms\nthat OED allows us to scale between the extremes of A-optimal design and D-optimal design to tune\ndesirable side-effects of the design; for example, sparsity in a design matrix can indicate not needing\nto tune a potentially expensive experimental parameter, which is instead left at its default value.\n\n7 Conclusion and future work\n\nWe introduced the family of ESP-design problems, which evaluate the quality of an experimental\ndesign using elementary symmetric polynomials, and showed that typical approaches to optimal\ndesign such as continuous relaxation and greedy algorithms can be extended to this broad family of\nproblems, which covers A-optimal design and D-optimal design as special cases.\nWe derived new properties of elementary symmetric polynomials: we showed that they are geodesi-\ncally log-convex on the space of positive de\ufb01nite matrices, enabling fast solutions to solving the\nrelaxed ESP optimization problem. We furthermore showed in Lemma 4.2 that volume sampling,\napplied to the columns of the design matrix X has a constant multiplicative impact on the objec-\ntive function E\u2113(\n), extending Avron and Boutsidis [2]\u2019s result from the trace to all el-\n\n)(cid:0)1\n\n(\n\nX\n\n\u22a4\nS XS\n\n7\n\n\fGREEDY\n\nGREEDYFDV\n\nSAMPLE\n\nRELAX\n\nUNIF\n\nUNIFFDV\n\n..\n.\n.\n.\n.\n.\n\n)\nt\np\nO\nA\n\n-\n\n(\n\n1\n=\n\u2113\n\n..\n.\n.\n.\n.\n.\n.\n.\n.\n..\n\n)\nS\n(\n\u2113\nf\n\n2.0\n\n1.0\n\n0.0\n\n0.0\n\n-1.0\n\n-2.0\n\n40\n\n80\n\n120\n\n160\n\n200\n\n40\n\n80\n\n120\n\n160\n\n200\n\n2.0\n\n1.0\n\n0.0\n..\n.\n.\n.\n.\n.\n.\n.\n\n0.0\n\n-1.0\n\n..\n.\n.\n.\n.\n.\n.\n-1.0\n\n-2.0\n\n-3.0\n\n40\n\n80\n\n120\n\n160\n\n200\n\n40\n\n80\n\n120\n\n160\n\n200\n\n2.0\n\n1.0\n\n..\n.\n.\n.\n.\n.\n.\n\n0.0\n\n-1.0\n\n..\n.\n.\n.\n.\n.\n.\n\n-1.0\n\n-2.0\n\n-3.0\n\n40\n\n80\n\n120\n\n160\n\n200\n\n40\n\n80\n\n120\n\n160\n\n200\n\n40\n\n80\n\n120\n\n160\n\n200\n\n40\n\n80\n\n120\n\n160\n\n200\n\n40\n\n80\n\n120\n\n160\n\n200\n\nbudget k\nd = 0:3\n\n..\n.\n.\n.\n.\n.\n.\n.\n.\n.\n\nbudget k\nd = 0:6\n\n..\n.\n.\n.\n.\n.\n.\n.\n.\n.\n\nbudget k\nd = 0:9\n\n0\n1\n=\n\u2113\n\n)\nS\n(\n\u2113\nf\n\n)\nS\n(\n\u2113\nf\n\n-2.0\n\n-3.0\n\n..\n.\n.\n.\n.\n.\n.\n.\n.\n..\n\n)\nt\np\nO\nD\n\n-\n\n(\n\n0\n2\n=\n\u2113\n\n..\n.\n.\n.\n.\n.\n.\n.\n.\n..\n.\n\nFigure 1: Synthetic experiments, n = 500, m = 30. The greedy algorithm performs as well as the\nclassical Fedorov approach; as k increases, all designs except UNIF converge towards the continuous\nrelaxation, making SAMPLE the best approach for large designs.\n\n\u2113 = 1 (A-opt)\n\n\u2113 = 3\n\n\u2113 = 6\n\n\u2113 = 8 (D-opt)\n\n0:82\n\n0:81\n\n0:80\n\ns\ne\ni\nr\nt\nn\ne\n\no\nr\ne\nz\n\nn\no\nn\n\nf\no\n\no\ni\nt\na\nr\n\n..\n.\n.\n.\n\n(cid:2)10\n\n(cid:0)4\n\nr\no\nr\nr\ne\n\ne\nv\ni\nt\nc\ni\nd\ne\nr\np\n\n3:2\n\n3:0\n\n2:8\n\n..\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n\n100\n\n120\n\n140\n\n160\n\n180\n\n200\n\n100\n\n120\n\n140\n\n160\n\n180\n\n200\n\nbudget k\n(a) MSE\n\n..\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n\nbudget k\n\n(b) Sparsity\n\nFigure 2: Predicting concrete compressive strength via the greedy method; higher \u2113 increases the\nsparsity of the design matrix XS, at the cost of marginally decreasing predictive performance.\n\nementary symmetric polynomials. This allows us to derive a greedy algorithm with performance\nguarantees, which empirically performs as well as Fedorov exchange, in a fraction of the runtime.\nHowever, our work still includes some open questions: in deriving the Lagrangian dual of the op-\ntimization problem, we had to introduce the function a(H) which maps S++\nm ; however, although\na(H) is known for \u2113 = 1; m, its form for other values of \u2113 is unknown, making the dual form\na purely theoretical object in the general case. Whether the closed form of a can be derived, or\nwhether E\u2113(a(H)) can be obtained with only knowledge of H, remains an open problem. Due to\nthe importance of the dual form of D-optimal design as the Minimum Volume Covering Ellipsoid,\nwe believe that further investigation of the general dual form of ESP-design will provide valuable\ninsight, both into optimal design and for the general theory of optimization.\n\n8\n\n\fACKNOWLEDGEMENTS\nSuvrit Sra acknowledges support from NSF grant IIS-1409802 and DARPA Fundamental Limits of\nLearning grant W911NF-16-1-0551.\n\nReferences\n\n[1] A. Atkinson, A. Donev, and R. Tobias. Optimum Experimental Designs, With SAS. Oxford Statistical\n\nScience Series. OUP Oxford, 2007.\n\n[2] H. Avron and C. Boutsidis. Faster subset selection for matrices and applications. SIAM J. Matrix Analysis\n\nApplications, 34(4):1464\u20131499, 2013.\n\n[3] E. R. Barnes. An algorithm for separating patterns by ellipsoids. IBM Journal of Research and Develop-\n\nment, 26:759\u2013764, 1982.\n\n[4] H. H. Bauschke, O. G\u00fcler, A. S. Lewis, and H. S. Sendov. Hyperbolic polynomials and convex analysis.\n\nCanad. J. Math., 53(3):470\u2013488, 2001.\n\n[5] R. Bhatia. Matrix Analysis. Springer, 1997.\n[6] R. Bhatia. Positive De\ufb01nite Matrices. Princeton University Press, 2007.\n[7] S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, 2004.\n[8] K. Chaloner and I. Verdinelli. Bayesian experimental design: A review. Statistical Science, 10:273\u2013304,\n\n1995.\n\n[9] D. A. Cohn. Neural network exploration using optimal experiment design. In Neural Networks, pages\n\n679\u2013686. Morgan Kaufmann, 1994.\n\n[10] R. Cominetti, W. F. Mascarenhas, and P. J. S. Silva. A newton\u2019s method for the continuous quadratic\n\nknapsack problem. Mathematical Programming Computation, 6(2):151\u2013169, 2014.\n\n[11] M. A. Davenport, A. K. Massimino, D. Needell, and T. Woolf. Constrained adaptive sensing.\n\nTransactions on Signal Processing, 64(20):5437\u20135449, 2016.\n\nIEEE\n\n[12] T. A. Davis, W. W. Hager, and J. T. Hungerford. An ef\ufb01cient hybrid algorithm for the separable convex\n\nquadratic knapsack problem. ACM Trans. Math. Softw., 42(3):22:1\u201322:25, 2016.\n\n[13] H. Dette, V. B. Melas, and W. K. Wong. Locally d-optimal designs for exponential regression models.\n\nStatistica Sinica, 16(3):789\u2013803, 2006.\n\n[14] A. N. Dolia, T. De Bie, C. J. Harris, J. Shawe-Taylor, and D. M. Titterington. The minimum volume cover-\ning ellipsoid estimation in kernel-de\ufb01ned feature spaces. In European Conference on Machine Learning,\npages 630\u2013637, 2006.\n\n[15] E. N. Dolia, N. M. White, and C. J. Harris. D-optimality for minimum volume ellipsoid with outliers. In\nIn Proceedings of the Seventh International Conference on Signal/Image Processing and Pattern Recog-\nnition, pages 73\u201376, 2004.\n\n[16] G. Elfving. Optimum allocation in linear regression theory. Ann. Math. Statist., 23(2):255\u2013262, 1952.\n[17] V. Fedorov. Theory of optimal experiments. Probability and mathematical statistics. Academic Press,\n\n1972.\n\n[18] Y. Gu and Z. Jin. Neighborhood preserving d-optimal design for active learning and its application to\n\nterrain classi\ufb01cation. Neural Computing and Applications, 23(7):2085\u20132092, 2013.\n\n[19] X. He. Laplacian regularized d-optimal design for active learning and its application to image retrieval.\n\nIEEE Trans. Image Processing, 19(1):254\u2013263, 2010.\n\n[20] T. Horel, S. Ioannidis, and S. Muthukrishnan. Budget Feasible Mechanisms for Experimental Design,\n\npages 719\u2013730. Springer Berlin Heidelberg, 2014.\n\n[21] R. A. Horn and C. R. Johnson. Matrix Analysis. Cambridge University Press, 1985.\n[22] D. A. Jackson and Y. Chen. Robust principal component analysis and outlier detection with ecological\n\ndata. Environmetrics, 15(2):129\u2013139, 2004.\n\n[23] T. Jain. Derivatives for antisymmetric tensor powers and perturbation bounds. Linear Algebra and its\n\nApplications, 435(5):1111 \u2013 1121, 2011.\n\n[24] R. Jozsa and G. Mitchison. Symmetric polynomials in information theory: Entropy and subentropy.\n\nJournal of Mathematical Physics, 56(6), 2015.\n\n[25] A. I. Khuri, B. Mukherjee, B. K. Sinha, and M. Ghosh. Design issues for generalized linear models: A\n\nreview. Statist. Sci., 21(3):376\u2013399, 2006.\n\n[26] J. Kiefer. Optimal design: Variation in structure and performance under change of criterion. Biometrika,\n\n62:277\u2013288, 1975.\n\n9\n\n\f[27] A. Krause, A. Singh, and C. Guestrin. Near-optimal sensor placements in gaussian processes: Theory,\nef\ufb01cient algorithms and empirical studies. Journal of Machine Learning Research, 9(Feb):235\u2013284, 2008.\n[28] F. S. Lasheras, J. V. Vil\u00e1n, P. G. Nieto, and J. del Coz D\u00edaz. The use of design of experiments to improve a\nneural network model in order to predict the thickness of the chromium layer in a hard chromium plating\nprocess. Mathematical and Computer Modelling, 52(78):1169 \u2013 1176, 2010.\n\n[29] A. S. Lewis. Derivatives of spectral functions. Math. Oper. Res., 21(3):576\u2013588, 1996.\n[30] I. G. Macdonald. Symmetric functions and Hall polynomials. Oxford university press, 1998.\n[31] A. J. Miller and N.-K. Nguyen. A fedorov exchange algorithm for d-optimal design. Applied Statistics,\n\n43:669\u2013677, 1994.\n\n[32] W. W. Muir. Inequalities concerning the inverses of positive de\ufb01nite matrices. Proceedings of the Edin-\n\nburgh Mathematical Society, 19(2):109113, 1974.\n\n[33] F. Pukelsheim. Optimal Design of Experiments. Society for Industrial and Applied Mathematics, 2006.\n[34] P. J. Rousseeuw and A. M. Leroy. Robust Regression and Outlier Detection. John Wiley & Sons, Inc.,\n\n1987.\n\n[35] G. Sagnol. Optimal design of experiments with application to the inference of traf\ufb01c matrices in large\nnetworks: second order cone programming and submodularity. Theses, \u00c9cole Nationale Sup\u00e9rieure des\nMines de Paris, 2010.\n\n[36] A. Schein and L. Ungar. A-Optimality for Active Learning of Logistic Regression Classi\ufb01ers, 2004.\n[37] A. Shpilka and A. Wigderson. Depth-3 arithmetic circuits over \ufb01elds of characteristic zero. computational\n\ncomplexity, 10(1):1\u201327, 2001.\n\n[38] S. Silvey, D. Titterington, and B. Torsney. An algorithm for optimal designs on a design space. Commu-\n\nnications in Statistics - Theory and Methods, 7(14):1379\u20131389, 1978.\n\n[39] J. D. Smith and M. T. Thai. Breaking the bonds of submodularity: Empirical estimation of approximation\n\nratios for monotone non-submodular greedy maximization. CoRR, abs/1702.07002, 2017.\n\n[40] S. Sra and R. Hosseini. Conic Geometric Optimization on the Manifold of Positive De\ufb01nite Matrices.\n\nSIAM J. Optimization (SIOPT), 25(1):713\u2013739, 2015.\n\n[41] P. Sun and R. M. Freund. Computation of minimum-volume covering ellipsoids. Operations Research,\n\n52(5):690\u2013706, 2004.\n\n[42] M. Todd. Minimum-Volume Ellipsoids. Society for Industrial and Applied Mathematics, 2016.\n[43] L. Vandenberghe, S. Boyd, and S.-P. Wu. Determinant maximization with linear matrix inequality con-\n\nstraints. SIAM J. Matrix Anal. Appl., 19(2):499\u2013533, 1998.\n\n[44] Y. Wang, A. W. Yu, and A. Singh. On computationally tractable selection of experiments in regression\n\nmodels, 2016.\n\n[45] Z. Wang, B. Moran, X. Wang, and Q. Pan. Approximation for maximizing monotone non-decreasing set\n\nfunctions with a greedy method. J. Comb. Optim., 31(1):29\u201343, 2016.\n\n[46] T. C. Xygkis, G. N. Korres, and N. M. Manousakis. Fisher information based meter placement in distri-\n\nbution grids via the d-optimal experimental design. IEEE Transactions on Smart Grid, PP(99), 2016.\n\n[47] I.-C. Yeh. Modeling of strength of high-performance concrete using arti\ufb01cial neural networks. Cement\n\nand Concrete Research, 28(12):1797 \u2013 1808, 1998.\n\n[48] Y. Yu. Monotonic convergence of a general algorithm for computing optimal designs. Ann. Statist., 38\n\n(3):1593\u20131606, 2010.\n\n10\n\n\f", "award": [], "sourceid": 1282, "authors": [{"given_name": "Zelda", "family_name": "Mariet", "institution": "MIT"}, {"given_name": "Suvrit", "family_name": "Sra", "institution": "MIT"}]}*