{"title": "Structure Learning for Optimization", "book": "Advances in Neural Information Processing Systems", "page_first": 1044, "page_last": 1052, "abstract": "We describe a family of global optimization procedures that automatically decompose optimization problems into smaller loosely coupled problems, then combine the solutions of these with message passing algorithms. We show empirically that these methods excel in avoiding local minima and produce better solutions with fewer function evaluations than existing global optimization methods. To develop these methods, we introduce a notion of coupling between variables of optimization that generalizes the notion of coupling that arises from factoring functions into terms that involve small subsets of the variables. It therefore subsumes the notion of independence between random variables in statistics, sparseness of the Hessian in nonlinear optimization, and the generalized distributive law. Despite being more general, this notion of coupling is easier to verify empirically -- making structure estimation easy -- yet it allows us to migrate well-established inference methods on graphical models to the setting of global optimization.", "full_text": "Structure Learning for Optimization\n\nShulin (Lynn) Yang\n\nDepartment of Computer Science\n\nUniversity of Washington\n\nSeattle, WA 98195\n\nyang@cs.washington.edu\n\nAli Rahimi\nRed Bow Labs\n\nBerkeley, CA 94704\n\nali@redbowlabs.com\n\nAbstract\n\nWe describe a family of global optimization procedures that automatically decom-\npose optimization problems into smaller loosely coupled problems. The solutions\nof these are subsequently combined with message passing algorithms. We show\nempirically that these methods produce better solutions with fewer function eval-\nuations than existing global optimization methods. To develop these methods, we\nintroduce a notion of coupling between variables of optimization. This notion\nof coupling generalizes the notion of independence between random variables in\nstatistics, sparseness of the Hessian in nonlinear optimization, and the general-\nized distributive law. Despite its generality, this notion of coupling is easier to\nverify empirically, making structure estimation easy, while allowing us to migrate\nwell-established inference methods on graphical models to the setting of global\noptimization.\n\nIntroduction\n\n1\nWe consider optimization problems where the objective function is costly to evaluate and may be\naccessed only by evaluating it at requested points. In this setting, the function is a black box, and\nhave no access to its derivative or its analytical structure. We propose solving such optimization\nproblems by \ufb01rst estimating the internal structure of the black box function, then optimizing the\nfunction with message passing algorithms that take advantage of this structure. This lets us solve\nglobal optimization problems as a sequence of small grid searches that are coordinated by dynamic\nprogramming. We are motivated by the problem of tuning the parameters of computer programs to\nimprove their accuracy or speed. For the programs that we consider, it can take several minutes to\nevaluate these performance measures under a particular parameter setting.\nMany optimization problems exhibit only loose coupling between many of the variables of opti-\nmization. For example, to tune the parameters of an audio-video streaming program, the parameters\nof the audio codec could conceivably be tuned independently of the parameters of the video codec.\nSimilarly, to tune the networking component that glues these codecs together it suf\ufb01ces to consider\nonly a few parameters of the codecs, such as their output bit-rate. Such notions of conditional decou-\npling are conveniently depicted in a graphical form that represents the way the objective function\nfactors into a sum or product of terms each involving only a small subset of the variables. This\nfactorization structure can then be exploited by optimization procedures such as dynamic program-\nming on trees or junction trees. Unfortunately, the factorization structure of a function is dif\ufb01cult to\nestimate from function evaluation queries only.\nWe introduce a notion of decoupling that can be more readily estimated from function evaluations.\nAt the same time, this notion of decoupling is more general than the factorization notion of decou-\npling in that functions that do not factorize may still exhibit this type of decoupling. We say that two\nvariables are decoupled if the optimal setting of one variable does not depend on the setting of the\nother. This is formalized below in a way that parallels the notion of conditional decoupling between\nrandom variables in statistics. This parallel allows us to migrate much of the machinery developed\n\n1\n\n\ffor inference on graphical models to global optimization . For example, decoupling can be visual-\nized with a graphical model whose semantics are similar to those of a Markov network. Analogs of\nthe max-product algorithm on trees, the junction tree algorithm, and loopy belief propagation can be\nreadily adapted to global optimization. We also introduce a simple procedure to estimate decoupling\nstructure.\nThe resulting recipe for global optimization is to \ufb01rst estimate the decoupling structure of the objec-\ntive function, then to optimize it with a message passing algorithm that utilises this structure. The\nmessage passing algorithm relies on a simple grid search to solve the sub-problems it generates. In\nmany cases, using the same number of function evaluations, this procedure produces solutions with\nobjective values that improve over those produced by existing global optimizers by as much as 10%.\nThis happens because knowledge of the independence structure allows this procedure to explore the\nobjective function only along directions that cause the function to vary, and because the grid search\nthat solves the sub-problems does not get stuck in local minima.\n2 Related work\nThe idea of estimating and exploiting loose coupling between variables of optimization appears\nimplicitly in Quasi-Newton methods that numerically estimate the Hessian matrix, such as BFGS\n(Nocedal & Wright, 2006, Chap. 6). Indeed, the sparsity pattern of the Hessian indicates the pairs\nof terms that do not interact with each other in a second-order approximation of the function. This\nis strictly a less powerful notion of coupling than the factorization model, which we argue below, is\nin turn less powerful than our notion of decoupling.\nOthers have proposed approximating the objective function while simultaneously optimizing over\nit Srinivas et al. (2010). The procedure we develop here seeks only to approximate decoupling\nstructure of the function, a much simpler task to carry out accurately.\nA similar notion of decoupling has been explored in the decision theory literature Keeney & Raiffa\n(1976); Bacchus & Grove (1996), where decoupling was used to reason about preferences and util-\nities during decision making. In contrast, we use decoupling to solve black-box optimization prob-\nlems and present a practical algorithm to estimate the decoupling structure.\n3 Decoupling between variables of optimization\nA common way to minimize an objective function over many variables is to factorize it into terms,\neach of which involves only a small subset of the variables Aji & McEliece (2000). Such a repre-\nsentation, if it exists, can be optimized via a sequence of small optimization problems with dynamic\nprogramming. This insight motivates message passing algorithms for inference on graphical mod-\nels. For example, rather than minimizing the function f1(x, y, z) = g1(x, y) + g2(y, z) over its three\nvariables simultaneously, one can compute the function g3(y) = minz g2(y, z), then the function\ng4(x) = miny g1(x, y) + g3(y), and \ufb01nally minimizing g4 over x. A similar idea works for the\nfunction f2(x, y, z) = g1(x, y)g2(y, z) and indeed, whenever the operator that combines the factors\nis associative, commutative, and allows the \u201cmin\u201d operator to distribute over it.\nHowever, it is not necessary for a function to factorize for it to admit a simple dynamic programming\nprocedure. For example, a factorization for the function f3(x, y, z) = x2y2z2 + x2 + y2 + z2 is\nelusive, yet the arguments of f3 are decoupled in the sense that the setting of any two variables does\nnot affect the optimal setting of the third. For example, argminx f3(x, y0, z0) is always x = 0, and\nsimilarly for y and z. This decoupling allows us to optimize over the variables separately. This is\nnot a trivial property. For example, the function f4(x, y, z) = (x y)2 + (y z)2 exhibits no\nsuch decoupling between x and y because the minimizer of argminx f4(x, y0, z0) is y0, which is\nobviously a function of the second argument of f. The following de\ufb01nition formalizes this concept:\nDe\ufb01nition 1 (Blocking and decoupling). Let f :\u2326 ! R be a function on a compact domain and let\n\u2326 be a subset of the domain. We say that the coordinates Z block X from Y under\nX\u21e5Y\u21e5Z\u2713\nf if the set of minimizers of f over X does not change for any setting of the variables Y given a\nsetting of the variables Z:\n\nf (X, Y2, Z).\n\nWe will say that X and Y are decoupled conditioned on Z under f, or X? f YZ, if Z blocks X\n\nfrom Y and Z blocks Y from X under f at the same time.\n\n8\n\nargmin\nX2X\n\nf (X, Y1, Z) = argmin\nX2X\n\nY12Y,Y22Y\n\nZ2Z\n\n2\n\n\f,\n\nWe will simply say that X and Y are decoupled, or X? f Y, when X? f YZ, \u2326= X\u21e5Y\u21e5Z\n\nand f is understood from context.\n\nFor a given function f (x1, . . . , xn), decoupling between the variables can be represented graphically\nwith an undirected graph analogous to a Markov network:\nDe\ufb01nition 2. A graph G = ({x1, . . . , xn}, E) is a coupling graph for a function f (x1, . . . , xn) if\n(i, j) /2 E implies xi and xj are decoupled under f.\nThe following result mirrors the notion of separation in Markov networks and makes it easy to reason\nabout decoupling between groups of variables with coupling graphs (see the appendix for a proof):\nProposition 1. Let X ,Y,Z be groups of nodes in a coupling graph for a function f. If every path\n\ns=1 gs(Xs), then X? f YZ whenever X? \u2326 YZ.\n\nfrom a node in X to a node in Y passes through a node in Z, then X? f YZ.\nX ,Y,Z, we say X is conditionally separated from Y by Z by factorization, or X? \u2326 YZ, if X and\n\nFunctions that factorize as a product of terms exhibit this type of decoupling. For subsets of variables\nY are separated in that way in the Markov network induced by the factorization of f. The following\nis a generalization of the familiar result that factorization implies the global Markov property (Koller\n& Friedman, 2009, Thm. 4.3) and follows from Aji & McEliece (2000):\nTheorem 1 (Factorization implies decoupling). Let f (x1, . . . , xn) be a function on a compact do-\nmain, and let X1, . . . ,XS,X ,Y,Z be subsets of {x1, . . . , xn}. Let \u2326 be any commutative associa-\ntive semi-ring operator over which the min operator distributes. If f factorizes as f (x1, . . . , xn) =\n\u2326S\nHowever decoupling is strictly more powerful than factorization. While X? \u2326 Y implies X? f Y,\nthe reverse is not necessarily true: there exist functions that admit no factorization at all, yet whose\narguments are completely mutually decoupled. Appendix B gives an example.\n4 Optimization procedures that utilize decoupling\nWhen a cost function factorizes, dynamic programming algorithms can be used to optimize over the\nvariables Aji & McEliece (2000). When a cost function exhibits decoupling as de\ufb01ned above, the\nsame dynamic programming algorithms can be applied with a few minor modi\ufb01cations.\nThe algorithms below refer to a function f whose arguments are partitioned over the sets\nX1, . . . ,Xn. Let X\u21e4i denote the optimal value of Xi 2X i. We will take simplifying liberties with\nthe order of the arguments of f when this causes no ambiguity. We will also replace the variables\nthat do not participate in the optimization (per decoupling) with an ellipsis.\n\n4.1 Optimization over trees\nSuppose the coupling graph between some partitioning X1, . . . ,Xm of the arguments of f is tree-\nstructured, in the sense that Xi ?f Xj unless the edge (i, j) is in the tree. To optimize over f with\ndynamic programming, de\ufb01ne X0 arbitrarily as the root of the tree, let pi denote the index of the\nparent of Xi, and let C1\ni , . . . denote the indices of its children. At each leaf node `, construct the\nfunctions\n\ni , C2\n\n\u02c6X` (Xp`) := argmin\nX`2X`\n\nf (X`, Xp`).\n\n(1)\n\nBy decoupling, the optimal value of X` depends only on the optimal value of its parent, so X\u21e4` =\n\u02c6X`(X\u21e4p`).\nFor all other nodes i, de\ufb01ne recursively starting from the parents of the leaf nodes the functions\n\n\u02c6Xi(Xpi) = argmin\nXi2Xi\n\nf (Xi, Xpi, \u02c6XC1\n\ni\n\n(Xi), \u02c6XC2\n\ni\n\n(Xi), . . .)\n\n(2)\n\nAgain, the optimal value of Xi depends only on the optimal setting of its parent, X\u21e4pi , and it can be\nveri\ufb01ed that X\u21e4i = \u02c6Xi(X\u21e4pi).\nIn our implementation of this algorithm, to represent a function \u02c6Xi(X), we discretize its argument\ninto a grid, and store the function as a table. To compute the entries of the table, a subordinate global\noptimizer computes the minimization that appears in the de\ufb01nition of \u02c6Xi.\n\n3\n\n\f4.2 Optimization over junction trees\n\nEven when the coupling graph for a function is not tree-structured, a thin junction tree can often be\nconstructed for it. A variant of the above algorithm that mirrors the junction tree algorithm can be\nused to ef\ufb01ciently search for the optima of the function.\nRecall that a tree T of cliques is a junction tree for a graph G if it satis\ufb01es the following three\nproperties: there is one path between each pair of cliques; for each clique C of G there is some\nclique A in T such that C \u2713 A; for each pair of cliques A and B in T that contain node i of G, each\nclique on the unique path between A and B also contains i.\nThese properties guarantee that T is tree-structured, that it covers all nodes and edges in G, and that\ntwo nodes v and u in two different cliques Xi and Xj are decoupled from each other conditioned on\nthe union of the cliques on the path between u and v in T . Many heuristics exist for constructing a\nthin junction tree for a graph Jensen & Graven-Nielsen (2007); Huang & Darwiche (1996).\nTo search for the minimizers of f, using a junction tree for its coupling graph, denote by Xij :=\nXi \\X j the intersection of the groups of variables Xi and Xj and by Xi\\j = Xi \\Xj the set of nodes\nin Xi but not in Xj. At every leaf clique ` of the junction tree, construct the function\n\n\u02c6X` (X`,p`) := argmin\n\nf (X`).\n\nX`\\p`2X`\\p`\n\n(3)\n\nFor all other cliques i, compute recursively starting from the parents of the leaf cliques\n\n\u02c6Xi(Xi,pi) = argmin\nXi,pi2Xi\\pi\n\nf (Xi, \u02c6XC1\n\ni\n\n(Xi,C1\n\ni\n\n), \u02c6XC2\n\ni\n\n(Xi,C2\n\ni\n\n), . . .).\n\n(4)\n\nAs before, decoupling between the cliques, conditioned on the intersection of the cliques, guarantees\nthat \u02c6Xi(X\u21e4i,pi) = X\u21e4i . And as before, our implementation of this algorithm stores the intermediate\nfunctions as tables by discretizing their arguments.\n\n4.3 Other strategies\n\nWhen the cliques of the junction tree are large, the subordinate optimizations in the above algorithm\nbecome costly. In such cases, the following adaptations of approximate inference algorithms are\nuseful:\n\ngraph.\n\n\u2022 The algorithm of Section 4.1 can be applied to a maximal spanning tree of the coupling\n\u2022 Analogously to Loopy Belief Propagation Pearl (1997), an arbitrary neighbor of each node\ncan be declared as its parent, and the steps of Section 4.1 can be applied to each node until\nconvergence.\n\n\u2022 Loops in the coupling graph can be broken by conditioning on a node in each loop, resulting\nin a tree-structured coupling graph conditioned on those nodes. The optimizer of Section\n4.1 then searches for the minima conditioned on the value of those nodes in the inner loop\nof a global optimizer that searches for good settings for the conditioned nodes.\n\n5 Graph structure learning\nIt is possible to estimate decoupling structure between the arguments of a function f with the help\nof a subordinate optimizer that only evaluates f.\nA straightforward application of de\ufb01nition 1 to assess empirically whether groups of variables X\nand Y are decoupled conditioned on a group of variables Z would require comparing the minimizer\nof f over X for every possible value of Z and Y. This is not practical because it is at least as dif\ufb01cult\nas minimizing f. Instead, we rely on the following proposition, which follows directly from 1:\nProposition 2 (Invalidating decoupling).\nIf for some Z 2Z and Y0, Y1 2Y , we have\nargminX2X f (X, Y0, Z) 6= argminX2X f (X, Y1, Z), then X 6?f Y|Z.\nFollowing this result, an approximate coupling graph can be constructed by positing and invalidating\ndecoupling relations. Starting with a graph containing no edges, we consider all groupings X =\n\n4\n\n\f{xi},Y = {xj},Z =\u2326 \\{xi, xj}, of variables x1, . . . , xn. We posit various values of Z 2Z , Y0 2\nY and Y1 2Y under this grouping, and compute the minimizers over X 2X of f (X, Y0, Z) and\nf (X, Y1, Z) with a subordinate optimizer. If the minimizers differ, then by the above proposition,\nX and Y are not decoupled conditioned on Z, and an edge is added between xi and xj in the graph.\nAlgorithm 1 summarizes this procedure.\n\nAlgorithm 1 Estimating the coupling graph of a function.\ninput A function f : X1 \u21e5\u00b7\u00b7\u00b7X n ! R, with Xi compact; A discretization \u02c6Xi of Xi; A similarity\nthreshold \u270f> 0; The number of times, NZ, to sample Z.\noutput A coupling graph G = ([x1, . . . , xn], E).\nE ;\nfor i, j 2 [1, . . . , n]; y0, y1 2 \u02c6Xj; 1 . . . NZ do\nZ \u21e0 U ( \u02c6X1 \u21e5\u00b7\u00b7\u00b7\u21e5 \u02c6Xn \\ \u02c6Xi \u21e5 \u02c6Xj)\n\u02c6x0 argminx2 \u02c6Xi\nif k\u02c6x0 \u02c6x1k \u270f then\nE E [{ (i, j)}\nend if\nend for\n\nf (x, y0, Z); \u02c6x1 argminx2 \u02c6Xi\n\nf (x, y1, Z)\n\nIn practice, we \ufb01nd that decoupling relationships are correctly recovered if values of Y0 and Y1 are\nchosen by quantizing Y into a set \u02c6Y of 4 to 10 uniformly spaced discrete values and exhaustively\nexamining the settings of Y0 and Y1 in \u02c6Y. A few values of Z (fewer than \ufb01ve) sampled uniformly at\nrandom from a similarly discretized set \u02c6Z suf\ufb01ce.\n6 Experiments\n\nWe evaluate a two step process for global optimization: \ufb01rst estimating decoupling between vari-\nables using the algorithm of Section 5, then optimizing with this structure using an algorithm from\nSection 4. Whenever Algorithm 1 detects tree-structured decoupling, we use the tree optimizer of\nSection 4.1. Otherwise we either construct a junction tree and apply the junction tree optimizer of\nSection 4.2 if the junction tree is thin, or we approximate the graph with a maximum spanning tree\nand apply the tree solver of Section 4.1.\nWe compare this approach with three state-of-the-art black-box optimization procedures: Direct\nSearch Perttunen et al. (1993) (a deterministic space carving strategy), FIPS Mendes et al. (2004) (a\nbiologically inspired randomized algorithm), and MEGA Hazen & Gupta (2009) (a multiresolution\nsearch strategy with numerically computed gradients). We use a publicly available implementation\nof Direct Search 1, and an implementation of FIPS and MEGA available from the authors of MEGA.\nWe set the number of particles for FIPS and MEGA to the square of the dimension of the problem\nplus one, following the recommendation of their authors.\nAs the subordinate optimizer for Algorithm 1, we use a simple grid search for all our experiments.\nAs the subordinate optimizer for the algorithms of Section 4, we experiment with grid search and\nthe aforementioned state-of-the-art global optimizers.\nWe report results on both synthetic and real optimization problems. For each experiment, we report\nthe quality of the solution each algorithm produces after a preset number of function calls. To vary\nthe number of function calls the baseline methods invoke, we vary the number of time they iterate.\nSince our method does not iterate, we vary the number of function calls its subordinate optimizer\ninvokes (when the subordinate optimizer is grid search, we vary the grid resolution).\nThe experiments demonstrate that using grid search as a subordinate strategy is suf\ufb01cient to produce\nbetter solutions than all the other global optimizers we evaluated.\n\n1Available from http://www4.ncsu.edu/\u02dcctk/Finkel_Direct/.\n\n5\n\n\fTable 1: Value of the iterates of the functions of Table 2 after 10,000 function evaluations (for\nour approach, this includes the function evaluations for structure learning). MIN is the ground\ntruth optimal value when available. GR is the number of discrete values along each dimension for\noptimization. Direct Search (DIR), FIPS and MEGA are three state-of-the-art algorithms for global\noptimization.\n\nFunction (n=50) min GR\n100\nColville\n400\nLevy\nMichalewics\n400\n400\nRastrigin\n400\nSchwefel\n20\nDixon&Price\n20\nRosenbrock\nTrid\n20\n6\nPowell\n\n0\n0\nn/a\n0\n0\n0\n0\nn/a\n0\n\nOurs\n\n0\n\n0.013\n-48.9\n\n0\n8.6\n1\n0\n\n-2.2e4\n19.4\n\nDIR\n3e-6\n2.80\n-18.2\n\n0\n\n1.9e4\n0.667\n2.9e4\n-185\n324\n\nFIPS MEGA\n3.75\n2e-14\n3.22\n4.20\n-18.4\n-1.3e-3\n4.2e-3\n23.6\n1.4e4\n1.6e4\n0.914\n16.8\n48.4\n5.7e4\n3.3e4\n-41\n0.014\n121\n\n6.1 Synthetic objective functions\nWe evaluated the above strategies on a standard benchmark of synthetic optimization problems 2\nshown in Appendix A. These are functions of 50 variables and are used as black-box functions\nin our experiments. In these experiments, the subordinate grid search of Algorithm 1 discretized\neach dimension into four discrete values. The algorithms of Section 4 also used grid search as a\n\nsubordinate optimizer. For this grid search, each dimension was discretized into GR =\u21e3 Emax\nNmc\u2318 1\n\ndiscrete values where Emax is a cap on the number of function evaluations to perform, Smc is the\nsize of the largest clique in the junction tree, and Nmc is the number of nodes in the junction tree.\nFigure 1 shows that in all cases, Algorithm 1 recovered decoupling structure exactly even for very\ncoarse grids. Values of NZ greater than 1 did not improve the quality of the recovered graph,\njustifying our heuristic of keeping NZ small. We used NZ = 1 in the remainder of this subsection.\nTable 1 summarizes the quality of the solutions produced by the various algorithms after 10,000\nfunction evaluations. Our approach outperformed the others on most of these problems. As ex-\npected, it performed particularly well on functions that exhibit sparse coupling, such as Levy, Rast-\nrigin, and Schwefel.\nIn addition to achieving better solutions given the same number of function evaluations, our approach\nalso imposed lower computational overhead than the other methods: to process the entire benchmark\nof this section takes our approach 2.2 seconds, while Direct Search, FIPS and MEGA take 5.7\nminutes, 3.7 minutes and 53.3 minutes respectively.\n\nSmc\n\n100% \n080% \n060% \n040% \n020% \n000% \n\nNumber of evaluations \n\n4.9e3 1.1e4 1.9e4 3.1e4 4.4e4 \nColville \n\n2 3 4 5 6 \n\nGrid resolution \n\n100% \n080% \n060% \n040% \n020% \n000% \n\nNumber of evaluations \n\n4.9e3 1.1e4 1.9e4 3.1e4 4.4e4 \nLevy \n\n2 3 4 5 6 \n\nGrid resolution \n\n100% \n080% \n060% \n040% \n020% \n000% \n\nNumber of evaluations \n\n4.9e3 1.1e4 1.9e4 3.1e4 4.4e4 \nRosenbrock \n\n2 3 4 5 6 \n\nGrid resolution \n\n100% \n080% \n060% \n040% \n020% \n000% \n\nNumber of evaluations \n\n4.9e3 1.1e4 1.9e4 3.1e4 4.4e4 \nPowell \n\n2 3 4 5 6 \n\nGrid resolution \n\nFigure 1: Very coarse gridding is suf\ufb01cient in Algorithm 1 to correctly recover decoupling structure.\nThe plots show percentage of incorrectly recovered edges in the coupling graph on four synthetic\ncost functions as a function of the grid resolution (bottom x-axis) and the number of function evalu-\nations (top x-axis). NZ = 1 in these experiments.\n6.2 Experiments on real applications\n\nWe considered the real-world problem of automatically tuning the parameters of machine vision\nand machine learning programs to improve their accuracy on new datasets. We sought to tune the\n\n2Acquired from http://www-optima.amp.i.kyoto-u.ac.jp/member/student/hedar/\n\nHedar_files/go.htm.\n\n6\n\n\fparameters of a face detector, a document topic classi\ufb01er, and a scene recognizer to improve their\naccuracy on new application domains. Automatic parameter tuning allows a user to quickly tune\na program\u2019s default parameters to their speci\ufb01c application domain without tedious trial and error.\nTo perform this tuning automatically, we treated the accuracy of a program as a black box function\nof the parameter values passed to it. These were challenging optimization problems because the\nderivative of the function is elusive and each function evaluation can take minutes. Because the\noutput of a program tends to depend in a structured way on its parameters, our method achieved\nsigni\ufb01cant speedups over existing global optimizers.\n\n6.2.1 Face detection\nThe \ufb01rst application was a face detector. The program has \ufb01ve parameters: the size, in pixels, of\nthe smallest face to consider, the minimum distance, in pixels, between detected faces; a \ufb02oating\npoint subsampling rate for building a multiresolution pyramid of the input image; a boolean \ufb02ag that\ndetermines whether to apply non-maximal suppression; and the choice of one of four wavelets to\nuse. Our goal was to minimize the detection error rate of this program on the GENKI-SZSL dataset\nof 3, 500 faces 3. Depending on the parameter settings, evaluating the accuracy of the program on\nthis dataset takes between 2 seconds and 2 minutes.\nAlgorithm 1 was run with a grid search as a subordinate optimizer with three discrete values along\nthe continuous dimensions.\nIt invoked 90 function evaluations and produced a coupling graph\nwherein the \ufb01rst three of the above parameters formed a clique and where the remaining two pa-\nrameter were decoupled of the others. Given this coupling graph, our junction tree optimizer with\ngrid search (with the continuous dimensions quantized into 10 discrete values) invoked 1000 func-\ntion evaluations, and found parameter settings for which the accuracy of the detector was 7% better\nthan the parameter settings found by FIPS and Direct Search after the same number of function eval-\nuations. FIPS and Direct Search fail to improve their solution even after 1800 evaluations. MEGA\nfails to improve over the initial detection error of 50.84% with any number of iterations. To evaluate\nthe accuracy of our method under different numbers of function invocations, we varied the grid res-\nolution between 2 to 12. See Figure 2. These experiments demonstrate how a grid search can help\novercome local minima that cause FIPS and Direct Search to get stuck.\n\n)\n\n%\n\n80\n\n(\n \nr\no\nr\nr\ne\n\nr\no\n\nt\nc\ne\n\n \n\nt\n\ne\nd\n\nn\no\n\n \n\ne\nc\na\nF\n\ni\nt\n\na\nc\ni\nf\ni\ns\ns\na\nc\n\nl\n\n60\n\n40\n\n \n\n20\n0\n\nJunction tree solver with grid search\nDirect search\nFIPS\n\n \n\n500\nNumber of evaluations\n\n1000\n\n1500\n\nFigure 2: Depending on the number of function evaluations allowed, our method produces parameter\nsettings for the face detector that are better than those recovered by FIPS or Direct Search by as much\nas 7%.\n6.2.2 Scene recognition\nThe second application was a visual scene recognizer. It extracts GIST features Oliva & Torralba\n(2001) from an input image and classi\ufb01es these features with a linear SVM. Our task was to tune\nthe six parameters of GIST to improve the recognition accuracy on a subset of the LabelMe dataset\n4, which includes images of scenes such as coasts, mountains, streets, etc. The parameters of the\nrecognizer include a radial cut-off frequency (in cycles/pixel) of a circular \ufb01lter that reduces illumi-\nnation effects, the number of bins in a radial histogram of the response of a spatial spacial \ufb01lter, and\nthe number of image regions in which to compute these histograms. Evaluating the classi\ufb01cation\nerror under a set of parameters requires extracting GIST features with these parameters on a training\nset, training a linear SVM, then applying the extractor and classi\ufb01er to a test set. Each evaluation\ntakes between 10 and 20 minutes depending on the parameter settings.\n\n3Available from http://mplab.ucsd.edu.\n4Available from http://labelme.csail.mit.edu.\n\n7\n\n\fAlgorithm 1 was run with a grid search as the subordinate optimizer, discretizing the search space\ninto four discrete values along each dimension. This results in a graph that admits no thin junction\ntree, so we approximate it with a maximal spanning tree. We then apply the tree optimizer of Section\n4.1 using as subordinate optimizers Direct Search, FIPS, and grid search (with \ufb01ve discrete values\nalong each dimension). After a total of roughly 300 function evaluations, the tree optimizer with\nFIPS produces parameters that result in a classi\ufb01cation error of 29.17%. With the same number\nof function evaluations, Direct Search and FIPS produce parameters that resulted in classi\ufb01cation\nerrors of 33.33% and 31.13% respectively. The tree optimizer with Direct Search and grid search as\nsubordinate optimizers resulted in error rates of 31.72% and 33.33%.\nIn this application, the proposed method enjoys only modest gains of \u21e0 2% because the variables\nare tightly coupled, as indicated by the denseness of the graph and the thickness of the junction tree.\n\n6.2.3 Multi-class classi\ufb01cation\nThe third application was to tune the hyperparameters of a multi-class SVM classi\ufb01er on the RCV1-\nv2 text categorization dataset 5. This dataset consists of a training set of 23,149 documents and a\ntest set of 781,265 documents each labeled with one of 101 topics Lewis et al. (2004). Our task\nwas to tune the 101 regularization parameters of the 1 vs. all classi\ufb01ers that comprise a multi-class\nclassi\ufb01er. The objective was the so-called macro-average F -score Tague (1981) on the test set. The\nF score for one category is F = 2rp/(r + p), where r and p are the recall and precision rates\nfor that category. The macro-average F score is the average of the F scores over all categories.\nEach evaluation requires training the classi\ufb01er using the given hyperparameters and evaluating the\nresulting classi\ufb01er on the test set, and takes only a second since the text features have been pre-\ncomputed.\nAlgorithm 1 with grid search as a subordinate optimizer with a grid resolution of three discrete values\nalong each dimension found no coupling between the hyperparameters. As a result, the algorithms\nof Section 4.1 reduce to optimizing over each one-dimensional parameter independently. We carried\nout these one-dimensional optimizations with Direct Search, FIPS, and grid search (discretizing each\ndimension into 100 values). After roughly 100,000 evaluations, these resulted in similar scores of\nF = 0.6764, 0.6720, and 0.6743, respectively. But with the same number of evaluations, off-the-\nshelf Direct Search and FIPS result in scores of F = 0.6324 and 0.6043, respectively, nearly 11%\nworse.\nThe cost of estimating the structure in this problem was large, since it grows quadratically with the\nnumber of classes, but worth the effort because it indicated that each variable should be optimized\nindependently, ultimately resulting in huge speedups 6.\n7 Conclusion\nWe quanti\ufb01ed the coupling between variables of optimization in a way that parallels the notion of\nindependence in statistics. This lets us identify decoupling between variables in cases where the\nfunction does not factorize, making it strictly stronger than the notion of decoupling in statistical\nestimation. This type of decoupling is also easier to evaluate empirically. Despite these differences,\nthis notion of decoupling allows us to migrate to global optimization many of the message pass-\ning algorithms that were developed to leverage factorization in statistics and optimization. These\ninclude belief propagation and the junction tree algorithm. We show empirically that optimizing\ncost functions by applying these algorithms to an empirically estimated decoupling structure out-\nperforms existing black box optimization procedures that rely on numerical gradients, deterministic\nspace carving, or biologically inspired searches. Notably, we observe that it is advantageous to\ndecompose optimization problems into a sequence of small deterministic grid searches using this\ntechnique, as opposed to employing existing black box optimizers directly.\n\n5Available from http://trec.nist.gov/data/reuters/reuters.html.\n6After running these experiments, we discovered a result of Fan & Lin (2007) showing that optimizing the\nmacro-average F-measure is equivalent to optimizing per-category F-measure, thereby validating decoupling\nstructure recovered by Algorithm 1.\n\n8\n\n\fReferences\nAji, S. and McEliece, R. The generalized distributive law and free energy minimization.\n\nTransaction on Informaion Theory, 46(2), March 2000.\n\nIEEE\n\nBacchus, F. and Grove, A. Utility independence in a qualitative decision theory. In Proceedings of\nthe 6th International Conference on Principles of Knowledge Representation and Reasoning, pp.\n542\u2013552, 1996.\n\nFan, R. E. and Lin, C. J. A study on threshold selection for multi-label classi\ufb01cation. Technical\n\nreport, National Taiwan University, 2007.\n\nHazen, M. and Gupta, M. Gradient estimation in global optimization algorithms. Congress on\n\nEvolutionary Computation, pp. 1841\u20131848, 2009.\n\nHuang, C. and Darwiche, A. Inference in belief networks: A procedural guide. International Journal\n\nof Approximate Reasoning, 15(3):225\u2013263, 1996.\n\nJensen, F. and Graven-Nielsen, T. Bayesian Networks and Decision Graphs. Springer, 2007.\nKeeney, R. L. and Raiffa, H. Decisions with Multiple Objectives: Preferences and Value Trade-offs.\n\nWiley, 1976.\n\nKoller, D. and Friedman, N. Probabilistic Graphical Models: Principles and Techniques. MIT\n\nPress, 2009.\n\nLewis, D., Yang, Y., Rose, T., and Li, F. RCV1: A new benchmark collection for text categorization\n\nresearch. Journal of Machine Learning Research, 2004.\n\nMendes, R., Kennedy, J., and Neves, J. The fully informed particle swarm: Simpler, maybe better.\n\nIEEE Transactions on Evolutionary Computation, 1(1):204\u2013210, 2004.\n\nNocedal, J. and Wright, S. Numerical Optimization. Springer, 2nd edition, 2006.\nOliva, A. and Torralba, A. Modeling the shape of the scene: a holistic representation of the spatial\n\nenvelope. International Journal of Computer Vision, 43:145\u2013175, 2001.\n\nPearl, J. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan\n\nKaufmann, 1997.\n\nPerttunen, C., Jones, D., and Stuckman, B. Lipschitzian optimization without the Lipschitz constant.\n\nJournal of Optimization Theory and Application, 79(1):157\u2013181, 1993.\n\nSrinivas, N., Krause, A., Kakade, S., and Seeger, M. Gaussian process optimization in the bandit\nsetting: No regret and experimental design. In International Conference on Machine Learning\n(ICML), 2010.\n\nTague, J. M. The pragmatics of information retrieval experimentation. Information Retrieval Exper-\n\niment, pp. 59\u2013102, 1981.\n\n9\n\n\f", "award": [], "sourceid": 642, "authors": [{"given_name": "Shulin", "family_name": "Yang", "institution": null}, {"given_name": "Ali", "family_name": "Rahimi", "institution": null}]}