{"title": "MIMIC: Finding Optima by Estimating Probability Densities", "book": "Advances in Neural Information Processing Systems", "page_first": 424, "page_last": 430, "abstract": null, "full_text": "MIMIC: Finding Optima by Estimating \n\nProbability Densities \n\nJeremy S. De Bonet, Charles L. Isbell, Jr., Paul Viola \n\nCambridge, MA 02139 \n\nArtificial Intelligence Laboratory \n\nMassachusetts Institute of Technology \n\nAbstract \n\nIn many optimization problems, the structure of solutions reflects \ncomplex relationships between the different input parameters. For \nexample, experience may tell us that certain parameters are closely \nrelated and should not be explored independently. Similarly, ex(cid:173)\nperience may establish that a subset of parameters must take on \nparticular values. Any search of the cost landscape should take \nadvantage of these relationships. We present MIMIC, a framework \nin which we analyze the global structure of the optimization land(cid:173)\nscape. A novel and efficient algorithm for the estimation of this \nstructure is derived. We use knowledge of this structure to guide a \nrandomized search through the solution space and, in turn, to re(cid:173)\nfine our estimate ofthe structure. Our technique obtains significant \nspeed gains over other randomized optimization procedures. \n\n1 \n\nIntroduction \n\nGiven some cost function C(x) with local minima, we may search for the optimal \nx in many ways. Variations of gradient descent are perhaps the most popular. \nWhen most of the minima are far from optimal, the search must either include a \nbrute-force component or incorporate randomization. Classical examples include \nSimulated Annealing (SA) and Genetic Algorithms (GAs) (Kirkpatrick, Gelatt and \nVecchi, 1983; Holland, 1975). In all cases, in the process of optimizing C(x) many \nthousands or perhaps millions of samples of C( x) are evaluated. Most optimization \nalgorithms take these millions of pieces of information, and compress them into \na single point x-the current estimate of the solution (one notable exception are \nGAs to which we will return shortly). Imagine splitting the search process into \ntwo parts, both taking t/2 time steps. Both parts are structurally identical: taking \na description of CO, they start their search from some initial point. The sole \nbenefit enjoyed by the second part of the search over the first is that the initial \n\n\fMIMIC: Finding Optima by Estimating Probability Densities \n\n425 \n\npoint is perhaps closer to the optimum. Intuitively, there must be some additional \ninformation that could be learned from the first half of the search, if only to warn \nthe second half about avoidable mistakes and pitfalls. \n\nWe present an optimization algorithm called Mutual-Information-Maximizing In(cid:173)\nput Clustering (MIMIC). It attempts to communicate information about the cost \nfunction obtained from one iteration of the search to later iterations of the search \ndirectly. It does this in an efficient and principled way. There are two main com(cid:173)\nponents of MIMIC: first, a randomized optimization algorithm that samples from \nthose regions of the input space most likely to contain the minimum for CO; second, \nan effective density estimator that can be used to capture a wide variety of struc(cid:173)\nture on the input space, yet is computable from simple second order statistics on \nthe data. MIMIC's results on simple cost functions indicate an order of magnitude \nimprovement in performance over related approaches. Further experiments on a \nk-color map coloring problem yield similar improvements. \n\n2 Related Work \n\nMany well known optimization procedures neither represent nor utilize the struc(cid:173)\nture of the optimization landscape. In contrast, Genetic Algorithms (GA) attempt \nto capture this structure by an ad hoc embedding of the parameters onto a line (the \nchromosome). The intent of the crossover operation in standard genetic algorithms \nis to preserve and propagate a group of parameters that might be partially respon(cid:173)\nsible for generating a favorable evaluation. Even when such groups exist, many of \nthe offspring generated do not preserve the structure of these groups because the \nchoice of crossover point is random. \nIn problems where the benefit of a parameter is completely independent of the \nvalue of all other parameters, the population simply encodes information about the \nprobability distribution over each parameter. In this case, the crossover operation \nis equivalent to sampling from this distribution; the more crossovers the better the \nsample. Even in problems where fitness is obtained through the combined effects of \nclusters of inputs, the GA crossover operation is beneficial only when its randomly \nchosen clusters happen to closely match the underlying structure of the problem. \nBecause of the rarity of such a fortuitous occurrence, the benefit of the crossover \noperation is greatly diminished. As as result, GAs have a checkered history in \nfunction optimization (Baum, Boneh and Garrett, 1995; Lang, 1995). One of our \ngoals is to incorporate insights from GAs in a principled optimization framework. \n\nThere have been other attempts to capture the advantages of GAs. Population \nBased Incremental Learning (PBIL) attempts to incorporate the notion of a candi(cid:173)\ndate population by replacing it with a single probability vector (Baluja and Caru(cid:173)\nana, 1995). Each element of the vector is the probability that a particular bit in a \nsolution is on. During the learning process, the probability vector can be thought \nof as a simple model of the optimization landscape. Bits whose values are firmly \nestablished have probabilities that are close to lor O. Those that are still unknown \nhave probabilities close to 0.5 . \n\nWhen it is the structure of the components of a candidate rather than the particular \nvalues of the components that determines how it fares, it can be difficult to move \nPBIL's representation towards a viable solution. Nevertheless, even in these sorts \nof problems PBIL often out-performs genetic algorithms because those algorithms \nare hindered by the fact that random crossovers are infrequently beneficial. \n\nA very distinct, but related technique was proposed by Sabes and Jordan for a \n\n\f426 \n\nJ. S. de Bonet, C. L. Isbell and P. WoLa \n\nreinforcement learning task (Sabes and Jordan, 1995). In their framework, the \nlearner must generate actions so that a reinforcement function can be completely \nexplored. Simultaneously, the learner must exploit what it has learned so as to \noptimize the long-term reward. Sabes and Jordan chose to construct a Boltzmann \ndistribution from the reinforcement function: p(x) = exp~~) where R(x) is the \nreinforcement function for action X, T is the temperature, and ZT is a normalization \nfactor. They use this distribution to generate actions. At high temperatures this \ndistribution approaches the uniform distribution, and results in random exploration \nof RO. At low temperatures only those actions which garner large reinforcement \nare generated. By reducing T, the learner progresses from an initially randomized \nsearch to a more directed search about the true optimal action. Interestingly, their \nestimate for p( x) is to some extent a model of the optimization landscape which is \nconstructed during the learning process. To our knowledge, Sabes and Jordan have \nneither attempted optimization over high dimensional spaces, nor attempted to fit \np( x) with a complex model. \n\n3 MIMIC \n\nKnowing nothing else about C(x) it might not be unreasonable to search for its \nminimum by generating points from a uniform distribution over the inputs p( x). \nSuch a search allows none of the information generated by previous samples to effect \nthe generation of subsequent samples. Not surprisingly, much less work might be \nnecessary if samples were generated from a distribution, p8(x), that is uniformly \ndistributed over those x's where C(x) ~ 0 and has a probability of 0 elsewhere. For \nexample, if we had access to p8 M (x) for OM = minx C( x) a single sample would be \nsufficient to find an optimum. \n\nThis insight suggests a process of successive approximation: given a collection of \npoints for which C( x) ~ 00 a density estimator for p/J o (x) is constructed. From this \ndensity estimator additional samples are generated, a new threshold established, \n01 = 00 -\nf, and a new density estimator created. The process is repeated until the \nvalues of C( x) cease to improve. \nThe MIMIC algorithm begins by generating a random population of candidates \nchoosen uniformly from the input space. From this population the median fitness \nis extracted and is denoted 00 . The algorithm then proceeds: \n\n1. Update the parameters of the density estimator of p/J\u00b7(x) from a sample. \n2. Generate more samples from the distribution p/J\u00b7(x). \n3. Set 0i+l equal to the Nth percentile of the data. Retain only the points \n\nless than Oi +1 ' \n\nThe validity of this approach is dependent on two critical assumptions: p(\\x) can \nbe successfully approximated with a finite amount of data; and D(pl1-f(X)llp (x)) is \nsmall enough so that samples from p8(x) are also likely to be samples from p/J-f(X) \n(where D(pllq) is the Kullback-Liebler divergence between p and q). Bounds on \nthese conditions can be used to prove convergence in a finite number of successive \napproximation steps. \n\nThe performance of this approach is dependent on the nature of the density approx(cid:173)\nimator used. We have chosen to estimate the conditional distributions for every pair \nof parameters in the representation, a total of O( n 2 ) numbers. In the next section \nwe will show how we use these conditionals distributions to construct a joint dis(cid:173)\ntribution which is closest in the KL sense to the true joint distribution. Such an \n\n\fMIMIC: Finding Optima by Estimating Probability Densities \n\n427 \n\napproximator is capable of representing clusters of highly related parameters. While \nthis might seem similar to the intuitive behavior of crossover, this representation is \nstrictly more powerful. More importantly, our clusters are learned from the data, \nand are not pre-defined by the programmer. \n\n4 Generating Events from Conditional Probabilities \n\nThe joint probability distribution over a set of random variables, X = {Xi}, is: \n\nGiven only pairwise conditional probabilities, p(Xi IXj) and unconditional probabil(cid:173)\nities, p(Xi), we are faced with the task of generating samples that match as closely \nas possible the true joint distribution, p(X). It is not possible to capture all possible \njoint distributions of n variables using only the unconditional and pairwise condi(cid:173)\ntional probabilities; however, we would like to describe the true joint distribution as \nclosely as possible. Below, we derive an algorithm for choosing such a description. \nGiven a permutation of the numbers between 1 and n, 7r = i1 i2 ... in, we define a \nclass of probability distributions, P1l\"(X): \n\nThe distribution P1l\"(X) uses 7r as an ordering for the pairwise conditional probabili(cid:173)\nties. Our goal is to choose the permutation 7r that maximizes the agreement between \nP1l\"(X) and the true distribution p(X). The agreement between two distributions \ncan be measured by the Kullback-Liebler divergence: \n\n(2) \n\nD(pllp1l\") = l p[logp - logp1l\" ]dX \n\n= Ep[logp] - Ep[logp1l\"] \n= -h(p) - Ep[logp(XilIXh)P(Xi2IXi3) . . . p(Xin_lIXi,,)p(Xin)] \n= -h(p) + h(Xi1IXi2) + h(Xh IXi3) + .. . + h(Xin_1IXiJ + h(XiJ. \n\nThis divergence is always non-negative, with equality only in the case where p(7r) \nand p(X) are identical distributions. The optimal 7r is defined as the one that \nminimizes this divergence. For a distribution that can be completely described by \npairwise conditional probabilities, the optimal 7r will generate a distribution that \nwill be identical to the true distribution. Insofar as the true distribution cannot be \ncaptured this way, the optimal P1l\"(X) will diverge from that distribution. \nThe first term in the divergence does not depend on 7r. Therefore, the cost function, \nJ1l\"(X), we wish to minimize is: \n\nThe optimal 7r is the one that produces the lowest pairwise entropy with respect \nto the true distribution. By searching over all n! permutations, it is possible to \ndetermine the optimal 7r. In the interests of computational efficiency, we employ a \nstraightforward greedy algorithm to pick a permutation: \n\n\f428 \n\nJ. S. de Bonet, C. L. Isbell and P. Viola \n\n1. in =:: arg minj h(Xj). \n2. ik =:: arg minj h( Xj IXik+J, where \n\nj t= ik+1 ... in and k =:: n - 1, n - 2, ... ,2,1. \n\nwhere hO is the empirical entropy. Once a distribution is chosen, generating samples \nis also straightforward: \n\n1. Choose a value for Xin based on its empirical probability P(Xi n). \n2. for k =:: n - 1, n - 2, ... ,2,1, choose element Xik based on the empirical \n\nconditional probability P(Xik jXik+ 1 )\u00b7 \n\nThe first algorithm runs in time O(n2 ) and the second in time O(n 2 ). \n\n5 Experiments \n\nTo measure the performance of MIMIC, we performed three benchmark experiments \nand compared our results with those obtained using several standard optimization \nalgorithms. \n\nWe will use four algorithms in our comparisons: \n\n1. MIMIC - the algorithm above with 200 samples per iteration \n2. PBIL - standard population based incremental learning \n3. RHC - randomized hill climbing \n4. GA - a standard genetic algorithm with single crossover and 10% \n\nmutation rate \n\n5.1 Four Peaks \n\nThe Four Peaks problem is taken from (Baluja and Caruana, 1995). Given an \nN -dimensional input vector X, the four peaks evaluation function is defined as: \n\nI(X, T) =:: max [tail(O, X), head(l, X)] + R(X, T) \n\nwhere \n\ntai/(b, X) =:: number of trailing b's in X \nhead(b, X) =:: number of leading b's in X \n\nR(X T) = {N iftail(?,X) > T and head(l,X) > T \n\n, \n\n0 \n\notherWIse \n\n(4) \n\n(5) \n(6) \n\n(7) \n\nThere are two global maxima for this function. They are achieved either when there \nare T + 1 leading l's followed by all O's or when there are T + 1 trailing O's preceded \nby all 1 'so There are also two suboptimal local maxima that occur with a string \nof all l's or all O's. For large values of T, this problem becomes increasingly more \ndifficult because the basin of attraction for the inferior local maxima become larger. \n\nResults for running the algorithms are shown in figure 1. In all trials, T was set \nto be 10% of N, the total number of inputs. The MIMIC algorithm consistently \nmaximizes the function with approximately one tenth the number of evaluations \nrequired by the second best algorithm. \n\n\fMIMIC: Finding Optima by Estimating Probability Densities \n\n429 \n\nFunction Evaluations Required to Maximize 4 Peaks \n\n1200.--~-~--~-~----, \n\n~ I ()()() \no \n.~ \n\n~ 800 \n\n\u2022 MIMIC \n0 PBIL \nx RHC \n\u2022 GA \n\n& \n~ 600 '------\" \nil \n~400 \n5 \n\n~ 200l::=;;~~~~~:::::=J \n\n40 \n\n50 \n\n60 \n\n70 \n\n80 \n\no \n\nInputs \n\nFigure 1: Number of evaluations of the Four-Peak cost function for different algo(cid:173)\nrithms plotted for a variety of problems sizes. \n\n5.2 Six Peaks \n\nThe Six Peaks problem is a slight variation on Four Peaks where \n\nR(X,T) = { ; \n\nif \n\ntai/(O,x) > T and head(l, x) > Tor \ntai/(l, x) > T and head(O, x) > T \notherwise \n\n(8) \n\nThis function has two additional global maxima where there are T + 1 leading O's \nfollowed by all 1 's or when there are T + 1 trailing 1 's preceded by all O's. In this \ncase, it is not the values of the candidates that is important, but their structure: \nthe first T + 1 positions should take on the same value, the last T + 1 positions \nshould take on the same value, these two groups should take on different values, \nand the middle positions should take on all the same value. \n\nResults for this problem are shown in figure 2. As might be expected, PBIL per(cid:173)\nformed worse than on the Four Peak problem because it tends to oscillate in the \nmiddle of the space while contradictory signals pull it back and forth. The random \ncrossover operation of the G A occasionally was able to capture some of the under(cid:173)\nlying structure, resulting in an improved relative performance of the GA. As we \nexpected, the MIMIC algorithm was able to capture the underlying structure of the \nproblem, and combine information from all the maxima. Thus MIMIC consistently \nmaximizes the Six Peaks function with approximately one fiftieth the number of \nevaluations required by the other algorithms. \n\n5.3 Max K-Coloring \n\nA graph is K-Colorable if it is possible to assign one of k colors to each of the \nnodes of the graph such that no adjacent nodes have the same color. Determining \nwhether a graph is K-Colorable is known to be NP-Complete. Here, we define \nMax K-Coloring to be the task of finding a coloring that minimizes the number of \nadjacent pairs colored the same. \n\nResults for this problem are shown in figure 2. We used a subset of graphs with \na single solution (up to permutations of color) so that the optimal solution is de(cid:173)\npendent only on the structure of the parameters. Because of this, PBIL performs \npoorly. GA's perform better because any crossover point is representative of some of \nthe underlying structure of the graphs used. Finally, MIMIC performs best because \n\n\f430 \n\nJ. S. de Bonet, C. L. Isbell and P. Vwla \n\nFunction Evaluations Required to Maximize 6 Peaks \n1300r---~--~-~--~---' \n\nFunction Evaluations Required to Maximize K-Coloring \n1200r---~-~-~-~-~---.. \n\n\u2022 MIMIC \no PBIL \nx RHC \n+ GA \nL--_ - - ' \n\n\",1200 \ne \no \n\u00b7~1000 \n.a \n'\" W 800 \n'Cl \n~600 \n\u00a7 \n~ 400 \no \n~ 200 \n\no \n\n20 \n\n30 \n\n40 \n\nInputs \n\n\u2022 MIMIC \no PBIL \nx RHC \n+ GA \n\n:gIOOO \no \n.;:1 \n'\" ~ 800 \nW \n'Cl 600 \n~ \n~400 g \n~ 200 \n\n50 \n\n60 \n\n40 \n\nFigure 2: Number of evaluations of the Six-Peak cost function (left) and the K-Color \ncost function (right) for a variety of problem sizes. \n\nit is able to capture all of the structural regularity within the inputs. \n\n6 Conclusions \n\nWe have described MIMIC, a novel optimization algorithm that converges faster \nand more reliably than several other existing algorithms. MIMIC accomplishes this \nin two ways. First, it performs optimization by successively approximating the con(cid:173)\nditional distribution of the inputs given a bound on the cost function. Throughout \nthis process, the optimum of the cost function becomes gradually more likely. As a \nresult, MIMIC directly communicates information about the cost function from the \nearly stages to the later stages of the search. Second, MIMIC attempts to discover \ncommon underlying structure about optima by computing second-order statistics \nand sampling from a distribution consistent with those statistics. \n\nAcknowledgments \n\nIn this research, Jeremy De Bonet is supported by the DOD Multidisciplinary Re(cid:173)\nsearch Program of the University Research Initiative, Charles Isbell by a fellowship \ngranted by AT&T Labs-Research, and Paul Viola by Office of Naval Research Grant \nNo. N00014-96-1-0311. Greg Galperin helped in the preparation of this paper. \n\nReferences \n\nBaluja, S. and Caruana, R. (1995). Removing the genetics from the standard genetic \n\nalgorithm. Technical report, Carnegie Mellon Univerisity. \n\nBaum, E. B., Boneh, D., and Garrett, C. (1995). Where genetic algorithms excel. In Pro(cid:173)\n\nceedings of the Conference on Computational Learning Theory, New York. Association \nfor Computing Machinery. \n\nHolland, J. H. (1975). Adaptation in Natural and Artificial Systems. The Michigan Uni(cid:173)\n\nversity Press. \n\nKirkpatrick, S., Gelatt, C., and Vecchi, M. (1983). Optimization by Simulated Annealing. \n\nScience, 220(4598):671-680. \n\nLang, K. (1995). Hill climbing beats genetic search on a boolean circuit synthesis problem \n\nof koza's. In Twelfth International Conference on Machine Learning. \n\nSabes, P. N. and Jordan, M. 1. (1995). Reinforcement learning by probability matching . In \nDavid S. Touretzky, M. M. and Perrone, M., editors, Advances in Neural Information \nProcessing, volume 8, Denver 1995. MIT Press, Cambridge. \n\n\f", "award": [], "sourceid": 1328, "authors": [{"given_name": "Jeremy", "family_name": "De Bonet", "institution": null}, {"given_name": "Charles", "family_name": "Isbell", "institution": null}, {"given_name": "Paul", "family_name": "Viola", "institution": null}]}