{"title": "The Noisy Euclidean Traveling Salesman Problem and Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 351, "page_last": 358, "abstract": null, "full_text": "The Noisy Euclidean Traveling  Salesman \n\nProblem and  Learning \n\nMikio  L.  Braun, Joachim M.  Buhmann \nbraunm@cs.uni-bonn.de, jb@cs.uni-bonn.de \n\nInstitute for  Computer Science,  Dept.  III, \n\nUniversity of Bonn \n\nR6merstraBe 164,  53117 Bonn,  Germany \n\nAbstract \n\nWe  consider  noisy  Euclidean  traveling  salesman  problems  in  the \nplane,  which  are  random combinatorial problems  with  underlying \nstructure. Gibbs sampling is  used to compute average trajectories, \nwhich  estimate  the  underlying structure common  to all  instances. \nThis procedure requires identifying the exact relationship between \npermutations and tours.  In  a  learning setting,  the  average trajec(cid:173)\ntory  is  used  as  a  model  to  construct  solutions  to  new  instances \nsampled from the same source. Experimental results show that the \naverage trajectory can in fact estimate the underlying structure and \nthat overfitting effects occur if the trajectory adapts too closely to \na  single instance. \n\nIntroduction \n\n1 \nThe  approach  in  combinatorial  optimization  is  traditionally  single-instance  and \nworst-case-oriented.  An  algorithm  is  tested  against  the  worst  possible  single  in(cid:173)\nstance. In reality, algorithms are often applied to a large number of related instances, \nthe  average-case performance being  the  measurement  of interest.  This  constitutes \na  completely different  problem:  given a  set of similar instances, construct solutions \nwhich  are  good  on  average.  We  call  this  kind  of  problem  multiple-instances  and \naverage-case-oriented.  Since  the instances  share some  information,  it  might  be  ex(cid:173)\npected  that  this  problem is  simpler  than solving  all  instances  separately, even  for \nNP-hard problems. \nWe  will  study  the  following  example  of a  multiple-instance  average-case  problem, \nwhich is  built from  the Euclidean travelings salesman problem  (TSP)  in the plane. \nConsider  a  salesman  who  makes  weekly  trips.  At  the  beginning  of each  week,  the \nsalesman  has  a  new  set  of  appointments  for  the  week,  for  which  he  has  to  plan \nthe  shortest  round-trip.  The  location  of the  appointments  will  not  be  completely \nrandom, because there are certain areas which have a  higher probability of contain(cid:173)\ning  an  appointment,  for  example  cities  or  business  districts  within  cities.  Instead \nof solving the planning problem each week from  scratch, a  clever salesman will  try \nto exploit  the underlying density and have a  rough trip pre-planned, which  he  will \nonly adapt from  week to week. \nAn  idealizing  formulization  of  this  setting  is  as  follows.  Fix  the  number  of  ap(cid:173)\npointments  n  E  N.  Let  Xl, ... , Xn  E  ]R2  and  (J  E  114.  Then,  the  locations  of the \n\n\fappointments  for  each  week  are  given  as  samples  from  the  normally  distributed \nrandom vectors  (i  E  {1, ... , n}) \n\n(1) \n\nThe  random  vector  (Xl, ... ,Xn )  will  be  called  a  scenario,  sampled  appointment \nlocations  (sampled)  instance.  The task consists  in  finding  the permutation  7r  E  Sn \nwhich  minimizes  7r  I-t d7r(n)1f(l)  + L~:ll d1f(i)1f(iH) ,  where  dij  := IIXi  - Xj112'  and \nSn  being  the  set  of all  bijective  functions  on  the set  {1, ... , n}.  Typical  examples \nare depicted in figure  l(a)- (c). \nIt turns  out  that  the  multiple-instance  average-case  setting  is  related  to  learning \ntheory,  especially to the theory of cost-based unsupervised  learning.  This relation(cid:173)\nship  becomes  clear if one considers the performance measure of interest.  The algo(cid:173)\nrithm takes a  set of instances It, ... ,In as input and outputs a  number of solutions \nSl,\u00b7\u00b7\u00b7, Sn\u00b7  It is  then  measured  by  the  average  performance  (l/n) L~=l C(Sk, h), \nwhere  C(s , I)  denotes  the  cost  of  solution  s  on  instance  I.  We  now  modify  the \nperformance measure as  follows.  Given  a  finite  number of instances It, ... ,In, the \nalgorithm has to construct a solution s' on a newly sampled instance I'. The perfor(cid:173)\nmance is  then measured by the expected cost E (C (s' ,I')). This can be interpreted \nas a  learning task.  The instances 11 , ... ,In are then the training data,  E(C(s', I')) \nis  the  analogue  of the  expected  risk  or  cost,  and  the  set  of solutions  is  identified \nwith the hypothesis  class in learning theory. \nIn  this  paper,  the setting  presented  in  the  previous  paragraph is  studied  with  the \nfurther restriction that only one training instance is  present.  From this training in(cid:173)\nstance, an average solution is constructed, represented by a closed curve in the plane. \nThis average trajectory is  supposed to capture the essential structure of the under(cid:173)\nlying probability density, similar to the centroids in K-means clustering.  Then, the \naverage trajectory is used as a seed for  a simple heuristic which constructs solutions \non newly drawn instances. The average trajectories are computed by  geometrically \naveraging  tours  which  are  drawn  by  a  Gibbs  sampler  at  finite  temperature.  This \nwill  be  discussed  in  detail  in  sections  2  and  3.  It turns  out  that  the  temperature \nacts  as  a  scale  or smoothing parameter.  A few  comments  concerning the selection \nof this parameter are given in section 6. \nThe technical content of our approach is  reminiscent of the \"elastic net\" -approaches \nof  Durbin  and  Willshaw  (see  [2],  [5]) ,  but  differs  in  many  points.  It is  based  on \na  completely  different  algorithmic  approach  using  Gibbs  sampling  and  a  general \ntechnique for  averaging tours.  Our algorithm has polynomial complexity per Monte \nCarlo step  and  convergence is  guaranteed by  the  usual  bounds  for  Markov  Chain \nMonte Carlo simulation and Gibbs sampling. Furthermore, the goal is not to provide \na  heuristic for  computing the best solution,  but to extract the relevant statistics of \nthe  Gibbs  distribution  at  finite  temperatures  to  generate  the  average  trajectory, \nwhich  will  be used to compute solutions on future  instances. \n\n2  The Metropolis algorithm \n\nThe Metropolis algorithm is a well-known algorithm which simulates a homogeneous \nMarkov  chain  whose  distribution  converges  to  the  Gibbs  distribution.  We  assume \nthat the reader is  familiar  with the concepts, we  give here only a  brief sketch of the \nrelevant results and refer  to  [6],  [3]  for  further  details. \nLet M  be a finite set and f:  M  -+  lit The Gibbs distribution at temperature T  E Il4 \nis  given  by  (m E  M) \n\n9T(m) := \n\nexp( - f(m)/T~ \n\n. \nLm/EM exp( - f(m )/T) \n\n(2) \n\n\fThe Metropolis algorithm works as follows.  We  start with any element m  E  M  and \nset Xl +- m. For i  ~ 2,  apply a  random local update m':= \u00a2(Xi).  Then set \n\nwith probability  min {I, exp( -(f(Xi) -\nelse \n\nf(m'))/T)} \n\n(3) \n\nThis scheme converges to the Gibbs distribution if certain conditions on \u00a2  are met. \nFurthermore,  a  L2-law  of large  numbers  holds:  For  h:  M  --t  ]R,  ~ L:~=l h(Xk )  --t \nL:mEM gT(m)h(m) in L2. For TSP, M  =  Sn and \u00a2 is  the Lin-Kernighan two-change \n[4],  which  consists  in  choosing  two  indexes  i, j  at  random  and  reversing  the  path \nbetween  the  appointments  i  and  j.  Note  that  the  Lin-Kernighan  two-change  and \nits generalizations for  neighborhood search are powerful heuristic in itself. \n\n3  Averaging  Tours \n\nOur  goal  is  to  compute the  average trajectory,  which  should  grasp the underlying \nstructure common to all  instances,  with  respect  to the  Gibbs  measure at non-zero \ntemperature  T .  The  Metropolis  algorithm  produces  a  sequence  of  permutations \n7rl, 7r2, ...  with  P{ 7rn  =  .}  --t  gT(.)  for  n  --t  00.  Since  permutations  cannot  be \nadded,  we  cannot  simply  compute  the  empirical  means  of  7rn.  Instead,  we  map \npermutations to their corresponding trajectories. \nDefinition 1  (trajectory)  The  trajectory  of 7r  E  Sn  given  n  points  Xl, ... ,Xn  is  a \nmapping  r( 7r):  {I, ... , n}  --t  ]R2  defined  by  r( 7r) (i) := X1C(i).  The  set  of all  trajec(cid:173)\ntories  (for  all  sets  of n  points)  is  denoted  by  Tn  (this  is  the  set  of all  mappings \nT  {I, ... , n} --t  ]R2 ). \nAddition  of trajectories  and  multiplication  with  scalars  can  be  defined  pointwise. \nThen it is  technically  possible  to compute t L:~=l r(7rk).  Unfortunately,  this  does \nnot  yield  the  desired  results,  since  the relation between  permutations  and tours  is \nnot  one-to-one.  For  example,  the  permutation  obtained  by  starting  the  tour  at  a \ndifferent  city  still  corresponds  to  the  same  tour.  We  therefore  need  to  define  the \naddition of trajectories in a  way which is independent of the choice of permutation \n(and  therefore  trajectory)  to  represent  the  tour.  We  will  study  the  relationship \nbetween tours and permutations first  in some detail, since we  feel  that the concepts \nintroduced here might be generally useful  for  analyzing combinatorial optimization \nproblems. \nDefinition 2  (tour and length of a tour)  Let G =  (V, E)  be  a complete  (undirected) \ngraph  with V  =  {I, ... ,n}  and E  =  (~).  A  subset tEE is  called  a  tour  iff It I =  n, \nfor  every  v  E  V,  there  exist  exactly  two  el, e2  E  t  such  that  v  E  el  and  v  E  e2, \nand  (V, t)  is  connected.  Given  a  symmetric  matrix (dij )  of distances,  the  length  of \na  tour t  is  defined  by  C(t)  := L:{i,j} Et dij . \n\nThe tour corresponding to a  permutation 7r  E  Sn  is  given by \n\nt(7r) :={ {7r(I), 7r(n)}} U  U {{7r(i) ,7r(i + I)}}. \n\nn-l \n\ni=l \n\n(4) \n\nIf t(7r)  =  t  for  a  permutation  7r  and  a  tour  t,  we  say  that  7r  represents  t.  We \ncall  two  permutations  7r,  7r'  equivalent,  if they  represent  the  same  tour  and  write \n7r  ,....,  7r'.  Let  [7r]  denote the equivalence  class  of 7r  as  usual.  Note  that the length of \na  permutation is  fully  determined  by  its  equivalence  class.  Therefore,  ,....,  describes \nthe intrinsic symmetries of the TSP formulated  as  an optimization problem on Sn , \ndenoted by TSP(Sn). \nWe have to define the addition EB  of trajectories such that the sum is independent of \nthe representation.  This means that for  two tours h, t2  such that h  is  represented \n\n\fby 'lf1,  'If~  and  t2  by  'lf2,  'If~  it  holds  that f('lf1)  EB  f('lf2)  ~ f('lfD  EB  f('If~).  The idea \nwill  be to normalize both summands before  addition.  We  will  first  study the exact \nrepresentation symmetry of TSP(Sn) ' \n\nThe  TSP(Sn)  symmetry  group  Algebraically  speaking,  Sn  is  a  group  with \nconcatenation of functions as multiplication, so we can characterize the equivalence \nclasses  of ~ by studying the set  of operations on a  permutation which  map to the \nsame equivalent class.  We  define  a  group action of Sn  on itself by right translation \n('If, 9  E  Sn): \n\ng. 'If:= 'lfg- 1. \n\n\"  . \"  : Sn  x  Sn  -+  Sn, \n\n(5) \nNote  that  any  permutation  in  Sn  can  be  mapped  to  another  by  an  appropriate \ngroup action  (namely 'If  -+  'If'  by  ('If,-l'lf) . 'If.),  such that the group action of Sn  on \nitself suffices  to study the equivalence classes of ~. \nFor certain 9  E  Sn,  it holds that t(g\u00b7 'If)  =  t('If).  We want to determine the maximal \nset  H t  of elements  which  keeps  t  invariant.  It even  holds  that  H t  is  a  subgroup \nof Sn:  The  identity  is  trivially  in  H t .  Let  g, h  be  t-invariant,  then  t((gh- 1) . 'If)  = \nt(g \u00b7(h- 1 . 'If))  =  t(h- 1 . 'If)  =  t(h \u00b7(h- 1 . 'If)  =  t( 'If).  H t  will  be  called  the  symmetry \ngroup  of TSP(Sn)  and it follows  that  ['If]  =  H t \u00b7 'If  :={h \u00b7 'If  I hE Hd. \nThe  shift  u  and  reversal  (2  are defined  by  (i  E {I, ... , n} ) \n\n(.) . __ {i + 1  i < n, \n\nu  z. \n\n1 \n\n. \nz =  n \n\n, \n\n(2(i)  :=n + 1- i, \n\n(6) \n\nand set  H  :=((2, u),  the group generated by u and  (2.  It holds that  (this  result is  an \neasy consequence of (2(2  =  id{l , ... ,n},  (2U  =  u- 1(2  and un =  id{l , ... ,n}) \n\nH  =  {uk  IkE {I, ... , n}} U {(2uk  IkE {l, ... ,n}}. \n\n(7) \n\nThe fundamental  result  is \nTheorem 1  Let t  be  the  mapping  which  sends permutations  to  tours  as  defined  in \n(4).  Then,  H t  =  H ,  where  H t  is  the  set  of all t-invariant  permutations  and H  is \ndefined in  (7). \n\nIt is  obvious  that  H  ~ H t .  Now,  let  h- 1  E  H t .  We  are  going  to  prove \nProof: \nthat  t-invariant  permutations  are  completely  defined  by  their  values  on  1  and  2. \nLet  hE H t  and  k:= h(l) . Then,  h(2)  =  u(k)  or  h(2)  =  u - 1(k),  because otherwise, \nh  would  give  rise  to  a  link  {{'If(h(1),'If(h(2\u00bb}}  1.  t('If) .  For the  same reason,  h(3) \nmust be mapped to u \u00b12(k).  Since h  must be bijective,  h(3)  =I- h(l) , so that the sign \nof the exponent  must  be  the  same  as  for  h(2).  In  general,  h(i)  =  u\u00b1(i- 1l(k).  Now \nnote that for  i,k E  {l , ... ,n} , u i(k)  =  uk(i)  and therefore, \n\n{\n\nu k- 1 \n(2un-k \n\nh= \n\nif h(i)  =  ui-1 (k) \nifh(i)=u-i+1(k)' \n\nD \n\nAdding trajectories  We can now define equivalence for trajectories. First define \na  group  action of Sn  on Tn  analogously to  (5):  the  action of h  E  Ht  on \"(  E  Tn  is \ngiven by h \u00b7 \"( := \"(  0  h- 1 .  Furthermore, we  say that \"(  ~ 1} ,  if H t \u00b7 \"(  =  H t \u00b71}. \nOur approach is motivated geometrically. We measure distances between trajectories \nas follows.  Let d:  ]R2  x  ]R2  -+ Il4  be a  metric.  Then define  h, 1}  E  Tn) \n\ndh,1}):= 2::=1 dh(k),1}(k). \n\n(8) \n\nBefore adding two trajectories we  will  first  choose equivalent representations \"(', 1}' \nwhich minimize d( \"(' , 1}').  Because of the results presented so far,  searching through \n\n\fall  equivalent  trajectories  is  computationally  tractable.  Note  that  for  h  E  H t ,  it \nholds  that d( h . ,,(, h . rJ)  = db, rJ)  as  h  only  reorders the summands.  It follows  that \nit suffices to change the representations only for  one argument, since d(h\u00b7 ,,(, i\u00b7 rJ)  = \ndb, h - 1i\u00b7 rJ)\u00b7  So the time complexity of one addition reduces to 2n computation of \ndistances which involve n  subtractions each. \nThe  normalizing  action  is  defined  by b, rJ  E Tn) \n\nn , 1J  := argmin d( ,,(, n . rJ)\u00b7 \n\nn E H t \n\n(9) \n\nAssuming that the normalizing action is  unique1 ,  we  can prove \n\nTheorem 2  Let ,,(,  rJ  be two  trajectories,  and n , 1J  the  unique normalizing  action  as \ndefined  in  (9).  Then,  the  operation \n\nis  representation  invariant. \n\n\"( EB  rJ  := \"( + n , 1J  . rJ \n\n(10) \n\nProof:  Let  \"(I = g. ,,(,  rJl  = h\u00b7 rJ  for  g, h  E  H t .  We  claim  that  n ,I1J1  = gn' 1Jh-1. \nThe normalizing action is  defined  by \n\nn,I1J1  = argmin db /, n l . rJl)  = argmin d(g . ,,(, nih\u00b7 rJ)  = argmin db , g-l n lh\u00b7 rJ), \n\nn l  E H t \n\nn l  EHt \n\nn l  E H t \n\n(11) \nby  inserting  g-l  parallelly  before  both  arguments  in  the  last  step.  Since  the  nor(cid:173)\nmalizing action is  unique, it follows  that for  the n l  realizing the minimum it holds \nthat g-ln l h = n , 1J  and therefore n l  = n , I1J1  = gn' 1Jh-1.  Now,  consider the sum \n\nwhich  proves the representation independence. \n0 \nThe  sum  of more  than  two  trajectories  can  be  defined  by  normalizing  everything \nwith  respect  to  the  first  summand,  so  that  empirical  sums  t EB~=l f(?ri)  are  now \nwell-defined. \n\n4 \n\nInferring  Solutions on  New Instances \n\nWe  transfer a  trajectory to a  new set of appointments Xl, .. .  ,Xn  by computing the \nrelaxed  tour using the following finite-horizon  adaption technique: \nFirst  of  all,  passing  times  ti  for  all  appointments  are  computed.  We  extend  the \ndomain of a  trajectory \"(  from  {I, ... , n} to the interval  [1, n + 1)  by linear interpo(cid:173)\nlation. Then we  define ti  such that \"((ti)  is  the earliest point with minimal distance \nbetween  appointment  Xi  and  the  trajectory.  The  passing  times  can  be  calculated \neasily  by  simple  geometric  considerations.  The  permutation  which  sorts  (ti)~l is \nthe  relaxed  solution  of\"( to  (Xi) . \nIn  a  post-processing  step,  self-intersections  are  removed  first.  Then,  segments  of \nlength  w  are  optimized  by  exhaustive  search.  Let  ?r  be  the  relaxed  solution.  The \npath  from  ?rei)  to  ?r(i  + w  + 2)  (index  addition  is  modulo  n)  is  replaced  by  the \nbest alternative through the appointments ?r(i + 1), ... , ?r(i + w + 1).  Iterate for  all \ni  E  {I , . . . , n}  until there is  no further  improvement.  Since this procedure has time \ncomplexity w!n,  it can only be done efficiently for  small w. \n\nlOtherwise,  perturb the locations  of the appointments by infinitesimal changes. \n\n\f5  Experiments \n\nFor experiments, we  used the following set-up:  We  took the 11.111-norm to determine \nthe  normalizing  action.  Typical  sample-sizes  for  the  Markov  chain  Monte  Carlo \nintegration were  1000  with  100  steps  in  between to  decouple  consecutive  samples. \nScenarios  were  modeled  after  eq.  (1),  where  the  Xi  were  chosen  to  form  simple \ngeometric shapes. \nAverage  trajectories  for  different  temperatures  are  plotted  in  figures  l(a)- (c).  As \nthe temperature decreases,  the average trajectory converges to  the trajectory of a \nsingle  locally  optimal  tour.  The  graphs  demonstrate  that  the  temperature  T  acts \nas a  smoothing  parameter. \nTo  estimate the expected risk of an average trajectory,  the  post-processed relaxed \n(PPR) solutions were averaged over 100 new instances (see figure  l(d)-(g)) in order \nto  estimate  the  expected  costs.  The  costs  of the  best  solutions  are  good  approx(cid:173)\nimations,  within  5%  of the  average  minimum  as  determined  by  careful  simulated \nannealing.  An interesting effect  occurs:  the expected  costs  have  their minimum  at \nnon-zero temperature. The corresponding trajectories are plotted in figure  l(e),(f). \nThey recover the structure of the scenario. In other words, average trajectories com(cid:173)\nputed  at  temperatures  which  are too  low,  start to  overfit  to noise  present only in \nthe instance for  which they were computed. So computation of the global optimum \nof a  noisy combinatorial optimization problem might  not be the right strategy, be(cid:173)\ncause the solutions might not reflect the underlying structure. Averaging over many \nsuboptimal solutions provides much better statistics. \n\n6  Selection of the Temperature \nThe  question  remains  how  to  select  the  optimal  temperature.  This  problem is  es(cid:173)\nsentially the same as determining the correct model  complexity in learning theory, \nand  therefore  no  fully  satisfying  answer  is  readily  available.  The  problem  is  nev(cid:173)\nertheless  suited  for  the  application  of the  heuristic  provided  by  the  empirical  risk \napproximation  (ERA)  framework  [1],  which  will  be briefly sketched here. \nThe  main  idea  of  ERA  is  to  coarse-grain  the  set  of  hypotheses  M  by  treating \nhypotheses  as  equivalent  which  are  only  slightly  different.  Hypotheses  whose  \u00a31 \nmutual distance  (defined  in a  similar fashion  as  (8))  is  smaller than the parameter \n\"(  E  Il4  are considered statistically equivalent.  Selecting a  subset of solutions such \nthat  \u00a3l -spheres  of radius  \"(  cover  M  results  in  the  coarse-grained  hypothesis  set \nM,. VC-type  large  deviation  bounds  depending on  the  size  of the  coarse-grained \nhypothesis  class can now be derived: \n\np{ C2 (m\"! ) - min  C2 (m)  > 2c}  :::;  21M\"! 1  sup  exp \n\n( \n\n-\n\nmEM \n\nmEM., \n\nn(c - \"()2 \n\n( \n\nam + c  c  -\n\n) \n) '  \n\n(13) \n\n\"( \n\nam  depending  on  the  distribution.  The  bound  weighs  two  competing  effects.  On \nthe  one  hand,  increasing \"(  introduces  a  systematic bias  in  the estimation.  On the \nother hand,  decreasing \"(  increases the  cardinality of the hypothesis class.  Given  a \nconfidence J > 0, the probability of being worse than c > 0 on a second instance and \n\"(  are linked.  So  an optimal  coarsening \"(  can be determined.  ERA then  advocates \nto either sample from  the ,,(-sphere around the empirical minimizer or average over \nthese solutions. \nNow  it  is  well  known,  that  the  Gibbs  sampler  is  concentrated on  solutions  whose \ncosts are below a  certain threshold.  Therefore, the ERA is  suited for  our approach. \nIn the relating equation the log  cardinality of the approximation set occurs,  which \nis  usually  interpreted  as  micro canonical  entropy.  This  relates  back  to  statistical \nphysics,  the  starting  point  of our  whole  approach.  Now  interpreting  \"(  as  energy, \nwe  can  compute  the  stop  temperature  from  the  optimal  T  Using  the  well-known \n\n\frelation  from  statistical  physics  ~ee:t:~::  =  T - 1 ,  we  can  derive  a  lower  bound  on \nthe  optimal  temperature  depending  on  variance  estimates  of the  specific  scenario \ngiven. \n\n7  Conclusion \nIn  reality, optimization algorithms are often applied to many similar instances.  We \npointed  out  that  this  can  be  interpreted  as  a  learning  problem.  The  underlying \nstructure  of  similar  instances  should  be  extracted  and  used  in  order  reduce  the \ncomputational complexity for  computing solutions to related instances. \nStarting with the noisy Euclidean TSP, the construction of average tours is  studied \nin  this  paper, which involves determining the exact relationship between  permuta(cid:173)\ntion and tours,  and identifying the intrinsic symmetries of the TSP. We  hope  that \nthis technique might prove to be useful for other applications in the field  of averag(cid:173)\ning  over  solutions  of combinatorial problems.  The  average trajectories  are  able  to \ncapture the underlying structure common to all instances. A heuristic for  construct(cid:173)\ning solutions on new  instances is  proposed.  An empirical study of these procedures \nis  conducted with results satisfying our expectations. \nIn  terms  of learning theory,  overfitting effects  can  be  observed.  This  phenomenon \npoints at a deep connection between combinatorial optimization problems with noise \nand learning theory, which might be bidirectional. On the one hand, we believe that \nnoisy  (in  contrast to  random)  combinatorial  optimization problems  are  dominant \nin  reality.  Robust  algorithms  could  be  built  by  first  estimating  the  undistorted \nstructure  and  then  using  this  structure  as  a  guideline  for  constructing  solutions \nfor  single instances.  On the other hand, hardness of efficient  optimization might be \nlinked to the inability to extract meaningful structure. These connections, which are \nsubject of further  studies, link statistical complexity to computational complexity. \n\nAcknowledgments \n\nThe  authors  would  like  to  thank  Naftali  Tishby,  Scott  Kirkpatrick  and  Michael \nClausen for  their helpful comments and discussions. \n\nReferences \n[1]  J.  M.  Buhmann and M.  Held.  Model selection in clustering by  uniform conver(cid:173)\ngence bounds.  Advances in Neural  Information Processing  Systems, 12:216- 222, \n1999. \n\n[2]  R.  Durbin  and  D. Willshaw.  An  analogue  approach to the travelling salesman \n\nproblem using an elastic net method.  Nature,  326:689- 691,  1987. \n\n[3]  S.  Kirkpatrick,  C.  D.  Gelatt,  and  M.  P.  Vecchio  Optimisation  by  simulated \n\nannealing.  Science,  220:671- 680, 1983. \n\n[4]  S.  Lin  and  B.  Kernighan.  An  effective  heuristic  algorithm  for  the  traveling \n\nsalesman problem.  Operations  Research,  21:498- 516, 1973. \n\n[5]  P.D.  Simic.  Statistical  mechanics  as  the  underlying  theory  of  \"elastic\"  and \n\n\"neural\"  optimizations.  Network,  1:89-103,  1990. \n\n[6]  G.  Winkler.  Image Analysis,  Random fields  and Dynamic Monte  Carlo  Methods, \n\nvolume 27  of Application  of Mathematics.  Springer, Heidelberg, 1995. \n\n\fi 17.7 \n\n&:  17.6 \n\" j 17.5 \n\nf 17.4 \n\n<\"Il 17.3 \n\n~  11.5 \n\n~ \n&:  11 .45 \n\" j  11.4 \n~ \n~ 11.35 \n\n-sigma2 = O.03 \n\nT.,...,:0.15OO:Xl \nLenglt. : 5.179571 \n\no  0 \no \n\no \n\ntemperatureT \n\n(d) \n\n- si ma = O.025 \n\nT.,...,: 0.212759 \nLenglt. : 6.295844 \n\ne \n\no \n\no \n\no \n\no \n\no.I H \n\nCD \no \n\ntemperatureT \n\n(f) \n\nn 5O\"\",11I>1S20_025 1 510 0_7654.2.() _742680_2 31390 .057 211.(l.Q1597 0. 2 1 479 0.8322 4 0 .58 33a1~ \n\ng \n\nFigure  1:  (a)  Average  trajectories  at  different  temperatures  for  n  =  100  appoint(cid:173)\nments on a circle with a 2  =  0.03.  (b) Average trajectories at different temperatures, \nfor  multiple Gaussian sources, n  =  50 and a 2  =  0.025.  (c)  The same for  an instance \nwith structure on two levels.  (d)  Average tour length of the post-processed relaxed \n(PPR)  solutions for  the  circle  instance plotted in  (a).  The PPR width was  w  =  5. \nThe average fits  to noise in the data if the temperature is  too low,  leading to over(cid:173)\nfitting  phenomena.  Note that the average best solution is  :s:  16.5.  (e)  The average \ntrajectory with the smallest average length of its PPR solutions in  (d).  (f)  Average \ntour length as in (d).  The average best solution is  :s:  10.80.  (g)  Lowest temperature \ntrajectory with small average PPR solution length in  (f). \n\n\f", "award": [], "sourceid": 2049, "authors": [{"given_name": "Mikio", "family_name": "Braun", "institution": null}, {"given_name": "Joachim", "family_name": "Buhmann", "institution": null}]}