{"title": "Unification of Information Maximization and Minimization", "book": "Advances in Neural Information Processing Systems", "page_first": 508, "page_last": 514, "abstract": null, "full_text": "Unification of Information Maximization \n\nand Minimization \n\nRyotaro  Kamimura \n\nInformation  Science  Laboratory \n\nTokai  University \n\n1117  Kitakaname  Hiratsuka Kanagawa 259-12,  Japan \n\nE-mail:  ryo@cc.u-tokaLac.jp \n\nAbstract \n\nIn  the  present  paper,  we  propose  a  method  to  unify  information \nmaximization and minimization  in  hidden  units.  The information \nmaximization and minimization are performed on two different lev(cid:173)\nels:  collective and individual level.  Thus, two kinds  of information: \ncollective and  individual  information  are  defined.  By  maximizing \ncollective  information  and  by  minimizing  individual  information, \nsimple  networks  can  be  generated  in  terms  of the  number  of con(cid:173)\nnections  and  the  number  of hidden  units.  Obtained  networks  are \nexpected  to give better generalization and improved interpretation \nof internal representations.  This method  was applied  to the infer(cid:173)\nence  of the  maximum onset  principle  of an artificial  language.  In \nthis  problem,  it  was  shown  that  the  individual  information  min(cid:173)\nimization  is  not  contradictory  to  the  collective  information  max(cid:173)\nimization.  In  addition,  experimental  results  confirmed  improved \ngeneralization performance,  because over-training can significantly \nbe suppressed. \n\n1 \n\nIntroduction \n\nThere have  been many attempts  to interpret neural networks from the  information \ntheoretical  point of view  [2],  [4],  [5].  Applied  to  the  supervised  learning,  informa(cid:173)\ntion has been maximized and minimized, depending on problems.  In these methods, \ninformation  is  defined  by  the  outputs  of hidden  units.  Thus,  the methods  aim  to \ncontrol hidden unit activity patterns in an optimal manner.  Information maximiza(cid:173)\ntion  methods  have  been  used  to  interpret  explicitly  internal  representations  and \nsimultaneously  to  reduce  the  number  of necessary  hidden  units  [5].  On  the  other \nhand, information minimization methods have been especially used to improve gen(cid:173)\neralization  performance  [2],  [4]  and  to speed  up  learning.  Thus,  if it  is  possible  to \n\n\fUnification of Information Maximization and Minimization \n\n509 \n\nmaximize and minimize information simultaneously, information  theoretic methods \nare expected  to be  applied to a  wide  range of problems. \n\nIn  this  paper,  we  unify  the  above  mentioned  two  methods,  namely,  information \nmaximization and  minimization  methods,  into one framework  to improve general(cid:173)\nization performance and to interpret explicitly internal representations.  However, it \nis  apparently impossible to maximize and minimize simultaneously the information \ndefined  by  the hidden unit activity.  Our goal  is  to maximize and to minimize infor(cid:173)\nmation on  two different levels,  namely, collective and individual levels.  This means \nthat information can be maximized in collective ways  and information is  minimized \nfor  individual input-hidden connections.  The seeming contradictory proposition  of \nthe simultaneous information  maximization  and minimization  can be overcome by \nassuming  the existence of the  two levels for  the information control. \n\nInformation is supposed to be controlled by an information controller located outside \nneural  networks  and  used  exclusively  to  control  information.  By  assuming  the \ninformation controller, we can clearly see how information appropriately defined can \nbe maximized or  minimized.  In addition,  the actual implementation of information \nmethods  is  much easier  by introducing a  concept of the information controller. \n\n2  Concept  of Information \n\nIn this section, we explain a concept of information in  a general framework of an in(cid:173)\nformation theory.  Let Y  take on a finite number of possible values Yl, Y2, ... , YM  with \nprobabilities P(Yl), P(Y2), ... , p(YM),  respectively.  Then,  initial  uncertainty H(Y) of \na  random variable Y  is  defined  by \n\nM \n\nH(Y) =  - L  p(Yj) 10gp(Yj). \n\nj=l \n\n( 1) \n\nNow, consider conditional uncertainty after the observation of another random vari(cid:173)\nable X, taking possible values Xl, X2, ... , Xs  with probabilities p(Xt},P(X2), ... ,p(XM), \nrespectively.  Conditional uncertainty H(Y I X)  can  be  defined  as \n\nH(Y I X) =  - LP(x8) LP(Yj I x8) logp(xj  I Y8)' \n\nS \n\nM \n\n(2) \n\n8=1 \n\nj=1 \n\nWe  can  easily  verify  that  conditional  uncertainty  is  always  less  than  or  equal  to \ninitial uncertainty.  Information is  usually defined as the decrease of this uncertainty \n[1]. \n\nI(Y I X) \n\nH(Y) - H(Y I X) \n\nM \n\nS \n\nM \n\n- L  p(Yj ) 10gp(Yj) + L  p( X8) L  p(Yj  I X,) 10gp(Yj  I X8) \n\nj=1 \n\n8=1 \n\nj=1 \n\n~ ~ ()  I)  p(Yj  I X8) \nL..,.;  L..,.;P  X8  p(Yj  X8 \n\n(.) \n\nlog \n\nP  YJ \n\n. \nJ \n\n8 \n\n=  LP(x8)I(Y I x8) \n\n(3) \n\nwhere \n\nI(Y I x,) \n\n\f510 \n\nR.  Kamimura \n\nj \n\nj \n\nwhich  is  referred  to as  conditional information.  Especially,  when  prior  uncertainty \nis  maximum,  that is,  a  prior  probability  is  equi-probable  (1/ M),  then informlttion \nis \n\nI(Y I X) \n\nS \n\nM \n\n10gM + I:p(x 3 )  I:p(Yj I x3 )logp(Yj  I X 3 ) \n\n(5) \n\nwhere  log M  is  maximum uncertainty concering A. \n\n3=1 \n\nj=1 \n\n3  Formulation of Information  Controller \n\nIn  this section,  we  apply a  concept  of information  to  actual  network  architectures \nand  define  collective  information  and  individual  information.  The  notation  in  the \nabove section  is  changed  into ordinary notation used  in  the  neural  network. \n\n3.1  Unification by Information  Controller \n\nTwo  kinds  of information,  collective  information  and  individual  information,  are \ncontrolled by using an information controller.  The information controller is  devised \nto interpret the mechanism of the information maximization and minimization more \nexplicitly.  As  shown  in  Figure  1,  the  information  controller  is  composed  of  two \nsubcomponents,  that  is,  an  individual  information  minimizer  and  collective  infor(cid:173)\nmation ma.'Cimizer.  A collective information maximizer is  used to increase collective \ninformation  as  much  as  possible.  An  individual  information  minimizer  is  used  to \ndecrease individual information.  By  this  minimization,  the majority of connections \nare pushed  toward  zero.  Eventually, all  the hidden  units  tend to  be  intermediately \nactivated.  Thus,  when  the  collective  information  maximizer  and  individual  infor(cid:173)\nmation  maximizer  are  simultaneously  applied,  a  hidden  unit  activity  pattern  is  a \npattern  of the  maximum  information  in  which  only  one  hidden  unit  is  on,  while \nall  the other  hidden  units are  off.  However,  multiple strongly negative connections \nto  produce  a  maximum information  state,  are  replaced  by  extremely  weak  input(cid:173)\nhidden  connections.  Strongly  negative  connections are  inhibited  by  the  individual \ninformation minimization.  This means  that by the information controller,  informa(cid:173)\ntion can be  maximized  and at the same time one  of the most  important properties \nof the  information  minimization,  namely,  weight  decay  or  weight  elimination,  can \napproximately be  realized.  Consequently,  the  information  controller  can  generate \nmuch  simplified  networks  in  terms  of hidden  units  and  in  terms  of  input-hidden \nconnections. \n\n3.2  Collective Information  Maximizer \n\nA  neural  network  to  be  controlled  is  composed  of input,  hidden  and output  units \nwith  bias,  as  shown  in  Figure  1.  The  jth  hidden  unit  receives  a  net  input  from \ninput  units and at the same time from a  collective  information  maximizer: \n\nL \n\nuj  = Xj  + I: Wjk~k \n\nk=O \n\n(6) \n\nwhere Xj  is an information maximizer from the jth collective information maximizer \nto  the  jth hidden  unit,  L  is  the  number  of input  units,  Wjk  is  a  connection  from \nthe kth input unit to the jth hidden  unit and ~k is  the  kth element of the 8th input \n\n\fUnification of Information Maximization and Minimization \n\n511 \n\nBias- Hidden \nConnections \n\nBias-Output \nconn71ons \n\nBias \n\nWiO \n\n.... ~~ \n\nI \n\nTarget \n\nInput- Hidden \nConnections \n\nIndividual \nInformation \nMinimizer \n\n. . . . \n\n~  X \nj \n\n\u2022 , \n\\  \\ \n\\.  \\  Information \n\\ \\  Maximizers \n\n...  \\ '., .. \u2022 Collective \n\nInformation \nMaximizer \n\nInformation Controller \n\nFigure  1:  A  network  architecture,  realizing  the  information  con(cid:173)\ntroller. \n\npattern.  The  jth  hidden  unit  produces  an  activity  or  an  output  by  a  sigmoidal \nactivation function: \n\nvJ \n\nf(uJ) \n\n1 \n\n1 + exp( -uJ) . \n\n(7) \n\nThe collective information maximizer is  used  to maximize the information contained \nin  hidden  units.  For  this  purpose,  we  should  define  collective  information.  Now, \nsuppose  that in  the previous formulation  in  information, a  symbol  X  and Y  repre(cid:173)\nsent a set of input patterns and hidden  units respectively.  Then,  let  us  approximate \na  probability p(Yj  I x$)  by a  normalized  output pj  of the jth hidden  unit computed \nby \n\n(8) \n\nwhere the summation  is  over all  the hidden units.  Then, it is  reasonable to suppose \nthat  at  an  initial  stage  all  the  hidden  units  are  activated  randomly  or  uniformly \nand all  the input patterns are also randomly given to networks.  Thus, a  probability \np(Yj)  of the activation of hidden  units at  the initial stage  is  equi-probable,  that is, \n11M.  A  probability  p(x$)  of input  patterns  is  also  supposed  to  be  equi-probable, \nnamely,  liS. Thus, information in  the equation  (3)  is  rewritten as \n\nI(Y I X)  ~ \n\nM  1 \n\n1 \n\n- L M  log  M  +  S L L pJ 10gpJ \n\n1  S  M \n\nj=l \n\n$=lj=l \n\n\f512 \n\nR.  Kamimura \n\nInput Unit  (C) \n\nk \n\nHidden Unit (0) \n\nJ \n\nFires \n\nDoes not lire \n\nFigure  2:  An  interpretation  of an  input-hidden  connection  for \ndefining  the  individual  information. \n\n1  S  M \n\nlog !If + S  2: 2:pj logpj \n\n3=lj=1 \n\n(9) \n\nwhere  log !If  is  maximum  uncertainty.  This  information  is  considered  to  be  the \ninformation acquired in a course oflearning.  Information maximizers are updated to \nincrease  collective information.  For obtaining update  rules,  we  should  differentiate \nthe information  function  with respect  to information maximizers  Xj: \n\n{3soI(Y I X) \n\nOXj \n\nIi ~ (lOg pj - f/:\" log p:\" ) pj (1  - vj) \n\n(10) \n\nwhere {3  is  a  parameter. \n\n3.3 \n\nIndividual Information  Minimization \n\nFor  representing  individual  information  by  a  concept  of information  discussed  in \nthe previous section,  we  consider  an  output Pjk  from  the jth hidden  unit only with \na  connection from  the  kth input  unit  to the jth output  unit: \n\n(11) \nwhich is  supposed to be a  probability of the firing  of the jth hidden  unit, given  the \nfiring ofthe kth input unit, as  shown in Figure 2.  Since this probability is considered \nto be a  probability, given  the firing of the  kth  input unit, conditional information is \nappropriate for  measuring the information.  In  addition,  it  is  reasonable  to suppose \nthat  a  probability  of  the  firing  of  the  jth  hidden  unit  is  1/2  at  an  initial  stage \nof  learning,  because  we  have  no  knowledge  on  hidden  units.  Thus,  conditional \ninformation for  a  pair  of the  kth  unit and  the  jth hidden  unit is  formulated as \n\nIj k (D I fires)  ~  - Pj k  log ~ - (1  - Pj k ) log  ( 1 - ~ ) \n+Pjk logpjk + (1  - Pjk) 10g(1- Pjk) \nlog2 + Pjk logpjk + (1  - Pjk)log(l - Pjk) \n\n(12) \nIf  connections  are  close  to  zero,  this  function  is  close  to  minimum  information, \nmeaning  that  it  is  impossible . to  estimate  the  firing  of  the  kth  hidden  unit. \nIf \n\n\fUnification of Information Maximization and Minimization \n\n513 \n\nTable  1:  An  example  of obtained  input-hidden  connections  Wjle \nby  the  information  controller.  The  parameter  {3,  It  and  1}  were \n0.015,  0.0008,  and  0.01. \n\n4 \n\n13.82 \n-3.08 \n-0.01 \n0.00 \n0.00 \n0.01 \n0.00 \n0.00 \n0.01 \n0.00 \n\nBias  WjO \n22.07 \n-0.95 \n0.00 \n0.00 \n0.00 \n0.06 \n0.00 \n0.00 \n0.03 \n0.07 \n\nInformation \nMaximizer  Xj \n\n-60.88 \n1.63 \n-10.93 \n-10.94 \n-10.97 \n-12.01 \n-11.01 \n-11.00 \n-11.61 \n-11.67 \n\nHidden \nUnits vJ \n\n1 \n2 \n3 \n4 \n5 \n6 \n7 \n8 \n9 \n10 \n\n1 \n\n3.09 \n-3.35 \n-0.01 \n0.00 \n0.00 \n0.02 \n0.00 \n0.00 \n0.02 \n0.01 \n\nInput  Units e: \n\n2 \n\n3 \n\n10.77 \n0.11 \n0.00 \n0.00 \n0.00 \n0.01 \n0.00 \n0.00 \n0.01 \n0.00 \n\n26.48 \n0 .33 \n0.00 \n0.00 \n-0.01 \n-0.04 \n-0.01 \n-0.01 \n-0.03 \n-0.02 \n\nconnections are larger,  the information is  larger and correlation  between input and \nhidden  units is  larger.  Total individual information is  the sum of all  the individual \nindividual information,  namely, \n\nI(D I fires) \n\nM \n\nL \n\n2: 2: Ijle(D I fires), \n\n(13) \n\nj=lle=O \n\nbecause  each  connection  is  treated  separately  or  independently.  The  individual \ninformation  minimization  directly  controls  the  input-hidden  connections.  By  dif(cid:173)\nferentiating  the  individual  information  function  and  a  cross  entropy cost function \nwith respect to input-hidden connections Wjle,  we have rules for updating concerning \ninput-hidden connections: \n\nol(D I fi1>es) \n\n-It \n\n8wjle \n\noG \n-1}--\nOWjle \nS \n\n-It Wjle  pjle(1- Pjle) + 1} 2:c5J~k \n\n(14) \n\nwhere  c5J  is  an  ordinary  delta for  the  cross  entropy function  and  1}  and  It  are  pa(cid:173)\nrameters.  Thus,  rules  for  updating  with  respect  to  input-hidden  connections  are \nclosely  related  to the  weight  decay  method.  Clearly,  as  the  individual information \nminimization corresponds  to diminishing the strength of input-hidden connections. \n\n!=1 \n\n4  Results  and  Discussion \n\nThe information controller  was  applied  to  the  segmentation of strings  of an  artifi(cid:173)\ncial  language  into appropriate  minimal  elements,  that  is,  syllables.  Table 1 shows \ninput-hidden  connections  with  the  bias  and  the  information  maximizers.  Hidden \nunits were  ordered  by  the magnitude of the relevance of each hidden  unit  [6].  Col(cid:173)\nlective information and  individual  information could sufficiently be maximized and \nminimized.  Relative  collective  and  individual  information  were  0.94  and  0.13.  In \nthis  state,  all  the  input-hidden  connections  except  connections  into  the  first  two \nhidden  units  are  almost  zero.  Information  maximizers  Xj  are  all  strongly  negative \nfor  these  cases.  These  negative  information  maximizers  make  eight  hidden  units \n(from  the  third  to  tenth hidden  unit)  inactive,  that is,  close  to zero.  By carefully \n\n\f514 \n\nR.  Kamimura \n\nTable 2:  Generalization  performance comparison for  200  and 200  train(cid:173)\ning patterns.  Averages in the table are average generalization errors over \nseven errors  of ten errors  with  ten  different  initial  values. \n\n(a)  200  patterns \n\nGeneralization  Errors \n\nRM S \n\nError Rates \n\nAverages \n\nStd.  Dev. \n\n0.087 \n0.082 \n0.064 \n0.052 \n\n0.015 \n0.009 \n0.015 \n0.008 \n\nError  Rates \n\nA verages \n\nStd.  Dev. \n\n0.024 \n0.012 \n0.009 \n0.008 \n\n0.009 \n0.004 \n0.006 \n0.004 \n\nMethods \nStandard \nWeight  Decay \nWeight  Elimination \nInformation Controller \n\nAverages \n\nStd.  Dev. \n\n0.188 \n0.183 \n0.172 \n0.167 \n\n0.010 \n0.004 \n0.014 \n0.011 \n\n(b)  300  patterns \n\nMethods \nStandard \nWeight  Decay \nWeight  Elimination \nInformation Controller \n\nA verages \n\nStd.  Dev. \n\n0.108 \n0.110 \n0.083 \n0.072 \n\n0.009 \n0.003 \n0.005 \n0.006 \n\nGeneralization  Errors \n\nRMS \n\nexaming the first  two hidden  units,  we  could  see  that  the first  hidden  unit and  the \nsecond  hidden  unit  are  concerned  with  rules  for  syllabification  and  a  exceptional \ncase. \n\nThen,  networks  were  trained  to infer  the  well-formedness  of strings  in  addition  to \nthe  segmentation  to  examine  generalization  performance.  Table  2  shows  general(cid:173)\nization  errors  for  200  and  300  training  patterns.  As  clearly  shown  in  the  figure, \nthe best generalization performance in  terms of RMS and error rates is  obtained by \nthe  information  controller.  Thus,  experimental  results  confirmed  that  in  all  cases \nthe generalization  performance of the  information  controller  is  well  over  the  other \nmethods.  In  addition,  experimental  results  explicitly  confirmed  that  better gener(cid:173)\nalization performance is  due  to the suppression  of over-training by  the information \ncontroller. \n\nReferences \n[1]  R.  Ash,  Information  Theo1'1),  John Wiley &  Sons:  New  York,  1965. \n[2]  G.  Deco,  W.  Finnof and  H.  G.  Zimmermann,  \"Unsupervised  mutual  infor(cid:173)\nmation criterion for  elimination of overtraining in  Supervised  Multilayer  N et(cid:173)\nworks,\"  Neural  Computation,  Vol.  7,  pp.86-107,  1995. \n\n[3]  R.  Kamimura \"Entropy minimization to increase the selectivity:  selection and \ncompetition  in  neural  networks,\"  Intelligent  Engineering  Systems  through  Ar(cid:173)\ntificial  Neural Networks,  ASME  Press,  pp.227-232,  1992. \n\n[4]  R.  Kamimura,  T.  Takagi  and  S.  Nakanishi,  \"Improving generalization  perfor(cid:173)\n\nmance by information minimization,\"  IEICE Transactions on Information  and \nSystems,  Vol.  E78-D,  No.2,  pp.163-173,  1995. \n\n[5]  R.  Kamimura and S.  Nakanishi,  \"Hidden information maximization for feature \ndetection and rule discovery,\"  Network:  Computation in  Neural Systems,  Vo1.6, \npp.577-602,  1995. \n\n[6]  M.  C.  Mozer  and P. Smolen sky,  \"Using relevance to reduce network size  auto(cid:173)\n\nmatically,\"  Connection  Science,  Vo.l,  No.1,  pp.3-16,  1989. \n\n\f", "award": [], "sourceid": 1282, "authors": [{"given_name": "Ryotaro", "family_name": "Kamimura", "institution": null}]}