{"title": "A Real Time Clustering CMOS Neural Engine", "book": "Advances in Neural Information Processing Systems", "page_first": 755, "page_last": 762, "abstract": null, "full_text": "A Real Time Clustering CMOS \n\nNeural Engine \n\nT. Serrano-Gotarredona, B. Linares-Barranco, and J. L. Huertas \n\nDept. of Analog Design, National Microelectronics Center (CNM), Ed. CICA, Av. Reina \n\nMercedes sIn, 41012 Sevilla, SPAIN. Phone: (34)-5-4239923, Fax: (34)-5-4624506, \n\nE-mail: bernabe@cnm.us.es \n\nAbstract \n\nWe  describe  an  analog  VLSI  implementation  of the ARTI  algorithm \n(Carpenter, 1987). A prototype chip has been fabricated in a standard low \ncost  1.5~m double-metal single-poly CMOS process. It has a die area of \nlcm2  and  is  mounted in  a  12O-pins PGA package.  The chip realizes  a \nmodified version of the original ARTI  architecture.  Such modification \nhas been shown to preserve all computational properties of the original \nalgorithm  (Serrano,  1994a),  while  being  more  appropriate  for  VLSI \nrealizations. The chip implements an ARTI network with 100 F 1 nodes \nand 18 F2 nodes. It can therefore cluster 100 binary pixels input patterns \ninto up to 18 different categories. Modular expansibility of the system is \npossible  by  assembling  an  NxM  array  of  chips  without  any  extra \ninterfacing circuitry, resulting in an F 1 layer with  l00xN nodes, and an \nF2  layer  with  18xM nodes.  Pattern  classification  is  performed  in  less \nthan  1.8~s, which  means  an  equivalent  computing  power of 2.2x109 \nconnections and connection-updates per second. Although internally the \nchip is analog in nature, it interfaces to the outside world through digital \nsignals, thus having a true asynchrounous digital behavior. Experimental \nchip  test  results  are  available,  which  have  been  obtained  through  test \nequipments for digital chips. \n\n1 \n\nINTRODUCTION \n\nThe original ARTI  algorithm (Carpenter,  1987) proposed in  1987 is a massively parallel \narchitecture for a  self-organizing  neural  binary-pattern recognition  machine.  In  response \nto  arbitrary  orderings  of arbitrarily  many  and  complex  binary  input  patterns,  ARTI  is \n\n\f756 \n\nT.  Serrano-Gotarredona,  B.  Linares-Barranco, 1.  L.  Huertas \n\nTj  =  LA L Z/i-LB L Zij+LM \n\nN \n\ni = 1 \n\nN \n\ni = 1 \n\nWinner-Take-All: \n\nY,  = 1  if  T, = maxj  {Tj } \ny .  =  0  if  j:t;l \nJ \n\nInitialize weights: \n\nzij  = 1 \n\nRead input pattern: \nI  =  (/1' 12,  ... I N ) \n\n'-- - - - - - - - - - - - - - - - - - - l  z,1 \n\nnew \n\n= In z,1 \n\nold \n\nFig.!: Modified Fast Learning or Type-3  ART! \n\nimplementation algorithm \n\ncapable  of  learning,  in  an  unsupervised  way,  stable  recognition  codes.  The  ARTl \narchitecture is described by  a set of Short Term Memory (STM) and another set of Long \nTerm  Memory  (LTM)  time domain  nonlinear differential equations. It is  valid to assume \nthat the  STM equations  settle  much faster (instantaneously) than  the  LTM  equations,  so \nthat the  STM  differential  equations  can  be  substituted  by  nonlinear algebraic  equations \nthat  describe  the  steady-state  of  the  STM  differential  equations.  Furthermore,  in  the \nfast-learning  mode  (Carpenter,  1987),  the  LTM  differential  equations  are  as  well \nsubstituted  by  their corresponding  steady-state  nonlinear  algebraic  equations.  This  way, \nthe  ARTI  architecture  can  be  behaviorally  modelled  by  the  sequential  application  of \nnonlinear  algebraic  equations.  Three  different  levels  of ARTI  implementations  (both  in \nsoftware and in hardware) can therefore be distinguished: \nType-I:  Full  Model  Implementation:  both  STM  and  LTM  time-domain  differential \nequations  are  realized.  This  implementation is  the  most expensive  (both in  software \nand in hardware), and requires a large amount of computational power. \n\nType-2:  STM  steady-state  Implementation:  only  the  LTM  time-domain  differential \nequations  are  implemented.  The  STM  behavior  is  governed  by  nonlinear  algebraic \nequations.  This  implementation  requires  less  resources  than  the  previous  one. \nHowever, a proper sequencing of STM events has to be introduced artificially, which \nis architecturally implicit in the Type-I  implementation. \n\nType-3:  Fast  Learning  Implementation:  STM  and  LTM  is  implemented  with  algebraic \nequations. This implementation is computationally the less expensive one. In this case \nan artificial sequencing of STM and LTM events has to be done. \n\nThe  implementation  presented  in  this  paper  realizes  a  modified  version  of the  original \nARTI  Type-3  algorithm,  more suitable for VLSI implementations.  Such modified  ARTI \nsystem  has  been  shown  to  preserve  all  computational  properties  of the  original  ARTI \narchitecture  (Serrano,  1994a).  The  flow  diagram  that  describes  the  modified  ARTI \narchitecture is  shown in Fig.  I. Note that there is  only one binary-valued weight template \n(Zij)' instead ofthe two weight templates (one binary-valued and the other real-valued) of \nthe original ARTl. For a more detailed discussion of the modified ARTI algorithm refer to \n(Serrano,  1994a, 1994b). \nIn the next Section we will provide an  analog current-mode based circuit that implements \nin hardware the flow diagram of Fig.  1. Note that, although internally this circuit is analog \nin nature, from  its  input and output signals point of view it is  a true asynchronous digital \n\n\fA  Real Time  Clustering  CMOS Neural Engine \n\n757 \n\ncircuit, easy to  interface  with  any  conventional digital  machine.  Finally,  in  Section  3 we \nwill provide experimental results measured from  the chip  using  a digital data acquisition \ntest equipment. \n\n2  CIRCUIT DESCRIPTION \n\nThe ART!  chip reported in this  paper has  an  F J  layer with  100 neurons and  an  F2  layer \nwith  18  neurons.  This  means  that it can handle  binary  input patterns  of 100 pixels each, \nand  cluster  them  into  up  to  18  different  categories;  according  to  a  digitally  adjustable \nvigilance parameter p. The circuit architecture of the chip is shown in Fig. 2(a). It consists \nof  an  array  of  18x100  synapses,  a  lx100  array  of \"vigilance  synapses\",  a  unity  gain \n18-outputs current mirror, an adjustable gain  18-outputs current mirror (with p=O.O,  0.1, ... \n18-input-currents \n0.9)1, \nWinner-Take-All  (WTA)  (Serrano,  1994b).  The  inputs  to  the  circuit  are  the  100  binary \ndigital input voltages  Ii  ' and the outputs of the circuit are the 18 digital output voltages Yj  . \nExternal  control  signals  allow  to  change  parameters  p,  LA'  LB ,  and  LM .  Also,  extra \ncircuitry has been added for reading the internal weights Zij  while the system is learning. \nEach row of synapses generates two currents, \n\ncurrent-comparator-controlled \n\nswitches \n\n18 \n\nand \n\nan \n\n100 \n\n100 \n\nTj  =  LA  LZ/i-LBLZij+LM \n\ni =  1 \n\ni= 1 \n\n100 \n\nVj  =  LA  LZ/i \n\ni = 1 \n\nwhile the row of the \"vigilance synapses\" generates the current \n\n100 \n\nVp  =  LALli \n\ni= 1 \n\n(1) \n\n(2) \n\nEach of the current comparators compares the current  V .  versus  pVp '  and allows  current \nto  reach  the  WTA  only  if  pV  ~ V . .  This  way  dompetition  and  vigilance  occur \nT. \nsrmultaneously and in parallel, speeding tip significantly the search process. \nFig.  2(b) shows the content of a synapse in  the  18x 1 00  array. It consists  of three  current \nsources  with  switches,  two digital  AND gates  and a flip-flop.  Each synapse receives  two \ninput voltages  Ii  and  y . , and two global  control  voltages  <1>/  (to enable/disable learning) \nand  reset  (to  initializ6  all  weights  zj\" \nto  '1').  Each  synapse  generates  two  currents \nLAlizij-LBzij  and LAlizij ,which will be1summed up for all the synapses in the same row to \ngenerate  the  currents  T.  and  V..  If learning  is  enabled  (<I> /  =  1)  the  value  of  Z j\"  will \nchange  to  Iiz i .  if  y .  =  f . The  ,lvigilance  synapses\"  consist  each  of a current-sou~ce of \nvalue  L A  witI~ a s~itch controlled by  the  input  voltage  Ii'  The  current comparators  are \nthose  proposed  in  (Dominguez-Castro,  1992),  the  WTA  used  is  reported  in  (Lazzaro, \n1989),  and  the  digitally  adjustable  current  mirror  is  based  on  (Loh,  1989),  while  its \ncontinuous gain fine tuning mechanism has been taken from (Adams,  1991). \n\n1.  An additional pin ofthe chip can fine-tune p between 0.9 and 1.0. \n\n\f758 \n\nT.  Serrano-Gotarredona,  B. Linares-Barranco, J.  L.  Huertas \n\n-------....... -------..... --- 1 \n\n1 , , , , \n\n:l,t.ti~j \n\n11  12 \n\n. \n\n. \n\n. \n\n1\\00 \n\n(a) \n\n(b) \n\nFig. 2: (a) System Diagram of Current-Mode ART! Chipt (b) Circuit Diagram of Synapse \n\nFig. 3: Tree based current-mirror scheme for matched current sources \n\nThe  circuit  has  been  designed  in  such  a  way  that  the  WTA  operates  with  a  precision \naround  1.5%  (--6  bits).  This  means  that  all  LA  and  L8  current  sources  have  to  match \nwithin  an  error of less  than that.  From a circuit implementation point of view  this  is  not \neasy  to  achieve,  since  there  are  5500  current  sources  spread  over  a  die  area  of  lcm2. \nTypical mismatch between reasonable size MOS  transistors inside such an area extension \ncan  be  expected  to  be  above  10%  (pelgrom,  1989).  To  overcome  this  problem  we \nimplemented a  tree-based  current mirror  scheme  as  is  shown  in  Fig.  3.  Starting from  a \nunique  current  reference,  and  using  high-precision  lO(or  less)-outputs  current  mirrors \n(each yielding a precision  around 0.2%),  only  up to four cascades are needed. This way, \nthe current mismatch attained at the synapse current sources was  around  1 % for currents \nbetween  LA/8  = 5J.LA  and  LA/8  = 10~A. This is shown in  Fig. 4,  where the measured dc \noutput current-error (in  %)  versus  input current of the tree based configuration for  18  of \nthe 3600 LA  synapse sources is depicted. \n\n\fA Real Time  Clustering CMOS Neural Engine \n\n759 \n\nFig. 4: Measured current mirror cascade missmatch (1 %/div) for LA for currents below  10~A \n\n3  EXPERIMENTAL RESULTS \n\nFig.  5  shows  a  microphotograph  of  a  prototype  chip  fabricated  in  a  standard  digital \ndouble-metal, single-poly  1.5~m low cost CMOS process. The chip die area is  Icm2, and \nit  is  mounted  in  a  120-pins  PGA  package.  Fig.  6  shows  a  typical  training  sequence \naccomplished by  the  chip and obtained experimentally  using a test equipment for  digital \nchips.  The  only  task  performed  by  the  test  equipment  was  to  provide  the  input  data \npatterns  I  (first column  in  Fig.  6),  detect which  of the  output nodes  became  'I' (pattern \nwith a vertical bar to its right), and extract the learned weights. Each  IOxlO square in Fig. \n6 represents either a  lOa-pixels input vector I,  or one row of lOa-pixels  synaptic weights \nz . ==  ezi\" zp ... zIOO\u00b7)  .  Each  row  of squares  in  Fig.  6  represents  the  input  pattern  (first \nsquare)l and the  18 1vectors  Zj  after learning has been performed for this input pattern. The \nsequence  shown in  Fig.  6 has  been  obtained for  p  = 0.7,  LA  = JO/lA,  LB  = 9.5/lA  , and \nLM  = 950/lA.  Only  two  iterations  of input patterns presentations  were  necessary,  in  this \ncase, for the system to learn and self-organize in response to these  18 input patterns. \nThe  last  row  in  Fig.  6  shows  the  final  learned  templates.  Fig.  7  shows  final  learned \ntemplates  for  different  values  of  p. The  numbers  below  each  square  indicate  the  input \npatterns that have been clustered into each  Zj  category. \nDelay  time  measurements  have  been  performed  for  the  feedforward  action  of the  chip \n(establishment of currents  T.,  V. , and  Vp '  and their competitions until  the WTA  settles), \nand  for  the  updating of weights: The feedforward delay  is  pattern and bias  currents  (LA' \nL B ,  LM )  dependent, but has  been measured to be always below  1.6/ls. The learning time \nis  constant and is  around  180ns. Therefore, throughput time is  less than  1.8/ls. A digital \nneuroprocessor  able  to  perform  a  connections/s,  b  connection-updates/s,  and  with  a \ndedicated WTA section with a c seconds delay, must satisfy \n\n\f760 \n\nT.  Serrano-Gotarredona.  B.  Linares-Barranco. J.  L  Huertas \n\nFig. 5: Microphotograph of ARTl chip \n\n3700 + 100 + c  =  1.8J..lS \n\na \n\nb \n\n(3) \n\nto  meet the  performance of our ~rototype chip. If a  = band c  =  lOOns,  the  equivalent \nspeed would be  a  =  b  = 2.2 x 10  connections and connection-updates per second. \n\n4  CONCLUSIONS \n\nA high  speed analog current-mode categorizer chip  has  been  built  using  a  standard  low \ncost  digital  CMOS  process.  The  high  performance  of the  chip  is  achieved  thanks  to  a \nsimplification of the original ARTl algorithm. The simplifications introduced are such that \nall the original computational properties  are preserved. Experimental chip test results  are \nprovided. \n\n\fA Real Time  Clustering CMOS Neural Engine \n\n761 \n\nZ,  Z2  Z3  Z4  Zs  z6  Z7  Zg  Z9  Z10  Zll Z12 Z13  Z14  ZlS Z16  Z17  Zig \n\n17  \u2022  \u2022\u2022\u2022\u2022\u2022\u2022 U \u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022 \n\n1  \u2022  \u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022 \n2 \u2022  \u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022 \n3 .  \u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022 \n4  \u2022  \u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022 \n5  \u2022  \u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022 \n6  \u2022  \u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022 \n7  \u2022  \u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022 \n8  \u2022  \u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022 \n9 \u2022  \u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022 \n10  \u2022  \u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022 \n11  \u2022  \u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022 \n12  \u2022  \u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022 \n13. \u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022 \n14  \u2022  \u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022 \n15  \u2022  \u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022 \n16  \u2022  \u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022 \n18  \u2022  \u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022 \n1  \u2022  \u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022 \n2  \u2022  \u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022 \n3  \u2022  \u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022 \n4  \u2022  \u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022 \n5  \u2022  \u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022 \n6  \u2022  \u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022 \n7  \u2022  \u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022 \n8  \u2022  \u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022 \n9_ \u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022 \n10  \u2022  \u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022 \n11  \u2022  \u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022 \n12  \u2022  \u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022 \n13  \u2022  \u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022 \n14  II \u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022 \n15  \u2022  \u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022 \n16  \u2022  \u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022 \n17  \u2022  \u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022 \n18  \u2022  \u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022 \n\nFig. 6: Test sequence obtained experimentally for p=O.7, LA=lO/-lA, LB=9.5/-lA, and \n\nLM=950/-lA \n\n\f762 \n\nT.  Serrano-Gotarredona, B.  Linares-Barranco, J.  L.  Huertas \n\nFig. 7:  Categorization of the input patterns for LA=3.2~A, LB=3.0~, LM=400~A, and \n\ndifferent values of p \n\nReferences \nW. J. Adams and J. Ramimez-Angulo. (1991) \"Extended Transconductance AdjustmentlLinearisation \n\nTechnique,\" Electronics Letters,  vol. 27, No.  10, pp. 842-844, May 1991. \n\nG.  A. Carpenter and S.  Grossberg. (1987) \"A Massively Parallel Architecture for a  Self-Organizing \nNeural Pattern Recognition Machine,\" Computer VIsion,  Graphics, and Image Processing, vol. 37, \npp. 54-115, 1987. \n\nR.  Dominguez-castro,  A.  Rodrfguez-Vazquez,  F.  Medeiro,  and  1.  L.  Huertas.  (1992)  \"High \nResolution  CMOS  Current  Comparators,\"  Proc.  of the  1992  European  Solid-State  Circuits \nConference (ESSCIRC'92), pp. 242-245, 1992. \n\nJ.  Lazzaro, R. Ryckebusch, M.  A.  Mahowald,  and C. Mead.  (1989) \"Winner-Take-All Networks of \nO(n) Complexity,\" in Advances in Neural Information Processing Systems, vol.  1, D. S. Touretzky \n(Ed.), Los Altos, CA:  Morgan Kaufmann,  1989, pp. 703-711. \n\nK. Loh, D. L.  Hiser, W.  J. Adams, and R.  L. Geiger.  (1989) \"A Robust Digitally Programmable and \nReconfigurabJe Monolithic Filter Structure,\" Proc.  of the 1989 Int.  Symp.  on Circuits and Systems \n(ISCAS'89), Portland, Oregon, vol.  1, pp.  110-113, 1989. \n\nM.  J.  Pelgrom,  A.  C.  J.  Duinmaijer,  and  A.  P.  G.  Welbers.  (1989)  \"Matching Properties  of MOS \n\nTransistors,\" IEEE Journal of Solid-State Circuits,  vol.  24, No.5, pp.  1433-1440, October 1989. \n\nT.  Serrano-Gotarredona  and  B.  Linares-Barranco.  (1994a)  \"A  Modified  ARTl  Algorithm  more \n\nsuitable for VLSI Implementations,\" submitted for publication (journal paper). \n\nT.  Serrano-Gotarredona  and  B.  Linares-Barranco.  (1994b)  \"A  Real-Time  Clustering  Microchip \n\nNeural Engine,\" submitted for publication (journal paper). \n\n\f", "award": [], "sourceid": 997, "authors": [{"given_name": "Teresa", "family_name": "Serrano-Gotarredona", "institution": null}, {"given_name": "Bernab\u00e9", "family_name": "Linares-Barranco", "institution": null}, {"given_name": "Jos\u00e9", "family_name": "Huertas", "institution": null}]}